Connect with us

SEO

Where We Are Today With Google’s Mobile-First Index

Published

on

Where We Are Today With Google’s Mobile-First Index

Okay, so it’s been a few years now since Google announced the mobile-first index.

Most sites have been moved over to Google’s mobile-first index and it’s no longer a “hot” topic in SEO.

I found a tweet from John Mueller, Google Search Advocate, in 2021 that sums up the lack of focus on this topic the best:

Going with that mentality that mobile-first indexing is a “part of life” (which I wholeheartedly agree with), as an SEO, it is helpful to know some of the history and where we are today.

For instance, since the announcement of the mobile-first index years ago, Google has now also placed emphasis on Page Experience, which is a ranking factor and very much incorporates mobile.

Before we jump into that topic, let’s first get into the beginnings of the mobile-first index and what we know so far.

Then, we’ll get into what Google is looking for in mobile usability, what it means to have an identical experience on mobile and desktop, how you can meet Google’s expectations of mobile-first best practices, and more.

Google’s Mobile-First Indexing

No, There Are Not Two Indexes

Google has stated that there isn’t a separate mobile-first index.

Instead, mobile-first indexing means Google primarily uses the mobile version of the webpage for ranking and indexing purposes.

In 2018, Google explained that with mobile-first indexing, the URL of the mobile-friendly version of your site is indexed.

If your website has separate mobile and desktop URLs, Google shows the mobile URL to mobile users and the desktop URL to desktop users.

Regardless, the indexed content will be the mobile version.

Shifting To The Mobile-First Index

At the end of 2017, Google announced that it would start slowly rolling out mobile-first indexing.

By March 2018, Google stated that they were expanding the rollout and instructed websites to prepare.

Fast forward three years later and not all websites have been switched over to the mobile index.

In June 2020, Google stated that while most websites were set to mobile indexing, there were still many that were not.

Google announced at that point that instead of switching in September 2020, it would delay mobile-first indexing until March 2021.

Google cited a number of issues encountered with sites as a reason for delaying the rollout, including problems with robots meta tags, lazy-loading, blocked assets, primary content, and mobile images and videos.

Eventually, Google removed its own self-imposed deadline in November 2021 explaining that there were still sites that were not yet in the mobile-first index because they weren’t ready to be moved over.

Google went on to say that the lack of readiness was due to several unexpected challenges faced by these websites.

According to Google, “because of these difficulties, we’ve decided to leave the timeline open for the last steps of mobile-first indexing.”

Google also stated that “we currently don’t have a specific final date for the move to mobile-first indexing and want to be thoughtful about the remaining bigger steps in that direction.”

Mobile-First Indexing As The Default For New Websites

If your website was published after July 1, 2019, mobile-first indexing is enabled by default.

Google made this announcement in May 2019 and explained that the change applied to websites that were previously unknown to Google Search.

The announcement went into detail about why Google would make mobile-first indexing the default for new websites.

According to Google, after crawling the web with a smartphone Googlebot over the years, they concluded that new websites are typically ready for this type of crawling.

Mobile Usability And Mobile-First Indexing Are Not Synonyms

In January 2019, Mueller explained that if your content does not pass the mobile usability test, it could still be moved to mobile-first indexing.

Even if Search Console’s “mobile usability” report showed that your site had valid URLs, it didn’t mean those pages were ready for mobile-first indexing.

Mobile usability is “completely separate” from mobile-first indexing, according to Mueller. Consequently, pages could be enabled for mobile-first indexing even if they were not considered usable on a mobile device.

You can hear Mueller’s explanation in the video below, starting at the 41:12 mark:

“So, first off, again mobile usability is completely separate from mobile-first indexing.

A site can or cannot be usable from a mobile point of view, but it can still contain all of the content that we need for mobile-first indexing.

An extreme example, if you take something like a PDF file, then on mobile that would be terrible to navigate. The links will be hard to click, the text will be hard to read.

But all of the text is still there, and we could perfectly index that with mobile-first indexing.

Mobile usability is not the same as mobile-first indexing.”

In summary, mobile-friendliness and mobile-responsive layouts are not mandatory for mobile-first indexing.

Since pages without mobile versions still work on a mobile device, they were eligible for indexing.

The Mobile & Desktop Experiences Should Be The Same

Google added to their mobile-first indexing best practices in January 2020, and the big emphasis was on providing an identical experience on mobile and desktop.

Matt Southern provided a great summarized list of what Google meant by the same experience:

  • Ensuring Googlebot can access and render mobile and desktop page content and resources.
  • Making sure the mobile site contains the same content as the desktop site.
  • Using the same meta robots tags on the mobile and desktop site.
  • Using the same headings on the mobile site and desktop site.
  • Making sure the mobile and desktop sites have the same structured data.

Google warns that if you purposefully serve less content on the mobile version of a page than the desktop version, you will likely experience a drop in traffic.

The reason? According to Google, they won’t be able to get as much information from the page as before (when the desktop version was used).

Instead, Google recommends that the primary content on the mobile site be the same as on the desktop site. Google even suggests using the same headings on the mobile version.

To drive this point home, even more, Google mentions in its mobile-indexing documentation that only the content on the mobile site is used in indexing.

Therefore, you should be sure that your mobile site has the same content as your desktop site.

Mueller reiterated this fact during Pubcon Pro Virtual 2020 with the following comment:

“…we’re now almost completely indexing the web using a smart phone Googlebot, which matches a lot more what users would actually see when they search.

And one of the things that we noticed that people are still often confused about is with regards to, like if I only have something on desktop, surely Google will still see that and it will also take into account the mobile content.

But actually, it is the case that we will only index the mobile content in the future.

So when a site is shifted over to mobile first indexing, we will drop everything that’s only on the desktop site. We will essentially ignore that.

…anything that you want to have indexed, it needs to be on the mobile site.”

You can read more about Mueller’s comments here: Google Mobile-First Index – Zero Desktop Content March 2021.

Google’s Mobile-First Indexing Best Practices

Google provides a comprehensive list of best practices for mobile-first indexing “to make sure that your users have the best experience.”

Most of the information Google shares as best practices is not really new.

Instead, the list is a compilation of various recommendations and advice that Google has provided elsewhere over the years.

In addition to the list of recommendations above about creating the same experience on mobile and desktop, other best practices include:

  • Making sure the error page status is the same on the mobile and desktop sites.
  • Avoiding fragment URLs in the mobile site.
  • Making sure the desktop pages have equivalent mobile pages.
  • Verifying both the mobile and desktop sites in Search Console.
  • Checking hreflang links on separate mobile URLs.
  • Making sure the mobile site can handle an increased crawl rate.
  • Making sure the robot.txt directives are the same on the mobile and desktop sites.

Google offers an entire section focused on suggestions for separate URLs.

The “Troubleshooting” section of the best practices document is also worth checking out.

It includes common errors that can either cause your site to not be ready for mobile-first indexing or could lead to a drop in rankings once your site is enabled.

Note that Mueller explained nothing has changed with mobile-first indexing related to sites with separate mobile URLs using rel-canonical. Mueller recommends keeping the annotations the same.

Google will use the mobile URL as canonical even if the rel-canonical points to the desktop URL.

Mueller created a helpful graphic that shows a “before and after” indexing process for desktop and m-dot URLs.

Read more: Google’s John Mueller Clears Confusion About Mobile-First Index.

One last note about best practices.

In Google’s mobile-first indexing best practices documentation, it states, “While it’s not required to have a mobile version of your pages to have your content included in Google’s search results, it is very strongly recommended.”

While it might seem obvious to have a mobile version, I have gotten pushback when speaking about mobile-first.

At one conference, an attendee asked during my session if having a mobile version of the site was necessary if no one was coming from a mobile device.

He kept emphasizing “no one.” My answer? Do it anyway.

Not only does Google very highly recommend it, but visitors, especially repeat visitors, might not be using mobile devices because of the poor experience.

We need to focus not just on getting pages ranked in search results, but also on ensuring that the visitor has a good experience once on the page.

Page Experience Update + Mobile-First

The Page Experience update also needs to be part of the conversation.

The Page Experience update was officially released for mobile devices in 2021 and includes measurement signals regarding how visitors perceive their experience of interacting with your web page.

According to Google, this perception goes beyond just the information value provided on the page. Therefore, Google takes into account loading performance, visual stability, and interactivity of the page, which is known as Core Web Vitals.

Page Experience also looks at mobile-friendliness, HTTPS, and intrusive interstitials, which were already a part of the ranking algorithm.

For instance, mobile-friendliness was announced as a ranking factor in 2015, which led to Mobilegeddon (the industry’s name for the update… not Google’s name).

This factor took into account text readability, spacing of tap targets, and unplayable content.

A year later, Google announced that it was strengthening this ranking factor.

Originally, the mobile-friendly update was meant to apply to mobile search results only, but now with the mobile-first index, it applies overall.

Let’s get back to Core Web Vitals.

Core Web Vitals are factors Google considers important in a user’s overall experience on the webpage, including Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS).

Each of these factors contributes to the user experience and is scored as “Good,” “Needs Improvement,” or “Poor.”

Now, let’s see how this relates to mobile-first indexing.

There is a lot of overlap between Core Web Vitals and the mobile-first index because both look at how a page performs on a mobile device.

To tie this together, you can reference one of the mobile-first indexing best practices provided by Google, which is to ensure your mobile site loads fast.

Google offers specific recommendations, including using Google PageSpeed Insights and focusing on the “Speed” section. Note that there are other tools you can use too to test speed, such as GTMetrix and WebPageTest.

Martin Splitt, who works in Google’s Developer Relations, was asked in May 2021 if the Page Experience Update was going to roll out on mobile and desktop pages at the same time.

His response was that it would start with mobile pages first, which it did in August 2021. It would be rolled out on desktop pages in February 2022.

It was also made clear that Google would assess mobile pages separately from desktop pages, meaning there is no aggregate score of mobile and desktop (at least not for now).

You can access both the desktop and mobile Page Experience reports in Google Search Console.

Just as you need to pay attention to the desktop and mobile versions of your site for the mobile-first index, you also need to for the Page Experience update.

Check out Core Web Vitals: A Complete Guide for detailed information about this update and how to implement fixes.

One last note before we move on: When Google scores a page, it will test the speed, stability, and usability of the page version that the user ends up seeing.

Here’s where things get tricky. For Core Web Vitals, if you have an AMP version, Google will use it for page experience scoring (i.e., speed, quality, and usability). The mobile version would not be used.

Yet, the mobile version is what would be crawled for the mobile-first index.

So, to sum it up, the AMP version would be used for Core Web Vitals scoring and the mobile version would be used for mobile-first indexing.

Read Google Mobile-First Indexing and Scoring of Sites with Mobile and AMP Versions for the full explanation from Mueller.

Improve Performance In Google’s Mobile-First Index

Here is a consolidated list of items to check that build on some of the best practices already provided.

1. If You Have Multiple Versions, Make Sure Important Content Is Shown On All

Make sure your important content – including structured data, internal links, images, and so on – is on the mobile version of your website, too.

Google even warns in its mobile-indexing best practices that if you have less content on your mobile page than the desktop page, you will experience some traffic loss when your site is moved to mobile-first indexing,

Read more here: Google: Mobile-Friendly Does Not Mean Ready For Mobile-First Index.

2. Let Googlebot Access And Render Your Content

Google recommends that you use the same meta robots tags on the mobile site, avoid lazy-loading primary content (Googlebot can’t load content that requires user interaction), and allow Googlebot to crawl your resources.

3. Verify Structured Data

Double-check that your structured data is the same on both the desktop and mobile versions of your website and also ensure the URLs are correct.

4. Improve Mobile Page Speed

Page speed has been a factor to consider for a long time and it is even more important with the mobile-first index and Page Experience update.

Advanced Core Web Vitals: A Technical SEO Guide is packed with how-to advice on identifying and addressing speed-related factors that impact Core Web Vitals and mobile-first indexing.

5. Keep An Eye On Mobile Errors

As with most SEO work, getting a site to perform well in the mobile-first index is not a “one and done” task. You need to be closely monitoring Search Console so that you can identify and fix mobile errors.

Make it a habit to regularly view the “mobile usability” and “Core Web Vitals” reports in Search Console.

Keep Reading: Google’s Changelog On Mobile-First Indexing

The changelog in Google’s mobile-first indexing best practices gives a quick recap of the changes since 2016.

As you can tell, there is a lot to know and keep in mind on mobile-first indexing.

Make sure you are staying on top of best practices and monitoring your website’s performance to succeed in the world of mobile-first indexing.

More Resources:


Featured Image: DisobeyArt/Shutterstock



Source link

SEO

Where Are The Advertisers Leaving Twitter Going For The Super Bowl?

Published

on

Where Are The Advertisers Leaving Twitter Going For The Super Bowl?

Since Elon Musk’s takeover of Twitter last October 27, 2022, things at the social media company have gone from bad to worse.

You probably saw this coming from a mile away – especially if you had read about a study by Media Matters that was published on November 22, 2022, entitled, “In less than a month, Elon Musk has driven away half of Twitter’s top 100 advertisers.”

If you missed that, then you’ve probably read Matt G. Southern’s article in Search Engine Journal, which was entitled, “Twitter’s Revenue Down 40% As 500 Top Advertisers Pull Out.”

This mass exodus creates a challenge for digital advertising executives and their agencies. Where should they go long term?

And what should they do in the short term – with Super Bowl LVII coming up on Sunday, February 12, 2023?

Ideally, these advertisers would follow their audience. If they knew where Twitter users were going, their ad budgets could follow them.

But it isn’t clear where Twitter users are going – or if they’ve even left yet.

Fake Followers On Twitter And Brand Safety

According to the latest data from Similarweb, a digital intelligence platform, there were 6.9 billion monthly visits to Twitter worldwide during December 2022 – up slightly from 6.8 billion in November, and down slightly from 7.0 billion in October.

So, if a high-profile user like Boston Mayor Michelle Wu has taken a step back from the frequent posts on her Twitter account, @wutrain, which has more than 152,000 followers, then it appears that other users have stepped up their monthly visits.

This includes several accounts that had been banned previously for spreading disinformation, which Musk unbanned.

(Disinformation is defined as “deliberately misleading or biased information,” while misinformation may be spread without the sender having harmful intentions.)

It’s also worth noting that SparkToro, which provides audience research software, also has a free tool called Fake Follower Audit, which analyzes Twitter accounts.

This tool defines “fake followers” as ones that are unreachable and will not see the account’s tweets either because they’re spam, bots, and propaganda, or because they’re no longer active on Twitter.

On Jan. 24, 2023, I used this tool and found that 70.2% of the 126.5 million followers of the @elonmusk account were fake.

According to the tool, accounts with a similar-sized following to @elonmusk have a median of 41% fake followers. So, Elon Musk’s account has more fake followers than most.

Screenshot from SparkToro, January 2023

By comparison, 20.6% of the followers of the @wutreain account were fake. So, Michelle Wu’s account has fewer fake followers than accounts with a similar-sized following.

Sparktoro results for fake followersScreenshot from SparkToro, January 2023

In fact, most Twitter accounts have significant numbers of fake followers.

This underlines the brand safety concerns that many advertisers and media buyers have, but it doesn’t give them any guidance on where they should move their ad dollars.

Who Are Twitter’s Top Competitors And What Are Their Monthly Visits?

So, I asked Similarweb if they had more data that might help. And they sent me the monthly visits from desktop and mobile devices worldwide for Twitter and its top competitors:

  • YouTube.com: 34.6 billion in December 2022, down 2.8% from 35.6 billion in December 2021.
  • Facebook.com: 18.1 billion in December 2022, down 14.2% from 21.1 billion in December 2021.
  • Twitter.com: 6.9 billion in December 2022, up 1.5% from 6.8 billion in December 2021.
  • Instagram.com: 6.3 billion in December 2022, down 3.1% from 6.5 billion in December 2021.
  • TikTok.com: 1.9 billion in December 2022, up 26.7% from 1.5 billion in December 2021.
  • Reddit.com: 1.8 billion in December 2022, down 5.3% from 1.9 billion in December 2021.
  • LinkedIn.com: 1.5 billion in December 2022, up 7.1% from 1.4 billion in December 2021.
  • Pinterest.com: 1.0 billion in December 2022, up 11.1% from 0.9 billion in December 2021.

The most significant trends worth noting are monthly visits to TikTok are up 26.7% year over year from a smaller base, while monthly visits to Facebook are down 14.2% from a bigger base.

So, the short-term events at Twitter over the past 90 days may have taken the spotlight off the long-term trends at TikTok and Facebook over the past year for some industry observers.

But based on Southern’s article in Search Engine Journal, “Facebook Shifts Focus To Short-Form Video After Stock Plunge,” which was published on February 6, 2022, Facebook CEO Mark Zuckerberg is focused on these trends.

In a call with investors, Zuckerberg said back then:

“People have a lot of choices for how they want to spend their time, and apps like TikTok are growing very quickly. And this is why our focus on Reels is so important over the long term.”

Meanwhile, there were 91% more monthly visits to YouTube in December 2022 than there were to Facebook. And that only counts the visits that Similarweb tracks from mobile and desktop devices.

Similarweb doesn’t track visits from connected TVs (CTVs).

Measuring Data From Connected TVs (CTVs) And Co-Viewing

Why would I wish to draw your attention to CTVs?

First, global viewers watched a daily average of over 700 million hours of YouTube content on TV devices, according to YouTube internal data from January 2022.

And Insider Intelligence reported in 2022 that 36.4% of the U.S. share of average time spent per day with YouTube came from connected devices, including Apple TV, Google Chromecast, Roku, and Xfinity Flex, while 49.3% came from mobile devices, and 14.3% came from desktops or laptops.

Second, when people watch YouTube on a connected TV, they often watch it together with their friends, family, and colleagues – just like they did at Super Bowl parties before the pandemic.

There’s even a term for this behavior: Co-viewing.

And advertisers can now measure their total YouTube CTV audience using real-time and census-level surveys in over 100 countries and 70 languages.

This means Heineken and Marvel Studios can measure the co-viewing of their Super Bowl ad in more than 100 markets around the globe where Heineken 0.0 non-alcoholic beer is sold, and/or 26 countries where “Ant-Man and The Wasp: Quantumania” is scheduled to be released three to five days after the Big Game.

It also enables Apple Music to measure the co-viewing of their Super Bowl LVII Halftime Show during Big Game parties worldwide (except Mainland China, Iran, North Korea, and Turkmenistan, where access to YouTube is currently blocked).

And, if FanDuel has already migrated to Google Analytics 4 (GA4), then the innovative sports-tech entertainment company can not only measure the co-viewing of their Big Game teasers on YouTube AdBlitz in 16 states where sports betting is legal, but also measure engaged-view conversions (EVCs) from YouTube within 3 days of viewing Rob Gronkowski’s attempt to kick a live field goal.

 

Advertisers couldn’t do that in 2022. But they could in a couple of weeks.

If advertisers want to follow their audience, then they should be moving some of their ad budgets out of Facebook, testing new tactics, and experimenting with new initiatives on YouTube in 2023.

Where should the advertisers leaving Twitter shift their budgets long term? And how will that change their Super Bowl strategies in the short term?

According to Similarweb, monthly visits to ads.twitter.com, the platform’s ad-buying portal dropped 15% worldwide from 2.5 million in December 2021 to 2.1 million in December 2022.

So, advertisers were heading for the exit weeks before they learned that 500 top advertisers had left the platform.

Where Did Their Ad Budgets Go?

Well, it’s hard to track YouTube advertising, which is buried in Google’s sprawling ad business.

And we can’t use business.facebook.com as a proxy for interest in advertising on that platform because it’s used by businesses for other purposes, such as managing organic content on their Facebook pages.

But monthly visits to ads.snapchat.com, that platform’s ad-buying portal, jumped 88.3% from 1.6 million in December 2021 to 3.0 million in December 2022.

Monthly visits to ads.tiktok.com are up 36.6% from 5.1 million in December 2021 to 7.0 million in December 2022.

Monthly visits to ads.pinterest.com are up 23.3% from 1.1 million in December 2021 to 1.4 million in December 2022.

And monthly visits to business.linkedin.com are up 14.6% from 5.7 million in December 2021 to 6.5 million in December 2022.

It appears that lots of advertisers are hedging their bets by spreading their money around.

Now, most of them should probably continue to move their ad budgets into Snapchat, TikTok, Pinterest, and LinkedIn – unless the “Chief Twit” can find a way to keep his microblogging service from becoming “a free-for-all hellscape, where anything can be said with no consequences!

How will advertisers leaving Twitter change their Super Bowl plan this year?

To double-check my analysis, I interviewed Joaquim Salguerio, who is the Paid Media Director at LINK Agency. He’s managed media budgets of over eight figures at multiple advertising agencies.

Below are my questions and his answers.

Greg Jarboe: “Which brands feel that Twitter has broken their trust since Musk bought the platform?”

Joaquim Salguerio: “I would say that several brands will have different reasonings for this break of trust.

First, if you’re an automaker, there’s suddenly a very tight relationship between Twitter and one of your competitors.

Second, advertisers that are quite averse to taking risks with their communications because of brand safety concerns might feel that they still need to be addressed.

Most of all, in a year where we’re seeing mass layoffs from several corporations, the Twitter troubles have given marketing teams a reason to re-evaluate its effectiveness during a time of budget cuts. That would be a more important factor than trust for most brands.

Obviously, there are some famous cases, such as the Lou Paskalis case, but it’s difficult to pinpoint a brand list that would have trust as their only concern.”

GJ: “Do you think it will be hard for Twitter to regain their trust before this year’s Super Bowl?”

JS: “It’s highly unlikely that any brand that has lost trust in Twitter will change its mind in the near future, and definitely not in time for the Super Bowl. Most marketing plans for the event will be finalized by now and recent communications by Twitter leadership haven’t signaled any change in direction.

If anything, from industry comments within my own network, I can say that comments from Musk recently (“Ads are too frequent on Twitter and too big. Taking steps to address both in coming weeks.”) were quite badly received. For any marketers that believe Twitter advertising isn’t sufficiently effective, this pushes them further away.

Brand communications should still occur on Twitter during Super Bowl though – it will have a peak in usage. And advertising verticals that should dominate the advertising space on Twitter are not the ones crossing the platform from their plans.”

GJ: “How do you think advertisers will change their Super Bowl plans around Twitter this year?”

JS: “The main change for advertising plans will likely be for brand comms amplification. As an example, the betting industry will likely be heavily present on Twitter during the game and I would expect little to no change in plans.”

In the FCMG category, though, time sensitivity won’t be as important, which means that social media teams will likely be making an attempt at virality without relying as much on paid dollars.

If budgets are to diverge, they will likely be moved within the social space and toward platforms that will have user discussion/engagement from the Super Bowl (TikTok, Reddit, etc.)”

GJ: “What trends will we see in advertising budget allocation for this year’s Super Bowl?”

Joaquim Salguerio: “We should see budget planning much in line with previous years in all honesty. TV is still the most important media channel on Super Bowl day.

Digital spend will likely go towards social platforms, we predict a growth in TikTok and Reddit advertising around the big day for most brands.

Twitter should still have a strong advertising budget allocated to the platform by the verticals aiming to get actions from users during the game (food delivery/betting/etc.).”

GJ: “Which platforms will benefit from this shift?”

JS: “Likely, we will see TikTok as the biggest winner from a shift in advertising dollars, as the growth numbers are making it harder to ignore the platform as a placement that needs to be in the plan.

Reddit can also capture some of this budget as it has the right characteristics marketers are looking for around the Super Bowl – it’s relevant to what’s happening at the moment and similar demographics.”

GJ: “Do you think advertisers that step away from Twitter for this year’s Big Game will stay away long term?”

JS: “That is impossible to know, as it’s completely dependent on how the platform evolves and the advertising solutions it will provide. Twitter’s proposition was always centered around brand marketing (their performance offering was always known to be sub-par).

Unless brand safety concerns are addressed by brands that decided to step away, it’s hard to foresee a change.

I would say that overall, Super Bowl ad spend on Twitter should not be as affected as it’s been portrayed – it makes sense to reach audiences where audiences are.

Especially if you know the mindset. The bigger issue is what happens when there isn’t a Super Bowl or a World Cup.”

More resources:


Featured Image: Brocreative/Shutterstock



Source link

Continue Reading

SEO

Is ChatGPT Use Of Web Content Fair?

Published

on

Is ChatGPT Use Of Web Content Fair?

Large Language Models (LLMs) like ChatGPT train using multiple sources of information, including web content. This data forms the basis of summaries of that content in the form of articles that are produced without attribution or benefit to those who published the original content used for training ChatGPT.

Search engines download website content (called crawling and indexing) to provide answers in the form of links to the websites.

Website publishers have the ability to opt-out of having their content crawled and indexed by search engines through the Robots Exclusion Protocol, commonly referred to as Robots.txt.

The Robots Exclusions Protocol is not an official Internet standard but it’s one that legitimate web crawlers obey.

Should web publishers be able to use the Robots.txt protocol to prevent large language models from using their website content?

Large Language Models Use Website Content Without Attribution

Some who are involved with search marketing are uncomfortable with how website data is used to train machines without giving anything back, like an acknowledgement or traffic.

Hans Petter Blindheim (LinkedIn profile), Senior Expert at Curamando shared his opinions with me.

Hans commented:

“When an author writes something after having learned something from an article on your site, they will more often than not link to your original work because it offers credibility and as a professional courtesy.

It’s called a citation.

But the scale at which ChatGPT assimilates content and does not grant anything back differentiates it from both Google and people.

A website is generally created with a business directive in mind.

Google helps people find the content, providing traffic, which has a mutual benefit to it.

But it’s not like large language models asked your permission to use your content, they just use it in a broader sense than what was expected when your content was published.

And if the AI language models do not offer value in return – why should publishers allow them to crawl and use the content?

Does their use of your content meet the standards of fair use?

When ChatGPT and Google’s own ML/AI models trains on your content without permission, spins what it learns there and uses that while keeping people away from your websites – shouldn’t the industry and also lawmakers try to take back control over the Internet by forcing them to transition to an “opt-in” model?”

The concerns that Hans expresses are reasonable.

In light of how fast technology is evolving, should laws concerning fair use be reconsidered and updated?

I asked John Rizvi, a Registered Patent Attorney (LinkedIn profile) who is board certified in Intellectual Property Law, if Internet copyright laws are outdated.

John answered:

“Yes, without a doubt.

One major bone of contention in cases like this is the fact that the law inevitably evolves far more slowly than technology does.

In the 1800s, this maybe didn’t matter so much because advances were relatively slow and so legal machinery was more or less tooled to match.

Today, however, runaway technological advances have far outstripped the ability of the law to keep up.

There are simply too many advances and too many moving parts for the law to keep up.

As it is currently constituted and administered, largely by people who are hardly experts in the areas of technology we’re discussing here, the law is poorly equipped or structured to keep pace with technology…and we must consider that this isn’t an entirely bad thing.

So, in one regard, yes, Intellectual Property law does need to evolve if it even purports, let alone hopes, to keep pace with technological advances.

The primary problem is striking a balance between keeping up with the ways various forms of tech can be used while holding back from blatant overreach or outright censorship for political gain cloaked in benevolent intentions.

The law also has to take care not to legislate against possible uses of tech so broadly as to strangle any potential benefit that may derive from them.

You could easily run afoul of the First Amendment and any number of settled cases that circumscribe how, why, and to what degree intellectual property can be used and by whom.

And attempting to envision every conceivable usage of technology years or decades before the framework exists to make it viable or even possible would be an exceedingly dangerous fool’s errand.

In situations like this, the law really cannot help but be reactive to how technology is used…not necessarily how it was intended.

That’s not likely to change anytime soon, unless we hit a massive and unanticipated tech plateau that allows the law time to catch up to current events.”

So it appears that the issue of copyright laws has many considerations to balance when it comes to how AI is trained, there is no simple answer.

OpenAI and Microsoft Sued

An interesting case that was recently filed is one in which OpenAI and Microsoft used open source code to create their CoPilot product.

The problem with using open source code is that the Creative Commons license requires attribution.

According to an article published in a scholarly journal:

“Plaintiffs allege that OpenAI and GitHub assembled and distributed a commercial product called Copilot to create generative code using publicly accessible code originally made available under various “open source”-style licenses, many of which include an attribution requirement.

As GitHub states, ‘…[t]rained on billions of lines of code, GitHub Copilot turns natural language prompts into coding suggestions across dozens of languages.’

The resulting product allegedly omitted any credit to the original creators.”

The author of that article, who is a legal expert on the subject of copyrights, wrote that many view open source Creative Commons licenses as a “free-for-all.”

Some may also consider the phrase free-for-all a fair description of the datasets comprised of Internet content are scraped and used to generate AI products like ChatGPT.

Background on LLMs and Datasets

Large language models train on multiple data sets of content. Datasets can consist of emails, books, government data, Wikipedia articles, and even datasets created of websites linked from posts on Reddit that have at least three upvotes.

Many of the datasets related to the content of the Internet have their origins in the crawl created by a non-profit organization called Common Crawl.

Their dataset, the Common Crawl dataset, is available free for download and use.

The Common Crawl dataset is the starting point for many other datasets that created from it.

For example, GPT-3 used a filtered version of Common Crawl (Language Models are Few-Shot Learners PDF).

This is how  GPT-3 researchers used the website data contained within the Common Crawl dataset:

“Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset… constituting nearly a trillion words.

This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice.

However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets.

Therefore, we took 3 steps to improve the average quality of our datasets:

(1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora,

(2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and

(3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.”

Google’s C4 dataset (Colossal, Cleaned Crawl Corpus), which was used to create the Text-to-Text Transfer Transformer (T5), has its roots in the Common Crawl dataset, too.

Their research paper (Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer PDF) explains:

“Before presenting the results from our large-scale empirical study, we review the necessary background topics required to understand our results, including the Transformer model architecture and the downstream tasks we evaluate on.

We also introduce our approach for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text data.

We refer to our model and framework as the ‘Text-to-Text Transfer Transformer’ (T5).”

Google published an article on their AI blog that further explains how Common Crawl data (which contains content scraped from the Internet) was used to create C4.

They wrote:

“An important ingredient for transfer learning is the unlabeled dataset used for pre-training.

To accurately measure the effect of scaling up the amount of pre-training, one needs a dataset that is not only high quality and diverse, but also massive.

Existing pre-training datasets don’t meet all three of these criteria — for example, text from Wikipedia is high quality, but uniform in style and relatively small for our purposes, while the Common Crawl web scrapes are enormous and highly diverse, but fairly low quality.

To satisfy these requirements, we developed the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl that is two orders of magnitude larger than Wikipedia.

Our cleaning process involved deduplication, discarding incomplete sentences, and removing offensive or noisy content.

This filtering led to better results on downstream tasks, while the additional size allowed the model size to increase without overfitting during pre-training.”

Google, OpenAI, even Oracle’s Open Data are using Internet content, your content, to create datasets that are then used to create AI applications like ChatGPT.

Common Crawl Can Be Blocked

It is possible to block Common Crawl and subsequently opt-out of all the datasets that are based on Common Crawl.

But if the site has already been crawled then the website data is already in datasets. There is no way to remove your content from the Common Crawl dataset and any of the other derivative datasets like C4 and .

Using the Robots.txt protocol will only block future crawls by Common Crawl, it won’t stop researchers from using content already in the dataset.

How to Block Common Crawl From Your Data

Blocking Common Crawl is possible through the use of the Robots.txt protocol, within the above discussed limitations.

The Common Crawl bot is called, CCBot.

It is identified using the most up to date CCBot User-Agent string: CCBot/2.0

Blocking CCBot with Robots.txt is accomplished the same as with any other bot.

Here is the code for blocking CCBot with Robots.txt.

User-agent: CCBot
Disallow: /

CCBot crawls from Amazon AWS IP addresses.

CCBot also follows the nofollow Robots meta tag:

<meta name="robots" content="nofollow">

What If You’re Not Blocking Common Crawl?

Web content can be downloaded without permission, which is how browsers work, they download content.

Google or anybody else does not need permission to download and use content that is published publicly.

Website Publishers Have Limited Options

The consideration of whether it is ethical to train AI on web content doesn’t seem to be a part of any conversation about the ethics of how AI technology is developed.

It seems to be taken for granted that Internet content can be downloaded, summarized and transformed into a product called ChatGPT.

Does that seem fair? The answer is complicated.

Featured image by Shutterstock/Krakenimages.com



Source link

Continue Reading

SEO

Google Updates Discover Follow Feed Guidelines

Published

on

Google Updates Discover Follow Feed Guidelines

Google updated their Google Discover feed guidelines to emphasize the most important elements to include in the feed in order for it to be properly optimized.

Google Discover Feed

The Google Discover follow feed feature offers relevant content to Chrome Android users and represents an importance source of traffic that is matched to user interests.

The Google Discover Follow feature is a component of Google Discover, a way to capture a steady stream of traffic apart from Google News and Google Search.

Google’s Discover Follow feature works by allowing users to choose to receive updates about the latest content on a site they are interested in.

The way to do participate in Discover Follow is through an optimized RSS or Atom feed.

If the feed is properly optimized on a website, users can choose to follow a website or a specific category of a website, depending on how the publisher configures their RSS/Atom feeds.

Audiences that follow a website will see the new content populate their Discover Follow feed which in turn brings fresh waves of traffic to participating websites that are properly optimized.

According to Google:

“The Follow feature lets people follow a website and get the latest updates from that website in the Following tab within Discover in Chrome.

Currently, the Follow button is a feature that’s available to signed-in users in English in the US, New Zealand, South Africa, UK, Canada, and Australia that are using Chrome Android.”

Receiving traffic from the Discover Follow feature only happens for sites with properly optimized feeds that follow the Discover Follow feature guidelines.

Updated Guidance for Google Discover Follow Feature

Google updated their guidelines for the Discover Feed feature to emphasize the importance of the feed <title> and <link> elements, emphasizing that the feed contains these elements.

The new guidance states:

“The most important content for the Follow feature is your feed <title> element and your per item <link> elements. Make sure your feed includes these elements.”

Presumably the absence of these two elements may result in Google being unable to understand the feed and display it for users, resulting in a loss of traffic.

Site publishers who participate in the Google Discover Follow feature should verify that their RSS or Atom feeds properly display the <title> and <link> elements.

Google Discover Optimization

Publishers and SEOs are familiar with optimizing for Google Search.

But many content publishers may be unaware of how to optimize for Google Discover in order to enjoy the loads of traffic that results from properly optimizing for Google Discover and the Google Discover Follow feature.

The Follow Feed feature, a component of Google Discover, is a way to help ensure that the website obtains a steady stream of relevant traffic beyond organic search.

This is why it’s important to make sure that your RSS/Atom feeds are properly optimized.

Read Google’s announcement of the updated guidance and read the complete Follow Feature feed guidelines here.

Featured image by Shutterstock/fizkes



Source link

Continue Reading

Trending

en_USEnglish