SEO

Are ChatGPT, Bard and Dolly 2.0 Trained On Pirated Content?

Published

12 months ago

April 24, 2023

Are ChatGPT, Bard and Dolly 2.0 Trained On Pirated Content?

Large Language Models (LLMs) like ChatGPT, Bard and even open source versions are trained on public Internet content. But there are also indications that popular AIs might also be trained on datasets created from pirated books.

Is Dolly 2.0 Trained on Pirated Content?

Dolly 2.0 is an open source AI that was recently released. The intent behind Dolly is to democratize AI by making it available to everyone who wants to create something with it, even commercial products.

But there’s also a privacy issue with concentrating AI technology in the hands of three major corporations and trusting them with private data.

Given a choice, many businesses would prefer to not hand off private data to third parties like Google, OpenAI and Meta.

Even Mozilla, the open source browser and app company, is investing in growing the open source AI ecosystem.

The intent behind open source AI is unquestionably good.

But there is an issue with the data that is used to train these large language models because some of it consists of pirated content.

Open source ChatGPT clone, Dolly 2.0, was created by a company called DataBricks (learn more about Dolly 2.0)

Dolly 2.0 is based on an Open Source Large Language Model (LLM) called Pythia (which was created by an open source group called, EleutherAI).

EleutherAI created eight versions of LLMs of different sizes within the Pythia family of LLMs.

One version of Pythia, a 12 billion parameter version, is the one used by DataBricks to create Dolly 2.0, as well as with a dataset that DataBricks created themselves (a dataset of questions and answers that was used to train the Dolly 2.0 AI to take instructions)

The thing about the EleutherAI Pythia LLM is that it was trained using a dataset called the Pile.

The Pile dataset is comprised of multiple sets of English language texts, one of which is a dataset called Books3. The Books3 dataset contains the text of books that were pirated and hosted at a pirate site called, bibliotik.

This is what the DataBricks announcement says:

“Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.”

Pythia LLM Was Created With the Pile Dataset

The Pythia research paper by EleutherAI that mentions that Pythia was trained using the Pile dataset.

This is a quote from the Pythia research paper:

“We train 8 model sizes each on both the Pile …and the Pile after deduplication, providing 2 copies of the suite which can be compared.”

Deduplication means that they removed redundant data, it’s a process for creating a cleaner dataset.

So what’s in Pile? There’s a Pile research paper that explains what’s in that dataset.

Here’s a quote from the research paper for Pile where it says that they use the Books3 dataset:

“In addition we incorporate several existing highquality datasets: Books3 (Presser, 2020)…”

The Pile dataset research paper links to a tweet by Shawn Presser, that says what is in the Books3 dataset:

“Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data.
Now you do. Now everyone does.
Presenting “books3”, aka “all of bibliotik”
Advertisement

– 196,640 books
– in plain .txt
– reliable, direct download, for years: https://the-eye.eu/public/AI/pile_preliminary_components/books3.tar.gz”

So… the above quote clearly states that the Pile dataset was used to train the Pythia LLM which in turn served as the foundation for the Dolly 2.0 open source AI.

Is Google Bard Trained on Pirated Content?

The Washington Post recently published a review of Google’s Colossal Clean Crawled Corpus dataset (also known as C4 – PDF research paper here) in which they discovered that Google’s dataset also contains pirated content.

The C4 dataset is important because it’s one of the datasets used to train Google’s LaMDA LLM, a version of which is what Bard is based on.

The actual dataset is called Infiniset and the C4 dataset makes up about 12.5% of the total text used to train LaMDA. Citations to those facts about Bard can be found here.

The Washington Post news article published:

“The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library.
Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department.
At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.”

The flaw in the Washington Post analysis is that they’re looking at a version of the C4 but not necessarily the one that LaMDA was trained on.

The research paper for the C4 dataset was published in July 2020. Within a year of publication another research paper was published that discovered that the C4 dataset was biased against people of color and the LGBT community.

The research paper is titled, Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (PDF research paper here).

It was discovered by the researchers that the dataset contained negative sentiment against people of Arab identies and excluded documents that were associated with Blacks, Hispanics, and documents that mention sexual orientation.

The researchers wrote:

“Our examination of the excluded data suggests that documents associated with Black and Hispanic authors and documents mentioning sexual orientations are significantly more likely to be excluded by C4.EN’s blocklist filtering, and that many excluded documents contained non-offensive or non-sexual content (e.g., legislative discussions of same-sex marriage, scientific and medical content).
This exclusion is a form of allocational harms …and exacerbates existing (language-based) racial inequality as well as stigmatization of LGBTQ+ identities…
In addition, a direct consequence of removing such text from datasets used to train language models is that the models will perform poorly when applied to text from and about people with minority identities, effectively excluding them from the benefits of technology like machine translation or search.”

It was concluded that the filtering of “bad words” and other attempts to “clean” the dataset was too simplistic and warranted are more nuanced approach.

Those conclusions are important because they show that it was well known that the C4 dataset was flawed.

LaMDA was developed in 2022 (two years after the C4 dataset) and the associated LaMDA research paper says that it was trained with C4.

But that’s just a research paper. What happens in real-life on a production model can be vastly different from what’s in the research paper.

When discussing a research paper it’s important to remember that Google consistently says that what’s in a patent or research paper isn’t necessarily what’s in use in Google’s algorithm.

Google is highly likely to be aware of those conclusions and it’s not unreasonable to assume that Google developed a new version of C4 for the production model, not just to address inequities in the dataset but to bring it up to date.

Google doesn’t say what’s in their algorithm, it’s a black box. So we can’t say with certainty that the technology underlying Google Bard was trained on pirated content.

To make it even clearer, Bard was released in 2023, using a lightweight version of LaMDA. Google has not defined what a lightweight version of LaMDA is.

So there’s no way to know what content was contained within the datasets used to train the lightweight version of LaMDA that powers Bard.

One can only speculate as to what content was used to train Bard.

Does GPT-4 Use Pirated Content?

OpenAI is extremely private about the datasets used to train GPT-4. The last time OpenAI mentioned datasets is in the PDF research paper for GPT-3 published in 2020 and even there it’s somewhat vague and imprecise about what’s in the datasets.

The TowardsDataScience website in 2021 published an interesting review of the available information in which they conclude that indeed some pirated content was used to train early versions of GPT.

They write:

“…we find evidence that BookCorpus directly violated copyright restrictions for hundreds of books that should not have been redistributed through a free dataset.
For example, over 200 books in BookCorpus explicitly state that they “may not be reproduced, copied and distributed for commercial or non-commercial purposes.””

It’s difficult to conclude whether GPT-4 used any pirated content.

Is There A Problem With Using Pirated Content?

One would think that it may be unethical to use pirated content to train a large language model and profit from the use of that content.

But the laws may actually allow this kind of use.

I asked Kenton J. Hutcherson, Internet Attorney at Hutcherson Law what he thought about the use of pirated content in the context of training large language models.

Specifically, I asked if someone uses Dolly 2.0, which may be partially created with pirated books, would commercial entities who create applications with Dolly 2.0 be exposed to copyright infringement claims?

Kenton answered:

“A claim for copyright infringement from the copyright holders of the pirated books would likely fail because of fair use.
Fair use protects transformative uses of copyrighted works.
Advertisement

Here, the pirated books are not being used as books for people to read, but as inputs to an artificial intelligence training dataset.
A similar example came into play with the use of thumbnails on search results pages. The thumbnails are not there to replace the webpages they preview. They serve a completely different function—they preview the page.
That is transformative use.”

Karen J. Bernstein of Bernstein IP offered a similar opinion.

“Is the use of the pirated content a fair use? Fair use is a commonly used defense in these instances.
The concept of the fair use defense only exists under US copyright law.
Fair use is analyzed under a multi-factor analysis that the Supreme Court set forth in a 1994 landmark case.
Advertisement

Under this scenario, there will be questions of how much of the pirated content was taken from the books and what was done to the content (was it “transformative”), and whether such content is taking the market away from the copyright creator.”

AI technology is bounding forward at an unprecedented pace, seemingly evolving on a week to week basis. Perhaps in a reflection of the competition and the financial windfall to be gained from success, Google and OpenAI are becoming increasingly private about how their AI models are trained.

Should they be more open about such information? Can they be trusted that their datasets are fair and non-biased?

The use of pirated content to create these AI models may be legally protected as fair use, but just because one can does that mean one should?

Featured image by Shutterstock/Roman Samborskyi

Related Topics:Bard ChatGPT Content Dolly Pirated seo Trained

Up Next

Google Updates Googlebot Verification Documentation

Don't Miss

Google Introduces New Crawler To Optimize Googlebot’s Performance

SEO

2024 WordPress Vulnerability Report Shows Errors Sites Keep Making

Published

7 hours ago

April 18, 2024

Max

2024 Annual WordPress security report by WPScan

WordPress security scanner WPScan’s 2024 WordPress vulnerability report calls attention to WordPress vulnerability trends and suggests the kinds of things website publishers (and SEOs) should be looking out for.

Some of the key findings from the report were that just over 20% of vulnerabilities were rated as high or critical level threats, with medium severity threats, at 67% of reported vulnerabilities, making up the majority. Many regard medium level vulnerabilities as if they are low-level threats and that’s a mistake because they’re not low level and should be regarded as deserving attention.

The WPScan report advised:

“While severity doesn’t translate directly to the risk of exploitation, it’s an important guideline for website owners to make an educated decision about when to disable or update the extension.”

WordPress Vulnerability Severity Distribution

Critical level vulnerabilities, the highest level of threat, represented only 2.38% of vulnerabilities, which is essentially good news for WordPress publishers. Yet as mentioned earlier, when combined with the percentages of high level threats (17.68%) the number or concerning vulnerabilities rises to almost 20%.

Here are the percentages by severity ratings:

Critical 2.38%
Low 12.83%
High 17.68%
Medium 67.12%

Authenticated Versus Unauthenticated

Authenticated vulnerabilities are those that require an attacker to first attain user credentials and their accompanying permission levels in order to exploit a particular vulnerability. Exploits that require subscriber-level authentication are the most exploitable of the authenticated exploits and those that require administrator level access present the least risk (although not always a low risk for a variety of reasons).

Unauthenticated attacks are generally the easiest to exploit because anyone can launch an attack without having to first acquire a user credential.

The WPScan vulnerability report found that about 22% of reported vulnerabilities required subscriber level or no authentication at all, representing the most exploitable vulnerabilities. On the other end of the scale of the exploitability are vulnerabilities requiring admin permission levels representing a total of 30.71% of reported vulnerabilities.

Permission Levels Required For Exploits

Vulnerabilities requiring administrator level credentials represented the highest percentage of exploits, followed by Cross Site Request Forgery (CSRF) with 24.74% of vulnerabilities. This is interesting because CSRF is an attack that uses social engineering to get a victim to click a link from which the user’s permission levels are acquired. This is a mistake that WordPress publishers should be aware of because all it takes is for an admin level user to follow a link which then enables the hacker to assume admin level privileges to the WordPress website.

The following is the percentages of exploits ordered by roles necessary to launch an attack.

Ascending Order Of User Roles For Vulnerabilities

Author 2.19%
Subscriber 10.4%
Unauthenticated 12.35%
Contributor 19.62%
CSRF 24.74%
Admin 30.71%

Most Common Vulnerability Types Requiring Minimal Authentication

Broken Access Control in the context of WordPress refers to a security failure that can allow an attacker without necessary permission credentials to gain access to higher credential permissions.

In the section of the report that looks at the occurrences and vulnerabilities underlying unauthenticated or subscriber level vulnerabilities reported (Occurrence vs Vulnerability on Unauthenticated or Subscriber+ reports), WPScan breaks down the percentages for each vulnerability type that is most common for exploits that are the easiest to launch (because they require minimal to no user credential authentication).

The WPScan threat report noted that Broken Access Control represents a whopping 84.99% followed by SQL injection (20.64%).

The Open Worldwide Application Security Project (OWASP) defines Broken Access Control as:

“Access control, sometimes called authorization, is how a web application grants access to content and functions to some users and not others. These checks are performed after authentication, and govern what ‘authorized’ users are allowed to do.
Access control sounds like a simple problem but is insidiously difficult to implement correctly. A web application’s access control model is closely tied to the content and functions that the site provides. In addition, the users may fall into a number of groups or roles with different abilities or privileges.”

SQL injection, at 20.64% represents the second most prevalent type of vulnerability, which WPScan referred to as both “high severity and risk” in the context of vulnerabilities requiring minimal authentication levels because attackers can access and/or tamper with the database which is the heart of every WordPress website.

These are the percentages:

Broken Access Control 84.99%
SQL Injection 20.64%
Cross-Site Scripting 9.4%
Unauthenticated Arbitrary File Upload 5.28%
Sensitive Data Disclosure 4.59%
Insecure Direct Object Reference (IDOR) 3.67%
Remote Code Execution 2.52%
Other 14.45%

Vulnerabilities In The WordPress Core Itself

The overwhelming majority of vulnerability issues were reported in third-party plugins and themes. However, there were in 2023 a total of 13 vulnerabilities reported in the WordPress core itself. Out of the thirteen vulnerabilities only one of them was rated as a high severity threat, which is the second highest level, with Critical being the highest level vulnerability threat, a rating scoring system maintained by the Common Vulnerability Scoring System (CVSS).

The WordPress core platform itself is held to the highest standards and benefits from a worldwide community that is vigilant in discovering and patching vulnerabilities.

Website Security Should Be Considered As Technical SEO

Site audits don’t normally cover website security but in my opinion every responsible audit should at least talk about security headers. As I’ve been saying for years, website security quickly becomes an SEO issue once a website’s ranking start disappearing from the search engine results pages (SERPs) due to being compromised by a vulnerability. That’s why it’s critical to be proactive about website security.

According to the WPScan report, the main point of entry for hacked websites were leaked credentials and weak passwords. Ensuring strong password standards plus two-factor authentication is an important part of every website’s security stance.

Using security headers is another way to help protect against Cross-Site Scripting and other kinds of vulnerabilities.

Lastly, a WordPress firewall and website hardening are also useful proactive approaches to website security. I once added a forum to a brand new website I created and it was immediately under attack within minutes. Believe it or not, virtually every website worldwide is under attack 24 hours a day by bots scanning for vulnerabilities.

Read the WPScan Report:

WPScan 2024 Website Threat Report

Featured Image by Shutterstock/Ljupco Smokovski

SEO

An In-Depth Guide And Best Practices For Mobile SEO

Published

1 day ago

April 18, 2024

Max

Mobile SEO: An In-Depth Guide And Best Practices

Over the years, search engines have encouraged businesses to improve mobile experience on their websites. More than 60% of web traffic comes from mobile, and in some cases based on the industry, mobile traffic can reach up to 90%.

Since Google has completed its switch to mobile-first indexing, the question is no longer “if” your website should be optimized for mobile, but how well it is adapted to meet these criteria. A new challenge has emerged for SEO professionals with the introduction of Interaction to Next Paint (INP), which replaced First Input Delay (FID) starting March, 12 2024.

Thus, understanding mobile SEO’s latest advancements, especially with the shift to INP, is crucial. This guide offers practical steps to optimize your site effectively for today’s mobile-focused SEO requirements.

What Is Mobile SEO And Why Is It Important?

The goal of mobile SEO is to optimize your website to attain better visibility in search engine results specifically tailored for mobile devices.

This form of SEO not only aims to boost search engine rankings, but also prioritizes enhancing mobile user experience through both content and technology.

While, in many ways, mobile SEO and traditional SEO share similar practices, additional steps related to site rendering and content are required to meet the needs of mobile users and the speed requirements of mobile devices.

Does this need to be a priority for your website? How urgent is it?

Consider this: 58% of the world’s web traffic comes from mobile devices.

If you aren’t focused on mobile users, there is a good chance you’re missing out on a tremendous amount of traffic.

Mobile-First Indexing

Additionally, as of 2023, Google has switched its crawlers to a mobile-first indexing priority.

This means that the mobile experience of your site is critical to maintaining efficient indexing, which is the step before ranking algorithms come into play.

How Much Of Your Traffic Is From Mobile?

How much traffic potential you have with mobile users can depend on various factors, including your industry (B2B sites might attract primarily desktop users, for example) and the search intent your content addresses (users might prefer desktop for larger purchases, for example).

Regardless of where your industry and the search intent of your users might be, the future will demand that you optimize your site experience for mobile devices.

How can you assess your current mix of mobile vs. desktop users?

An easy way to see what percentage of your users is on mobile is to go into Google Analytics 4.

Click Reports in the left column.
Click on the Insights icon on the right side of the screen.
Scroll down to Suggested Questions and click on it.
Click on Technology.
Click on Top Device model by Users.
Then click on Top Device category by Users under Related Results.
The breakdown of Top Device category will match the date range selected at the top of GA4.

Screenshot from GA4, March 2024

You can also set up a report in Looker Studio.

Add your site to the Data source.
Add Device category to the Dimension field.
Add 30-day active users to the Metric field.
Click on Chart to select the view that works best for you.

A screen capture from Looker Studio showing a pie chart with a breakdown of mobile, desktop, tablet, and Smart TV users for a site

Screenshot from Looker Studio, March 2024

You can add more Dimensions to really dig into the data to see which pages attract which type of users, what the mobile-to-desktop mix is by country, which search engines send the most mobile users, and so much more.

How To Check If Your Site Is Mobile-Friendly

Now that you know how to build a report on mobile and desktop usage, you need to figure out if your site is optimized for mobile traffic.

While Google removed the mobile-friendly testing tool from Google Search Console in December 2023, there are still a number of useful tools for evaluating your site for mobile users.

Bing still has a mobile-friendly testing tool that will tell you the following:

Viewport is configured correctly.
Page content fits device width.
Text on the page is readable.
Links and tap targets are sufficiently large and touch-friendly.
Any other issues detected.

Google’s Lighthouse Chrome extension provides you with an evaluation of your site’s performance across several factors, including load times, accessibility, and SEO.

To use, install the Lighthouse Chrome extension.

Go to your website in your browser.
Click on the orange lighthouse icon in your browser’s address bar.
Click Generate Report.
A new tab will open and display your scores once the evaluation is complete.

An image showing the Lighthouse Scores for a website.

Screenshot from Lighthouse, March 2024

You can also use the Lighthouse report in Developer Tools in Chrome.

Simply click on the three dots next to the address bar.
Select “More Tools.”
Select Developer Tools.
Click on the Lighthouse tab.
Choose “Mobile” and click the “Analyze page load” button.

An image showing how to get to Lighthouse within Google Chrome Developer Tools.

Screenshot from Lighthouse, March 2024

Another option that Google offers is the PageSpeed Insights (PSI) tool. Simply add your URL into the field and click Analyze.

PSI will integrate any Core Web Vitals scores into the resulting view so you can see what your users are experiencing when they come to your site.

An image showing the PageSpeed Insights scores for a website.

Screenshot from PageSpeed Insights, March 2024

Other tools, like WebPageTest.org, will graphically display the processes and load times for everything it takes to display your webpages.

With this information, you can see which processes block the loading of your pages, which ones take the longest to load, and how this affects your overall page load times.

You can also emulate the mobile experience by using Developer Tools in Chrome, which allows you to switch back and forth between a desktop and mobile experience.

An image showing how to change the device emulation for a site within Google Chrome Developer Tools

Screenshot from Google Chrome Developer Tools, March 2024

Lastly, use your own mobile device to load and navigate your website:

Does it take forever to load?
Are you able to navigate your site to find the most important information?
Is it easy to add something to cart?
Can you read the text?

How To Optimize Your Site Mobile-First

With all these tools, keep an eye on the Performance and Accessibility scores, as these directly affect mobile users.

Expand each section within the PageSpeed Insights report to see what elements are affecting your score.

These sections can give your developers their marching orders for optimizing the mobile experience.

While mobile speeds for cellular networks have steadily improved around the world (the average speed in the U.S. has jumped to 27.06 Mbps from 11.14 Mbps in just eight years), speed and usability for mobile users are at a premium.

Best Practices For Mobile Optimization

Unlike traditional SEO, which can focus heavily on ensuring that you are using the language of your users as it relates to the intersection of your products/services and their needs, optimizing for mobile SEO can seem very technical SEO-heavy.

While you still need to be focused on matching your content with the needs of the user, mobile search optimization will require the aid of your developers and designers to be fully effective.

Below are several key factors in mobile SEO to keep in mind as you’re optimizing your site.

Site Rendering

How your site responds to different devices is one of the most important elements in mobile SEO.

The two most common approaches to this are responsive design and dynamic serving.

Responsive design is the most common of the two options.

Using your site’s cascading style sheets (CSS) and flexible layouts, as well as responsive content delivery networks (CDN) and modern image file types, responsive design allows your site to adjust to a variety of screen sizes, orientations, and resolutions.

With the responsive design, elements on the page adjust in size and location based on the size of the screen.

You can simply resize the window of your desktop browser and see how this works.

An image showing the difference between Web.dev in a full desktop display vs. a mobile display using responsive design.

Screenshot from web.dev, March 2024

This is the approach that Google recommends.

Adaptive design, also known as dynamic serving, consists of multiple fixed layouts that are dynamically served to the user based on their device.

Sites can have a separate layout for desktop, smartphone, and tablet users. Each design can be modified to remove functionality that may not make sense for certain device types.

This is a less efficient approach, but it does give sites more control over what each device sees.

While these will not be covered here, two other options:

Progressive Web Apps (PWA), which can seamlessly integrate into a mobile app.
Separate mobile site/URL (which is no longer recommended).

Interaction to Next Paint (INP)

Google has introduced Interaction to Next Paint (INP) as a more comprehensive measure of user experience, succeeding First Input Delay. While FID measures the time from when a user first interacts with your page (e.g., clicking a link, tapping a button) to the time when the browser is actually able to begin processing event handlers in response to that interaction. INP, on the other hand, broadens the scope by measuring the responsiveness of a website throughout the entire lifespan of a page, not just first interaction.

Note that actions such as hovering and scrolling do not influence INP, however, keyboard-driven scrolling or navigational actions are considered keystrokes that may activate events measured by INP but not scrolling which is happeing due to interaction.

Scrolling may indirectly affect INP, for example in scenarios where users scroll through content, and additional content is lazy-loaded from the API. While the act of scrolling itself isn’t included in the INP calculation, the processing, necessary for loading additional content, can create contention on the main thread, thereby increasing interaction latency and adversely affecting the INP score.

What qualifies as an optimal INP score?

An INP under 200ms indicates good responsiveness.
Between 200ms and 500ms needs improvement.
Over 500ms means page has poor responsiveness.

and these are common issues causing poor INP scores:

Long JavaScript Tasks: Heavy JavaScript execution can block the main thread, delaying the browser’s ability to respond to user interactions. Thus break long JS tasks into smaller chunks by using scheduler API.
Large DOM (HTML) Size: A large DOM ( starting from 1500 elements) can severely impact a website’s interactive performance. Every additional DOM element increases the work required to render pages and respond to user interactions.
Inefficient Event Callbacks: Event handlers that execute lengthy or complex operations can significantly affect INP scores. Poorly optimized callbacks attached to user interactions, like clicks, keypress or taps, can block the main thread, delaying the browser’s ability to render visual feedback promptly. For example when handlers perform heavy computations or initiate synchronous network requests such on clicks.

and you can troubleshoot INP issues using free and paid tools.

As a good starting point I would recommend to check your INP scores by geos via treo.sh which will give you a great high level insights where you struggle with most.

INP scores by Geos

Image Optimization

Images add a lot of value to the content on your site and can greatly affect the user experience.

From page speeds to image quality, you could adversely affect the user experience if you haven’t optimized your images.

This is especially true for the mobile experience. Images need to adjust to smaller screens, varying resolutions, and screen orientation.

Use responsive images
Implement lazy loading
Compress your images (use WebP)
Add your images into sitemap

Optimizing images is an entire science, and I advise you to read our comprehensive guide on image SEO how to implement the mentioned recommendations.

Avoid Intrusive Interstitials

Google rarely uses concrete language to state that something is a ranking factor or will result in a penalty, so you know it means business about intrusive interstitials in the mobile experience.

Intrusive interstitials are basically pop-ups on a page that prevent the user from seeing content on the page.

John Mueller, Google’s Senior Search Analyst, stated that they are specifically interested in the first interaction a user has after clicking on a search result.

Examples of intrusive interstitial pop-ups on a mobile site according to Google.

Not all pop-ups are considered bad. Interstitial types that are considered “intrusive” by Google include:

Pop-ups that cover most or all of the page content.
Non-responsive interstitials or pop-ups that are impossible for mobile users to close.
Pop-ups that are not triggered by a user action, such as a scroll or a click.

Structured Data

Most of the tips provided in this guide so far are focused on usability and speed and have an additive effect, but there are changes that can directly influence how your site appears in mobile search results.

Search engine results pages (SERPs) haven’t been the “10 blue links” in a very long time.

They now reflect the diversity of search intent, showing a variety of different sections to meet the needs of users. Local Pack, shopping listing ads, video content, and more dominate the mobile search experience.

As a result, it’s more important than ever to provide structured data markup to the search engines, so they can display rich results for users.

In this example, you can see that both Zojirushi and Amazon have included structured data for their rice cookers, and Google is displaying rich results for both.

An image of a search result for Japanese rice cookers that shows rich results for Zojirushi and Amazon.

Screenshot from search for [Japanese rice cookers], Google, March 2024

Adding structured data markup to your site can influence how well your site shows up for local searches and product-related searches.

Using JSON-LD, you can mark up the business, product, and services data on your pages in Schema markup.

If you use WordPress as the content management system for your site, there are several plugins available that will automatically mark up your content with structured data.

Content Style

When you think about your mobile users and the screens on their devices, this can greatly influence how you write your content.

Rather than long, detailed paragraphs, mobile users prefer concise writing styles for mobile reading.

Each key point in your content should be a single line of text that easily fits on a mobile screen.

Your font sizes should adjust to the screen’s resolution to avoid eye strain for your users.

If possible, allow for a dark or dim mode for your site to further reduce eye strain.

Headers should be concise and address the searcher’s intent. Rather than lengthy section headers, keep it simple.

Finally, make sure that your text renders in a font size that’s readable.

Tap Targets

As important as text size, the tap targets on your pages should be sized and laid out appropriately.

Tap targets include navigation elements, links, form fields, and buttons like “Add to Cart” buttons.

Targets smaller than 48 pixels by 48 pixels and targets that overlap or are overlapped by other page elements will be called out in the Lighthouse report.

Tap targets are essential to the mobile user experience, especially for ecommerce websites, so optimizing them is vital to the health of your online business.

Prioritizing These Tips

If you have delayed making your site mobile-friendly until now, this guide may feel overwhelming. As a result, you may not know what to prioritize first.

As with so many other optimizations in SEO, it’s important to understand which changes will have the greatest impact, and this is just as true for mobile SEO.

Think of SEO as a framework in which your site’s technical aspects are the foundation of your content. Without a solid foundation, even the best content may struggle to rank.

Responsive or Dynamic Rendering: If your site requires the user to zoom and scroll right or left to read the content on your pages, no number of other optimizations can help you. This should be first on your list.
Content Style: Rethink how your users will consume your content online. Avoid very long paragraphs. “Brevity is the soul of wit,” to quote Shakespeare.
Image Optimization: Begin migrating your images to next-gen image formats and optimize your content display network for speed and responsiveness.
Tap Targets: A site that prevents users from navigating or converting into sales won’t be in business long. Make navigation, links, and buttons usable for them.
Structured Data: While this element ranks last in priority on this list, rich results can improve your chances of receiving traffic from a search engine, so add this to your to-do list once you’ve completed the other optimizations.

Summary

From How Search Works, “Google’s mission is to organize the world’s information and make it universally accessible and useful.”

If Google’s primary mission is focused on making all the world’s information accessible and useful, then you know they will prefer surfacing sites that align with that vision.

Since a growing percentage of users are on mobile devices, you may want to infer the word “everywhere” added to the end of the mission statement.

Are you missing out on traffic from mobile devices because of a poor mobile experience?

If you hope to remain relevant, make mobile SEO a priority now.

Featured Image: Paulo Bobita/Search Engine Journal

SEO

HARO Has Been Dead for a While

Published

1 day ago

April 17, 2024

Entireweb News Bot

Every SEO’s favorite ~~link-building~~ collaboration tool, HARO, was officially killed off for good last week by Cision. It’s now been wrapped into a new product: Connectively.

I know nothing about the new tool. I haven’t tried it. But after trying to use HARO recently, I can’t say I’m surprised or saddened by its death. It’s been a walking corpse for a while.

I used HARO way back in the day to build links. It worked. But a couple of months ago, I experienced the platform from the other side when I decided to try to source some “expert” insights for our posts.

After just a few minutes of work, I got hundreds of pitches:

So, I grabbed a cup of coffee and began to work through them. It didn’t take long before I lost the will to live. Every other pitch seemed like nothing more than lazy AI-generated nonsense from someone who definitely wasn’t an expert.

Here’s one of them:

Example of an AI-generated pitch in HARO

Seriously. Who writes like that? I’m a self-confessed dullard (any fellow Dull Men’s Club members here?), and even I’m not that dull…

I don’t think I looked through more than 30-40 of the responses. I just couldn’t bring myself to do it. It felt like having a conversation with ChatGPT… and not a very good one!

Despite only reviewing a few dozen of the many pitches I received, one stood out to me:

Example HARO pitch that caught my attention

Believe it or not, this response came from a past client of mine who runs an SEO agency in the UK. Given how knowledgeable and experienced he is (he actually taught me a lot about SEO back in the day when I used to hassle him with questions on Skype), this pitch rang alarm bells for two reasons:

I truly doubt he spends his time replying to HARO queries
I know for a fact he’s no fan of Neil Patel (sorry, Neil, but I’m sure you’re aware of your reputation at this point!)

So… I decided to confront him 😉

Here’s what he said:

Shocker.

I pressed him for more details:

I’m getting a really good deal and paying per link rather than the typical £xxxx per month for X number of pitches. […] The responses as you’ve seen are not ideal but that’s a risk I’m prepared to take as realistically I dont have the time to do it myself. He’s not native english, but I have had to have a word with him a few times about clearly using AI. On the low cost ones I don’t care but on authority sites it needs to be more refined.

I think this pretty much sums up the state of HARO before its death. Most “pitches” were just AI answers from SEOs trying to build links for their clients.

Don’t get me wrong. I’m not throwing shade here. I know that good links are hard to come by, so you have to do what works. And the reality is that HARO did work. Just look at the example below. You can tell from the anchor and surrounding text in Ahrefs that these links were almost certainly built with HARO:

Example of links build with HARO, via Ahrefs' Site Explorer

But this was the problem. HARO worked so well back in the day that it was only a matter of time before spammers and the #scale crew ruined it for everyone. That’s what happened, and now HARO is no more. So…

If you’re a link builder, I think it’s time to admit that HARO link building is dead and move on.

No tactic works well forever. It’s the law of sh**ty clickthroughs. This is why you don’t see SEOs having huge success with tactics like broken link building anymore. They’ve moved on to more innovative tactics or, dare I say it, are just buying links.

Sidenote.

Talking of buying links, here’s something to ponder: if Connectively charges for pitches, are links built through those pitches technically paid? If so, do they violate Google’s spam policies? It’s a murky old world this SEO lark, eh?

If you’re a journalist, Connectively might be worth a shot. But with experts being charged for pitches, you probably won’t get as many responses. That might be a good thing. You might get less spam. Or you might just get spammed by SEOs with deep pockets. The jury’s out for now.

My advice? Look for alternative methods like finding and reaching out to experts directly. You can easily use tools like Content Explorer to find folks who’ve written lots of content about the topic and are likely to be experts.

For example, if you look for content with “backlinks” in the title and go to the Authors tab, you might see a familiar name. 😉

Finding people to request insights from in Ahrefs' Content Explorer

I don’t know if I’d call myself an expert, but I’d be happy to give you a quote if you reached out on social media or emailed me (here’s how to find my email address).

Alternatively, you can bait your audience into giving you their insights on social media. I did this recently with a poll on X and included many of the responses in my guide to toxic backlinks.

Me, indirectly sourcing insights on social media

Either of these options is quicker than using HARO because you don’t have to sift through hundreds of responses looking for a needle in a haystack. If you disagree with me and still love HARO, feel free to tell me why on X 😉