Connect with us

SEO

Google Bard AI – What Sites Were Used To Train It?

Published

on

Google Bard AI - What Sites Were Used To Train It?

Google’s Bard is based on the LaMDA language model, trained on datasets based on Internet content called Infiniset of which very little is known about where the data came from and how they got it.

The 2022 LaMDA research paper lists percentages of different kinds of data used to train LaMDA, but only 12.5% comes from a public dataset of crawled content from the web and another 12.5% comes from Wikipedia.

Google is purposely vague about where the rest of the scraped data comes from but there are hints of what sites are in those datasets.

Google’s Infiniset Dataset

Google Bard is based on a language model called LaMDA, which is an acronym for Language Model for Dialogue Applications.

LaMDA was trained on a dataset called Infiniset.

Advertisement

Infiniset is a blend of Internet content that was deliberately chosen to enhance the model’s ability to engage in dialogue.

The LaMDA research paper (PDF) explains why they chose this composition of content:

“…this composition was chosen to achieve a more robust performance on dialog tasks …while still keeping its ability to perform other tasks like code generation.

As future work, we can study how the choice of this composition may affect the quality of some of the other NLP tasks performed by the model.”

The research paper makes reference to dialog and dialogs, which is the spelling of the words used in this context, within the realm of computer science.

In total, LaMDA was pre-trained on 1.56 trillion words of “public dialog data and web text.”

The dataset is comprised of the following mix:

Advertisement
  • 12.5% C4-based data
  • 12.5% English language Wikipedia
  • 12.5% code documents from programming Q&A websites, tutorials, and others
  • 6.25% English web documents
  • 6.25% Non-English web documents
  • 50% dialogs data from public forums

The first two parts of Infiniset (C4 and Wikipedia) is comprised of data that is known.

The C4 dataset, which will be explored shortly, is a specially filtered version of the Common Crawl dataset.

Only 25% of the data is from a named source (the C4 dataset and Wikipedia).

The rest of the data that makes up the bulk of the Infiniset dataset, 75%, consists of words that were scraped from the Internet.

The research paper doesn’t say how the data was obtained from websites, what websites it was obtained from or any other details about the scraped content.

Google only uses generalized descriptions like “Non-English web documents.”

The word “murky” means when something is not explained and is mostly concealed.

Advertisement

Murky is the best word for describing the 75% of data that Google used for training LaMDA.

There are some clues that may give a general idea of what sites are contained within the 75% of web content, but we can’t know for certain.

C4 Dataset

C4 is a dataset developed by Google in 2020. C4 stands for “Colossal Clean Crawled Corpus.”

This dataset is based on the Common Crawl data, which is an open-source dataset.

About Common Crawl

Common Crawl is a registered non-profit organization that crawls the Internet on a monthly basis to create free datasets that anyone can use.

The Common Crawl organization is currently run by people who have worked for the Wikimedia Foundation, former Googlers, a founder of Blekko, and count as advisors people like Peter Norvig, Director of Research at Google and Danny Sullivan (also of Google).

Advertisement

How C4 is Developed From Common Crawl

The raw Common Crawl data is cleaned up by removing things like thin content, obscene words, lorem ipsum, navigational menus, deduplication, etc. in order to limit the dataset to the main content.

The point of filtering out unnecessary data was to remove gibberish and retain examples of natural English.

This is what the researchers who created C4 wrote:

“To assemble our base data set, we downloaded the web extracted text from April 2019 and applied the aforementioned filtering.

This produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also comprises reasonably clean and natural English text.

We dub this data set the “Colossal Clean Crawled Corpus” (or C4 for short) and release it as part of TensorFlow Datasets…”

There are other unfiltered versions of C4 as well.

Advertisement

The research paper that describes the C4 dataset is titled, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (PDF).

Another research paper from 2021, (Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus – PDF) examined the make-up of the sites included in the C4 dataset.

Interestingly, the second research paper discovered anomalies in the original C4 dataset that resulted in the removal of webpages that were Hispanic and African American aligned.

Hispanic aligned webpages were removed by the blocklist filter (swear words, etc.) at the rate of 32% of pages.

African American aligned webpages were removed at the rate of 42%.

Presumably those shortcomings have been addressed…

Advertisement

Another finding was that 51.3% of the C4 dataset consisted of webpages that were hosted in the United States.

Lastly, the 2021 analysis of the original C4 dataset acknowledges that the dataset represents just a fraction of the total Internet.

The analysis states:

“Our analysis shows that while this dataset represents a significant fraction of a scrape of the public internet, it is by no means representative of English-speaking world, and it spans a wide range of years.

When building a dataset from a scrape of the web, reporting the domains the text is scraped from is integral to understanding the dataset; the data collection process can lead to a significantly different distribution of internet domains than one would expect.”

The following statistics about the C4 dataset are from the second research paper that is linked above.

The top 25 websites (by number of tokens) in C4 are:

Advertisement
  1. patents.google.com
  2. en.wikipedia.org
  3. en.m.wikipedia.org
  4. www.nytimes.com
  5. www.latimes.com
  6. www.theguardian.com
  7. journals.plos.org
  8. www.forbes.com
  9. www.huffpost.com
  10. patents.com
  11. www.scribd.com
  12. www.washingtonpost.com
  13. www.fool.com
  14. ipfs.io
  15. www.frontiersin.org
  16. www.businessinsider.com
  17. www.chicagotribune.com
  18. www.booking.com
  19. www.theatlantic.com
  20. link.springer.com
  21. www.aljazeera.com
  22. www.kickstarter.com
  23. caselaw.findlaw.com
  24. www.ncbi.nlm.nih.gov
  25. www.npr.org

These are the top 25 represented top level domains in the C4 dataset:

Screenshot from Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

If you’re interested in learning more about the C4 dataset, I recommend reading Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (PDF) as well as the original 2020 research paper (PDF) for which C4 was created.

What Could Dialogs Data from Public Forums Be?

50% of the training data comes from “dialogs data from public forums.”

That’s all that Google’s LaMDA research paper says about this training data.

If one were to guess, Reddit and other top communities like StackOverflow are safe bets.

Reddit is used in many important datasets such as ones developed by OpenAI called WebText2 (PDF), an open-source approximation of WebText2 called OpenWebText2 and Google’s own WebText-like (PDF) dataset from 2020.

Google also published details of another dataset of public dialog sites a month before the publication of the LaMDA paper.

This dataset that contains public dialog sites is called MassiveWeb.

Advertisement

We’re not speculating that the MassiveWeb dataset was used to train LaMDA.

But it contains a good example of what Google chose for another language model that focused on dialogue.

MassiveWeb was created by DeepMind, which is owned by Google.

It was designed for use by a large language model called Gopher (link to PDF of research paper).

MassiveWeb uses dialog web sources that go beyond Reddit in order to avoid creating a bias toward Reddit-influenced data.

It still uses Reddit. But it also contains data scraped from many other sites.

Advertisement

Public dialog sites included in MassiveWeb are:

  • Reddit
  • Facebook
  • Quora
  • YouTube
  • Medium
  • StackOverflow

Again, this isn’t suggesting that LaMDA was trained with the above sites.

It’s just meant to show what Google could have used, by showing a dataset Google was working on around the same time as LaMDA, one that contains forum-type sites.

The Remaining 37.5%

The last group of data sources are:

  • 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc;
  • 12.5% Wikipedia (English)
  • 6.25% English web documents
  • 6.25% Non-English web documents.

Google does not specify what sites are in the Programming Q&A Sites category that makes up 12.5% of the dataset that LaMDA trained on.

So we can only speculate.

Stack Overflow and Reddit seem like obvious choices, especially since they were included in the MassiveWeb dataset.

What “tutorials” sites were crawled? We can only speculate what those “tutorials” sites may be.

Advertisement

That leaves the final three categories of content, two of which are exceedingly vague.

English language Wikipedia needs no discussion, we all know Wikipedia.

But the following two are not explained:

English and non-English language web pages are a general description of 13% of the sites included in the database.

That’s all the information Google gives about this part of the training data.

Should Google Be Transparent About Datasets Used for Bard?

Some publishers feel uncomfortable that their sites are used to train AI systems because, in their opinion, those systems could in the future make their websites obsolete and disappear.

Advertisement

Whether that’s true or not remains to be seen, but it is a genuine concern expressed by publishers and members of the search marketing community.

Google is frustratingly vague about the websites used to train LaMDA as well as what technology was used to scrape the websites for data.

As was seen in the analysis of the C4 dataset, the methodology of choosing which website content to use for training large language models can affect the quality of the language model by excluding certain populations.

Should Google be more transparent about what sites are used to train their AI or at least publish an easy to find transparency report about the data that was used?

Featured image by Shutterstock/Asier Romero



Source link

Advertisement
Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address

SEO

Measuring Content Impact Across The Customer Journey

Published

on

By

Measuring Content Impact Across The Customer Journey

Understanding the impact of your content at every touchpoint of the customer journey is essential – but that’s easier said than done. From attracting potential leads to nurturing them into loyal customers, there are many touchpoints to look into.

So how do you identify and take advantage of these opportunities for growth?

Watch this on-demand webinar and learn a comprehensive approach for measuring the value of your content initiatives, so you can optimize resource allocation for maximum impact.

You’ll learn:

  • Fresh methods for measuring your content’s impact.
  • Fascinating insights using first-touch attribution, and how it differs from the usual last-touch perspective.
  • Ways to persuade decision-makers to invest in more content by showcasing its value convincingly.

With Bill Franklin and Oliver Tani of DAC Group, we unravel the nuances of attribution modeling, emphasizing the significance of layering first-touch and last-touch attribution within your measurement strategy. 

Check out these insights to help you craft compelling content tailored to each stage, using an approach rooted in first-hand experience to ensure your content resonates.

Advertisement

Whether you’re a seasoned marketer or new to content measurement, this webinar promises valuable insights and actionable tactics to elevate your SEO game and optimize your content initiatives for success. 

View the slides below or check out the full webinar for all the details.

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

How to Find and Use Competitor Keywords

Published

on

How to Find and Use Competitor Keywords

Competitor keywords are the keywords your rivals rank for in Google’s search results. They may rank organically or pay for Google Ads to rank in the paid results.

Knowing your competitors’ keywords is the easiest form of keyword research. If your competitors rank for or target particular keywords, it might be worth it for you to target them, too.

There is no way to see your competitors’ keywords without a tool like Ahrefs, which has a database of keywords and the sites that rank for them. As far as we know, Ahrefs has the biggest database of these keywords.

How to find all the keywords your competitor ranks for

  1. Go to Ahrefs’ Site Explorer
  2. Enter your competitor’s domain
  3. Go to the Organic keywords report

The report is sorted by traffic to show you the keywords sending your competitor the most visits. For example, Mailchimp gets most of its organic traffic from the keyword “mailchimp.”

Mailchimp gets most of its organic traffic from the keyword, “mailchimp”.Mailchimp gets most of its organic traffic from the keyword, “mailchimp”.

Since you’re unlikely to rank for your competitor’s brand, you might want to exclude branded keywords from the report. You can do this by adding a Keyword > Doesn’t contain filter. In this example, we’ll filter out keywords containing “mailchimp” or any potential misspellings:

Filtering out branded keywords in Organic keywords reportFiltering out branded keywords in Organic keywords report

If you’re a new brand competing with one that’s established, you might also want to look for popular low-difficulty keywords. You can do this by setting the Volume filter to a minimum of 500 and the KD filter to a maximum of 10.

Finding popular, low-difficulty keywords in Organic keywordsFinding popular, low-difficulty keywords in Organic keywords

How to find keywords your competitor ranks for, but you don’t

  1. Go to Competitive Analysis
  2. Enter your domain in the This target doesn’t rank for section
  3. Enter your competitor’s domain in the But these competitors do section
Competitive analysis reportCompetitive analysis report

Hit “Show keyword opportunities,” and you’ll see all the keywords your competitor ranks for, but you don’t.

Content gap reportContent gap report

You can also add a Volume and KD filter to find popular, low-difficulty keywords in this report.

Volume and KD filter in Content gapVolume and KD filter in Content gap

How to find keywords multiple competitors rank for, but you don’t

  1. Go to Competitive Analysis
  2. Enter your domain in the This target doesn’t rank for section
  3. Enter the domains of multiple competitors in the But these competitors do section
Competitive analysis report with multiple competitorsCompetitive analysis report with multiple competitors

You’ll see all the keywords that at least one of these competitors ranks for, but you don’t.

Content gap report with multiple competitorsContent gap report with multiple competitors

You can also narrow the list down to keywords that all competitors rank for. Click on the Competitors’ positions filter and choose All 3 competitors:

Selecting all 3 competitors to see keywords all 3 competitors rank forSelecting all 3 competitors to see keywords all 3 competitors rank for
  1. Go to Ahrefs’ Site Explorer
  2. Enter your competitor’s domain
  3. Go to the Paid keywords report
Paid keywords reportPaid keywords report

This report shows you the keywords your competitors are targeting via Google Ads.

Since your competitor is paying for traffic from these keywords, it may indicate that they’re profitable for them—and could be for you, too.

Advertisement

You know what keywords your competitors are ranking for or bidding on. But what do you do with them? There are basically three options.

1. Create pages to target these keywords

You can only rank for keywords if you have content about them. So, the most straightforward thing you can do for competitors’ keywords you want to rank for is to create pages to target them.

However, before you do this, it’s worth clustering your competitor’s keywords by Parent Topic. This will group keywords that mean the same or similar things so you can target them all with one page.

Here’s how to do that:

  1. Export your competitor’s keywords, either from the Organic Keywords or Content Gap report
  2. Paste them into Keywords Explorer
  3. Click the “Clusters by Parent Topic” tab
Clustering keywords by Parent TopicClustering keywords by Parent Topic

For example, MailChimp ranks for keywords like “what is digital marketing” and “digital marketing definition.” These and many others get clustered under the Parent Topic of “digital marketing” because people searching for them are all looking for the same thing: a definition of digital marketing. You only need to create one page to potentially rank for all these keywords.

Keywords under the cluster of "digital marketing"Keywords under the cluster of "digital marketing"

2. Optimize existing content by filling subtopics

You don’t always need to create new content to rank for competitors’ keywords. Sometimes, you can optimize the content you already have to rank for them.

How do you know which keywords you can do this for? Try this:

Advertisement
  1. Export your competitor’s keywords
  2. Paste them into Keywords Explorer
  3. Click the “Clusters by Parent Topic” tab
  4. Look for Parent Topics you already have content about

For example, if we analyze our competitor, we can see that seven keywords they rank for fall under the Parent Topic of “press release template.”

Our competitor ranks for seven keywords that fall under the "press release template" clusterOur competitor ranks for seven keywords that fall under the "press release template" cluster

If we search our site, we see that we already have a page about this topic.

Site search finds that we already have a blog post on press release templatesSite search finds that we already have a blog post on press release templates

If we click the caret and check the keywords in the cluster, we see keywords like “press release example” and “press release format.”

Keywords under the cluster of "press release template"Keywords under the cluster of "press release template"

To rank for the keywords in the cluster, we can probably optimize the page we already have by adding sections about the subtopics of “press release examples” and “press release format.”

3. Target these keywords with Google Ads

Paid keywords are the simplest—look through the report and see if there are any relevant keywords you might want to target, too.

For example, Mailchimp is bidding for the keyword “how to create a newsletter.”

Mailchimp is bidding for the keyword “how to create a newsletter”Mailchimp is bidding for the keyword “how to create a newsletter”

If you’re ConvertKit, you may also want to target this keyword since it’s relevant.

If you decide to target the same keyword via Google Ads, you can hover over the magnifying glass to see the ads your competitor is using.

Mailchimp's Google Ad for the keyword “how to create a newsletter”Mailchimp's Google Ad for the keyword “how to create a newsletter”

You can also see the landing page your competitor directs ad traffic to under the URL column.

The landing page Mailchimp is directing traffic to for “how to create a newsletter”The landing page Mailchimp is directing traffic to for “how to create a newsletter”

Learn more

Check out more tutorials on how to do competitor keyword analysis:

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

Google Confirms Links Are Not That Important

Published

on

By

Google confirms that links are not that important anymore

Google’s Gary Illyes confirmed at a recent search marketing conference that Google needs very few links, adding to the growing body of evidence that publishers need to focus on other factors. Gary tweeted confirmation that he indeed say those words.

Background Of Links For Ranking

Links were discovered in the late 1990’s to be a good signal for search engines to use for validating how authoritative a website is and then Google discovered soon after that anchor text could be used to provide semantic signals about what a webpage was about.

One of the most important research papers was Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, published around 1998 (link to research paper at the end of the article). The main discovery of this research paper is that there is too many web pages and there was no objective way to filter search results for quality in order to rank web pages for a subjective idea of relevance.

The author of the research paper discovered that links could be used as an objective filter for authoritativeness.

Kleinberg wrote:

Advertisement

“To provide effective search methods under these conditions, one needs a way to filter, from among a huge collection of relevant pages, a small set of the most “authoritative” or ‘definitive’ ones.”

This is the most influential research paper on links because it kick-started more research on ways to use links beyond as an authority metric but as a subjective metric for relevance.

Objective is something factual. Subjective is something that’s closer to an opinion. The founders of Google discovered how to use the subjective opinions of the Internet as a relevance metric for what to rank in the search results.

What Larry Page and Sergey Brin discovered and shared in their research paper (The Anatomy of a Large-Scale Hypertextual Web Search Engine – link at end of this article) was that it was possible to harness the power of anchor text to determine the subjective opinion of relevance from actual humans. It was essentially crowdsourcing the opinions of millions of website expressed through the link structure between each webpage.

What Did Gary Illyes Say About Links In 2024?

At a recent search conference in Bulgaria, Google’s Gary Illyes made a comment about how Google doesn’t really need that many links and how Google has made links less important.

Patrick Stox tweeted about what he heard at the search conference:

” ‘We need very few links to rank pages… Over the years we’ve made links less important.’ @methode #serpconf2024″

Google’s Gary Illyes tweeted a confirmation of that statement:

Advertisement

“I shouldn’t have said that… I definitely shouldn’t have said that”

Why Links Matter Less

The initial state of anchor text when Google first used links for ranking purposes was absolutely non-spammy, which is why it was so useful. Hyperlinks were primarily used as a way to send traffic from one website to another website.

But by 2004 or 2005 Google was using statistical analysis to detect manipulated links, then around 2004 “powered-by” links in website footers stopped passing anchor text value, and by 2006 links close to the words “advertising” stopped passing link value, links from directories stopped passing ranking value and by 2012 Google deployed a massive link algorithm called Penguin that destroyed the rankings of likely millions of websites, many of which were using guest posting.

The link signal eventually became so bad that Google decided in 2019 to selectively use nofollow links for ranking purposes. Google’s Gary Illyes confirmed that the change to nofollow was made because of the link signal.

Google Explicitly Confirms That Links Matter Less

In 2023 Google’s Gary Illyes shared at a PubCon Austin that links were not even in the top 3 of ranking factors. Then in March 2024, coinciding with the March 2024 Core Algorithm Update, Google updated their spam policies documentation to downplay the importance of links for ranking purposes.

Google March 2024 Core Update: 4 Changes To Link Signal

The documentation previously said:

Advertisement

“Google uses links as an important factor in determining the relevancy of web pages.”

The update to the documentation that mentioned links was updated to remove the word important.

Links are not just listed as just another factor:

“Google uses links as a factor in determining the relevancy of web pages.”

At the beginning of April Google’s John Mueller advised that there are more useful SEO activities to engage on than links.

Mueller explained:

“There are more important things for websites nowadays, and over-focusing on links will often result in you wasting your time doing things that don’t make your website better overall”

Finally, Gary Illyes explicitly said that Google needs very few links to rank webpages and confirmed it.

Why Google Doesn’t Need Links

The reason why Google doesn’t need many links is likely because of the extent of AI and natural language undertanding that Google uses in their algorithms. Google must be highly confident in its algorithm to be able to explicitly say that they don’t need it.

Way back when Google implemented the nofollow into the algorithm there were many link builders who sold comment spam links who continued to lie that comment spam still worked. As someone who started link building at the very beginning of modern SEO (I was the moderator of the link building forum at the #1 SEO forum of that time), I can say with confidence that links have stopped playing much of a role in rankings beginning several years ago, which is why I stopped about five or six years ago.

Read the research papers

Authoritative Sources in a Hyperlinked Environment – Jon M. Kleinberg (PDF)

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Featured Image by Shutterstock/RYO Alexandre

Advertisement



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

Trending

Follow by Email
RSS