Connect with us

SEO

Google Bard AI – What Sites Were Used To Train It?

Published

on

Google Bard AI - What Sites Were Used To Train It?

Google’s Bard is based on the LaMDA language model, trained on datasets based on Internet content called Infiniset of which very little is known about where the data came from and how they got it.

The 2022 LaMDA research paper lists percentages of different kinds of data used to train LaMDA, but only 12.5% comes from a public dataset of crawled content from the web and another 12.5% comes from Wikipedia.

Google is purposely vague about where the rest of the scraped data comes from but there are hints of what sites are in those datasets.

Google’s Infiniset Dataset

Google Bard is based on a language model called LaMDA, which is an acronym for Language Model for Dialogue Applications.

LaMDA was trained on a dataset called Infiniset.

Advertisement

Infiniset is a blend of Internet content that was deliberately chosen to enhance the model’s ability to engage in dialogue.

The LaMDA research paper (PDF) explains why they chose this composition of content:

“…this composition was chosen to achieve a more robust performance on dialog tasks …while still keeping its ability to perform other tasks like code generation.

As future work, we can study how the choice of this composition may affect the quality of some of the other NLP tasks performed by the model.”

The research paper makes reference to dialog and dialogs, which is the spelling of the words used in this context, within the realm of computer science.

In total, LaMDA was pre-trained on 1.56 trillion words of “public dialog data and web text.”

The dataset is comprised of the following mix:

Advertisement
  • 12.5% C4-based data
  • 12.5% English language Wikipedia
  • 12.5% code documents from programming Q&A websites, tutorials, and others
  • 6.25% English web documents
  • 6.25% Non-English web documents
  • 50% dialogs data from public forums

The first two parts of Infiniset (C4 and Wikipedia) is comprised of data that is known.

The C4 dataset, which will be explored shortly, is a specially filtered version of the Common Crawl dataset.

Only 25% of the data is from a named source (the C4 dataset and Wikipedia).

The rest of the data that makes up the bulk of the Infiniset dataset, 75%, consists of words that were scraped from the Internet.

The research paper doesn’t say how the data was obtained from websites, what websites it was obtained from or any other details about the scraped content.

Google only uses generalized descriptions like “Non-English web documents.”

The word “murky” means when something is not explained and is mostly concealed.

Advertisement

Murky is the best word for describing the 75% of data that Google used for training LaMDA.

There are some clues that may give a general idea of what sites are contained within the 75% of web content, but we can’t know for certain.

C4 Dataset

C4 is a dataset developed by Google in 2020. C4 stands for “Colossal Clean Crawled Corpus.”

This dataset is based on the Common Crawl data, which is an open-source dataset.

About Common Crawl

Common Crawl is a registered non-profit organization that crawls the Internet on a monthly basis to create free datasets that anyone can use.

The Common Crawl organization is currently run by people who have worked for the Wikimedia Foundation, former Googlers, a founder of Blekko, and count as advisors people like Peter Norvig, Director of Research at Google and Danny Sullivan (also of Google).

Advertisement

How C4 is Developed From Common Crawl

The raw Common Crawl data is cleaned up by removing things like thin content, obscene words, lorem ipsum, navigational menus, deduplication, etc. in order to limit the dataset to the main content.

The point of filtering out unnecessary data was to remove gibberish and retain examples of natural English.

This is what the researchers who created C4 wrote:

“To assemble our base data set, we downloaded the web extracted text from April 2019 and applied the aforementioned filtering.

This produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also comprises reasonably clean and natural English text.

We dub this data set the “Colossal Clean Crawled Corpus” (or C4 for short) and release it as part of TensorFlow Datasets…”

There are other unfiltered versions of C4 as well.

Advertisement

The research paper that describes the C4 dataset is titled, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (PDF).

Another research paper from 2021, (Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus – PDF) examined the make-up of the sites included in the C4 dataset.

Interestingly, the second research paper discovered anomalies in the original C4 dataset that resulted in the removal of webpages that were Hispanic and African American aligned.

Hispanic aligned webpages were removed by the blocklist filter (swear words, etc.) at the rate of 32% of pages.

African American aligned webpages were removed at the rate of 42%.

Presumably those shortcomings have been addressed…

Advertisement

Another finding was that 51.3% of the C4 dataset consisted of webpages that were hosted in the United States.

Lastly, the 2021 analysis of the original C4 dataset acknowledges that the dataset represents just a fraction of the total Internet.

The analysis states:

“Our analysis shows that while this dataset represents a significant fraction of a scrape of the public internet, it is by no means representative of English-speaking world, and it spans a wide range of years.

When building a dataset from a scrape of the web, reporting the domains the text is scraped from is integral to understanding the dataset; the data collection process can lead to a significantly different distribution of internet domains than one would expect.”

The following statistics about the C4 dataset are from the second research paper that is linked above.

The top 25 websites (by number of tokens) in C4 are:

Advertisement
  1. patents.google.com
  2. en.wikipedia.org
  3. en.m.wikipedia.org
  4. www.nytimes.com
  5. www.latimes.com
  6. www.theguardian.com
  7. journals.plos.org
  8. www.forbes.com
  9. www.huffpost.com
  10. patents.com
  11. www.scribd.com
  12. www.washingtonpost.com
  13. www.fool.com
  14. ipfs.io
  15. www.frontiersin.org
  16. www.businessinsider.com
  17. www.chicagotribune.com
  18. www.booking.com
  19. www.theatlantic.com
  20. link.springer.com
  21. www.aljazeera.com
  22. www.kickstarter.com
  23. caselaw.findlaw.com
  24. www.ncbi.nlm.nih.gov
  25. www.npr.org

These are the top 25 represented top level domains in the C4 dataset:

Screenshot from Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

If you’re interested in learning more about the C4 dataset, I recommend reading Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (PDF) as well as the original 2020 research paper (PDF) for which C4 was created.

What Could Dialogs Data from Public Forums Be?

50% of the training data comes from “dialogs data from public forums.”

That’s all that Google’s LaMDA research paper says about this training data.

If one were to guess, Reddit and other top communities like StackOverflow are safe bets.

Reddit is used in many important datasets such as ones developed by OpenAI called WebText2 (PDF), an open-source approximation of WebText2 called OpenWebText2 and Google’s own WebText-like (PDF) dataset from 2020.

Google also published details of another dataset of public dialog sites a month before the publication of the LaMDA paper.

This dataset that contains public dialog sites is called MassiveWeb.

Advertisement

We’re not speculating that the MassiveWeb dataset was used to train LaMDA.

But it contains a good example of what Google chose for another language model that focused on dialogue.

MassiveWeb was created by DeepMind, which is owned by Google.

It was designed for use by a large language model called Gopher (link to PDF of research paper).

MassiveWeb uses dialog web sources that go beyond Reddit in order to avoid creating a bias toward Reddit-influenced data.

It still uses Reddit. But it also contains data scraped from many other sites.

Advertisement

Public dialog sites included in MassiveWeb are:

  • Reddit
  • Facebook
  • Quora
  • YouTube
  • Medium
  • StackOverflow

Again, this isn’t suggesting that LaMDA was trained with the above sites.

It’s just meant to show what Google could have used, by showing a dataset Google was working on around the same time as LaMDA, one that contains forum-type sites.

The Remaining 37.5%

The last group of data sources are:

  • 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc;
  • 12.5% Wikipedia (English)
  • 6.25% English web documents
  • 6.25% Non-English web documents.

Google does not specify what sites are in the Programming Q&A Sites category that makes up 12.5% of the dataset that LaMDA trained on.

So we can only speculate.

Stack Overflow and Reddit seem like obvious choices, especially since they were included in the MassiveWeb dataset.

What “tutorials” sites were crawled? We can only speculate what those “tutorials” sites may be.

Advertisement

That leaves the final three categories of content, two of which are exceedingly vague.

English language Wikipedia needs no discussion, we all know Wikipedia.

But the following two are not explained:

English and non-English language web pages are a general description of 13% of the sites included in the database.

That’s all the information Google gives about this part of the training data.

Should Google Be Transparent About Datasets Used for Bard?

Some publishers feel uncomfortable that their sites are used to train AI systems because, in their opinion, those systems could in the future make their websites obsolete and disappear.

Advertisement

Whether that’s true or not remains to be seen, but it is a genuine concern expressed by publishers and members of the search marketing community.

Google is frustratingly vague about the websites used to train LaMDA as well as what technology was used to scrape the websites for data.

As was seen in the analysis of the C4 dataset, the methodology of choosing which website content to use for training large language models can affect the quality of the language model by excluding certain populations.

Should Google be more transparent about what sites are used to train their AI or at least publish an easy to find transparency report about the data that was used?

Featured image by Shutterstock/Asier Romero



Source link

Advertisement
Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address

SEO

Google On Hyphens In Domain Names

Published

on

By

What Google says about using hyphens in domain names

Google’s John Mueller answered a question on Reddit about why people don’t use hyphens with domains and if there was something to be concerned about that they were missing.

Domain Names With Hyphens For SEO

I’ve been working online for 25 years and I remember when using hyphens in domains was something that affiliates did for SEO when Google was still influenced by keywords in the domain, URL, and basically keywords anywhere on the webpage. It wasn’t something that everyone did, it was mainly something that was popular with some affiliate marketers.

Another reason for choosing domain names with keywords in them was that site visitors tended to convert at a higher rate because the keywords essentially prequalified the site visitor. I know from experience how useful two-keyword domains (and one word domain names) are for conversions, as long as they didn’t have hyphens in them.

A consideration that caused hyphenated domain names to fall out of favor is that they have an untrustworthy appearance and that can work against conversion rates because trustworthiness is an important factor for conversions.

Lastly, hyphenated domain names look tacky. Why go with tacky when a brandable domain is easier for building trust and conversions?

Advertisement

Domain Name Question Asked On Reddit

This is the question asked on Reddit:

“Why don’t people use a lot of domains with hyphens? Is there something concerning about it? I understand when you tell it out loud people make miss hyphen in search.”

And this is Mueller’s response:

“It used to be that domain names with a lot of hyphens were considered (by users? or by SEOs assuming users would? it’s been a while) to be less serious – since they could imply that you weren’t able to get the domain name with fewer hyphens. Nowadays there are a lot of top-level-domains so it’s less of a thing.

My main recommendation is to pick something for the long run (assuming that’s what you’re aiming for), and not to be overly keyword focused (because life is too short to box yourself into a corner – make good things, course-correct over time, don’t let a domain-name limit what you do online). The web is full of awkward, keyword-focused short-lived low-effort takes made for SEO — make something truly awesome that people will ask for by name. If that takes a hyphen in the name – go for it.”

Pick A Domain Name That Can Grow

Mueller is right about picking a domain name that won’t lock your site into one topic. When a site grows in popularity the natural growth path is to expand the range of topics the site coves. But that’s hard to do when the domain is locked into one rigid keyword phrase. That’s one of the downsides of picking a “Best + keyword + reviews” domain, too. Those domains can’t grow bigger and look tacky, too.

That’s why I’ve always recommended brandable domains that are memorable and encourage trust in some way.

Advertisement

Read the post on Reddit:

Are domains with hyphens bad?

Read Mueller’s response here.

Featured Image by Shutterstock/Benny Marty

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

Reddit Post Ranks On Google In 5 Minutes

Published

on

By

Google apparently ranks Reddit posts within minutes

Google’s Danny Sullivan disputed the assertions made in a Reddit discussion that Google is showing a preference for Reddit in the search results. But a Redditor’s example proves that it’s possible for a Reddit post to rank in the top ten of the search results within minutes and to actually improve rankings to position #2 a week later.

Discussion About Google Showing Preference To Reddit

A Redditor (gronetwork) complained that Google is sending so many visitors to Reddit that the server is struggling with the load and shared an example that proved that it can only take minutes for a Reddit post to rank in the top ten.

That post was part of a 79 post Reddit thread where many in the r/SEO subreddit were complaining about Google allegedly giving too much preference to Reddit over legit sites.

The person who did the test (gronetwork) wrote:

“…The website is already cracking (server down, double posts, comments not showing) because there are too many visitors.

…It only takes few minutes (you can test it) for a post on Reddit to appear in the top ten results of Google with keywords related to the post’s title… (while I have to wait months for an article on my site to be referenced). Do the math, the whole world is going to spam here. The loop is completed.”

Advertisement

Reddit Post Ranked Within Minutes

Another Redditor asked if they had tested if it takes “a few minutes” to rank in the top ten and gronetwork answered that they had tested it with a post titled, Google SGE Review.

gronetwork posted:

“Yes, I have created for example a post named “Google SGE Review” previously. After less than 5 minutes it was ranked 8th for Google SGE Review (no quotes). Just after Washingtonpost.com, 6 authoritative SEO websites and Google.com’s overview page for SGE (Search Generative Experience). It is ranked third for SGE Review.”

It’s true, not only does that specific post (Google SGE Review) rank in the top 10, the post started out in position 8 and it actually improved ranking, currently listed beneath the number one result for the search query “SGE Review”.

Screenshot Of Reddit Post That Ranked Within Minutes

Anecdotes Versus Anecdotes

Okay, the above is just one anecdote. But it’s a heck of an anecdote because it proves that it’s possible for a Reddit post to rank within minutes and get stuck in the top of the search results over other possibly more authoritative websites.

hankschrader79 shared that Reddit posts outrank Toyota Tacoma forums for a phrase related to mods for that truck.

Advertisement

Google’s Danny Sullivan responded to that post and the entire discussion to dispute that Reddit is not always prioritized over other forums.

Danny wrote:

“Reddit is not always prioritized over other forums. [super vhs to mac adapter] I did this week, it goes Apple Support Community, MacRumors Forum and further down, there’s Reddit. I also did [kumo cloud not working setup 5ghz] recently (it’s a nightmare) and it was the Netgear community, the SmartThings Community, GreenBuildingAdvisor before Reddit. Related to that was [disable 5g airport] which has Apple Support Community above Reddit. [how to open an 8 track tape] — really, it was the YouTube videos that helped me most, but it’s the Tapeheads community that comes before Reddit.

In your example for [toyota tacoma], I don’t even get Reddit in the top results. I get Toyota, Car & Driver, Wikipedia, Toyota again, three YouTube videos from different creators (not Toyota), Edmunds, a Top Stories unit. No Reddit, which doesn’t really support the notion of always wanting to drive traffic just to Reddit.

If I guess at the more specific query you might have done, maybe [overland mods for toyota tacoma], I get a YouTube video first, then Reddit, then Tacoma World at third — not near the bottom. So yes, Reddit is higher for that query — but it’s not first. It’s also not always first. And sometimes, it’s not even showing at all.”

hankschrader79 conceded that they were generalizing when they wrote that Google always prioritized Reddit. But they also insisted that that didn’t diminish what they said is a fact that Google’s “prioritization” forum content has benefitted Reddit more than actual forums.

Why Is The Reddit Post Ranked So High?

It’s possible that Google “tested” that Reddit post in position 8 within minutes and that user interaction signals indicated to Google’s algorithms that users prefer to see that Reddit post. If that’s the case then it’s not a matter of Google showing preference to Reddit post but rather it’s users that are showing the preference and the algorithm is responding to those preferences.

Advertisement

Nevertheless, an argument can be made that user preferences for Reddit can be a manifestation of Familiarity Bias. Familiarity Bias is when people show a preference for things that are familiar to them. If a person is familiar with a brand because of all the advertising they were exposed to then they may show a bias for the brand products over unfamiliar brands.

Users who are familiar with Reddit may choose Reddit because they don’t know the other sites in the search results or because they have a bias that Google ranks spammy and optimized websites and feel safer reading Reddit.

Google may be picking up on those user interaction signals that indicate a preference and satisfaction with the Reddit results but those results may simply be biases and not an indication that Reddit is trustworthy and authoritative.

Is Reddit Benefiting From A Self-Reinforcing Feedback Loop?

It may very well be that Google’s decision to prioritize user generated content may have started a self-reinforcing pattern that draws users in to Reddit through the search results and because the answers seem plausible those users start to prefer Reddit results. When they’re exposed to more Reddit posts their familiarity bias kicks in and they start to show a preference for Reddit. So what could be happening is that the users and Google’s algorithm are creating a self-reinforcing feedback loop.

Is it possible that Google’s decision to show more user generated content has kicked off a cycle where more users are exposed to Reddit which then feeds back into Google’s algorithm which in turn increases Reddit visibility, regardless of lack of expertise and authoritativeness?

Featured Image by Shutterstock/Kues

Advertisement

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

WordPress Releases A Performance Plugin For “Near-Instant Load Times”

Published

on

By

WordPress speculative loading plugin

WordPress released an official plugin that adds support for a cutting edge technology called speculative loading that can help boost site performance and improve the user experience for site visitors.

Speculative Loading

Rendering means constructing the entire webpage so that it instantly displays (rendering). When your browser downloads the HTML, images, and other resources and puts it together into a webpage, that’s rendering. Prerendering is putting that webpage together (rendering it) in the background.

What this plugin does is to enable the browser to prerender the entire webpage that a user might navigate to next. The plugin does that by anticipating which webpage the user might navigate to based on where they are hovering.

Chrome lists a preference for only prerendering when there is an at least 80% probability of a user navigating to another webpage. The official Chrome support page for prerendering explains:

“Pages should only be prerendered when there is a high probability the page will be loaded by the user. This is why the Chrome address bar prerendering options only happen when there is such a high probability (greater than 80% of the time).

There is also a caveat in that same developer page that prerendering may not happen based on user settings, memory usage and other scenarios (more details below about how analytics handles prerendering).

Advertisement

The Speculative Loading API solves a problem that previous solutions could not because in the past they were simply prefetching resources like JavaScript and CSS but not actually prerendering the entire webpage.

The official WordPress announcement explains it like this:

Introducing the Speculation Rules API
The Speculation Rules API is a new web API that solves the above problems. It allows defining rules to dynamically prefetch and/or prerender URLs of certain structure based on user interaction, in JSON syntax—or in other words, speculatively preload those URLs before the navigation. This API can be used, for example, to prerender any links on a page whenever the user hovers over them.”

The official WordPress page about this new functionality describes it:

“The Speculation Rules API is a new web API… It allows defining rules to dynamically prefetch and/or prerender URLs of certain structure based on user interaction, in JSON syntax—or in other words, speculatively preload those URLs before the navigation.

This API can be used, for example, to prerender any links on a page whenever the user hovers over them. Also, with the Speculation Rules API, “prerender” actually means to prerender the entire page, including running JavaScript. This can lead to near-instant load times once the user clicks on the link as the page would have most likely already been loaded in its entirety. However that is only one of the possible configurations.”

The new WordPress plugin adds support for the Speculation Rules API. The Mozilla developer pages, a great resource for HTML technical understanding describes it like this:

“The Speculation Rules API is designed to improve performance for future navigations. It targets document URLs rather than specific resource files, and so makes sense for multi-page applications (MPAs) rather than single-page applications (SPAs).

The Speculation Rules API provides an alternative to the widely-available <link rel=”prefetch”> feature and is designed to supersede the Chrome-only deprecated <link rel=”prerender”> feature. It provides many improvements over these technologies, along with a more expressive, configurable syntax for specifying which documents should be prefetched or prerendered.”

Advertisement

See also: Are Websites Getting Faster? New Data Reveals Mixed Results

Performance Lab Plugin

The new plugin was developed by the official WordPress performance team which occasionally rolls out new plugins for users to test ahead of possible inclusion into the actual WordPress core. So it’s a good opportunity to be first to try out new performance technologies.

The new WordPress plugin is by default set to prerender “WordPress frontend URLs” which are pages, posts, and archive pages. How it works can be fine-tuned under the settings:

Settings > Reading > Speculative Loading

Browser Compatibility

The Speculative API is supported by Chrome 108 however the specific rules used by the new plugin require Chrome 121 or higher. Chrome 121 was released in early 2024.

Browsers that do not support will simply ignore the plugin and will have no effect on the user experience.

Check out the new Speculative Loading WordPress plugin developed by the official core WordPress performance team.

Advertisement

How Analytics Handles Prerendering

A WordPress developer commented with a question asking how Analytics would handle prerendering and someone else answered that it’s up to the Analytics provider to detect a prerender and not count it as a page load or site visit.

Fortunately both Google Analytics and Google Publisher Tags (GPT) both are able to handle prerenders. The Chrome developers support page has a note about how analytics handles prerendering:

“Google Analytics handles prerender by delaying until activation by default as of September 2023, and Google Publisher Tag (GPT) made a similar change to delay triggering advertisements until activation as of November 2023.”

Possible Conflict With Ad Blocker Extensions

There are a couple things to be aware of about this plugin, aside from the fact that it’s an experimental feature that requires Chrome 121 or higher.

A comment by a WordPress plugin developer that this feature may not work with browsers that are using the uBlock Origin ad blocking browser extension.

Download the plugin:
Speculative Loading Plugin by the WordPress Performance Team

Read the announcement at WordPress
Speculative Loading in WordPress

Advertisement

See also: WordPress, Wix & Squarespace Show Best CWV Rate Of Improvement

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

Trending

Follow by Email
RSS