Connect with us

SEO

Is This Dataset Used For Google’s AI Search?

Published

on

Is This Dataset Used For Google's AI Search?

Google published a research paper on a new kind of dataset for training a language model to retrieve sentences that exactly answer a question within an open-ended dialogue.

We don’t know if Google is using this dataset. But researchers claim it outperforms models trained on other datasets.

Many research papers, like the one published for LaMDA don’t mention specific contexts of how it could be used.

For example, the LaMDA research paper (PDF) vaguely concludes:

“LaMDA is a step closer to practical and safe open-ended dialog systems, which can in turn unlock a wide range of useful applications.”

This research paper states the problem they are solving is how to create a dataset for training a machine for an open-ended dialogue by selecting a sentence from a webpage.

Why This Dataset is Important

What makes this research paper of interest is that the researchers conclude that it could be used for factually grounding generative AI output, like what is seen in Google’s new Search Generative Experience.

Given that the research paper was presented at an Information Retrieval conference (Proceedings of the 45th International ACM SIGIR Conference on Research and Development), it’s fairly safe to guess that this algorithm is related to information retrieval, which means search.

One last thing to note is that the research on this new kind of dataset was presented last year in 2022 but it has apparently gone unnoticed… Until now.

What Google Set Out to Achieve With the New Dataset

The researchers explain what they are focused on:

“In this paper we focus on open-ended dialogues: two parties converse in turns on any number of topics with no restrictions to the topic shifts and type of discussion on each topic.

In addition, the dialogue is not grounded to a specific document, in contrast to the setting used in some previous work…

The task we address is retrieving sentences from some document corpus that contain information useful for generating (either automatically or by humans) the next turn in the dialogue.

We note that the dialogue turns can be questions, queries, arguments, statements, etc.”

A New Kind of Dataset For Language Model Training

The problem the researchers are solving is how to retrieve a sentence from a webpage as the answer to an open-ended question, a type of question that needs more than a yes or no answer.

The research paper explains that what is missing to make that ability happen in a machine is an appropriate conversational dataset.

They explain that existing datasets are used for two reasons:

  1. To evaluate dialogue responses by a generative AI but not for use in training it to actually retrieve the relevant information for that response.
  2. Datasets for use by a search engine or question answering, focused on a single passage of a question and answer.

They explain the shortcomings of existing datasets:

“…in most of these datasets, the returned search results are not viewed as part of the dialogue.

…in both conversational passage retrieval and conversational QA datasets, there is a user asking questions or queries that reflect explicit intents with information needs, as opposed to natural dialogues where intents may be only implicitly represented, e.g., in affirmative statements.

To sum, existing conversational datasets do not combine natural human-human conversations with relevance annotations for sentences retrieved from a large document corpus.

We therefore constructed such a dataset…”

How the New Dataset Was Created

The researchers created a dataset that can be used to train an algorithm that can retrieve a sentence that is the correct response in an open-ended dialogue.

The dataset consists of Reddit conversations that were matched to answers from Wikipedia, plus human annotations (relevance ratings), of those question and answer pairs.

Reddit data was downloaded from Pushshift.io, an archive of Reddit conversations (Pushshift FAQ).

The research paper explains:

“To address a broader scope of this task where any type of dialogue can be used, we constructed a dataset that includes openended dialogues from Reddit, candidate sentences from Wikipedia for each dialogue and human annotations for the sentences.

The dataset includes 846 dialogues created from Reddit threads.

For each dialogue, 50 sentences were retrieved from Wikipedia using an unsupervised initial retrieval method.

These sentences were judged by crowd workers for relevance, that is, whether they contained information useful for generating the next turn in the dialogue.”

The dataset they created is available at GitHub.

Example of a dialogue question:

 

“Which came first, the chicken or the egg?”

An example of an irrelevant answer:

“Domesticated chickens have been around for about 10,000 years. Eggs have been around for hundreds of millions of years.”

An example of a correct webpage sentence that can be used for answer is:

“Put more simply by Neil deGrasse Tyson:
‘Which came first: the chicken or the egg? The egg-laid by a bird that was not a chicken.’”

Retrieval Methodology

For the retrieval part they cite prior research in language models and other methods and settle on weak supervision approach.

They explain:

“Fine-tuning of retrieval models requires relevance labels for training examples in a target task.

These are sometimes scarce or unavailable.

One approach to circumvent this is to automatically generate labels and train a weakly supervised model on these annotations.

…We follow the weak supervision paradigm in our model training, with a novel weak Reddit annotator for retrieval in a dialogue context.”

Is the Dataset Successful?

Google and other organizations publish many research papers that demonstrate varying levels of success.

Some research concludes with limited success, moving the state of the art by just a little if at all.

The research papers that are of interest (to me) are the ones that are clearly successful and outperform the current state of the art.

That’s the case with the development of this dataset for training a language model to retrieve sentences that accurately serve as a turn in an open-ended dialogue.

They state how a BERT model trained with this dataset becomes even more powerful.

They write:

“Indeed, while RANKBERTMS outperforms all non-fine-tuned models, the RANKBERTMS→R model, which was further fine-tuned using our weakly supervised training set, improves the performance.

This method attains the highest performance with all performance gains over other methods being statistically significant.

This finding also demonstrates the effectiveness of our weak annotator and weakly supervised training set, showing that performance can be improved without manual annotation for training.”

Elsewhere the researchers report:

“We show that a neural ranker which was fined-tuned using our weakly supervised training set outperforms all other tested models, including a neural ranker fine-tuned on the MS Marco passage retrieval dataset.”

They also write that as successful as this approach is, they are interested in furthering the state of the art even more than they already  have.

The research paper concludes:

“In future work, we would like to devise BERT-based retrieval models that are trained based on weak supervision alone, using a pre-trained BERT, without the need for large annotated training sets like MS Marco.

We would also like to ground generative language models with our retrieval models and study the conversations that emerge from such grounding.”

Could This Approach Be In Use?

Google rarely confirms when specific research is used. There are some cases, such as with BERT, where Google confirms they are using it.

But in general the standard response is that just because Google publishes a research paper or a patent doesn’t mean that they are using it in their search algorithm.

That said, the research paper, which dates from mid-2022, indicated that a future direction was to study how generative language models (which is like Bard and Google’s Search Generative Experience) can be grounded with it.

An AI generative chat experience can result in the AI output making things up, what is technically known as hallucinating.

Grounding means anchoring the AI chat output with facts, typically from online sources, to help prevent hallucinations.

Bing uses a system called the Bing Orchestrator that checks webpages to ground the GPT output in facts.

Grounding the AI output helps keep it grounded to facts, which is something that this dataset may be capable of doing, in addition to selecting sentences from webpages as part of an answer.

Screenshot of an answer from Google's Search Generative Experience that shows the answer with three citations to webpages with facts that ground the AI answer.

Read the Research Paper:

Abstract Webpage: A Dataset for Sentence Retrieval for Open-Ended Dialogues

Actual Research Paper: A Dataset for Sentence Retrieval for Open-Ended Dialogues

Featured image by Shutterstock/Camilo Concha



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address

SEO

Competing Against Brands & Nouns Of The Same Name

Published

on

By

An illustration of a man in a business suit interacting with a floating 3D network of connected nodes, symbolizing SEO strategy and digital technology, set against a stylized outdoor background with clouds and plants

Establishing and building a brand has always been both a challenge and an investment, even before the days of the internet.

One thing the internet has done, however, is make the world a lot smaller, and the frequency of brand (or noun) conflicts has greatly increased.

In the past year, I’ve been emailed and asked questions about these conflicts at conferences more than I have in my entire SEO career.

When you share your brand name with another brand, town, or city, Google has to decide and determine the dominant user interpretation of the query – or at least, if there are multiple common interpretations, the most common interpretations.

Noun and brand conflicts typically happen when:

  • A rebrand’s research focuses on other business names and doesn’t take into consideration general user search.
  • When a brand chooses a word in one language, but it has a use in another.
  • A name is chosen that is also a noun (e.g. the name of a town or city).

Some examples include Finlandia, which is both a brand of cheese and vodka; Graco, which is both a brand of commercial products and a brand of baby products; and Kong, which is both the name of a pet toy manufacturer and a tech company.

User Interpretations

From conversations I’ve had with marketers and SEO pros working for various brands with this issue, the underlying theme (and potential cause) comes down to how Google handles interpretation of what users are looking for.

When a user enters a query, Google processes the query to identify known entities that are contained.

It does this to improve the relevance of search results being returned (as outlined in its 2015 Patent #9,009,192). From this, Google also works to return related, relevant results and search engine results page (SERP) elements.

For example, when you search for a specific film or TV series, Google may return a SERP feature containing relevant actors or news (if deemed relevant) about the media.

This then leads to interpretation.

When Google receives a query, the search results need to often cater for multiple common interpretations and intents. This is no different when someone searches for a recognized branded entity like Nike.

When I search for Nike, I get a search results page that is a combination of branded web assets such as the Nike website and social media profiles, the Map Pack showing local stores, PLAs, the Nike Knowledge Panel, and third-party online retailers.

This variation is to cater for the multiple interpretations and intents that a user just searching for “Nike” may have.

Brand Entity Disambiguation

Now, if we look at brands that share a name such as Kong, when Google checks for entities and references against the Knowledge Graph (and knowledge base sources), it gets two closer matches: Kong Company and Kong, Inc.

The search results page is also littered with product listing ads (PLAs) and ecommerce results for pet toys, but the second blue link organic result is Kong, Inc.

Also on page one, we can find references to a restaurant with the same name (UK-based search), and in the image carousel, Google is introducing the (King) Kong film franchise.

It is clear that Google sees the dominant interpretation of this query to be the pet toy company, but has diversified the SERP further to cater for secondary and tertiary meanings.

In 2015, Google was granted a patent that included features of how Google might determine differences in entities of the same name.

This includes the possible use of annotations within the Knowledge Base – such as the addition of a word or descriptor – to help disambiguate entities with the same name. For example, the entries for Dan Taylor could be:

  • Dan Taylor (marketer).
  • Dan Taylor (journalist).
  • Dan Taylor (olympian).

How it determines what is the “dominant” interpretation of the query, and then how to order search results and the types of results, from experience, comes down to:

  • Which results users are clicking on when they perform the query (SERP interaction).
  • How established the entity is within the user’s market/region.
  • How closely the entity is related to previous queries the user has searched (personalization).

I’ve also observed that there is a correlation between extended brand searches and how they affect exact match branded search.

It’s also worth highlighting that this can be dynamic. Should a brand start receiving a high volume of mentions from multiple news publishers, Google will take this into account and amend the search results to better meet users’ needs and potential query interpretations at that moment in time.

SEO For Brand Disambiguation

Building a brand is not a task solely on the shoulders of SEO professionals. It requires buy-in from the wider business and ensuring the brand and brand messaging are both defined and aligned.

SEO can, however, influence this effort through the full spectrum of SEO: technical, content, and digital PR.

Google understands entities on the concept of relatedness, and this is determined by the co-occurrence of entities and then how Google classifies and discriminates between those entities.

We can influence this through technical SEO through granular Schema markup and by making sure the brand name is consistent across all web properties and references.

This ties into how we then write about the brand in our content and the co-occurrence of the brand name with other entity types.

To reinforce this and build brand awareness, this should be coupled with digital PR efforts with the objective of brand placement and corroborating topical relevance.

A Note On Search Generative Experience

As it looks likely that Search Generative Experience is going to be the future of search, or at least components of it, it’s worth noting that in tests we’ve done, Google can, at times, have issues when generative AI snapshots for brands, when there are multiple brands with the same name.

To check your brand’s exposure, I recommend asking Google and generating an SGE snapshot for your brand + reviews.

If Google isn’t 100% sure which brand you mean, it will start to include reviews and comments on companies of the same (or very similar) name.

It does disclose that they are different companies in the snapshot, but if your user is skim-reading and only looking at the summaries, this could be an accidental negative brand touchpoint.

More resources:


Featured Image: VectorMine/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

Google Rolls Out New ‘Web’ Filter For Search Results

Published

on

By

Google logo inside the Google Indonesia office in Jakarta

Google is introducing a filter that allows you to view only text-based webpages in search results.

The “Web” filter, rolling out globally over the next two days, addresses demand from searchers who prefer a stripped-down, simplified view of search results.

Danny Sullivan, Google’s Search Liaison, states in an announcement:

“We’ve added this after hearing from some that there are times when they’d prefer to just see links to web pages in their search results, such as if they’re looking for longer-form text documents, using a device with limited internet access, or those who just prefer text-based results shown separately from search features.”

The new functionality is a throwback to when search results were more straightforward. Now, they often combine rich media like images, videos, and shopping ads alongside the traditional list of web links.

How It Works

On mobile devices, the “Web” filter will be displayed alongside other filter options like “Images” and “News.”

Screenshot from: twitter.com/GoogleSearchLiaison, May 2024.

If Google’s systems don’t automatically surface it based on the search query, desktop users may need to select “More” to access it.

1715727362 7 Google Rolls Out New Web Filter For Search ResultsScreenshot from: twitter.com/GoogleSearchLiaison, May 2024.

More About Google Search Filters

Google’s search filters allow you to narrow results by type. The options displayed are dynamically generated based on your search query and what Google’s systems determine could be most relevant.

The “All Filters” option provides access to filters that are not shown automatically.

Alongside filters, Google also displays “Topics” – suggested related terms that can further refine or expand a user’s original query into new areas of exploration.

For more about Google’s search filters, see its official help page.


Featured Image: egaranugrah/Shutterstock



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

Why Google Can’t Tell You About Every Ranking Drop

Published

on

By

Why Google Can't Tell You About Every Ranking Drop

In a recent Twitter exchange, Google’s Search Liaison, Danny Sullivan, provided insight into how the search engine handles algorithmic spam actions and ranking drops.

The discussion was sparked by a website owner’s complaint about a significant traffic loss and the inability to request a manual review.

Sullivan clarified that a site could be affected by an algorithmic spam action or simply not ranking well due to other factors.

He emphasized that many sites experiencing ranking drops mistakenly attribute it to an algorithmic spam action when that may not be the case.

“I’ve looked at many sites where people have complained about losing rankings and decide they have a algorithmic spam action against them, but they don’t. “

Sullivan’s full statement will help you understand Google’s transparency challenges.

Additionally, he explains why the desire for manual review to override automated rankings may be misguided.

Challenges In Transparency & Manual Intervention

Sullivan acknowledged the idea of providing more transparency in Search Console, potentially notifying site owners of algorithmic actions similar to manual actions.

However, he highlighted two key challenges:

  1. Revealing algorithmic spam indicators could allow bad actors to game the system.
  2. Algorithmic actions are not site-specific and cannot be manually lifted.

Sullivan expressed sympathy for the frustration of not knowing the cause of a traffic drop and the inability to communicate with someone about it.

However, he cautioned against the desire for a manual intervention to override the automated systems’ rankings.

Sullivan states:

“…you don’t really want to think “Oh, I just wish I had a manual action, that would be so much easier.” You really don’t want your individual site coming the attention of our spam analysts. First, it’s not like manual actions are somehow instantly processed. Second, it’s just something we know about a site going forward, especially if it says it has change but hasn’t really.”

Determining Content Helpfulness & Reliability

Moving beyond spam, Sullivan discussed various systems that assess the helpfulness, usefulness, and reliability of individual content and sites.

He acknowledged that these systems are imperfect and some high-quality sites may not be recognized as well as they should be.

“Some of them ranking really well. But they’ve moved down a bit in small positions enough that the traffic drop is notable. They assume they have fundamental issues but don’t, really — which is why we added a whole section about this to our debugging traffic drops page.”

Sullivan revealed ongoing discussions about providing more indicators in Search Console to help creators understand their content’s performance.

“Another thing I’ve been discussing, and I’m not alone in this, is could we do more in Search Console to show some of these indicators. This is all challenging similar to all the stuff I said about spam, about how not wanting to let the systems get gamed, and also how there’s then no button we would push that’s like “actually more useful than our automated systems think — rank it better!” But maybe there’s a way we can find to share more, in a way that helps everyone and coupled with better guidance, would help creators.”

Advocacy For Small Publishers & Positive Progress

In response to a suggestion from Brandon Saltalamacchia, founder of RetroDodo, about manually reviewing “good” sites and providing guidance, Sullivan shared his thoughts on potential solutions.

He mentioned exploring ideas such as self-declaration through structured data for small publishers and learning from that information to make positive changes.

“I have some thoughts I’ve been exploring and proposing on what we might do with small publishers and self-declaring with structured data and how we might learn from that and use that in various ways. Which is getting way ahead of myself and the usual no promises but yes, I think and hope for ways to move ahead more positively.”

Sullivan said he can’t make promises or implement changes overnight, but he expressed hope for finding ways to move forward positively.


Featured Image: Tero Vesalainen/Shutterstock



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

Trending