SEO

How NLP & NLU Work For Semantic Search

Published

2 years ago

April 26, 2022

Natural language processing (NLP) and natural language understanding (NLU) are two often-confused technologies that make search more intelligent and ensure people can search and find what they want.

This intelligence is a core component of semantic search.

NLP and NLU are why you can type “dresses” and find that long-sought-after “NYE Party Dress” and why you can type “Matthew McConnahey” and get Mr. McConnaughey back.

With these two technologies, searchers can find what they want without having to type their query exactly as it’s found on a page or in a product.

NLP is one of those things that has built up such a large meaning that it’s easy to look past the fact that it tells you exactly what it is: NLP processes natural language, specifically into a format that computers can understand.

These kinds of processing can include tasks like normalization, spelling correction, or stemming, each of which we’ll look at in more detail.

NLU, on the other hand, aims to “understand” what a block of natural language is communicating.

It performs tasks that can, for example, identify verbs and nouns in sentences or important items within a text. People or programs can then use this information to complete other tasks.

Computers seem advanced because they can do a lot of actions in a short period of time. However, in a lot of ways, computers are quite daft.

They need the information to be structured in specific ways to build upon it. For natural language data, that’s where NLP comes in.

It takes messy data (and natural language can be very messy) and processes it into something that computers can work with.

Text Normalization

When searchers type text into a search bar, they are trying to find a good match, not play “guess the format.”

For example, to require a user to type a query in exactly the same format as the matching words in a record is unfair and unproductive.

We use text normalization to do away with this requirement so that the text will be in a standard format no matter where it’s coming from.

As we go through different normalization steps, we’ll see that there is no approach that everyone follows. Each normalization step generally increases recall and decreases precision.

A quick aside: “recall” means a search engine finds results that are known to be good.

Precision means a search engine finds only good results.

Search results could have 100% recall by returning every document in an index, but precision would be poor.

Conversely, a search engine could have 100% recall by only returning documents that it knows to be a perfect fit, but sit will likely miss some good results.

Again, normalization generally increases recall and decreases precision.

Whether that movement toward one end of the recall-precision spectrum is valuable depends on the use case and the search technology. It isn’t a question of applying all normalization techniques but deciding which ones provide the best balance of precision and recall.

Letter Normalization

The simplest normalization you could imagine would be the handling of letter case.

In English, at least, words are generally capitalized at the beginning of sentences, occasionally in titles, and when they are proper nouns. (There are other rules, too, depending on whom you ask.)

But in German, all nouns are capitalized. Other languages have their own rules.

These rules are useful. Otherwise, we wouldn’t follow them.

For example, capitalizing the first words of sentences helps us quickly see where sentences begin.

That usefulness, however, is diminished in an information retrieval context.

The meanings of words don’t change simply because they are in a title and have their first letter capitalized.

Even trickier is that there are rules, and then there is how people actually write.

If I text my wife, “SOMEONE HIT OUR CAR!” we all know that I’m talking about a car and not something different because the word is capitalized.

We can see this clearly by reflecting on how many people don’t use capitalization when communicating informally – which is, incidentally, how most case-normalization works.

Of course, we know that sometimes capitalization does change the meaning of a word or phrase. We can see that “cats” are animals, and “Cats” is a musical.

In most cases, though, the increased precision that comes with not normalizing on case, is offset by decreasing recall by far too much.

The difference between the two is easy to tell via context, too, which we’ll be able to leverage through natural language understanding.

While less common in English, handling diacritics is also a form of letter normalization.

Diacritics are the marks, or “glyphs,” attached to letters, as in á, ë, or ç.

Words can otherwise be spelled the same, but added diacritics can change the meaning. In French, “élève” means “student,” while “élevé” means “elevated.”

Nonetheless, many people will not include the diacritics when searching, and so another form of normalization is to strip all diacritics, leaving behind the simple (and now ambiguous) “eleve.”

Tokenization

The next normalization challenge is breaking down the text the searcher has typed in the search bar and the text in the document.

This step is necessary because word order does not need to be exactly the same between the query and the document text, except when a searcher wraps the query in quotes.

Breaking queries, phrases, and sentences into words may seem like a simple task: Just break up the text at each space.

Problems show up quickly with this approach. Again, let’s start with English.

Separating on spaces alone means that the phrase “Let’s break up this phrase!” yields us let’s, break, up, this, and phrase! as words.

For search, we almost surely don’t want the exclamation point at the end of the word “phrase.”

Whether we want to keep the contracted word “let’s” together is not as clear.

Some software will break the word down even further (“let” and “‘s”) and some won’t.

Some will not break down “let’s” while breaking down “don’t” into two pieces.

This process is called “tokenization.”

We call it tokenization for reasons that should now be clear: What we end up with are not words but discrete groups of characters. This is even more true for languages other than English.

German speakers, for example, can merge words (more accurately “morphemes,” but close enough) together to form a larger word. The German word for “dog house” is “Hundehütte,” which contains the words for both “dog” (“Hund”) and “house” (“Hütte”).

Nearly all search engines tokenize text, but there are further steps an engine can take to normalize the tokens. Two related approaches are stemming and lemmatization.

Stemming And Lemmatization

Stemming and lemmatization take different forms of tokens and break them down for comparison.

For example, take the words “calculator” and “calculation,” or “slowing” and “slowly.”

We can see there are some clear similarities.

Stemming breaks a word down to its “stem,” or other variants of the word it is based on. Stemming is fairly straightforward; you could do it on your own.

What’s the stem of “stemming?”

You can probably guess that it’s “stem.” Often stemming means removing prefixes or suffixes, as in this case.

There are multiple stemming algorithms, and the most popular is the Porter Stemming Algorithm, which has been around since the 1980s. It is a series of steps applied to a token to get to the stem.

Stemming can sometimes lead to results that you wouldn’t foresee.

Looking at the words “carry” and “carries,” you might expect that the stem of each of these is “carry.”

The actual stem, at least according to the Porter Stemming Algorithm, is “carri.”

This is because stemming attempts to compare related words and break down words into their smallest possible parts, even if that part is not a word itself.

On the other hand, if you want an output that will always be a recognizable word, you want lemmatization. Again, there are different lemmatizers, such as NLTK using Wordnet.

Lemmatization breaks a token down to its “lemma,” or the word which is considered the base for its derivations. The lemma from Wordnet for “carry” and “carries,” then, is what we expected before: “carry.”

Lemmatization will generally not break down words as much as stemming, nor will as many different word forms be considered the same after the operation.

The stems for “say,” “says,” and “saying” are all “say,” while the lemmas from Wordnet are “say,” “say,” and “saying.” To get these lemma, lemmatizers are generally corpus-based.

If you want the broadest recall possible, you’ll want to use stemming. If you want the best possible precision, use neither stemming nor lemmatization.

Which you go with ultimately depends on your goals, but most searches can generally perform very well with neither stemming nor lemmatization, retrieving the right results, and not introducing noise.

Plurals

If you decide not to include lemmatization or stemming in your search engine, there is still one normalization technique that you should consider.

That is the normalization of plurals to their singular form.

Generally, ignoring plurals is done through the use of dictionaries.

Even if “de-pluralization” seems as simple as chopping off an “-s,” that’s not always the case. The first problem is with irregular plurals, such as “deer,” “oxen,” and “mice.”

A second problem is pluralization with an “-es” suffix, such as “potato.” Finally, there are simply the words that end in an “s” but aren’t plural, like “always.”

A dictionary-based approach will ensure that you introduce recall, but not incorrectly.

Just as with lemmatization and stemming, whether you normalize plurals is dependent on your goals.

Cast a wider net by normalizing plurals, a more precise one by avoiding normalization.

Usually, normalizing plurals is the right choice, and you can remove normalization pairs from your dictionary when you find them causing problems.

One area, however, where you will almost always want to introduce increased recall is when handling typos.

Typo Tolerance And Spell Check

We have all encountered typo tolerance and spell check within search, but it’s useful to think about why it’s present.

Sometimes, there are typos because fingers slip and hit the wrong key.

Other times, the searcher thinks a word is spelled differently than it is.

Increasingly, “typos” can also result from poor speech-to-text understanding.

Finally, words can seem like they have typos but really don’t, such as in comparing “scream” and “cream.”

The simplest way to handle these typos, misspellings, and variations, is to avoid trying to correct them at all. Some algorithms can compare different tokens.

One of these is the Damerau-Levenshtein Distance algorithm.

This measure looks at how many edits are needed to go from one token to another.

You can then filter out all tokens with a distance that is too high.

(Two is generally a good threshold, but you will probably want to adjust this based on the length of the token.)

After filtering, you can use the distance for sorting results or feeding into a ranking algorithm.

Many times, context can matter when determining if a word is misspelled or not. The word “scream” is probably correct after “I,” but not after “ice.”

Machine learning can be a solution for this by bringing context to this NLP task.

This spell check software can use the context around a word to identify whether it is likely to be misspelled and its most likely correction.

Typos In Documents

One thing that we skipped over before is that words may not only have typos when a user types it into a search bar.

Words may also have typos inside a document.

This is especially true when the documents are made of user-generated content.

This detail is relevant because if a search engine is only looking at the query for typos, it is missing half of the information.

The best typo tolerance should work across both query and document, which is why edit distance generally works best for retrieving and ranking results.

Spell check can be used to craft a better query or provide feedback to the searcher, but it is often unnecessary and should never stand alone.

Natural Language Understanding

While NLP is all about processing text and natural language, NLU is about understanding that text.

Named Entity Recognition

A task that can aid in search is that of named entity recognition, or NER. NER identifies key items, or “entities,” inside of text.

While some people will call NER natural language processing and others will call it natural language understanding, what’s clear is that it can find what’s important within a text.

For the query “NYE party dress” you would perhaps get back an entity of “dress” that is mapped to a type of “category.”

NER will always map an entity to a type, from as generic as “place” or “person,” to as specific as your own facets.

NER can also use context to identify entities.

A query of “white house” may refer to a place, while “white house paint” might refer to a color of “white” and a product category of “paint.”

Query Categorization

Named entity recognition is valuable in search because it can be used in conjunction with facet values to provide better search results.

Recalling the “white house paint” example, you can use the “white” color and the “paint” product category to filter down your results to only show those that match those two values.

This would give you high precision.

If you don’t want to go that far, you can simply boost all products that match one of the two values.

Query categorization can also help with recall.

For searches with few results, you can use the entities to include related products.

Imagine that there are no products that match the keywords “white house paint.”

In this case, leveraging the product category of “paint” can return other paints that might be a decent alternative, such as that nice eggshell color.

Document Tagging

Another way that named entity recognition can help with search quality is by moving the task from query time to ingestion time (when the document is added to the search index).

When ingesting documents, NER can use the text to tag those documents automatically.

These documents will then be easier to find for the searchers.

Either the searchers use explicit filtering, or the search engine applies automatic query-categorization filtering, to enable searchers to go directly to the right products using facet values.

Intent Detection

Related to entity recognition is intent detection, or determining the action a user wants to take.

Intent detection is not the same as what we talk about when we say “identifying searcher intent.”

Identifying searcher intent is getting people to the right content at the right time.

Intent detection maps a request to a specific, pre-defined intent.

It then takes action based on that intent. A user searching for “how to make returns” might trigger the “help” intent, while “red shoes” might trigger the “product” intent.

In the first case, you could route the search to your help desk search.

In the second one, you could route it to the product search. This isn’t so different from what you see when you search for the weather on Google.

Look, and notice that you get a weather box at the very top of the page. (Newly launched web search engine Andi takes this concept to the extreme, bundling search in a chatbot.)

For most search engines, intent detection, as outlined here, isn’t necessary.

Most search engines only have a single content type on which to search at a time.

When there are multiple content types, federated search can perform admirably by showing multiple search results in a single UI at the same time.

Other NLP And NLU tasks

There are plenty of other NLP and NLU tasks, but these are usually less relevant to search.

Tasks like sentiment analysis can be useful in some contexts, but search isn’t one of them.

You could imagine using translation to search multi-language corpuses, but it rarely happens in practice, and is just as rarely needed.

Question answering is an NLU task that is increasingly implemented into search, especially search engines that expect natural language searches.

Once again, you can see this on major web search engines.

Google, Bing, and Kagi will all immediately answer the question “how old is the Queen of England?” without needing to click through to any results.

Some search engine technologies have explored implementing question answering for more limited search indices, but outside of help desks or long, action-oriented content, the usage is limited.

Few searchers are going to an online clothing store and asking questions to a search bar.

Summarization is an NLU task that is more useful for search.

Much like with the use of NER for document tagging, automatic summarization can enrich documents. Summaries can be used to match documents to queries, or to provide a better display of the search results.

This better display can help searchers be confident that they have gotten good results and get them to the right answers more quickly.

Even including newer search technologies using images and audio, the vast, vast majority of searches happen with text. To get the right results, it’s important to make sure the search is processing and understanding both the query and the documents.

Semantic search brings intelligence to search engines, and natural language processing and understanding are important components.

NLP and NLU tasks like tokenization, normalization, tagging, typo tolerance, and others can help make sure that searchers don’t need to be search experts.

Instead, they can go from need to solution “naturally” and quickly.

More resources:

Featured Image: ryzhi/Shutterstock

Related Topics:NLP NLU Search Semantic Work

Up Next

How To Find And Fix Internal Links

Don't Miss

4 Smart Tactics For Advanced Google Ads Audience Targeting

Click to comment

You must be logged in to post a comment Login

SEO

Measuring Content Impact Across The Customer Journey

Published

18 hours ago

April 23, 2024

Max

Measuring Content Impact Across The Customer Journey

Understanding the impact of your content at every touchpoint of the customer journey is essential – but that’s easier said than done. From attracting potential leads to nurturing them into loyal customers, there are many touchpoints to look into.

So how do you identify and take advantage of these opportunities for growth?

Watch this on-demand webinar and learn a comprehensive approach for measuring the value of your content initiatives, so you can optimize resource allocation for maximum impact.

You’ll learn:

Fresh methods for measuring your content’s impact.
Fascinating insights using first-touch attribution, and how it differs from the usual last-touch perspective.
Ways to persuade decision-makers to invest in more content by showcasing its value convincingly.

With Bill Franklin and Oliver Tani of DAC Group, we unravel the nuances of attribution modeling, emphasizing the significance of layering first-touch and last-touch attribution within your measurement strategy.

Check out these insights to help you craft compelling content tailored to each stage, using an approach rooted in first-hand experience to ensure your content resonates.

Whether you’re a seasoned marketer or new to content measurement, this webinar promises valuable insights and actionable tactics to elevate your SEO game and optimize your content initiatives for success.

View the slides below or check out the full webinar for all the details.

SEO

How to Find and Use Competitor Keywords

Published

21 hours ago

April 23, 2024

Entireweb News Bot

Competitor keywords are the keywords your rivals rank for in Google’s search results. They may rank organically or pay for Google Ads to rank in the paid results.

Knowing your competitors’ keywords is the easiest form of keyword research. If your competitors rank for or target particular keywords, it might be worth it for you to target them, too.

There is no way to see your competitors’ keywords without a tool like Ahrefs, which has a database of keywords and the sites that rank for them. As far as we know, Ahrefs has the biggest database of these keywords.

How to find all the keywords your competitor ranks for

Go to Ahrefs’ Site Explorer
Enter your competitor’s domain
Go to the Organic keywords report

The report is sorted by traffic to show you the keywords sending your competitor the most visits. For example, Mailchimp gets most of its organic traffic from the keyword “mailchimp.”

Since you’re unlikely to rank for your competitor’s brand, you might want to exclude branded keywords from the report. You can do this by adding a Keyword > Doesn’t contain filter. In this example, we’ll filter out keywords containing “mailchimp” or any potential misspellings:

Filtering out branded keywords in Organic keywords report

If you’re a new brand competing with one that’s established, you might also want to look for popular low-difficulty keywords. You can do this by setting the Volume filter to a minimum of 500 and the KD filter to a maximum of 10.

Finding popular, low-difficulty keywords in Organic keywords

How to find keywords your competitor ranks for, but you don’t

Go to Competitive Analysis
Enter your domain in the This target doesn’t rank for section
Enter your competitor’s domain in the But these competitors do section

Hit “Show keyword opportunities,” and you’ll see all the keywords your competitor ranks for, but you don’t.

You can also add a Volume and KD filter to find popular, low-difficulty keywords in this report.

How to find keywords multiple competitors rank for, but you don’t

Go to Competitive Analysis
Enter your domain in the This target doesn’t rank for section
Enter the domains of multiple competitors in the But these competitors do section

Competitive analysis report with multiple competitors

You’ll see all the keywords that at least one of these competitors ranks for, but you don’t.

Content gap report with multiple competitors

You can also narrow the list down to keywords that all competitors rank for. Click on the Competitors’ positions filter and choose All 3 competitors:

Selecting all 3 competitors to see keywords all 3 competitors rank for

Go to Ahrefs’ Site Explorer
Enter your competitor’s domain
Go to the Paid keywords report

This report shows you the keywords your competitors are targeting via Google Ads.

Since your competitor is paying for traffic from these keywords, it may indicate that they’re profitable for them—and could be for you, too.

You know what keywords your competitors are ranking for or bidding on. But what do you do with them? There are basically three options.

1. Create pages to target these keywords

You can only rank for keywords if you have content about them. So, the most straightforward thing you can do for competitors’ keywords you want to rank for is to create pages to target them.

However, before you do this, it’s worth clustering your competitor’s keywords by Parent Topic. This will group keywords that mean the same or similar things so you can target them all with one page.

Here’s how to do that:

Export your competitor’s keywords, either from the Organic Keywords or Content Gap report

Paste them into Keywords Explorer
Click the “Clusters by Parent Topic” tab

For example, MailChimp ranks for keywords like “what is digital marketing” and “digital marketing definition.” These and many others get clustered under the Parent Topic of “digital marketing” because people searching for them are all looking for the same thing: a definition of digital marketing. You only need to create one page to potentially rank for all these keywords.

Keywords under the cluster of "digital marketing"

2. Optimize existing content by filling subtopics

You don’t always need to create new content to rank for competitors’ keywords. Sometimes, you can optimize the content you already have to rank for them.

How do you know which keywords you can do this for? Try this:

Export your competitor’s keywords

Paste them into Keywords Explorer
Click the “Clusters by Parent Topic” tab
Look for Parent Topics you already have content about

For example, if we analyze our competitor, we can see that seven keywords they rank for fall under the Parent Topic of “press release template.”

Our competitor ranks for seven keywords that fall under the "press release template" cluster

If we search our site, we see that we already have a page about this topic.

Site search finds that we already have a blog post on press release templates

If we click the caret and check the keywords in the cluster, we see keywords like “press release example” and “press release format.”

Keywords under the cluster of "press release template"

To rank for the keywords in the cluster, we can probably optimize the page we already have by adding sections about the subtopics of “press release examples” and “press release format.”

3. Target these keywords with Google Ads

Paid keywords are the simplest—look through the report and see if there are any relevant keywords you might want to target, too.

For example, Mailchimp is bidding for the keyword “how to create a newsletter.”

If you’re ConvertKit, you may also want to target this keyword since it’s relevant.

If you decide to target the same keyword via Google Ads, you can hover over the magnifying glass to see the ads your competitor is using.

You can also see the landing page your competitor directs ad traffic to under the URL column.

Learn more

Check out more tutorials on how to do competitor keyword analysis:

SEO

Google Confirms Links Are Not That Important

Published

2 days ago

April 23, 2024

Max

Google confirms that links are not that important anymore

Google’s Gary Illyes confirmed at a recent search marketing conference that Google needs very few links, adding to the growing body of evidence that publishers need to focus on other factors. Gary tweeted confirmation that he indeed say those words.

Background Of Links For Ranking

Links were discovered in the late 1990’s to be a good signal for search engines to use for validating how authoritative a website is and then Google discovered soon after that anchor text could be used to provide semantic signals about what a webpage was about.

One of the most important research papers was Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, published around 1998 (link to research paper at the end of the article). The main discovery of this research paper is that there is too many web pages and there was no objective way to filter search results for quality in order to rank web pages for a subjective idea of relevance.

The author of the research paper discovered that links could be used as an objective filter for authoritativeness.

Kleinberg wrote:

“To provide effective search methods under these conditions, one needs a way to filter, from among a huge collection of relevant pages, a small set of the most “authoritative” or ‘definitive’ ones.”

This is the most influential research paper on links because it kick-started more research on ways to use links beyond as an authority metric but as a subjective metric for relevance.

Objective is something factual. Subjective is something that’s closer to an opinion. The founders of Google discovered how to use the subjective opinions of the Internet as a relevance metric for what to rank in the search results.

What Larry Page and Sergey Brin discovered and shared in their research paper (The Anatomy of a Large-Scale Hypertextual Web Search Engine – link at end of this article) was that it was possible to harness the power of anchor text to determine the subjective opinion of relevance from actual humans. It was essentially crowdsourcing the opinions of millions of website expressed through the link structure between each webpage.

What Did Gary Illyes Say About Links In 2024?

At a recent search conference in Bulgaria, Google’s Gary Illyes made a comment about how Google doesn’t really need that many links and how Google has made links less important.

Patrick Stox tweeted about what he heard at the search conference:

” ‘We need very few links to rank pages… Over the years we’ve made links less important.’ @methode #serpconf2024″

Google’s Gary Illyes tweeted a confirmation of that statement:

“I shouldn’t have said that… I definitely shouldn’t have said that”

Why Links Matter Less

The initial state of anchor text when Google first used links for ranking purposes was absolutely non-spammy, which is why it was so useful. Hyperlinks were primarily used as a way to send traffic from one website to another website.

But by 2004 or 2005 Google was using statistical analysis to detect manipulated links, then around 2004 “powered-by” links in website footers stopped passing anchor text value, and by 2006 links close to the words “advertising” stopped passing link value, links from directories stopped passing ranking value and by 2012 Google deployed a massive link algorithm called Penguin that destroyed the rankings of likely millions of websites, many of which were using guest posting.

The link signal eventually became so bad that Google decided in 2019 to selectively use nofollow links for ranking purposes. Google’s Gary Illyes confirmed that the change to nofollow was made because of the link signal.

Google Explicitly Confirms That Links Matter Less

In 2023 Google’s Gary Illyes shared at a PubCon Austin that links were not even in the top 3 of ranking factors. Then in March 2024, coinciding with the March 2024 Core Algorithm Update, Google updated their spam policies documentation to downplay the importance of links for ranking purposes.

Google March 2024 Core Update: 4 Changes To Link Signal

The documentation previously said:

“Google uses links as an important factor in determining the relevancy of web pages.”

The update to the documentation that mentioned links was updated to remove the word important.

Links are not just listed as just another factor:

“Google uses links as a factor in determining the relevancy of web pages.”

At the beginning of April Google’s John Mueller advised that there are more useful SEO activities to engage on than links.

Mueller explained:

“There are more important things for websites nowadays, and over-focusing on links will often result in you wasting your time doing things that don’t make your website better overall”

Finally, Gary Illyes explicitly said that Google needs very few links to rank webpages and confirmed it.

I shouldn’t have said that… I definitely shouldn’t have said that
— Gary 鯨理／경리 Illyes (so official, trust me) (@methode) April 19, 2024
Advertisement

Why Google Doesn’t Need Links

The reason why Google doesn’t need many links is likely because of the extent of AI and natural language undertanding that Google uses in their algorithms. Google must be highly confident in its algorithm to be able to explicitly say that they don’t need it.

Way back when Google implemented the nofollow into the algorithm there were many link builders who sold comment spam links who continued to lie that comment spam still worked. As someone who started link building at the very beginning of modern SEO (I was the moderator of the link building forum at the #1 SEO forum of that time), I can say with confidence that links have stopped playing much of a role in rankings beginning several years ago, which is why I stopped about five or six years ago.

Read the research papers

Authoritative Sources in a Hyperlinked Environment – Jon M. Kleinberg (PDF)

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Featured Image by Shutterstock/RYO Alexandre