Connect with us


How NLP & NLU Work For Semantic Search



How NLP & NLU Work For Semantic Search

Natural language processing (NLP) and natural language understanding (NLU) are two often-confused technologies that make search more intelligent and ensure people can search and find what they want.

This intelligence is a core component of semantic search.

NLP and NLU are why you can type “dresses” and find that long-sought-after “NYE Party Dress” and why you can type “Matthew McConnahey” and get Mr. McConnaughey back.

With these two technologies, searchers can find what they want without having to type their query exactly as it’s found on a page or in a product.

NLP is one of those things that has built up such a large meaning that it’s easy to look past the fact that it tells you exactly what it is: NLP processes natural language, specifically into a format that computers can understand.

These kinds of processing can include tasks like normalization, spelling correction, or stemming, each of which we’ll look at in more detail.

NLU, on the other hand, aims to “understand” what a block of natural language is communicating.


It performs tasks that can, for example, identify verbs and nouns in sentences or important items within a text. People or programs can then use this information to complete other tasks.

Computers seem advanced because they can do a lot of actions in a short period of time. However, in a lot of ways, computers are quite daft.

They need the information to be structured in specific ways to build upon it. For natural language data, that’s where NLP comes in.

It takes messy data (and natural language can be very messy) and processes it into something that computers can work with.

Text Normalization

When searchers type text into a search bar, they are trying to find a good match, not play “guess the format.”

For example, to require a user to type a query in exactly the same format as the matching words in a record is unfair and unproductive.

We use text normalization to do away with this requirement so that the text will be in a standard format no matter where it’s coming from.


As we go through different normalization steps, we’ll see that there is no approach that everyone follows. Each normalization step generally increases recall and decreases precision.

A quick aside: “recall” means a search engine finds results that are known to be good.

Precision means a search engine finds only good results.

Search results could have 100% recall by returning every document in an index, but precision would be poor.

Conversely, a search engine could have 100% recall by only returning documents that it knows to be a perfect fit, but sit will likely miss some good results.

Again, normalization generally increases recall and decreases precision.

Whether that movement toward one end of the recall-precision spectrum is valuable depends on the use case and the search technology. It isn’t a question of applying all normalization techniques but deciding which ones provide the best balance of precision and recall.

Letter Normalization

The simplest normalization you could imagine would be the handling of letter case.


In English, at least, words are generally capitalized at the beginning of sentences, occasionally in titles, and when they are proper nouns. (There are other rules, too, depending on whom you ask.)

But in German, all nouns are capitalized. Other languages have their own rules.

These rules are useful. Otherwise, we wouldn’t follow them.

For example, capitalizing the first words of sentences helps us quickly see where sentences begin.

That usefulness, however, is diminished in an information retrieval context.

The meanings of words don’t change simply because they are in a title and have their first letter capitalized.

Even trickier is that there are rules, and then there is how people actually write.

If I text my wife, “SOMEONE HIT OUR CAR!” we all know that I’m talking about a car and not something different because the word is capitalized.


We can see this clearly by reflecting on how many people don’t use capitalization when communicating informally – which is, incidentally, how most case-normalization works.

Of course, we know that sometimes capitalization does change the meaning of a word or phrase. We can see that “cats” are animals, and “Cats” is a musical.

In most cases, though, the increased precision that comes with not normalizing on case, is offset by decreasing recall by far too much.

The difference between the two is easy to tell via context, too, which we’ll be able to leverage through natural language understanding.

While less common in English, handling diacritics is also a form of letter normalization.

Diacritics are the marks, or “glyphs,” attached to letters, as in á, ë, or ç.

Words can otherwise be spelled the same, but added diacritics can change the meaning. In French, “élève” means “student,” while “élevé” means “elevated.”

Nonetheless, many people will not include the diacritics when searching, and so another form of normalization is to strip all diacritics, leaving behind the simple (and now ambiguous) “eleve.”



The next normalization challenge is breaking down the text the searcher has typed in the search bar and the text in the document.

This step is necessary because word order does not need to be exactly the same between the query and the document text, except when a searcher wraps the query in quotes.

Breaking queries, phrases, and sentences into words may seem like a simple task: Just break up the text at each space.

Problems show up quickly with this approach. Again, let’s start with English.

Separating on spaces alone means that the phrase “Let’s break up this phrase!” yields us let’s, break, up, this, and phrase! as words.

For search, we almost surely don’t want the exclamation point at the end of the word “phrase.”

Whether we want to keep the contracted word “let’s” together is not as clear.

Some software will break the word down even further (“let” and “‘s”) and some won’t.


Some will not break down “let’s” while breaking down “don’t” into two pieces.

This process is called “tokenization.”

We call it tokenization for reasons that should now be clear: What we end up with are not words but discrete groups of characters. This is even more true for languages other than English.

German speakers, for example, can merge words (more accurately “morphemes,” but close enough) together to form a larger word. The German word for “dog house” is “Hundehütte,” which contains the words for both “dog” (“Hund”) and “house” (“Hütte”).

Nearly all search engines tokenize text, but there are further steps an engine can take to normalize the tokens. Two related approaches are stemming and lemmatization.

Stemming And Lemmatization

Stemming and lemmatization take different forms of tokens and break them down for comparison.

For example, take the words “calculator” and “calculation,” or “slowing” and “slowly.”


We can see there are some clear similarities.

Stemming breaks a word down to its “stem,” or other variants of the word it is based on. Stemming is fairly straightforward; you could do it on your own.

What’s the stem of “stemming?”

You can probably guess that it’s “stem.” Often stemming means removing prefixes or suffixes, as in this case.

There are multiple stemming algorithms, and the most popular is the Porter Stemming Algorithm, which has been around since the 1980s. It is a series of steps applied to a token to get to the stem.

Stemming can sometimes lead to results that you wouldn’t foresee.

Looking at the words “carry” and “carries,” you might expect that the stem of each of these is “carry.”

The actual stem, at least according to the Porter Stemming Algorithm, is “carri.”


This is because stemming attempts to compare related words and break down words into their smallest possible parts, even if that part is not a word itself.

On the other hand, if you want an output that will always be a recognizable word, you want lemmatization. Again, there are different lemmatizers, such as NLTK using Wordnet.

Lemmatization breaks a token down to its “lemma,” or the word which is considered the base for its derivations. The lemma from Wordnet for “carry” and “carries,” then, is what we expected before: “carry.”

Lemmatization will generally not break down words as much as stemming, nor will as many different word forms be considered the same after the operation.

The stems for “say,” “says,” and “saying” are all “say,” while the lemmas from Wordnet are “say,” “say,” and “saying.” To get these lemma, lemmatizers are generally corpus-based.

If you want the broadest recall possible, you’ll want to use stemming. If you want the best possible precision, use neither stemming nor lemmatization.

Which you go with ultimately depends on your goals, but most searches can generally perform very well with neither stemming nor lemmatization, retrieving the right results, and not introducing noise.


If you decide not to include lemmatization or stemming in your search engine, there is still one normalization technique that you should consider.


That is the normalization of plurals to their singular form.

Generally, ignoring plurals is done through the use of dictionaries.

Even if “de-pluralization” seems as simple as chopping off an “-s,” that’s not always the case. The first problem is with irregular plurals, such as “deer,” “oxen,” and “mice.”

A second problem is pluralization with an “-es” suffix, such as “potato.” Finally, there are simply the words that end in an “s” but aren’t plural, like “always.”

A dictionary-based approach will ensure that you introduce recall, but not incorrectly.

Just as with lemmatization and stemming, whether you normalize plurals is dependent on your goals.

Cast a wider net by normalizing plurals, a more precise one by avoiding normalization.

Usually, normalizing plurals is the right choice, and you can remove normalization pairs from your dictionary when you find them causing problems.


One area, however, where you will almost always want to introduce increased recall is when handling typos.

Typo Tolerance And Spell Check

We have all encountered typo tolerance and spell check within search, but it’s useful to think about why it’s present.

Sometimes, there are typos because fingers slip and hit the wrong key.

Other times, the searcher thinks a word is spelled differently than it is.

Increasingly, “typos” can also result from poor speech-to-text understanding.

Finally, words can seem like they have typos but really don’t, such as in comparing “scream” and “cream.”

The simplest way to handle these typos, misspellings, and variations, is to avoid trying to correct them at all. Some algorithms can compare different tokens.

One of these is the Damerau-Levenshtein Distance algorithm.


This measure looks at how many edits are needed to go from one token to another.

You can then filter out all tokens with a distance that is too high.

(Two is generally a good threshold, but you will probably want to adjust this based on the length of the token.)

After filtering, you can use the distance for sorting results or feeding into a ranking algorithm.

Many times, context can matter when determining if a word is misspelled or not. The word “scream” is probably correct after “I,” but not after “ice.”

Machine learning can be a solution for this by bringing context to this NLP task.

This spell check software can use the context around a word to identify whether it is likely to be misspelled and its most likely correction.

Typos In Documents


One thing that we skipped over before is that words may not only have typos when a user types it into a search bar.

Words may also have typos inside a document.

This is especially true when the documents are made of user-generated content.

This detail is relevant because if a search engine is only looking at the query for typos, it is missing half of the information.

The best typo tolerance should work across both query and document, which is why edit distance generally works best for retrieving and ranking results.

Spell check can be used to craft a better query or provide feedback to the searcher, but it is often unnecessary and should never stand alone.

Natural Language Understanding

While NLP is all about processing text and natural language, NLU is about understanding that text.

Named Entity Recognition

A task that can aid in search is that of named entity recognition, or NER. NER identifies key items, or “entities,” inside of text.


While some people will call NER natural language processing and others will call it natural language understanding, what’s clear is that it can find what’s important within a text.

For the query “NYE party dress” you would perhaps get back an entity of “dress” that is mapped to a type of “category.”

NER will always map an entity to a type, from as generic as “place” or “person,” to as specific as your own facets.

NER can also use context to identify entities.

A query of “white house” may refer to a place, while “white house paint” might refer to a color of “white” and a product category of “paint.”

Query Categorization

Named entity recognition is valuable in search because it can be used in conjunction with facet values to provide better search results.

Recalling the “white house paint” example, you can use the “white” color and the “paint” product category to filter down your results to only show those that match those two values.


This would give you high precision.

If you don’t want to go that far, you can simply boost all products that match one of the two values.

Query categorization can also help with recall.

For searches with few results, you can use the entities to include related products.

Imagine that there are no products that match the keywords “white house paint.”

In this case, leveraging the product category of “paint” can return other paints that might be a decent alternative, such as that nice eggshell color.

Document Tagging

Another way that named entity recognition can help with search quality is by moving the task from query time to ingestion time (when the document is added to the search index).


When ingesting documents, NER can use the text to tag those documents automatically.

These documents will then be easier to find for the searchers.

Either the searchers use explicit filtering, or the search engine applies automatic query-categorization filtering, to enable searchers to go directly to the right products using facet values.

Intent Detection

Related to entity recognition is intent detection, or determining the action a user wants to take.

Intent detection is not the same as what we talk about when we say “identifying searcher intent.”

Identifying searcher intent is getting people to the right content at the right time.

Intent detection maps a request to a specific, pre-defined intent.

It then takes action based on that intent. A user searching for “how to make returns” might trigger the “help” intent, while “red shoes” might trigger the “product” intent.


In the first case, you could route the search to your help desk search.

Intent detection maps a request to a specific, pre-defined intent – then takes action based on that intent.

In the second one, you could route it to the product search. This isn’t so different from what you see when you search for the weather on Google.

Look, and notice that you get a weather box at the very top of the page. (Newly launched web search engine Andi takes this concept to the extreme, bundling search in a chatbot.)

For most search engines, intent detection, as outlined here, isn’t necessary.

Most search engines only have a single content type on which to search at a time.

When there are multiple content types, federated search can perform admirably by showing multiple search results in a single UI at the same time.

Other NLP And NLU tasks

There are plenty of other NLP and NLU tasks, but these are usually less relevant to search.

Tasks like sentiment analysis can be useful in some contexts, but search isn’t one of them.


You could imagine using translation to search multi-language corpuses, but it rarely happens in practice, and is just as rarely needed.

Question answering is an NLU task that is increasingly implemented into search, especially search engines that expect natural language searches.

Once again, you can see this on major web search engines.

Google, Bing, and Kagi will all immediately answer the question “how old is the Queen of England?” without needing to click through to any results.

Some search engine technologies have explored implementing question answering for more limited search indices, but outside of help desks or long, action-oriented content, the usage is limited.

Few searchers are going to an online clothing store and asking questions to a search bar.

Summarization is an NLU task that is more useful for search.

Much like with the use of NER for document tagging, automatic summarization can enrich documents. Summaries can be used to match documents to queries, or to provide a better display of the search results.


This better display can help searchers be confident that they have gotten good results and get them to the right answers more quickly.

Even including newer search technologies using images and audio, the vast, vast majority of searches happen with text. To get the right results, it’s important to make sure the search is processing and understanding both the query and the documents.

Semantic search brings intelligence to search engines, and natural language processing and understanding are important components.

NLP and NLU tasks like tokenization, normalization, tagging, typo tolerance, and others can help make sure that searchers don’t need to be search experts.

Instead, they can go from need to solution “naturally” and quickly.

More resources: 

Featured Image: ryzhi/Shutterstock

!function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', '');

if( typeof sopp !== "undefined" && sopp === 'yes' ){ fbq('dataProcessingOptions', ['LDU'], 1, 1000); }else{ fbq('dataProcessingOptions', []); }

fbq('init', '1321385257908563');

fbq('track', 'PageView');

fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'nlp-nlu-semantic-search', content_category: 'content seo ' });

Source link


Fact Checking: Get Your Facts Right



Fact Checking: Get Your Facts Right

In the last decade or so, the concept of “fake news” has become a major thorn in the side of consumers and content writers alike.

Digital marketing experts who write SEO content at the enterprise level might not consider themselves journalists or news reporters – but there’s a greater overlap between the roles than many people realize.

Like journos, enterprise SEO content writers need to earn the trust of their audience by demonstrating authority, relevance, and experience.

And while you might think that, as a content marketing specialist, the only person you’re serving is your client or employer, the truth is that good SEO content provides just as much service to consumers.

You’re not just advertising to people; you’re helping them find answers, information, and solutions to their problems.

That’s why, for SEO content writers, getting the facts right is crucial.

“Fake news” has eroded a lot of people’s trust in media. Online content, in particular, is always fighting an uphill battle due to the oversaturation of the digital space – and the sheer amount of misinformation that finds its way into blogs and social media sites with little quality control.


Today, fact-checking is arguably more important than ever before.

One little mistake is all it takes to lose a consumer’s trust forever.

But what does it mean to get your facts right? Is it just ensuring every name is spelled correctly, and every claim has an attributed source?

Both of these things are an important part of SEO fact-checking, but they’re only a small piece of a large puzzle.

Enterprise SEO Fact Checking Best Practices

Fun fact: Even when consumers don’t know you’re lying, Google does.

Web pages with deceptive, inaccurate, or poorly vetted content are penalized and less likely to appear in search results.

Want to avoid the wrath of the almighty algorithm? Here’s what you need to do:

Get The Basics Right

A few paragraphs back, I mentioned that fact-checking isn’t limited to correctly writing people’s names, ages, positions, and pronouns.


Nevertheless, getting the basics right is still important. If you can’t do at least that much, then you won’t be prepared to do more in-depth fact-checking.

It’s especially important to get this information right when you’re quoting multiple people.

Not only do you need to attribute quotes and ideas to the proper sources, but you also have to make sure the information they shared with you is accurately reproduced.

Double Check Everything

If you get a quote from someone that says the sky is blue, go outside and look up, just to be sure.

Okay, that might be an exaggerated example – but you get the point.

Double and triple-check everything.

If you find a useful quote or statistic online, track down the original source. See if you can find other reliable web pages with the same information.

Don’t be afraid to do a little research yourself. Crunch the numbers and try to find corroborating evidence.


Never take anything at face value.

Go To The Source

Speaking of tracking down the sources of stats and quotes: That’s a cornerstone of fact-checking so important, it merits expanding on now.

Have you ever had a teacher or professor tell you, in no uncertain terms, never to use Wikipedia as a source?

Well, that’s just as true when writing enterprise-level SEO content. Wikipedia might be useful in pointing you toward helpful sources, but it shouldn’t be your primary text.

Nor should any second-hand source. If another web page states something as a fact, confirm where it got that fact.

If it’s a disreputable source and you parrot it, then you become a disreputable source, too.

Understand The Information

Content writing – especially at the enterprise level and especially in an agency (rather than in-house PR team) context – often requires authors to cover many different areas of expertise in many different industries.

It can be tempting to regurgitate and plagiarize information that already exists, but if you do that, you won’t be able to offer any meaningful insights.


You have to understand the information you’re relaying.

That will help you spot contradictions and factual errors and demonstrate genuine authority.

Is AI Automation The Future Of Fact Checking?

Enterprise-level content fact-checking requires a lot of time and effort, but cutting corners is a recipe for disaster.

Fortunately, just as it has with many other aspects of SEO, AI automation may soon be able to simplify the process.

U.K.-based independent fact-checking organization, Full Fact, has been leading the charge in recent years to develop scalable, automated fact-checking tools.

Full Fact’s efforts have already garnered the attention of the biggest names in search engine technology.

In 2019, the non-profit organization was one of the winners of the 2019 Google AI Impact Challenge, which provides funding for potentially revolutionary automation research projects.

Full Fact’s stated goal is to develop AI software capable of breaking down long content pieces into individual sentences, then identifying the types of claims those sentences represent, before finally cross-referencing those claims in real-time with the most up-to-date factual news data.


Though Full Fact is still years away from achieving its goal, the benefits of such a breakthrough for SEO content writing are self-evident.

That said, you don’t have to wait for the future to use AI automation and other software tools to help you fact-check.

For example, the Grammarly Plagiarism Checker not only identifies duplicate content taken from another source but also highlights portions of text requiring attribution.

Commonly used enterprise SEO tools like Semrush, Ahrefs, and Moz, meanwhile, can be used to investigate a domain’s authority, helping you decide which sources are considered reputable.

Fact-checking in today’s oversaturated news and information marketplace can be intimidating at first glance. But the number of resources available to content writers is growing by leaps and bounds every day.

Making full use of these resources better enables you to win consumer trust in an age when that kind of trust is a very delicate, precious, and valuable commodity.

More resources:

Featured Image: redgreystock/Shutterstock


fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'fact-checking-get-your-facts-right', content_category: 'creation' }); } });

Source link

Continue Reading

Subscribe To our Newsletter
We promise not to spam you. Unsubscribe at any time.
Invalid email address