Google’s Smith Algorithm Outperforms BERT
Google recently published a research paper on a new algorithm called SMITH that it claims outperforms BERT for understanding long queries and long documents. In particular, what makes this new model better is that it is able to understand passages within documents in the same way BERT understands words and sentences, which enables the algorithm to understand longer documents.
On November 3, 2020 I read about a Google algorithm called Smith that claims to outperform BERT. I briefly discussed it on November 25th in Episode 395 of the SEO 101 podcast in late November.
I’ve been waiting until I had some time to write a summary of it because SMITH seems to be an important algorithm and deserved a thoughtful write up, which I humbly attempted.
So here it is, I hope you enjoy it and if you do please share this article.
Is Google Using the SMITH Algorithm?
Google does not generally say what specific algorithms it is using. Although the researchers say that this algorithm outperforms BERT, until Google formally states that the SMITH algorithm is in use to understand passages within web pages, it is purely speculative to say whether or not it is in use.
What is the SMITH Algorithm?
SMITH is a new model for trying to understand entire documents. Models such as BERT are trained to understand words within the context of sentences.
In a very simplified description, the SMITH model is trained to understand passages within the context of the entire document.
While algorithms like BERT are trained on data sets to predict randomly hidden words are from the context within sentences, the SMITH algorithm is trained to predict what the next block of sentences are.
This kind of training helps the algorithm understand larger documents better than the BERT algorithm, according to the researchers.
BERT Algorithm Has Limitations
This is how they present the shortcomings of BERT:
“In recent years, self-attention based models like Transformers… and BERT …have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length.
In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input.”
According to the researchers, the BERT algorithm is limited to understanding short documents. For a variety of reasons explained in the research paper, BERT is not well suited for understanding long-form documents.
The researchers propose their new algorithm which they say outperforms BERT with longer documents.
They then explain why long documents are difficult:
“…semantic matching between long texts is a more challenging task due to a few reasons:
1) When both texts are long, matching them requires a more thorough understanding of semantic relations including matching pattern between text fragments with long distance;
2) Long documents contain internal structure like sections, passages and sentences. For human readers, document structure usually plays a key role for content understanding. Similarly, a model also needs to take document structure information into account for better document matching performance;
3) The processing of long texts is more likely to trigger practical issues like out of TPU/GPU memories without careful model design.”
Larger Input Text
BERT is limited to how long documents can be. SMITH, as you will see further down, performs better the longer the document is.
This is a known shortcoming with BERT.
This is how they explain it:
“Experimental results on several benchmark data for long-form text matching… show that our proposed SMITH model outperforms the previous state-of-the-art models and increases the maximum input text length from 512 to 2048 when comparing with BERT based baselines.”
This fact of SMITH being able to do something that BERT is unable to do is what makes the SMITH model intriguing.
The SMITH model doesn’t replace BERT.
The SMITH model supplements BERT by doing the heavy lifting that BERT is unable to do.
The researchers tested it and said:
“Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention…, multi-depth attention-based hierarchical recurrent neural network…, and BERT.
Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048.”
Long to Long Matching
If I am understanding the research paper correctly, the research paper states that the problem of matching long queries to long content has not been been adequately explored.
According to the researchers:
“To the best of our knowledge, semantic matching between long document pairs, which has many important applications like news recommendation, related article recommendation and document clustering, is less explored and needs more research effort.”
Later in the document they state that there have been some studies that come close to what they are researching.
But overall there appears to be a gap in researching ways to match long queries to long documents. That is the problem the researchers are solving with the SMITH algorithm.
Details of Google’s SMITH
I won’t go deep into the details of the algorithm but I will pick out some general features that communicate a high level view of what it is.
The document explains that they use a pre-training model that is similar to BERT and many other algorithms.
First a little background information so the document makes more sense.
Algorithm Pre-training
Pre-training is where an algorithm is trained on a data set. For typical pre-training of these kinds of algorithms, the engineers will mask (hide) random words within sentences. The algorithm tries to predict the masked words.
As an example, if a sentence is written as, “Old McDonald had a ____,” the algorithm when fully trained might predict, “farm” is the missing word.
As the algorithm learns, it eventually becomes optimized to make less mistakes on the training data.
The pre-training is done for the purpose of training the machine to be accurate and make less mistakes.
Here’s what the paper says:
“Inspired by the recent success of language model pre-training methods like BERT, SMITH also adopts the “unsupervised pre-training + fine-tuning” paradigm for the model training.
For the Smith model pre-training, we propose the masked sentence block language modeling task in addition to the original masked word language modeling task used in BERT for long text inputs.”
Blocks of Sentences are Hidden in Pre-training
Here is where the researchers explain a key part of the algorithm, how relations between sentence blocks in a document are used for understanding what a document is about during the pre-training process.
“When the input text becomes long, both relations between words in a sentence block and relations between sentence blocks within a document becomes important for content understanding.
Therefore, we mask both randomly selected words and sentence blocks during model pre-training.”
The researchers next describe in more detail how this algorithm goes above and beyond the BERT algorithm.
What they’re doing is stepping up the training to go beyond word training to take on blocks of sentences.
Here’s how it is described in the research document:
“In addition to the masked word prediction task in BERT, we propose the masked sentence block prediction task to learn the relations between different sentence blocks.”
The SMITH algorithm is trained to predict blocks of sentences. My personal feeling about that is… that’s pretty cool.
This algorithm is learning the relationships between words and then leveling up to learn the context of blocks of sentences and how they relate to each other in a long document.
Section 4.2.2, titled, “Masked Sentence Block Prediction” provides more details on the process (research paper linked below).
Results of SMITH Testing
The researchers noted that SMITH does better with longer text documents.
“The SMITH model which enjoys longer input text lengths compared with other standard self-attention models is a better choice for long document representation learning and matching.”
In the end, the researchers concluded that the SMITH algorithm does better than BERT for long documents.
Why SMITH Research Paper is Important
One of the reasons I prefer reading research papers over patents is that the research papers share details of whether the proposed model does better than existing and state of the art models.
Many research papers conclude by saying that more work needs to be done. To me that means that the algorithm experiment is promising but likely not ready to be put into a live environment.
A smaller percentage of research papers say that the results outperform the state of the art. These are the research papers that in my opinion are worth paying attention to because they are likelier to make it into Google’s algorithm.
When I say likelier, I don’t mean that the algorithm is or will be in Google’s algorithm.
What I mean is that, relative to other algorithm experiments, the research papers that claim to outperform the state of the art are more likely to make it into Google’s algorithm.
SMITH Outperforms BERT for Long Form Documents
According to the conclusions reached in the research paper, the SMITH model outperforms many models, including BERT, for understanding long content.
“The experimental results on several benchmark datasets show that our proposed SMITH model outperforms previous state-of-the-art Siamese matching models including HAN, SMASH and BERT for long-form document matching.
Moreover, our proposed model increases the maximum input text length from 512 to 2048 when compared with BERT-based baseline methods.”
Is SMITH in Use?
As written earlier, until Google explicitly states they are using SMITH there’s no way to accurately say that the SMITH model is in use at Google.
That said, research papers that aren’t likely in use are those that explicitly state that the findings are a first step toward a new kind of algorithm and that more research is necessary.
This is not the case with this research paper. The research paper authors confidently state that SMITH beats the state of the art for understanding long-form content.
That confidence in the results and the lack of a statement that more research is needed makes this paper more interesting than others and therefore well worth knowing about in case it gets folded into Google’s algorithm sometime in the future or in the present.
Citation
Read the original research paper:
Description of the SMITH Algorithm
Download the SMITH Algorithm PDF Research Paper:
Google Warns About Misuse of Its Indexing API
Google has updated its Indexing API documentation with a clear warning about spam detection and the possible consequences of misuse.
Warning Against API Misuse The new message in the guide says:
“All submissions through the Indexing API are checked for spam. Any misuse, like using multiple accounts or going over the usage limits, could lead to access being taken away.”
This warning is aimed at people trying to abuse the system by exceeding the API’s limits or breaking Google’s rules.
What Is the Indexing API? The Indexing API allows websites to tell Google when job posting or livestream video pages are added or removed. It helps websites with fast-changing content get their pages crawled and indexed quickly.
But it seems some users have been trying to abuse this by using multiple accounts to get more access.
Impact of the Update Google is now closely watching how people use the Indexing API. If someone breaks the rules, they might lose access to the tool, which could make it harder for them to keep their search results updated for time-sensitive content.
How To Stay Compliant To use the Indexing API properly, follow these rules:
- Don’t go over the usage limits, and if you need more, ask Google instead of using multiple accounts.
- Use the API only for job postings or livestream videos, and make sure your data is correct.
- Follow all of Google’s API guidelines and spam policies.
- Use sitemaps along with the API, not as a replacement.
Remember, the Indexing API isn’t a shortcut to faster indexing. Follow the rules to keep your access.
This Week in Search News: Simple and Easy-to-Read Update
Here’s what happened in the world of Google and search engines this week:
1. Google’s June 2024 Spam Update
Google finished rolling out its June 2024 spam update over a period of seven days. This update aims to reduce spammy content in search results.
2. Changes to Google Search Interface
Google has removed the continuous scroll feature for search results. Instead, it’s back to the old system of pages.
3. New Features and Tests
- Link Cards: Google is testing link cards at the top of AI-generated overviews.
- Health Overviews: There are more AI-generated health overviews showing up in search results.
- Local Panels: Google is testing AI overviews in local information panels.
4. Search Rankings and Quality
- Improving Rankings: Google said it can improve its search ranking system but will only do so on a large scale.
- Measuring Quality: Google’s Elizabeth Tucker shared how they measure search quality.
5. Advice for Content Creators
- Brand Names in Reviews: Google advises not to avoid mentioning brand names in review content.
- Fixing 404 Pages: Google explained when it’s important to fix 404 error pages.
6. New Search Features in Google Chrome
Google Chrome for mobile devices has added several new search features to enhance user experience.
7. New Tests and Features in Google Search
- Credit Card Widget: Google is testing a new widget for credit card information in search results.
- Sliding Search Results: When making a new search query, the results might slide to the right.
8. Bing’s New Feature
Bing is now using AI to write “People Also Ask” questions in search results.
9. Local Search Ranking Factors
Menu items and popular times might be factors that influence local search rankings on Google.
10. Google Ads Updates
- Query Matching and Brand Controls: Google Ads updated its query matching and brand controls, and advertisers are happy with these changes.
- Lead Credits: Google will automate lead credits for Local Service Ads. Google says this is a good change, but some advertisers are worried.
- tROAS Insights Box: Google Ads is testing a new insights box for tROAS (Target Return on Ad Spend) in Performance Max and Standard Shopping campaigns.
- WordPress Tag Code: There is a new conversion code for Google Ads on WordPress sites.
These updates highlight how Google and other search engines are continuously evolving to improve user experience and provide better advertising tools.
AI
Exploring the Evolution of Language Translation: A Comparative Analysis of AI Chatbots and Google Translate
According to an article on PCMag, while Google Translate makes translating sentences into over 100 languages easy, regular users acknowledge that there’s still room for improvement.
In theory, large language models (LLMs) such as ChatGPT are expected to bring about a new era in language translation. These models consume vast amounts of text-based training data and real-time feedback from users worldwide, enabling them to quickly learn to generate coherent, human-like sentences in a wide range of languages.
However, despite the anticipation that ChatGPT would revolutionize translation, previous experiences have shown that such expectations are often inaccurate, posing challenges for translation accuracy. To put these claims to the test, PCMag conducted a blind test, asking fluent speakers of eight non-English languages to evaluate the translation results from various AI services.
The test compared ChatGPT (both the free and paid versions) to Google Translate, as well as to other competing chatbots such as Microsoft Copilot and Google Gemini. The evaluation involved comparing the translation quality for two test paragraphs across different languages, including Polish, French, Korean, Spanish, Arabic, Tagalog, and Amharic.
In the first test conducted in June 2023, participants consistently favored AI chatbots over Google Translate. ChatGPT, Google Bard (now Gemini), and Microsoft Bing outperformed Google Translate, with ChatGPT receiving the highest praise. ChatGPT demonstrated superior performance in converting colloquialisms, while Google Translate often provided literal translations that lacked cultural nuance.
For instance, ChatGPT accurately translated colloquial expressions like “blow off steam,” whereas Google Translate produced more literal translations that failed to resonate across cultures. Participants appreciated ChatGPT’s ability to maintain consistent levels of formality and its consideration of gender options in translations.
The success of AI chatbots like ChatGPT can be attributed to reinforcement learning with human feedback (RLHF), which allows these models to learn from human preferences and produce culturally appropriate translations, particularly for non-native speakers. However, it’s essential to note that while AI chatbots outperformed Google Translate, they still had limitations and occasional inaccuracies.
In a subsequent test, PCMag evaluated different versions of ChatGPT, including the free and paid versions, as well as language-specific AI agents from OpenAI’s GPTStore. The paid version of ChatGPT, known as ChatGPT Plus, consistently delivered the best translations across various languages. However, Google Translate also showed improvement, performing surprisingly well compared to previous tests.
Overall, while ChatGPT Plus emerged as the preferred choice for translation, Google Translate demonstrated notable improvement, challenging the notion that AI chatbots are always superior to traditional translation tools.
Source: https://www.pcmag.com/articles/google-translate-vs-chatgpt-which-is-the-best-language-translator
-
SEARCHENGINES6 days ago
Google Ranking Volatility Record, Forbes Advisor Slapped, Bing Generative Search Experience & More
-
SEO5 days ago
Google’s AI Overviews Avoid Political Content, New Data Shows
-
WORDPRESS6 days ago
Automattic demanded web host pay $32M annually for using WordPress trademark
-
SEO7 days ago
8% Of Automattic Employees Choose To Resign
-
WORDPRESS5 days ago
5 Most Profitable Online Businesses You Can Start Today for Free!
-
SEARCHENGINES4 days ago
Google Shopping Researched with AI
-
WORDPRESS6 days ago
The WordPress Saga: Does Matt Mullenweg Want a Fork or Not?
-
WORDPRESS4 days ago
8 Best Banks for ECommerce Businesses in 2024