Google recently published a research paper on a new algorithm called SMITH that it claims outperforms BERT for understanding long queries and long documents. In particular, what makes this new model better is that it is able to understand passages within documents in the same way BERT understands words and sentences, which enables the algorithm to understand longer documents.
I’ve been waiting until I had some time to write a summary of it because SMITH seems to be an important algorithm and deserved a thoughtful write up, which I humbly attempted.
So here it is, I hope you enjoy it and if you do please share this article.
Is Google Using the SMITH Algorithm?
Google does not generally say what specific algorithms it is using. Although the researchers say that this algorithm outperforms BERT, until Google formally states that the SMITH algorithm is in use to understand passages within web pages, it is purely speculative to say whether or not it is in use.
What is the SMITH Algorithm?
SMITH is a new model for trying to understand entire documents. Models such as BERT are trained to understand words within the context of sentences.
In a very simplified description, the SMITH model is trained to understand passages within the context of the entire document.
While algorithms like BERT are trained on data sets to predict randomly hidden words are from the context within sentences, the SMITH algorithm is trained to predict what the next block of sentences are.
This kind of training helps the algorithm understand larger documents better than the BERT algorithm, according to the researchers.
BERT Algorithm Has Limitations
This is how they present the shortcomings of BERT:
“In recent years, self-attention based models like Transformers… and BERT …have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length.
In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input.”
According to the researchers, the BERT algorithm is limited to understanding short documents. For a variety of reasons explained in the research paper, BERT is not well suited for understanding long-form documents.
The researchers propose their new algorithm which they say outperforms BERT with longer documents.
They then explain why long documents are difficult:
“…semantic matching between long texts is a more challenging task due to a few reasons:
1) When both texts are long, matching them requires a more thorough understanding of semantic relations including matching pattern between text fragments with long distance;
2) Long documents contain internal structure like sections, passages and sentences. For human readers, document structure usually plays a key role for content understanding. Similarly, a model also needs to take document structure information into account for better document matching performance;
3) The processing of long texts is more likely to trigger practical issues like out of TPU/GPU memories without careful model design.”
Larger Input Text
BERT is limited to how long documents can be. SMITH, as you will see further down, performs better the longer the document is.
This is a known shortcoming with BERT.
This is how they explain it:
“Experimental results on several benchmark data for long-form text matching… show that our proposed SMITH model outperforms the previous state-of-the-art models and increases the maximum input text length from 512 to 2048 when comparing with BERT based baselines.”
This fact of SMITH being able to do something that BERT is unable to do is what makes the SMITH model intriguing.
The SMITH model doesn’t replace BERT.
The SMITH model supplements BERT by doing the heavy lifting that BERT is unable to do.
The researchers tested it and said:
“Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention…, multi-depth attention-based hierarchical recurrent neural network…, and BERT.
Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048.”
Long to Long Matching
If I am understanding the research paper correctly, the research paper states that the problem of matching long queries to long content has not been been adequately explored.
According to the researchers:
“To the best of our knowledge, semantic matching between long document pairs, which has many important applications like news recommendation, related article recommendation and document clustering, is less explored and needs more research effort.”
Later in the document they state that there have been some studies that come close to what they are researching.
But overall there appears to be a gap in researching ways to match long queries to long documents. That is the problem the researchers are solving with the SMITH algorithm.
Details of Google’s SMITH
I won’t go deep into the details of the algorithm but I will pick out some general features that communicate a high level view of what it is.
The document explains that they use a pre-training model that is similar to BERT and many other algorithms.
First a little background information so the document makes more sense.
Pre-training is where an algorithm is trained on a data set. For typical pre-training of these kinds of algorithms, the engineers will mask (hide) random words within sentences. The algorithm tries to predict the masked words.
As an example, if a sentence is written as, “Old McDonald had a ____,” the algorithm when fully trained might predict, “farm” is the missing word.
As the algorithm learns, it eventually becomes optimized to make less mistakes on the training data.
The pre-training is done for the purpose of training the machine to be accurate and make less mistakes.
Here’s what the paper says:
“Inspired by the recent success of language model pre-training methods like BERT, SMITH also adopts the “unsupervised pre-training + fine-tuning” paradigm for the model training.
For the Smith model pre-training, we propose the masked sentence block language modeling task in addition to the original masked word language modeling task used in BERT for long text inputs.”
Blocks of Sentences are Hidden in Pre-training
Here is where the researchers explain a key part of the algorithm, how relations between sentence blocks in a document are used for understanding what a document is about during the pre-training process.
“When the input text becomes long, both relations between words in a sentence block and relations between sentence blocks within a document becomes important for content understanding.
Therefore, we mask both randomly selected words and sentence blocks during model pre-training.”
The researchers next describe in more detail how this algorithm goes above and beyond the BERT algorithm.
What they’re doing is stepping up the training to go beyond word training to take on blocks of sentences.
Here’s how it is described in the research document:
“In addition to the masked word prediction task in BERT, we propose the masked sentence block prediction task to learn the relations between different sentence blocks.”
The SMITH algorithm is trained to predict blocks of sentences. My personal feeling about that is… that’s pretty cool.
This algorithm is learning the relationships between words and then leveling up to learn the context of blocks of sentences and how they relate to each other in a long document.
Section 4.2.2, titled, “Masked Sentence Block Prediction” provides more details on the process (research paper linked below).
Results of SMITH Testing
The researchers noted that SMITH does better with longer text documents.
“The SMITH model which enjoys longer input text lengths compared with other standard self-attention models is a better choice for long document representation learning and matching.”
In the end, the researchers concluded that the SMITH algorithm does better than BERT for long documents.
Why SMITH Research Paper is Important
One of the reasons I prefer reading research papers over patents is that the research papers share details of whether the proposed model does better than existing and state of the art models.
Many research papers conclude by saying that more work needs to be done. To me that means that the algorithm experiment is promising but likely not ready to be put into a live environment.
A smaller percentage of research papers say that the results outperform the state of the art. These are the research papers that in my opinion are worth paying attention to because they are likelier to make it into Google’s algorithm.
When I say likelier, I don’t mean that the algorithm is or will be in Google’s algorithm.
What I mean is that, relative to other algorithm experiments, the research papers that claim to outperform the state of the art are more likely to make it into Google’s algorithm.
SMITH Outperforms BERT for Long Form Documents
According to the conclusions reached in the research paper, the SMITH model outperforms many models, including BERT, for understanding long content.
“The experimental results on several benchmark datasets show that our proposed SMITH model outperforms previous state-of-the-art Siamese matching models including HAN, SMASH and BERT for long-form document matching.
Moreover, our proposed model increases the maximum input text length from 512 to 2048 when compared with BERT-based baseline methods.”
Is SMITH in Use?
As written earlier, until Google explicitly states they are using SMITH there’s no way to accurately say that the SMITH model is in use at Google.
That said, research papers that aren’t likely in use are those that explicitly state that the findings are a first step toward a new kind of algorithm and that more research is necessary.
This is not the case with this research paper. The research paper authors confidently state that SMITH beats the state of the art for understanding long-form content.
That confidence in the results and the lack of a statement that more research is needed makes this paper more interesting than others and therefore well worth knowing about in case it gets folded into Google’s algorithm sometime in the future or in the present.
Read the original research paper:
Download the SMITH Algorithm PDF Research Paper:
Roger Montti is a search marketer with 20 years experience.
I offer site audits and link building strategies.
How to Write For Google
Are you writing your SEO content based on the latest best practice tips?
I originally wrote this SEO copywriting checklist in 2012—my, how things have changed. Today, Google stresses quality content even more than before, conversational copy is critical, and there are revised SEO writing “rules.”
I’ve updated the list to reflect these changes and to provide additional information.
As a side note, I would argue that there’s no such thing as “writing for Google.” Yes, there are certain things you should do to make the Google gods happy. However, your most important goal should be writing clear, compelling, standout copy that tells a story.
I’m keeping the old headline in the hopes that I can convert some of the “write for Google” people to do things the right way.
Items to review before you start your SEO writing project
– Do you have enough information about your target reader?
Your copy will pack a powerful one-two punch if your content is laser-focused on your target reader. Ask your client or supervisor for a customer/reader persona document outlining your target readers’ specific characteristics. If the client doesn’t have a customer persona document, be prepared to spend an hour or more asking detailed questions.
Here’s more information on customer personas.
– Writing a sales page? Did you interview the client?
It’s essential to interview new clients and to learn more about their company, USP, and competition. Don’t forget to ask about industry buzzwords that should appear in the content.
Not sure what questions to ask to get the copywriting ball rolling? Here’s a list of 56 questions you can start with today.
– Writing a blog post? Get topic ideas from smart sources
When you’re blogging, it’s tempting to write about whatever strikes your fancy. The challenge is, what interests you may not interest your readers. If you want to make sure you’re writing must-read content, sites like Quora, LinkedIn, Google Trends, and BuzzSumo can help spark some ideas.
– Did you use Google for competitive intelligence ideas?
Check out the sites positioning in the top-10 and look for common characteristics. How long are competing articles? Do the articles link out to authoritative sources? Are there videos or infographics? Do the articles include quotes from industry experts? Your job is to write an essay that’s better than what’s already appearing in the top-10 — so let the competition be your guide.
– Did you conduct keyphrase research?
Yes, keyphrase research (and content optimization) is still a crucial SEO step. If you don’t give Google some keyphrase “cues,” your page probably won’t position the way you want.
Use a keyphrase research tool and find possible keyphrases for your page or post. As a hint: if you are tightly focusing on a topic, long-tail keyphrases are your best bet. Here’s more information about why long-tail keyphrases are so important.
If you are researching B2B keyphrases, know that the “traditional” keyphrase research steps may not apply. Here’s more information about what to do if B2B keyphrase research doesn’t work.
– What is your per-page keyphrase focus?
Writers are no longer forced to include the exact-match keyphrase over and over again. (Hurray!) Today, we can focus on a keyphrase theme that matches the search intent and weave in multiple related keyphrases.
– Did you expand your keyphrase research to include synonyms and close variants?
Don’t be afraid to include keyphrase synonyms and close variants on your page. Doing so opens up your positioning opportunities, makes your copy better, and is much easier to write!
Are you wondering if you should include your keyphrases as you write the copy — or edit them in later? It’s up to you! Here are the pros and cons of both processes.
— Do your keyphrases match the search intent?
Remember that Google is “the decider” when it comes to search intent. If you’re writing a sales page — and your desired keyphrase pulls up informational blog posts in Google – your sales page probably won’t position.
— Writing a blog post? Does your Title/headline work for SEO, social, and your readers?
Yes, you want your headline to be compelling, but you also want it to be keyphrase rich. Always include your main page keyphrase (or a close variant) in your Title and work in other keyphrases if they “fit.”
– Did you include keyphrase-rich subheadlines?
Subheadlines are an excellent way to visually break up your text, making it easy for readers to quick-scan your benefits and information. Additionally, just like with the H1 headline, adding a keyphrase to your subheadlines can (slightly) help reinforce keyphrase relevancy.
As a hint, sometimes, you can write a question-oriented subheadline and slip the keyphrase in more easily. Here’s more information about why answering questions is a powerful SEO content play.
– Is your Title “clickable” and compelling?
Remember, the search engine results page is your first opportunity for conversion. Focusing too much on what you think Google “wants” may take away your Title’s conversion power.
Consider how you can create an enticing Title that “gets the click” over the other search result listings. You have about 59 characters (with spaces) to work with, so writing tight is essential.
– Does the meta description fit the intent of the page?
Yes, writers should create a meta description for every page. Why? Because they tell the reader what the landing page is about and help increase SERP conversions. Try experimenting with different calls-to-actions at the end, such as “learn more” or “apply now.” You never know what will entice your readers to click!
– Is your content written in a conversational style?
With voice search gaining prominence, copy that’s written in a conversational style is even more critical.
Read your copy out loud and hear how it sounds. Does it flow? Or does it sound too formal? If you’re writing for a regulated industry, such as finance, legal, or healthcare, you may not be able to push the conversational envelope too much. Otherwise, write like you talk.
Here’s how to explain why conversational content is so important.
–Is your copy laser-focused on your audience?
A big mistake some writers make is creating copy that appeals to “everyone” rather than their specific target reader. Writing sales and blog pages that are laser-focused on your audience will boost your conversions and keep readers checking out your copy longer. Here’s how one company does it.
Plus, you don’t receive special “Google points” for writing long content. Even short copy can position if it fully answers the searcher’s query. Your readers don’t want to wade through 1,500 words to find something that can be explained in 300 words.
Items to review after you’ve written the page
– Did you use too many keyphrases?
Remember, there is no such thing as keyword density. If your content sounds keyphrase-heavy and stilted, reduce the keyphrase usage and focus more on your readers’ experience. Your page doesn’t receive bonus points for exact-matching your keyphrase multiple times. If your page sounds keyphrase stuffed when you read it out loud, dial back your keyphrase usage.
– Did you edit your content?
Resist the urge to upload your content as soon as you write it. Put it away and come back to it after a few hours (or even the next day.) Discover why editing your Web writing is so very important. Also, don’t think that adding typos will help your page position. They won’t.
– Is the content interesting to read?
Yes, it’s OK if your copy has a little personality. Here’s more information about working with your page’s tone and feel and how to avoid the “yawn response.” Plus, know that even FAQ pages can help with conversions — and yes, even position.
– Are your sentences and paragraphs easy to read?
Vary your sentence structure so you have a combination of longer and shorter sentences. If you find your sentences creeping over 30 or so words, edit them down and make them punchier. Your writing will have more impact if you do.
Plus, long paragraphs without much white space are hard to read off a computer monitor – and even harder to read on a smartphone. Split up your long paragraphs into shorter ones. Please.
– Are you forcing your reader onto a “dead end” page?
“Dead-end” pages (pages that don’t link out to related pages) can stop your readers dead in their tracks and hurt your conversion goals.
Want to avoid this? Read more about “dead-end” Web pages.
– Does the content provide the reader with valuable information?
Google warns against sites with “thin,” low-quality content that’s poorly written. In fact, according to Google, spelling errors are a bigger boo-boo than broken HTML. Make sure your final draft is typo-free, written well, and thoroughly answers the searcher’s query.
Want to know what Google considers quality content — directly from Google? Here are Google’s Quality Raters guidelines for more information.
– Did you use bullet points where appropriate?
If you find yourself writing a list-like sentence, use bullet points instead. Your readers will thank you, and the items will be much easier to read.
Plus, you can write your bullet points in a way that makes your benefit statements pop, front and center. Here’s how Nike does it.
– Is the primary CTA (call-to-action) clear–and is it easy to take action?
What action do you want your readers to take? Do you want them to contact you? Buy something? Sign up for your newsletter? Make sure you’re telling your reader what you want them to do, and make taking action easy. If you force people to answer multiple questions just to fill out a “contact us” form, you run the risk of people bailing out.
Here’s a list of seven CTA techniques that work.
– Do you have a secondary CTA (such as a newsletter signup or downloading a white paper?)
Do you want readers to sign up for your newsletter or learn about related products? Don’t bury your “sign up for our newsletter” button in the footer text. Instead, test different CTA locations (for instance, try including a newsletter signup link at the bottom of every blog post) and see where you get the most conversions.
– Does the page include too many choices?
It’s important to keep your reader focused on your primary and secondary CTAs. If your page lists too many choices (for example, a large, scrolling page of products), consider eliminating all “unnecessary” options that don’t support your primary call-to-action. Too many choices may force your readers into not taking any action at all.
– Did you include benefit statements?
People make purchase decisions based on what’s in it for them (yes, even your B2B buyers.) Highly specific benefit statements will help your page convert like crazy. Don’t forget to include a benefit statement in your Title (whenever possible) like “free shipping” or “sale.” Seeing this on the search results page will catch your readers’ eyes, tempting them to click the link and check out your site.
– Do you have vertical-specific testimonials?
It’s incredible how many great sales pages are testimonial-free. Testimonials are a must for any site, as they offer third-party proof that your product or service is superior. Plus, your testimonials can help you write better, more benefit-driven sales pages and fantastic comparison-review pages.
Here’s a way to make your testimonials more powerful.
And finally — the most important question:
– Does your content stand out and genuinely deserve a top position?
SEO writing is more than shoving keyphrases into the content. If you want to be rewarded by Google (and your readers), your content must stand out — not be a carbon copy of the current top-10 results. Take a hard look at your content and compare it against what’s currently positioning. Have you fully answered the searcher’s query? Did you weave in other value-added resources, such as expert quotes, links to external and internal resources (such as FAQ pages), videos, and graphics?
If so, congratulations! You’ve done your job.
Google Ads Serving Issue For Ads On Desktop Gmail
Google has a new serving issue with Google Ads that is impacting ad serving on the desktop version of Gmail. So if you are serving Google Ads on Gmail, your ads may not show to a “significant subset of users,” according to Google.
Google posted the incident over here and wrote “we’re aware of a problem with Google Ads affecting a significant subset of users. We will provide an update by Dec 24, 2021, 2:00 AM UTC detailing when we expect to resolve the problem. Please note that this resolution time is an estimate and may change. This issue is specific to ads serving on Gmail on Desktop browsers only.”
The issue again only impacts ads serving on Gmail on Desktop browsers only.
It started yesterday, December 23, 2021 at around 2pm ET and is still currently an issue. Google is working on resolving the issue but has yet to resolve it.
You can track the issue over here.
Forum discussion at Twitter.
Google Loses Top Domain Spot To TikTok
Google is no longer the world’s most popular domain after being dethroned by TikTok, according to rankings from web security company Cloudflare. The list of most popular domains is part of Cloudflare’s Year in Review report and represents domains that gained the most traffic from one year to another.
Google.com — which includes also includes Maps, Translate, and News among others — ended the previous year as the leader in Cloudflare’s rankings. At that time, TikTok was ranking in the 7th position. TikTok.com is now ending 2021 with a leap toward top spot ahead of Google, Facebook, Amazon, and other world leading domains.
Here’s the full list of the top 10 most popular domains as of late 2021:
Cloudflare describes TikTok’s journey toward becoming the most popular domain throughout the year 2021:“It was on February 17, 2021, that TikTok got the top spot for a day.
Back in March, TikTok got a few more days and also in May, but it was after August 10, 2021, that TikTok took the lead on most days. There were some days when Google was #1, but October and November were mostly TikTok’s days, including on Thanksgiving (November 25) and Black Friday (November 26).”
Also included in Cloudflare’s report are lists of the most popular social media domains, most popular e-commerce platforms, and most popular video streaming sites. To no surprise, Amazon ended the year as the most popular e-commerce domain, followed by Taobao, Ebay, and Walmart.
The list of most popular video streaming sites was dominated by giants such as Netflix, YouTube, and HBOMax. Interestingly, Twitch didn’t manage to crack the top 10.
Putting These Rankings In PerspectiveDoes this mean TikTok is now the biggest social media site? No, it still has a long way to go before reaching those heights. What this means is TikTok.com received more traffic than any other domain, according to Cloudflare. That doesn’t mean TikTok has more users than Google or competing social media sites. Insider Intelligence (formerly eMarketer) reports TikTok surpassed Snapchat and Twitter in global user numbers, but is well behind Facebook and Instagram.
In other words, TikTok is the third largest social media platform worldwide. The number of global TikTok users number grew 59.8% in 2020, and went up by an additional 40.8% in 2021.Further, Insider Intelligence estimates TikTok will see a 15.1% growth in global users in 2022.
Should that estimate hold true, TikTok will hold a 20% share of overall social media users by the end of next year.
If TikTok isn’t part of your social media marketing strategy for 2022, these numbers are a good case for making it a priority.
Source: Matt Southern
How to Expand Your Reach with Newsletter Advertising
How Much Can You Trust Recommendations In Google Ads?
Top 4 Link Building Mistakes To Avoid In 2022 [Webinar]
China accused of interference as Australia PM’s WeChat account vanishes
What is marketing automation?
TikTok’s Working on a New, Opt-In Function to Show You Who Viewed Your Profile
‘Flurona’ is a great example of how misinformation can circulate
Are Contextual Links A Google Ranking Factor?
Is It A Google Ranking Factor?
January 22nd Another Unconfirmed Google Search Ranking Update
WordPress 5.9 to Introduce Language Switcher on Login Screen
Here’s How Meta Is Changing Facebook Ads Targeting For 2022
14 Top Reasons Why Google Isn’t Indexing Your Site
20 Tips and Best Practices
Pages That Look Like Error Pages Can Be Considered Soft 404s By Google
Are Nofollow Links a Google Ranking Factor?
17 Actionable Content Marketing Tips for 2022
Picking SEO Keywords: An Expert’s Guide
10 Things You Need To Know To Be Successful
How To Help Google Rank Products With Duplicate Descriptions
SEARCHENGINES4 days ago
Google Search Ranking Update On January 19th & 20th
MARKETING5 days ago
Which Social Networks Should You Advertise on in 2022?
SEARCHENGINES5 days ago
Some Sites Seeing Massive Crawl Spikes From Google
SEARCHENGINES2 days ago
Bug With Google Ads Discovery & Performance Max Campaigns & New Placement Reports
SEO1 day ago
What Is a Google Broad Core Algorithm Update?
SEARCHENGINES2 days ago
Google Looking To Make Crawling More Efficient & Environmental Friendly
MARKETING2 days ago
How to Create Functional SOPs (That Your Employees Actually Use)
SEARCHENGINES2 days ago
Google New York City Conference Room View