SEO

Google’s Search Algorithm Exposed in Document Leak

Published

2 years ago

May 29, 2024

The Search Algorithm Exposed: Inside Google’s Search API Documents Leak

Google’s search algorithm is, essentially, one of the biggest influencers of what gets found on the internet. It decides who gets to be at the top and enjoy the lion’s share of the traffic, and who gets regulated to the dark corners of the web — a.k.a. the 2nd and so on pages of the search results.

It’s the most consequential system of our digital world. And how that system works has been largely a mystery for years, but no longer. The Google search document leak, just went public just yesterday, drops thousands of pages of purported ranking algorithm factors onto our laps.

The Leak

There’s some debate as to whether the documentation was “leaked,” or “discovered.” But what we do know is that the API documentation was (likely accidentally) pushed live on GitHub— where it was then found.

The thousands and thousands of pages in these documents, which appear to come from Google’s internal Content API Warehouse, give us an unprecedented look into how Google search and its ranking algorithms work.

Fast Facts About the Google Search API Documentation

Reported to be the internal documentation for Google Search’s Content Warehouse API.
The documentation indicates this information is accurate as of March 2024.
2,596 modules are represented in the API documentation with 14,014 attributes. These are what we might call ranking factors or features, but not all attributes may be considered part of the ranking algorithm.
The documentation did not provide how these ranking factors are weighted.

And here’s the kicker: several factors found on this document were factors that Google has said, on record, they didn’t track and didn’t include in their algorithms.

That’s invaluable to the SEO industry, and undoubtedly something that will direct how we do SEO for the foreseeable future.

Is The Document Real?

Another subject of debate is whether these documents are real. On that point, here’s what we know so far:

The documentation was on GitHub and was briefly made public from March to May 2024.
The documentation contained links to private GitHub repositories and internal pages — these required specific, Google-credentialed logins to access.
The documentation uses similar notation styles, formatting, and process/module/feature names and references seen in public Google API documentation.
Ex-Googlers say documentation similar to this exists on almost every Google team, i.e., with explanations and definitions for various API attributes and modules.

No doubt Google will deny this is their work (as of writing they refuse to comment on the leak). But all signs, so far, point to this document being the real deal, though I still caution everyone to take everything you learn from it with a grain of salt.

What We Learnt From The Google Search Document Leak

With over 2,500 technical documents to sift through, the insights we have so far are just the tip of the iceberg. I expect that the community will be analyzing this leak for months (possibly years) to gain more SEO-applicable insights.

Other articles have gotten into the nitty-gritty of it already. But if you’re having a hard time understanding all the technical jargon in those breakdowns, here’s a quick and simple summary of the points of interest identified in the leak so far:

Google uses something called “Twiddlers.” These are functions that help rerank a page (think boosting or demotion calculations).
Content can be demoted for reasons such as SERP signals (aka user behavior) indicating dissatisfaction, a link not matching the target site, using exact match domains, product reviews, location, or sexual content.
Google uses a variety of measurements related to clicks, including “badClicks”, ”goodClicks”, ”lastLongestClicks” and ”unsquashedClicks”.
Google keeps a copy of every version of every page it has ever indexed. However, it only uses the last 20 changes of any given URL when analyzing a page.
Google uses a domain authority metric, called “siteAuthority”
Google uses a system called “NavBoost” that uses click data for evaluating pages.
Google has a “sandbox” that websites are segregated to, based on age or lack of trust signals. Indicated by an attribute called “hostAge”
May be related to the last point, but there is an attribute called “smallPersonalSite” in the documentation. Unclear what this is used for.
Google does identify entities on a webpage and can sort, rank, and filter them.
So far, the only attributes that can be connected to E-E-A-T are author-related attributes.
Google uses Chrome data as part of their page quality scoring, with a module featuring a site-level measure of views from Chrome (“chromeInTotal”)
The number, diversity, and source of your backlinks matter a lot, even if PageRank has not been mentioned by Google in years.
Title tags being keyword-optimized and matching search queries is important.
“siteFocusScore” attribute measures how much a site is focused on a given topic.
Publish dates and how frequently a page is updated determines content “freshness” — which is also important.
Font size and text weight for links are things that Google notices. It appears that larger links are more positively received by Google.

Author’s Note: This is not the first time a search engine’s ranking algorithm was leaked. I covered the Yandex hack and how it affects SEO in 2023, and you’ll see plenty of similarities in the ranking factors both search engines use.

Action Points for Your SEO

I did my best to review as much of the “ranking features” that were leaked, as well as the original articles by Rand Fishkin and Mike King. From there, I have some insights I want to share with other SEOs and webmasters out there who want to know how to proceed with their SEO.

Links Matter — Link Value Affected by Several Factors

Links still matter. Shocking? Not really. It’s something I and other SEOs have been saying, even if link-related guidelines barely show up in Google news and updates nowadays.

Still, we need to emphasize link diversity and relevance in our off-page SEO strategies.

Some insights from the documentation:

PageRank of the referring domain’s homepage (also known as Homepage Trust) affects the value of the link.
Indexing tier matters. Regularly updated and accessed content is of the highest tier, and provides more value for your rankings.

If you want your off-page SEO to actually do something for your website, then focus on building links from websites that have authority, and from pages that are either fresh or are otherwise featured in the top tier.

Some PR might help here — news publications tend to drive the best results because of how well they fulfill these factors.

As for guest posts, there’s no clear indication that these will hurt your site, but I definitely would avoid approaching them as a way to game the system. Instead, be discerning about your outreach and treat it as you would if you were networking for new business partners.

Aim for Successful Clicks

The fact that clicks are a ranking factor should not be a surprise. Despite what Google’s team says, clicks are the clearest indicator of user behavior and how good a page is at fulfilling their search intent.

Google’s whole deal is providing the answers you want, so why wouldn’t they boost pages that seem to do just that?

The core of your strategy should be creating great user experiences. Great content that provides users with the right answers is how you do that. Aiming for qualified traffic is how you do that. Building a great-looking, functioning website is how you do that.

Go beyond just picking clickbait title tags and meta descriptions, and focus on making sure users get what they need from your website.

Author’s Note: If you haven’t been paying attention to page quality since the concepts of E-E-A-T and the HCU were introduced, now is the time to do so. Here’s my guide to ranking for the HCU to help you get started.

Keep Pages Updated

An interesting click-based measurement is the “last good click.” That being in a module related to indexing signals suggests that content decay can affect your rankings.

Be vigilant about which pages on your website are not driving the expected amount of clicks for its SERP position. Outdated posts should be audited to ensure content has up-to-date and accurate information to help users in their search journey.

This should revive those posts and drive clicks, preventing content decay.

It’s especially important to start on this if you have content pillars on your website that aren’t driving the same traffic as they used to.

Establish Expertise & Authority

Google does notice the entities on a webpage, which include a bunch of things, but what I want to focus on are those related to your authors.

E-E-A-T as a concept is pretty nebulous — because scoring “expertise” and “authority” of a website and its authors is nebulous. So, a lot of SEOs have been skeptical about it.

However, the presence of an “author” attribute combined with the in-depth mapping of entities in the documentation shows there is some weight to having a well-established author on your website.

So, apply author markups, create an author bio page and archive, and showcase your official profiles on your website to prove your expertise.

Build Your Domain Authority

After countless Q&As and interviews where statements like “we don’t have anything like domain authority,” and “we don’t have website authority score,” were thrown around, we find there does exist an attribute called “siteAuthority”.

Though we don’t know specifically how this measure is computed, and how it weighs in the overall scoring for your website, we know it does matter to your rankings.

So, what do you need to do to improve site authority? It’s simple — keep following best practices and white-hat SEO, and you should be able to grow your authority within your niche.

Stick to Your Niche

Speaking of niches — I found the “siteFocusScore” attribute interesting. It appears that building more and more content within a specific topic is considered a positive.

It’s something other SEOs have hypothesized before. After all, the more you write about a topic, the more you must be an authority on that topic, right?

But anyone can write tons of blogs on a given topic nowadays with AI, so how do you stand out (and avoid the risk of sounding artificial and spammy?)

That’s where author entities and link-building come in. I do think that great content should be supplemented by link-building efforts, as a sort of way to show that hey, “I’m an authority with these credentials, and these other people think I’m an authority on the topic as well.”

Key Takeaway

Most of the insights from the Google search document leak are things that SEOs have been working on for months (if not years). However, we now have solid evidence behind a lot of our hunches, providing that our theories are in fact best practices.

The biggest takeaway I have from this leak: Google relies on user behavior (click data and post-click behavior in particular) to find the best content. Other ranking factors supplement that. Optimize to get users to click on and then stay on your page, and you should see benefits to your rankings.

Could Google remove these ranking factors now that they’ve been leaked? They could, but it’s highly unlikely that they’ll remove vital attributes in the algorithm they’ve spent years building.

So my advice is to follow these now validated SEO practices and be very critical about any Google statements that follow this leak.

Source link