Connect with us

SEO

How To Use IndexNow API With Python For Bulk Indexing

Published

on

How To Use IndexNow API With Python For Bulk Indexing

IndexNow is a protocol developed by Microsoft Bing and adopted by Yandex that enables webmasters and SEO pros to easily notify search engines when a webpage has been updated via an API.

And today, Microsoft announced that it is making the protocol easier to implement by ensuring that submitted URLs are shared between search engines.

Given its positive implications and the promise of a faster indexing experience for publishers, the IndexNow API should be on every SEO professional’s radar.

Using Python for automating URL submission to the IndexNow API or making an API request to the IndexNow API for bulk URL indexing can make managing IndexNow more efficient for you.

In this tutorial, you’ll learn how to do just that, with step-by-step instructions for using the IndexNow API to submit URLs to Microsoft Bing in bulk with Python.

Note: The IndexNow API is similar to Google’s Indexing API with only one difference: the Google Indexing API is only for job advertisements or broadcasting web pages that contain a video object within it.

Google announced that they will test the IndexNow API but hasn’t updated us since.

Bulk Indexing Using IndexNow API with Python: Getting Started

Below are the necessities to understand and implement the IndexNow API tutorial.

Below are the Python packages and libraries that will be used for the Python IndexNow API tutorial.

  • Advertools (must).
  • Pandas (must).
  • Requests (must).
  • Time (optional).
  • JSON (optional).

Before getting started, reading the basics can help you to understand this IndexNow API and Python tutorial better. We will be using an API Key and a .txt file to provide authentication along with specific HTTP Headers.

IndexNow API Usage Steps with Python.

1. Import The Python Libraries

To use the necessary Python libraries, we will use the “import” command.

  • Advertools will be used for sitemap URL extraction.
  • Requests will be used for making the GET and POST requests.
  • Pandas will be used for taking the URLs in the sitemap into a list object.
  • The “time” module is to prevent a “Too much request” error with the “sleep()” method.
  • JSON is for possibly modifying the POST JSON object if needed.

Below, you will find all of the necessary import lines for the IndexNow API tutorial.

import advertools as adv
import pandas as pd
import requests
import json
import time

2. Extracting The Sitemap URLs With Python

To extract the URLs from a sitemap file, different web scraping methods and libraries can be used such as Requests or Scrapy.

But to keep things simple and efficient, I will use my favorite Python SEO package – Advertools.

With only a single line of code, all of the URLs within a sitemap can be extracted.

sitemap_urls = adv.sitemap_to_df("https://www.example.com/sitemap_index.xml")

The “sitemap_to_df” method of the Advertools can extract all the URLs and other sitemap-related tags such as “lastmod” or “priority.”

Below, you can see the output of the “adv.sitemap_to_df” command.

Sitemap URL Extraction for IndexNow API UsageSitemap URL Extraction can be done via Advertools’ “sitemap_to_df” method.

All of the URLs and dates are specified within the “sitemap_urls” variable.

Since sitemaps are useful sources for search engines and SEOs, Advertools’ sitemap_to_df method can be used for many different tasks including a Sitemap Python Audit.

But that’s a topic for another time.

3. Take The URLs Into A List Object With “to_list()”

Python’s Pandas library has a method for taking a data frame column (data series) into a list object, to_list().

Below is an example usage:

sitemap_urls["loc"].to_list()

Below, you can see the result:

Sitemap URL ListingPandas’ “to_list” method can be used with Advertools for listing the URLs.

All URLs within the sitemap are in a Python list object.

4. Understand The URL Syntax Of IndexNow API Of Microsoft Bing

Let’s take a look at the URL syntax of the IndexNow API.

Here’s an example:

https://<searchengine>/indexnow?url=url-changed&key=your-key

The URL syntax represents the variables and their relations to each other within the RFC 3986 standards.

  • The <searchengine> represents the search engine name that you will use the IndexNow API for.
  • “?url=” parameter is to determine the URL that will be submitted to the search engine via IndexNow API.
  • “&key=” is the API Key that will be used within the IndexNow API.
  • “&keyLocation=” is to provide an authenticity that shows that you are the owner of the website that IndexNow API will be used for.

The “&keyLocation” will bring us to the API Key and its “.txt” version.

5. Gather The API Key For IndexNow And Upload It To The Root

You’ll need a valid key to use the IndexNow API.

Use this link to generate the Microsoft Bing IndexNow API Key.

IndexNow API Key Taking There is no limit for generating the IndexNow API Key.

Clicking the “Generate” button creates an IndexNow API Key.

When you click on the download button, it will download the “.txt” version of the IndexNow API Key.

IndexNow API Key GenerationIndexNow API Key can be generated by Microsoft Bing’s stated address.
txt version of IndexNow API KeyDownloaded IndexNow API Key as txt file.

The TXT version of the API key will be the file name and as well as within the text file.

IndexNow API Key in TXT FileIndexNow API Key in TXT File should be the same with the name of the file, and the actual API Key value.

The next step is uploading this TXT file to the root of the website’s server.

Since I use FileZilla for my FTP, I have uploaded it easily to my web server’s root.

Root Server and IndexNow API Set upBy putting the .txt file into the web server’s root folder, the IndexNow API setup can be completed.

The next step is performing a simple for a loop example for submitting all of the URLs within the sitemap.

6. Submit The URLs Within The Sitemap With Python To IndexNow API

To submit a single URL to the IndexNow, you can use a single “requests.get()” instance. But to make it more useful, we will use a for a loop.

To submit URLs in bulk to the IndexNow API with Python, follow the steps below:

  1. Create a key variable with the IndexNow API Key value.
  2. Replace the <searchengine> section with the search engine that you want to submit URLs (Microsoft Bing, or Yandex, for now).
  3. Assign all of the URLs from the sitemap within a list to a variable.
  4. Use the “txt” file within the root of the web server with its URL value.
  5. Place the URL, key, and key location URL within the string manipulation value.
  6. Start your for a loop, and use the “requests.get()” for all of the URLs within the sitemap.

Below, you can see the implementation:

key = "22bc7c564b334f38b0b1ed90eec8f2c5"
url = sitemap_urls["loc"].to_list()
for i in url:
          endpoint = f"https://bing.com/indexnow?url={i}&key={key}&keyLocation={location}"
          response = requests.get(endpoint)
          print(i)
          print(endpoint)
          print(response.status_code, response.content)
          #time.sleep(5)

If you’re concerned about sending too many requests to the IndexNow API, you can use the Python time module to make the script wait between every request.

Here you can see the output of the script:

IndexNow API Automation ScriptThe empty string as the request’s response body represents the success of the IndexNow API request according to Microsoft Bing’s IndexNow documentation.

The 200 Status Code means that the request was successful.

With the for a loop, I have submitted 194 URLs to Microsoft Bing.

According to the IndexNow Documentation, the HTTP 200 Response Code signals that the search engine is aware of the change in the content or the new content. But it doesn’t necessarily guarantee indexing.

For instance, I have used the same script for another website. After 120 seconds, Microsoft Bing says that 31 results are found. And conveniently, it shows four pages.

The only problem is that on the first page there are only two results, and it says that the URLs are blocked by Robots.txt even if the blocking was removed before submission.

This can happen if the robots.txt was changed to remove some URLs before using the IndexNow API because it seems that Bing does not check the Robots.txt again.

Thus, if you previously blocked them, they try to index your website but still use the previous version of the robots.txt file.

Bing IndexNow API ResultsIt shows what will happen if you use IndexNow API by blocking Bingbot via Robots.txt.

On the second page, there is only one result:

IndexNow Bing Paginated ResultMicrosoft Bing might use a different indexation and pagination method than Google. The second page shows only one among the 31 results.

On the third page, there is no result, and it shows the Microsoft Bing Translate for translating the string within the search bar.

Microsoft Bing TranslateIt shows sometimes, Microsoft Bing infers the “site” search operator as a part of the query.

When I checked Google Analytics, it shows that Bing still hadn’t crawled the website or indexed it. I know this is true as I also checked the log files.

Google and Bing Indexing ProcessesBelow, you will see the Bing Webmaster Tool’s report for the example website:

Bing Webmaster Tools Report

It says that I submitted 38 URLs.

The next step will involve the bulk request with the POST Method and a JSON object.

7. Perform An HTTP Post Request To The IndexNow API

To perform an HTTP post request to the IndexNow API for a set of URLs, a JSON object should be used with specific properties.

  • Host property represents the search engine hostname.
  • Key represents the API Key.
  • Key represents the location of the API Key’s txt file within the web server.
  • urlList represents the URL set that will be submitted to the IndexNow API.
  • Headers represent the POST Request Headers that will be used which are “Content-type” and “charset.”

Since this is a POST request, the “requests.post” will be used instead of the “requests.get().”

Below, you will find an example of a set of URLs submitted to Microsoft Bing’s IndexNow API.

data = {
  "host": "www.bing.com",
  "key": "22bc7c564b334f38b0b1ed90eec8f2c5",
  "keyLocation": "https://www.example.com/22bc7c564b334f38b0b1ed90eec8f2c5.txt",
  "urlList": [
    'https://www.example.com/technical-seo/http-header/',
    'https://www.example.com/python-seo/nltk/lemmatize',
    'https://www.example.com/pagespeed/broser-hints/preload',
    'https://www.example.com/python-seo/nltk/stemming',
    'https://www.example.com/python-seo/categorize-queries/',
    'https://www.example.com/python-seo/nltk/tokenization',
    'https://www.example.com/review/oncrawl/',
    'https://www.example.com/technical-seo/hreflang/',
    'https://www.example.com/technical-seo/multilingual-seo/'
      ]
}
headers = {"Content-type":"application/json", "charset":"utf-8"}
r = requests.post("https://bing.com/", data=data, headers=headers)
r.status_code, r.content

In the example above, we have performed a POST Request to index a set of URLs.

We have used the “data” object for the “data parameter of requests.post,” and the headers object for the “headers” parameter.

Since we POST a JSON object, the request should have the “content-type: application/json” key and value with the “charset:utf-8.”

After I make the POST request, 135 seconds later, my live logfile analysis dashboard started to show the immediate hits from the Bingbot.

Bingbot Log File Analysis

8. Create Custom Function For IndexNow API To Make Time

Creating a custom function for IndexNow API is useful to decrease the time that will be spent on the code preparation.

Thus, I have created two different custom Python functions to use the IndexNow API for bulk requests and individual requests.

Below, you will find an example for only the bulk requests to the IndexNow API.

The custom function for bulk requests is called “submit_url_set.”

Even if you just fill in the parameters, still you will be able to use it properly.

def submit_url_set(set_:list, key, location, host="https://www.bing.com", headers={"Content-type":"application/json", "charset":"utf-8"}):
     key = "22bc7c564b334f38b0b1ed90eec8f2c5"
     set_ = sitemap_urls["loc"].to_list()
     data = {
     "host": "www.bing.com",
     "key": key,
     "keyLocation": "https://www.example.com/22bc7c564b334f38b0b1ed90eec8f2c5.txt",
     "urlList": set_
     }
     r = requests.post(host, data=data, headers=headers)
     return r.status_code

An explanation of this custom function:

  • The “Set_” parameter is to provide a list of URLs.
  • “Key” parameter is to provide an IndexNow API Key.
  • “Location” parameter is to provide the location of the IndexNow API Key’s txt file within the web server.
  • “Host” is to provide the search engine host address.
  • “Headers” is to provide the headers that are necessary for the IndexNow API.

I have defined some of the parameters with default values such as “host” for Microsoft Bing. If you want to use it for Yandex, you will need to state it while calling the function.

Below is an example usage:

submit_url_set(set_=sitemap_urls["loc"].to_list(), key="22bc7c564b334f38b0b1ed90eec8f2c5", location="https://www.example.com/22bc7c564b334f38b0b1ed90eec8f2c5.txt")

If you want to extract sitemap URLs with a different method, or if you want to use the IndexNow API for a different URL set, you will need to change “set_” parameter value.

Below, you will see an example of the Custom Python function for the IndexNow API for only individual requests.

def submit_url(url, location, key = "22bc7c564b334f38b0b1ed90eec8f2c5"):
     key = "22bc7c564b334f38b0b1ed90eec8f2c5"
     url = sitemap_urls["loc"].to_list()
     for i in url:
          endpoint = f"https://bing.com/indexnow?url={i}&key={key}&keyLocation={location}"
          response = requests.get(endpoint)
          print(i)
          print(endpoint)
          print(response.status_code, response.content)
          #time.sleep(5)

Since this is for a loop, you can submit more URLs one by one. The search engine can prioritize these types of requests differently.

Some of the bulk requests will include non-important URLs, the individual requests might be seen as more reasonable.

If you want to include the sitemap URL extraction within the function, you should include Advertools naturally into the functions themselves.

Tips For Using The IndexNow API With Python

An Overview of How The IndexNow API Works, Capabilities & Uses

  • The IndexNow API doesn’t guarantee that your website or the URLs that you submitted will be indexed.
  • You should only submit URLs that are new or for which the content has changed.
  • The IndexNow API impacts the crawl budget.
  • Microsoft Bing has a threshold for the URL Content Quality and Calculation of the Crawl Need for a URL. If the submitted URL is not good enough, they may not crawl it.
  • You can submit up to 10,000 URLs.
  • The IndexNow API suggests submitting URLs even if the website is small.
  • Submitting the same pages many times within a day can block the IndexNow API from crawling the redundant URLs or the source.
  • The IndexNow API is useful for sites where the content changes frequently, like every 10 minutes.
  • IndexNow API is useful for pages that are gone and are returning a 404 response code. It lets the search engine know that the URLs are gone.
  • IndexNow API can be used for notifying of new 301 or 302 redirects.
  • The 200 Status Response Code means that the search engine is aware of the submitted URL.
  • The 429 Status Code means that you made too many requests to the IndexNow API.
  • If you put a “txt” file that contains the IndexNow API Key into a subfolder, the IndexNow API can be used only for that subfolder.
  • If you have two different CMS, you can use two different IndexNow API Keys for two different site sections
  • Subdomains need to use a different IndexNow API key.
  • Even if you already use a sitemap, using IndexNow API is useful because it efficiently tells the search engines of website changes and reduces unnecessary bot crawling.
  • All search engines that adopt the IndexNow API (Microsoft Bing and Yandex) share the URLs that are submitted between each other.
IndexNow API Infographic SEOIndexNow API Documentation and usage tips can be found above.

In this IndexNow API tutorial and guideline with Python, we have examined a new search engine technology.

Instead of waiting to be crawled, publishers can notify the search engines to crawl when there is a need.

IndexNow reduces the use of search engine data center resources, and now you know how to use Python to make the process more efficient, too.

More resources:

An Introduction To Python & Machine Learning For Technical SEO

How to Use Python to Monitor & Measure Website Performance

Advanced Technical SEO: A Complete Guide


Featured Image: metamorworks/Shutterstock




Source link

SEO

Google CEO Confirms AI Features Coming To Search “Soon”

Published

on

Google CEO Confirms AI Features Coming To Search "Soon"

Google announced today that it will soon be rolling out AI-powered features in its search results, providing users with a new, more intuitive way to navigate and understand the web.

These new AI features will help users quickly understand the big picture and learn more about a topic by distilling complex information into easy-to-digest formats.

Google has a long history of using AI to improve its search results for billions of people.

The company’s latest AI technologies, such as LaMDA, PaLM, Imagen, and MusicLM, provide users with entirely new ways to engage with information.

Google is working to bring these latest advancements into its products, starting with search.

Statement From Google CEO Sundar Pichai

Sundar Pichai, CEO of Google and Alphabet, released a statement on Twitter about a conversational AI service that will be available in the coming weeks.

Bard, powered by LaMDA, is Google’s new language model for dialogue applications.

According to Pichai, Bard, which leverages Google’s vast intelligence and knowledge base, can deliver accurate and high-quality answers:

“In 2021, we shared next-gen language + conversation capabilities powered by our Language Model for Dialogue Applications (LaMDA). Coming soon: Bard, a new experimental conversational #GoogleAI service powered by LaMDA.

Bard seeks to combine the breadth of the world’s knowledge with the power, intelligence, and creativity of our large language models. It draws on information from the web to provide fresh, high-quality responses. Today we’re opening Bard up to trusted external testers.

We’ll combine their feedback with our own internal testing to make sure Bard’s responses meet our high bar for quality, safety, and groundedness and we will make it more widely available in coming weeks. It’s early, we will launch, iterate and make it better.”

In Summary

Increasingly, people are turning to Google for deeper insights and understanding.

With the help of AI, Google can consolidate insights for questions where there is no one correct answer, making it easier for people to get to the core of what they are searching for.

In addition to the AI features being rolled out in search, Google is also introducing a new experimental conversational AI service called Bard. Powered by LaMDA, Bard will use Google’s vast intelligence and knowledge base to deliver accurate and high-quality answers to users.

Google continues demonstrating its commitment to making search more intuitive and effective for users. As Pichai said in his statement, the company will continue to launch, iterate, and improve these new offerings in the coming weeks and months.

Source: Google



Source link

Continue Reading

SEO

Google Updates Structured Data Guidance To Clarify Supported Formats

Published

on

Google Updates Structured Data Guidance To Clarify Supported Formats

Google updated the structured data guidance to better emphasize that all three structured data formats are acceptable to Google and also explain why JSON-LD is is recommended.

The updated Search Central page that was updated is the Supported Formats section of the Introduction to structured data markup in Google Search webpage.

The most important changes were to add a new section title (Supported Formats), and to expand that section with an explanation of supported structured data formats.

Three Structured Data Formats

Google supports three structured data formats.

  1. JSON-LD
  2. Microdata
  3. RDFa

But only one of the above formats, JSON-LD, is recommended.

According to the documentation, the other two formats (Microdata and RDFa) are still fine to use. The update to the documentation explains why JSON-LD is recommended.

Google also made a minor change to a title of a preceding section to reflect that the section addresses structured data vocabulary

The original section title, Structured data format, is now Structured data vocabulary and format.

Google added a section title the section that offers guidance on Google’s preferred structured data format.

This is also the section with the most additional text added to it.

New Supported Formats Section Title

The updated content explains why Google prefers the JSON-LD structured data format, while confirming that the other two formats are acceptable.

Previously this section contained just two sentences:

“Google Search supports structured data in the following formats, unless documented otherwise:

Google recommends using JSON-LD for structured data whenever possible.”

The updated section now has the following content:

“Google Search supports structured data in the following formats, unless documented otherwise.

In general, we recommend using a format that’s easiest for you to implement and maintain (in most cases, that’s JSON-LD); all 3 formats are equally fine for Google, as long as the markup is valid and properly implemented per the feature’s documentation.

In general, Google recommends using JSON-LD for structured data if your site’s setup allows it, as it’s the easiest solution for website owners to implement and maintain at scale (in other words, less prone to user errors).”

Structured Data Formats

JSON-LD is arguably the easiest structured data format to implement, the easiest to scale, and the most straightforward to edit.

Most, if not all, WordPress SEO and structured data plugins output JSON-LD structured data.

Nevertheless, it’s a useful update to Google’s structured data guidance in order to make it clear that all three formats are still supported.

Google’s documentation on the change can be read here.

Featured image by Shutterstock/Olena Zaskochenko



Source link

Continue Reading

SEO

Ranking Factors & The Myths We Found

Published

on

Ranking Factors & The Myths We Found

Yandex is the search engine with the majority of market share in Russia and the fourth-largest search engine in the world.

On January 27, 2023, it suffered what is arguably one of the largest data leaks that a modern tech company has endured in many years – but is the second leak in less than a decade.

In 2015, a former Yandex employee attempted to sell Yandex’s search engine code on the black market for around $30,000.

The initial leak in January this year revealed 1,922 ranking factors, of which more than 64% were listed as unused or deprecated (superseded and best avoided).

This leak was just the file labeled kernel, but as the SEO community and I delved deeper, more files were found that combined contain approximately 17,800 ranking factors.

When it comes to practicing SEO for Yandex, the guide I wrote two years ago, for the most part, still applies.

Yandex, like Google, has always been public with its algorithm updates and changes, and in recent years, how it has adopted machine learning.

Notable updates from the past two-three years include:

  • Vega (which doubled the size of the index).
  • Mimicry (penalizing fake websites impersonating brands).
  • Y1 update (introducing YATI).
  • Y2 update (late 2022).
  • Adoption of IndexNow.
  • A fresh rollout and assumed update of the PF filter.

On a personal note, this data leak is like a second Christmas.

Since January 2020, I’ve run an SEO news website as a hobby dedicated to covering Yandex SEO and search news in Russia with 600+ articles, so this is probably the peak event of the hobby site.

I’ve also spoken twice at the Optimization conference – the largest SEO conference in Russia.

This is also a good test to see how closely Yandex’s public statements match the codebase secrets.

In 2019, working with Yandex’s PR team, I was able to interview engineers in their Search team and ask a number of questions sourced from the wider Western SEO community.

You can read the interview with the Yandex Search team here.

Whilst Yandex is primarily known for its presence in Russia, the search engine also has a presence in Turkey, Kazakhstan, and Georgia.

The data leak was believed to be politically motivated and the actions of a rogue employee, and contains a number of code fragments from Yandex’s monolithic repository, Arcadia.

Within the 44GB of leaked data, there’s information relating to a number of Yandex products including Search, Maps, Mail, Metrika, Disc, and Cloud.

What Yandex Has Had To Say

As I write this post (January 31st, 2023), Yandex has publicly stated that:

the contents of the archive (leaked code base) correspond to the outdated version of the repository – it differs from the current version used by our services

And:

It is important to note that the published code fragments also contain test algorithms that were used only within Yandex to verify the correct operation of the services.

So, how much of this code base is actively used is questionable.

Yandex has also revealed that during its investigation and audit, it found a number of errors that violate its own internal principles, so it is likely that portions of this leaked code (that are in current use) may be changing in the near future.

Factor Classification

Yandex classifies its ranking factors into three categories.

This has been outlined in Yandex’s public documentation for some time, but I feel is worth including here, as it better helps us understand the ranking factor leak.

  • Static factors – Factors that are related directly to the website (e.g. inbound backlinks, inbound internal links, headers, and ads ratio).
  • Dynamic factors – Factors that are related to both the website and the search query (e.g. text relevance, keyword inclusions, TF*IDF).
  • User search-related factors – Factors relating to the user query (e.g. where is the user located, query language, and intent modifiers).

The ranking factors in the document are tagged to match the corresponding category, with TG_STATIC and TG_DYNAMIC, and then TG_QUERY_ONLY, TG_QUERY, TG_USER_SEARCH, and TG_USER_SEARCH_ONLY.

Yandex Leak Learnings So Far

From the data thus far, below are some of the affirmations and learnings we’ve been able to make.

There is so much data in this leak, it is very likely that we will be finding new things and making new connections in the next few weeks.

These include:

  • PageRank (a form of).
  • At some point Yandex utilized TF*IDF.
  • Yandex still uses meta keywords, which are also highlighted in its documentation.
  • Yandex has specific factors for medical, legal, and financial topics (YMYL).
  • It also uses a form of page quality scoring, but this is known (ICS score).
  • Links from high-authority websites have an impact on rankings.
  • There’s nothing new to suggest Yandex can crawl JavaScript yet outside of already publicly documented processes.
  • Server errors and excessive 4xx errors can impact ranking.
  • The time of day is taken into consideration as a ranking factor.

Below, I’ve expanded on some other affirmations and learnings from the leak.

Where possible, I’ve also tied these leaked ranking factors to the algorithm updates and announcements that relate to them, or where we were told about them being impactful.

MatrixNet

MatrixNet is mentioned in a few of the ranking factors and was announced in 2009, and then superseded in 2017 by Catboost, which was rolled out across the Yandex product sphere.

This further adds validity to comments directly from Yandex, and one of the factor authors DenPlusPlus (Den Raskovalov), that this is, in fact, an outdated code repository.

MatrixNet was originally introduced as a new, core algorithm that took into consideration thousands of ranking factors and assigned weights based on the user location, the actual search query, and perceived search intent.

It is typically seen as an early version of Google’s RankBrain, when they are indeed two very different systems. MatrixNet was launched six years before RankBrain was announced.

MatrixNet has also been built upon, which isn’t surprising, given it is now 14 years old.

In 2016, Yandex introduced the Palekh algorithm that used deep neural networks to better match documents (webpages) and queries, even if they didn’t contain the right “levels” of common keywords, but satisfied the user intents.

Palekh was capable of processing 150 pages at a time, and in 2017 was updated with the Korolyov update, which took into account more depth of page content, and could work off 200,000 pages at once.

URL & Page-Level Factors

From the leak, we have learned that Yandex takes into consideration URL construction, specifically:

  • The presence of numbers in the URL.
  • The number of trailing slashes in the URL (and if they are excessive).
  • The number of capital letters in the URL is a factor.
Screenshot from author, January 2023

The age of a page (document age) and the last updated date are also important, and this makes sense.

As well as document age and last update, a number of factors in the data relate to freshness – particularly for news-related queries.

Yandex formerly used timestamps, specifically not for ranking purposes but “reordering” purposes, but this is now classified as unused.

Also in the deprecated column are the use of keywords in the URL. Yandex has previously measured that three keywords from the search query in the URL would be an “optimal” result.

Internal Links & Crawl Depth

Whilst Google has gone on the record to say that for its purposes, crawl depth isn’t explicitly a ranking factor, Yandex appears to have an active piece of code that dictates that URLs that are reachable from the homepage have a “higher” level of importance.

Yandex factorsScreenshot from author, January 2023

This mirrors John Mueller’s 2018 statement that Google gives “a little more weight” to pages found more than one click from the homepage.

The ranking factors also highlight a specific token weighting for webpages that are “orphans” within the website linking structure.

Clicks & CTR

In 2011, Yandex released a blog post talking about how the search engine uses clicks as part of its rankings and also addresses the desires of the SEO pros to manipulate the metric for ranking gain.

Specific click factors in the leak look at things like:

  • The ratio of the number of clicks on the URL, relative to all clicks on the search.
  • The same as above, but broken down by region.
  • How often do users click on the URL for the search?

Manipulating Clicks

Manipulating user behavior, specifically “click-jacking”, is a known tactic within Yandex.

Yandex has a filter, known as the PF filter, that actively seeks out and penalizes websites that engage in this activity using scripts that monitor IP similarities and then the “user actions” of those clicks – and the impact can be significant.

The below screenshot shows the impact on organic sessions (сессии) after being penalized for imitating user clicks.

Image Source: Russian Search NewsImage from Russian Search News, January 2023

User Behavior

The user behavior takeaways from the leak are some of the more interesting findings.

User behavior manipulation is a common SEO violation that Yandex has been combating for years. At the 2020 Optimization conference, then Head of Yandex Webmaster Tools Mikhail Slevinsky said the company is making good progress in detecting and penalizing this type of behavior.

Yandex penalizes user behavior manipulation with the same PF filter used to combat CTR manipulation.

Dwell Time

102 of the ranking factors contain the tag TG_USERFEAT_SEARCH_DWELL_TIME, and reference the device, user duration, and average page dwell time.

All but 39 of these factors are deprecated.

Yandex factorsScreenshot from author, January 2023

Bing first used the term Dwell time in a 2011 blog, and in recent years Google has made it clear that it doesn’t use dwell time (or similar user interaction signals) as ranking factors.

YMYL

YMYL (Your Money, Your Life) is a concept well-known within Google and is not a new concept to Yandex.

Within the data leak, there are specific ranking factors for medical, legal, and financial content that exist – but this was notably revealed in 2019 at the Yandex Webmaster conference when it announced the Proxima Search Quality Metric.

Metrika Data Usage

Six of the ranking factors relate to the usage of Metrika data for the purposes of ranking. However, one of them is tagged as deprecated:

  • The number of similar visitors from the YandexBar (YaBar/Ябар).
  • The average time spent on URLs from those same similar visitors.
  • The “core audience” of pages on which there is a Metrika counter [deprecated].
  • The average time a user spends on a host when accessed externally (from another non-search site) from a specific URL.
  • Average ‘depth’ (number of hits within the host) of a user’s stay on the host when accessed externally (from another non-search site) from a particular URL.
  • Whether or not the domain has Metrika installed.

In Metrika, user data is handled differently.

Unlike Google Analytics, there are a number of reports focused on user “loyalty” combining site engagement metrics with return frequency, duration between visits, and source of the visit.

For example, I can see a report in one click to see a breakdown of individual site visitors:

MetrikaScreenshot from Metrika, January 2023

Metrika also comes “out of the box” with heatmap tools and user session recording, and in recent years the Metrika team has made good progress in being able to identify and filter bot traffic.

With Google Analytics, there is an argument that Google doesn’t use UA/GA4 data for ranking purposes because of how easy it is to modify or break the tracking code – but with Metrika counters, they are a lot more linear, and a lot of the reports are unchangeable in terms of how the data is collected.

Impact Of Traffic On Rankings

Following on from looking at Metrika data as a ranking factor; These factors effectively confirm that direct traffic and paid traffic (buying ads via Yandex Direct) can impact organic search performance:

  • Share of direct visits among all incoming traffic.
  • Green traffic share (aka direct visits) – Desktop.
  • Green traffic share (aka direct visits) – Mobile.
  • Search traffic – transitions from search engines to the site.
  • Share of visits to the site not by links (set by hand or from bookmarks).
  • The number of unique visitors.
  • Share of traffic from search engines.

News Factors

There are a number of factors relating to “News”, including two that mention Yandex.News directly.

Yandex.News was an equivalent of Google News, but was sold to the Russian social network VKontakte in August 2022, along with another Yandex product “Zen”.

So, it’s not clear if these factors related to a product no longer owned or operated by Yandex, or to how news websites are ranked in “regular” search.

Backlink Importance

Yandex has similar algorithms to combat link manipulation as Google – and has since the Nepot filter in 2005.

From reviewing the backlink ranking factors and some of the specifics in the descriptions, we can assume that the best practices for building links for Yandex SEO would be to:

  • Build links with a more natural frequency and varying amounts.
  • Build links with branded anchor texts as well as use commercial keywords.
  • If buying links, avoid buying links from websites that have mixed topics.

Below is a list of link-related factors that can be considered affirmations of best practices:

  • The age of the backlink is a factor.
  • Link relevance based on topics.
  • Backlinks built from homepages carry more weight than internal pages.
  • Links from the top 100 websites by PageRank (PR) can impact rankings.
  • Link relevance based on the quality of each link.
  • Link relevance, taking into account the quality of each link, and the topic of each link.
  • Link relevance, taking into account the non-commercial nature of each link.
  • Percentage of inbound links with query words.
  • Percentage of query words in links (up to a synonym).
  • The links contain all the words of the query (up to a synonym).
  • Dispersion of the number of query words in links.

However, there are some link-related factors that are additional considerations when planning, monitoring, and analyzing backlinks:

  • The ratio of “good” versus “bad” backlinks to a website.
  • The frequency of links to the site.
  • The number of incoming SEO trash links between hosts.

The data leak also revealed that the link spam calculator has around 80 active factors that are taken into consideration, with a number of deprecated factors.

This creates the question as to how well Yandex is able to recognize negative SEO attacks, given it looks at the ratio of good versus bad links, and how it determines what a bad link is.

A negative SEO attack is also likely to be a short burst (high frequency) link event in which a site will unwittingly gain a high number of poor quality, non-topical, and potentially over-optimized links.

Yandex uses machine learning models to identify Private Blog Networks (PBNs) and paid links, and it makes the same assumption between link velocity and the time period they are acquired.

Typically, paid-for links are generated over a longer period of time, and these patterns (including link origin site analysis) are what the Minusinsk update (2015) was introduced to combat.

Yandex Penalties

There are two ranking factors, both deprecated, named SpamKarma and Pessimization.

Pessimization refers to reducing PageRank to zero and aligns with the expectations of severe Yandex penalties.

SpamKarma also aligns with assumptions made around Yandex penalizing hosts and individuals, as well as individual domains.

Onpage Advertising

There are a number of factors relating to advertising on the page, some of them deprecated (like the screenshot example below).

Yandex factorsScreenshot from author, January 2023

It’s not known from the description exactly what the thought process with this factor was, but it could be assumed that a high ratio of adverts to visible screen was a negative factor – much like how Google takes umbrage if adverts obfuscate the page’s main content, or are obtrusive.

Tying this back to known Yandex mechanisms, the Proxima update also took into consideration the ratio of useful and advertising content on a page.

Can We Apply Any Yandex Learnings To Google?

Yandex and Google are disparate search engines, with a number of differences, despite the tens of engineers who have worked for both companies.

Because of this fight for talent, we can infer that some of these master builders and engineers will have built things in a similar fashion (though not direct copies), and applied learnings from previous iterations of their builds with their new employers.

What Russian SEO Pros Are Saying About The Leak

Much like the Western world, SEO professionals in Russia have been having their say on the leak across the various Runet forums.

The reaction in these forums has been different to SEO Twitter and Mastodon, with a focus more on Yandex’s filters, and other Yandex products that are optimized as part of wider Yandex optimization campaigns.

It is also worth noting that a number of conclusions and findings from the data match what the Western SEO world is also finding.

Common themes in the Russian search forums:

  • Webmasters asking for insights into recent filters, such as Mimicry and the updated PF filter.
  • The age and relevance of some of the factors, due to author names no longer being at Yandex, and mentions of long-retired Yandex products.
  • The main interesting learnings are around the use of Metrika data, and information relating to the Crawler & Indexer.
  • A number of factors outline the usage of DSSM, which in theory was superseded by the release of Palekh in 2016. This was a search algorithm utilizing machine learning, announced by Yandex in 2016.
  • A debate around ICS scoring in Yandex, and whether or not Yandex may provide more traffic to a site and influence its own factors by doing so.

The leaked factors, particularly around how Yandex evaluates site quality, have also come under scrutiny.

There is a long-standing sentiment in the Russian SEO community that Yandex oftentimes favors its own products and services in search results ahead of other websites, and webmasters are asking questions like:

Why does it bother going to all this trouble, when it just nails its services to the top of the page anyway?

In loosely translated documents, these are referred to as the Sorcerers or Yandex Sorcerers. In Google, we’d call these search engine results pages (SERPs) features – like Google Hotels, etc.

In October 2022, Kassir (a Russian ticket portal) claimed ₽328m compensation from Yandex due to lost revenue, caused by the “discriminatory conditions” in which Yandex Sorcerers took the customer base away from the private company.

This is off the back of a 2020 class action in which multiple companies raised a case with the Federal Antimonopoly Service (FAS) for anticompetitive promotion of its own services.

More resources:


Featured Image: FGC/Shutterstock



Source link

Continue Reading

Trending

en_USEnglish