Connect with us

SEO

How To Do A Sitemap Audit For Better Indexing & Crawling Via Python

Published

on

How To Do A Sitemap Audit For Better Indexing & Crawling Via Python

Sitemap auditing involves syntax, crawlability, and indexation checks for the URLs and tags in your sitemap files.

A sitemap file contains the URLs to index with further information regarding the last modification date, priority of the URL, images, videos on the URL, and other language alternates of the URL, along with the change frequency.

Sitemap index files can involve millions of URLs, even if a single sitemap can only involve 50,000 URLs at the top.

Auditing these URLs for better indexation and crawling might take time.

But with the help of Python and SEO automation, it is possible to audit millions of URLs within the sitemaps.

Advertisement

What Do You Need To Perform A Sitemap Audit With Python?

To understand the Python Sitemap Audit process, you’ll need:

  • A fundamental understanding of technical SEO and sitemap XML files.
  • Working knowledge of Python and sitemap XML syntax.
  • The ability to work with Python Libraries, Pandas, Advertools, LXML, Requests, and XPath Selectors.

Which URLs Should Be In The Sitemap?

A healthy sitemap XML sitemap file should include the following criteria:

  • All URLs should have a 200 Status Code.
  • All URLs should be self-canonical.
  • URLs should be open to being indexed and crawled.
  • URLs shouldn’t be duplicated.
  • URLs shouldn’t be soft 404s.
  • The sitemap should have a proper XML syntax.
  • The URLs in the sitemap should have an aligning canonical with Open Graph and Twitter Card URLs.
  • The sitemap should have less than 50.000 URLs and a 50 MB size.

What Are The Benefits Of A Healthy XML Sitemap File?

Smaller sitemaps are better than larger sitemaps for faster indexation. This is particularly important in News SEO, as smaller sitemaps help for increasing the overall valid indexed URL count.

Differentiate frequently updated and static content URLs from each other to provide a better crawling distribution among the URLs.

Using the “lastmod” date in an honest way that aligns with the actual publication or update date helps a search engine to trust the date of the latest publication.

While performing the Sitemap Audit for better indexing, crawling, and search engine communication with Python, the criteria above are followed.

An Important Note…

When it comes to a sitemap’s nature and audit, Google and Microsoft Bing don’t use “changefreq” for changing frequency of the URLs and “priority” to understand the prominence of a URL. In fact, they call it a “bag of noise.”

However, Yandex and Baidu use all these tags to understand the website’s characteristics.

A 16-Step Sitemap Audit For SEO With Python

A sitemap audit can involve content categorization, site-tree, or topicality and content characteristics.

Advertisement

However, a sitemap audit for better indexing and crawlability mainly involves technical SEO rather than content characteristics.

In this step-by-step sitemap audit process, we’ll use Python to tackle the technical aspects of sitemap auditing millions of URLs.

Image created by the author, February 2022

1. Import The Python Libraries For Your Sitemap Audit

The following code block is to import the necessary Python Libraries for the Sitemap XML File audit.

import advertools as adv

import pandas as pd

from lxml import etree

from IPython.core.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))

Here’s what you need to know about this code block:

  • Advertools is necessary for taking the URLs from the sitemap file and making a request for taking their content or the response status codes.
  • “Pandas” is necessary for aggregating and manipulating the data.
  • Plotly is necessary for the visualization of the sitemap audit output.
  • LXML is necessary for the syntax audit of the sitemap XML file.
  • IPython is optional to expand the output cells of Jupyter Notebook to 100% width.

2. Take All Of The URLs From The Sitemap

Millions of URLs can be taken into a Pandas data frame with Advertools, as shown below.

sitemap_url = "https://www.complaintsboard.com/sitemap.xml"
sitemap = adv.sitemap_to_df(sitemap_url)
sitemap.to_csv("sitemap.csv")
sitemap_df = pd.read_csv("sitemap.csv", index_col=False)
sitemap_df.drop(columns=["Unnamed: 0"], inplace=True)
sitemap_df

Above, the Complaintsboard.com sitemap has been taken into a Pandas data frame, and you can see the output below.

Sitemap URL ExtractionA General Sitemap URL Extraction with Sitemap Tags with Python is above.

In total, we have 245,691 URLs in the sitemap index file of Complaintsboard.com.

The website uses “changefreq,” “lastmod,” and “priority” with an inconsistency.

3. Check Tag Usage Within The Sitemap XML File

To understand which tags are used or not within the Sitemap XML file, use the function below.

Advertisement
def check_sitemap_tag_usage(sitemap):
     lastmod = sitemap["lastmod"].isna().value_counts()
     priority = sitemap["priority"].isna().value_counts()
     changefreq = sitemap["changefreq"].isna().value_counts()
     lastmod_perc = sitemap["lastmod"].isna().value_counts(normalize = True) * 100
     priority_perc = sitemap["priority"].isna().value_counts(normalize = True) * 100
     changefreq_perc = sitemap["changefreq"].isna().value_counts(normalize = True) * 100
     sitemap_tag_usage_df = pd.DataFrame(data={"lastmod":lastmod,
     "priority":priority,
     "changefreq":changefreq,
     "lastmod_perc": lastmod_perc,
     "priority_perc": priority_perc,
     "changefreq_perc": changefreq_perc})
     return sitemap_tag_usage_df.astype(int)

The function check_sitemap_tag_usage is a data frame constructor based on the usage of the sitemap tags.

It takes the “lastmod,” “priority,” and “changefreq” columns by implementing “isna()” and “value_counts()” methods via “pd.DataFrame”.

Below, you can see the output.

Sitemap Tag AuditSitemap Audit with Python for sitemap tags’ usage.

The data frame above shows that 96,840 of the URLs do not have the Lastmod tag, which is equal to 39% of the total URL count of the sitemap file.

The same usage percentage is 19% for the “priority” and the “changefreq” within the sitemap XML file.

There are three main content freshness signals from a website.

These are dates from a web page (visible to the user), structured data (invisible to the user), “lastmod” in the sitemap.

Advertisement

If these dates are not consistent with each other, search engines can ignore the dates on the websites to see their freshness signals.

4. Audit The Site-tree And URL Structure Of The Website

Understanding the most important or crowded URL Path is necessary to weigh the website’s SEO efforts or technical SEO Audits.

A single improvement for Technical SEO can benefit thousands of URLs simultaneously, which creates a cost-effective and budget-friendly SEO strategy.

URL Structure Understanding mainly focuses on the website’s more prominent sections and content network analysis understanding.

To create a URL Tree Dataframe from a website’s URLs from the sitemap, use the following code block.

sitemap_url_df = adv.url_to_df(sitemap_df["loc"])
sitemap_url_df

With the help of “urllib” or the “advertools” as above, you can easily parse the URLs within the sitemap into a data frame.

Advertisement
Python sitemap auditCreating a URL Tree with URLLib or Advertools is easy.
Checking the URL breakdowns helps to understand the overall information tree of a website.

The data frame above contains the “scheme,” “netloc,” “path,” and every “/” breakdown within the URLs as a “dir” which represents the directory.

Auditing the URL structure of the website is prominent for two objectives.

These are checking whether all URLs have “HTTPS” and understanding the content network of the website.

Content analysis with sitemap files is not the topic of the “Indexing and Crawling” directly, thus at the end of the article, we will talk about it slightly.

Check the next section to see the SSL Usage on Sitemap URLs.

5. Check The HTTPS Usage On The URLs Within Sitemap

Use the following code block to check the HTTP Usage ratio for the URLs within the Sitemap.

sitemap_url_df["scheme"].value_counts().to_frame()

The code block above uses a simple data filtration for the “scheme” column which contains the URLs’ HTTPS Protocol information.

Advertisement

using the “value_counts” we see that all URLs are on the HTTPS.

Python https scheme columnChecking the HTTP URLs from the Sitemaps can help to find bigger URL Property consistency errors.

6. Check The Robots.txt Disallow Commands For Crawlability

The structure of URLs within the sitemap is beneficial to see whether there is a situation for “submitted but disallowed”.

To see whether there is a robots.txt file of the website, use the code block below.

import requests
r = requests.get("https://www.complaintsboard.com/robots.txt")
R.status_code
200

Simply, we send a “get request” to the robots.txt URL.

If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.

After checking the “robots.txt” existence, we can use the “adv.robotstxt_test” method for bulk robots.txt audit for crawlability of the URLs in the sitemap.

sitemap_df_robotstxt_check = adv.robotstxt_test("https://www.complaintsboard.com/robots.txt", urls=sitemap_df["loc"], user_agents=["*"])
sitemap_df_robotstxt_check["can_fetch"].value_counts()

We have created a new variable called “sitemap_df_robotstxt_check”, and assigned the output of the “robotstxt_test” method.

Advertisement

We have used the URLs within the sitemap with the “sitemap_df[“loc”]”.

We have performed the audit for all of the user-agents via the “user_agents = [“*”]” parameter and value pair.

You can see the result below.

True     245690
False         1
Name: can_fetch, dtype: int64

It shows that there is one URL that is disallowed but submitted.

We can filter the specific URL as below.

pd.set_option("display.max_colwidth",255)
sitemap_df_robotstxt_check[sitemap_df_robotstxt_check["can_fetch"] == False]

We have used “set_option” to expand all of the values within the “url_path” section.

Advertisement
Python Sitemap Audit Robots TXT CheckA URL appears as disallowed but submitted via a sitemap as in Google Search Console Coverage Reports.
We see that a “profile” page has been disallowed and submitted.

Later, the same control can be done for further examinations such as “disallowed but internally linked”.

But, to do that, we need to crawl at least 3 million URLs from ComplaintsBoard.com, and it can be an entirely new guide.

Some website URLs do not have a proper “directory hierarchy”, which can make the analysis of the URLs, in terms of content network characteristics, harder.

Complaintsboard.com doesn’t use a proper URL structure and taxonomy, so analyzing the website structure is not easy for an SEO or Search Engine.

But the most used words within the URLs or the content update frequency can signal which topic the company actually weighs on.

Since we focus on “technical aspects” in this tutorial, you can read the Sitemap Content Audit here.

7. Check The Status Code Of The Sitemap URLs With Python

Every URL within the sitemap has to have a 200 Status Code.

Advertisement

A crawl has to be performed to check the status codes of the URLs within the sitemap.

But, since it’s costly when you have millions of URLs to audit, we can simply use a new crawling method from Advertools.

Without taking the response body, we can crawl just the response headers of the URLs within the sitemap.

It is useful to decrease the crawl time for auditing possible robots, indexing, and canonical signals from the response headers.

To perform a response header crawl, use the “adv.crawl_headers” method.

adv.crawl_headers(sitemap_df["loc"], output_file="sitemap_df_header.jl")
df_headers = pd.read_json("sitemap_df_header.jl", lines=True)
df_headers["status"].value_counts()

The explanation of the code block for checking the URLs’ status codes within the Sitemap XML Files for the Technical SEO aspect can be seen below.

Advertisement
200    207866
404        23
Name: status, dtype: int64

It shows that the 23 URL from the sitemap is actually 404.

And, they should be removed from the sitemap.

To audit which URLs from the sitemap are 404, use the filtration method below from Pandas.

df_headers[df_headers["status"] == 404]

The result can be seen below.

Python Sitemap Audit for URL Status CodeFinding the 404 URLs from Sitemaps is helpful against Link Rot.

8. Check The Canonicalization From Response Headers

From time to time, using canonicalization hints on the response headers is beneficial for crawling and indexing signal consolidation.

In this context, the canonical tag on the HTML and the response header has to be the same.

If there are two different canonicalization signals on a web page, the search engines can ignore both assignments.

Advertisement

For ComplaintsBoard.com, we don’t have a canonical response header.

  • The first step is auditing whether the response header for canonical usage exists.
  • The second step is comparing the response header canonical value to the HTML canonical value if it exists.
  • The third step is checking whether the canonical values are self-referential.

Check the columns of the output of the header crawl to check the Canonicalization from Response Headers.

df_headers.columns

Below, you can see the columns.

Python Sitemap URL Response Header AuditPython SEO Crawl Output Data Frame columns. “dataframe.columns” method is always useful to check.

If you are not familiar with the response headers, you may not know how to use canonical hints within response headers.

A response header can include the canonical hint with the “Link” value.

It is registered as “resp_headers_link” by the Advertools directly.

Another problem is that the extracted strings appear within the “<URL>;” string pattern.

It means we will use regex to extract it.

Advertisement
df_headers["resp_headers_link"]

You can see the result below.

Sitemap URL Response HeaderScreenshot from Pandas, February 2022

The regex pattern “[^<>][a-z:/0-9-.]*” is good enough to extract the specific canonical value.

A self-canonicalization check with the response headers is below.

df_headers["response_header_canonical"] = df_headers["resp_headers_link"].str.extract(r"([^<>][a-z:/0-9-.]*)")
(df_headers["response_header_canonical"] == df_headers["url"]).value_counts()

We have used two different boolean checks.

One to check whether the response header canonical hint is equal to the URL itself.

Another to see whether the status code is 200.

Since we have 404 URLs within the sitemap, their canonical value will be “NaN”.

Advertisement
Non-canonical URL in Sitemap Audit with PythonIt shows there are specific URLs with canonicalization inconsistencies.
We have 29 outliers for Technical SEO. Every wrong signal given to the search engine for indexation or ranking will cause the dilution of the ranking signals.

To see these URLs, use the code block below.

Response Header Python SEO AuditScreenshot from Pandas, February 2022.

The Canonical Values from the Response Headers can be seen above.

df_headers[(df_headers["response_header_canonical"] != df_headers["url"]) & (df_headers["status"] == 200)]

Even a single “/” in the URL can cause canonicalization conflict as appears here for the homepage.

Canonical Response Header CheckComplaintsBoard.com Screenshot for checking the Response Header Canonical Value and the Actual URL of the web page.
You can check the canonical conflict here.

If you check log files, you will see that the search engine crawls the URLs from the “Link” response headers.

Thus in technical SEO, this should be weighted.

9. Check The Indexing And Crawling Commands From Response Headers

There are 14 different X-Robots-Tag specifications for the Google search engine crawler.

The latest one is “indexifembedded” to determine the indexation amount on a web page.

The Indexing and Crawling directives can be in the form of a response header or the HTML meta tag.

This section focuses on the response header version of indexing and crawling directives.

Advertisement
  • The first step is checking whether the X-Robots-Tag property and values exist within the HTTP Header or not.
  • The second step is auditing whether it aligns itself with the HTML Meta Tag properties and values if they exist.

Use the command below yo check the X-Robots-Tag” from the response headers.

def robots_tag_checker(dataframe:pd.DataFrame):
     for i in df_headers:
          if i.__contains__("robots"):
               return i
          else:
               return "There is no robots tag"
robots_tag_checker(df_headers)
OUTPUT>>>
'There is no robots tag'

We have created a custom function to check the “X-Robots-tag” response headers from the web pages’ source code.

It appears that our test subject website doesn’t use the X-Robots-Tag.

If there would be an X-Robots-tag, the code block below should be used.

df_headers["response_header_x_robots_tag"].value_counts()
df_headers[df_headers["response_header_x_robots_tag"] == "noindex"]

Check whether there is a “noindex” directive from the response headers, and filter the URLs with this indexation conflict.

In the Google Search Console Coverage Report, those appear as “Submitted marked as noindex”.

Contradicting indexing and canonicalization hints and signals might make a search engine ignore all of the signals while making the search algorithms trust less to the user-declared signals.

Advertisement

10. Check The Self Canonicalization Of Sitemap URLs

Every URL in the sitemap XML files should give a self-canonicalization hint.

Sitemaps should only include the canonical versions of the URLs.

The Python code block in this section is to understand whether the sitemap URLs have self-canonicalization values or not.

To check the canonicalization from the HTML Documents’ “<head>” section, crawl the websites by taking their response body.

Use the code block below.

user_agent = "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The difference between “crawl_headers” and the “crawl” is that “crawl” takes the entire response body while the “crawl_headers” is only for response headers.

Advertisement
adv.crawl(sitemap_df["loc"],

output_file="sitemap_crawl_complaintsboard.jl",

follow_links=False,

custom_settings={"LOG_FILE":"sitemap_crawl_complaintsboard.log", “USER_AGENT”:user_agent})

You can check the file size differences from crawl logs to response header crawl and entire response body crawl.

SEO Crawl PythonPython Crawl Output Size Comparison.

From 6GB output to the 387 MB output is quite economical.

If a search engine just wants to see certain response headers and the status code, creating information on the headers would make their crawl hits more economical.

How To Deal With Large DataFrames For Reading And Aggregating Data?

This section requires dealing with the large data frames.

A computer can’t read a Pandas DataFrame from a CSV or JL file if the file size is larger than the computer’s RAM.

Thus, the “chunking” method is used.

When a website sitemap XML File contains millions of URLs, the total crawl output will be larger than tens of gigabytes.

Advertisement

An iteration across sitemap crawl output data frame rows is necessary.

For chunking, use the code block below.

df_iterator = pd.read_json(

    'sitemap_crawl_complaintsboard.jl',

    chunksize=10000,

     lines=True)
for i, df_chunk in enumerate(df_iterator):

    output_df = pd.DataFrame(data={"url":df_chunk["url"],"canonical":df_chunk["canonical"], "self_canonicalised":df_chunk["url"] == df_chunk["canonical"]})
    mode="w" if i == 0 else 'a'

    header = i == 0

    output_df.to_csv(

        "canonical_check.csv",

        index=False,

        header=header,

        mode=mode

       )

df[((df["url"] != df["canonical"]) == True) & (df["self_canonicalised"] == False) & (df["canonical"].isna() != True)]

You can see the result below.

Python SEO AuditPython SEO Canonicalization Audit.

We see that the paginated URLs from the “book” subfolder give canonical hints to the first page, which is a non-correct practice according to the Google guidelines.

11. Check The Sitemap Sizes Within Sitemap Index Files

Every Sitemap File should be less than 50 MB. Use the Python code block below in the Technical SEO with Python context to check the sitemap file size.

pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap")

You can see the result below.

Python SEO sitemap sizingPython SEO Sitemap Size Audit.

We see that all sitemap XML files are under 50MB.

For better and faster indexation, keeping the sitemap URLs valuable and unique while decreasing the size of the sitemap files is beneficial.

Advertisement

12. Check The URL Count Per Sitemap With Python

Every URL within the sitemaps should have fewer than 50.000 URLs.

Use the Python code block below to check the URL Counts within the sitemap XML files.

(pd.pivot_table(sitemap_df,

values=["loc"],

index="sitemap",

aggfunc="count")

.sort_values(by="loc", ascending=False))

You can see the result below.

Sitemap URL Count CheckPython SEO Sitemap URL Count Audit.
All sitemaps have less than 50.000 URLs. Some sitemaps have only one URL, which wastes the search engine’s attention.

Keeping sitemap URLs that are frequently updated different from the static and stale content URLs is beneficial.

URL Count and URL Content character differences help a search engine to adjust crawl demand effectively for different website sections.

13. Check The Indexing And Crawling Meta Tags From URLs’ Content With Python

Even if a web page is not disallowed from robots.txt, it can still be disallowed from the HTML Meta Tags.

Thus, checking the HTML Meta Tags for better indexation and crawling is necessary.

Advertisement

Using the “custom selectors” is necessary to perform the HTML Meta Tag audit for the sitemap URLs.

sitemap = adv.sitemap_to_df("https://www.holisticseo.digital/sitemap.xml")

adv.crawl(url_list=sitemap["loc"][:1000], output_file="meta_command_audit.jl",

follow_links=False,

xpath_selectors= {"meta_command": "//meta[@name="robots"]/@content"},

custom_settings={"CLOSESPIDER_PAGECOUNT":1000})

df_meta_check = pd.read_json("meta_command_audit.jl", lines=True)

df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True).value_counts()

The “//meta[@name=”robots”]/@content” XPATH selector is to extract all the robots commands from the URLs from the sitemap.

We have used only the first 1000 URLs in the sitemap.

And, I stop crawling after the initial 1000 responses.

I have used another website to check the Crawling Meta Tags since ComplaintsBoard.com doesn’t have it on the source code.

You can see the result below.

Advertisement
URL Indexing Audit from Sitemap PythonPython SEO Meta Robots Audit.
None of the URLs from the sitemap have “nofollow” or “noindex” within the “Robots” commands.

To check their values, use the code below.

df_meta_check[df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True) == False][["url", "meta_command"]]

You can see the result below.

Meta Tag Audit from the WebsitesMeta Tag Audit from the Websites.

14. Validate The Sitemap XML File Syntax With Python

Sitemap XML File Syntax validation is necessary to validate the integration of the sitemap file with the search engine’s perception.

Even if there are certain syntax errors, a search engine can recognize the sitemap file during the XML Normalization.

But, every syntax error can decrease the efficiency for certain levels.

Use the code block below to validate the Sitemap XML File Syntax.

def validate_sitemap_syntax(xml_path: str, xsd_path: str)
    xmlschema_doc = etree.parse(xsd_path)
    xmlschema = etree.XMLSchema(xmlschema_doc)
    xml_doc = etree.parse(xml_path)
    result = xmlschema.validate(xml_doc)
    return result
validate_sitemap_syntax("sej_sitemap.xml", "sitemap.xsd")

For this example, I have used “https://www.searchenginejournal.com/sitemap_index.xml”. The XSD file involves the XML file’s context and tree structure.

It is stated in the first line of the Sitemap file as below.

Advertisement

For further information, you can also check DTD documentation.

15. Check The Open Graph URL And Canonical URL Matching

It is not a secret that search engines also use the Open Graph and RSS Feed URLs from the source code for further canonicalization and exploration.

The Open Graph URLs should be the same as the canonical URL submission.

From time to time, even in Google Discover, Google chooses to use the image from the Open Graph.

To check the Open Graph URL and Canonical URL consistency, use the code block below.

for i, df_chunk in enumerate(df_iterator):

    if "og:url" in df_chunk.columns:

        output_df = pd.DataFrame(data={

        "canonical":df_chunk["canonical"],

        "og:url":df_chunk["og:url"],

        "open_graph_canonical_consistency":df_chunk["canonical"] == df_chunk["og:url"]})

        mode="w" if i == 0 else 'a'

        header = i == 0

        output_df.to_csv(

            "open_graph_canonical_consistency.csv",

            index=False,

            header=header,

            mode=mode

        )
    else:

        print("There is no Open Graph URL Property")
There is no Open Graph URL Property

If there is an Open Graph URL Property on the website, it will give a CSV file to check whether the canonical URL and the Open Graph URL are the same or not.

Advertisement

But for this website, we don’t have an Open Graph URL.

Thus, I have used another website for the audit.

if "og:url" in df_meta_check.columns:

     output_df = pd.DataFrame(data={

     "canonical":df_meta_check["canonical"],

     "og:url":df_meta_check["og:url"],

     "open_graph_canonical_consistency":df_meta_check["canonical"] == df_meta_check["og:url"]})

     mode="w" if i == 0 else 'a'

     #header = i == 0

     output_df.to_csv(

            "df_og_url_canonical_audit.csv",

            index=False,

            #header=header,

            mode=mode
     )

else:

     print("There is no Open Graph URL Property")

df = pd.read_csv("df_og_url_canonical_audit.csv")

df

You can see the result below.

Sitemap Open Graph Audit with PythonPython SEO Open Graph URL Audit.

We see that all canonical URLs and the Open Graph URLs are the same.

Python Audit with CanonicalizationPython SEO Canonicalization Audit.

16. Check The Duplicate URLs Within Sitemap Submissions

A sitemap index file shouldn’t have duplicated URLs across different sitemap files or within the same sitemap XML file.

The duplication of the URLs within the sitemap files can make a search engine download the sitemap files less since a certain percentage of the sitemap file is bloated with unnecessary submissions.

For certain situations, it can appear as a spamming attempt to control the crawling schemes of the search engine crawlers.

use the code block below to check the duplicate URLs within the sitemap submissions.

Advertisement
sitemap_df["loc"].duplicated().value_counts()

You can see that the 49574 URLs from the sitemap are duplicated.

Python SEO Duplicated URL in SitemapPython SEO Duplicated URL Audit from the Sitemap XML Files

To see which sitemaps have more duplicated URLs, use the code block below.

pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap", values="loc", aggfunc="count").sort_values(by="loc", ascending=False)

You can see the result.

Python SEO Sitemap AuditPython SEO Sitemap Audit for duplicated URLs.

Chunking the sitemaps can help with site-tree and technical SEO analysis.

To see the duplicated URLs within the Sitemap, use the code block below.

sitemap_df[sitemap_df["loc"].duplicated() == True]

You can see the result below.

Duplicated Sitemap URLDuplicated Sitemap URL Audit Output.

Conclusion

I wanted to show how to validate a sitemap file for better and healthier indexation and crawling for Technical SEO.

Python is vastly used for data science, machine learning, and natural language processing.

But, you can also use it for Technical SEO Audits to support the other SEO Verticals with a Holistic SEO Approach.

Advertisement

In a future article, we can expand these Technical SEO Audits further with different details and methods.

But, in general, this is one of the most comprehensive Technical SEO guides for Sitemaps and Sitemap Audit Tutorial with Python.

More resources: 


Featured Image: elenasavchina2/Shutterstock




Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address

SEO

Measuring Content Impact Across The Customer Journey

Published

on

By

Measuring Content Impact Across The Customer Journey

Understanding the impact of your content at every touchpoint of the customer journey is essential – but that’s easier said than done. From attracting potential leads to nurturing them into loyal customers, there are many touchpoints to look into.

So how do you identify and take advantage of these opportunities for growth?

Watch this on-demand webinar and learn a comprehensive approach for measuring the value of your content initiatives, so you can optimize resource allocation for maximum impact.

You’ll learn:

  • Fresh methods for measuring your content’s impact.
  • Fascinating insights using first-touch attribution, and how it differs from the usual last-touch perspective.
  • Ways to persuade decision-makers to invest in more content by showcasing its value convincingly.

With Bill Franklin and Oliver Tani of DAC Group, we unravel the nuances of attribution modeling, emphasizing the significance of layering first-touch and last-touch attribution within your measurement strategy. 

Check out these insights to help you craft compelling content tailored to each stage, using an approach rooted in first-hand experience to ensure your content resonates.

Advertisement

Whether you’re a seasoned marketer or new to content measurement, this webinar promises valuable insights and actionable tactics to elevate your SEO game and optimize your content initiatives for success. 

View the slides below or check out the full webinar for all the details.

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

How to Find and Use Competitor Keywords

Published

on

How to Find and Use Competitor Keywords

Competitor keywords are the keywords your rivals rank for in Google’s search results. They may rank organically or pay for Google Ads to rank in the paid results.

Knowing your competitors’ keywords is the easiest form of keyword research. If your competitors rank for or target particular keywords, it might be worth it for you to target them, too.

There is no way to see your competitors’ keywords without a tool like Ahrefs, which has a database of keywords and the sites that rank for them. As far as we know, Ahrefs has the biggest database of these keywords.

How to find all the keywords your competitor ranks for

  1. Go to Ahrefs’ Site Explorer
  2. Enter your competitor’s domain
  3. Go to the Organic keywords report

The report is sorted by traffic to show you the keywords sending your competitor the most visits. For example, Mailchimp gets most of its organic traffic from the keyword “mailchimp.”

Mailchimp gets most of its organic traffic from the keyword, “mailchimp”.Mailchimp gets most of its organic traffic from the keyword, “mailchimp”.

Since you’re unlikely to rank for your competitor’s brand, you might want to exclude branded keywords from the report. You can do this by adding a Keyword > Doesn’t contain filter. In this example, we’ll filter out keywords containing “mailchimp” or any potential misspellings:

Filtering out branded keywords in Organic keywords reportFiltering out branded keywords in Organic keywords report

If you’re a new brand competing with one that’s established, you might also want to look for popular low-difficulty keywords. You can do this by setting the Volume filter to a minimum of 500 and the KD filter to a maximum of 10.

Finding popular, low-difficulty keywords in Organic keywordsFinding popular, low-difficulty keywords in Organic keywords

How to find keywords your competitor ranks for, but you don’t

  1. Go to Competitive Analysis
  2. Enter your domain in the This target doesn’t rank for section
  3. Enter your competitor’s domain in the But these competitors do section
Competitive analysis reportCompetitive analysis report

Hit “Show keyword opportunities,” and you’ll see all the keywords your competitor ranks for, but you don’t.

Content gap reportContent gap report

You can also add a Volume and KD filter to find popular, low-difficulty keywords in this report.

Volume and KD filter in Content gapVolume and KD filter in Content gap

How to find keywords multiple competitors rank for, but you don’t

  1. Go to Competitive Analysis
  2. Enter your domain in the This target doesn’t rank for section
  3. Enter the domains of multiple competitors in the But these competitors do section
Competitive analysis report with multiple competitorsCompetitive analysis report with multiple competitors

You’ll see all the keywords that at least one of these competitors ranks for, but you don’t.

Content gap report with multiple competitorsContent gap report with multiple competitors

You can also narrow the list down to keywords that all competitors rank for. Click on the Competitors’ positions filter and choose All 3 competitors:

Selecting all 3 competitors to see keywords all 3 competitors rank forSelecting all 3 competitors to see keywords all 3 competitors rank for
  1. Go to Ahrefs’ Site Explorer
  2. Enter your competitor’s domain
  3. Go to the Paid keywords report
Paid keywords reportPaid keywords report

This report shows you the keywords your competitors are targeting via Google Ads.

Since your competitor is paying for traffic from these keywords, it may indicate that they’re profitable for them—and could be for you, too.

Advertisement

You know what keywords your competitors are ranking for or bidding on. But what do you do with them? There are basically three options.

1. Create pages to target these keywords

You can only rank for keywords if you have content about them. So, the most straightforward thing you can do for competitors’ keywords you want to rank for is to create pages to target them.

However, before you do this, it’s worth clustering your competitor’s keywords by Parent Topic. This will group keywords that mean the same or similar things so you can target them all with one page.

Here’s how to do that:

  1. Export your competitor’s keywords, either from the Organic Keywords or Content Gap report
  2. Paste them into Keywords Explorer
  3. Click the “Clusters by Parent Topic” tab
Clustering keywords by Parent TopicClustering keywords by Parent Topic

For example, MailChimp ranks for keywords like “what is digital marketing” and “digital marketing definition.” These and many others get clustered under the Parent Topic of “digital marketing” because people searching for them are all looking for the same thing: a definition of digital marketing. You only need to create one page to potentially rank for all these keywords.

Keywords under the cluster of "digital marketing"Keywords under the cluster of "digital marketing"

2. Optimize existing content by filling subtopics

You don’t always need to create new content to rank for competitors’ keywords. Sometimes, you can optimize the content you already have to rank for them.

How do you know which keywords you can do this for? Try this:

Advertisement
  1. Export your competitor’s keywords
  2. Paste them into Keywords Explorer
  3. Click the “Clusters by Parent Topic” tab
  4. Look for Parent Topics you already have content about

For example, if we analyze our competitor, we can see that seven keywords they rank for fall under the Parent Topic of “press release template.”

Our competitor ranks for seven keywords that fall under the "press release template" clusterOur competitor ranks for seven keywords that fall under the "press release template" cluster

If we search our site, we see that we already have a page about this topic.

Site search finds that we already have a blog post on press release templatesSite search finds that we already have a blog post on press release templates

If we click the caret and check the keywords in the cluster, we see keywords like “press release example” and “press release format.”

Keywords under the cluster of "press release template"Keywords under the cluster of "press release template"

To rank for the keywords in the cluster, we can probably optimize the page we already have by adding sections about the subtopics of “press release examples” and “press release format.”

3. Target these keywords with Google Ads

Paid keywords are the simplest—look through the report and see if there are any relevant keywords you might want to target, too.

For example, Mailchimp is bidding for the keyword “how to create a newsletter.”

Mailchimp is bidding for the keyword “how to create a newsletter”Mailchimp is bidding for the keyword “how to create a newsletter”

If you’re ConvertKit, you may also want to target this keyword since it’s relevant.

If you decide to target the same keyword via Google Ads, you can hover over the magnifying glass to see the ads your competitor is using.

Mailchimp's Google Ad for the keyword “how to create a newsletter”Mailchimp's Google Ad for the keyword “how to create a newsletter”

You can also see the landing page your competitor directs ad traffic to under the URL column.

The landing page Mailchimp is directing traffic to for “how to create a newsletter”The landing page Mailchimp is directing traffic to for “how to create a newsletter”

Learn more

Check out more tutorials on how to do competitor keyword analysis:

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

Google Confirms Links Are Not That Important

Published

on

By

Google confirms that links are not that important anymore

Google’s Gary Illyes confirmed at a recent search marketing conference that Google needs very few links, adding to the growing body of evidence that publishers need to focus on other factors. Gary tweeted confirmation that he indeed say those words.

Background Of Links For Ranking

Links were discovered in the late 1990’s to be a good signal for search engines to use for validating how authoritative a website is and then Google discovered soon after that anchor text could be used to provide semantic signals about what a webpage was about.

One of the most important research papers was Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, published around 1998 (link to research paper at the end of the article). The main discovery of this research paper is that there is too many web pages and there was no objective way to filter search results for quality in order to rank web pages for a subjective idea of relevance.

The author of the research paper discovered that links could be used as an objective filter for authoritativeness.

Kleinberg wrote:

Advertisement

“To provide effective search methods under these conditions, one needs a way to filter, from among a huge collection of relevant pages, a small set of the most “authoritative” or ‘definitive’ ones.”

This is the most influential research paper on links because it kick-started more research on ways to use links beyond as an authority metric but as a subjective metric for relevance.

Objective is something factual. Subjective is something that’s closer to an opinion. The founders of Google discovered how to use the subjective opinions of the Internet as a relevance metric for what to rank in the search results.

What Larry Page and Sergey Brin discovered and shared in their research paper (The Anatomy of a Large-Scale Hypertextual Web Search Engine – link at end of this article) was that it was possible to harness the power of anchor text to determine the subjective opinion of relevance from actual humans. It was essentially crowdsourcing the opinions of millions of website expressed through the link structure between each webpage.

What Did Gary Illyes Say About Links In 2024?

At a recent search conference in Bulgaria, Google’s Gary Illyes made a comment about how Google doesn’t really need that many links and how Google has made links less important.

Patrick Stox tweeted about what he heard at the search conference:

” ‘We need very few links to rank pages… Over the years we’ve made links less important.’ @methode #serpconf2024″

Google’s Gary Illyes tweeted a confirmation of that statement:

Advertisement

“I shouldn’t have said that… I definitely shouldn’t have said that”

Why Links Matter Less

The initial state of anchor text when Google first used links for ranking purposes was absolutely non-spammy, which is why it was so useful. Hyperlinks were primarily used as a way to send traffic from one website to another website.

But by 2004 or 2005 Google was using statistical analysis to detect manipulated links, then around 2004 “powered-by” links in website footers stopped passing anchor text value, and by 2006 links close to the words “advertising” stopped passing link value, links from directories stopped passing ranking value and by 2012 Google deployed a massive link algorithm called Penguin that destroyed the rankings of likely millions of websites, many of which were using guest posting.

The link signal eventually became so bad that Google decided in 2019 to selectively use nofollow links for ranking purposes. Google’s Gary Illyes confirmed that the change to nofollow was made because of the link signal.

Google Explicitly Confirms That Links Matter Less

In 2023 Google’s Gary Illyes shared at a PubCon Austin that links were not even in the top 3 of ranking factors. Then in March 2024, coinciding with the March 2024 Core Algorithm Update, Google updated their spam policies documentation to downplay the importance of links for ranking purposes.

Google March 2024 Core Update: 4 Changes To Link Signal

The documentation previously said:

Advertisement

“Google uses links as an important factor in determining the relevancy of web pages.”

The update to the documentation that mentioned links was updated to remove the word important.

Links are not just listed as just another factor:

“Google uses links as a factor in determining the relevancy of web pages.”

At the beginning of April Google’s John Mueller advised that there are more useful SEO activities to engage on than links.

Mueller explained:

“There are more important things for websites nowadays, and over-focusing on links will often result in you wasting your time doing things that don’t make your website better overall”

Finally, Gary Illyes explicitly said that Google needs very few links to rank webpages and confirmed it.

Why Google Doesn’t Need Links

The reason why Google doesn’t need many links is likely because of the extent of AI and natural language undertanding that Google uses in their algorithms. Google must be highly confident in its algorithm to be able to explicitly say that they don’t need it.

Way back when Google implemented the nofollow into the algorithm there were many link builders who sold comment spam links who continued to lie that comment spam still worked. As someone who started link building at the very beginning of modern SEO (I was the moderator of the link building forum at the #1 SEO forum of that time), I can say with confidence that links have stopped playing much of a role in rankings beginning several years ago, which is why I stopped about five or six years ago.

Read the research papers

Authoritative Sources in a Hyperlinked Environment – Jon M. Kleinberg (PDF)

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Featured Image by Shutterstock/RYO Alexandre

Advertisement



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

Trending

Follow by Email
RSS