SEO
How To Do A Sitemap Audit For Better Indexing & Crawling Via Python
Sitemap auditing involves syntax, crawlability, and indexation checks for the URLs and tags in your sitemap files.
A sitemap file contains the URLs to index with further information regarding the last modification date, priority of the URL, images, videos on the URL, and other language alternates of the URL, along with the change frequency.
Sitemap index files can involve millions of URLs, even if a single sitemap can only involve 50,000 URLs at the top.
Auditing these URLs for better indexation and crawling might take time.
But with the help of Python and SEO automation, it is possible to audit millions of URLs within the sitemaps.
What Do You Need To Perform A Sitemap Audit With Python?
To understand the Python Sitemap Audit process, you’ll need:
- A fundamental understanding of technical SEO and sitemap XML files.
- Working knowledge of Python and sitemap XML syntax.
- The ability to work with Python Libraries, Pandas, Advertools, LXML, Requests, and XPath Selectors.
Which URLs Should Be In The Sitemap?
A healthy sitemap XML sitemap file should include the following criteria:
- All URLs should have a 200 Status Code.
- All URLs should be self-canonical.
- URLs should be open to being indexed and crawled.
- URLs shouldn’t be duplicated.
- URLs shouldn’t be soft 404s.
- The sitemap should have a proper XML syntax.
- The URLs in the sitemap should have an aligning canonical with Open Graph and Twitter Card URLs.
- The sitemap should have less than 50.000 URLs and a 50 MB size.
What Are The Benefits Of A Healthy XML Sitemap File?
Smaller sitemaps are better than larger sitemaps for faster indexation. This is particularly important in News SEO, as smaller sitemaps help for increasing the overall valid indexed URL count.
Differentiate frequently updated and static content URLs from each other to provide a better crawling distribution among the URLs.
Using the “lastmod” date in an honest way that aligns with the actual publication or update date helps a search engine to trust the date of the latest publication.
While performing the Sitemap Audit for better indexing, crawling, and search engine communication with Python, the criteria above are followed.
An Important Note…
When it comes to a sitemap’s nature and audit, Google and Microsoft Bing don’t use “changefreq” for changing frequency of the URLs and “priority” to understand the prominence of a URL. In fact, they call it a “bag of noise.”
However, Yandex and Baidu use all these tags to understand the website’s characteristics.
A 16-Step Sitemap Audit For SEO With Python
A sitemap audit can involve content categorization, site-tree, or topicality and content characteristics.
However, a sitemap audit for better indexing and crawlability mainly involves technical SEO rather than content characteristics.
In this step-by-step sitemap audit process, we’ll use Python to tackle the technical aspects of sitemap auditing millions of URLs.
1. Import The Python Libraries For Your Sitemap Audit
The following code block is to import the necessary Python Libraries for the Sitemap XML File audit.
import advertools as adv import pandas as pd from lxml import etree from IPython.core.display import display, HTML display(HTML("<style>.container { width:100% !important; }</style>"))
Here’s what you need to know about this code block:
- Advertools is necessary for taking the URLs from the sitemap file and making a request for taking their content or the response status codes.
- “Pandas” is necessary for aggregating and manipulating the data.
- Plotly is necessary for the visualization of the sitemap audit output.
- LXML is necessary for the syntax audit of the sitemap XML file.
- IPython is optional to expand the output cells of Jupyter Notebook to 100% width.
2. Take All Of The URLs From The Sitemap
Millions of URLs can be taken into a Pandas data frame with Advertools, as shown below.
sitemap_url = "https://www.complaintsboard.com/sitemap.xml" sitemap = adv.sitemap_to_df(sitemap_url) sitemap.to_csv("sitemap.csv") sitemap_df = pd.read_csv("sitemap.csv", index_col=False) sitemap_df.drop(columns=["Unnamed: 0"], inplace=True) sitemap_df
Above, the Complaintsboard.com sitemap has been taken into a Pandas data frame, and you can see the output below.
In total, we have 245,691 URLs in the sitemap index file of Complaintsboard.com.
The website uses “changefreq,” “lastmod,” and “priority” with an inconsistency.
3. Check Tag Usage Within The Sitemap XML File
To understand which tags are used or not within the Sitemap XML file, use the function below.
def check_sitemap_tag_usage(sitemap): lastmod = sitemap["lastmod"].isna().value_counts() priority = sitemap["priority"].isna().value_counts() changefreq = sitemap["changefreq"].isna().value_counts() lastmod_perc = sitemap["lastmod"].isna().value_counts(normalize = True) * 100 priority_perc = sitemap["priority"].isna().value_counts(normalize = True) * 100 changefreq_perc = sitemap["changefreq"].isna().value_counts(normalize = True) * 100 sitemap_tag_usage_df = pd.DataFrame(data={"lastmod":lastmod, "priority":priority, "changefreq":changefreq, "lastmod_perc": lastmod_perc, "priority_perc": priority_perc, "changefreq_perc": changefreq_perc}) return sitemap_tag_usage_df.astype(int)
The function check_sitemap_tag_usage is a data frame constructor based on the usage of the sitemap tags.
It takes the “lastmod,” “priority,” and “changefreq” columns by implementing “isna()” and “value_counts()” methods via “pd.DataFrame”.
Below, you can see the output.
The data frame above shows that 96,840 of the URLs do not have the Lastmod tag, which is equal to 39% of the total URL count of the sitemap file.
The same usage percentage is 19% for the “priority” and the “changefreq” within the sitemap XML file.
There are three main content freshness signals from a website.
These are dates from a web page (visible to the user), structured data (invisible to the user), “lastmod” in the sitemap.
If these dates are not consistent with each other, search engines can ignore the dates on the websites to see their freshness signals.
4. Audit The Site-tree And URL Structure Of The Website
Understanding the most important or crowded URL Path is necessary to weigh the website’s SEO efforts or technical SEO Audits.
A single improvement for Technical SEO can benefit thousands of URLs simultaneously, which creates a cost-effective and budget-friendly SEO strategy.
URL Structure Understanding mainly focuses on the website’s more prominent sections and content network analysis understanding.
To create a URL Tree Dataframe from a website’s URLs from the sitemap, use the following code block.
sitemap_url_df = adv.url_to_df(sitemap_df["loc"]) sitemap_url_df
With the help of “urllib” or the “advertools” as above, you can easily parse the URLs within the sitemap into a data frame.
- Checking the URL breakdowns helps to understand the overall information tree of a website.
The data frame above contains the “scheme,” “netloc,” “path,” and every “/” breakdown within the URLs as a “dir” which represents the directory.
Auditing the URL structure of the website is prominent for two objectives.
These are checking whether all URLs have “HTTPS” and understanding the content network of the website.
Content analysis with sitemap files is not the topic of the “Indexing and Crawling” directly, thus at the end of the article, we will talk about it slightly.
Check the next section to see the SSL Usage on Sitemap URLs.
5. Check The HTTPS Usage On The URLs Within Sitemap
Use the following code block to check the HTTP Usage ratio for the URLs within the Sitemap.
sitemap_url_df["scheme"].value_counts().to_frame()
The code block above uses a simple data filtration for the “scheme” column which contains the URLs’ HTTPS Protocol information.
using the “value_counts” we see that all URLs are on the HTTPS.
6. Check The Robots.txt Disallow Commands For Crawlability
The structure of URLs within the sitemap is beneficial to see whether there is a situation for “submitted but disallowed”.
To see whether there is a robots.txt file of the website, use the code block below.
import requests r = requests.get("https://www.complaintsboard.com/robots.txt") R.status_code 200
Simply, we send a “get request” to the robots.txt URL.
If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.
After checking the “robots.txt” existence, we can use the “adv.robotstxt_test” method for bulk robots.txt audit for crawlability of the URLs in the sitemap.
sitemap_df_robotstxt_check = adv.robotstxt_test("https://www.complaintsboard.com/robots.txt", urls=sitemap_df["loc"], user_agents=["*"]) sitemap_df_robotstxt_check["can_fetch"].value_counts()
We have created a new variable called “sitemap_df_robotstxt_check”, and assigned the output of the “robotstxt_test” method.
We have used the URLs within the sitemap with the “sitemap_df[“loc”]”.
We have performed the audit for all of the user-agents via the “user_agents = [“*”]” parameter and value pair.
You can see the result below.
True 245690 False 1 Name: can_fetch, dtype: int64
It shows that there is one URL that is disallowed but submitted.
We can filter the specific URL as below.
pd.set_option("display.max_colwidth",255) sitemap_df_robotstxt_check[sitemap_df_robotstxt_check["can_fetch"] == False]
We have used “set_option” to expand all of the values within the “url_path” section.
- We see that a “profile” page has been disallowed and submitted.
Later, the same control can be done for further examinations such as “disallowed but internally linked”.
But, to do that, we need to crawl at least 3 million URLs from ComplaintsBoard.com, and it can be an entirely new guide.
Some website URLs do not have a proper “directory hierarchy”, which can make the analysis of the URLs, in terms of content network characteristics, harder.
Complaintsboard.com doesn’t use a proper URL structure and taxonomy, so analyzing the website structure is not easy for an SEO or Search Engine.
But the most used words within the URLs or the content update frequency can signal which topic the company actually weighs on.
Since we focus on “technical aspects” in this tutorial, you can read the Sitemap Content Audit here.
7. Check The Status Code Of The Sitemap URLs With Python
Every URL within the sitemap has to have a 200 Status Code.
A crawl has to be performed to check the status codes of the URLs within the sitemap.
But, since it’s costly when you have millions of URLs to audit, we can simply use a new crawling method from Advertools.
Without taking the response body, we can crawl just the response headers of the URLs within the sitemap.
It is useful to decrease the crawl time for auditing possible robots, indexing, and canonical signals from the response headers.
To perform a response header crawl, use the “adv.crawl_headers” method.
adv.crawl_headers(sitemap_df["loc"], output_file="sitemap_df_header.jl") df_headers = pd.read_json("sitemap_df_header.jl", lines=True) df_headers["status"].value_counts()
The explanation of the code block for checking the URLs’ status codes within the Sitemap XML Files for the Technical SEO aspect can be seen below.
200 207866 404 23 Name: status, dtype: int64
It shows that the 23 URL from the sitemap is actually 404.
And, they should be removed from the sitemap.
To audit which URLs from the sitemap are 404, use the filtration method below from Pandas.
df_headers[df_headers["status"] == 404]
The result can be seen below.
8. Check The Canonicalization From Response Headers
From time to time, using canonicalization hints on the response headers is beneficial for crawling and indexing signal consolidation.
In this context, the canonical tag on the HTML and the response header has to be the same.
If there are two different canonicalization signals on a web page, the search engines can ignore both assignments.
For ComplaintsBoard.com, we don’t have a canonical response header.
- The first step is auditing whether the response header for canonical usage exists.
- The second step is comparing the response header canonical value to the HTML canonical value if it exists.
- The third step is checking whether the canonical values are self-referential.
Check the columns of the output of the header crawl to check the Canonicalization from Response Headers.
df_headers.columns
Below, you can see the columns.
If you are not familiar with the response headers, you may not know how to use canonical hints within response headers.
A response header can include the canonical hint with the “Link” value.
It is registered as “resp_headers_link” by the Advertools directly.
Another problem is that the extracted strings appear within the “<URL>;” string pattern.
It means we will use regex to extract it.
df_headers["resp_headers_link"]
You can see the result below.
The regex pattern “[^<>][a-z:/0-9-.]*” is good enough to extract the specific canonical value.
A self-canonicalization check with the response headers is below.
df_headers["response_header_canonical"] = df_headers["resp_headers_link"].str.extract(r"([^<>][a-z:/0-9-.]*)") (df_headers["response_header_canonical"] == df_headers["url"]).value_counts()
We have used two different boolean checks.
One to check whether the response header canonical hint is equal to the URL itself.
Another to see whether the status code is 200.
Since we have 404 URLs within the sitemap, their canonical value will be “NaN”.
- We have 29 outliers for Technical SEO. Every wrong signal given to the search engine for indexation or ranking will cause the dilution of the ranking signals.
To see these URLs, use the code block below.
The Canonical Values from the Response Headers can be seen above.
df_headers[(df_headers["response_header_canonical"] != df_headers["url"]) & (df_headers["status"] == 200)]
Even a single “/” in the URL can cause canonicalization conflict as appears here for the homepage.
- You can check the canonical conflict here.
If you check log files, you will see that the search engine crawls the URLs from the “Link” response headers.
Thus in technical SEO, this should be weighted.
9. Check The Indexing And Crawling Commands From Response Headers
There are 14 different X-Robots-Tag specifications for the Google search engine crawler.
The latest one is “indexifembedded” to determine the indexation amount on a web page.
The Indexing and Crawling directives can be in the form of a response header or the HTML meta tag.
This section focuses on the response header version of indexing and crawling directives.
- The first step is checking whether the X-Robots-Tag property and values exist within the HTTP Header or not.
- The second step is auditing whether it aligns itself with the HTML Meta Tag properties and values if they exist.
Use the command below yo check the X-Robots-Tag” from the response headers.
def robots_tag_checker(dataframe:pd.DataFrame): for i in df_headers: if i.__contains__("robots"): return i else: return "There is no robots tag" robots_tag_checker(df_headers) OUTPUT>>> 'There is no robots tag'
We have created a custom function to check the “X-Robots-tag” response headers from the web pages’ source code.
It appears that our test subject website doesn’t use the X-Robots-Tag.
If there would be an X-Robots-tag, the code block below should be used.
df_headers["response_header_x_robots_tag"].value_counts() df_headers[df_headers["response_header_x_robots_tag"] == "noindex"]
Check whether there is a “noindex” directive from the response headers, and filter the URLs with this indexation conflict.
In the Google Search Console Coverage Report, those appear as “Submitted marked as noindex”.
Contradicting indexing and canonicalization hints and signals might make a search engine ignore all of the signals while making the search algorithms trust less to the user-declared signals.
10. Check The Self Canonicalization Of Sitemap URLs
Every URL in the sitemap XML files should give a self-canonicalization hint.
Sitemaps should only include the canonical versions of the URLs.
The Python code block in this section is to understand whether the sitemap URLs have self-canonicalization values or not.
To check the canonicalization from the HTML Documents’ “<head>” section, crawl the websites by taking their response body.
Use the code block below.
user_agent = "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
The difference between “crawl_headers” and the “crawl” is that “crawl” takes the entire response body while the “crawl_headers” is only for response headers.
adv.crawl(sitemap_df["loc"], output_file="sitemap_crawl_complaintsboard.jl", follow_links=False, custom_settings={"LOG_FILE":"sitemap_crawl_complaintsboard.log", “USER_AGENT”:user_agent})
You can check the file size differences from crawl logs to response header crawl and entire response body crawl.
From 6GB output to the 387 MB output is quite economical.
If a search engine just wants to see certain response headers and the status code, creating information on the headers would make their crawl hits more economical.
How To Deal With Large DataFrames For Reading And Aggregating Data?
This section requires dealing with the large data frames.
A computer can’t read a Pandas DataFrame from a CSV or JL file if the file size is larger than the computer’s RAM.
Thus, the “chunking” method is used.
When a website sitemap XML File contains millions of URLs, the total crawl output will be larger than tens of gigabytes.
An iteration across sitemap crawl output data frame rows is necessary.
For chunking, use the code block below.
df_iterator = pd.read_json( 'sitemap_crawl_complaintsboard.jl', chunksize=10000, lines=True) for i, df_chunk in enumerate(df_iterator): output_df = pd.DataFrame(data={"url":df_chunk["url"],"canonical":df_chunk["canonical"], "self_canonicalised":df_chunk["url"] == df_chunk["canonical"]}) mode="w" if i == 0 else 'a' header = i == 0 output_df.to_csv( "canonical_check.csv", index=False, header=header, mode=mode ) df[((df["url"] != df["canonical"]) == True) & (df["self_canonicalised"] == False) & (df["canonical"].isna() != True)]
You can see the result below.
We see that the paginated URLs from the “book” subfolder give canonical hints to the first page, which is a non-correct practice according to the Google guidelines.
11. Check The Sitemap Sizes Within Sitemap Index Files
Every Sitemap File should be less than 50 MB. Use the Python code block below in the Technical SEO with Python context to check the sitemap file size.
pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap")
You can see the result below.
We see that all sitemap XML files are under 50MB.
For better and faster indexation, keeping the sitemap URLs valuable and unique while decreasing the size of the sitemap files is beneficial.
12. Check The URL Count Per Sitemap With Python
Every URL within the sitemaps should have fewer than 50.000 URLs.
Use the Python code block below to check the URL Counts within the sitemap XML files.
(pd.pivot_table(sitemap_df, values=["loc"], index="sitemap", aggfunc="count") .sort_values(by="loc", ascending=False))
You can see the result below.
- All sitemaps have less than 50.000 URLs. Some sitemaps have only one URL, which wastes the search engine’s attention.
Keeping sitemap URLs that are frequently updated different from the static and stale content URLs is beneficial.
URL Count and URL Content character differences help a search engine to adjust crawl demand effectively for different website sections.
13. Check The Indexing And Crawling Meta Tags From URLs’ Content With Python
Even if a web page is not disallowed from robots.txt, it can still be disallowed from the HTML Meta Tags.
Thus, checking the HTML Meta Tags for better indexation and crawling is necessary.
Using the “custom selectors” is necessary to perform the HTML Meta Tag audit for the sitemap URLs.
sitemap = adv.sitemap_to_df("https://www.holisticseo.digital/sitemap.xml") adv.crawl(url_list=sitemap["loc"][:1000], output_file="meta_command_audit.jl", follow_links=False, xpath_selectors= {"meta_command": "//meta[@name="robots"]/@content"}, custom_settings={"CLOSESPIDER_PAGECOUNT":1000}) df_meta_check = pd.read_json("meta_command_audit.jl", lines=True) df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True).value_counts()
The “//meta[@name=”robots”]/@content” XPATH selector is to extract all the robots commands from the URLs from the sitemap.
We have used only the first 1000 URLs in the sitemap.
And, I stop crawling after the initial 1000 responses.
I have used another website to check the Crawling Meta Tags since ComplaintsBoard.com doesn’t have it on the source code.
You can see the result below.
- None of the URLs from the sitemap have “nofollow” or “noindex” within the “Robots” commands.
To check their values, use the code below.
df_meta_check[df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True) == False][["url", "meta_command"]]
You can see the result below.
14. Validate The Sitemap XML File Syntax With Python
Sitemap XML File Syntax validation is necessary to validate the integration of the sitemap file with the search engine’s perception.
Even if there are certain syntax errors, a search engine can recognize the sitemap file during the XML Normalization.
But, every syntax error can decrease the efficiency for certain levels.
Use the code block below to validate the Sitemap XML File Syntax.
def validate_sitemap_syntax(xml_path: str, xsd_path: str) xmlschema_doc = etree.parse(xsd_path) xmlschema = etree.XMLSchema(xmlschema_doc) xml_doc = etree.parse(xml_path) result = xmlschema.validate(xml_doc) return result validate_sitemap_syntax("sej_sitemap.xml", "sitemap.xsd")
For this example, I have used “https://www.searchenginejournal.com/sitemap_index.xml”. The XSD file involves the XML file’s context and tree structure.
It is stated in the first line of the Sitemap file as below.
For further information, you can also check DTD documentation.
15. Check The Open Graph URL And Canonical URL Matching
It is not a secret that search engines also use the Open Graph and RSS Feed URLs from the source code for further canonicalization and exploration.
The Open Graph URLs should be the same as the canonical URL submission.
From time to time, even in Google Discover, Google chooses to use the image from the Open Graph.
To check the Open Graph URL and Canonical URL consistency, use the code block below.
for i, df_chunk in enumerate(df_iterator): if "og:url" in df_chunk.columns: output_df = pd.DataFrame(data={ "canonical":df_chunk["canonical"], "og:url":df_chunk["og:url"], "open_graph_canonical_consistency":df_chunk["canonical"] == df_chunk["og:url"]}) mode="w" if i == 0 else 'a' header = i == 0 output_df.to_csv( "open_graph_canonical_consistency.csv", index=False, header=header, mode=mode ) else: print("There is no Open Graph URL Property")
There is no Open Graph URL Property
If there is an Open Graph URL Property on the website, it will give a CSV file to check whether the canonical URL and the Open Graph URL are the same or not.
But for this website, we don’t have an Open Graph URL.
Thus, I have used another website for the audit.
if "og:url" in df_meta_check.columns: output_df = pd.DataFrame(data={ "canonical":df_meta_check["canonical"], "og:url":df_meta_check["og:url"], "open_graph_canonical_consistency":df_meta_check["canonical"] == df_meta_check["og:url"]}) mode="w" if i == 0 else 'a' #header = i == 0 output_df.to_csv( "df_og_url_canonical_audit.csv", index=False, #header=header, mode=mode ) else: print("There is no Open Graph URL Property") df = pd.read_csv("df_og_url_canonical_audit.csv") df
You can see the result below.
We see that all canonical URLs and the Open Graph URLs are the same.
16. Check The Duplicate URLs Within Sitemap Submissions
A sitemap index file shouldn’t have duplicated URLs across different sitemap files or within the same sitemap XML file.
The duplication of the URLs within the sitemap files can make a search engine download the sitemap files less since a certain percentage of the sitemap file is bloated with unnecessary submissions.
For certain situations, it can appear as a spamming attempt to control the crawling schemes of the search engine crawlers.
use the code block below to check the duplicate URLs within the sitemap submissions.
sitemap_df["loc"].duplicated().value_counts()
You can see that the 49574 URLs from the sitemap are duplicated.
To see which sitemaps have more duplicated URLs, use the code block below.
pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap", values="loc", aggfunc="count").sort_values(by="loc", ascending=False)
You can see the result.
Chunking the sitemaps can help with site-tree and technical SEO analysis.
To see the duplicated URLs within the Sitemap, use the code block below.
sitemap_df[sitemap_df["loc"].duplicated() == True]
You can see the result below.
Conclusion
I wanted to show how to validate a sitemap file for better and healthier indexation and crawling for Technical SEO.
Python is vastly used for data science, machine learning, and natural language processing.
But, you can also use it for Technical SEO Audits to support the other SEO Verticals with a Holistic SEO Approach.
In a future article, we can expand these Technical SEO Audits further with different details and methods.
But, in general, this is one of the most comprehensive Technical SEO guides for Sitemaps and Sitemap Audit Tutorial with Python.
More resources:
Featured Image: elenasavchina2/Shutterstock
SEO
YouTube Extends Shorts To 3 Minutes, Adds New Features
YouTube expands Shorts to 3 minutes, adds templates, AI tools, and the option to show fewer Shorts on the homepage.
- YouTube Shorts will allow 3-minute videos.
- New features include templates, enhanced remixing, and AI-generated video backgrounds.
- YouTube is adding a Shorts trends page and comment previews.
SEO
How To Stop Filter Results From Eating Crawl Budget
Today’s Ask An SEO question comes from Michal in Bratislava, who asks:
“I have a client who has a website with filters based on a map locations. When the visitor makes a move on the map, a new URL with filters is created. They are not in the sitemap. However, there are over 700,000 URLs in the Search Console (not indexed) and eating crawl budget.
What would be the best way to get rid of these URLs? My idea is keep the base location ‘index, follow’ and newly created URLs of surrounded area with filters switch to ‘noindex, no follow’. Also mark surrounded areas with canonicals to the base location + disavow the unwanted links.”
Great question, Michal, and good news! The answer is an easy one to implement.
First, let’s look at what you’re trying and apply it to other situations like ecommerce and publishers. This way, more people can benefit. Then, go into your strategies above and end with the solution.
What Crawl Budget Is And How Parameters Are Created That Waste It
If you’re not sure what Michal is referring to with crawl budget, this is a term some SEO pros use to explain that Google and other search engines will only crawl so many pages on your website before it stops.
If your crawl budget is used on low-value, thin, or non-indexable pages, your good pages and new pages may not be found in a crawl.
If they’re not found, they may not get indexed or refreshed. If they’re not indexed, they cannot bring you SEO traffic.
This is why optimizing a crawl budget for efficiency is important.
Michal shared an example of how “thin” URLs from an SEO point of view are created as customers use filters.
The experience for the user is value-adding, but from an SEO standpoint, a location-based page would be better. This applies to ecommerce and publishers, too.
Ecommerce stores will have searches for colors like red or green and products like t-shirts and potato chips.
These create URLs with parameters just like a filter search for locations. They could also be created by using filters for size, gender, color, price, variation, compatibility, etc. in the shopping process.
The filtered results help the end user but compete directly with the collection page, and the collection would be the “non-thin” version.
Publishers have the same. Someone might be on SEJ looking for SEO or PPC in the search box and get a filtered result. The filtered result will have articles, but the category of the publication is likely the best result for a search engine.
These filtered results can be indexed because they get shared on social media or someone adds them as a comment on a blog or forum, creating a crawlable backlink. It might also be an employee in customer service responded to a question on the company blog or any other number of ways.
The goal now is to make sure search engines don’t spend time crawling the “thin” versions so you can get the most from your crawl budget.
The Difference Between Indexing And Crawling
There’s one more thing to learn before we go into the proposed ideas and solutions – the difference between indexing and crawling.
- Crawling is the discovery of new pages within a website.
- Indexing is adding the pages that are worthy of showing to a person using the search engine to the database of pages.
Pages can get crawled but not indexed. Indexed pages have likely been crawled and will likely get crawled again to look for updates and server responses.
But not all indexed pages will bring in traffic or hit the first page because they may not be the best possible answer for queries being searched.
Now, let’s go into making efficient use of crawl budgets for these types of solutions.
Using Meta Robots Or X Robots
The first solution Michal pointed out was an “index,follow” directive. This tells a search engine to index the page and follow the links on it. This is a good idea, but only if the filtered result is the ideal experience.
From what I can see, this would not be the case, so I would recommend making it “noindex,follow.”
Noindex would say, “This is not an official page, but hey, keep crawling my site, you’ll find good pages in here.”
And if you have your main menu and navigational internal links done correctly, the spider will hopefully keep crawling them.
Canonicals To Solve Wasted Crawl Budget
Canonical links are used to help search engines know what the official page to index is.
If a product exists in three categories on three separate URLs, only one should be “the official” version, so the two duplicates should have a canonical pointing to the official version. The official one should have a canonical link that points to itself. This applies to the filtered locations.
If the location search would result in multiple city or neighborhood pages, the result would likely be a duplicate of the official one you have in your sitemap.
Have the filtered results point a canonical back to the main page of filtering instead of being self-referencing if the content on the page stays the same as the original category.
If the content pulls in your localized page with the same locations, point the canonical to that page instead.
In most cases, the filtered version inherits the page you searched or filtered from, so that is where the canonical should point to.
If you do both noindex and have a self-referencing canonical, which is overkill, it becomes a conflicting signal.
The same applies to when someone searches for a product by name on your website. The search result may compete with the actual product or service page.
With this solution, you’re telling the spider not to index this page because it isn’t worth indexing, but it is also the official version. It doesn’t make sense to do this.
Instead, use a canonical link, as I mentioned above, or noindex the result and point the canonical to the official version.
Disavow To Increase Crawl Efficiency
Disavowing doesn’t have anything to do with crawl efficiency unless the search engine spiders are finding your “thin” pages through spammy backlinks.
The disavow tool from Google is a way to say, “Hey, these backlinks are spammy, and we don’t want them to hurt us. Please don’t count them towards our site’s authority.”
In most cases, it doesn’t matter, as Google is good at detecting spammy links and ignoring them.
You do not want to add your own site and your own URLs to the disavow tool. You’re telling Google your own site is spammy and not worth anything.
Plus, submitting backlinks to disavow won’t prevent a spider from seeing what you want and do not want to be crawled, as it is only for saying a link from another site is spammy.
Disavowing won’t help with crawl efficiency or saving crawl budget.
How To Make Crawl Budgets More Efficient
The answer is robots.txt. This is how you tell specific search engines and spiders what to crawl.
You can include the folders you want them to crawl by marketing them as “allow,” and you can say “disallow” on filtered results by disallowing the “?” or “&” symbol or whichever you use.
If some of those parameters should be crawled, add the main word like “?filter=location” or a specific parameter.
Robots.txt is how you define crawl paths and work on crawl efficiency. Once you’ve optimized that, look at your internal links. A link from one page on your site to another.
These help spiders find your most important pages while learning what each is about.
Internal links include:
- Breadcrumbs.
- Menu navigation.
- Links within content to other pages.
- Sub-category menus.
- Footer links.
You can also use a sitemap if you have a large site, and the spiders are not finding the pages you want with priority.
I hope this helps answer your question. It is one I get a lot – you’re not the only one stuck in that situation.
More resources:
Featured Image: Paulo Bobita/Search Engine Journal
SEO
Ad Copy Tactics Backed By Study Of Over 1 Million Google Ads
Mastering effective ad copy is crucial for achieving success with Google Ads.
Yet, the PPC landscape can make it challenging to discern which optimization techniques truly yield results.
Although various perspectives exist on optimizing ads, few are substantiated by comprehensive data. A recent study from Optmyzr attempted to address this.
The goal isn’t to promote or dissuade any specific method but to provide a clearer understanding of how different creative decisions impact your campaigns.
Use the data to help you identify higher profit probability opportunities.
Methodology And Data Scope
The Optmyzr study analyzed data from over 22,000 Google Ads accounts that have been active for at least 90 days with a minimum monthly spend of $1,500.
Across more than a million ads, we assessed Responsive Search Ads (RSAs), Expanded Text Ads (ETAs), and Demand Gen campaigns. Due to API limitations, we could not retrieve asset-level data for Performance Max campaigns.
Additionally, all monetary figures were converted to USD to standardize comparisons.
Key Questions Explored
To provide actionable insights, we focused on addressing the following questions:
- Is there a correlation between Ad Strength and performance?
- How do pinning assets impact ad performance?
- Do ads written in title case or sentence case perform better?
- How does creative length affect ad performance?
- Can ETA strategies effectively translate to RSAs and Demand Gen ads?
As we evaluated the results, it’s important to note that our data set represents advanced marketers.
This means there may be selection bias, and these insights might differ in a broader advertiser pool with varying levels of experience.
The Relationship Between Ad Strength And Performance
Google explicitly states that Ad Strength is a tool designed to guide ad optimization rather than act as a ranking factor.
Despite this, marketers often hold mixed opinions about its usefulness, as its role in ad performance appears inconsistent.
Our data corroborates this skepticism. Ads labeled with an “average” Ad Strength score outperformed those with “good” or “excellent” scores in key metrics like CPA, conversion rate, and ROAS.
This disparity is particularly evident in RSAs, where the ROAS tends to decrease sharply when moving from “average” to “good,” with only a marginal increase when advancing to “excellent.”
Interestingly, Demand Gen ads also showed a stronger performance with an “average” Ad Strength, except for ROAS.
The metrics for conversion rates in Demand Gen and RSAs were notably similar, which is surprising since Demand Gen ads are typically designed for awareness, while RSAs focus on driving transactions.
Key Takeaways:
- Ad Strength doesn’t reliably correlate with performance, so it shouldn’t be a primary metric for assessing your ads.
- Most ads with “poor” or “average” Ad Strength labels perform well by standard advertising KPIs.
- “Good” or “excellent” Ad Strength labels do not guarantee better performance.
How Does Pinning Affect Ad Performance?
Pinning refers to locking specific assets like headlines or descriptions in fixed positions within the ad. This technique became common with RSAs, but there’s ongoing debate about its efficacy.
Some advertisers advocate for pinning all assets to replicate the control offered by ETAs, while others prefer to let Google optimize placements automatically.
Our data suggests that pinning some, but not all, assets offers the most balanced results in terms of CPA, ROAS, and CPC. However, ads where all assets are pinned achieve the highest relevance in terms of CTR.
Still, this marginally higher CTR doesn’t necessarily translate into better conversion metrics. Ads with unpinned or partially pinned assets generally perform better in terms of conversion rates and cost-based metrics.
Key Takeaways:
- Selective pinning is optimal, offering a good balance between creative control and automation.
- Fully pinned ads may increase CTR but tend to underperform in metrics like CPA and ROAS.
- Advertisers should embrace RSAs, as they consistently outperform ETAs – even with fully pinned assets.
Title Case Vs. Sentence Case: Which Performs Better?
The choice between title case (“This Is a Title Case Sentence”) and sentence case (“This is a sentence case sentence”) is often a point of contention among advertisers.
Our analysis revealed a clear trend: Ads using sentence case generally outperformed those in title case, particularly in RSAs and Demand Gen campaigns.
(RSA Data)
(ETA Data)
(Demand Gen)
ROAS, in particular, showed a marked preference for sentence case across these ad types, suggesting that a more natural, conversational tone may resonate better with users.
Interestingly, many advertisers still use a mix of title and sentence case within the same account, which counters the traditional approach of maintaining consistency throughout the ad copy.
Key Takeaways:
- Sentence case outperforms title case in RSAs and Demand Gen ads on most KPIs.
- Including sentence case ads in your testing can improve performance, as it aligns more closely with organic results, which users perceive as higher quality.
- Although ETAs perform slightly better with title case, sentence case is increasingly the preferred choice in modern ad formats.
The Impact Of Ad Length On Performance
Ad copy, particularly for Google Ads, requires brevity without sacrificing impact.
We analyzed the effects of character count on ad performance, grouping ads by the length of headlines and descriptions.
(RSA Data)
(ETA Data)
(Demand Gen Data)
Interestingly, shorter headlines tend to outperform longer ones in CTR and conversion rates, while descriptions benefit from moderate length.
Ads that tried to maximize character counts by using dynamic keyword insertion (DKI) or customizers often saw no significant performance improvement.
Moreover, applying ETA strategies to RSAs proved largely ineffective.
In almost all cases, advertisers who carried over ETA tactics to RSAs saw a decline in performance, likely because of how Google dynamically assembles ad components for display.
Key Takeaways:
- Shorter headlines lead to better performance, especially in RSAs.
- Focus on concise, impactful messaging instead of trying to fill every available character.
- ETA tactics do not translate well to RSAs, and attempting to replicate them can hurt performance.
Final Thoughts On Ad Optimizations
In summary, several key insights emerge from this analysis.
First, Ad Strength should not be your primary focus when assessing performance. Instead, concentrate on creating relevant, engaging ad copy tailored to your target audience.
Additionally, pinning assets should be a strategic, creative decision rather than a hard rule, and advertisers should incorporate sentence case into their testing for RSAs and Demand Gen ads.
Finally, focus on quality over quantity in ad copy length, as longer ads do not always equate to better results.
By refining these elements of your ads, you can drive better ROI and adapt to the evolving landscape of Google Ads.
Read the full Ad Strength & Creative Study from Optmyzr.
More resources:
Featured Image: Sammby/Shutterstock
-
SEARCHENGINES7 days ago
Daily Search Forum Recap: September 27, 2024
-
SEO6 days ago
How to Estimate It and Source Data
-
SEO7 days ago
9 Successful PR Campaign Examples, According to the Data
-
SEO5 days ago
Yoast Co-Founder Suggests A WordPress Contributor Board
-
SEO5 days ago
6 Things You Can Do to Compete With Big Sites
-
SEARCHENGINES6 days ago
Google’s 26th Birthday Doodle Is Missing
-
SEARCHENGINES3 days ago
Daily Search Forum Recap: September 30, 2024
-
SEARCHENGINES5 days ago
Google Volatility With Gains & Losses, Updated Web Spam Policies, Cache Gone & More Search News
You must be logged in to post a comment Login