Connect with us


Find Resources Bigger Than 15 MB For Better Googlebot Crawling



Find Resources Bigger Than 15 MB For Better Googlebot Crawling

Googlebot is an automatic and always-on web crawling system that keeps Google’s index refreshed.

The website estimates Google’s index to be more than 62 billion web pages.

Google’s search index is “well over 100,000,000 gigabytes in size.”

Googlebot and variants (smartphones, news, images, etc.) have certain constraints for the frequency of JavaScript rendering or the size of the resources.

Google uses crawling constraints to protect its own crawling resources and systems.

For instance, if a news website refreshes the recommended articles every 15 seconds, Googlebot might start to skip the frequently refreshed sections – since they won’t be relevant or valid after 15 seconds.

Years ago, Google announced that it does not crawl or use resources bigger than 15 MB.

On June 28, 2022, Google republished this blog post by stating that it does not use the excess part of the resources after 15 MB for crawling.

To emphasize that it rarely happens, Google stated that the “median size of an HTML file is 500 times smaller” than 15 MB.

Screenshot from the author, August 2022

Above, shows the median desktop and mobile HTML file size. Thus, most websites do not have the problem of the 15 MB constraint for crawling.

But, the web is a big and chaotic place.

Understanding the nature of the 15 MB crawling limit and ways to analyze it is important for SEOs.

An image, video, or bug can cause crawling problems, and this lesser-known SEO information can help projects protect their organic search value.

Find Resources Bigger Than 15 MB For Better Googlebot Crawling

Is 15 MB Googlebot Crawling Limit Only For HTML Documents?


15 MB Googlebot crawling limit is for all indexable and crawlable documents, including Google Earth, Hancom Hanword (.hwp), OpenOffice text (.odt), and Rich Text Format (.rtf), or other Googlebot-supported file types.

Are Image And Video Sizes Summed With HTML Document?

No, every resource is evaluated separately by the 15 MB crawling limit.

If the HTML document is 14.99 MB, and the featured image of the HTML document is 14.99 MB again, they both will be crawled and used by Googlebot.

The HTML document’s size is not summed with the resources that are linked via HTML tags.

Does Inlined CSS, JS, Or Data URI Bloat HTML Document Size?

Yes, inlined CSS, JS, or the Data URI are counted and used in the HTML document size.

Thus, if the document exceeds 15 MB due to inlined resources and commands, it will affect the specific HTML document’s crawlability.

Does Google Stop Crawling The Resource If It Is Bigger Than 15 MB?

No, Google crawling systems do not stop crawling the resources that are bigger than the 15 MB limit.

They continue to fetch the file and use only the smaller part than the 15 MB.

For an image bigger than 15 MB, Googlebot can chunk the image until the 15 MB with the help of “content range.”

The Content-Range is a response header that helps Googlebot or other crawlers and requesters perform partial requests.

How To Audit The Resource Size Manually?

You can use Google Chrome Developer Tools to audit the resource size manually.

Follow the steps below on Google Chrome.

  • Open a web page document via Google Chrome.
  • Press F12.
  • Go to the Network tab.
  • Refresh the web page.
  • Order the resources according to the Waterfall.
  • Check the size column on the first row, which shows the HTML document’s size.

Below, you can see an example of a homepage HTML document, which is bigger than 77 KB.

search engine journal homepage html resultsScreenshot by author, August 2022

How To Audit The Resource Size Automatically And Bulk?

Use Python to audit the HTML document size automatically and in bulk. Advertools and Pandas are two useful Python Libraries to automate and scale SEO tasks.

Follow the instructions below.

  • Import Advertools and Pandas.
  • Collect all the URLs in the sitemap.
  • Crawl all the URLs in the sitemap.
  • Filter the URLs with their HTML Size.
import advertools as adv

import pandas as pd

df = adv.sitemap_to_df("")

adv.crawl(df["loc"], output_file="output.jl", custom_settings={"LOG_FILE":"output_1.log"})

df = pd.read_json("output.jl", lines=True)

df[["url", "size"]].sort_values(by="size", ascending=False)

The code block above extracts the sitemap URLs and crawls them.

The last line of the code is only for creating a data frame with a descending order based on the sizes. urls and sizeImage created by author, August 2022

You can see the sizes of HTML documents as above.

The biggest HTML document in this example is around 700 KB, which is a category page.

So, this website is safe for 15 MB constraints. But, we can check beyond this.

How To Check The Sizes of CSS And JS Resources?

Puppeteer is used to check the size of CSS and JS Resources.

Puppeteer is a NodeJS package to control Google Chrome with headless mode for browser automation and website tests.

Most SEO pros use Lighthouse or Page Speed Insights API for their performance tests. But, with the help of Puppeteer, every technical aspect and simulation can be analyzed.

Follow the code block below.

const puppeteer = require('puppeteer');

const XLSX = require("xlsx");

const path = require("path");

(async () => {

    const browser = await puppeteer.launch({

        headless: false


    const page = await browser.newPage();

    await page.goto('');

    console.log('Page loaded');

    const perfEntries = JSON.parse(

        await page.evaluate(() => JSON.stringify(performance.getEntries()))





      const workSheetColumnName = [






          const urlObject = new URL("")

          const hostName = urlObject.hostname

          const domainName = hostName.replace("www.|.com", "");



          const workSheetName = "Users";

          const filePath = `./${domainName}`;

          const userList = perfEntries;



          const exportPerfToExcel = (userList) => {

              const data = => {

                  return [, url.transferSize, url.encodedBodySize, url. decodedBodySize];


              const workBook = XLSX.utils.book_new();

              const workSheetData = [




              const workSheet = XLSX.utils.aoa_to_sheet(workSheetData);

              XLSX.utils.book_append_sheet(workBook, workSheet, workSheetName);

              XLSX.writeFile(workBook, path.resolve(filePath));

              return true;








If you do not know JavaScript or didn’t finish any kind of Puppeteer tutorial, it might be a little harder for you to understand these code blocks. But, it is actually simple.

It basically opens a URL, takes all the resources, and gives their “transferSize”, “encodedSize”, and “decodedSize.”

In this example, “decodedSize” is the size that we need to focus on. Below, you can see the result in the form of an XLS file.

Resource SizesByte sizes of the resources from the website.

If you want to automate these processes for every URL again, you will need to use a for loop in the “” command.

According to your preferences, you can put every web page into a different worksheet or attach it to the same worksheet by appending it.


The 15 MB of Googlebot crawling constraint is a rare possibility that will block your technical SEO processes for now, but shows that the median video, image, and JavaScript sizes have increased in the last few years.

The median image size on the desktop has exceeded 1 MB.

Timeseries of Image BytesScreenshot by author, August 2022

The video bytes exceed 5 MB in total.

Timeseries of video bytesScreenshot by author, August 2022

In other words, from time to time, these resources – or some parts of these resources – might be skipped by Googlebot.

Thus, you should be able to control them automatically, with bulk methods to make time and not skip.

More resources:

Featured Image: BestForBest/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address


Google Quietly Ends Covid-Era Rich Results




Google Quietly Ends Covid-Era Rich Results

Google removed the Covid-era structured data associated with the Home Activities rich results that allowed online events to be surfaced in search since August 2020, publishing a mention of the removal in the search documentation changelog.

Home Activities Rich Results

The structured data for the Home Activities rich results allowed providers of online livestreams, pre-recorded events and online events to be findable in Google Search.

The original documentation has been completely removed from the Google Search Central webpages and now redirects to a changelog notation that explains that the Home Activity rich results is no longer available for display.

The original purpose was to allow people to discover things to do from home while in quarantine, particularly online classes and events. Google’s rich results surfaced details of how to watch, description of the activities and registration information.

Providers of online events were required to use Event or Video structured data. Publishers and businesses who have this kind of structured data should be aware that this kind of rich result is no longer surfaced but it’s not necessary to remove the structured data if it’s a burden, it’s not going to hurt anything to publish structured data that isn’t used for rich results.

The changelog for Google’s official documentation explains:

“Removing home activity documentation
What: Removed documentation on home activity structured data.

Why: The home activity feature no longer appears in Google Search results.”

Read more about Google’s Home Activities rich results:

Google Announces Home Activities Rich Results

Read the Wayback Machine’s archive of Google’s original announcement from 2020:

Home activities

Featured Image by Shutterstock/Olga Strel

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading


Google’s Gary Illyes: Lastmod Signal Is Binary




Google's Gary Illyes: Lastmod Signal Is Binary

In a recent LinkedIn discussion, Gary Illyes, Analyst at Google, revealed that the search engine takes a binary approach when assessing a website’s lastmod signal from sitemaps.

The revelation came as Illyes encouraged website owners to upgrade to WordPress 6.5, which now natively supports the lastmod element in sitemaps.

When Mark Williams-Cook asked if Google has a “reputation system” to gauge how much to trust a site’s reported lastmod dates, Illyes stated, “It’s binary: we either trust it or we don’t.”

No Shades Of Gray For Lastmod

The lastmod tag indicates the date of the most recent significant update to a webpage, helping search engines prioritize crawling and indexing.

Illyes’ response suggests Google doesn’t factor in a website’s history or gradually build trust in the lastmod values being reported.

Google either accepts the lastmod dates provided in a site’s sitemap as accurate, or it disregards them.

This binary approach reinforces the need to implement the lastmod tag correctly and only specify dates when making meaningful changes.

Illyes commends the WordPress developer community for their work on version 6.5, which automatically populates the lastmod field without extra configuration.

Accurate Lastmod Essential For Crawl Prioritization

While convenient for WordPress users, the native lastmod support is only beneficial if Google trusts you’re using it correctly.

Inaccurate lastmod tags could lead to Google ignoring the signal when scheduling crawls.

With Illyes confirming Google’s stance, it shows there’s no room for error when using this tag.

Why SEJ Cares

Understanding how Google acts on lastmod can help ensure Google displays new publish dates in search results when you update your content.

It’s an all-or-nothing situation – if the dates are deemed untrustworthy, the signal could be disregarded sitewide.

With the information revealed by Illyes, you can ensure your implementation follows best practices to the letter.

Featured Image: Danishch/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading


How to Persuade Your Boss to Send You to Ahrefs Evolve



How to Persuade Your Boss to Send You to Ahrefs Evolve

There’s one thing standing between you and several days of SEO, socializing, and Singaporean sunshine: your boss (and their Q4 budget 😅).

But don’t worry—we’ve got your back. Here are 5 arguments (and an example message) you can use to persuade your boss to send you to Ahrefs Evolve.

About Ahrefs Evolve

  • 2 days in sunny Singapore (Oct 24–25)
  • 500 digital marketing enthusiasts
  • 18 top speakers from around the world

Learn more and buy tickets.

SEO is changing at a breakneck pace. Between AI Overviews, Google’s rolling update schedule, their huge API leak, and all the documents released during their antitrust trial, it’s hard to keep up. What works in SEO today?

You could watch a YouTube video or two, maybe even attend an hour-long webinar. Or, much more effective: you could spend two full days learning from a panel of 18 international SEO experts, discussing your takeaways live with other attendees.

How to Persuade Your Boss to Send You to AhrefsHow to Persuade Your Boss to Send You to Ahrefs
Evolve speakers from around the world.

Our world-class speakers are tackling the hardest problems and best opportunities in SEO today. The talk agenda covers topics like:

  • Responding to AI Overviews: Amanda King will teach you how to respond to AI Overviews, Google Gemini, and other AI search functions.
  • Surviving (and thriving) Google’s algo updates: Lily Ray will talk through Google’s recent updates, and share data-driven recommendations for what’s working in search today.
  • Planning for the future of SEO: Bernard Huang will talk through the failures of AI content and the path to better results.

(And attendees will get video recordings of each session, so you can share the knowledge with your teammates too.)

View the full talk agenda here.

There’s no substitute for meeting with influencers, peers, and partners in real life. 

Conferences create serendipity: chance encounters and conversations that can have a huge positive impact on you and your business. By way of example, these are some of the real benefits that have come my way from attending conferences:

  • Conversations that lead to new customers for our business,
  • Invitations to speak at events,
  • New business partnerships and co-marketing opportunities, and
  • Meeting people that we went on to hire.

There’s a “halo” effect that lingers long after the event is over: the people you meet will remember you for longer, think more highly of you, and be more likely to help you out, should you ask.

(And let’s not forget: there’s a lot of information, particularly in SEO, that only gets shared in person.)

The “international” part of Evolve matters too. Evolve is a different crowd to your local run-of-the-mill conference. It’s a chance to meet with people from markets you wouldn’t normally meet—from Australia to Indonesia and beyond.

How to Persuade Your Boss to Send You to AhrefsHow to Persuade Your Boss to Send You to Ahrefs
Evolve attendees by home country.

If you’re an Ahrefs customer (thank you!), you’ll learn tons of tips, tricks and workflow improvements from attending Evolve. You’ll have opportunities to:

  • Attend talks from the Ahrefs team, showcasing advanced features and strategies that you can use in your own business.
  • Pick our brains at the Ahrefs booth, where we’ll offer informal 1:1 coaching sessions and previews of up-coming releases (like our new content optimization tool 🤫).
  • Join dedicated Ahrefs training workshops, hosted by the Ahrefs team and Ahrefs power users (tickets for these workshops will sold separately).

As a manager myself, there are two questions I need answered when approving expenses:

  • Is this a reasonable cost?
  • Will we see a return on this investment?

To answer those questions: early bird tickets for Evolve start at $570. For context, “super early bird” tickets for MozCon (another popular SEO conference) this year were almost twice as much: $999.

There’s a lot included in the ticket price too:

  • World-class international speakers,
  • 5-star hotel venue,
  • 5-star hotel food (two tea breaks with snacks & lunch),
  • Networking afterparty, and
  • Full talk recordings to later share with your team.

SEO is a crucial growth channel for most businesses. If you can improve your company’s SEO performance after attending Evolve (and we think you will), you’ll very easily see a positive return on the investment.

Traveling to tropical Singapore (and eating tons of satay) is great for you, but it’s also great for your team. Attending Evolve is a chance to break with routine, reignite your passion for marketing, and come back to your job reinvigorated.

This would be true for any international conference, but it goes double for Singapore. It’s a truly unique place: an ultra-safe, high-tech city that brings together dozens of different cultures.

1718123166 301 How to Persuade Your Boss to Send You to Ahrefs1718123166 301 How to Persuade Your Boss to Send You to Ahrefs
Little India in Singapore

You’ll discover different beliefs, working practices, and ways of business—and if you’re anything like me, come back a richer, wiser person for the experience.

If you’re nervous about pitching your boss on attending Evolve, remember: the worst that can happen is a polite “not this time”, and you’ll find yourself in the same position you are now.

So here goes: take this message template, tweak it to your liking, and send it to your boss over email or Slack… and I’ll see you in Singapore 😉

Email template

Hi [your boss’ name],

Our SEO tool provider, Ahrefs, is holding an SEO and digital marketing conference in Singapore in October. I’d like to attend, and I think it’s in the company’s interest:

  • The talks will help us respond to all the changes happening in SEO today. I’m particularly interested in the talks about AI and recent Google updates. 
  • I can network with my peers. I can discover what’s working at other companies, and explore opportunities for partnerships and co-marketing.
  • I can learn how we can use Ahrefs better across the organization.
  • I’ll come back reinvigorated with new ideas and motivation, and I can share my top takeaways and talk recordings with my team after the event.

Early bird tickets are $570. Given how important SEO is to the growth of our business, I think we’ll easily see a return from the spend.

Can we set up time to chat in more detail? Thanks!

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading