Connect with us

SEO

How to Complete a Technical SEO Audit in 8 Steps

Published

on

How to Complete a Technical SEO Audit in 8 Steps

For someone performing their first technical SEO audit, the results can be both overwhelming and intimidating. Often, you can’t see the wood for the trees and have no idea how to fix things or where to even begin.

After years of working with clients, especially as the head of tech SEO for a U.K. agency, I’ve found technical SEO audits to be a near-daily occurrence. With that, I know how important it is, especially for newer SEOs, to understand what each issue is and why it is important.

Understanding issues found within a technical audit allows you to analyze a site fully and come up with a comprehensive strategy.

In this guide, I am going to walk you through a step-by-step process for a successful tech audit but also explain what each issue is and, perhaps more importantly, where it should lie on your priority list.

Whether it’s to make improvements on your own site or recommendations for your first client, this guide will help you to complete a technical SEO audit successfully and confidently in eight steps.

But first, let’s clarify some basics.

Advertisement

What is a technical SEO audit?

Technical SEO is the core foundation of any website. A technical SEO audit is an imperative part of site maintenance to analyze the technical aspects of your website.

An audit will check if a site is optimized properly for the various search engines, including Google, Bing, Yahoo, etc.

This includes ensuring there are no issues related to crawlability and indexation that prevent search engines from allowing your site to appear on the search engine results pages (SERPs).

An audit involves analyzing all elements of your site to make sure that you have not missed out on anything that could be hindering the optimization process. In many cases, some minor changes can improve your ranking significantly.

Also, an audit can highlight technical problems your website has that you may not be aware of, such as hreflang errors, canonical issues, or mixed content problems.

When should you perform a technical SEO audit?

Generally speaking, I always like to do an initial audit on a new site—whether that is one I just built or one I am seeing for the first time from a client—and then audits on a quarterly basis.

Advertisement

I think it is advisable to get into good habits with regular audits as part of ongoing site maintenance. This is especially if you are working with a site that is continuously publishing new content.

It is also a good idea to perform an SEO audit when you notice that your rankings are stagnant or declining.

What do you need from a client before completing a technical audit?

Even if a client comes to me with goals that are not necessarily “tech SEO focused,” such as link building or creating content, it is important to remember that any technical issue can impede the success of the work we do going forward.

It is always important to assess the technical aspects of the site, offer advice on how to make improvements, and explain how those technical issues may impact the work we intend to do together.

With that said, if you intend on performing a technical audit on a website that is not your own, at a minimum, you will need access to the Google Search Console and Google Analytics accounts for that site.

How to perform a technical SEO audit in eight steps

For the most part, technical SEO audits are not easy. Unless you have a very small, simple business site that was perfectly built by an expert SEO, you’re likely going to run into some technical issues along the way.

Advertisement

Often, especially with more complex sites, such as those with a large number of pages or those in multiple languages, audits can be like an ever-evolving puzzle that can take days or even weeks to crack.

Regardless of whether you are looking to audit your own small site or a large one for a new client, I’m going to walk you through the eight steps that will help you to identify and fix some of the most common technical issues.

Step 1. Crawl your website

All you need to get started here is to set up a project in Ahrefs’ Site Audit, which you can even access for free as part of Ahrefs Webmaster Tools.

This tool scans your website to check how many URLs there are, how many are indexable, how many are not, and how many have issues.

From this, the audit tool creates an in-depth report on everything it finds to help you identify and fix any issues that are hindering your site’s performance.

Of course, more advanced issues may need further investigation that involves other tools, such as Google Search Console. But our audit tool does a great job at highlighting key issues, especially for beginner SEOs.

Advertisement

First, to run an audit with Site Audit, you will need to ensure your website is connected to your Ahrefs account as a project. The easiest way to do this is via Google Search Console, although you can verify your ownership by adding a DNS record or HTML file.

Verifying ownership in Ahrefs' Site Audit

Once your ownership is verified, it is a good idea to check the Site Audit settings before running your first crawl. If you have a bigger site, it is always best to increase the crawl speed before you start.

Changing crawl settings in Ahrefs' Site Audit

There are a number of standard settings in place. For a small, personal site, these settings may be fine as they are. However, settings like the maximum number of pages crawled under “Limits” is something you may want to alter for bigger projects.

Setting the maximum number of pages crawled in Ahrefs' Site Audit

Also, if you are looking for in-depth insight on Core Web Vitals (CWV), you may want to add your Google API key here too.

Core Web Vitals settings in Ahrefs' Site Audit

Once happy with the settings, you can run a new crawl under the “Site Audit” tab.
Running a crawl in Ahrefs' Site Audit

Initially, after running the audit, you will be directed to the “Overview” page. This will give you a top-level view of what the tool has found, including the number of indexable vs. non-indexable pages, top issues, and an overall website health score out of 100.

This will give you a quick and easy-to-understand proxy metric to the overall website health.

Advertisement
Health score metric in Ahrefs' Site Audit

From here, you can head over to the “All issues” tab. This breaks down all of the problems the crawler has found, how much of a priority they are to be fixed, and how to fix them.

"All issues" tab in Ahrefs' Site Audit

This report, alongside other tools, can help you to start identifying the issues that may be hindering your performance on the SERPs.

Step 2. Spotting crawlability and indexation issues

If your site has pages that can’t be crawled by search engines, your website may not be indexed correctly, if at all. If your website does not appear in the index, it cannot be found by users.

Ensuring that search bots can crawl your website and collect data from it correctly means search engines can accurately place your site on the SERPs and you can rank for those all-important keywords.

There are a few things you need to consider when looking for crawlability issues:

  • Indexation errors
  • Robots.txt errors
  • Sitemap issues
  • Optimizing the crawl budget

Identifying indexation issues

Priority: High

Ensuring your pages are indexed is imperative if you want to appear anywhere on Google.

The simplest way to check how your site is indexed is by heading to Google Search Console and checking the Coverage report. Here, you can see exactly which pages are indexed, which pages have warnings, as well as which ones are excluded and why:

Coverage report in Google Search Console

Note that pages will only appear in the search results if they are indexed without any issues.

If your pages are not being indexed, there are a number of issues that may be causing this. We will take a look at the top few below, but you can also check our other guide for a more in-depth walkthrough.

Advertisement

Checking the robots.txt file

Priority: High

The robots.txt file is arguably the most straightforward file on your website. But it is something that people consistently get wrong. Although you may advise search engines on how to crawl your site, it is easy to make errors.

Most search engines, especially Google, like to abide by the rules you set out in the robots.txt file. So if you accidentally tell a search engine not to crawl and/or index certain URLs or even your entire site, that’s what will happen.

This is what the robots.txt file, which tells search engines not to crawl any pages, looks like:

Disallowing search engines via robots.txt

Often, these instructions are left within the file even after the site goes live, preventing the site from being crawled. This is a rare easy fix that acts as a panacea to your SEO.

You can also check whether a single page is accessible and indexed by typing the URL into the Google Search Console search bar. If it’s not indexed yet and it’s accessible, you can “Request Indexing.”

Requesting indexing in Google Search Console

The Coverage report in Google Search Console can also let you know if you’re blocking certain pages in robots.txt despite them being indexed:
Advertisement
Pages blocked via robots.txt in Google Search Console

Robots meta tags

Priority: High

A robots meta tag is an HTML snippet that tells search engines how to crawl or index a certain page. It’s placed into the <head> section of a webpage and looks like this:

<meta name="robots" content="noindex" />

This noindex is the most common one. And as you’ve guessed, it tells search engines not to index the page. We also often see the following robots meta tag on pages across whole websites:

<meta name="robots" content=”max-snippet:-1, max-image-preview:large, max-video-preview:-1" />

This tells Google to use any of your content freely on its SERPs. The Yoast SEO plugin for WordPress adds this by default unless you add noindex or nosnippet directives.

If there are no robots meta tags on the page, search engines consider that as index, follow, meaning that they can index the page and crawl all links on it.

Advertisement

But noindex actually has a lot of uses:

  • Thin pages with little or no value for the user
  • Pages in the staging environment
  • Admin and thank-you pages
  • Internal search results
  • PPC landing pages
  • Pages about upcoming promotions, contests, or product launches
  • Duplicate content (use canonical tags to suggest the best version for indexing)

But improper use also happens to be a top indexability issue. Using the wrong attribute accidentally can have a detrimental effect on your presence on the SERPs, so remember to use it with care.

Checking the sitemap

Priority: High

An XML sitemap helps Google to navigate all of the important pages on your website. Considering crawlers can’t stop and ask for directions, a sitemap ensures Google has a set of instructions when it comes to crawling and indexing your website.

But much like crawlers can be accidentally blocked via the robots.txt file, pages can be left out of the sitemap, meaning they likely won’t get prioritized for crawling.

Also, by having pages in your sitemap that shouldn’t be there, such as broken pages, you can confuse crawlers and affect your crawl budget (more on that next).

You can check sitemap issues in Site Audit: Site Audit > All issues > Other.

Advertisement
Sitemap issues in Ahrefs' Site Audit

The main thing here is to ensure that all of the important pages that you want to have indexed are within your sitemap and avoid including anything else.

Checking the crawl budget

Priority: High (for large websites)

A crawl budget refers to how many pages and how rapidly a search engine can crawl.

A variety of things influence the crawl budget. These include the number of resources on the website, as well as how valuable Google deems your indexable pages to be.

Having a big crawl budget does not guarantee that you will rank at the top of the SERPs. But if all of your critical pages are not crawled due to crawl budget concerns, it is possible that those pages may not be indexed.

Your pages are likely being scanned as part of your daily crawl budget if they are popular, receive organic traffic and links, and are well-linked internally across your site.

New pages—as well as those that are not linked internally or externally, e.g., those found on newer sites—may not be crawled as frequently, if at all.

Advertisement

For larger sites with millions of pages or sites that are often updated, crawl budget can be an issue. In general, if you have a large number of pages that aren’t being crawled or updated as frequently as you want, you should think about looking to speed up crawling.

Using the Crawl Stats report in Google Search Console can give you insight into how your site is being crawled and any issues that may have been flagged by the Googlebot.

Crawling insights via Google Search Console

You will also want to look into any flagged crawl statuses like the ones shown here:

Crawl status codes you might see in Google Search Console

Step 3. Checking technical on-page elements

It is important to check your on-page fundamentals. Although many SEOs may tell you that on-page issues like those with meta descriptions aren’t a big deal, I personally think it is part of good SEO housekeeping.

Even Google’s John Mueller previously stated that having multiple H1 tags on a webpage isn’t an issue. However, let’s think about SEO as a points system.

If you and a competitor have sites that stand shoulder to shoulder on the SERP, then even the most basic of issues could be the catalyst that determines who ranks at the top. So in my opinion, even the most basic of housekeeping issues should be addressed.

So let’s take a look at the following:

  • Page titles and title tags
  • Meta descriptions
  • Canonical tags
  • Hreflang tags
  • Structured data

Page titles and title tags

Priority: Medium

Title tags have a lot more value than most people give them credit for. Their job is to let Google and site visitors know what a webpage is about—like this:

Advertisement

Title tag in Google search

Here’s what it looks like in raw HTML format:

<title>How to Craft the Perfect SEO Title Tag (Our 4-Step Process)</title>

In recent years, title tags have sparked a lot of debate in the SEO world. Google, it turns out, is likely to modify your title tag if it doesn’t like it.

Google rewrites around a third of title tags

One of the biggest reasons Google rewrites title tags is that they are simply too long. This is one issue that is highlighted within Site Audit.

Title tag rewrites highlighted in Ahrefs' Site Audit

In general, it is good practice to ensure all of your pages have title tags, none of which are longer than 60 characters.

Meta descriptions

Priority: Low

Advertisement

A meta description is an HTML attribute that describes the contents of a page. It may be displayed as a snippet under the title tag in the search results to give further context.

Title tag in Google search

More visitors will click on your website in the search results if it has a captivating meta description. Even though Google only provides meta descriptions 37% of the time, it is still important to ensure your most important pages have great ones.

You can find out if any meta descriptions are missing, as well as if they are too long or too short.

Title tag rewrites highlighted in Ahrefs' Site Audit

But writing meta descriptions is more than just filling a space. It’s about enticing potential site visitors.

Check canonical tags

Priority: High

A canonical tag (rel=“canonical”) specifies the primary version for duplicate or near-duplicate pages. To put it another way, if you have about the same content available under several URLs, you should be using canonical tags to designate which version is the primary and should be indexed.

Advertisement

How canonicalization works

Canonical tags are an important part of SEO, mainly because Google doesn’t like duplicate content. Also, using canonical tags incorrectly (or not at all) can seriously affect your crawl budget.

If spiders are wasting their time crawling duplicate pages, it can mean that valuable pages are being missed.

You can find duplicate content issues in Site Audit: Site Audit > Reports > Duplicates > Issues.

Duplicate pages without canonical via Ahrefs' Site Audit

International SEO: hreflang tags

Priority: High

Although hreflang is seemingly yet another simple HTML tag, it is possibly the most complex SEO element to get your head around.

The hreflang tag is imperative for sites in multiple languages. If you have many versions of the same page in a different language or target different parts of the world—for example, one version in English for the U.S. and one version in French for France—you need hreflang tags.

Translating a website is time consuming and costly—because you’ll need to put in effort and ensure all versions show up in the relevant search results. But it does give a better user experience by catering to different users who consume content in different languages.

Plus, as clusters of multiple-language pages share each other’s ranking signals, using hreflang tags correctly can have a direct impact as a ranking factor. This is alluded to by Gary Illyes from Google in this video.

You can find hreflang tag issues in Site Audit under localization: Site Audit > All issues > Localization.

Localization issues via Ahrefs' Site Audit

Structured data

Priority: High

Structured data, often referred to as schema markup, has a number of valuable uses in SEO.

Advertisement

Most prominently, structured data is used to help get rich results or features in the Knowledge Panel. Here’s a great example: When working with recipes, more details are given about each result, such as the rating.

Recipe results with structured data

You also get a feature in the Knowledge Panel that shows what a chocolate chip cookie is (along with some nutritional information):

Knowledge card in Google search

Because structured data helps Google better understand not only your website but also detailed information such as authors, structured data can help both semantic search and improve expertise, authoritativeness, and trustworthiness, aka E-A-T.

Nowadays, JSON-LD is the preferred format for structured data, so keep it that way if possible. But you can also encounter Microdata and RDFa.

As part of your technical audit, you should be testing your structured data. A great tool for this is the Classy Schema testing tool.

Schema markup testing tool

You can also check your eligibility for rich results with Google’s Rich Results Test.
Advertisement

Google's Rich Results testing tool

Step 4. Identifying image issues

Image optimization is often overlooked when it comes to SEO. However, image optimization has a number of benefits that include:

  • Improved load speed.
  • More traffic you can get from Google Images.
  • More engaging user experience.
  • Improved accessibility.

Image issues can be found in the main audit report: Site Audit > Reports > Images.

Image issues via Ahrefs' Site Audit

Broken images

Priority: High

Broken images cannot be displayed on your website. This makes for a bad user experience in general but can also look spammy, giving visitors the impression that the site is not well maintained and professional.

This can be especially problematic for anyone who monetizes their website, as it can make the website seem less trustworthy.

Image file size too large

Priority: High

Large images on your website can seriously impact your site speed and performance. Ideally, you want to display images in the smallest possible size and in an appropriate format, such as WebP.

Advertisement

The best option is to optimize the image file size before uploading the image to your website. Tools like TinyJPG can optimize your images before they’re added to your site.

If you are looking to optimize existing images, there are tools available, especially for more popular content management systems (CMSs) like WordPress. Plugins such as Imagify or WP-Optimize are great examples.

HTTPS page links to HTTP image

Priority: Medium

HTTPS pages that link to HTTP images cause what is called “mixed content issues.” This means that a page is loaded securely via HTTPS. But a resource it links to, such as an image or video, is on an insecure HTTP connection.

Mixed content is a security issue. For those who monetize sites with display ads, it can even prevent ad providers from allowing ads on your site. It also degrades the user experience of your website.

By default, certain browsers restrict unsafe resource requests. If your page relies on these vulnerable resources, it may not function correctly if they are banned.

Advertisement

Missing alt text

Priority: Low

Alt text, or alternative text, describes an image on a website. It is an incredibly important part of image optimization, as it improves accessibility on your website for millions of people throughout the world who are visually impaired.

Often, those with a visual impairment use screen readers, which convert images into audio. Essentially, this is describing the image to the site visitor. Properly optimized alt text allows screen readers to inform site users with visual impairments exactly what they are seeing.

Alt text can also serve as anchor text for image links, help you to rank on Google Images, and improve topical relevance.

Step 5. Analyzing internal links

When most people think of “links” for SEO, they think about backlinks. How to build them, how many they should have, and so on.

What many people don’t realize is the sheer importance of internal linking. In fact, internal links are like the jelly to backlinks’ peanut butter. Can you have one without the other? Sure. Are they always better together? You bet!

Advertisement

Not only do internal links help your external link building efforts, but they also make for a better website experience for both search engines and users.

The proper siloing of topics using internal linking creates an easy-to-understand topical roadmap for everyone who comes across your site. This has a number of benefits:

  • Creates relevancy for keywords
  • Helps ensure all content is crawled
  • Makes it easy for visitors to find relevant content or products

Example of siloing on fitness website

Of course, when done right, all of this makes sense. But internal links should be audited when you first get your hands on a site because things may not be as orderly as you’ll want.

4xx status codes

Priority: High

Go to Site Audit > Internal pages > Issues tab > 4XX page.

4XX page errors via Ahrefs' Site Audit

Here, you can see all of your site’s broken internal pages.
Advertisement

These are problematic because they waste “link equity” and provide users with a negative experience.

Here are a few options for dealing with these issues:

  • Bring back the broken page at the same address (if deleted by accident)
  • Redirect the broken page to a more appropriate location; all internal links referring to it should be updated or removed

Orphan pages

Priority: High

Go to Site Audit > Links > Issues tab > Orphan page (has no incoming internal links).

Orphan page issues via Ahrefs' Site Audit

Here, we highlight pages that have zero internal links pointing to them.

There are two reasons why indexable pages should not be orphaned:

  • Internal links will not pass PageRank because there are none.
  • They won’t be found by Google (unless you upload your sitemap through Google Search Console or there are backlinks from several other websites’ crawled pages, they won’t be seen).

If your website has multiple orphaned pages, filter the list from high to low for organic traffic. If internal links are added to orphaned pages still receiving organic traffic, they’ll certainly gain far more traffic.

Step 6. Checking external links

External links are hyperlinks within your pages that link to another domain. That means all of your backlinks—the links to your website from another one—are someone else’s external links.

Advertisement

See how the magic of the internet is invisibly woven together? *mind-blown emoji*

External links are often used to back up sources in the form of citations. For example, if I am writing a blog post and discussing metrics from a study, I’ll externally link to where I found that authoritative source.

Linking to credible sources makes your own website more credible to both visitors and search engines. This is because you show that your information is backed up with sound research.

Here’s what John said about external links:

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address

SEO

How Compression Can Be Used To Detect Low Quality Pages

Published

on

By

Compression can be used by search engines to detect low-quality pages. Although not widely known, it's useful foundational knowledge for SEO.

The concept of Compressibility as a quality signal is not widely known, but SEOs should be aware of it. Search engines can use web page compressibility to identify duplicate pages, doorway pages with similar content, and pages with repetitive keywords, making it useful knowledge for SEO.

Although the following research paper demonstrates a successful use of on-page features for detecting spam, the deliberate lack of transparency by search engines makes it difficult to say with certainty if search engines are applying this or similar techniques.

What Is Compressibility?

In computing, compressibility refers to how much a file (data) can be reduced in size while retaining essential information, typically to maximize storage space or to allow more data to be transmitted over the Internet.

TL/DR Of Compression

Compression replaces repeated words and phrases with shorter references, reducing the file size by significant margins. Search engines typically compress indexed web pages to maximize storage space, reduce bandwidth, and improve retrieval speed, among other reasons.

This is a simplified explanation of how compression works:

  • Identify Patterns:
    A compression algorithm scans the text to find repeated words, patterns and phrases
  • Shorter Codes Take Up Less Space:
    The codes and symbols use less storage space then the original words and phrases, which results in a smaller file size.
  • Shorter References Use Less Bits:
    The “code” that essentially symbolizes the replaced words and phrases uses less data than the originals.

A bonus effect of using compression is that it can also be used to identify duplicate pages, doorway pages with similar content, and pages with repetitive keywords.

Research Paper About Detecting Spam

This research paper is significant because it was authored by distinguished computer scientists known for breakthroughs in AI, distributed computing, information retrieval, and other fields.

Advertisement

Marc Najork

One of the co-authors of the research paper is Marc Najork, a prominent research scientist who currently holds the title of Distinguished Research Scientist at Google DeepMind. He’s a co-author of the papers for TW-BERT, has contributed research for increasing the accuracy of using implicit user feedback like clicks, and worked on creating improved AI-based information retrieval (DSI++: Updating Transformer Memory with New Documents), among many other major breakthroughs in information retrieval.

Dennis Fetterly

Another of the co-authors is Dennis Fetterly, currently a software engineer at Google. He is listed as a co-inventor in a patent for a ranking algorithm that uses links, and is known for his research in distributed computing and information retrieval.

Those are just two of the distinguished researchers listed as co-authors of the 2006 Microsoft research paper about identifying spam through on-page content features. Among the several on-page content features the research paper analyzes is compressibility, which they discovered can be used as a classifier for indicating that a web page is spammy.

Detecting Spam Web Pages Through Content Analysis

Although the research paper was authored in 2006, its findings remain relevant to today.

Then, as now, people attempted to rank hundreds or thousands of location-based web pages that were essentially duplicate content aside from city, region, or state names. Then, as now, SEOs often created web pages for search engines by excessively repeating keywords within titles, meta descriptions, headings, internal anchor text, and within the content to improve rankings.

Section 4.6 of the research paper explains:

Advertisement

“Some search engines give higher weight to pages containing the query keywords several times. For example, for a given query term, a page that contains it ten times may be higher ranked than a page that contains it only once. To take advantage of such engines, some spam pages replicate their content several times in an attempt to rank higher.”

The research paper explains that search engines compress web pages and use the compressed version to reference the original web page. They note that excessive amounts of redundant words results in a higher level of compressibility. So they set about testing if there’s a correlation between a high level of compressibility and spam.

They write:

“Our approach in this section to locating redundant content within a page is to compress the page; to save space and disk time, search engines often compress web pages after indexing them, but before adding them to a page cache.

…We measure the redundancy of web pages by the compression ratio, the size of the uncompressed page divided by the size of the compressed page. We used GZIP …to compress pages, a fast and effective compression algorithm.”

High Compressibility Correlates To Spam

The results of the research showed that web pages with at least a compression ratio of 4.0 tended to be low quality web pages, spam. However, the highest rates of compressibility became less consistent because there were fewer data points, making it harder to interpret.

Figure 9: Prevalence of spam relative to compressibility of page.

The researchers concluded:

Advertisement

“70% of all sampled pages with a compression ratio of at least 4.0 were judged to be spam.”

But they also discovered that using the compression ratio by itself still resulted in false positives, where non-spam pages were incorrectly identified as spam:

“The compression ratio heuristic described in Section 4.6 fared best, correctly identifying 660 (27.9%) of the spam pages in our collection, while misidentifying 2, 068 (12.0%) of all judged pages.

Using all of the aforementioned features, the classification accuracy after the ten-fold cross validation process is encouraging:

95.4% of our judged pages were classified correctly, while 4.6% were classified incorrectly.

More specifically, for the spam class 1, 940 out of the 2, 364 pages, were classified correctly. For the non-spam class, 14, 440 out of the 14,804 pages were classified correctly. Consequently, 788 pages were classified incorrectly.”

The next section describes an interesting discovery about how to increase the accuracy of using on-page signals for identifying spam.

Insight Into Quality Rankings

The research paper examined multiple on-page signals, including compressibility. They discovered that each individual signal (classifier) was able to find some spam but that relying on any one signal on its own resulted in flagging non-spam pages for spam, which are commonly referred to as false positive.

Advertisement

The researchers made an important discovery that everyone interested in SEO should know, which is that using multiple classifiers increased the accuracy of detecting spam and decreased the likelihood of false positives. Just as important, the compressibility signal only identifies one kind of spam but not the full range of spam.

The takeaway is that compressibility is a good way to identify one kind of spam but there are other kinds of spam that aren’t caught with this one signal. Other kinds of spam were not caught with the compressibility signal.

This is the part that every SEO and publisher should be aware of:

“In the previous section, we presented a number of heuristics for assaying spam web pages. That is, we measured several characteristics of web pages, and found ranges of those characteristics which correlated with a page being spam. Nevertheless, when used individually, no technique uncovers most of the spam in our data set without flagging many non-spam pages as spam.

For example, considering the compression ratio heuristic described in Section 4.6, one of our most promising methods, the average probability of spam for ratios of 4.2 and higher is 72%. But only about 1.5% of all pages fall in this range. This number is far below the 13.8% of spam pages that we identified in our data set.”

So, even though compressibility was one of the better signals for identifying spam, it still was unable to uncover the full range of spam within the dataset the researchers used to test the signals.

Combining Multiple Signals

The above results indicated that individual signals of low quality are less accurate. So they tested using multiple signals. What they discovered was that combining multiple on-page signals for detecting spam resulted in a better accuracy rate with less pages misclassified as spam.

Advertisement

The researchers explained that they tested the use of multiple signals:

“One way of combining our heuristic methods is to view the spam detection problem as a classification problem. In this case, we want to create a classification model (or classifier) which, given a web page, will use the page’s features jointly in order to (correctly, we hope) classify it in one of two classes: spam and non-spam.”

These are their conclusions about using multiple signals:

“We have studied various aspects of content-based spam on the web using a real-world data set from the MSNSearch crawler. We have presented a number of heuristic methods for detecting content based spam. Some of our spam detection methods are more effective than others, however when used in isolation our methods may not identify all of the spam pages. For this reason, we combined our spam-detection methods to create a highly accurate C4.5 classifier. Our classifier can correctly identify 86.2% of all spam pages, while flagging very few legitimate pages as spam.”

Key Insight:

Misidentifying “very few legitimate pages as spam” was a significant breakthrough. The important insight that everyone involved with SEO should take away from this is that one signal by itself can result in false positives. Using multiple signals increases the accuracy.

What this means is that SEO tests of isolated ranking or quality signals will not yield reliable results that can be trusted for making strategy or business decisions.

Takeaways

We don’t know for certain if compressibility is used at the search engines but it’s an easy to use signal that combined with others could be used to catch simple kinds of spam like thousands of city name doorway pages with similar content. Yet even if the search engines don’t use this signal, it does show how easy it is to catch that kind of search engine manipulation and that it’s something search engines are well able to handle today.

Here are the key points of this article to keep in mind:

Advertisement
  • Doorway pages with duplicate content is easy to catch because they compress at a higher ratio than normal web pages.
  • Groups of web pages with a compression ratio above 4.0 were predominantly spam.
  • Negative quality signals used by themselves to catch spam can lead to false positives.
  • In this particular test, they discovered that on-page negative quality signals only catch specific types of spam.
  • When used alone, the compressibility signal only catches redundancy-type spam, fails to detect other forms of spam, and leads to false positives.
  • Combing quality signals improves spam detection accuracy and reduces false positives.
  • Search engines today have a higher accuracy of spam detection with the use of AI like Spam Brain.

Read the research paper, which is linked from the Google Scholar page of Marc Najork:

Detecting spam web pages through content analysis

Featured Image by Shutterstock/pathdoc

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

New Google Trends SEO Documentation

Published

on

By

Google publishes new documentation for how to use Google Trends for search marketing

Google Search Central published new documentation on Google Trends, explaining how to use it for search marketing. This guide serves as an easy to understand introduction for newcomers and a helpful refresher for experienced search marketers and publishers.

The new guide has six sections:

  1. About Google Trends
  2. Tutorial on monitoring trends
  3. How to do keyword research with the tool
  4. How to prioritize content with Trends data
  5. How to use Google Trends for competitor research
  6. How to use Google Trends for analyzing brand awareness and sentiment

The section about monitoring trends advises there are two kinds of rising trends, general and specific trends, which can be useful for developing content to publish on a site.

Using the Explore tool, you can leave the search box empty and view the current rising trends worldwide or use a drop down menu to focus on trends in a specific country. Users can further filter rising trends by time periods, categories and the type of search. The results show rising trends by topic and by keywords.

To search for specific trends users just need to enter the specific queries and then filter them by country, time, categories and type of search.

The section called Content Calendar describes how to use Google Trends to understand which content topics to prioritize.

Advertisement

Google explains:

“Google Trends can be helpful not only to get ideas on what to write, but also to prioritize when to publish it. To help you better prioritize which topics to focus on, try to find seasonal trends in the data. With that information, you can plan ahead to have high quality content available on your site a little before people are searching for it, so that when they do, your content is ready for them.”

Read the new Google Trends documentation:

Get started with Google Trends

Featured Image by Shutterstock/Luis Molinero

Source link

Advertisement
Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

All the best things about Ahrefs Evolve 2024

Published

on

All the best things about Ahrefs Evolve 2024

Hey all, I’m Rebekah and I am your Chosen One to “do a blog post for Ahrefs Evolve 2024”.

What does that entail exactly? I don’t know. In fact, Sam Oh asked me yesterday what the title of this post would be. “Is it like…Ahrefs Evolve 2024: Recap of day 1 and day 2…?” 

Even as I nodded, I couldn’t get over how absolutely boring that sounded. So I’m going to do THIS instead: a curation of all the best things YOU loved about Ahrefs’ first conference, lifted directly from X.

Let’s go!

OUR HUGE SCREEN

CONFERENCE VENUE ITSELF

It was recently named the best new skyscraper in the world, by the way.

 

OUR AMAZING SPEAKER LINEUP – SUPER INFORMATIVE, USEFUL TALKS!

 

Advertisement

GREAT MUSIC

 

AMAZING GOODIES

 

SELFIE BATTLE

Some background: Tim and Sam have a challenge going on to see who can take the most number of selfies with all of you. Last I heard, Sam was winning – but there is room for a comeback yet!

 

THAT BELL

Everybody’s just waiting for this one.

 

STICKER WALL

AND, OF COURSE…ALL OF YOU!

 

Advertisement

There’s a TON more content on LinkedIn – click here – but I have limited time to get this post up and can’t quite figure out how to embed LinkedIn posts so…let’s stop here for now. I’ll keep updating as we go along!



Source link

Advertisement
Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

Trending