Connect with us

SEO

Semantic Keyword Clustering For 10,000+ Keywords [With Script]

Published

on

Semantic Keyword Clustering For 10,000+ Keywords [With Script]


Semantic keyword clustering can help take your keyword research to the next level.

In this article, you’ll learn how to use a Google Colaboratory sheet shared exclusively with Search Engine Journal readers.

This article will walk you through using the Google Colab sheet, a high-level view of how it works under the hood, and how to make adjustments to suit your needs.

But first, why cluster keywords at all?

Common Use Cases For Keyword Clustering

Here are a few use cases for clustering keywords.

Faster Keyword Research:

  • Filter out branded keywords or keywords with no commercial value.
  • Group related keywords together to create more in-depth articles.
  • Group related questions and answers together for FAQ creation.

Paid Search Campaigns:

  • Create negative keyword lists for Ads using large datasets faster – stop wasting money on junk keywords!
  • Group similar keywords into campaign ideas for Ads.

Here’s an example of the script clustering similar questions together, perfect for an in-depth article!

Screenshot from Microsoft Excel, February 2022

Issues With Earlier Versions Of This Tool

If you’ve been following my work on Twitter, you’ll know I’ve been experimenting with keyword clustering for a while now.

Earlier versions of this script were based on the excellent PolyFuzz library using TF-IDF matching.

While it got the job done, there were always some head-scratching clusters which I felt the original result could be improved on.

Words that shared a similar pattern of letters would be clustered even if they were unrelated semantically.

For example, it was unable to cluster words like “Bike” with “Bicycle”.

Earlier versions of the script also had other issues:

  • It didn’t work well in languages other than English.
  • It created a high number of groups that were unable to be clustered.
  • There wasn’t much control over how the clusters were created.
  • The script was limited to ~10,000 rows before it timed out due to a lack of resources.

Semantic Keyword Clustering Using Deep Learning Natural Language Processing (NLP)

Fast forward four months to the latest release which has been completely rewritten to utilize state-of-the-art, deep learning sentence embeddings.

Check out some of these awesome semantic clusters!

Notice that heated, thermal, and warm are contained within the same cluster of keywords?

excel sheet showing an example of semantic keyword clusteringScreenshot from Microsoft Excel, February 2022

Or how about, Wholesale and Bulk?

excel sheet showing another example of semantic keyword clusteringScreenshot from Microsoft Excel, February 2022

Dog and Dachshund, Xmas and Christmas?

excel sheet showing another example of semantic keyword clustering. Showing that Dachshund and dogs have been grouped together.Screenshot from Microsoft Excel, February 2022

It can even cluster keywords in over one hundred different languages!

excel sheet showing another example of semantic keyword clustering in FrenchScreenshot from Microsoft Excel, February 2022

Features Of The New Script Versus Earlier Iterations

In addition to semantic keyword grouping, the following improvements have been added to the latest version of this script.

  • Support for clustering 10,000+ keywords at once.
  • Reduced no cluster groups.
  • Ability to choose different pre-trained models (although the default model works fine!).
  • Ability to choose how closely related clusters should be.
  • Choice of the minimum number of keywords to use per cluster.
  • Automatic detection of character encoding and CSV delimiters.
  • Multi-lingual clustering.
  • Works with many common keyword exports out of the box. (Search Console Data, AdWords or third-party keyword tools like Ahrefs and Semrush).
  • Works with any CSV file with a column named “Keyword.”
  • Simple to use (The script works by inserting a new column called Cluster Name to any list of keywords uploaded).

How To Use The Script In Five Steps (Quick Start)

To get started, you will need to click this link, and then choose the option, Open in Colab as shown below.

How to Open Google Colab from GithubScreenshot from Google Colaboratory, February 2022

Change the Runtime type to GPU by selecting Runtime > Change Runtime Type.

Google Collab, How to change settings to use the GPUScreenshot from Google Colaboratory, February 2022

Select Runtime > Run all from the top navigation from within Google Colaboratory, (Or just press Ctrl+F9).

How to run all cell in Google ColabScreenshot from Google Colaboratory, February 2022

Upload a .csv file containing a column called “Keyword” when prompted.

How to upload a file using Google ColabScreenshot from Google Colaboratory, February 2022

Clustering should be fairly quick, but ultimately it depends on the number of keywords, and the model used.

Generally speaking, you should be good for 50,000 keywords.

If you see a Cuda Out of Memory Error, you’re trying to cluster too many keywords at the same time!

(It’s worth noting that this script can easily be adapter to run on a local machine without the confines of Google Colaboratory.)

The Script Output

The script will run and append clusters to your original file to a new column called Cluster Name.

Cluster names are assigned using the shortest length keyword in the cluster.

For example, the cluster name for the following group of keywords has been set as “alpaca socks” because that is the shortest keyword in the cluster.

Demonstration of the example output from the script showing alpaca socks have been grouped together Screenshot from Microsoft Excel, February 2022

Once clustering has been completed, a new file is automatically saved, with clustered appended in a new column to the original file.

How The Key Clustering Tool Works

This script is based upon the Fast Clustering algorithm and uses models which have been pre-trained at scale on large amounts of data.

This makes it easy to compute the semantic relationships between keywords using off-the-shelf models.

(You don’t have to be a data scientist to use it!)

In fact, whilst I’ve made it customizable for those who like to tinker and experiment, I’ve chosen some balanced defaults which should be reasonable for most people’s use cases.

Different models can be swapped in and out of the script depending on the requirements, (faster clustering, better multi-language support, better semantic performance, and so on).

After a lot of testing, I found the perfect balance of speed and accuracy using the all-MiniLM-L6-v2 transformer which provided a great balance between speed and accuracy.

If you prefer to use your own, you can just experiment, you can replace the existing pre-trained model with any of the models listed here or on the Hugging Face Model Hub.

Swapping In Pre-Trained Models

Swapping in models is as easy as replacing the variable with the name of your preferred transformer.

For example, you can change the default model all-miniLM-L6-v2 to all-mpnet-base-v2 by editing:

transformer = ‘all-miniLM-L6-v2’

to

transformer = ‘all-mpnet-base-v2

Here’s where you would edit it in the Google Colaboratory sheet.

How to choose a sentence transformer for keyword clusteringScreenshot from Google Colaboratory, February 2022

The Trade-off Between Cluster Accuracy And No Cluster Groups

A common complaint with previous iterations of this script is that it resulted in a high number of unclustered results.

Unfortunately, it will always be a balancing act between cluster accuracy versus the number of clusters.

A higher cluster accuracy setting will result in a higher number of unclustered results.

There are two variables that can directly influence the size and accuracy of all clusters:

min_cluster_size

and

cluster accuracy

I have set a default of 85 (/100) for cluster accuracy and a minimum cluster size of 2.

In testing, I found this to be the sweet spot, but feel free to experiment!

Here’s where to set those variables in the script.

How to set the minimum sentence size and keyword cluster accuracyScreenshot from Google Colaboratory, February 2022

That’s it! I hope this keyword clustering script is useful to your work.

More resources:


Featured Image: Graphic Grid/Shutterstock





Source link

SEO

9 Common Technical SEO Issues That Actually Matter

Published

on

9 Common Technical SEO Issues That Actually Matter

In this article, we’ll see how to find and fix technical SEO issues, but only those that can seriously affect your rankings.

If you’d like to follow along, get Ahrefs Webmaster Tools and Google Search Console (both are free) and check for the following issues.

Indexability is a webpage’s ability to be indexed by search engines. Pages that are not indexable can’t be displayed on the search engine results pages and can’t bring in any search traffic. 

Three requirements must be met for a page to be indexable:

  1. The page must be crawlable. If you haven’t blocked Googlebot from entering the page robots.txt or you have a website with fewer than 1,000 pages, you probably don’t have an issue there. 
  2. The page must not have a noindex tag (more on that in a bit).
  3. The page must be canonical (i.e., the main version). 

Solution

In Ahrefs Webmaster Tools (AWT):  

  1. Open Site Audit
  2. Go to the Indexability report 
  3. Click on issues related to canonicalization and “noindex” to see affected pages
Indexability issues in Site Audit

For canonicalization issues in this report, you will need to replace bad URLs in the link rel="canonical" tag with valid ones (i.e., returning an “HTTP 200 OK”). 

As for pages marked by “noindex” issues, these are the pages with the “noindex” meta tag placed inside their code. Chances are most of the pages found in the report there should stay as is. But if you see any pages that shouldn’t be there, simply remove the tag. Do make sure those pages aren’t blocked by robots.txt first. 

Recommendation

Click on the question mark on the right to see instructions on how to fix each issue. For more detailed instructions, click on the “Learn more” link. 

Instruction on how to fix an SEO issue in Site Audit

A sitemap should contain only pages that you want search engines to index. 

When a sitemap isn’t regularly updated or an unreliable generator has been used to make it, a sitemap may start to show broken pages, pages that became “noindexed,” pages that were de-canonicalized, or pages blocked in robots.txt. 

Solution 

In AWT:

  1. Open Site Audit 
  2. Go to the All issues report
  3. Click on issues containing the word “sitemap” to find affected pages 
Sitemap issues shown in Site Audit

Depending on the issue, you will have to:

  • Delete the pages from the sitemap.
  • Remove the noindex tag on the pages (if you want to keep them in the sitemap). 
  • Provide a valid URL for the reported page. 

Google uses HTTPS encryption as a small ranking signal. This means you can experience lower rankings if you don’t have an SSL or TLS certificate securing your website. 

But even if you do, some pages and/or resources on your pages may still use the HTTP protocol. 

Solution 

Assuming you already have an SSL/TLS certificate for all subdomains (if not, do get one), open AWT and do these: 

  1. Open Site Audit
  2. Go to the Internal pages report 
  3. Look at the protocol distribution graph and click on HTTP to see affected pages
  4. Inside the report showing pages, add a column for Final redirect URL 
  5. Make sure all HTTP pages are permanently redirected (301 or 308 redirects) to their HTTPS counterparts 
Protocol distribution graph
Internal pages issues report with added column

Finally, let’s check if any resources on the site still use HTTP: 

  1. Inside the Internal pages report, click on Issues
  2. Click on HTTPS/HTTP mixed content to view affected resources 
Site Audit reporting six HTTPS/HTTP mixed content issues

You can fix this issue by one of these methods:

  • Link to the HTTPS version of the resource (check this option first) 
  • Include the resource from a different host, if available 
  • Download and host the content on your site directly if you are legally allowed to do so
  • Exclude the resource from your site altogether

Learn more: What Is HTTPS? Everything You Need to Know 

Duplicate content happens when exact or near-duplicate content appears on the web in more than one place. 

It’s bad for SEO mainly for two reasons: It can cause undesirable URLs to show in search results and can dilute link equity

Content duplication is not necessarily a case of intentional or unintentional creation of similar pages. There are other less obvious causes such as faceted navigation, tracking parameters in URLs, or using trailing and non-trailing slashes

Solution 

First, check if your website is available under only one URL. Because if your site is accessible as:

  • http://domain.com
  • http://www.domain.com
  • https://domain.com
  • https://www.domain.com

Then Google will see all of those URLs as different websites. 

The easiest way to check if users can browse only one version of your website: type in all four variations in the browser, one by one, hit enter, and see if they get redirected to the master version (ideally, the one with HTTPS). 

You can also go straight into Site Audit’s Duplicates report. If you see 100% bad duplicates, that is likely the reason.

Duplicates report showing 100% bad duplicates
Simulation (other types of duplicates turned off).

In this case, choose one version that will serve as canonical (likely the one with HTTPS) and permanently redirect other versions to it. 

Then run a New crawl in Site Audit to see if there are any other bad duplicates left. 

Running a new crawl in Site Audit

There are a few ways you can handle bad duplicates depending on the case. Learn how to solve them in our guide

Learn more: Duplicate Content: Why It Happens and How to Fix It 

Pages that can’t be found (4XX errors) and pages returning server errors (5XX errors) won’t be indexed by Google so they won’t bring you any traffic. 

Furthermore, if broken pages have backlinks pointing to them, all of that link equity goes to waste. 

Broken pages are also a waste of crawl budget—something to watch out for on bigger websites. 

Solution

In AWT, you should: 

  1. Open Site Audit.
  2. Go to the Internal pages report.
  3. See if there are any broken pages. If so, the Broken section will show a number higher than 0. Click on the number to show affected pages.
Broken pages report in Site Audit

In the report showing pages with issues, it’s a good idea to add a column for the number of referring domains. This will help you make the decision on how to fix the issue. 

Internal pages report with no. of referring domains column added

Now, fixing broken pages (4XX error codes) is quite simple, but there is more than one possibility. Here’s a short graph explaining the process:

How to deal with broken pages

Dealing with server errors (the ones reporting a 5XX) can be a tougher one, as there are different possible reasons for a server to be unresponsive. Read this short guide for troubleshooting.

Recommendation

With AWT, you can also see 404s that were caused by incorrect links to your website. While this is not a technical issue per se, reclaiming those links may give you an additional SEO boost.

  1. Go to Site Explorer
  2. Enter your domain 
  3. Go to the Best by links report
  4. Add a “404 not found” filter
  5. Then sort the report by referring domains from high to low
How to find broken backlinks in Site Explorer
In this example, someone linked to us, leaving a comma inside the URL.

If you’ve already dealt with broken pages, chances are you’ve fixed most of the broken links issues. 

Other critical issues related to links are: 

  • Orphan pages – These are the pages without any internal links. Web crawlers have limited ability to access those pages (only from sitemap or backlinks), and there is no link equity flowing to them from other pages on your site. Last but not least, users won’t be able to access this page from the site navigation. 
  • HTTPS pages linking to internal HTTP pages – If an internal link on your website brings users to an HTTP URL, web browsers will likely show a warning about a non-secure page. This can damage your overall website authority and user experience.

Solution

In AWT, you can:

  1. Go to Site Audit.
  2. Open the Links report.
  3. Open the Issues tab. 
  4. Look for the following issues in the Indexable category. Click to see affected pages. 
Important SEO issues related to links

Fix the first issue by changing the links from HTTP to HTTPS or simply delete those links if no longer needed.

For the second issue, an orphan page needs to be either linked to from some other page on your website or deleted if a given page holds no value to you.

Sidenote.

Ahrefs’ Site Audit can find orphan pages as long as they have backlinks or are included in the sitemap. For a more thorough search for this issue, you will need to analyze server logs to find orphan pages with hits. Find out how in this guide.

7. Mobile experience issues

Having a mobile-friendly website is a must for SEO. Two reasons: 

  1. Google uses mobile-first indexing – It’s mostly using the content of mobile pages for indexing and ranking.
  2. Mobile experience is part of the Page Experience signals – While Google will allegedly always “promote” the page with the best content, page experience can be a tiebreaker for pages offering content of similar quality. 

Solution

In GSC: 

  1. Go to the Mobile Usability report in the Experience section
  2. View affected pages by clicking on issues in the Why pages aren’t usable on mobile section 
Mobile Usability report in Google Search Console

You can read Google’s guide for fixing mobile issues here.  

8. Performance and stability issues 

Performance and visual stability are other aspects of Page Experience signals used by Google to rank pages. 

Google has developed a special set of metrics to measure user experience called Core Web Vitals (CWV). Site owners and SEOs can use those metrics to see how Google perceives their website in terms of UX. 

Google's search signals for page experience

While page experience can be a ranking tiebreaker, CWV is not a race. You don’t need to have the fastest website on the internet. You just need to score “good” ideally in all three categories: loading, interactivity, and visual stability. 

Three categories of Core Web Vitals

Solution 

In GSC: 

  1. First, click on Core Web Vitals in the Experience section of the reports.
  2. Then click Open report in each section to see how your website scores. 
  3. For pages that aren’t considered good, you’ll see a special section at the bottom of the report. Use it to see pages that need your attention.
How to find Core Web Vitals in Google Search Console
CWV issue report in Google Search Console

Optimizing for CWV may take some time. This may include things like moving to a faster (or closer) server, compressing images, optimizing CSS, etc. We explain how to do this in the third part of this guide to CWV. 

Bad website structure in the context of technical SEO is mainly about having important organic pages too deep into the website structure. 

Pages that are nested too deep (i.e., users need >6 clicks from the website to get to them) will receive less link equity from your homepage (likely the page with the most backlinks), which may affect their rankings. This is because link value diminishes with every link “hop.” 

Sidenote.

Website structure is important for other reasons too such as the overall user experience, crawl efficiency, and helping Google understand the context of your pages. Here, we’ll only focus on the technical aspect, but you can read more about the topic in our full guide: Website Structure: How to Build Your SEO Foundation.

Solution 

In AWT

  1. Open Site Audit
  2. Go to Structure explorer, switch to the Depth tab, and set the data type to Data table
  3. Configure the Segment to only valid HTML pages and click Apply
  4. Use the graph to investigate pages with more than six clicks away from the homepage 
How to find site structure issues in Site Audit
Adding a new segment in Site Audit

The way to fix the issue is to link to these deeper nested pages from pages closer to the homepage. More important pages could find their place in site navigation, while less important ones can be just linked to the pages a few clicks closer.

It’s a good idea to weigh in user experience and the business role of your website when deciding what goes into sitewide navigation. 

For example, we could probably give our SEO glossary a slightly higher chance to get ahead of organic competitors by including it in the main site navigation. Yet we decided not to because it isn’t such an important page for users who are not particularly searching for this type of information. 

We’ve moved the glossary only up a notch by including a link inside the beginner’s guide to SEO (which itself is just one click away from the homepage). 

Structure explorer showing glossary page is two clicks away from the homepage
One page from the glossary folder is two clicks away from the homepage.
Link that moved SEO glossary a click closer to the homepage
Just one link, even at the bottom of a page, can move a page higher in the overall structure.

Final thoughts 

When you’re done fixing the more pressing issues, dig a little deeper to keep your site in perfect SEO health. Open Site Audit and go to the All issues report to see other issues regarding on-page SEO, image optimization, redirects, localization, and more. In each case, you will find instructions on how to deal with the issue. 

All issues report in Site Audit

You can also customize this report by turning issues on/off or changing their priority. 

Issue report in Site Audit is customizable

Did I miss any important technical issues? Let me know on Twitter or Mastodon.



Source link

Continue Reading

SEO

New Google Ads Feature: Account-Level Negative Keywords

Published

on

New Google Ads Feature: Account-Level Negative Keywords

Google Ads Liaison Ginny Marvin has announced that account-level negative keywords are now available to Google Ads advertisers worldwide.

The feature, which was first announced last year and has been in testing for several months, allows advertisers to add keywords to exclude traffic from all search and shopping campaigns, as well as the search and shopping portion of Performance Max, for greater brand safety and suitability.

Advertisers can access this feature from the account settings page to ensure their campaigns align with their brand values and target audience.

This is especially important for brands that want to avoid appearing in contexts that may be inappropriate or damaging to their reputation.

In addition to the brand safety benefits, the addition of account-level negative keywords makes the campaign management process more efficient for advertisers.

Instead of adding negative keywords to individual campaigns, advertisers can manage them at the account level, saving time and reducing the chances of human error.

You no longer have to worry about duplicating negative keywords in multiple campaigns or missing any vital to your brand safety.

Additionally, account-level negative keywords can improve the accuracy of ad targeting by excluding irrelevant or low-performing keywords that may adversely impact campaign performance. This can result in higher-quality traffic and a better return on investment.

Google Ads offers a range of existing brand suitability controls, including inventory types, digital content labels, placement exclusions, and negative keywords at the campaign level.

Marvin added that Google Ads is expanding account-level negative keywords to address various use cases and will have more to share soon.

This rollout is essential in giving brands more control over their advertising and ensuring their campaigns target the appropriate audience.


Featured Image: Primakov/Shutterstock



Source link

Continue Reading

SEO

Google’s Gary Illyes Answers Your SEO Questions On LinkedIn

Published

on

Google's Gary Illyes Answers Your SEO Questions On LinkedIn

Google Analyst Gary Illyes offers guidance on large robots.txt files, the SEO impact of website redesigns, and the correct use of rel-canonical tags.

Illyes is taking questions sent to him via LinkedIn direct message and answering them publicly, offering valuable insights for those in the SEO community.

It’s already newsworthy for a Google employee to share SEO advice. This is especially so given it’s Illyes, who isn’t as active on social media as colleagues like Search Advocate John Mueller and Developer Advocate Martin Splitt.

Throughout the past week, Illyes has shared advice and offered guidance on the following subjects:

  • Large robots.txt files
  • The SEO impact of website redesigns
  • The correct use of rel-canonical tags

Considering the engagement his posts are getting, there’s likely more to come. Here’s a summary of what you missed if you’re not following him on LinkedIn.

Keep Robots.Txt Files Under 500KB

Regarding a previously published poll on the size of robots.txt files, Illyes shares a PSA for those with a file size larger than 500kb.

Screenshot from: linkedin.com/in/garyillyes/, January 2023.

Illyes advises paying attention to the size of your website’s robots.txt file, especially if it’s larger than 500kb.

Google’s crawlers only process the first 500kb of the file, so it’s crucial to ensure that the most important information appears first.

Doing this can help ensure that your website is properly crawled and indexed by Google.

Website Redesigns May Cause Rankings To Go “Nuts”

When you redesign a website, it’s important to remember that its rankings in search engines may be affected.

As Illyes explains, this is because search engines use the HTML of your pages to understand and categorize the content on your site.

If you make changes to the HTML structure, such as breaking up paragraphs, using CSS styling instead of H tags, or adding unnecessary breaking tags, it can cause the HTML parsers to produce different results.

This can significantly impact your site’s rankings in search engines. Or, as Illyes phrases it, it can cause rankings to go “nuts”:

Google’s Gary Illyes Answers Your SEO Questions On LinkedInScreenshot from: linkedin.com/in/garyillyes/, January 2023.

Illyes advises using semantically similar HTML when redesigning the site and avoiding adding tags that aren’t necessary to minimize the SEO impact.

This will allow HTML parsers to better understand the content on your site, which can help maintain search rankings.

Don’t Use Relative Paths In Your Rel-Canonical

Don’t take shortcuts when implementing rel-canonical tags. Illyes strongly advises spelling out the entire URL path:

Google’s Gary Illyes Answers Your SEO Questions On LinkedInScreenshot from: linkedin.com/in/garyillyes/, January 2023.

Saving a few bytes using a relative path in the rel-canonical tag isn’t worth the potential issues it could cause.

Using relative paths may result in search engines treating it as a different URL, which can confuse search engines.

Spelling out the full URL path eliminates potential ambiguity and ensures that search engines identify the correct URL as the preferred version.

In Summary

By answering questions sent to him via direct message and offering his expertise, Illyes is giving back to the community and providing valuable insights on various SEO-related topics.

This is a testament to Illyes’ dedication to helping people understand how Google works. Send him a DM, and your question may be answered in a future LinkedIn post.


Source: LinkedIn

Featured Image: SNEHIT PHOTO/Shutterstock



Source link

Continue Reading

Trending

en_USEnglish