Connect with us
Cloak And Track Your Affiliate Links With Our User-Friendly Link Cloaking Tool, Try It Free

SEO

Find Keyword Cannibalization Using OpenAI’s Text Embeddings with Examples

Published

on

Find Keyword Cannibalization Using OpenAI's Text Embeddings with Examples

This new series of articles focuses on working with LLMs to scale your SEO tasks. We hope to help you integrate AI into SEO so you can level up your skills.

We hope you enjoyed the previous article and understand what vectors, vector distance, and text embeddings are.

Following this, it’s time to flex your “AI knowledge muscles” by learning how to use text embeddings to find keyword cannibalization.

We will start with OpenAI’s text embeddings and compare them.

Model Dimensionality Pricing Notes
text-embedding-ada-002 1536 $0.10 per 1M tokens Great for most use cases.
text-embedding-3-small 1536 $0.002 per 1M tokens Faster and cheaper but less accurate
text-embedding-3-large 3072 $0.13 per 1M tokens More accurate for complex long text-related tasks, slower

(*tokens can be considered as words words.)

But before we start, you need to install Python and Jupyter on your computer.

Jupyter is a web-based tool for professionals and researchers. It allows you to perform complex data analysis and machine learning model development using any programming language.

Don’t worry – it’s really easy and takes little time to finish the installations. And remember, ChatGPT is your friend when it comes to programming.

In a nutshell:

  • Download and install Python.
  • Open your Windows command line or terminal on Mac.
  • Type this commands pip install jupyterlab and pip install notebook
  • Run Jupiter by this command: jupyter lab

We will use Jupyter to experiment with text embeddings; you’ll see how fun it is to work with!

But before we start, you must sign up for OpenAI’s API and set up billing by filling your balance.

Open AI Api Billing settings

Once you’ve done that, set up email notifications to inform you when your spending exceeds a certain amount under Usage limits.

Then, obtain API keys under Dashboard > API keys, which you should keep private and never share publicly.

OpenAI API keysOpenAI API keys

Now, you have all the necessary tools to start playing with embeddings.

  • Open your computer command terminal and type jupyter lab.
  • You should see something like the below image pop up in your browser.
  • Click on Python 3 under Notebook.
jupyter labjupyter lab

In the opened window, you will write your code.

As a small task, let’s group similar URLs from a CSV. The sample CSV has two columns: URL and Title. Our script’s task will be to group URLs with similar semantic meanings based on the title so we can consolidate those pages into one and fix keyword cannibalization issues.

Here are the steps you need to do:

Install required Python libraries with the following commands in your PC’s terminal (or in Jupyter notebook)

pip install pandas openai scikit-learn numpy unidecode

The ‘openai’ library is required to interact with the OpenAI API to get embeddings, and ‘pandas’ is used for data manipulation and handling CSV file operations.

The ‘scikit-learn’ library is necessary for calculating cosine similarity, and ‘numpy’ is essential for numerical operations and handling arrays. Lastly, unidecode is used to clean text.

Then, download the sample sheet as a CSV, rename the file to pages.csv, and upload it to your Jupyter folder where your script is located.

Set your OpenAI API key to the key you obtained in the step above, and copy-paste the code below into the notebook.

Run the code by clicking the play triangle icon at the top of the notebook.


import pandas as pd
import openai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import csv
from unidecode import unidecode

# Function to clean text
def clean_text(text: str) -> str:
    # First, replace known problematic characters with their correct equivalents
    replacements = {
        '–': '–',   # en dash
        '’': '’',   # right single quotation mark
        '“': '“',   # left double quotation mark
        '”': '”',   # right double quotation mark
        '‘': '‘',   # left single quotation mark
        'â€': '—'     # em dash
    }
    for old, new in replacements.items():
        text = text.replace(old, new)
    # Then, use unidecode to transliterate any remaining problematic Unicode characters
    text = unidecode(text)
    return text

# Load the CSV file with UTF-8 encoding from root folder of Jupiter project folder
df = pd.read_csv('pages.csv', encoding='utf-8')

# Clean the 'Title' column to remove unwanted symbols
df['Title'] = df['Title'].apply(clean_text)

# Set your OpenAI API key
openai.api_key = 'your-api-key-goes-here'

# Function to get embeddings
def get_embedding(text):
    response = openai.Embedding.create(input=[text], engine="text-embedding-ada-002")
    return response['data'][0]['embedding']

# Generate embeddings for all titles
df['embedding'] = df['Title'].apply(get_embedding)

# Create a matrix of embeddings
embedding_matrix = np.vstack(df['embedding'].values)

# Compute cosine similarity matrix
similarity_matrix = cosine_similarity(embedding_matrix)

# Define similarity threshold
similarity_threshold = 0.9  # since threshold is 0.1 for dissimilarity

# Create a list to store groups
groups = []

# Keep track of visited indices
visited = set()

# Group similar titles based on the similarity matrix
for i in range(len(similarity_matrix)):
    if i not in visited:
        # Find all similar titles
        similar_indices = np.where(similarity_matrix[i] >= similarity_threshold)[0]
        
        # Log comparisons
        print(f"nChecking similarity for '{df.iloc[i]['Title']}' (Index {i}):")
        print("-" * 50)
        for j in range(len(similarity_matrix)):
            if i != j:  # Ensure that a title is not compared with itself
                similarity_value = similarity_matrix[i, j]
                comparison_result="greater" if similarity_value >= similarity_threshold else 'less'
                print(f"Compared with '{df.iloc[j]['Title']}' (Index {j}): similarity = {similarity_value:.4f} ({comparison_result} than threshold)")

        # Add these indices to visited
        visited.update(similar_indices)
        # Add the group to the list
        group = df.iloc[similar_indices][['URL', 'Title']].to_dict('records')
        groups.append(group)
        print(f"nFormed Group {len(groups)}:")
        for item in group:
            print(f"  - URL: {item['URL']}, Title: {item['Title']}")

# Check if groups were created
if not groups:
    print("No groups were created.")

# Define the output CSV file
output_file="grouped_pages.csv"

# Write the results to the CSV file with UTF-8 encoding
with open(output_file, 'w', newline="", encoding='utf-8') as csvfile:
    fieldnames = ['Group', 'URL', 'Title']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    writer.writeheader()
    for group_index, group in enumerate(groups, start=1):
        for page in group:
            cleaned_title = clean_text(page['Title'])  # Ensure no unwanted symbols in the output
            writer.writerow({'Group': group_index, 'URL': page['URL'], 'Title': cleaned_title})
            print(f"Writing Group {group_index}, URL: {page['URL']}, Title: {cleaned_title}")

print(f"Output written to {output_file}")


This code reads a CSV file, ‘pages.csv,’ containing titles and URLs, which you can easily export from your CMS or get by crawling a client website using Screaming Frog.

Then, it cleans the titles from non-UTF characters, generates embedding vectors for each title using OpenAI’s API, calculates the similarity between the titles, groups similar titles together, and writes the grouped results to a new CSV file, ‘grouped_pages.csv.’

In the keyword cannibalization task, we use a similarity threshold of 0.9, which means if cosine similarity is less than 0.9, we will consider articles as different. To visualize this in a simplified two-dimensional space, it will appear as two vectors with an angle of approximately 25 degrees between them.

<span class=

In your case, you may want to use a different threshold, like 0.85 (approximately 31 degrees between them), and run it on a sample of your data to evaluate the results and the overall quality of matches. If it is unsatisfactory, you can increase the threshold to make it more strict for better precision.

You can install ‘matplotlib’ via terminal.

And use the Python code below in a separate Jupyter notebook to visualize cosine similarities in two-dimensional space on your own. Try it; it’s fun!


import matplotlib.pyplot as plt
import numpy as np

# Define the angle for cosine similarity of 0.9. Change here to your desired value. 
theta = np.arccos(0.9)

# Define the vectors
u = np.array([1, 0])
v = np.array([np.cos(theta), np.sin(theta)])

# Define the 45 degree rotation matrix
rotation_matrix = np.array([
    [np.cos(np.pi/4), -np.sin(np.pi/4)],
    [np.sin(np.pi/4), np.cos(np.pi/4)]
])

# Apply the rotation to both vectors
u_rotated = np.dot(rotation_matrix, u)
v_rotated = np.dot(rotation_matrix, v)

# Plotting the vectors
plt.figure()
plt.quiver(0, 0, u_rotated[0], u_rotated[1], angles="xy", scale_units="xy", scale=1, color="r")
plt.quiver(0, 0, v_rotated[0], v_rotated[1], angles="xy", scale_units="xy", scale=1, color="b")

# Setting the plot limits to only positive ranges
plt.xlim(0, 1.5)
plt.ylim(0, 1.5)

# Adding labels and grid
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True)
plt.title('Visualization of Vectors with Cosine Similarity of 0.9')

# Show the plot
plt.show()


I usually use 0.9 and higher for identifying keyword cannibalization issues, but you may need to set it to 0.5 when dealing with old article redirects, as old articles may not have nearly identical articles that are fresher but partially close.

It may also be better to have the meta description concatenated with the title in case of redirects, in addition to the title.

So, it depends on the task you are performing. We will review how to implement redirects in a separate article later in this series.

Now, let’s review the results with the three models mentioned above and see how they were able to identify close articles from our data sample from Search Engine Journal’s articles.

Data SampleData Sample

From the list, we already see that the 2nd and 4th articles cover the same topic on ‘meta tags.’ The articles in the 5th and 7th rows are pretty much the same – discussing the importance of H1 tags in SEO – and can be merged.

The article in the 3rd row doesn’t have any similarities with any of the articles in the list but has common words like “Tag” or “SEO.”

The article in the 6th row is again about H1, but not exactly the same as H1’s importance to SEO. Instead, it represents Google’s opinion on whether they should match.

Articles on the 8th and 9th rows are quite close but still different; they can be combined.

text-embedding-ada-002

By using ‘text-embedding-ada-002,’ we precisely found the 2nd and 4th articles with a cosine similarity of 0.92 and the 5th and 7th articles with a similarity of 0.91.

Screenshot from Jupyter log showing cosine similaritiesScreenshot from Jupyter log showing cosine similarities

And it generated output with grouped URLs by using the same group number for similar articles. (colors are applied manually for visualization purposes).

Output sheet with grouped URLsOutput sheet with grouped URLs

For the 2nd and 3rd articles, which have common words “Tag” and “SEO” but are unrelated, the cosine similarity was 0.86. This shows why a high similarity threshold of 0.9 or greater is necessary. If we set it to 0.85, it would be full of false positives and could suggest merging unrelated articles.

text-embedding-3-small

By using ‘text-embedding-3-small,’ quite surprisingly, it didn’t find any matches per our similarity threshold of 0.9 or higher.

For the 2nd and 4th articles, cosine similarity was 0.76, and for the 5th and 7th articles, with similarity 0.77.

To better understand this model through experimentation, I’ve added a slightly modified version of the 1st row with ’15’ vs. ’14’ to the sample.

  1. “14 Most Important Meta And HTML Tags You Need To Know For SEO”
  2. “15 Most Important Meta And HTML Tags You Need To Know For SEO”
Example which shows text-embedding-3-small resultsAn example which shows text-embedding-3-small results

On the contrary, ‘text-embedding-ada-002’ gave 0.98 cosine similarity between those versions.

Title 1 Title 2 Cosine Similarity
14 Most Important Meta And HTML Tags You Need To Know For SEO 15 Most Important Meta And HTML Tags You Need To Know For SEO 0.92
14 Most Important Meta And HTML Tags You Need To Know For SEO Meta Tags: What You Need To Know For SEO 0.76

Here, we see that this model is not quite a good fit for comparing titles.

text-embedding-3-large

This model’s dimensionality is 3072, which is 2 times higher than that of ‘text-embedding-3-small’ and ‘text-embedding-ada-002′, with 1536 dimensionality.

As it has more dimensions than the other models, we could expect it to capture semantic meaning with higher precision.

However, it gave the 2nd and 4th articles cosine similarity of 0.70 and the 5th and 7th articles similarity of 0.75.

I’ve tested it again with slightly modified versions of the first article with ’15’ vs. ’14’ and without ‘Most Important’ in the title.

  1. “14 Most Important Meta And HTML Tags You Need To Know For SEO”
  2. “15 Most Important Meta And HTML Tags You Need To Know For SEO”
  3. “14 Meta And HTML Tags You Need To Know For SEO”
Title 1 Title 2 Cosine Similarity
14 Most Important Meta And HTML Tags You Need To Know For SEO 15 Most Important Meta And HTML Tags You Need To Know For SEO 0.95
14 Most Important Meta And HTML Tags You Need To Know For SEO 14 Most Important Meta And HTML Tags You Need To Know For SEO 0.93
14 Most Important Meta And HTML Tags You Need To Know For SEO Meta Tags: What You Need To Know For SEO 0.70
15 Most Important Meta And HTML Tags You Need To Know For SEO 14 Most Important  Meta And HTML Tags You Need To Know For SEO 0.86

So we can see that ‘text-embedding-3-large’ is underperforming compared to ‘text-embedding-ada-002’ when we calculate cosine similarities between titles.

I want to note that the accuracy of ‘text-embedding-3-large’ increases with the length of the text, but ‘text-embedding-ada-002’ still performs better overall.

Another approach could be to strip away stop words from the text. Removing these can sometimes help focus the embeddings on more meaningful words, potentially improving the accuracy of tasks like similarity calculations.

The best way to determine whether removing stop words improves accuracy for your specific task and dataset is to empirically test both approaches and compare the results.

Conclusion

With these examples, you have learned how to work with OpenAI’s embedding models and can already perform a wide range of tasks.

For similarity thresholds, you need to experiment with your own datasets and see which thresholds make sense for your specific task by running it on smaller samples of data and performing a human review of the output.

Please note that the code we have in this article is not optimal for large datasets since you need to create text embeddings of articles every time there is a change in your dataset to evaluate against other rows.

To make it efficient, we must use vector databases and store embedding information there once generated. We will cover how to use vector databases very soon and change the code sample here to use a vector database.

More resources: 


Featured Image: BestForBest/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address

SEO

Reddit Limits Search Engine Access, Google Remains Exception

Published

on

By

Reddit Limits Search Engine Access, Google Remains Exception

Reddit has recently tightened its grip on who can access its content, blocking major search engines from indexing recent posts and comments.

This move has sparked discussions in the SEO and digital marketing communities about the future of content accessibility and AI training data.

What’s Happening?

First reported by 404 Media, Reddit updated its robots.txt file, preventing most web crawlers from accessing its latest content.

Google, however, remains an exception, likely due to a $60 million deal that allows the search giant to use Reddit’s content for AI training.

Brent Csutoras, founder of Search Engine Journal, offers some context:

“Since taking on new investors and starting their pathway to IPO, Reddit has moved away from being open-source and allowing anyone to scrape their content and use their APIs without paying.”

The Google Exception

Currently, Google is the only major search engine able to display recent Reddit results when users search with “site:reddit.com.”

This exclusive access sets Google apart from competitors like Bing and DuckDuckGo.

Why This Matters

For users who rely on appending “Reddit” to their searches to find human-generated answers, this change means they’ll be limited to using Google or search engines that pull from Google’s index.

It presents new challenges for SEO professionals and marketers in monitoring and analyzing discussions on one of the internet’s largest platforms.

The Bigger Picture

Reddit’s move aligns with a broader trend of content creators and platforms seeking compensation for using their data in AI training.

As Csutoras points out:

“Publications, artists, and entertainers have been suing OpenAI and other AI companies, blocking AI companies, and fighting to avoid using public content for AI training.”

What’s Next?

While this development may seem surprising, Csutoras suggests it’s a logical step for Reddit.

He notes:

“It seems smart on Reddit’s part, especially since similar moves in the past have allowed them to IPO and see strong growth for their valuation over the last two years.”


FAQ

What is the recent change Reddit has made regarding content accessibility?

Reddit has updated its robots.txt file to block major search engines from indexing its latest posts and comments. This change exempts Google due to a $60 million deal, allowing Google to use Reddit’s content for AI training purposes.

Why does Google have exclusive access to Reddit’s latest content?

Google has exclusive access to Reddit’s latest content because of a $60 million deal that allows Google to use Reddit’s content for AI training. This agreement sets Google apart from other search engines like Bing and DuckDuckGo, which are unable to index new Reddit posts and comments.

What broader trend does Reddit’s recent move reflect?

Reddit’s decision to limit search engine access aligns with a larger trend where content creators and platforms seek compensation for the use of their data in AI training. Many publications, artists, and entertainers are taking similar actions to either block or demand compensation from AI companies using their content.


Featured Image: Mamun sheikh K/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

Google Cautions On Blocking GoogleOther Bot

Published

on

By

Google cautions about blocking and opting out of getting crawled by the GoogleOther crawler

Google’s Gary Illyes answered a question about the non-search features that the GoogleOther crawler supports, then added a caution about the consequences of blocking GoogleOther.

What Is GoogleOther?

GoogleOther is a generic crawler created by Google for the various purposes that fall outside of those of bots that specialize for Search, Ads, Video, Images, News, Desktop and Mobile. It can be used by internal teams at Google for research and development in relation to various products.

The official description of GoogleOther is:

“GoogleOther is the generic crawler that may be used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development.”

Something that may be surprising is that there are actually three kinds of GoogleOther crawlers.

Three Kinds Of GoogleOther Crawlers

  1. GoogleOther
    Generic crawler for public URLs
  2. GoogleOther-Image
    Optimized to crawl public image URLs
  3. GoogleOther-Video
    Optimized to crawl public video URLs

All three GoogleOther crawlers can be used for research and development purposes. That’s just one purpose that Google publicly acknowledges that all three versions of GoogleOther could be used for.

What Non-Search Features Does GoogleOther Support?

Google doesn’t say what specific non-search features GoogleOther supports, probably because it doesn’t really “support” a specific feature. It exists for research and development crawling which could be in support of a new product or an improvement in a current product, it’s a highly open and generic purpose.

This is the question asked that Gary narrated:

“What non-search features does GoogleOther crawling support?”

Gary Illyes answered:

“This is a very topical question, and I think it is a very good question. Besides what’s in the public I don’t have more to share.

GoogleOther is the generic crawler that may be used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development.

Historically Googlebot was used for this, but that kind of makes things murky and less transparent, so we launched GoogleOther so you have better controls over what your site is crawled for.

That said GoogleOther is not tied to a single product, so opting out of GoogleOther crawling might affect a wide range of things across the Google universe; alas, not Search, search is only Googlebot.”

It Might Affect A Wide Range Of Things

Gary is clear that blocking GoogleOther wouldn’t have an affect on Google Search because Googlebot is the crawler used for indexing content. So if blocking any of the three versions of GoogleOther is something a site owner wants to do, then it should be okay to do that without a negative effect on search rankings.

But Gary also cautioned about the outcome that blocking GoogleOther, saying that it would have an effect on other products and services across Google. He didn’t state which other products it could affect nor did he elaborate on the pros or cons of blocking GoogleOther.

Pros And Cons Of Blocking GoogleOther

Whether or not to block GoogleOther doesn’t necessarily have a straightforward answer. There are several considerations to whether doing that makes sense.

Pros

Inclusion in research for a future Google product that’s related to search (maps, shopping, images, a new feature in search) could be useful. It might be helpful to have a site included in that kind of research because it might be used for testing something good for a site and be one of the few sites chosen to test a feature that could increase earnings for a site.

Another consideration is that blocking GoogleOther to save on server resources is not necessarily a valid reason because GoogleOther doesn’t seem to crawl so often that it makes a noticeable impact.

If blocking Google from using site content for AI is a concern then blocking GoogleOther will have no impact on that at all. GoogleOther has nothing to do with crawling for Google Gemini apps or Vertex AI, including any future products that will be used for training associated language models. The bot for that specific use case is Google-Extended.

Cons

On the other hand it might not be helpful to allow GoogleOther if it’s being used to test something related to fighting spam and there’s something the site has to hide.

It’s possible that a site owner might not want to participate if GoogleOther comes crawling for market research or for training machine learning models (for internal purposes) that are unrelated to public-facing products like Gemini and Vertex.

Allowing GoogleOther to crawl a site for unknown purposes is like giving Google a blank check to use your site data in any way they see fit outside of training public-facing LLMs or purposes related to named bots like GoogleBot.

Takeaway

Should you block GoogleOther? It’s a coin toss. There are possible potential benefits but in general there isn’t enough information to make an informed decision.

Listen to the Google SEO Office Hours podcast at the 1:30 minute mark:

Featured Image by Shutterstock/Cast Of Thousands

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

AI Search Boosts User Satisfaction

Published

on

By

AI chat robot on search engine bar. Artificial intelligence bot innovation technology answer question with smart solution. 3D vector created from graphic software.

A new study finds that despite concerns about AI in online services, users are more satisfied with search engines and social media platforms than before.

The American Customer Satisfaction Index (ACSI) conducted its annual survey of search and social media users, finding that satisfaction has either held steady or improved.

This comes at a time when major tech companies are heavily investing in AI to enhance their services.

Search Engine Satisfaction Holds Strong

Google, Bing, and other search engines have rapidly integrated AI features into their platforms over the past year. While critics have raised concerns about potential negative impacts, the ACSI study suggests users are responding positively.

Google maintains its position as the most satisfying search engine with an ACSI score of 81, up 1% from last year. Users particularly appreciate its AI-powered features.

Interestingly, Bing and Yahoo! have seen notable improvements in user satisfaction, notching 3% gains to reach scores of 77 and 76, respectively. These are their highest ACSI scores in over a decade, likely due to their AI enhancements launched in 2023.

The study hints at the potential of new AI-enabled search functionality to drive further improvements in the customer experience. Bing has seen its market share improve by small but notable margins, rising from 6.35% in the first quarter of 2023 to 7.87% in Q1 2024.

Customer Experience Improvements

The ACSI study shows improvements across nearly all benchmarks of the customer experience for search engines. Notable areas of improvement include:

  • Ease of navigation
  • Ease of using the site on different devices
  • Loading speed performance and reliability
  • Variety of services and information
  • Freshness of content

These improvements suggest that AI enhancements positively impact various aspects of the search experience.

Social Media Sees Modest Gains

For the third year in a row, user satisfaction with social media platforms is on the rise, increasing 1% to an ACSI score of 74.

TikTok has emerged as the new industry leader among major sites, edging past YouTube with a score of 78. This underscores the platform’s effective use of AI-driven content recommendations.

Meta’s Facebook and Instagram have also seen significant improvements in user satisfaction, showing 3-point gains. While Facebook remains near the bottom of the industry at 69, Instagram’s score of 76 puts it within striking distance of the leaders.

Challenges Remain

Despite improvements, the study highlights ongoing privacy and advertising challenges for search engines and social media platforms. Privacy ratings for search engines remain relatively low but steady at 79, while social media platforms score even lower at 73.

Advertising experiences emerge as a key differentiator between higher- and lower-satisfaction brands, particularly in social media. New ACSI benchmarks reveal user concerns about advertising content’s trustworthiness and personal relevance.

Why This Matters For SEO Professionals

This study provides an independent perspective on how users are responding to the AI push in online services. For SEO professionals, these findings suggest that:

  1. AI-enhanced search features resonate with users, potentially changing search behavior and expectations.
  2. The improving satisfaction with alternative search engines like Bing may lead to a more diverse search landscape.
  3. The continued importance of factors like content freshness and site performance in user satisfaction aligns with long-standing SEO best practices.

As AI becomes more integrated into our online experiences, SEO strategies may need to adapt to changing user preferences.


Featured Image: kate3155/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

Trending