Connect with us


How To Build A Recommender System With TF-IDF And NMF (Python)



How To Build A Recommender System With TF-IDF And NMF (Python)

Topic clusters and recommender systems can help SEO experts to build a scalable internal linking architecture.

And as we know, internal linking can impact both user experience and search rankings. It’s an area we want to get right.

In this article, we will use Wikipedia data to build topic clusters and recommender systems with Python and the Pandas data analysis tool.

To achieve this, we will use the Scikit-learn library, a free software machine learning library for Python, with two main algorithms:

  • TF-IDF: Term frequency-inverse document frequency.
  • NMF: Non-negative matrix factorization, which is a group of algorithms in multivariate analysis and linear algebra that can be used to analyze dimensional data.

Specifically, we will:

  1. Extract all of the links from a Wikipedia article.
  2. Read text from Wikipedia articles.
  3. Create a TF-IDF map.
  4. Split queries into clusters.
  5. Build a recommender system.

Here is an example of topic clusters that you will be able to build: 

Screenshot from Pandas, February 2022

Moreover, here’s the overview of the recommender system that you can recreate.

example of a recommender system in pandasScreenshot from Pandas, February 2022

Ready? Let’s get a few definitions and concepts you’ll want to know out of the way first.

The Difference Between Topic Clusters & Recommender Systems

Topic clusters and recommender systems can be built in different ways.

In this case, the former is grouped by IDF weights and the latter by cosine similarity. 

In simple SEO terms:

  • Topic clusters can help to create an architecture where all articles are linked to.
  • Recommender systems can help to create an architecture where the most relevant pages are linked to.

What Is TF-IDF?

TF-IDF, or term frequency-inverse document frequency, is a figure that expresses the statistical importance of any given word to the document collection as a whole.

TF-IDF is calculated by multiplying term frequency and inverse document frequency.

  • TF: Number of times a word appears in a document/number of words in the document.
  • IDF: log(Number of documents / Number of documents that contain the word).

To illustrate this, let’s consider this situation with Machine Learning as a target word:

  • Document A contains the target word 10 times out of 100 words.
  • In the entire corpus, 30 documents out of 200 documents also contain the target word.

Then, the formula would be:

TF-IDF = (10/100) * log(200/30)

What TF-IDF Is Not

TF-IDF is not something new. It’s not something that you need to optimize for. 

According to John Mueller, it’s an old information retrieval concept that isn’t worth focusing on for SEO.

There is nothing in it that will help you outperform your competitors.

Still, TF-IDF can be useful to SEOs.

Learning how TF-IDF works gives insight into how a computer can interpret human language.

Consequently, one can leverage that understanding to improve the relevancy of the content using similar techniques.

See also  Free Google Ads Script To Dynamically Change Target ROAS

What Is Non-negative Matrix Factorization (NMF)?

Non-negative matrix factorization, or NMF, is a dimension reduction technique often used in unsupervised learning that combines the product of non-negative features into a single one.

In this article, NMF will be used to define the number of topics we want all the articles to be grouped under.


Definition Of Topic Clusters

Topic clusters are groupings of related terms that can help you create an architecture where all articles are interlinked or on the receiving end of internal links.

Definition Of Recommender Systems

Recommender systems can help to create an architecture where the most relevant pages are linked to.

Building A Topic Cluster

Topic clusters and recommender systems can be built in different ways.

In this case, topic clusters are grouped by IDF weights and the Recommender systems by cosine similarity. 

Extract All The Links From A Specific Wikipedia Article

Extracting links on a Wikipedia page is done in two steps.

First, select a specific subject. In this case, we use the Wikipedia article on machine learning.

Second, use the Wikipedia API to find all the internal links on the article.

Here is how to query the Wikipedia API using the Python requests library.

import requests

main_subject="Machine learning"

params = {
        'action': 'query',
        'format': 'json',
        'titles': main_subject,

r = requests.get(url, params=params)
r_json = r.json()
linked_pages = r_json['query']['pages']

page_titles = [p['title'] for p in linked_pages.values()]

At last, the result is a list of all the pages linked from the initial article.

all the pages linkedScreenshot from Pandas, February 2022

These links represent each of the entities used for the topic clusters.

Select A Subset Of Articles

For performance purposes, we will select only the first 200 articles (including the main article on machine learning).

# select first X articles
num_articles = 200
pages = page_titles[:num_articles] 

# make sure to keep the main subject on the list
pages += [main_subject] 

# make sure there are no duplicates on the list
pages = list(set(pages))

Read Text From The Wikipedia Articles

Now, we need to extract the content of each article to perform the calculations for the  TF-IDF analysis.

To do so, we will fetch the API again for each of the pages stored in the pages variable.

From each response, we will store the text from the page and add it to a list called text_db.

Note that you may need to install tqdm and lxml packages to use them.

import requests
from lxml import html
from tqdm.notebook import tqdm

text_db = []
for page in tqdm(pages):
    response = requests.get(
                'action': 'parse',
                'page': page,
                'format': 'json',

    raw_html = response['parse']['text']['*']
    document = html.document_fromstring(raw_html)
    for p in document.xpath('//p'):
        text += p.text_content()

This query will return a list in which each element represent the text of the corresponding Wikipedia page.

## Print number of articles
print('Number of articles extracted: ', len(text_db))


Number of articles extracted:  201

As we can see, there are 201 articles.

See also  TikTok Introduces Interactive Add-Ons For Ads

This is because we added the article on “Machine learning” on top of the top 200 links from that page.

Furthermore, we can select the first article (index 0) and read the first 300 characters to gain a better understanding.

# read first 300 characters of 1st article


'nBiology is the  scientific study of life.[1][2][3] It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field.[1][2][3] For instance, all organisms are made up of  cells that process hereditary information encoded in genes, which can '

Create A TF-IDF Map

In this section, we will rely on pandas and TfidfVectorizer to create a Dataframe that contains the bi-grams (two consecutive words) of each article.

Here, we are using TfidfVectorizer.

This is the equivalent of using CountVectorizer followed by TfidfTransformer, which you may see in other tutorials.

In addition, we need to remove the “noise”. In the field of Natural Language Processing, words like “the”, “a”, “I”, “we” are called “stopwords”.


In the English language, stopwords have low relevancy for SEOs and are overrepresented in documents.

Hence, using nltk, we will add a list of English stopwords to the TfidfVectorizer class.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

# Create a list of English stopwords stop_words = stopwords.words('english')
# Instantiate the class vec = TfidfVectorizer( stop_words=stop_words, ngram_range=(2,2), # bigrams use_idf=True )
# Train the model and transform the data tf_idf = vec.fit_transform(text_db)
# Create a pandas DataFrame df = pd.DataFrame( tf_idf.toarray(), columns=vec.get_feature_names(), index=pages )
# Show the first lines of the DataFrame df.head()
tfidf pandas resultScreenshot from Pandas, February 2022

In the DataFrame above:

  • Rows are the documents.
  • Columns are the bi-grams (two consecutive words).
  • The values are the word frequencies (tf-idf).
word frequenciesScreenshot from Pandas, February 2022

Sort The IDF Vectors

Below, we are sorting the Inverse document frequency vectors by relevance.

idf_df = pd.DataFrame(
idf weightsScreenshot from Pandas, February 2022

Specifically, the IDF vectors are calculated from the log of the number of articles divided by the number of articles containing each word.

The greater the IDF, the more relevant it is to an article.

The lower the IDF, the more common it is across all articles.

  • 1 mention out of 1 articles = log(1/1) = 0.0
  • 1 mention out of 2 articles = log(2/1) = 0.69
  • 1 mention out of 10 articles = log(10/1) = 2.30
  • 1 mention out of 100 articles = log(100/1) = 4.61

Split Queries Into Clusters Using NMF

Using the tf_idf matrix, we will split queries into topical clusters.

Each cluster will contain closely related bi-grams.

Firstly, we will use NMF to reduce the dimensionality of the matrix into topics.

Simply put, we will group 201 articles into 25 topics.

from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize

# (optional) Disable FutureWarning of Scikit-learn
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# select number of topic clusters
n_topics = 25

# Create an NMF instance
nmf = NMF(n_components=n_topics)

# Fit the model to the tf_idf
nmf_features = nmf.fit_transform(tf_idf)

# normalize the features
norm_features = normalize(nmf_features)

We can see that the number of bigrams stays the same, but articles are grouped into topics.

# Compare processed VS unprocessed dataframes
print('Original df: ', df.shape)
print('NMF Processed df: ', nmf.components_.shape)

Secondly, for each of the 25 clusters, we will provide query recommendations.

See also  How to Build and Maintain Workforce Resilience, According to Experts
# Create clustered dataframe the NMF clustered df
components = pd.DataFrame(

clusters = {}

# Show top 25 queries for each cluster
for i in range(len(components)):
    clusters[i] = []
    loop = dict(components.loc[i,:].nlargest(25)).items()
    for k,v in loop:
        clusters[i].append({'q':k[0],'sim_score': v})

Thirdly, we will create a data frame that shows the recommendations.

# Create dataframe using the clustered dictionary
grouping = pd.DataFrame(clusters).T
grouping['topic'] = grouping[0].apply(lambda x: x['q'])
grouping.drop(0, axis=1, inplace=True)
grouping.set_index('topic', inplace=True)

def show_queries(df):
    for col in df.columns:
        df[col] = df[col].apply(lambda x: x['q'])
    return df

# Only display the query in the dataframe
clustered_queries = show_queries(grouping)

Finally, the result is a DataFrame showing 25 topics along with the top 25 bigrams for each topic.

example of a topic cluster in pandasScreenshot from Pandas, February 2022

Building A Recommender System

Now, instead of building topic clusters, we will now build a recommender system using the same normalized features from the previous step.

The normalized features are stored in the norm_features variable.

# compute cosine similarities of each cluster
data = {}
# create dataframe
norm_df = pd.DataFrame(norm_features, index=pages)
for page in pages:
    # select page recommendations
    recommendations = norm_df.loc[page,:]

    # Compute cosine similarity
    similarities =

    data[page] = []
    loop = dict(similarities.nlargest(20)).items()
    for k, v in loop:
        if k != page:
            data[page].append({'q':k,'sim_score': v})

What the code above does is:

  • Loops through each of the pages selected at the start.
  • Selects the corresponding row in the normalized dataframe.
  • Computes the cosine similarity of all the bigram queries.
  • Selects the top 20 queries sorted by similarity score.

After the execution, we are left with a dictionary of pages containing lists of recommendations sorted by similarity score.

similarity scoreScreenshot from Pandas, February 2022

The next step is to convert that dictionary into a DataFrame.

# convert dictionary to dataframe
recommender = pd.DataFrame(data).T

def show_queries(df):
    for col in df.columns:
        df[col] = df[col].apply(lambda x: x['q'])
    return df


The resulting DataFrame shows the parent query along with sorted recommended topics in each column.

example of a recommender system in pandasScreenshot from Pandas, February 2022


We are done building our own recommender system and topic cluster.

Interesting Contributions From The SEO Community

I am a big fan of Daniel Heredia, who has also played around with TF-IDF by finding relevant words with TF IDF, textblob, and Python.


Python tutorials can be daunting.

A single article may not be enough.

If that is the case, I encourage you to read Koray Tuğberk GÜBÜR’s tutorial, which exposes a similar way to use TF-IDF.

Billy Bonaros also came up with a creative application of TF-IDF in Python and showed how to create a TF-IDF keyword research tool.


In the end, I hope you have learned a logic here that can be adapted to any website.

Understanding how topic clusters and recommender systems can help improve a website’s architecture is a valuable skill for any SEO pro wishing to scale your work.

Using Python and Scikit-learn, you have learned how to build your own – and have learned the basics of TF-IDF and of non-negative matrix factorization in the process.

More resources:


Featured Image: Kateryna Reka/Shutterstock

Source link


SEO Legend, Mentor & Friend



SEO Legend, Mentor & Friend

The SEO industry will be forever changed with the loss of Bill Slawski, owner of SEO By The Sea, Director of Search at Go Fish Digital, educator, mentor, and friend.

Bill was a great many things to a lot of people. He has been a contributor here at Search Engine Journal since 2019, and a friend and mentor to many of us for decades more.

It’s not often you can say that someone has influenced and shaped an entire industry. But this is one of those times.

On May 19, 2022, the SEO industry learned that Bill Slawski had passed away.

The loss and sadness across our community were palpable.

Remembering Bill Slawski: SEO Legend, Mentor & Friend

Remembering Bill Slawski: SEO Legend, Mentor & FriendRemembering Bill Slawski: SEO Legend, Mentor & Friend

Remembering Bill Slawski: SEO Legend, Mentor & FriendRemembering Bill Slawski: SEO Legend, Mentor & Friend

Remembering Bill Slawski: SEO Legend, Mentor & Friend
Remembering Bill Slawski: SEO Legend, Mentor & FriendRemembering Bill Slawski: SEO Legend, Mentor & FriendRemembering Bill Slawski: SEO Legend, Mentor & Friend

A search patent expert, colleague and mentor to many, and a friend to many more, Bill influenced the lives of everyone in the search industry.

If you hadn’t read one of the thousands of articles he wrote or contributed to, watched one of his interviews, attended one of his talks, or listened to a podcast he was a guest on – I guarantee that someone you work with, learn from, or work for has.


This was due in no small part to Bill’s vast knowledge and expertise, combined with an unequaled passion for the nuances and technological advances that make search engines tick.

I spoke with Bill a few weeks ago as we were planning a feature article on the patents he felt are most impactful for search marketers.

In that interview, he explained his love for patents.

“One thing I always say about patents is they’re the best place to find assumptions about searchers, about search, and about the web. These are search engineers sharing their opinions in addition to solving problems,” he said.

He loved getting to see what engineers were thinking, and what they had to say when it comes to different problems on the web.

“One of my favorite types of patents to look up is when they repeat a patent and file a continuation,” Bill explained. “I like to look at these continuation patents and see how they’ve changed, because they don’t tell you, ‘This is what we’re doing.’”

That innate curiosity and true passion for unraveling the complexities of the search algorithms we work with each day made talking with Bill and reading his work a real joy.

I can’t tell you how many times I’ve gone to Bill or referenced his work in mine over the years, as have so many others.


He had a real talent for making complex concepts more accessible for readers and marketers of all stripes. As a result, his contributions to our collective understanding of how search works are unrivaled.

See also  5 Types of Applications you can Build with Python

Bill Slawski’s work and knowledge are foundational to the practice of SEO as we know it today.

I speak for all of us at SEJ in saying we’re incredibly grateful for what he generously shared with each of us.

He was a close friend and respected colleague to our founder, Loren Baker, as well.

“Bill Slawski was a true friend of mine in more ways than one. First of all, he was a surprising mentor who helped me out quite a bit early on in my career, even before the days of social media or Search Engine Journal. He was my buddy and workmate,” Loren said.

Loren Baker and Bill SlawskiLoren Baker and Bill Slawski

Bill and Loren worked together for a couple of years and spent a lot of time out in the parking lot in Havre de Grace, Maryland, smoking cigarettes and talking about Google patents.

“If anything, I would say that Bill taught me that there was much more to SEO than just ranking alone,” Loren explained, adding that Bill taught him the importance of incorporating a narrative into all of the work that you do.

“He taught me the ethics and workmanship behind creating a piece of digital art that people will want to read, will want to share, and will ultimately search for and click on–touching their lives,” he said. “I will miss Bill deeply. It’s very difficult losing friends.”

Having started in 1996 and launching SEO By The Sea in 2005, Bill was the go-to source when you wanted to understand how search engines work or how they change the way we search or live our lives.


But it was so much more than that.

Bill was generous with his time and eager to share his knowledge of search, information retrieval, NLP, and other information technology with any and all.

He had a gift for taking complex patents, algorithms, concepts, real-world behavior, and search engines and explaining how the world of search and information retrieval worked in a way that everyone could understand.

Bill seemed to have an instinct for understanding what you knew and didn’t know or where you were confused. He could fill in the gaps without making you feel silly for having asked. Even if it was the millionth time he’d answered that question.

You didn’t have to be an SEO rockstar or an experienced professional, either.

If you didn’t understand something or had questions, he would happily spend hours explaining the concepts and offering (or creating) resources to help. And as many in the industry who encountered Braggadocio can attest to, you always felt like a long-lost friend, even if you had just “met” him in text.

“It’s like when you go to a conference and you’re one of the first people there. And all the seats are still empty and there’s not a lot of discussion going on. That’s what the SEO world was like back then…I remember happening upon an SEO forum and just being a lurker. Just looking at what everybody was talking about and thinking, ‘this is a strange career. I’m not sure I can do this.’ In the end, I did it.

I started out working and promoting a website for a couple friends who started a business. And so helping them succeed in business was a pretty good motivation.” Bill Slawski, cognitiveSEO Talks interview, April 5, 2018

Bill’s wealth of knowledge extended far beyond search, too.


With a Bachelor of Arts in English from the University of Delaware and a Juris Doctor Degree from Widener University School of Law, Bill spent 14 years as a court manager, administrator, technologist, and management analyst with the Superior Court of Deleware.

See also  Google rolls out a new Attribution feature in Analytics

He loved nature and plants, and the ocean. He loved traveling and search conferences, but he ultimately found peace in nature and took advantage of it often. And he shared it with us all.

Bill pushed everyone to look beyond the headlines and keywords.

He was quick to add words of support and congratulations when someone shared an achievement. He encouraged everyone to explore the possible, to not be intimidated by new things, and to better understand the search ecosystem, not just the technology, so we could better serve our families, communities, colleagues, and clients.

His kindness, generosity, loyalty, and love of the industry knew no bounds.

The King of Podcasts on TwitterThe King of Podcasts on Twitter

Marshall Simmonds on TwitterMarshall Simmonds on Twitter

Here at Search Engine Journal, Bill was a familiar face on social media and a VIP contributor, but he was much more than that.

Matt Southern, News Writer

One of the things I’ll miss most about Bill Slawski is the outdoor photography he shared on Twitter.


As deeply entrenched as he was in SEO and online marketing, he always took time to step back from the keyboard and admire life’s beauty.

I think that’s something we could all benefit from doing more of.

Roger Montti, News Writer

I knew Bill Slawski for almost 20 years, from the forums and search marketing conferences. He created a stir with all the things he discovered in the patents, which went a long way toward demystifying what search engines did.

See also  The Definitive Guide To Podcast Intros

What impressed me the most was his generosity with his time and how encouraging he was to me and to everyone. I feel privileged and honored to have been able to call him a friend.

He will be profoundly missed.

Brent Csutoras, Advisor and Owner

So much of our marketing journey has been in understanding not only how something works with Google but what they are trying to accomplish over the coming years so we can be prepared and ready to pivot when needed.


Bill’s work with patents provided valuable insight very few individuals were capable of distilling and yet everyone benefited from.

He was instrumental in getting us to where we are as SEOs and digital marketers today.

Bill Slawski Was A Man Of Quiet Impact

“My first interaction with Bill Slawski was on Kim Krause Berg’s Cre8asite forum. I was trying to learn what SEO was all about, so I just lurked, soaking up knowledge from bragadocchio, Black Knight, Grumpus, Barry Welford, and others. I know that Bill started more 10,000 threads there during his time as one of the admins and one of the first things that struck me was his willingness to patiently share his knowledge. At the time, I had no idea who he was, but it quickly became obvious that he was someone who was worth listening to. ”

~ Doc Sheldon, Facebook

That he was.

Atul Gawande once wrote that life is meaningful because it has a story–one driven by a deep need to identify purposes outside of ourselves and a transcendent desire to see and help others achieve their potential.

This was the very essence of Bill’s life.

Not just in the wealth of unparalleled knowledge and resources he has gifted to us, but in the inspiration, guidance, and encouragement he has instilled in us all. That is his legacy and one that will live on.

It’s been difficult to hit Publish on this piece as I don’t feel anything we share could do that legacy justice.


Search Engine Journal will leave Bill’s library of content here untouched in perpetuity, and we’ve left comments open below for all to share your contributions to this memorial for Bill.

Thank you, Bill, for sharing your intelligence, passion, and knowledge with the SEO community.

You will be sorely missed.

Written in collaboration with Angie Nikoleychuk.


if( typeof sopp !== “undefined” && sopp === ‘yes’ ){
fbq(‘dataProcessingOptions’, [‘LDU’], 1, 1000);
fbq(‘dataProcessingOptions’, []);

fbq(‘init’, ‘1321385257908563’);

fbq(‘track’, ‘PageView’);

fbq(‘trackSingle’, ‘1321385257908563’, ‘ViewContent’, {
content_name: ‘memoriam-bill-slawski’,
content_category: ‘news seo’

Source link

Continue Reading

Subscribe To our Newsletter
We promise not to spam you. Unsubscribe at any time.
Invalid email address