Connect with us

SEO

A Practical Guide To Multi-touch Attribution

Published

on

A Practical Guide To Multi-touch Attribution

The customer journey involves multiple interactions between the customer and the merchant or service provider.

We call each interaction in the customer journey a touch point.

According to Salesforce.com, it takes, on average, six to eight touches to generate a lead in the B2B space.

The number of touchpoints is even higher for a customer purchase.

Multi-touch attribution is the mechanism to evaluate each touch point’s contribution toward conversion and gives the appropriate credits to every touch point involved in the customer journey.

Conducting a multi-touch attribution analysis can help marketers understand the customer journey and identify opportunities to further optimize the conversion paths.

In this article, you will learn the basics of multi-touch attribution, and the steps of conducting multi-touch attribution analysis with easily accessible tools.

What To Consider Before Conducting Multi-Touch Attribution Analysis

Define The Business Objective

What do you want to achieve from the multi-touch attribution analysis?

Do you want to evaluate the return on investment (ROI) of a particular marketing channel, understand your customer’s journey, or identify critical pages on your website for A/B testing?

Different business objectives may require different attribution analysis approaches.

Defining what you want to achieve from the beginning helps you get the results faster.

Define Conversion

Conversion is the desired action you want your customers to take.

For ecommerce sites, it’s usually making a purchase, defined by the order completion event.

For other industries, it may be an account sign-up or a subscription.

Different types of conversion likely have different conversion paths.

If you want to perform multi-touch attribution on multiple desired actions, I would recommend separating them into different analyses to avoid confusion.

Define Touch Point

Touch point could be any interaction between your brand and your customers.

If this is your first time running a multi-touch attribution analysis, I would recommend defining it as a visit to your website from a particular marketing channel. Channel-based attribution is easy to conduct, and it could give you an overview of the customer journey.

If you want to understand how your customers interact with your website, I would recommend defining touchpoints based on pageviews on your website.

If you want to include interactions outside of the website, such as mobile app installation, email open, or social engagement, you can incorporate those events in your touch point definition, as long as you have the data.

Regardless of your touch point definition, the attribution mechanism is the same. The more granular the touch points are defined, the more detailed the attribution analysis is.

In this guide, we’ll focus on channel-based and pageview-based attribution.

You’ll learn about how to use Google Analytics and another open-source tool to conduct those attribution analyses.

An Introduction To Multi-Touch Attribution Models

The ways of crediting touch points for their contributions to conversion are called attribution models.

The simplest attribution model is to give all the credit to either the first touch point, for bringing in the customer initially, or the last touch point, for driving the conversion.

These two models are called the first-touch attribution model and the last-touch attribution model, respectively.

Obviously, neither the first-touch nor the last-touch attribution model is “fair” to the rest of the touch points.

Then, how about allocating credit evenly across all touch points involved in converting a customer? That sounds reasonable – and this is exactly how the linear attribution model works.

However, allocating credit evenly across all touch points assumes the touch points are equally important, which doesn’t seem “fair”, either.

Some argue the touch points near the end of the conversion paths are more important, while others are in favor of the opposite. As a result, we have the position-based attribution model that allows marketers to give different weights to touchpoints based on their locations in the conversion paths.

All the models mentioned above are under the category of heuristic, or rule-based, attribution models.

In addition to heuristic models, we have another model category called data-driven attribution, which is now the default model used in Google Analytics.

What Is Data-Driven Attribution?

How is data-driven attribution different from the heuristic attribution models?

Here are some highlights of the differences:

  • In a heuristic model, the rule of attribution is predetermined. Regardless of first-touch, last-touch, linear, or position-based model, the attribution rules are set in advance and then applied to the data. In a data-driven attribution model, the attribution rule is created based on historical data, and therefore, it is unique for each scenario.
  • A heuristic model looks at only the paths that lead to a conversion and ignores the non-converting paths. A data-driven model uses data from both converting and non-converting paths.
  • A heuristic model attributes conversions to a channel based on how many touches a touch point has with respect to the attribution rules. In a data-driven model, the attribution is made based on the effect of the touches of each touch point.

How To Evaluate The Effect Of A Touch Point

A common algorithm used by data-driven attribution is called Markov Chain. At the heart of the Markov Chain algorithm is a concept called the Removal Effect.

The Removal Effect, as the name suggests, is the impact on conversion rate when a touch point is removed from the pathing data.

This article will not go into the mathematical details of the Markov Chain algorithm.

Below is an example illustrating how the algorithm attributes conversion to each touch point.

The Removal Effect

Assuming we have a scenario where there are 100 conversions from 1,000 visitors coming to a website via 3 channels, Channel A, B, & C. In this case, the conversion rate is 10%.

Intuitively, if a certain channel is removed from the conversion paths, those paths involving that particular channel will be “cut off” and end with fewer conversions overall.

If the conversion rate is lowered to 5%, 2%, and 1% when Channels A, B, & C are removed from the data, respectively, we can calculate the Removal Effect as the percentage decrease of the conversion rate when a particular channel is removed using the formula:

Image from author, November 2022

Then, the last step is attributing conversions to each channel based on the share of the Removal Effect of each channel. Here is the attribution result:

Channel Removal Effect Share of Removal Effect Attributed Conversions
A 1 – (5% / 10%) = 0.5 0.5 / (0.5 + 0.8 + 0.9) = 0.23 100 * 0.23 = 23
B 1 – (2% / 10%) = 0.8 0.8 / (0.5 + 0.8 + 0.9) = 0.36 100 * 0.36 = 36
C 1 – (1% / 10%) = 0.9 0.9 / (0.5 + 0.8 + 0.9) = 0.41 100 * 0.41 = 41

In a nutshell, data-driven attribution does not rely on the number or position of the touch points but on the impact of those touch points on conversion as the basis of attribution.

Multi-Touch Attribution With Google Analytics

Enough of theories, let’s look at how we can use the ubiquitous Google Analytics to conduct multi-touch attribution analysis.

As Google will stop supporting Universal Analytics (UA) from July 2023, this tutorial will be based on Google Analytics 4 (GA4) and we’ll use Google’s Merchandise Store demo account as an example.

In GA4, the attribution reports are under Advertising Snapshot as shown below on the left navigation menu.

After landing on the Advertising Snapshot page, the first step is selecting an appropriate conversion event.

GA4, by default, includes all conversion events for its attribution reports.

To avoid confusion, I highly recommend you pick only one conversion event (“purchase” in the below example) for the analysis.

advertising snapshot GA4Screenshot from GA4, November 2022

 

Understand The Conversion Paths In GA4

Under the Attribution section on the left navigation bar, you can open the Conversion Paths report.

Scroll down to the conversion path table, which shows all the paths leading to conversion.

At the top of this table, you can find the average number of days and number of touch points that lead to conversions.

GA4 touchpoints to conversionScreenshot from GA4, November 2022 

 

In this example, you can see that Google customers take, on average, almost 9 days and 6 visits before making a purchase on its Merchandise Store.

Find Each Channel’s Contribution In GA4

Next, click the All Channels report under the Performance section on the left navigation bar.

In this report, you can find the attributed conversions for each channel of your selected conversion event – “purchase”, in this case.

All channels report GA4Screenshot from GA4, November 2022

 

Now, you know Organic Search, together with Direct and Email, drove most of the purchases on Google’s Merchandise Store.

Examine Results From Different Attribution Models In GA4

By default, GA4 uses the data-driven attribution model to determine how many credits each channel receives. However, you can examine how different attribution models assign credits for each channel.

Click Model Comparison under the Attribution section on the left navigation bar.

For example, comparing the data-driven attribution model with the first touch attribution model (aka “first click model” in the below figure), you can see more conversions are attributed to Organic Search under the first click model (735) than the data-driven model (646.80).

On the other hand, Email has more attributed conversions under the data-driven attribution model (727.82) than the first click model (552).

Attribution models for channel grouping GA4Screenshot from GA4, November 2022

 

The data tells us that Organic Search plays an important role in bringing potential customers to the store, but it needs help from other channels to convert visitors (i.e., for customers to make actual purchases).

On the other hand, Email, by nature, interacts with visitors who have visited the site before and helps to convert returning visitors who initially came to the site from other channels.

Which Attribution Model Is The Best?

A common question, when it comes to attribution model comparison, is which attribution model is the best. I’d argue this is the wrong question for marketers to ask.

The truth is that no one model is absolutely better than the others as each model illustrates one aspect of the customer journey. Marketers should embrace multiple models as they see fit.

From Channel-Based To Pageview-Based Attribution

Google Analytics is easy to use, but it works well for channel-based attribution.

If you want to further understand how customers navigate through your website before converting, and what pages influence their decisions, you need to conduct attribution analysis on pageviews.

While Google Analytics doesn’t support pageview-based attribution, there are other tools you can use.

We recently performed such a pageview-based attribution analysis on AdRoll’s website and I’d be happy to share with you the steps we went through and what we learned.

Gather Pageview Sequence Data

The first and most challenging step is gathering data on the sequence of pageviews for each visitor on your website.

Most web analytics systems record this data in some form. If your analytics system doesn’t provide a way to extract the data from the user interface, you may need to pull the data from the system’s database.

Similar to the steps we went through on GA4, the first step is defining the conversion. With pageview-based attribution analysis, you also need to identify the pages that are part of the conversion process.

As an example, for an ecommerce site with online purchase as the conversion event, the shopping cart page, the billing page, and the order confirmation page are part of the conversion process, as every conversion goes through those pages.

You should exclude those pages from the pageview data since you don’t need an attribution analysis to tell you those pages are important for converting your customers.

The purpose of this analysis is to understand what pages your potential customers visited prior to the conversion event and how they influenced the customers’ decisions.

Prepare Your Data For Attribution Analysis

Once the data is ready, the next step is to summarize and manipulate your data into the following four-column format. Here is an example.

data manipulation: 4-column formatScreenshot from author, November 2022

 

The Path column shows all the pageview sequences. You can use any unique page identifier, but I’d recommend using the url or page path because it allows you to analyze the result by page types using the url structure.  “>” is a separator used in between pages.

The Total_Conversions column shows the total number of conversions a particular pageview path led to.

The Total_Conversion_Value column shows the total monetary value of the conversions from a particular pageview path. This column is optional and is mostly applicable to ecommerce sites.

The Total_Null column shows the total number of times a particular pageview path failed to convert.

Build Your Page-Level Attribution Models

To build the attribution models, we leverage the open-source library called ChannelAttribution.

While this library was originally created for use in R and Python programming languages, the authors now provide a free Web app for it, so we can use this library without writing any code.

Upon signing into the Web app, you can upload your data and start building the models.

For first-time users, I’d recommend clicking the Load Demo Data button for a trial run. Be sure to examine the parameter configuration with the demo data.

Load Demo Data buttonScreenshot from author, November 2022

When you’re ready, click the Run button to create the models.

Once the models are created, you’ll be directed to the Output tab, which displays the attribution results from four different attribution models – first-touch, last-touch, linear, and data-drive (Markov Chain).

Remember to download the result data for further analysis.

For your reference, while this tool is called ChannelAttribution, it’s not limited to channel-specific data.

Since the attribution modeling mechanism is agnostic to the type of data given to it, it’d attribute conversions to channels if channel-specific data is provided, and to web pages if pageview data is provided.

Analyze Your Attribution Data

Organize Pages Into Page Groups

Depending on the number of pages on your website, it may make more sense to first analyze your attribution data by page groups rather than individual pages.

A page group can contain as few as just one page to as many pages as you want, as long as it makes sense to you.

Taking AdRoll’s website as an example, we have a Homepage group that contains just the homepage and a Blog group that contains all of our blog posts.

For ecommerce sites, you may consider grouping your pages by product categories as well.

Starting with page groups instead of individual pages allows marketers to have an overview of the attribution results across different parts of the website. You can always drill down from the page group to individual pages when needed.

Identify The Entries And Exits Of The Conversion Paths

After all the data preparation and model building, let’s get to the fun part – the analysis.

I’d suggest first identifying the pages that your potential customers enter your website and the pages that direct them to convert by examining the patterns of the first-touch and last-touch attribution models.

Pages with particularly high first-touch and last-touch attribution values are the starting points and endpoints, respectively, of the conversion paths. These are what I call gateway pages.

Make sure these pages are optimized for conversion.

Keep in mind that this type of gateway page may not have very high traffic volume.

For example, as a SaaS platform, AdRoll’s pricing page doesn’t have high traffic volume compared to some other pages on the website but it’s the page many visitors visited before converting.

Find Other Pages With Strong Influence On Customers’ Decisions

After the gateway pages,  the next step is to find out what other pages have a high influence on your customers’ decisions.

For this analysis, we look for non-gateway pages with high attribution value under the Markov Chain models.

Taking the group of product feature pages on AdRoll.com as an example, the pattern of their attribution value across the four models (shown below) shows they have the highest attribution value under the Markov Chain model, followed by the linear model.

This is an indication that they are visited in the middle of the conversion paths and played an important role in influencing customers’ decisions.

4 attribution models bar chartImage from author, November 2022

 

These types of pages are also prime candidates for conversion rate optimization (CRO).

Making them easier to be discovered by your website visitors and their content more convincing would help lift your conversion rate.

To Recap

Multi-touch attribution allows a company to understand the contribution of various marketing channels and identify opportunities to further optimize the conversion paths.

Start simply with Google Analytics for channel-based attribution. Then, dig deeper into a customer’s pathway to conversion with pageview-based attribution.

Don’t worry about picking the best attribution model.

Leverage multiple attribution models, as each attribution model shows different aspects of the customer journey.

More resources: 


Featured Image: Black Salmon/Shutterstock



Source link

Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

SEO

Essential Functions For SEO Data Analysis

Published

on

Essential Functions For SEO Data Analysis

Learning to code, whether with PythonJavaScript, or another programming language, has a whole host of benefits, including the ability to work with larger datasets and automate repetitive tasks.

But despite the benefits, many SEO professionals are yet to make the transition – and I completely understand why! It isn’t an essential skill for SEO, and we’re all busy people.

If you’re pressed for time, and you already know how to accomplish a task within Excel or Google Sheets, then changing tack can feel like reinventing the wheel.

When I first started coding, I initially only used Python for tasks that I couldn’t accomplish in Excel – and it’s taken several years to get to the point where it’s my defacto choice for data processing.

Looking back, I’m incredibly glad that I persisted, but at times it was a frustrating experience, with many an hour spent scanning threads on Stack Overflow.

This post is designed to spare other SEO pros the same fate.

Within it, we’ll cover the Python equivalents of the most commonly used Excel formulas and features for SEO data analysis – all of which are available within a Google Colab notebook linked in the summary.

Specifically, you’ll learn the equivalents of:

  • LEN.
  • Drop Duplicates.
  • Text to Columns.
  • SEARCH/FIND.
  • CONCATENATE.
  • Find and Replace.
  • LEFT/MID/RIGHT.
  • IF.
  • IFS.
  • VLOOKUP.
  • COUNTIF/SUMIF/AVERAGEIF.
  • Pivot Tables.

Amazingly, to accomplish all of this, we’ll primarily be using a singular library – Pandas – with a little help in places from its big brother, NumPy.

Prerequisites

For the sake of brevity, there are a few things we won’t be covering today, including:

  • Installing Python.
  • Basic Pandas, like importing CSVs, filtering, and previewing dataframes.

If you’re unsure about any of this, then Hamlet’s guide on Python data analysis for SEO is the perfect primer.

Now, without further ado, let’s jump in.

LEN

LEN provides a count of the number of characters within a string of text.

For SEO specifically, a common use case is to measure the length of title tags or meta descriptions to determine whether they’ll be truncated in search results.

Within Excel, if we wanted to count the second cell of column A, we’d enter:

=LEN(A2)
Screenshot from Microsoft Excel, November 2022

Python isn’t too dissimilar, as we can rely on the inbuilt len function, which can be combined with Pandas’ loc[] to access a specific row of data within a column:

len(df['Title'].loc[0])

In this example, we’re getting the length of the first row in the “Title” column of our dataframe.

len function python
Screenshot of VS Code, November, 2022

Finding the length of a cell isn’t that useful for SEO, though. Normally, we’d want to apply a function to an entire column!

In Excel, this would be achieved by selecting the formula cell on the bottom right-hand corner and either dragging it down or double-clicking.

When working with a Pandas dataframe, we can use str.len to calculate the length of rows within a series, then store the results in a new column:

df['Length'] = df['Title'].str.len()

Str.len is a ‘vectorized’ operation, which is designed to be applied simultaneously to a series of values. We’ll use these operations extensively throughout this article, as they almost universally end up being faster than a loop.

Another common application of LEN is to combine it with SUBSTITUTE to count the number of words in a cell:

=LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1

In Pandas, we can achieve this by combining the str.split and str.len functions together:

df['No. Words'] = df['Title'].str.split().str.len()

We’ll cover str.split in more detail later, but essentially, what we’re doing is splitting our data based upon whitespaces within the string, then counting the number of component parts.

word count PythonScreenshot from VS Code, November 2022

Dropping Duplicates

Excel’s ‘Remove Duplicates’ feature provides an easy way to remove duplicate values within a dataset, either by deleting entirely duplicate rows (when all columns are selected) or removing rows with the same values in specific columns.

Excel drop duplicatesScreenshot from Microsoft Excel, November 2022

In Pandas, this functionality is provided by drop_duplicates.

To drop duplicate rows within a dataframe type:

df.drop_duplicates(inplace=True)

To drop rows based on duplicates within a singular column, include the subset parameter:

df.drop_duplicates(subset="column", inplace=True)

Or specify multiple columns within a list:

df.drop_duplicates(subset=['column','column2'], inplace=True)

One addition above that’s worth calling out is the presence of the inplace parameter. Including inplace=True allows us to overwrite our existing dataframe without needing to create a new one.

There are, of course, times when we want to preserve our raw data. In this case, we can assign our deduped dataframe to a different variable:

df2 = df.drop_duplicates(subset="column")

Text To Columns

Another everyday essential, the ‘text to columns’ feature can be used to split a text string based on a delimiter, such as a slash, comma, or whitespace.

As an example, splitting a URL into its domain and individual subfolders.

Excel drop duplicatesScreenshot from Microsoft Excel, November 2022

When dealing with a dataframe, we can use the str.split function, which creates a list for each entry within a series. This can be converted into multiple columns by setting the expand parameter to True:

df['URL'].str.split(pat="/", expand=True)
str split PythonScreenshot from VS Code, November 2022

As is often the case, our URLs in the image above have been broken up into inconsistent columns, because they don’t feature the same number of folders.

This can make things tricky when we want to save our data within an existing dataframe.

Specifying the n parameter limits the number of splits, allowing us to create a specific number of columns:

df[['Domain', 'Folder1', 'Folder2', 'Folder3']] = df['URL'].str.split(pat="/", expand=True, n=3)

Another option is to use pop to remove your column from the dataframe, perform the split, and then re-add it with the join function:

df = df.join(df.pop('Split').str.split(pat="/", expand=True))

Duplicating the URL to a new column before the split allows us to preserve the full URL. We can then rename the new columns:🐆

df['Split'] = df['URL']

df = df.join(df.pop('Split').str.split(pat="/", expand=True))

df.rename(columns = {0:'Domain', 1:'Folder1', 2:'Folder2', 3:'Folder3', 4:'Parameter'}, inplace=True)
Split pop join functions PythonScreenshot from VS Code, November 2022

CONCATENATE

The CONCAT function allows users to combine multiple strings of text, such as when generating a list of keywords by adding different modifiers.

In this case, we’re adding “mens” and whitespace to column A’s list of product types:

=CONCAT($F$1," ",A2)
concat Excel
Screenshot from Microsoft Excel, November 2022

Assuming we’re dealing with strings, the same can be achieved in Python using the arithmetic operator:

df['Combined] = 'mens' + ' ' + df['Keyword']

Or specify multiple columns of data:

df['Combined'] = df['Subdomain'] + df['URL']
concat PythonScreenshot from VS Code, November 2022

Pandas has a dedicated concat function, but this is more useful when trying to combine multiple dataframes with the same columns.

For instance, if we had multiple exports from our favorite link analysis tool:

df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
df3 = pd.read_csv('data3.csv')

dflist = [df, df2, df3]

df = pd.concat(dflist, ignore_index=True)

SEARCH/FIND

The SEARCH and FIND formulas provide a way of locating a substring within a text string.

These commands are commonly combined with ISNUMBER to create a Boolean column that helps filter down a dataset, which can be extremely helpful when performing tasks like log file analysis, as explained in this guide. E.g.:

=ISNUMBER(SEARCH("searchthis",A2)
isnumber search ExcelScreenshot from Microsoft Excel, November 2022

The difference between SEARCH and FIND is that find is case-sensitive.

The equivalent Pandas function, str.contains, is case-sensitive by default:

df['Journal'] = df['URL'].str.contains('engine', na=False)

Case insensitivity can be enabled by setting the case parameter to False:

df['Journal'] = df['URL'].str.contains('engine', case=False, na=False)

In either scenario, including na=False will prevent null values from being returned within the Boolean column.

One massive advantage of using Pandas here is that, unlike Excel, regex is natively supported by this function – as it is in Google sheets via REGEXMATCH.

Chain together multiple substrings by using the pipe character, also known as the OR operator:

df['Journal'] = df['URL'].str.contains('engine|search', na=False)

Find And Replace

Excel’s “Find and Replace” feature provides an easy way to individually or bulk replace one substring with another.

find replace ExcelScreenshot from Microsoft Excel, November 2022

When processing data for SEO, we’re most likely to select an entire column and “Replace All.”

The SUBSTITUTE formula provides another option here and is useful if you don’t want to overwrite the existing column.

As an example, we can change the protocol of a URL from HTTP to HTTPS, or remove it by replacing it with nothing.

When working with dataframes in Python, we can use str.replace:

df['URL'] = df['URL'].str.replace('http://', 'https://')

Or:

df['URL'] = df['URL'].str.replace('http://', '') # replace with nothing

Again, unlike Excel, regex can be used – like with Google Sheets’ REGEXREPLACE:

df['URL'] = df['URL'].str.replace('http://|https://', '')

Alternatively, if you want to replace multiple substrings with different values, you can use Python’s replace method and provide a list.

This prevents you from having to chain multiple str.replace functions:

df['URL'] = df['URL'].replace(['http://', ' https://'], ['https://www.', 'https://www.’], regex=True)

LEFT/MID/RIGHT

Extracting a substring within Excel requires the usage of the LEFT, MID, or RIGHT functions, depending on where the substring is located within a cell.

Let’s say we want to extract the root domain and subdomain from a URL:

=MID(A2,FIND(":",A2,4)+3,FIND("/",A2,9)-FIND(":",A2,4)-3)
left mid right ExcelScreenshot from Microsoft Excel, November 2022

Using a combination of MID and multiple FIND functions, this formula is ugly, to say the least – and things get a lot worse for more complex extractions.

Again, Google Sheets does this better than Excel, because it has REGEXEXTRACT.

What a shame that when you feed it larger datasets, it melts faster than a Babybel on a hot radiator.

Thankfully, Pandas offers str.extract, which works in a similar way:

df['Domain'] = df['URL'].str.extract('.*://?([^/]+)')
str extract PythonScreenshot from VS Code, November 2022

Combine with fillna to prevent null values, as you would in Excel with IFERROR:

df['Domain'] = df['URL'].str.extract('.*://?([^/]+)').fillna('-')

If

IF statements allow you to return different values, depending on whether or not a condition is met.

To illustrate, suppose that we want to create a label for keywords that are ranking within the top three positions.

Excel IFScreenshot from Microsoft Excel, November 2022

Rather than using Pandas in this instance, we can lean on NumPy and the where function (remember to import NumPy, if you haven’t already):

df['Top 3'] = np.where(df['Position'] <= 3, 'Top 3', 'Not Top 3')

Multiple conditions can be used for the same evaluation by using the AND/OR operators, and enclosing the individual criteria within round brackets:

df['Top 3'] = np.where((df['Position'] <= 3) & (df['Position'] != 0), 'Top 3', 'Not Top 3')

In the above, we’re returning “Top 3” for any keywords with a ranking less than or equal to three, excluding any keywords ranking in position zero.

IFS

Sometimes, rather than specifying multiple conditions for the same evaluation, you may want multiple conditions that return different values.

In this case, the best solution is using IFS:

=IFS(B2<=3,"Top 3",B2<=10,"Top 10",B2<=20,"Top 20")
IFS ExcelScreenshot from Microsoft Excel, November 2022

Again, NumPy provides us with the best solution when working with dataframes, via its select function.

With select, we can create a list of conditions, choices, and an optional value for when all of the conditions are false:

conditions = [df['Position'] <= 3, df['Position'] <= 10, df['Position'] <=20]

choices = ['Top 3', 'Top 10', 'Top 20']

df['Rank'] = np.select(conditions, choices, 'Not Top 20')

It’s also possible to have multiple conditions for each of the evaluations.

Let’s say we’re working with an ecommerce retailer with product listing pages (PLPs) and product display pages (PDPs), and we want to label the type of branded pages ranking within the top 10 results.

The easiest solution here is to look for specific URL patterns, such as a subfolder or extension, but what if competitors have similar patterns?

In this scenario, we could do something like this:

conditions = [(df['URL'].str.contains('/category/')) & (df['Brand Rank'] > 0),
(df['URL'].str.contains('/product/')) & (df['Brand Rank'] > 0),
(~df['URL'].str.contains('/product/')) & (~df['URL'].str.contains('/category/')) & (df['Brand Rank'] > 0)]

choices = ['PLP', 'PDP', 'Other']

df['Brand Page Type'] = np.select(conditions, choices, None)

Above, we’re using str.contains to evaluate whether or not a URL in the top 10 matches our brand’s pattern, then using the “Brand Rank” column to exclude any competitors.

In this example, the tilde sign (~) indicates a negative match. In other words, we’re saying we want every brand URL that doesn’t match the pattern for a “PDP” or “PLP” to match the criteria for ‘Other.’

Lastly, None is included because we want non-brand results to return a null value.

np select PythonScreenshot from VS Code, November 2022

VLOOKUP

VLOOKUP is an essential tool for joining together two distinct datasets on a common column.

In this case, adding the URLs within column N to the keyword, position, and search volume data in columns A-C, using the shared “Keyword” column:

=VLOOKUP(A2,M:N,2,FALSE)
vlookup ExcelScreenshot from Microsoft Excel, November 2022

To do something similar with Pandas, we can use merge.

Replicating the functionality of an SQL join, merge is an incredibly powerful function that supports a variety of different join types.

For our purposes, we want to use a left join, which will maintain our first dataframe and only merge in matching values from our second dataframe:

mergeddf = df.merge(df2, how='left', on='Keyword')

One added advantage of performing a merge over a VLOOKUP, is that you don’t have to have the shared data in the first column of the second dataset, as with the newer XLOOKUP.

It will also pull in multiple rows of data rather than the first match in finds.

One common issue when using the function is for unwanted columns to be duplicated. This occurs when multiple shared columns exist, but you attempt to match using one.

To prevent this – and improve the accuracy of your matches – you can specify a list of columns:

mergeddf = df.merge(df2, how='left', on=['Keyword', 'Search Volume'])

In certain scenarios, you may actively want these columns to be included. For instance, when attempting to merge multiple monthly ranking reports:

mergeddf = df.merge(df2, on='Keyword', how='left', suffixes=('', '_october'))
    .merge(df3, on='Keyword', how='left', suffixes=('', '_september'))

The above code snippet executes two merges to join together three dataframes with the same columns – which are our rankings for November, October, and September.

By labeling the months within the suffix parameters, we end up with a much cleaner dataframe that clearly displays the month, as opposed to the defaults of _x and _y seen in the earlier example.

multi merge PythonScreenshot from VS Code, November 2022

COUNTIF/SUMIF/AVERAGEIF

In Excel, if you want to perform a statistical function based on a condition, you’re likely to use either COUNTIF, SUMIF, or AVERAGEIF.

Commonly, COUNTIF is used to determine how many times a specific string appears within a dataset, such as a URL.

We can accomplish this by declaring the ‘URL’ column as our range, then the URL within an individual cell as our criteria:

=COUNTIF(D:D,D2)
Excel countifScreenshot from Microsoft Excel, November 2022

In Pandas, we can achieve the same outcome by using the groupby function:

df.groupby('URL')['URL'].count()
Python groupbyScreenshot from VS Code, November 2022

Here, the column declared within the round brackets indicates the individual groups, and the column listed in the square brackets is where the aggregation (i.e., the count) is performed.

The output we’re receiving isn’t perfect for this use case, though, because it’s consolidated the data.

Typically, when using Excel, we’d have the URL count inline within our dataset. Then we can use it to filter to the most frequently listed URLs.

To do this, use transform and store the output in a column:

df['URL Count'] = df.groupby('URL')['URL'].transform('count')
Python groupby transformScreenshot from VS Code, November 2022

You can also apply custom functions to groups of data by using a lambda (anonymous) function:

df['Google Count'] = df.groupby(['URL'])['URL'].transform(lambda x: x[x.str.contains('google')].count())

In our examples so far, we’ve been using the same column for our grouping and aggregations, but we don’t have to. Similarly to COUNTIFS/SUMIFS/AVERAGEIFS in Excel, it’s possible to group using one column, then apply our statistical function to another.

Going back to the earlier search engine results page (SERP) example, we may want to count all ranking PDPs on a per-keyword basis and return this number alongside our existing data:

df['PDP Count'] = df.groupby(['Keyword'])['URL'].transform(lambda x: x[x.str.contains('/product/|/prd/|/pd/')].count())
Python groupby countifsScreenshot from VS Code, November 2022

Which in Excel parlance, would look something like this:

=SUM(COUNTIFS(A:A,[@Keyword],D:D,{"*/product/*","*/prd/*","*/pd/*"}))

Pivot Tables

Last, but by no means least, it’s time to talk pivot tables.

In Excel, a pivot table is likely to be our first port of call if we want to summarise a large dataset.

For instance, when working with ranking data, we may want to identify which URLs appear most frequently, and their average ranking position.

pivot table ExcelScreenshot from Microsoft Excel, November 2022

Again, Pandas has its own pivot tables equivalent – but if all you want is a count of unique values within a column, this can be accomplished using the value_counts function:

count = df['URL'].value_counts()

Using groupby is also an option.

Earlier in the article, performing a groupby that aggregated our data wasn’t what we wanted – but it’s precisely what’s required here:

grouped = df.groupby('URL').agg(
     url_frequency=('Keyword', 'count'),
     avg_position=('Position', 'mean'),
     )

grouped.reset_index(inplace=True)
groupby-pivot PythonScreenshot from VS Code, November 2022

Two aggregate functions have been applied in the example above, but this could easily be expanded upon, and 13 different types are available.

There are, of course, times when we do want to use pivot_table, such as when performing multi-dimensional operations.

To illustrate what this means, let’s reuse the ranking groupings we made using conditional statements and attempt to display the number of times a URL ranks within each group.

ranking_groupings = df.groupby(['URL', 'Grouping']).agg(
     url_frequency=('Keyword', 'count'),
     )
python groupby groupingScreenshot from VS Code, November 2022

This isn’t the best format to use, as multiple rows have been created for each URL.

Instead, we can use pivot_table, which will display the data in different columns:

pivot = pd.pivot_table(df,
index=['URL'],
columns=['Grouping'],
aggfunc="size",
fill_value=0,
)
pivot table PythonScreenshot from VS Code, November 2022

Final Thoughts

Whether you’re looking for inspiration to start learning Python, or are already leveraging it in your SEO workflows, I hope that the above examples help you along on your journey.

As promised, you can find a Google Colab notebook with all of the code snippets here.

In truth, we’ve barely scratched the surface of what’s possible, but understanding the basics of Python data analysis will give you a solid base upon which to build.

More resources:


Featured Image: mapo_japan/Shutterstock



Source link

Continue Reading

DON'T MISS ANY IMPORTANT NEWS!
Subscribe To our Newsletter
We promise not to spam you. Unsubscribe at any time.
Invalid email address

Trending

en_USEnglish