Connect with us

SEO

Best AI Search Engines To Try Right Now

Published

on

Best AI Search Engines To Try Right Now

AI-powered search engines are a new breed redefining the search experience as we know it.

And when we talk about AI-powered search engines, Bing and Google SGE (Search Generative Experience) are currently the two that rise to the top.

For some time, they have been the most popular and widely recognizable names in AI search engines – and, as such, the ones that get the most attention.

But as with most things, the landscape is far from stagnant. Today, there are many other AI search engines out there that are just as useful as Bing and Google – and, in some ways, even better.

From privacy-focused search engines to those that prioritize publisher sourcing, we have curated a list of the six best AI search engines that exist right now and what you need to know about them.

Notably, at least one of them boasts a paid version with such a generous number of queries per day that some argue it surpasses the offerings of OpenAI’s ChatGPT Plus.

Let’s dive in.

1. Andi Search

Andi Search is a startup AI search engine that offers an interpretation of a better way to explore the Internet and obtain knowledge.

After a while of using Andi, one gets a sense that there really is a better way to present information.

It also becomes apparent that Bing and Google SGE haven’t strayed far from the old 10 blue links paradigm.

There are three things that make Andi stand out:

  • The interface uses AI throughout the entire search results, not just at the top of the page the way Bing and Google SGE do.
  • Images, summaries, and options are offered in a way that makes sense contextually.
  • All on-page elements work together to communicate the information users are seeking.

Andi Is More Than A Text-Based Search Engine

Humans are highly visual, and Andi does a good job of presenting information not just with text but with images, too.

Using Andi, it becomes apparent that both Bing and Google SGE are traditional search engines with an AI chatbox at the top of the page.

What Is Andi AI Search?

Andi is a factually grounded AI search engine that was purposely created to offer trustworthy answers while avoiding hallucinations that are common to GPT-based apps.

It offers keyword search results, answers complex multi-part questions, and fully understands natural language input.

Technology Used By Andi

Andi Search uses a mix of technologies.

In a 2022 Q&A, Andi Search engineers said they use several commercial and open-source Large Language Models (LLMs), knowledge graphs, and Google, Bing, and other search engines (50% of the time in 2022).

Andi AI Search Results

I asked Andi a complex question:

"Please tell me about the fictional star wars character Ahsoka and also explain if she is one of the most skilled Jedi of all time."

The answer was in the form of a short summary and a link to the source of the information, with an option to show the full search results or to summarize an answer to the question.

The summary provided correctly answered my complex question, even the part about whether the fictional character Ahsoka Tano is the most skilled Jedi of all time.

Video Of Search Results From Websites

On a desktop, the right-hand side contains a panel with the search results.

The results are not in the form of a standard 10 blue links, but rather, they consist of the featured image with text from the webpage.

Andi is currently in a Beta testing stage, but it is freely available to use – no need to sign up or have an account.

Andi Search And Privacy

Andi is a privacy-first AI company. It doesn’t store cookies, doesn’t share data, and no information is available to any employee of the company.

It even blocks Google’s FLoC tracking technology so that Google can’t follow you onto Andi.

Controversial Feature

Andi is a fine search engine in many ways, but there is one feature called Reader that publishers may not appreciate.

Screenshot Of Andi Reader Button

Screenshot from Andi Reader, September 2023Andi Reader button

Clicking on the Read button reveals the entire webpage for users to read without visiting the website.

Below is a screenshot that shows how Andi publishes a snapshot of the webpage (content blurred by me):

Screenshot Of Andi Search Reader

Andi Search ReaderScreenshot from Andi Reader, September 2023Andi Search Reader

Summary Of Andi Search

Andi is truly a rethink of how the search engines of today should function. It encourages users to rediscover the best that the web has to offer.

On the other hand, the engineers behind Andi may want to consider that publishers and search engines are an ecosystem that are dependent on each other.

2. Metaphor AI Search

There are many AI startups that are visualizing different ways to surface internet data that leverage the power of AI. That approach does away with traditional crawlers and indexers.

Metaphor is an example of the out-of-the-box use of large language models.

A Q&A on Y Combinator/Hacker News reveals that the engineer behind Metaphor uses what he calls next link prediction.

The intuition underlying the approach is that training LLMs and indexing websites are somewhat similar.

What they did was create a model architecture that had the concept of links baked in.

An interesting feature of Metaphor is that growing the index of sites doesn’t require retraining the entire language model. It’s simply a matter of adding the additional data.

How Metaphor AI Search Works

Approaching Metaphor, it’s important to keep in mind that this isn’t a traditional style search engine.

It’s surfacing links.

Furthermore, users can select what kinds of links to show.

The user-selectable categories are:

  • Tweets.
  • Wiki.
  • News.
  • Research Papers.
  • Podcasts.
  • Events.
  • Blogs.
  • GitHub.
  • Companies.
  • Recipes.
  • All of the Above.

Metaphor Search Results

Searching for recipes shows results that are different from Google and Bing, in a good way.

A search for authentic Spanish rice recipes as well as authentic mujadara recipes. Metaphor surfaced links to websites with authentic recipes.

The quality of the sites was different from the dumbed-down and inauthentic recipes sometimes shown on Google.

Searches on Metaphor sometimes don’t generate what you’re looking for. For example, searching for SEO in the News category yielded irrelevant results.

Summary Of Metaphor

Metaphor is worth giving a try because it may be useful for certain kinds of searches.

It’s not a general search engine, and it doesn’t claim to be. It’s something different, and that can be refreshing sometimes.

Nevertheless, Metaphor is still in the early stages of development, and it shows in some searches.

3. Brave AI Search Summarizer

Brave searchScreenshot from Brave, September 2023Brave search

Brave Search is a privacy-first search engine.

AI is deployed in a way that complements search and does not attempt to be a chatbot – it simply serves the goal of offering information.

Brave uses its own LLMs to assess the ranked webpages and offer a summarization. This function is called the Summarizer, which users can opt out of if they wish.

The Summarizer isn’t invoked for every search, only about 17% of searches will spawn the feature.

What’s great about the Summarizer is that it links to sources.

The screenshot below has the links circled in red.

Screenshot Of Brave Summarizer

Screenshot Of Brave SummarizerScreenshot from Brave Summarizer, September 2023Screenshot Of Brave Summarizer

Another use of AI is to generate webpage descriptions so that users can read a brief of what’s contained on the webpage.

The technology powering Brave consists of three LLMs trained for:

  • Question answering and improving search results.
  • Classification to weed out undesirable search results.
  • Summarizer and paraphrasing model add the final touches.

Brave Search Language Models

Brave Search uses two open-source language models that are available at Hugging Face.

The first language model is called BART (not to be confused with BERT, which is something different).

The second language model used by Brave is DeBERTa, which is a hybrid model based on Google’s BERT and Meta’s RoBERTa. DeBERTa is said to be an improvement over BERT and ROBERTa.

Brave Search does not use Bing or Google search, it has its own webpage index.

Brave Search Summary

Brave Search is perfect for users who want a search engine that respects privacy, is easy to use, and is useful.

4. YOU AI Search Engine

YOU is an AI search engine that combines a large language model with up-to-date citations to websites, which makes it more than just a search engine.

You.com calls itself YouChat, a search assistant that’s in the search engine.

Notable about YouChat is that, while it’s a privacy-first search assistant, it claims that it also gets better the more you use it.

Another outstanding feature is that YouChat can respond to the latest news and recent events.

YouChat can write code, summarize complex topics, generate images, write code, and create content (in any language).

You.com features YouAgent, an AI agent that writes code and can run it in its own environment, then take further action based on the output.

It’s available at You.com and available as an app for Android and iPhone and as a Chrome extension.

All versions of You.com respect privacy and do not sell user data to advertisers.

The web version of You.com answers questions with answers summarized from websites that are linked to from the answer.

You.com SERPs With Links To Websites

The 6 Best AI Search Engines To Try Right NowThe 6 Best AI Search Engines To Try Right Now

There are traditional search results in a right-hand panel, which consists of links to videos and websites.

Links To Webpages & Videos On Side Panel

The 6 Best AI Search Engines To Try Right NowThe 6 Best AI Search Engines To Try Right Now

YOU is available in a free and paid version.

The free version of You offers unlimited chat-based searches. It also provides a limited amount of AI image and writing generations (ten each).

The paid versions, You Pro for $9.99/month and YouPro for Educational for $6.99/month, offer unlimited AI image and writing generations, personalized machine learning, and priority access.

Subscriptions are available at a lower price when paid on a yearly basis.

You.com Summary

You.com is a unique personalized search destination tuned to help users not just research topics but get things done.

The AI search engine answers questions in natural language while also citing links to websites and videos that offer comprehensive coverage of the topic.

You.com also provides chat-based AI tools that are capable of taking that research and creating something new with it.

5. Phind.com

Phind calls itself a generative AI search engine for developers.

On August 28, 2023, it announced a new LLM called Phind-CodeLlama-34B-v2 that outperforms GPT-4 on a benchmark called HumanEval.

The technology underpinning Phind is a serious contender.

While it self-describes as AI search for developers, it does a great job of surfacing answers from trustworthy websites.

A drop down menu next to the search box allows users to choose from GPT-3.5 Turbo or the Phind Model (unlimited uses) and limited use of GPT-4.

Phind can answer complex questions such as, “How did Facebook become so popular?” or “What marketing lessons can be learned from how Shein promotes itself?” It can also respond to follow-up questions.

But the search results aren’t always perfect.

A query for “What is a Google-friendly way to build links to a website?” resulted in several valid suggestions but also the recommendation of guest posting for links.

A follow-up question resulted in a response that backtracked on the guest posting suggestion.

Summary Of Phind

Phind is a fine way to interact with a GPT-3.5-based search engine that can also output essays and solve other problems.

Users can sign in to save their chats.

6. Perplexity.ai

Perplexity.ai is a self-described answer engine.

It offers natural language answers to questions that are up to date, including being able to answer questions about the weather.

Perplexity is powered by LLMs, Internet data, and its updated version of Google’s PageRank, which creates a link graph of trustworthy websites.

Perplexity.ai is a serious contender in AI Search.

It was founded by people with links to DeepMind and OpenAI and is funded by leading Silicon Valley investors.

The search engine itself offers in-depth answers and suggests follow-up questions to dig even deeper.

Perplexity is even able to mix different kinds of information and come up with perfect answers.

For example, I asked it to write an article about SEO that corresponds to Ernest Hemingway’s writing style.

"Please create an informational article about search engine optimization, that is done in the style of Ernest Hemingway"

It offered suggestions like:

“Keep it simple: Hemingway’s writing style is known for its simplicity and clarity.

Similarly, when it comes to SEO, it’s important to keep your content simple and easy to understand.

Avoid using overly technical jargon or complex sentence structures that might confuse your readers.

…Hemingway’s writing style can be a useful guide for creating SEO-friendly content. By focusing on simplicity, readability, and clarity, you can create content that is both engaging and optimized for search engines.”

It’s not perfect, though.

I asked it about former Googler Matt Cutts, and it provided a comprehensive answer that also cited Matt as the founder of an SEO consultancy.

The wrong answer came from a LinkedIn page where the name of the consultancy was found in a sidebar containing people who are connected to him on LinkedIn.

What I think may have happened is that the Perplexity.ai crawler may not yet be able to separate the main content of a webpage from the rest of the content.

I asked it the link building question, “What is a Google-friendly way to build links to a website?” and it gave a reasonable answer.

A useful feature, as mentioned earlier, is the suggested follow-up topics. Perplexity labels them as “Related.”

The related topics offered for the link building question were:

  • “What are some effective link building strategies for SEO.”
  • How to find high-quality websites to link to your site.
  • What are some common mistakes to avoid when building links for SEO.”

Screenshot Of Related Topics Suggestions

Perplexity.ai Related questionsScreenshot from Perplexity.ai, September 2023Perplexity.ai Related questions

Perplexity.ai Is Publisher-Friendly

Something that should be mentioned is that Perplexity.ai is publisher-friendly.

It does a great job of linking to the websites from which the answers were sourced.

Perplexity.ai answers to Screenshot from Perplexity.ai, September 2023Perplexity.ai answers to

Perplexity.ai Summary

Perplexity is much more than a search engine, it’s a true answer engine.

It reimagines what question answering can be and does a great job of providing answers and also encouraging exploration with suggestions for related topics.

AI Search Engine Future Is Now

It’s been decades since users had such a vast choice in search engines.

There have never been so many viable alternatives to Google as there are today.

Generative AI is creating new ways to discover accurate and up-to-date information.

The “organic” 10 blue links are increasingly becoming a thing of the past.

Give some of these free AI search engines a try, because many are every bit as good as the top two.

More resources:


Featured image by Shutterstock/SvetaZi

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address

SEO

Google Warns Against Over-Reliance On SEO Tool Metrics

Published

on

By

Google Warns Against Over-Reliance On SEO Tool Metrics

In a recent discussion on Reddit’s r/SEO forum, Google’s Search Advocate, John Mueller, cautioned against relying too heavily on third-party SEO metrics.

His comments came in response to a person’s concerns about dramatic changes in tool measurements and their perceived impact on search performance.

The conversation was sparked by a website owner who reported the following series of events:

  1. A 50% drop in their website’s Domain Authority (DA) score.
  2. A surge in spam backlinks, with 75% of all their website’s links acquired in the current year.
  3. An increase in spam comments, averaging 30 per day on a site receiving about 150 daily visits.
  4. A discrepancy between backlink data shown in different SEO tools.

The owner, who claimed never to have purchased links, is concerned about the impact of these spammy links on their site’s performance.

Mueller’s Perspective On Third-Party Metrics

Mueller addressed these concerns by highlighting the limitations of third-party SEO tools and their metrics.

He stated:

“Many SEO tools have their own metrics that are tempting to optimize for (because you see a number), but ultimately, there’s no shortcut.”

He cautioned against implementing quick fixes based on these metrics, describing many of these tactics as “smoke & mirrors.”

Mueller highlighted a crucial point: the metrics provided by SEO tools don’t directly correlate with how search engines evaluate websites.

He noted that actions like using disavow files don’t affect metrics from SEO tools, as these companies don’t have access to Google data.

This highlights the need to understand the sources and limitations of SEO tool data. Their metrics aren’t direct indicators of search engine rankings.

What To Focus On? Value, Not Numbers

Mueller suggested a holistic SEO approach, prioritizing unique value over specific metrics like Domain Authority or spam scores.

He advised:

“If you want to think about the long term, finding ways to add real value that’s unique and wanted by people on the web (together with all the usual SEO best practices as a foundation) is a good target.”

However, Mueller acknowledged that creating unique content isn’t easy, adding:

“Unique doesn’t mean a unique combination of words, but really something that nobody else is providing, and ideally, that others can’t easily provide themselves.

It’s hard, it takes a lot of work, and it can take a lot of time. If it were fast & easy, others would be – and probably are already – doing it and have more practice at it.”

Mueller’s insights encourage us to focus on what really matters: strategies that put users first.

This helps align content with Google’s goals and create lasting benefits.

Key Takeaways

  1. While potentially useful, third-party SEO metrics shouldn’t be the primary focus of optimization efforts.
  2. Dramatic changes in these metrics don’t reflect changes in how search engines view your site.
  3. Focus on creating unique content rather than chasing tool-based metrics.
  4. Understand the limitations and sources of SEO tool data

Featured Image: JHVEPhoto/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

A Guide To Robots.txt: Best Practices For SEO

Published

on

By

A Guide To Robots.txt: Best Practices For SEO

Understanding how to use the robots.txt file is crucial for any website’s SEO strategy. Mistakes in this file can impact how your website is crawled and your pages’ search appearance. Getting it right, on the other hand, can improve crawling efficiency and mitigate crawling issues.

Google recently reminded website owners about the importance of using robots.txt to block unnecessary URLs.

Those include add-to-cart, login, or checkout pages. But the question is – how do you use it properly?

In this article, we will guide you into every nuance of how to do just so.

What Is Robots.txt?

The robots.txt is a simple text file that sits in the root directory of your site and tells crawlers what should be crawled.

The table below provides a quick reference to the key robots.txt directives.

Directive Description
User-agent Specifies which crawler the rules apply to. See user agent tokens. Using * targets all crawlers.
Disallow Prevents specified URLs from being crawled.
Allow Allows specific URLs to be crawled, even if a parent directory is disallowed.
Sitemap Indicates the location of your XML Sitemap by helping search engines to discover it.

This is an example of robot.txt from ikea.com with multiple rules.

Example of robots.txt from ikea.com

Note that robots.txt doesn’t support full regular expressions and only has two wildcards:

  • Asterisks (*), which matches 0 or more sequences of characters.
  • Dollar sign ($), which matches the end of a URL.

Also, note that its rules are case-sensitive, e.g., “filter=” isn’t equal to “Filter=.”

Order Of Precedence In Robots.txt

When setting up a robots.txt file, it’s important to know the order in which search engines decide which rules to apply in case of conflicting rules.

They follow these two key rules:

1. Most Specific Rule

The rule that matches more characters in the URL will be applied. For example:

User-agent: *
Disallow: /downloads/
Allow: /downloads/free/

In this case, the “Allow: /downloads/free/” rule is more specific than “Disallow: /downloads/” because it targets a subdirectory.

Google will allow crawling of subfolder “/downloads/free/” but block everything else under “/downloads/.”

2. Least Restrictive Rule

When multiple rules are equally specific, for example:

User-agent: *
Disallow: /downloads/
Allow: /downloads/

Google will choose the least restrictive one. This means Google will allow access to /downloads/.

Why Is Robots.txt Important In SEO?

Blocking unimportant pages with robots.txt helps Googlebot focus its crawl budget on valuable parts of the website and on crawling new pages. It also helps search engines save computing power, contributing to better sustainability.

Imagine you have an online store with hundreds of thousands of pages. There are sections of websites like filtered pages that may have an infinite number of versions.

Those pages don’t have unique value, essentially contain duplicate content, and may create infinite crawl space, thus wasting your server and Googlebot’s resources.

That is where robots.txt comes in, preventing search engine bots from crawling those pages.

If you don’t do that, Google may try to crawl an infinite number of URLs with different (even non-existent) search parameter values, causing spikes and a waste of crawl budget.

When To Use Robots.txt

As a general rule, you should always ask why certain pages exist, and whether they have anything worth for search engines to crawl and index.

If we come from this principle, certainly, we should always block:

  • URLs that contain query parameters such as:
    • Internal search.
    • Faceted navigation URLs created by filtering or sorting options if they are not part of URL structure and SEO strategy.
    • Action URLs like add to wishlist or add to cart.
  • Private parts of the website, like login pages.
  • JavaScript files not relevant to website content or rendering, such as tracking scripts.
  • Blocking scrapers and AI chatbots to prevent them from using your content for their training purposes.

Let’s dive into examples of how you can use robots.txt for each case.

1. Block Internal Search Pages

The most common and absolutely necessary step is to block internal search URLs from being crawled by Google and other search engines, as almost every website has an internal search functionality.

On WordPress websites, it is usually an “s” parameter, and the URL looks like this:

https://www.example.com/?s=google

Gary Illyes from Google has repeatedly warned to block “action” URLs as they can cause Googlebot to crawl them indefinitely even non-existent URLs with different combinations.

Here is the rule you can use in your robots.txt to block such URLs from being crawled:

User-agent: *
Disallow: *s=*
  1. The User-agent: * line specifies that the rule applies to all web crawlers, including Googlebot, Bingbot, etc.
  2. The Disallow: *s=* line tells all crawlers not to crawl any URLs that contain the query parameter “s=.” The wildcard “*” means it can match any sequence of characters before or after “s= .” However, it will not match URLs with uppercase “S” like “/?S=” since it is case-sensitive.

Here is an example of a website that managed to drastically reduce the crawling of non-existent internal search URLs after blocking them via robots.txt.

Screenshot from crawl stats reportScreenshot from crawl stats report

Note that Google may index those blocked pages, but you don’t need to worry about them as they will be dropped over time.

2. Block Faceted Navigation URLs

Faceted navigation is an integral part of every ecommerce website. There can be cases where faceted navigation is part of an SEO strategy and aimed at ranking for general product searches.

For example, Zalando uses faceted navigation URLs for color options to rank for general product keywords like “gray t-shirt.”

However, in most cases, this is not the case, and filter parameters are used merely for filtering products, creating dozens of pages with duplicate content.

Technically, those parameters are not different from internal search parameters with one difference as there may be multiple parameters. You need to make sure you disallow all of them.

For example, if you have filters with the following parameters “sortby,” “color,” and “price,” you may use this set of rules:

User-agent: *
Disallow: *sortby=*
Disallow: *color=*
Disallow: *price=*

Based on your specific case, there may be more parameters, and you may need to add all of them.

What About UTM Parameters?

UTM parameters are used for tracking purposes.

As John Mueller stated in his Reddit post, you don’t need to worry about URL parameters that link to your pages externally.

John Mueller on UTM parametersJohn Mueller on UTM parameters

Just make sure to block any random parameters you use internally and avoid linking internally to those pages, e.g., linking from your article pages to your search page with a search query page “https://www.example.com/?s=google.”

3. Block PDF URLs

Let’s say you have a lot of PDF documents, such as product guides, brochures, or downloadable papers, and you don’t want them crawled.

Here is a simple robots.txt rule that will block search engine bots from accessing those documents:

User-agent: *
Disallow: /*.pdf$

The “Disallow: /*.pdf$” line tells crawlers not to crawl any URLs that end with .pdf.

By using /*, the rule matches any path on the website. As a result, any URL ending with .pdf will be blocked from crawling.

If you have a WordPress website and want to disallow PDFs from the uploads directory where you upload them via the CMS, you can use the following rule:

User-agent: *
Disallow: /wp-content/uploads/*.pdf$
Allow: /wp-content/uploads/2024/09/allowed-document.pdf$

You can see that we have conflicting rules here.

In case of conflicting rules, the more specific one takes priority, which means the last line ensures that only the specific file located in folder “wp-content/uploads/2024/09/allowed-document.pdf” is allowed to be crawled.

4. Block A Directory

Let’s say you have an API endpoint where you submit your data from the form. It is likely your form has an action attribute like action=”/form/submissions/.”

The issue is that Google will try to crawl that URL, /form/submissions/, which you likely don’t want. You can block these URLs from being crawled with this rule:

User-agent: *
Disallow: /form/

By specifying a directory in the Disallow rule, you are telling the crawlers to avoid crawling all pages under that directory, and you don’t need to use the (*) wildcard anymore, like “/form/*.”

Note that you must always specify relative paths and never absolute URLs, like “https://www.example.com/form/” for Disallow and Allow directives.

Be cautious to avoid malformed rules. For example, using /form without a trailing slash will also match a page /form-design-examples/, which may be a page on your blog that you want to index.

Read: 8 Common Robots.txt Issues And How To Fix Them

5. Block User Account URLs

If you have an ecommerce website, you likely have directories that start with “/myaccount/,” such as “/myaccount/orders/” or “/myaccount/profile/.”

With the top page “/myaccount/” being a sign-in page that you want to be indexed and found by users in search, you may want to disallow the subpages from being crawled by Googlebot.

You can use the Disallow rule in combination with the Allow rule to block everything under the “/myaccount/” directory (except the /myaccount/ page).

User-agent: *
Disallow: /myaccount/
Allow: /myaccount/$


And again, since Google uses the most specific rule, it will disallow everything under the /myaccount/ directory but allow only the /myaccount/ page to be crawled.

Here’s another use case of combining the Disallow and Allow rules: in case you have your search under the /search/ directory and want it to be found and indexed but block actual search URLs:

User-agent: *
Disallow: /search/
Allow: /search/$

6. Block Non-Render Related JavaScript Files

Every website uses JavaScript, and many of these scripts are not related to the rendering of content, such as tracking scripts or those used for loading AdSense.

Googlebot can crawl and render a website’s content without these scripts. Therefore, blocking them is safe and recommended, as it saves requests and resources to fetch and parse them.

Below is a sample line that is disallowing sample JavaScript, which contains tracking pixels.

User-agent: *
Disallow: /assets/js/pixels.js

7. Block AI Chatbots And Scrapers

Many publishers are concerned that their content is being unfairly used to train AI models without their consent, and they wish to prevent this.

#ai chatbots
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: cohere-ai
User-agent: Bytespider
User-agent: Google-Extended
User-Agent: PerplexityBot
User-agent: Applebot-Extended
User-agent: Diffbot
User-agent: PerplexityBot
Disallow: /
#scrapers
User-agent: Scrapy
User-agent: magpie-crawler
User-agent: CCBot
User-Agent: omgili
User-Agent: omgilibot
User-agent: Node/simplecrawler
Disallow: /

Here, each user agent is listed individually, and the rule Disallow: / tells those bots not to crawl any part of the site.

This, besides preventing AI training on your content, can help reduce the load on your server by minimizing unnecessary crawling.

For ideas on which bots to block, you may want to check your server log files to see which crawlers are exhausting your servers, and remember, robots.txt doesn’t prevent unauthorized access.

8. Specify Sitemaps URLs

Including your sitemap URL in the robots.txt file helps search engines easily discover all the important pages on your website. This is done by adding a specific line that points to your sitemap location, and you can specify multiple sitemaps, each on its own line.

Sitemap: https://www.example.com/sitemap/articles.xml
Sitemap: https://www.example.com/sitemap/news.xml
Sitemap: https://www.example.com/sitemap/video.xml

Unlike Allow or Disallow rules, which allow only a relative path, the Sitemap directive requires a full, absolute URL to indicate the location of the sitemap.

Ensure the sitemaps’ URLs are accessible to search engines and have proper syntax to avoid errors.

Sitemap fetch error in search consoleSitemap fetch error in search console

9. When To Use Crawl-Delay

The crawl-delay directive in robots.txt specifies the number of seconds a bot should wait before crawling the next page. While Googlebot does not recognize the crawl-delay directive, other bots may respect it.

It helps prevent server overload by controlling how frequently bots crawl your site.

For example, if you want ClaudeBot to crawl your content for AI training but want to avoid server overload, you can set a crawl delay to manage the interval between requests.

User-agent: ClaudeBot
Crawl-delay: 60

This instructs the ClaudeBot user agent to wait 60 seconds between requests when crawling the website.

Of course, there may be AI bots that don’t respect crawl delay directives. In that case, you may need to use a web firewall to rate limit them.

Troubleshooting Robots.txt

Once you’ve composed your robots.txt, you can use these tools to troubleshoot if the syntax is correct or if you didn’t accidentally block an important URL.

1. Google Search Console Robots.txt Validator

Once you’ve updated your robots.txt, you must check whether it contains any error or accidentally blocks URLs you want to be crawled, such as resources, images, or website sections.

Navigate Settings > robots.txt, and you will find the built-in robots.txt validator. Below is the video of how to fetch and validate your robots.txt.

2. Google Robots.txt Parser

This parser is official Google’s robots.txt parser which is used in Search Console.

It requires advanced skills to install and run on your local computer. But it is highly recommended to take time and do it as instructed on that page because you can validate your changes in the robots.txt file before uploading to your server in line with the official Google parser.

Centralized Robots.txt Management

Each domain and subdomain must have its own robots.txt, as Googlebot doesn’t recognize root domain robots.txt for a subdomain.

It creates challenges when you have a website with a dozen subdomains, as it means you should maintain a bunch of robots.txt files separately.

However, it is possible to host a robots.txt file on a subdomain, such as https://cdn.example.com/robots.txt, and set up a redirect from  https://www.example.com/robots.txt to it.

You can do vice versa and host it only under the root domain and redirect from subdomains to the root.

Search engines will treat the redirected file as if it were located on the root domain. This approach allows centralized management of robots.txt rules for both your main domain and subdomains.

It helps make updates and maintenance more efficient. Otherwise, you would need to use a separate robots.txt file for each subdomain.

Conclusion

A properly optimized robots.txt file is crucial for managing a website’s crawl budget. It ensures that search engines like Googlebot spend their time on valuable pages rather than wasting resources on unnecessary ones.

On the other hand, blocking AI bots and scrapers using robots.txt can significantly reduce server load and save computing resources.

Make sure you always validate your changes to avoid unexpected crawability issues.

However, remember that while blocking unimportant resources via robots.txt may help increase crawl efficiency, the main factors affecting crawl budget are high-quality content and page loading speed.

Happy crawling!

More resources: 


Featured Image: BestForBest/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

Google Search Has A New Boss: Prabhakar Raghavan Steps Down

Published

on

By

Google Search Has A New Boss: Prabhakar Raghavan Steps Down

Google has announced that Prabhakar Raghavan, the executive overseeing the company’s search engine and advertising products, will be stepping down from his current role.

The news came on Thursday in a memo from CEO Sundar Pichai to staff.

Nick Fox To Lead Search & Ads

Taking over Raghavan’s responsibilities will be Nick Fox, a longtime Google executive with experience across various departments.

Fox will now lead the Knowledge & Information team, which includes Google’s Search, Ads, Geo, and Commerce products.

Pichai expressed confidence in Fox’s ability to lead these crucial divisions, noting:

“Throughout his career, Nick has demonstrated leadership across nearly every facet of Knowledge & Information, from Product and Design in Search and Assistant, to our Shopping, Travel, and Payments products.”

Raghavan’s New Role

Raghavan will transition to the newly created position of Chief Technologist.

He will work closely with Pichai and other Google leaders in this role to provide technical direction.

Pichai praised Raghavan’s contributions, stating:

“Prabhakar’s leadership journey at Google has been remarkable, spanning Research, Workspace, Ads, and Knowledge & Information. He led the Gmail team in launching Smart Reply and Smart Compose as early examples of using AI to improve products, and took Gmail and Drive past 1 billion users.”

Past Criticisms

This recent announcement from Google comes in the wake of earlier criticisms leveled at the company’s search division.

In April, an opinion piece from Ed Zitron highlighted concerns about the direction of Google Search under Raghavan’s leadership.

The article cited industry analysts who claimed that Raghavan’s background in advertising, rather than search technology, had led to decisions prioritizing revenue over search quality.

Critics alleged that under Raghavan’s tenure, Google had rolled back key quality improvements to boost engagement metrics and ad revenue.

Internal emails from 2019 were referenced. They described a “Code Yellow” emergency response to lagging search revenues when Raghavan was head of Ads. This reportedly resulted in boosting sites previously downranked for using spam tactics.

Google has disputed many of these claims, maintaining that its advertising systems do not influence organic search results.

More Restructuring

As part of Google’s restructuring:

  1. The Gemini app team, led by Sissie Hsiao, will join Google DeepMind under CEO Demis Hassabis.
  2. Google Assistant teams focused on devices and home experiences will move to the Platforms & Devices division.

Looking Ahead

Fox’s takeover from Raghavan could shake things up at Google.

We may see faster AI rollouts in search and ads, plus more frequent updates. Fox might revisit core search quality, addressing recent criticisms.

Fox might push for quicker adoption of new tech to fend off competitors, especially in AI. He’s also likely to be more savvy about regulatory issues.

It’s important to note that these potential changes are speculative based on the limited information available.

The actual changes in leadership style and priorities will become clearer as Fox settles into his new role.


Featured Image: One Artist/Shutterstock

Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

Trending