What is Crawl Budget in SEO? How to Optimize it?

What is Crawl Budget
87 / 100

In SEO, especially with large sites, when you publish new articles or important pages, it doesn’t mean all of those URLs get ranked, or maybe indexed.

So, how can you understand the indexing process and even, maybe reduce indexing time from Google?

That’s the Googlebot’s cue, which is associated with the crawl budget.

What is the Crawl Budget?

Crawl Budget is the average number of URLs that Googlebot will crawl on your site before leaving. There’s 2 mainly Indexing Crawlers that Google uses are Googlebot Desktop and Googlebot Smartphone (associated with Mobile-First Indexing).

What is Crawl Budget

So, crawl budget optimization ensures that Googlebot isn’t wasting time crawling through your unimportant pages (such as auto generated links by filter, searchbox,…) at risk of ignoring your important pages (articles, landing pages,…).

What affects the Crawl Budget process?

A website’s crawl budget is determined by several factors. The crawl rate limit and the crawler’s demand for content are two aspects of a website that affect the crawl budget. Also, the size of the site helps determine the crawl budget, with smaller sites getting a smaller budget and larger sites getting a larger one.

Let’s have a look on several factors positively and negatively affect to Crawl Budget:

Factors
Cue
Positively or Negatively affect?
Crawl DemandThis is an estimate of how much Google’s search engine wants to crawl your pages. This estimate is based on how popular your pages are and how stale the content is in Google’s index.It depends
Crawl Rate LimitThe crawl rate limit defines the maximum fetching rate for a given site and how often Googlebot must wait between fetches.Positively
Crawl HealthAs a website gets more popular and receives more visitors, it may need to upgrade its server capacity to cater to the increased demand. This is known as crawling health, since Googlebot needs to crawl and index the site frequently in order to keep up with the pace of new content.It depends
Faceted navigationSuch as filtering products by color, price range, size,… or paginating not search-friendly because it can auto generate a bunch of unnecessary URLs. If all those URLs have no canonical link to the main page, it’s definitely cause duplicate content. Even worse, Googlebot will be confused and has no clue which URLs are important.Negatively
Session Identifiers (Session IDs)Auto generated parameters by users or tracking setup.Negatively
Soft error pages (404s pages)A “soft 404” is a response from a web server that includes code 200 OK when a requested page is not found. This can limit a site’s crawl coverage because search engines may index those duplicate URLs instead of URLs with unique content.Negatively
Hacked pagesWhen your site got hacked, maybe URLs will return with 404 issues, so it surely affects the crawling negatively.Negatively
Infinity spacesInfinity spaces are large numbers of links that may not provide much new content for Googlebot to index. Failing to remove these links may make it harder for Googlebot to crawl your site and could result in your site being either partially or completely omitted from Google’s search results.Negatively
Low-quality contentSuggests that your content lacks utility or value to Google or its readers. The creators of the search engine want to provide quality content with real value, so they punish you for having unhelpful or bland content.Negatively
ProxiesIt depends
SitemapThese are the XML files that present URLs for a website and the frequency of changes to each URL. This type of file allows webmasters to adjust the importance of each page within their site. Thanks to that, Googlebot will crawl URLs navigationally.It depends
Robots.txtA robots.txt file is a special text file that helps us manage crawler traffic to our website, by telling crawlers which files (web pages, media file, resource file) or directories we want and don’t want to be crawled.It depends

How to Optimize the Crawl Budget?

In order to optimize the crawling process, there’s several ways:

Avoiding using too many Javascript files.

Heavy data-driven JavaScript pages consume more of a crawl budget than HTML sites. If Googlebot and other search engine spiders spend too much of your crawl budget, some elements (such as photos or large video files) may be rendered invisible to users. This can lead to poor user experiences and lost traffic.

The Crawl Stats report by file type
The Crawl Stats report by file type
Javascript files were crawled by Googlebot
Javascript files were crawled by Googlebot

Don’t block CSS and Javascript files in Robots.txt

For Google, it’s important to be able to see your website the way a real user sees it. So don’t block CSS and JS files in your robots.txt file. This way, we can do a better job of understanding your website.

https://youtu.be/B9BWbruCiDc
Explanation from Google

There’s several reasons why you shouldn’t block it:

  • Mobile-friendly algorithm: In order to have the above-the-fold content rendered, Google needs to be able to render the page completely, including the JavaScript and CSS. This is because the mobile-friendly algorithm evaluates whether a page is mobile-friendly by ensuring that it’s able to render the page completely, including the JavaScript and CSS. By doing so, Google can apply both the mobile-friendly tag in search results and the associated ranking boost for mobile search results. Google was publicly recommending webmasters to do this if they wanted a mobile-friendly boost for their pages.
  • Page layout algorithm: This algorithm looks at where content is placed on a page in relation to ads. If Google determines a page contains more ads than content, it can devalue the rankings for that page. However, webmasters can use CSS wizardry to make it appear as if content is at the top of the page, while ads are below it.

Check if your site already got no-limit crawl rate or not

Google’s web-crawling bots have several safeguards to prevent them from overloading your webpage during indexing. However, if you feel you need to limit the crawl rate, do this by placing a on each page you don’t want indexed.

Then, you can check your current crawl level of your site thanks to this tool from Google.

Let Google optimize crawl rate for your site
You should let Google optimize crawl rate for your site

Check your font file format

One thing you should keep in mind is the crawl rating for the font files. Just you don’t ignore, let’s see its effectiveness:

The crawling status for “Page resource load” classified by Googlebot type
The crawling status for “Page resource load” classified by Googlebot type
Page resource load including mostly font file type
Page resource load including mostly font file type

Check your Server issues

Search engine crawlers are programmed to prevent overloading any website. If your site returns server errors, or if the requested URLs time out often, the crawler will limit the volume of content it collects from your site.

Host had problems in the past
Host had problems in the past issue

The only solution for this kind of issue is upgrading to a better hosting package.

Reducing and/or Redirecting 404 URLs

To put it in this way, when Googlebot crawls 404 URLs, it will definitely get stuck.

Not found URLs get crawled by Google
Not found URLs get crawled by Google

To save the crawl budget, consider using Redirect 301 annotation.

Watch out your Robots.txt file

All about Crawl Budget with Robots.txt file in SEO Mythbusting show by Google Webmaster

When you start editing with the Robots.txt file, just make sure to follow Google’s guideline.

Disallow features in the Robots.txt file can cause the confliction
Disallow /xxx* in the Robots.txt file

But it still gets indexed:

Indexed although got blocked by Robots txt file
Indexed although got blocked by Robots txt file

Use URL parameters to block Googlebot crawling unnecessary URLs

If you ever faced the auto generated URLs because of filter variations, this feature is what you need.

Step 1: Access to Google Search Console platform.

Step 2: Click on the URL parameters field on the left side.

URL parameter feature in Google Search Console
URL parameter feature in Google Search Console

Step 3: Define what type of URL you need to block from crawling.

Follow Google documentation, there’s 5 type of URL parameters:

  • Sorts (for example, sort=price_descending): Changes the order which content is presented in.
  • Narrows (for example, pants_size=M): Filter the content on the page.
  • Specifies (for example, store=women): Determines the general class of content displayed on a page. If this specifies an exact item, and this is the only way to reach this content, should select “Every URL” for the behavior.
  • Translates (for example, lang=fr): Displays a translated version of the content. If you use a parameter to show different languages, you probably do want Google to crawl the translated versions using hreflang to indicate language variants of your page rather than blocking content with this tool.
  • Paginates (for example, page=2): Displays a specific page of a long listing or article.

Step 4: Add parameter

Add parameter in Google Search Console
Add parameter in Google Search Console

Step 5: Setup URL parameter

Setup URL parameter

There’s some fields that you need to be clarified:

  • Parameter (case sensitive): Name whatever you want.
  • Does this parameter change page content seen by the user?: Yes: Changes, reorders, or narrows page content.
  • Which URLs with this parameter should Googlebot crawl?: No URLs (Because you don’t want to spend the budget for crawling all of these URLs).

Crawl Budget FAQs

Here’s the several questions about crawl budget:

How to increase the crawl rate?

You can’t. There’s just 2 options as below:

  • Let Google optimize for my site (recommended).
  • Limit Google’s maximum crawl rate.

That means you can just decrease the crawl rate, not increase it.

Will it affect Crawl Budget when I disallow URLs in Robots.txt file?

According to Google, No, it doesn’t affect crawl budget at all.

Is Crawl Budget a ranking factor?

No, Crawl Budget is not a ranking factor.

Does page speed affect the crawl budget?

Yes, page speed affects the crawl budget, positively and negatively.

Googlebot is most efficient when it connects with websites quickly. For instance, Googlebot can fetch more content in the same number of connections if the website is speedy. Conversely, if a site returns a significantly high number of 5xx errors or connection timeouts, crawling slows down.

Adding individual pages, will it affect the crawl budget?

According to John Mueller, No, adding individual pages isn’t going to impact how Googlebot crawls your site.

Here’s the tweet:

This articles will be updated if anything new got discovered. So, stay tuned.