In SEO, especially with large sites, when you publish new articles or important pages, it doesn’t mean all of those URLs get ranked, or maybe indexed.
So, how can you understand the indexing process and even, maybe reduce indexing time from Google?
That’s the Googlebot’s cue, which is associated with the crawl budget.
What is the Crawl Budget?
Crawl Budget is the average number of URLs that Googlebot will crawl on your site before leaving. There’s 2 mainly Indexing Crawlers that Google uses are Googlebot Desktop and Googlebot Smartphone (associated with Mobile-First Indexing).
So, crawl budget optimization ensures that Googlebot isn’t wasting time crawling through your unimportant pages (such as auto generated links by filter, searchbox,…) at risk of ignoring your important pages (articles, landing pages,…).
What affects the Crawl Budget process?
A website’s crawl budget is determined by several factors. The crawl rate limit and the crawler’s demand for content are two aspects of a website that affect the crawl budget. Also, the size of the site helps determine the crawl budget, with smaller sites getting a smaller budget and larger sites getting a larger one.
Let’s have a look on several factors positively and negatively affect to Crawl Budget:
|Crawl Demand||This is an estimate of how much Google’s search engine wants to crawl your pages. This estimate is based on how popular your pages are and how stale the content is in Google’s index.||It depends|
|Crawl Rate Limit||The crawl rate limit defines the maximum fetching rate for a given site and how often Googlebot must wait between fetches.||Positively|
|Crawl Health||As a website gets more popular and receives more visitors, it may need to upgrade its server capacity to cater to the increased demand. This is known as crawling health, since Googlebot needs to crawl and index the site frequently in order to keep up with the pace of new content.||It depends|
|Faceted navigation||Such as filtering products by color, price range, size,… or paginating not search-friendly because it can auto generate a bunch of unnecessary URLs. If all those URLs have no canonical link to the main page, it’s definitely cause duplicate content. Even worse, Googlebot will be confused and has no clue which URLs are important.||Negatively|
|Session Identifiers (Session IDs)||Auto generated parameters by users or tracking setup.||Negatively|
|Soft error pages (404s pages)||A “soft 404” is a response from a web server that includes code 200 OK when a requested page is not found. This can limit a site’s crawl coverage because search engines may index those duplicate URLs instead of URLs with unique content.||Negatively|
|Hacked pages||When your site got hacked, maybe URLs will return with 404 issues, so it surely affects the crawling negatively.||Negatively|
|Infinity spaces||Infinity spaces are large numbers of links that may not provide much new content for Googlebot to index. Failing to remove these links may make it harder for Googlebot to crawl your site and could result in your site being either partially or completely omitted from Google’s search results.||Negatively|
|Low-quality content||Suggests that your content lacks utility or value to Google or its readers. The creators of the search engine want to provide quality content with real value, so they punish you for having unhelpful or bland content.||Negatively|
|Sitemap||These are the XML files that present URLs for a website and the frequency of changes to each URL. This type of file allows webmasters to adjust the importance of each page within their site. Thanks to that, Googlebot will crawl URLs navigationally.||It depends|
|Robots.txt||A robots.txt file is a special text file that helps us manage crawler traffic to our website, by telling crawlers which files (web pages, media file, resource file) or directories we want and don’t want to be crawled.||It depends|
How to Optimize the Crawl Budget?
In order to optimize the crawling process, there’s several ways:
For Google, it’s important to be able to see your website the way a real user sees it. So don’t block CSS and JS files in your robots.txt file. This way, we can do a better job of understanding your website.
There’s several reasons why you shouldn’t block it:
- Page layout algorithm: This algorithm looks at where content is placed on a page in relation to ads. If Google determines a page contains more ads than content, it can devalue the rankings for that page. However, webmasters can use CSS wizardry to make it appear as if content is at the top of the page, while ads are below it.
Check if your site already got no-limit crawl rate or not
Google’s web-crawling bots have several safeguards to prevent them from overloading your webpage during indexing. However, if you feel you need to limit the crawl rate, do this by placing a on each page you don’t want indexed.
Then, you can check your current crawl level of your site thanks to this tool from Google.
Check your font file format
One thing you should keep in mind is the crawl rating for the font files. Just you don’t ignore, let’s see its effectiveness:
Check your Server issues
Search engine crawlers are programmed to prevent overloading any website. If your site returns server errors, or if the requested URLs time out often, the crawler will limit the volume of content it collects from your site.
The only solution for this kind of issue is upgrading to a better hosting package.
Reducing and/or Redirecting 404 URLs
To put it in this way, when Googlebot crawls 404 URLs, it will definitely get stuck.
To save the crawl budget, consider using Redirect 301 annotation.
Watch out your Robots.txt file
When you start editing with the Robots.txt file, just make sure to follow Google’s guideline.
But it still gets indexed:
Use URL parameters to block Googlebot crawling unnecessary URLs
If you ever faced the auto generated URLs because of filter variations, this feature is what you need.
Step 1: Access to Google Search Console platform.
Step 2: Click on the URL parameters field on the left side.
Step 3: Define what type of URL you need to block from crawling.
Follow Google documentation, there’s 5 type of URL parameters:
- Sorts (for example, sort=price_descending): Changes the order which content is presented in.
- Narrows (for example, pants_size=M): Filter the content on the page.
- Specifies (for example, store=women): Determines the general class of content displayed on a page. If this specifies an exact item, and this is the only way to reach this content, should select “Every URL” for the behavior.
- Translates (for example, lang=fr): Displays a translated version of the content. If you use a parameter to show different languages, you probably do want Google to crawl the translated versions using hreflang to indicate language variants of your page rather than blocking content with this tool.
- Paginates (for example, page=2): Displays a specific page of a long listing or article.
Step 4: Add parameter
Step 5: Setup URL parameter
There’s some fields that you need to be clarified:
- Parameter (case sensitive): Name whatever you want.
- Does this parameter change page content seen by the user?: Yes: Changes, reorders, or narrows page content.
- Which URLs with this parameter should Googlebot crawl?: No URLs (Because you don’t want to spend the budget for crawling all of these URLs).
Crawl Budget FAQs
Here’s the several questions about crawl budget:
How to increase the crawl rate?
You can’t. There’s just 2 options as below:
- Let Google optimize for my site (recommended).
- Limit Google’s maximum crawl rate.
That means you can just decrease the crawl rate, not increase it.
Will it affect Crawl Budget when I disallow URLs in Robots.txt file?
According to Google, No, it doesn’t affect crawl budget at all.
Is Crawl Budget a ranking factor?
No, Crawl Budget is not a ranking factor.
Does page speed affect the crawl budget?
Yes, page speed affects the crawl budget, positively and negatively.
Googlebot is most efficient when it connects with websites quickly. For instance, Googlebot can fetch more content in the same number of connections if the website is speedy. Conversely, if a site returns a significantly high number of 5xx errors or connection timeouts, crawling slows down.
Adding individual pages, will it affect the crawl budget?
According to John Mueller, No, adding individual pages isn’t going to impact how Googlebot crawls your site.
Here’s the tweet:
This articles will be updated if anything new got discovered. So, stay tuned.