Google published guidance on how to properly reduce Googlebot’s crawl rate due to an increase in erroneous use of 403/404 response codes, which could have a negative impact on websites.
The guidance mentioned that the misuse of the response codes was rising from web publishers and content delivery networks.
Rate Limiting Googlebot
Googlebot is Google’s automated software that visits (crawls) websites and downloads the content.
Rate limiting Googlebot means slowing down how fast Google crawls a website.
The phrase, Google’s crawl rate, refers to how many request for webpages per second that Googlebot makes.
There are times when a publisher may want to slow Googlebot down, for example if it’s causing too much server load.
Google recommends several ways to limit Googlebot’s crawl rate, chief among them is through the use of the Google Search Console.
Rate limiting through search console will slow down the crawl rate for a period of 90 days.
Another way of affecting Google’s crawl rate is through the use of Robots.txt to block Googlebot from crawling individual pages, directories (categories), or the entire website.
A good thing about Robots.txt is that it is only asking Google to refrain from crawling and not asking Google to remove a site from the index.
However, using the robots.txt can have result in “long-term effects” on Google’s crawling patterns.
Perhaps for that reason the ideal solution is to use Search Console.
Google: Stop Rate Limiting With 403/404
Google published guidance on their Search Central blog advising publishers to not use 4XX response codes (except for 429 response code).
The blog post specifically mentioned the misuse of the 403 and 404 error response codes for rate limiting, but the guidance applies to all 4XX response codes except for the 429 response.
The recommendation is necessitated because they have seen an increase in publishers using those error response codes for the purpose of limiting Google’s crawl rate.
The 403 response code means that the visitor (Googlebot in this case) is prohibited from visiting the webpage.
The 404 response code tells Googlebot that the webpage is entirely gone.
Server error response code 429 means “too many requests” and that’s a valid error response.
Over time, Google may eventually drop webpages from their search index if they continue using those two error response codes.
That means that the pages will not be considered for ranking in the search results.
Google wrote:
“Over the last few months we noticed an uptick in website owners and some content delivery networks (CDNs) attempting to use 404 and other 4xx client errors (but not 429) to attempt to reduce Googlebot’s crawl rate.
The short version of this blog post is: please don’t do that…”
Ultimately, Google recommends using the 500, 503, or 429 error response codes.
The 500 response code means there was an internal server error. The 503 response means that the server is unable to handle the request for a webpage.
Google treats both of those kinds of responses as temporary errors. So it will come again later to check if the pages are available again.
A 429 error response tells the bot that it’s making too many requests and it can also ask it to wait for a set period of time before re-crawling.
Google recommends consulting their Developer Page about rate limiting Googlebot.
Read Google’s blog post:
Don’t use 403s or 404s for rate limiting
Featured image by Shutterstock/Krakenimages.com