Google Search Central has launched a new series called “Crawling December” to provide insights into how Googlebot crawls and indexes webpages.
Google will publish a new article each week this month exploring various aspects of the crawling process that are not often discussed but can significantly impact website crawling.
The first post in the series covers the basics of crawling and sheds light on essential yet lesser-known details about how Googlebot handles page resources and manages crawl budgets.
Crawling Basics
Today’s websites are complex due to advanced JavaScript and CSS, making them harder to crawl than old HTML-only pages. Googlebot works like a web browser but on a different schedule.
When Googlebot visits a webpage, it first downloads the HTML from the main URL, which may link to JavaScript, CSS, images, and videos. Then, Google’s Web Rendering Service (WRS) uses Googlebot to download these resources to create the final page view.
Here are the steps in order:
- Initial HTML download
- Processing by the Web Rendering Service
- Resource fetching
- Final page construction
Crawl Budget Management
Crawling extra resources can reduce the main website’s crawl budget. To help with this, Google says that “WRS tries to cache every resource (JavaScript and CSS) used in the pages it renders.”
It’s important to note that the WRS cache lasts up to 30 days and is not influenced by the HTTP caching rules set by developers.
This caching strategy helps to save a site’s crawl budget.
Recommendations
This post gives site owners tips on how to optimize their crawl budget:
- Reduce Resource Use: Use fewer resources to create a good user experience. This helps save crawl budget when rendering a page.
- Host Resources Separately: Place resources on a different hostname, like a CDN or subdomain. This can help shift the crawl budget burden away from your main site.
- Use Cache-Busting Parameters Wisely: Be careful with cache-busting parameters. Changing resource URLs can make Google recheck them, even if the content is the same. This can waste your crawl budget.
Also, Google warns that blocking resource crawling with robots.txt can be risky.
If Google can’t access a necessary resource for rendering, it may have trouble getting the page content and ranking it properly.
Related: 9 Tips To Optimize Crawl Budget For SEO
Monitoring Tools
The Search Central team says the best way to see what resources Googlebot is crawling is by checking a site’s raw access logs.
You can identify Googlebot by its IP address using the ranges published in Google’s developer documentation.
Why This Matters
This post clarifies three key points that impact how Google finds and processes your site’s content:
- Resource management directly affects your crawl budget, so hosting scripts and styles on CDNs can help preserve it.
- Google caches resources for 30 days regardless of your HTTP cache settings, which helps conserve your crawl budget.
- Blocking critical resources in robots.txt can backfire by preventing Google from properly rendering your pages.
Understanding these mechanics helps SEOs and developers make better decisions about resource hosting and accessibility – choices that directly impact how well Google can crawl and index their sites.
Related: Google Warns: URL Parameters Create Crawl Issues
Featured Image: ArtemisDiana/Shutterstock