Monday, February 24, 2025
Every now and then we get questions about robots.txt, robots meta tags, and the control
functionality that they offer. Following our December series on crawling,
we thought this would be the perfect time to put together a light refresher. So, if you’re curious
about these controls, follow along in this new blog post series!
Let’s start at the very beginning, with robots.txt.
So, what is robots.txt?
A “robots.txt” is a
file that any website can provide. In its simplest form, it’s a text file that’s stored on the
server. Almost all websites have a robots.txt file.
To look at one, take the domain name and add /robots.txt
to the end, then browse to that
address. For example, this website’s robots.txt file is at developers.google.com/robots.txt
.
Most websites use content management systems (CMSes) that make these files automatically, but even
if you’re making your website “by hand”, it’s easy to create. We’ll take a look at some of the
variations in future posts.
What are these files for?
robots.txt files tell website crawlers which parts of a website are available for automated access
(we call that crawling), and which parts aren’t. It allows sites to address everything from their
whole site, parts of their site, or even specific files within their site. In addition to being
machine-readable, the files are also human-readable. This means that there’s always a
straightforward yes or no answer regarding whether or not a page is allowed to be accessed in an
automated fashion by a specific crawler.
It’s standard practice for anyone building a crawler to follow these directives, and easy for a
developer to support them—there are more than 1000 open-source libraries available
for developers. The file gives instructions to crawlers for optimal crawling of a website. Modern
websites can be complex, navigating them automatically can be challenging, and robots.txt rules
help crawlers to focus on appropriate content. This also helps crawlers to avoid dynamically
created pages which could generate strain on the server, and make crawling unnecessarily
inefficient. Because robots.txt files are both technically helpful and good for relations with
website owners, most commercial crawler operators follow them.
Built and expanded by the public
robots.txt files have been around almost as long as the internet has existed, and
it’s one of the essential tools that enables the internet to work as it does. HTML, the
foundation of web pages, was invented in 1991, the first browsers came 1992, and robots.txt
arrived in 1994. That means they predate even Google, which was founded in 1998. The format has
been mostly unchanged since then, and
a file from the early days
would still be valid now. Through three years of global community engagement, it was made an
IETF proposed standard
in 2022.
If you have a website, chances are you also have a robots.txt file. There’s a vibrant and active
community around robots.txt, there are thousands of software tools that help to build, test,
manage, or understand robots.txt files in all shapes and sizes. The beauty of robots.txt though is
that you don’t need fancy tools, it’s possible to read the file in a browser, and for a website
that you manage, to adjust it in a simple text editor.
Looking forward…
The robots.txt format is flexible. There’s room for growth, the public web community can expand on
it, and crawlers can announce extensions when appropriate, without breaking existing usage. This
happened in 2007, when search engines announced the “sitemap”
directive. It’s also regularly happening as new “user-agents” are supported by crawler operators
and search engines, such as those used for AI purposes.
robots.txt is here to stay. New file formats take a few years to be finalized with the larger
internet community, proper tools to make them useful for the ecosystem take even longer. It’s easy,
it’s granular and expressive, it’s well-understood and accepted, and it just works, like it’s
been working for decades now.
Curious to hear more about the details? Stay tuned
for the next editions of our Robots Refresher series on the Search Central blog!