A robots.txt file is a critical page on your website that provides a set of instructions to web crawlers and web robots on which pages they can or cannot access.
It is used to help control the indexing behavior of search engine crawlers, so that your website is not overwhelmed with requests and certain pages are not indexed by crawlers. If you want to keep a specific page off of Google Search, you should use a noindex directive or protect your page with a password. But if you want to protect lots of pages, robots.txt works well.
It’s important that you fully understand the power of robots.txt because it can severely damage your site’s SEO if it is written improperly. On the flip side, it has plenty of benefits: improve website performance by blocking crawlers from parts of your website they shouldn’t access which reduces traffic to your servers, improve your website’s security by protecting the most sensitive information from being accessed by unauthorized users, and improve the search indexing process by guiding crawlers to your most relevant pages.
Components of Robots.txt
The most important lines of a robots.txt file can be broken down into four buckets:
- User-agent: This specifies which web crawler or user agent the rules apply to. A wildcard character (*) signifies that the rules apply to all crawlers. An example of calling out specific user agents like Google-Extended and GPTBot can be found in Narcity’s robots.txt.
- Disallow: This directive simply tells crawlers which pages or directories they are not allowed to crawl. One aspect of using disallow is to prevent particularly sensitive information from being indexed. Google says it is a best practice to block pages you don’t want indexed with disallow, and this can also reduce crawl budget by preventing crawlers from wasting time on such pages. Oftentimes you’ll block certain directories of files, for example anything with /core/* is blocked in our robots.txt.
- Allow: There may be instances when you want to make exceptions to the disallow rule. This is when you use the allow directive. These specific pages or directories are fine to be crawled despite a larger disallow rule. For example, Raw Story’s robots.txt allows for /r/kappa/api/ to be indexed as it contains a custom-built sitemap, despite otherwise disallowing the folder /r/.
- Sitemap: This directive provides the location of your XML sitemap file, which lists all of the URLs on your website that you want to be indexed. A good crawler will find these on its own, but a sitemap speeds up the process. In some cases, websites have multiple sitemaps and this is where they belong. An example of listing multiple sitemaps can be found in Panorama’s robots.txt. Please check that any sitemap is working properly with elements in it when you're including it in robots.txt.
With the four components above, you can configure your robots.txt in a way that makes it clear which pages you want crawlers to index and which pages you want robots to stay away from. You can hide internal resources or non-public pages and block any duplicate content from confusing crawlers. Through the process, you are also optimizing your crawl budget.
One important note: While robots.txt provides a set of instructions, it doesn’t enforce them. Search engine crawlers and site health crawlers like Semrush are among the good bots that follow the rules, but spam bots are likely to ignore them. For that reason, be especially careful with any sensitive information that you are exposing on your website.
Common Issues
Search Engine Journal has a great list of the most common issues with robots.txt files that you should definitely give a read. Some of these include:
- noindex: If you have this in your robots.txt, your file may be very outdated, as Google began ignoring noindex rules in robots.txt as of 2019. It's best to remove noindex references.
- crawl-delay: This is supported by Bing but not Google, and crawl settings were removed entirely from Google Search Console at the end of 2023. So it doesn't have a great usefulness if it's in your robots.txt.
- missing sitemap: At least one sitemap should be in your robots.txt file.
- incorrect use of wildcards: The asterisk (*) represents any instances of a valid character and the dollar sign ($) denotes the final part of a URL, such as a filetype extension. Use these carefully so you don't block entire parts of your site accidentally.
Update Your Robots.txt
RebelMouse users can easily make changes to their robots.txt by launching Layout & Design Tool in your Posts Dashboard menu. Navigate to Global Settings and you’ll find a line for robots.txt. After clicking it, you can make updates right there.
Validate Your Robots.txt Setup
Google Search Console has added the ability to check that your robots.txt is set up properly. To do this, simply navigate to Settings at the bottom of the left-side navigation menu. Under crawling, you should see robots.txt: “Valid.” To gain more insights, you can open up the robots.txt report (right side of the screen), which tells you the last time it was checked, the file path, the fetch status (fetched successfully or not fetched for reasons such as not found), and the size of the file. Any issues will be noted. If you need to request a recrawl, you can do so on this page.
This is what you should see in Google Search Console for a valid robots.txt file.
If the robots.txt is not valid, you will see an error message and you can troubleshoot from there.
Request a Review
If you’d like one of our strategists to take a look at your robots.txt and make suggestions for optimizing it, simply get in touch and we can set that up with you.
- SEO Growth After Switching to RebelMouse CMS | Case Study ... ›
- Best CMS for SEO: 2024 Guide [Infographic] - RebelMouse ›
- Technical SEO Tips to Master Google Updates - RebelMouse ›