Key Takeaways
- Crawling, not indexing. A blocked URL can still rank if Google finds it via a link. To keep a page out of search, allow the crawl and use noindex. Never use both at once.
- Crawl budget is a large-site problem. It matters above roughly a million pages, or ten thousand fast-changing ones. Most small sites can ignore it.
- AI crawlers split three ways: training (GPTBot, ClaudeBot), search (OAI-SearchBot, Claude-SearchBot, PerplexityBot), and retrieval (ChatGPT-User, Claude-User, Perplexity-User). Govern each separately. Perplexity has no training crawler.
- Retrieval crawlers may ignore robots.txt because the fetch is user-initiated. For reliable control, use WAF rules built from official IP ranges.
- Most failures are governance failures, not syntax errors. The classic disaster is a staging Disallow: / shipped to production.
Robots.txt now does two jobs:
- It tells search engines where not to go.
- It decides what AI systems can do with your content.
This guide covers the directives, crawl budget, how to govern AI crawlers, and how to keep it from quietly breaking, with every claim checked against official documentation.
The one thing to remember as you go: it controls crawling, not indexing, and almost every expensive mistake comes from forgetting that.
Part 1: Foundations
1. What Robots.txt Is, and What It Is Not
Robots.txt is a plain-text file that lives at the root of a website and tells automated visitors, crawlers, spiders, bots of all kinds, where you would prefer they not go. It sits at a fixed, public address:
https://example.com/robots.txt
The protocol behind it is the Robots Exclusion Protocol, which the IETF finally turned into a formal standard, RFC 9309, in September 2022. For the twenty-something years before that, robots.txt worked on the strength of convention alone: everyone agreed on roughly how it behaved, but the edges were fuzzy. RFC 9309 cleaned up the parts that used to cause arguments, how patterns are matched, which rule wins when two conflict, and what a crawler should do when the file is missing or broken.
In practical terms, the file tells crawlers which URLs they may visit, which they should leave alone, and where to find your sitemaps. It is the first thing a well-behaved crawler fetches before it touches anything else on your site, which makes it the opening handshake between your server and the wider machine web. Get that handshake wrong and everything downstream suffers: crawl efficiency, server load, search access, and now AI access too.
That is what it does. Here is what it does not do, and this is where the money gets lost.
- It does not control indexing. This is the single most common and most expensive misunderstanding in technical SEO. A rule like this: Disallow: /page/ - it tells crawlers not to visit /page/. It does nothing to stop Google from listing that URL in search results once Google learns the URL exists some other way, through a link, a sitemap, a redirect, or a mention on another site. If you actually want a page kept out of the index, the tool is <meta name="robots" content="noindex"> or an X-Robots-Tag header, not a Disallow.
- It does not provide security. RFC 9309 says so in plain language: the protocol is not a substitute for real content security, and listing a path in the file makes that path publicly discoverable. Anything sensitive needs authentication and access controls. A Disallow rule is a polite request, not a lock.
- It does not hide anything. Every line in your robots.txt is visible to anyone who types the URL. People who go looking for the directories you would rather keep quiet start by reading your robots.txt, because that is where owners helpfully list them.
- And compliance is voluntary. The well-known crawlers, Googlebot, Bingbot, ClaudeBot, GPTBot, PerplexityBot, honor the file. Bad-faith bots ignore it entirely. The whole system runs on the assumption of good-faith actors, which is worth keeping in mind every time you are tempted to treat robots.txt as a defense.
What changed between 2022 and 2026
The file itself barely changed. What we ask it to do changed a great deal:
- 2022: RFC 9309 standardizes the protocol.
- 2023: OpenAI's GPTBot arrives, and for the first time site owners have to decide whether they want their content feeding AI training.
- 2024 to 2025: that decision goes mainstream, and the cast grows: ClaudeBot, Claude-SearchBot, OAI-SearchBot, PerplexityBot.
- 2026: AI visibility, a different thing from AI governance, becomes a core part of the SEO conversation rather than a fringe one.
So robots.txt has quietly turned into a search-and-AI governance tool, and any site of meaningful size now faces a set of crawler-access decisions that simply did not exist three years ago.
Source: RFC 9309
2. Crawling vs Indexing vs Rendering vs Ranking
These four words get used interchangeably all the time, and that habit is the root of more costly SEO mistakes than any syntax error. They are four separate processes. Keeping them straight is most of the battle.
Crawling
Crawling is just this: a crawler asks your server for a URL, and your server answers.
GET /article/example HTTP/1.1
Googlebot requested the page, read the response, and moved on. That is the entire act of crawling, and robots.txt is the only one of the four processes it directly controls.
Rendering
Modern search engines do not stop at reading raw HTML. They render the page the way a browser does, running the JavaScript, loading the CSS, building the layout. Google has described its own pipeline in exactly those terms. Any page that relies on JavaScript to bring in its content, its navigation, or its structured data is leaning on this step whether the team realizes it or not.
This is where a robots.txt mistake turns invisible and severe. If you block the CSS or JavaScript a page needs, Google can crawl the HTML perfectly and still fail to understand the page, because it never got to render it. Blocked rendering resources are one of the most damaging and most common misconfigurations there is, precisely because the page looks fine to a human and broken only to the crawler.
Indexing
After rendering, Google decides whether to store what it learned, and that stored record is what makes a page eligible to show up in results. Indexing is where robots.txt has no say at all. A URL you blocked can still be indexed if Google found it through an external link.
- The tool that actually prevents indexing is noindex: <meta name="robots" content="noindex">
- For files that cannot carry an HTML tag: X-Robots-Tag: noindex
And here is the interaction that trips up even experienced teams. If you block a page in robots.txt, Google cannot crawl it. If Google cannot crawl it, Google never sees the noindex tag sitting on it. So the page can end up indexed anyway, discovered through some external link, with Google unable to read the very instruction that was supposed to keep it out. This is why Google's own guidance is blunt about it: when your goal is to prevent indexing, allow the crawl and apply noindex. Do not block with robots.txt.
Ranking
Ranking is the ordering of indexed pages for a given query, and robots.txt touches it only indirectly. Efficient crawling helps ranking in the sense that it keeps your important content discovered and fresh. But a Disallow never nudges a page up or down the results. It removes the page from the running altogether.
The decision table
Goal | Correct Tool |
|---|---|
Prevent crawling | robots.txt Disallow |
Prevent indexing | noindex meta tag or X-Robots-Tag |
Consolidate duplicates | Canonical tag |
Redirect content | 301 redirect |
Secure content | Authentication / access controls |
Manage AI access | robots.txt (crawler-specific) |
Sources: Google Block Indexing Documentation; Google Robots Meta Tag Documentation
3. Robots.txt Technical Specifications
Most of the time you will not think about the specification at all. But the few rules below decide what happens at the edges, when your server hiccups, when the file gets large, when a crawler caches it, and the edges are exactly where outages live.
- Where the file has to live. RFC 9309 is strict here: the file has to sit at the root of the host.
- Not /files/robots.txt, not /assets/robots.txt, not anywhere else. And it only governs the host it sits on. A file at www.example.com has no authority over blog.example.com. Every subdomain needs its own.
- Encoding. RFC 9309 requires UTF-8, and Google recommends it explicitly. Encoding mismatches can produce parsing behavior you did not intend.
- Size. Crawlers are required to parse at least 500 kibibytes, and Google processes up to that limit. Anything past it may be ignored. You are unlikely to ever hit this, since 500 KiB holds thousands of directives, but it is worth knowing the ceiling exists.
What different HTTP responses mean. This table matters more than it looks:
Status Code | What Happens |
|---|---|
200 | File available; the rules in it must be followed |
404 / any 4xx | File treated as unavailable; the crawler may access anything |
5xx server error | Crawler must assume a complete block until the file is reachable again |
Redirect (301/302) | Crawlers should follow at least five consecutive redirects |
- Caching. Crawlers should not hold a cached copy for more than 24 hours unless they cannot reach the file. In practice, your edits reach compliant crawlers within roughly a day, with the exact timing varying by crawler. Do not expect changes to take effect the moment you save.
- A security reminder, because it bears repeating. RFC 9309 treats the file's content as untrusted by design, and so should you. Never put sensitive paths, internal directory names, API endpoints, or anything security-relevant in robots.txt. The file is public on purpose.
Source: RFC 9309, Section 2.3–2.5
Part 2: Directives and Syntax
4. Directives Reference
Everything in a robots.txt file is built from a handful of directives, and RFC 9309 nails down exactly how each one behaves.
User-agent
This line names the crawler that the rules underneath it apply to.
- User-agent: Googlebot
That targets Googlebot and nobody else. Matching is case-insensitive, so googlebot, Googlebot, and GOOGLEBOT all mean the same thing. The wildcard catches everyone:
- User-agent: *
The * group applies to any crawler that does not have its own, more specific group somewhere in the file. And if a crawler happens to match both a named group and the * group, RFC 9309 says the matching groups get merged and read together rather than one silently overriding the other.
Disallow
Disallow blocks crawling of paths that match.
- Disallow: /search/
One quirk worth knowing: an empty Disallow: with nothing after it means allow everything. It is the same as having no restriction at all.
Allow
Allow carves an exception out of a broader Disallow, and the most specific match, measured by the length of the path in bytes, wins.
- Disallow: /wp-admin/
- Allow: /wp-admin/admin-ajax.php
That is the classic WordPress pattern: shut the admin area to crawlers, but keep open the one endpoint that rendering and various front-end features actually depend on.
Sitemap
Sitemap tells crawlers where your XML sitemaps live. It is not part of the core protocol, but it is so widely supported that you should always include it; RFC 9309 explicitly allows crawlers to read records it does not formally define.
- Sitemap: https://example.com/sitemap.xml
You can declare as many as you need:
- Sitemap: https://example.com/post-sitemap.xml
- Sitemap: https://example.com/news-sitemap.xml
How precedence actually works
When two rules fight, the most specific one wins, and specificity is measured by how many bytes are in the path pattern. If an Allow and a Disallow are exactly as specific as each other, Allow wins the tie.
- Disallow: /images/
- Allow: /images/public/
The result is that /images/public/ stays crawlable while everything else under /images/ does not.
What is not supported
Crawl-delay is the big one people assume works everywhere. It is not part of RFC 9309, and Google ignores it outright. Bing honors it, and Anthropic's documentation notes that ClaudeBot supports it as a non-standard extension, but treat it as crawler-specific and confirm the one you care about actually respects it before you rely on it. Older directives like Visit-time and Request-rate are not part of the standard either, and you should not build anything on them.
Source: RFC 9309, Section 2.2
5. Wildcards and Pattern Matching
RFC 9309 gives you exactly two special characters. Used carefully they are precise; used carelessly they are a foot-gun that can de-index a site.
The asterisk, *
It matches any sequence of characters, of any length, anywhere in the path.
- Disallow: /*?sort=
That single rule catches all of these:
- /shoes?sort=price
- /category/shoes?size=10&sort=newest
- /search?q=running+shoes&sort=rating
Note that the ? does not have to come first. The * matches whatever sits in front of it.
The end marker, $
It pins the pattern to the end of the URL.
- Disallow: /*.pdf$
That matches /files/document.pdf but deliberately misses /files/document.pdf?download=true, because the URL does not end at .pdf. The $ lets you target a file type cleanly without accidentally catching URLs where those same characters happen to appear in the middle.
Patterns you will actually use
Block all session parameters
- Disallow: /*?session=
Block all sort parameters
- Disallow: /*?sort=
Block tracking parameters (verify no valuable pages use these)
- Disallow: /*?utm_source=
Block PDFs specifically
- Disallow: /*.pdf$
The overblocking trap
Wildcards are powerful, and that power runs in both directions. One careless pattern can wipe out thousands or millions of URLs in a single line. Before you ship a wildcard rule on a large site, pull a real sample of the URLs it would catch and confirm that none of them carry search or business value you cannot afford to lose.
The most dangerous line in the language is this:
- Disallow: /*
It blocks the entire site. It differs from Disallow: / only in the mechanics, and it deserves exactly the same fear.
Source: RFC 9309, Section 2.2.3
Part 3: Crawl Governance
6. Crawl Budget: The Definitive Guide
What crawl budget actually is
Google defines crawl budget as the set of URLs it both can and wants to crawl on your site.
Two separate forces decide it, and the important thing is that they do not add together. Both have to be satisfied.
- The first is crawl capacity: the most crawling Google's infrastructure will do without overloading your servers. Google works this out from your response times, your error rates, and your availability. Faster and more reliable servers earn more capacity.
- The second is crawl demand: how much Google actually wants to crawl you in the first place. Content popularity, how often you update, the size of your URL inventory, and staleness signals all feed into it.
The relationship between them is the part people miss. Google's documentation puts it directly: even if you never hit your capacity limit, low demand means less crawling. A site with plenty of capacity but little demand gets crawled below what its servers could handle. A site with high demand but constrained capacity never gets fully crawled.
You need both to be healthy, and optimizing one while ignoring the other gets you nowhere.
Who actually needs to think about this
Google frames its crawl budget material as an advanced guide, and it means it. The sites that genuinely need to care are large ones (a million or more pages changing at least weekly), fast-moving medium ones (ten thousand or more pages changing daily), and any site with a large pile of URLs that Search Console reports as "Discovered, currently not indexed."
If you run a local business site, a small blog, or a focused SaaS product, crawl budget is almost certainly not your problem, and time spent optimizing it is time wasted. Keep your sitemap current, glance at index coverage now and then, and move on to work that matters.
The one principle to govern by
Every URL should justify its crawl cost. When Google spends resources fetching a URL, that URL should pay them back with something discoverable: a ranking, a link, fresher content. URLs that consume crawl resources and return nothing are simply waste, and waste is what crawl governance exists to remove.
Where the waste usually hides
Internal search results are often the single largest source of waste on publisher and ecommerce sites. The combinations are effectively infinite, the pages carry little unique content, and they were never meant to rank in the first place.
- Disallow: /search/
- Disallow: /*?s=
- Disallow: /*?q=
Faceted navigation is the one Google calls out by name as a major challenge. A catalog with 10 sizes, 20 brands, and 15 colors generates 3,000 filter combinations per category before you even get to multi-select, and the vast majority of those combinations answer no real search demand.
Session and tracking parameters like ?session=12345 or ?utm_source=email spin up duplicate copies of pages that already exist, burning crawl on content that adds nothing to the index.
Sort parameters such as ?sort=price show the same products in a different order. Same content, no unique value, rarely worth a standalone slot in the index.
Deep archives and infinite pagination, the /page/500/ and the calendar that scrolls back to 2009, are classic crawl traps.
Duplicate content in general, whether through parameters, tracking, or variants that all resolve to the same thing, makes Google crawl the same page over and over.
AI crawl budget, the new wrinkle
AI crawlers now show up as measurable activity in the logs of large sites, and they do not behave like Googlebot. Googlebot crawls in service of one unified index. The AI crawlers represent several different systems with different jobs and different rhythms, training crawlers, search index crawlers, and retrieval crawlers, so it is worth tracking them separately in your logs. The rough groupings:
- Training crawlers: GPTBot, ClaudeBot
- Search crawlers: OAI-SearchBot, Claude-SearchBot, PerplexityBot
- Retrieval crawlers: ChatGPT-User, Claude-User, Perplexity-User
What Google explicitly recommends
Google's own best-practice list is worth following almost verbatim:
- Manage your URL inventory so Google knows what to crawl and what to skip.
- Consolidate duplicate content so crawl focuses on unique pages.
- Use robots.txt to block genuinely unimportant pages, not as a temporary lever.
- Return 404 or 410 for permanently removed pages. Blocked URLs linger in the crawl queue longer than removed ones.
- Keep sitemaps current with <lastmod>.
- Avoid long redirect chains.
- Improve page load performance, because faster pages support more efficient crawling.
One warning from Google deserves its own line. Do not use noindex as a crawl budget tactic. A noindex page still gets crawled so Google can read the tag and then drop it, which spends a crawl slot and indexes nothing. If you never want a page crawled at all, a robots.txt Disallow is the efficient choice.
Sources: Google Crawl Budget Documentation; Google Large Site Management
7. What to Block
Crawl restrictions exist to cut waste, not to throttle crawling for sport. The right targets are the URLs that spend resources and return nothing of value.
Internal search results
- Disallow: /search/
- Disallow: /*?s=
- Disallow: /*?q=
- Disallow: /site-search/
- # Session IDs
- Disallow: /*?session=
- Disallow: /*?sessionid=
- Disallow: /*?sid=
Sort parameters
- Disallow: /*?sort=
- Disallow: /*?orderby=
Cart and checkout
- Disallow: /cart/
- Disallow: /checkout/
- Disallow: /basket/
Account and authentication areas
- Disallow: /account/
- Disallow: /login/
- Disallow: /register/
- Disallow: /admin/
Preview and draft environments
- Disallow: /preview/
- Disallow: /draft/
Infinite URL spaces
- Disallow: /calendar/
- Disallow: /date/
- Disallow: /*?date=
On a staging server, and only on a staging server, you block everything:
- User-agent: *
- Disallow: /
Never let that block reach production. Shipping a staging restriction to the live site is the single most common cause of catastrophic crawl failures, and it has taken down sites you have heard of.
Sources: Google Crawl Budget Documentation; RFC 9309
8. What Never to Block
Over-blocking does as much damage as under-blocking, and a few categories of URL simply have to stay open or the site will not perform in modern search.
CSS and JavaScript. Do not block them.
Never do this
- Disallow: /css/
- Disallow: /js/
- Disallow: /static/
- Disallow: /assets/
Google renders your pages. To read the layout and visual hierarchy it needs the CSS, and to run your navigation, structured data, and dynamic content it needs the JavaScript. Block those resources and Google misreads the page, misses your schema, and indexes a half-built version of what your visitors actually see.
Images. They feed Google Images, Google Discover (which is hungry for good imagery), Google News, and your product and article structured data. Unless you have a concrete business reason to suppress image indexing, leave them crawlable.
Core content. Your blog, articles, news, products, categories, guides, and resources are the whole point of the site. Never block any of them without rigorous analysis first.
- /blog/
- /articles/
- /news/
- /products/
- /categories/
- /guides/
- /resources/
Framework and rendering assets. Next.js, Nuxt, and their peers serve the chunks a page needs to render from predictable paths, and blocking one of those paths can knock out Google's ability to render the entire site.
Never block without testing
- Disallow: /_next/
- Disallow: /__nuxt/
- Disallow: /static/chunks/
Sources: Google JavaScript SEO Documentation; Google Rendering Documentation
9. XML Sitemaps and Robots.txt
Robots.txt and sitemaps are two halves of the same conversation. Robots.txt says "do not waste your time here." The sitemap says "spend your time here." Together they are how you steer crawl attention.
Declaring sitemaps
- Sitemap: https://example.com/sitemap.xml
Declare them in robots.txt as a matter of habit. Google can find sitemaps through Search Console or by crawling, but a declaration in the file means any compliant crawler that reads it learns where your sitemaps are, Search Console account or not.
- Sitemap: https://example.com/post-sitemap.xml
- Sitemap: https://example.com/page-sitemap.xml
- Sitemap: https://example.com/news-sitemap.xml
For very large sites, point at a sitemap index instead of listing everything:
What belongs in a sitemap, and what does not
A sitemap should contain only URLs that should rank, are canonically correct, return a 200, and are not noindexed. Keep out anything that redirects, anything noindexed, anything returning a 404 or other error, and anything blocked by robots.txt. A sitemap stuffed with blocked URLs, redirects, and dead ends sends mixed signals and burns crawl on nothing.
A note for publishers
Keep a dedicated news sitemap for Google News eligibility. It should hold only content from the last 48 hours and should update continuously as new pieces go live, not regenerate on some lazy schedule.
Sources: Google XML Sitemap Documentation; Sitemaps Protocol
10. Robots.txt and Rendering
Search engines do more than read HTML now. They render pages, running the JavaScript, loading the CSS, processing everything the way a browser would. That changes the whole relationship between what you block and what you are visible for.
- Google's pipeline runs in four steps: crawl the URL, parse the HTML, render by executing JavaScript and CSS, then index what it understands. Block any resource that the parse or render steps depend on, and you can quietly break indexing even though the URL itself was crawled without a hitch.
- The failure looks like this. A page brings in its main content through JavaScript. The rendering resources are blocked. Google sees an empty page. The structured data that JavaScript was going to inject never gets processed, the navigation links never get discovered, and the internal linking signals vanish. It hits hardest on React single-page apps, on Next.js apps rendering client-side, and on any site that depends on JavaScript for its breadcrumbs or schema.
- A few framework specifics worth knowing: Next.js keeps its static assets and hydration chunks under /_next/, and blocking that directory can break rendering across the whole site. Nuxt uses similar asset paths, so test rendering after any change. Angular's compiled assets are no different; block them and full rendering breaks.
- The general rule is simple enough to keep in your head: before you change any robots.txt rule that touches an asset directory, test the rendered output with Google's URL Inspection tool in Search Console, both before and after. If a browser needs a file to render the page, so does Google.
Sources: Google JavaScript SEO Documentation; Google Rendering Documentation
Part 4: AI Governance
11. The Complete AI Crawler Reference (2026)
For most of robots.txt's history, the crawlers it managed were search engines, and they all wanted roughly the same thing. Then 2023 arrived and a new category showed up that did not fit the old mental model at all. Any site of meaningful size now has to understand these crawlers, and the first thing to understand is the most important: not all AI crawlers are the same. They serve different systems and different purposes, and each one is a separate business decision.
They sort into three categories, and the rest of this section hangs off that taxonomy.
- Training crawlers collect publicly accessible content that may go into improving AI models. Block them and your content stops contributing to future training datasets.
- Search crawlers build the indexes that power AI search experiences. Block them and you lose visibility in AI-powered search results.
- Retrieval crawlers fetch a page in response to a specific user request. They behave differently from autonomous crawlers because a person asked for something in real time, and for that reason several of them are documented as not consistently honoring robots.txt.
OpenAI
GPTBot is the training crawler. Its user-agent is GPTBot, it respects Disallow directives, and OpenAI publishes its IP ranges at openai.com/gptbot.json. To keep your content out of training:
- User-agent: GPTBot
- Disallow: /
OAI-SearchBot is the search crawler behind ChatGPT Search. Its user-agent is OAI-SearchBot, it honors robots.txt, and its IPs live at openai.com/searchbot.json. Block it and you may lose visibility in ChatGPT Search results.
- ChatGPT-User is the retrieval crawler, the one that visits a page when a user asks ChatGPT a question. OpenAI describes it as not used for automatic crawling, and notes that because these actions are user-initiated, robots.txt rules may not apply to it. Its IPs are at openai.com/chatgpt-user.json.
There is also OAI-AdsBot, which validates the safety of pages submitted as ads on ChatGPT. It is not a content-discovery crawler and sits outside the training, search, and retrieval framing above; include it in policy only if you run ads through ChatGPT.
Anthropic
Source: Anthropic official documentation
ClaudeBot is the training crawler. Its user-agent is ClaudeBot, and Anthropic states that its bots respect "do not crawl" signals by honoring industry-standard robots.txt directives. It also supports the non-standard Crawl-delay extension. To block training:
- User-agent: ClaudeBot
- Disallow: /
Claude-User is the retrieval crawler, fetching content when a Claude user asks for information from a page. Its user-agent is Claude-User, and Anthropic notes that disabling it stops the system from retrieving your content in response to a user query, which can reduce your visibility for user-directed web search. It honors robots.txt, with the user-initiated context applying.
Claude-SearchBot is the search crawler that navigates the web to improve search quality and relevance for Claude users. Its user-agent is Claude-SearchBot, it honors robots.txt, and disabling it stops your content from being indexed for search optimization, which can reduce your visibility and accuracy in Claude's search results.
All three Anthropic crawlers are confirmed in official Anthropic documentation as of April 2026.
Perplexity
Source: Perplexity official documentation
PerplexityBot is the search crawler, designed to surface and link websites in Perplexity's search results. Its user-agent is PerplexityBot, it honors robots.txt, and Perplexity is explicit that it is not used to crawl content for AI foundation models. IPs are at perplexity.com/perplexitybot.json.
Perplexity-User is the retrieval crawler that visits pages to help answer a user's question. Its user-agent is Perplexity-User, and Perplexity states plainly that because a user requested the fetch, this fetcher generally ignores robots.txt rules. It is also not used for crawling or training. IPs are at perplexity.com/perplexity-user.json.
One correction worth stating loudly, because the mistake is everywhere: Perplexity does not operate a training crawler. Any governance framework that files Perplexity under "Training" is simply wrong.
The traditional search crawlers, for reference
Googlebot has its documentation and verification guidance. Bingbot has its webmaster tools and, unlike Google, honors Crawl-delay.
The retrieval-crawler nuance that matters
Both Perplexity and OpenAI have documented that their retrieval crawlers, Perplexity-User and ChatGPT-User, may not consistently honor robots.txt, because they run in response to user requests rather than as autonomous crawlers. Anthropic's Claude-User is documented as honoring robots.txt while acknowledging the same user-initiated context.
The practical takeaway is that robots.txt reliably governs autonomous crawl behavior but is not a dependable lever on retrieval crawlers. If you genuinely need to control retrieval access, you reach for bot management infrastructure, WAF rules and IP allowlists built from the official JSON sources, alongside or in place of robots.txt.
Matching the policy to the goal
Objective | Training Crawlers | Search Crawlers | Retrieval Crawlers |
|---|---|---|---|
Maximum visibility | Allow | Allow | Allow |
Visibility without training | Block | Allow | Allow |
Search only | Block | Allow | Block |
Maximum restriction | Block | Block | Block |
A template for the most common stance, visibility without contributing to training:
Training: block if you do not want content used in model training
- User-agent: GPTBot
- Disallow: /
- User-agent: ClaudeBot
- Disallow: /
Search: allow for AI search visibility
- User-agent: OAI-SearchBot
- Allow: /
- User-agent: Claude-SearchBot
- Allow: /
- User-agent: PerplexityBot
- Allow: /
Retrieval: allow for AI assistant access
- User-agent: ChatGPT-User
- Allow: /
- User-agent: Claude-User
- Allow: /
- User-agent: Perplexity-User
- Allow: /
Standard search governance
- User-agent: *
- Disallow: /search/
- Disallow: /cart/
- Disallow: /checkout/
- Sitemap: https://example.com/sitemap.xml
Last verified June 2026. This ecosystem moves fast, so review this section quarterly and check it against the official documentation each time.
12. AI Visibility Framework
AI governance and AI visibility are easy to confuse and worth keeping apart. Governance asks what access AI systems should have to your content. Visibility asks how present you actually are inside those systems. They are not the same lever, and they can move independently: a site can block every training crawler and still get cited from content crawled earlier, and a site can allow every crawler and never once surface in an AI answer.
Related reading: Governance decides whether AI systems can reach you. What earns a citation once they do is a separate discipline. See AI Visibility: What Gets Content Cited.
It helps to think of visibility as four stacked layers:
- Discovery. Can AI systems find your content at all? This rests on an accessible robots.txt, crawlable URLs, healthy sitemaps, and internal linking.
- Retrieval. Can they fetch it when a user asks? This needs accessible content (no paywalls blocking retrieval crawlers), fast responses, and retrieval-crawler access.
- Citation. Will they quote you in an answer? This turns on authority signals, clarity, entity recognition, and structured data. A site can be retrieved constantly and never cited.
- Recommendation. Will a system actively recommend your product? The top of the stack, driven by brand authority, consistent mentions across the web, and third-party references.
The most common strategic mistake is to treat visibility as a single on-off switch when the useful question is which layer the gap sits at. A publisher with strong content but restricted retrieval crawlers scores well on discovery and badly on retrieval. An ecommerce brand with great products but thin external mentions scores well on retrieval and badly on recommendation. Diagnose the layer before you prescribe the fix.
Aligning governance with visibility goals looks like this:
Goal | Training | Search | Retrieval |
|---|---|---|---|
Maximum visibility | Allow | Allow | Allow |
Visibility without training contribution | Block | Allow | Allow |
Search visibility, controlled retrieval | Block | Allow | Block* |
Maximum restriction | Block | Block | Block |
*Blocking retrieval crawlers through robots.txt may not be fully reliable, given the user-initiated behavior Perplexity and OpenAI both document.
The emerging publisher position has settled into a recognizable shape: block training to protect a content-licensing negotiating position, allow search to keep AI search visibility and referral potential, and allow retrieval to support user-initiated access. It keeps the visibility intact while holding leverage on the training-data question.
13. AI Governance Adoption Model
Most organizations today sit in one of two places: no AI policy at all, or a blanket block on everything. Neither one is governance. The model below traces the path from not knowing to governing on purpose.
Level | Description |
|---|---|
0: No awareness | No policy, no owner, no monitoring. The organization does not know which AI systems are accessing its content or how often. |
1: Reactive | A snap decision, usually a blanket block on GPTBot. Often incomplete and undocumented. |
2: Basic | Training, search, and retrieval crawlers identified as distinct categories, with an intentional decision for each. |
3: Docu5ented | Written policies per category, ownership assigned, a review schedule in place. |
4: Operational | Monitoring live. Server logs reviewed for AI crawler activity. Policy reviews actually happening. |
5: Strategic | AI policy integrated with content, SEO, and licensing strategy. Business objectives drive it. |
6: Enterprise | Cross-functional ownership across SEO, legal, editorial, engineering, and product. |
7: Adaptive | Continuous monitoring and refinement, with quarterly reviews built into the operating rhythm. |
Wherever you sit, you should be able to answer five questions without pausing: Can AI models train on our content? Can AI search engines index it? Can AI assistants retrieve it for users? Are these policies documented and owned? When were they last reviewed?
The patterns by industry are fairly consistent:
- Publishers increasingly block training while allowing search and retrieval. The New York Times restricting AI training access in 2023 set off a wave of publisher policy development.
- Ecommerce tends to allow search and retrieval to fuel AI shopping discovery, with training decisions varying, as AI product recommendations become a competitive priority.
- SaaS companies often allow everything, since visibility in AI tool recommendations drives growth.
- Regulated industries tend to restrict across the board, driven by compliance rather than strategy.
Part 5: Decision Frameworks
14. The Robots.txt Decision Framework
The most useful thing to understand about robots.txt errors is that most of them are not syntax errors. They are tool-selection errors: reaching for robots.txt when the job actually needed noindex, or a canonical, or authentication. The framework below is built to catch that before you write a single directive.
Run the pre-filter first. Is this content sensitive or behind authentication? If yes, admin areas, customer portals, internal dashboards, private data, stop here and use authentication and access controls. Robots.txt is the wrong tool, full stop. If no, continue.
Then work through crawl and index, in order.
Should crawlers access this URL at all? If no, search pages, checkout, session URLs, infinite filter combinations, preview environments, crawl traps, use a robots.txt Disallow and you are done. If yes, keep going.
Should it appear in search results? If no, thank-you pages, campaign landing pages, internal tools, thin content not meant for search, allow the crawl and apply noindex. Do not use robots.txt for this, because a blocked page never reveals its noindex tag and can be indexed from external links anyway. If yes, keep going.
Is it a duplicate of another URL? Only ask this once you have confirmed the URL is both crawlable and indexable, since canonicals on blocked pages are invisible. If it is a duplicate, product variants, parameter URLs, tracking duplicates, use a canonical tag pointing at the preferred URL. If not, keep going.
Has the content permanently moved? If yes, migrations, restructures, consolidations, deletions with a good replacement, use a 301. If no, the URL is crawlable, indexable, canonical, and current, and you move on to the AI layer.
Finally, the AI decisions, which apply to all content, even content already blocked for traditional search.
Training access: do you want AI models training on this? Allow needs no directive; block uses a crawler-specific Disallow: / on GPTBot and ClaudeBot; conditional blocks by path rather than sitewide.
AI search access: do you want to appear in AI search? Allow needs no directive; block uses a crawler-specific Disallow on OAI-SearchBot, Claude-SearchBot, and PerplexityBot; conditional grants selective path access.
Retrieval access: do you want AI assistants fetching content for users? Allow needs no directive and is the most common choice; block uses a crawler-specific Disallow but may not be fully reliable; conditional pairs robots.txt with WAF rules. For anything you actually need enforced on retrieval crawlers, remember that Perplexity-User generally ignores robots.txt and ChatGPT-User may too, so build WAF-level rules from the official IP ranges rather than trusting the file alone.
The tool-selection cheat sheet, one more time, because it is the heart of the whole thing:
Goal | Correct Tool |
|---|---|
Prevent crawling | robots.txt Disallow |
Prevent indexing | noindex meta tag or X-Robots-Tag |
Consolidate duplicates | Canonical tag |
Redirect content | 301 redirect |
Secure content | Authentication + access controls |
Block AI training | robots.txt (training crawler user-agents) |
Block AI search | robots.txt (search crawler user-agents) |
Control AI retrieval | robots.txt + WAF for reliable enforcement |
License content commercially | Legal agreements (robots.txt is not a contract) |
15. URL Governance Worksheet
Govern by URL class, not by individual page. Run each pattern through the worksheet below once, write down the decisions, and you have something auditable that survives staff turnover and CMS changes.
Field | Value |
|---|---|
URL pattern | |
Business owner | |
SEO owner | |
Engineering owner | |
Purpose | |
Business value | High / Medium / Low |
Should be crawled? | Yes / No |
Should be indexed? | Yes / No |
Duplicate version exists? | Yes / No |
Canonical target | |
Sensitive / auth required? | Yes / No |
AI training allowed? | Allow / Block / Conditional |
AI search allowed? | Allow / Block / Conditional |
AI retrieval allowed? | Allow / Block / Conditional |
Conditions / notes | |
Sitemap inclusion? | Yes / No |
Last reviewed | |
Last change and reason |
A filled-in version for a publisher:
URL Type | Crawl | Index | AI Retrieval | Notes |
|---|---|---|---|---|
Article | Yes | Yes | Allow | Core asset |
News article | Yes | Yes | Allow | News sitemap required |
Category hub | Yes | Yes | Allow | |
Topic hub | Yes | Usually yes | Allow | Evaluate by demand |
Author page | Depends | Depends | Depends | Evaluate by traffic |
Tag page | Depends | Depends | Depends | Thin tags get noindex |
Search page | No | No | Block | Crawl trap |
Preview URL | No | No | Block | |
Archive (recent) | Yes | Yes | Allow | |
Archive (deep) | Depends | Depends | Allow | Evaluate crawl cost |
And for ecommerce:
URL Type | Crawl | Index | AI Retrieval | Notes |
|---|---|---|---|---|
Product page | Yes | Yes | Allow | Primary asset |
Category page | Yes | Yes | Allow | Primary asset |
High-value facet | Yes | Yes | Allow | Verify demand |
Low-value facet | Usually no | Usually no | Allow | Assess individually |
Sort URL | Usually no | Usually no | Allow | |
Search results | No | No | Block | Crawl trap |
Cart | No | No | Block | No SEO value |
Checkout | No | No | Block | No SEO value |
Account | No | No | Block | Auth required |
A template for recording the AI access decision per crawler:
Access Type | Policy | Notes |
|---|---|---|
Googlebot | ||
Bingbot | ||
GPTBot (training) | Allow / Block / Conditional | |
ClaudeBot (training) | Allow / Block / Conditional | |
OAI-SearchBot (search) | Allow / Block / Conditional | |
Claude-SearchBot (search) | Allow / Block / Conditional | |
PerplexityBot (search) | Allow / Block / Conditional | Not a training crawler |
ChatGPT-User (retrieval) | Allow / Block / Conditional | May not honor robots.txt |
Claude-User (retrieval) | Allow / Block / Conditional | |
Perplexity-User (retrieval) | Allow / Block / Conditional | Generally ignores robots.txt |
16. URL Lifecycle Governance
A URL's value is not fixed. The page that deserves top crawl priority today may deserve none in two years, and mature organizations govern that arc deliberately rather than letting it drift.
A URL moves through roughly six stages:
- Creation. The goal is fast discovery and indexing. Add it to the sitemap immediately, give it internal links, and for publishers, the news sitemap.
- Growth. As it gains traffic and authority, watch crawl frequency in Search Console and keep rendering resources accessible.
- Maturity. Maintain visibility and update <lastmod> when the content refreshes.
- Decline. When traffic fades, decide whether to refresh, consolidate, or plan retirement. Does it still earn links and serve intent?
- Archive. Reference value without active promotion, usually still crawlable and indexable. Archived authoritative content is frequently cited by AI systems, a quietly valuable asset teams underrate.
- Retirement. When value is gone: 301 to a good replacement, 410 when there is none, a kept redirect to preserve authority, or maintain the page if legal retention requires it.
URL State | Crawl | Index | Sitemap | Crawl Priority |
|---|---|---|---|---|
New | Yes | Yes | Yes | High |
Growing | Yes | Yes | Yes | High |
Mature | Yes | Yes | Yes | Moderate |
Declining | Depends | Depends | Depends | Lower |
Archived | Usually yes | Often yes | Sometimes | Low |
Retired | Usually no | Usually no | No | None |
The reason to govern by class rather than by individual URL is simply that the by-URL approach does not survive scale. The most mature organizations write lifecycle policies for whole classes, articles, products, categories, events, campaigns, documentation, and apply them consistently, which is what keeps governance auditable as the site grows.
Part 6: Operations
17. Robots.txt During Site Migrations
Migrations have caused some of the worst SEO disasters on record, and the culprit is surprisingly often not a redirect chain or a canonical mistake. It is a single robots.txt line that shipped at launch.
The line is this one:
- User-agent: *
- Disallow: /
On a staging environment it is exactly right. Ride it into production by accident and it is catastrophic: Googlebot is told to stop crawling, discovery halts, content refresh stops, and recovery can take weeks. Staging restrictions reaching production is the single most common source of robots-related traffic disasters, which is why it earns its own line on every checklist below.
- Before launch, review the production robots.txt explicitly rather than assuming it is correct, confirm staging is restricted and protected from crawling, review every noindex directive for intended versus accidental, review canonicals sitewide, verify redirect chains are complete, and confirm sitemap declarations point at correct paths in the new environment.
- On launch day, verify example.com/robots.txt returns a 200, verify it contains no accidental Disallow: /, verify sitemap URLs are reachable, verify no asset directories are blocked, and submit the updated sitemap to Search Console.
- For at least 30 days after, monitor Crawl Stats daily, watch the Index Coverage report, review server logs for Googlebot activity, and watch for any drop in crawl rate.
Source: Google Site Migration Documentation
18. Auditing Framework
Most robots.txt files get written once and then forgotten, while the world around them keeps moving: migrations bolt on directives, CMS updates reshape URLs, new URL types appear, fresh AI crawler categories arrive. Skip regular auditing and governance quietly drifts out of alignment with reality.
How often depends on the site. Small sites can audit quarterly. Publishers and ecommerce sites should audit monthly. Enterprises should audit monthly and again after any significant change. After a migration, audit immediately and again 30 days post-launch.
The checklist itself:
Foundation
- robots.txt exists at the root
- Returns HTTP 200
- UTF-8 encoded
- Under 500 KiB
Crawl governance
- Search URLs reviewed
- Filter URLs reviewed
- Session parameters reviewed
- Preview URLs reviewed
- Archive pagination reviewed
Assets
- CSS directories accessible
- JavaScript directories accessible
- Images accessible
- Framework assets (/_next/, etc.) accessible
Sitemaps
- All sitemaps declared
- All declared sitemaps return 200
- No redirected URLs in sitemaps
- No noindexed URLs in sitemaps
- <lastmod> dates current
AI governance
- Training crawler policy documented
- Search crawler policy documented
- Retrieval crawler policy documented
- Policy matches business intent
Subdomains
- All subdomains have robots.txt
- All subdomain files reviewed
Monitoring
- Crawl Stats reviewed in Search Console
- Server logs reviewed for crawler activity
19. Testing Framework
Test every robots.txt change before it ships. There are no exceptions worth making, because the failure mode is invisible until traffic drops.
Work through it in layers:
- Syntax. Confirm the file parses. Watch for the usual culprits: a misspelled Disalow:, a missing leading slash (Disallow: search/ instead of Disallow: /search/), or rules placed before any User-agent line, which RFC 9309 says should be ignored.
- URL behavior. For every significant category, confirm the important pages are crawlable, the search and preview pages are blocked, and the asset directories are open.
- Patterns. For any wildcard rule, write out representative URLs and check the rule matches what you actually meant.
Test this rule:
- Disallow: /*?sort=
Against these URLs:
- /shoes?sort=price -> should be blocked
- /shoes?size=10 -> should NOT be blocked
- /shoes?sort=price&size=10 -> should be blocked
- /articles/sorting-algorithms -> should NOT be blocked
Fourth, production verification right after deploy: fetch the live file and read it, test representative URLs with the URL Inspection tool, and watch Crawl Stats for any unexpected drop. Fifth, environments: production verified crawlable where intended, staging verified blocked sitewide, and every subdomain verified on its own.
20. Log File Analysis
Search Console shows you what Google reports. Server logs show you what actually happened, and the two are not always the same. Every request writes a line, which crawler, when, what response code, how long the server took, and no other tool gives you crawl reality at that resolution.
The questions worth asking the logs are simple: which crawlers visit, which URLs they crawl, how often, what response codes they get back, and whether crawl resources are landing on your business priorities or wandering off into waste.
What to watch depends on the crawler:
- Search crawlers (Googlebot, Bingbot): crawl frequency by URL category, response code distribution, your most-crawled URLs (are these your best pages?), and your least-crawled ones (are these pages you care about?).
- AI training crawlers (GPTBot, ClaudeBot): crawl volume and frequency, which sections they target, and whether your policies are being respected.
- AI search crawlers: discovery patterns and which sections get attention.
- AI retrieval crawlers: what content is fetched and how often, since that tells you what is being surfaced to AI users in real time.
Response codes deserve their own quick reference:
Code | Meaning | Action |
|---|---|---|
200 | Page crawled successfully | Monitor the distribution |
301 | Redirect | Minimize chains |
404 | Not found | Return 410 for permanent removals |
410 | Gone, permanently removed | Stronger stop signal than 404 |
500 | Server error | Investigate immediately; 5xx makes crawlers back off |
Sources: Google Crawl Budget Documentation; Google Crawl Stats Documentation
21. Failure Postmortems
These are the failure patterns the SEO field has documented most often, and the ones that cost the most when they land. Each is worth recognizing on sight.
- Sitewide block from a staging deployment. The User-agent: * / Disallow: / that belonged on staging rides along to production through an infrastructure slip or a missing checklist line. Googlebot stops crawling the whole site, and recovery can take weeks with ranking loss along the way. Risk: critical. Prevention: name robots.txt on every launch checklist and confirm the production file returns 200 with correct content on launch day.
- Blocking CSS. A Disallow: /css/ left over from an era when blocking assets was considered a crawl-budget trick. Google can no longer fully render, layout comprehension degrades, and structured data may go unprocessed. Risk: high.
- Blocking JavaScript. Same legacy thinking, or a misunderstanding of framework paths, producing Disallow: /js/ or Disallow: /_next/. Google cannot render JavaScript-dependent content, navigation links go undiscovered, and in the worst case the whole site reads as an empty page. Risk: high.
- Blocking product categories. Aggressive crawl-budget optimization that never checked the business value of category pages. Category rankings collapse and revenue-driving keywords drop out of search. Risk: critical for ecommerce.
- Internal search left open. /search?q= crawlable with no restriction, so search result URLs pile up in the queue, an infinite supply of low-value combinations eats the budget, and high-value content gets crawled less. Risk: medium to high.
- Faceted navigation explosion. Filter URLs like ?size=, ?color=, and ?brand= all crawlable with no governance, turning millions of combinations crawlable, draining the budget, and starving product and category pages. Risk: high for ecommerce.
- AI governance misconfiguration. The intent was to block training and allow search and retrieval, but a misread of precedence, or more often a too-broad wildcard, blocks every AI crawler instead. AI search visibility disappears without anyone deciding it should. Risk: increasing.
- Governance drift. The file launched sensible, then time passed: new URL types, a CMS update, new AI crawlers, changed ownership, and nobody revisited it. Crawl traps that should be closed stay open, and restrictions on valuable content that should be lifted stay in place. Risk: high over time.
When you are actually working a failure, the diagnostic questions are almost always the same six: What changed? When? Who changed it? Was it tested before deployment? Was it reviewed and approved? Was it monitored afterward? The answers usually point at a governance gap rather than a technical one.
Failure Type | Risk Level |
|---|---|
Sitewide block from staging | Critical |
Product page block | Critical |
Category page block | Critical |
JavaScript blocked | High |
CSS blocked | High |
Governance drift | High |
Facet explosion | High |
Internal search open | Medium to High |
AI governance misconfiguration | Increasing |
Missing sitemap declarations | Medium |
22. Enterprise Governance Operating Model
At enterprise scale, almost every robots.txt failure is a governance failure rather than a technical one. The syntax is fine and the file parses. The real problem is that someone made the wrong decision, or nobody made the right one. The fix is to treat robots.txt as what it is: a production configuration file that deserves the same rigor as any other.
Ownership is where it starts, and a simple RACI keeps it honest:
- Crawl governance: SEO is accountable and defines the policy, engineering is responsible and implements it, and product and editorial are consulted on priorities and content visibility.
- AI governance: legal is accountable for licensing and compliance, SEO is consulted on visibility, engineering is responsible for implementation, and product is consulted on business impact.
- Deployment: engineering executes, SEO must approve before anything ships, and product stays aware of the impact.
Every change should move through a structured path: a documented proposal with rationale, a business-impact review, an SEO review, an engineering review, testing in staging, sign-off from the designated approvers, deployment, at least 14 days of post-deployment monitoring, and a rollback plan prepared before deployment rather than after something breaks.
A change-request template keeps that path repeatable:
- Requestor:
- Date:
- Proposed change (exact directive):
- Reason for change:
- Impacted URL patterns:
- Estimated impacted URL count:
- Expected outcome:
- Risk assessment: Low / Medium / High / Critical
- Testing completed: Yes / No
- Rollback plan:
- Approvers:
Review cadence tracks the audit cadence: quarterly for small sites, monthly for publishers and ecommerce, and monthly plus change-triggered reviews for enterprise.
Part 7: Real-World Analysis
23. Real-World Analysis Framework
Reading the robots.txt files of organizations operating at scale shows you something no documentation can: the decisions sophisticated teams actually make under real operational pressure. The point is not to copy them, since every site has its own objectives, architecture, and priorities. The point is to read the patterns, the choices that keep recurring across industries and scales, and to understand the reasoning underneath them.
For any file you study, look at six things: what is blocked and why, how sitemaps are used to manage discovery, whether rendering assets stay accessible, how crawl waste is reduced, whether AI crawler policies are present, and how sophisticated the implementation looks overall. And keep asking the one question that teaches the most. Not "what is blocked?" but "why is it blocked?" Every directive is a business decision in disguise, and the decision tells you more than the syntax does.
24. Publisher Analysis
Google's own robots.txt
Google's public file, at https://www.google.com/robots.txt, is one of the most instructive examples of crawl governance at extraordinary scale, partly because the company that builds the crawler is so careful about its own crawl exposure.
It blocks its own search results sitewide, then carves out the handful of search-related URLs that carry standalone informational value:
- Disallow: /search
- Allow: /search/about
- Allow: /search/howsearchworks
That is exactly the pattern this guide recommends, executed with surgical precision. The same precision shows up at the parameter level, where Google pairs Allow and Disallow with wildcards and end markers:
- Allow: /?hl=*&gws_rd=ssl$
- Disallow: /?hl=*&
The lessons travel well. Scale forces specificity, so large inventories need surgical rules rather than broad ones. Internal systems should be invisible to crawlers. Allow and Disallow used together inside one crawler group give you precision neither directive can reach alone. And even the company that invented the crawler governs its own exposure carefully.
GitHub's robots.txt
GitHub's file, at https://github.com/robots.txt, governs one of the most URL-intensive user-generated platforms on the web, and it shows what aggressive, deliberate governance looks like.
It uses crawler-specific rules where different crawlers need different treatment:
- User-agent: bingbot
- Disallow: /ekansa/Open-Context-Data
- Disallow: */tarball/
- Disallow: */zipball/
- User-agent: *
- Disallow: /*/*/pulse
- Disallow: /*/*/projects
It restricts repository metadata views, the commits, branches, contributors, and forks that change constantly, generate crawl waste, and offer no SEO value in return:
- Disallow: /*/*/commits/
- Disallow: /*/*/branches
- Disallow: /*/*/contributors
- Disallow: /*/*/tags
- Disallow: /*/*/stargazers
- Disallow: /*/*/watchers
- Disallow: /*/*/network
- Disallow: /*/*/graphs
- Disallow: /*/*/compare
It blocks download and archive content regardless of URL structure, and it blocks tracking parameters:
- Disallow: /*/download
- Disallow: /*/archive/
- Disallow: /*source=*
- Disallow: /*ref_cta=*
The takeaways: user-generated platforms need aggressive URL governance by default, crawler-specific rules are a legitimate tool when crawlers genuinely need different treatment, and binary or download content should stay out of crawl no matter how the URLs are shaped.
Part 8: References
25. Security Myths
Three myths refuse to die, and each one is dangerous.
- Myth: Disallow protects content. It does not. RFC 9309 is explicit that the protocol is not a substitute for valid content security measures. A Disallow asks compliant crawlers not to visit a path. It blocks no one, and the URL stays publicly accessible to anyone who knows it.
- Myth: Disallow hides content. It does the opposite. Listing a path in robots.txt puts that path on public display in the file itself, and security researchers, competitors, and attackers all read robots.txt precisely to find the paths an owner hoped to bury.
- Myth: Bots must obey robots.txt. Compliant, good-faith crawlers honor it. Malicious bots do not. A scraper after your content for unauthorized use will not pause for a Disallow.
The correct approach to security is the boring, effective one: authentication and authorization at the application layer, firewalls and WAF rules, rate limiting, and access controls. Robots.txt is governance, not protection.
Sources: RFC 9309, Section 3; OWASP Access Control Guidance
26. Robots.txt vs Meta Robots vs X-Robots-Tag
These three get confused constantly because all three touch crawler behavior, but they solve entirely different problems.
Robots.txt controls crawling at the URL level through a file at the root.
- User-agent: *
- Disallow: /search/
The meta robots tag controls indexing for a single HTML page through a tag in the <head>.
- <meta name="robots" content="noindex">
- <meta name="robots" content="noindex, nofollow">
X-Robots-Tag controls indexing for any content type through an HTTP response header, which makes it the right tool for non-HTML files like PDFs and images that cannot carry a <head> tag.
- X-Robots-Tag: noindex
- X-Robots-Tag: noindex, nofollow
The interaction between them is the part that causes real damage. Block a page in robots.txt and Google cannot crawl it. If Google cannot crawl it, it never reads the meta robots tag. So a page carrying noindex that is also blocked by robots.txt may still end up indexed, discovered through an external link with the noindex instruction never verified.
The rule that follows is firm: never combine robots.txt blocking with noindex. Pick one. For content you want to exist but keep out of search, allow the crawl and apply noindex. For content you want out of the crawl queue entirely, use robots.txt.
Method | Crawl Control | Index Control |
|---|---|---|
robots.txt | Yes | No |
Meta robots | No | Yes |
X-Robots-Tag | No | Yes |
Sources: Google Robots Meta Tag Documentation; Google Block Indexing Documentation
27. The Future of Robots.txt
Robots.txt is not going obsolete.
It remains the foundational way a site tells compliant agents what it prefers about crawling, and RFC 9309's formalization in 2022 strengthened its standing rather than weakening it. What is changing is its scope and its complexity.
AI governance is the defining evolution. For 29 years the file was a search governance tool. In roughly three years it became an AI governance tool as well, and splitting crawlers into training, search, and retrieval, then governing each independently, is a genuine structural change in what the file is for.
Retrieval-crawler compliance is the open problem. Both Perplexity and OpenAI have documented that their user-initiated retrieval crawlers may not consistently honor robots.txt, and as AI assistants fetch more content on behalf of users, the line between crawling and browsing keeps blurring. Closing that gap will probably require governance mechanisms that do not yet exist in mature form.
Licensing and governance are converging, with major publishers now treating robots.txt as one part of a wider AI content policy that runs all the way to commercial licensing negotiations. And bot management platforms are extending what the file can do, with Cloudflare, Akamai, and their peers offering managed AI crawler controls, IP verification, and enforcement that operates above the robots.txt layer.
The most likely near-term shape is a layered one: robots.txt stays the baseline governance layer, while platform-level enforcement, authentication, and possibly new protocol extensions cover the compliance gaps it cannot close alone.
Appendices
Appendix A: Directive Reference
Directive | Function | Standard? |
|---|---|---|
User-agent | Defines which crawler receives the rules | RFC 9309 |
Disallow | Blocks crawling of matching paths | RFC 9309 |
Allow | Overrides a broader Disallow; longest match wins | RFC 9309 |
Sitemap | Declares sitemap location | Widely supported; not in core RFC 9309 spec |
Crawl-delay | Suggests delay between requests | Not RFC 9309; Google ignores it; Bing and Anthropic support it |
Appendix B: Wildcard Reference
Character | Meaning | Example |
|---|---|---|
* | Zero or more of any character | Disallow: /*?sort= |
$ | End of the URL pattern | Disallow: /*.pdf$ |
Appendix C: Crawl vs Index vs Render vs Rank
Process | Meaning | Controlled By |
|---|---|---|
Crawl | Bot requests and retrieves the URL | robots.txt |
Parse | Bot processes the HTML | Cannot be separately controlled |
Render | Bot executes JS and CSS | Requires rendering resources to be crawlable |
Index | Search engine stores page information | noindex / X-Robots-Tag |
Rank | Page becomes eligible for results | Content, links, signals |
Appendix D: AI Crawler Reference Table (2026)
Crawler | Organization | Category | Robots.txt Compliance | Training Use? | Source |
|---|---|---|---|---|---|
GPTBot | OpenAI | Training | Yes | Yes | developers.openai.com |
OAI-SearchBot | OpenAI | Search | Yes | No | developers.openai.com |
ChatGPT-User | OpenAI | Retrieval | May not consistently comply | No | developers.openai.com |
ClaudeBot | Anthropic | Training | Yes | Yes | support.claude.com |
Claude-SearchBot | Anthropic | Search | Yes | No | support.claude.com |
Claude-User | Anthropic | Retrieval | Yes | No | support.claude.com |
PerplexityBot | Perplexity | Search | Yes | No | docs.perplexity.ai |
Perplexity-User | Perplexity | Retrieval | Generally ignores robots.txt | No | docs.perplexity.ai |
Perplexity does not operate a training crawler. Any framework that files Perplexity under "Training" is incorrect.
Last verified June 2026. Verify against official documentation before publication and quarterly thereafter.
Appendix E: Glossary
- Allow: A directive that permits crawling of a path, overriding a broader Disallow. The most specific match wins.
- Canonical: An HTML tag or HTTP header that names the preferred URL when the same or similar content sits at multiple URLs.
- Crawl budget: The set of URLs Google can and wants to crawl, set by the intersection of crawl capacity and crawl demand.
- Crawl capacity limit: The most crawling Google's infrastructure will do on a site without overloading its servers, based on response time, error rates, and availability.
- Crawl demand: How much Google wants to crawl a site, based on content popularity, freshness, and inventory size.
- Crawl trap: A URL pattern that generates excessive, low-value crawl activity, such as internal search, session parameters, or infinite calendar pagination.
- Disallow: A directive that blocks crawling of matching path patterns.
- Noindex: A meta tag or X-Robots-Tag instruction that keeps a page out of the index. Does not prevent crawling.
- REP, Robots Exclusion Protocol: The protocol governing robots.txt behavior, standardized as RFC 9309 in September 2022.
- Rendering: The process by which search engines execute JavaScript and CSS to understand a page the way a browser does.
- RFC 9309: The IETF standard formalizing the Robots Exclusion Protocol, published September 2022.
- Training crawler: An AI crawler that collects content for potential use in model training. Examples: GPTBot, ClaudeBot.
- Search crawler: An AI crawler that builds indexes for AI-powered search. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot.
- Retrieval crawler: An AI crawler that fetches content in response to a specific user request. Examples: ChatGPT-User, Claude-User, Perplexity-User. May not consistently honor robots.txt.
- User-agent: The identifier a crawler uses to announce itself. Directives are targeted at specific user-agents.
- X-Robots-Tag: An HTTP response header that carries indexing instructions for any content type, including PDFs and images.

