Mastering Robots.txt: Crawl, AI Crawler & Governance

Key Takeaways

Crawling, not indexing. A blocked URL can still rank if Google finds it via a link. To keep a page out of search, allow the crawl and use noindex. Never use both at once.
Crawl budget is a large-site problem. It matters above roughly a million pages, or ten thousand fast-changing ones. Most small sites can ignore it.
AI crawlers split three ways: training (GPTBot, ClaudeBot), search (OAI-SearchBot, Claude-SearchBot, PerplexityBot), and retrieval (ChatGPT-User, Claude-User, Perplexity-User). Govern each separately. Perplexity has no training crawler.
Retrieval crawlers may ignore robots.txt because the fetch is user-initiated. For reliable control, use WAF rules built from official IP ranges.
Most failures are governance failures, not syntax errors. The classic disaster is a staging Disallow: / shipped to production.

Robots.txt now does two jobs:

It tells search engines where not to go.
It decides what AI systems can do with your content.

This guide covers the directives, crawl budget, how to govern AI crawlers, and how to keep it from quietly breaking, with every claim checked against official documentation.

The one thing to remember as you go: it controls crawling, not indexing, and almost every expensive mistake comes from forgetting that.

Part 1: Foundations

1. What Robots.txt Is, and What It Is Not

Robots.txt is a plain-text file that lives at the root of a website and tells automated visitors, crawlers, spiders, bots of all kinds, where you would prefer they not go. It sits at a fixed, public address:

https://example.com/robots.txt

The protocol behind it is the Robots Exclusion Protocol, which the IETF finally turned into a formal standard, RFC 9309, in September 2022. For the twenty-something years before that, robots.txt worked on the strength of convention alone: everyone agreed on roughly how it behaved, but the edges were fuzzy. RFC 9309 cleaned up the parts that used to cause arguments, how patterns are matched, which rule wins when two conflict, and what a crawler should do when the file is missing or broken.

In practical terms, the file tells crawlers which URLs they may visit, which they should leave alone, and where to find your sitemaps. It is the first thing a well-behaved crawler fetches before it touches anything else on your site, which makes it the opening handshake between your server and the wider machine web. Get that handshake wrong and everything downstream suffers: crawl efficiency, server load, search access, and now AI access too.

That is what it does. Here is what it does not do, and this is where the money gets lost.

It does not control indexing. This is the single most common and most expensive misunderstanding in technical SEO. A rule like this: Disallow: /page/ - it tells crawlers not to visit /page/. It does nothing to stop Google from listing that URL in search results once Google learns the URL exists some other way, through a link, a sitemap, a redirect, or a mention on another site. If you actually want a page kept out of the index, the tool is <meta name="robots" content="noindex"> or an X-Robots-Tag header, not a Disallow.
It does not provide security. RFC 9309 says so in plain language: the protocol is not a substitute for real content security, and listing a path in the file makes that path publicly discoverable. Anything sensitive needs authentication and access controls. A Disallow rule is a polite request, not a lock.
It does not hide anything. Every line in your robots.txt is visible to anyone who types the URL. People who go looking for the directories you would rather keep quiet start by reading your robots.txt, because that is where owners helpfully list them.
And compliance is voluntary. The well-known crawlers, Googlebot, Bingbot, ClaudeBot, GPTBot, PerplexityBot, honor the file. Bad-faith bots ignore it entirely. The whole system runs on the assumption of good-faith actors, which is worth keeping in mind every time you are tempted to treat robots.txt as a defense.

What changed between 2022 and 2026

The file itself barely changed. What we ask it to do changed a great deal:

2022: RFC 9309 standardizes the protocol.
2023: OpenAI's GPTBot arrives, and for the first time site owners have to decide whether they want their content feeding AI training.
2024 to 2025: that decision goes mainstream, and the cast grows: ClaudeBot, Claude-SearchBot, OAI-SearchBot, PerplexityBot.
2026: AI visibility, a different thing from AI governance, becomes a core part of the SEO conversation rather than a fringe one.

So robots.txt has quietly turned into a search-and-AI governance tool, and any site of meaningful size now faces a set of crawler-access decisions that simply did not exist three years ago.

Source: RFC 9309

2. Crawling vs Indexing vs Rendering vs Ranking

These four words get used interchangeably all the time, and that habit is the root of more costly SEO mistakes than any syntax error. They are four separate processes. Keeping them straight is most of the battle.

Crawling

Crawling is just this: a crawler asks your server for a URL, and your server answers.

GET /article/example HTTP/1.1

Googlebot requested the page, read the response, and moved on. That is the entire act of crawling, and robots.txt is the only one of the four processes it directly controls.

Rendering

Modern search engines do not stop at reading raw HTML. They render the page the way a browser does, running the JavaScript, loading the CSS, building the layout. Google has described its own pipeline in exactly those terms. Any page that relies on JavaScript to bring in its content, its navigation, or its structured data is leaning on this step whether the team realizes it or not.

This is where a robots.txt mistake turns invisible and severe. If you block the CSS or JavaScript a page needs, Google can crawl the HTML perfectly and still fail to understand the page, because it never got to render it. Blocked rendering resources are one of the most damaging and most common misconfigurations there is, precisely because the page looks fine to a human and broken only to the crawler.

Indexing

After rendering, Google decides whether to store what it learned, and that stored record is what makes a page eligible to show up in results. Indexing is where robots.txt has no say at all. A URL you blocked can still be indexed if Google found it through an external link.

The tool that actually prevents indexing is noindex: <meta name="robots" content="noindex">
For files that cannot carry an HTML tag: X-Robots-Tag: noindex

And here is the interaction that trips up even experienced teams. If you block a page in robots.txt, Google cannot crawl it. If Google cannot crawl it, Google never sees the noindex tag sitting on it. So the page can end up indexed anyway, discovered through some external link, with Google unable to read the very instruction that was supposed to keep it out. This is why Google's own guidance is blunt about it: when your goal is to prevent indexing, allow the crawl and apply noindex. Do not block with robots.txt.

Ranking

Ranking is the ordering of indexed pages for a given query, and robots.txt touches it only indirectly. Efficient crawling helps ranking in the sense that it keeps your important content discovered and fresh. But a Disallow never nudges a page up or down the results. It removes the page from the running altogether.

The decision table

Goal	Correct Tool
Prevent crawling	robots.txt Disallow
Prevent indexing	noindex meta tag or X-Robots-Tag
Consolidate duplicates	Canonical tag
Redirect content	301 redirect
Secure content	Authentication / access controls
Manage AI access	robots.txt (crawler-specific)

Sources: Google Block Indexing Documentation; Google Robots Meta Tag Documentation

3. Robots.txt Technical Specifications

Most of the time you will not think about the specification at all. But the few rules below decide what happens at the edges, when your server hiccups, when the file gets large, when a crawler caches it, and the edges are exactly where outages live.

Where the file has to live. RFC 9309 is strict here: the file has to sit at the root of the host.
- https://www.example.com/robots.txt
- Not /files/robots.txt, not /assets/robots.txt, not anywhere else. And it only governs the host it sits on. A file at www.example.com has no authority over blog.example.com. Every subdomain needs its own.
Encoding. RFC 9309 requires UTF-8, and Google recommends it explicitly. Encoding mismatches can produce parsing behavior you did not intend.
Size. Crawlers are required to parse at least 500 kibibytes, and Google processes up to that limit. Anything past it may be ignored. You are unlikely to ever hit this, since 500 KiB holds thousands of directives, but it is worth knowing the ceiling exists.

What different HTTP responses mean. This table matters more than it looks:

Status Code	What Happens
200	File available; the rules in it must be followed
404 / any 4xx	File treated as unavailable; the crawler may access anything
5xx server error	Crawler must assume a complete block until the file is reachable again
Redirect (301/302)	Crawlers should follow at least five consecutive redirects

The 5xx row is the one that bites. If your server starts returning a 500 error on /robots.txt, compliant crawlers are required to treat your entire site as off-limits until the file recovers. That turns a small infrastructure blip into a sitewide crawl outage. For any large site, monitoring the availability of robots.txt is not a nice-to-have. It is a production requirement, the same as monitoring the homepage.

Caching. Crawlers should not hold a cached copy for more than 24 hours unless they cannot reach the file. In practice, your edits reach compliant crawlers within roughly a day, with the exact timing varying by crawler. Do not expect changes to take effect the moment you save.
A security reminder, because it bears repeating. RFC 9309 treats the file's content as untrusted by design, and so should you. Never put sensitive paths, internal directory names, API endpoints, or anything security-relevant in robots.txt. The file is public on purpose.

Source: RFC 9309, Section 2.3–2.5

Part 2: Directives and Syntax

4. Directives Reference

Everything in a robots.txt file is built from a handful of directives, and RFC 9309 nails down exactly how each one behaves.

User-agent

This line names the crawler that the rules underneath it apply to.

User-agent: Googlebot

That targets Googlebot and nobody else. Matching is case-insensitive, so googlebot, Googlebot, and GOOGLEBOT all mean the same thing. The wildcard catches everyone:

User-agent: *

The * group applies to any crawler that does not have its own, more specific group somewhere in the file. And if a crawler happens to match both a named group and the * group, RFC 9309 says the matching groups get merged and read together rather than one silently overriding the other.

Disallow

Disallow blocks crawling of paths that match.

Disallow: /search/

One quirk worth knowing: an empty Disallow: with nothing after it means allow everything. It is the same as having no restriction at all.

Allow

Allow carves an exception out of a broader Disallow, and the most specific match, measured by the length of the path in bytes, wins.

Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

That is the classic WordPress pattern: shut the admin area to crawlers, but keep open the one endpoint that rendering and various front-end features actually depend on.

Sitemap

Sitemap tells crawlers where your XML sitemaps live. It is not part of the core protocol, but it is so widely supported that you should always include it; RFC 9309 explicitly allows crawlers to read records it does not formally define.

Sitemap: https://example.com/sitemap.xml

You can declare as many as you need:

Sitemap: https://example.com/post-sitemap.xml
Sitemap: https://example.com/news-sitemap.xml

How precedence actually works

When two rules fight, the most specific one wins, and specificity is measured by how many bytes are in the path pattern. If an Allow and a Disallow are exactly as specific as each other, Allow wins the tie.

Disallow: /images/
Allow: /images/public/

The result is that /images/public/ stays crawlable while everything else under /images/ does not.

What is not supported

Crawl-delay is the big one people assume works everywhere. It is not part of RFC 9309, and Google ignores it outright. Bing honors it, and Anthropic's documentation notes that ClaudeBot supports it as a non-standard extension, but treat it as crawler-specific and confirm the one you care about actually respects it before you rely on it. Older directives like Visit-time and Request-rate are not part of the standard either, and you should not build anything on them.

Source: RFC 9309, Section 2.2

5. Wildcards and Pattern Matching

RFC 9309 gives you exactly two special characters. Used carefully they are precise; used carelessly they are a foot-gun that can de-index a site.

The asterisk, *

It matches any sequence of characters, of any length, anywhere in the path.

Disallow: /*?sort=

That single rule catches all of these:

/shoes?sort=price
/category/shoes?size=10&sort=newest
/search?q=running+shoes&sort=rating

Note that the ? does not have to come first. The * matches whatever sits in front of it.

The end marker, $

It pins the pattern to the end of the URL.

Disallow: /*.pdf$

That matches /files/document.pdf but deliberately misses /files/document.pdf?download=true, because the URL does not end at .pdf. The $ lets you target a file type cleanly without accidentally catching URLs where those same characters happen to appear in the middle.

Patterns you will actually use

Block all session parameters

Disallow: /*?session=

Block all sort parameters

Disallow: /*?sort=

Block tracking parameters (verify no valuable pages use these)

Disallow: /*?utm_source=

Block PDFs specifically

Disallow: /*.pdf$

The overblocking trap

Wildcards are powerful, and that power runs in both directions. One careless pattern can wipe out thousands or millions of URLs in a single line. Before you ship a wildcard rule on a large site, pull a real sample of the URLs it would catch and confirm that none of them carry search or business value you cannot afford to lose.

The most dangerous line in the language is this:

Disallow: /*

It blocks the entire site. It differs from Disallow: / only in the mechanics, and it deserves exactly the same fear.

Source: RFC 9309, Section 2.2.3

Part 3: Crawl Governance

6. Crawl Budget: The Definitive Guide

What crawl budget actually is

Google defines crawl budget as the set of URLs it both can and wants to crawl on your site.

Two separate forces decide it, and the important thing is that they do not add together. Both have to be satisfied.

The first is crawl capacity: the most crawling Google's infrastructure will do without overloading your servers. Google works this out from your response times, your error rates, and your availability. Faster and more reliable servers earn more capacity.
The second is crawl demand: how much Google actually wants to crawl you in the first place. Content popularity, how often you update, the size of your URL inventory, and staleness signals all feed into it.

The relationship between them is the part people miss. Google's documentation puts it directly: even if you never hit your capacity limit, low demand means less crawling. A site with plenty of capacity but little demand gets crawled below what its servers could handle. A site with high demand but constrained capacity never gets fully crawled.

You need both to be healthy, and optimizing one while ignoring the other gets you nowhere.

Who actually needs to think about this

Google frames its crawl budget material as an advanced guide, and it means it. The sites that genuinely need to care are large ones (a million or more pages changing at least weekly), fast-moving medium ones (ten thousand or more pages changing daily), and any site with a large pile of URLs that Search Console reports as "Discovered, currently not indexed."

If you run a local business site, a small blog, or a focused SaaS product, crawl budget is almost certainly not your problem, and time spent optimizing it is time wasted. Keep your sitemap current, glance at index coverage now and then, and move on to work that matters.

The one principle to govern by

Every URL should justify its crawl cost. When Google spends resources fetching a URL, that URL should pay them back with something discoverable: a ranking, a link, fresher content. URLs that consume crawl resources and return nothing are simply waste, and waste is what crawl governance exists to remove.

Where the waste usually hides

Internal search results are often the single largest source of waste on publisher and ecommerce sites. The combinations are effectively infinite, the pages carry little unique content, and they were never meant to rank in the first place.

Disallow: /search/
Disallow: /*?s=
Disallow: /*?q=

Faceted navigation is the one Google calls out by name as a major challenge. A catalog with 10 sizes, 20 brands, and 15 colors generates 3,000 filter combinations per category before you even get to multi-select, and the vast majority of those combinations answer no real search demand.

Session and tracking parameters like ?session=12345 or ?utm_source=email spin up duplicate copies of pages that already exist, burning crawl on content that adds nothing to the index.

Sort parameters such as ?sort=price show the same products in a different order. Same content, no unique value, rarely worth a standalone slot in the index.

Deep archives and infinite pagination, the /page/500/ and the calendar that scrolls back to 2009, are classic crawl traps.

Duplicate content in general, whether through parameters, tracking, or variants that all resolve to the same thing, makes Google crawl the same page over and over.

AI crawl budget, the new wrinkle

AI crawlers now show up as measurable activity in the logs of large sites, and they do not behave like Googlebot. Googlebot crawls in service of one unified index. The AI crawlers represent several different systems with different jobs and different rhythms, training crawlers, search index crawlers, and retrieval crawlers, so it is worth tracking them separately in your logs. The rough groupings:

Training crawlers: GPTBot, ClaudeBot
Search crawlers: OAI-SearchBot, Claude-SearchBot, PerplexityBot
Retrieval crawlers: ChatGPT-User, Claude-User, Perplexity-User

What Google explicitly recommends

Google's own best-practice list is worth following almost verbatim:

Manage your URL inventory so Google knows what to crawl and what to skip.
Consolidate duplicate content so crawl focuses on unique pages.
Use robots.txt to block genuinely unimportant pages, not as a temporary lever.
Return 404 or 410 for permanently removed pages. Blocked URLs linger in the crawl queue longer than removed ones.
Keep sitemaps current with <lastmod>.
Avoid long redirect chains.
Improve page load performance, because faster pages support more efficient crawling.

One warning from Google deserves its own line. Do not use noindex as a crawl budget tactic. A noindex page still gets crawled so Google can read the tag and then drop it, which spends a crawl slot and indexes nothing. If you never want a page crawled at all, a robots.txt Disallow is the efficient choice.

Sources: Google Crawl Budget Documentation; Google Large Site Management

7. What to Block

Crawl restrictions exist to cut waste, not to throttle crawling for sport. The right targets are the URLs that spend resources and return nothing of value.

Internal search results

Disallow: /search/
Disallow: /*?s=
Disallow: /*?q=
Disallow: /site-search/
# Session IDs
Disallow: /*?session=
Disallow: /*?sessionid=
Disallow: /*?sid=

Sort parameters

Disallow: /*?sort=
Disallow: /*?orderby=

Cart and checkout

Disallow: /cart/
Disallow: /checkout/
Disallow: /basket/

Account and authentication areas

Disallow: /account/
Disallow: /login/
Disallow: /register/
Disallow: /admin/

Preview and draft environments

Disallow: /preview/
Disallow: /draft/

Infinite URL spaces

Disallow: /calendar/
Disallow: /date/
Disallow: /*?date=

On a staging server, and only on a staging server, you block everything:

User-agent: *
Disallow: /

Never let that block reach production. Shipping a staging restriction to the live site is the single most common cause of catastrophic crawl failures, and it has taken down sites you have heard of.

Sources: Google Crawl Budget Documentation; RFC 9309

8. What Never to Block

Over-blocking does as much damage as under-blocking, and a few categories of URL simply have to stay open or the site will not perform in modern search.

CSS and JavaScript. Do not block them.

Never do this

Disallow: /css/
Disallow: /js/
Disallow: /static/
Disallow: /assets/

Google renders your pages. To read the layout and visual hierarchy it needs the CSS, and to run your navigation, structured data, and dynamic content it needs the JavaScript. Block those resources and Google misreads the page, misses your schema, and indexes a half-built version of what your visitors actually see.

Images. They feed Google Images, Google Discover (which is hungry for good imagery), Google News, and your product and article structured data. Unless you have a concrete business reason to suppress image indexing, leave them crawlable.

Core content. Your blog, articles, news, products, categories, guides, and resources are the whole point of the site. Never block any of them without rigorous analysis first.

/blog/
/articles/
/news/
/products/
/categories/
/guides/
/resources/

Framework and rendering assets. Next.js, Nuxt, and their peers serve the chunks a page needs to render from predictable paths, and blocking one of those paths can knock out Google's ability to render the entire site.

Never block without testing

Disallow: /_next/
Disallow: /__nuxt/
Disallow: /static/chunks/

Sources: Google JavaScript SEO Documentation; Google Rendering Documentation

9. XML Sitemaps and Robots.txt

Robots.txt and sitemaps are two halves of the same conversation. Robots.txt says "do not waste your time here." The sitemap says "spend your time here." Together they are how you steer crawl attention.

Declaring sitemaps

Sitemap: https://example.com/sitemap.xml

Declare them in robots.txt as a matter of habit. Google can find sitemaps through Search Console or by crawling, but a declaration in the file means any compliant crawler that reads it learns where your sitemaps are, Search Console account or not.

Sitemap: https://example.com/post-sitemap.xml
Sitemap: https://example.com/page-sitemap.xml
Sitemap: https://example.com/news-sitemap.xml

For very large sites, point at a sitemap index instead of listing everything:

Sitemap: https://example.com/sitemap_index.xml

What belongs in a sitemap, and what does not

A sitemap should contain only URLs that should rank, are canonically correct, return a 200, and are not noindexed. Keep out anything that redirects, anything noindexed, anything returning a 404 or other error, and anything blocked by robots.txt. A sitemap stuffed with blocked URLs, redirects, and dead ends sends mixed signals and burns crawl on nothing.

A note for publishers

Keep a dedicated news sitemap for Google News eligibility. It should hold only content from the last 48 hours and should update continuously as new pieces go live, not regenerate on some lazy schedule.

Sources: Google XML Sitemap Documentation; Sitemaps Protocol

10. Robots.txt and Rendering

Search engines do more than read HTML now. They render pages, running the JavaScript, loading the CSS, processing everything the way a browser would. That changes the whole relationship between what you block and what you are visible for.

Google's pipeline runs in four steps: crawl the URL, parse the HTML, render by executing JavaScript and CSS, then index what it understands. Block any resource that the parse or render steps depend on, and you can quietly break indexing even though the URL itself was crawled without a hitch.
The failure looks like this. A page brings in its main content through JavaScript. The rendering resources are blocked. Google sees an empty page. The structured data that JavaScript was going to inject never gets processed, the navigation links never get discovered, and the internal linking signals vanish. It hits hardest on React single-page apps, on Next.js apps rendering client-side, and on any site that depends on JavaScript for its breadcrumbs or schema.
A few framework specifics worth knowing: Next.js keeps its static assets and hydration chunks under /_next/, and blocking that directory can break rendering across the whole site. Nuxt uses similar asset paths, so test rendering after any change. Angular's compiled assets are no different; block them and full rendering breaks.
The general rule is simple enough to keep in your head: before you change any robots.txt rule that touches an asset directory, test the rendered output with Google's URL Inspection tool in Search Console, both before and after. If a browser needs a file to render the page, so does Google.

Sources: Google JavaScript SEO Documentation; Google Rendering Documentation

Part 4: AI Governance

11. The Complete AI Crawler Reference (2026)

For most of robots.txt's history, the crawlers it managed were search engines, and they all wanted roughly the same thing. Then 2023 arrived and a new category showed up that did not fit the old mental model at all. Any site of meaningful size now has to understand these crawlers, and the first thing to understand is the most important: not all AI crawlers are the same. They serve different systems and different purposes, and each one is a separate business decision.

They sort into three categories, and the rest of this section hangs off that taxonomy.

Training crawlers collect publicly accessible content that may go into improving AI models. Block them and your content stops contributing to future training datasets.
Search crawlers build the indexes that power AI search experiences. Block them and you lose visibility in AI-powered search results.
Retrieval crawlers fetch a page in response to a specific user request. They behave differently from autonomous crawlers because a person asked for something in real time, and for that reason several of them are documented as not consistently honoring robots.txt.

OpenAI

GPTBot is the training crawler. Its user-agent is GPTBot, it respects Disallow directives, and OpenAI publishes its IP ranges at openai.com/gptbot.json. To keep your content out of training:

User-agent: GPTBot
Disallow: /

OAI-SearchBot is the search crawler behind ChatGPT Search. Its user-agent is OAI-SearchBot, it honors robots.txt, and its IPs live at openai.com/searchbot.json. Block it and you may lose visibility in ChatGPT Search results.

ChatGPT-User is the retrieval crawler, the one that visits a page when a user asks ChatGPT a question. OpenAI describes it as not used for automatic crawling, and notes that because these actions are user-initiated, robots.txt rules may not apply to it. Its IPs are at openai.com/chatgpt-user.json.

There is also OAI-AdsBot, which validates the safety of pages submitted as ads on ChatGPT. It is not a content-discovery crawler and sits outside the training, search, and retrieval framing above; include it in policy only if you run ads through ChatGPT.

Anthropic

Source: Anthropic official documentation

ClaudeBot is the training crawler. Its user-agent is ClaudeBot, and Anthropic states that its bots respect "do not crawl" signals by honoring industry-standard robots.txt directives. It also supports the non-standard Crawl-delay extension. To block training:

User-agent: ClaudeBot
Disallow: /

Claude-User is the retrieval crawler, fetching content when a Claude user asks for information from a page. Its user-agent is Claude-User, and Anthropic notes that disabling it stops the system from retrieving your content in response to a user query, which can reduce your visibility for user-directed web search. It honors robots.txt, with the user-initiated context applying.

Claude-SearchBot is the search crawler that navigates the web to improve search quality and relevance for Claude users. Its user-agent is Claude-SearchBot, it honors robots.txt, and disabling it stops your content from being indexed for search optimization, which can reduce your visibility and accuracy in Claude's search results.

All three Anthropic crawlers are confirmed in official Anthropic documentation as of April 2026.

Perplexity

Source: Perplexity official documentation

PerplexityBot is the search crawler, designed to surface and link websites in Perplexity's search results. Its user-agent is PerplexityBot, it honors robots.txt, and Perplexity is explicit that it is not used to crawl content for AI foundation models. IPs are at perplexity.com/perplexitybot.json.

Perplexity-User is the retrieval crawler that visits pages to help answer a user's question. Its user-agent is Perplexity-User, and Perplexity states plainly that because a user requested the fetch, this fetcher generally ignores robots.txt rules. It is also not used for crawling or training. IPs are at perplexity.com/perplexity-user.json.

One correction worth stating loudly, because the mistake is everywhere: Perplexity does not operate a training crawler. Any governance framework that files Perplexity under "Training" is simply wrong.

The traditional search crawlers, for reference

Googlebot has its documentation and verification guidance. Bingbot has its webmaster tools and, unlike Google, honors Crawl-delay.

The retrieval-crawler nuance that matters

Both Perplexity and OpenAI have documented that their retrieval crawlers, Perplexity-User and ChatGPT-User, may not consistently honor robots.txt, because they run in response to user requests rather than as autonomous crawlers. Anthropic's Claude-User is documented as honoring robots.txt while acknowledging the same user-initiated context.

The practical takeaway is that robots.txt reliably governs autonomous crawl behavior but is not a dependable lever on retrieval crawlers. If you genuinely need to control retrieval access, you reach for bot management infrastructure, WAF rules and IP allowlists built from the official JSON sources, alongside or in place of robots.txt.

Matching the policy to the goal

Objective	Training Crawlers	Search Crawlers	Retrieval Crawlers
Maximum visibility	Allow	Allow	Allow
Visibility without training	Block	Allow	Allow
Search only	Block	Allow	Block
Maximum restriction	Block	Block	Block

A template for the most common stance, visibility without contributing to training:

Training: block if you do not want content used in model training

User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /

Search: allow for AI search visibility

User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

Retrieval: allow for AI assistant access

User-agent: ChatGPT-User
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Perplexity-User
Allow: /

Standard search governance

User-agent: *
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://example.com/sitemap.xml

Last verified June 2026. This ecosystem moves fast, so review this section quarterly and check it against the official documentation each time.

12. AI Visibility Framework

AI governance and AI visibility are easy to confuse and worth keeping apart. Governance asks what access AI systems should have to your content. Visibility asks how present you actually are inside those systems. They are not the same lever, and they can move independently: a site can block every training crawler and still get cited from content crawled earlier, and a site can allow every crawler and never once surface in an AI answer.

Related reading: Governance decides whether AI systems can reach you. What earns a citation once they do is a separate discipline. See AI Visibility: What Gets Content Cited.

It helps to think of visibility as four stacked layers:

Discovery. Can AI systems find your content at all? This rests on an accessible robots.txt, crawlable URLs, healthy sitemaps, and internal linking.
Retrieval. Can they fetch it when a user asks? This needs accessible content (no paywalls blocking retrieval crawlers), fast responses, and retrieval-crawler access.
Citation. Will they quote you in an answer? This turns on authority signals, clarity, entity recognition, and structured data. A site can be retrieved constantly and never cited.
Recommendation. Will a system actively recommend your product? The top of the stack, driven by brand authority, consistent mentions across the web, and third-party references.

The most common strategic mistake is to treat visibility as a single on-off switch when the useful question is which layer the gap sits at. A publisher with strong content but restricted retrieval crawlers scores well on discovery and badly on retrieval. An ecommerce brand with great products but thin external mentions scores well on retrieval and badly on recommendation. Diagnose the layer before you prescribe the fix.

Aligning governance with visibility goals looks like this:

Goal	Training	Search	Retrieval
Maximum visibility	Allow	Allow	Allow
Visibility without training contribution	Block	Allow	Allow
Search visibility, controlled retrieval	Block	Allow	Block*
Maximum restriction	Block	Block	Block

*Blocking retrieval crawlers through robots.txt may not be fully reliable, given the user-initiated behavior Perplexity and OpenAI both document.

The emerging publisher position has settled into a recognizable shape: block training to protect a content-licensing negotiating position, allow search to keep AI search visibility and referral potential, and allow retrieval to support user-initiated access. It keeps the visibility intact while holding leverage on the training-data question.

13. AI Governance Adoption Model

Most organizations today sit in one of two places: no AI policy at all, or a blanket block on everything. Neither one is governance. The model below traces the path from not knowing to governing on purpose.

Level	Description
0: No awareness	No policy, no owner, no monitoring. The organization does not know which AI systems are accessing its content or how often.
1: Reactive	A snap decision, usually a blanket block on GPTBot. Often incomplete and undocumented.
2: Basic	Training, search, and retrieval crawlers identified as distinct categories, with an intentional decision for each.
3: Docu5ented	Written policies per category, ownership assigned, a review schedule in place.
4: Operational	Monitoring live. Server logs reviewed for AI crawler activity. Policy reviews actually happening.
5: Strategic	AI policy integrated with content, SEO, and licensing strategy. Business objectives drive it.
6: Enterprise	Cross-functional ownership across SEO, legal, editorial, engineering, and product.
7: Adaptive	Continuous monitoring and refinement, with quarterly reviews built into the operating rhythm.

Wherever you sit, you should be able to answer five questions without pausing: Can AI models train on our content? Can AI search engines index it? Can AI assistants retrieve it for users? Are these policies documented and owned? When were they last reviewed?

The patterns by industry are fairly consistent:

Publishers increasingly block training while allowing search and retrieval. The New York Times restricting AI training access in 2023 set off a wave of publisher policy development.
Ecommerce tends to allow search and retrieval to fuel AI shopping discovery, with training decisions varying, as AI product recommendations become a competitive priority.
SaaS companies often allow everything, since visibility in AI tool recommendations drives growth.
Regulated industries tend to restrict across the board, driven by compliance rather than strategy.

Part 5: Decision Frameworks

14. The Robots.txt Decision Framework

The most useful thing to understand about robots.txt errors is that most of them are not syntax errors. They are tool-selection errors: reaching for robots.txt when the job actually needed noindex, or a canonical, or authentication. The framework below is built to catch that before you write a single directive.

Run the pre-filter first. Is this content sensitive or behind authentication? If yes, admin areas, customer portals, internal dashboards, private data, stop here and use authentication and access controls. Robots.txt is the wrong tool, full stop. If no, continue.

Then work through crawl and index, in order.

Should crawlers access this URL at all? If no, search pages, checkout, session URLs, infinite filter combinations, preview environments, crawl traps, use a robots.txt Disallow and you are done. If yes, keep going.

Should it appear in search results? If no, thank-you pages, campaign landing pages, internal tools, thin content not meant for search, allow the crawl and apply noindex. Do not use robots.txt for this, because a blocked page never reveals its noindex tag and can be indexed from external links anyway. If yes, keep going.

Is it a duplicate of another URL? Only ask this once you have confirmed the URL is both crawlable and indexable, since canonicals on blocked pages are invisible. If it is a duplicate, product variants, parameter URLs, tracking duplicates, use a canonical tag pointing at the preferred URL. If not, keep going.

Has the content permanently moved? If yes, migrations, restructures, consolidations, deletions with a good replacement, use a 301. If no, the URL is crawlable, indexable, canonical, and current, and you move on to the AI layer.

Finally, the AI decisions, which apply to all content, even content already blocked for traditional search.

Training access: do you want AI models training on this? Allow needs no directive; block uses a crawler-specific Disallow: / on GPTBot and ClaudeBot; conditional blocks by path rather than sitewide.

AI search access: do you want to appear in AI search? Allow needs no directive; block uses a crawler-specific Disallow on OAI-SearchBot, Claude-SearchBot, and PerplexityBot; conditional grants selective path access.

Retrieval access: do you want AI assistants fetching content for users? Allow needs no directive and is the most common choice; block uses a crawler-specific Disallow but may not be fully reliable; conditional pairs robots.txt with WAF rules. For anything you actually need enforced on retrieval crawlers, remember that Perplexity-User generally ignores robots.txt and ChatGPT-User may too, so build WAF-level rules from the official IP ranges rather than trusting the file alone.

The tool-selection cheat sheet, one more time, because it is the heart of the whole thing:

Goal	Correct Tool
Prevent crawling	robots.txt Disallow
Prevent indexing	noindex meta tag or X-Robots-Tag
Consolidate duplicates	Canonical tag
Redirect content	301 redirect
Secure content	Authentication + access controls
Block AI training	robots.txt (training crawler user-agents)
Block AI search	robots.txt (search crawler user-agents)
Control AI retrieval	robots.txt + WAF for reliable enforcement
License content commercially	Legal agreements (robots.txt is not a contract)

15. URL Governance Worksheet

Govern by URL class, not by individual page. Run each pattern through the worksheet below once, write down the decisions, and you have something auditable that survives staff turnover and CMS changes.

Field	Value
URL pattern
Business owner
SEO owner
Engineering owner
Purpose
Business value	High / Medium / Low
Should be crawled?	Yes / No
Should be indexed?	Yes / No
Duplicate version exists?	Yes / No
Canonical target
Sensitive / auth required?	Yes / No
AI training allowed?	Allow / Block / Conditional
AI search allowed?	Allow / Block / Conditional
AI retrieval allowed?	Allow / Block / Conditional
Conditions / notes
Sitemap inclusion?	Yes / No
Last reviewed
Last change and reason

A filled-in version for a publisher:

URL Type	Crawl	Index	AI Retrieval	Notes
Article	Yes	Yes	Allow	Core asset
News article	Yes	Yes	Allow	News sitemap required
Category hub	Yes	Yes	Allow
Topic hub	Yes	Usually yes	Allow	Evaluate by demand
Author page	Depends	Depends	Depends	Evaluate by traffic
Tag page	Depends	Depends	Depends	Thin tags get noindex
Search page	No	No	Block	Crawl trap
Preview URL	No	No	Block
Archive (recent)	Yes	Yes	Allow
Archive (deep)	Depends	Depends	Allow	Evaluate crawl cost

And for ecommerce:

URL Type	Crawl	Index	AI Retrieval	Notes
Product page	Yes	Yes	Allow	Primary asset
Category page	Yes	Yes	Allow	Primary asset
High-value facet	Yes	Yes	Allow	Verify demand
Low-value facet	Usually no	Usually no	Allow	Assess individually
Sort URL	Usually no	Usually no	Allow
Search results	No	No	Block	Crawl trap
Cart	No	No	Block	No SEO value
Checkout	No	No	Block	No SEO value
Account	No	No	Block	Auth required

A template for recording the AI access decision per crawler:

Access Type	Policy	Notes
Googlebot
Bingbot
GPTBot (training)	Allow / Block / Conditional
ClaudeBot (training)	Allow / Block / Conditional
OAI-SearchBot (search)	Allow / Block / Conditional
Claude-SearchBot (search)	Allow / Block / Conditional
PerplexityBot (search)	Allow / Block / Conditional	Not a training crawler
ChatGPT-User (retrieval)	Allow / Block / Conditional	May not honor robots.txt
Claude-User (retrieval)	Allow / Block / Conditional
Perplexity-User (retrieval)	Allow / Block / Conditional	Generally ignores robots.txt

16. URL Lifecycle Governance

A URL's value is not fixed. The page that deserves top crawl priority today may deserve none in two years, and mature organizations govern that arc deliberately rather than letting it drift.

A URL moves through roughly six stages:

Creation. The goal is fast discovery and indexing. Add it to the sitemap immediately, give it internal links, and for publishers, the news sitemap.
Growth. As it gains traffic and authority, watch crawl frequency in Search Console and keep rendering resources accessible.
Maturity. Maintain visibility and update <lastmod> when the content refreshes.
Decline. When traffic fades, decide whether to refresh, consolidate, or plan retirement. Does it still earn links and serve intent?
Archive. Reference value without active promotion, usually still crawlable and indexable. Archived authoritative content is frequently cited by AI systems, a quietly valuable asset teams underrate.
Retirement. When value is gone: 301 to a good replacement, 410 when there is none, a kept redirect to preserve authority, or maintain the page if legal retention requires it.

URL State	Crawl	Index	Sitemap	Crawl Priority
New	Yes	Yes	Yes	High
Growing	Yes	Yes	Yes	High
Mature	Yes	Yes	Yes	Moderate
Declining	Depends	Depends	Depends	Lower
Archived	Usually yes	Often yes	Sometimes	Low
Retired	Usually no	Usually no	No	None

The reason to govern by class rather than by individual URL is simply that the by-URL approach does not survive scale. The most mature organizations write lifecycle policies for whole classes, articles, products, categories, events, campaigns, documentation, and apply them consistently, which is what keeps governance auditable as the site grows.

Part 6: Operations

17. Robots.txt During Site Migrations

Migrations have caused some of the worst SEO disasters on record, and the culprit is surprisingly often not a redirect chain or a canonical mistake. It is a single robots.txt line that shipped at launch.

The line is this one:

User-agent: *
Disallow: /

On a staging environment it is exactly right. Ride it into production by accident and it is catastrophic: Googlebot is told to stop crawling, discovery halts, content refresh stops, and recovery can take weeks. Staging restrictions reaching production is the single most common source of robots-related traffic disasters, which is why it earns its own line on every checklist below.

Before launch, review the production robots.txt explicitly rather than assuming it is correct, confirm staging is restricted and protected from crawling, review every noindex directive for intended versus accidental, review canonicals sitewide, verify redirect chains are complete, and confirm sitemap declarations point at correct paths in the new environment.
On launch day, verify example.com/robots.txt returns a 200, verify it contains no accidental Disallow: /, verify sitemap URLs are reachable, verify no asset directories are blocked, and submit the updated sitemap to Search Console.
For at least 30 days after, monitor Crawl Stats daily, watch the Index Coverage report, review server logs for Googlebot activity, and watch for any drop in crawl rate.

Source: Google Site Migration Documentation

18. Auditing Framework

Most robots.txt files get written once and then forgotten, while the world around them keeps moving: migrations bolt on directives, CMS updates reshape URLs, new URL types appear, fresh AI crawler categories arrive. Skip regular auditing and governance quietly drifts out of alignment with reality.

How often depends on the site. Small sites can audit quarterly. Publishers and ecommerce sites should audit monthly. Enterprises should audit monthly and again after any significant change. After a migration, audit immediately and again 30 days post-launch.

The checklist itself:

Foundation

robots.txt exists at the root
Returns HTTP 200
UTF-8 encoded
Under 500 KiB

Crawl governance

Search URLs reviewed
Filter URLs reviewed
Session parameters reviewed
Preview URLs reviewed
Archive pagination reviewed

Assets

CSS directories accessible
JavaScript directories accessible
Images accessible
Framework assets (/_next/, etc.) accessible

Sitemaps

All sitemaps declared
All declared sitemaps return 200
No redirected URLs in sitemaps
No noindexed URLs in sitemaps
<lastmod> dates current

AI governance

Training crawler policy documented
Search crawler policy documented
Retrieval crawler policy documented
Policy matches business intent

Subdomains

All subdomains have robots.txt
All subdomain files reviewed

Monitoring

Crawl Stats reviewed in Search Console
Server logs reviewed for crawler activity

19. Testing Framework

Test every robots.txt change before it ships. There are no exceptions worth making, because the failure mode is invisible until traffic drops.

Work through it in layers:

Syntax. Confirm the file parses. Watch for the usual culprits: a misspelled Disalow:, a missing leading slash (Disallow: search/ instead of Disallow: /search/), or rules placed before any User-agent line, which RFC 9309 says should be ignored.
URL behavior. For every significant category, confirm the important pages are crawlable, the search and preview pages are blocked, and the asset directories are open.
Patterns. For any wildcard rule, write out representative URLs and check the rule matches what you actually meant.

Test this rule:

Disallow: /*?sort=

Against these URLs:

/shoes?sort=price -> should be blocked
/shoes?size=10 -> should NOT be blocked
/shoes?sort=price&size=10 -> should be blocked
/articles/sorting-algorithms -> should NOT be blocked

Fourth, production verification right after deploy: fetch the live file and read it, test representative URLs with the URL Inspection tool, and watch Crawl Stats for any unexpected drop. Fifth, environments: production verified crawlable where intended, staging verified blocked sitewide, and every subdomain verified on its own.

20. Log File Analysis

Search Console shows you what Google reports. Server logs show you what actually happened, and the two are not always the same. Every request writes a line, which crawler, when, what response code, how long the server took, and no other tool gives you crawl reality at that resolution.

The questions worth asking the logs are simple: which crawlers visit, which URLs they crawl, how often, what response codes they get back, and whether crawl resources are landing on your business priorities or wandering off into waste.

What to watch depends on the crawler:

Search crawlers (Googlebot, Bingbot): crawl frequency by URL category, response code distribution, your most-crawled URLs (are these your best pages?), and your least-crawled ones (are these pages you care about?).
AI training crawlers (GPTBot, ClaudeBot): crawl volume and frequency, which sections they target, and whether your policies are being respected.
AI search crawlers: discovery patterns and which sections get attention.
AI retrieval crawlers: what content is fetched and how often, since that tells you what is being surfaced to AI users in real time.

Response codes deserve their own quick reference:

Code	Meaning	Action
200	Page crawled successfully	Monitor the distribution
301	Redirect	Minimize chains
404	Not found	Return 410 for permanent removals
410	Gone, permanently removed	Stronger stop signal than 404
500	Server error	Investigate immediately; 5xx makes crawlers back off

Sources: Google Crawl Budget Documentation; Google Crawl Stats Documentation

21. Failure Postmortems

These are the failure patterns the SEO field has documented most often, and the ones that cost the most when they land. Each is worth recognizing on sight.

Sitewide block from a staging deployment. The User-agent: * / Disallow: / that belonged on staging rides along to production through an infrastructure slip or a missing checklist line. Googlebot stops crawling the whole site, and recovery can take weeks with ranking loss along the way. Risk: critical. Prevention: name robots.txt on every launch checklist and confirm the production file returns 200 with correct content on launch day.
Blocking CSS. A Disallow: /css/ left over from an era when blocking assets was considered a crawl-budget trick. Google can no longer fully render, layout comprehension degrades, and structured data may go unprocessed. Risk: high.
Blocking JavaScript. Same legacy thinking, or a misunderstanding of framework paths, producing Disallow: /js/ or Disallow: /_next/. Google cannot render JavaScript-dependent content, navigation links go undiscovered, and in the worst case the whole site reads as an empty page. Risk: high.
Blocking product categories. Aggressive crawl-budget optimization that never checked the business value of category pages. Category rankings collapse and revenue-driving keywords drop out of search. Risk: critical for ecommerce.
Internal search left open. /search?q= crawlable with no restriction, so search result URLs pile up in the queue, an infinite supply of low-value combinations eats the budget, and high-value content gets crawled less. Risk: medium to high.
Faceted navigation explosion. Filter URLs like ?size=, ?color=, and ?brand= all crawlable with no governance, turning millions of combinations crawlable, draining the budget, and starving product and category pages. Risk: high for ecommerce.
AI governance misconfiguration. The intent was to block training and allow search and retrieval, but a misread of precedence, or more often a too-broad wildcard, blocks every AI crawler instead. AI search visibility disappears without anyone deciding it should. Risk: increasing.
Governance drift. The file launched sensible, then time passed: new URL types, a CMS update, new AI crawlers, changed ownership, and nobody revisited it. Crawl traps that should be closed stay open, and restrictions on valuable content that should be lifted stay in place. Risk: high over time.

When you are actually working a failure, the diagnostic questions are almost always the same six: What changed? When? Who changed it? Was it tested before deployment? Was it reviewed and approved? Was it monitored afterward? The answers usually point at a governance gap rather than a technical one.

Failure Type	Risk Level
Sitewide block from staging	Critical
Product page block	Critical
Category page block	Critical
JavaScript blocked	High
CSS blocked	High
Governance drift	High
Facet explosion	High
Internal search open	Medium to High
AI governance misconfiguration	Increasing
Missing sitemap declarations	Medium

22. Enterprise Governance Operating Model

At enterprise scale, almost every robots.txt failure is a governance failure rather than a technical one. The syntax is fine and the file parses. The real problem is that someone made the wrong decision, or nobody made the right one. The fix is to treat robots.txt as what it is: a production configuration file that deserves the same rigor as any other.

Ownership is where it starts, and a simple RACI keeps it honest:

Crawl governance: SEO is accountable and defines the policy, engineering is responsible and implements it, and product and editorial are consulted on priorities and content visibility.
AI governance: legal is accountable for licensing and compliance, SEO is consulted on visibility, engineering is responsible for implementation, and product is consulted on business impact.
Deployment: engineering executes, SEO must approve before anything ships, and product stays aware of the impact.

Every change should move through a structured path: a documented proposal with rationale, a business-impact review, an SEO review, an engineering review, testing in staging, sign-off from the designated approvers, deployment, at least 14 days of post-deployment monitoring, and a rollback plan prepared before deployment rather than after something breaks.

A change-request template keeps that path repeatable:

Requestor:
Date:
Proposed change (exact directive):
Reason for change:
Impacted URL patterns:
Estimated impacted URL count:
Expected outcome:
Risk assessment: Low / Medium / High / Critical
Testing completed: Yes / No
Rollback plan:
Approvers:

Review cadence tracks the audit cadence: quarterly for small sites, monthly for publishers and ecommerce, and monthly plus change-triggered reviews for enterprise.

Part 7: Real-World Analysis

23. Real-World Analysis Framework

Reading the robots.txt files of organizations operating at scale shows you something no documentation can: the decisions sophisticated teams actually make under real operational pressure. The point is not to copy them, since every site has its own objectives, architecture, and priorities. The point is to read the patterns, the choices that keep recurring across industries and scales, and to understand the reasoning underneath them.

For any file you study, look at six things: what is blocked and why, how sitemaps are used to manage discovery, whether rendering assets stay accessible, how crawl waste is reduced, whether AI crawler policies are present, and how sophisticated the implementation looks overall. And keep asking the one question that teaches the most. Not "what is blocked?" but "why is it blocked?" Every directive is a business decision in disguise, and the decision tells you more than the syntax does.

24. Publisher Analysis

Google's own robots.txt

Google's public file, at https://www.google.com/robots.txt, is one of the most instructive examples of crawl governance at extraordinary scale, partly because the company that builds the crawler is so careful about its own crawl exposure.

It blocks its own search results sitewide, then carves out the handful of search-related URLs that carry standalone informational value:

Disallow: /search
Allow: /search/about
Allow: /search/howsearchworks

That is exactly the pattern this guide recommends, executed with surgical precision. The same precision shows up at the parameter level, where Google pairs Allow and Disallow with wildcards and end markers:

Allow: /?hl=*&gws_rd=ssl$
Disallow: /?hl=*&

The lessons travel well. Scale forces specificity, so large inventories need surgical rules rather than broad ones. Internal systems should be invisible to crawlers. Allow and Disallow used together inside one crawler group give you precision neither directive can reach alone. And even the company that invented the crawler governs its own exposure carefully.

GitHub's robots.txt

GitHub's file, at https://github.com/robots.txt, governs one of the most URL-intensive user-generated platforms on the web, and it shows what aggressive, deliberate governance looks like.

It uses crawler-specific rules where different crawlers need different treatment:

User-agent: bingbot
Disallow: /ekansa/Open-Context-Data
Disallow: */tarball/
Disallow: */zipball/
User-agent: *
Disallow: /*/*/pulse
Disallow: /*/*/projects

It restricts repository metadata views, the commits, branches, contributors, and forks that change constantly, generate crawl waste, and offer no SEO value in return:

Disallow: /*/*/commits/
Disallow: /*/*/branches
Disallow: /*/*/contributors
Disallow: /*/*/tags
Disallow: /*/*/stargazers
Disallow: /*/*/watchers
Disallow: /*/*/network
Disallow: /*/*/graphs
Disallow: /*/*/compare

It blocks download and archive content regardless of URL structure, and it blocks tracking parameters:

Disallow: /*/download
Disallow: /*/archive/
Disallow: /*source=*
Disallow: /*ref_cta=*

The takeaways: user-generated platforms need aggressive URL governance by default, crawler-specific rules are a legitimate tool when crawlers genuinely need different treatment, and binary or download content should stay out of crawl no matter how the URLs are shaped.

Part 8: References

25. Security Myths

Three myths refuse to die, and each one is dangerous.

Myth: Disallow protects content. It does not. RFC 9309 is explicit that the protocol is not a substitute for valid content security measures. A Disallow asks compliant crawlers not to visit a path. It blocks no one, and the URL stays publicly accessible to anyone who knows it.
Myth: Disallow hides content. It does the opposite. Listing a path in robots.txt puts that path on public display in the file itself, and security researchers, competitors, and attackers all read robots.txt precisely to find the paths an owner hoped to bury.
Myth: Bots must obey robots.txt. Compliant, good-faith crawlers honor it. Malicious bots do not. A scraper after your content for unauthorized use will not pause for a Disallow.

The correct approach to security is the boring, effective one: authentication and authorization at the application layer, firewalls and WAF rules, rate limiting, and access controls. Robots.txt is governance, not protection.

Sources: RFC 9309, Section 3; OWASP Access Control Guidance

26. Robots.txt vs Meta Robots vs X-Robots-Tag

These three get confused constantly because all three touch crawler behavior, but they solve entirely different problems.

Robots.txt controls crawling at the URL level through a file at the root.

User-agent: *
Disallow: /search/

The meta robots tag controls indexing for a single HTML page through a tag in the <head>.

<meta name="robots" content="noindex">
<meta name="robots" content="noindex, nofollow">

X-Robots-Tag controls indexing for any content type through an HTTP response header, which makes it the right tool for non-HTML files like PDFs and images that cannot carry a <head> tag.

X-Robots-Tag: noindex
X-Robots-Tag: noindex, nofollow

The interaction between them is the part that causes real damage. Block a page in robots.txt and Google cannot crawl it. If Google cannot crawl it, it never reads the meta robots tag. So a page carrying noindex that is also blocked by robots.txt may still end up indexed, discovered through an external link with the noindex instruction never verified.

The rule that follows is firm: never combine robots.txt blocking with noindex. Pick one. For content you want to exist but keep out of search, allow the crawl and apply noindex. For content you want out of the crawl queue entirely, use robots.txt.

Method	Crawl Control	Index Control
robots.txt	Yes	No
Meta robots	No	Yes
X-Robots-Tag	No	Yes

Sources: Google Robots Meta Tag Documentation; Google Block Indexing Documentation

27. The Future of Robots.txt

Robots.txt is not going obsolete.

It remains the foundational way a site tells compliant agents what it prefers about crawling, and RFC 9309's formalization in 2022 strengthened its standing rather than weakening it. What is changing is its scope and its complexity.

AI governance is the defining evolution. For 29 years the file was a search governance tool. In roughly three years it became an AI governance tool as well, and splitting crawlers into training, search, and retrieval, then governing each independently, is a genuine structural change in what the file is for.

Retrieval-crawler compliance is the open problem. Both Perplexity and OpenAI have documented that their user-initiated retrieval crawlers may not consistently honor robots.txt, and as AI assistants fetch more content on behalf of users, the line between crawling and browsing keeps blurring. Closing that gap will probably require governance mechanisms that do not yet exist in mature form.

Licensing and governance are converging, with major publishers now treating robots.txt as one part of a wider AI content policy that runs all the way to commercial licensing negotiations. And bot management platforms are extending what the file can do, with Cloudflare, Akamai, and their peers offering managed AI crawler controls, IP verification, and enforcement that operates above the robots.txt layer.

The most likely near-term shape is a layered one: robots.txt stays the baseline governance layer, while platform-level enforcement, authentication, and possibly new protocol extensions cover the compliance gaps it cannot close alone.

Appendices

Appendix A: Directive Reference

Directive	Function	Standard?
User-agent	Defines which crawler receives the rules	RFC 9309
Disallow	Blocks crawling of matching paths	RFC 9309
Allow	Overrides a broader Disallow; longest match wins	RFC 9309
Sitemap	Declares sitemap location	Widely supported; not in core RFC 9309 spec
Crawl-delay	Suggests delay between requests	Not RFC 9309; Google ignores it; Bing and Anthropic support it

Appendix B: Wildcard Reference

Character	Meaning	Example
*	Zero or more of any character	Disallow: /*?sort=
$	End of the URL pattern	Disallow: /*.pdf$

Appendix C: Crawl vs Index vs Render vs Rank

Process	Meaning	Controlled By
Crawl	Bot requests and retrieves the URL	robots.txt
Parse	Bot processes the HTML	Cannot be separately controlled
Render	Bot executes JS and CSS	Requires rendering resources to be crawlable
Index	Search engine stores page information	noindex / X-Robots-Tag
Rank	Page becomes eligible for results	Content, links, signals

Appendix D: AI Crawler Reference Table (2026)

Crawler	Organization	Category	Robots.txt Compliance	Training Use?	Source
GPTBot	OpenAI	Training	Yes	Yes	developers.openai.com
OAI-SearchBot	OpenAI	Search	Yes	No	developers.openai.com
ChatGPT-User	OpenAI	Retrieval	May not consistently comply	No	developers.openai.com
ClaudeBot	Anthropic	Training	Yes	Yes	support.claude.com
Claude-SearchBot	Anthropic	Search	Yes	No	support.claude.com
Claude-User	Anthropic	Retrieval	Yes	No	support.claude.com
PerplexityBot	Perplexity	Search	Yes	No	docs.perplexity.ai
Perplexity-User	Perplexity	Retrieval	Generally ignores robots.txt	No	docs.perplexity.ai

Perplexity does not operate a training crawler. Any framework that files Perplexity under "Training" is incorrect.

Last verified June 2026. Verify against official documentation before publication and quarterly thereafter.

Appendix E: Glossary

Allow: A directive that permits crawling of a path, overriding a broader Disallow. The most specific match wins.
Canonical: An HTML tag or HTTP header that names the preferred URL when the same or similar content sits at multiple URLs.
Crawl budget: The set of URLs Google can and wants to crawl, set by the intersection of crawl capacity and crawl demand.
Crawl capacity limit: The most crawling Google's infrastructure will do on a site without overloading its servers, based on response time, error rates, and availability.
Crawl demand: How much Google wants to crawl a site, based on content popularity, freshness, and inventory size.
Crawl trap: A URL pattern that generates excessive, low-value crawl activity, such as internal search, session parameters, or infinite calendar pagination.
Disallow: A directive that blocks crawling of matching path patterns.
Noindex: A meta tag or X-Robots-Tag instruction that keeps a page out of the index. Does not prevent crawling.
REP, Robots Exclusion Protocol: The protocol governing robots.txt behavior, standardized as RFC 9309 in September 2022.
Rendering: The process by which search engines execute JavaScript and CSS to understand a page the way a browser does.
RFC 9309: The IETF standard formalizing the Robots Exclusion Protocol, published September 2022.
Training crawler: An AI crawler that collects content for potential use in model training. Examples: GPTBot, ClaudeBot.
Search crawler: An AI crawler that builds indexes for AI-powered search. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot.
Retrieval crawler: An AI crawler that fetches content in response to a specific user request. Examples: ChatGPT-User, Claude-User, Perplexity-User. May not consistently honor robots.txt.
User-agent: The identifier a crawler uses to announce itself. Directives are targeted at specific user-agents.
X-Robots-Tag: An HTTP response header that carries indexing instructions for any content type, including PDFs and images.

crawl crawl budget google crawler retrieval crawler rfc 9309 robots.txt