AI Crawlers and robots.txt
AI crawlers GPTBot, ClaudeBot, Google-Extended — User-Agent table, robots.txt for AI bots, training vs inference difference, configuration template
AI Crawlers
Section titled “AI Crawlers”Beyond classic search bots (Googlebot, Bingbot), websites now receive visits from AI crawlers — bots collecting data for training and inference of language models.
| Crawler | Company | Purpose | User-Agent |
|---|---|---|---|
| GPTBot | OpenAI | Training + Browse | GPTBot/1.0 |
| ChatGPT-User | OpenAI | Browse (realtime) | ChatGPT-User |
| ClaudeBot | Anthropic | Training | ClaudeBot |
| anthropic-ai | Anthropic | Training | anthropic-ai |
| Google-Extended | AI training | Google-Extended | |
| PerplexityBot | Perplexity | Inference/RAG | PerplexityBot |
| CCBot | Common Crawl | Training datasets | CCBot/2.0 |
| Bytespider | ByteDance | Training | Bytespider |
| Applebot-Extended | Apple | Apple Intelligence | Applebot-Extended |
| Amazonbot | Amazon | Alexa/AI | Amazonbot |
Training vs Inference
Section titled “Training vs Inference”Two fundamentally different processes:
Training — bulk data collection to form model weights. Happens every few months. Blocking training crawlers has no effect on current model responses.
Inference — real-time RAG/browse. The model fetches pages “on the fly” when answering a user. Blocking inference bots removes your site from live answers.
| Training | Inference | |
|---|---|---|
| Frequency | Every few months | Per query |
| Blocking impact | Next model version | Current answers |
| Example bots | GPTBot, ClaudeBot, CCBot | ChatGPT-User, PerplexityBot |
robots.txt for AI Bots
Section titled “robots.txt for AI Bots”Block Training, Allow Browse
Section titled “Block Training, Allow Browse”# Classic search engines — full accessUser-agent: GooglebotAllow: /
# AI training crawlers — blockUser-agent: GPTBotDisallow: /
User-agent: ClaudeBotDisallow: /
User-agent: CCBotDisallow: /
User-agent: BytespiderDisallow: /
User-agent: Google-ExtendedDisallow: /
# AI browse/inference — allowUser-agent: ChatGPT-UserAllow: /
User-agent: PerplexityBotAllow: /Selective Blocking
Section titled “Selective Blocking”# Allow documentation, block private areasUser-agent: GPTBotAllow: /docs/Allow: /blog/Disallow: /admin/Disallow: /internal/Disallow: /api/private/Blocking AI ≠ Losing SEO
Section titled “Blocking AI ≠ Losing SEO”Googlebot and Google-Extended are different user-agents:
- Googlebot — indexing for Google Search. Blocking = losing search rankings
- Google-Extended — data for Gemini/AI Overviews. Blocking has no effect on organic search
Block Google-Extended without worrying about SEO.
New Directives
Section titled “New Directives”The industry is experimenting with robots.txt extensions for AI:
# Experimental directives (not all bots support them)User-agent: *DisallowAITraining: /<!-- HTML meta tag --><meta name="robots" content="noai, noimageai">Cloudflare has proposed AI Audit — managing AI bots via dashboard without editing robots.txt.
llms.txt vs robots.txt
Section titled “llms.txt vs robots.txt”Two files solving opposite tasks:
| robots.txt | llms.txt | |
|---|---|---|
| Purpose | ”Where NOT to go" | "What’s IMPORTANT” |
| Audience | All crawlers | LLMs and AI tools |
| Format | Custom syntax | Markdown |
| Effect | Blocks access | Directs attention |
| Standard | RFC 9309 | Community convention |
Both files complement each other:
robots.txt— protects private content from botsllms.txt— helps AI find the most valuable public content
Working Template
Section titled “Working Template”Complete example for a documentation site:
# Search enginesUser-agent: GooglebotAllow: /
User-agent: BingbotAllow: /
# AI training — blockUser-agent: GPTBotDisallow: /
User-agent: ClaudeBotDisallow: /
User-agent: Google-ExtendedDisallow: /
User-agent: CCBotDisallow: /
User-agent: BytespiderDisallow: /
# AI inference/browse — allow docsUser-agent: ChatGPT-UserAllow: /docs/Disallow: /
User-agent: PerplexityBotAllow: /docs/Disallow: /
# SitemapSitemap: https://example.com/sitemap.xmlComplement this robots.txt with an llms.txt file — and AI assistants will get structured access to your documentation.
Sources: