AI Crawlers and robots.txt

AI crawlers GPTBot, ClaudeBot, Google-Extended — User-Agent table, robots.txt for AI bots, training vs inference difference, configuration template

AI Crawlers

Beyond classic search bots (Googlebot, Bingbot), websites now receive visits from AI crawlers — bots collecting data for training and inference of language models.

Crawler	Company	Purpose	User-Agent
GPTBot	OpenAI	Training + Browse	`GPTBot/1.0`
ChatGPT-User	OpenAI	Browse (realtime)	`ChatGPT-User`
ClaudeBot	Anthropic	Training	`ClaudeBot`
anthropic-ai	Anthropic	Training	`anthropic-ai`
Google-Extended	Google	AI training	`Google-Extended`
PerplexityBot	Perplexity	Inference/RAG	`PerplexityBot`
CCBot	Common Crawl	Training datasets	`CCBot/2.0`
Bytespider	ByteDance	Training	`Bytespider`
Applebot-Extended	Apple	Apple Intelligence	`Applebot-Extended`
Amazonbot	Amazon	Alexa/AI	`Amazonbot`

Training vs Inference

Two fundamentally different processes:

Training — bulk data collection to form model weights. Happens every few months. Blocking training crawlers has no effect on current model responses.

Inference — real-time RAG/browse. The model fetches pages “on the fly” when answering a user. Blocking inference bots removes your site from live answers.

	Training	Inference
Frequency	Every few months	Per query
Blocking impact	Next model version	Current answers
Example bots	GPTBot, ClaudeBot, CCBot	ChatGPT-User, PerplexityBot

robots.txt for AI Bots

Block Training, Allow Browse

# Classic search engines — full access
User-agent: Googlebot
Allow: /

# AI training crawlers — block
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Google-Extended
Disallow: /

# AI browse/inference — allow
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Selective Blocking

# Allow documentation, block private areas
User-agent: GPTBot
Allow: /docs/
Allow: /blog/
Disallow: /admin/
Disallow: /internal/
Disallow: /api/private/

Blocking AI ≠ Losing SEO

Googlebot and Google-Extended are different user-agents:

Googlebot — indexing for Google Search. Blocking = losing search rankings
Google-Extended — data for Gemini/AI Overviews. Blocking has no effect on organic search

Block Google-Extended without worrying about SEO.

New Directives

The industry is experimenting with robots.txt extensions for AI:

# Experimental directives (not all bots support them)
User-agent: *
DisallowAITraining: /

<!-- HTML meta tag -->
<meta name="robots" content="noai, noimageai">

Cloudflare has proposed AI Audit — managing AI bots via dashboard without editing robots.txt.

llms.txt vs robots.txt

Two files solving opposite tasks:

	robots.txt	llms.txt
Purpose	”Where NOT to go"	"What’s IMPORTANT”
Audience	All crawlers	LLMs and AI tools
Format	Custom syntax	Markdown
Effect	Blocks access	Directs attention
Standard	RFC 9309	Community convention

Both files complement each other:

robots.txt — protects private content from bots
llms.txt — helps AI find the most valuable public content

Working Template

Complete example for a documentation site:

# Search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI training — block
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# AI inference/browse — allow docs
User-agent: ChatGPT-User
Allow: /docs/
Disallow: /

User-agent: PerplexityBot
Allow: /docs/
Disallow: /

# Sitemap
Sitemap: https://example.com/sitemap.xml

Complement this robots.txt with an llms.txt file — and AI assistants will get structured access to your documentation.

Sources:

AI Crawlers and robots.txt

AI Crawlers

Training vs Inference

robots.txt for AI Bots

Block Training, Allow Browse

Selective Blocking

Blocking AI ≠ Losing SEO

New Directives

llms.txt vs robots.txt

Working Template

Guides

Specification

Tools

Examples

Reference