Skip to content

AI Crawlers and robots.txt

AI crawlers GPTBot, ClaudeBot, Google-Extended — User-Agent table, robots.txt for AI bots, training vs inference difference, configuration template

Beyond classic search bots (Googlebot, Bingbot), websites now receive visits from AI crawlers — bots collecting data for training and inference of language models.

CrawlerCompanyPurposeUser-Agent
GPTBotOpenAITraining + BrowseGPTBot/1.0
ChatGPT-UserOpenAIBrowse (realtime)ChatGPT-User
ClaudeBotAnthropicTrainingClaudeBot
anthropic-aiAnthropicTraininganthropic-ai
Google-ExtendedGoogleAI trainingGoogle-Extended
PerplexityBotPerplexityInference/RAGPerplexityBot
CCBotCommon CrawlTraining datasetsCCBot/2.0
BytespiderByteDanceTrainingBytespider
Applebot-ExtendedAppleApple IntelligenceApplebot-Extended
AmazonbotAmazonAlexa/AIAmazonbot

Two fundamentally different processes:

Training — bulk data collection to form model weights. Happens every few months. Blocking training crawlers has no effect on current model responses.

Inference — real-time RAG/browse. The model fetches pages “on the fly” when answering a user. Blocking inference bots removes your site from live answers.

TrainingInference
FrequencyEvery few monthsPer query
Blocking impactNext model versionCurrent answers
Example botsGPTBot, ClaudeBot, CCBotChatGPT-User, PerplexityBot
# Classic search engines — full access
User-agent: Googlebot
Allow: /
# AI training crawlers — block
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Google-Extended
Disallow: /
# AI browse/inference — allow
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
# Allow documentation, block private areas
User-agent: GPTBot
Allow: /docs/
Allow: /blog/
Disallow: /admin/
Disallow: /internal/
Disallow: /api/private/

Googlebot and Google-Extended are different user-agents:

  • Googlebot — indexing for Google Search. Blocking = losing search rankings
  • Google-Extended — data for Gemini/AI Overviews. Blocking has no effect on organic search

Block Google-Extended without worrying about SEO.

The industry is experimenting with robots.txt extensions for AI:

# Experimental directives (not all bots support them)
User-agent: *
DisallowAITraining: /
<!-- HTML meta tag -->
<meta name="robots" content="noai, noimageai">

Cloudflare has proposed AI Audit — managing AI bots via dashboard without editing robots.txt.

Two files solving opposite tasks:

robots.txtllms.txt
Purpose”Where NOT to go""What’s IMPORTANT”
AudienceAll crawlersLLMs and AI tools
FormatCustom syntaxMarkdown
EffectBlocks accessDirects attention
StandardRFC 9309Community convention

Both files complement each other:

  1. robots.txt — protects private content from bots
  2. llms.txt — helps AI find the most valuable public content

Complete example for a documentation site:

robots.txt
# Search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# AI training — block
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# AI inference/browse — allow docs
User-agent: ChatGPT-User
Allow: /docs/
Disallow: /
User-agent: PerplexityBot
Allow: /docs/
Disallow: /
# Sitemap
Sitemap: https://example.com/sitemap.xml

Complement this robots.txt with an llms.txt file — and AI assistants will get structured access to your documentation.


Sources: