Block AI Bots from Your Website
Stop AI crawlers from harvesting your content for training and answer engines. Here's every major AI bot — what it does, who runs it, and a copy-ready robots.txt rule to block it.
Every AI crawler, explained
OpenAI's training crawler. Blocking it stops your content being used to train future ChatGPT / GPT models. Does not affect ChatGPT's live browsing or search citations.
User-agent: GPTBot
Disallow: /Fetches a page in real time when a ChatGPT user clicks a link or asks it to browse. Blocking removes your pages from those on-demand fetches.
User-agent: ChatGPT-User
Disallow: /Indexes the web for ChatGPT Search. Block this if you do not want to appear as a cited source in ChatGPT's search results.
User-agent: OAI-SearchBot
Disallow: /Anthropic's training crawler for Claude. Blocking it keeps your content out of Claude model training data.
User-agent: ClaudeBot
Disallow: /Indexes pages to improve Claude's search answers. Block to stay out of Claude's search index.
User-agent: Claude-SearchBot
Disallow: /Fetches pages on demand when a Claude user asks it to browse a URL.
User-agent: Claude-User
Disallow: /Older Anthropic user-agent. Keep it in your block list for full coverage of legacy crawls.
User-agent: anthropic-ai
Disallow: /Controls whether your content trains Gemini and Vertex AI. Critically, blocking it does NOT affect Googlebot or your Google Search rankings.
User-agent: Google-Extended
Disallow: /Common Crawl's bot builds the open dataset that many LLMs train on. Blocking it cuts off a major upstream training source.
User-agent: CCBot
Disallow: /Builds Perplexity's answer index. Block to avoid being indexed as a Perplexity source.
User-agent: PerplexityBot
Disallow: /Fetches pages live when a Perplexity user's query needs them.
User-agent: Perplexity-User
Disallow: /ByteDance's aggressive crawler feeding TikTok / Doubao AI. Often high-volume — many sites block it to save crawl budget.
User-agent: Bytespider
Disallow: /Amazon's crawler powering Alexa and Amazon AI features.
User-agent: Amazonbot
Disallow: /Governs whether your content trains Apple Intelligence. Separate from Applebot — blocking it does NOT affect Siri or Spotlight.
User-agent: Applebot-Extended
Disallow: /Meta's crawler for training Meta AI / Llama models.
User-agent: Meta-ExternalAgent
Disallow: /Cohere's crawler for model training.
User-agent: cohere-ai
Disallow: /Extracts structured data to build knowledge graphs sold to AI products.
User-agent: Diffbot
Disallow: /Crawler for Timpi's decentralized search and AI index.
User-agent: Timpibot
Disallow: /Own your content in the AI era
Every week brings a new crawler scraping the open web to train models and feed answer engines. Most generic robots.txt tools have not kept up — they still list dead bots from 2010 and none of the crawlers that actually matter now. This page documents the full modern list: GPTBot, ClaudeBot, Google-Extended, CCBot, PerplexityBot, Bytespider, Applebot-Extended, Meta-ExternalAgent and more.
A smart strategy usually blocks the training crawlers (so your work isn\'t used to build models for free) while allowing the answer/search crawlers (so you still earn citations and referral traffic). Pick per category — and remember robots.txt is honored by the legitimate operators but is not a hard firewall. For guaranteed blocking, add server-side or WAF rules too.