Robots.txt for AI: How to Control Which AI Bots Access Your Site

AI crawlers are visiting your website right now — whether you know it or not.

GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and dozens of other AI crawlers are continuously scraping the web to train their models. The question isn't whether they're accessing your site, but whether you've given them permission.

This guide shows you how to use robots.txt to control which AI bots can access your content, which user agents to watch for, and how to make the right decision for your website.

Understanding AI Crawlers and robots.txt

What Is robots.txt?

robots.txt is a text file at the root of your website (yourdomain.com/robots.txt) that tells web crawlers which pages they can and cannot access. It's been the standard for controlling traditional search engine crawlers like Googlebot — and now it's the primary tool for controlling AI crawlers too.

How AI Crawlers Use robots.txt

Unlike traditional search crawlers that index pages for search results, AI crawlers use your content to train their language models. Here's the key difference:

Aspect	Traditional Crawlers (Googlebot)	AI Crawlers (GPTBot, ClaudeBot)
Purpose	Index pages for search results	Collect training data for AI models
Citation	Links in search results	Content may be cited in AI responses
Traffic	Drives organic search traffic	May drive referral traffic if cited
Control	robots.txt + meta tags	robots.txt (primary control)
Visibility	Pages appear in SERPs	Content influences AI-generated answers

Why robots.txt Matters More Than Ever

With AI search growing rapidly, your robots.txt configuration directly affects:

Whether AI models can learn from your content
Whether AI chatbots will cite your website
Your AI Visibility Score across platforms
Your competitive positioning in AI-generated answers

Complete List of AI Crawler User Agents

Here's a comprehensive list of AI crawler user agents you should know about:

Major AI Platform Crawlers

User Agent	Company/Platform	Purpose	Documentation
`GPTBot`	OpenAI (ChatGPT)	Web crawling for GPT model training	openai.com/gptbot
`ChatGPT-User`	OpenAI (ChatGPT)	Real-time content fetching for user queries	openai.com
`ClaudeBot`	Anthropic (Claude)	Web crawling for Claude model training	anthropic.com
`PerplexityBot`	Perplexity AI	Web crawling for search and answers	perplexity.ai
`Google-Extended`	Google (Gemini/Bard)	Training data for AI models (beyond search indexing)	developers.google.com
`Claude-Web`	Anthropic (Claude)	Claude web search feature	anthropic.com
`Bytespider`	ByteDance	AI model training data	bytedance.com
`Applebot-Extended`	Apple (Apple Intelligence)	Training data for Apple AI features	developer.apple.com
`FacebookBot`	Meta (Meta AI)	AI training and features	developers.facebook.com
`CCBot`	Common Crawl	Open web dataset used by many AI companies	commoncrawl.org
`anthropic-ai`	Anthropic (Claude)	Legacy Claude crawler	anthropic.com
`OAI-SearchBot`	OpenAI (SearchGPT)	OpenAI's search product	openai.com
`cohere-ai`	Cohere	AI model training	cohere.com

AI-Powered SEO and Research Crawlers

User Agent	Service	Purpose
`Tinyscout`	Tinybot	AI-powered web research
`ZoominfoBot`	ZoomInfo	AI business intelligence
`SemrushBot`	Semrush	SEO and AI analysis
`AhrefsBot`	Ahrefs	SEO and AI analysis
`MJ12bot`	Majestic	Link and content analysis

How to Configure robots.txt for AI Crawlers

Option 1: Allow All AI Crawlers (Recommended for Most Sites)

If you want AI models to discover and cite your content:

# robots.txt — Allow AI crawlers

User-agent: *
Allow: /

# Specifically allow AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Block access to non-public areas
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Disallow: /wp-admin/
Disallow: /.env

# Sitemaps
Sitemap: https://yourdomain.com/sitemap.xml

When to choose this option:

Content sites and blogs that want AI visibility
E-commerce stores that want product recommendations from AI
SaaS companies that want to be featured in AI tool recommendations
Any business that wants to appear in AI-generated search results

Option 2: Block All AI Crawlers

If you don't want AI models to use your content:

# robots.txt — Block AI crawlers

# Allow traditional search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: cohere-ai
Disallow: /

# Allow everything else
User-agent: *
Allow: /

When to choose this option:

Sites with proprietary or confidential content
News organizations protecting copyrighted material
Membership sites with paywalled content
Sites that want to control their AI narrative completely

Option 3: Selective AI Crawler Control

Block some AI crawlers while allowing others:

# robots.txt — Selective AI control

# Allow AI crawlers that cite sources
User-agent: PerplexityBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Block AI crawlers that only use data for training
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# General rules
User-agent: *
Allow: /
Disallow: /admin/

When to choose this option:

You want to appear in AI search results but not train AI models
You trust some platforms more than others
You're testing different AI visibility strategies

Option 4: Partial Content Access

Allow AI crawlers on your blog but block them from other areas:

# robots.txt — Partial AI access

# AI crawlers can access blog and public content
User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Disallow: /

User-agent: ClaudeBot
Allow: /blog/
Allow: /guides/
Disallow: /

User-agent: PerplexityBot
Allow: /blog/
Allow: /guides/
Disallow: /

# Allow all for search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /
Disallow: /admin/

When to choose this option:

You have mixed content (public blog + private areas)
You want AI to cite your educational content but not your proprietary pages
E-commerce sites that want product pages visible but not customer data

Advanced robots.txt Techniques for AI

Rate Limiting AI Crawlers

Some AI crawlers respect the Crawl-delay directive:

User-agent: GPTBot
Allow: /
Crawl-delay: 10

User-agent: ClaudeBot
Allow: /
Crawl-delay: 10

This tells crawlers to wait 10 seconds between requests, reducing server load.

Using Wildcards

You can use patterns to block or allow specific URL patterns:

User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Disallow: /api/*
Disallow: /*?ref=*
Disallow: /*.json$

Sitemap Guidance for AI

Include your sitemap to help AI crawlers discover your content efficiently:

User-agent: GPTBot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

User-agent: ClaudeBot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

Beyond robots.txt: Additional AI Crawler Controls

Meta Tags

You can control AI access at the page level using meta tags:

<!-- Block all AI crawlers from this page -->
<meta name="robots" content="noai, noimageai">

<!-- Block specific AI crawlers -->
<meta name="GPTBot" content="noindex, nofollow">
<meta name="ClaudeBot" content="noindex, nofollow">
<meta name="PerplexityBot" content="noindex, nofollow">

Note: Meta tag support varies by crawler. GPTBot and ClaudeBot respect them, but not all AI crawlers do.

X-Robots-Tag HTTP Headers

For non-HTML files (PDFs, images, documents):

X-Robots-Tag: noai, noimageai

llms.txt for Positive Control

While robots.txt controls what crawlers can't access, llms.txt tells them what they should find:

# yourdomain.com/llms.txt

> Your website description for AI models

## Articles
- [Article Title](https://yourdomain.com/article): Brief description

## About
- Author information and credentials

Learn more about creating an llms.txt file in our complete llms.txt guide.

Should You Block AI Crawlers?

This is one of the most debated questions in web publishing today. Here's a balanced view:

Reasons to Allow AI Crawlers ✅

Reason	Explanation
AI citations drive traffic	Being cited by ChatGPT, Claude, or Perplexity sends high-intent visitors
AI visibility is the new SEO	As AI search grows, blocking crawlers means becoming invisible
Early mover advantage	Fewer competitors in AI search means more visibility now
Free exposure	AI models are essentially promoting your content for free
User expectations	Users expect AI to recommend the best resources

Reasons to Block AI Crawlers ❌

Reason	Explanation
Copyright concerns	Your content trains models without compensation
Competitive risk	AI could use your content to help competitors
Loss of control	AI might misrepresent or hallucinate about your content
No direct attribution	Some AI models cite content without clear links
Resource usage	AI crawlers consume bandwidth without guaranteed benefit

Our Recommendation

For most websites, allowing AI crawlers is the better choice. The traffic and visibility benefits outweigh the risks. However, you should:

Monitor AI crawler activity in your server logs
Check AI responses regularly for how your content is being used
Use llms.txt to guide AI models to your best content
Review policies periodically as the AI landscape evolves

Read our full analysis in Should You Block AI Crawlers? Pros, Cons, and How to Decide.

Monitoring AI Crawler Activity

Check Server Logs

Monitor which AI crawlers are visiting your site:

# Check for AI crawler visits
grep -i "GPTBot\|ClaudeBot\|PerplexityBot\|Google-Extended\|Applebot-Extended\|CCBot" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn

Track AI Crawler Frequency

# Daily AI crawler requests
grep "GPTBot" /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f1 | uniq -c

Use Analytics Tools

Many web analytics platforms can filter for AI crawler user agents. Look for:

Cloudflare Analytics: Bot traffic reports
Google Analytics 4: Filter reports by user agent
Log analysis tools: GoAccess, AWStats, or custom scripts

Verifying Your robots.txt Configuration

Test Your robots.txt

Use these tools to verify your configuration:

Google Search Console: robots.txt Tester
OpenAI GPTBot test: Check if your content appears in ChatGPT responses
Perplexity test: Search for your content on Perplexity and check citations
SeenByAI: Check your AI Visibility Score to see if AI platforms can find you

Common robots.txt Mistakes

Mistake	Problem	Fix
Wrong file location	Must be at root: `/robots.txt`	Move to `yourdomain.com/robots.txt`
Blocking everything	`Disallow: /` blocks all crawlers	Be specific with your rules
Typos in user agents	Crawlers won't match misspelled names	Double-check exact user agent strings
No sitemap reference	Crawlers may miss pages	Add `Sitemap:` directive
Contradictory rules	Later rules override earlier ones	Order rules carefully
Blocking CSS/JS	Prevents proper page rendering	Allow `/wp-content/`, `/static/`, etc.

Quick-Start Template

Here's a production-ready template you can customize:

# robots.txt for [Your Website]
# Last updated: 2025-04-12

# === AI Crawler Configuration ===

# OpenAI (ChatGPT)
User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /private/

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic (Claude)
User-agent: ClaudeBot
Allow: /
Disallow: /admin/
Disallow: /private/

User-agent: Claude-Web
Allow: /

# Perplexity AI
User-agent: PerplexityBot
Allow: /

# Google (Search + AI)
User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Allow: /

# Apple Intelligence
User-agent: Applebot-Extended
Allow: /

# === Traditional Search Engines ===

User-agent: Bingbot
Allow: /

# === General Rules ===

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/
Disallow: /api/internal/
Disallow: /*.json$

# === Sitemaps ===

Sitemap: https://yourdomain.com/sitemap.xml

Key Takeaways

robots.txt is your primary tool for controlling AI crawler access to your website
Know the major AI user agents: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended
Most websites should allow AI crawlers to maximize AI visibility and referral traffic
Be specific with rules — allow blog content, block private areas
Combine with llms.txt for positive guidance about your content
Monitor and verify your configuration regularly
The landscape is evolving — review your robots.txt quarterly as new crawlers emerge

Check Your AI Visibility

Want to know if AI crawlers are actually finding and citing your content? Get your free AI visibility report →

SeenByAI scans your website across ChatGPT, Claude, Perplexity, and Google AI to show your current AI Visibility Score and identify exactly what to fix.

Robots.txt for AI: How to Control Which AI Bots Access Your Site

Understanding AI Crawlers and robots.txt

What Is robots.txt?

How AI Crawlers Use robots.txt

Why robots.txt Matters More Than Ever

Complete List of AI Crawler User Agents

Major AI Platform Crawlers

AI-Powered SEO and Research Crawlers

How to Configure robots.txt for AI Crawlers

Option 1: Allow All AI Crawlers (Recommended for Most Sites)

Option 2: Block All AI Crawlers

Option 3: Selective AI Crawler Control

Option 4: Partial Content Access

Advanced robots.txt Techniques for AI

Rate Limiting AI Crawlers

Using Wildcards

Sitemap Guidance for AI

Beyond robots.txt: Additional AI Crawler Controls

Meta Tags

X-Robots-Tag HTTP Headers

llms.txt for Positive Control

Should You Block AI Crawlers?

Reasons to Allow AI Crawlers ✅

Reasons to Block AI Crawlers ❌

Our Recommendation

Monitoring AI Crawler Activity

Check Server Logs

Track AI Crawler Frequency

Use Analytics Tools

Verifying Your robots.txt Configuration

Test Your robots.txt

Common robots.txt Mistakes

Quick-Start Template

Key Takeaways

Check Your AI Visibility

Want to check your AI visibility?

More articles

How to Create an AI-Friendly Sitemap

AI Search and Privacy: What Happens to Your Data?

The Ultimate AI SEO Audit: How to Assess Your Site's AI Readiness