How to Block AI Crawlers (And Which Ones to Let In)

As AI companies scrape the web to train their models, website owners face a critical decision: Should you let AI crawlers access your content?

This guide explains how to control AI crawler access using robots.txt, the implications of blocking or allowing different crawlers, and how to make the right choice for your site.

Why Control AI Crawler Access?

Reasons to Block AI Crawlers

Protect proprietary content — Prevent your unique insights from training competitor AI models
Reduce server load — AI crawlers can consume significant bandwidth
Maintain competitive advantage — Keep your data exclusive
Privacy concerns — Control how your content is used in AI training
Legal considerations — Some jurisdictions have evolving regulations around AI training data

Reasons to Allow AI Crawlers

AI search visibility — Get cited by ChatGPT, Claude, and Perplexity
Traffic potential — AI citations can drive qualified visitors
Brand authority — Being cited by AI models builds credibility
Future-proofing — AI search is becoming mainstream
User experience — Help AI assistants give accurate information about your business

Understanding AI Crawler User-Agents

Here's a comprehensive list of AI crawler user-agents as of 2025:

Company	User-Agent	Purpose
OpenAI	`GPTBot`	Training ChatGPT and GPT models
OpenAI	`ChatGPT-User`	ChatGPT browsing plugin
Anthropic	`Claude-Web`	Training Claude models
Anthropic	`Claude-User`	Claude web access
Google	`Google-Extended`	AI training (separate from search)
Google	`Googlebot`	Search indexing (allows AI features)
Perplexity	`PerplexityBot`	Perplexity AI search indexing
Perplexity	`Perplexity-User`	Perplexity answer generation
Common Crawl	`CCBot`	Open web dataset (used by many AI companies)
Bytespider	`Bytespider`	TikTok/ByteDance AI training
Amazon	`Amazonbot`	Alexa and Amazon AI training
Meta	`FacebookBot`	Meta AI training
Microsoft	`Bingbot`	Bing search (includes AI features)
Apple	`Applebot`	Apple AI training and search

How to Block AI Crawlers

Basic Blocking in robots.txt

To block all AI crawlers from your entire site:

# Block OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Block Anthropic/Claude
User-agent: Claude-Web
Disallow: /

User-agent: Claude-User
Disallow: /

# Block Google AI training (not search)
User-agent: Google-Extended
Disallow: /

# Block Perplexity
User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

# Block ByteDance/TikTok
User-agent: Bytespider
Disallow: /

# Block Amazon
User-agent: Amazonbot
Disallow: /

# Block Meta
User-agent: FacebookBot
Disallow: /

Selective Blocking

Block crawlers from specific sections:

# Allow AI crawlers on public blog
User-agent: GPTBot
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Allow: /blog/
Allow: /about/

# But block from premium content
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/

Blocking Specific File Types

Prevent AI crawlers from accessing certain content types:

User-agent: GPTBot
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /downloads/

User-agent: Claude-Web
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /downloads/

Strategic Approaches

Approach 1: Block All AI Training, Allow AI Search

Allow AI search engines to index you for citations, but block training crawlers:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search-oriented crawlers (these cite sources)
User-agent: ChatGPT-User
Disallow:

User-agent: PerplexityBot
Disallow:

User-agent: Perplexity-User
Disallow:

Best for: Content sites that want AI search visibility but don't want to contribute to model training.

Approach 2: Allow All AI Access

Let all AI crawlers access your content:

# No AI-specific blocks
# Your content is open to all crawlers

Best for: Sites focused on maximum visibility, thought leadership, or open knowledge sharing.

Approach 3: Block Everything

Prevent all automated access (not recommended for most sites):

User-agent: *
Disallow: /

Best for: Private intranets, staging sites, or highly sensitive content.

Approach 4: Time-Delayed Access

Allow crawlers after content has been exclusive for a period:

# Premium content blocked for 30 days
User-agent: GPTBot
Disallow: /premium/recent/

# Archive content allowed
User-agent: GPTBot
Allow: /premium/archive/

Best for: News sites, research publications, or subscription-based content.

Testing Your robots.txt

Manual Verification

Check syntax: Use our free robots.txt checker
Test specific URLs: Verify each blocked path is actually blocked
Check AI-specific rules: Ensure AI user-agents are properly targeted

What to Test

Test URL: https://yoursite.com/private/content

Expected: Blocked for GPTBot, Claude-Web, etc.
Actual: [Check with tool]

Test URL: https://yoursite.com/blog/public-post

Expected: Allowed for all
Actual: [Check with tool]

Which Crawlers Should You Allow?

Recommended: Allow These

Crawler	Why Allow	Impact
ChatGPT-User	Enables ChatGPT browsing plugin citations	High visibility potential
PerplexityBot	Perplexity search citations	Growing traffic source
Perplexity-User	Perplexity answer generation	Direct citations
Googlebot	Search indexing + AI Overviews	Essential for SEO
Bingbot	Bing search + Copilot	Significant search traffic

Consider Carefully

Crawler	Considerations
GPTBot	Trains ChatGPT; blocks future citation potential but protects content
Claude-Web	Trains Claude; similar trade-off as GPTBot
Google-Extended	Separates AI training from search; can block without hurting SEO
Applebot	Apple's AI efforts; smaller current impact

Often Blocked

Crawler	Common Reason to Block
CCBot	Wide distribution of scraped data; hard to control usage
Bytespider	TikTok/ByteDance training; competitive concerns
Amazonbot	Amazon's AI training; retail/competitive concerns
FacebookBot	Meta's AI training; data usage concerns

The Business Impact of Blocking

Potential Benefits

Content protection — Your unique insights remain exclusive
Competitive moat — Harder for competitors to replicate your approach
Reduced costs — Lower bandwidth and server load
Legal clarity — Clearer position on data usage

Potential Drawbacks

Missed citations — AI models can't cite what they can't see
Reduced visibility — Missing out on growing AI search traffic
User frustration — AI assistants may give outdated info about you
Future regret — Hard to reverse decision if AI search becomes dominant

Real-World Examples

Sites That Block AI Crawlers

The New York Times

Blocks GPTBot, CCBot
Protects premium journalism from training
Maintains subscription value

Reddit

Blocks most AI crawlers
Negotiates direct licensing deals instead
Monetizes content access

Sites That Allow AI Crawlers

Wikipedia

Allows all crawlers
Mission-aligned with knowledge sharing
Benefits from AI citations

Shopify

Allows AI crawlers on public content
Gains visibility in AI commerce recommendations
Blocks from admin/internal areas

Making Your Decision

Questions to Ask

Is your content unique/proprietary?
- Yes → Consider blocking training crawlers
- No → Allow for visibility
Do you rely on search traffic?
- Yes → Allow search crawlers (Googlebot, Bingbot)
- No → More flexibility
Is AI citation valuable to your business?
- Yes → Allow ChatGPT-User, PerplexityBot
- No → Block without concern
Do you have the resources to monitor?
- Yes → Can adjust strategy based on results
- No → Pick a position and stick with it

Our Recommendation

For most websites, we recommend:

# Allow AI search (citation potential)
User-agent: ChatGPT-User
Disallow:

User-agent: PerplexityBot
Disallow:

User-agent: Perplexity-User
Disallow:

User-agent: Claude-User
Disallow:

# Block AI training (content protection)
User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow search engines (essential SEO)
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

This approach:

✅ Protects your content from training models
✅ Allows AI search engines to cite you
✅ Maintains traditional SEO
✅ Keeps options open for the future

Monitoring and Adjusting

Track AI Referral Traffic

Monitor your analytics for traffic from:

chatgpt.com / chat.openai.com
perplexity.ai
claude.ai
AI browser extensions and apps

Revisit Your Decision Quarterly

The AI landscape changes rapidly. Review:

New crawler user-agents
Changes in referral traffic
Business strategy shifts
Competitive landscape

Tools and Resources

robots.txt Checker — Verify your AI crawler blocks
AI Visibility Scanner — Check if AI can access your content
llms.txt Generator — Guide AI crawlers to your best content

Conclusion

Controlling AI crawler access is now a standard part of website management. The right approach depends on your content strategy, business model, and risk tolerance.

Remember:

You can always change your robots.txt
Start conservative and open up if beneficial
Monitor the impact of your decisions
What's right for others may not be right for you

Next step: Check your current robots.txt configuration.

→ Test your robots.txt for AI crawler blocks

Want to understand the full picture of your AI visibility? Run a complete AI visibility scan to see how AI models currently see your website.