How to Block AI Crawlers (And Which Ones to Let In)
As AI companies scrape the web to train their models, website owners face a critical decision: Should you let AI crawlers access your content?
This guide explains how to control AI crawler access using robots.txt, the implications of blocking or allowing different crawlers, and how to make the right choice for your site.
Why Control AI Crawler Access?
Reasons to Block AI Crawlers
- Protect proprietary content — Prevent your unique insights from training competitor AI models
- Reduce server load — AI crawlers can consume significant bandwidth
- Maintain competitive advantage — Keep your data exclusive
- Privacy concerns — Control how your content is used in AI training
- Legal considerations — Some jurisdictions have evolving regulations around AI training data
Reasons to Allow AI Crawlers
- AI search visibility — Get cited by ChatGPT, Claude, and Perplexity
- Traffic potential — AI citations can drive qualified visitors
- Brand authority — Being cited by AI models builds credibility
- Future-proofing — AI search is becoming mainstream
- User experience — Help AI assistants give accurate information about your business
Understanding AI Crawler User-Agents
Here's a comprehensive list of AI crawler user-agents as of 2025:
| Company | User-Agent | Purpose |
|---|---|---|
| OpenAI | GPTBot | Training ChatGPT and GPT models |
| OpenAI | ChatGPT-User | ChatGPT browsing plugin |
| Anthropic | Claude-Web | Training Claude models |
| Anthropic | Claude-User | Claude web access |
Google-Extended | AI training (separate from search) | |
Googlebot | Search indexing (allows AI features) | |
| Perplexity | PerplexityBot | Perplexity AI search indexing |
| Perplexity | Perplexity-User | Perplexity answer generation |
| Common Crawl | CCBot | Open web dataset (used by many AI companies) |
| Bytespider | Bytespider | TikTok/ByteDance AI training |
| Amazon | Amazonbot | Alexa and Amazon AI training |
| Meta | FacebookBot | Meta AI training |
| Microsoft | Bingbot | Bing search (includes AI features) |
| Apple | Applebot | Apple AI training and search |
How to Block AI Crawlers
Basic Blocking in robots.txt
To block all AI crawlers from your entire site:
# Block OpenAI User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / # Block Anthropic/Claude User-agent: Claude-Web Disallow: / User-agent: Claude-User Disallow: / # Block Google AI training (not search) User-agent: Google-Extended Disallow: / # Block Perplexity User-agent: PerplexityBot Disallow: / User-agent: Perplexity-User Disallow: / # Block Common Crawl User-agent: CCBot Disallow: / # Block ByteDance/TikTok User-agent: Bytespider Disallow: / # Block Amazon User-agent: Amazonbot Disallow: / # Block Meta User-agent: FacebookBot Disallow: /
Selective Blocking
Block crawlers from specific sections:
# Allow AI crawlers on public blog User-agent: GPTBot Disallow: /admin/ Disallow: /private/ Disallow: /api/ Allow: /blog/ Allow: /about/ # But block from premium content User-agent: GPTBot Disallow: /premium/ Disallow: /members/
Blocking Specific File Types
Prevent AI crawlers from accessing certain content types:
User-agent: GPTBot Disallow: /*.pdf$ Disallow: /*.doc$ Disallow: /downloads/ User-agent: Claude-Web Disallow: /*.pdf$ Disallow: /*.doc$ Disallow: /downloads/
Strategic Approaches
Approach 1: Block All AI Training, Allow AI Search
Allow AI search engines to index you for citations, but block training crawlers:
# Block training crawlers User-agent: GPTBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: CCBot Disallow: / # Allow search-oriented crawlers (these cite sources) User-agent: ChatGPT-User Disallow: User-agent: PerplexityBot Disallow: User-agent: Perplexity-User Disallow:
Best for: Content sites that want AI search visibility but don't want to contribute to model training.
Approach 2: Allow All AI Access
Let all AI crawlers access your content:
# No AI-specific blocks # Your content is open to all crawlers
Best for: Sites focused on maximum visibility, thought leadership, or open knowledge sharing.
Approach 3: Block Everything
Prevent all automated access (not recommended for most sites):
User-agent: * Disallow: /
Best for: Private intranets, staging sites, or highly sensitive content.
Approach 4: Time-Delayed Access
Allow crawlers after content has been exclusive for a period:
# Premium content blocked for 30 days User-agent: GPTBot Disallow: /premium/recent/ # Archive content allowed User-agent: GPTBot Allow: /premium/archive/
Best for: News sites, research publications, or subscription-based content.
Testing Your robots.txt
Manual Verification
- Check syntax: Use our free robots.txt checker
- Test specific URLs: Verify each blocked path is actually blocked
- Check AI-specific rules: Ensure AI user-agents are properly targeted
What to Test
Test URL: https://yoursite.com/private/content
Expected: Blocked for GPTBot, Claude-Web, etc.
Actual: [Check with tool]
Test URL: https://yoursite.com/blog/public-post
Expected: Allowed for all
Actual: [Check with tool]
Which Crawlers Should You Allow?
Recommended: Allow These
| Crawler | Why Allow | Impact |
|---|---|---|
| ChatGPT-User | Enables ChatGPT browsing plugin citations | High visibility potential |
| PerplexityBot | Perplexity search citations | Growing traffic source |
| Perplexity-User | Perplexity answer generation | Direct citations |
| Googlebot | Search indexing + AI Overviews | Essential for SEO |
| Bingbot | Bing search + Copilot | Significant search traffic |
Consider Carefully
| Crawler | Considerations |
|---|---|
| GPTBot | Trains ChatGPT; blocks future citation potential but protects content |
| Claude-Web | Trains Claude; similar trade-off as GPTBot |
| Google-Extended | Separates AI training from search; can block without hurting SEO |
| Applebot | Apple's AI efforts; smaller current impact |
Often Blocked
| Crawler | Common Reason to Block |
|---|---|
| CCBot | Wide distribution of scraped data; hard to control usage |
| Bytespider | TikTok/ByteDance training; competitive concerns |
| Amazonbot | Amazon's AI training; retail/competitive concerns |
| FacebookBot | Meta's AI training; data usage concerns |
The Business Impact of Blocking
Potential Benefits
- Content protection — Your unique insights remain exclusive
- Competitive moat — Harder for competitors to replicate your approach
- Reduced costs — Lower bandwidth and server load
- Legal clarity — Clearer position on data usage
Potential Drawbacks
- Missed citations — AI models can't cite what they can't see
- Reduced visibility — Missing out on growing AI search traffic
- User frustration — AI assistants may give outdated info about you
- Future regret — Hard to reverse decision if AI search becomes dominant
Real-World Examples
Sites That Block AI Crawlers
The New York Times
- Blocks GPTBot, CCBot
- Protects premium journalism from training
- Maintains subscription value
- Blocks most AI crawlers
- Negotiates direct licensing deals instead
- Monetizes content access
Sites That Allow AI Crawlers
Wikipedia
- Allows all crawlers
- Mission-aligned with knowledge sharing
- Benefits from AI citations
Shopify
- Allows AI crawlers on public content
- Gains visibility in AI commerce recommendations
- Blocks from admin/internal areas
Making Your Decision
Questions to Ask
-
Is your content unique/proprietary?
- Yes → Consider blocking training crawlers
- No → Allow for visibility
-
Do you rely on search traffic?
- Yes → Allow search crawlers (Googlebot, Bingbot)
- No → More flexibility
-
Is AI citation valuable to your business?
- Yes → Allow ChatGPT-User, PerplexityBot
- No → Block without concern
-
Do you have the resources to monitor?
- Yes → Can adjust strategy based on results
- No → Pick a position and stick with it
Our Recommendation
For most websites, we recommend:
# Allow AI search (citation potential) User-agent: ChatGPT-User Disallow: User-agent: PerplexityBot Disallow: User-agent: Perplexity-User Disallow: User-agent: Claude-User Disallow: # Block AI training (content protection) User-agent: GPTBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / # Allow search engines (essential SEO) User-agent: Googlebot Disallow: User-agent: Bingbot Disallow:
This approach:
- ✅ Protects your content from training models
- ✅ Allows AI search engines to cite you
- ✅ Maintains traditional SEO
- ✅ Keeps options open for the future
Monitoring and Adjusting
Track AI Referral Traffic
Monitor your analytics for traffic from:
chatgpt.com/chat.openai.comperplexity.aiclaude.ai- AI browser extensions and apps
Revisit Your Decision Quarterly
The AI landscape changes rapidly. Review:
- New crawler user-agents
- Changes in referral traffic
- Business strategy shifts
- Competitive landscape
Tools and Resources
- robots.txt Checker — Verify your AI crawler blocks
- AI Visibility Scanner — Check if AI can access your content
- llms.txt Generator — Guide AI crawlers to your best content
Conclusion
Controlling AI crawler access is now a standard part of website management. The right approach depends on your content strategy, business model, and risk tolerance.
Remember:
- You can always change your robots.txt
- Start conservative and open up if beneficial
- Monitor the impact of your decisions
- What's right for others may not be right for you
Next step: Check your current robots.txt configuration.
→ Test your robots.txt for AI crawler blocks
Want to understand the full picture of your AI visibility? Run a complete AI visibility scan to see how AI models currently see your website.