← Back to Blog
robots.txtAI CrawlersTechnical SEOPrivacy

How to Block AI Crawlers (And Which Ones to Let In)

A complete guide to controlling AI crawlers on your website. Learn how to block ChatGPT, Claude, and other AI bots using robots.txt, and understand which crawlers you might want to allow.

SeenByAI Team·April 5, 2025·8 min read

How to Block AI Crawlers (And Which Ones to Let In)

As AI companies scrape the web to train their models, website owners face a critical decision: Should you let AI crawlers access your content?

This guide explains how to control AI crawler access using robots.txt, the implications of blocking or allowing different crawlers, and how to make the right choice for your site.

Why Control AI Crawler Access?

Reasons to Block AI Crawlers

  1. Protect proprietary content — Prevent your unique insights from training competitor AI models
  2. Reduce server load — AI crawlers can consume significant bandwidth
  3. Maintain competitive advantage — Keep your data exclusive
  4. Privacy concerns — Control how your content is used in AI training
  5. Legal considerations — Some jurisdictions have evolving regulations around AI training data

Reasons to Allow AI Crawlers

  1. AI search visibility — Get cited by ChatGPT, Claude, and Perplexity
  2. Traffic potential — AI citations can drive qualified visitors
  3. Brand authority — Being cited by AI models builds credibility
  4. Future-proofing — AI search is becoming mainstream
  5. User experience — Help AI assistants give accurate information about your business

Understanding AI Crawler User-Agents

Here's a comprehensive list of AI crawler user-agents as of 2025:

CompanyUser-AgentPurpose
OpenAIGPTBotTraining ChatGPT and GPT models
OpenAIChatGPT-UserChatGPT browsing plugin
AnthropicClaude-WebTraining Claude models
AnthropicClaude-UserClaude web access
GoogleGoogle-ExtendedAI training (separate from search)
GoogleGooglebotSearch indexing (allows AI features)
PerplexityPerplexityBotPerplexity AI search indexing
PerplexityPerplexity-UserPerplexity answer generation
Common CrawlCCBotOpen web dataset (used by many AI companies)
BytespiderBytespiderTikTok/ByteDance AI training
AmazonAmazonbotAlexa and Amazon AI training
MetaFacebookBotMeta AI training
MicrosoftBingbotBing search (includes AI features)
AppleApplebotApple AI training and search

How to Block AI Crawlers

Basic Blocking in robots.txt

To block all AI crawlers from your entire site:

# Block OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Block Anthropic/Claude
User-agent: Claude-Web
Disallow: /

User-agent: Claude-User
Disallow: /

# Block Google AI training (not search)
User-agent: Google-Extended
Disallow: /

# Block Perplexity
User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

# Block ByteDance/TikTok
User-agent: Bytespider
Disallow: /

# Block Amazon
User-agent: Amazonbot
Disallow: /

# Block Meta
User-agent: FacebookBot
Disallow: /

Selective Blocking

Block crawlers from specific sections:

# Allow AI crawlers on public blog
User-agent: GPTBot
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Allow: /blog/
Allow: /about/

# But block from premium content
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/

Blocking Specific File Types

Prevent AI crawlers from accessing certain content types:

User-agent: GPTBot
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /downloads/

User-agent: Claude-Web
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /downloads/

Strategic Approaches

Allow AI search engines to index you for citations, but block training crawlers:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search-oriented crawlers (these cite sources)
User-agent: ChatGPT-User
Disallow:

User-agent: PerplexityBot
Disallow:

User-agent: Perplexity-User
Disallow:

Best for: Content sites that want AI search visibility but don't want to contribute to model training.

Approach 2: Allow All AI Access

Let all AI crawlers access your content:

# No AI-specific blocks
# Your content is open to all crawlers

Best for: Sites focused on maximum visibility, thought leadership, or open knowledge sharing.

Approach 3: Block Everything

Prevent all automated access (not recommended for most sites):

User-agent: *
Disallow: /

Best for: Private intranets, staging sites, or highly sensitive content.

Approach 4: Time-Delayed Access

Allow crawlers after content has been exclusive for a period:

# Premium content blocked for 30 days
User-agent: GPTBot
Disallow: /premium/recent/

# Archive content allowed
User-agent: GPTBot
Allow: /premium/archive/

Best for: News sites, research publications, or subscription-based content.

Testing Your robots.txt

Manual Verification

  1. Check syntax: Use our free robots.txt checker
  2. Test specific URLs: Verify each blocked path is actually blocked
  3. Check AI-specific rules: Ensure AI user-agents are properly targeted

What to Test

Test URL: https://yoursite.com/private/content

Expected: Blocked for GPTBot, Claude-Web, etc.
Actual: [Check with tool]

Test URL: https://yoursite.com/blog/public-post

Expected: Allowed for all
Actual: [Check with tool]

Which Crawlers Should You Allow?

Recommended: Allow These

CrawlerWhy AllowImpact
ChatGPT-UserEnables ChatGPT browsing plugin citationsHigh visibility potential
PerplexityBotPerplexity search citationsGrowing traffic source
Perplexity-UserPerplexity answer generationDirect citations
GooglebotSearch indexing + AI OverviewsEssential for SEO
BingbotBing search + CopilotSignificant search traffic

Consider Carefully

CrawlerConsiderations
GPTBotTrains ChatGPT; blocks future citation potential but protects content
Claude-WebTrains Claude; similar trade-off as GPTBot
Google-ExtendedSeparates AI training from search; can block without hurting SEO
ApplebotApple's AI efforts; smaller current impact

Often Blocked

CrawlerCommon Reason to Block
CCBotWide distribution of scraped data; hard to control usage
BytespiderTikTok/ByteDance training; competitive concerns
AmazonbotAmazon's AI training; retail/competitive concerns
FacebookBotMeta's AI training; data usage concerns

The Business Impact of Blocking

Potential Benefits

  • Content protection — Your unique insights remain exclusive
  • Competitive moat — Harder for competitors to replicate your approach
  • Reduced costs — Lower bandwidth and server load
  • Legal clarity — Clearer position on data usage

Potential Drawbacks

  • Missed citations — AI models can't cite what they can't see
  • Reduced visibility — Missing out on growing AI search traffic
  • User frustration — AI assistants may give outdated info about you
  • Future regret — Hard to reverse decision if AI search becomes dominant

Real-World Examples

Sites That Block AI Crawlers

The New York Times

  • Blocks GPTBot, CCBot
  • Protects premium journalism from training
  • Maintains subscription value

Reddit

  • Blocks most AI crawlers
  • Negotiates direct licensing deals instead
  • Monetizes content access

Sites That Allow AI Crawlers

Wikipedia

  • Allows all crawlers
  • Mission-aligned with knowledge sharing
  • Benefits from AI citations

Shopify

  • Allows AI crawlers on public content
  • Gains visibility in AI commerce recommendations
  • Blocks from admin/internal areas

Making Your Decision

Questions to Ask

  1. Is your content unique/proprietary?

    • Yes → Consider blocking training crawlers
    • No → Allow for visibility
  2. Do you rely on search traffic?

    • Yes → Allow search crawlers (Googlebot, Bingbot)
    • No → More flexibility
  3. Is AI citation valuable to your business?

    • Yes → Allow ChatGPT-User, PerplexityBot
    • No → Block without concern
  4. Do you have the resources to monitor?

    • Yes → Can adjust strategy based on results
    • No → Pick a position and stick with it

Our Recommendation

For most websites, we recommend:

# Allow AI search (citation potential)
User-agent: ChatGPT-User
Disallow:

User-agent: PerplexityBot
Disallow:

User-agent: Perplexity-User
Disallow:

User-agent: Claude-User
Disallow:

# Block AI training (content protection)
User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow search engines (essential SEO)
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

This approach:

  • ✅ Protects your content from training models
  • ✅ Allows AI search engines to cite you
  • ✅ Maintains traditional SEO
  • ✅ Keeps options open for the future

Monitoring and Adjusting

Track AI Referral Traffic

Monitor your analytics for traffic from:

  • chatgpt.com / chat.openai.com
  • perplexity.ai
  • claude.ai
  • AI browser extensions and apps

Revisit Your Decision Quarterly

The AI landscape changes rapidly. Review:

  • New crawler user-agents
  • Changes in referral traffic
  • Business strategy shifts
  • Competitive landscape

Tools and Resources

Conclusion

Controlling AI crawler access is now a standard part of website management. The right approach depends on your content strategy, business model, and risk tolerance.

Remember:

  • You can always change your robots.txt
  • Start conservative and open up if beneficial
  • Monitor the impact of your decisions
  • What's right for others may not be right for you

Next step: Check your current robots.txt configuration.

→ Test your robots.txt for AI crawler blocks


Want to understand the full picture of your AI visibility? Run a complete AI visibility scan to see how AI models currently see your website.

Want to check your AI visibility?

See how well ChatGPT, Claude, and Perplexity can find your website.

Check your site →

More articles