← Back to Blog
Robots.txt AIAI CrawlersGPTBotClaudeBotPerplexityBotAI Crawler Control

Robots.txt for AI: How to Control Which AI Bots Access Your Site

Learn how to configure robots.txt to control AI crawlers like GPTBot, ClaudeBot, and PerplexityBot. Includes a complete guide to all AI user agents and best practices.

SeenByAI Team·April 12, 2025·11 min read

Robots.txt for AI: How to Control Which AI Bots Access Your Site

AI crawlers are visiting your website right now — whether you know it or not.

GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and dozens of other AI crawlers are continuously scraping the web to train their models. The question isn't whether they're accessing your site, but whether you've given them permission.

This guide shows you how to use robots.txt to control which AI bots can access your content, which user agents to watch for, and how to make the right decision for your website.

Understanding AI Crawlers and robots.txt

What Is robots.txt?

robots.txt is a text file at the root of your website (yourdomain.com/robots.txt) that tells web crawlers which pages they can and cannot access. It's been the standard for controlling traditional search engine crawlers like Googlebot — and now it's the primary tool for controlling AI crawlers too.

How AI Crawlers Use robots.txt

Unlike traditional search crawlers that index pages for search results, AI crawlers use your content to train their language models. Here's the key difference:

AspectTraditional Crawlers (Googlebot)AI Crawlers (GPTBot, ClaudeBot)
PurposeIndex pages for search resultsCollect training data for AI models
CitationLinks in search resultsContent may be cited in AI responses
TrafficDrives organic search trafficMay drive referral traffic if cited
Controlrobots.txt + meta tagsrobots.txt (primary control)
VisibilityPages appear in SERPsContent influences AI-generated answers

Why robots.txt Matters More Than Ever

With AI search growing rapidly, your robots.txt configuration directly affects:

  • Whether AI models can learn from your content
  • Whether AI chatbots will cite your website
  • Your AI Visibility Score across platforms
  • Your competitive positioning in AI-generated answers

Complete List of AI Crawler User Agents

Here's a comprehensive list of AI crawler user agents you should know about:

Major AI Platform Crawlers

User AgentCompany/PlatformPurposeDocumentation
GPTBotOpenAI (ChatGPT)Web crawling for GPT model trainingopenai.com/gptbot
ChatGPT-UserOpenAI (ChatGPT)Real-time content fetching for user queriesopenai.com
ClaudeBotAnthropic (Claude)Web crawling for Claude model traininganthropic.com
PerplexityBotPerplexity AIWeb crawling for search and answersperplexity.ai
Google-ExtendedGoogle (Gemini/Bard)Training data for AI models (beyond search indexing)developers.google.com
Claude-WebAnthropic (Claude)Claude web search featureanthropic.com
BytespiderByteDanceAI model training databytedance.com
Applebot-ExtendedApple (Apple Intelligence)Training data for Apple AI featuresdeveloper.apple.com
FacebookBotMeta (Meta AI)AI training and featuresdevelopers.facebook.com
CCBotCommon CrawlOpen web dataset used by many AI companiescommoncrawl.org
anthropic-aiAnthropic (Claude)Legacy Claude crawleranthropic.com
OAI-SearchBotOpenAI (SearchGPT)OpenAI's search productopenai.com
cohere-aiCohereAI model trainingcohere.com

AI-Powered SEO and Research Crawlers

User AgentServicePurpose
TinyscoutTinybotAI-powered web research
ZoominfoBotZoomInfoAI business intelligence
SemrushBotSemrushSEO and AI analysis
AhrefsBotAhrefsSEO and AI analysis
MJ12botMajesticLink and content analysis

How to Configure robots.txt for AI Crawlers

Option 1: Allow All AI Crawlers (Recommended for Most Sites)

If you want AI models to discover and cite your content:

# robots.txt — Allow AI crawlers

User-agent: *
Allow: /

# Specifically allow AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Block access to non-public areas
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Disallow: /wp-admin/
Disallow: /.env

# Sitemaps
Sitemap: https://yourdomain.com/sitemap.xml

When to choose this option:

  • Content sites and blogs that want AI visibility
  • E-commerce stores that want product recommendations from AI
  • SaaS companies that want to be featured in AI tool recommendations
  • Any business that wants to appear in AI-generated search results

Option 2: Block All AI Crawlers

If you don't want AI models to use your content:

# robots.txt — Block AI crawlers

# Allow traditional search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: cohere-ai
Disallow: /

# Allow everything else
User-agent: *
Allow: /

When to choose this option:

  • Sites with proprietary or confidential content
  • News organizations protecting copyrighted material
  • Membership sites with paywalled content
  • Sites that want to control their AI narrative completely

Option 3: Selective AI Crawler Control

Block some AI crawlers while allowing others:

# robots.txt — Selective AI control

# Allow AI crawlers that cite sources
User-agent: PerplexityBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Block AI crawlers that only use data for training
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# General rules
User-agent: *
Allow: /
Disallow: /admin/

When to choose this option:

  • You want to appear in AI search results but not train AI models
  • You trust some platforms more than others
  • You're testing different AI visibility strategies

Option 4: Partial Content Access

Allow AI crawlers on your blog but block them from other areas:

# robots.txt — Partial AI access

# AI crawlers can access blog and public content
User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Disallow: /

User-agent: ClaudeBot
Allow: /blog/
Allow: /guides/
Disallow: /

User-agent: PerplexityBot
Allow: /blog/
Allow: /guides/
Disallow: /

# Allow all for search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /
Disallow: /admin/

When to choose this option:

  • You have mixed content (public blog + private areas)
  • You want AI to cite your educational content but not your proprietary pages
  • E-commerce sites that want product pages visible but not customer data

Advanced robots.txt Techniques for AI

Rate Limiting AI Crawlers

Some AI crawlers respect the Crawl-delay directive:

User-agent: GPTBot
Allow: /
Crawl-delay: 10

User-agent: ClaudeBot
Allow: /
Crawl-delay: 10

This tells crawlers to wait 10 seconds between requests, reducing server load.

Using Wildcards

You can use patterns to block or allow specific URL patterns:

User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Disallow: /api/*
Disallow: /*?ref=*
Disallow: /*.json$

Sitemap Guidance for AI

Include your sitemap to help AI crawlers discover your content efficiently:

User-agent: GPTBot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

User-agent: ClaudeBot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

Beyond robots.txt: Additional AI Crawler Controls

Meta Tags

You can control AI access at the page level using meta tags:

<!-- Block all AI crawlers from this page -->
<meta name="robots" content="noai, noimageai">

<!-- Block specific AI crawlers -->
<meta name="GPTBot" content="noindex, nofollow">
<meta name="ClaudeBot" content="noindex, nofollow">
<meta name="PerplexityBot" content="noindex, nofollow">

Note: Meta tag support varies by crawler. GPTBot and ClaudeBot respect them, but not all AI crawlers do.

X-Robots-Tag HTTP Headers

For non-HTML files (PDFs, images, documents):

X-Robots-Tag: noai, noimageai

llms.txt for Positive Control

While robots.txt controls what crawlers can't access, llms.txt tells them what they should find:

# yourdomain.com/llms.txt

> Your website description for AI models

## Articles
- [Article Title](https://yourdomain.com/article): Brief description

## About
- Author information and credentials

Learn more about creating an llms.txt file in our complete llms.txt guide.

Should You Block AI Crawlers?

This is one of the most debated questions in web publishing today. Here's a balanced view:

Reasons to Allow AI Crawlers ✅

ReasonExplanation
AI citations drive trafficBeing cited by ChatGPT, Claude, or Perplexity sends high-intent visitors
AI visibility is the new SEOAs AI search grows, blocking crawlers means becoming invisible
Early mover advantageFewer competitors in AI search means more visibility now
Free exposureAI models are essentially promoting your content for free
User expectationsUsers expect AI to recommend the best resources

Reasons to Block AI Crawlers ❌

ReasonExplanation
Copyright concernsYour content trains models without compensation
Competitive riskAI could use your content to help competitors
Loss of controlAI might misrepresent or hallucinate about your content
No direct attributionSome AI models cite content without clear links
Resource usageAI crawlers consume bandwidth without guaranteed benefit

Our Recommendation

For most websites, allowing AI crawlers is the better choice. The traffic and visibility benefits outweigh the risks. However, you should:

  1. Monitor AI crawler activity in your server logs
  2. Check AI responses regularly for how your content is being used
  3. Use llms.txt to guide AI models to your best content
  4. Review policies periodically as the AI landscape evolves

Read our full analysis in Should You Block AI Crawlers? Pros, Cons, and How to Decide.

Monitoring AI Crawler Activity

Check Server Logs

Monitor which AI crawlers are visiting your site:

# Check for AI crawler visits
grep -i "GPTBot\|ClaudeBot\|PerplexityBot\|Google-Extended\|Applebot-Extended\|CCBot" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn

Track AI Crawler Frequency

# Daily AI crawler requests
grep "GPTBot" /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f1 | uniq -c

Use Analytics Tools

Many web analytics platforms can filter for AI crawler user agents. Look for:

  • Cloudflare Analytics: Bot traffic reports
  • Google Analytics 4: Filter reports by user agent
  • Log analysis tools: GoAccess, AWStats, or custom scripts

Verifying Your robots.txt Configuration

Test Your robots.txt

Use these tools to verify your configuration:

  1. Google Search Console: robots.txt Tester
  2. OpenAI GPTBot test: Check if your content appears in ChatGPT responses
  3. Perplexity test: Search for your content on Perplexity and check citations
  4. SeenByAI: Check your AI Visibility Score to see if AI platforms can find you

Common robots.txt Mistakes

MistakeProblemFix
Wrong file locationMust be at root: /robots.txtMove to yourdomain.com/robots.txt
Blocking everythingDisallow: / blocks all crawlersBe specific with your rules
Typos in user agentsCrawlers won't match misspelled namesDouble-check exact user agent strings
No sitemap referenceCrawlers may miss pagesAdd Sitemap: directive
Contradictory rulesLater rules override earlier onesOrder rules carefully
Blocking CSS/JSPrevents proper page renderingAllow /wp-content/, /static/, etc.

Quick-Start Template

Here's a production-ready template you can customize:

# robots.txt for [Your Website]
# Last updated: 2025-04-12

# === AI Crawler Configuration ===

# OpenAI (ChatGPT)
User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /private/

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic (Claude)
User-agent: ClaudeBot
Allow: /
Disallow: /admin/
Disallow: /private/

User-agent: Claude-Web
Allow: /

# Perplexity AI
User-agent: PerplexityBot
Allow: /

# Google (Search + AI)
User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Allow: /

# Apple Intelligence
User-agent: Applebot-Extended
Allow: /

# === Traditional Search Engines ===

User-agent: Bingbot
Allow: /

# === General Rules ===

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/
Disallow: /api/internal/
Disallow: /*.json$

# === Sitemaps ===

Sitemap: https://yourdomain.com/sitemap.xml

Key Takeaways

  1. robots.txt is your primary tool for controlling AI crawler access to your website
  2. Know the major AI user agents: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended
  3. Most websites should allow AI crawlers to maximize AI visibility and referral traffic
  4. Be specific with rules — allow blog content, block private areas
  5. Combine with llms.txt for positive guidance about your content
  6. Monitor and verify your configuration regularly
  7. The landscape is evolving — review your robots.txt quarterly as new crawlers emerge

Check Your AI Visibility

Want to know if AI crawlers are actually finding and citing your content? Get your free AI visibility report →

SeenByAI scans your website across ChatGPT, Claude, Perplexity, and Google AI to show your current AI Visibility Score and identify exactly what to fix.

Want to check your AI visibility?

See how well ChatGPT, Claude, Gemini & Perplexity can find your website.

Check your site →

More articles