Robots.txt for AI: How to Control Which AI Bots Access Your Site
AI crawlers are visiting your website right now — whether you know it or not.
GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and dozens of other AI crawlers are continuously scraping the web to train their models. The question isn't whether they're accessing your site, but whether you've given them permission.
This guide shows you how to use robots.txt to control which AI bots can access your content, which user agents to watch for, and how to make the right decision for your website.
Understanding AI Crawlers and robots.txt
What Is robots.txt?
robots.txt is a text file at the root of your website (yourdomain.com/robots.txt) that tells web crawlers which pages they can and cannot access. It's been the standard for controlling traditional search engine crawlers like Googlebot — and now it's the primary tool for controlling AI crawlers too.
How AI Crawlers Use robots.txt
Unlike traditional search crawlers that index pages for search results, AI crawlers use your content to train their language models. Here's the key difference:
| Aspect | Traditional Crawlers (Googlebot) | AI Crawlers (GPTBot, ClaudeBot) |
|---|---|---|
| Purpose | Index pages for search results | Collect training data for AI models |
| Citation | Links in search results | Content may be cited in AI responses |
| Traffic | Drives organic search traffic | May drive referral traffic if cited |
| Control | robots.txt + meta tags | robots.txt (primary control) |
| Visibility | Pages appear in SERPs | Content influences AI-generated answers |
Why robots.txt Matters More Than Ever
With AI search growing rapidly, your robots.txt configuration directly affects:
- Whether AI models can learn from your content
- Whether AI chatbots will cite your website
- Your AI Visibility Score across platforms
- Your competitive positioning in AI-generated answers
Complete List of AI Crawler User Agents
Here's a comprehensive list of AI crawler user agents you should know about:
Major AI Platform Crawlers
| User Agent | Company/Platform | Purpose | Documentation |
|---|---|---|---|
GPTBot | OpenAI (ChatGPT) | Web crawling for GPT model training | openai.com/gptbot |
ChatGPT-User | OpenAI (ChatGPT) | Real-time content fetching for user queries | openai.com |
ClaudeBot | Anthropic (Claude) | Web crawling for Claude model training | anthropic.com |
PerplexityBot | Perplexity AI | Web crawling for search and answers | perplexity.ai |
Google-Extended | Google (Gemini/Bard) | Training data for AI models (beyond search indexing) | developers.google.com |
Claude-Web | Anthropic (Claude) | Claude web search feature | anthropic.com |
Bytespider | ByteDance | AI model training data | bytedance.com |
Applebot-Extended | Apple (Apple Intelligence) | Training data for Apple AI features | developer.apple.com |
FacebookBot | Meta (Meta AI) | AI training and features | developers.facebook.com |
CCBot | Common Crawl | Open web dataset used by many AI companies | commoncrawl.org |
anthropic-ai | Anthropic (Claude) | Legacy Claude crawler | anthropic.com |
OAI-SearchBot | OpenAI (SearchGPT) | OpenAI's search product | openai.com |
cohere-ai | Cohere | AI model training | cohere.com |
AI-Powered SEO and Research Crawlers
| User Agent | Service | Purpose |
|---|---|---|
Tinyscout | Tinybot | AI-powered web research |
ZoominfoBot | ZoomInfo | AI business intelligence |
SemrushBot | Semrush | SEO and AI analysis |
AhrefsBot | Ahrefs | SEO and AI analysis |
MJ12bot | Majestic | Link and content analysis |
How to Configure robots.txt for AI Crawlers
Option 1: Allow All AI Crawlers (Recommended for Most Sites)
If you want AI models to discover and cite your content:
# robots.txt — Allow AI crawlers User-agent: * Allow: / # Specifically allow AI crawlers User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-Web Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: OAI-SearchBot Allow: / # Block access to non-public areas User-agent: * Disallow: /admin/ Disallow: /private/ Disallow: /api/ Disallow: /wp-admin/ Disallow: /.env # Sitemaps Sitemap: https://yourdomain.com/sitemap.xml
When to choose this option:
- Content sites and blogs that want AI visibility
- E-commerce stores that want product recommendations from AI
- SaaS companies that want to be featured in AI tool recommendations
- Any business that wants to appear in AI-generated search results
Option 2: Block All AI Crawlers
If you don't want AI models to use your content:
# robots.txt — Block AI crawlers # Allow traditional search engines User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # Block AI crawlers User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: FacebookBot Disallow: / User-agent: cohere-ai Disallow: / # Allow everything else User-agent: * Allow: /
When to choose this option:
- Sites with proprietary or confidential content
- News organizations protecting copyrighted material
- Membership sites with paywalled content
- Sites that want to control their AI narrative completely
Option 3: Selective AI Crawler Control
Block some AI crawlers while allowing others:
# robots.txt — Selective AI control # Allow AI crawlers that cite sources User-agent: PerplexityBot Allow: / User-agent: OAI-SearchBot Allow: / # Block AI crawlers that only use data for training User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / # Allow search engines User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # General rules User-agent: * Allow: / Disallow: /admin/
When to choose this option:
- You want to appear in AI search results but not train AI models
- You trust some platforms more than others
- You're testing different AI visibility strategies
Option 4: Partial Content Access
Allow AI crawlers on your blog but block them from other areas:
# robots.txt — Partial AI access # AI crawlers can access blog and public content User-agent: GPTBot Allow: /blog/ Allow: /guides/ Disallow: / User-agent: ClaudeBot Allow: /blog/ Allow: /guides/ Disallow: / User-agent: PerplexityBot Allow: /blog/ Allow: /guides/ Disallow: / # Allow all for search engines User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / User-agent: * Allow: / Disallow: /admin/
When to choose this option:
- You have mixed content (public blog + private areas)
- You want AI to cite your educational content but not your proprietary pages
- E-commerce sites that want product pages visible but not customer data
Advanced robots.txt Techniques for AI
Rate Limiting AI Crawlers
Some AI crawlers respect the Crawl-delay directive:
User-agent: GPTBot Allow: / Crawl-delay: 10 User-agent: ClaudeBot Allow: / Crawl-delay: 10
This tells crawlers to wait 10 seconds between requests, reducing server load.
Using Wildcards
You can use patterns to block or allow specific URL patterns:
User-agent: GPTBot Allow: /blog/ Allow: /guides/ Disallow: /api/* Disallow: /*?ref=* Disallow: /*.json$
Sitemap Guidance for AI
Include your sitemap to help AI crawlers discover your content efficiently:
User-agent: GPTBot Allow: / Sitemap: https://yourdomain.com/sitemap.xml User-agent: ClaudeBot Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Beyond robots.txt: Additional AI Crawler Controls
Meta Tags
You can control AI access at the page level using meta tags:
<!-- Block all AI crawlers from this page -->
<meta name="robots" content="noai, noimageai">
<!-- Block specific AI crawlers -->
<meta name="GPTBot" content="noindex, nofollow">
<meta name="ClaudeBot" content="noindex, nofollow">
<meta name="PerplexityBot" content="noindex, nofollow">
Note: Meta tag support varies by crawler. GPTBot and ClaudeBot respect them, but not all AI crawlers do.
X-Robots-Tag HTTP Headers
For non-HTML files (PDFs, images, documents):
X-Robots-Tag: noai, noimageai
llms.txt for Positive Control
While robots.txt controls what crawlers can't access, llms.txt tells them what they should find:
# yourdomain.com/llms.txt
> Your website description for AI models
## Articles
- [Article Title](https://yourdomain.com/article): Brief description
## About
- Author information and credentials
Learn more about creating an llms.txt file in our complete llms.txt guide.
Should You Block AI Crawlers?
This is one of the most debated questions in web publishing today. Here's a balanced view:
Reasons to Allow AI Crawlers ✅
| Reason | Explanation |
|---|---|
| AI citations drive traffic | Being cited by ChatGPT, Claude, or Perplexity sends high-intent visitors |
| AI visibility is the new SEO | As AI search grows, blocking crawlers means becoming invisible |
| Early mover advantage | Fewer competitors in AI search means more visibility now |
| Free exposure | AI models are essentially promoting your content for free |
| User expectations | Users expect AI to recommend the best resources |
Reasons to Block AI Crawlers ❌
| Reason | Explanation |
|---|---|
| Copyright concerns | Your content trains models without compensation |
| Competitive risk | AI could use your content to help competitors |
| Loss of control | AI might misrepresent or hallucinate about your content |
| No direct attribution | Some AI models cite content without clear links |
| Resource usage | AI crawlers consume bandwidth without guaranteed benefit |
Our Recommendation
For most websites, allowing AI crawlers is the better choice. The traffic and visibility benefits outweigh the risks. However, you should:
- Monitor AI crawler activity in your server logs
- Check AI responses regularly for how your content is being used
- Use llms.txt to guide AI models to your best content
- Review policies periodically as the AI landscape evolves
Read our full analysis in Should You Block AI Crawlers? Pros, Cons, and How to Decide.
Monitoring AI Crawler Activity
Check Server Logs
Monitor which AI crawlers are visiting your site:
# Check for AI crawler visits
grep -i "GPTBot\|ClaudeBot\|PerplexityBot\|Google-Extended\|Applebot-Extended\|CCBot" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn
Track AI Crawler Frequency
# Daily AI crawler requests
grep "GPTBot" /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f1 | uniq -c
Use Analytics Tools
Many web analytics platforms can filter for AI crawler user agents. Look for:
- Cloudflare Analytics: Bot traffic reports
- Google Analytics 4: Filter reports by user agent
- Log analysis tools: GoAccess, AWStats, or custom scripts
Verifying Your robots.txt Configuration
Test Your robots.txt
Use these tools to verify your configuration:
- Google Search Console: robots.txt Tester
- OpenAI GPTBot test: Check if your content appears in ChatGPT responses
- Perplexity test: Search for your content on Perplexity and check citations
- SeenByAI: Check your AI Visibility Score to see if AI platforms can find you
Common robots.txt Mistakes
| Mistake | Problem | Fix |
|---|---|---|
| Wrong file location | Must be at root: /robots.txt | Move to yourdomain.com/robots.txt |
| Blocking everything | Disallow: / blocks all crawlers | Be specific with your rules |
| Typos in user agents | Crawlers won't match misspelled names | Double-check exact user agent strings |
| No sitemap reference | Crawlers may miss pages | Add Sitemap: directive |
| Contradictory rules | Later rules override earlier ones | Order rules carefully |
| Blocking CSS/JS | Prevents proper page rendering | Allow /wp-content/, /static/, etc. |
Quick-Start Template
Here's a production-ready template you can customize:
# robots.txt for [Your Website] # Last updated: 2025-04-12 # === AI Crawler Configuration === # OpenAI (ChatGPT) User-agent: GPTBot Allow: / Disallow: /admin/ Disallow: /private/ User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / # Anthropic (Claude) User-agent: ClaudeBot Allow: / Disallow: /admin/ Disallow: /private/ User-agent: Claude-Web Allow: / # Perplexity AI User-agent: PerplexityBot Allow: / # Google (Search + AI) User-agent: Googlebot Allow: / User-agent: Google-Extended Allow: / # Apple Intelligence User-agent: Applebot-Extended Allow: / # === Traditional Search Engines === User-agent: Bingbot Allow: / # === General Rules === User-agent: * Allow: / Disallow: /admin/ Disallow: /private/ Disallow: /api/internal/ Disallow: /*.json$ # === Sitemaps === Sitemap: https://yourdomain.com/sitemap.xml
Key Takeaways
- robots.txt is your primary tool for controlling AI crawler access to your website
- Know the major AI user agents: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended
- Most websites should allow AI crawlers to maximize AI visibility and referral traffic
- Be specific with rules — allow blog content, block private areas
- Combine with llms.txt for positive guidance about your content
- Monitor and verify your configuration regularly
- The landscape is evolving — review your robots.txt quarterly as new crawlers emerge
Check Your AI Visibility
Want to know if AI crawlers are actually finding and citing your content? Get your free AI visibility report →
SeenByAI scans your website across ChatGPT, Claude, Perplexity, and Google AI to show your current AI Visibility Score and identify exactly what to fix.