If I block AI bots from training data, will they still cite my pages?

It depends on the crawler. OpenAI's GPTBot is used for both training AND real-time browsing in ChatGPT. Blocking GPTBot prevents both. Some providers separate training crawlers from inference crawlers — check each provider's documentation for their specific opt-out paths.

How do I check which bots are currently blocked on my site?

Access your robots.txt directly at yoursite.com/robots.txt. Look for Disallow rules on User-agent: * (which applies to all bots) and on specific AI crawler agents. Also check your CDN/WAF settings — Cloudflare's Bot Fight Mode and similar tools can block AI crawlers at the network level.

Should I allow all AI crawlers or only specific ones?

Allow all major AI crawlers unless you have a specific reason to block a particular one. Selective blocking (e.g., allowing Perplexity but blocking GPTBot) is possible but complex to maintain as new AI engines emerge. The default recommendation is to allow all and monitor for content misuse separately.

Get Started

Are your pages being cited by AI engines? Audit your GEO score for free.

Get a demo

Authority GEO Signals · Published Mar 31, 2026

AI Bot Access via robots.txt

Allowing AI crawlers (GPTBot, ClaudeBot, PerplexityBot) to index and cite your content.

TL;DR — A page blocked in robots.txt cannot be cited by AI engines — ever, regardless of content quality. Several major sites blocked AI bots reactively in 2023–24 without realising the consequence: they became invisible to AI-generated answers.

Why AI Bot Access Matters

A page that is blocked in robots.txt for AI crawlers cannot be cited in AI-generated answers — full stop. No amount of schema markup, FAQ blocks, or authority references will help if the crawler cannot access the page in the first place. Bot access is the zero-th condition that all other GEO signals depend on.

In 2023–24, many publishers and websites added AI-specific blocks to their robots.txt reactively — often in response to concerns about training data usage. The consequence, which many did not anticipate, was immediate exclusion from AI engine citation pools. Perplexity, ChatGPT's browsing mode, and Google AI Overviews all respect robots.txt directives and will not cite pages that disallow their crawlers.

The key AI crawler user agents to know:

GPTBot — OpenAI's crawler (used for training and real-time browsing)
ClaudeBot — Anthropic's crawler
anthropic-ai — Anthropic's alternate user agent
PerplexityBot — Perplexity's crawler
Amazonbot — Amazon's crawler (Alexa/Rufus)
Google-Extended — Google's crawler for Gemini and AI Overviews training data

How to Implement

Check your /robots.txt for any Disallow rules targeting these agents. To explicitly allow AI crawlers:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: Google-Extended
Allow: /

If you want to allow crawling but opt out of training data usage, check each provider's specific opt-out mechanism — some support Disallow with specific path patterns or separate configuration files.

Common Mistakes

Blanket Disallow: / applied to all bots — a catch-all wildcard block (User-agent: * with Disallow: /) blocks AI crawlers along with all other bots
Blocking at the CDN/WAF level — Cloudflare and AWS WAF bot management may block AI crawlers independently of robots.txt; check your firewall rules
Only checking for Googlebot — verifying Googlebot access doesn't mean AI-specific crawlers are permitted; check each agent separately

Sources

Frequently Asked Questions

Related Signals

llms.txt

The companion file to robots.txt that tells AI engines what your site is about.

Content Freshness

After enabling AI bot access, freshness signals determine citation priority.

Schema Markup for AI Engines

Structured data that AI crawlers read once they have access to your pages.

Your GEO score

Find out which GEO signals are missing from your pages and how to fix them.

Audit my pages

14-day free trial

Are AI crawlers blocked on your site?

TrustData checks your robots.txt and CDN configuration for AI crawler blocks that make your content invisible.

Audit my pages