Your AI visibility starts with one decision the bot makes before any of the citation, embedding or co-citation work matters: can it actually read your page?
In 2026 that question splits into eight or nine sub-questions, one per active AI crawler. GPTBot wants to read your HTML for training. OAI-SearchBot wants to index you for ChatGPT’s real-time search. ClaudeBot reads deep technical content. Claude-SearchBot indexes you for Claude’s web search. PerplexityBot mostly waits until a user references your domain, then fetches in bursts. Each one has a different user agent, a different crawl cadence, a different path preference, and a different downstream effect on whether you show up in answers.
And almost none of them execute JavaScript.
That single fact, combined with one outdated robots.txt entry or one Cloudflare WAF rule someone enabled six months ago, is the single biggest source of preventable AI invisibility on the web today. Fuel Online’s 2026 research put it at 34% of SaaS companies blocking at least one major AI crawler — almost always unintentionally.
This guide is the technical floor every brand needs to be standing on before optimising anything else. It covers what each crawler actually does, the robots.txt configuration that works in 2026, the rendering rules that decide whether your content is visible, the CDN and WAF gotchas that silently kill AI traffic, and the 7-day audit playbook that exposes problems most teams never see.
| Why this matters more than any other GEO work Every cited stat in this article assumes the bot reached your page. If the bot got a 403, a 429, an empty JS shell, a redirect chain, or a Cloudflare challenge — none of the rest of your AI visibility work matters. Crawl access is the floor. Get this right, then everything else compounds. |
The 5 categories of AI bot crawling your site in 2026
Most AI bot guides treat every crawler the same way. They are not the same. Each one has a distinct job, a distinct cost-benefit profile, and a distinct policy decision attached to it. The right mental model is five categories, not one list.
| Category | Examples | What it does | Recommended default |
| 1. Training crawlers | GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent, CCBot | Bulk-collect web content for model training. 6–12 month delay before training data shows up in answers, and no guarantee any specific fact is retained. | Allow for evergreen content; some operators block to protect IP. Per-category decision. |
| 2. Search / retrieval crawlers | OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot | Build the real-time index that powers AI search answers. Blocking these = invisible in AI search. | ALWAYS allow. This is visibility infrastructure. |
| 3. User-triggered fetches | ChatGPT-User, Claude-User, Perplexity-User | Fired when a user pastes your URL into a chat or asks about your domain. Real-time, on-demand. Drives direct citation opportunities. | ALWAYS allow. Highest-value direct traffic signal. |
| 4. Specialist / agent fetches | Google-NotebookLM, Google-Read-Aloud, Google-CloudVertexBot, DuckAssistBot | Niche product use cases — note-taking, accessibility, enterprise AI tools. Lower volume but growing. | Allow by default; block individually if logs show abuse. |
| 5. Aggressive / low-value | Bytespider, some Diffbot patterns, unknown UAs | Aggressive crawling, often with poor cost-to-value ratio. | Block or hard rate-limit. |
The critical thing to internalise: Categories 1 and 2 are independently controllable for OpenAI and Anthropic. You can block GPTBot (training) while allowing OAI-SearchBot (search). You can block ClaudeBot (training) while allowing Claude-SearchBot (Claude’s web search index). Most teams in 2026 either block everything or allow everything because they don’t realise the split exists. Both are wrong defaults.
What each major AI bot actually does — real log data
A 30-day server log study published in April 2026 (Digital Applied, 12 production sites across SaaS, ecommerce, agencies and publishers, 380–48,000 pages each) gave the first solid picture of how the major AI bots actually behave in production. The findings are sharper than any vendor blog post.
| Bot | Hits/site/day (median) | Revisit cadence | Crawl pattern | What it tells you |
| GPTBot | 4,200 | 2.4 days | Breadth-first; prefers /blog/, /docs/, /about/. Average crawl depth 3.8. | Most aggressive crawler. If you publish often, GPTBot sees it first. |
| ClaudeBot | 1,800 | 6.8 days | Depth-first; prefers /docs/ and /api/ paths. Average depth 5.2. | Patient, technical. Optimised for high-quality content, not freshness. |
| PerplexityBot | 980 | On-demand | Almost no scheduled background crawl. Bursts to 240 req/min when user query goes viral. | Edge cache and rate-limit defences matter here; they don’t for GPTBot. |
| Google-Extended | 540 | ~14 days | Low and steady. Mostly fetches URLs Googlebot has already indexed. | Slower cadence. Real Google AI integration reaches you through Googlebot, not Google-Extended. |
| OAI-SearchBot | ~600–900* | Continuous | Builds the index ChatGPT search reads at answer time. Distinct from GPTBot. | Allow this even if you block GPTBot — it’s how you appear in ChatGPT search answers. |
* Estimated from Cloudflare’s January 2026 global analysis. All four major declared AI bots respected robots.txt 100% of the time across the study window. The bots that didn’t follow robots.txt were either undeclared (unknown UAs) or deprecated. The big takeaway: well-documented AI bots are well-behaved in 2026. The risk is not rogue crawlers — it’s your own infrastructure blocking compliant ones.
The crawl-to-refer ratio matters more than crawl volume
Volume of crawl is not the same as value of crawl. The 2026 published crawl-to-refer ratios (pages crawled per visitor sent back) put the cost-benefit in sharp relief: GPTBot ~1,255:1, ClaudeBot ~20,583:1, PerplexityBot the best in the category because every Perplexity citation is a clickable, attributed link that sends measurable referral traffic.
That doesn’t mean block GPTBot or ClaudeBot — they feed training corpora and search indexes that drive citation outcomes downstream. It means the policy decision is genuinely strategic, not binary. Allow the bots that drive citations, allow the ones that contribute to training your category, and only block the ones with both bad cost-to-value AND no upside path to visibility.
The robots.txt configuration that works in 2026
This is the configuration most operators should ship in 2026: allow search and user-triggered bots for visibility, make a deliberate choice on training bots, block aggressive low-value crawlers. Copy, paste, adjust to your priorities.
| # robots.txt — Balanced AI strategy for 2026 # Allow AI search visibility; deliberate stance on AI training # —– Traditional search engines —– User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # —– AI SEARCH / RETRIEVAL — always allow (visibility infrastructure) —– User-agent: OAI-SearchBot Allow: / User-agent: Claude-SearchBot Allow: / User-agent: PerplexityBot Allow: / # —– USER-TRIGGERED fetches — always allow —– User-agent: ChatGPT-User Allow: / User-agent: Claude-User Allow: / User-agent: Perplexity-User Allow: / # —– TRAINING crawlers — allow for evergreen, decide per priority —– User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: CCBot Allow: / # —– AGGRESSIVE / LOW-VALUE — block —– User-agent: Bytespider Disallow: / # —– Sensible defaults —– User-agent: * Disallow: /admin/ Disallow: /api/internal/ Disallow: /checkout/ Disallow: /cart/ Sitemap: https://yourdomain.com/sitemap.xml |
The IP-protective variant (block training, keep search visibility)
If protecting content from model training is a higher priority than maximising training-corpus presence — common for media, legal, financial and B2B SaaS verticals — flip the training section to Disallow while keeping search bots open.
| # Training crawlers — BLOCK to protect IP User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Meta-ExternalAgent Disallow: / # Search and user-triggered fetches still allowed # (so you remain visible in ChatGPT, Claude and Perplexity search answers) |
| The 2026 user-agent gotcha most guides still get wrong Claude-Web and anthropic-ai are deprecated. Sites still blocking only those strings are not actually blocking Anthropic’s current crawler. The active strings in 2026 are ClaudeBot (training), Claude-SearchBot (search index), and Claude-User (per-user fetches). If your robots.txt still references the deprecated agents, your policy is doing nothing. |
The CDN and WAF problem nobody is talking about
Your robots.txt can be perfect and your AI bot traffic can still be silently blocked. The reason: edge platforms — Cloudflare, Fastly, Akamai, AWS WAF, Cloudfront — added ‘block AI bots’ toggles in 2024–2025, often defaulted to on, often enabled by a security engineer with no marketing context. The block happens at the edge before requests ever reach your origin, and robots.txt never gets evaluated.
This is the single largest source of accidental AI invisibility in 2026, by a wide margin.
The 5-step CDN/WAF audit
- Check the bot management settings in your CDN dashboard. Cloudflare: Security → Bots, then look for ‘Block AI Scrapers and Crawlers’ or ‘Manage AI Crawlers.’ Fastly: edge ACL rules. Akamai: Bot Manager policies. AWS WAF: managed rule groups, specifically AWSManagedRulesBotControlRuleSet.
- Look for managed security rules that include ‘AI bot’ or ‘scraper’ lists. Many WAF providers ship default rule groups that automatically block declared AI user agents — without notifying the operator and without showing up in robots.txt at all.
- Verify your CDN isn’t overriding your robots.txt. Cloudflare specifically has a ‘Manage your robots.txt’ feature that can silently take precedence over your origin file. Disable that, ensure origin file is canonical.
- Allowlist the specific user agents you want through. At a minimum: OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, PerplexityBot, Perplexity-User. If you also want training bots, add GPTBot, ClaudeBot, Google-Extended, CCBot.
- Pull edge logs (not just origin logs) for the last 7 days. Filter for status 403, 429 and any challenge codes (typically 403 with a CF-Ray header) against AI user agents. Every block is a citation you lost.
Edge platform-specific configuration patterns
Cloudflare
Security → Bots → Configure Super Bot Fight Mode (set to ‘Allow’ for verified bots, not ‘Block’). Then Security → WAF → Managed Rules and disable any ‘AI Bots’ or ‘AI Scrapers’ rule group. Then Rules → Configuration Rules and add an explicit allow rule for the AI search user agents listed above. Then Caching → ‘Manage your robots.txt’ — turn this OFF so your origin file controls everything.
Fastly
VCL or Edge Compute: add an explicit allow for the search and user-triggered AI user agents before any generic bot-block logic runs. Order matters — most accidental blocks come from a generic deny rule executing before the specific allow.
AWS CloudFront + WAF
Inspect managed rule groups for AWSManagedRulesBotControlRuleSet and AWSManagedRulesCommonRuleSet. The Bot Control set will block declared AI bots by default unless explicitly overridden. Add a higher-priority custom rule that matches the AI search and user-triggered user agents and sets the action to Allow.
Vercel / Netlify edge
Edge functions and middleware can introduce per-request rate limiting that doesn’t distinguish between bots and humans. Inspect any middleware that runs on the page-fetch path. Common pattern: a rate limiter set to 100 req/min/IP that triggers PerplexityBot’s burst behaviour (it can hit 240 req/min on a viral query) and returns 429s for the burst, killing real-time citation.
The JavaScript rendering problem (and how to fix it)
Every major AI training and search crawler currently reads HTML source only. None of them execute JavaScript by default. This is the second-largest preventable cause of AI invisibility — and unlike robots.txt or CDN problems, it requires a rendering architecture change, not a configuration change.
The hard number: 46% of ChatGPT bot visits begin in reading mode — plain HTML, no JavaScript execution. If your key content is JavaScript-rendered, almost half of all GPT-driven crawl traffic sees an empty page.
How to test what AI bots actually see
Three quick tests, in increasing rigour:
- Curl test: `curl -A ‘GPTBot’ https://yourdomain.com/your-key-page > test.html`, then open test.html in a browser. If you see your headline, body copy, comparison tables and FAQ content — you’re fine. If you see an empty `<div id=”root”></div>` — your content is JS-rendered and invisible to most AI bots.
- View-source test: In Chrome, right-click → View Page Source (not Inspect — Inspect shows the rendered DOM, View Source shows what the bot gets). Search for one distinctive phrase from your visible content. If it’s missing from view-source, it’s invisible to AI bots.
- Wayback / fetch-as test: Use a service that fetches your page without JS execution — Google Search Console’s URL Inspection (Live Test, then ‘View Crawled Page’) is the most authoritative for Google-Extended behaviour. Compare what it sees against what the user sees.
The four fix patterns, ranked by effort
Pattern 1: Static generation (lowest effort, highest reliability)
Frameworks: Astro, Next.js (SSG), Eleventy, Hugo, Jekyll. Pages are pre-rendered at build time into static HTML. Every crawler — AI or human — sees the full content immediately. Best for content-heavy sites where pages don’t change per user. If your site is documentation, marketing pages, or a blog, this is the right answer.
Pattern 2: Server-side rendering / SSR
Frameworks: Next.js (SSR), Remix, Nuxt, SvelteKit. Pages are rendered on the server per request and delivered as full HTML. Heavier infrastructure than SSG but works for dynamic content. Standard recommendation for marketing + product sites with personalised content.
Pattern 3: Pre-rendering for bots only
Services: Prerender.io, Rendertron, Vercel’s prerender middleware. The CDN detects bot user agents and serves a pre-rendered HTML snapshot, while humans get the JS app. Works, but adds infrastructure complexity and a risk of serving stale content to bots if the prerender refresh cycle is poorly configured. Acceptable middle ground for sites where SPA architecture isn’t changing.
Pattern 4: Progressive enhancement
The page ships with full content in the initial HTML, then JS enhances interactivity on top. The hardest pattern to retrofit onto an existing React or Vue app, but the most robust long-term — humans, search bots, AI bots and accessibility tools all see the content immediately. This is where most performance-conscious teams are heading in 2026.
| The hidden cost of SPA architectures in 2026 If your marketing site is built on a client-rendered SPA framework, your AI bot crawl problem isn’t a one-day fix — it’s a sprint or two of engineering work. Most teams discover this when their AI citation share drops and they trace it back to a 2024 framework migration. Audit your rendering architecture now, before the next visibility loss is downstream of a deploy you can’t easily reverse. |
The structural cues AI bots use to parse your page
Getting the page rendered is necessary but not sufficient. Once an AI bot has the HTML, it has to decide what’s important — and the structural cues your page provides directly shape that decision. Pages with FAQ schema receive approximately 40% more citation weight than unstructured pages in independent testing. Schema, headings and answer formatting are not cosmetic — they’re parser hints.
The 6 structural priorities, in order
- Clean HTML semantics. Real `<h1>`–`<h6>` heading hierarchy. Real `<p>`, `<ul>`, `<ol>` and `<table>` tags. Don’t fake headings with styled `<div>`s — bots cannot reliably infer hierarchy from CSS.
- JSON-LD structured data. Organization, WebPage, BreadcrumbList, FAQPage, Article, Product, Review, Person, Service schemas as relevant. Bots parse JSON-LD directly and use it for entity disambiguation, category classification and direct fact extraction.
- FAQ blocks with explicit questions and answers. Bots love this structure because it maps directly to user-query format. Mark up with FAQPage schema. Place near the top of the page where extraction is highest-confidence.
- Comparison tables. Real `<table>` elements with `<thead>` and `<tbody>`. Listicle-style ‘best X for Y’ content gets disproportionate citation weight when the comparisons are tabular and parseable.
- Visible ‘Last updated’ dates. Bots use this as a freshness signal. ConvertMate’s 2026 analysis found 76.4% of ChatGPT citations come from content updated within the last 30 days. Make the date prominent, machine-readable, and accurate.
- Direct-answer paragraphs. The first 2–3 sentences after each H2 or H3 should answer the implicit question of that heading in full sentences. Bots extract these as ‘answer chunks’ for retrieval-augmented generation. Don’t bury the answer under preamble.
The structural work above pairs with the off-site signals our link building tools coverage outlines — both halves are needed: the bot has to reach your page AND find structured signals worth citing once it does.
llms.txt: what it is, what it isn’t, what to do about it
llms.txt is a proposed standard for telling AI systems where the LLM-friendly content on your site lives — typically a curated index of clean Markdown URLs. It is gaining adoption among technical-documentation sites. It is not, as of 2026, treated by any major AI platform as an access-control mechanism or a ranking signal.
What llms.txt is good for: a clean, machine-parseable index that human readers and a small but growing set of AI tools can use to find your most important content. Worth publishing if your site has substantial documentation.
What llms.txt is not good for: controlling AI bot access (use robots.txt and WAF rules), influencing citation rankings (no evidence this works), or replacing structured schema markup.
Ship one if your site warrants it. Don’t treat it as a substitute for the robots.txt and rendering work above.
The 7-day AI bot crawl audit
If you haven’t audited your AI bot crawl access in the last 6 months, you have problems you don’t know about. This is the structured 7-day audit that surfaces them.
Day 1: Log baseline
- Pull 30 days of server access logs. If you only have a CDN, pull edge logs too — they’re separate from origin logs and contain the requests blocked at the edge.
- Filter for these user-agent substrings: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, CCBot, Meta-ExternalAgent, Amazonbot, Bytespider.
- Count hits per bot. Note any bot you’d expect to see (GPTBot, ClaudeBot, PerplexityBot) that’s missing entirely — that’s a 100% block, almost always from CDN or WAF.
Day 2: robots.txt audit
- Fetch https://yourdomain.com/robots.txt and read it line by line.
- Identify any explicit Disallow for the search/retrieval bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) or user-triggered fetches (ChatGPT-User, Claude-User, Perplexity-User) — these are visibility kills.
- Identify any deprecated agents (Claude-Web, anthropic-ai) that someone added thinking it would block Anthropic — these do nothing now.
- Make sure your training-crawler policy matches your strategic priorities. There’s no wrong answer, but the answer should be intentional.
Day 3: CDN / WAF audit
- Run the 5-step CDN/WAF audit from earlier in this guide. This is where the most expensive accidental blocks live.
- Specifically check for: ‘Block AI scrapers’ toggles, managed rule groups for bot control, rate limits that don’t whitelist AI bots, and any CDN-level robots.txt override.
- If you find a block, document the rule, who created it, and the date — most are inherited from 18+ months ago by team members no longer at the company.
Day 4: Rendering audit
- Run the three render tests (curl, view-source, fetch-as) on your top 20 URLs by traffic and citation potential.
- Any URL where the bot view is empty or substantially missing content from the user view goes on the rendering-fix list.
- Prioritise: any URL on your top-traffic, top-citation-potential or top-revenue list takes precedence over long-tail pages.
Day 5: Structural cues audit
- On the same top 20 URLs, check JSON-LD presence with Google’s Rich Results Test or Schema.org validator. Add FAQPage, Article, Product, Organization schemas as relevant.
- Audit heading hierarchy. Make sure each page has exactly one H1, structured H2/H3 nesting, and no fake-heading divs.
- Verify ‘Last updated’ dates are visible, accurate and machine-readable on every cited or citation-candidate page.
Day 6: Block remediation
- Fix the highest-impact findings first: any 100% block at CDN/WAF on a search or user-triggered AI bot.
- Deploy the updated robots.txt. Note: changes typically take ~24 hours for OpenAI’s systems to process and adjust search behaviour.
- Open tickets for the rendering fixes — these are engineering work, not configuration.
Day 7: Verification and monitoring
- Test each bot path manually: `curl -A ‘GPTBot’ https://yourdomain.com/`, repeat for each user agent. Look for 200 status and full HTML in the response.
- Set up ongoing monitoring: weekly automated check that the major AI bots are still hitting your site. A sudden drop to zero hits is your earliest warning of a new block.
- Document the audit results and re-audit cadence (quarterly minimum, monthly for sites with active infrastructure changes).
Most of the audit is configuration work that pays back in weeks. The rendering work is engineering work that pays back in months — but it’s also the work that, once done, doesn’t need redoing. Get it right once.
The 2026 crawl numbers every team should know
| Number | Source and operational meaning |
| 34% | Share of SaaS companies blocking at least one major AI crawler — almost always unintentionally (Fuel Online, 2026). The single largest preventable cause of AI invisibility. |
| 46% | Share of ChatGPT bot visits that begin in reading mode (plain HTML, no JS execution). If your content is JS-rendered, almost half of GPT crawl traffic sees nothing. |
| 50B+ | Daily AI crawler requests across Cloudflare’s network as of March 2025 (~1% of all web requests). Volume has grown materially since. |
| 4,200 | Median GPTBot hits per site per day across the 12-site April 2026 log study. The most aggressive AI crawler by 2–4× margin. |
| 100% | Robots.txt compliance rate observed for GPTBot, ClaudeBot, PerplexityBot and Google-Extended in the same 30-day study. Major declared bots are well-behaved. |
| 6.8 days | ClaudeBot’s median revisit cadence — significantly slower than GPTBot’s 2.4 days. Optimised for quality and depth, not freshness. |
| 240 req/min | Peak PerplexityBot burst observed when a single user query went viral. Rate limits without AI-bot allowlists kill citation opportunities in real time. |
| 76.4% | Share of ChatGPT citations coming from content updated within the last 30 days (ConvertMate, 2026). ‘Last updated’ visibility is a parser-readable signal. |
| 40% | Citation-weight uplift on pages with FAQ schema vs unstructured equivalents (independent 2026 testing). Schema is the cheapest crawl-optimisation work you can do. |
| 24 hours | Typical propagation time for robots.txt changes through OpenAI’s systems. Don’t expect instant policy enforcement; do expect day-2 results. |
For the wider data picture connecting crawl outcomes to AI citation share and competitive visibility, see our link building statistics reference.
5 mistakes that quietly kill AI crawl access
Mistake 1: Blocking GPTBot because ‘I don’t want my content used for training’
This is a legitimate decision — but the way most teams implement it is wrong. Blocking GPTBot does NOT block OAI-SearchBot or ChatGPT-User. If you block all three, you’re invisible in ChatGPT search; if you block only GPTBot, you’re still in ChatGPT search but not in OpenAI’s training set. Make the right decision per category, not the catastrophic decision globally.
Mistake 2: Trusting robots.txt without auditing the CDN
Robots.txt is meaningless if your CDN is returning 403s before the file is read. The edge platform is now the primary control plane for bot access in 2026 — robots.txt is documentation, not enforcement. Audit both.
Mistake 3: Treating SPA architecture as a marketing problem
A client-rendered React or Vue site is a structural AI invisibility problem, not a content problem. Adding more content, more schema, more outreach doesn’t fix it — the bots aren’t seeing any of it. Get to SSR, SSG or pre-rendering. Until then, every other piece of GEO work is mostly wasted.
Mistake 4: Using deprecated Anthropic user agents
Claude-Web and anthropic-ai are dead. If your robots.txt or WAF still references them, your policy is doing nothing. Current strings: ClaudeBot (training), Claude-SearchBot (search), Claude-User (user fetches). Update everywhere.
Mistake 5: Rate-limiting PerplexityBot like a normal crawler
PerplexityBot has a burst profile — quiet baseline, then up to 240 req/min when a user query goes viral. A generic 100 req/min rate limit returns 429s during exactly the moment your site is about to be cited in a high-traffic answer. Allowlist PerplexityBot’s published IP ranges (or its user agent) above any generic rate limiter.
The bottom line
AI bot crawl optimisation is not glamorous work. It’s robots.txt files, CDN dashboards, rendering architecture and server log analysis — none of which produces a slide for the leadership deck. But every other AI visibility lever in 2026 — citations, co-citation density, brand mentions, share of voice — sits on top of this foundation. If the bot can’t read the page, none of the other work compounds.
The brands winning AI visibility in 2026 are not the ones with the most content, the most backlinks or the highest DR. They’re the ones who fixed the crawl floor first, so every piece of content, every backlink and every co-citation placement they’ve earned since actually reaches the model. That’s the operational discipline behind every category-leader name you see in a ChatGPT answer.
Run the 7-day audit. Fix what it surfaces. Then revisit it quarterly, because the user-agent list, the CDN defaults and the rendering best practices will all keep shifting through 2027. The teams that build this into operating cadence — not project — are the ones that stop quietly losing citation share every time someone on infra flips a security toggle.
And the full set of tactics that turn that crawl access into actual citations sits in our link building strategies reference — every play in that guide assumes the work above is already done.
