Bot Management & AI Crawlers
The AI Crawler Landscape
The number of AI training crawlers has exploded. Each major AI company operates one or more crawlers that traverse the web to collect training data. Understanding who they are and how to verify them is the first step toward effective management.
| Crawler | Operator | Purpose | IP Verification |
|---|---|---|---|
| GPTBot | OpenAI | GPT model training | Published ranges |
| ClaudeBot | Anthropic | Claude model training | Published ranges |
| Bytespider | ByteDance | TikTok / Doubao AI training | Published ranges |
| CCBot | Common Crawl | Open web archive | Published ranges |
| Google-Extended | Gemini AI training | Google IP ranges | |
| Meta-ExternalAgent | Meta | Meta AI training | Published ranges |
Blocking Googlebot is not the same as blocking Google-Extended. Googlebot indexes your site for Google Search — blocking it will remove you from search results. Google-Extended is a separate crawler used only for Gemini AI training. You can block one without affecting the other.
Typical Bot Traffic Breakdown
Understanding the normal composition of bot traffic helps you identify anomalies and prioritise management efforts. Here is a typical breakdown for a mid-traffic website.
| Bot Category | Typical Share |
|---|---|
| Search engines (Googlebot, Bingbot, etc.) | 30–40% |
| AI training crawlers | 15–25% |
| SEO tools (Ahrefs, SEMrush, Moz, etc.) | 10–15% |
| Social media (Facebook, Twitter, LinkedIn) | 5–10% |
| Monitoring services (UptimeRobot, Pingdom, etc.) | 3–5% |
| Feed readers (Feedly, NewsBlur, etc.) | 2–5% |
| Unknown / unverified scrapers | 10–20% |
Over 30% of traffic claiming a bot identity fails IP verification. This means nearly a third of requests that say they are Googlebot, Bingbot, or other known crawlers are actually coming from IPs that do not belong to those operators. Without IP verification, your bot analytics are significantly misleading.
Measuring Crawl Impact
Bots consume real infrastructure resources. Before making access decisions, measure the actual impact each crawler has on your servers.
Key metrics to track per bot
- Requests per day — the total volume of requests from each crawler
- Bandwidth consumed — total bytes transferred, which directly affects your CDN and hosting costs
- Response time impact — whether bot traffic is slowing down response times for real users
- Error rates — a bot generating high error rates may be hitting stale URLs or causing origin load
- Peak request rate — the maximum burst rate (requests per minute) during peak crawling periods
If a single bot accounts for more than 20% of your total bot traffic, investigate immediately. This level of crawling is excessive for almost any site and may indicate aggressive scraping, misconfigured crawl rates, or bot impersonation.
Robots.txt Strategies by Site Type
There is no one-size-fits-all robots.txt policy. The right strategy depends on your business model, content type, and values.
| Site Type | Recommended Approach | Rationale |
|---|---|---|
| News / Media | Block AI training crawlers | Original reporting has direct commercial value; allowing AI training risks displacing the source content in AI-generated answers |
| E-commerce | Selective approach | Product pages may benefit from AI visibility, but catalogue scraping is a concern; allow selectively and monitor closely |
| SaaS Documentation | Allow most AI crawlers | Having your docs referenced by AI assistants drives awareness and helps users find your product |
| Blog / Publisher | Values-based decision | Weigh the trade-off between AI visibility (potential traffic from AI citations) and content protection (training on your work without compensation) |
Your robots.txt decisions should be data-driven. Use LogLens to measure how much each AI crawler actually consumes before deciding whether to block it. A crawler making 10 requests per day is very different from one making 10,000.
Compliance Monitoring
Robots.txt is a voluntary standard. Declaring a Disallow rule does not technically prevent a crawler from accessing your pages — it only asks them not to. Monitoring compliance is essential.
Verify that blocked crawlers actually stop
After updating your robots.txt, check your server logs over the following days and weeks to confirm that the blocked crawlers have stopped visiting the disallowed paths. LogLens makes this easy by showing per-bot request history with path breakdowns.
Watch for non-compliant crawlers
Some crawlers — particularly less reputable ones — ignore robots.txt entirely. If you see continued crawling from a bot you have blocked in robots.txt, you will need to escalate to IP-level blocking at your CDN or firewall.
Review quarterly
The AI crawler landscape changes rapidly. New crawlers appear regularly, and existing ones may change their behaviour. Review your robots.txt policy and actual crawl data at least once per quarter to ensure your rules still reflect your intentions.
Robots.txt rules are public. Anyone can read your robots.txt to see which paths you consider sensitive or valuable. Avoid using Disallow rules as a security mechanism — they are a content access policy, not an access control system.