Security & Threat Detection with Server Logs
The Fake Bot Problem
User-Agent strings are one of the most commonly relied-upon signals for identifying crawlers, but they are trivially spoofed. Any HTTP client can set its User-Agent to anything it wants, and many attackers take advantage of this.
Attackers impersonate Googlebot and other legitimate crawlers for several reasons:
- Faster responses — many sites serve pre-rendered or cached content to known bots, giving faster access to page content
- Unblocked access — bot-specific allow lists often bypass geo-blocks, paywalls, or login walls
- Bypassed rate limits — sites frequently exempt recognised crawlers from rate limiting, allowing aggressive scraping
- Evasion of security rules — WAF rules and firewall policies often whitelist known bot User-Agents
Never trust User-Agent strings alone. Always verify crawler identity by checking the request's source IP against the bot operator's officially published IP ranges. LogLens does this verification automatically for all major crawlers.
Identifying Content Scrapers
Content scrapers have distinctive patterns in server logs that set them apart from legitimate users and crawlers. Once you know what to look for, they become straightforward to spot.
Rapid sequential requests
Scrapers typically make dozens or hundreds of requests per minute from a single IP or a small cluster of IPs. Legitimate users rarely exceed a few pages per minute, and even aggressive crawlers like Googlebot respect crawl-delay directives.
No asset loading
Scrapers request HTML pages only. They do not load CSS, JavaScript, images, fonts, or any other assets that a real browser would need to render the page. If you see an IP fetching page after page with zero corresponding asset requests, it is almost certainly a scraper.
Systematic URL traversal
Look for request patterns that follow a predictable structure: alphabetical ordering, paginated sequences (/page/1, /page/2, /page/3...), or the exact order URLs appear in your sitemap. Human browsing is inherently irregular; machine traversal is not.
Missing referrer and cookie data
Scrapers rarely send referrer headers or manage cookies. A stream of requests with empty Referer headers and no cookies, combined with the patterns above, is a strong signal.
Cloud provider IP addresses
Most scrapers run on cloud infrastructure. Requests originating from AWS, Google Cloud Platform, or Microsoft Azure IP ranges — especially when combined with the patterns above — are highly likely to be automated scraping.
LogLens automatically flags IPs that exhibit scraping patterns. Use the IP Analysis page to filter by cloud provider ASN and cross-reference with request rates for fast identification.
Vulnerability Scanning Patterns
Automated vulnerability scanners probe for common weaknesses by requesting paths associated with known exploits, exposed configuration files, and popular admin interfaces. These paths should never appear in legitimate traffic.
Watch for request surges to sensitive paths
- /wp-login.php
- /wp-admin/
- /.env
- /config.yml
- /.git/config
- /api/v1/users
- /graphql
- /swagger.json
- /api-docs
A surge of 404 responses to sensitive paths is one of the clearest indicators of automated vulnerability scanning. These requests typically arrive in bursts — tens or hundreds within a few minutes — and often originate from a single IP or a small set of rotating IPs.
Suspicious IP Pattern Detection
Beyond specific attack signatures, certain IP-level patterns warrant immediate investigation.
Volume anomalies
A single IP making 1,000 or more requests per hour is abnormal for almost any website. Legitimate users rarely exceed a few hundred page views in a session, even on high-engagement sites.
Rotating IPs from the same subnet
Sophisticated attackers rotate through IP addresses within the same /24 or /16 subnet to avoid per-IP rate limits. When you see multiple IPs from the same block all making elevated requests, treat them as a single actor.
Traffic from unexpected regions
If your site primarily serves users in North America and Europe, a sudden spike of requests from a region where you have no audience is worth investigating — especially if paired with other suspicious signals.
Unusual timing patterns
Requests arriving between 2:00 AM and 5:00 AM local time with User-Agent strings that claim to be mainstream browsers (Chrome, Firefox, Safari) are suspicious. Real users are largely asleep; automated tools are not.
Use LogLens IP Analysis to sort by request volume and filter by time of day. Combine with geographic filters to quickly surface the patterns described above.
Setting Up Effective Alerts
Detection is only useful if it surfaces threats quickly enough to act on them. Here are the alerts every site should configure.
Traffic spike alert
Trigger when any 15-minute window exceeds 5x above the baseline for total requests. This catches DDoS attempts, aggressive scraping bursts, and scanning campaigns.
Error rate alert
Trigger when 5xx errors exceed 5% of total requests over a rolling window. Server errors during a traffic spike often indicate that an attack is impacting availability.
New high-volume bot alert
Trigger when an unverified bot makes 500 or more requests per hour. This catches new scrapers and impersonators before they can do significant damage.
Start with thresholds set high — 10x above baseline for traffic spikes — and tighten gradually as you learn your normal patterns. Use rolling baselines (e.g., same hour last week) rather than static thresholds to account for natural traffic variation.
Response Strategies
Different threat types call for different responses. Acting quickly with the right approach prevents damage while avoiding collateral impact on legitimate traffic.
| Threat | Recommended Response | Details |
|---|---|---|
| Fake bots | Block at CDN/edge | Use Cloudflare WAF rules or AWS WAF to block IPs that claim a bot User-Agent but fail IP verification |
| Content scrapers | Rate limit | Enforce 60 requests per minute for unverified clients; tighten further for repeat offenders |
| Vulnerability scanners | Block + alert | Block the source IPs and investigate the specific paths being targeted to confirm no exposure |
| Credential stuffing | Rate limit + CAPTCHA | Apply strict rate limits to authentication endpoints and require CAPTCHA after failed attempts |
Security Review Cadence
Effective security monitoring is a habit, not a one-time setup. Establish a regular review cadence to stay ahead of evolving threats.
Weekly
- Review top IPs by request volume and check for new unverified high-volume sources
- Check the 404 report for new sensitive paths being probed
- Verify that active alerts fired correctly (or confirm no events warranted them)
Monthly
- Audit the full bot list — look for new bot User-Agent strings that appeared this month
- Review rate-limit and block rules for relevance; remove blocks that are no longer necessary
- Check alert thresholds against current traffic levels and adjust if your baseline has shifted
Quarterly
- Full security posture review: compare current threat landscape to previous quarter
- Update firewall rules and WAF managed rule sets
- Review geographic access policies if your audience has shifted
- Validate that all sensitive paths return proper responses (not leaked configuration data)