Over A Third Of My Visitors Are Bad Bots!

Posted on July 14, 2008
Filed Under Analysis, Bots | Leave a Comment

I’m starting on the long, long learning curve of blocking bad bots. I have a long way to go yet but just wondered what the work I had done so far had revealed.

The site running the most up to date code was not a huge site. A long-standing, decidely niche site, it used to generate 1000-2000 visitors a day depending on the day of the week. Currently running far below that due to me repeatedly messing up refactoring the bot blocking and not testing sufficiently before putting it live (it’s only a hobby site, lack of time meant it was easier to debug by promoting to live. It all went horribly wrong but it was at least a deliberate choice). The most up to date code is not particularly clever yet. I’m only blocking bots stupid enough to try and masquerade as spiders that can be identified with roundtrip dns, and bots not quite that stupid but still stupid enough not to be able to manage sessions!

The first problem was what data to compare. As usual with stats nothing was going to be 100% accurate but what would be best?

Sessions were out of the question. Due to session handling on the site stupid bots tend to repeatedly but their thick heads against the front door starting a new session each time. The infamous AVG Linkscanner has made thousands of unsuccessful requests on its own.

In the end I went with unique IP addresses. Not perfect of course due to caching but I thought the volume of traffic was low enough to make it a useful comparison.

Over a week of traffic there were 1236 unique ips that were not recognised good bots (i.e. white-listed search engine spiders identified by round-trip DNS). Compare this to the number of ips that were not good bots and actually managed to request a page (not an embedded file or robots.txt) and get a 200 or 304 status in return – 806.

That leaves 428 IP address that harboured agents that were too poorly coded to handle sessions, or got a 404 (looking for exploits) or a 403 (masquerading as a known spider). Over a third already and bear in mind that I will not yet be catching even vaguely competent stealth crawlers. I expect that figure to rise!

Comments

Leave a Reply