If you’ve ever received an email with the subject “Your account has reached its monthly bandwidth limit”, your initial response may be: “wow, my site is really popular – that viral marketing campaign I paid money for has delivered the goods”. Sadly, this isn’t always the right conclusion – indeed I’d go so far as to say it’s rarely the right conclusion. Your first response should be to dig into your web stats package or Google Analytics and have a look at your site’s traffic in more detail. You may discover that the majority of your visitors are not actually homo sapiens.
Non-human visitors come in a variety of forms but the most common is a type of bot, known as a web crawler, spider or spiderbot. These fall into two broad categories – good and bad. The good bots we may have heard about – Googlebot, Bingbot/MSN, Yahoo etc – these are the agents that crawl your site and add it to their search index – something that is not only acceptable – it’s desirable. Still, you’d like some control over what content the web crawlers have access to. This is where a simple text file called robots.txt can help. There are many articles explaining/extolling the virtues of robots.txt. I’ll point you to the wikipedia page on the Robots Exclusion Protocol for further reading.
What if bots ignore the rules ? In this modern world of cyber warfare, bots are often the tool of choice. A malicious web crawler bot not only ignores the rules in robots.txt, it deliberately masquerades as a legitimate bot so that it can crawl your site, looking for specific pages such as /wp-login.php that suggest the site is using WordPress. Although there is no direct threat from having your site crawled and indexed by a malicious bot, it’s a form of reconnaissance and it’s reasonable to take counter measures.
Perhaps a more annoying consequence of bad bots is that they consume bandwidth, in other words, they use up your monthly traffic allowance, assuming you have a limit. Each “hit” from a bot may be only a few kilobytes but cumulatively it can all add up quickly. Even if you don’t have a limit, those requests are tying up resources that should be available for your human visitors.
What’s the solution ? If you use WordPress, I’d recommend the “Blackhole for Bad Bots” plugin. It’s really easy to set up and can send you an email when a new bot appears on the radar. You can then decide whether it’s legit…or not. The plugin comes pre-configured with a list of “good” bots that it will ignore. There is also a stand-alone PHP version that can be used on any PHP-based site.
Protecting yourself from bad bots is likely to become standard practice and indeed there are tools in plugins such as Wordfence that let you tweak the bot rules, but I think the Blackhole guy(s) have captured the zeitgeist really well in a simple, effective plugin.