Broken fake MJ12bot hitting my server
This morning I was wondering about several 100KB peculiar log entries left by a MJ12bot variant that did not exactly seem to follow the nice behaviour of the original. Here is a sample entry as illustration:
82.245.176.52 - - [30/Dec/2007:06:50:57 +0100] "GET /dw041-formication-agnosia/ _title=%22%22%3E%20%3Cabbr%20title=%22%22%3E%20%3Cacronym%20title=%22%22%3E%20%3Cb%3E%20%3C blockquote%20cite=%22%22%3E%20%3Ccode%3E%20%3Cem%3E%20%3Ci%3E%20%3Cstrike%3E%20%3C strong%3E%20%3C/small/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2/ HTTP/1.1" 404 39670 "-" "MJ12bot/v1.0.8 (http://majestic12.co.uk/bot.php?+)"
Whatever this broken botware was trying to accomplish, it did not work out, other than leaving some decent clutter in my server logs. A bit of research revealed that the original makers of this distributed search engine have been aware about these fake bots for a while:
20 Oct 2007 – in the last few days it has been brought to our attention that a number of fake MJ12bots appeared on the Net. These bots are not ours but they use fake MJ12bot user-agent – this is something we can’t do anything about just like with email spammers who fake email addresses so we all get spammed supposedly from our own emails or someone elses.
(source : http://www.majestic12.co.uk/projects/dsearch/mj12bot.php)
What impacts this crawl will have is something that remains to be seen, if I was lucky this was merely some distributed search for email addresses and at worst a search for “fresh” content to be displayed on doorway or made-for-Adsense pages or someone compiling his personal list of “spammable targets” to sell it to fellow bottom feeder spammers. In any case, the best defence is of course to not let this bot have access to one’s server in the first place and this is what I did to prevent this surprise from happening again:
Prerequisites: You need an Apache server.
1. Detection rule for User Agent
For less complicated stuff that does not require checking for multiple conditions at once I prefer using Mod SetEnvIf. As we know from the MJ12bot page, the recent versions of this bot are in the 1.2.x range. Thus we can conclude that anything older than this can be safely discarded as a fake without any grave side effects. You can place the following line in either your .htaccess or httpd.conf file:
# deny fake bot
SetEnvIfNoCase User-Agent "^MJ12bot/v?1\.[01]\.[0-9]{1,2}" block
What does this entry do? It checks for any User-Agent header that matches a pre 1.2.x version independent from the presence of the preceeding “v” in the version string. If there is a match, an environmental variable called “block” will be set. This “block” variable can then be used for further actions, in our case that would be denying access, of course.
2. Deny the bot from crawling our site
Now that we have the variable we need to look for, we can block any User-Agent that matches our regular expression from above:
Deny from env=block
In case you want to use it in .htaccess, this line merely needs to be placed after the SetEnvIfRule, but those who want to include it in httpd.conf, have to take care of placing it within their VirtualHost section, preferably in within a Directory or Location directive. An example as illustration:
<VirtualHost 192.168.0.1>
[...]
# Directory permissions
<Directory "/home/web/example.com/html">
Options Indexes FollowSymlinks MultiViews
AllowOverride All
Order deny,allow
# apply SetEnvIfRule here
Deny from env=block
[...]
That should take care of the problem for a while. Do not forget to reload (Linux) or restart (FreeBSD) apache after having applied changes to httpd.conf, so that these changes actually take effect. Those who put it into .htaccess are already done, as this file is read each time Apache is trying to access a directory.
No Comments »
No comments yet.
Leave a comment
Posting comments requires Javascript to be turned on.