Cyveillance fake IE 6 bot – fit for the filters
They are coming in legions these days… Was Attributor’s performance already underwhelming, it was nothing compared to the behaviour of Cyveillance’s bot sitting on 38.100.41.112 that hit my netlabel site some hours ago. Here is a summary of what I found inacceptable:
1. Posing as IE 6.0 on Windows XP, although as a badly faked version:
Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)
The original string for the OS version, however, should be Windows NT 5.1 or Windows NT 5.1; SV1 (the latter with Service Pack 2 installed).
2. No information about the bot owner, let alone a meaningful rDNS domain name pointer
3. Disregard for robots.txt and fetching content that was explicitly disallowed for robots. I do not have any tolerance for such a grave violation of my privacy and get to decide for myself what content is available for indexing (or research).
4. Insane indexing speed with intervals < 2 seconds a page. Not even that, the bot is also broken and asks for non existent content, thus increasing the load and clutter in one’s logfile.
Querying Cogent’s rwhois server reveals the following:
[rwhois.cogentco.com] %rwhois V-1.5:0010b0:00 rwhois.cogentco.com 38.100.41.112 network:ID:NET-266429401A network:Network-Name:NET-266429401A network:IP-Network:38.100.41.64/26 network:Org-Name:CYVEILLANCE network:Street-Address:1555 WILSON BLVD Suite 404 network:City:ARLINGTON network:State:VA network:Postal-Code:22209 network:Tech-Contact:ZC108-ARIN network:Updated:2007-10-05 20:45:56
Other allocations of Cyveillance I could find by querying ARIN’s whois database are:
38.100.21.0/24 38.105.83.0/27 65.213.208.128/27 65.222.176.96/27 65.222.185.72/29
Now the moment is right to investigate the question who Cyveillance are and what the company’s purpose is:
According to a diagram on their corporate website their objective is to monitor the web for identity theft, phishing attacks, credit card fraud, unauthorised (ab-)use of intellectual property and information leaks. Typically they are hired by companies to accomplish this task, thus their focus lies on investigations which are they paid for (which does not mean that the public would not benefit from their endeavours, especially if they were hired by large ISPs). However, their aggressive spidering behaviour earned them some critical remarks in a Wikipedia article dedicated to this company:
Numerous websites have complained about Cyveillance’s traffic for the following reasons:
1. Their robots access many pages in a short period of time and use a comparatively large amount of bandwidth.
2. They completely ignore the robots.txt exclusion protocol, which specifies pages that should not be accessed by robots
3. They use a falsified user-agent string, usually pretending to be some version of Microsoft Internet Explorer on some version of Windows, which is deceptive and can throw off log analysis.
Source: Cyveillance article at Wikipedia
I do not know which of the ranges mentioned are involved with spidering, but as I do not expect any communication from their ranges, I see no reason to grant them access to my server any longer. Perhaps once they learnt how to behave themselves I might change my mind, but I expect that is not going to happen (why would they care about some German’s opinion. Vice versa, why would I care about some US corporation in Virginia, as Germany is still outside of the US-jurisdiction).
Attributor – unsolicited copyright police pulling my feed
They have been on my radar for a while and today I finally had enough from Attributor’s stealth bot activities and decided to opt out from their visits by adding their 64.41.145.0/24 address range to my firewall. In their own words Attributor describe their service as pulling content from billions of websites they monitor in order to toss it into a database and check back with their customers, whether they are able to find any signs of content duplication. If there is a match, they send out notifications on the copyright owner’s behalf and ask for linking back to the original site (mind you, this is only restricted to customers who pay for their content police services). In case those requests remain unanswered, they finally pull the DMCA take down notice card, which sounds scary at first but should have little bearing on everyone outside of the USA.
Principally I welcome their endeavours and even wished they would nail down the hordes of scrapers who steal content as botbait for their spam pages. But, and here comes the part I object to their activities, what is the point in monitoring resources outside of their jurisdiction? My hosting company, registrar and I myself are located in Germany and anyone with half a brain and capable of searching the web for “whois” can find out this sensational news within minutes. Furthermore, why would I care about a bot dressed up as IE6.0 on WinXP and doing nothing but stealing my bandwidth without any benefits in return? Their bot does not really want to know anything about robots.txt (this would defeat its purpose as “sneaky” monitoring tool) and its crawling behaviour leaves a thing or two to be desired, too.
I do not want to go too much into details like wondering why the company is hiding behind some P.O. box in California, uses a private (!) registration shield despite being a corporation (exhibit A ) and owns a netblock that is just about three weeks old (exhibit B). Perhaps they have legitimate reasons to conceal their identities, like protecting themselves from angry scrapers. Perhaps they used to reside somewhere else or are forced to move around pretty often to circumvent being blocked by concerned webmasters. Whatever it is, I do not know and I do not really want to find out, as the only thing that matters to me is whether a bot behaves nicely, is to my or at least the public’s benefit and clearly announces itself as bot including a responsible owner (working website or email address). Everything else might turn out as sneaky spammer or scraper (more or less interchangeable terms) as “references” or “success stories” can easily be fabricated.
Their only address range, according to ARIN, is:
Attributor Corporation SAVV-S600611-2 (NET-64-41-145-0-1)
64.41.145.0 - 64.41.145.255
For denying access I recommend either the gentleman (or woman) method using httpd.conf or .htaccess:
Deny from 64.41.145.0/24
or the BOfH method using iptables:
~# iptables -A INPUT -s 64.41.145.0/24 -i eth0 -p tcp -j DROP
Update 03/02/08:
As Mark pointed out in the comment section, there is another range of Attributor on Abovenet:
Abovenet Communications, Inc NETBLK-ABOVENET-3 (NET-209-133-0-0-1)
209.133.0.0 - 209.133.127.255
Attributor Corp ABOV-I241-209-133-94-0-24 (NET-209-133-94-0-1)
209.133.94.0 - 209.133.94.255
Or as CIDR notation: 209.133.94.0/24
Block as you see fit for it.
Broken fake MJ12bot hitting my server
This morning I was wondering about several 100KB peculiar log entries left by a MJ12bot variant that did not exactly seem to follow the nice behaviour of the original. Here is a sample entry as illustration:
82.245.176.52 - - [30/Dec/2007:06:50:57 +0100] "GET /dw041-formication-agnosia/ _title=%22%22%3E%20%3Cabbr%20title=%22%22%3E%20%3Cacronym%20title=%22%22%3E%20%3Cb%3E%20%3C blockquote%20cite=%22%22%3E%20%3Ccode%3E%20%3Cem%3E%20%3Ci%3E%20%3Cstrike%3E%20%3C strong%3E%20%3C/small/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2/ HTTP/1.1" 404 39670 "-" "MJ12bot/v1.0.8 (http://majestic12.co.uk/bot.php?+)"
Whatever this broken botware was trying to accomplish, it did not work out, other than leaving some decent clutter in my server logs. A bit of research revealed that the original makers of this distributed search engine have been aware about these fake bots for a while:
20 Oct 2007 – in the last few days it has been brought to our attention that a number of fake MJ12bots appeared on the Net. These bots are not ours but they use fake MJ12bot user-agent – this is something we can’t do anything about just like with email spammers who fake email addresses so we all get spammed supposedly from our own emails or someone elses.
(source : http://www.majestic12.co.uk/projects/dsearch/mj12bot.php)
What impacts this crawl will have is something that remains to be seen, if I was lucky this was merely some distributed search for email addresses and at worst a search for “fresh” content to be displayed on doorway or made-for-Adsense pages or someone compiling his personal list of “spammable targets” to sell it to fellow bottom feeder spammers. In any case, the best defence is of course to not let this bot have access to one’s server in the first place and this is what I did to prevent this surprise from happening again:
Prerequisites: You need an Apache server.
1. Detection rule for User Agent
For less complicated stuff that does not require checking for multiple conditions at once I prefer using Mod SetEnvIf. As we know from the MJ12bot page, the recent versions of this bot are in the 1.2.x range. Thus we can conclude that anything older than this can be safely discarded as a fake without any grave side effects. You can place the following line in either your .htaccess or httpd.conf file:
# deny fake bot
SetEnvIfNoCase User-Agent "^MJ12bot/v?1\.[01]\.[0-9]{1,2}" block
What does this entry do? It checks for any User-Agent header that matches a pre 1.2.x version independent from the presence of the preceeding “v” in the version string. If there is a match, an environmental variable called “block” will be set. This “block” variable can then be used for further actions, in our case that would be denying access, of course.
2. Deny the bot from crawling our site
Now that we have the variable we need to look for, we can block any User-Agent that matches our regular expression from above:
Deny from env=block
In case you want to use it in .htaccess, this line merely needs to be placed after the SetEnvIfRule, but those who want to include it in httpd.conf, have to take care of placing it within their VirtualHost section, preferably in within a Directory or Location directive. An example as illustration:
<VirtualHost 192.168.0.1>
[...]
# Directory permissions
<Directory "/home/web/example.com/html">
Options Indexes FollowSymlinks MultiViews
AllowOverride All
Order deny,allow
# apply SetEnvIfRule here
Deny from env=block
[...]
That should take care of the problem for a while. Do not forget to reload (Linux) or restart (FreeBSD) apache after having applied changes to httpd.conf, so that these changes actually take effect. Those who put it into .htaccess are already done, as this file is read each time Apache is trying to access a directory.