electro acoustic expressionism
nodepet
January 5th, 2008

Cyveillance fake IE 6 bot – fit for the filters

Filed under: Web — olliver @ 15:00 h

They are coming in legions these days… Was Attributor’s performance already underwhelming, it was nothing compared to the behaviour of Cyveillance’s bot sitting on 38.100.41.112 that hit my netlabel site some hours ago. Here is a summary of what I found inacceptable:

1. Posing as IE 6.0 on Windows XP, although as a badly faked version:

Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

The original string for the OS version, however, should be Windows NT 5.1 or Windows NT 5.1; SV1 (the latter with Service Pack 2 installed).

2. No information about the bot owner, let alone a meaningful rDNS domain name pointer

3. Disregard for robots.txt and fetching content that was explicitly disallowed for robots. I do not have any tolerance for such a grave violation of my privacy and get to decide for myself what content is available for indexing (or research).

4. Insane indexing speed with intervals < 2 seconds a page. Not even that, the bot is also broken and asks for non existent content, thus increasing the load and clutter in one’s logfile.

Querying Cogent’s rwhois server reveals the following:

[rwhois.cogentco.com]
%rwhois V-1.5:0010b0:00 rwhois.cogentco.com
38.100.41.112
network:ID:NET-266429401A
network:Network-Name:NET-266429401A
network:IP-Network:38.100.41.64/26
network:Org-Name:CYVEILLANCE
network:Street-Address:1555 WILSON BLVD Suite 404
network:City:ARLINGTON
network:State:VA
network:Postal-Code:22209
network:Tech-Contact:ZC108-ARIN
network:Updated:2007-10-05 20:45:56

Other allocations of Cyveillance I could find by querying ARIN’s whois database are:

38.100.21.0/24
38.105.83.0/27
65.213.208.128/27
65.222.176.96/27
65.222.185.72/29

Now the moment is right to investigate the question who Cyveillance are and what the company’s purpose is:
According to a diagram on their corporate website their objective is to monitor the web for identity theft, phishing attacks, credit card fraud, unauthorised (ab-)use of intellectual property and information leaks. Typically they are hired by companies to accomplish this task, thus their focus lies on investigations which are they paid for (which does not mean that the public would not benefit from their endeavours, especially if they were hired by large ISPs). However, their aggressive spidering behaviour earned them some critical remarks in a Wikipedia article dedicated to this company:

Numerous websites have complained about Cyveillance’s traffic for the following reasons:
1. Their robots access many pages in a short period of time and use a comparatively large amount of bandwidth.
2. They completely ignore the robots.txt exclusion protocol, which specifies pages that should not be accessed by robots
3. They use a falsified user-agent string, usually pretending to be some version of Microsoft Internet Explorer on some version of Windows, which is deceptive and can throw off log analysis.

Source: Cyveillance article at Wikipedia

I do not know which of the ranges mentioned are involved with spidering, but as I do not expect any communication from their ranges, I see no reason to grant them access to my server any longer. Perhaps once they learnt how to behave themselves I might change my mind, but I expect that is not going to happen (why would they care about some German’s opinion. Vice versa, why would I care about some US corporation in Virginia, as Germany is still outside of the US-jurisdiction).

Comments (0)

No Comments »

No comments yet.

Leave a comment

Posting comments requires Javascript to be turned on.