Who are behind WebDataCentreBot?
It does not pay not to preemptively block ranges known to be occupied by popular hosting companies, unless you want to have fun with non behaving or fake bots. The pleasure of me enjoying the WebDataCentreBot was rather accidental as I was lazy in terms of blocklisting any SoftLayer ranges, so that these may not be able to do anything but sending mail to or receiving mail from me.
Sitting on 67.228.177.87 and announcing itself as:
Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)
Not only did it jump right in to start indexing without bothering in the slightest about robots.txt, but also happily accepted content that was explicitly excluded from robots.txt. But then again, how should it know without reading it in the first place? Well, I thought perhaps they want to learn about the broken behaviour of their bot and fix it, but looking at their site webdatacentre.com, all I can find is:
Web Data Centre
Web Data Centre is an internet research project driven by a small team of researchers from different parts of the world. Its aim is to get a better understanding of the link structure of the web. More info is coming shortly.
(front page as of June 29th 2008)
And that was it. No point of contact whatsoever and looking at the registration data, things turn out to look pretty spammy:
Domain Name: WEBDATACENTRE.COM
Registrant [1435225]:
Moniker Privacy Services
20 SW 27th Ave.
Suite 201
Pompano Beach
FL
33069
US
Administrative Contact [1435225]:
Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
Moniker Privacy Services
20 SW 27th Ave.
Suite 201
Pompano Beach
FL
33069
US
Phone: +1.9549848445
Fax: +1.9549699155
Billing Contact [1435225]:
Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
Moniker Privacy Services
20 SW 27th Ave.
Suite 201
Pompano Beach
FL
33069
US
Phone: +1.9549848445
Fax: +1.9549699155
Technical Contact [1435225]:
Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
Moniker Privacy Services
20 SW 27th Ave.
Suite 201
Pompano Beach
FL
33069
US
Phone: +1.9549848445
Fax: +1.9549699155
Domain servers in listed order:
NS1.DOMAINSERVICE.COM 67.99.176.12
NS2.DOMAINSERVICE.COM 67.97.247.209
NS3.DOMAINSERVICE.COM 64.49.213.231
NS4.DOMAINSERVICE.COM 67.97.247.210
Record created on: 2008-06-27 05:46:23.0
Database last updated on: 2008-06-27 05:46:39.373
Domain Expires on: 2009-06-27 05:46:41.0
Registered a mere two days ago and hiding behind an anonymous privacy shield. Why would a business want to remain anonymous unless it has to conceal something? One also might expect a search engine to reveal its legitimacy by having a meaningful rDNS name that reflects the bot’s name, but nothing much to find here either:
olliver@bunkiten:~$ host 67.228.177.87 87.177.228.67.in-addr.arpa domain name pointer midphase.com.
Midphase.com is the generic PTR record of a Softlayer reseller:
%rwhois V-1.5:003fff:00 rwhois.softlayer.com (by Network Solutions, Inc. V-1.5.9.5) network:Class-Name:network network:ID:NETBLK-SOFTLAYER.67.228.160.0/19 network:Auth-Area:67.228.160.0/19 network:Network-Name:SOFTLAYER-67.228.160.0 network:IP-Network:67.228.177.0/24 network:IP-Network-Block:67.228.177.0-67.228.177.255 network:Organization;I:Hosting Services Inc. network:Street-Address:223 West Jackson Blvd STE# 1014 network:City:Chicago network:State:IL network:Postal-Code:60606 network:Country-Code:US network:Tech-Contact;I:sysadmins @ softlayer.com network:Abuse-Contact;I:abuse @ midphase.com network:Admin-Contact;I:IPADM258-ARIN network:Created:20080128 network:Updated:20080324 network:Updated-By:ipadmin @ softlayer.com
An aggregated range of consecutive ip addresses registered to the bot building outfit would seem more practical, especially to direct complaints to the appropriate persons. However, there is no info about the number of ip addresses in use by this anonymous entity, which effectively helps Midphase’s publicity shy customers remain anonymous. Putting all together, it seems more likely to assume they are content/email/webform seeking spammers building a list for themselves or to sell to other spammers than an actual search engine. Even if I am all mistaken, I am still not particularly keen on bots that do ignore established standards like robots.txt. Absent any communication channels one has to conclude that one may not be able to opt out from their crawling by ordinary means.
Therefore, firewalling this particular range seems an appropriate solution to me:
iptables -A INPUT -s 67.228.177.0/24 -i eth0 -p tcp -m tcp ! --dport 25 --syn -j REJECT
This rule rejects all incoming TCP traffic except for SMTP, as there may be legit sites we like to receive mail from or sent mail to. We have to specify that only incoming syn packages be rejected, because otherwise outgoing mail to this address range would remain stuck in our queue and never got delivered. If this potential need for communication is not an issue to be worried of, one still can apply the BOfH method and drop the range altogether:
iptables -A INPUT -s 67.228.177.0/24 -i eth0 -j DROP
Apache servers may also be happy about another SetEnvIfRule, preferably in httpd.conf/apache2.conf or .htaccess if the former is not an option due to a shared hosting account:
SetEnvIfNoCase User-Agent "WebDataCentre(Bot|\.com)" block Deny from env=block
Update July 1st, 2008:
The bot has been spotted with another ip address, 66.150.224.245, this time without any rDNS record at all:
olliver@bunkiten:~$ host 66.150.224.245 Host 245.224.150.66.in-addr.arpa. not found: 3(NXDOMAIN)
Familiar set up, within a /24 of a presumable Internap reseller and still without any details concerning the company/project.
CustName: Networld Internet Services Address: P.O box 551 City: Skippack StateProv: PA PostalCode: 19474 Country: US RegDate: 2007-01-16 Updated: 2007-01-16 NetRange: 66.150.224.0 - 66.150.224.255 CIDR: 66.150.224.0/24 NetName: INAP-PHI-NETWORLDINT-12098 NetHandle: NET-66-150-224-0-1 Parent: NET-66-150-0-0-1 NetType: Reassigned Comment: RegDate: 2007-01-16 Updated: 2007-01-16 RTechHandle: INO3-ARIN RTechName: InterNap Network Operations Center RTechPhone: +1-877-843-4662 RTechEmail: noc @ internap.com OrgAbuseHandle: IAC3-ARIN OrgAbuseName: Internap Abuse Contact OrgAbusePhone: +1-206-256-9500 OrgAbuseEmail: abuse @ internap.com OrgTechHandle: INO3-ARIN OrgTechName: InterNap Network Operations Center OrgTechPhone: +1-877-843-4662 OrgTechEmail: noc @ internap.com
In case you want to add another iptables rule based on the sample further above, simply replace 67.228.177.0/24 with 66.150.224.0/24 and you should be set.
Update July 4th, 2008
Another sighting, this time crawling from Sweden using 77.110.52.67 as ip address:
olliver@bunkiten:~$ host 77.110.52.67 67.52.110.77.in-addr.arpa is an alias for 77-110-52-67.univation.riksnet.nu. 77-110-52-67.univation.riksnet.nu domain name pointer ip67.univation.riksnet.nu.
So the pattern of using generic rDNS records obviously persists, as does their ignorance concerning robots.txt.
Whois:
inetnum: 77.110.52.64 - 77.110.52.79 netname: SE-RIKSNET-UNIVATION2 descr: Stockholm Univation AB site2 country: SE admin-c: BEER3-RIPE tech-c: BEER3-RIPE status: ASSIGNED PA mnt-by: MNT-RIKSNET mnt-lower: MNT-RIKSNET mnt-routes: MNT-RIKSNET source: RIPE # Filtered person: Bengt Erik Sandstrom address: Graddvagen 7 address: S-906 20 Umea address: Sweden phone: +46 768 272022 nic-hdl: BEER3-RIPE source: RIPE # Filtered
That range would translate to 77.110.52.64/28, a rather small block this time, and this is also the value you would like to use for blocking them via iptables or other means.
14 Comments »
Leave a comment
Posting comments requires Javascript to be turned on.
We picked up WebDataCentreBot in ModSecurity today 30 June 2008.
Its IP was 66.150.224.245
The rule we use in ModSecurity is:
SecRule HTTP_User-Agent “WebDataCentreBot”
Sample Log from today:
66.150.224.245 – - [30/Jun/2008:14:53:17 -0400] “GET / HTTP/1.1″ 406 251 “-” “Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)”
Thanks for the update. Seems they do not really want people to opt out from their crawling, do they? I’m afraid I won’t see their attempts at indexing from their new location, because everything from Internap is blocked at the firewall level, due to Internap’s less than stellar reputation:
http://www.spamhaus.org/statistics/networks.lasso
Rank 4 in Spamhaus’ top spammiest networks is not really something to be proud of.
Olliver
Thank you for this public service information. We started getting hit by WebDataCentre, recently, and it it saves time to know they are schmucks.
They just downloaded EVERY page on my website in under a minute. (Great way of helping my server resources)
IP Address was 77.110.52.67
No robots.txt file was accessed.
Thanks for new ip address, Vince. Seems they’re all over the place, as this address is located in Sweden. Having heard about your sighting I immediately checked my iptable stats, and it seems they attempted to fetch my sites from their Softlayer range last night:
Entire /16 is denied access except for bidirectional mail traffic.
Olliver
What about www.munax.com ?
Dean,
Good question :-). I have to admit that I completely forgot about them since I firewalled their /25 several months ago.
According to their FAQ they announce themselves as fake IE because
They leave fake referrers because:
I do not recall whether they honoured robots.txt the first time I noticed them, but I know that other spiders do a sufficiently good job without these “features”. Fortunately firewalling them is easy so I need not waste any system resources for denying their requests.
Olliver
A client of mine jest had their server nailed by this bot sitting on 66.90.118.101
WebDataCentreBot/1.0; +http://WebDataCentre.com/)
A Badbot
got it on 66.90.118.101 @ Jan 6th, identifies itself as WebDataCentreBot.
@ Jan 11th with the same SESSIONID as on Jan6 the server gets flooded with requests from different IP’s with the same SESSIONID, trying to break the registration process on my forum.
No WebDataCentreBot identification this time.
I see WebDataCentreBot every once in a while and the bot has a weird bug. My site uses ?arg= to read files, and the HTML uses ‘link href=”htm/default.css”‘. What the bot does is to somehow go from this GET
GET /?arg=code/src/configfile.phps
which is valid to this GET
GET /?arg=code/htm/default.css
which is not. The problem is that it recurses like so
GET /?arg=code/htm/htm/default.css
GET /?arg=code/htm/htm/htm/default.css
GET /?arg=code/htm/htm/htm/htm/default.css
etc. doing so over hundreds of different iterations before giving up. Luckily my site is really small.
Jones,
This type of error is typical of a linking structure that uses relative paths like “directory/file.html” or “../directory/file.htm”. Numerous bots have issues if they stumble across paths like these. You may like to replace these with complete uri paths starting at the webroot like “/directory/file.htm” or “/?parameter=value” for dynamic links and then most issues of this kind should go away. Alternatively, you could leave your links as is and add a
<base href="http://www.example.com/" />line into the header section of your generated html output. This, however, has the disadvantage of local copies no longer working (like testing them on a lan prior to uploading them to the production site).
Apart from this potential problem, the question is whether it is of any use to allow this bot indexing one’s site. Evidence suggests this might not be a terribly good idea, but eventually this is up to the individual webmasters themselves.
66.90.118.101 IP for new webdatacentre bot. Used up a lot of bandwidth, too. Block.
ip now as 66.150.224.245 and as ‘User Agent: Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)’ still no request for robots.txt
72.249.45.161 – [18/Mar/2010:06:23:28 +0100] “GET / HTTP/1.1″ 200 60486 “-” “Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)”
Just had crawled my website.