electro acoustic expressionism
nodepet
June 29th, 2008

Who are behind WebDataCentreBot?

Filed under: Web — olliver @ 23:52 h

It does not pay not to preemptively block ranges known to be occupied by popular hosting companies, unless you want to have fun with non behaving or fake bots. The pleasure of me enjoying the WebDataCentreBot was rather accidental as I was lazy in terms of blocklisting any SoftLayer ranges, so that these may not be able to do anything but sending mail to or receiving mail from me.

Sitting on 67.228.177.87 and announcing itself as:

Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)

Not only did it jump right in to start indexing without bothering in the slightest about robots.txt, but also happily accepted content that was explicitly excluded from robots.txt. But then again, how should it know without reading it in the first place? Well, I thought perhaps they want to learn about the broken behaviour of their bot and fix it, but looking at their site webdatacentre.com, all I can find is:

Web Data Centre

Web Data Centre is an internet research project driven by a small team of researchers from different parts of the world. Its aim is to get a better understanding of the link structure of the web. More info is coming shortly.

(front page as of June 29th 2008)

And that was it. No point of contact whatsoever and looking at the registration data, things turn out to look pretty spammy:

Domain Name: WEBDATACENTRE.COM

Registrant [1435225]:
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US

Administrative Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Billing Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Technical Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Domain servers in listed order:

        NS1.DOMAINSERVICE.COM         67.99.176.12
        NS2.DOMAINSERVICE.COM         67.97.247.209
        NS3.DOMAINSERVICE.COM         64.49.213.231
        NS4.DOMAINSERVICE.COM         67.97.247.210

        Record created on:        2008-06-27 05:46:23.0
        Database last updated on: 2008-06-27 05:46:39.373
        Domain Expires on:        2009-06-27 05:46:41.0

Registered a mere two days ago and hiding behind an anonymous privacy shield. Why would a business want to remain anonymous unless it has to conceal something? One also might expect a search engine to reveal its legitimacy by having a meaningful rDNS name that reflects the bot’s name, but nothing much to find here either:

olliver@bunkiten:~$ host 67.228.177.87
87.177.228.67.in-addr.arpa domain name pointer midphase.com.

Midphase.com is the generic PTR record of a Softlayer reseller:

%rwhois V-1.5:003fff:00 rwhois.softlayer.com (by Network Solutions, Inc. V-1.5.9.5)
network:Class-Name:network
network:ID:NETBLK-SOFTLAYER.67.228.160.0/19
network:Auth-Area:67.228.160.0/19
network:Network-Name:SOFTLAYER-67.228.160.0
network:IP-Network:67.228.177.0/24
network:IP-Network-Block:67.228.177.0-67.228.177.255
network:Organization;I:Hosting Services Inc.
network:Street-Address:223 West Jackson Blvd STE# 1014
network:City:Chicago
network:State:IL
network:Postal-Code:60606
network:Country-Code:US
network:Tech-Contact;I:sysadmins @ softlayer.com
network:Abuse-Contact;I:abuse @ midphase.com
network:Admin-Contact;I:IPADM258-ARIN
network:Created:20080128
network:Updated:20080324
network:Updated-By:ipadmin @ softlayer.com

An aggregated range of consecutive ip addresses registered to the bot building outfit would seem more practical, especially to direct complaints to the appropriate persons. However, there is no info about the number of ip addresses in use by this anonymous entity, which effectively helps Midphase’s publicity shy customers remain anonymous. Putting all together, it seems more likely to assume they are content/email/webform seeking spammers building a list for themselves or to sell to other spammers than an actual search engine. Even if I am all mistaken, I am still not particularly keen on bots that do ignore established standards like robots.txt. Absent any communication channels one has to conclude that one may not be able to opt out from their crawling by ordinary means.

Therefore, firewalling this particular range seems an appropriate solution to me:

iptables -A INPUT -s 67.228.177.0/24 -i eth0 -p tcp -m tcp ! --dport 25 --syn -j REJECT

This rule rejects all incoming TCP traffic except for SMTP, as there may be legit sites we like to receive mail from or sent mail to. We have to specify that only incoming syn packages be rejected, because otherwise outgoing mail to this address range would remain stuck in our queue and never got delivered. If this potential need for communication is not an issue to be worried of, one still can apply the BOfH method and drop the range altogether:

iptables -A INPUT -s 67.228.177.0/24 -i eth0 -j DROP

Apache servers may also be happy about another SetEnvIfRule, preferably in httpd.conf/apache2.conf or .htaccess if the former is not an option due to a shared hosting account:

SetEnvIfNoCase User-Agent "WebDataCentre(Bot|\.com)" block

Deny from env=block

Update July 1st, 2008:

The bot has been spotted with another ip address, 66.150.224.245, this time without any rDNS record at all:

olliver@bunkiten:~$ host 66.150.224.245
Host 245.224.150.66.in-addr.arpa. not found: 3(NXDOMAIN)

Familiar set up, within a /24 of a presumable Internap reseller and still without any details concerning the company/project.

CustName:   Networld Internet Services
Address:    P.O box 551
City:       Skippack
StateProv:  PA
PostalCode: 19474
Country:    US
RegDate:    2007-01-16
Updated:    2007-01-16

NetRange:   66.150.224.0 - 66.150.224.255
CIDR:       66.150.224.0/24
NetName:    INAP-PHI-NETWORLDINT-12098
NetHandle:  NET-66-150-224-0-1
Parent:     NET-66-150-0-0-1
NetType:    Reassigned
Comment:
RegDate:    2007-01-16
Updated:    2007-01-16

RTechHandle: INO3-ARIN
RTechName:   InterNap Network Operations Center
RTechPhone:  +1-877-843-4662
RTechEmail:  noc @ internap.com 

OrgAbuseHandle: IAC3-ARIN
OrgAbuseName:   Internap Abuse Contact
OrgAbusePhone:  +1-206-256-9500
OrgAbuseEmail:  abuse @ internap.com

OrgTechHandle: INO3-ARIN
OrgTechName:   InterNap Network Operations Center
OrgTechPhone:  +1-877-843-4662
OrgTechEmail:  noc @ internap.com

In case you want to add another iptables rule based on the sample further above, simply replace 67.228.177.0/24 with 66.150.224.0/24 and you should be set.

Update July 4th, 2008

Another sighting, this time crawling from Sweden using 77.110.52.67 as ip address:

olliver@bunkiten:~$ host 77.110.52.67
67.52.110.77.in-addr.arpa is an alias for 77-110-52-67.univation.riksnet.nu.
77-110-52-67.univation.riksnet.nu domain name pointer ip67.univation.riksnet.nu.

So the pattern of using generic rDNS records obviously persists, as does their ignorance concerning robots.txt.

Whois:

inetnum:        77.110.52.64 - 77.110.52.79
netname:        SE-RIKSNET-UNIVATION2
descr:	        Stockholm Univation AB site2
country:        SE
admin-c:        BEER3-RIPE
tech-c:         BEER3-RIPE
status:         ASSIGNED PA
mnt-by:         MNT-RIKSNET
mnt-lower:      MNT-RIKSNET
mnt-routes:     MNT-RIKSNET
source:         RIPE # Filtered

person:         Bengt Erik Sandstrom
address:        Graddvagen 7
address:        S-906 20 Umea
address:        Sweden
phone:          +46 768 272022
nic-hdl:        BEER3-RIPE
source:         RIPE # Filtered

That range would translate to 77.110.52.64/28, a rather small block this time, and this is also the value you would like to use for blocking them via iptables or other means.

7 Comments »

  1. We picked up WebDataCentreBot in ModSecurity today 30 June 2008.

    Its IP was 66.150.224.245

    The rule we use in ModSecurity is:

    SecRule HTTP_User-Agent “WebDataCentreBot”

    Sample Log from today:

    66.150.224.245 - - [30/Jun/2008:14:53:17 -0400] “GET / HTTP/1.1″ 406 251 “-” “Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)”

    Comment by Bot Buster — July 1st, 2008 @ 03:24 h
  2. Thanks for the update. Seems they do not really want people to opt out from their crawling, do they? I’m afraid I won’t see their attempts at indexing from their new location, because everything from Internap is blocked at the firewall level, due to Internap’s less than stellar reputation:

    http://www.spamhaus.org/statistics/networks.lasso

    Rank 4 in Spamhaus’ top spammiest networks is not really something to be proud of.

    Olliver

    Comment by olliver — July 1st, 2008 @ 07:24 h
  3. Thank you for this public service information. We started getting hit by WebDataCentre, recently, and it it saves time to know they are schmucks.

    Comment by Karl — July 1st, 2008 @ 18:33 h
  4. They just downloaded EVERY page on my website in under a minute. (Great way of helping my server resources)

    IP Address was 77.110.52.67

    No robots.txt file was accessed.

    Comment by Vince — July 4th, 2008 @ 08:02 h
  5. Thanks for new ip address, Vince. Seems they’re all over the place, as this address is located in Sweden. Having heard about your sighting I immediately checked my iptable stats, and it seems they attempted to fetch my sites from their Softlayer range last night:

    pkts bytes target     prot opt in     out     source               destination
    4    240   REJECT     tcp  --  eth0   *       67.228.0.0/16        0.0.0.0/0

    Entire /16 is denied access except for bidirectional mail traffic.

    Olliver

    Comment by olliver — July 4th, 2008 @ 09:37 h
  6. What about www.munax.com ?

    Comment by Dean S — September 23rd, 2008 @ 23:02 h
  7. Dean,
    Good question :-). I have to admit that I completely forgot about them since I firewalled their /25 several months ago.

    According to their FAQ they announce themselves as fake IE because

    Today, web servers are intelligent enough to react on the type of user agent. If our crawlers had a name, say MunaxRob or something like that, many web servers would not know about it and would return junk or maybe nothing at all.

    They leave fake referrers because:

    You might have set your web server to deny access to things (images for instance) on your site unless the Referer is a page on your web site. This is why the crawler access your site with a Referer page outside your site;

    I do not recall whether they honoured robots.txt the first time I noticed them, but I know that other spiders do a sufficiently good job without these “features”. Fortunately firewalling them is easy so I need not waste any system resources for denying their requests.

    Olliver

    Comment by olliver — September 24th, 2008 @ 01:10 h

RSS feed for comments on this post  

Leave a comment

Posting comments requires Javascript to be turned on.