electro acoustic expressionism
nodepet
September 30th, 2008

dotbot - yet another useless robot…

Filed under: Web — olliver @ 10:36 h

Allow me to start with a question: What is the purpose of a legitimate robot? One would think it is fetching content at a reasonable pace whilst respecting the host’s restrictions in robots.txt. When a bot bothers to fetch robots.txt prior to its crawling, does that signify it will also process its rules? Not necessarily it seems. When Dotbot visited me two days ago, it did not seem to be interested in my content, but in collecting redirect messages without following them:

208.115.111.245 - - [28/Sep/2008:08:53:50 +0200] “GET /robots.txt HTTP/1.1″ 200 77 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:00 +0200] “GET /category/life HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:04 +0200] “GET /category/music HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:08 +0200] “GET /category/photo HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:13 +0200] “GET /category/spam HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:18 +0200] “GET /category/web HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”

This is just a small but representative sample: For reasons unknown to me the Dotbot omits the terminal slash of the URI which results in a 301 redirect (because there is no file of that name). Now if only the spider followed it, so that it could fetch something meaningful. To cut a long story short, except for robots.txt, there was not a single article this bot took home, because the robot obviously does not know how to handle redirects. Quite a silly waste of resources in my opinion, but then again, what do I know about the bot’s purpose?

On the DotNetDotCom website, the crawler’s presumable home, we can find the following statement:

Hi! Thanks for letting us crawl you!

We are just a few Seattle based guys trying to figure out how to make internet data as open as possible. You should be able to find everything you are looking for below. If not feel free to contact us. Happy Surfing!

The “we are just …” statement does not raise much confidence in me. This impression is amplified by the next paragraph, which contains an instruction about how to get rid of the bot:

1. First and foremost, curse our name. Trust us, it will feel good. Now breath gently…
2. Create a simple text file named robots.txt and place it in your server’s root directory. (http://www.yoursite.com/ «– Right There!)
3. Add the following code to your robots.txt file:
User-agent: dotbot
Disallow: /
4. Reflect on how easy that was.

To me this does not sound like a responsible operation, because it suggests that rather than fixing their bot, they urge “flamers” to opt-out from their crawling. Regulars will know I am one of these flamers ;-) and of course this is not the only reason for my scepticism:

208.115.111.245 - - [28/Sep/2008:11:13:52 +0200] “GET /robots.txt HTTP/1.1″ 200 77 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:11:19:32 +0200] “GET /impressum HTTP/1.1″ 301 241 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”

Impressum is explicitly excluded from crawling in robots.txt because it contains sensitive information about me that I am required to put up by German law. Yet, despite reading robots.txt DotBot chose to jump right onto it. Fortunately again failing to add a trailing slash to its request and handle the resulting 301 redirect properly. This is usually a KO criterion for a bot and since experience has proven time and again that bad bots have a tendency of morphing I prefer to firewall them right away.

Whois opines the following about their address space:

OrgName:    dotnetdotcom.org
OrgID:      DOTNE
Address:    93 S. Jackson Street #10070
City:       Seattle
StateProv:  WA
PostalCode: 98104-2818
Country:    US

NetRange:   208.115.111.240 - 208.115.111.255
CIDR:       208.115.111.240/28
OriginAS:   AS23033
NetName:    208-115-111-240-SLASH28
NetHandle:  NET-208-115-111-240-1
Parent:     NET-208-115-96-0-1
NetType:    Reassigned
Comment:
RegDate:    2008-07-21
Updated:    2008-07-21

I am not suggesting the DotNetDotCom owners are blackhats. But I have better things to do in my life then to debug other people’s bot operation. If DotBot even fails at elementary things like following robots.txt and redirects then I do not see to allow it to visit my sites. Blocking 208.115.111.240/28 should take care of the problem.

Comments (0)

June 29th, 2008

Who are behind WebDataCentreBot?

Filed under: Web — olliver @ 23:52 h

It does not pay not to preemptively block ranges known to be occupied by popular hosting companies, unless you want to have fun with non behaving or fake bots. The pleasure of me enjoying the WebDataCentreBot was rather accidental as I was lazy in terms of blocklisting any SoftLayer ranges, so that these may not be able to do anything but sending mail to or receiving mail from me.

Sitting on 67.228.177.87 and announcing itself as:

Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)

Not only did it jump right in to start indexing without bothering in the slightest about robots.txt, but also happily accepted content that was explicitly excluded from robots.txt. But then again, how should it know without reading it in the first place? Well, I thought perhaps they want to learn about the broken behaviour of their bot and fix it, but looking at their site webdatacentre.com, all I can find is:

Web Data Centre

Web Data Centre is an internet research project driven by a small team of researchers from different parts of the world. Its aim is to get a better understanding of the link structure of the web. More info is coming shortly.

(front page as of June 29th 2008)

And that was it. No point of contact whatsoever and looking at the registration data, things turn out to look pretty spammy:

Domain Name: WEBDATACENTRE.COM

Registrant [1435225]:
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US

Administrative Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Billing Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Technical Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Domain servers in listed order:

        NS1.DOMAINSERVICE.COM         67.99.176.12
        NS2.DOMAINSERVICE.COM         67.97.247.209
        NS3.DOMAINSERVICE.COM         64.49.213.231
        NS4.DOMAINSERVICE.COM         67.97.247.210

        Record created on:        2008-06-27 05:46:23.0
        Database last updated on: 2008-06-27 05:46:39.373
        Domain Expires on:        2009-06-27 05:46:41.0

Registered a mere two days ago and hiding behind an anonymous privacy shield. Why would a business want to remain anonymous unless it has to conceal something? One also might expect a search engine to reveal its legitimacy by having a meaningful rDNS name that reflects the bot’s name, but nothing much to find here either:

olliver@bunkiten:~$ host 67.228.177.87
87.177.228.67.in-addr.arpa domain name pointer midphase.com.

Midphase.com is the generic PTR record of a Softlayer reseller:

%rwhois V-1.5:003fff:00 rwhois.softlayer.com (by Network Solutions, Inc. V-1.5.9.5)
network:Class-Name:network
network:ID:NETBLK-SOFTLAYER.67.228.160.0/19
network:Auth-Area:67.228.160.0/19
network:Network-Name:SOFTLAYER-67.228.160.0
network:IP-Network:67.228.177.0/24
network:IP-Network-Block:67.228.177.0-67.228.177.255
network:Organization;I:Hosting Services Inc.
network:Street-Address:223 West Jackson Blvd STE# 1014
network:City:Chicago
network:State:IL
network:Postal-Code:60606
network:Country-Code:US
network:Tech-Contact;I:sysadmins @ softlayer.com
network:Abuse-Contact;I:abuse @ midphase.com
network:Admin-Contact;I:IPADM258-ARIN
network:Created:20080128
network:Updated:20080324
network:Updated-By:ipadmin @ softlayer.com

An aggregated range of consecutive ip addresses registered to the bot building outfit would seem more practical, especially to direct complaints to the appropriate persons. However, there is no info about the number of ip addresses in use by this anonymous entity, which effectively helps Midphase’s publicity shy customers remain anonymous. Putting all together, it seems more likely to assume they are content/email/webform seeking spammers building a list for themselves or to sell to other spammers than an actual search engine. Even if I am all mistaken, I am still not particularly keen on bots that do ignore established standards like robots.txt. Absent any communication channels one has to conclude that one may not be able to opt out from their crawling by ordinary means.

Therefore, firewalling this particular range seems an appropriate solution to me:

iptables -A INPUT -s 67.228.177.0/24 -i eth0 -p tcp -m tcp ! --dport 25 --syn -j REJECT

This rule rejects all incoming TCP traffic except for SMTP, as there may be legit sites we like to receive mail from or sent mail to. We have to specify that only incoming syn packages be rejected, because otherwise outgoing mail to this address range would remain stuck in our queue and never got delivered. If this potential need for communication is not an issue to be worried of, one still can apply the BOfH method and drop the range altogether:

iptables -A INPUT -s 67.228.177.0/24 -i eth0 -j DROP

Apache servers may also be happy about another SetEnvIfRule, preferably in httpd.conf/apache2.conf or .htaccess if the former is not an option due to a shared hosting account:

SetEnvIfNoCase User-Agent "WebDataCentre(Bot|\.com)" block

Deny from env=block

Update July 1st, 2008:

The bot has been spotted with another ip address, 66.150.224.245, this time without any rDNS record at all:

olliver@bunkiten:~$ host 66.150.224.245
Host 245.224.150.66.in-addr.arpa. not found: 3(NXDOMAIN)

Familiar set up, within a /24 of a presumable Internap reseller and still without any details concerning the company/project.

CustName:   Networld Internet Services
Address:    P.O box 551
City:       Skippack
StateProv:  PA
PostalCode: 19474
Country:    US
RegDate:    2007-01-16
Updated:    2007-01-16

NetRange:   66.150.224.0 - 66.150.224.255
CIDR:       66.150.224.0/24
NetName:    INAP-PHI-NETWORLDINT-12098
NetHandle:  NET-66-150-224-0-1
Parent:     NET-66-150-0-0-1
NetType:    Reassigned
Comment:
RegDate:    2007-01-16
Updated:    2007-01-16

RTechHandle: INO3-ARIN
RTechName:   InterNap Network Operations Center
RTechPhone:  +1-877-843-4662
RTechEmail:  noc @ internap.com 

OrgAbuseHandle: IAC3-ARIN
OrgAbuseName:   Internap Abuse Contact
OrgAbusePhone:  +1-206-256-9500
OrgAbuseEmail:  abuse @ internap.com

OrgTechHandle: INO3-ARIN
OrgTechName:   InterNap Network Operations Center
OrgTechPhone:  +1-877-843-4662
OrgTechEmail:  noc @ internap.com

In case you want to add another iptables rule based on the sample further above, simply replace 67.228.177.0/24 with 66.150.224.0/24 and you should be set.

Update July 4th, 2008

Another sighting, this time crawling from Sweden using 77.110.52.67 as ip address:

olliver@bunkiten:~$ host 77.110.52.67
67.52.110.77.in-addr.arpa is an alias for 77-110-52-67.univation.riksnet.nu.
77-110-52-67.univation.riksnet.nu domain name pointer ip67.univation.riksnet.nu.

So the pattern of using generic rDNS records obviously persists, as does their ignorance concerning robots.txt.

Whois:

inetnum:        77.110.52.64 - 77.110.52.79
netname:        SE-RIKSNET-UNIVATION2
descr:	        Stockholm Univation AB site2
country:        SE
admin-c:        BEER3-RIPE
tech-c:         BEER3-RIPE
status:         ASSIGNED PA
mnt-by:         MNT-RIKSNET
mnt-lower:      MNT-RIKSNET
mnt-routes:     MNT-RIKSNET
source:         RIPE # Filtered

person:         Bengt Erik Sandstrom
address:        Graddvagen 7
address:        S-906 20 Umea
address:        Sweden
phone:          +46 768 272022
nic-hdl:        BEER3-RIPE
source:         RIPE # Filtered

That range would translate to 77.110.52.64/28, a rather small block this time, and this is also the value you would like to use for blocking them via iptables or other means.

Comments (8)

March 10th, 2008

BioSearch bot: pointless POST requests

Filed under: Web — olliver @ 23:52 h

I really do not know what this bot is trying to accomplish, but it looks rather pointless:

66.167.105.59 - - [10/Mar/2008:04:47:50 +0100]
"POST / HTTP/1.1" 403 210 "-" "BioSearch"
66.167.105.59 - - [10/Mar/2008:04:47:51 +0100]
"POST /robots.txt HTTP/1.1" 403 220 "-" "BioSearch"

You would only use POST for submitting form data, but not retrieving data. Apart from that, the request order is wrong: A bot should first ask for robots.txt and then, depending on the outcome, either go away or start indexing.

Unfortunately the netblock, where this brainless wonder resides, does not reveal any details about the bot’s owner:

[rwhois.covad.net]
%rwhois V-1.5:003fff:00 rwhois.covad.com (by Network Solutions, Inc. V-you-guess)
network:Class-Name:network
network:Auth-Area:66.167.0.0/16
network:ID:NETBLK-NONE-66-167-0-0.66.167.0.0/16
network:Network-Name:NONE-66-167-0-0
network:IP-Network:66.167.0.0/16
network:In-Addr-Server;I:ns3.covad.com
network:In-Addr-Server;I:ns4.covad.com
network:IP-Network-Block:66.167.0.0 - 66.167.255.255
network:Org-Name:Covad Communications
network:Street-Address:110 Rio Robles
network:City:San Jose
network:State:CA
network:Postal-Code:95134
network:Country-Code:US
network:Tech-Contact;I:ipadmin covad.com
network:Admin-Contact;I:ipadmin covad.com
network:Created:20030508150409000
network:Updated:20041506165200000

The rDNS of the offending ip address is not more talkative either:
h-66-167-105-99.lsanca54.dynamic.covad.net
This reads to me like Los Angeles/California.

Although there is a company in California called Biosearch Technologies, they are unrelated to the bot, as they are only offering products derived from their biological research and are based in the San Francisco area. So it seems to be just a broken anonymous bot which is safe to block.

Comments (0)

January 22nd, 2008

IE 6.0 omits trailing slash for webroot requests

Filed under: Web — olliver @ 10:53 h

Just when you think it could not happen, it does anyway…
I have just discovered that Internet Explorer 6.0 has the habit of omitting the trailing slash of a domain name, whenever it is not explicitly appended in a request. This only works for requests of the webroot (like www.example.com), because in all other cases Apache will automatically launch a 301 redirect to the url version with a trailing slash. This is irritating to me because all other browsers will automatically add the trailing slash if it is missing.

Here are some log entries to illustrate what it looks like when you omit the slash of the domain name:

192.168.0.16 - - [22/Jan/2008:09:42:11 +0100] "GET / HTTP/1.1"
200 41369 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
192.168.0.16 - - [22/Jan/2008:09:42:13 +0100]
"GET /wp-content/themes/nodepet/style.css HTTP/1.1"
304 - "http://www.nodepet.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

I marked the referrer string as bold. By sending this broken referrer string, IE is likely to break scripts which rely on the usual behaviour (i.e. referrer checker against hotlinking or script automation) and deny access to legit visitors. Curiously, if you do add the slash to your request from the start, IE 6.0 will behave like any other browser:

192.168.0.1 - - [22/Jan/2008:09:48:16 +0100] "GET / HTTP/1.1"
200 41369 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
192.168.0.1 - - [22/Jan/2008:09:48:17 +0100]
"GET /wp-content/themes/nodepet/style.css HTTP/1.1"
304 - "http://www.nodepet.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Up to now I thought a good indicator for a bot generated fake referrer is the missing slash and would make a good SetEnvIf rule to deny access on, however now it seems like this approach is generating false positives. On the other hand, by loosening the check, I open the up the flood gates for spambots, which is not really something I am keen on.

Comments (0)

January 11th, 2008

IE6.0 bug: horizontal scrollbars using italic fonts

Filed under: Web — olliver @ 16:22 h

I came across this weird bug whilst tweaking my site and could not find any website mentioning it. Or maybe I am too silly to use the right wording for my search queries as I cannot imagine that I am the first person to have noticed it.

First off, my layout here complies to HTML Strict 1.0. It would even validate as HTML Strict 1.1 after a minor configuration change on my web server (so that it uses a different mime type for html documents). My CSS file I use also validates without the slightest warning. Now to the description:
Whenever I use a block element that has an embedded <i> or <em> tag (to mark a quote for example) or alternatively, I define italic as style for a block element and it contains text that stretches over multiple lines, IE 6.0 on Windows will freak out and add horizontal scrollbars without a visible cause. These bars disappear as soon as I remove italic from the offending style or the embedded <i> or <em> tag from the affected block element.

Perhaps someone incidentally reading this post knows how to work around this problem or a link pointing to a working hack? Any hint would be greatly appreciated, as I really would like to learn what exactly is causing this issue.

Comments (4)

January 9th, 2008

Radian6 - an abusive crawler from Canada

Filed under: Web — olliver @ 14:42 h

The above headline might be irritating, because at first glance Radian6 obeys robots.txt. However, once it is blocked by a server, it switches its user agent to some generic java client and continues indexing. Anyway, before I go on, let me first introduce you to Radian6, its origins and purposes:

Radian6 is a crawler from Canada and its purpose becomes clear just by looking at the first lines on its home page:

Millions of blog posts. Viral videos. Reviews in forums. Sharing of photos. Status updates via microblogging. All conversations, all happening online right now and affecting brands, reputations, sales, you name it.

This is classic marketing speak at its best, and right, the company behind radian6 provides marketing and promotion professionals with the latest “trends” in the blogosphere, so their customers can pick them up and modify their advertising campaigns accordingly, as you can read on this page. To me this boils down to bullying sites with a PR machinery, pushing them aside into irrelevance and snatching trends and slogans for corporate advertising or promo campaigns. All fine and dandy, but who am I to support the advertising industry by providing keywords they can eventually use against me? The bot is to no benefit for me as I do not get to see its index unless I pay for it. Add the fact that Radian6 shows a broken crawling behaviour with pulling the same content multiple times a day and omits trailing slashes in urls (resulting in even more waste of traffic as each request has to be redirected to the actual location), and the only conclusion for me is to deny this bot. And that means on my server 403 Forbidden:

142.166.3.122 - - [09/Jan/2008:03:20:01 +0100]
"GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"
142.166.3.122 - - [09/Jan/2008:06:39:35 +0100]
"GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"
142.166.3.123 - - [09/Jan/2008:09:46:53 +0100]
"GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"
142.166.3.122 - - [09/Jan/2008:12:50:18 +0100]
"GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"

Now one would think at some time a legitimate bot will eventually give up and move on to more commerce friendly hosts. But that turned out to be wishful thinking, as the bot seems to have inherited a mental deficiency that prevents it from accepting that someone does not want to see it:

142.166.3.122 - - [09/Jan/2008:03:19:26 +0100]
"GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:03:19:27 +0100]
"GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:03:19:28 +0100]
"GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:03:20:01 +0100]
"GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:06:39:33 +0100]
"GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:06:39:34 +0100]
"GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:09:46:50 +0100]
"GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:09:46:51 +0100]
"GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:09:46:53 +0100]
"GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:12:50:16 +0100]
"GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:12:50:17 +0100]
"GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"

Why is it, that some outfits in the promotion and marketing industry (and their most radical variant called “spammers”) have such a disregard for individuals and believe everyone gladly accepts their “message blast” if it is only repeated often enough? Why do they cherish the delusion of being special, excempt from common ethical standards and more gifted and intelligent than their targetted “consumer base”? I assume they do not put such questions into consideration or accept opinions contrary to theirs, therefore I decided to add the netblock 142.66.0.0/16 to my growing list of firewalled ip ranges to prevent any more “stealth visits”.

In case you wish to opt out from their visits too, simply add the following line to your .htaccess or httpd.conf file:

Deny from 142.66.0.0/16

Or, in case you are fortunate enough to run a dedicated server and do not expect any welcome visitors from that ip allocation, you may prefer to firewall their range right away:

iptables -A INPUT -s 142.166.0.0/16 -i eth0 -p tcp -m tcp --dport ! 25 -j DROP

This rule leaves port 25 (SMTP) open as communication channel. Since this is a rather large chunk of addresses, it may well contain responsible companies and individuals and for those I still want to be reachable via email. Of course, if you do not have any such concerns you can also apply the BOFH method and silence this range entirely:

iptables -A INPUT -s 142.166.0.0/16 -i eth0 -p tcp -j DROP
Comments (1)

January 7th, 2008

Yahoo loves kurzfilm and can’t let it go…

Filed under: Web — olliver @ 22:01 h

This romance started back in December:
One fine day, shortly before Christmas, I noticed some weird requests by Yahoo’s crawler:

74.6.28.164 - - [23/Dec/2007:08:25:16 +0100] "GET /kurzfilm/ HTTP/1.0" 301 307 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.28.164 - - [23/Dec/2007:08:25:17 +0100] "GET /kurzfilm/ HTTP/1.0" 404 275 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Of course there was no directory called “kurzfilm” on my server, but it seemed some stale link was pointing to it and Yahoo checked to see whether there is something new to discover. If you look closely you spot a 301 redirect before the actual 404 response. That is because I use mod rewrite to redirect any request that does not use “www.mydomain.com” as host to that location first, in order to ensure, that only this version of my domains will appear in search results. After some research I was even able to locate the origin: Some Dutch website used to link to it using the server’s ip address. Unfortunately this was a fatal mistake as Yahoo is now querying this non existent “kurzfilm” directory over and over again.

Google behaves different in that regard: Once it cannot find anything there it soon discards the url and moves on. Also it obeys 301 (moved permanently) redirects and discards the previous destination after a while. But Yahoo?

Yahoo loves “kurzfilm” in the morning:

74.6.28.164 - - [06/Jan/2008:09:08:59 +0100] "GET /kurzfilm/ HTTP/1.0" 301 236 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.28.164 - - [06/Jan/2008:09:09:01 +0100] "GET /kurzfilm/ HTTP/1.0" 404 5883 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

And “kurzfilm” in the evening:

74.6.28.164 - - [06/Jan/2008:20:02:50 +0100] "GET /kurzfilm/ HTTP/1.0" 301 236 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.28.164 - - [06/Jan/2008:20:02:52 +0100] "GET /kurzfilm/ HTTP/1.0" 404 5883 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Did you notice the 301 redirect? That means Yahoo is still using the server’s ip address for its request, despite the 301 redirect, which should normally signalise that a request be permanently turned to the new destination instead. But then again it would not be Yahoo and so I shall expect to find “kurzfilm” in my logfiles around this time next year, too. Maybe I should create this “kurzfilm” directory already, just for Yahoo: The “kurzfilm” search engine # 1 :-).

Comments (0)

January 5th, 2008

Cyveillance fake IE 6 bot - fit for the filters

Filed under: Web — olliver @ 15:00 h

They are coming in legions these days… Was Attributor’s performance already underwhelming, it was nothing compared to the behaviour of Cyveillance’s bot sitting on 38.100.41.112 that hit my netlabel site some hours ago. Here is a summary of what I found inacceptable:

1. Posing as IE 6.0 on Windows XP, although as a badly faked version:

Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

The original string for the OS version, however, should be Windows NT 5.1 or Windows NT 5.1; SV1 (the latter with Service Pack 2 installed).

2. No information about the bot owner, let alone a meaningful rDNS domain name pointer

3. Disregard for robots.txt and fetching content that was explicitly disallowed for robots. I do not have any tolerance for such a grave violation of my privacy and get to decide for myself what content is available for indexing (or research).

4. Insane indexing speed with intervals < 2 seconds a page. Not even that, the bot is also broken and asks for non existent content, thus increasing the load and clutter in one’s logfile.

Querying Cogent’s rwhois server reveals the following:

[rwhois.cogentco.com]
%rwhois V-1.5:0010b0:00 rwhois.cogentco.com
38.100.41.112
network:ID:NET-266429401A
network:Network-Name:NET-266429401A
network:IP-Network:38.100.41.64/26
network:Org-Name:CYVEILLANCE
network:Street-Address:1555 WILSON BLVD Suite 404
network:City:ARLINGTON
network:State:VA
network:Postal-Code:22209
network:Tech-Contact:ZC108-ARIN
network:Updated:2007-10-05 20:45:56

Other allocations of Cyveillance I could find by querying ARIN’s whois database are:

38.100.21.0/24
38.105.83.0/27
65.213.208.128/27
65.222.176.96/27
65.222.185.72/29

Now the moment is right to investigate the question who Cyveillance are and what the company’s purpose is:
According to a diagram on their corporate website their objective is to monitor the web for identity theft, phishing attacks, credit card fraud, unauthorised (ab-)use of intellectual property and information leaks. Typically they are hired by companies to accomplish this task, thus their focus lies on investigations which are they paid for (which does not mean that the public would not benefit from their endeavours, especially if they were hired by large ISPs). However, their aggressive spidering behaviour earned them some critical remarks in a Wikipedia article dedicated to this company:

Numerous websites have complained about Cyveillance’s traffic for the following reasons:
1. Their robots access many pages in a short period of time and use a comparatively large amount of bandwidth.
2. They completely ignore the robots.txt exclusion protocol, which specifies pages that should not be accessed by robots
3. They use a falsified user-agent string, usually pretending to be some version of Microsoft Internet Explorer on some version of Windows, which is deceptive and can throw off log analysis.

Source: Cyveillance article at Wikipedia

I do not know which of the ranges mentioned are involved with spidering, but as I do not expect any communication from their ranges, I see no reason to grant them access to my server any longer. Perhaps once they learnt how to behave themselves I might change my mind, but I expect that is not going to happen (why would they care about some German’s opinion. Vice versa, why would I care about some US corporation in Virginia, as Germany is still outside of the US-jurisdiction).

Comments (0)

January 4th, 2008

Attributor - unsolicited copyright police pulling my feed

Filed under: Web — olliver @ 11:58 h

They have been on my radar for a while and today I finally had enough from Attributor’s stealth bot activities and decided to opt out from their visits by adding their 64.41.145.0/24 address range to my firewall. In their own words Attributor describe their service as pulling content from billions of websites they monitor in order to toss it into a database and check back with their customers, whether they are able to find any signs of content duplication. If there is a match, they send out notifications on the copyright owner’s behalf and ask for linking back to the original site (mind you, this is only restricted to customers who pay for their content police services). In case those requests remain unanswered, they finally pull the DMCA take down notice card, which sounds scary at first but should have little bearing on everyone outside of the USA.

Principally I welcome their endeavours and even wished they would nail down the hordes of scrapers who steal content as botbait for their spam pages. But, and here comes the part I object to their activities, what is the point in monitoring resources outside of their jurisdiction? My hosting company, registrar and I myself are located in Germany and anyone with half a brain and capable of searching the web for “whois” can find out this sensational news within minutes. Furthermore, why would I care about a bot dressed up as IE6.0 on WinXP and doing nothing but stealing my bandwidth without any benefits in return? Their bot does not really want to know anything about robots.txt (this would defeat its purpose as “sneaky” monitoring tool) and its crawling behaviour leaves a thing or two to be desired, too.

I do not want to go too much into details like wondering why the company is hiding behind some P.O. box in California, uses a private (!) registration shield despite being a corporation (exhibit A ) and owns a netblock that is just about three weeks old (exhibit B). Perhaps they have legitimate reasons to conceal their identities, like protecting themselves from angry scrapers. Perhaps they used to reside somewhere else or are forced to move around pretty often to circumvent being blocked by concerned webmasters. Whatever it is, I do not know and I do not really want to find out, as the only thing that matters to me is whether a bot behaves nicely, is to my or at least the public’s benefit and clearly announces itself as bot including a responsible owner (working website or email address). Everything else might turn out as sneaky spammer or scraper (more or less interchangeable terms) as “references” or “success stories” can easily be fabricated.

Their only address range, according to ARIN, is:

Attributor Corporation SAVV-S600611-2 (NET-64-41-145-0-1)
                                  64.41.145.0 - 64.41.145.255

For denying access I recommend either the gentleman (or woman) method using httpd.conf or .htaccess:

Deny from 64.41.145.0/24

or the BOfH method using iptables:

~# iptables -A INPUT -s 64.41.145.0/24 -i eth0 -p tcp -j DROP

That should take care of the problem for a while.

Comments (3)

December 30th, 2007

Broken fake MJ12bot hitting my server

Filed under: Web — olliver @ 12:20 h

This morning I was wondering about several 100KB peculiar log entries left by a MJ12bot variant that did not exactly seem to follow the nice behaviour of the original. Here is a sample entry as illustration:

82.245.176.52 - - [30/Dec/2007:06:50:57 +0100] "GET /dw041-formication-agnosia/
_title=%22%22%3E%20%3Cabbr%20title=%22%22%3E%20%3Cacronym%20title=%22%22%3E%20%3Cb%3E%20%3C
blockquote%20cite=%22%22%3E%20%3Ccode%3E%20%3Cem%3E%20%3Ci%3E%20%3Cstrike%3E%20%3C
strong%3E%20%3C/small/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/
/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/
/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/
/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/
/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/
/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/
/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/
/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/
/page/2/ HTTP/1.1" 404 39670 "-" "MJ12bot/v1.0.8 (http://majestic12.co.uk/bot.php?+)"

Whatever this broken botware was trying to accomplish, it did not work out, other than leaving some decent clutter in my server logs. A bit of research revealed that the original makers of this distributed search engine have been aware about these fake bots for a while:

20 Oct 2007 - in the last few days it has been brought to our attention that a number of fake MJ12bots appeared on the Net. These bots are not ours but they use fake MJ12bot user-agent - this is something we can’t do anything about just like with email spammers who fake email addresses so we all get spammed supposedly from our own emails or someone elses.

(source : http://www.majestic12.co.uk/projects/dsearch/mj12bot.php)

What impacts this crawl will have is something that remains to be seen, if I was lucky this was merely some distributed search for email addresses and at worst a search for “fresh” content to be displayed on doorway or made-for-Adsense pages or someone compiling his personal list of “spammable targets” to sell it to fellow bottom feeder spammers. In any case, the best defence is of course to not let this bot have access to one’s server in the first place and this is what I did to prevent this surprise from happening again:

Prerequisites: You need an Apache server.

1. Detection rule for User Agent
For less complicated stuff that does not require checking for multiple conditions at once I prefer using Mod SetEnvIf. As we know from the MJ12bot page, the recent versions of this bot are in the 1.2.x range. Thus we can conclude that anything older than this can be safely discarded as a fake without any grave side effects. You can place the following line in either your .htaccess or httpd.conf file:

# deny fake bot
SetEnvIfNoCase User-Agent "^MJ12bot/v?1\.[01]\.[0-9]{1,2}" block

What does this entry do? It checks for any User-Agent header that matches a pre 1.2.x version independent from the presence of the preceeding “v” in the version string. If there is a match, an environmental variable called “block” will be set. This “block” variable can then be used for further actions, in our case that would be denying access, of course.

2. Deny the bot from crawling our site
Now that we have the variable we need to look for, we can block any User-Agent that matches our regular expression from above:

Deny from env=block

In case you want to use it in .htaccess, this line merely needs to be placed after the SetEnvIfRule, but those who want to include it in httpd.conf, have to take care of placing it within their VirtualHost section, preferably in within a Directory or Location directive. An example as illustration:

<VirtualHost 192.168.0.1>
[...]
  # Directory permissions
  <Directory "/home/web/example.com/html">
    Options Indexes FollowSymlinks MultiViews
    AllowOverride All
    Order deny,allow
    # apply SetEnvIfRule here
    Deny from env=block
[...]

That should take care of the problem for a while. Do not forget to reload (Linux) or restart (FreeBSD) apache after having applied changes to httpd.conf, so that these changes actually take effect. Those who put it into .htaccess are already done, as this file is read each time Apache is trying to access a directory.

Comments (0)