dotbot - yet another useless robot…
Allow me to start with a question: What is the purpose of a legitimate robot? One would think it is fetching content at a reasonable pace whilst respecting the host’s restrictions in robots.txt. When a bot bothers to fetch robots.txt prior to its crawling, does that signify it will also process its rules? Not necessarily it seems. When Dotbot visited me two days ago, it did not seem to be interested in my content, but in collecting redirect messages without following them:
208.115.111.245 - - [28/Sep/2008:08:53:50 +0200] “GET /robots.txt HTTP/1.1″ 200 77 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:00 +0200] “GET /category/life HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:04 +0200] “GET /category/music HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:08 +0200] “GET /category/photo HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:13 +0200] “GET /category/spam HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:08:58:18 +0200] “GET /category/web HTTP/1.1″ 301 - “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
This is just a small but representative sample: For reasons unknown to me the Dotbot omits the terminal slash of the URI which results in a 301 redirect (because there is no file of that name). Now if only the spider followed it, so that it could fetch something meaningful. To cut a long story short, except for robots.txt, there was not a single article this bot took home, because the robot obviously does not know how to handle redirects. Quite a silly waste of resources in my opinion, but then again, what do I know about the bot’s purpose?
On the DotNetDotCom website, the crawler’s presumable home, we can find the following statement:
Hi! Thanks for letting us crawl you!
We are just a few Seattle based guys trying to figure out how to make internet data as open as possible. You should be able to find everything you are looking for below. If not feel free to contact us. Happy Surfing!
The “we are just …” statement does not raise much confidence in me. This impression is amplified by the next paragraph, which contains an instruction about how to get rid of the bot:
1. First and foremost, curse our name. Trust us, it will feel good. Now breath gently…
2. Create a simple text file named robots.txt and place it in your server’s root directory. (http://www.yoursite.com/ «– Right There!)
3. Add the following code to your robots.txt file:
User-agent: dotbot
Disallow: /
4. Reflect on how easy that was.
To me this does not sound like a responsible operation, because it suggests that rather than fixing their bot, they urge “flamers” to opt-out from their crawling. Regulars will know I am one of these flamers ;-) and of course this is not the only reason for my scepticism:
208.115.111.245 - - [28/Sep/2008:11:13:52 +0200] “GET /robots.txt HTTP/1.1″ 200 77 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 - - [28/Sep/2008:11:19:32 +0200] “GET /impressum HTTP/1.1″ 301 241 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
Impressum is explicitly excluded from crawling in robots.txt because it contains sensitive information about me that I am required to put up by German law. Yet, despite reading robots.txt DotBot chose to jump right onto it. Fortunately again failing to add a trailing slash to its request and handle the resulting 301 redirect properly. This is usually a KO criterion for a bot and since experience has proven time and again that bad bots have a tendency of morphing I prefer to firewall them right away.
Whois opines the following about their address space:
OrgName: dotnetdotcom.org OrgID: DOTNE Address: 93 S. Jackson Street #10070 City: Seattle StateProv: WA PostalCode: 98104-2818 Country: US NetRange: 208.115.111.240 - 208.115.111.255 CIDR: 208.115.111.240/28 OriginAS: AS23033 NetName: 208-115-111-240-SLASH28 NetHandle: NET-208-115-111-240-1 Parent: NET-208-115-96-0-1 NetType: Reassigned Comment: RegDate: 2008-07-21 Updated: 2008-07-21
I am not suggesting the DotNetDotCom owners are blackhats. But I have better things to do in my life then to debug other people’s bot operation. If DotBot even fails at elementary things like following robots.txt and redirects then I do not see to allow it to visit my sites. Blocking 208.115.111.240/28 should take care of the problem.
Who are behind WebDataCentreBot?
It does not pay not to preemptively block ranges known to be occupied by popular hosting companies, unless you want to have fun with non behaving or fake bots. The pleasure of me enjoying the WebDataCentreBot was rather accidental as I was lazy in terms of blocklisting any SoftLayer ranges, so that these may not be able to do anything but sending mail to or receiving mail from me.
Sitting on 67.228.177.87 and announcing itself as:
Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)
Not only did it jump right in to start indexing without bothering in the slightest about robots.txt, but also happily accepted content that was explicitly excluded from robots.txt. But then again, how should it know without reading it in the first place? Well, I thought perhaps they want to learn about the broken behaviour of their bot and fix it, but looking at their site webdatacentre.com, all I can find is:
Web Data Centre
Web Data Centre is an internet research project driven by a small team of researchers from different parts of the world. Its aim is to get a better understanding of the link structure of the web. More info is coming shortly.
(front page as of June 29th 2008)
And that was it. No point of contact whatsoever and looking at the registration data, things turn out to look pretty spammy:
Domain Name: WEBDATACENTRE.COM
Registrant [1435225]:
Moniker Privacy Services
20 SW 27th Ave.
Suite 201
Pompano Beach
FL
33069
US
Administrative Contact [1435225]:
Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
Moniker Privacy Services
20 SW 27th Ave.
Suite 201
Pompano Beach
FL
33069
US
Phone: +1.9549848445
Fax: +1.9549699155
Billing Contact [1435225]:
Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
Moniker Privacy Services
20 SW 27th Ave.
Suite 201
Pompano Beach
FL
33069
US
Phone: +1.9549848445
Fax: +1.9549699155
Technical Contact [1435225]:
Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
Moniker Privacy Services
20 SW 27th Ave.
Suite 201
Pompano Beach
FL
33069
US
Phone: +1.9549848445
Fax: +1.9549699155
Domain servers in listed order:
NS1.DOMAINSERVICE.COM 67.99.176.12
NS2.DOMAINSERVICE.COM 67.97.247.209
NS3.DOMAINSERVICE.COM 64.49.213.231
NS4.DOMAINSERVICE.COM 67.97.247.210
Record created on: 2008-06-27 05:46:23.0
Database last updated on: 2008-06-27 05:46:39.373
Domain Expires on: 2009-06-27 05:46:41.0
Registered a mere two days ago and hiding behind an anonymous privacy shield. Why would a business want to remain anonymous unless it has to conceal something? One also might expect a search engine to reveal its legitimacy by having a meaningful rDNS name that reflects the bot’s name, but nothing much to find here either:
olliver@bunkiten:~$ host 67.228.177.87 87.177.228.67.in-addr.arpa domain name pointer midphase.com.
Midphase.com is the generic PTR record of a Softlayer reseller:
%rwhois V-1.5:003fff:00 rwhois.softlayer.com (by Network Solutions, Inc. V-1.5.9.5) network:Class-Name:network network:ID:NETBLK-SOFTLAYER.67.228.160.0/19 network:Auth-Area:67.228.160.0/19 network:Network-Name:SOFTLAYER-67.228.160.0 network:IP-Network:67.228.177.0/24 network:IP-Network-Block:67.228.177.0-67.228.177.255 network:Organization;I:Hosting Services Inc. network:Street-Address:223 West Jackson Blvd STE# 1014 network:City:Chicago network:State:IL network:Postal-Code:60606 network:Country-Code:US network:Tech-Contact;I:sysadmins @ softlayer.com network:Abuse-Contact;I:abuse @ midphase.com network:Admin-Contact;I:IPADM258-ARIN network:Created:20080128 network:Updated:20080324 network:Updated-By:ipadmin @ softlayer.com
An aggregated range of consecutive ip addresses registered to the bot building outfit would seem more practical, especially to direct complaints to the appropriate persons. However, there is no info about the number of ip addresses in use by this anonymous entity, which effectively helps Midphase’s publicity shy customers remain anonymous. Putting all together, it seems more likely to assume they are content/email/webform seeking spammers building a list for themselves or to sell to other spammers than an actual search engine. Even if I am all mistaken, I am still not particularly keen on bots that do ignore established standards like robots.txt. Absent any communication channels one has to conclude that one may not be able to opt out from their crawling by ordinary means.
Therefore, firewalling this particular range seems an appropriate solution to me:
iptables -A INPUT -s 67.228.177.0/24 -i eth0 -p tcp -m tcp ! --dport 25 --syn -j REJECT
This rule rejects all incoming TCP traffic except for SMTP, as there may be legit sites we like to receive mail from or sent mail to. We have to specify that only incoming syn packages be rejected, because otherwise outgoing mail to this address range would remain stuck in our queue and never got delivered. If this potential need for communication is not an issue to be worried of, one still can apply the BOfH method and drop the range altogether:
iptables -A INPUT -s 67.228.177.0/24 -i eth0 -j DROP
Apache servers may also be happy about another SetEnvIfRule, preferably in httpd.conf/apache2.conf or .htaccess if the former is not an option due to a shared hosting account:
SetEnvIfNoCase User-Agent "WebDataCentre(Bot|\.com)" block Deny from env=block
Update July 1st, 2008:
The bot has been spotted with another ip address, 66.150.224.245, this time without any rDNS record at all:
olliver@bunkiten:~$ host 66.150.224.245 Host 245.224.150.66.in-addr.arpa. not found: 3(NXDOMAIN)
Familiar set up, within a /24 of a presumable Internap reseller and still without any details concerning the company/project.
CustName: Networld Internet Services Address: P.O box 551 City: Skippack StateProv: PA PostalCode: 19474 Country: US RegDate: 2007-01-16 Updated: 2007-01-16 NetRange: 66.150.224.0 - 66.150.224.255 CIDR: 66.150.224.0/24 NetName: INAP-PHI-NETWORLDINT-12098 NetHandle: NET-66-150-224-0-1 Parent: NET-66-150-0-0-1 NetType: Reassigned Comment: RegDate: 2007-01-16 Updated: 2007-01-16 RTechHandle: INO3-ARIN RTechName: InterNap Network Operations Center RTechPhone: +1-877-843-4662 RTechEmail: noc @ internap.com OrgAbuseHandle: IAC3-ARIN OrgAbuseName: Internap Abuse Contact OrgAbusePhone: +1-206-256-9500 OrgAbuseEmail: abuse @ internap.com OrgTechHandle: INO3-ARIN OrgTechName: InterNap Network Operations Center OrgTechPhone: +1-877-843-4662 OrgTechEmail: noc @ internap.com
In case you want to add another iptables rule based on the sample further above, simply replace 67.228.177.0/24 with 66.150.224.0/24 and you should be set.
Update July 4th, 2008
Another sighting, this time crawling from Sweden using 77.110.52.67 as ip address:
olliver@bunkiten:~$ host 77.110.52.67 67.52.110.77.in-addr.arpa is an alias for 77-110-52-67.univation.riksnet.nu. 77-110-52-67.univation.riksnet.nu domain name pointer ip67.univation.riksnet.nu.
So the pattern of using generic rDNS records obviously persists, as does their ignorance concerning robots.txt.
Whois:
inetnum: 77.110.52.64 - 77.110.52.79 netname: SE-RIKSNET-UNIVATION2 descr: Stockholm Univation AB site2 country: SE admin-c: BEER3-RIPE tech-c: BEER3-RIPE status: ASSIGNED PA mnt-by: MNT-RIKSNET mnt-lower: MNT-RIKSNET mnt-routes: MNT-RIKSNET source: RIPE # Filtered person: Bengt Erik Sandstrom address: Graddvagen 7 address: S-906 20 Umea address: Sweden phone: +46 768 272022 nic-hdl: BEER3-RIPE source: RIPE # Filtered
That range would translate to 77.110.52.64/28, a rather small block this time, and this is also the value you would like to use for blocking them via iptables or other means.
BioSearch bot: pointless POST requests
I really do not know what this bot is trying to accomplish, but it looks rather pointless:
66.167.105.59 - - [10/Mar/2008:04:47:50 +0100] "POST / HTTP/1.1" 403 210 "-" "BioSearch" 66.167.105.59 - - [10/Mar/2008:04:47:51 +0100] "POST /robots.txt HTTP/1.1" 403 220 "-" "BioSearch"
You would only use POST for submitting form data, but not retrieving data. Apart from that, the request order is wrong: A bot should first ask for robots.txt and then, depending on the outcome, either go away or start indexing.
Unfortunately the netblock, where this brainless wonder resides, does not reveal any details about the bot’s owner:
[rwhois.covad.net] %rwhois V-1.5:003fff:00 rwhois.covad.com (by Network Solutions, Inc. V-you-guess) network:Class-Name:network network:Auth-Area:66.167.0.0/16 network:ID:NETBLK-NONE-66-167-0-0.66.167.0.0/16 network:Network-Name:NONE-66-167-0-0 network:IP-Network:66.167.0.0/16 network:In-Addr-Server;I:ns3.covad.com network:In-Addr-Server;I:ns4.covad.com network:IP-Network-Block:66.167.0.0 - 66.167.255.255 network:Org-Name:Covad Communications network:Street-Address:110 Rio Robles network:City:San Jose network:State:CA network:Postal-Code:95134 network:Country-Code:US network:Tech-Contact;I:ipadmin covad.com network:Admin-Contact;I:ipadmin covad.com network:Created:20030508150409000 network:Updated:20041506165200000
The rDNS of the offending ip address is not more talkative either:
h-66-167-105-99.lsanca54.dynamic.covad.net
This reads to me like Los Angeles/California.
Although there is a company in California called Biosearch Technologies, they are unrelated to the bot, as they are only offering products derived from their biological research and are based in the San Francisco area. So it seems to be just a broken anonymous bot which is safe to block.
IE 6.0 omits trailing slash for webroot requests
Just when you think it could not happen, it does anyway…
I have just discovered that Internet Explorer 6.0 has the habit of omitting the trailing slash of a domain name, whenever it is not explicitly appended in a request. This only works for requests of the webroot (like www.example.com), because in all other cases Apache will automatically launch a 301 redirect to the url version with a trailing slash. This is irritating to me because all other browsers will automatically add the trailing slash if it is missing.
Here are some log entries to illustrate what it looks like when you omit the slash of the domain name:
192.168.0.16 - - [22/Jan/2008:09:42:11 +0100] "GET / HTTP/1.1" 200 41369 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 192.168.0.16 - - [22/Jan/2008:09:42:13 +0100] "GET /wp-content/themes/nodepet/style.css HTTP/1.1" 304 - "http://www.nodepet.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
I marked the referrer string as bold. By sending this broken referrer string, IE is likely to break scripts which rely on the usual behaviour (i.e. referrer checker against hotlinking or script automation) and deny access to legit visitors. Curiously, if you do add the slash to your request from the start, IE 6.0 will behave like any other browser:
192.168.0.1 - - [22/Jan/2008:09:48:16 +0100] "GET / HTTP/1.1" 200 41369 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 192.168.0.1 - - [22/Jan/2008:09:48:17 +0100] "GET /wp-content/themes/nodepet/style.css HTTP/1.1" 304 - "http://www.nodepet.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
Up to now I thought a good indicator for a bot generated fake referrer is the missing slash and would make a good SetEnvIf rule to deny access on, however now it seems like this approach is generating false positives. On the other hand, by loosening the check, I open the up the flood gates for spambots, which is not really something I am keen on.
IE6.0 bug: horizontal scrollbars using italic fonts
I came across this weird bug whilst tweaking my site and could not find any website mentioning it. Or maybe I am too silly to use the right wording for my search queries as I cannot imagine that I am the first person to have noticed it.
First off, my layout here complies to HTML Strict 1.0. It would even validate as HTML Strict 1.1 after a minor configuration change on my web server (so that it uses a different mime type for html documents). My CSS file I use also validates without the slightest warning. Now to the description:
Whenever I use a block element that has an embedded <i> or <em> tag (to mark a quote for example) or alternatively, I define italic as style for a block element and it contains text that stretches over multiple lines, IE 6.0 on Windows will freak out and add horizontal scrollbars without a visible cause. These bars disappear as soon as I remove italic from the offending style or the embedded <i> or <em> tag from the affected block element.
Perhaps someone incidentally reading this post knows how to work around this problem or a link pointing to a working hack? Any hint would be greatly appreciated, as I really would like to learn what exactly is causing this issue.
Radian6 - an abusive crawler from Canada
The above headline might be irritating, because at first glance Radian6 obeys robots.txt. However, once it is blocked by a server, it switches its user agent to some generic java client and continues indexing. Anyway, before I go on, let me first introduce you to Radian6, its origins and purposes:
Radian6 is a crawler from Canada and its purpose becomes clear just by looking at the first lines on its home page:
Millions of blog posts. Viral videos. Reviews in forums. Sharing of photos. Status updates via microblogging. All conversations, all happening online right now and affecting brands, reputations, sales, you name it.
This is classic marketing speak at its best, and right, the company behind radian6 provides marketing and promotion professionals with the latest “trends” in the blogosphere, so their customers can pick them up and modify their advertising campaigns accordingly, as you can read on this page. To me this boils down to bullying sites with a PR machinery, pushing them aside into irrelevance and snatching trends and slogans for corporate advertising or promo campaigns. All fine and dandy, but who am I to support the advertising industry by providing keywords they can eventually use against me? The bot is to no benefit for me as I do not get to see its index unless I pay for it. Add the fact that Radian6 shows a broken crawling behaviour with pulling the same content multiple times a day and omits trailing slashes in urls (resulting in even more waste of traffic as each request has to be redirected to the actual location), and the only conclusion for me is to deny this bot. And that means on my server 403 Forbidden:
142.166.3.122 - - [09/Jan/2008:03:20:01 +0100] "GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)" 142.166.3.122 - - [09/Jan/2008:06:39:35 +0100] "GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)" 142.166.3.123 - - [09/Jan/2008:09:46:53 +0100] "GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)" 142.166.3.122 - - [09/Jan/2008:12:50:18 +0100] "GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"
Now one would think at some time a legitimate bot will eventually give up and move on to more commerce friendly hosts. But that turned out to be wishful thinking, as the bot seems to have inherited a mental deficiency that prevents it from accepting that someone does not want to see it:
142.166.3.122 - - [09/Jan/2008:03:19:26 +0100] "GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:03:19:27 +0100] "GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:03:19:28 +0100] "GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:03:20:01 +0100] "GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:06:39:33 +0100] "GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:06:39:34 +0100] "GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:09:46:50 +0100] "GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:09:46:51 +0100] "GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:09:46:53 +0100] "GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:12:50:16 +0100] "GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:12:50:17 +0100] "GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"
Why is it, that some outfits in the promotion and marketing industry (and their most radical variant called “spammers”) have such a disregard for individuals and believe everyone gladly accepts their “message blast” if it is only repeated often enough? Why do they cherish the delusion of being special, excempt from common ethical standards and more gifted and intelligent than their targetted “consumer base”? I assume they do not put such questions into consideration or accept opinions contrary to theirs, therefore I decided to add the netblock 142.66.0.0/16 to my growing list of firewalled ip ranges to prevent any more “stealth visits”.
In case you wish to opt out from their visits too, simply add the following line to your .htaccess or httpd.conf file:
Deny from 142.66.0.0/16
Or, in case you are fortunate enough to run a dedicated server and do not expect any welcome visitors from that ip allocation, you may prefer to firewall their range right away:
iptables -A INPUT -s 142.166.0.0/16 -i eth0 -p tcp -m tcp --dport ! 25 -j DROP
This rule leaves port 25 (SMTP) open as communication channel. Since this is a rather large chunk of addresses, it may well contain responsible companies and individuals and for those I still want to be reachable via email. Of course, if you do not have any such concerns you can also apply the BOFH method and silence this range entirely:
iptables -A INPUT -s 142.166.0.0/16 -i eth0 -p tcp -j DROP
Yahoo loves kurzfilm and can’t let it go…
This romance started back in December:
One fine day, shortly before Christmas, I noticed some weird requests by Yahoo’s crawler:
74.6.28.164 - - [23/Dec/2007:08:25:16 +0100] "GET /kurzfilm/ HTTP/1.0" 301 307 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.28.164 - - [23/Dec/2007:08:25:17 +0100] "GET /kurzfilm/ HTTP/1.0" 404 275 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Of course there was no directory called “kurzfilm” on my server, but it seemed some stale link was pointing to it and Yahoo checked to see whether there is something new to discover. If you look closely you spot a 301 redirect before the actual 404 response. That is because I use mod rewrite to redirect any request that does not use “www.mydomain.com” as host to that location first, in order to ensure, that only this version of my domains will appear in search results. After some research I was even able to locate the origin: Some Dutch website used to link to it using the server’s ip address. Unfortunately this was a fatal mistake as Yahoo is now querying this non existent “kurzfilm” directory over and over again.
Google behaves different in that regard: Once it cannot find anything there it soon discards the url and moves on. Also it obeys 301 (moved permanently) redirects and discards the previous destination after a while. But Yahoo?
Yahoo loves “kurzfilm” in the morning:
74.6.28.164 - - [06/Jan/2008:09:08:59 +0100] "GET /kurzfilm/ HTTP/1.0" 301 236 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.28.164 - - [06/Jan/2008:09:09:01 +0100] "GET /kurzfilm/ HTTP/1.0" 404 5883 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
And “kurzfilm” in the evening:
74.6.28.164 - - [06/Jan/2008:20:02:50 +0100] "GET /kurzfilm/ HTTP/1.0" 301 236 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.28.164 - - [06/Jan/2008:20:02:52 +0100] "GET /kurzfilm/ HTTP/1.0" 404 5883 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Did you notice the 301 redirect? That means Yahoo is still using the server’s ip address for its request, despite the 301 redirect, which should normally signalise that a request be permanently turned to the new destination instead. But then again it would not be Yahoo and so I shall expect to find “kurzfilm” in my logfiles around this time next year, too. Maybe I should create this “kurzfilm” directory already, just for Yahoo: The “kurzfilm” search engine # 1 :-).
Cyveillance fake IE 6 bot - fit for the filters
They are coming in legions these days… Was Attributor’s performance already underwhelming, it was nothing compared to the behaviour of Cyveillance’s bot sitting on 38.100.41.112 that hit my netlabel site some hours ago. Here is a summary of what I found inacceptable:
1. Posing as IE 6.0 on Windows XP, although as a badly faked version:
Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)
The original string for the OS version, however, should be Windows NT 5.1 or Windows NT 5.1; SV1 (the latter with Service Pack 2 installed).
2. No information about the bot owner, let alone a meaningful rDNS domain name pointer
3. Disregard for robots.txt and fetching content that was explicitly disallowed for robots. I do not have any tolerance for such a grave violation of my privacy and get to decide for myself what content is available for indexing (or research).
4. Insane indexing speed with intervals < 2 seconds a page. Not even that, the bot is also broken and asks for non existent content, thus increasing the load and clutter in one’s logfile.
Querying Cogent’s rwhois server reveals the following:
[rwhois.cogentco.com] %rwhois V-1.5:0010b0:00 rwhois.cogentco.com 38.100.41.112 network:ID:NET-266429401A network:Network-Name:NET-266429401A network:IP-Network:38.100.41.64/26 network:Org-Name:CYVEILLANCE network:Street-Address:1555 WILSON BLVD Suite 404 network:City:ARLINGTON network:State:VA network:Postal-Code:22209 network:Tech-Contact:ZC108-ARIN network:Updated:2007-10-05 20:45:56
Other allocations of Cyveillance I could find by querying ARIN’s whois database are:
38.100.21.0/24 38.105.83.0/27 65.213.208.128/27 65.222.176.96/27 65.222.185.72/29
Now the moment is right to investigate the question who Cyveillance are and what the company’s purpose is:
According to a diagram on their corporate website their objective is to monitor the web for identity theft, phishing attacks, credit card fraud, unauthorised (ab-)use of intellectual property and information leaks. Typically they are hired by companies to accomplish this task, thus their focus lies on investigations which are they paid for (which does not mean that the public would not benefit from their endeavours, especially if they were hired by large ISPs). However, their aggressive spidering behaviour earned them some critical remarks in a Wikipedia article dedicated to this company:
Numerous websites have complained about Cyveillance’s traffic for the following reasons:
1. Their robots access many pages in a short period of time and use a comparatively large amount of bandwidth.
2. They completely ignore the robots.txt exclusion protocol, which specifies pages that should not be accessed by robots
3. They use a falsified user-agent string, usually pretending to be some version of Microsoft Internet Explorer on some version of Windows, which is deceptive and can throw off log analysis.
Source: Cyveillance article at Wikipedia
I do not know which of the ranges mentioned are involved with spidering, but as I do not expect any communication from their ranges, I see no reason to grant them access to my server any longer. Perhaps once they learnt how to behave themselves I might change my mind, but I expect that is not going to happen (why would they care about some German’s opinion. Vice versa, why would I care about some US corporation in Virginia, as Germany is still outside of the US-jurisdiction).
Attributor - unsolicited copyright police pulling my feed
They have been on my radar for a while and today I finally had enough from Attributor’s stealth bot activities and decided to opt out from their visits by adding their 64.41.145.0/24 address range to my firewall. In their own words Attributor describe their service as pulling content from billions of websites they monitor in order to toss it into a database and check back with their customers, whether they are able to find any signs of content duplication. If there is a match, they send out notifications on the copyright owner’s behalf and ask for linking back to the original site (mind you, this is only restricted to customers who pay for their content police services). In case those requests remain unanswered, they finally pull the DMCA take down notice card, which sounds scary at first but should have little bearing on everyone outside of the USA.
Principally I welcome their endeavours and even wished they would nail down the hordes of scrapers who steal content as botbait for their spam pages. But, and here comes the part I object to their activities, what is the point in monitoring resources outside of their jurisdiction? My hosting company, registrar and I myself are located in Germany and anyone with half a brain and capable of searching the web for “whois” can find out this sensational news within minutes. Furthermore, why would I care about a bot dressed up as IE6.0 on WinXP and doing nothing but stealing my bandwidth without any benefits in return? Their bot does not really want to know anything about robots.txt (this would defeat its purpose as “sneaky” monitoring tool) and its crawling behaviour leaves a thing or two to be desired, too.
I do not want to go too much into details like wondering why the company is hiding behind some P.O. box in California, uses a private (!) registration shield despite being a corporation (exhibit A ) and owns a netblock that is just about three weeks old (exhibit B). Perhaps they have legitimate reasons to conceal their identities, like protecting themselves from angry scrapers. Perhaps they used to reside somewhere else or are forced to move around pretty often to circumvent being blocked by concerned webmasters. Whatever it is, I do not know and I do not really want to find out, as the only thing that matters to me is whether a bot behaves nicely, is to my or at least the public’s benefit and clearly announces itself as bot including a responsible owner (working website or email address). Everything else might turn out as sneaky spammer or scraper (more or less interchangeable terms) as “references” or “success stories” can easily be fabricated.
Their only address range, according to ARIN, is:
Attributor Corporation SAVV-S600611-2 (NET-64-41-145-0-1)
64.41.145.0 - 64.41.145.255
For denying access I recommend either the gentleman (or woman) method using httpd.conf or .htaccess:
Deny from 64.41.145.0/24
or the BOfH method using iptables:
~# iptables -A INPUT -s 64.41.145.0/24 -i eth0 -p tcp -j DROP
That should take care of the problem for a while.
Broken fake MJ12bot hitting my server
This morning I was wondering about several 100KB peculiar log entries left by a MJ12bot variant that did not exactly seem to follow the nice behaviour of the original. Here is a sample entry as illustration:
82.245.176.52 - - [30/Dec/2007:06:50:57 +0100] "GET /dw041-formication-agnosia/ _title=%22%22%3E%20%3Cabbr%20title=%22%22%3E%20%3Cacronym%20title=%22%22%3E%20%3Cb%3E%20%3C blockquote%20cite=%22%22%3E%20%3Ccode%3E%20%3Cem%3E%20%3Ci%3E%20%3Cstrike%3E%20%3C strong%3E%20%3C/small/page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2//page/2/ /page/2/ HTTP/1.1" 404 39670 "-" "MJ12bot/v1.0.8 (http://majestic12.co.uk/bot.php?+)"
Whatever this broken botware was trying to accomplish, it did not work out, other than leaving some decent clutter in my server logs. A bit of research revealed that the original makers of this distributed search engine have been aware about these fake bots for a while:
20 Oct 2007 - in the last few days it has been brought to our attention that a number of fake MJ12bots appeared on the Net. These bots are not ours but they use fake MJ12bot user-agent - this is something we can’t do anything about just like with email spammers who fake email addresses so we all get spammed supposedly from our own emails or someone elses.
(source : http://www.majestic12.co.uk/projects/dsearch/mj12bot.php)
What impacts this crawl will have is something that remains to be seen, if I was lucky this was merely some distributed search for email addresses and at worst a search for “fresh” content to be displayed on doorway or made-for-Adsense pages or someone compiling his personal list of “spammable targets” to sell it to fellow bottom feeder spammers. In any case, the best defence is of course to not let this bot have access to one’s server in the first place and this is what I did to prevent this surprise from happening again:
Prerequisites: You need an Apache server.
1. Detection rule for User Agent
For less complicated stuff that does not require checking for multiple conditions at once I prefer using Mod SetEnvIf. As we know from the MJ12bot page, the recent versions of this bot are in the 1.2.x range. Thus we can conclude that anything older than this can be safely discarded as a fake without any grave side effects. You can place the following line in either your .htaccess or httpd.conf file:
# deny fake bot
SetEnvIfNoCase User-Agent "^MJ12bot/v?1\.[01]\.[0-9]{1,2}" block
What does this entry do? It checks for any User-Agent header that matches a pre 1.2.x version independent from the presence of the preceeding “v” in the version string. If there is a match, an environmental variable called “block” will be set. This “block” variable can then be used for further actions, in our case that would be denying access, of course.
2. Deny the bot from crawling our site
Now that we have the variable we need to look for, we can block any User-Agent that matches our regular expression from above:
Deny from env=block
In case you want to use it in .htaccess, this line merely needs to be placed after the SetEnvIfRule, but those who want to include it in httpd.conf, have to take care of placing it within their VirtualHost section, preferably in within a Directory or Location directive. An example as illustration:
<VirtualHost 192.168.0.1>
[...]
# Directory permissions
<Directory "/home/web/example.com/html">
Options Indexes FollowSymlinks MultiViews
AllowOverride All
Order deny,allow
# apply SetEnvIfRule here
Deny from env=block
[...]
That should take care of the problem for a while. Do not forget to reload (Linux) or restart (FreeBSD) apache after having applied changes to httpd.conf, so that these changes actually take effect. Those who put it into .htaccess are already done, as this file is read each time Apache is trying to access a directory.