electro acoustic expressionism
nodepet
September 30th, 2008

dotbot – yet another useless robot…

Filed under: Web — olliver @ 10:36 h

Allow me to start with a question: What is the purpose of a legitimate robot? One would think it is fetching content at a reasonable pace whilst respecting the host’s restrictions in robots.txt. When a bot bothers to fetch robots.txt prior to its crawling, does that signify it will also process its rules? Not necessarily it seems. When Dotbot visited me two days ago, it did not seem to be interested in my content, but in collecting redirect messages without following them:

208.115.111.245 – - [28/Sep/2008:08:53:50 +0200] “GET /robots.txt HTTP/1.1″ 200 77 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:00 +0200] “GET /category/life HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:04 +0200] “GET /category/music HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:08 +0200] “GET /category/photo HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:13 +0200] “GET /category/spam HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:18 +0200] “GET /category/web HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”

This is just a small but representative sample: For reasons unknown to me the Dotbot omits the terminal slash of the URI which results in a 301 redirect (because there is no file of that name). Now if only the spider followed it, so that it could fetch something meaningful. To cut a long story short, except for robots.txt, there was not a single article this bot took home, because the robot obviously does not know how to handle redirects. Quite a silly waste of resources in my opinion, but then again, what do I know about the bot’s purpose?

On the DotNetDotCom website, the crawler’s presumable home, we can find the following statement:

Hi! Thanks for letting us crawl you!

We are just a few Seattle based guys trying to figure out how to make internet data as open as possible. You should be able to find everything you are looking for below. If not feel free to contact us. Happy Surfing!

The “we are just …” statement does not raise much confidence in me. This impression is amplified by the next paragraph, which contains an instruction about how to get rid of the bot:

1. First and foremost, curse our name. Trust us, it will feel good. Now breath gently…
2. Create a simple text file named robots.txt and place it in your server’s root directory. (http://www.yoursite.com/ «– Right There!)
3. Add the following code to your robots.txt file:
User-agent: dotbot
Disallow: /
4. Reflect on how easy that was.

To me this does not sound like a responsible operation, because it suggests that rather than fixing their bot, they urge “flamers” to opt-out from their crawling. Regulars will know I am one of these flamers ;-) and of course this is not the only reason for my scepticism:

208.115.111.245 – - [28/Sep/2008:11:13:52 +0200] “GET /robots.txt HTTP/1.1″ 200 77 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:11:19:32 +0200] “GET /impressum HTTP/1.1″ 301 241 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”

Impressum is explicitly excluded from crawling in robots.txt because it contains sensitive information about me that I am required to put up by German law. Yet, despite reading robots.txt DotBot chose to jump right onto it. Fortunately again failing to add a trailing slash to its request and handle the resulting 301 redirect properly. This is usually a KO criterion for a bot and since experience has proven time and again that bad bots have a tendency of morphing I prefer to firewall them right away.

Whois opines the following about their address space:

OrgName:    dotnetdotcom.org
OrgID:      DOTNE
Address:    93 S. Jackson Street #10070
City:       Seattle
StateProv:  WA
PostalCode: 98104-2818
Country:    US

NetRange:   208.115.111.240 - 208.115.111.255
CIDR:       208.115.111.240/28
OriginAS:   AS23033
NetName:    208-115-111-240-SLASH28
NetHandle:  NET-208-115-111-240-1
Parent:     NET-208-115-96-0-1
NetType:    Reassigned
Comment:
RegDate:    2008-07-21
Updated:    2008-07-21

I am not suggesting the DotNetDotCom owners are blackhats. But I have better things to do in my life then to debug other people’s bot operation. If DotBot even fails at elementary things like following robots.txt and redirects then I do not see to allow it to visit my sites. Blocking 208.115.111.240/28 should take care of the problem.

Comments (0)

No Comments »

No comments yet.

Leave a comment

Posting comments requires Javascript to be turned on.