Radian6 – an abusive crawler from Canada
The above headline might be irritating, because at first glance Radian6 obeys robots.txt. However, once it is blocked by a server, it switches its user agent to some generic java client and continues indexing. Anyway, before I go on, let me first introduce you to Radian6, its origins and purposes:
Radian6 is a crawler from Canada and its purpose becomes clear just by looking at the first lines on its home page:
Millions of blog posts. Viral videos. Reviews in forums. Sharing of photos. Status updates via microblogging. All conversations, all happening online right now and affecting brands, reputations, sales, you name it.
This is classic marketing speak at its best, and right, the company behind radian6 provides marketing and promotion professionals with the latest “trends” in the blogosphere, so their customers can pick them up and modify their advertising campaigns accordingly, as you can read on this page. To me this boils down to bullying sites with a PR machinery, pushing them aside into irrelevance and snatching trends and slogans for corporate advertising or promo campaigns. All fine and dandy, but who am I to support the advertising industry by providing keywords they can eventually use against me? The bot is to no benefit for me as I do not get to see its index unless I pay for it. Add the fact that Radian6 shows a broken crawling behaviour with pulling the same content multiple times a day and omits trailing slashes in urls (resulting in even more waste of traffic as each request has to be redirected to the actual location), and the only conclusion for me is to deny this bot. And that means on my server 403 Forbidden:
142.166.3.122 - - [09/Jan/2008:03:20:01 +0100] "GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)" 142.166.3.122 - - [09/Jan/2008:06:39:35 +0100] "GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)" 142.166.3.123 - - [09/Jan/2008:09:46:53 +0100] "GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)" 142.166.3.122 - - [09/Jan/2008:12:50:18 +0100] "GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"
Now one would think at some time a legitimate bot will eventually give up and move on to more commerce friendly hosts. But that turned out to be wishful thinking, as the bot seems to have inherited a mental deficiency that prevents it from accepting that someone does not want to see it:
142.166.3.122 - - [09/Jan/2008:03:19:26 +0100] "GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:03:19:27 +0100] "GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:03:19:28 +0100] "GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:03:20:01 +0100] "GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:06:39:33 +0100] "GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:06:39:34 +0100] "GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:09:46:50 +0100] "GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:09:46:51 +0100] "GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:09:46:53 +0100] "GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:12:50:16 +0100] "GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11" 142.166.3.122 - - [09/Jan/2008:12:50:17 +0100] "GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"
Why is it, that some outfits in the promotion and marketing industry (and their most radical variant called “spammers”) have such a disregard for individuals and believe everyone gladly accepts their “message blast” if it is only repeated often enough? Why do they cherish the delusion of being special, excempt from common ethical standards and more gifted and intelligent than their targetted “consumer base”? I assume they do not put such questions into consideration or accept opinions contrary to theirs, therefore I decided to add the netblock 142.66.0.0/16 to my growing list of firewalled ip ranges to prevent any more “stealth visits”.
In case you wish to opt out from their visits too, simply add the following line to your .htaccess or httpd.conf file:
Deny from 142.66.0.0/16
Or, in case you are fortunate enough to run a dedicated server and do not expect any welcome visitors from that ip allocation, you may prefer to firewall their range right away:
iptables -A INPUT -s 142.166.0.0/16 -i eth0 -p tcp -m tcp --dport ! 25 -j DROP
This rule leaves port 25 (SMTP) open as communication channel. Since this is a rather large chunk of addresses, it may well contain responsible companies and individuals and for those I still want to be reachable via email. Of course, if you do not have any such concerns you can also apply the BOFH method and silence this range entirely:
iptables -A INPUT -s 142.166.0.0/16 -i eth0 -p tcp -j DROP
2 Comments »
Leave a comment
Posting comments requires Javascript to be turned on.
thanks for that i was wondering what that damn thing was constantly pinging my feed.
cheers for the advice
I don’t mind allowing access to companies who do things ethically and efficiently, but this kind of stuff is out of control. Thanks for the info and instructions.