Sphere adding query strings to search results
During my usual daily administrative tasks, I was less than pleased to discover that someone at Sphere apparently came up with a new idea. Or perhaps that idea has been around for quite a long time and I was merely fortunate enough to have been spared from learning of its existence. For those who do not know what I am referring to, here is a short explanation:
Sphere offer a widget for blogs that tries to present related blogposts for a particular subject. Perhaps not a bad idea in itself, maybe even a boon to those who want to research a subject and have a handy tool to consult more than one source prior to adding their own thoughts. However there seem to be some issues:
To these links a referer=sphere_search query string is appended. Even worse, it seems that other search engines are encouraged to crawl these modified links. As the search engine bot cannot differentiate between a useless query string and one that has been added on purpose, it will treat the original and the modified link as two separate pages showing the same and ultimately this can lead to the receiving end being hit by dreaded duplicate content penalties.
Google demonstrates that quite a few sites now have to put up with indexed pages they never wanted to have this way:
http://www.google.com/search?q=inurl:referer=sphere_search
This pretty much looks to me like some marketing expert had the brilliant idea of introducing the referrer tracking model of the adult content industry to the mainstream without the affected webmasters’ consent. If you have been hit by this nonsense too, you can easily spot it in your logfiles, like I did this morning:
66.249.73.232 – - [28/Dec/2008:11:18:13 +0100] “GET /goals-for-2008-revisited?referer=sphere_search HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.73.232 – - [28/Dec/2008:11:18:14 +0100] “GET /goals-for-2008-revisited/?referer=sphere_search HTTP/1.1″ 200 11189 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
You may notice that the first request omitted the terminating slash of the uri, a bad habit Sphere’s bot has been exhibiting for years without anyone bothering to fix it. Now the hassle of addressing these malformed requests in order to limit the damage they can do has been added to the mix. Apache users may like to use a Mod Rewrite rule similar to this to work around the problem:
# broken sphere search
RewriteCond %{QUERY_STRING} .
RewriteCond %{QUERY_STRING} ^referer=sphere [NC]
RewriteRule (.*) http://www.example.com/$1/? [R=301,L]
I assume you know how to use Mod Rewrite, in case you don’t, please google for the basics before using this code, as inappropriate usage without understanding the implications can cause havoc to a server. Here is a brief explanation of what it does:
At first it is checked wether a query string is present and if so, whether it contains a value beginning with “referer=sphere”. If so, it will force a 301 redirect that strips this query string and re-adds the omitted terminal slash of our fake directory path. This is done by appending the question mark at the end of the variable. Now when disaster struck one of our pages, bots can be led to the actual source no matter what rubbish was originally picked up.
Of course this is merely curing symptoms. The source of the problem is this moronic addition of a query string which can only be fixed by Sphere. Surely I’m not the only one on this planet who feels annoyed by such nonsensical additions and perhaps in the end you can only opt out from their crawling:
The responsible outfit is located in this little corner of the web:
Votenet Solutions Inc, SAVV-S234898-5 (NET-64-14-117-224-1)
64.14.117.224 - 64.14.117.255
which equates to 64.14.117.224/27.
Opt-out can be understood in five ten fiftyfold ways:
One may add an entry to robots.txt, another to .htaccess or httpd.conf (depending on shared/dedicated hosting) and freaks like me prefer using iptables, so they need not see the clutter of failed requests in their logfiles:
# sphere.com: broken link generation
iptables -A INPUT -s 64.14.117.224/27 -i eth0 -j DROP
That should put an end to the affiliate link mess.
2 Comments »
Leave a comment
Posting comments requires Javascript to be turned on.
Works for me! Thank you! I did take the “/” after the $1 off, I was getting a “//” at the end.
I would hope/assume that Google and the other engines would eventually compensate for that. But I still don’t want people bookmarking that version of the link
Gary,
Glad to learn the article was of use to you.
I added the slash, because Sphere had appended query strings to articles where the terminal slash was already omitted by them. Without it, you would redirect to the article without a terminating slash and another one adding this slash would be triggered. I observed Google not bothering about more than one redirect and simply stop following it which was something I wanted to avoid.
Of course if, for some reasons, the slash is not omitted before the query string, then there’s no need to keep it after $1.
Olliver