Custom logfile for referrer spam
One of the most annoying characteristics of referrer spam is the clutter it leaves in Apache’s access log files, making analysis of them nearly pointless. But Apache would not be Apache if there were not a work around for it. Thanks to Mod SetEnvIf we have the choice of looking at client headers with regular expressions and in the event of a match register an environmental variable. This sounds quite useless at first glance, because we would expect any sort of immediate action instead. But I can assure you it is not, because the same way we can write
Deny from 192.168.0.0/24
we can also check for the variable’s existence and regulate access depending on the outcome:
Deny from env=spam
or use it for creating special logfiles only for matches or the opposite:
CustomLog /var/log/httpd/test.example.com.access.log env=!spam
And based on this property we are now able to isolate unwanted referrer spam or autoposting spambots from our access.log and get a handy source for looking up bad behaving visitors to take appropriate action (like firewalling or merely denying access to them). In order to make everything work, we need root access to httpd.conf, so those without a dedicated server may not be able to organise their log entries like this.
Ok, ready? Fine, then let us get started. At first a summary of what we are trying to accomplish:
- Create a set of SetEnvIf rules that will be used to check incoming requests
- If a requests meets the criterion of being unwanted, write it into the block log
- In case it is a legitimate request write it into the access log
I skip the section about explaining what SetEnvIf is and what good it can do for you, so in case you lack the knowledge you may have to study the Apache manual first. But fear not, if you are familiar with regular expressions you surely will quickly get into it. What we now have to consider is the location of our rules: It could be either httpd.conf or .htaccess files and each decision can have its advantages:
Everything in httpd.conf is loaded once on start up (or each time you reload the server and force it to reread its configuration file) and then kept in memory as long as the server is running. Thus things that principally need to be done like some permanent redirects or search engine friendly links via mod rewrite rules are best kept here.
.htaccess however is read each time a directory is accessed. Therefore any changes made here are taken into effect immediately. But due to the way this file works, a huge load of rules can considerably slow down the server, so you only want to keep things that change very often and put as much as possible into your httpd.conf. To get back to the subject, our preferred location of the ruleset should be httpd.conf of course, since spammy keywords in referrer strings are very unlikely to change over time.
In case there is not already a section with SetEnvIf rules we create one now by adding the following scheme:
<IfModule mod_setenvif.c>
SetEnvIfNoCase User-Agent "Indy Library|OmniExplorer" spam
SetEnvIfNoCase Referer "^http://([0-9a-z_.\-]*(poker|casino)\.)" spam
SetEnvIf Remote_Addr "^69\.31\.(79|93|132)\.[0-9]{1,3}$" spam
</IfModule mod_setenvif.c>
Note that this is only an example, of course there is no limit to the level of complexity you want to apply to your regexp rules, and there are other environmental variables that can be used for filtering as well. Now each time one of our rules is triggered, the variable “spam” will be defined and a query for it will return true (boolean comparison).
Next location we are heading to is the virtual host section of our httpd.conf:
<virtualhost 192.168.0.1> ServerAdmin webmaster@example.com DocumentRoot /usr/local/www/test/ ServerName test.example.com ErrorLog /var/log/httpd/test.example.com-error.log CustomLog /var/log/httpd/test.example.com-spam.log combined env=spam CustomLog /var/log/httpd/test.example.com-access.log combined env=!spam
Notable additions/changes are marked bold. The negation of a match is no typo, it is really written like this (those familiar with programming will be slightly irritated as most languages use != to mark a negative comparison).
Now there may be more complex comparisons that require the usuage of Rewrite rules and then an entry, although blocked, will remain in our access log. Luckily, Mod Rewrite even takes such situations into account and allows setting environmental variables by using the E flag. Which means we can continue using our “spam” variable, only that we have to assign “1″ as value for it in order to indicate that this variable has been set. An example as illustration:
# spambot detection
RewriteCond %{THE_REQUEST} ?lng= [NC]
RewriteRule .* - [E=spam:1,L]
The only thing we need to take care of with this combination of SetEnvIf and Rewrite rules is that we still have the final Deny from env=spam line, so that the detection of the variable actually results in denying access to our server (or whatever action seems appropriate).
And voila, now we have accomplished the following: Spam will be written to a separated logfile and no longer appears in access log itself, thus we will be able to analyse our traffic again.
No Comments »
No comments yet.
Leave a comment
Posting comments requires Javascript to be turned on.