January 20, 2005

Using MT-Blacklist on referrer spam

Tony at juju.org has posted a perlscript to use the MT-Blacklist master blacklist to eliminate referrer spam from your webserver's access lines.

I used to think that this was useless, but at the time, referrer and weblog spam domains were completely different. Now, I'm seeing the same domains I blacklist ending up in my referrers. Nice job Tony!

[edit]'; } ?> Posted on 2005/01/20 23:13

Comments:

Zusätzliche Hinweise zu dasBlog 1.7
Read more in Das Blog 1.7 - noch ein paar Hinweise »

Rutty (davidrutt.me.uk):

Interesting! My referrers list is chock full of spam sites - time to check this one out ;)

Posted on January 21, 2005 04:57 CET [edit] | '; } ?>

David [TypeKey Profile Page] (cronaca.com):

I've been steadily adding to my .htaccess file, trying to keep up with all the referrer spam. This script sounds like just what we need. What might be the performance implications? I know that a longer .htaccess file will slow down loading, and I presume this new approach will not.

Posted on January 21, 2005 10:44 CET [edit] | '; } ?>

Jay Allen [TypeKey Profile Page] (jayallen.org):

Good question. Let me know when you find out.

Posted on January 21, 2005 10:54 CET [edit] | '; } ?>

Hanna [TypeKey Profile Page] (bitcheswithglitches.com/nwn):

It appears to be a script that can be run as a cron task (preferrably right before the stats analyzer grabs the file and turns it into whatever it turns it into), so it should definitely not have any effect on page loading.

You can even specify domains for it to ignore in the log and not check against the blacklist (such as your own, your friend's blog who sends you a lot of traffic, google, yahoo, whatever) which speeds up the process.

I think it has to be run locally and then your stats analyzer needs to be pointed at the new de-referral-spammed output file, so it is not a simple task to set up, requiring a lot of access to the webserver that many people may not have. I asked in comments if it was something that could work remotely, outputting the file remotely and then replacing the stats file with that file. (Actually, I asked it differently, so I might get a different answer.)

It would be great to run locally as it sounds like it puts a pretty hard hit on the server when it is actually running, though it only runs once a night. So, if you install this and your service provider has a throttle on how much processor you can use, this might not work or it might work fine. Only way is to find out.

I really would prefer if it were running remotely, though (option to do either way). Just because that is the only way I could do this. I got in enough trouble from my service provider when MT was rebuilding pages whenever we got spam, blocked or not, and I was blocking hundreds of spams a day and therefore rebuilding the page hundreds of times a day. Thank heavens thats' over. :)

Love,

Hanna

Posted on January 21, 2005 11:48 CET [edit] | '; } ?>

tom sherman [TypeKey Profile Page] (underscorebleach.net/content/jotsheet):

To echo Hanna's reply to David's question, this is a solution to filter referrers after they've already gotten into your logfile. It doesn't prevent that, and hence it has no performance implications on Apache.

Also, shameless plug for a proposal I've written on referral spam. (Jay is going to get tired of me linking it!)

Proposal for a solution to referrer spam

Posted on January 21, 2005 11:52 CET [edit] | '; } ?>

Rod Begbie [TypeKey Profile Page] (groovymother.com):

I was wondering about using blacklist.txt at the Apache level just last week

In the meantime, I've hacked a patch for AWStats which uses blacklist.txt to hide refererspam.

Rod.

Posted on January 21, 2005 11:54 CET [edit] | '; } ?>

tom sherman [TypeKey Profile Page] (underscorebleach.net/content/jotsheet):

Rod,

For true, real-time filtering using Apache, check out Chris' directions for using mod_access_rbl. This doesn't use MT-Blacklist, however.

Posted on January 21, 2005 12:29 CET [edit] | '; } ?>

demonsurfer (demon.twinflame.org):

Can't you just write the stats to a protected directory? Or do people particularly want referrer stats published for some reason?

Posted on January 21, 2005 14:11 CET [edit] | '; } ?>

demonsurfer (demon.twinflame.org):

hmm.. or I wonder if you can just add that rel="nofollow" tag into the results if you're concerned about search spiders counting them...

Posted on January 21, 2005 14:12 CET [edit] | '; } ?>

Jay Allen [TypeKey Profile Page] (jayallen.org):

Yes, as was mentioned in the Six Apart "Introduction to nofollow", public lists of referrers is a perfect place to add it.

Posted on January 21, 2005 14:37 CET [edit] | '; } ?>

David [TypeKey Profile Page] (cronaca.com):

Since my main interest here is in having stats that haven't been rendered unreadable by spam entries, that AWStats patch currently appears to offer the best solution. Purging log files doesn't seem to be all that practical for those of us running AWStats updates several times per day.

Some sort of referrer-Blacklist plug-in would be handy for those who maintain a realtime public list of referrers.

Posted on January 21, 2005 14:51 CET [edit] | '; } ?>

Arve Bersvendsen [TypeKey Profile Page] (virtuelvis.com):

I would actually see something that worked the other way around:

I usually receive large batches of referer spam, and two to ten days after receiving this referer spam, someone tries to mass comment spam with these URLs.

Posted on January 21, 2005 16:45 CET [edit] | '; } ?>

Derwood :

For me, this might not work well. I keep apache logs one month at a time and use my own custom log rolling scripts.
My logs can get quite large by the end of the month. Also, I use comparatively slow machines for my servers. But, I can see where it would be a useful tool for some. My preference has been to us MT-Blacklist and MT-DSBL. However, I've just installed the mod_access_rbl plugin and it shows serious promise. If MT-Blacklist could be made to also harvest the IPs from comment spam, the Comment Spam ClearingHouse could also provide these IPs.

I realize that IPs can be spoofed and that spammers often have legions of zombie systems. But, its something else we can do to add another layer of comment spam protection.

Just a thought..

Posted on January 22, 2005 01:13 CET [edit] | '; } ?>

Tony (juju.org):

I think RBLs are a good idea in theory, however they've been a major pain for me. At work it seems half the time we setup an email server for someone, we have to deal with the fact that they're on some RBL somewhere and it can be rather difficult to get off them sometimes.

For instance, I'd like to give modaccessrbl a try, however I can't access it's website. I'm on a comcast cable modem and apparently I've been blocked. The really annoying part is, the message provides a link to check which list I'm on, but that link takes me to another page that I can't access! For people thinking of giving modaccessrbl a try, be weary. You may block loads of innocent users.

Posted on January 22, 2005 07:36 CET [edit] | '; } ?>

demonsurfer (demon.twinflame.org):

heh.. the irony. :)

Posted on January 22, 2005 08:27 CET [edit] | '; } ?>

Jay Allen (jayallen.org):

As I've said many times in the past, I'm generally against IP-based solutions. DSBL is an exception because it is blocking confirmed open proxies. There are good reasons to use an open proxy, but vastly more bad reasons these days.

Posted on January 22, 2005 09:05 CET [edit] | '; } ?>

tom sherman [TypeKey Profile Page] (underscorebleach.net/content/jotsheet):

I agree that IP-based solutions are poor. However, here we're talking about keeping crap out of log file reports, which I don't think should be published anyway. (If you do, that's your business, but I think it's just asking for trouble. Side issue.)

Anyway, my point is: if there are false positives with an IP-based solution for scrubbing your logfiles of referrer spam... so what?

Posted on January 22, 2005 21:43 CET [edit] | '; } ?>

Jay Allen (jayallen.org):

Yeah. Sorry about that. It's my old trick knee jerk reaction acting up. :-) Your absolutely right.

Posted on January 23, 2005 05:33 CET [edit] | '; } ?>

Tony (juju.org):

Right, unfortunately mod access rbl blocks access to their website completely! Using it to scrub logfiles is one thing, denying me access to their website because I'm on a comcast cable modem is unacceptable. takes a deep breath, counts to 10

By the way, I just updated derefspam making it about 3 times faster. I also started compiling my own blacklist.txt listing some of the major referral spammers that aren't in MT-Blacklist's.

Posted on January 23, 2005 11:03 CET [edit] | '; } ?>

A couple days after I noticed the SPAM in my web statistics, the topic came up over at the Comment…
Read more in Your Referrers Are Showing »

Cowboy (emamat.com):

hi ; after install MT-Blacklist when run :

http://mywebsite/cgi-bin/mt/mt-blacklist.cgi

error :

Can't call method "id" on an undefined value at d:\domains\emamat.com\wwwroot\cgi-bin\mt\extlib/jayallen/Blacklist.pm line 2331.

can you help me ? tanks for you .

Posted on January 25, 2005 19:23 CET [edit] | '; } ?>

Jay Allen (jayallen.org):

1) Please don't post technical support requests here. That is what the forums are for.

2) Search for Windows. Start reading. In short, there are major bugs with file paths on Windows. Windows is hence unsupported. If you get it working, hooray, but the bugs aren't easy to fix.

Posted on January 25, 2005 20:45 CET [edit] | '; } ?>

New blog, new layout, and finally a new post.. Well, for some time now I have been working on ways to reduce or limit blog spam. The most important thing that I’ve done to the blogs here is to add…
Read more in Wow, its been a long time.. »

I don’t know about you, but I hate getting a crapload of comment spam and referral spam. I’ve been able to block much of the former with MT-Blacklist, but the…
Read more in Major Server Cleaning and Securing »

Without realizing it, I’ve been quite ahead of MT-Blacklist for a while with Pivot-Blacklist by incorporating anti-referer spam protection. This article on Jay Allen’s site describes a way to eliminate spam referers not only from your recent referers l…
Read more in MT-Blacklist now also used against referer spam »

MT-Blacklist/Comment Spam Clearinghouse: Using MT-Blacklist on referrer spam…
Read more in Hot on the heels of ping spam comes referrer spam »

So this site has only been up since Jan. 31, and over the last 2 or 3 days have recieved in excess of 20 trackback spam pings. What makes it even more demoralizing is that they are the only trackbacks…
Read more in Trackback spam »

Marco (i-marco.nl/weblog):

I thought I'd share this here, maybe it's an idea for MT-Blacklist as well. Xiffy, the guy behind Nucleus-Blacklist which is based on my Pivot-Blacklist extension for Pivot wrote interesting code which generates .htaccess rules based on what was matched by the blacklist. I have incorporated his contributions into Pivot-Blacklist. It works very nicely.

There are two kind of rules:

IP Blocking rules are generated on an x strikes = out basis. The admin can configure the amount of strikes. If spam is caught from the same IP over and over again this is a candidate to block completely. The system logs IP's to a 'suspects' file and moves them over to a 'blocked ips' file when the treshold is exceeded.

Referer denial rules are generated based on which rules from either the central blacklist or user-defined rules are triggered by the spammers.

In the end, users can copy/paste the generated .htaccess rules into their own .htaccess and then reset the matched IP's / matched rules file in Nucleus/Pivot Blacklist.

Cheers,

Marco

Posted on February 25, 2005 00:26 CET [edit] | '; } ?>

demonsurfer (demon.twinflame.org):

Sort of like a personal IP blacklist instead of parsing against DSBL.org or opm.blitzed.org and the like?

Posted on February 25, 2005 12:27 CET [edit] | '; } ?>

Marco (i-marco.nl/weblog):

demonsurfer well it's different. My blacklist software does use DSBL and SURBL and catches practically all bad referers. This results in them being blocked from my last-referers list on my blogs. However, they still pollute my stats page because the request does get handled by apache and therefore still gets counted as a 'hit'. Blocking them off at .htaccess level completely kills them off from appearing anywhere. It does take maintenance but personally I don't mind that as long as they stay away.

One could probably even convert the entire blacklist from this site into .htaccess rules theoretically but I guess that could result in slow performance. Therefore I only use those rules that are actually applied by the blacklist software.

Posted on February 26, 2005 01:16 CET [edit] | '; } ?>

Simon Cox (simoncox.com):

I don't understand why a spoof referer IP will help the spammer. He needs to put in the domain or IP of the site he wants in our referer list. Can't we just block the IP or address of the site that they are promoting in our htaccess file? What am I missing here???

Posted on February 28, 2005 06:02 CET [edit] | '; } ?>

Marco (i-marco.nl/weblog):

Simon you can add matching rules in your .htaccess to either deny specific referers (either real or spoofed, doesn't matter) and deny rules in order to completely block certain IP's from doing anything on your site. So in short: yes we can. What exactly is the part you think you're missing?

Posted on March 01, 2005 10:23 CET [edit] | '; } ?>

Simon Cox (simoncox.com):

Thanks Marco, I started blocking ip's of undesirable sites but wondered why this could not be done automatically - I've read the arguments about blocking real users. Why is spoofing the ip's an issue? The spammers need the real IP to get the referrer into our referrer lists (which I don't actually have or use on my site).

I recently removed an automated script that would add a block to my htaccess file if something called for a particular file - a file that I don't actually use - like mtcomments (because I always rename mine) - i.e. a honey pot trap. Its worked pretty well. In my robots file I also have a pointer to a directory with the trap in it so that search engines will avoid that directory and naughty people looking to abuse my robots file will get blocked. If anyone wants this script then mail me - I can't remember where I got it.

Posted on March 01, 2005 12:29 CET [edit] | '; } ?>

Marco (i-marco.nl/weblog):

Simon, the thing is: suppose you use mozilla's refspoof plugin for mozilla / firefox and you enter www.some-spammy-site.com in the spoof box and then fire it at your site, your apache access log (or worse, you're last referers list on your homepage if you have one) will be polluted with that spam site. This means in the worst case it will appear on your home page but even if it doesn't it will still pollute your statistics by falsely inflating your pageview / visitor count. .htaccess blocking takes care of this.

Being the author of Pivot-Blacklist I'm interested in your honeypot approach and the script you use. Maybe I can do something with it to make the software even better than it is now (right now it already does block 99,99% of comment/referer/trackback spam). So if you can, mail it to marco@i-marco.nl.

Thanks!

Posted on March 02, 2005 03:28 CET [edit] | '; } ?>

If you publish a list of referring pages on your website you are going to have to deal with a problem: referrer spam. Apparently some shady companies like to advertise their pills, gambling scams…
Read more in What the hack? »

Posted on rakaz at 2005/03/17 06:55