Tonight I implemented the new global comment spam filters. A few weeks ago, I had built some internal tools that let me quickly track and delete comment spam. Being a reactive stopgap, I knew I would eventually have to build some sort of preventive filter to stop comments from even getting into journals.

And so this week I commited myself to building the new comment spam filter. And I think it's done.

Fortunately for me, comment spam is localized to a few specific keywords: viagra, cialis, vicodine, xanax, zoloft, casino, blackjack, backgammon, hgh, and a few more I can't remember off the top of my head.

Again, fortunately, these are only being posted by guests. Because I don't have the resources (both time, expertise, and server power) to build a "true" Bayesian filter, I've built a rather basic implementation that simply does a keyword check against known "flag" words that I predefine (in future iterations of the spam filter, users will be able to flag comments as spam and delete, which should send the comment to a script which determines keywords with high probability of being associated with spam).

If you're a guest, posting a comment, and you have flag words in your comment, guest name, or in the link, you will *then* be prompted with a captcha.

A captcha (completely automated public Turing test to tell computers and humans apart) is one of those "please input the letters from the following image" type tests, like so:

I absolutely HATE captchas. HATE HATE HATE. I have failed some of those tests repeatedly (which I guess makes me a computer) which is incredibly frustrating. I even posted a joke captcha a while ago:

In any case, I decided to bite the bullet and implement the most rudimentary of all captcha images:

Yes, that is my captcha image. It is mind-numbingly easy to read. Yes, I realized it's very susceptible to screen readers, but if a spammer starts using screen reader, I can simply spend a few hours obfuscating the text a bit then. No need to do premature optimization on the obfuscation, especially since I hate impossible-to-read-captchas.

So basically, anonymous guest posters posting comments with bad spam keywords hit this prompt:

If they pass, then the comment gets posted. Otherwise it's ignored. It's not the most advanced solution, but I'm pretty sure it's going to be enough to cut back on a majority of comment spam as it is - as I keep tuning spam filter, I'm sure I can hit a higher percentage of comments.

So if you see any weird things happening with comments, please let me know.

Currently listening to: Leona Naess - Calling
Posted by roy on March 2, 2006 at 01:30 AM in Web Development, Tabulas | 9 Comments

Related Entries

Linked Entries

These are Tabulas entries which have linked to this particular entry.

roy has disabled commenting.
Comment posted on March 9th, 2006 at 01:19 AM
I especially hate the captcha images where you can't tell the difference between the "O" and "0" (zero)..
Comment posted on March 3rd, 2006 at 07:19 PM
So I suppose all the blind folks must be using screen readers now (and eventually even those won't help as the captcha inevitably becomes distorted)? So much for aural browsers.
Comment posted on March 2nd, 2006 at 09:53 PM
A weird thing did happen with a comment. As everyone does, I get an email sent to me whenever someone comments on my posts. A couple of days ago I received a "guest" comment that was full of links and junk. Yet when I tried to look for it to delete it, I haven't found it. Not that I want it, mind you, just that it was a little weird. :)
Comment posted on March 2nd, 2006 at 10:59 PM
Actually, the past few days I had been very proactive in monitoring (in a reactive fashion). You probably did get the comment spam, then I probably went in and deleted it before you could :)

Now the comment spam gets blocked before it even gets posted.
Comment posted on March 4th, 2006 at 10:04 PM
Ah, that's what I like: a man who takes out the trash without being asked! ;-) :-D
Comment posted on March 2nd, 2006 at 08:04 AM
marinol.
Comment posted on March 2nd, 2006 at 06:53 AM
viagra, cialis, vicodine, xanax, zoloft, casino, blackjack, backgammon, hgh, and a few more I can't remember off the top of my head.
Comment posted on March 2nd, 2006 at 08:37 AM
online poker
Comment posted on March 2nd, 2006 at 12:13 PM
party poker