tackling comment spam

March 2, 2006

Tonight I implemented the new global comment spam filters. A few weeks ago, I had built some internal tools that let me quickly track and delete comment spam. Being a reactive stopgap, I knew I would eventually have to build some sort of preventive filter to stop comments from even getting into journals.

And so this week I commited myself to building the new comment spam filter. And I think it's done.

Fortunately for me, comment spam is localized to a few specific keywords: viagra, cialis, vicodine, xanax, zoloft, casino, blackjack, backgammon, hgh, and a few more I can't remember off the top of my head.

Again, fortunately, these are only being posted by guests. Because I don't have the resources (both time, expertise, and server power) to build a "true" Bayesian filter, I've built a rather basic implementation that simply does a keyword check against known "flag" words that I predefine (in future iterations of the spam filter, users will be able to flag comments as spam and delete, which should send the comment to a script which determines keywords with high probability of being associated with spam).

If you're a guest, posting a comment, and you have flag words in your comment, guest name, or in the link, you will *then* be prompted with a captcha.

A captcha (completely automated public Turing test to tell computers and humans apart) is one of those "please input the letters from the following image" type tests, like so:

I absolutely HATE captchas. HATE HATE HATE. I have failed some of those tests repeatedly (which I guess makes me a computer) which is incredibly frustrating. I even posted a joke captcha a while ago:

In any case, I decided to bite the bullet and implement the most rudimentary of all captcha images:

Yes, that is my captcha image. It is mind-numbingly easy to read. Yes, I realized it's very susceptible to screen readers, but if a spammer starts using screen reader, I can simply spend a few hours obfuscating the text a bit then. No need to do premature optimization on the obfuscation, especially since I hate impossible-to-read-captchas.

So basically, anonymous guest posters posting comments with bad spam keywords hit this prompt:

If they pass, then the comment gets posted. Otherwise it's ignored. It's not the most advanced solution, but I'm pretty sure it's going to be enough to cut back on a majority of comment spam as it is - as I keep tuning spam filter, I'm sure I can hit a higher percentage of comments.

So if you see any weird things happening with comments, please let me know.

Currently listening to: Leona Naess - Calling

Posted by roy on March 2, 2006 at 01:30 AM in Web Development, Tabulas | 9 Comments

Related Entries

another hot day in seoul July 3, 2002
recall July 28, 2002
Two Stories September 23, 2003
Hmm August 21, 2005
losing battle September 17, 2006

Linked Entries

These are Tabulas entries which have linked to this particular entry.

Tabulas comment spam by rykorp March 3, 2006

roy has disabled commenting.

dwooillk

Comment posted on March 9th, 2006 at 01:19 AM

I especially hate the captcha images where you can't tell the difference between the "O" and "0" (zero)..

Reply to this comment

Leedar

Comment posted on March 3rd, 2006 at 07:19 PM

So I suppose all the blind folks must be using screen readers now (and eventually even those won't help as the captcha inevitably becomes distorted)? So much for aural browsers.

Reply to this comment

Tallullah

Comment posted on March 2nd, 2006 at 09:53 PM

A weird thing did happen with a comment. As everyone does, I get an email sent to me whenever someone comments on my posts. A couple of days ago I received a "guest" comment that was full of links and junk. Yet when I tried to look for it to delete it, I haven't found it. Not that I want it, mind you, just that it was a little weird. :)

Reply to this comment

roy

Comment posted on March 2nd, 2006 at 10:59 PM

Actually, the past few days I had been very proactive in monitoring (in a reactive fashion). You probably did get the comment spam, then I probably went in and deleted it before you could :)

Now the comment spam gets blocked before it even gets posted.

Reply to this comment

Tallullah

Comment posted on March 4th, 2006 at 10:04 PM

Ah, that's what I like: a man who takes out the trash without being asked! ;-) :-D

Reply to this comment

hapy

Comment posted on March 2nd, 2006 at 08:04 AM

marinol.

Reply to this comment

tonylee

Comment posted on March 2nd, 2006 at 06:53 AM

viagra, cialis, vicodine, xanax, zoloft, casino, blackjack, backgammon, hgh, and a few more I can't remember off the top of my head.