Building a Better Spam Detector

A couple of weeks ago the AIRWeb held its 2008 conference. After seeing Dr. Garcia's post on the conference I was going read the papers and provide a high-level overview of some of the papers. However, after I saw that they were holding a web spam competition, my interests headed in a different direction. At the risk of raising Dr. Garcia's ire (a mistake I've made in the past), I have, with tongue placed throughly in cheek, developed my own spam detection algorithm. And that algorithm has performed surprisingly well!

I turned my project into a tool you can use to check if a domain name looks spammy. I'm not making any guarantees with it though. It could be a nifty tool if you want to have a high-quality domain name. Or at least, one which is not obviously spammy:

UPDATE: There were a few bugs in the earlier version, I've fixed those (but not the other ones I don't know about yet), so now things should be (at least slightly) better!

You probably understand the basic idea of spam detection:

The engines (and surfers) don't like the spam pages.
Enter the PhD-types with their fancy models, support from the engines with their massive data-centers, funding for advanced research, and whole lot more smarts than I've got.

To measure these things it's interesting to look at the "true positive rate" and the "false positive rate". The perfect algorithm has a true positive rate of 100% (or 1.0) (all spam is identified) and a false positive rate of 0% (no non-spam pages are marked as spam). However, just as there's no such thing as both calorie-free and delicious chocolate (PLEASE tell me I'm wrong!), there is rarely a perfect algorithm. So we are faced with trade-offs.

On the one hand you could label everything as non-spam and you would never have a false positive. This is the point on the graph where x=0 and y=0. On the other hand, you could just label everything as spam and no SERPs would contain any spam pages. Of course, there would be no pages in any SERP at all, and this whole industry would have to go to back-up plans. I'd be a roadie for my wife's Rock Band tours.

Here is a plot illustrating the trade-offs, as adapted by me from the AIRWeb 2008 conference, including my own fancy algorithm. "y=x" is the base-line random classifier (you better be smarter than this one!)
see the actual results

Clearly, the graph shows that I totally rocked :) Ignore for a minute the line labeled "SEOmoz - fair", we'll come back to this. As you can see, at a false positive rate of 10% (0.10), I was able to successfully label over 50% of the spam pages, outperforming the worst algorithm from the workshop (Skvortsov at ~39%), and performing nearly as well as the best (Geng, et al. at ~55%). My own algorithm, SpamDilettante® (patent pending!) , developed in just two days, with only the help of our very own totally amazing intern Danny Dover and secret dev weapon Ben Hendrickson has outperformed some of the best and brightest researchers in the field of adversarial information retrieval.

Well, graphs lie. And so do I. Let me explain what's going on here. First of all, my algorithm really does classify spam. And I really did develop it in just two days without using a link-graph, extracting complex custom web features, or racking up many days or months of compute cluster time. But there are some important caveats I'll get to, and these are illustrated by the much worse line called "fair" above.

What I did was begin with not one of Rand's most popular blog posts. However, this post is actually filled with excellent content (see the graph above). Most of the things I couldn't actually compute very easily:

High ratio of ad blocks to content
Small amounts of unique content
Very few direct visits
Less likely to have links from trusted sources
Unlikely to register with Google/Yahoo!/MSN Local Services
Many levels of links away from highly trusted websites
Cloaking based on user-agent or IP address is common

The list goes on. These are not things that even some of the researchers backed by the engines could get a hold of. So I ignored these. After all, how important could knowing if there's traffic to a site or cloaking going on? The big guys don't care about that, right?

However, some of the things I could get just from the domain name:

Long domain names
.info, .cc, .us and other cheap, easy to grab TLDs
Use of common, high-commercial value spam keywords in the domain name
More likely to contain multiple hyphens in the domain name
Less likely to have .com or .org extensions
Almost never have .mil, .edu or .gov extensions

So I figured a linear regression (a very simple statistical model) based on these factors would be a pretty neat brief project. So I grabbed Danny and got some of the other mozzers to put together a rather short list of somewhat random (but valid) domain names. Danny spent an afternoon browsing all kinds of filth that I'm sure his professors (and family) are pleased he's seeing during his employment at SEOmoz. In the end we had about 1/3 of our set labeled as spam, 1/3 labeled as non-spam, and 1/3 labeled as "unknown" (mostly non-english sites which probably have great content).

With the hard work done for me, I wrote a script to extract the above features (in just a few lines of python code). I took the 2/3 of the data which was labeled as spam/non-spam and divided it into an 80% "training set" and 20% "test set". This is important because if you can see the labels for all your data you might as well just hard-code what's spam and what's not. And then you win (remember that "perfect" algorithm?). Anyway, I just did a linear regression on this 80% set and got my classifier.

To get performance numbers I used my classifier on my reserved 20% test set. Basically it spewed a bunch of numbers like "0.87655" which you could think of as probabilities. To get the above curve, I tried a series of thresholds (e.g. IF prob > 0.7 THEN spam ELSE not spam). This is the trade-off between false positive and false negative, and gives me the above curves.

And that's the story of how I beat the academicians.

O.K., back to reality for a moment; on to the caveats.

As I pointed out in the introduction to the competition my data set is a much simpler classification problem (complete label coverage and almost no class imbalance)
As Rebecca says, "it's just one of those "common seo knowledge" things--.info, .biz, a lot of .net [are spam]," and my dataset includes lot of these "easy targets". The competition is all .uk (and mostly co.uk)
My dataset is awfully small and likely has all kinds of sampling problems. My results probably do not generalize.

But these are just guesses. Can we support the hypothesis that I do not, in fact, rock? Well that's what the "fair" line above is all about. I actually downloaded the data set from the challenge (jus the urls and labels) and ran my classifier on it. Suddenly my competitive system doesn't look so good. I had a professor that had a rubric for these things, and according to his rubric I'm just one notch above "terrible" stuck squarely at "poor". The next closest guy in the competition is "good" (skipping "mediocre", of course) and the best system is "excellent".

As E. Garcia said in his original post which started me on this, "it is time to revisit [the] drawing board."

Comments 26

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

SeanMaguire

2008-05-05T02:47:12-07:00

Can someone tell me where to fill out the Spam Report for this guy?

SeanMaguire edited 2008-05-05T03:05:20-07:00
6 0

Can someone tell me where to fill out the Spam Report for <a href="https://www.360sell.com/images/seomoz/mattcutts.png" rel="nofollow">this guy</a>?
Cancel
- Mintyman
 
 2008-05-05T05:00:13-07:00
 
 Classic! Way to go Sean :o)
 
 1 0
 
 Classic! Way to go Sean :o)
 Cancel
SeanMaguire

2008-05-05T01:25:05-07:00

Well Nick, if there's any consolation, on a true positive note, you were trending better and closing the gap at a false positive rate between .15 - .20. That has to count for something!

In a bit of irony, I ran your Spam Detection Tool's own URL through the motions.

Yes - your tool considers itself Spammy. ;)

SeanMaguire edited 2008-05-05T02:23:18-07:00
4 0

Well Nick, if there's any consolation, on a true positive note, you were trending better and closing the gap at a false positive rate between .15 - .20. That has to count for something! In a bit of irony, I ran your Spam Detection Tool's own URL through the motions. Yes - your tool considers itself <a href="https://www.360sell.com/images/seomoz/spam.png" rel="nofollow">Spammy</a>. ;)
Cancel
Jean Christofferson

2008-05-05T13:36:11-07:00

The url spam detector is a very interesting idea. I was pleased to see that my own domain and my current hot project domains passed with nice low scores. But while it is interesting, there isn't too much of actual value right now. My primary business provides web service tools including email campaign management. Our terms of use require that email only be sent to users that have opt-ed in but there are violations and we terminate customers who cause spam problems. I checked the domain names of sites that have generated spam complaints and while their scores were a bit higher, they still passed. The 'spam' sites do look good and are professional but some are in industries that are known generators (mortgage industry for example) of spam email with the definition of spam as unwanted email, not the spam where everything is forged to try to hide who and where it came from. I realize that the tool was more of an experiment and was only looking at the URL and basic components of the site but if it was extended a bit and could predict the sites that might engage in spammy behavior that would be very powerful.

3 0

The url spam detector is a very interesting idea. I was pleased to see that my own domain and my current hot project domains passed with nice low scores. But while it is interesting, there isn't too much of actual value right now. My primary business provides web service tools including email campaign management. Our terms of use require that email only be sent to users that have opt-ed in but there are violations and we terminate customers who cause spam problems. I checked the domain names of sites that have generated spam complaints and while their scores were a bit higher, they still passed. The 'spam' sites do look good and are professional but some are in industries that are known generators (mortgage industry for example) of spam email with the definition of spam as unwanted email, not the spam where everything is forged to try to hide who and where it came from. I realize that the tool was more of an experiment and was only looking at the URL and basic components of the site but if it was extended a bit and could predict the sites that might engage in spammy behavior that would be very powerful.
Cancel
- Nick Gerner
 
 2008-05-05T14:04:05-07:00
 
 I think you're completely correct. This is an interesting idea, and that's what AIRWeb is all about. If you're interested in learning more there's a whole host of papers on the subject. One particularly interesting one is "Web Spam Taxonomy", by ZoltÃ¡n GyÃ¶ngyi and Hector Garcia-Molina. This paper lays out some of the features these researchers are looking at, without introducing too many mathematical formulae.
 
 3 0
 
 I think you're completely correct. This is an interesting idea, and that's what AIRWeb is all about. If you're interested in learning more there's a whole host of papers on the subject. One particularly interesting one is <a href="https://airweb.cse.lehigh.edu/2005/gyongyi.pdf" rel="nofollow">"Web Spam Taxonomy", by ZoltÃ¡n GyÃ¶ngyi and Hector Garcia-Molina</a>. This paper lays out some of the features these researchers are looking at, without introducing too many mathematical formulae. 
 Cancel
vingold

2008-05-05T05:01:19-07:00

I like the tool. Mostly because when I put some of my more spammy sounding domains through it - they come up clean.

I think as Sean has more or less pointed out with his examples - it is mostly relevant when dealing with the root site domain, and not individual posts or directories.

Even this very post would be considered spammy according to the tool (I imagine because of length and hyphens).

And it also seems that adding a forward hyphen trips some spam flags.

Try putting the following into the tool:

www.seomoz.org

www.seomoz.org/

The second one comes up as spammy. And the only factor that seems to change is the subdomain depth.

2 0

I like the tool. Mostly because when I put some of my more spammy sounding domains through it - they come up clean. I think as Sean has more or less pointed out with his examples - it is mostly relevant when dealing with the root site domain, and not individual posts or directories. Even this very post would be considered spammy according to the tool (I imagine because of length and hyphens). And it also seems that adding a forward hyphen trips some spam flags. Try putting the following into the tool: www.seomoz.org www.seomoz.org/ The second one comes up as spammy. And the only factor that seems to change is the subdomain depth. 
Cancel
- Nick Gerner
 
 2008-05-05T10:36:37-07:00
 
 yeah, it's supposed to only look at the domain name, I fixed the issue.
 
 But uh... as per the blog post, this is at least a little bit tongue-in-cheek, so take the results with a grain of salt...
 
 1 0
 
 yeah, it's supposed to only look at the domain name, I fixed the issue. But uh... as per the blog post, this is at least a little bit tongue-in-cheek, so take the results with a grain of salt...
 Cancel
 - vingold
 
 2008-05-05T14:03:48-07:00
 
 Nick, I think everyone here commends your effort. I just like seeing all the false positives.
 
 3 0
 
 Nick, I think everyone here commends your effort. I just like seeing all the false positives.
 Cancel
Nuno Hipólito

2008-05-05T02:21:31-07:00

www.porn-sex-poker-games-casino-drugs-kill-kill.com

score: 1.22

2 0

www.porn-sex-poker-games-casino-drugs-kill-kill.com score: 1.22 
Cancel
Rand Fishkin

2008-05-05T02:13:03-07:00

What interests me here is the application to SEO and webdev. If URLs and domain names really can be correlated nicely to spam detection, it means that those of us who build legitimate sites probably have a number of best practices to consider in order to help avoid being classified incorrectly.

Aaron Wall had a good post about passing a manual review by Google, but this brings up the other side - passing an automated review.

I'm also insanely curious about what Google's false positive rate might be...

3 1

What interests me here is the application to SEO and webdev. If URLs and domain names really can be correlated nicely to spam detection, it means that those of us who build legitimate sites probably have a number of best practices to consider in order to help avoid being classified incorrectly. Aaron Wall had a good post about passing a manual review by Google, but this brings up the other side - passing an automated review. I'm also insanely curious about what Google's false positive rate might be... 
Cancel
MarcusHerou

2009-07-03T07:21:37-07:00

Hi.

Very curious about your url spam detection algorithm. How can it be that the deeper subdomain chain you have the lesser the probability is for spam ?

I mean is really spam.spam.spam.nospam.com less spammy than spam.nospam.com ? Please elaborate.

1 0

Hi. Very curious about your url spam detection algorithm. How can it be that the deeper subdomain chain you have the lesser the probability is for spam ? I mean is really spam.spam.spam.nospam.com less spammy than spam.nospam.com ? Please elaborate.
Cancel
humer

2010-04-16T15:13:41-07:00

so sorry to write that but your tool is not worth the bytes neccessary for creating it.

It misses different language semantics and behaviours.Although I cannot find a reason for my name (real name) to be spammy, since my company has the same name.

So my name has 5 letters. My TLD is biz which is .24 worse than .com what results in ~.58 instead of ~.34 - where is the relevance for this?

Your algorithm is gimicky at best but I would call it nonsens.

You should kick this page - or make it excusevly for .com domains :-D

never mind, but it has to be said.

Regards, Marc

1 0

so sorry to write that but your tool is not worth the bytes neccessary for creating it. It misses different language semantics and behaviours.Although I cannot find a reason for my name (real name) to be spammy, since my company has the same name. So my name has 5 letters. My TLD is biz which is .24 worse than .com what results in ~.58 instead of ~.34 - where is the relevance for this? Your algorithm is gimicky at best but I would call it nonsens. You should kick this page - or make it excusevly for .com domains :-D never mind, but it has to be said. Regards, Marc 
Cancel
Luke Lawreszuk

2011-02-18T15:10:47-08:00

are .com and .org domains better then .net? Let's say you run the tool for domain: perfume.com and perfume.net - second one is classifies over 0.2 points more spammy then first one...

1 0

are .com and .org domains better then .net? Let's say you run the tool for domain: perfume.com and perfume.net - second one is classifies over 0.2 points more spammy then first one...
Cancel
FilipeSantos

2009-03-10T09:28:15-07:00

Hey guys, I have to say that is a pretty awesome thing to see in regards to a breakout by element and how they get factored into an algorythm tool like this. Obviously to get even more precise detection, many more variables and math will need to be applied, but Rome wasn't built overnight, so it will take testing, finetuning and definately new spam discoveries (along with what current search engines are known to factor) for the tool to be optimal.

I've looked far and wide and this is definately one of the best and simplest detectors around. It could however better explain the breakdowns of spam so that SEO beginners/intermediate practicioners are able to learn from the tool.

my 2 cents.

1 0

Hey guys, I have to say that is a pretty awesome thing to see in regards to a breakout by element and how they get factored into an algorythm tool like this. Obviously to get even more precise detection, many more variables and math will need to be applied, but Rome wasn't built overnight, so it will take testing, finetuning and definately new spam discoveries (along with what current search engines are known to factor) for the tool to be optimal. I've looked far and wide and this is definately one of the best and simplest detectors around. It could however better explain the breakdowns of spam so that SEO beginners/intermediate practicioners are able to learn from the tool. my 2 cents. 
Cancel
- Nick Gerner
 
 2009-03-10T09:35:27-07:00
 
 Thanks for the feedback. You're right, more features would help. What we've found is that this helps on lots of spam, but not lots of spam that most users will ever see. Currently it doesn't look at anything except the URL itself: no page content, no title tags, no links, nothing.
 
 All the features considered are listed in the tool. This was a short project, but if we start pushing more of this kind of analysis we will be sure to include beginner/intermediate documentation/discussion.
 
 1 0
 
 Thanks for the feedback. You're right, more features would help. What we've found is that this helps on lots of spam, but not lots of spam that most users will ever see. Currently it doesn't look at anything except the URL itself: no page content, no title tags, no links, nothing. All the features considered are listed in the tool. This was a short project, but if we start pushing more of this kind of analysis we will be sure to include beginner/intermediate documentation/discussion.
 Cancel
 - eggdaddy
 
 2009-03-10T16:38:58-07:00
 
 Nick , i might have missed it somewhere but why does my url go spammy if i drop www and just go https:// ive been heavily promoting the latter for months :s
 
 As you may have guessed Nick your telling me im spammy and im feeling like a douche bag for opting for two hyphens on selecting my domain over 3 yrs ago. Wheres the nearest bridge :s
 
 eggdaddy edited 2009-03-10T16:57:07-07:00
 1 0
 
 Nick , i might have missed it somewhere but why does my url go spammy if i drop www and just go https:// ive been heavily promoting the latter for months :s As you may have guessed Nick your telling me im spammy and im feeling like a douche bag for opting for two hyphens on selecting my domain over 3 yrs ago. Wheres the nearest bridge :s 
 Cancel
 - Nick Gerner
 
 2009-03-10T16:53:41-07:00
 
 All of this is statistical correlation with all domain names we know about (or knew about at the time of writing this post).
 
 At that time it helped to have a 'www'. But I wouldn't worry too much about that aspect. There are always exceptions to the rules. Unless you're embedding frequently spammed terms or using a bunch of hyphens in the hope that search engines will see your keywords, you're probably in good shape.
 
 Personally I prefer sites that have the www. And people more often than not link to the www version (we have evidence that shows this is true). So it's just adding another redirect that your users, server, and search engines have to deal with.
 
 1 0
 
 All of this is statistical correlation with all domain names we know about (or knew about at the time of writing this post). At that time it helped to have a 'www'. But I wouldn't worry too much about that aspect. There are always exceptions to the rules. Unless you're embedding frequently spammed terms or using a bunch of hyphens in the hope that search engines will see your keywords, you're probably in good shape. Personally I prefer sites that have the www. And people more often than not link to the www version (we have evidence that shows this is true). So it's just adding another redirect that your users, server, and search engines have to deal with.
 Cancel
 - eggdaddy
 
 2009-03-10T16:59:23-07:00
 
 lol yes your right. as above hyphens and non www. dufus here ! Doh
 
 1 0
 
 lol yes your right. as above hyphens and non www. dufus here ! Doh
 Cancel
Danny Dover

2008-05-07T16:19:18-07:00

Nick,

You give me far too much credit! I really appreciate it and look forward to doing more work with you in the future.

1 0

Nick, You give me far too much credit! I really appreciate it and look forward to doing more work with you in the future.
Cancel
streety

2008-05-05T06:54:15-07:00

I must say it's great to see ROC curves being used outside the context where I'm used to them. They are a really nice way of showing the tradeoff between sensitivity and specificity.

What are you using as your list of common, high-commercial value spam keywords?

1 0

I must say it's great to see ROC curves being used outside the context where I'm used to them. They are a really nice way of showing the tradeoff between sensitivity and specificity. What are you using as your list of common, high-commercial value spam keywords? 
Cancel
- Nick Gerner
 
 2008-05-05T10:39:34-07:00
 
 It's not a big secret, words like "poker", "loan", "sex", and the like. I'd publish the list, but it is neither novel nor is it family friendly.
 
 I do understand that having "sex" in your domain name doesn't mean you're a spammer. It's just statistically correlated with spam and helps me to achieve my ("poor" to "mediocre") 0.68 AUC score.
 
 edited 2008-05-05T10:39:55-07:00
 1 0
 
 It's not a big secret, words like "poker", "loan", "sex", and the like. I'd publish the list, but it is neither novel nor is it family friendly. I do understand that having "sex" in your domain name doesn't mean you're a spammer. It's just statistically correlated with spam and helps me to achieve my ("poor" to "mediocre") 0.68 AUC score.
 Cancel
 - SeanMaguire
 
 2008-05-05T11:05:37-07:00
 
 Nick,
 
 Please take no offense at all. I think we all understand your intent and are just having a bit of fun with the tool you created.
 
 That said, you've raised interest in something that is pretty unique and interesting.
 
 As for carfeu, he was just disappointed that his family business came back with a spammy score. ;)
 
 2 0
 
 Nick, Please take no offense at all. I think we all understand your intent and are just having a bit of fun with the tool you created. That said, you've raised interest in something that is pretty unique and interesting. As for carfeu, he was just disappointed that his family business came back with a spammy score. ;) 
 Cancel
SeanMaguire

2008-05-05T02:41:41-07:00

What interests me here is the application to SEO and webdev. If URLs and domain names really can be correlated nicely to spam detection, it means that those of us who build legitimate sites probably have a number of best practices to consider in order to help avoid being classified incorrectly.
The subject has definitely captured my curiosity. With just a cursory review, it seems the most immediate problem is with the weight given to domain length and hyphens. That probably places about 90% of all blog posts in the category of "Spammy". A few tweaks here and there and this could be a helpful tool.

1 0

<blockquote>What interests me here is the application to SEO and webdev. If URLs and domain names really can be correlated nicely to spam detection, it means that those of us who build legitimate sites probably have a number of best practices to consider in order to help avoid being classified incorrectly.</blockquote>The subject has definitely captured my curiosity. With just a cursory review, it seems the most immediate problem is with the weight given to domain length and hyphens. That probably places about 90% of all blog posts in the category of "Spammy". A few tweaks here and there and this could be a helpful tool.
Cancel
- Nick Gerner
 
 2008-05-05T10:00:04-07:00
 
 There's a few bugs to work out with this "tool" which are giving some bad results, worse even than it would give naturally.
 
 1 0
 
 There's a few bugs to work out with this "tool" which are giving some bad results, worse even than it would give naturally.
 Cancel
 - Nick Gerner
 
 2008-05-05T10:32:10-07:00
 
 Alright, I fixed a couple of the bugs which made it not actually work at all. Thanks for the bug reports.
 
 1 0
 
 Alright, I fixed a couple of the bugs which made it not actually work at all. Thanks for the bug reports.
 Cancel
J.D-HER

2008-05-04T23:45:16-07:00

Congrats! That's a pretty wild guess when it comes to other languages. On average English has less characters to form words than German or Swedish. Nordic languages use big words to describe one meaning. Thus, the possibilities of being spammy increase by the language chosen and the regional domains. I tested a couple of .de .dk .se that are great sources of information in the educational market and your tool red flags them as SPAMMY. Keep it going though, our community needs such a tool!

1 0

Congrats! That's a pretty wild guess when it comes to other languages. On average English has less characters to form words than German or Swedish. Nordic languages use big words to describe one meaning. Thus, the possibilities of being spammy increase by the language chosen and the regional domains. I tested a couple of .de .dk .se that are great sources of information in the educational market and your tool red flags them as SPAMMY. Keep it going though, our community needs such a tool!
Cancel

Post Analytics

Comments 26

Log in to Moz

Don't have an account?