I turned my project into a tool you can use to check if a domain name looks spammy. I'm not making any guarantees with it though. It could be a nifty tool if you want to have a high-quality domain name. Or at least, one which is not obviously spammy:
UPDATE: There were a few bugs in the earlier version, I've fixed those (but not the other ones I don't know about yet), so now things should be (at least slightly) better!
You probably understand the basic idea of spam detection:
- The engines (and surfers) don't like the spam pages.
- Enter the PhD-types with their fancy models, support from the engines with their massive data-centers, funding for advanced research, and whole lot more smarts than I've got.
On the one hand you could label everything as non-spam and you would never have a false positive. This is the point on the graph where x=0 and y=0. On the other hand, you could just label everything as spam and no SERPs would contain any spam pages. Of course, there would be no pages in any SERP at all, and this whole industry would have to go to back-up plans. I'd be a roadie for my wife's Rock Band tours.
Here is a plot illustrating the trade-offs, as adapted by me from the AIRWeb 2008 conference, including my own fancy algorithm. "y=x" is the base-line random classifier (you better be smarter than this one!)
see the actual results
Clearly, the graph shows that I totally rocked :) Ignore for a minute the line labeled "SEOmoz - fair", we'll come back to this. As you can see, at a false positive rate of 10% (0.10), I was able to successfully label over 50% of the spam pages, outperforming the worst algorithm from the workshop (Skvortsov at ~39%), and performing nearly as well as the best (Geng, et al. at ~55%). My own algorithm, SpamDilettante® (patent pending!) , developed in just two days, with only the help of our very own totally amazing intern Danny Dover and secret dev weapon Ben Hendrickson has outperformed some of the best and brightest researchers in the field of adversarial information retrieval.
Well, graphs lie. And so do I. Let me explain what's going on here. First of all, my algorithm really does classify spam. And I really did develop it in just two days without using a link-graph, extracting complex custom web features, or racking up many days or months of compute cluster time. But there are some important caveats I'll get to, and these are illustrated by the much worse line called "fair" above.
What I did was begin with not one of Rand's most popular blog posts. However, this post is actually filled with excellent content (see the graph above). Most of the things I couldn't actually compute very easily:
- High ratio of ad blocks to content
- Small amounts of unique content
- Very few direct visits
- Less likely to have links from trusted sources
- Unlikely to register with Google/Yahoo!/MSN Local Services
- Many levels of links away from highly trusted websites
- Cloaking based on user-agent or IP address is common
However, some of the things I could get just from the domain name:
- Long domain names
- .info, .cc, .us and other cheap, easy to grab TLDs
- Use of common, high-commercial value spam keywords in the domain name
- More likely to contain multiple hyphens in the domain name
- Less likely to have .com or .org extensions
- Almost never have .mil, .edu or .gov extensions
With the hard work done for me, I wrote a script to extract the above features (in just a few lines of python code). I took the 2/3 of the data which was labeled as spam/non-spam and divided it into an 80% "training set" and 20% "test set". This is important because if you can see the labels for all your data you might as well just hard-code what's spam and what's not. And then you win (remember that "perfect" algorithm?). Anyway, I just did a linear regression on this 80% set and got my classifier.
To get performance numbers I used my classifier on my reserved 20% test set. Basically it spewed a bunch of numbers like "0.87655" which you could think of as probabilities. To get the above curve, I tried a series of thresholds (e.g. IF prob > 0.7 THEN spam ELSE not spam). This is the trade-off between false positive and false negative, and gives me the above curves.
And that's the story of how I beat the academicians.
O.K., back to reality for a moment; on to the caveats.
- As I pointed out in the introduction to the competition my data set is a much simpler classification problem (complete label coverage and almost no class imbalance)
- As Rebecca says, "it's just one of those "common seo knowledge" things--.info, .biz, a lot of .net [are spam]," and my dataset includes lot of these "easy targets". The competition is all .uk (and mostly co.uk)
- My dataset is awfully small and likely has all kinds of sampling problems. My results probably do not generalize.
As E. Garcia said in his original post which started me on this, "it is time to revisit [the] drawing board."
Can someone tell me where to fill out the Spam Report for this guy?
Classic! Way to go Sean :o)
Well Nick, if there's any consolation, on a true positive note, you were trending better and closing the gap at a false positive rate between .15 - .20. That has to count for something!
In a bit of irony, I ran your Spam Detection Tool's own URL through the motions.
Yes - your tool considers itself Spammy. ;)
The url spam detector is a very interesting idea. I was pleased to see that my own domain and my current hot project domains passed with nice low scores. But while it is interesting, there isn't too much of actual value right now. My primary business provides web service tools including email campaign management. Our terms of use require that email only be sent to users that have opt-ed in but there are violations and we terminate customers who cause spam problems. I checked the domain names of sites that have generated spam complaints and while their scores were a bit higher, they still passed. The 'spam' sites do look good and are professional but some are in industries that are known generators (mortgage industry for example) of spam email with the definition of spam as unwanted email, not the spam where everything is forged to try to hide who and where it came from. I realize that the tool was more of an experiment and was only looking at the URL and basic components of the site but if it was extended a bit and could predict the sites that might engage in spammy behavior that would be very powerful.
I think you're completely correct. This is an interesting idea, and that's what AIRWeb is all about. If you're interested in learning more there's a whole host of papers on the subject. One particularly interesting one is "Web Spam Taxonomy", by Zoltán Gyöngyi and Hector Garcia-Molina. This paper lays out some of the features these researchers are looking at, without introducing too many mathematical formulae.
I like the tool. Mostly because when I put some of my more spammy sounding domains through it - they come up clean.
I think as Sean has more or less pointed out with his examples - it is mostly relevant when dealing with the root site domain, and not individual posts or directories.
Even this very post would be considered spammy according to the tool (I imagine because of length and hyphens).
And it also seems that adding a forward hyphen trips some spam flags.
Try putting the following into the tool:
www.seomoz.org
www.seomoz.org/
The second one comes up as spammy. And the only factor that seems to change is the subdomain depth.
yeah, it's supposed to only look at the domain name, I fixed the issue.
But uh... as per the blog post, this is at least a little bit tongue-in-cheek, so take the results with a grain of salt...
Nick, I think everyone here commends your effort. I just like seeing all the false positives.
www.porn-sex-poker-games-casino-drugs-kill-kill.com
score: 1.22
What interests me here is the application to SEO and webdev. If URLs and domain names really can be correlated nicely to spam detection, it means that those of us who build legitimate sites probably have a number of best practices to consider in order to help avoid being classified incorrectly.
Aaron Wall had a good post about passing a manual review by Google, but this brings up the other side - passing an automated review.
I'm also insanely curious about what Google's false positive rate might be...
Hi.
Very curious about your url spam detection algorithm. How can it be that the deeper subdomain chain you have the lesser the probability is for spam ?
I mean is really spam.spam.spam.nospam.com less spammy than spam.nospam.com ? Please elaborate.
so sorry to write that but your tool is not worth the bytes neccessary for creating it.
It misses different language semantics and behaviours.Although I cannot find a reason for my name (real name) to be spammy, since my company has the same name.
So my name has 5 letters. My TLD is biz which is .24 worse than .com what results in ~.58 instead of ~.34 - where is the relevance for this?
Your algorithm is gimicky at best but I would call it nonsens.
You should kick this page - or make it excusevly for .com domains :-D
never mind, but it has to be said.
Regards, Marc
are .com and .org domains better then .net? Let's say you run the tool for domain: perfume.com and perfume.net - second one is classifies over 0.2 points more spammy then first one...
Hey guys, I have to say that is a pretty awesome thing to see in regards to a breakout by element and how they get factored into an algorythm tool like this. Obviously to get even more precise detection, many more variables and math will need to be applied, but Rome wasn't built overnight, so it will take testing, finetuning and definately new spam discoveries (along with what current search engines are known to factor) for the tool to be optimal.
I've looked far and wide and this is definately one of the best and simplest detectors around. It could however better explain the breakdowns of spam so that SEO beginners/intermediate practicioners are able to learn from the tool.
my 2 cents.
Thanks for the feedback. You're right, more features would help. What we've found is that this helps on lots of spam, but not lots of spam that most users will ever see. Currently it doesn't look at anything except the URL itself: no page content, no title tags, no links, nothing.
All the features considered are listed in the tool. This was a short project, but if we start pushing more of this kind of analysis we will be sure to include beginner/intermediate documentation/discussion.
Nick , i might have missed it somewhere but why does my url go spammy if i drop www and just go https:// ive been heavily promoting the latter for months :s
As you may have guessed Nick your telling me im spammy and im feeling like a douche bag for opting for two hyphens on selecting my domain over 3 yrs ago. Wheres the nearest bridge :s
All of this is statistical correlation with all domain names we know about (or knew about at the time of writing this post).
At that time it helped to have a 'www'. But I wouldn't worry too much about that aspect. There are always exceptions to the rules. Unless you're embedding frequently spammed terms or using a bunch of hyphens in the hope that search engines will see your keywords, you're probably in good shape.
Personally I prefer sites that have the www. And people more often than not link to the www version (we have evidence that shows this is true). So it's just adding another redirect that your users, server, and search engines have to deal with.
lol yes your right. as above hyphens and non www. dufus here ! Doh
Nick,
You give me far too much credit! I really appreciate it and look forward to doing more work with you in the future.
I must say it's great to see ROC curves being used outside the context where I'm used to them. They are a really nice way of showing the tradeoff between sensitivity and specificity.
What are you using as your list of common, high-commercial value spam keywords?
It's not a big secret, words like "poker", "loan", "sex", and the like. I'd publish the list, but it is neither novel nor is it family friendly.
I do understand that having "sex" in your domain name doesn't mean you're a spammer. It's just statistically correlated with spam and helps me to achieve my ("poor" to "mediocre") 0.68 AUC score.
Nick,
Please take no offense at all. I think we all understand your intent and are just having a bit of fun with the tool you created.
That said, you've raised interest in something that is pretty unique and interesting.
As for carfeu, he was just disappointed that his family business came back with a spammy score. ;)
The subject has definitely captured my curiosity. With just a cursory review, it seems the most immediate problem is with the weight given to domain length and hyphens. That probably places about 90% of all blog posts in the category of "Spammy". A few tweaks here and there and this could be a helpful tool.
There's a few bugs to work out with this "tool" which are giving some bad results, worse even than it would give naturally.
Alright, I fixed a couple of the bugs which made it not actually work at all. Thanks for the bug reports.
Congrats! That's a pretty wild guess when it comes to other languages. On average English has less characters to form words than German or Swedish. Nordic languages use big words to describe one meaning. Thus, the possibilities of being spammy increase by the language chosen and the regional domains. I tested a couple of .de .dk .se that are great sources of information in the educational market and your tool red flags them as SPAMMY. Keep it going though, our community needs such a tool!