Let's try a little exercise...
Common features of spam domains include:
- Long domain names
- .info, .cc, .us and other cheap, easy to grab TLDs
- Short registration period (1 year, maybe 2)
- High ratio of ad blocks to content
- Javascript redirects from initial landing pages
- Use of common, high-commercial value spam keywords like "mortgage," "poker," "texas hold 'em," "porn," "student credit cards," and related terms
- Many links to other low quality, spam sites
- Few links to high quality, trusted sites
- High keyword frequencies and keyword densities
- Small amounts of unique content
- Very few direct visits
- Very few links sent out in (non-spam) email to the site
- Registered to people/entities not associated with trusted sites
- Not frequently registered with services like Yahoo! Site Explorer, Google Webmaster Central or Live Webmaster Tools
- Rarely have short, high value domain names
- Often contain many keyword-stuffed subdomains
- More likely to have longer domain names
- More likely to contain multiple hyphens in the domain name
- Less likely to have links from trusted sources
- Less likely to have SSL Security certificates
- Less likely to be in directories like DMOZ, Yahoo!, Librarian's Internet Index, etc.
- Unlikely to have any significant quantity of branded searches
- Unlikely to be bookmarked in services like My Yahoo!, Del.icio.us, Faves.com, etc.
- Unlikely to get featured in social voting sites like Digg, Reddit, Yahoo! Buzz, StumbleUpon, etc.
- Unlikely to have channels on YouTube, communities on Facebook or links from Wikipedia
- Unlikely to be mentioned on major news sites (either with or without link attribution)
- Unlikely to register with Google/Yahoo!/MSN Local Services
- Unlikely to have a legitimate physical address/phone number on the website
- Likely to have the domain associated with emails on blacklists
- Often contain a large number of snippets of "duplicate" content found elsewhere on the web
- Unlikely to contain unique content in the form of PDFs, PPTs, XLSs, DOCs, etc.
- Frequently feature commercially focused content
- Many levels of links away from highly trusted websites
- Rarely contain privacy policy and copyright notice pages
- Rarely listed in Better Business Bureau's Online Directory
- Rarely contains high grade level text content (as measured by metrics like Fleisch-Kincaid Reading Level)
- Rarely have small snippets of text quoted on other websites and pages
- Cloaking based on user-agent or IP address is common
- Rarely contain paid analytics tracking software
- Rarely have online or offline marketing campaigns
- Rarely have affiliate link programs pointing to them
- Less likely to have .com or .org extensions
- Almost never have .mil, .edu or .gov extensions
- Rarely have links from domains with .edu or .gov extensions
- Almost never have links from domains with .mil extensions
- Rarely receive high quantities of monthly visits
- Rarely have visits lasting longer than 30 seconds
- Rarely have visitors bookmarking their domains in the browser
- Unlikely to buy significant quantities of PPC ad traffic
- Rarely have banner ad media buys
- Likely to have links to a significant portion of the sites and pages that link to them
- Extremely unlikely to be mentioned or linked-to in scientific research papers
- Unlikely to use expensive web technologies (Microsoft Server & Coding Products that Require a Licensing Fee)
- Likely to be registered by parties who own a very large number of domains
- Unlikely to attract significant return traffic
- More likely to contain malware, viruses or spyware (or any automated downloads)
For high quality content domains, the opposite is true (at least, for a good percentage of these). Now think about the sites you're building - which features apply to them? What could you do differently to be more like the "high quality" category and less like the "spam"?
BTW - Love to hear your take on features you think are common to spam, or to high quality sites.
I will play devils advocate here...
ways that Google might disagree with many of your examples
Few links to high quality, trusted sites:
Most scrapper sites are comprised entirely of a paragraph of scraped text that then links to the site it was scraped from (high quality sites)
Very few direct visits:
A website that offers information to nuns in a small village probably has few direct visits, yet is still a quality site. A million such examples could also be included. Traffic levels are indicative of interest, not quality.
Not frequently registered with services like Yahoo! Site Explorer, Google Webmaster Central or Live Webmaster Tools:
I would say spammy sites are more likely to be registered to such services
Unilkely to get featured in social voting sites like Digg, Reddit, Yahoo! Buzz, StumbleUpon, etc.:
I would say spammy sites are more likely to be featured
Often contain a large number of snippets of "duplicate" content found elsewhere on the web:
Google itself is comprised entirely of snippets of "duplicate" content found elsewhere on web, as are directories, reference sites, dictionaries, university websites, weather sites, public notice sites, goverment sites, federal, state, and city sites, etc.
Frequently feature commercially focused content:
So do commercial websites like amazon, ebay, etc.
Rarely contains high grade level text content (as measured by metrics like Fleisch-Kincaid Reading Level):
I suck huge at grammar, but I do not think feedthebot is a spam site, and neither does Google. A school site that publishes third grader short stories wouldn't be considered spam either. Poetry is often not grammatically correct, poetry is spam? Or how about this post you just wrote, it uses the word "Unilkely" instead of "Unlikely" but I do not think this post is spam.
Rarely receive high quantities of monthly visits:
Again traffic quantities are in no way indicative of quality, they are indicative of interest.
Unlikely to use expensive web technologies (Microsoft Server & Coding Products that Require a Licensing Fee):
The majority, the vast majority, of websites on the web do not use these.
Return traffic isn't a big factor for a number of markets.
I'd love to know how Rand knows which sites are (not) registered with Y!SE, GWC, and LWT.
I'm not sure why my accountant would have a Facebook community or a YouTube Channel.
Google would also probably say that their Analytics isn't a spam sign either.
Meh, you don't need to be Devil's Advocate to find exceptions to most of those points really.
Those are some mighty fine points there Pat! you a thumb up :)
I would say it is impossible to decide if a site is spam through one fault, everything is at fault, it's part of being human :) But if a site triggers X amount of what Rand mentions then that decides what spam is.
a bit like how each point is a filter, trip too many filters and the big G will trip you up!
Oh, yeah - this post will get a lot of comments. Pat's list and Rishil's list are both exceptional.
Rand, I think the point of this 50+ item list is that if you are more on it - than off of it - you're probably not going to be trusted. The problem is you have set the bar pretty high in so many areas.
My complaints are:
Long domain names and .info, .cc, .us and other cheap, easy to grab TLDs
It is pretty damn challenging to come up with any kind of short domain without making up a new word by either making a portmanteau or just something that would otherwise be jibberish (i.e. SEOmoz.com).
I think idontlikeyouinthatway.com is a very trusted domain (though not always safe for work), and yet it can be argued that it is pretty damn long.
Use of common, high-commercial value spam keywords
And just the use of mortgage or poker, etc. doesn't indicate spam. I run a handful of real estate sites, if I didn't talk about mortgages I'd be doing my clients a disservice.
Short registration period (1 year, maybe 2)
I've never registered any of my domains for longer than this period, but I've had some of them for almost 7 or 8 years. Now granted, I'm not arguing I have any high-trust sites, but I think the age of the domain should matter more than the registration period.
Unlikely to contain unique content in the form of PDFs, PPTs, XLSs, DOCs, etc.
I dislike nearly all of these things on web sites. To me they are a huge usability issue. If I am using a browser and you can present the information in HTML - then just do so. The only two on this list I feel is acceptable is PDF and PPT - and those both better things you presented off the web.
I do applaud this effort though. I think it makes for an interesting checklist. It would be interesting to see what would happen (and if people would link to it) if someone constructed a site guided only by doing the opposite of everything here.
Lastly, this is a page on baltimore sun. To me the amount of actual content on this page (unique or not) is minimal. I have the data and ability to make 1000s of these pages, if I were to do so - would it be spammy? Should pages like this be more trusted or less trusted?
dood!
"I have the data and ability to make 1000s of these pages, if I were to do so - would it be spammy? Should pages like this be more trusted or less trusted? "
less, unless you can make them provide "unique and useful content"
Registered to people/entities not associated with trusted sites
A large number of sites on the web arent - What could you do differently to be more like the "high quality" category and less like the "spam"? What can I do being an individual but assume that my site is spam?
Extremely unlikely to be mentioned or linked-to in scientific research papers I dont think Amazon is very likely to be linked-to by the scincetic research community - unless they were reccomending the latest re release of Assimov's Foundation series...
Rarely have banner ad media buys -I dont think that most sites on the net can afford these.
Unlikely to buy significant quantities of PPC ad traffic - Au contraire - PPC is one of the best ways for adsesense publishers to capitalise on long tail and drive traffic to their sites.
Almost never have links from domains with .mil extensions Are you serious? Its a fair point that they wouldnt, but how many sites do? We might as well add in that list "Not likely to be mentioned in the Bible"
Rand - its the first post of yours that I am not entirely sure I would want distributed in its present form. I think you should break it into the definates to the most likely, possible and never....
Almost never have .mil, .edu or .gov extensions Rarely have links from domains with .edu or .gov extensions
Those aren't really reliable metrics for websites outside of the US, who will never have .mil, .edu or .gov extensions and likely will never get links from sites with those extensions. But sites can still have quality, trusted content within their own countries and even worldwide.
Great point. I should know this, and I don't: What is the trust supposedly associated with .ac.nz, .ac.uk etc?
I think you missed 'uses AdSense' . =)
It really is hard to tell spam from non-spam in many cases. Even the NY Times operates a site (consumersearch) that in my opinion is spammy.
The biggest tell-all for me is if the DNS has no associated MX records. Pretty much every legitimate site I know of demands that at least some of the owners (webmaster, at the least) have an associated email address with that domain.
Lots of big spammers create and/or manage their own DNS without thinking to set up superfluous mx records.
If you are running a network of sites you wouldnt neccessarily set up mail for every domain - this - is true for bug businesses as well. I used i work for HCA Healthcare at some point - they are the second biggest hospital group. Every hospital had its own site, own domain name - but all the email was runnning of the main domain.
That doesnt mean that every hospitals site wasnt a legitimate one, does it?
I think MX is another flaw in the theory.
oh... that should be "big" not "bug". Someone please install a spell checker. Please . lol.
This post makes a lot more sense now.
There's an edit button so you can correct the error yourself.
I think this post (and comments) clearly illustrates how difficult it is to algorithmically detect a spam site with anything close to 100% certainty. Many people creating a website for the first time would fall into quite a few of these traps, and even if their intent and content is good, their lack of web savvy could trigger quite a few filter flags...
Great list - very comprehensive! Google should factor some of these into it's algo - in fact we know it already does with its tendency towards .mil, .edu and .gov
I've been an internet user for many years (compuserver was the big player back then :) and I've developed sort of a 6th sense for noticing spam especially in the SERPs - I'm sure many other people have too. There's something difficult to put into words about identifying spam in this way; you just get that spammy feeling and hey presto... you click through to a pop-up infested mortgage/gambling/porn/ebook pitch!
I've personally never been penalised for running a .info domain - in fact it was a PR5 at one point and the SERPs loved it for some reason. So I'd be quite sad if less popular TLDs were flagged as spammy without factoring in many of the addition points you list. I guess it's a matter of getting tarred with the same brush.
I work for a company that claims to do SEO but they do nothing to a client's website. They order a new url, put up a bunch of poorly written content using as many as 100 keywords and then add a box at the top of the pages for you to click on to be taken to the client's original site. Rankings are tracked on the new url, not the client's site. They use RSS feeds for links on these sites along with a weather tracker for the client's city. Needless to say, they have a no refund policy and clients see no benefit from their "SEO" services. Does this sound like legitimate SEO? They probably have 1500 to 2000 customers but are plagued by so many charge backs that they can not even get a merchant account anymore. They run all their payments through Google checkout.
What company is this?
heh... some search engines have lots of these features. :D
What I find really interesting, and I think is touched on by almost all of the comments here, is that determining if a page is spam or not is not as simple as checking a few rules.
This reminds me a little bit of a distinction that was brought up in the standards panel by (I think) Ian McAnerin, at SMX West a few weeks ago: standards vs. guidelines. McAnerin argued that guidelines are not hard rules, but instead are common rules of thumb that, when combined (in complex ways that are rarely publicly available), adhere to the standard.
I'm arguing that spam is a standard (by McAnerin's definition), and the above things are part of a broader set of guidelines.
So there are certainly pages with long domain-names-that-have-valuable-information.example, but more likely, in general, long domain names full-of-mortgage-and-viagra-spam-keywords.example are spam.
The web is totally saturated of crappy content! .EDU domains became link whores and directories are diluted into a mesh of interlinking hell. In every category, in every industry you'll find an "authority" recycling content and selling links...
If Andy Warhol lived today he'll probably state that all domains will get their 15,000 links
Simply frustrating.
Think I might be being slow. By:
Do you mean that (real!) people don't email links to spammy sites to each other?
Hello, my name is Michael Janik from Hamburg, Germany.
I want to give my compliments to randfish because his posts are the best.
They are very useful, full of information and writen in a comprehensible manner.
Thank you!!!!
I'd agree with feedthebot. There are some of these things that SE could take as not relevant when rating content as spam. BUT there's always the possibility of the casual 'collateral damage', that is: to kill some innocent sites because a bigger number of bad sites follow some of the same rules too. Search Engines will always be a mystery somehow, their filters are weird.
Nice list Rand..but what is the application of the list - just to evaluate a perspective link partner ? or is there more to it?
Rand's list kinda feels like a somewhat off the cuff mashup of random thoughts; many of which are clearly accurate - yet it lends itself to counterpoints.
I think the real value is the combination of Rand's views coupled with the subsequent discussion. One could probably synthesize a pretty strong list from the two.
Excellent individual contributions by Pat and Vin. Rishi also had some good points.
awww Sean I prefered your original comment!
I see this post more like a general guide differentiate spam from quality content Sites. I agree with Rand, it's just like a check list, maybe some Sites have a few of these aspects mentioned by Rand and that doesn't mean they are low quality. The list help us have a general idea of the concept of spam.
BUT RAND, it takes so much time to make a "real" quality website :) Actually truth be told, in the long run it is easier in the long run to have a site with 95% of the features you attributed to a "good" website.
Hmm interesting post. The comments posted thus far have summed up many of the points that would be made on this end. Some of those can be disputed but overall it is impossible to adhere to all of those guidelines it seems.
You really hate long domain names, huh?
"Long domain names"
"Rarely have short, high value domain names"
"More likely to have longer domain names"
etc...
ok.. one for the list.. 2 parts
(from a former spam killer at a search engine)
part 1.
mulltiple gibberish subdomains - containing numbers and letters, usually attached to a reeeealy short domain.
like l0pht.org (not really but close)
(waiting to see who the first l33t haX0r will be to correct me)
Part 2.
multiple gibberish sub domains that are Reeeealy close to a popular domain.. i.e.
RNYspace.com = rnyspace.com
oh.. and... how about.... "has lots of amazon listings with one really long ass page, all of it centered"
What are you trying to say about l0pht.org ? I own the domain, and it used to be used by l0pht crack industries (www.l0pht.com - twitter @stake) . Back in the mid to late 1990s. It's not close to anything, nor does it have anything to do with spam. But thanks for the mention!
Rand, I think this is a very valuable post. Anyone can find exceptions to the rule, but I think the point is that if your website has a few or more of these things going on it might be flagged for further investigation. It did bring up some points I hadn't thought of. Did Danny Sullivan's post on TrustRank have anything to do with motivation or timing for this post? Maybe it's an unrelated coincidence.
Where does crafting PR with the nofollow attribute/siloing fall into this picture? Ha, just kidding. Ignore that. I don't even want to throw gas on that fire right now.
Is this a link bait post?
I refer to Rebecca the SEOmoz Link Bait spokes person.
No idea--Rand doesn't usually explain to me the intent behind every blog post he writes. Contrary to what you may think, he doesn't report to me, I report to him. ;)