Spam Filtering and Lessons to Carry to the Web

The methodology behind spam filtering for email is getting more and more advanced as filtering programs are literally trained by their users into recognizing and categorizing an email as spam. For the web, spam detection and elimination is a work in progress too, but many of the same lessons are being applied.

It's critically important for SEOs and web developers in general to recognize these filtration and learning techniques in order to avoid being falsely labeled by the search engines. A few of the more common methods are listed below:

Topic Detection - Frequent topics of abuse are well-known and include Nigerian-style scams, work-from-home offers, pharmaceuticals, etc.
Address & IP location - This can work on the web too with geotargeting and web server IPs.
Bonus/Penalty Words - Certain terms are frequently used in spam, while others are much less likely to occur.
Readability Scales - Applying semantic analysis to the content of an email or web page can indicate whether it is more/less likely to be spam.
Stopword Frequency - In natural language and personal correspondence email, the percentage of stopwords is much higher than in a spam email. This is also true of spam web pages.

Understanding these techniques and thinking about them as you develop pages is a good way to prevent false spam classification. Several excellent resources include:

Spam Filters by Sam Holden
Spam & IR Research - Thread at SEW

Comments 0

Log in to Moz

Don't have an account?