The sheer fact that you are reading YouMoz is a strong indicator that you already know full well about the recently launched Google Penguin Update. This is in fact the "over-optimization" penalty alluded to by Matt Cutts a few weeks ago. In Google's own words, "Sites affected by this change might not be easily recognizable as spamming without deep analysis or expertise, but the common thread is that these sites are doing much more than white hat SEO; we believe they are engaging in webspam tactics to manipulate search engine rankings."
This statement is very telling. It doesn't say the common thread is that these sites are doing black hat SEO. It says they are doing much more than white hat SEO. Sounds to me like you can take white hat SEO a little too far, even to the point of being labeled "webspam tactics" by Google.
So if the Penguin update was aimed at nuking spam or those using spammy tactics, what is spam? What are spammy tactics?
Google gave a few examples, such as nonsensical spun content and keyword stuffing. This stuff is obvious, and quite frankly, if the examples they gave were actually not marked as spam before, they should be embarrassed.
One thing is for sure with this update, there has been some serious backlash. Not only can you read the comments on Google's own blog post, but very respected people in the SEO industry voiced their concerns. Apparently Google and white hat SEO proponents had two very definitions of spam.
Coincidentally, the following morning when I got to the office and checked my Gmail, I had a larger than normal amount of spam in my spam folder. A light bulb literally went off. Google runs Gmail, and if we are to glean any clues as to how they identify and classify spam, why not investigate Gmail?
I found some pretty interesting results. I think that by learning how Google identifies spam in email, we can learn how they are identifying spam in websites.
First of all, and I didn't know this before, if you open a spam message (probably why I never knew this before), Google puts a little message telling you why the email was marked as spam. It looks something like this:
I decided to learn more and clicked through the link to see this Google support page explain a bit more about how Google identifies spam.
The first reason something would be marked as spam is for phishing. This is no surprise as Google doesn't want users to get duped into giving up personal or financial information to scammers.
Website equivalent? Sites with malware or maybe non-trusted merchants. Google is not a fan of any website that tries to infect computers with a virus or anything.
Their second reason deals with messages from an unconfirmed sender. These are basically when someone pretends to send you something from what appears to be an official website address, but they aren't with that website.
Website equivalent? Perhaps hacked or hijacked websites, websites registered and hosted outside the country, sites not registered with Google Webmaster Tools, or other sites that have suspicious ownership.
The next reason something would be marked as spam is because you previously marked it as spam. Persistent messages from the same user, identical subjects, stuff like that.
Website equivalent? In Chrome, you can block sites from your search results. Also sites with little engagement and high bounce rates would probably qualify here.
The next one is a biggie. It deals with similarity to suspicious messages. Google says here that "Gmail uses automated spam detection systems to analyze patterns and predict what types of messages are fraudulent or potentially harmful." They then go on to list some examples, such as typical spam language (adult, get rich quick) messages from accounts or IP addresses that previously sent spam messages and suspicious attachments to name a few.
Website equivalent? Here's where things get tricky. How does Google determine what is, in their words, "usually associated with spam?" We know the obvious ones, but what if your legitimate website in a legitimate industry was suddenly viewed as spam? Just use SEO as an example, just how many folks outside this industry do you suppose believe we are not spammers? Should perception of the masses dictate what is or is not spam?
Other Google resources shed more light on this topic. Google released a video about fighting spam some time ago. It eventually leads you to a place where you can learn more about Gmail's spam fighting methods. Here is where things get really telling.
The first method they point out in combating spam is called community clicks. Basically, as more and more users mark stuff as spam, they use that data to determine what messages are spam. Think back to the Chrome extension to block sites in your search results. Think to the +1 button. Think to user statistics like bounce rate and engagement statistics like depth of visit, length of visit, etc.
We have all been trying to figure out what all of the sites that got hit had in common. What's the one thing we cannot see on other sites? We can see their content. We can see their links. We cannot see usage statistics. Google can.
Google even says this about Gmail spam: "Our team of leading spam-fighting scientists uses a number of advanced Google technologies. Though in many cases our best weapon is you." Can it be more revealing than this?
Gmail openly admits that their best asset in dealing with spam is user feedback. Why would we suspect their search results are any different? Why bother tracking search history, browsing history, offering free analytical software, implementing the +1 button and Chrome extension...to what end? For user feedback. Whether any of us knew it or not, we have all been sending feedback to the big G for quite some time. And now it is being put to use.
Think of this...why in each of our major keyword SERPs are there so many new sites? Why so many poor sites? Why so many that have never been there before? Because Google has no data on them. They have never been in the SERPs, so now that they are, Google will quickly realize, through this fancy new Penguin update, whether or not the result is good or spam. Count on search results to be much more volatile moving forward.
After talking about community clicks, the Gmail spam fighting page then goes on about quick adaptation. They laud their ability to quickly roll out new spam data as they receive it so that within minutes of new spam being created, they can identify it. What does this say about what we do? Think about the recent hit on link networks. Google can quickly discover and identify spam, and as of the Penguin update, they can roll it out globally in a hurry.
And in case you didn't think I was on to something here, the next spam fighting method says it all. I decided to post it all here, verbatim:
"Many Google teams provide pieces of the spam-protection puzzle, from distributed computing to language detection. For example, we use optical character recognition developed by the Google Book Search team to protect Gmail users from image spam. And machine-learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam."
Okay, so work done by the Google Book Search team has helped Gmail to identify image spam. Work done by the search team has helped Gmail classify spam. Obviously, these different branches of Google work together. So whatever methods they are using at Gmail, you had better believe that is being shared with the search team.
Think about that for a minute. Gmail is saying that they can filter email messages by their content and look for language typically associated with spam. I've noticed that anything related to insurance, pharmaceuticals, and loans all end up in my spam folder. What does that tell you? Gmail knows language associated with those products is known to be spam.
So what about websites? Wouldn't that knowledge of identifying and classifying spam be shared with the search and webspam teams? Don't you think Matt Cutts has access to Gmail's spam detection data? I bet he does, and I bet some of it is being seen in this Penguin update.
Gmail is a pretty well documented product. You can read up quite a bit on spam filters and how they work. I would recommend this to everyone as we can then get a better idea of what Google sees as spam content. For me the biggest takeaway is that Gmail openly admits to using user data and feedback in classifying and identifying spam. This should be a huge indicator to us all that user data is playing a role in how Google classifies and identifies spam on the web. The trick now is to figure out exactly what user data/feedback is being used.
I'm not really sure if you can glean much info from Gmail's spam filters. The people behind Gmail and Search are not the same.
So you think the departments never talk? Share code? Share research? Of course they do!
I completely agree. Not sharing information with other departments would be a waste of resources that could be diverted to other areas.
Agree. Different departments not different companies. I'm betting they share and play nice with each other.
This article sounds totaly logic, espacially if you take in consideration the fact that after the Penguin update, Viagra's website vanished from the search results and everyone knows how much email spam there is about Viagra.
That would definiyrly suggest penguin is automated rather than manual like the blog network thing was. As much spam as there is about Viagara, it is a legit product with a legit website.
registering your site with webmaster tools is a great tip and enbracing Google+ ensures a healthy social media signal to Google. This is one of the first blogs on the penguin update which makes sense of a topic with a lot of differing and confusing opinions flying around, thanks Dan
Excellent post Dan. Definitely thought-provoking!
Your Gmail theory is interesting and it would be great to see updates from you on your findings re. this theory! Makes you think about what Google actually do with all this data. In my experience, I've found Google never does anything without having multiple objectives, so you can be sure they use the Gmail data for more than just protecting their users from email spam.
Some of my sites have been effected by this update and are jumping around the SERPs 5+ times a day - something I've never seen before! Which got me thinking...
One theory I have is that the update may be ranking new sites for some SERPs to give them a "fair go" and measure their CTR, Bounce Rates etc from there. It would change the SERPs quite dramatically but also give others the chance to prove themselves if they have a better website - regardless of their age, link profile etc.
"if the examples they gave were actually not marked as spam before, they should be embarrassed." - My favourite part of your post!
hey Dan, its a nice post. appreciate your research. Everyone in SEO industry is shaked very badly.
Someday Google should be penalize for Over-Updation ;)
i hope so...:)
1. Amazon now controls all English speaking words Amazon also owns imdb.com according to SEMRush - Amazon ranks #1 for "sex toys" - do the same search in Bing IMDB.com ranks #1 for XXX (you can figure out the traffic count for yourself) Amazon and Google were the top speakers at the ChannelAdvisors event days before PEGUIN Amazon, eBay, Sears ect get feeds from 3rd parties like CA Amazon stock jumped $30 a share same day as PENGUIN Google also just funded $5million for lobbyist in DC PENGUIN is all about product PENGUIN is all about $$$$$ It's Google and Amazon after researching over 100 sites and product lines it is clear that this update was about "product" Google uses the excuse that they do not owe anyone a living or a ranking position - but they have made it clear that they are going to put Amazon 1st for all e-com and both are going to make a lot of money doing so. PEGUIN = PORN + PRODUCT = $$$$$ to AMAZON
Great Post and clever idea using other Google platforms to idenify how the Penguin update works. I'd also bet the domains and IPs from such spam automaticly get marked as spam (or atleast grey hat) sites in Search results as well.
Totally agree, if you feel like you were collateral damage, quite possible you are using a shared host and a known spam network shared your hosting.
Very thought proking. My sites have been caught up in the Panda update so information like this (thanks for all the links) is good to look at and think through. The sites seem to perform on a web analytics front (good usage and dwell times) but probably 'over' optimised. Seems that last years optimised is this year's spam... we talk about white hat & black hat but to me an over-optimised white-hat is now being accused of spamming.
I'm not saying user data is the only factor in Penguin, just that I think the same technology and process Google uses to identify email spam (lots of user data) is probably part of the Penguin update to identify webspam. Anyone caught in the algorithm then has their links and content analyzed with intense scrutiny. Lots of identical, keyword rich anchor text? Lots of blog comments? Lots of articles? Lots of irrelevant links? Well, if Google's user data shows people tend not to respond well to your site, then they look at these other metrics and if they catch low quality stuff, you are gone.
At least that would be the idea. Because plenty of people with bad links, lots of keyword anchors and the like did not get affected. But my guess is because Google sees users like these sites and interact well with them. Why remove a site that users like? So they look at what users are telling them through usage stats they don't like, run this Penguin analysis of their content and links, and if they are beyond a certain threshold, penalty.
Nice post Dan, some really interesting insights there. I completely agree that departments (Gmail and search) will share code, insights and ideas, so it seems like a fairly reasoned argument.
Keep up the good work!
Good job... i think is a very interesant theory
Nice Piece of Information Dan!
SEO's people would really love to know all about this, specially the best explanation how this controversial update gives a big change in SERP. I agree that knowing how Penguin identifies spam and function would be the best start to discover on how to recover this amazing update.
Best Regards,
Tarjinder S. Kailey
Thanks for your post!
Really awesome article and too much informative. I think we need to understand both aspect of new Google Penguin update so that we can able to do best in our SEO practice. We need to understand what actually Google Penguin wants and how Google is treating or working if we understand the whole thing then there is no problem to get well rank in Google or recovery which we have lost with new update. Very good described above about "spam" and with example gmail spam which everyone know. and new update is more focusing on webspam and other like keyword stuffing, more optimization etc. We need to think about on this matter seriously.
Thanks for sharing such a useful and informative stuff
That's a very creative way of thinking Daniel, it never crossed my mind that we can learn so much about google's filtering system through gmail spam message, it makes a lot of sense indeed.
Good Old Article, especially insights on Gmail Scam
Hey Thanks Dan for the post great insights and information with many ways to resolve and take care for the websites to be safe from big G's crucial algo updates.
my website is www.sexysimran.com , can anyone please check why my website gone from ranking on 24th april 2012.As till 23rd april it was on 1st page. on keyword " Independent escorts in delhi "
Please resolve my problem
I would love to see a correlation between the negative impact of penguin/panda and the bounce rates on the ranked pages.
I do not see a reason why Google would NOT take into consideration a high bounce rate when evaluating the website quality. Taking it one step further, it would be interesting to see if adding on page events that would traditionally lower the reported bounce rate from a Google Analytics perspective make any difference in rankings.
Well, one documented thing is that affiliate sites were nailed pretty good for the most part in Penguin. The goal of an affiliate site is precisely for you to bounce. They want you to go to the merchant immediately. Just a thought there.
Might also take into account bounces on specific searches, bounce rate or click rate according to specific rankings, etc. Not sure what the metrics are, but as I said in another comment, I think they have fine tuned their metrics to identify sites they think users don't like, then they look at the content and links of said site, and if it is beyond a certain threshold of what is deemed natural in that niche, they get a penalty.
I do also support the influence of bounce rate and useless link building methods as a spam triggering factor. But i still doubt one thing about bounce rate. I believe 90% of the websites are not having Google webmasters tools or analytics code installed. So in such a situation is it technically possible for Google to track the user’s interaction? And if not, then how can they include a determining factor (bounce rate) into an algorithm (spam filter algo) which will work for some sites (having analytics code) and remain as a vestigial part for others.
Some people have suggested that one way Google could track bounce rate on sites without Google Analytics is to track visitors who click on a search result, then quickly come back to the same results page (e.g. people who hit the back button).
From a technical standpoint, It would be possible for Google to measure bounce rate without Google Analytics even being installed. They could measure the SERP URL, the URL being clicked and then the visit back to the SERP. It would be a little more complicated than that and take into considerations other factors such as time etc but I wouldn't rule it out as a possibility.
Think about it :)
I think it does. I manage few websites and able to get less than 1% bounce rate for one of the websites with around 80% new visitors from different sources. Lot of my keywords for the website above 80% is now rnking in top 5 of google results now.
i definately see a coorelation between the bounce rate and google rankings here. but will need more such examples to actually claim it. :)
I had some sites that were nailed in Penquin and they were for legitimate businesses with good link building practices. Yes, we syndicated some articles and press releases, but not through BMR or any network like that. Plus we have hundreds of links from "legit" directories, like Angies List, etc. I personally think Penguin hammered some good sites that did not deserve the penalty.
So now someone can make a bunch of spam links to a competitor and kill their rankings? This whole update is a very bad idea by Google.
Great post. I would imagine like yourself that much of the google information is shared between departments and gets plugged in here, there and everywhere. Thus it would be logical to think that boundaries will be crossed.
I am hoping that SeoMoz and the powers that be start to develop (in time) a SpamRank for sites, So from a link building strategy you can avoid the worst of the worst.
Dan, thanks for this post. I know that sometimes not everyone sees the value but today I did. We have a fairly new and small client that got hit pretty hard witht the new update. We have been looking hard to get more clues but when I read this post this afternoon the lights went on. Our client had a link building company create literally thousands of links to his domain in a realativly short amount of time. We knew about them when we started and knew that he had been taken advantage of since the majority of them were not even considered by any of the major search engines. It occured to me after reading the above that this,
"engaging in webspam tactics to manipulate search engine rankings"
might actually be catching up to him and us. I can't prove it but it makes a lot of since especially since this is a local company going after local search results in a major US city. We will keep looking for clues and ideas but this gave us more of an idea of where to look. Thanks a lot.
Interesting stuff to read! Someone mentioned above that both of these teams work separately and are different, but they are dealing with same kind of spam and most probably I expect them to have almost same kind of strategies to counter the spam.
Very nice post and quite well researched and I would call it digging deep into the spam filters. Good work Dan!
Some very interesting points, Dan. It's sometimes really easy to forget that algorithmic updates are all based on human feedback - either from Google's search quality team or from user data. It seems perfectly logical that the data from Chrome, the +1 button and the block function have been combined into a major update.
Thanks for the insights!
I’m sure that Google uses their SERP to detect who visits a site and how long they visit for through IPs. However, what about a large school or business that uses the same IPs? or how is Google going to count each visit? as a returning visit?
I like the relation between Gmail spam and web spam. However, mail spam is usually massed and the person receiving the email has nothing to do with the sender, and Gmail is not familiar with the sender. Sometimes, my real mail goes into a spam folder... Does that mean that sometimes sites that are not bad also fall into a spam folder?
Ive been in a discussion about bounce rate and how real it is over the past few month. I entered into this discussion very hard headed about bounce rate being a good calculater for user interaction. I was proven that it is not accurate at all and a bounce rate is really useless.
Any metric alone is probably useless, but user metrics, some combination of them, must be used in some way by Google. They have such vast data they surely put it to use. And yes sometimes sites that are not bad also get labeled spam, just as your real messages go to your spam folder. Google isn't perfect at detecting spam in email or web search.
Actually, I learned that there is a more accurate metric for calculating a bounce rate. which are not those provided by any web tag analytics. Use your server log analytics, Filter out bots and filter out your own IP Address(es.)
Google also tries to identify what kind of software is used for the website because it knows that some piece of software can be used to create low quality content (such as "social bookmarking" websites).
Excellent analysis.
My only question is: haven't these types pf spam been identified in previous incarnations of Google algos? Any sort of malware - at least malware ID'd by Google - has been a no-no since we were all baby SEOs and off-shore, blue pill pharmas have been tagged as spam since Clinton was in office.
So what's new? Are there refinements to the latest algo? Are they getting "smarter?"
Just asking...
Thanks for the informative post, written for the non-techs who hang out here.
Paul
Main point is that with Gmail we know their best asset in identifying spam is the user. We report things as spam or report them as not spam if that is the case. Feedback isn't so explicit with websites, but through our actions, which are all tracked by Google, we send "Feedback" in essence. Google is probably confident with the data they now have on website usage and are using it against websites that perform poorly with users.
Those that perform poorly likely have their content and links analyzed. If Google spots anything "unnatural" then boom, penalty.
"Think about that for a minute. Gmail is saying that they can filter email messages by their content and look for language typically associated with spam. I've noticed that anything related to insurance, pharmaceuticals, and loans all end up in my spam folder. What does that tell you? Gmail knows language associated with those products is known to be spam."
that's not because "insurance" and "loans" are spammy words by themselves, it's because the people who are blasting these insurance and loans emails are doing agressive email spamming. spamming just happens to be more prevelant with things like pharmaceutical sellers so you see them in your spam folder more often.
a) the body wording like jibberish or lack of wording
b) the shady or dangerous link which is ALWAYS included it needs to be that's the whole purpose of spamming
c) the subject line content wording
d) the sender being blacklisted or identified as a spammer by Google or enough users flagging an email as spam
e) if it's an email from a sender that has never sent you anything before - opening a spam email is like accepting it and the next time they spam you it's likely to go in your inbox instead of spam folder. if you didn't identify the previous email as spam they assume you accepted it as legit
Email marketing tools let you test your email content before sending out an email blast and give feedback as to how likely your email will be flagged as being spam so you can adjust the messaging to increase the delivery success rate.
It is possible that a site who has been email spamming can be held accountable for that and their site rankings can suffer as a result if it can be tracked back to them as definitely being the ones who are email spamming but spammers are smarter than that. Quite often email spammers hijack innocent sites and use them to send out spam on their behalf and those sites usually do get penalized as a result. It's your job as a webmaster to keep your site secure and free of malware or dangerous scripts.
i think this is comparing apples to oranges. email spam is a different thing than webspam. and this smackdown was on sites using spammy link building tactics. low level bad SEO link building like forum and comment link spamming, theme sponsoring, social bookmarking, free seo directory add url sites, etc. on page keyword stuffing won't help either so probably a combination of on page and incoming link profile thresholds will trigger a penalty. but google has been doing this for a while, they're just getting more aggressive with it.
here's an idea Google, how about figure out a way to NOT include incoming links or PR as part of the main basis of the algorithm monster you created. If link benefit completely goes away spamming will go down considerably, sure there will always be spammers doing it for traffic reasons, but most spammers engage in it because they are trying to boost page rank.
I have set up my website to google webmaster and it is working fine, but in alexa it is weird that i got zero traffic at all. I wonder if alexa is just having an error or the alexa site is affected by the penguin update or is it just my website?
My website is RANKBYTE which i registered in alexa but i wonder what is the problem although in other traffic and rank checking tools they returns good result