Since the Panda and Penguin updates, the SEO community has been talking more and more about machine learning, and yet often the term still isn't well understood. We know that it is the "magic" behind Panda and Penguin, but how does it work? Why didn't they use it earlier? What does it have to do with the periodic "data refreshes" we see for both of these algorithms?
I think that machine learning is going to be playing a bigger and bigger role in SEO, and so I think it is important that we have a basic understanding of how it works.
Disclaimer: Firstly, I'm no expert on machine learning. Secondly, I'm going to intentionally simplify aspects in places and brush over certain details that I don't feel are necessary. The goal of this post is not to give you a full or detailed understanding of machine learning, but instead to give you a high-level understanding that allows you to answer the questions in my opening paragraph should a client ask you about them. Lastly, Google is a black box, so obviously it is impossible to know for sure exactly how they are going about things, but this is my interpretation of the clues the SEO community has stitched together over time.
Watermelon farming
Machine learning is appropriate to use when there is a problem that does not have an exact answer (i.e. there isn't a right or wrong answer) and/or one that does not have a method of solution that we can fully describe.
Examples where machine learning is not appropriate would be a computer program that counts the words in a document, simply adds some numbers together, or counts the hyperlinks on a page.
Examples where machine learning would be appropriate are optical character recognition, determining whether an email is spam, or identifying a face in a photo. In all of these cases it is almost impossible for a human (who is most likely extremely good at these tasks) to write an exact set of rules for how to go about doing these things that they can feed into a computer program. Furthermore, there isn't always a right answer; one man's spam is another man's informative newsletter.
Explaining Machine Learning with Will Critchlow at SearchLove 2013 in London. I like watermelons.
The example I am going to use in this post is that of picking watermelons. Watermelons do not continue to ripen once they are picked, so it is important to pick them when they are perfectly ripe. Anyone who has been picking watermelons for years can look at a watermelon, give it a feel with their hands, and from its size, colour and from how firm it feels they can determine whether it is under-ripe, over-ripe or just right. They can do this with a high degree of accuracy. However, if you asked them to write down a list of rules or a flow chart that you or I could use to determined whether a specific watermelon was ripe, then they would almost certainly fail - the problem doesn't have a clean cut answer you can write into rules. Also note that there isn't necessarily a right or wrong answer - there may even be disagreement among the farmers.
You can imagine that the same is true about how to identify whether a webpage is spammy or not; it is hard or impossible to write an exact set of rules that work well, and there is room for disagreement.
Robo-farmers
However, this doesn't mean that it is impossible to teach a computer to find ripe watermelons; it is absolutely possible. We simply need a method that is more akin to how humans would learn this skill: learning by experience. This is where machine learning comes in.
Supervised learning
We can set up a computer (there are various methods, we don't need to know the details at this point, but the method you've likely heard of is artificial neural networks) such that we can feed it information about one melon after another (size, firmness, color, etc.), and we also tell the computer whether that melon is ripe or not. This collection of melons is our "training set," and depending the complexity of what is being learnt it needs to have a lot of "melons" (or webpages or whatever) in it.
Over time, the computer will begin to construct a model of how it thinks the various attributes of the melon play into it being ripe or not. Machine learning can handle situations where these interactions could be relatively complex (e.g. the firmness of a ripe melon may change depending on the melon's color and the ambient temperature). We show each melon in the training set many times in a round robin fashion (imagine this was you; now that you've noticed something you didn't before you can go back to previous melons and learn even more from them).
Once we're feeling confident that the computer is getting the hang of it, then we can give it a test by showing it melons from another collection it has not yet seen (we call this set of melons the "validation set"), but we don't share whether these melons are ripe or not. Now the computer tries to apply what it has learnt and predict whether the melons are ripe or not (or even how ripe they may or may not be). We can see from how many melons the computer accurately identifies how well it has learnt. If it didn't learn well we may need to show it more melons or we may need to tweak the algorithm (the "brain") behind the scenes and start again.
This type of approach is called supervised learning, where we supply the learning algorithm with the details about whether the original melons are ripe or not. There do exist alternative methods, but supervised learning is the best starting point and likely covers a fair bit of what Google is doing.
One thing to note here is that even after you've trained the computer to identify ripe melons well, it cannot write that exhaustive set of rules we wanted from the farmer any more than the farmer could.
Caffeine infrastructure update
So how does all this fit with search?
First we need to rewind to 2010 and the rollout of the Caffeine infrastructure update. Little did we know it at the time, but Caffeine was the forefather of Panda and Penguin. It was Caffeine that allowed Panda and Penguin to come into existence.
Caffeine allowed Google to update its index far faster than ever before, and update PageRank for parts of the web's link graph independently of the rest of the graph. Previously, you had to recalculate PageRank for all pages on the web at once; you couldn't do just one webpage. With Caffeine, we believe that changed and they could estimate, with good accuracy, updated PageRank for parts of the web (sub-graphs) to account for new (or removed) links.
This meant a "live index" that is constantly updating, rather than having periodic updates.
So, how does this tie in with machine learning, and how does it set the stage for Panda and Penguin? Lets put it all together...
Panda and Penguin
Caffeine allowed Google to update PageRank extremely quickly, far faster than ever before, and this is likely the step that allowed them finally apply machine learning at scale as a major part of the algorithm.
The problem that Panda set out to solve is very similar to the problem of determining whether a water melon is ripe. Anyone reading this blog post could take a short look at a webpage, and in most cases tell me how spammy that page is with a high degree of accuracy. However, very few people could write me an exact list of rules to judge that characteristic for pages you've not yet seen ("if there are more than x links, and there are y ads taking up z% of the screen above the fold..."). You could give some broad rules, but nothing that would be effective for all the pages where it matters. Consider also that if you (or Google) could construct such a list of strict rules, it would become easier to circumvent them.
So, Google couldn't write specific sets of rules to judge these spammy pages, which is why for years many of us would groan when we looked at a page that was clearly (in our minds) spammy but which was ranking well in the Google SERPs.
The exact same logic applies for Penguin.
The problems Google was facing were similar to the problem of watermelon farming. So why weren't they using machine learning from day one?
Training
Google likely created a training set by having their teams of human quality assessors give webpages a score for how spammy that page was. They would have had hundreds or thousands of assessors all review hundreds or thousands of pages to produce a huge list of webpages with associated spam scores (averaged from multiple assessors). I'm not 100% sure on exactly what format this process would have taken, but we can get a general understanding using the above explanation.
Now, recall that to learn how ripe the watermelons are we have to have a lot of melons and we have to look at each of them multiple times. This is a lot of work and takes time, especially given that we have to learn and update our understanding (we call that the "model") of how to determine ripeness. After that step we need to try our model out on the validation set (the melons we've not seen before) to assess whether it is working well or not.
In Google's case, this process is taking place across its whole index of the web. I'm not clear on the exact approach they would be using here, of course, but it seems clear that applying the above "learn and test" approach across the whole index is immensely resource intensive. The types of breakthroughs that Caffeine enabled with a live index and faster computation on just parts of the graph are what made Machine Learning finally viable. You can imagine that previously if it took hours (or even minutes) to recompute values (be it PageRank or a spam metric) then doing this the thousands of times necessary to apply Machine Learning simply was not possible. Once Caffeine allowed them to begin, the timeline to Panda and subsequently Penguin was pretty quick, demonstrating that once they were able they were keen to utilise machine learning as part of the algorithm (and it is clear why).
What next?
Each "roll out" of subsequent Panda and Penguin updates was when a new (and presumably improved) model had been calculated, tested, and could now be applied as a signal to the live index. Then, earlier this year, it was announced that Panda would be continuously updating and rolling out over periods of around 10 days, so the signs indicate that they are improving the speed and efficiency with which they can apply Machine Learning to the index.
Hummingbird seems to be setting the stage for additional updates.
I fully expect we will see more machine learning being applied to all areas of Google over the coming year. In fact, I think we are already seeing the next iterations of it with Hummingbird, and at Distilled we are viewing the Hummingbird update in a similar fashion to Caffeine. Whilst Hummingbird was an algorithm update rather than an infrastructure update, we can't shake the feeling that it is setting the foundations for something yet to come.
Wrap-up
I'm excited by the possibilities of machine learning being applied at this sort of scale, and I think we're going to see a lot more of it. This post set out to give a basic understanding of what is involved, but I'm afraid to tell you I'm not sure the watermelon science is 100% accurate. However, I think understanding the concept of Machine Learning can really help when trying to comprehend algorithms such as Panda and Penguin.
For anyone interested in a fun introduction to machine learning that's easy to understand but still very detailed and explanatory, I highly recommend NOVA's "Smartest Machine on Earth" documentary about IBM teaching Watson to play Jeopardy.
Totally agree that Hummingbird is setting the stage for other Google updates/changes in the future. Similar to Caffeine, it wasn't a unique small algorithm they added to the system (like Panda/Penguin) but changed and updated the way the Google search engine operates. We can only wait to see what is coming in the next few years!
Love that machine learning is being addressed so directly in the SEO community. Obviously this post only scrapes the surface - as was noted - but would love to learn more about this concept/topic, maybe even take a class on it or something. From a business perspective Google is going to have to go in this direction in order to save as much time, money, and manpower as possible.
Love this post, Tom. Would love to see some follow-up posts on machine learning at some point!
In the supervised learning section you mention "even after you've trained the computer to identify ripe melons well, it cannot write that exhaustive set of rules we wanted from the farmer any more than the farmer could."
I'm confused by that statement. If the computer is identifying ripe melons with any degree of accuracy, it has to be using objective criteria to do that. Surely the computer could eventually write a big statement that identifies ripe melons with XX% accuracy, right?
A very good question, Kane! You're right to be confused. :)
So - you could just save the machine learning program code and all the outputs from training and then technically you have a 'set of instructions' for this which you can write out. However, it isn't something that you or I could read and internalise - it would be far too complex for us to make sense of.
Imagine if we printed it out to make a massive flow chart (it would be huge). We could work through the instructions on a case by case basis (one melon at a time) but it wouldn't work to teach us anything about how to do it for all melons (the bigger picture), if that makes sense.
It can happen quite often in Machine Learning that due number of inputs, the complexity with how they interact with one another and the range of outputs that you can train a system but left unable to accurately predict how it'll behave with certain inputs.
A potentially useful parallel to think about, which you may know, is chaos theory (it isn't exactly the same thing but close enough that thinking about it may help). Chaos theory deals with situations with no random element (we call these 'deterministic systems') yet the inputs can change slightly and create extremely different outputs. You could Google 'complex systems' for more information. :)
I know the last one as the "butterfly effect" :-) It's my favourite argument when I want to show that by travelling back in time you would change the future even if you just sneezed
@kane: I think this is a perfectly reasonable question. There are two real-world problems with writing a big-ass 'if' statement. Even if a human or computer does it.
1) It's really hard to code anything that can cover all possible cases across the entire Internet.
2) It's really hard to code anything when the thing you're coding against constantly changes by the second.
For those who are math or computer science nerds, I'm over simplifying. I'm writing a comment not a book. I do agree with Tom: If you follow the logic far enough, you eventually fall into "chaos theory." Yikes.
Some of this worries me, where AI work (which commonly fails to match human intent) continues to come to search. Example - Pogosticking on sites meant to be quick reads, and throughout ecommerce, could have misinterpreted value that reflects in the SERPS. Alternatively, bots could easily be created to manipulate this scenario.
I think the implementation could be sound for the greater good, and I like to think Google is taking precautions (based on all the "it was a good idea at the time" rollouts of the past), but maybe I'm just paranoid.
I also guess this would mean the end of G's routine statements that user data and metrics don't affect rankings (but hey, we've sort of seen that overplayed by personalization anyway).
Awesome Theory
I completely agree with your idea that Google might have created some niche specific team to determine the quality that the pages being processed & then use that data with machine learning approach. I am exited to see if any official person from Google encounter with this question and how they might respond to it.
I'd be more interested to see the extent to which Google manually clean ups and reorders the SERPs . Especially in those considered "spammy" niches e.g Payday loans.
Good point, Nick.
I think they do it very frequently. I've read somewhere that they do like 300 small upgrades to the algorithm per year on a case-by-case basis based on certain spammy websites having high rankings or good websites having low rankings.Â
They DID ! See, here's their cookbook - "Search Quality Rating Guidelines" https://static.googleusercontent.com/media/www.google.com/cs//insidesearch/howsearchworks/assets/searchqualityevaluatorguidelines.pdf
Awesome Post Anthony..I think Moz is also using the Machine learning to understand the Google algorithm..By measuring SERP for repetitive sets of keywords and try to understand where it is heading to.Using supervised learning on different set of keywords to cross check themselves.
This is the same approach followed many SEO's to understand what Google Intends to do...Rightly said Google is a Black Box.
Glad you use farmer and melon in your example.I'm farmer and i know how to identify ripe melon.One of the best method is by sound.Just knocking on watermelons and listen to the sound.And your right,i cannot write a set of rules for that sound.But machine can learn to identify a sound of a ripe melon.
Now,what inputs SE use for "knocking" on sites?Are LDA and topic modelling used not only to determine relevancy,but also as main metric to identify spam?
yeah my in law does it too, but you can pick it a bit early , just not too early
They have teams of evaluators that (more or less) randomly pick webpages and say: this website is relevant, this site is useless.. They have a scale for that and strict rules.Â
https://static.googleusercontent.com/media/www.goo...
Google then feeds their black box with pairs (webpage data ; evaluator's rating) just like you would feed the example machine manually with pairs (melon data ; human-assigned level of ripeness).
A lot of useful information about machine learning in this post. I'm glad that I've read it. Anyway I was expecting machine learning will be a very popular subject around the SEOs sooner or later. And it really happened to be after the Panda and Penguin updates.
Great post!!
From my understanding of machine learning, this also means that if you were to ask the Google Search team (who actually built Penguin or Panda) why your site is deemed spammy, they honestly won't be able to tell you. The machines already have a life of its own that the creators cannot follow exhaustively.
I completely believe the fact with your idea that Search engines might have designed some market particular group to figure out the quality that the webpages being prepared & then use that information with device studying strategy. I am departed to see if any formal person from Search engines experience with this query and how they might react to it.
Do mine eyes decieve me or has someone just described Machine Learning in plain english. Bravo.
Well, I believe it was necessary for me to learn machine learn of SEO in order to ooze out good conceptual plot for any SEO or Internet marketing campaign.
Thanks Tom!
great article, great intro to machine learning. For anyone interested in exploring machine learning further you should check out https://www.amazon.com/Machine-Learning-Hackers-Drew-Conway/dp/1449303714 hands on practical examples. Enjoy
Thanks for the post. Scary but fun.
I also agree with the opinion of overripe and under ripe and its the best thing to ripe at the right time rather than later or before situation. Putting the right things at right time at right place gives the most advantage and it matters a lot for achieving success because competition is very much higher so keep an eye on each factor that can cause any impact upon your business , brand or service.
Really useful insight thanks
Awesome post, Thanks for detail information and it will be really helpful.
Fantastic insight!
Really taking something from a deep ! Nice explanation about what it is & what it's not. Means all about machine learning for SEO's.
Nice comparison. Crawlers are gradually evolving and anything which evolves as per time span is worth discovering and exploring. Hats off to Google algorithm updates, all focused to giving the end users a better web experience.
Great article, thank you for the profound and inspiring reading.
On a side note - noticed it when sharing on G+ - moz should escape the quotes in the meta tags - the meta description HTML rendering and parsing breaks because of unescaped quotes, quote:
<meta name="description" content="Since the Panda and Penguin updates, the SEO community has been talking more and more about machine learning, and yet often the term still isn't well understood. We know that it is the "magic" behind Panda and Penguin, but how does it work, and why didn't they use it earlier?" />
end quote.
Boris
Hi Tom,
Nice post regarding google's Algorithm Updatez...
Explained with good example, Mee too like watermelon.And i would like to know one think that how the SEO guys be prepared for these updates.
This is very interesting stuff! Trying to learn how Google does things is a (needed) step, in order to make great search marketing. But logically, we should all focus on the end goal (the users) instead of trying to master Google techniques.
That is the ideal, but the practical approach to this is often not the same unfortunately.
Yes! Caffeine has done a great job for the crawling and since then Panda and Penguin has surplus the ability to get smarter. We are in the era where machines will learn and after some period we will set them independant. Ya, but that's gonna take time though. I hope that wont take out our heart.
Your blog is very important; I got something new from your blog. Your all the SEO tips are important and useful. I agree with you. Fresh and unique content is also play a very important role in SEO.
Nice point to be discussed.