As part of our quest to understand the Algorithm, we do a lot of correlation analysis here at SEOmoz. We tend to dive right into the deep end, so I thought it might be a good time to take a step back and talk about the absolute basics of correlation, including some warnings about causation. We’ll often say (and hear) the fallback phrase – “correlation does not imply causation”, but people rarely dig into what that means.
To make this experience as pleasant as possible for the math-phobic, I drew you a picture. I’d like to introduce the world’s first Mathographic. Ok, it’s probably not the world’s first, and it’s really just an infographic, but give a guy credit for trying to keep you entertained.
View Full-sized Infographic (796 x 2200)
Embed this image:
Obviously, our actual studies get a lot more complex than this, but everything has to start somewhere. If you’re interested in more advanced topics on correlation, here are a few references worth checking out:
- Correlation and Dependence (Wikipedia)
- Stastistical Signifcance of Correlations (SUNY)
- Spearman’s Rank Correlation (U. Delaware)
- Non-linear Regression (Graphpad)
- Some Correlation Humor (xkcd)
Are there any other correlation-related topics you’d like to hear more about on the blog? Are there analyses that you’re interested in within the broader realm of SEO and social media? Let us know in the comments.
I've done some number crunching, and I believe that my thumbs ups are causing Dr. Pete to create great posts. The correlation between my thumbs up and Dr. Pete's posts is just too high to ignore.
Hopefully your post will help people understand the difference between correlation and causation - relevant XKCD!
Very useful article Peter, Infograhics has become more popular among bloggers and they are a best way to convey complex information clearly and understable.
Several reasons why many bloggers prefer infographics are:
There are many sites i used and Visual.ly is my favourite one and it is like a search engine for infographics. Mathographic is quite intersting stuff and love to see more into as well.
Thank you Peter for this wonderful post.
I just discovered Visual.ly recently, and it is really great for random inspiration.
Thank you Peter, Infographics are great way to visualize unique stories and people find them best. Visual.ly is nice one i must say.
I have to ask--what does imply causation, then?
Causation can only be inferred, never exactly known :)
There's the million-dollar question. "Imply" is a strong word with a specific mathematical/logical meaning - correlation can certainly suggest causation. We just have to realize that there are alternatives.
If X and Y have a strong correlation, common sense and experience can usually push us in one direction or another. Take the common example of the correlation between smoking and lung cancer. We can safely guess that lung cancer probably doesn't cause smoking, so if we have no idea what Z could be, we can start with the hypothesis that smoking causes cancer.
One way to test that is a controlled experiment. Take a group of similar people, split them into 2 or more groups, manipulate X, and measure how Y changes. If directly changing X changes Y, you've got more evidence for causation.
Of course, in many cases that's impractical or unethical. You can't force 1,000 people to smoke and 1,000 not to smoke for 20 years and see who gets cancer. So, you have to rely on other evidence, including:
(1) More correlation studies over different groups. It doesn't prove causation, but it helps build the case.
(2) Other observations and causal data. For example, if scientists found a specific chemical process in the body that was connected to lung cancer and cigarettes contained that chemical. Again, not proof, but more evidence.
(3) Multi-variate correlations. Let's say we pull in a ton of demographic data and find that living in polluted areas is also highly correlated with lung cancer. Now, someome hypothesizes that people who live in highly-polluted areas are under stress, so they smoke more (it's our Z, in other words). You could test that by looking at the correlation between living in polluted areas and smoking.
As Frederik said, you can infer causation, but it usually involves evidence from multiple sources, unless you can reveal a smoking gun of causation.
When dealing with human behavior and behavior modification, it is almost impossible to determine causation. That's why we test everything. Heuristics is the word here.
Hi Pete,
As you said in your first paragraph. "We’ll often say (and hear) the fallback phrase – “correlation does not imply causation”, but people rarely dig into what that means."
What do you think of this?
https://www.branded3.com/tweets-vs-rankings
Whilst the author has stated within the article "Obviously, correlation does not imply causation ...", the use of the word "affect" in the title of the article clearly reveals they have fallen into the trap.
This study proves a correlation between tweets and rankings but it does not prove that "tweets AFFECT rankings".
I read the study when it came out and thought it was interesting, but I'd agree that the title (especially "prove") was a bit over-the-top. Then again, we ARE marketers :)
Yup, it's a fair cop - poetic license.
This is definitely becoming a poster.
In all seriousness, I do most of my illustrations in Illustrator and this one is all line-art, so feel free to DM me if you want a scalable version (that goes for anyone).
The only way I know of to explain the causation is DOE or Design of Expeiment :) I usualy do the helicopter case study with my students to get them on board
.
If you are like me, you can't help to see many mixing up causation and correlation, on the news, in discussions or even politics.
For example, red cars get more speeding tickets. Do red cars go faster?
Or
The Honda Accord is the most stolen car. Implied suggestion, if you own an Accord you are at a higher risk of having your car stolen. Incorrect conclusion (without further data) as the Accord is also the most popular car.
It is always fun to spot those, although it can also be frustrating ;-)
Thank you Pete for this clear explation.
I agree.
What I find particularly disturbing is that the media prey on people's inability to differentiate correlation and causation in order to create news by implying something they know is misleading.
For example, "scientists" (oh, yeah, define those dudes) have discovered that people who drink red wine live longer. Ergo, drinking red wine will cause you to live longer. People who drink red wine may just be an indicator of belonging to a different sector of society that has more disposable income, is better educated and oh, possibly goes to the gym every day before having a glass of wine with friends. There may be no statistical evidence at all that there is any causation here.
However, I'm prepared to roll with that one!
Thank you, Pete! This is a fantastic infographic. It is especially useful when you are showing inference data to determine ranking factors. One thing I would add is the importance of p-values. P-values determine whether or not your correlation is significant, and should not be ignored. For example, you could have a correlation of .85, but a p-value of .4 shows that the relationship could be completely random. However, a correlation of .85 with a p-value of .001 would show a significant positive linear relationship.
P-values would've been the next thing I added if I made the graphic any longer :) Eventually, it just became clear that this is a complex topic, and covering it all in one graphic wasn't going to happen.
P-values are funny with correlations, and a bit unique - what I see more often than a high correlation with a high p-value (high being "bad") is a low correlation with a significant/low p-value. Basically, it's statistically significant, but it still doesn't matter. You don't run into that with experimental design and things like F-tests.
im totally going to add this to the list of stuff to use to explain stuff
"The list of stuff to use to explain stuff" may now be my favorite list of all time.
This is also a nice explanation of the Post Hoc Ergo Proctor Hoc logical fallacy.
I loooooooove this infographic! Beautiful, informative, irresistibly, nice colors .. just perfect!
I would like to see an infographic how much time people do spend on social networks.
I don't know if I would want to see such an infographic about myself. I would be scared of what it would say.
I once graphed time spent on Twitter vs. my productivity:
https://www.30go30.com/images/blog-20110524-1.gif
Not gonna lie, I think I enjoyed that second graphic more than the first mathographic. Don't get me wrong, your infographic is a great, clear explanation of the correlation/causation issue, but that second image is hull-ar-ee-us.
I think the outcome of that infographic would be shocking to say the least.
Hitwise did an infographic (well, pie chart really), about how much time people would spend on social networks if the internet was an hour: https://weblogs.hitwise.com/james-murray/2011/09/if_uk_internet_usage_was_just.html
Thanks for this Jenni!
I would of thought that e-mail would be much higher though although I guess it's hard for them to track Outlook?
Love that - what a simple way to illustrate a complex idea. Of course, 2 minutes of porn = a lot of lying Brits :)
Thanks Dr Pete!
I have to confess a little thrill of excitement when I saw the MATHOGRAPHIC! Not because it is about math, or because as more of a word nerd, it was nice to see it was made for me ;)
The most exciting thing about it is the fact that it is a TEACHING Mathographic! Not actually a representation of Data, which so many infographics these days seem to be. So good to see you take a different approach to the "medium".
One of the things I have noticed that seems to challenge people on the subject of correlation is the idea of trying to decide what action to take when there are two or more factors that seem to compete or conflict with each other. One that featured in a recent Q&A thread was built around +ve correlation for exact match domain and -ve correlation for long domain. While I hadn't seen an issue, since it is entirely possible to have a short exact match domain, it seems that some are confused by what they see as a conflict between the two.
Perhaps there is some scope for discussing specific combinations of ranking factors where these kinds of issues can occur.
Thanks again for the thrill!
Sha
Part of what's so challenging in our large-scale ranking factors studies is that you're looking at correlations for dozens of factors. When Google claims to have over 200 ranking factors and Bing claims over 1,000, even relatively small correlations can be interesting. It's also likely that many factors are inter-dependent. I think it's an important problem to tackle, because it has real impact, but like everything in the real-world, it's also very messy.
That's the kind of mathographic I'd like to see.
The question of correlation vs. causation isn't too crazy to me, but I could never easily wrap my head around things like appropriate sample size or what kind of correlation to run given the types of variables or what magnitude of correlation makes a statistical significance. I know, basically the whole gamut.
Inter-dependency is another issue, since I know OLS assumes all totally independent variables. Any sites or tips for that kind of stuff that aren't written like a math paper (which ARE muy interesting, but tend to be more descriptive of the processes behind the math, as opposed to actionable advice)
WOW! This remember me of my math classes… all about graphs and tricky x, y and z
No, but honestly we read lot about correlation and causation over the internet but the problem is it’s in a very complex format… I really congrats Dr. Pete for making it so simple for one to understand the basics of correlation and why correlation is not causation.
Very nice infographics…
You mean, 'mathographic'.
Yeah like gfiorelli1 said Dr. Pete is the always good with the imagination of words… so yes I can call it mathographics a.k.a infographics…
I had a calculus professor who instead of "X" and "Y" always said "These guys" and "Those guys". It was very confusing at 8am, but it kept me awake.
I'm personally a fan of infographic presentations as they take subject matter that is compex perhaps complicated and make it user-friendly to the audience that you're trying to attract. Caution to the wind - it is complicated so hire an SEO expert who is experienced in assembling infographic models. Trust me, it's not as easy as it looks.
Im new to this. Explanation really helped thanks.
That was the easiest to understand demonstration of corolation. Thanks for making an infograph for us.
Well as Gianluca Fiorelli said your imagination in words is really appreciable this quality makes your article 100% worth to read. And secondly the information with best explanation you had shared, I am really very thankful to you for this article.
Like to keep my online marketing, as simple as possible personally!
However mathographic entertaining way of explaining things :-)
As a former math teacher turned internet marketer, this warmed my heart.
Dr. Pete I love it very well done strictly to the point illustrated perfectly. Nice way of simplifying it all the best man.
Tom
I find your graphic misleading. It looks like you are trying to demonstrate regression analysis rather than correlation analysis. More examples of correlations near 0 and near .5 would be good.
It's not that difficult anyway.
Cointegration asks the same question about time series. Regression analysis tries to quantify what a 1% move in A means for B.
Anyone interested in not being misled should look at Anscombe's Quartet: <img src="https://s3.amazonaws.com/data.tumblr.com/tumblr_l4sdj7Yi6V1qc38e9o1_1280.png?AWSAccessKeyId=AKIAJ6IHWSU3BX3X7X3Q&Expires=1317887056&Signature=8r0c08A%2BSDlOVw2gkfDzcqOhlvc%3D" />
This is one of the clearer /visual examples I have seen of the diffrence. It should be helpful for future when trying to get the basics of the diffrence accross to a person.
Thanks for the diagram,
First Math-O-Graphic i have read ... makes sense though ;-) An easy way to explain it to all my "non math" collegues
Great post Dr Pete; I always try my best to understand the posts involving correlation graphs, but this back--to-basics info will definitely help in the future.
And I think Mathographic has potential as a term/series, though I did read it as mathamographic for ages...
This is the infographic I think we all needed. Particuarly the bit about the Z or unknown variable that could be the cause. A good example of this in SEO might be Retweets for instance. It is likely that Retweets and rankings have a positive correlation, but it might be that the higher you rank the more likely someone will click on the post and therefore retweet rather than Google paying any attention to the number of RTs at all.
You might want to consider sending this infographic out to more than just the SEO communtiy as well. There are many industries that could do with a little education around this issue.
Another way to explain it:
https://dilbert.com/strips/comic/2011-11-28/ :-)
This covers a basic concept that every SEO needs in their toolbox. Thanks for breaking it down Pete!
Dr. Pete, you amuse me. I love the Infogra... errr... Mathographic. :)
Nice design for your Mathopraphic!! ...and great topic
Good job Dr.Pete. It amazes me how people love to use infographics
Dr, Pete aka #pete, this is fantastic. I love it. Very nice job. Ken
BBC Radio 4 tried tricking tabloid news papers into a correlation/causation trap. They did a study to show in area's where there were more mobile phone masts the birth rate went up....
Pretty cool to look at correlation in such a simple manner, people often forget the basics and then when complicated matters arise it's impossible to deduce anything.
I look forward to seeing what people would like measured in future articles!
Keep it simply stupid for my effort..kiss me (like my buddie would say) thanks tons for the easy explanation!
Marco Gutierrez
Great Overall Design of the info graphic! It's a nice summary of correlation & causation. it's that pesky 'Z' factor that makes determining causation so difficult! Thank You.
Mathematics is a topic which affects all of us daily. It is also a subject which leads most people to click on to the next page. What do you think the correlation is between mathematical analysis articles and bounce rates? :)
Hats off to you Dr. Pete. You found a way to take a daunting topic and break it down in such a way as to make it understandable to everyone. While this article wont help on it's own, it is a key building block to understanding many aspects of SEO analysis.
Why aren't all math elements explained in this way? I like the RT and rankings correlation factor mentioned above. Can we think of other correlation/causation elements that effect SEO such as the more links you have pointing to your site might "cause" a bosst in your rankings but how do you factor in spammy links and correlation?
You and your graphs :P
Good read - cheers.
Great infographic- beautifully done, may have to save that to my HD to explain to clients the difference between correlation and causation. Thank you!
this infograph should be given to Econ professors in college and handed out day one. If I had this lamanated inside my book I wouldn't have gotten so afraid of coefficients and variables as I went through class. Nice info graph
Love the article, but i think a little more could have been done with the infographic to help it stand on its own sans article.
Great read Dr Pete, i'm always looking out for your posts. This is an that I am constantly dealing with, working with an outdoor visitor attraction that is hugely impacted by seasonality and weather.
I'd love to know if you have done any work on assessing the impact of seasonality/weather and the methods you use.
In the meantime I'll get on with your further reading!
You sure do know how to explain things in a very simple way.
Correlation may not imply causation, but it does show how much of the causation is attributable to the correlation.
Good color and design on your mathographic.
One of things I always liked of you, Peter, is your imagination with words... Mathografic is surely a great one and I'm pretty sure that we are going to see it growing as a infographic subgenre.
On the other hand, the info given are useful indeed and the Causation is not Correlation is here more easily explained than - sorry bud - the dolphins example Rand did in an older post/presentation.
Useful!
You often see people on forums and blogs claiming that they "discovered" something about Google's algorithm in one of their "experiments".
Unfurtunately many people dont understand the fact that correlation does not imply causation. And in addition to that, they think you can conclude something from even the smallet datasets.
This is why i like SEOmoz - Here you can find the scientific methods used, which other blogs often ignore, and even delivered in a language or with beautiful infographics that most people will understand.