I've been an SEO for a long while - nearly 8 years. In all that time, I still haven't been able to wean myself off the intoxicating drug dealt out by the Google toolbar - that "little green fairy dust" called PageRank. Intellectually, I know it's flawed in a multitude of ways, but so many people in our field (and in the broader webmaster/marketing community) still talk about "PR 4 websites" and how "I have a PR6 but he's still outranking me." I find myself thinking about it, using it in conversations and yes, even considering it as a metric for rankings.
There's so many reasons why PageRank shouldn't be a primary metric for SEO:
- Infrequently updated - Google updates the PR scores in the toolbar 2-4X each year on an unpredictable and unpublished schedule. The PageRank score you see today could be dramatically different than the PageRank Google's using in ranking/crawling calculations.
- 1 of 200+ ranking signals - Google's representatives have continually repeated that PageRank is just one of "more than two hundred" signals the engine applies to the rankings equation.
- Applies to pages, not sites - The PR score is based on individual URLs, not domains. Technically, there's no such thing as a "PR 5" website, just a website with a homepage URL that has displays "5" in the toolbar.
- Imprecise - PageRank is a logarithmic score when fitted to a 0-10 value in the toolbar. We've estimated the log base around 8-10, meaning that a PR5 URL has 8-10X more PageRank than a PR4. Yet, there's no granularity between values. One PR4 page might have 5 times more PageRank than another PR4 page, but the Google score won't tell you until the log base threshold has crossed the next value marker.
- Intentionally Inaccurate - Google has been using toolbar PageRank to visually penalize pages and sites for buying/selling links for many years, but they readily admit they use this filter intermitently so as not to tip off spammers. Thus, we're never sure when looking at PageRank whether a page/site has or hasn't had its PageRank reduced and whether that does or doesn't impact rankings (or the value passed by the non-manipulative links).
But perhaps none of these are as compelling as the data put together by our in-house correlation, machine-learning & ranking model expert, Ben Hendrickson. Over the past few months, we've gotten an increasing influx of questions about PageRank with relation to our tools, mozbar and through Q+A, so Ben's gone ahead and done some hardcore correlation analyses to help answer our most pressing question about the toolbar PageRank score - does it matter, and if so, how much?
How Well Does PageRank Correlate to Rankings?
The short answer is - not very well, but not badly enough to suggest that Google's statement above is entirely inaccurate. Let's have the data do the talking:
Using Spearman correlation, we can see that for page one results ordering (the correlation we measured for all of these charts), Google's toolbar PageRank is around 0.18. A perfect correlation would be 1.00 and a completely useless/random correlation would be 0.00. In other words, PageRank has a positive correlation, but it's not particularly predictive.
Interestingly, PageRank is even more useless on Yahoo!'s results ordering and Google.co.uk results (UK SEOs take note!) but nearly as good for Bing.com as the Google results in the US.
The next time your boss or client asks you about increasing their PageRank; show them this chart. It's the best evidence we at SEOmoz have to back up the statement "PageRank doesn't matter much." The metrics website owners and marketers should care about is traffic, conversions and the lifetime value of the visits sent by search engines. PageRank (and similar metrics) don't help with these at all. SEOs, however, appreciate any proxies or metrics they can get their hands on that will help to better explain the rankings. We'll use the rest of the post to tackle that issue.
Is PageRank the Best Metric of its Kind?
Another interesting question we need to ask is whether other, similar metrics that model themselves on PageRank's data (and Google's link graph of the web) are potentially worth using. The chart below speaks directly to this question:
In this chart, we're looking exclusively at correlation with Google.com's rankings (in the US). PageRank and SEOmoz's own mozRank are extremely close, but, perhaps surprisingly, mozTrust (which uses a PageRank-like algorithm biased to trusted seed sources) and external mozRank (which counts only the mozRank to a URL coming from external links) both have higher correlations.
This suggests that, as Google's representatives have often said, "what others say about you is more important than what you say about yourself." Looking at the quantity of external link juice flowing to a page may be a better metric than just that page's total link juice (including values from both internal and external links).
Are Other Commonly Available Metrics Better Correlated?
When we saw that some PageRank-like metrics might be more usable (or at least very competitive substitutes) for this purpose, we naturally asked "what about non-PageRank metrics?" This next chart provides some answers:
The data here is especially interesting. Yahoo!'s link count is a good deal better than Google's PageRank in correlating with Google's own search results!
Perhaps not surprisingly, Page Authority, a metric that Ben builds with rank modeling, has the highest correlation with Google.com's rankings. It's about 51% "better" correlated than PageRank - a big step up, but still nowhere near telling the whole story. While this data may seem to make SEOmoz's metrics look quite good, in fact, our raw link counts are slightly worse than Yahoo!'s, suggesting that we still need to improve Linkscape's crawling and indexing.
Can We Value Websites/Domains with PageRank (or Other Metrics)?
Another big question we need to answer is around the concept of "homepage PageRank" being a measure of a site's ability to perform in Google's rankings. Correlation data can answer this quite competently:
The correlations here are, not surprisingly, considerably worse. Estimating a page's ranking based on page-specific metrics is hard enough, but to do so using only data we have about the domain that page is hosted on is extremely challenging. Still, we can see that some different metrics than those we used previously can offer some insight. Google's homepage PageRank certainly isn't great, but it's also not much worse than the best metric we've got - Domain Authority.
It's also extremely curious that Compete.com's traffic rank outperforms PageRank and that Yahoo!'s count of links to a domain underperforms, particularly considering its impressive show in the page-specific metrics. We did also attempt to pull Alexa data, but found the speed and consistency to be so poor that we couldn't get it all prior to publication.
The story with domain-level metrics that I'd love to tell you is "use Domain Authority," but being TAGFEE, I have to say that today, no single metric is, IMO, good enough. We'll be hard at work improving these over the weeks and months to come, but we'd also love to see other efforts to help solve this puzzle. Valuing a domain's ability to rank pages in Google may be challenging, but it's a very worthwhile goal.
Where/How to Access These Metrics
We used a variety of metrics in our correlation analyses above, and we certainly invite you to use any that are of interest in your own work:
- Google's PageRank Score
- via the Google Toolbar
- Also available, though potentially against Google's TOS, through the PageRank Checksum (please perform your own searches)
- Yahoo!'s Link Counts
- via Yahoo! Site Explorer
- via the Y!SE API
- Compete.com Rank
- via Compete.com's free tool on their website
- via the Compete.com API
- Alexa Rank
- via Alexa.com
- via Alexa Data Services API
- SEOmoz Metrics
- via the SEOmoz API (free up to 1 million calls / month)
- via Open Site Explorer
- via the Mozbar
- via Linkscape
Information about the Dataset Used for this Analysis
We suspected that folks would have questions about how the data was gathered, the source of the keyword/ranking information and several other pieces. Ben kindly answered many of these below:
How many keyword rankings did we collect?
Over 4,000 search results for Google.com and over 2,000 for the other engines (Google.co.uk, Bing.com, Yahoo.com).
What is our accuracy level with this data?
The standard error ranged between 0.00528743 and 0.00559586 for the Google.com correlations.
Standard error gives some idea of how much our answer is likely to change if we looked at a lot more queries than we did. If we had been able to look at an infinite number of queries similar to those we tested (ignoring that doing so would be impossible), the answer we would get would be 68% likely to be within one standard error of the answer we measured here, and 99.73% chance to be within three.
What source provided the the keywords/rankings we used?
From Google AdWords' suggested keywords for different categories. If one ads together all keywords for all of the top categories with all of the subcategories one level down, one gets a bit over 11,000 unique keywords. From this list, we randomly sampled keywords.
Why did we use Spearman rather than Pearson correlation?
Pearson's correlation is only good at measuring linear correlation, and many of the values we are looking at are not. If something is well exponentially correlated (like link counts generally are), we don't want to score them unfairly lower.
[Update April 22nd 6pm: I should give sferguson credit for suggesting using Spearman's to us]
How did we handle "ties" in the results (when, for example, PageRank wasn't granular enough)?
We follow exactly the methodology that is suggested for Spearman's in textbooks, which is treating all tied values as having ranked indices equal to the average of the indices of the tying values. This might give an unexpected advantage to less granular metrics (like toolbar PageRank) because they can hedge and vote "tied" on close calls whereas more granular metrics do not. On this data it seems this is not affecting the results much, as the results appear similar to other ways of handling ties that do not have this effect.
The Big Picture in Just a Few Words
Google's PageRank is, indeed, slightly correlated with their rankings (as well as with the rankings of other major search engines). However, other page-level metrics are dramatically better, including link counts from Yahoo! and Page Authority.
Homepage PR of a website is much less correlated with the ranking performance of pages on that site, but not entirely useless. Domain Authority is a slightly better metric for this purpose, as is the Compete.com Traffic Rank of the domain. None of these, however, are convincing enough to be highly useful today (in our opinion). The best they can do is serve as a proxy until (hopefully) better metrics arrive.
Looking forward to your comments and questions as always!
Oh, and if you found this post valuable, Tweets are appreciated :-)
Rand your (or Ben's) reasoning for using Spearman correlation instead of Pearson is wrong. The difference between two correlations is not that one describes linear and the other exponential correlation, it is that they differ in the type of variables that they use. Both Spearman and Pearson are trying to find whether two variables correlate through a monotone function, the difference is that they treat different type of variables - Pearson deals with non-ranked or continuous variables while Spearman deals with ranked data.
I am not sure whether using Spearman for number of linking domains or external links is the correct statistical test to use.
Since Pagerank can be treated as ranked variable, using Spearman sounds about right, I am not sure so much about mozRank and mozTrust (don't know enough about those variables)
Another point is that regardless of what test you use, Correlation Coefficient of .2-.3 is extremely low. It very roughly translates to the fact that the chance of a significant monotonous correlation between two variables is 20-30% which is considered random. Taken your standard errors into account, the differences between the correlation coefficients are not significant enough to be able to draw conclusions. If one takes into the account that there are more than 200 parameters that Google takes into account when ranking a website, it would make sense that correlation between a single parameter and ranking would be statistically best described as random
Branko
This is an excellent response and one that shouldn't be glossed over by people just because they might not understand how statistics work. I myself have only a fleeting experience of these correlation techniques, but it is enough to know that what whiteweb_b said:
"Another point is that regardless of what test you use, Correlation Coefficient of .2-.3 is extremely low."
...is very true. Don't be put off by the first part of this guys response if you don't understand it (I only did at a very basic level), read the second half because people need to understand what can actually be taken from this data. (i.e. not a great deal if you're looking for a relational pattern)
It was still an interesting orginial article, and the first half was still particularily useful to me.
I'm going to let Ben tackle the question regarding Spearman. We had previously used Pearson, but after significant research into potential correlation methodologies, determined this was the right one. Ben can certainly explain better than I.
With regard to the "nearly random" - that would be a gross inaccuracy. Random would be a correlation of 0.00. If we include the highest standard error of 0.00559586, not one of these correlation is close to 0.00 or randomness. They all clearly have a significant, measurable correlation with ranking. That's not to say that any of them are excellent, but as single metrics in an algorithmic formula that contain 200+ factors is somewhat significant.
Even with the 0.18 correlation (could be as low as 0.1745 with the standard error), I wouldn't try to claim that Google is lying on the corporate technology page. It may not be huge, but there is correlation with rankings, suggesting that indeed, pages with more PageRank have some better opportunity to rank higher.
Let me divert and digress. Are we totally missing a bigger problem?
1. Using random keywords for input data is bad. Instead, select keywords which are competitive (cpc and search volume). Those phrases are optimized. Much more to learn from difficult phrases than easy phrases.
2. consider backlink anchor text. a pr10 means nothing if your anchor text is the word "click here" Simple case in point: search for the phrase "search engine" on google. Guess what, Google (pr10) is nowhere near top, while Dogpile is (pr8 which is way less than a 10)
Come on guys. Please re-do with a better designed test.
If I recall, Spearman correlation is the appropriate statistic for comparing ranked and non-ranked data. I believe Kendall serves a similar purpose.
While a correlation of .2 is low, it can be significant with a large enough n. The significance levels of the correlations weren't reported here.
Now to go out on a limb. If two correlations both have a standard error of .005, A difference of .01 or greater between them should be significant at p < .05. I may however be completely wrong, and wouldn't know for sure without seeing the data.
Hi White Web,
Sean's response is exactly right, but I'll repeat it more verbosely :-)
Your statement that "both Spearman and Pearson are trying to find whether two variables correlate through a monotone function" is not accurate. The distinction between measuring only linear correlation or any monotonic correlation is actually the critical difference between them. I will touch on why that is, but first let me just cite the relevant Wikipedia articles.
The Wikipedia article on "Pearson correlation coefficient" starts by noting that "measure of the correlation (linear dependence) between two variables".
The Wikpedia article on "Spearman's rank correlation coefficient" starts with an example in the upper right showing that a "Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. In contrast, this does not give a perfect Pearson correlation."
Technically, Spearman's correlation the same as Pearson's except one first replaces the values of both variables with what their indices would be if one sorted them. This is what makes any monotonic function become a linear function which normal Pearson's will score perfectly.
You make the comment "Pearson deals with non-ranked or continuous variables while Spearman deals with ranked data". This is only true because Spearman converts everything to ranked variables! If variables are all ranked to begin with, Pearson and Spearman are identical. So it certainly is not correct so suggest one can only apply Spearman's to already ranked data, or else there would never be a case to use it where it would give a different value than Pearson's!
.....
Your second point I think is confusing the idea of a small correlation coefficient with small certainty about the value of a correlation coefficient. We aren't looking at factors such as "does the page have any content relevant to the query" so we would expect the correlation to be fairly low. Nevertheless, understanding what metrics are the best to measure this query-independent strength of a page is quite important to SEO. This can be done by assuming independence between page content and the metrics, and then measuring the correlation just to the query-independent measure. One can argue about to what extent this assumption of independence is not valid and may cause some bias, but it is statistically quite sound. Does this make sense why this is valid, and why we would expect the correlation values to be fairly low?
Anyway, I like discussing math more than my usually programming, so do answer back if any of this doesn't make sense or you still take exception to the approach of the article :-)
Ben
Hi Ben
Thanks for the response
So for the first part, I did not claim that Pearson is better suited here than Spearmans. My claim was that the justification as stated in the article was not correct. While we can argue the validity of each test, it still seems to me that the explanation that says that we use Spearman instead of Pearson because "link counts are generally exponentially correlated" is wrong. If we know that the link counts are exponentially correlated (or any other variable), there would be no need to establish independence. The rank correlation is used when we don't know whether variables are correlated and want to test that null hypothesis. Furthermore, the fact that Spearman deals with ranked data (either because that is its nature or because we ranked it so we can perform Spearman's on it) already tells us that the correlation (if existing) will be linear, hence the nickname of Spearman's as "the Pearson test of ranked data". Additionally, I wrote that Pagerank correlation to rankings would suit the Spearman rank correlation perfectly but am not so sure about the backlink count. One would have to transform link counts into ranks and I do not see how can that be done consistently over large number of SERPs (but please do correct me if i am wrong).
As for the second point, I do not think there is confusion on my part. I was talking about low values of correlation coefficient, not about the small certainty of the value. My point was that, yes you can use Spearmans coefficient to testt a null hypothesis which claims that two ranked variables are independent of one another (and to do that on a sample like yours, one must perform a student-t test which i don't see in the article) but you cannot use Spearman's coefficient measured on different variables to compare the strength of correlations between those variables and the dependant variable. In other words, you can say that your Spearman's test rejects the null hypothesis which says that there is no correlation between the PR and ranking, but the strength of correlation cannot be established from it, let alone compare with different rho values of other parameters (especially parameters that are not naturally ranked like link count).
I have been breaking my teeth on similar measurements on a different SEO subject and have consulted several statisticians (and large volume of literature) on these issues, so i can definitely appreciate the effort invested in trying to give interpretation to so much gathered data, but unfortunately the wish to publish the conclusion does not correlate with the significance of the results :)
Again, thanks for your response. I second the request expressed here for some more in-depth study on statistical analysis of different types of data we gather when performing SEO research.
Thanks for engaging on this. I'm still pretty sure I applied the math right, but I also enjoy chatting about it, and think we can reach some conclusions about this together.
Here is my argument for Spearman's over Pearson's a little more verbosely:
1) We cannot assume linearity of our data. Said more precisely, we cannot a priori assume that the metrics we are looking at are correlated if and only if they are linearly correlated. If I were correct that links were more exponentially correlated than linearly correlated this would be obvious, although even if you don't accept that, I still think you must conceded there is no reason to assume any correlation would necessarily be linear.
2) Given the correlations might not be entirely linear, we shouldn't use a measure of correlation that only measures linear correlation.
3) Pearson's only measures linear correlation. Spearman's does not. The quotes from Wikipedia in my first reply make this clear, although if you doubt this we can argue the math directly. So it seems pretty clear Spearman's is the way to go for this problem over Pearson's.
.....
It is true I didn't do the math to reject the null hypothesis on each value to show they were correlated, but that is only because I thought it was pretty clear we would always reject those null hypotheses and find them correlated. However, it is pretty easy to do this math. Consider the correlation of PR to Google.com. The correlation coefficient is greater than 0.18, standard error is less than 0.0056, so the null hypothesis being right would be an event of more than 32.143(=.18/0.0056) standard deviations. This is unlikely enough that most online calculators round that to 0 probability, although using wolferamalpha.com we can see the chance is less than 1*10^-278. That is less likely than winning a major lottery 30 times in a row.
I think that is strong enough significance to publish :-)
....
The claim you make in bold is that one cannot use Spearman's to compare the relative strength of correlations. Why is that? These coefficients are a measurement of correlation between 0 and 1, so greater and lesser are well defined. We compared them in the post, how did it not work? Do you really mean to say it is unfair to claim a Spearman's correlation of 0.99 is more correlated (measured by Spearman's coefficent) than a Spearman's correlation of 0.01? Spearman's correlation is a measure of how close the correlation is to being a monotonic function, and so it seems pretty clear a measurement of 0.99 would be very close to monotonic correlation and 0.01 would be far away (at least relative to each other). How is that not valid?
You are right that one can use Spearman's coefficients to try to reject a NULL hypothesis of independence, but that doesn't mean one cannot also use the coefficients as a measure of correlation.
.....
You make the point that you think Spearman correlation cannot be applied to link counts, although you think it can be to PR, because one cannot convert the raw values to ranked values consistently over a large number of SERPs. Here is how one can (and must) do it. For each SERP, replace each link value with the index of that value if the values in that SERP were sorted. In fact, when applying Spearman's to PR values, one also needs to do this. It is part of Spearman's algorithm to do this, and because not all PR values will be in every SERP, one would get the wrong answer if one did not.
....
Am I being clear? I tried to answer all of your objections, let me know if I missed something. I'm sure we can clarify everything.
Ben
Ben,
Really enjoying this exchange. Statistics are one of my biggest passions. Being well-versed in the field is certainly valuable for SEO. I'm glad to see that SEOmoz has a team member like you.
I'd like to second the request below for SEOmoz to write a blog post about using statistics for SEO. I think that the community would find it very valuable, and it would help us gain a deeper understanding of a lot of SEOmoz's research.
I emailed this to my boss and he asked what we can do to get our page rank higher . . . go figure.
LOL! It's kinda like you're living in a Monty Python sketch. Next thing you know, someone from the Ministry of Silly Walks will be emailing you asking for help to keyword stuff their home page.
I was already asked to keyword stuff the homepage. My boss also asked me if I can make the font color white so people can't read it. I actually remember 12 years ago when that worked (when Yahoo was #1).
I'm late to this discussion and not a statistician, but I will claim that comparisons between the "measured" coefficients e.g. "It's about 51% better correlated than PageRank - a big step up" are not valid.
I see Branko already noted this as well.
Given that these claims Rand makes such as "It's about 51% better" are what most people will take away from the discussion, and noting that they also suggest Rand's Tools are better than PageRank numbers, any responsible SEO has to question the ethics of publishing this (and labeling it science) -- as Michael Martine also noted above.
Now.. can *I* prove that it is invalid to compare the derived coefficients and assume a linear relationship such that claims of "51% better" are allowed? Correlation coefficients can almost never be compared to each other linearly as Rand compares them here. Branko noted that as well.
Maybe Ben can run some fabricated test data through his computers, to show that known-to-be correlated data produce high correlation coefficients via his scripts? And known-to-be-uncorrelated data produce low correlation coefficients? And then, using that test data, can we test whether or not the produced correlation coefficients relate to each other in a fashion that supports claims like "51% better"? Surely you can fabricate data that is actually 51% better, and run it through the programs. That would certainly remove a lot of the doubt surrounding these claims, and the validity of the approach.
The SEO community (or perhaps more accurately, the seomoz community) needs to decide whether or not or how much to trust seo "research" based in such statistical analyses. Most work this deep into statistics is outside our reach. But does that mean we accept it, and allow the claims to stand?
I pitched a talk on this issue for the SMX Advanced meeting, were I included an outline of how one can safely report SEO research findings without stepping into the muddy waters of unverifiable claims. It didn't get accepted this time. It is an important issue that we all need to address.
Im not nearly as smart as Ben but I have a couple thoughts:
What if Ben gathered information only about competitive keywords like with a global monthly search query over say 3000, I wonder if the correlation between PR and rankings would increase dramatically.
He may be using only popular keywords, but if he is using more random keywords then I could see where PR would be less predictive.
When I look up a couple random popular keywords here is what I find:
Pizza
Page 1 SERPS:
1. PR 7
2. PR 7
3. PR 7
4. PR 6
5. PR 5 (note domain name is pizza.com)
6. PR 6
7. PR 5
8. PR 4
9. PR 4
10. PR 5
Franchise
Page 1 SERPS:
1. PR 5 (note domain name is franchise.com)
2. PR 6
3. PR 6
4. PR 6
5. PR 6
6. PR 3 (note dictionary.reference.com - alexa rank 165)
7. PR 4
8. PR 4
9. PR 5
10. PR 5
To me PR is just one of a number of factors, but I still think that in popular keyword searches, the results will basically show sites with similar PR in decending order unless other things are overwhelming PR. Looking at PR of other sites that come up can at least give you indication if that a keyword you should target with the page you had in mind or if you are probably out of your league.
Improve your PR by acquiring great links and your site will generally rank higher for various keywords.
You raise a good point. I touched on this is my comment above. There are probably additional factors that mediate the relationship between PageRank and SERP rankings. Query volume may be one of them.
It is a good point. More slicing and dicing would be good.
I did compare one word vs two word queries, as that was easy to do. It showed that one word queries where more correlated to about all of the metrics. One word queries are probably high traffic, and so this suggests that you are probably right that higher traffic terms correlated better to PR and all of the other metrics. But I thought this made a few too many assumptions to put in the post.
Adwords data would really be the right thing to use.
Hi Rand and SEOmoz team,I loved this article. As you know we are trying to do some correlations here in Brazil but I was wondering if you can give us some good resources to understand more about correlation and how can we do that.
Maybe Ben could create an article explaining how to measure correlation like you did here in this article. I'd love to learn and help the SEO community as you are already doing.
Again, congrats for this post. You are doing a great job!
From a big fan of yours,Fábio
A thumbs up for the request of an article from Ben
And a virtual one (I don't want to spam) for the will you show to help the SEO community.
Correlation is actually a fairly simple statistic. Excel, or a similar program can calculate it as a function. Essentially, correlation (r) indicates how much of the variability in Y can be attributed to changes in X. Variance (r^2) gives this a percentage.
Thank you for being so transparent about SEOmoz' own metrics corelation with rankings.
I am sure many would find it tempting just to say they got the best tools on the web and try hiding the flaws.
And I am very happy to have some stats that prove PR shouldn't be used. I guess I should be proud of not having the google toolbar installed ;)
This is nice detail posts and we have more to read on PR to clear and affirm our views which we share with our clients
Ah, PageRank. The amber nectar than turned to water.
About as relevant as a dmoz listing.
Maybe any post extolling the virtues of PageRank and dmoz listings should be auto blocked by Google et al. to limit long conversations with the partially educated managers / clients?
Definitely need to get back to stats at some point. Useful information... I love trying to find THE SEO metric that beat'em all, although I know it's impossible.
One day, all in-house SEO teams will need a mathematician to be able to get the best results. It's the heart of SEO R&D imho
One question: What was this nice little application SEOmoz developed that could help to find correlation between two variables using curves? Can someone point it to me again?
Online Non-Liner Regression Tool
Thank you! Exactly what I was looking for! :)
I haven't see any crappy site with high PR say 7 or above. Conversely, i have never seen any authority site with zero or no PR (unless it gets penalized). As a site grows in popularity i see gradual rise in PR. See how the PR of twitter has increased in the last one year. Although algorithmically there may be no relationship between PR and popularity, but from a boss/client's point of view who doesn't know SEO at the algorithmic level, it looks like there is a relationship. This relationship is not just between Popularity and PR but also between domain authority and PR. Your client can always say that "all the popular websites i see, have high /very high PR or all high PR sites are popular in their niches. If you have low PR, then it means you have low global link popularity i.e. not many people find your site worth linking or it hasn't got juicy back links.". This is something which is very hard to explain without wrestiling with spearman correlations.
Another problem is, on one side we say to your client that PR is not important and on the other side we want him to change his site architecture or do link consolidation so that the link juice can flow to the most important pages. This is something which contradicts our own statement of PR is not important. PR may not be important ranking wise but is very important to keep pages on a site (esp. very big sites) in the main index and hence rankable. If this was not the case, then is there a need to fix site crawlability issues.
If you have low true, under-the-hood PageRank, this is probably true. However, perform a highly competitive search like https://www.google.co.uk/search?q=bingo and note how the toolbar PageRank doesn't correlate to the ranking sites. WinkBingo's home page has a toolbar PR of 6 - higher than all the other ranking sites, but it ranks sixth.
There is going to be a good reason why Google ranks these pages, or any others, where it does; however, the the green bar at the top of the ranking pages isn't much of a factor.
Secondly, paying close attention to any one page's toolbar PageRank (which is updated infrequently, often cosmetic and inaccurate due to PageRank's logarithmic nature, e.g. "One PR4 page might have 5 times more PageRank than another PR4 page") and building a site that can have authority easily passed around it, are two entirely different things.
I have also worked on a site that had a PR7 and ranked for nothing. Its PR7 had been obtained by a legitimate content sharing scheme that looked like spam due to content containing links, even though it wasn't spam and wasn't done by SEOs or with SEO in mind. Google gave it its toolbar PageRank, but the site was penalised. We explained the issue and the penalty was lifted. The site's PageRank went down, but it began ranking top-10 for its primary, competitive phrases. In other words, after our reconsideration request Google discounted all of the PageRank passing through those links and the toolbar PageRank went down. However, those links weren't helping to begin with (i.e. the toolbar PageRank score wasn't helping). No one wants a tbPR 7 page that doesn't rank. I'll take a tbPR4 page that ranks over that any day.
From the above example, it's worth noting that a penalised site won't always lose its toolbar PageRank, making toolbar PageRank even less trustworthy a metric.
May I say this? Jane, I love the clarity of your answers here in the blog (where I'd love to see you more) and in the Q&A.
Cheers! Shockingly, haven't blogged on here since my "I'm leaving Seattle" post in January of last year. Will do something about that when I can think of something original to say ;)
My word Jane. When you come out of the woodwork, you really come out with a splash! I echo gfiorelli (It seems like that's all I've been doing this week Gianluca. Will you get out of my head!) I'd love to see a blog post here from you.
Thanks for the kind words! I'll think about something to write :)
Himanshu makes an excellent point regarding crawlability and indexation. Not all SEO value is about ranking correlations. If the page isn't indexed there can be no rankings.
Matt Cutts has recently admitted that "the number of pages that we crawl is roughly proportional to your PageRank," speaking chiefly of the "crawl budget" of large sites and their crawling and indexation issues (a must read). This by itself gives PR significant importance.
As to ranking value, what I get, simplistically, from this excellent post and enusing discussion, after all the statistical back and forth, is what we have always known (or did we?!): that in general, PR plays some small positive part in rankings that we shouldn't obsess too much about. To me, this study says the same in a more elaborate way, and is valuable as confirming this notion. For which thank you.
Yet more reasons why I'm trying to teach myself about machine learning (thanks for Ben's help behind the scenes!).
I guess when its all said and done, its gonna take as a while to completely wean ourselves off PageRank. Its in our veins!!
I'm not too crazy about statistics so I'm not going to make mention about the figures and all. But bottomline, if anyone asks about the importance of Page rank, I'll just point them here!
Thanks for that Rand.
I believe PR is part of the SEO mix, but not the be all, end all metric. It's interesting to hear what others think about PR in the SEO industry - particularly those who discredit PR think it's worthless. Not to bash on those who think that, but more to learn and understand why they think that way.
"Worthless" is certainly a value judgement. Does PageRank provide utility? Is it useful in some way? Does knowning a URL's PageRank provide actionable information? If the answer to any of these questions is yes, then it is certainly not worthless.
First, great post.
Second: it's often enough to tell your boss/client that page rank is one of over 200 signals - because many people think that page rank = serp rank. i recently came across a blog post stating that "google says that page speed will influence page rank" ... which of course isn't true
Third: with your first image (hook it into my veins) I hope you're somehow referring to this simpsons scene ;-)
Yeah. It is very unfortunate that Larry "The Web" Page had to name the algorithm after himself. Laypersons assume that PageRank is equivalent to the rank of your pages.
Thanks, good to know someone understands PR so I don't have too. Is there any truth in the gossip that pagerank can be inherited from a previously defunct URL that has been repurchased?
PR is just one of many factors to estimate a website's value.
I think PR is so popular cause it provide a simple way for ordinary peopl.
many people want to know one website's value, but they don't want to or have no time to use so many tools to look inside a website.
I realise this might be an obvious question, but in regards to the UK search results, what did you do to eliminate localization as a factor when testing pagerank? Google has pretty clearly said that localization will re-order things based on your location, so to get a relevant result for google.co.uk it would seem reasonable to do the searches from a UK IP (maybe through vpn) to make sure localization is eliminated as a factor
Good question.
I made no effort, and I fetched all SERPs from American IPs.
I am not exactly sure the latest on the work to do IP geolocation based personalization, although the impression I had was that it was only going to be for a pretty limited number of queries such as "pizza" and "restaurant", and I hadn't heard news that it was actually being used yet. I would be interested to hear otherwise.
It is my understanding that when switching to a new domain, PageRank is inheritated through a 301 redirect. Does the same apply if the domain is forwarded, as oppose to the 301?
Thanks,
Dawn
It's funny that page rank gets easily discounted mostly by those, who don't have it and appreciated more by those who do. Some of my sites rank well with low PR( to me anything under PR4 is low), some rank high with high PR ( to me PR7 and up is high).
Despite above, I will take non ranking PR7 site over ranking PR4 site any day, simple because I can always make it rank. And those who don't know how to take advantage of high PR site -they probably never owned one to begin with. Now, owning one and managing one for a client there is a big difference :)
If the objective of your site is a high PR and you consider a PR>7 a success in itself, then you are absolutely right. PR is important and you should spend time doing whatever it takes to rank.
But, if conversions are the objective, then you might want to further ‘investigate’ whether it always follows that, PR7 sites always lead to better conversions than PR4 sites.
Absolute Dynamite post! The more time I spend on seomoz and the more time I use the tools....the beeter analyst I become. I would like to see how Alexa rankings and SERPs correlate just for fun!
Your statement about standard errors is only true if Google.com rankings come from a Gaussian distribution -- which I would guess is not the case. Gaussians arise when many independent random processes come together to influence a phenomenon. That sounds like the opposite of Google.
A couple other statistical suggestions for y'all:
Great Post!
I got to say that pagerank has always been really confusing. Sometimes you'll see a Pagerank 0 at the top of a high level competition keyword! Anyway, this article gives me some great information to rely to my clients and blog (Web-Directory-SEO). Great article, and keep them comming!
re Spearman. Yeah! Spearman was developed specifically for correlations among ranked things.
i bet u used other ranking signals to correlate with the ranking besides, compete, alexa, moz, google toolbar PR & yahoo inlink for your own use.
I would like to know what are those signals. Lets say page loading time and etc...
i like to see all the known signals colerralated with ranking and how it all fits together to give a better picture and how to prioritize them for better understanding. Because this math is so fun.
I am tempted to create a sample of my own and compute it using perl. So im curious to know more about your sample used.
Ty for the experiment, looking forward to more o these.
Cheers.
Esh
If you do your own experiments, I would be interested to hear the results, and would probably try to convince you to write a YOUmoz post :-)
Ben
Hi everybody,
I just stumbled over this entry and I think there are some mistakes how you make your conclusions.
Just calculating a correlation between two values will not be enough. Because as you pointed out, Google uses 200 signals to rank pages.
You need have pages with the other 199 signals similar to actual measure the correlation between the PageRank to the rank. If the correlation is then low, then your statement is valid.
Here it could be that a page with PR5 is ranked higher then a page with PR8 because all other signals are better. If all 199 other signals are similar a page with PR6 will probably outrank a PR5 page.
And if this is the case all the time, then we a have a correlation of 1.
Cheers
Like anything with SEO its a long term project, and to manage to attain a high PR is certainly your goal, and as others have stated Google has 200 variables for their calculations, and 12 years in perfecting their algorithms with almost unlimited and huge resources, so statistically very hard to emulate, but does not mean it is impossible, and I really found this article very interesting and the feedback even more so. LT
Thanks for the article.
I think it's really helpful for me.
I think it is important to note that data like Alexa and Compete are directionally inaccurate. Yes, this post is about correlation, not causation, but unlike the other measurements which less directly affected by rankings, alexa and compete data are very much inflated as sites move up the rankings... ie: while PageRank might be a Cause of good rankings, Traffic Measurements are a Result of good rankings.
Once again, I understand this is about correlation, but I think it is still worth being said. Don't go out and start using Alexa and Compete data to determine whether a domain is worthwhile for acquiring a link.
I think the cause and effect relationships between these metrics is probably so nuanced, that stating them with too much conviction would be a mistake. In my experience, multicollinearity becomes a big problem when working with multivariate statistics.
Is PageRank a Cause of good rankings or is it just the Result of good rankings, globally speaking? I don't know that's why I'm asking.
I believe it's probably the consequence, not the reason why you will rank better. The association between a good PageRank and a good SERP won't always be 100%, not even near it, because to each website certain keywords rank better than others and even the keywords themselves shift their relevance across the internet as more (or less) people search for them and there other parameters that PR take into account... so in the end PageRank will be a big round number that gives you a global estimate of how "important" your site (but since you have just 10 levels to rank every single website on earth it will be hard to go up and down as an yoyo)
It's so amazing !tks again
These are by far my favourite kind of SEO blog posts, Rand - thanks for writing all of this out! Now, I just need to find the time to read it...
Bookmarked :-)
thanks for the post...very in depth...what do you think Google's replacement will be for PAGE RANK? Maybe a PAGE TRUST?
tony ;~)
But Google must be doing something right, since if I search for instance for "Macintosh" in Google.com the first result has the most PR of all in the first page of search results (9/10) - Apple's website - and the word "Macintosh" isn't nowhere in Apple's homepage (not even once in all meta-tags or page-content).
I agree if people say PageRank is biased towards some type of criteria but I disagree to describe it as "useless". Of course, you miss a lot of information, but what could you expect of a single number as a measure of everything inside a website? There will always be a bias related to the relative weight you give each parameter that account for PR, but that is also true to every other kind of ranking or metric.
Google also takes synonyms into account when considering keywords. Since Apple's home page will have a large number of links to the domain it will already have a high (non-toolbar) pagerank. Combine this with "Macintosh" most likely being classed as a synonym for "Apple" and it would perform well on that search, even though the word itself doesn't appear
edit: Also, I think the aim of this article was to point out that using the pagerank published to google toolbar as an indicator of SEO success isn't always the best idea, and that there are other methods which can match it better. This isn't the same as google's internal algorythm, which will take into account more than just pagerank (which, iirc, is based mainly on links alone, whereas the google algorythm also probably incorporates keyword density, domain age and a host of other factors)
I agree, PageRank isn't a measure of SEO success, and in fact I'm glad it is so because otherwise it would mean that if you followed every SEO practice, no matter how irrelevant the content on your site would be to others, it would always rank above another site with poorer SEO but much more "acclaimed" content by others.
PR gives you an estimate of "importance" of a website, and this "importance" is defined according to how important people who developed PR think backlinks, metatags and other parameters are, relative to each other, compared to other websites. Since other metrics give these parameters another relative importance, then their result will be different, as would be PR if you changed the algorithm
So it depends on the way you see it, because what is important for people who made PR is probably not the same for people who made seomoz metrics... and probably I tend to have a different definition of important in a certain context and you another one which doesn't match with neither PR nor seomoz metrics.
As you said the toolbar-PR could be old, depending on last PR-Update. So what I miss : how old is the toolbar-PR you have tested. The correlation should be much higher a few days after an update then some week later.
In the "Google model in my head" the pagerank ist not directly correlated with the serps. But there is a very clear correlation between the pageRank and link-power. That means : a site that is linked by high pr-sites has a better chance to rank well. Often after a while this effekt comes out in the visible pagerank - but it took some months.
So - to have pagerank is not important for the sites ranking but for all linked sites.
Thanks for the post :-)
And isn't this the main marketing claim of sooo many scummy directories? Add your site to improve your PR
Yes, definitely. I've even seen ones named things like free pr web directory, seo friendly directory etc. and they charge people to add their sites, scummy is a nice way of putting it!
Firstly, as pointed out in the article, it's very difficult to know when the updates are going to happen. As such, waiting until after one to make the comparison doesn't simulate the situation most SEOs would be in when making use of it.
To answer the question though, wikipedia says that April 3rd was last update, so this test will be working with a fairly up-to-date number
I fetched all of the values after Sunday, so like SolidSquid notes, it would be using the values updated on April 3rd.
It is a good point if we had access to Google's internal PR numbers that are fresher and have more resolution it would almost certainly perform a bit better.
First: great insight
Second: I have to read it more more quietly to really assume all the infos you gave
Third: you say > The next time your boss or client asks you about increasing their PageRank; show them this chart.
I'd love (and I will surely), but the effectivness could be weaker as the data sets are about US and UK. Is it too much to ask you to follow up with this analysis and check out also other regional Google, for instance the Italian one (or those ones can be useful accordingly to the origin of the visits to SEOmoz)?
PD: and I have to find my own Ben Hendrickson in order to deal with the math formulas... I was subscribed to F with Math when at school.
I'll be your very own Ben if you're hiring ;-)
But...that's impossible. You're a Sean.
That's never stopped me before..
I checked my email, and Sean is actually who recommended to Danny who then recommended to me to use Spearman's instead of Pearson's for this sort of analysis. I edited the post above to note this!
So I think it is only fair to concluded he is a bit better at this stuff than myself!
The compulsion to check my PageRank is similar to the compulsion to Google myself. When I Google myself, it doesn't mean I'm going to show up for anything anyone actually searches ...it just helps me sleep better. It's kind of the SEO equivalent to checking my locks before going to bed.
"The data here is especially interesting. Yahoo!'s link count is a good deal better than Google's PageRank in correlating with Google's own search results!"
Is this about the link-queries or linkdomain-queries? (AKA links to the page or links to the domain?)
Also what I really would like to see is how rankings correlate to how old the cache-date is. Which in my opinion is a better indicator then pagerank to see how much authority (in the eyes of google) a page has.
Comparing PageRank to Yahoo's link counts we used the number of external links reported by yahoo site explorer to the URL.
Later in the post, when comparing domain metrics like PageRank of the homepage and our Domain Authority score, we used the total number of external links to the domain.
I didn't check cache-date. That would be interested to look at.
Great post , I found it quite hard to follow as correlations and maths really aren't a strong point, but the depth and explanations were really useful - and I appreciate the transpareny too :)
But above all you (seomoz) even show us PR e.g. on https://www.seomoz.org/directories --> this "seduces" us (at least me) to look a PR :-)
Petra
Fair point - we're planning to update that list in the near future, so that will be a good incentive to drop PR and put in DA/PA.
A fascinating post (and responses) despite my complete lack of understanding of the statistical methods discussed.
My understanding of Google PR is that it is not intended to have a bearing on a page's ability to rank for any given keyword. If it means anything at all, it is an indication of the value that a link from that page would pass on to another page.
A page that has been highly optimised for a specific keyword, with the same keyword in the anchor text of links into that page will always rank better than a poorly optimised page but with a high PR.
For me, PageRank is only a broad indication of the value of a link from that page. The PageRank of the pages on my clients' sites are only of interest to me in terms of the link juice they may pass internally.
I always hear people saying pagerank means nothing at all anymore - and I always knew it meant a little something - thanks for the data to prove it :)
Just yesterday I had to educate a client about PageRank and Google Search Rankings. I wish I had read this post earlier to show him the hard data. Thanks for the post.
Thanks for looking into using Spearman's correlation guys. I'm glad to see it was useful.
Because I'm a research fanatic, I would love to see a follow-up exploring these results further. If it were me, I would ask:
I believe the answers to these questions would give a great deal of insight into the strengths and limitations of measures like PageRank.
Thanks for the post.
Now atleast when the clients ask about PageRank I can show them these details and tell them to focus more on content of the website and do some qualitative link building then the PageRank shall take care of itself.
Bharati Ahuja
I have been looking for a post like this~!!!
Thanks!!
The real and simple objective of any site (or at least commercial sites) is not to achieve a pageRank of 8,9 or 10, but it is to convert visitors into customers (i.e $$) period.
Yes, a pageRank usually means a site and its content are important, and will get more traffic. But if all that traffic is not contributing to the site's goals, then pageRank 'matters not.'
In addition to this very useful post that sheds a light on the pageRank's usefulness, managers would probably love to digg one more step and see if a correlation between PageRank and conversions exists.
In other words, can we establish that high pageRank means high ROI? if so, then pageRank should be considered by both SEOs and Google a 'big deal'
Put blatantly, I will pay attention to PageRank if I can say to my manager: "look Jim, our revenues doubled because our pageRank went from 3 to 7 this quarter"
Ah, but correlation does not necessarily mean causation.
I agree.
But, what is the practical ultimate objective of looking at any metric?
In my case, I look how that metric is impacting my bottom line (i.e. conversions)