This is my first YOUmoz post, and I would greatly appreciate your feedback. I will be actively responding to comments, and I know that we will get a great discussion going. Please comment with any critique, questions, or random thoughts that you may have. If you would rather skip the statistics, feel free to jump ahead to the discussion section.
Introduction
A couple of months ago, SEOmoz explored the relationship between a web page's PageRank and its position in search results. They concluded:
Google's PageRank is, indeed, slightly correlated with their rankings (as well as with the rankings of other major search engines). However, other page-level metrics are dramatically better, including link counts from Yahoo and Page Authority.
I was intrigued by the study, and vowed to investigate the metric using my own data set. Because all of my data are at the root domain level, I chose to focus on the homepage PageRank of each domain.
Methods
I averaged three months of data (November, 2009 - January, 2010), collected on the last day of each month for 1,316 root domains. Using Quantcast Media Planner, I selected websites that had chosen to make their traffic data public. To be included, websites had to have an average of at least 100,000 unique US visitors during this time period.
The domains selected for this study do not approximate a random sample of websites. Because of the way in which they were selected, they will bias in favor of sites with many US visitors, and against sites with very few. There may also be differences between Quantified sites with public traffic data, and non-Quantified websites. For example, Quantified domains are probably more likely to include advertising on their pages than sites without the Quantcast script.
PageRank
PageRank (PR) can only take eleven values (0-10). It is an ordinal variable meaning that the difference between PR = 8 and PR = 9 is not the same as the difference between PR = 3 and PR = 4. Like mozRank, it probably exists on a log scale.
The median and mode PageRank among websites in this study were PR = 6, with a minimum of PR = 0, and a maximum of PR = 9. However, only ten websites had PR < 3, and only seven had PR = 9.
Results
SEOmoz Metrics
Using Spearman's correlation coefficient, I compared PageRank to several SEOmoz root domain metrics. Domain mozRank (linearized) was strongly correlated with PR (r = 0.62)*. This correlation was somewhat smaller than the 0.71 that SEOmoz reported in May, 2009. The disparity may be due to differences in methodology; SEOmoz used Pearson's correlation coefficient, and did not linearize mozRank. Additionally, PR data in my study were probably measured over a smaller range of values, potentially weakening the observed dependencies.
*All reported correlations are significant at p < .01.
MozTrust was also highly correlated with PageRank (r = .62), with Domain Authority somewhat less-so (r = .55). The latter has since undergone some major changes, and this result may not reflect the metric as it exists today.
Search Engine Indexing
I performed [site:example.com] queries using Google, Yahoo, and Bing APIs to approximate the number of pages indexed by each search engine. Much to my surprise, PageRank shared the strongest correlation with the number of pages indexed by Bing (r = .52), instead of Google (r = .30), or Yahoo (r = .24). My first thought was that Google might not have reported accurate counts, a phenomenon often noted by SEO professionals. However, there is some evidence that may indicate otherwise.
If Google's reported indexation numbers are inaccurate, we would expect the metric to have lower correlations with similar metrics. However, indexation numbers reported by Google and Yahoo share a fairly high Pearson's correlation coefficient (r = 0.38). Both appear to share smaller correlations with Bing: 0.34, and 0.26 respectively. Even more interesting, SEOmoz metrics seem to have much stronger correlations with Bing's indexed pages than the numbers reported by Google or Yahoo.
If Google is failing to accurately report the size of its index, we might expect that similar queries would also return inaccurate data. However, PageRank shares a high Spearman's correlation coefficient with the number of results returned by a Google [link:example.com] query (r = 0.65). The strength of this relationship appears similar to those between SEOmoz metrics and PR mentioned earlier. PR's correlation with the results of a Yahoo [linkdomain:example.com -site:example.com] query is somewhat smaller (r = 0.53).
If the number of pages Google reports having indexed is a relatively poor metric, we would also expect to find more variation between months than other search engines. However, I did not find this to be the case. In fact, Bing had by far the highest average percent change in the number of pages indexed, a whopping 355% increase per month. Google averaged an increase of 61%, and Yahoo an increase of only 2%.
While it is still possible that the number of pages on each domain that Google reports to have indexed is inaccurate, I see another potential explanation. Moreso than Yahoo or Google, the number of pages that Bing will index on any given domain is related to the quantity and quality of links to that domain. Perhaps, at least when it comes to indexation, Bing follows more of a traditional PageRank-like algorithm. After all, Google claims that PR is only one of more than 200 signals used for ranking pages. This theory is supported by the results of SEOmoz's comparison of Google's and Bing's ranking factors.
Social Media
PageRank even shares fairly strong correlations with social media metric such as how many of a domain's pages are saved on Delicious (r = 0.49), how many stories it has on Digg (r = 0.38), and even the number of Tweets linking to one of its pages as measured by Topsy (r = .38).
Website Traffic
Last, but certainly not least, PageRank predicts website traffic with somewhat surprising strength. As reported by Quantcast, monthly page views, visits, and unique visitors are all significantly correlated with PR. Google's little green bar even correlates with visits per unique visitor (r = 0.18), but not page views per visit. However, putting this in context shows the value of a metric like Domain Authority.
So what exactly does all of this mean, and why is it important?
First, despite being a page-level metric, homepage PageRank is actually a fairly good predictor of many important domain-level variables relevant to SEO, social media, and website traffic.
For instance, on average, websites with a PR = 7 homepage had 2.6 times as many unique visitors as those with a PR = 6 homepage, which in turn had 1.5 times as many unique visitors as those with a PR = 5 homepage.
Second, homepage PageRank is sometimes used as a proxy for a hypothetical “domain PageRank.” While technically inaccurate, this study supports the idea that the PR of a website's homepage provides information about the domain as a whole.
While it may be limited to just eleven possible values, PR it is surprisingly good at predicting the relative number of inbound links to a domain reported by Google and Yahoo, as well as the relative number of pages indexed by Bing. The key word here is “relative.” As an ordinal variable, PR cannot be used to predict the actual values of continuous variables.
Finally, this study provides evidence that SEOmoz's domain-level metrics may be good (and possibly better than PageRank) predictors of variables important to search, social media, and web analytics. This, as well as all of the results of this study should be interpreted within the context of the included domains (high-traffic, US-centric, and publicly Quantified).
I hope you enjoyed reading my post, because I certainly enjoyed writing it. I intend to write many more based on your feedback. If you found this post interesting or valuable, I would greatly appreciate your thumbs up by clicking the icon below.
Wow, extremely thorough post! Lots of great data and it seems like you setup your test very well.
I think the important takeaway here is the PageRank is still a pretty decent stat to use when you need to quickly judge the approximate "value" of a page or domain. I most commonly use it when link-building: Does this page have PR? If so, how much? The more PR, the more valuable the link (assuming relevancy and other basics). If I'm sifting through a lot of pages, I won't want to lookup tons of metrics, but a quick look at my toolbar can help me decide if this link opportunity is worth pursuing or not.
The other area where PR is still important (again, IMO), is for bloggers seeking to gain advertisers. Advertisers will be more likely to buy a banner ad on a blog if the blog's homepage has decent PR, and usually more willing to pay $$ if the page has good PR.
It seems like your data at least partially validates the use of PR for either of these purposes (although MozRank & Page Authority may be better). All three metrics are ways to instantly gauge the ballpark value of a website or webpage.
I totally agree with your takeaway Nick. It seems like PageRank is sometimes scoffed at by SEOs, and even referred to as "meaningless." It might not be the best metric around, but it is a quick and dirty way to evaluate a domain's value. Even if PageRank wasn't a good predictor of anything useful, the fact that people still talk about it means that it can't be ignored.
I agree. If PR was "meaningless" then why would Google spend money and resources to calculate it? To mess with SEOs? I doubt that. I notice that PRs tend to be very similar among competitors of a single keyword phrase.
Great post! I'll have to read it slowly in the morning... it has been a long time since stats class!
This is awesome stuff Sean! I love what you've done with the correlation numbers around PR.
One quick question - did you consider looking at the correlation of other variables/metrics with Quantcast visitor numbers? I'd be super interested to see if PR is indeed the best one for determining the overall popularity of a site.
What this data certainly suggests to me is that homepage PR may be a reasonable metric for observing the overall performance of a domain, though as you noted, others might be even better.
BTW - Huge thumbs up for the contribution and congrats on the first YOUmoz post!
Thanks for the kind words Rand. I have actually been gathering data about these domains from a variety of sources for about a year now. The study began as an evaluation of tools like Alexa, Compete, and Google (now DoubleClick) Ad Planner. I ended up including a variety of other data related to SEO and social media. I haven't reported the results yet for several reasons, primarily because I try my best to be very thorough in these things. As I'm sure the last couple of weeks have shown, research receives a lot of scrutiny, especially when conducted in big industries like SEO and audience measurement.
Anecdotally, I can tell you that for predicting traffic, the quantity of a domain's pages saved on Delicious by far outshined the other social media metrics. Interestingly, predicting website "use" (i.e. pages per visit, visits per unique visitor), is rather difficult. The only metric that really held up was the number of phrase-match searches for a domain (i.e. "example.com") per month (as reported by AdWords).
Interesting - I wouldn't have thought delicious bookmarks would have high correlation with traffic, but I guess it makes sense. Makes me think that the number of pages getting some high volume of tweeted links/references might also be a good indicator.
Phrase-match domain name also makes a lot of sense - good thinking to add that to the mix!
Great post,
I do agree that PR is a good enough indicator for link-building, and your study does a good job of confirming it.
I think PR is easy and always at hand without additional work, for quick reference, and maybe this is why as you say some SEOs refer to it as 'meaningless' just to overcomplicate their work, so no one else will figure out the 'sercet' to what they do at work :-)))
Telling your client that PageRank is somehow beneath you does add some "mystique" to what you do.
Kudos on the post. But I think SEOmoz is getting to worried about statistics lately (probably related to a certain blog post elsewhere?).
Don't get me wrong, I think analysis like this is interesting, but I do believe it is coming from the wrong place. SEOmoz is a great site because it can educate and develop your thirst for SEO (and SEM in general), and when we get too worried to be "accurate" and "scientific" we lose our grip on what really drives a good SEO (or online marketer in general): passion.
Just my opinion, as always. :)
And a very valuable opinion! Science can only take us so far in understanding the world. Especially for an industry like SEO, the real question is: "is this information actionable?" Fortunately, I have a passion for research, so I like to think that I can help fill the important role of questioning our long-held assumptions like "PageRank is meaningless." Thanks again for your feedback!
Statistics don't replace passion. Statistics validate passion, or redirect it. SEO is a results driven industry, maybe to a fault (which is why black hat practices are so widespread.) All statistics do is help you figure out which ideas you are passionate about can really drive results, and which unfortunately do not, so that as a community we can all further our practices and achieve more.
I wondered when this question would come up. :) We've definitely had a lot of posts about statistics lately. This came through YOUmoz though, and it did really well. I didn't want to NOT promote it to the main blog only because we've recently been talking a lot about statistics. Sean wrote a great post on his own, so I wanted to let it shine a bit. It was well deserved.
Awesome analysis. I'm glad you chose to look at PR for the homepage and I agree that this study supports the idea that homepage PR provides information about the domain as a whole.
Knowing that PageRank flows to other pages of the site predominantly through the homepage, this article should help reinforce the importance of having a site architecture that compliments a solid internal linking scheme. - Sean Rusinko
Excellent insight about internal linking Sean. Especially for large sites, making sure that homepage PageRank is flowing to other key pages is paramount.
Ok I get that Authority & Link Counts from Yahoo! (which I assume means the number of incoming links?) might be better indicators here, but what interests me is what's the best trait to look for on an incoming link to *me*. Is it still possibly Authority > PR ? Or another metric yet?
I know this might sound like a beginner question, but i've never asked anyone on this site.
My SEO guy has one of the best business models in the world, if not the best, I found him because he is top 3, used to have a double there. He has explained most of what he knows in order to help me start getting my conglomerate of commerce websites to the top- but most people arent absolutely clear on what other than PR constitutes the best inbound link. All I know for sure is that high PR sites who decide to link to me seem to make a positive impact. For my part I *do* go around posting and building relationships with all the relevant consumer and tech savvy forums, social, and blog groups out there. And I can tell you there are thousands.
I guess all these statistics really forced me to examine Authority for the first time, and for that I would like to thank you, Sean!
That is a great question, and one to which I do not think we will ever have a definitive answer. The best one that I can give is that this study supports the idea that homepage PageRank may not be a bad indicator of something like link value. As to the best metric, determining it with a decent level of certainty would require operationally defining "link value," and conducting a series of experiments to find out.
Unfortunately, such experiments are a little outside of my available resources at the moment.
Pagerank is still the main algorithm of Google. But the Pagerank bar (the one everybody knows), with the ranking from 1 to 10, is not very accurate.
Agree with you. Moreover the higher the pagerank the less accurate is the bar.
Great post Sean.
I have been doing SEO for 15 years and a lot of that went right over my head. J
I have some questions that I hope can be answered.
I am going to use some stats from this month from 2 of my PR4 websites. Site #1 - PR4 1077 unique 331 links First online in Feb this year.
Site#2 – PR4 31733 unique 1746 links Been online since the internet was only 5 computers, (AOL, Delphi, Compuserve, Playnet and Prodigy). Every sort of linking campaign has been used in the 15 years of marketing. Organic linking is currently 99% of it. Users make tutorials using our software and link to us. I have records of over 5000 linking sites.
What I find interesting is that just about 100% of the 331 links on site 1 were placed on PR0 pages.
If we were to consider the scale in which PR is (was?) calculated is as is considered by the industry to be logarithmically based, the discrepancy is huge. If I go and complicate matters by throwing in a PR3 website, things really go out of kilter.
Site #3 – PR3 6978 unique 410 links
What I think I am seeing is the method of calculation has changed. No longer based on a mathematically calculated formula, PR has been brought in line with relevance. Although all the links I built were on PR0 pages, the content was relevant to my linked pages.
What are your thoughts?
Question #2 is that while I have seen a lot of opinion that states linking is a good thing for SEO, I have to ask is, “How?”. Has someone done a study that shows conclusively that placing links and only placing links has moved any of a website’s indexed pages even one notch up? On the https://www.seomoz.org/blog/the-science-of-ranking-correlations there is a heading “How Well Does PageRank Correlate to Rankings?” I don’t think this is the right question. I think it should be, “How much does Page Rank influence Rankings?”
I have several PR0 websites that pretty much dominate their market in search. Granted it is long tail mostly and VERY niche market, but if anyone searched for their terms they are right in the top.
I have seen pages with only ONE link beat PR2 and 3 sites. I have also seen lower pr sites beat higher PR sites in the index. I also saw my PR go from PR0 on Wed evening to PR4 the next morning. Not one of my indexed rankings changed position.
Google says a bunch of stuff about PR which I think is just blowing smoke. IMO the differences between the Google hidden PR and the PR Shown in the toolbar is only a matter of decimal points. If on update day you site scored an increase to say PR3, you would not know if the site was actually PR3.2 or PR3.9, only that it was in the PR3 range. I don’t think Google would keep 2 separate databases, one for the public and one for internal use. Up to now, (and I think this might be changing), PR was only calculated about 4 times a year. On the day my site jumped to 4, none of my friends and acquaintances noticed a change in their sites so I would assume it was a localized calculation. Perhaps keeping in line with the Caffeine processing changes?
Enuff for a first post.
best,
Reg
Really interesting data, Sean. Nice work.
I'm curious about a follow-up (don't know how hard this would be in your data-set) - in your data, can you calculate the Pearson's correlation coefficient between mR and PR? I'd like to see how different it is (or isn't).
PR is tough, because it's a discrete variable that theoretically represents a continuous value that Google won't let us see. We also know that it's logarithmic, but we don't know what the power function is.
Do you mean calculating Pearson's between "pretty" mozRank and PageRank, linear mozRank and PageRank, or applying a transformation to linearize PageRank?
I was thinking whatever would be the closest approximation to your methodology, just to keep it easy. I wasn't 100% clear on the differences between your approach and the original study, and I was just curious how much impact the Pearson vs. Spearman calculations made.
Pearson correlation between linearized mozRank and linearized Pagerank is 0.63. PageRank was linearized with a base of 2.17^PR.
I'm fairly certain that the correlation was smaller than the one you found because I looked at a smaller range of data. Most of my PageRank values were between 3 and 8. It's possible that all of the correlations that I reported would be higher if I increased the range.
Very interesting - thanks. I'm always curious about the theoretical vs. practical impact of certain approaches. Sometimes, I find we argue about the details, only to realize that the upshot is fairly small. When in doubt, I like to see the numbers.
Thanks for bringing up the importance of practical implications. Because Google's ranking algorithm includes 200+ variables, analyzing individual correlations isn't likely to make much of a dent in explaining all of its variance. We would need to do a principal component analysis to be able to explain a significant portion of it. Of course, that raises the thorny issue of collinearity, which I suspect might be problematic for any models we created.
Yeah, every time I even ponder multivariate work, the complications just explode. There are way too many dependencies in the major ranking factors. All of these debates have me digging back into my regression textbooks, though - it's been a while.
I know how you feel. I've been scouring through a few stats textbooks that I haven't looked at in a couple of years to make sure I wasn't talking out of my rear. Sometimes it's easier to keep things "simple" by looking at only two variables. Of course, that unfortunately reduces the practicality of your results. I can only imagine what you guys must go through while trying to model search algorithms.
SeanWF--
Most of this stuff went right over my head, but I did enjoy the way you structured the post, i.e. charts, which I love because I'm a geek like that. ;)
It was very well-thought-out and organized. I can tell you spent a lot of time on this, and from what I've read in the comments, it was well worth it because it made a lot of sense to others and more importantly, it's a useful analysis.
I'll have to go back and re-read it a couple of times. (Mr. Math and I have a love/hate relationship.) Good job, and I hope to see more in the future!
Summer :)
Thanks for the feedback Summer. I think that a post called "Statistics 101 for SEOs" or something similar needs to be written. I'm not sure if I'm the guy for the job, but I'm seeing a lot of people requesting such an article. If you have any questions, feel free to message me on SEOmoz, or on Twitter (@SeanWF).
I would be *very interested in reading a Stats 101 kind of post. I think it would give me a good background on some of the stuff I keep reading about but don't understand yet. If it were written for the visual learner (colors, charts, etc.) then I'd probably understand it more. I *knew I should have paid attention in some of my high school math classes! You usually don't learn much about statistics when you're an English major and have a minor in Fine Arts! ;)
And as soon as I canfigure out why Twitter isn't letting me into my account, I'll be sure to follow you! :)
Sean
Great job. Although I might quarrel over what you call large correlations, overall it's a good read. Proud of you.
Thanks,
For such high value metrics, I'd argue anything in the 0.40 to 0.60 range is pretty impressive.
Great Post Sean.
Goodnewscowboy had stolen my comment, but... even if I have the brain smoking too and need to reread your post with much calm, I want to tell you that you wrote in a way that I had no temptation of escaping.
Your reflection are really interesting and somehow give an order to many assumption we do over the base of practice.
Thanks
Thanks Gianluca, I've really enjoyed reading your posts as well. There's a lot to take in, and a lot of it is definitely open to interpretation. Given what you've seen, what other SEO assumptions do you think deserve investigating?
Thank you Sean...
What other SEO assumptions... the first to blame you to have your blog quite left to itself, as I'd love to read more from you (it's not a reproach, simply a warm invitation, you have things to tell to us in your fingers)
And one statistical research I'd like to see one day if do exist correlations between high social web activities and link graph. I mean, facts that can prove that social (graph) and link graph can be somehow correlated. If no, why? If yes, why again... and if it can be stated an evolution through time (therefore a research that could be repeated yearly or every six months).
Haha, my blog is in desperate need of updating. I've got something coming up soon that everyone here should find very interesting. However, I'm considering taking a full-time search marketing internship with Disney. If that happens, I probably won't be blogging a whole lot. It's a tough decision..
I think I see what you're getting at with the link graph/social graph connection, but I'm not entirely sure. What do you mean by "web activities," and how could I measure it and compare it to the link graph?
For social web activities I mean freshness, in facebook - for instance - the repost via rss of external articles from blogs, or the links, or the status notes. But also fans numbers (there's a correlation between fans numbers and web popularity?)
From Twitter... tweets and retweets with shortened links, followers numbers.
The how... well, to be totally honest right now I have no clue how could be possible to compare the two (social and link graph). But I suspect that some sort of correlation does exist. As you can see I am in "intuitional" phase, that's why I so look forward to post like this one you wrote.
Ah... about Disney: shit shit shit, as the actors says before going on stage to call the good luck.
Interesting idea. I do have some API queries set up that can gather tweets, Facebook sharing/liking, and data from a few other social media platforms at the page level. I've mostly been working with domain-level stuff at this point. The place where I'm stuck is figuring out what I should use for a "seed set" of pages. They need to have a common theme, or I at least need to be able to give some sort of justification for why I chose them.
I think, but it's just my opinion, that a good seed could be the touristic one... but only hotels (or b&b) websites* with their corresponding pages and twitter account (all have both). And in order to have a small but consistent universe, I would choose a secondary touristic place (therefore no NY or Niagara Falls kind of location).
* Portals not... they're the wikipedia of tourism sector. Even if Yelp could be included as a mix between social and "classic" website alternative.
The only way I could ever beat you to the punch G. is when the World Cup is in full swing. So I'm milkin' it for all it's worth ;)
Gotcha... but the real reason why mine are one of the first comment (or before yours) is simply timezone... Jennita usualy schedules the go live of the posts when I'm obliged to wake up by my children... therefore I got used to check out if there are new Posts on SEOmoz around my 6/7 am (9/10 pm Seattle, but of the previous day then mine).
But, yes, use the wc2010 advantage for another three weeks ;)
OK, so what your saying is that I have three more weeks to go rent some kids to have a hope of beating you...no, wait a minute. I don't live in Spain...I need to become a party animal so I can come in late every night.
Something tells me this is gonna be a hard sell to my wife...
AH! It's not my intention to cause family problems from some K miles from you.
Just take advantage that the world cup is taking me with an eye to the tv screen and the other working... yes, I am getting croos-eyed.
Wow, simply too many comments to read, so not sure if someone brought this up or not, but there is a big, big problem with this statistical analysis. Namely, a pearson or in this case spearman is used to find correlations between variables of scale. Self admittedly, pagerank is a nominal variable, not one of scale.
Something like age is scale. The difference between 1 and 2 is the same as 20 and 21. In theory it can go inifinitely high. Just like height, or weight. These are variables of scale. Pagerank is limited to 0-10.
When comparing a nominal or ordinal level variable to a scale variable you run a test to compare means. If the nominal variable contains only two values, you use an independent sample t-test. If it contains more than two values, you use an ANOVA.
What these tests tell us is sites with PR=0 have average rank of this, average visitors of that, vs. pr=6 sites which have an average rank of this and average visitors of that. The test then tells us if those differences between the means are statistically significant.
This would be a much more appropriate test for this kind of data. Anyway the data can be made available to the public so I can give it a whirl? Or is it in the post and I just missed it?
I may be mistaken, but Spearman's correlation coefficient can be used to measure the relationship between an interval-level variable (i.e. unique visitors) and an ordinal variable (which PageRank is). In fact, the reason to use Spearman's over Pearson's is when comparing two ordinal variables or an ordinal and interval variable. If PageRank were in fact a nominal variable, then you would be absolutely correct in using a one-way ANOVA over a correlation statistic. However, if we assume that PageRank is inherently ordinal, Spearman's coefficient is a good fit.
While I will not be making my data public (at this time), the study is easily replicable using the methods I described in the post. If you'd like some suggestions for tools that will easily automate the data-collection process, send me a private message. Also, if there's any additional analyses that you'd like to see me perform, I'd be happy to report them here.
Sort of correct. Joreskog and Sorbom determined that for an ordinal variable to be considered continuous and thus used with correlation data, it must have 15 or more orderings. This is why I believe the ordinal measurement should be treated as nominal in this instance. But you are correct in that if there were 15 or more, a spearman would be just fine. But isn't it only with another nomianl variable? I thought spearman would correlate nominal variables together, not one nominal and one scale. I can't remember on that one, I would have to look it up.
I'll send a private message for some of those tools, very interested in learning more about them.
Hadn't hear about the need for a minimum of 15 rankings. I'll have to look into that. I wasn't under the impression that correlations required that variables had be continuous. Do you happen to have a source for that? Spearman's correlation coefficient transforms any interval variable's values into ranks, thus the only requirement is that both variables are at least measurable at an ordinal level.
Yeah it was something called the Monte Carlo studies or something. It was in my college textbook. Not sure on a source, would have to look it up, i just remember 15 was the number. After looking it up it appears a spearman could measure correlation between any two type of variables, so right on there.
I guess for me I would still like to see anovas with the data. What are teh average visitors for each pagerank? how do they compare? Now that would be really interesting to me.
Well actually, the very last graph in this post and its preceding paragraph gives an idea of that. I haven't run an ANOVA and something like Tukey's HSD to test significant differences, but hypothetically, the eta-squared value from the ANOVA should roughly equal the coefficient of determination from Spearman's. The downside of using an ANOVA is that by treating PageRank as a nominal variable, you lose important information.
nawabz, did you really just try and drop a link into the comments of this fabulous post. I find it highly offensive. Can SEOmoz remove this post??
Sean, wow what a well research post. It is good to see someone putting time and effort into PageRank analysis. So many SEO completely dismiss PR as irrelevant, I think you have just proved them wrong.
I think PR is useful as an indicator to help non-SEOs understand the concept of site authority etc.
Thank you for putting so much effort into helping us mere-mortals understand PR further :)
Thanks Danni,
That is such a great point about PageRank helping non-SEOs; I hadn't even thought of that! Especially when working with clients, being able to point to an "official" Google metric measuring site authority is an excellent way to teach them the ropes. From there, it's an easy transition to something like mozRank.
P.S.
I don't see a link in your profile. Do you have a website I can check out?
Sean,
I am not a SEO consultant. But I am a big SEO advocate. So I have a number of people that come to me and ask for SEO advice. Frankly, one of the biggest signs that a company needs SEO help is when they have a PR=0. I spoke with a fairly large Orlando pest control company just a couple of days ago. Large enough to be in the nation's top 100 list. I was totally floored when I saw a PR=0 on their site. Needless to say, they had some duplicate content issues and a few other SEO problems....
...Sorry for the long extra side note, MY POINT:
-YES. You should definately use PR as a marketing tool when approaching new clients. As you said, it's like a 3rd party score card that says, "Your site needs SEO".
Business people understand when they are being marketed to. Having a third party opinion, i.e. Google, agreeing with your pitch is powerful.
Very, very interesting Sean. Thanks a lot for these stats!
I firmly believe in Page Rank as part of my job entails submitting information for various clients websites, to certain directories. Admittedly there are hundreds, possibly thousands of directories but we do have to make sure we are using the most prominent sites, the higher the page rank the better and in these cases we find that when they list and rank highly on the search engines this ultimately will help in the overall SEO strategy for our clients. In short, Page Rank is key.
Sean, very good post, man! I like so much the statistics and you have a good skills interpreting them. The correlation is something that a lot of people ignore, and can bring you information like you posted here.
Thanks!
Excellent article, and definitely agree with the idea Google would not spend resources on something that has no value.
Love how you have analysed a comprehensive amount of data to back yourself up. Thank you. LT
Just stumbled upon this old article, but couldn't resist to add my 2 cents.
I think that the more popular the keywords, the more important PR becomes. Topsites that rank high on very populair keywords are usually fairly SEO-optimized. There probably isn't thát much to gain anymore SEO-wize, so PR becomes more important. However on long tail keyword, optimizing the pages SEO-wize may have significant more impact then PR.This is consistent with Reg NBS-SEO's findings.
As far as the statistics go, I believe that it is very hard (if not impossible) to provide accurate, valid samples for these kind of analyses and it would need a team of SEO-experts & experienced statisticans to analyse this data and provide usefull and valid conclusions. I'm not saying this article is persé invalid or ill-founded, but the way in which it is written and in which the experiment is set up doesn't indicite an analyses as mentioned above.
For instance the analyses (I'm referring to the previous article now*) would probably show a very different outcome if less populair keywords (i.e. long tail) were used.
*My comment is based on both this article and it's prequal (https://www.seomoz.org/blog/the-science-of-ranking-correlations), since this is the latter, I post it here.
Chris,
I fully acknowledge your point about the validity of this sample. I did bring it up in this post, and fully outlined my methodology, reasoning, and limitations. In fact, at no point in my post did I refer to the group of websites I used as a “sample” in order to avoid confusion.
However, just because I did not use a random sample of websites does not mean that my results are invalid. It certainly limits my ability to draw conclusions about “all websites,” but that was not the population of interest. However, because this research sets a new precedent, I feel that the burden of proving that we would see “a very different outcome” using other websites is on anyone who chooses to replicate my work.
As an advanced SEO and experienced statistician, I can assure you that my methods are sound, and I stand by the quality of my work. Your points are well taken, as is your critique. It's an important part of the research process, and I appreciate that you took the time to review my post. Thanks Chris.
:-)
What were average unique domains and total links for each pagerank?
awesome analysis! thanks for your work
Appreciate the compliment. If you're asking how many domains had each PageRank value, the first graph shows the frequencies. As for links by PageRank, are you asking the average number of inbound links a PR 4 site compared to a PR 3? If so, by who's report? Yahoo, Google, or SEOmoz?
yes thats what im asking... i want to know it ~1k links for pr4 and ~10k for pr5
i understand this would be hard to measure first because you need to pick a good sample and all links are not created equal but i wondered what they would look like...
i guess you would do it with linkscape data because it will have most accurate counts...
ive been pulling pr myself and a couple things i noticed...
first is that any #anchor in url tested will make it fail to return... when calculating the google magic hash if you pass just a domain name it seems to return the homepage pagerank but i havent found where the "domain" pr I pull and the https://domain.com/ result are diffrent
there is also some new parameters being passed by gogole toolbar or firefox and its getting back some other S number... i have no idea what its for but it looks alot like an auth token to me so iono
still haven't finished reverse engineering the wonder wheel JavaScript so i can make programmatic requests... but pagrank history and cache is almost done in SEO Site Tools
This spreadhseet shows the average numbero of external links as measured by Yahoo for each PageRank value.
As for collecting information, I personally am a huge fan of SEOQuake. Custom parameters are extremely powerful, and are in fact how I gathered the majority of my data.
Congrats on your first YOUmoz post Sean. Quick question on your correlation significance numbers. Is the max 1.0? Thanks, Mark
Yup. Correlation coefficients generally range from -1 to 0 to 1. Squaring the correlation coefficient (i.e. 0.60^2 = 0.36) gives the coefficient of determination. This number tells us the percentage of variance in "y" we can explain with changes in "x".
In this example (r^2 = 0.36), can say that we can explain 36% of the amount that "y" tends to vary about its mean with a regression equation based off of "x".
PageRank - the SEO equivalent of "it's what's on the inside that counts".
Sean - this is a great job. This specific data is interesting, and more generally having SEO being more about statistics is exciting.
Whoa Nelly! You are a statistics animal Sean. I can see the massive amount of work that went into this post, and for that I congratulate you.
Now I just have to hit the reset button in my brain (that just went into overload after trying to take in all the data) and re-read it again. Slowly. Really slowly.
PS - Congrats on your first YOUmoz post!
Thanks GNC. I appreciate the compliments. I'm really in to statistics, so it was actually very fun to write. I managed to write the bulk of it over just two days (12+ hour days), but I have been working with the data set for about a year now.
Funny thing is I put more weight into SEOMoz's toolbar for Firefox and the calculations for Domain Authority and Page Authroity. I've seen some pages with 0 pagerank to well on the Moz toolbar.
That sounds like a great choice Joe. The statistics I presented in this post suggest that SEOmoz's metrics may be better than PageRank as a predictor of many important variables. MozMetrics are certainly updated more often, and have greater granularity, which arguably makes them more actionable.
What I find confusing is we have links from some pages that have a high PR, like 6 but opensiteexplore gives it a PA of like 20! These pages are like 2 clicks from the homepage and any other page on the site.
From what I understand, PageRank is much similar to mozRank than Page Authority. MozRank was built off of a PageRankish model, whereas Page Authority takes many non-link variables into account.
So if you have a link from a page like in my example, would it be very valuable? It has a high PR but a very low PA.
That's a very good question. Unfortunately, I'm not sure I know the answer. You should probably ask someone from SEOmoz their opinion.
Absolutely brilliant! Blown away by the statistics. Thanks for sharing and hope to read many more posts from you.
Thanks
Awesome analysis. Well done!
Thanks for sharing!
With kind regards
Ortwin Oberhauser
www.ortwin-oberhauser.com
Sean, congrats on your first post and what a topic to start with!
As I have the mozbar installed I tend to default to the mozrank metrics knowing that the PR must be closely associted. I use to manually cross reference both until the pattern was to similar that it was a waste of time to look at both. I stuck with the mozbar and reference that.
I'm glad to know that your study has shown mozRank to potentially be a better predictor.
Thanks Marshall. I do want to caution that I didn't measure if any one correlation was significantly higher or lower than any other correlation. There is a certain degree of error when working with these types of statistics, and in order to confidently say that mozRank is a significantly better metric than PageRank, I would have to conduct additional tests. However, the numbers I have reported do give a rough sense of the differences between the two.
I thumbed you up because it takes a lot of work and thought to put something like this together. However, I'm going to go out on a limb and disagree that PR predicts traffic. Unless you are only looking at Organic traffic, this is a tough conclusion to draw. The data suggests that PR and traffic do have some correlation. However I've seen quantcast be 100% off on estimating traffic for some sites that I have analytics access too. The correlation between PR and traffic could mean that PR predicts a high PPC budget, for example. So just having high PR isn't necessarily going to predict that I will have a ton of organic traffic.
Thanks for the feedback. I guess I should have been more clear about the word "predict" in this context. When two variables are significantly correlated, we can say that both "predict" each other. Their ability to do so is determined by the value of the correlation (r) squared. I'm not trying to suggest that one "causes" the other, but that changes in PageRank tend to coincide with changes in traffic. You raise an excellent point about confounding variables like PPC budget, and I appreciate it.
As for the accuracy of Quantcast, the truth is that Quantcast's metrics are simply different than those given by an analytics platform. For Quantified (directly measured) sites, Quantcast performs all sorts of additional calculations and corrections, so it would make sense that they don't align with what you see in your analytics.
Edit: Should be reply to previous post disbelieving PR = Traffic, not a reply to Sean. Apologies!
PageRank can drive traffic. Traffic can drive PageRank. Once PageRank reaches a certain point, if a website has significant content PPC will never reach as many people as organic results will. This is because a huge chunk of searches are based on long-tail keywords, and even my PPC accounts with over 75,000 keywords contain only a fraction of keyphrases longer than 3 keywords.
Let's just talk about what PageRank is for a second. PageRank is basically an algorithm that considers all the incoming links a page has, and based off the "weight/importance" the link carries gives the page receiving the links a score. Add all those scores together and you get PageRank.
Let's talk about traffic driving PageRank: How do you get links (in a natural fashion?) People see something they like, and they link it for the people visiting their page to check out. Those people check it out, and some of them they link it, and it continues. This is how traffic drives PageRank.
How can PageRank drive traffic? By the definition of PageRank, the higher it goes, the more and better links you have. This obviously links to more traffic- if CNN.com links to me in a news story, I'll receive a nice page rank contribution boost. That will also drive more traffic than a link from Annie's News I Like Blog. Furthermore, if I'm answering queries for people are searching for, higher page rank will typically make me appear more relevant in results (disclaimer: lots of factors, yada yada.) If they like my content, they link it, more people visit, etc etc. PageRank can drive traffic too!
Outside of pornography, gambling, and more private areas of the web, I'd be shocked to find situations where PR and traffic are uncorrelated. I would be interested to find out what websites get the most from lower page rank.