(NOTE: This post is written by Ben Hendrickson and Rand Fishkin as a follow up to Ben's presentation at the Distilled/SEOmoz training seminar in London this week)
Our web index, Linkscape, updated again recently, and in addition to provide the traditional stats, we thought we'd share some of the cutting edge research work we do here. Below, you'll find a post which requires extremely close and careful reading. Correlation data doesn't have all the answers, but it's certainly very interesting. Likewise, the ranking models data provides a great deal of insight, but it would be dangerous to simply look at the charts without reading the post carefully. There's a number of caveats and information - raw lines can mislead by themselves, so please be diligent!
[UPDATE Oct 26, 2009: There used to be a mistake. Below there is a line showing showing correlation between unique domains linking and SERP order. That line was previously incorrect. Instead of being the number of unique domains linking to the target page, it was the number of unique domains linking anywhere on the domain of the target page. The corrected line shows unique domains linking to be much more important, so I also added this line to the combined über score vs individual features chart. My apologies for the error. -Ben]
First, some stats on the latest Linkscape index:
- Release date: Oct. 6th, 2009 (exactly 1 year after our initial launch)
- Root Domains: 57,422,144 (57 million)
- Subdomains: 215,675,235 (215 million)
- URLs: 40,596,773,936 (40.5 billion)
- Links: 456,939,586,207 (456 billion; if we also include pages that 301, that number climbs to 461 billion)
- Link Attributes:
- No-follow link, internal: 6,965,314,198 (1.51% of total)
- No-follow links, external: 2,765,319,261 (0.60% of total)
- No-follow links, total: 9,730,633,459 (2.11% of total)
- 301'ing URLs: 384,092,425 (0.08% of total)
- 302'ing URLs: 2,721,705,173 (0.59% of total)
- URLs employing 'rel=canonical': 52,148,989 (0.01% of total)
- Average Correlation between mozRank + Google PageRank
- Mean absolute error: 0.54
- Average Correlation between Domain mozRank (DmR) and Homepage PageRank
- Mean absolute error: 0.37
Now let's get into some of the research around correlation data and talk about how we can use the features Linkscape provides to produce some interesting data. These first charts use raw correlation - just the relationship between the ranking positions and the individual feature. As noted above, please read the descriptions of each carefully before drawing conclusions and remember that correlation IS NOT causation. These charts are not meant to say that if you do these things, you will instantly get better rankings - they merely show what features apply to pages/sites ranking in the top positions.
Understanding the Charts:
- Mean Index By Value: These are used for the y-axises of many charts. Instead of averaging the raw values, we average it's relative index in the SERP if ordered by this value. So if there are 3 SERPs, and the page in the first position has the 4th most links, the 2nd has the 1st most links, and the 3rd has the 10th most links, the mean index by # of links for the first position would be (4+1+10)/3 = 5.
- Mean Count Numbers - these numbers appear on the y-axis of the first chart, showing averages of link counts.
- Position: This is used for the x-axises of many charts. For these charts, this is specific to the organic position in Google.com, excluding any vertical or non-traditional search results (local, video, news, images, etc.)
- Error Bars: The bars that bound the trend lines in our charts can show the confidence of two different things. On some charts, they show the 95% of confidence of where the mean will be if we had infinite analogous data. These error bars show how confident we are in the line, and frequently have the word "stderr" in the title. On other charts, they are our confidence of what any given SERP will look like. These error bars are much wider, as we are much more certain of what the average of many SERPs will be than we are what any given SERP will look like. Charts with these error bars are frequently labeled with "stddev" in the title.
The data below is based on a collection of 10,000 search results for a variety of queries (biased towards generic and commercial rather than branded/informational queries) and 250,000 results. Some results were excluded for errors during crawling or returning non-html responses. Results are taken from Google.com in the US from October of 2009.
Are Links Well Correlated with Rankings?
Common SEO wisdom holds that the raw number of links that points to a result is a good predictor of ranking position. However, many SEOs have noticed that Yahoo! Site Explorer's link numbers (and even Google's numbers inside services like Webmaster Tools) can include a dramatic number of links that may not matter (nofollowed links, internal links, etc.) and exclude things (like 301 redirects) that matter quite a bit. Using the Linkscape data set, we can remove these noisy links and use only the number of external, followed links (and 301s) to run in our correlation analysis.
This first chart certainly suggests that correlation exists, but the spikiness is a bit frustrating. Through deeper analysis, we found that this is largely due to results that have ranking pages with massive (or very tiny) quantities of links. Thus, it made sense to produce this next chart:
Here, we can see what would happen if we force-rank the results by number of links. This means we've taken each set of results and assigned a number (1, 2, 3, etc.) that corresponds to the quantity of links they have in comparison to the other pages ranking for that result (e.g. the page with the most links is assigned "1," the second-most links gets "2," etc). The smoothness of the line suggests it is fairly accurate, but we can be precise about our accuracy. The error bars below show the 95th percentile confidence interval for estimates of the mean.
We're looking good. The correlation is quite strong, suggesting that yes, the number of external, followed links is important and the standard error is low, so we can feel confident that the correlation is real. Clearly, though, comparing with the perfect fit line, links are not the whole picture. Having the most links out of your peers in the results is likely a very good goal, but it can't be the only goal.
The last piece here is to examine the standard deviation. This can tell us how much an individual page might vary from the averages.
This chart tells us that variation for any individual set of results can be quite large, so getting more links isn't always going to be a clear win. It's notable that in this chart, standard deviation here is shown for the 95th percentile confidence, which is actually 1.97 standard deviations away from the mean. On the whole, # of external, followed links is clearly important and well correlated, but we're going to need to get more advanced in our models and broader in our thinking to get actionable information at a granular level.
Can Any Single Metric Predict the Rankings?
Boy, that sure would be nice... We've looked in the past at the quality of metrics like PageRank, Yahoo! Site Explorer's Link Counts, Alexa Rank, etc. The short answer is that they're barely better than random guessing. Google's PageRank score was (around February of 2009) approximately 16% better than random guessing for predicting ranking page (N+10 aka ranking page 1 vs. page 2) and less than 5% better than random guessing for predicting ranking position (N+1 aka ranking position 1 vs position 2). The chart below shows correlations for a number of popular SEO metrics:
Since then, Nick, Ben and Chas have all been hard at work on improving the value and quality of Linkscape's index as well as the usefulness and signaling provided by the metrics. This next chart shows how we're progressing:
The correlations above map in the 35-50% better than random guessing range (though it's not a 1-to-1 comparison with the numbers above - watch for that in a future post) for the first result. Looking at this graph suggests that external mozRank (which represents the quantity of link juice to a page from external links) and external followed links correspond well to current rankings is interesting and certainly lends an additional data point for link builders. This correlation line might, for example, suggest that in the "average" rankings scenario, earning links from high mozRank/PageRank pages with few links on them (so the links pass more juice) as well as higher raw quantities of external, followed links are both very important. But even more, this chart supports the idea that earning links from unique domains is paramount.
[UPDATE (Oct 26, 2009): Previously there was a paragraph speculating why the above result for the importance of unique linking domains was so much lower than we previously calculated. As noted at the top of this post, this was because I used the wrong datapoint for unique domains linking. Correcting this made the discrepancy with earlier results disappear. The chart above is now correct. -Ben]
The frustrating part about this data is that it's not telling us the entire story, nor is it directly actionable for an individual search query. As you can see below, the standard deviation numbers show that for any given search, the range varies somewhat dramatically.
When we see this effect, just as we did above, takeaway for an SEO doing work on a client project and attempting to achieve a particular ranking position is unclear. Employing these metrics as KPIs and ways of valuing potential links is probably useful, and building competitive analysis tracking with these data points is likely to be considerably better than using more classic third-party metrics, but it doesn't say "do this to rank better," and that's the "holy grail" we're chasing.
How Do "On-Page" Factors Correlate with Rankings?
This post has dealt very little with on-page factors and their correlation to rankings. We'll look at that next.
Google recently announced that they ignore the meta keywords tag. This data, showing a line that's very spiky and error bars showing stderr (standard error) all within the horizontal at "13" certainly supports that assertion. Employing the query term/phrase in the meta keywords is one of the least correlated signals we've examined.
Title tags that employ the query term, on the other hand, appear to have a real correlation with rankings. They're certainly not perfectly correlated, but on average, this chart tells us that Google has a clear preference (though not massively strong - note the smaller range in the y-axis) for pages that employ the query term in the title tag.
We've examined H1/H2/Hx tags in the past and come to the conclusion that they have little impact on rankings. This graph certainly suggests that's still the case. Employing the query in other on-page areas such as the body (anything between the <body> tags) and out anchors (employing the keyword in the <a> tag whether internal or external) have significantly greater correlation with rankings, while H1-H4 tag keyword use appears almost horizontal on the graph (suggesting no benefit is derived from its use). It's not as bad as the random effect we observed with meta keywords (the lines all start a tiny bit below 13 and end a tiny bit above), but the positive correlation is low and the horizontal is mostly inside the error bars.
This graph is the clearest illustration yet of why it's so important to build systems more advanced than simple, direct correlation. According to this chart, employing the query term in the path or filename of the URL is actually slightly negatively correlated with ranking highly, while the subdomain appears largely useless and the root domain has strong correlation. Granted, all of these (except the root domain) are on a very narrow band of the x-axis, but SEO experience tells us that using keywords in the name of a page is a very good thing, for both search rankings and click-through rate. Whenever we see data like this, a number of hypotheses arise. The one we like best internally right now is that the URL path/filename data may be skewed by the root domain keyword usage. Essentially, when a root domain name already employs the keyword term, the engines may see those who also employ it in the path/filename as potentially keyword stuffing (a form of spam). It may also be that raw correlation sees a large number of less-well URL-optimized pages performing well due to other factors (links, domain authority, etc.). It's also true that most sites that employ the keyword in the path/filename don't use it in the root domain as well, so the negative of the one may be mixed-in with the positive of the other.
Whatever the reason, this is a perfect example of why raw correlation is flawed and why a greater depth of analysis - and much more sophisticated models - are critical to getting more value out of the data.
Can We Build a Ranking Model that Gives more Actionable Takeaways?
To get to a true representation of the potential value of any given SEO action, we need a model that imitates Google's. This is no easy undertaking - Google employs a supposed 200 ranking factors, so while we've got lots of data points (on-page and link factors, plus lots of derivatives/combinations of these) the complexity is still a dramatic hurdle.
The "uber" score (red line in the graph above) is built by taking all of these features we have about webpages, domains and links from both on-page analysis and Linkscape data. We (well, technically, Ben) run them through a machine learning model that maps to the search results and produces a result that's considerably better correlated with rankings than any single metric. You can already see that in the top 10 search results, the slope of the line is looking really good - an indication that our metrics and analysis function better for predicting success in those areas (which, luckily, are the same positions SEOs care most about).
These machine learning ranking models let us take a much more sophisticated look at the value of employing a keyword in any particular on-page feature. Instead of going off simple correlation, we can actually ask, based on our best fit model, "what's the impact of using the keyword here?" Let's use the example we struggled with above showing negative correlation for keywords in path/filename:
As you can see, this model suggests that, once again, subdomains are largely useless places to put keywords, but the root domain is a very good place to employ it. Path and filename are slightly positive, which also fits with our expectations. It's also important to note that on this chart some lines dip below 0 on the "mean derivative of uber" y-axis in the 20-25 ranking position range. This suggests that for those results, the keyword use may actually be hurting them. Looking into some sample results, we can see that a number of the URLs in that 20-25 range seem to be trying too hard. They're using the keyword multiple times in the domain/path/filename and fit with what many SEOs call "spammy-looking." It could certainly be a weakness in our model's accuracy, but we think it's also likely that a lot of pages would actually benefit from being a bit less aggressive with their URL keyword stuffing.
In this next chart, we can see the standard deviation error bars. You can see that we're more confident that in the top results, employing keywords in these URL features won't hurt and is likely to help, while in the latter portion of the results, we've got a bit less confidence about the negative effects.
Let's turn our attention to those pesky H(x) tags again, and see if the ranking model has more to say about their impact/value.
We're still getting mostly similar results. It appears that H1-H4 tags are not great places to use keywords. As with the URL features, they seem to help a tiny bit (even less than URL features, actually), then have a very tiny negative - flat effect in the latter SERPs. Even with the error bars, this is fairly convincing evidence that H(x) tags just don't provide much value. A best practice might still suggest their use, but there are probably far more valuable places to use your keywords.
Our link measurements also get more sophisticated (and tell a more nuanced story) when we use the ranking models. You can see above that improving mozRank in the top results appears important, while raw # of links may be less valuable. However, when we look further back in the results, you see the negative dip, suggesting that some pages may be over-using mozRank and external links (quite possibly from less reputable/spammy sources). This graph doesn't have a ton of actionable data (as controlling the amount of mozRank or even the number of external links you get is probably not wise), but it does fit fairly nicely to a lot of the things we know about SEO - good links help, bad links might hurt.
The last graph shows some of the more interesting on-page features from our dataset. The big one here is the consistent suggestion to use images with good alt text that employ your keyword term/phrase. That green line is one of our highest correlations for on-page keyword usage. Putting keywords in bold, in body text (anywhere) and even in out anchors (remember, these are any anchors, not necessarily external links) has the same type of positive impact at top SERPs and slight negative in the 20-25 range that we've seen previously. This shouldn't surprise us at all is we suspect that spammers/keyword-stuffers are playing more heavily in those result numbers.
Conclusions & Take-Aways:
I know this is a lot of data to parse, but it's also pretty important to understand if you're in the SEO space and want to bring more data credibility and analysis to your projects. We suspect that SEOmoz isn't the only firm working on this (though we may be the only one willing to publicly share the data for now), and you can bring a lot of credibility to a client project or in-house effort with these data points showing the importance and predicted value of the changes you recommend as an SEO. There are plenty of people who malign our industry as being based on hunches and intuition rather than strong data. With these analyses, we're getting closer to closing that gap. We don't want to suggest that this data is perfect (the error bars and accuracy analyses show that's obviously not the case), but it's certainly a great extra piece to add to the equation.
Things the data suggests that we feel good about:
- Links are important, but naive link data can mislead. It seems wise to get more sophisticated with link analysis.
- No single metric can predict rankings (at least, not yet)
- H1s (and H2s-H4s) probably aren't very important places to use your keywords
- Alt attributes of images are probably pretty important places to use your keywords
- Keyword stuffing may be holding you back (particularly if you're outside the top 15 results and overusing it)
- Likewise, overdoing it with (not-so-great) links might be hurting you
We're definitely looking forward to comments and questions, but Ben & I are in the UK and may not be back online for a while (Ben's plane leaves in a few hours for the US and British Airways doesn't yet have wifi in-flight).
p.s. A shoutout to Tim Grice from SEOWizz, who put together this correlation analysis a few weeks back.
I am a regular follower of SEOMOZ.org and always take a note of Rand's articles. But this time I was not able to resist to login in and create an account to say - how much interesting this article is.
Keep the good work up guys.
Tushar
You say that:
On the post on On-page optimisation, this was said:
Would this post's correlation with rankings be clearer if you could somehow account for where in the title the keyword was for a given ranking. If you see what I mean.
Certainly - in this limited analysis for the presentation in London, we didn't look at positioning or more complex questions, but that's certainly possible in both raw correlation and using the ranking models for the future.
We did some work where we considered position of matches in areas a little bit ago, and including it did increase accurate. So position does appear to matter. But the trade off is that including it makes the results more complex to understand.
It should be possible to work out how to make it easy to understand AND contain position information. But in making this data, I opted for the lazy method of making it easier to understand by dropping position information. This is another thing for me to correct in the future.
Ben - just wanted to say I thought your presentation was really great. Particulary looved the disclaimer you gave at the beginning :-)
This is great, in-depth data, and kudos to Ben for all the hard work.
I think some people have trouble separating these large-scale analyses from what works or doesn't work for their own site (or handful of sites). One of the difficulties in applying aggregate data to any single site is the same problem that doctors often face. You're taking a large population and applying your findings to a sample of 1, which in many ways violates the whole point of statistics.
As an analogy, let's say that you find aggregate data to suggest that eating an orange per day creates a 3% improvement in overall health. If you apply that knowledge to someone who already has a healthy diet, with plenty of fruits and vegetables, that change will probably have no effect at all. That one person might naturally (and erroneously) assume that oranges have no benefit. Meanwhile, what if you gave that one orange/day to someone with scurvy. That individual would notice an almost miraculous improvement, potentially leading them to tout the wonders of oranges to everyone (also erroneously).
That's not at all to say that the aggregate data has no value. On the contrary, it's vital to our understanding of the search ecosystem as a whole. On the other hand, it's naive to think that one magic, one-sized-fits all tactic will simply fall out of the data. These analyses reveal just how complex and sophisticated the algorithm has become, and the task of applying this knowledge to any given data point (i.e. your website) isn't easy.
Hey Pete - you're absolutely right when we're talking about the correlation analysis at the top of the post, but when Ben moves into the ranking models features, this is predicting exactly that second piece you're seeking - what is the marginal value for any site hoping to achieve better rankings to employ (or not employ) a given tactic. Hence, a big takeaway for every site doing SEO should be (if you trust Ben's ability to accurately build a ranking model using machine learning and his confidence intervals) to not worry much about H(x) tags and do a bit more with alt text on images. As this model gets better and better, it will continue to give more of this globally actionable advice about what Google cares about and doesn't.
Like most of our readers, I think I'm still trying to understand the first half of this post, let alone the second (actually, I'm probably still on the first 20%) :) That's no insult to Ben - actually, it's a testament to just how rich this analysis is.
I was really only speaking to the people who run with any given piece of analysis, apply it to their site, and then give up when it doesn't work (or assume the analysis is wrong). I've been seeing this a lot on the URL stuff. People who already have 85% optimal URLs want to go through a ton of work to make them 95% optimal. Of course, this carries risk (of 301s not working, over-optimizing, etc.) for potentially very little gain. Meanwhile, someone whose URLs are highly parameter-loaded with no keyword representation might see significant improvements making that same change. I think this is true for a lot of SEO tactics - the value of any given tactic depends on your starting point.
The data above shows exactly what you're perceiving with projects. Look at the y-axis and how small the gradient is for the slope of the curves. Even when we say that putting keywords in URL paths/filenames is helpful, you can see that it's quite a tiny amount of benefit. In the future, we can do more to make it even easier to understand in exacty the way you're describing - saying something like - "your page is 91.5% optimized for the URL. Changing to a URL that includes the exact match keyword in the path would make this a 92.6%, an improvement of 1.1%." From that, you can actually decide what's worthwhile vs. not (and you can see the ranking strengths of those folks ahead of you so you can get a better sense of whether getting that 1.1% boost for on-page is worthwhile).
NOTE: This isn't right around the corner, but it's the logical extension of where this work takes us :-)
This is why I like SEOmoz... tests which I can't do "at home". Thanks.
All of this data just gave me a nerdgasm. Great work guys!
If you claim H tags don't matter, I'd like to see if text size matters. Perhaps that's the false positives that people are getting in the industry is on the H tags actually being a larger font on the page.
Good call - we'll add that to the list!
Hey Brent! Long time no hear from!
However difficult it may be for us to step away from long-held assumptions and best practices, it will be increasingly important to base our suggestions and optimizations on real data - lest we run the risk of being seen as boondogglers.
Obviously the goal of sharing the data is to promote Linkscape, but I nevertheless thank you for being so open and willing to share this information with the community. SEOMoz is a blessing to the SEO industry.
Boy do I wish I could put the time to collect all that data and map it out like that. One day I will be this orginized.
Really excellent work.
You suggested that others may be doing the same, but not publically.
I've in fact been doing this kind of work for the last four years. Recently we published a study on Caffeine vs Vanilla Google which was really top level.
Internally though we do this kind of work, programming in python and R to pull together data sets and generate our graphs. We don't have Linkscape at the state you do but we're working on something a little different in that respect.
We've also been working on semantic profiling and automatic generation of individual maps of the web for our clients which present a fantastic amount of data. The killer is always trying to present it in an understandable format which I think you've done a great job of.
What you're doing here though is more fantastic than that - you're pointing out that SEO is extremely complex and in depth (because Search Engines themselves are) which I don't believe that any top US SEO blog has EVER done. Personally I die a little every time I see another top ten list.
SEO is getting a really bad name because people DON'T understand this, or vectors, or any of the basic maths necessary to produce these kinds of graphs. That whole debate recently on whether SEOs should understand HTML completely flummoxed me, here's hoping we get more of these posts.
Waaauuuvvvv!!!??
Speachless, overwhelmed... I think i just became a true fan ;o)
This is a tremendous piece of work which you might wanna send to Derek Powazek. This is plain and simple facts which definately will act as proper evidence dismissing his rant from last week.
Once again - Very impressive :o)
Great post guys! I’m truly amazed form the efforts you must have been taken into account with this article, or should I say thesis… Whatever the case may be, it’s nice to view the colorful graphs and to read about your theories which again, are top presented. In my humble opinion, the most important thing in this article is the fact that as long as we don’t have a Google algorithms’ replicas, your guess and other guesses are as good as any other professional SEO companies around the globe. It’s unreasonable to think that we can get some sort of a predictive metric, which will point out rankings. The best thing to do and it will never change, is to build a site with good content that naturally attract visitors who naturally refer the site to other visitors that naturally start to debate on the sites content and basically turn it into a hot topic. From that point it a question of good “retention” - if you’ll allow me to describe it in that way. A lot of sites are good, a lot of sites attract visitors and most of them are doing a good “retention” – Google and any other SE knows it as well as you do, but the remaining questions that will never be answered are: what decides which one is better than the other? And how that decision affects their ranking?
For those unanswered questions you must go to SEO companies because that’s the closest you’ll ever get to trying to find an answer…
mottid - this piece caught my eye:
Why is it unreasonable? Just because it's tremendously hard (Ben's been working on this for years now) doesn't mean it's not worth doing or not possible. I think that given more time and sweat equity, we can eventually reach something very close to the predictive metric you're describing. And when we do have it, we'll certainly plan on sharing and making it available :-)
I think what he is trying to say is that Google is always going to change their algoritum so you (Ben) will never really finish this project. That doesn't mean you should stop trying!
That’s exactly what I’ve been trying to say J
What about correlating the mozRank of links with matching anchor-text (broad, phrase, exact) against Rankings, where all of the top 10 have at least 1 link meeting that criteria.
We have found with our own internal testing that we always run into problems of having selected keywords for which no one wants to or cares about ranking. The metrics for ranking these sites become more obscure as Google really uses the extent of there 200+ ranking factors to find anything that appears relevant. These obscure rankings scrape the bottom of the data barrel, meaning it is more likely that Linkscape would not yet have sufficient data given index sizes.
It's amazing the amount of value you give away for free Rand, kudos!
I was hoping to see more solid data against keyword stuffing to show my boss, who loves to stuff keywords like Thanksgiving turkeys!
Nice work though.
Great post. Would love to hear some more on different combinations of the variables, instead of the 'all data combined' and the 'single source of data and ceterus paribus' graphs.
For example, is there a stronger correlation between body + H(x) vs. Position or is there a stronger correlation between bold + body vs. Position? Etcetc. (just curious, not criticizing any of the available stuff, which is great!)
Btw, any reason why the EM tag is left out of the On Page Matches graph?Keep up the amazing work!
Great point on the value of computing the accuracy of various combinations of features. That would be a good way of figuring out what is giving unique information.
EM was left out just because results for it are always similar to B, and fewer people use the EM tag so our data for it is less precise than for B. I've sort of assumed the small difference between them I see I just error, and that google is likely merging them together in their algo.
Wow, let me just say Im glad you do all the work and then let us copy!
Wow! There is some really amazing data and takeaways here. I had to break it up and read it a few different times to take it all in. I love statistics and even this made my head hurt a little. It's nice to see that what Google says is actually the truth (i.e. Keyword not being used, etc).
I think the real takeaway here is that staying current with SEO tactics is a necessary or you will be left behind. Things that work now may not work in 6 to 12 months and your clients sites should be every changing with the times.
Ben great work and keep it up, with data like this coming in, I'm sure you will find something amazing in no time!
Sensational post Ben - haven't taken it all yet, but it makes me regret missing ProSEO. Thanks for sharing
Great Post - its nice to have a text based roundup after your fantastic presentation on Monday at seomoz london.
on a slightly side note - when and where can the attendee's download all the presentations? theres a few that I could do with going over again, especially all those excel formula's that were so popular on day one!
cheers!
MOG
I believe you'll get an email with a link to download them. If you don't receive it by the end of next week, email me and I'll look into it.
Eyes... bleeding. Mind... reeling. Brain... hurts.
Aside from that, what a great post. Thanks for all the wonderful data!
When you say "According to this chart, employing the query term in the path or filename of the URL is actually slightly negatively correlated with ranking highly, while the subdomain appears largely useless and the root domain has strong correlation" how did you determine this in cases where the query term was concatenated in the domain, file name or subdomain?
For example:
Query term: obama cairo speech
URL: htt*p://www.seomoz.org/blog/obamacairospeech
I would be interested to know....
We do the words separately, and in the URL count any substring match regardless of tokenization. So in the charts above we would count:
https://www.cairofoobarobamaspeech.com
As matching each word of the query.
THANKS. ...... for the SEO pro London AND this roundup.
Makes my SEO life easier
i have a problem, it's with the implied takeaway to not use keywords in path/filename when you have a domain that employs keyword.if a main category of a site includes a keyword that exists in the root domain, is it really more relevant to omit the keyword as to not 'appear spammy' to the Googleybot (or any other search for that matter)?
(consider the following made up examples)
discountapparel.com - if a main category is women's apparel or sports apparel:
this: discountapparel.com/womens-apparel/
less relevant than
this: discountapparel.com/womens/
smartphoneaccessories.com - if a main category is iphone accessories or blackberry accessories:
this: smartphoneaccessories.com/iphone-accessories/
less relevant than
this: smartphoneaccessories.com/iphone/
allshoes.com - if a main category is running shows or golf shoess:
this: allshoes.com/running-shoes/
less relevant than
this: allshoes.com/running/
as a reminder from your post, correlation IS NOT causation.
i am launching a prototype beta platform cms/taxonomy builder/whatever over the next few days that automates seo (NOT intended for blackhat/spam/evil/etc), the last thing i want to do is build in logic to please a search engine when it will result in decreased usability and is truly is less relevant.
if anyone in the world sees this, i appreciate weigh-ins, as this is the one metric out of this fantastic analysis where direct correlation may be debatable in regards to algorithmic ranking vs what's actually relevant.
That's impressive that you guys have indexed so much and set it up in a way to consume in a useful manner.
First of all thanks for the shout.
I think undertaking this kind of data is always worth while, even though it can never be 100% accurate it can lead us to strong assumptions.
The link quality over quantity is an argument I always feel torn about;
On one hand I have seen first hand the power of a quality link in ranking terms.
On the other, whenever I am undertaking competition analysis the top ranking sites are always littered with low quality links from directories and spam sites. They obviously have quality links in the mix but the low quality dwarf them.
Another interesting side of this is anchor text, when I look at these sites with a few quality links and a boat load of low quality, it is always thecrappy links that use the anchor the site is ranking for. Does this mean although link juice may not pass anchor text retains it's normal weight??
Just very frustrating.
I think the low quality links have an "link age" factor to a certain extent. Directory submission was all the rage back in the day (not that long ago), which is when I'd guess these links were done, they could have been present now for a few years...
I always do some of the low quality link building, directories and articles and find that a few months down the line, results are there to be seen, providing that there is also an element of on site seo, i.e page titles and content.
Go for the low quality as a base or foundation, then go after the quality on top
I think your right there. Any articles I have published never show up quick but a few months later are present and passing juice. I guess any campaign needs to be a mixture between quality and quantity.
With regards to the low quality links being 5 years old, this may well be the case but some pass huge amounts of link juice. As you say maybe the age factor again,
I gues its a little like the roots of a plant, the big strong quality roots keep it anchored and upright but there are hundreds of little ones that provide nutrients and assist growth.
i love seeing data points and the machine learning yall are doing sounds awesome. id like to know some more about how you built your crawler, how long it takes to do a crawl, compile your index. way cool work guys
Thanks for the explanation around these - this was a great presentation by Ben and good to talk to him in the pub afterwards.
I'd meant to ask at the time - is there any significance with a lot of the correlations crossing from +ve to -ve at the position 17 mark? Is there something magic which happens around position 17?
I'm not 100% sure - it doesn't correspond to precisely the same result in all the charts and these are against the mean, so my educated guess would be that on average, the sites that are highly competitive but haven't devolved into spam/manipulation (with either on-page or link factors) tend to have much greater prominence in higher numbered results above 14-22(ish) than they do in the top positions (sort of suggesting Google's doing their job pretty well).
Ben might have a more sophisticated take on this, though. He gets home in 5-6 hours and will likely log in to answer :-)
Hey Tim!
The short answer I don't know exactly and should investigate more. Rand's educated guess seems plausible to me. But here are a couple of points on this subject:
There isn't anything magic about 17. We compute a ranking power for each result separately, and don't use any input signals that contain information about where Google ranked them (we'd get a very accurate and very useless model if we did). So it is only because the input signals for lower results tend to be different than the input signals for higher results that the average marginal effects of changing an input signal is different.
If you look at the 95% percentile bars for how this holds for individual results, it is pretty clear this isn't hard and fast. The marginal effects for changing the signals for top ranking results in many SERPs is more like the average for lower positions than it is like the average for higher positions (and visa versa).
I haven't done the work to figure out what is different about the average page by position that is making the difference. So at this point, any thing I guess will be less likely right than what a hands on SEO like you or Rand would guess.
Fantastic post Rand/Ben and Time Grice, nice work. It's refreshing to see real data back up assumptions and industry beliefs. It's difficult to monitor if you aren't measuring, so keep up the good work.
BTW, I love BA. I would fly them anyday. Safe flight.
Impressive set of calculations. And nice graphs too.
But... it seems to me that you are putting to much weight on matching exact keywords in your calculations. What I still miss is semantics. Maybe an exact keyword match has no effect on the position in the search results, but a semantically related keyword has a big effect.
Are you also researching semantics at SEOMoz?
These particular models use somewhat naive metrics (though they're considerably more advanced than anything I've personally seen on or off the web in the past), but yes, we're certainly planning to add lots of features over time that can help us get better, and greater semantic analysis and complexity may well be part of that.
OK. That would be great.
One other suggestion to investigate: when you look at external links to a page, try if the correlation improves if you also include the links to the parent of that page. I think it might.
Both good points.
Looking at related words does seems like an important future step.
I had not thought of links to the parent of a page, but that does make sense. I worry some about how to algorithmically determine the "parent page". To the degree it can't be reliably done, then it makes it less likely Google is doing it (and would be harder for me to implement). Do you have any thoughts? Would looking at the default page for one directory up be sufficient?
Another short coming that I am a bit embarrassed about is the lack of features where lengths of sections or counts of matches are weighted by the inverse frequency of the token in the corpus. The point being that the word "of" should be weighted less than the word "hotel". That seems like a really common thing to do in search, and not even that hard to do, but I just haven't done it yet...
I noticed the relation with links to the parent in the following situation. Page A (URL: sub.example.com) recieved a lot of links that improved the ranking of page B (URL: sub.example.com/cat/page-b) by 2-3 spots (the page is usually around spot 15). Page A also has a link to B in the body text. B has hardly any links pointing to it, most links point to the homepage of the subdomain (page A). I was able to reproduce this effect so I am quite confident that their really is a relation. The improvement in ranking for B is for a short period. So probably it has to do with freshness.
To answer you question what should be regarded as the parent of a page. Going one directory up seems a reasonable approach to me, but including links to the homepage also seems a clever approach. Many people link to the homepage instead of a direct link to the page that they find interesting. Be careful though not to reinvent the PageRank algoritme :-)
Great post, I love being able to see some real metrics around what we all think we know, and a great step in the right direction to convince people that its quality not quantity of links that can add real value.
I also agree with analysing links based on a number of factors and it's good to see this broken out here. I'll definitely be using as a basis to work this on our clients using the invaluable linkscape tool!
Excellent post Rand. It is indeed "a lot of info to parse" ... si i will probably be reading this again.
It also makes clear that great SEO is more then just putting keywords here and there and is actually a technical challenge.
Keep up the good work :)
Ben, Rand... awesome post guys. Some really great analysis here.
Thanks for taking the time to put all that data together, and more so for sharing it.
Hi guys,
Ben actually made my brain melt a little with this presentation (in a good way) - having digested it more now it's even more awesome than I thought.
It really underlines just how much work you have put in to making Linkscape (and the toosl it powers) as useful and accurate as possible.
On that note I'm off to try and get as good at Excel as Will - probably be gone a while...
Harsh thumbing there!
Yeah, it took a while to digest some of the graphs. It must have taken forever to create the content for this post. Great info though, as soon as you're able to digest it completely.
I loved Ben's presentation on Monday, and this write up is so great and valuable. It seems to be a genuine insight into how the search engines most likely calculate parts of the rankings.
Btw - happy one year anniversary Linkscape :D
(edit: my own terrible spelling)
Brilliant as ever, but especially in pointing out that, with data must come caution in reading into that data.
Looking forward to re-reading and re-re-reading this a few times.
Until then, I think this appropriately encapsulates my and many others thoughts about this research.
My head was spinning and I had a terrific headache after reading this the first time. Now I'm reading it again and I'm starting to feel a little better.
I was kind of shocked at how little value you seem to put on using the H1, H2... tags. That's kind of the way we were all brought up was to make sure our keywords were at least in the H1 tag.
I'm going to have to read this a third time to get a better feel for it and if I can come away understanding half of what's on this page I'll feel pretty good about it.
Later,
Jeff Sargent
Instead, we must look at the bigger picture of today’s SEO — it’s content
marketing. And you can only win in the other areas — domain authority,
traffic, organic search results, and ranking — by focusing on your content.[link removed]
How about starting to publish the presentation online as well, at the very least the slides?
Cheers
Ben's slides were actually just the graphs you see above, which he then talked about. The presentation was filmed, though, and will be available on the London PRO DVD.
Thanks for sharing this study. It seems that PageRank is still a good indicator.
Wowsers! I just found this post (I am a bit slow in catching up on the backlog of my pre-Moz days!)... and am completely blown away.
I dare not even ask how long this took Ben and Rand to put together, but I would wager it was no short task.
I really hope that you all can follow up with an equally entertaining and mind-boggling set this fall!
Really interested to see how often these things change, and how much some of these findings may have changed over the course of a year.
Reading back through all of these posts I have missed is definately paying off, another gem of information!
It looks like a few sleepless nights were involved in this post!
Possibly daft question alert:
"(employing the keyword in the tag whether internal or external) have significantly greater correlation with rankings.."
So an on-page anchor might be positive. An out anchor might also be positive but is that for the domain in question or one it's linking to?
I like the post, good data corelation. Particularly about outbound anchor text and google links in the ranking.
I really like this article !!! This is very helpfulll to have visual representation of these differents factors and to see the correlation between them.
Definitely a lot to absorb, but much appreciated, nonetheless. As an analytical person I have to admit even my eyes glazed over at times, but the text brought me back to reality.
I'm wondering though - would the competitiveness of the keyword phrase make a difference, or is it implied that the standard deviation would take that into account as well? The conclusion seemed to indicate that some on page factors that might have been thought to be important in the past, really aren't. But might that not change for uber competitive keywords? Conversely, for long tails, if none of the other page elements are present, might the actual text between the body tags be a determining factor? In general, would the graphs change if the number of results returned changed from 200K to 302Billion (for football, e.g.), or between single-word keyword phrases and 5 or 6-word keyword phrases, and in what ways?
So if we were to use this data as delivered then we would soley do the following 3 things:
"Get Links"
"Optimise Title Tags"
"Do some ALT Tags"
Doesn't really match up to the supposed 200+ ranking factors that Google use to determine rank. I know its more complex than the data shows in reality, but that is basically the summary.
You can manipulate/optimise these 3 areas and pretty much they are the only ones that will really determine your rank.
Superb stuff, as usual. Thank you very much for the analysis.
Second time to read this in the last 6 months and I think I understand it all a little better now, might help too that I've begun to play with the API and have started seeing what some of the raw data actually looks like.
Great information, as always!
Reading this post again in 2011!
Gentleman, you have truly outdone yourselves with this research project.
I am of the opinion that Google almost certainly employs a "randomness factor" as one of their 200+ determinants. In addition to increasing SERP variety, this factor adds noise to the data, making it more difficult for researchers to pin down the components of the algorithm.
If I were to conjecture, this factor is responsible for much of the high standard error you have found. Despite this, you have succeeded in creating some very valuable models for SEO practitioners. Hats off to your team.
Hey Sean - I think you might mean standard deviation (rather than standard error, which is quite narrow). Those numbers simply represent the 95% confidence interval or 1.97 standard deviations away from the mean.
The "randomness" concept is certainly interesting, but I think that given the right models, we should actually be able to see and quantify that somewhat (though it might take a while) :-)
Rand, Absolutely, and if any group can do so it will be SEOmoz. If there is in fact a "randomness factor" of some sort, it will simply make your task more difficult and possibly less predictive (although no less valuable). It's been a while since my last stats and research methods class, but I think I'm using the term "standard error" correctly. There's at least two different ways the term is used, so perhaps we're just thinking of different uses. Once again, fantastic work. This study provides significant evidence of the validity of the mozRank measure.
@sferguson & Rand:
A randomness factor is a very interesting idea, but only to explain temporal small adjustments (individual URL's hopping around ranking spot no. 2/3/4 over time).
=> if such a factor would exist permanently, to which URL would you apply it? And if so, that would permanently hurt that URL disproportionally large which is not in accordance with Google mission statement (don't be evil!!!) ;)
I personally have more belief in dynamical weightings that change in parallel to keyword competitiveness, but apply to every URL in the index.
How's that for a follow up? Hint! :P
Pretty cool stuff. I think this is all done with multiple regression analysis, yes? I'd be interested to see the R-squared and confidence intervals on these.
Apologies, I see the 95% confidence interval now. Still interested in the R-squared values, though! ;-)
This is absolutely a killer, thanks a lot, Rand and Ben.
The data shows in the post is really complicated. You are doing a great job, the complexity of google algorithm is delivered by hundreds of top engineers with their around 200 ranking factors, so linkscape is very awesome already.
I've always using seomoz's backlink tool to check the number of inbound links with "keyword" anchor text, I think that's a good factor to see the competition. (although it's very time consuming to evaluate the quality of each inbound link).
I am really looking forward to seeing more sophisticated link analysis. E.g. anchor text appearance, mozrank, moztrust....etc
Useful information - good to know about the heading tags.
take a look at daveN's new home page, hes stuck a load of internal links in his H1... and the crafty guy that he is he must be up to something... reminds me of an old technique i used to use lol... maybe retro seo is the way to go!
retro seo ay?
*runs off and stuffs lists of keywords at the bottom of the page* :)
Amazing data. The one thing it screams in my face is "GET GOOD LINKS"
One little tiny thing:
Octoer of 2009 should be
October of 2009
(I am an insane spelling freak)
You missed the double "i" in little - spelling freak... which appeared here :This post has dealt very liittle with on-page factors and their correlation to rankings. We'll look at that next.:
Regardless: Has SEOmoz done any comparisons to Yahoo! or Bing? I think it would be interesting to see those comparisons not just from a SEO standpoint but we could get a better handle on the global picture. For example if having the keyword in the H1, H2, H3... on Google doesn't provide any SEO benefits maybe it helps in Yahoo! / Bing.
As long as it doesn't hinder SEO efforts on Google to have the keyword in the H1 and it helps on Yahoo! or Bing that would be good information to have. Any amazing post, I definitely need to re-read this and I wish I could have seen the presentation on this.
We did a little quantitative work earlier this year where we compared Google, Yahoo, and pre-Bing Microsoft, but everyone just seemed interest in the Google results so I dropped the others this time. Maybe that was silly, especially given Bing is getting some good attention. You make a good point of why it would be useful to see the others.
I fixed the spelling errors, but you guys missed a few :P
You're the best but I promise with the amount of important data in this post the last thing I was looking for was spelling errors. I was merely trying to add whimsical humor to my comments.
haha! me too ;)
First of all, great post and thanks for sharing.
I would be interested to see a more in depth look at the hx tags. For example, filter out all the pages that use the h1 tag for their logo so it is not unique page to page.
More specifically, focus on pages where the h1 tag closely matches the title tag and contains the keyword. I would be interested to see if that played a role in ranking at all.
The improper use of the tag may be corrupting the data. Like some pages that use the h1 tag over and over again in an attempt to spam the engines.
Jason - I certainly agree with your analyses when it comes to the correlation data, but when the ranking models also show such a minor impact, it seems to me relatively convincing that even as a completely separate metric, using them properly and putting in keywords is likely to provide an extremely small amount of additional value. Certainly, as the model gets more accurate and the features more numerous and refined, we'll know even more, but at this point, we should have seen much more positive graphs if that were the case.
Cheers Rand/Ben
Really good follow up to the SEOmoz presentation.
Ben, excellent presentation by the way.
Interesting findings, interesting results -but is it just me or does it not mention crucial factors such as which keywords were being monitored for rankings and where rankings were being monitored from?
We note some of that near the top. These results come from Google.com in the US, include approx. 10,000 search results that bias more towards commercial kinds of queries and things SEOs tend to try to rank for rather than a purely random set. The SERPs, both for the correlation analysis and the testing SERPs for the ranking models (which is separate from the training data) was collected in October of 2009 (very recently).
Thanks Rand, just skim read the post earlier and missed it first time around.
Interesting stuff. Look forward to seeing more of this type of analysis.
Great post Rand! much appreciated, i especially like the seowizz info which seems to point to sitewide links not being an issue due to domain uniqueness not playing an important factor (based on the cursory results.
This is like a Thanksgiving feast of information. I am going to have to come back for seconds. Great post.
This is absoltuely fantastic! It's such a treat to be able to rise above the usual guesswork and rumor milling with some actual hard data. This is going to give me something to mull over for a while now - great stuff!
Mind you, it's not the first of its kind. A while back, German SEO site Sistrix published some of their tool's data on OnPage factor/ranking correlations (the original's in German but I've linked to a Google-translated version). Some of their data there was really weird, especially the assertion that H2-H6 have a really high correlation with first page ranking while H1s had none at all. I always assumed this had to be a data blip, but this has become gospel in parts of the German SEO community.
Hence I'm really glad to see that your data doesn't back that finding up at all. Bit surprised that optimizing the H1 seems to be such a waste of time too though - this means a lot of guides to SEO are going to have to be rewritten. Oh well, at least this means I can stop pestering one of my clients to jiggle around the heading structure in their blog template...
very interesting claims with data to back it up. Some claims even go against common sense like H1 (H2-H4) tags. I've always thought of them like headlines in newspapers highlighting articles that follow them that naturally contain your keywords. as always, great post.
VERY interesting.
Do you have any correlation data between keyword/keyphrase density and SERP ranking?
The "alt attributes are probably a pretty good place to put keywords" takeaway is stronger than I expected given previous posts here. Is this a departure from the previous stance that if a site has optimal alt attributes then it's probably doing a lot of SEO elements optimally? I can hear the pitter patter of folks running to check their image tags....
You bring up a good source of error.
In the charts at the start of the post, we don't do any controlling for other signals so this is a big concern. In the later model based charts we do account for all of the other inputs we have, so this is less of a concern. Still, inputs that Google has that we don't will get partially attributed in our models to the most correlated features we do have. So it is possible that there is something we don't know to look for, but Google cares about, and is very correlated to keyword usage in alt text. On the other hand, if this feature exists, what is it? This question isn't just hypothetical. If you possible candidates I'll try building features for them to see.
I would be very curious to see more analysis on the impact of links within a site on the ranking. It seems to me that internal links have a pretty powerful impact, but that you would need to really analyze the link structure, link keywords and even on-page position of the links to create a model with higher correlation.
I will have to go back and re-read the post, but I was not clear on whether you looked at the impact of internal links versus external.
"It seems to me that internal links have a pretty powerful impact.."
Agreed. I use a number of the tools at the pro level and don't get why I see some competitor pages blowing me away except that they have more internal links pointing to the page with targeted keyword/phrase. Some don't even have offsite backlinks. Nowhere in sight.
Perhaps the strength of the actual domain? ..and internal linking?
One example
DmR-6.98
DmT- 7.15
Awesome post Rand & Ben. Link Scape is a great tool. Glad to see that LinkScape is getting closer to "Google".
I am trying to understand useage of H tags
On the H tags you are suggesting not to use keywords in H tags, Even if you have them it doesn't have any positive effect for SEO. right?
You are not suggesting to completely avoid them.
I think H tags are still important for site visitors, in terms of showing them way around and telling them what the page is all about.I seemed to get lost if they don't have Headings or clear navigation.
Can you clarify?
Thank You once again for a great post. Lots to take in here.
Yes - spot on. We're not saying "don't use them" we're just saying "don't expect to suddenly rank great in Google just because you use them" and "it appears Google doesn't care much if you do" and "there seem to be things that will help much more." :-)
Thank for the Reply Rand. I appreciate that you are replying to many comments on this post.
Links & positions are for 1 website or 1 page ?
The positions refer to the ranking position of an individual result which are always specific pages rather than broad domains (even a website's homepage is a "page" rather than a domain). However, in the link features, we certainly look at lots of things that relate to the root domain and subdomain in addition to page-specific metrics.
Nice takaways.
Defeats the preaching of matt cutts make good content and you'll win. Does though enforce the old speakers outline: tell engines what you're going to tell them, then tell them what you are telling them, and then tell them what you told them.
Slava
Amazing beautiful fantastic article.
wow more info that won't really help me rank better. are you sure about the alt tags? yeah me neither.
With all due respect, this post demonstrates what a waste of time and effort most attempts to "figure it all out" are. More money is wasted on "SEO" than on any other commodity on the planet. Even earnest efforts like this one basically sum it all up by saying that there is no way to know what to do to improve rankings in Google. Most of the $ made in SEO is via "baffle 'em w/bs" while extracting as much money as you can, while you can, from the client.
You succeed on the 'net the same way you succeed on the street--offer something people want at a good price and provide great customer service. There is no "SEO" magic wand. (The closest thing is a direct match URL.) Write good content, have a site that is user friendly, engage in social networking, and provide a useful product and the rest will fall into place.
Obviously, I'm biased, but I have to strongly disagree. I've worked personally with so many great companies who have great products that are useful, unique, compelling and massively shared, yet it took SEO expertise (not always mine, certainly) to take their traffic 5-10X up.
I'd love to believe the same thing about the advertising industry (create a great product and people will buy it, right?) or the PR industry (do interesting things and the press will find you and write about you, right?) or the legal world (don't break the law and you'll be fine, right?) or the technology world (just use the hardware and software properly and you'll be fine, right?), but none of these are ever the case. Specialized knowledge, connections, insight, experience, research and data makes professionals in these arenas useful, desirable and valuable. It seems naive to tihnk that SEO would be any different.
I don't want to start an argument but I feel compelled to resoundingly disagree.
This kind of blanket statement just illustrates a lack of understanding of how search engines work (or at least a wilful misrepresentation) and I would invite you to read more of the content on SEOmoz to find out more about it.
@qualitycontent
Is your name Derek Powazek?
@quality content:
Although I agree with you about the current 'cowboy rate' in the SEO industry (which is a very bad thing IMO) and although I think everyone's entitled to his or her opinion, I think your comment here is the most naive one I've ever read about SEO in general and the true piece of art Ben's post is in particular.
The exact thing this thesis (not just post) aims for is to battle the 'thick thumb' parts of SEO and to provide useful quantitative insights on Search Engine metrics. Despite the lack of a definite metrics/weights recipe on how Google generates SERPs,the correlation of combined metrics provided here is growing towards perfection. And as in all professions, hobbies, et cetera: practise (research in this case) makes perfect.
About your 'magic wand on SEO': hoe do you think Google actually prioritises? Do you think it's done manually? At random in case 'Google doesn't know'? Are there maybe some corrupt internal Googlers that accept money from anyone to boost their rankings?NO! OF COURSE NOT! Google's algorithms are man-coded rules on factors Google thinks are important. And by adding weights to each individual metrical score (including negative weights), an end result is produced for any given query: a SERP.And of course these factors and weights can be 'decompiled' by discrete simulation such as SEOmoz/Ben has done here; as long as you have roughly the same data as Google has and enough diskspace/memory and cpu power to do the math.
@Rand & Ben: excellent piece of work. Keep it up.
As a usability specialist, I think your advice in the second paragraph represents the real BS in our industry. Define for me, in any succinct or applicable way what "good content", "user friendly", and "useful product" mean. People throw these useless truisms around as if they're actual advice, smile and nod at each other with smug satisfaction, and leave clients and business owners with absolutely nothing but frustration.
What Rand and Ben have provided here is hard, complex data. Putting it to work is tough, because building sites that rank is difficult. The same thing is true for usability and even content. There isn't a one-sized-fits-all formula. Real usability takes a ton of analysis, a complex understanding of your data, and the willingness to drop all of your assumptions and dig deep into the minds of your audience. By your logic, if I can't reduce SEO or usability down to a few bullet points, it must be nonsense. I can't explain neuroanatomy, high-energy physics, or microbiology in a Top 10 list, either, so I guess those are all crap, too.
I dunno gang. This "qualitycontent" comment reads more like a troll looking to be contentious.