Last week I gave the keynote presentation at SMX Munich, Lessons Learned Building an Index of the WWW. In that presentation, I shared a great deal of data from our web index as well as some SEO tips based on our experience replicating many search engine activities (crawling, indexing, building a link graph, de-duplication, canonicalization, etc.). In this blog post, I'd like to first announce that Linkscape's new index, with crawl data from late March to early April (& upon which these data points are calculated), is now live - check it out here - and second, to share the charts, graphs and tips from my presentation.
The Linkscape Index
First off, some basic points about Linkscape's index:
- The crawl is intended to imitate what major search engines crawl and keep in their index. Talking to lots of folks from the engines who do this work, we've heard that while tens or hundreds of billions of pages are crawled, there are only "~5-10 billion pages worth keeping in a main index."
- Linkscape is a crawler-built index, meaning it uses a seed set and crawls outward via links to discover new URLs.
- The index currently biases towards pages with external links, meaning we don't crawl as deeply as the major engines do, but we try to crawl. very broadly (to reach as many well-connected pages and unique domains as possible).
- The crawlers and data sources we currently employ all respect robots.txt.
The Web's Structure
As we crawl, we see some well-known structural pieces making up the web:
Linkscape, as well as numerous academic sources (and, almost certainly, the major search engines), collect and store data about three types of structural components - pages, subdomains and root domains. Link & content metrics, along with crawl parameters and query-independent ranking factors, are stored about each of these.
Linkscape also sees a view of the web that most IR students will be familiar with:
As others have noted in the past, the web's link structure tends to look a bit like a bowtie, with a large number of tightly linked, well connected pages in the center and outliers on the borders with few incoming/outbound links. Linkscape does a relatively good job with the center and the linked-to edge (with few/no outbounds), but struggles more on pages with no incoming links (as these are difficult to discover and often not worthwhile keeping in an index).
Index Statistics
We've found these data points fascinating and I'm excited to be able to share many of them for the first time. While Linkscape is not as comprehensive as Yahoo!/Google, it's far closer to a representation than a sample size. Our latest index update currently contains:
- 44,410,893,857 (44 Billion) pages
- 230,211,915 (230 Million) subdomains
- 54,712,427 (54 Million) root domains
- 474,779,069,489 (474 Billion) links
For this index, the following data pieces apply:
* Note that for the link distribution chart, this refers to "external, juice-passing links" which excludes links from the same subdomain to itself as well as links on pages with the meta nofollow or those that employ rel=nofollow.
* Note that for the root domains linking chart, this refers only to pages/sites receiving links from unique root domains. For example, with www.seomoz.org, we'd only receive one "linking root domain" from searchengineland.com, even though that site links to ours on many unique pages. Likewise, with links we receive from About.com and their numerous subdomains - in total, it's only one counted "unique root domain."
* Not surprisingly, most links on the web are incestuous to some degree, and thus come from internal links (those on the same subdomain as the target), same IP address (where multiple sites from the same owner are hosted), same root domain and the same c-block of IP addresses. If we can see these relationships with Linkscape, it follows that the search engines have an easy time of it as well - and these links are almost certainly not passing the same kind of value that external links from unique root domains, IP-addresses and C-blocks would.
Some interesting data points on the above:
- 2.7% of all links on the web are nofollowed
- 73% of those are internal (so nofollow is actually far more popular as a link sculpting tool than a spam prevention device)
- 3 billion out of our 475 billion links (~0.6%) were found in noscript tags - while the engines recommend against this and talk about it as a spam tactic, we suspect that many of these are, in fact, legitimate uses and probably do get counted (due to their value in content discovery).
- 165,638,731 links (0.034%) aren't visible on the page (they're hidden off screen using CSS or other tactics). Again, given the numbers, we wonder whether all of these are spam and whether they're all discounted by the engines.
- This is our first index supporting the canonical URL tag, and so far we've seen just north of 16 million pages employing the parameter. While this is still a drop in the bucket on a global web scale, we'll be watching closely for how much support it generates over the months to come.
Search Engine & Linkscape Metrics
Like the search engines, we calculate a number of metrics on the pages, subdomains and root domains in our index to help uncover spam and sort by popularity & trustworthiness. The following are distributions of the metrics we currently employ:
* mozRank is our calculation of raw link popularity. Like Google's PageRank, Yahoo!'s WebRank and Live's StaticRank, it's a recursive algorithm that counts links as votes and treats links from more popular pages as more important. We've found that while it's useful for discovering which pages to crawl and index, it's a poor measure of true importance and has significant noise.
* Domain mozRank is calculated in the same fashion as page-level mozRank, but on the domain-level link graph. Thus, it only takes into account unique links that exist from one root domain to another and is agnostic as to whether a site has 1, 100 or 1,000 links to another. We've found this metric exceptionally valuable for identifying the popularity and importance of a root domain - on the subdomain link graph, it's more susceptible to manipulation and spam.
* mozTrust, which we also calculate on both the domain and page level link graphs, has proven highly effective as a spam identifier (particularly in combination with mozRank - the difference between the two is an excellent predictor of manipulative linking). mozTrust relies on the same intuition as Yahoo!'s TrustRank, running a recursive algorithm that passes juice down from trusted seed URLs/domains.
Measuring Correlation
Possibly the most interesting data I shared from an SEO application standpoint was around our research into the correlation of individual metrics to search engine rankings. Our own Ben Hendrickson has been doing significant data gathering and analysis, trying to answer the question,
How well does any single metric predict higher rankings?
His early results are enlightening:
In this chart, Ben's showing that no metric is particularly good at predicting rankings by itself, but if you had to use something, the number of root domains linking to a URL and that URL's mozRank are both just above the 95% confidence interval. Note that such classic SEO metrics as Yahoo! link counts and Alexa.com counts (which are included in many toolbars and appear in many SEO reports) are very nearly worthless.
The results are much better (though still not excellent) when we instead ask what metrics correlate with ranking 10 positions higher (essentially, what's the difference between page 1 vs. 2, 2 vs. 3, etc). Here, Ben shows that while only a single metric is above the 95% confidence interval (domains linking to a URL), there are several that are 20%+ better than random guessing.
Perhaps the most surprising result of this (for me, at least) was the data showing that Google's link counts actually do have a correlation with rankings, suggesting that they're not completely random (even though they might feel that way given their small sample size).
Out of all the metrics, it's little surprise that # of linking root domains is a favorite (we use it, for example, to sort our Top 500 list). It's one of the most difficult metrics to manipulate effectively and has high correlation with trust, importance and search engine rankings.
Top Tips for SEOs
Based on the work we do crawling and building an index, and the struggles we've encountered (and seen the engines similarly encounter), we've crafted a few short tips. While some of these are obvious and well known, they still pay to keep in mind as high-level recommendations we feel confident the search engines would support:
- Don't rely on the search engine to canonicalize anything for you.
- Focus on link acquisition from a diverse number of root domains, not necessarily high PageRank pages, or those with high link counts.
- Make smart, usable, short URLs. They're far easier to process and have a much better correlation with useful, unique content an engine would want to keep in its index.
- If you want to earn lots of links, building a distributed content widget/badge/link that users embed in their sites/pages is an incredibly effective strategy. Just look at how many of the top pages on the web achieved that position employing this strategy.
- Don't rely on PageRank or raw link counts as accurate assessments of ranking potential. According to our data, they're not high signal or high rankings correlation metrics.
- The social web is rising, as are those employing it effectively (again, check out the top sites list for evidence).
- Don't be afraid to use nofollow internally as it's clearly not an outlier on the web. However, do be cautious with its use - you can seriously screw things up if you make mistakes on that front.
- Keep content on a single subdomain and root domain wherever possible. The metrics of that domain will go a long way to make that content visible and ranking-worthy.
- Avoid doing "strange" things from a technical and link acquisition perspective. The former makes you harder to crawl, process and index while the latter makes you stand out as possible spam/manipulation.
We hope you enjoy this data - please feel free to share - and enjoy using the new Linkscape index. Again, I'd like to give my congratulations and thanks to both Ben & Nick, who've done a tremendous job with Linkscape. If you have questions, please leave them in the comments and they should be able to provide answers and direction.
p.s. For those keeping track, this index update was almost exactly a month from our last one, and our goal is to maintain approximately 3-4 week intervals between updates for the foreseeable future. We're also doing a lot to improve the quality and focus of our index to capture more good stuff and deep stuff on mid-size and large domains (and less spam). We'd appreciate it if those of you who are producing lots of spam would help us out by ceasing to earn links from trustworthy, respectable sites and pages - thanks! :-)
"Don't be afraid to use nofollow internally as it's clearly not an outlier on the web. However, do be cautious with its use - you can seriously screw things up if you make mistakes on that front."
How about a future post or WBF about some of the things you can screw up by using nofollow internally?
I second that request. I'm about to start working on some internal link sculpting on a number of my sites and would love to see a Whiteboard Friday that covers best practices and the pitfalls to avoid.
Yes, I concur with Whitespark and BradleyT on getting a WBF written to outline nofollow uses and how you can screw things up. Definately helpful to hear and see examples of this.
You got it! I'll do the WB Friday on this subject tomorrow morning :-)
WOW. That's some yummy looking data you all have over there! I would dive into this right now and soak it all in like a sponge, but after working on vector calculus for two hours... I think I'll have to read this all over again tomorrow.
Just wanted to say thanks for making all of this data available!
Great job you guys.
Wow, thanks Rand; great data - and thanks for presenting it in a way that we might stand a chance of being able to explain some of it to less technically-minded people.
The 'number of linking domains' metric is interesting; it supports the advice in SEOMoz's Pro Linkbuilding guide that getting links from a broad number of domains is valuable because links from 10 domains probably involved 10 editorial descisions, but 10 links from 1 domainis unlikely to have.
Also great to have data on how nofollow is not being used for its original purpose.
Fascinating stuff.
just a quick note to add as per the message to @rand on twitter, how will Geocities.com closing down affect a number of websites?
Stellar question!! :)
Rand/Nick, anyone got an answer? [Other than some guys is trying to "save" geocities]
I see some pages from geocities still in the SERPS and taking it down will be a decent dent in Web. I wonder how big, etc. :)
I maybe a little strange but Rand would you analyize the long tail of each data graph?
Maybe find some commonalities of each?
I ask because websites with high PR (7,8,9 and visibility for that matter) tend to not be apart of the trends of the web. For example WSJ.com will rarly link to what they are talking about, but typical blogs would.
Like the rare nature of high PR sites. :D
Either way. Great data. Keep up the fantastic work! :D
I checked out the 500 list and I like seeing that Greenpeace has a higher linkcount that the US Dept of Justice! You know the DOJ gets .gov links...
Great job Greenpeace.
Sensational information Rand & SEOmoz team.
I too would like to find out more information about the long tail as somebody else previously referenced. Common characteristics of high PR sites is probably fairly obvious, however you may just discover some gems.
Ive started advising clients that we have to drop any reference to Alexa and a recent project showed an Alexa ranking double min but once i reviewed the web analytics data, this showed that this site got 20 times the traffic as my site.
Typically as a Alexa number gets higher, further away from #1 it should now mean such a massive difference in traffic.
So after being put in place that maybe my site should be getting less traffic or Alexa sucks, i think your score of 1% for relevance is perfect.
Wow you guys have been busy.
Awesome data - thanks for sharing it.
It's tough to evaluate links in Linkscape using mozRank when you realize that some obviously-penalized pages are passing a great deal of mozRank. Any chance we might see a "possibly penalized" indicator in Linkscape reports?
Following that is there any likelihood you'll be altering your mozRank calculation to adjust for SPAM signals?
Great keynote. Great event.Lesson learned :)
Again and again. I love your tools.
It was nice meeting you again.Looking forward to the nextevent in Seattle in summer.
Stay tuned for more happy days ...
Thanks for the post Rand! As a numbers geek I love seeing this hard data.
Perhaps the only thing I hear from clients as much as "meta tags" is "PageRank". I love having good, hard numbers to use to say "there is MUCH more to it" effectively.
Big THUMBS UP from me, this project is truly one of the most interesting experiments Ive seen in a long time.
One thing that would be a great educational tool would be the ability to search your index using your ranking algorithyms to determine the actual google ranking projects that you are putting forward.
This would probably be the single most important tool to all SEO's out there in the professional space, as it would help us to determine our theories (and yours) properly using this "sample" data that you are accruing.
It would also be a proactive way of identifying pages that have had a google penalty, as your algo wouldnt cater for this in the same way, so we would be able to identify pages that as you put it in the whiteboard session on Friday are "over optimized".
Anyway, a big thanks guys, really interesting in seeing where this goes over the coming months!
Excellent data Rand. Thanks for sharing. One questions based on the note below:
"* mozTrust, which we also calculate on both the domain and page level link graphs, has proven highly effective as a spam identifier (particularly in combination with mozRank - the difference between the two is an excellent predictor of manipulative linking)."
How much of a difference between mozTrust and mozRank would you look for to raise the link manipulation alarm?
We usually think a difference of more than 1 point is something that bears investigation. I'm not saying it's spam, but you might want to check out the links at that point as this probably means there's a lot of links that really aren't contributing much value.
As Nick said, a full point difference often bears investigation, and a 2-3 point difference usually means something's fishy. This works with PageRank vs. mozRank, too. If the two numbers are off by more than 2-2.5, we see a high correlation with sites/pages that have been penalized (when PageRank is lower).
I've been reading here for years.
This is possibly the best post I've ever read here.
I'm a big advocate of measuring and proving everything, so this kind of data is priceless
Posts like this make statistics nerds happy! Thanks much!
Rand, an honest question on Linkscape: you say it's crawl-based and I see you have some hard data on most of my sites in Linkscape.
What's the name of the user agent (UA) your bot's using for crawling, cause I can't find a single mention of it in my access logs.
Thanks in advance for your answer!
Without any doubt, this is the most valuble post on SEOMoz so far. If there is a voting for the best seomoz posts for 2009, I would nominate this as a winner definitely.
building a distributed content widget/badge/link that users embed in their sites/pages is an incredibly effective strategy - is done well with wordpress themes. And they work.
Correlation you found with with Google backlinks is illuminating.
However, the question is how did you count them - is that just by standard link operator (which is known to strongly understate the number of links) or these stats based on the number of backlinks from Google webmaster accounts?
We used the link operator.
The important point about the link operator, which does not reflect the correct absolute link count, is that it does reflect a good sample, at least in so far as rankings go. So if you've got two sites where one has a substantially lower count from "link:" than the other, it's likely the second will have a better chance ranking.
Does that make sense?
Wow, can't believe that there are only 23 comments on THIS post so far? I don't think that I have ever gotten to a post with less than 40 within the first couple of hours! :-)
Love this kind of stuff Rand, thanks for sharing it. I also love that it backs every theory I have been preaching for the past year! ;-) Its reassuring that there is hard data to back theories up. Although its hard to get completely conclusive data in our industry its nice to take the data for what it is.
Great comment Migz!
-CWalk
I love the hard data associated with the post. One of my big take-aways was that Google seems more concerned with the number of linking domains, rather than the quality of the links. So my general strategy for link building shifts from pursuing a few high quality links to pursuing a vast variety of mediocre links from a large number of domains.
My question is whether Google counts domains not stored in the index as an unique domain?
Personally I would image if Google isn't aware of the domain or has bannned it from within the index, links from these would not count as a unique domain.
Does that sound right?
I would not uniformly go after many mediocre links. I think you need balance here. If you've already got a sizable inventory of highly authoritative links, it's important to diversify with some broader reach. But if you don't have any highly authoritative links, you should pull in some before going after more poor links.
Linkscape continues to impress. I can not express how much value the pro membership has added to my SEO efforts. Great work guys!
Spot of luck that number of domains linking to a site is coming through as a major factor as that has been a principal I've used as the basis of my link building over the last couple of years.
Glad to see it confirmed on SEOmoz...
Thanks for sharing the info Rand. It's a great boon to be able to miss events like SMX Munich yet still be able to get a lot of the data presented.
Wow. Great post.
Very interesting (and surprising) to see that Google's link counts correlate with rankings.
Ok, I'm off to get some links from some more diverse domains now...
Excellent information! I can't wait for more stuff like this.
waw, thats so many facts.
Thanks for making this public. Greatly appreciated.
What a great post. Thanks for sharing all of this data! You guys are doing some amazing work at the MOZ!
Not only facinating data, but great analysis too. It's great to see where linkscape is heading and what sort of data can substatiate best practices in SEO.
Not to mention the discoveries that are not so obvious can open one's eyes to potential in natural search from a search engineer's perspective.
Perhaps additional data to add into that index would be scaling between non-optimized, semi-optimized and over optimized pages/domains...
Rand -
Wow, that's some pretty eye-opening information there. I almost have trouble believing some of it, like trusting google to reveal decent backlinks - but it's hard to argue when there are stats to back it up.
Question though - how did you guys determine which sites are hiding links off-screen? Was there a certain margin-left or absolute position number that you looked for to come up with that number? Just curious...
amazing post, thanks for sharing!
I don't think the data here shows that google reveals good backlinks. Instead what we're saying is that the count that Google reveals is a consistent sampling of the overall count. So if you take two pages and look at their link counts from Google, it turns out that (relatively speaking) that's a good thing to compare when thinking about search engine performance.
But the actual links that they show you are probably not a good set of the most powerful links for the site.
Rand! Great post. I really found the data to be very helpful and very interesting!
Hi Rand,
Thanks for tips,
But one thing wonders me, would we have any penalty having many backlinks from same site using (for example) profile pages?
What if NYTimes links to my site very often?
--Vusal