Those of you who've been running Linkscape reports this weekend may have noticed our new, fresher data in the index. This is our second update (the first was on Dec. 8th of 2008) and it's brought with it a refresh of much of the crawl we originally launched with in October. This update is seriously good news for us, as it's taken far less time, energy and frustration than our first update, and it's good news for our members (and anyone who uses the free data) as much of the web's changes in Q4 2008 are now included in the index. We're working towards updates every 4-5 weeks, but it may be several more before we can get that level of freshness.

Some important & interesting statistics from this crawl:

  • URLs: 36,330,454,654
  • Subdomains: 225,767,675
  • Root Domains: 46,952,859
  • Links: 410,360,303,763

Distributions of External Links from this Index:

  • 50%: 0
  • 95%: 1
  • 99%: 8
  • 99.5%: 18
  • 99.9%: 108
  • 99.99%: 1854
  • 99.9999%: 104,828

Distributions of Linking Domains to a Domain:

  • 50%: 0
  • 95%: 1
  • 99%: 4
  • 99.5%: 8
  • 99.9%: 26
  • 99.99%: 145
  • 99.9999%: 9713

Distribution of External Links to a Root Domain:

  • 50%: 2
  • 95%: 80
  • 99%: 558
  • 99.5%: 1,266
  • 99.9%: 9,138
  • 99.99%: 94,999
  • 99.9999%: 12,286,546

It's crazy to realize that if your page has 100 external links to it, you're in the top .1% of all pages on the web. Likewise, if you have just 25 external domains with links to your site, you're in the top .1% of all websites. On the entire web, the vast majority of websites never get 100 links pointing to them!

My takeaway? The Internet is really, really, really big (also, remember that Linkscape's index is probably about 1/2 to 1/3 the size of major indices like Google & Yahoo!, so these numbers don't hold true for all data sources, just for metrics from our index). We also like to be domain diverse with our index, so Linkscape shows a lot of links between domains but is still shallow for large sites and deep internal pages (which could bias these numbers).

Sharp readers will note that our number of pages has actually slightly decreased since the last update - since this was a "refresh" update, that was actually anticipated. We found about 10% of pages that we had crawled are now issuing some type of error code - 404s, 500s, etc. Thus, if you've always wondered how much of the web decayed over the course of 3 months, this might be a pretty good statistic to scratch that itch.

In terms of upgrades to the interface and the Linkscape tool, you've probably noticed the new basic report format:

New Basic Linkscape Report

And the new data detail tab (which, if you're a serious SEO or want the juicy KPIs from the tool, is the place to look):

Linkscape Data Detail Tab

If you've got feedback on either of these, please do leave a comment or send email to Adam _at_ SEOmoz.org. We're also looking at upgrading the basic comparison reports and the advanced reports in the near future, so ideas are always welcome on those, too.

And now on to something that I'm very excited to announce (and which I think you'll find fascinating)... Our top 250 domains list! Nick pulled out the top ranked root domains (by Domain mozRank) and some stats about them and we've created a page on SEOmoz that we'll update with each index showing some of the web's most powerful/important domains from a link perspective. Remember, Domain mozRank is like Google's PageRank - it's an iterative, Markov-chain algorithm, so this isn't just raw numbers, it's based on the importance of who links to you as well.

Top 250 Domains via Linkscape

Some of the most interesting finds for me:

  • Macromedia and Adobe are individually both at the top of the list; Adobe most probably for the PDF download links and Macromedia for the Flash player. Note that we don't wipe out domains via 301 like the search engines do, so we're actually seeing and reporting the links to domains that redirect (and then noting those redirects in the tool and passing the link juice on).
  • Miiebian.gov.cn is, I believe, a government website in China that many sites operating in the country are required to link to - hence the high rankings.
  • Sedoparking.com was one of the big outliers, but looking through their links, even though the engines probably are not passing link weight, they clearly have a ton of domains (and many important and once-important ones) linking in.
  • Statcounter - I always knew they had a lot of links, but was surprised to see how many important domains still use their simple widget for counting traffic (and then providing that link back). It's a testament to being an early player on the web with a simple, useful product.
  • Blogspot and Wordpress are no surprise - since this is looking at root domains (aka *.domain.com), every link to every subdomain on those is counting on the domain-level link graph.

Digging through the full top 250 is certainly interesting, both as a web data aficionado and as an SEO. Seeing how some of the top domains have earned their links and realizing the strategies and brands that rise to the top could probably inspire another dozen or two blog posts. In the future, I'd like to try to do this for subdomains and pages as well - to see what specific content is garnering the web's links. Hopefully we can have that out in our next index update (or the one following).

To wrap things up, I just want to say publicly how proud I am of the work Nick & Ben have done, along with the contributions from Jeff and the entire dev and front-end team. While Linkscape is still in beta, and probably has a few more months at that status, the depth and breadth of the project, the engineering brilliance and the value it's brought to SEO still amazes me. In fact, Jeff and I just got this email today to help remind us that the long nights and weekends at the computer are worth it:

Hey you guys, I'm finally using Linkscape for one of my client projects, three months on :)  First time I have used it since the day after you launched; just been too busy with design projects & other stuff.
 
It just totally rocks.  I continue to be blown away by how easy it is to use, and how helpful the data is.  Particularly the competitive data and all of the sorting features.
 
I couldn't help myself from emailing you about it :)

Now it's back to work for the next updates, which should bring both fresher data and a much larger index as well.

p.s. We love feedback, so please do share!