Since the launch of Open Site Explorer and our API update, Chas, Ben and I have invested a lot of time and energy into improving the freshness and completeness of Linkscape's data.  I'm pleased to announce that we've updated the Linkcape index with crawl data that's between two and five weeks old—the freshest it's ever been.  We've also changed how we select pages, in order to get deeper coverage on important domains and waste less time on prolific but unimportant domains. 

You may recall Rand's recent post about prioritizing the best pages to crawl, and mine about churn in the web. We've applied some of the principles from these posts to our own crawling and indexing. Rand discussed how crawlers might discover good content on a domain by selecting well-linked-to entry points:


In the past, we've selected pages to crawl based purely on mozRank.  That turned out to favor some unsavory elements (you know who you are :P).  Now, we look at each domain and determine how authoritative it is.  From there we select pages using the principle illustrated above:  Highly linked-to pages—the homepage, category pages, important pieces of deep content—link to other important pages we should crawl.  From intuition and experience we believe this gives the right behavior to crawl like a search engine would.

In a past post, I discussed the importance of fresh data.  After all, if 25% of pages on the web disappear after one month, data collected two or more months ago just isn't actionable.

From now on, we're focusing on that first bar in the graph above. By the time our data approaches that second bar (meaning most of it is out of date), we should have an index update for you.  If and when we show you historical data, we'll mark it as such.

What this means for you is that all our tools powered by Linkscape will provide fresher, more relevant data, and we'll have better coverage than ever.  This includes things like:
As well as products and tools developed outside SEOmoz using either the free or paid API: There are plenty more.  In fact, you could build one too!

Because I know how much everyone likes numbers, here are some stats from our latest index:
  • URLs: 43,813,674,337
  • Subdomains: 251,428,688
  • Root Domains: 69,881,887
  • Links: 9,204,328,536,611
Our last index update was on January 17th.  You might recall some bigger numbers in the last update.  Because of the changes to our crawl selection, our latest index should exclude a lot of duplicate content, spam pages, link farms, and spider traps while keeping high quality content.

Our next update is scheduled for March 11. But we'll update the index before then if the data is ready early :)

As always, keep the feedback coming.  With our own toolset relying on this data, and dozens of partners using our API to develop their own applications, it's critical that we hear what you guys think.

NOTE: we're still updating the top 500 list at the moment.  We'll tweet when that's ready.