Looking Back at Linkscape's Trillion + URLs (and Announcing our Latest Index Update)

As we rapidly approach the end of 2009 and opening of 2010, we've got a much anticipated index update ready to roll out gang. Say it with me "twenty-ten". Oh yeah, I'm so gonna get a flying car and a cyberpunk android :) ...Ahem. I thought this would be a great time to take a look back at the year and ask, "where did all those pages go?" Being a data-driven kind of guy, I want to take a look at some numbers about churn, freshness and what it means for the size of the web and web indexes over the last year, and the hundreds of billions, indeed trillion plus urls we've gotten our hands on.

This index update has a lot going on, so I've broken things out section by section:

Analysis of the Web's Churn (or why having ten trillion URLs isn't very useful)
Canonicalization, De-Duping & Choosing Which Pages to Keep
Statistics on our December Linkscape Update
New Updates to the FREE SEOmoz API (and a 90% price drop on the paid API)

An Analysis of the Web's Churn Rate

Not too long ago, at SMX East, I heard Joachim Kupke (senior software engineer on Google's indexing team) say that "a majority of the web is duplicate content". I made great use of that point at a Jane and Robot meet up shortly after. Now, I'd like to add my own corollary to that statement: "most of the web is short-lived".

Churn on the Web

After just a single month, a full 25% of the URLs are what we call "unverifiable". By that I mean that the content was either duplicate, included session parameters, or for some reason could not be retrieved (verified) again (404s, 500s, etc.). Six months later, 75% of the tens of billions of URLs we've seen are "unverifiable" and a year later, only 20% qualifies for "verified" status. As Rand noted earlier this week, Google's doing a lot of verifying themselves.

To visualize this dramatic churn, imagine the web six months ago...

the web six months ago

Using Joachim's point, plus what we've observed, that six-month old content today looks something like this:

what remains of the the six month old web

What this means for you as a marketer is that some of the links you build and content you share across the web is not permanent. If you engage heavily with high-churn portions of the web, the statistics you monitor over time can vary pretty wildly. It's important to understand the difference between getting links (and republishing content) in places that will make a splash now, but fade away, versus engaging in lasting ways. Of course, both are important (as high-churn areas may drive traffic that turns into more permanent value), but the distinction shouldn't be overlooked.

Canonicalization, De-Duping & Choosing Which Pages to Keep

Regarding Linkscape's indices, we capture both of these cases:

We've got an up-to-date crawl including fresh content that's making waves right now. Blogscape helps power this, monitoring 10 million+ feeds and sending those back to Linkscape for inclusion in our crawl.
We include the lasting content which will continue to support your SEO efforts by analyzing which sites and pages are "unverifiable" and removing these from each new index. This is why our index growth isn't cumulative -- we re-crawl the web each cycle to make sure that the links + data you're seeing are fresh and verifiable.

To put it another way, consider the quality of most of the pages on the web, as measured, for instance, by mozRank:

Most Pages are Junk (via mozRank)

I think the graph speaks for itself. The vast majority of pages have very little "importance" as defined by a measure of link juice. So it doesn't surprise me (now at least) that most of these junk pages are disappearing after not too long. Of course, there are still plenty of really important pages that do stick around.

But what does this say about the pages we're keeping? First of let's take out any discussion of the pages that we saw over a year ago (as we've seen above, there's likely less than 1/5th of them remaining on the web). In just the past 12 months, we've seen between 500 billion and well over 1 trillion pages depending on how you count it (via Danny at Search Engine Land).

Linkscape URLs in the last year

So in just a year we've provided 500 billion unique urls through Linkscape and the Linkscape powered tools (Competitive Link Finder, Visualization, Backlink Analysis, etc.). And what's more, this represents less than half of the URLs we've seen in total, as the "scrubbing" we do for each index cuts approx. 50% of the "junk" (including canonicalization, de-duping, and straight tossing for spam and other reasons). There's likely many trillions of URLs out there, but the engines (and Linkscape) certainly don't want anything close to all of these in an index.

Linkscape's December Index Update:

From this latest index (compiled over approx. the last 30 days) we've included:

47,652,586,788 unique URLs (47.6 billion)
223,007,523 subdomains (223 million)
58,587,013 root domains (59.5 billion)
547,465,598,586 links (547 billion)

We've checked that all of these URLs and links existed within the last month or so. And I call out this notion of "verified" because we believe that's what matters for a lot of reasons:

Our own research on how search engines rank documents
Your impact on the web (as in traditional marketing) and ability to compare progress over time
Sharing reliable, trust-worthy data with customers, both for self and competitive analysis
Measuring progress and areas for improvement in search acquisition and SEO

I hope you'll agree. Or, at least, share your thoughts :)

New Updates to the Free & Paid Versions of our API

I also want to call a shout out to Sarah who's been hard at work on repackaging our site intelligence API suite. She's got all kinds of great stuff planned for early the coming year, including tons of data in our free APIs. Plus she's dropped the prices on our paid suite by nearly 90%.

Both of these items are great news to some of our many partners, including:

Buzzstream - a tool for social media, PR and link management
Brandwatch - a reputation monitoring tool
Grader.com - Hubspot's popular site analysis tool
Quirk's Search Status Bar
And at least three of these top "10 Link Building Tools for Tracking Inbound Links"

Thanks to these partners we've doubled the traffic to our APIs to over 4 million hits per day, more than half of which are from external partners! We're really excited to be working with so many of you.

Comments 29

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

HiveDigitalInc

2009-12-03T04:17:02-08:00

HUZZAH!

The Site Intelligence API is quickly becoming an indispensable tool for Virante and our clients. I am very much excited about the future of the API.

4 0

HUZZAH! The Site Intelligence API is quickly becoming an indispensable tool for Virante and our clients. I am very much excited about the future of the API. 
Cancel
- Nick Gerner
 
 2009-12-03T09:22:44-08:00
 
 We're really glad you like it and the support you've shown.
 
 Our commitment to our data services kept me up last night while one of our customers was making a very large transaction at exactly the same moment I was rolling out new data :)
 
 1 0
 
 We're really glad you like it and the <a href="https://www.thegooglecache.com/rants-and-raves/seomoz-linkscape-api-functions-library/" rel="nofollow">support you've shown</a>. Our commitment to our data services kept me up last night while one of our customers was making a very large transaction at exactly the same moment I was rolling out new data :)
 Cancel
Arnie Kuenn

2009-12-03T20:45:48-08:00

We have usedall the tools out there, even Majestic SEO, but Linkscape and related tools are still our go-to apps for client work. Keep up the good work MOZers.

3 0

We have usedall the tools out there, even Majestic SEO, but Linkscape and related tools are still our go-to apps for client work. Keep up the good work MOZers.
Cancel
Alexey Lionov

2009-12-03T11:50:28-08:00

thanks to the article! )

2 0

thanks to the article! )
Cancel
Melissa Fach - @SEOAware

2009-12-02T19:06:02-08:00

As someone that likes using all of SEOmoz's tools it is exciting to have new data to play with.

Thanks for the easy-to-read breakdown.

2 0

As someone that likes using all of SEOmoz's tools it is exciting to have new data to play with. Thanks for the easy-to-read breakdown.
Cancel
Rob Chant

2009-12-02T19:13:25-08:00

Great stuff, I've really been looking forward to the next update. Interesting to read about the amount of churn too... not really surprising I guess.

(And Bladerunner = best film ever!)

1 0

Great stuff, I've really been looking forward to the next update. Interesting to read about the amount of churn too... not really surprising I guess. (And Bladerunner = best film ever!) 
Cancel
Jan Dunlop

2009-12-02T22:52:37-08:00

"ready to roll out" = When??

:)

1 0

"ready to roll out" = When?? :) 
Cancel
- Luke Jones
 
 2009-12-03T01:18:07-08:00
 
 Yeah I want to know that too. I hate being teased :-(.
 
 1 0
 
 Yeah I want to know that too. I hate being teased :-(.
 Cancel
- Thomas Høgenhaven
 
 2009-12-03T02:03:15-08:00
 
 Some of my sites seem to have updated linkscape info, while other - and newer sites (with an age on 1-3 month) - still score four 0's in the toolbar. So I think it's already started to roll :)
 
 1 0
 
 Some of my sites seem to have updated linkscape info, while other - and newer sites (with an age on 1-3 month) - still score four 0's in the toolbar. So I think it's already started to roll :)
 Cancel
 - Nick Gerner
 
 2009-12-03T09:12:05-08:00
 
 Indeed, we got the new data up over last night, some stragglers might still fall in line over the next couple of days, for for 99.9% of stuff you're looking at, it should be the new stuff.
 
 1 0
 
 Indeed, we got the new data up over last night, some stragglers might still fall in line over the next couple of days, for for 99.9% of stuff you're looking at, it should be the new stuff.
 Cancel
Slim Sallem

2009-12-11T02:54:05-08:00

Amazing amount of data.

1 0

Amazing amount of data. 
Cancel
Jon Payne

2009-12-11T03:03:43-08:00

Great stuff Nick. I'm excited to see Linkscape continue to grow, refine, improve, etc.

On another note, found a typo...

58,587,013 root domains (59.5 billion)

This should say

58,587,013 root domains (58.5 million)

Thanks and keep up the good work!

1 0

Great stuff Nick. I'm excited to see Linkscape continue to grow, refine, improve, etc. On another note, found a typo... <blockquote>58,587,013 root domains (59.5 billion)</blockquote> This should say <blockquote>58,587,013 root domains (58.5 million) </blockquote> Thanks and keep up the good work! 
Cancel
Alaric

2009-12-13T17:50:02-08:00

Very interesting information on web churn and the sheer amount of low importance and duplicate pages out there.

It would be interesting to see how this analysis looks a year from now. I'd guess that the web churn rate is probably increasing and the percentage of low mozRank pages would go up too.

1 0

Very interesting information on web churn and the sheer amount of low importance and duplicate pages out there. It would be interesting to see how this analysis looks a year from now. I'd guess that the web churn rate is probably increasing and the percentage of low mozRank pages would go up too. 
Cancel
rankuplinks

2009-12-03T11:27:43-08:00

In Api WIKI there are some coming soon returns like:
- Domain Authority (coming soon!)
- Page Authority (coming soon!)
Can you say more about this metrics and when it will be available?

1 0
In Api WIKI there are some coming soon returns like: <ul><li>Domain Authority (coming soon!)</li><li>Page Authority (coming soon!) </li></ul>Can you say more about this metrics and when it will be available?
Cancel
- Rand Fishkin
 
 2009-12-03T11:40:08-08:00
 
 Still 2-4 weeks away, but yes, as soon as we get them up, we'll have a lot more data about them. The "sneak peek" answer is that domain authority and page authority are both going to be mashups of all our various link metrics about a domain/page that best match up to their potential ability to rank well in the search engines. Inspiration (and math) comes from our ranking models work here.
 
 2 0
 
 Still 2-4 weeks away, but yes, as soon as we get them up, we'll have a lot more data about them. The "sneak peek" answer is that domain authority and page authority are both going to be mashups of all our various link metrics about a domain/page that best match up to their potential ability to rank well in the search engines. Inspiration (and math) comes from <a href="googles-algorithm-pretty-charts-math-stuff">our ranking models work here</a>.
 Cancel
runnerunner

2009-12-03T10:29:51-08:00

hmmm, still seeing old data for queries, anyone else?

1 0

hmmm, still seeing old data for queries, anyone else?
Cancel
Jen Wiss

2009-12-02T23:56:40-08:00

Great post - well sumed up on how Google is really starting to remove bad or non relevent pages!

Good job Nick!!!

1 0

Great post - well sumed up on how Google is really starting to remove bad or non relevent pages! Good job Nick!!!
Cancel
- Luke Jones
 
 2009-12-03T01:21:24-08:00
 
 It's funny actually, just yesterday I replied to an individual on here saying exactly what Nick has in this post. I wasn't stating any facts, just writing what I wanted from Google in the future.
 
 It's good to know that they're thinking on the same wavelength of most people.
 
 I echo what cyberpunkdreams says by the way. Bladerunner is one of the best films ever made. I wonder how many people who watched that film thought, "what if I'm a replicant?"
 
 3 0
 
 It's funny actually, just yesterday I replied to an individual on here saying exactly what Nick has in this post. I wasn't stating any facts, just writing what I wanted from Google in the future. It's good to know that they're thinking on the same wavelength of most people. I echo what cyberpunkdreams says by the way. Bladerunner is one of the best films ever made. I wonder how many people who watched that film thought, "what if I'm a replicant?"
 Cancel
David Iwanow

2009-12-03T03:07:05-08:00

Awesome looking forward to the new update to see if some of the newer domains now have a bit more independent data "Linkscape" to see why they are still ranking well against my clients.

This churn rate seems to tie in well with Rand's recent post https://www.seomoz.org/blog/googles-indexation-cap and also seems to support some recent mass dumping of pages from Google's index that seem to be occurring.

I would have thought that the # of URLs that are "unverifiable" would have been greater that 25%. I know many tools seem to be getting more false positives as the amount of data grows. Is this a similar issue for Linkscape?

1 0

Awesome looking forward to the new update to see if some of the newer domains now have a bit more independent data "Linkscape" to see why they are still ranking well against my clients. This churn rate seems to tie in well with Rand's recent post https://www.seomoz.org/blog/googles-indexation-cap and also seems to support some recent mass dumping of pages from Google's index that seem to be occurring. I would have thought that the # of URLs that are "unverifiable" would have been greater that 25%. I know many tools seem to be getting more false positives as the amount of data grows. Is this a similar issue for Linkscape?
Cancel
humbled

2009-12-03T04:21:44-08:00

Nice, not quite as exciting as a PR update but nearly.

Does anyone know the date of the last update as I want to review the impact I have had over x period on some sites.

1 0

Nice, not quite as exciting as a PR update but nearly. Does anyone know the date of the last update as I want to review the impact I have had over x period on some sites. 
Cancel
- Nick Gerner
 
 2009-12-03T09:15:53-08:00
 
 ah yes, a good question. The last update was on October 7th, which is about twice as long as we usually go and even longer than we'd like to go.
 
 Fortunately we've got Chas on the case, doubling our processing capacity.
 
 1 0
 
 ah yes, a good question. The last update was on October 7th, which is about twice as long as we usually go and even longer than we'd like to go. Fortunately we've got <a href="https://www.seomoz.org/team/chas" rel="nofollow">Chas</a> on the case, doubling our processing capacity.
 Cancel
SoulSurfer8

2009-12-13T20:18:48-08:00

Linkscape has definitely become one of my go-to link analysis tools and has gotten better every month since it was released.

1 0

Linkscape has definitely become one of my go-to link analysis tools and has gotten better every month since it was released.
Cancel
A C

2009-12-03T08:23:08-08:00

How much Server space is needed to hold a Trillion URL's?

1 0

How much Server space is needed to hold a Trillion URL's?
Cancel
- Nick Gerner
 
 2009-12-03T09:07:08-08:00
 
 the average URL length is 100 bytes. So that's 100 trillion bytes or 90 TB of uncompressed data just for the urls. And if you think about on average we store 10 links per page, that's 1 quadrillion bytes worth of links (if you store "www.seomoz.org,www.janeandrobot.com"), or over 1 petabyte of data.
 
 That's not to say we've got that 1 petabyte in warm cache and serving all of it out every month ;)
 
 As you can imagine, we're pulling alls kinds of engineering tricks to handle this stuff.
 
 3 0
 
 the average URL length is 100 bytes. So that's 100 trillion bytes or 90 TB of uncompressed data just for the urls. And if you think about on average we store 10 links per page, that's 1 quadrillion bytes worth of links (if you store "www.seomoz.org,www.janeandrobot.com"), or over 1 petabyte of data. That's not to say we've got that 1 petabyte in warm cache and serving all of it out every month ;) As you can imagine, we're pulling alls kinds of engineering tricks to handle this stuff.
 Cancel
 - busuioc marius
 
 2010-01-25T06:55:22-08:00
 
 To big
 
 1 0
 
 To big
 Cancel
 - Luke Jones
 
 2010-01-25T10:15:15-08:00
 
 Really sorry about this but...
 
 ...That's what she said.
 
 --
 
 That's a fair amount of data there. I'm sure they have ways of compressing the URIs or the data that's on the servers.
 
 1 0
 
 Really sorry about this but... ...That's what she said. -- That's a fair amount of data there. I'm sure they have ways of compressing the URIs or the data that's on the servers.
 Cancel
 - Nick Gerner
 
 2010-01-25T10:26:15-08:00
 
 Yeah, we have some patent pending compression technology :P (I actually have mixed philosophical feelings about patents...)
 
 Our data compresses better than 10:1 in many cases. And we use a compact numbering scheme to reference rows of data. So I think one month of data is about 10TB compressed but over 100TB uncompressed.
 
 We have a handful of meta-data that describe a tiny portion of the data that we don't compress and those take up over 500GB.
 
 It takes about 24 hours to set up a node to serve the index. Just moving the data needed takes 10 or 12 hours.
 
 I know because I screwed this up a couple nights before OSE launch after staying up til 3am. Then I had to do it again...
 
 1 0
 
 Yeah, we have some patent pending compression technology :P (I actually have mixed philosophical feelings about patents...) Our data compresses better than 10:1 in many cases. And we use a compact numbering scheme to reference rows of data. So I think one month of data is about 10TB compressed but over 100TB uncompressed. We have a handful of meta-data that describe a tiny portion of the data that we don't compress and those take up over 500GB. It takes about 24 hours to set up a node to serve the index. Just moving the data needed takes 10 or 12 hours. I know because I screwed this up a couple nights before OSE launch after staying up til 3am. Then I had to do it again...
 Cancel
FatCow

2009-12-03T01:53:25-08:00

Amazing amount of data.

1 0

Amazing amount of data.
Cancel
Del Oni

2009-12-03T06:11:11-08:00

You are my man. Thanks

jennita edited 2009-12-03T06:57:08-08:00
1 4

You are my man. Thanks
Cancel

Post Analytics