Linkscape Index Update and a Peek Behind the Curtains

Last week we updated the Linkscape index, and we've been doing it again this week. As I've pointed out in the past, up-to-date data is critical. So we're pushing everyone around here just about as hard as we can to provide that to you. This time we've got updated information on over 43 billion urls, 275 million sub-domains, 77 million root domains, and 445 billion links. For those keeping track, the next update should be around April 15.

I've got three important points in this post. So for your click-y enjoyment:

Fresh Index Size Over Time

If you've been keeping track, you may have noticed a drop in pages and links in our index in the last two or three months. You'll notice that I call these graphs "Fresh Index Size", by which I mean that these numbers by and large reflect only what we verified in the prior month. So what happened to those links?

Linkscape Index Size: Pages

Linkcape Index Size: Links

Note: "March - 2" is the most recent update (since we had two updates this month!)

At the end of January, in response to user feedback, we changed our methodology around what we update and include. One of the things we hear a lot is, "awesome index, but where's my site?" Or perhaps, "great links, but I know this site links to me, where is it?" Internally we also discovered a number of sites that generate technically distinct content, but with no extra value for our index. One of my favorite examples of such a site is tnid.org. So we cut pages like those, and made an extra effort to include sites which previously had been excluded. And the results are good:

Linkscape Index Size: Domains

I'm actually really excited about this because our numbers are now very much in line with Netcraft's survey of active sites. But more importantly, I hope you are pleased too.

Linkscape Processing Pipeline

I've been spending time with Kate, our new VP of Engineering, bringing her up to speed about our technology. In addition to announcing the updated data, I also wanted to share some of our discussions. Below is a diagram of our monthly (well, 3-5 week) pipeline.

You can think of the open web as having essentially an endless supply of URLs to crawl, representing many petabytes of content. From that we select a much smaller set of pages to get updated content for on a monthly basis. In large part, this is due to politeness considerations: there's about 2.6 million seconds in a month, and most sites won't tolerate fetching one page a second by a bot. So we only can get updated content for so many pages in a month.

From the updated content we get, we discover a very large amount of new content, representing a petabyte or more of new data. From this we merge non-canonical forms, and remove duplicates, as well as synthesize some powerful metrics like Page Authority, Domain Authority, mozRank, etc.

Once we've got that data prepared, we drop our old (by then out of date) data, and push the updated information to our API. On about a monthly basis we turn over about 50 billion urls, representing hundreds of terabytes of information.

What Happened To Last Week's Update

In the spirit of TAGFEE, I feel like I need to take some responsibility for last week's late update, and explain what happened.

One of the big goals we've got is to give fresh data. One way we can do that is to shorten the amount of time between getting raw content and processing it. That corresponds to the "Newly Discovered Content" section of the chart above. For the last update we doubled the size of our infrastructure. In addition to doubling the number of computers we have running around analyzing and synthesizing data, it actually increased the coordination between those computers. If everyone has to talk to everyone else, and you double the number of people, you actually quadruple the number of relationships. This caused lots of problems we had to deal with at various times.

Another nasty side-effect of all of this was this made machine failures even more common than we experienced before. If you know anything about Amazon Web Services and Elastic Computer Cloud then you know that those instances go down a lot :) So we needed an extra four days to get the data out.

Fortunately we've taken this as an opportunity to improve our infrastructure, fault tolerance and lots of other good tech start-up buzz words. Which is one of the reasons we're able to get this update out so quickly after the previous one.

As always, we really appreciate feedback, so keep it coming!

Comments 22

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

Rand Fishkin

2010-03-26T13:08:46-07:00

Phenomenal post Nick. So many people always ask about how a crawl and index of the web works, how it's maintained, how it scales, etc. You definitely deserve lots of thumbs up for this one - and not least for giving us some great graphics to accompany :-)

Also interesting about the "whys" of what can happen with masses of data in flux.

5 0

Phenomenal post Nick. So many people always ask about how a crawl and index of the web works, how it's maintained, how it scales, etc. You definitely deserve lots of thumbs up for this one - and not least for giving us some great graphics to accompany :-) Also interesting about the "whys" of what can happen with masses of data in flux. 
Cancel
- Nick Gerner
 
 2010-03-26T13:11:18-07:00
 
 Yeah, and the point about the "fresh" index versus "all" our data is an interesting one that's coming up a lot. We've got some exciting work going on that front. The next three or so months should make some pretty interesting steps toward resolving that conflict.
 
 2 0
 
 Yeah, and the point about the "fresh" index versus "all" our data is an interesting one that's coming up a lot. We've got some exciting work going on that front. The next three or so months should make some pretty interesting steps toward resolving that conflict.
 Cancel
- Carter Cole
 
 2010-03-26T14:25:57-07:00
 
 the idea about "fresh" data made me think of a question from a WMT video. they were wondering if old urls that linked (like all those from geocities) still count and @mattcutts said that they dont (kinda duh) but it made me think... what is "old"? some links that are shown in my WMT aren't there anymore so when do they stop counting? they show when they were "last found" but when do they stop counting for google?
 
 im guessing the answer is the next time the page gets crawled...but could be awhile for some pages right? iono im just commenting out loud
 
 does linkscape have a lower priority on pages its already crawled? how long does it take to refresh all the linkscape data?
 
 thanks for time and keep up the awesome work!
 
 cartercole edited 2010-03-26T14:27:37-07:00
 2 0
 
 the idea about "fresh" data made me think of a question from a WMT video. they were wondering if old urls that linked (like all those from geocities) still count and <a href="https://twitter.com/mattcutts" rel="nofollow">@mattcutts</a> said that they dont (kinda duh) but it made me think... what is "old"? some links that are shown in my WMT aren't there anymore so when do they stop counting? they show when they were "last found" but when do they stop counting for google? im guessing the answer is the next time the page gets crawled...but could be awhile for some pages right? iono im just commenting out loud does linkscape have a lower priority on pages its already crawled? how long does it take to refresh all the linkscape data? thanks for time and keep up the awesome work! 
 Cancel
 - Nick Gerner
 
 2010-03-26T15:05:29-07:00
 
 We evaluate what to pull in next regardless of whether we have crawled it in the past. Whatever is valuable is refreshed quickly. Whatever is unimportant falls out quickly.
 
 This has two important effects:
 
 1) we have a substantial amount of content refreshed within weeks of new content (which isn't perfect, but isn't bad either)
 
 2) we have some lag (perhaps an index update or two) on crawling new pages because we won't crawl it until we've seen at least one link to it.
 
 1 0
 
 We evaluate what to pull in next regardless of whether we have crawled it in the past. Whatever is valuable is refreshed quickly. Whatever is unimportant falls out quickly. This has two important effects: 1) we have a substantial amount of content refreshed within weeks of new content (which isn't perfect, but isn't bad either) 2) we have some lag (perhaps an index update or two) on crawling new pages because we won't crawl it until we've seen at least one link to it.
 Cancel
goodnewscowboy

2010-03-26T15:40:16-07:00

Whenever that happens to us we get more reliable. And now that we've really got dozens and dozens of machines, failures are routine for us.

So the fact that you routinely fail makes you more reliable...I am so stealing this to use as my new company tagline :-P

Seriously, thanks for two Linkscape updates this month. It gives new meaning to the term March Madness!

2 0

Whenever that happens to us we get more reliable. And now that we've really got dozens and dozens of machines, failures are routine for us. So the fact that you routinely fail makes you more reliable...I am so stealing this to use as my new company tagline :-P Seriously, thanks for two Linkscape updates this month. It gives new meaning to the term March Madness!
Cancel
FatCow

2010-03-26T13:13:40-07:00

Why does he always write such a good posts? Damn.

2 0

Why does he always write such a good posts? Damn.
Cancel
Fábio Ricotta

2010-03-26T14:53:05-07:00

Thanks for this update!

I rope to run some new experiments with this new data soon. I need to share that with you on a Youmoz post next week.

As Rand says: Take care!

2 0

Thanks for this update! I rope to run some new experiments with this new data soon. I need to share that with you on a Youmoz post next week. As Rand says: Take care! 
Cancel
- jennita
 
 2010-03-28T09:28:09-07:00
 
 Yay! YOUmoz post :)
 
 1 0
 
 Yay! YOUmoz post :)
 Cancel
R Labrie

2010-03-29T13:38:39-07:00

Gives new life to the term March Madness for sure. Sounds like you guys could add a few things to the Murphy's Law book.

Thanks for the hard work and info!

1 0

 Gives new life to the term March Madness for sure. Sounds like you guys could add a few things to the Murphy's Law book. Thanks for the hard work and info! 
Cancel
Gareth Thomas

2010-10-21T23:09:08-07:00

What I love about new updates is that anticipation just before viewing the new report. I've got Linkscape to email me when a new report is made. I can then view this as soon as I start work, which kinding kickstarts the day and allows me to assess my past work/linking building efforts.

1 0

What I love about new updates is that anticipation just before viewing the new report. I've got Linkscape to email me when a new report is made. I can then view this as soon as I start work, which kinding kickstarts the day and allows me to assess my past work/linking building efforts.
Cancel
Sam Crocker

2010-03-31T06:49:54-07:00

Thanks very much Nick! Fresh data FTW! :)

1 0

Thanks very much Nick! Fresh data FTW! :)
Cancel
Alex L

2010-04-01T23:09:38-07:00

Nice stats,

thanks

Alex,

https://www.cleaningstar.com.au

1 0

Nice stats, thanks Alex, https://www.cleaningstar.com.au 
Cancel
theexo51

2010-03-29T05:42:38-07:00

im loving the move away from consulting and into development...with support for Y:site explorer looking like its coming to an end, we all need a decent independent link analysis site.

carry on the good work!

1 0

im loving the move away from consulting and into development...with support for Y:site explorer looking like its coming to an end, we all need a decent independent link analysis site. carry on the good work!
Cancel
pbhj

2010-03-27T05:06:50-07:00

>"In large part, this is due to politeness considerations: there's about 2.6 million seconds in a month, and most sites won't tolerate fetching one page a second by a bot. So we only can get updated content for so many pages in a month."

Your logic here is unsound. You can run in parallel so that this limitation is < 2.6 Million pages spidered per site! The real limitation is your bandwidth and processing power. Or am I wrong?

As an aside is there info somewhere about your hardware that you're running this on?

And one last thing, are you selling your index or thinking of starting your own SE?

pbhj edited 2010-03-27T05:07:50-07:00
1 0

>"In large part, this is due to politeness considerations: there's about 2.6 million seconds in a month, and most sites won't tolerate fetching one page a second by a bot. So we only can get updated content for so many pages in a month." Your logic here is unsound. You can run in parallel so that this limitation is < 2.6 Million pages spidered per site! The real limitation is your bandwidth and processing power. Or am I wrong? As an aside is there info somewhere about your hardware that you're running this on? And one last thing, are you selling your index or thinking of starting your own SE? 
Cancel
- Nick Gerner
 
 2010-03-27T10:07:02-07:00
 
 You're right that crawlers can run in parallel. The point I'm trying to make is that all of Amazon.com (or Wikipedia, or plenty of other less well known, but still large, authoritative sites) cannot be crawled in just a few weeks because of politeness constraints.
 
 So in a few weeks, you could crawl billions of pages. But you would still end up with only a partial update for many, many sites.
 
 2 0
 
 You're right that crawlers can run in parallel. The point I'm trying to make is that all of Amazon.com (or Wikipedia, or plenty of other less well known, but still large, authoritative sites) cannot be crawled in just a few weeks because of politeness constraints. So in a few weeks, you could crawl billions of pages. But you would still end up with only a partial update for many, many sites.
 Cancel
Perséides Technologie

2010-03-26T13:11:28-07:00

Was it intended to be done on friday? :)

NOW I'm sure I'll get a GREAT weekend!!!

1 0

Was it intended to be done on friday? :) NOW I'm sure I'll get a GREAT weekend!!! 
Cancel
- Nick Gerner
 
 2010-03-26T13:24:51-07:00
 
 Yeah, I guess it's like a little weekend gift. Or something neat waiting for you when you get back to the office on Monday :)
 
 3 0
 
 Yeah, I guess it's like a little weekend gift. Or something neat waiting for you when you get back to the office on Monday :)
 Cancel
Christian Maund-Anderson

2010-03-26T13:13:38-07:00

Thanks for the info Nick! Your last point brings up this question, in terms of SEO Architecture, could the issues we see with the AWS Cloud, potentially cause crawl and indexation issues for an average site with SEs?

1 0

Thanks for the info Nick! Your last point brings up this question, in terms of SEO Architecture, could the issues we see with the AWS Cloud, potentially cause crawl and indexation issues for an average site with SEs?
Cancel
- Nick Gerner
 
 2010-03-26T13:26:11-07:00
 
 AWS doesn't really have bad reliability. It's just that we've got so many machines that machine failure is almost to be expected. Whenever that happens to us we get more reliable. And now that we've really got dozens and dozens of machines, failures are routine for us.
 
 1 0
 
 AWS doesn't really have bad reliability. It's just that we've got so many machines that machine failure is almost to be expected. Whenever that happens to us we get more reliable. And now that we've really got dozens and dozens of machines, failures are routine for us.
 Cancel
John Alden

2010-03-27T04:19:02-07:00

Great post Nick! Smell of freshly indexed pages is a great way to start the day ^^

1 0

Great post Nick! Smell of freshly indexed pages is a great way to start the day ^^
Cancel
Tony Mandarich

2010-03-27T09:13:21-07:00

great explaination.....I think the pieces in my head are all starting to fall into place.

Tony ;~)

1 0

great explaination.....I think the pieces in my head are all starting to fall into place. Tony ;~) 
Cancel
David Cerdá

2010-03-28T13:46:10-07:00

Really good post..!!

Regards from factores posicionamiento web

-- Jen removed link

jennita edited 2010-03-28T22:10:04-07:00
1 2

Really good post..!! Regards from factores posicionamiento web -- Jen removed link 
Cancel

Post Analytics

Comments 22

Log in to Moz

Don't have an account?