Last week we updated the Linkscape index, and we've been doing it again this week. As I've pointed out in the past, up-to-date data is critical. So we're pushing everyone around here just about as hard as we can to provide that to you. This time we've got updated information on over 43 billion urls, 275 million sub-domains, 77 million root domains, and 445 billion links. For those keeping track, the next update should be around April 15.
I've got three important points in this post. So for your click-y enjoyment:
If you've been keeping track, you may have noticed a drop in pages and links in our index in the last two or three months. You'll notice that I call these graphs "Fresh Index Size", by which I mean that these numbers by and large reflect only what we verified in the prior month. So what happened to those links?
Note: "March - 2" is the most recent update (since we had two updates this month!)
At the end of January, in response to user feedback, we changed our methodology around what we update and include. One of the things we hear a lot is, "awesome index, but where's my site?" Or perhaps, "great links, but I know this site links to me, where is it?" Internally we also discovered a number of sites that generate technically distinct content, but with no extra value for our index. One of my favorite examples of such a site is tnid.org. So we cut pages like those, and made an extra effort to include sites which previously had been excluded. And the results are good:
I'm actually really excited about this because our numbers are now very much in line with Netcraft's survey of active sites. But more importantly, I hope you are pleased too.
I've been spending time with Kate, our new VP of Engineering, bringing her up to speed about our technology. In addition to announcing the updated data, I also wanted to share some of our discussions. Below is a diagram of our monthly (well, 3-5 week) pipeline.
You can think of the open web as having essentially an endless supply of URLs to crawl, representing many petabytes of content. From that we select a much smaller set of pages to get updated content for on a monthly basis. In large part, this is due to politeness considerations: there's about 2.6 million seconds in a month, and most sites won't tolerate fetching one page a second by a bot. So we only can get updated content for so many pages in a month.
From the updated content we get, we discover a very large amount of new content, representing a petabyte or more of new data. From this we merge non-canonical forms, and remove duplicates, as well as synthesize some powerful metrics like Page Authority, Domain Authority, mozRank, etc.
Once we've got that data prepared, we drop our old (by then out of date) data, and push the updated information to our API. On about a monthly basis we turn over about 50 billion urls, representing hundreds of terabytes of information.
What Happened To Last Week's Update
In the spirit of TAGFEE, I feel like I need to take some responsibility for last week's late update, and explain what happened.
One of the big goals we've got is to give fresh data. One way we can do that is to shorten the amount of time between getting raw content and processing it. That corresponds to the "Newly Discovered Content" section of the chart above. For the last update we doubled the size of our infrastructure. In addition to doubling the number of computers we have running around analyzing and synthesizing data, it actually increased the coordination between those computers. If everyone has to talk to everyone else, and you double the number of people, you actually quadruple the number of relationships. This caused lots of problems we had to deal with at various times.
Another nasty side-effect of all of this was this made machine failures even more common than we experienced before. If you know anything about Amazon Web Services and Elastic Computer Cloud then you know that those instances go down a lot :) So we needed an extra four days to get the data out.
Fortunately we've taken this as an opportunity to improve our infrastructure, fault tolerance and lots of other good tech start-up buzz words. Which is one of the reasons we're able to get this update out so quickly after the previous one.
As always, we really appreciate feedback, so keep it coming!
Phenomenal post Nick. So many people always ask about how a crawl and index of the web works, how it's maintained, how it scales, etc. You definitely deserve lots of thumbs up for this one - and not least for giving us some great graphics to accompany :-)
Also interesting about the "whys" of what can happen with masses of data in flux.
Yeah, and the point about the "fresh" index versus "all" our data is an interesting one that's coming up a lot. We've got some exciting work going on that front. The next three or so months should make some pretty interesting steps toward resolving that conflict.
the idea about "fresh" data made me think of a question from a WMT video. they were wondering if old urls that linked (like all those from geocities) still count and @mattcutts said that they dont (kinda duh) but it made me think... what is "old"? some links that are shown in my WMT aren't there anymore so when do they stop counting? they show when they were "last found" but when do they stop counting for google?
im guessing the answer is the next time the page gets crawled...but could be awhile for some pages right? iono im just commenting out loud
does linkscape have a lower priority on pages its already crawled? how long does it take to refresh all the linkscape data?
thanks for time and keep up the awesome work!
We evaluate what to pull in next regardless of whether we have crawled it in the past. Whatever is valuable is refreshed quickly. Whatever is unimportant falls out quickly.
This has two important effects:
1) we have a substantial amount of content refreshed within weeks of new content (which isn't perfect, but isn't bad either)
2) we have some lag (perhaps an index update or two) on crawling new pages because we won't crawl it until we've seen at least one link to it.
Whenever that happens to us we get more reliable. And now that we've really got dozens and dozens of machines, failures are routine for us.
So the fact that you routinely fail makes you more reliable...I am so stealing this to use as my new company tagline :-P
Seriously, thanks for two Linkscape updates this month. It gives new meaning to the term March Madness!
Why does he always write such a good posts? Damn.
Thanks for this update!
I rope to run some new experiments with this new data soon. I need to share that with you on a Youmoz post next week.
As Rand says: Take care!
Yay! YOUmoz post :)
Gives new life to the term March Madness for sure. Sounds like you guys could add a few things to the Murphy's Law book.
Thanks for the hard work and info!
What I love about new updates is that anticipation just before viewing the new report. I've got Linkscape to email me when a new report is made. I can then view this as soon as I start work, which kinding kickstarts the day and allows me to assess my past work/linking building efforts.
Thanks very much Nick! Fresh data FTW! :)
Nice stats,
thanks
Alex,
https://www.cleaningstar.com.au
im loving the move away from consulting and into development...with support for Y:site explorer looking like its coming to an end, we all need a decent independent link analysis site.
carry on the good work!
>"In large part, this is due to politeness considerations: there's about 2.6 million seconds in a month, and most sites won't tolerate fetching one page a second by a bot. So we only can get updated content for so many pages in a month."
Your logic here is unsound. You can run in parallel so that this limitation is < 2.6 Million pages spidered per site! The real limitation is your bandwidth and processing power. Or am I wrong?
As an aside is there info somewhere about your hardware that you're running this on?
And one last thing, are you selling your index or thinking of starting your own SE?
You're right that crawlers can run in parallel. The point I'm trying to make is that all of Amazon.com (or Wikipedia, or plenty of other less well known, but still large, authoritative sites) cannot be crawled in just a few weeks because of politeness constraints.
So in a few weeks, you could crawl billions of pages. But you would still end up with only a partial update for many, many sites.
Was it intended to be done on friday? :)
NOW I'm sure I'll get a GREAT weekend!!!
Yeah, I guess it's like a little weekend gift. Or something neat waiting for you when you get back to the office on Monday :)
Thanks for the info Nick! Your last point brings up this question, in terms of SEO Architecture, could the issues we see with the AWS Cloud, potentially cause crawl and indexation issues for an average site with SEs?
AWS doesn't really have bad reliability. It's just that we've got so many machines that machine failure is almost to be expected. Whenever that happens to us we get more reliable. And now that we've really got dozens and dozens of machines, failures are routine for us.
Great post Nick! Smell of freshly indexed pages is a great way to start the day ^^
great explaination.....I think the pieces in my head are all starting to fall into place.
Tony ;~)
Really good post..!!
Regards from factores posicionamiento web
-- Jen removed link