Being in beta, we're still making a lot of improvements. This time around we made some serious changes to our crawling methodology to include more quality content, and to avoid junk. Having spoken to search engineers, thought leaders, our users, and having taken a look at the data, we think we've made plenty of progress on the raw size of our index (between 30 and 40 billion pages). Now we want to focus on index quality. With that said, here are the latest index stats:
- URLs: 36,651,796,236
- Subdomains: 214,625,541
- Root Domains: 50,734,663
- Links: 409,127,041,842
The current goals for the Linkscape team are:
- Freshness: data which is three to eight weeks old or better
- Coverage: include key deep pages and influential posts that might not be referenced from other sites
- Visibility: uncover actionable data and trends across all our data, rather than sorting or filtering just 3,000 links
- Measureability: provide data which is comparable index-to-index, and track those trends
- Quality: provide data which reflects the structure of the relevant web
When we ask how many external links to a homepage there are, on average we report 40% of what Y!SE does. Of course, we discount nofollows, while Y!SE (in our experience) includes them. Given our latest crawl, we see that 3% of links are nofollowed (up from 1.8% when we originally launched). This is probably about our crawl choice, rather than a large trend. It's entirely possible that homepage links are nofollowed more often than deep pages (think about those comments you leave on blogs with a link back home). We also keep link counts to sources of 301s separate from their targets for reporting reasons. Don't worry, we still pass link juice through 301s.
When we ask how many external links to a whole site, we report, on average, 90% of what Y!SE does. This number is quite dramatic IMHO. Here we're probably seeing less of the nofollow bias. This might also reflect a crawl bias, and some canonicalization differences. But again, a strong statistical correlation suggests that we are giving an accurate site-wide link profile.
This graph is quite striking to me. We pulled all the links we had for a variety of pages and went back 1 month later to see how many of those pages are still linking. We want to assure you that the links we're reporting reflect the current state of the web, rather than the stale web from months or even years ago. As it turns out, we have a 91% success rate. When we ask for Y!SE's 1000 links we see a 97% success rate, making us quite competitive in this regard.
A few other stats measuring our index quality:
- Pages mentioned in DMOZ also in Linkscape: 96%
- Domains mentioned in DMOZ also in Linkscape: 99%
- Average error of mozRank against Google Toolbar PR: 0.56 (best possible is 0.25 due to round-off)
Oh yeah, I just discovered this awesome social media tool! It's called... Twitter! (tongue-in-cheek)
But seriously, that's another great way to provide feedback, get questions answered, and in general keep up-to-date with what's going on behind the scenes here at the mozPlex.
I would love the ability to compare old reports with new ones for the same sites so i can see the growth trends.
Is this something you will be considering to do in the near future?
Would anyone else like that feature?
I'd second that request. I'm really interested in seeing the growth trends as some of the campaigns I am managing progress.
Third that! I was a tiny bit saddened upon using the Linkscape tool for the first time the other day, and saw that you can't compare old to new reports.
This is definitely something we want to address (see point 4 in our goals).
Right now the comparisons are a little tricky because we're grooming our data-set. For instance, the first update doubled the size of our index. If you got a larger link count, is that because of index growth or real link growth? For now I would say, focus on relative differences within an index (e.g. do you have more links or fewer than your next best competitor?)
We will NOT come out of beta (whatever that means) without some kind of solution for this on the data side. So expect to see more around this in the next few months. But keep the feedback barrage on this front coming.
Woo hoo! Congrats guys, I've been waiting for this update for a long time. Maybe I am just blind from excitment, but I can't find out how to delete my old linkscape reports. I want to re-run some old URLS through this update, but I keep getting redirected to the older (saved) versions. Any direction you can offer would be helpful.
Thanks!
Thanks for the report! We forgot to roll over a sticky bit. Bit is unstuck, try those tubes again!
Damn those sticky bits... they get everywhere and are hard to clean off.
I'm interested by the implied averages of the number at the top of the page:
I realise that the Linkscape crawl began at the strongest parts of the internet, but these are higher that I would have guessed. Let's see what happens as more of the internet's backwaters get included in the numbers.
Working with these aggregates is so difficult because of the vast long-tail nature of the web. The average is VERY different from the median here. Our feeling is that we're already including too much of the backwaters (by some definition of "backwater").
This gets at the interplay between points 2 and 5 of our goals: Coverage vs Quality.
That number 11.2 is actually too low for most pages you'll ever encounter. But you're right, if we consider many of those "backwater" pages, many of whose content/links we do not include in our index, the number does come down.
Interestingly, I think the "average number of inlinks per url" generally increases with more crawling. This statistic is related to the ratio of crawled URLs to discovered URLs. Consider after linkscape had crawled just one page. It had found about 30 unique links, learning about 30 urls, and so the average number of inlinks per url was exactly 1 (from 30 divided by 30). This ratio increases as one crawls more, and finds an increasing portion of links pointing to URLs that have already been discovered.
As an intermediate datapoint, when the crawl for linkscape was only in the hundreds of millions of pages, the average number of links per URL was more like 3. Now, as you noted, it is more like 11. So it would seem to be increasing as we crawl more.
However, it could be leveling off. One never gets the ratio of discovered to crawled to be 1, because a lot of URLs are uncrawlable. The are non-html content, they are robots.txt out, they are clearly spam you don't want to crawl, they are to 404 pages, etc. And the second factor, the average number of links per-page, has declined somewhat as we crawled more, which is also contributing to the leveling off of the average number of links per URL. So you might be right, and this statistic could now be decreasing.
"[...] how many external links to a whole site, we report, on average, 90% of what Y!SE does. This number is quite dramatic IMHO. Here we're probably seeing less of the nofollow bias. This might also reflect a crawl bias, and some canonicalization differences. But again, a strong statistical correlation suggests that we are giving an accurate site-wide link profile."
Presumably, you've done some qualitative analysis to look at the discrepancy and see what sort of links those 10% are? You'd have to look through a few sites by hand but surely that would show immediately whether the 10% can be written off as uncrawled domains linking in or whether it's duplicate content or if Y! is not reporting unique links, etc.?
Unrelated, but the geek in me is interested in the [on disk] size of your database, the time and bandwidth to create it and any details about your crawler you care to share?
We have taken a look at (some) of that 10% (our study covered millions of links). It's still a little early to say, but the issues revolve around freshness and "depth". Some of the links are from the last two or three weeks, which is just outside our resolution. Also, some of those links apparently don't have any links themselves. They are only discoverable through RSS and browsing history.
We're working on both issues, but the freshness one is a little bit trickier to address.
Let me figure out the right way to release info about our back-end architecture. For now, suffice it to say, it's BIG. This is bigger than any project I've ever worked on (and I worked on something affectionately, if a little inaccurately, called the "Petabyte Project" at Cornell).
When we spoke with large-scale db vendors, we got the impression that we would be the largest client they had in terms of size and data turn-over frequency. Hence, we built an almost entirely home-grown solution.
We're big enough that we see machine failure and data corruption routinely. The data is big enough that develop-deploy-debug cycles take days. Both Ben and I have lost more than a few nights of sleep dealing with these issues.
Fantastic update Nick.
With Google officially coming out and saying they have 1 Trillion URL's, do you think that is your target goal for the data of linkscape's index?
(I think it's safe to approximate that G has 1 Trillion+ maybe 1.2 Tril. by now...who knows, with the web's reach growing, it's assumed to grow).
May I ask, where are all your URL's by geography?
It's is predominately in China/Japan/India? Or is it solely the US/UK/AU?
I think some difficult goals (maybe....at least from my perspective):
- Continuously fighting the duplicate battle
- Spidering PDF's
- Spidering Flash
- Spidering through Javascript
- Finding pages that have few/no links to them (the crap people submit to SE's).
- Links that are generated based of IP (geo-targeting links)
- (Maybe) Links within iFrames
I watched this video about Google's evolution and they talked about how they could spider the entire web in a couple of months and now they can spider the web in a few minutes, WTF! Gezz. (I can't find the specific link but hopefully the next paragraph will help bring some weight to my statement).
I doubt Moz has the depth of servers as Google but they use 1,000 servers to process a single query in .12 seconds.
Keep up the fantastic work. I'm going to try to make our own tool but that would take at least 6-12 months before I could even enter into the game of Crawling the Web. By then, who knows what the web will look like. ;) [I make no promises on that tool but heck, by then you'll probably devour more of the web than Google. :D And if that's the case, I might just leave it up to the experts. :D]
Great job, I look forward to seeing how Linkscape develops.
Just thought this was a good find about young Google (something to add to my original comment).
Joshua,
The 1 trillion number is a bit of hoax. As it turns out we've "seen" many more urls as well. The question is, what is rank-able content? There are a lot of people that acknowledge that the GG index doesn't have anywhere near all of these URLs.
We've seen that same video. And actually, between it and the research above, it makes us think we are much closer to the engines than any of us previously thought. If their index fits in memory on those 1000 machines (not saying that it does), and each has say, 16GB of RAM (keep in mind that GG uses commodity hardware) then we're talking about ~16TB of data. This is the neighborhood (bigger or smaller, depending on how you count it) of our dataset. And we don't have a keyword inverted index or a page cache. Even at 64GB of RAM each we're still on the same order of magnitude.
All of this (especially the research above) suggests that we are doing the right things and are providing a very reasonable view of the web.
Hmm. Sounds like you've done your homework. :)
I trust your judgement.
Ultimately it comes down to if the tool is useful for your purposes.
So really it's I who have to trust your judgment :)
I hope we can live up to that standard.
Haha. Your too kind being all 'servant-ish'. :D
I was just throwing out some ideas. But I guess you already tackled them pretty easily, if you have the whole web in your hands. :)
The web being 16 TB? That's sounds pretty small. With memory doubling every year we'll hold 16 TB's in a few years. :P
I love linkscape. I really do. I just think there might be some data missing. Many times I use a linkscape credit and I compare it to Y!SE it's missing ~20-30% of the links. And with Google being a bigger dog than Yahoo, it could be more.
But again. Maybe it's just politics. Maybe G is trying to impress everyone. Idk.
Just one micro-perspective from a little guy. :) I trust your judgement. Your a far smarter guy than me about SE's. You've actually read some of the patients, I've read the summaries. :P
Keep up the great work.
This is actually a really good discussion to have :)
We're definitely still missing plenty. When I say we're in the neighborhood, I mean we have a big chunk of the data; we have enough data to be extremely relevant (see the numbers above). But we are still growing to be sure.
Please, keep the feedback coming!
That's all folks! :)
This site is very good and very useful..
Hey Nick,
Excellent post on the update, we (along with many other people no doubt) have been waiting for some data refinement for a while now. It's excellent to see Linkscape evolving; even at it's inception it proved to be one of the most useful tools available despite requiring a bit of calculated guesswork.
Once the data is satisfactory (in terms of size, quality and accuracy) I would also love to see some historical comparison between reports.
hey Nick, those are great Linkscape improvements, well done, I love the data Linkscape returns.
You mentioned:
'Of course, we discount nofollows, while Y!SE (in our experience) includes them.'
In my experience I have also noticed YSE returning results for no follow links, even for poor quality nofollow links. I think it is great that Linkscape discounts nofollows as they can hamper clarity when analysing inbound links, unless those theoretical 'nofollow' links actually turn out to be not so 'nofollow' and more 'dofollow' than we think they are. : )
Lastly I agree with Andrew that it would be good to ahve the ability to compare several reports for the same website over a set period eg: 6 months... otherwise i end up making local copies of Linkscape and even printing out reports and archiving them for better benchmarking
This is a good point. I know I've seen some good discussion around how nofollow is interpreted. We will continue to refine how we interpret it to match the "relevant web" (i.e. rankings).
I want to leave myself plenty of wiggle-room for how we treat all of these things.
Will wrote a pretty good post regarding the same (and lots of other stuff).
Very helpful tool and glad to see improvements.
I'm new so not see or used this tool before, just playing with it now.
Feel free to shoot a PM or twitter message my way if you have any questions or need some tips.
That goes for everyone. I've seen Q+A submissions about "how to use Linkscape". No need to spend a credit.