Last week, we launched a new Linkscape update with data crawled and indexed in August. Several folks noticed some significant changes in this index, particularly in link counts and some PA/DA metrics. I wanted to take some time in this post to talk about Linkscape's data, our process, some of the challenges we're facing and what you can expect to see with the index over the next several months.
Before we do that, here's the stats on the latest update:
- 45,200,112,724 (45.2 billion) URLs
- 425,981,698 (425 million) Subdomains
- 98,785,848 (98.7 million) Root Domains
- 373,046,145,690 (373 billion) Links
- Followed vs. Nofollowed
- 2.22% of all links found were nofollowed
- 58.7% of nofollowed links are internal, 41.3% are external
- Rel Canonical - 10.12% of all pages now employ a rel=canonical tag
- The average page has 80.08 links on it
- 66.71 internal links on average
- 13.37 external links on average
If you've been paying close attention to the stats on the Linkscape index updates, you might have observed that for the past year, domain diversity (the quantity of root domains in the index) and overall size (the number of unique URLs) appear to have an inverse relationship. When we have larger indices, we crawl fewer domains and when we crawl more domains, we tend to have fewer pages from them.
Here's a graphical comparison starting in August of last year:
As you can see, when we've crawled a larger number of unique domains, we've crawled fewer individual URLs. This has long been a frustration and an artifact of some of the systems that we've used to build the service. In April of this year, we began testing a new system for crawling that we hope will enable us to reach both depth and breadth, but there's a lot of complex, hard-to-build steps we need to take first to scale processing, fix bugs and streamline Linkscape's architecture.
Our VP Engineering, Kate, recently addressed this in a Q+A on the topic:
Hi everyone!
I just wanted to add a quick response to shed a bit more light on the situation. Last year we started a on a project to drastically improve our index. The first part of that was to make our crawler discover more of the web - this included crawling deeper on domains, discovering more links faster (freshness), and contain more links overall.
Background
To understand the changes, it might help if I explain how our crawler used to work and how we changed.
Our crawler used to crawl the web (for 3-4 weeks), then we would compute the link graph and create all the lists of links, and metrics you see in Open Site Explorer - this is what we called processing (and it would take 2-3 weeks). As part of processing we would select the top 10 billion urls to crawl, and then start crawling those.
The problem with this system was that the data was could be 7-8 weeks old (crawling time + processing + deployment to the API and OSE). It also wasn't recursive - meaning that we would only discover new links when we did the processing of that crawl, so it could take us several months before we would see new links that were deeper in domains.
The Changes
We modified our crawler so we were crawling all the time - we crawl sites every day, or week, or month - based on authority. As we crawl those site, any new links that we find are added to one of the buckets, and will be crawled typically within that same index. This is exciting because we can go deeper, discover more links, and produce a higher quality index. The other benefit, is that since we are crawling all the time, we can just take a snapshot of that crawl and run processing - without waiting for the last round of processing to finish - and this means we can update the index more often.
However, in June, we had a problem with the old crawlers, and we had to roll out our new version of the crawl and index with the OSE launch on July 27th. So even though our testing looked good when we released the new index, and correlations were higher than the old crawl, we got complaints about things that were wrong.
The Issues
Binary files were in the index - There are normally only supposed to be links in the index, but because the new crawler went very deep on some domains we started discovering all sorts of binary files, which when parsed, produced lots of weird links. So domains had all these links from sites that didn't link to them. We fixed this issue, and this is the first index with the fix.
We went too deep on big domains - There are a lot of knobs to turn on the new crawlers - from the number of sites we crawl daily/weekly/month to how many links we keep for different domains. One of the first things we noticed with this new crawl, was that we had less domains in our index. So we dialed down how many urls could come from a domain - and this new index also contains that change.
What We Are Doing
We recognize that all of you depend on this data. And we take the index quality very seriously.
We have already made a lot of other changes, increasing the overall size and adjusting how we crawl. However, since it still takes 2-4 weeks to process an index, so some of those changes won't be seen for another 2-4 weeks yet.
We are also working on an updated, higher correlating Page Authority/Domain Authority that should be out in a month or two - but also may jump around a bit.
What You Can Do
Definitely keep sending us feedback. It really helps us understand where we may have missed in our testing, and what we can do to fix it. And thanks again for your patience - we really want to deliver the best possible Linkscape for you, and I assure the team is working nights and weekends to address these concerns. And if anyone has questions you can always email me or our help team (which tend to respond to emails much faster), as all of us care a lot and really want to hear your feedback.
Thanks again,
Kate
On Friday night, I stayed late at the office with a number of folks from the Linkscape team (pictured below during their morning standup):
(clockwise from Martin in the center; Alec, Phil, Brandon, Carin, Matt and Walt)
There are big, tough problems around building a web index, particularly on a budget like ours vs. those of Google or Bing. We brainstormed a lot of ideas, but the big challenge comes down to this: Any change we make today won't be observable for at least 5-6 weeks, making for a very slow iteration process. In software engineering, the faster your iterations and the faster you know the impact of your changes, the faster you can improve. Linkscape is not providing a fast feedback loop today, and we know we need to address that before we invest tons of efforts in improvements that "might" have a positive impact.
I can promise, however, that the team of engineers working on this are among the smartest, most capable, diligent and passionate people I've ever worked with or met. We know there's going to be 3-4 more months of hard slogging and indices of only moderately improved quality before we reach the levels we really want (our internal goal is 100 billion URLs in an index while maintaining domain diversity above 110 million root domains).
You can definitely help us by providing feedback when you think we've missed an important site or page, when metrics look out of whack or when something goes awry in OSE, the mozBar or your web app campaigns. We really appreciate your patience while we improve and your support for the Linkscape dataset. The team can tell you that I take our struggles personally and hard, but I'm incredibly bullish on what we'll be producing by the end of the year.
What to Expect in the Next 3 Months
- We'll have a new index out in just 7-10 days that further addresses some bugs (and has some more freshly crawled pages, too)
- Index sizes - look for between 44-55 billion URLs, probably not achieving much over that until December, possibly later
- Domain diversity - look for 100mil+ starting in the next index, and likely maintaining near that or above for future indices
- Index updates may slip past 4-5 weeks as we try to make more fixes ahead of a new crawl or processing cycle (we'll keep the Linkscape calendar updated to make this a transparent process)
- We're releasing a new version of PA + DA that are likely to be much better correlated with Google rankings (giving a superior metric to judge the ranking potential of sites/pages). This might, however, result in some sites + pages rising or falling dramatically. My best advice here is to use your competitors and industry cohorts as a bar for comparison rather than just looking at the raw numbers over time (since the metric itself is changing, a "40" in October might not mean what a "40" means today).
Looking forward to hearing from you - the engineering team, along with myself and Kate, will be paying close attention to the comments on the thread and to any private feedback or emails to [email protected] on this topic as well. Thanks again - it's an honor to have such a great community of folks paying careful attention and deriving value from our products. We promise to live up to the high expectations you've got for us.
I just LOVE this, linkscape updates every 4 weeks. Randfish for president?
Hi Rand,
Thanks for taking the time to keep the information flowing and for explaining the challenges that need to be overcome so we can all have some basic understanding of the way ahead.
I have to say that managing all the complexities in play seems just mind boggling to me!
It's great to hear another update is not too far away.
It would have been nice if things could be easier for the team as they put their collective shoulder to the wheel, but as people who work in the world of SEO, we all know that the easiest path is usually not the best. :-)
I'll be looking forward to being a part of this next phase of growth for Linkscape and OSE. Sending feedback brings with it an extra little rush of excitement now...how awesome would it be if it were my little comment that unlocks some piece of the puzzle along the way?! :-)
I'll borrow from something I said in a recent Q&A thread to send a little message both to the SEOmoz team and to the rest of this awesome community who I know are committed to helping in the process:
"Also - when the problem seems insurmountable, remember two things ...
Sha
Love the tool but i get frustrated when i see link with high authority but its not cached by google. Would be nice if there was easy way to toss those links out. For example the tool loves the spammy garden web directory and crawls it super deep and scores it high..wheras google bot cuts off crawl much shallower.
Hi Sean,
That sounds like exactly the type of feedback that might be of use to the engineering team!
If you haven't already sent a detailed explanation of the issue to the help team, then I hope that you will soon.
Who knows, maybe it will be your comment that helps fit another piece into the puzzle! :)
If you've got some examples, that would be awesome! Like Google, we use mozRank (well, technically, they use PageRank, but the two algos are quite similar) to prioritize our crawl of the web, but unlike Google, we don't have sophisticated spam scores, so we often get "fooled" by those who manipulate the web's link graph. Working on a good spam metric is a priority we'll focus on following the upgrade of depth/breadth.
If you want to send the specifics over, the feedback tab of OpenSiteExplorer (https://www.opensiteexplorer.org/) on the left-hand side is a great place, or you can email the help team.
Thanks Rand,
It's exciting to see the progress your team has been able to achieve with Linkscape and the direction you're heading.
Since links are such a big part of the search engine equation we are extremely fortunate to have this awesome tool at our disposal.
Just reading the numbers you have thrown out here makes my head spin trying to calculate the shear computer power it must take.
Thanks to you and your team for such great work and for making it available to us.
No follow links are not that bad i think. still it is 2.22 means ok.
I think we all understand that this comprehensive crawling of the Linkscape update is hard stuff. If there are some problems rising we all know that you do your best to solve them.
We should be glad that you provide us the data at all!
Can you explain what this brings to the party (aka us) as I don't have the fullest understanding of the whole Linkscape benefit to us. I know of course you use it for metrics and helping getting deeper pictures of certain URLs but I regularly see URLs/domains that haven't been visited despite being around for a bit.
This isn't a critique as your post does explain well the effort required and the fact that you are competing against the big players.
But is it an uphill battle with no end and one that you will alweays be behind when compared to the bigger players?
The latest index updates, as you can see from the charts above, are better in some ways and worse in others. We estimate that, in total, Bing + Google keep ~130 million root domains and ~150 billion pages in their main indices at any given time. Our current scale doesn't let us get there, but in the not-to-far-future, we hope to achieve 75%+ and possibly 90%+ of what search engines keep from the web.
My goal with explaining the challenges we're facing and laying out the roadmap is to do exactly what Gyorgy described above - make sure folks are informed rather than surprised.
We'll likely never be 100% of Google/Bing in terms of all three dimensions - size, breadth and freshness - but we do think we can get very close, and we've seen that the closer we get, the better and more useful our tools and metrics become.
Rand, as a Sysadmin/Linux dude (with a strong interest in data caching, so I now what it means since I know you guys got a flat file architecture) before even being an SEO, I would be really excited to see what SEOmoz datacenter looks like...
S-I-C-K update!!
Keep up the great work guys!
@Rand Sir these is the first time I come to know that SEOMOZ is working on such a great project. Well I am not that smart or intelligent to give any suggestion on such a big subject but still I am doing this comment is just because I like the way you shared your improvement statistic with us is really very respectable and noble. Really I loved this post very much and wish you and your team all the best for bright future with your project. Thanks.
Good stuff. I'll be happy to see some fresher links
Same here. I can't wait.
Based on direct case study compare Majestic and Seo Moz data are more or less the same - I think that means that both are close to the actual true data.
I use SEO Moz because it looks much better :)
Very interesting data from the Linkscape crawler, its amazing that so much data can be "farmed" of sorts from Google without any issues. Gotta be tough to get all those inner page URL's and I see it's mentioned that you've had some success fixing that, much props.
Hi Rand,
Well it’s finally nice to see a public acknowledgement to this issue. We have been badgering you and your team constantly for over 2 months (since the back end of July) about this and didn’t really get a straight answer on the matter until now.
It’s a shame that your whole TAGFEE ethos slipped a little here as people are paying substantial sums of money to use Linkscape (and all tools based on Linkscape data) and to not be told that 75%+ of the data is effectively unworkable is a little sad.
I guess it also has further repercussions as many people (I guess foolishly) blindly take your data for granted and base SEO campaigns upon it; many of which will have been based on defunct data sets.
It’s especially a shame for ourselves as we had just built an awesome tool using your full API access which we haven’t been able to use since the summer because of the issue. (SEOmoz were kind enough to halt our payments for the API though until the matter is resolved).
Glad to see you know what the problem is though and that you have put the PR wheels in motion with this post to try and appease people like myself from wondering if you were trying to keep the whole issue hush-hush!
I look forward to the fix and using your API again!
Paul
Hi Paul - really appreciate your feedback and understand your frustration. Let me address a few specific points you brought up:
If you have any specific questions or issues, please feel free to email me direct - happy to follow up more. It sounds like you might be perceiving or feeling even more problems than the ones of scale/reach I described (and Kate noted) above.
Cheers!
Hey Rand,
Oh no, it’s not that we were being ignored, we were in fact getting responses from both Twitter and your helpdesk immediately. It was more that no-one was answering our questions! The cynic in me was sensing a reluctance to discuss or even acknowledge the issue at all.
Your last bullet pointed statement here however leads me to believe that we still are not on the same page and that you are not picking up on the same issues we are.
I’ll admit the 75% figure was plucked out of thin air for illustrative purposes; however I would expect it to be wholly accurate, if not understated.
I will pull together an email for you that includes some screen grabs (of which we have provided for SEOmoz previously) so you can see exactly what we’re seeing and hopefully we can finally iron it out!
Thanks,
Paul
I'd love to know the nitty gritty behind HOW your OSE crawler works... What it's built on, how long it takes to crawl, the hardware/bandwidth requirements, why Google doesn't nuke you for it, and all of that stuff. I'm almost positive I don't have the chops to make a full one myself, but it would be interesting to see how it all works.
Thanks for the updates, it's good hear the detailed information about the indexing issues. It's better to be informed rather than surprised. ;-)
Very interesting to see that only 2.22% of the links are no-followed. I thought it would have been slightly higher than that.
As difficult as this is, thanks for being so open. SEOMoz is just one source of data and, although SEO campaigns are going to be based on this, you'd be mad to use it as your only source. Good luck getting things fixed! :)
Domain diversity and overall size has completely inverse relationship..We can easily check it by the data...Thanks for producing such important data.
I think mergering with MajesticSEO is a good idea for building web index. lol
They seems have great index but still i am in love with SEOmoz because of your link metric.
btw, i have login problem with mozbar firefox 6.0.2, already adress this issue via email.
and, i noticed several link metric drop on my websites especially on PA and DA. Is this because of that change?
Majestic has built a very large index, thanks to their process - they actually do things very differently than us or the engines, but it's impressive to see their size and reach. We've talked on occasion w/ the Majestic folks (who are great people, BTW), but for now, we're going to try slogging out the next 6 months and see if we can maintain the quality, metrics, canonicalization and index structure we prefer AND reach the broad size of the engines.
Re: link metrics drops; yeah, these are likely related to the link counts from dropping binary files as well as the shift in index focus (growing diversity and sacrificing some raw size). As I mentioned in the post, if you're an outlier (i.e. everyone else in your sphere - competitors/etc - has stayed at similar metrics but you've dropped or risen substantially) please send us feedback so we can check it out.
Thanks!
my competitors's link metric also drop a little bit, not as much as what happened to my website.
But seems that i doesnt correlate with my ranking.
i got more long tail keyword from organic search and my ranking remain the same and some other goes higher.
btw, thanks rand :D
Yudhis
One has the index. The other has the great link valuation algorithm. Who is going to win the race? And more importantly is anyone going to win the race before the value of paying attention to Total Links gets diminished?
Hi Rand,
I think Alex @ MajesticSEO would be interested in helping you if you come across major issues.
Thanks.
Amaizing data!
Good to know that people started using nofollow and canonical.