Over the last few months our engineering team has been working feverishly on a new index. Our crawlers were extremely successful, but, we ran into a few bumps in the road along the way. Those proverbial bumps shifted our expected launch date from the beginning of April... to today. It's not all bad, though.
The new index went live today, and it's big. 159 billion URLs big.
That said, we learned an important lesson in all of this -- maybe bigger isn't better. Our community has voiced their opinion and we agree, consistent index launches are essential.
That's why we've made changes to dial back our crawlers to a manageable size in order to address the community's number one concern, releasing indexes on a reliable schedule. Our overarching goal is to increase index size over a long period of time, all while updating our processing architecture to maintain a reliable release schedule, starting with our next index that'll be launching in the coming weeks.
Before jumping into the juicy details of this index, I wanted to first point out that the other news (ahem) we announced today will play a large role in alleviating these delays in the future. Plans include both growing our team of engineers to solve these complex scaling problems and investing in the necessary computing resources to consistently produce a larger index. Money can't do everything, but it sure doesn't hurt when it comes to index consistency. :)
Here Are The Latest Stats
- 159,751,604,443 (159 billion) URLs
- 1,114,893,161 (1.1 billion) Subdomains
- 153,439,996 (153 million) Root Domains
- 1,768,519,682,804 (1.7 trillion) Links
-
Followed vs. Nofollowed
- 2.47% of all links found were nofollowed
- 64.05% of nofollowed links are internal
- 35.95% are external
- Rel Canonical - 11.13% of all pages now employ a rel=canonical tag
-
The average page has 82.90 links on it
- 71.75 internal links on average
- 11.15 external links on average
A Few Caveats
I know, I know -- there's always got to to be a catch. So, the index isn’t 100% up-to-speed right off the bat. But it's close. In an effort to get the new index out, we had to make a few sacrifices. Namely, our Anchor Text call will still be indexing the old index when queried, meaning that if you request Anchor Text information it will be slightly dated. The legacy Anchor Text funnel will only hang around for 6 - 8 days from now, until we roll out a refresh to the index. Then, all will be back to normal.
What's This MozScape Stuff?
Finally, you may have noticed Linkscape's shiny new API pages and snazzy new name. The short of it is that Linkscape is now Mozscape, both to better scale our API naming conventions and to refresh the brand. Along with that refresh came a drastic increase in speed, bumping the API rate limit from one request every 10 seconds to 10 requests per second on the paid levels, some great case studies of folks using our data and a simplification of our pricing model. It's just the beginning of our big plans that we have in store.
Enjoy the updates, and if you've got any questions about the new index or Mozscape, drop me a line in the comments or via email.
Index update 5/9/12 (from Carin)
The full index is now live! This index is exactly the same data as what was released on 5/1, but includes the updated Anchor Text views.
I think many companies often get obsessed with something and they lose the focus on the big picture. I know for PR it's much better to release new stuff often. However, to keep having satisfied clients it's important to have a good, stable, robust product. Increasing index than decreasing it, having relatively slow update than slowing it even more, that causes problems to us - your clients. So my suggestion is, don't be like Google. Focus on doing good what you do instead of trying to be something else.
Lately I had some issues with seomoz. I got that e-mail update with new rankings but on campaign I got different results. That was the problem last 2 updates. Also, my ranks on rank tracking weren't updated this monday as it should, it's still old data. When you add these problems with index freshness than you get like multiple small issues but those are affecting our businesses. Don't forget the little guys on whose backs you grew the company (Like Google is doing recently). What we need the most is reliable and stable product, not new features every 1-2 months.
IMHO SEOmoz is one of the greatest SEO and IM resources overall on the web, I am mozfan of course and I like the fact that seomoz is listening what community is saying, I just wanted to give my thoughts on how I see things regarding future of the product on which so many of us depend on.
Best comment I have read in weeks.
Thanks for the thoughtful comment, Davor. We hear you loud and clear, and it's very much in line with our vision. The complexities with index consistency and size is something that we have an entire team dedicated to, as well as a hefty hardware investment that's on the horizon. It's a main focus for us, and is crucial to our product, as well as our users. We realize that.
Rest assured, we're working late into the evenings not to add features, but to fix the consistency and size issues that we've seen with the index over the past few months. Thanks again -- I hope that gives you better insight to where our head is at.
Still no sign of backlinks from Hubpages or Squidoo even though they have massive domain authority! Why aren't these being indexed by Mozscape?
We do index them, but those sites are massive, and we have limits on how far we'll crawl on very deep domains (due to both politeness issues and to create more index diversity). Over time, we may get deeper on those sites (and we're more likely to index those pages if they have external links pointing to them, rather than just the internal link structures).
I see pages I created in February are not getting Page Authority and MozRank. Also, over the past 2 months we launched an 2 infographics and started massive guest blogging campaign, how my Domain Authority can increase with only 1-2 points when for 2 months I got more than 400 links from really good blogs.
Hope I get my data some time this year :)
Totally correct. Some people pay for these services, not me exactly, but my bosses do. My team collects information about our sites and my bosses expect me to present fresh and correct data every month but how to do that, when the source service is not working correctly. Update from February, huh not so huge update only 2 months old.
Next update should show data crawled primarily in late February through to end of March / early April, and be out there in 4 weeks or less, so you'll likely see this reflected then. The bigger processing has meant we're slower than normal in freshness, but we're working hard to catch up (and now have some funding to help with that).
We acquired about 75 new links since the last update and our DA only went up one point too. Perhaps it's because the higher your DA is the harder it is to increase it? The whole algorithimic thing?
I'm rather confused.
When will the SEO MOZ
Competitive Domain Analysis
Match up with the
Open Site Explorer
with all this new info?
I just ran reports and now they are all different?
Should be very soon (next time your web app data / rankings / etc is updated). In the meantime, the OSE "compare" tab can perform a very similar function (albeit not as integrated - sorry about that).
Playing with the data now. Thanks SEOMoz.
There is a MASSIVE difference between my information in GWT vs. OSE. I know RogerBot is no where as advanced as GoogleBot, but it's all I have for competitor analysis.
I agree.I recently lauched sub domain and shows no links on OSE while GTW has around 30+. Sub domain is about month and a half old.
Hi Francisco - all the various indices (Google, Bing, SEOmoz, Majestic, etc) will have different crawl sizes, depths, things they keep in index, canonicalize, etc. so the numbers/data will always be different between them. Think of it like a census - no one's polling everyone, it's all sample sets, so while ratios will often be very similar, the raw counts are going to be all over.
Is there a point of getting OSE index to ever grow as big as Google's index? I bet that it would be nice and it would make all of us SEOs and Webmasters happy but, does it make business and operation sense?
I have been telling that to many people in Q&A; to use OSE as a guideline, not as an absolute truth of the internet.
It's great to get somewhat of a picture of the competition and to see if one's own campaigns are growing or not. Although I turn to GWT for root domains, I turn to SEOmoz for almost everything else.
I remember when you launched the WEBAPP. It's still very valuable to me. SUPER VALUABLE!
Hi, we have a separate PRO account. How do we take advantage of this ocean of potentially useful data for our SEO efforts? Thanks!
If you don't have an API key, you can generate one here -- that's all you need!
Just in case you haven't visited it, all of this data is also available to PRO members in Open Site Explorer as part of your PRO subscription! Check it out at www.opensiteexplorer.org. You get full access!
Roger! you are getting bigger and biggerrrrrr day by day, I am pretty much excited what will be the next.
BTW I am happy my website DA and PA increased :)
Glad to see the update guys, and happy to see the new funding will help scale size and frequency.
Can you give a rough idea of when the crawl period was? It looks like the 2/28 update had crawl data from Dec/Jan for example. Just want to get an idea to compare to our link building logs.
Hey Kane - this index will contain crawl data from about mid-December to the end of February.
Thanks for the quick response Carin, that's very helpful.
Congrats on all the good news, guys. It sounds like you worked super hard on this update. That had to be stressful. Great work all around! Checking the new data now...
Will come back in a few days when the page loading times have increased.Don't know if because your new index is bigger the slowness will carry on.
Yeah, I'm getting all kings of errors saying that something went wrong. It's probably just because everyone's on it at once.
Hey guys - the API slowness is from the latest rollout. We are in the process right now of redistributing the load - we're seeing a lot more traffic than we have in the past few weeks :)
You should see this slowness disappear in the next few hours!
I'm comparing the newest index to Ahrefs, and Ahrefs still seems better! Why is it that everyone says it's not as good? I'm not at all experienced with Ahrefs and don't know much about it. Is it not fresh either? Not accurate?
Ahref's SERP analysis is great.. it's not as clunky as SEMRush & I love how accurate the data is. Moz can't touch it. The daily fresh links is also untouchable by Moz or Majestic. The link value from Ahref's seems a bit off and there is no documentation I have found to explain the value at all. I have links Majestic found right away.. Ahrefs still hasn't found them. As far as packaging your work for clients.. nothing is as pretty as Moz. To actually get work done, I utilize the other two now.
OK, so the bottom line is that everything about Ahref's index is better except for accuracy of link authority? That's pretty imperessive, then. I guess I'll use Ahrefs to view my backinks and the Moz Bar for DA and PA and I'll just be all happy with everything.
Hi Andrew! Congratulations for this. Thank you so much for sharing this stat. I know that you and the team really worked hard to achieve your goal. Please keep us updated.
Update to the index deployement: The full index is now live! This index is exactly the same data as what was released on 5/1, but includes the updated Anchor Text views.
Wow that's a massive leap! Will see if we can have a look... Thanks
SEOMoz rock.. Too much to data to look for me! :)
I would really love to see some statistics coming out of this: websites per language, per country code top level domains, per more linking towards facebook or anything else that might be really weird and fun :D
That's a nice sideffect of having huge data about stuff that everyone of uses every day :D
Whoa, big data!
That reminds me the book of Google I've read, how Larry Page started Google and the problem he had to crawl the internet. Hes was adding cheap desktop computers everywhere at first.. haha! It was a cheap solution that lasted a couple of years until he had to make the big move.
Seems like data for March onward is not there, waiting for that as well. Good work Rand!
Hey Asad - you are right, this index will contain data from about mid-December to the end of February, but we have our next index processing right now that will cover up until early/mid-April. We have that one scheduled to launch in 4 weeks, but we'll get it out there as soon as we can!
Am having a new site, launched in march, and after the latest May PR update by Google its a PR-2 now, and now this is the first time that a site is showing a PR-2 but PA, DA and MozRank and Moztrust and everything else are Zero! Seems kinda interesting to see such a situation, now waiting for the linkscape next update to see the correlation between the figures!
Fantastic post! Glad to see the update about index news, please can you propose which is month these crawl data?
As Carin noted above, the oldest data is from December, newest from end of February (majority from Feb). Next index will take us through to end of March. We're working hard to bring processing time on large indices down, but in the meantime, may scale back on size so we can better balance with freshness.
Fan fricken tastic. Cant wait dor the bew anchor text data to come out, wikl be real useful for post penguin analysis
Thanks Russ! I suspect you're right - though I'd also point out that we have another index processing that may be launched in the next 2-4 weeks that will contain some fresher link data (through to end of March / start of April). That one may be more useful for diagnosing the Penguin update specifically. This one will be really good to look at links acquired by sites/pages prior to March 2012 (since the processing this time around was ~90 days).
Rand, does a processing time of ~90 days mean that only links crawled before the processing started (>90 days ago) would find their way into the new index?
Congrats on all the big numbers this week!
Right! So basically we're talking about end of February for the freshest data in this index. The next index will be through to end of March (but hopefully launched in the next 30 days, maybe quite a bit less). After that, we're working to balance size vs. freshness so until we can get processing better scaled both hardware and code-wise.
Thanks, good to know how it works.
All looking good Rand, great work on the 'other' news as well!
Speaking of Penguin, we've seen a few bad results since the update was rolled out, and have been doing some in-depth analysis to try and discover where we might have been hit. Very frustrating as we've never used mass link schemes and have made a big effort to focus on content, blogs and social during the last 18 months.
Anyway, my question:
Do updates such as Penguin pose major problems for the Mozscape crawls and analysis? If Google are essentially 'moving the goal posts' with such updates, do you have to adjust the way in which you determine Moz's ranking and authority figures?
I would assume that algorithm changes wouldn't affect it that much. I thought Moz' metrics were based mostly around trying to make an educated guess of the pagerank and trustrank of a page based on what they think google would. This excludes all the other (supposedly 100s) of ways that G uses to rank pages. This is just my outside perspective though.
PA actually reached it's highest correlation with rankings ever in this index. Since PA is based on a machine learning system, it should keep up very nicely with changes Google makes (so long as the inputs don't change dramatically - e.g. they switch from using links and features of sites/pages to something we can't access like user/usage data).
Thanks for the reply Rand - all makes sense.
Thanks for getting Penguin friendly analaysis refresh so quickly. This upgrade timing is right on time.
Well done guys! This pleases me to no end, and I will certainly be making full use of all this juicy data. For what it's worth I'm having a few troubles with the Firefox toolbar today, but that could be completely unrelated.
Hey Dan - if you continue to see problems with the toolbar, let our help team know and we can investigate!
Great work, guys! But I'm not seeing any updated data for my campaigns?
Hey! You should see your campaign data updated very soon - next time your web app data / rankings / etc updates. Competitive Link Analysis is pulling from this latest index now.
FYI - You can always use the OSE compare tab when your campaign is slow to update, however, definltely not as convenient as having it in your campaign.
Erm... from 70 mil up to 160 in a month! What the F*** :) Bravo to Moz team! This is freaking crazy improvement!
Quoting
"That's why we've made changes to dial back our crawlers to a manageable size in order to address the community's number one concern, releasing indexes on a reliable schedule. Our overarching goal is to increase index size over a long period of time, all while updating our processing architecture to maintain a reliable release schedule, starting with our next index that'll be launching in the coming weeks."
Does this mean we can expect the number of backlinks, linking root domains, etc. for our websites to go back down in the next update?
The next update is as large as the one we launched today, but will end up taking almost 90 days to process as well - sacrificing freshness a little.
In order to strike a better balance with freshness, the index following the 5/29 release will drop in size - down to something we can produce in 4 to 6 weeks instead of 12 weeks.
I understand the freshness aspect, but, will backlinks from all URLs in this index still appear in the futures updates.
I'd like to know this too. It wouldn't look very good to our clients if our links and DA looked like they dropped!
Finally! Glad the index got bigger of course, but I agree that freshness is the most important. In the perfect world we could have both...which might be in a few months?