We're continuing the trend of two index releases each month by bringing you the latest Mozscape index release today - only 15 days after our last release on February 12th! The latest Mozscape index took about 11 days to process, with a fairly significant portion crawled the beginning of February. The crawl data spans about 38 days, so the oldest crawl data will date back to the beginning of January. You can access refreshed data across all of our applications - Open Site Explorer, the Mozbar, PRO campaigns, and the Mozscape API.
Our Big Data processing team (Martin York, Douglas Vojir, and Stephen Wood) have been working on some really exciting improvements to our processing code base reducing the length of time processing takes, as well as beginning development on a highly anticipated new Mozscape index feature:
-
The Mozscape index is created in one continuous batch processing pipeline. A massive amount of crawl data is initially downloaded which is first sorted and organized, then the computations and magic are applied. Every so often, files get uploaded in a checkpoint step; just in case something catastrophic happens to the index, we'll be able to roll back to a fairly recent step.
Recently the Big Data processing team dug through this checkpointing code to see where they could optimize - and they really optimized! The time needed to checkpoint files varies throughout the pipeline, but the longest checkpointing step used to take about 60 hours to complete… With the optimization from Doug and Martin, this step now takes on average 2.18 hours! Holy time savings!!
-
The first few steps in processing are dedicated to organizing how the work is going to be distributed across the entire Mozscape processing cluster. These files are broken out into what are called shards and then assigned across the entire fleet of machines. Sometimes these shards aren't always completely full; this means one machine will be all done with work before another machine. Martin revisited this code as well to see what type of optimization could be applied. With the help of our master data scientist, Matt Peters, Martin was able to improve the distribution of work, saving around 25% of time spent processing!
- One feature we hear requested fairly often is including HTTPS crawl data in the Mozscape index. Good news - development on this feature has begun, and we hope to have HTTPS data included in the Mozscape index this summer!
Here are the metrics for this latest index:
- 82,275,594,589 (82 billion) URLs
- 9,097,532,641 (9.1 billion) Subdomains
- 148,991,416 (149 million) Root Domains
- 829,267,740,331 (829 billion) Links
-
Followed vs. Nofollowed
- 2.25% of all links found were nofollowed
- 56.08% of nofollowed links are internal
- 43.92% are external
- Rel Canonical - 15.43% of all pages now employ a rel=canonical tag
-
The average page has 73 links on it
- 62.93 internal links on average
- 10.33 external links on average
And the following correlations with Google's US search results:
- Page Authority - 0.35
- Domain Authority - 0.19
- MozRank - 0.24
- Linking Root Domains - 0.31
- Total Links - 0.25
- External Links - 0.29
As you can see from the metrics above, there continues to be an increase of subdomains as we have discovered a small number of root domains that have a substantial number of subdomains associated with them.
We always love to hear your thoughts! And remember, if you're ever curious about when Mozscape next updates, you can check the calendar here. We also maintain a list of previous index updates with metrics here.
Great work, glad to see the trend of twice a month. Can't wait until it's once a week!
Considering that last index update took 15 days and that Internet is constantly growing, I believe you'll have to wait a bit longer until weekly update become reality.
Congratulations to whole SEOmoz team for their effort.
Welcome and congrats :)
And i thankful to all SEOmoz team who work and share this nice info to all of us.
Thank again
Great to see the mozscape index being updated on a consistent basis. Kudos to Seomoz team for this. Mozbot is becoming hyperactive and is consistent in delivering the fruitful & updated results to its users on time these days. Not to forget the billions of pages he needs to crawl on the web and the bandwidth used before updating its index.
Congrats and thanks to the SEOmoz technology team on the performance and efficiency gains.
Thanks for sharing, I wish to use this comment for appreciate the whole technical team behind it.
Great!
Off to download some updates - thanks!
Great to see the continued growth of Mozscape, some of the numbers are staggering :) Keep up the good work.
Yay! So you guys are gonna do it twice every month now. TU!
Congrats!
Congrats SEO Moz
Will y'all ever get back up to the ~168 billion URL level we saw last year? The more the merrier, and all that :)
Totally agree - the more merrier for sure! :)
Unfortunately, the more the merrier, but the longer it takes to process...but we're working on solving that problem! Lots of exciting things coming this year!
Thanks for sharing
Where can I find more info on what the correlation figures mean. What can I conclude from the correlation Domain Authority - 0.19 to Google's US search results. What I can remember from high School this feels like a low correlation. Or am I seeing this wrong?
Hey there!!
Our data scientist, Matt Peters, posted a really informative blog post on what correlations mean last September. You can read up on the details here:
https://www.seomoz.org/blog/mozscape-correlation-analysis-google-algorithm-changes
Hope that helps!
Thanks for the link Carin. It was good to be able to expand on what the figures mean.
This is great news, we are currently involved in link removal after getting hit with an unnatural link warning so we can now see how our progress is going every 2 weeks.
Thanks again
Always happy to see new data in OSE. Thanks for the fast release!
Is there currently a feature planned to show which backlinks have been changed since the last index update? So I can quickly see what has changed in my campaigns?
Hey there! That is a great suggestion - I'll make sure we have that on the feature list to prioritize!
Thanks,Carin
Thanks for confirming.....
Nice news for me.
Thank you for your good work.
Thanks for the update and hard work from the team. I have a question. As the update takes 11 days to process, does the data represent (for some websites) the status from about 2 weeks ago? Thanks in advance.
Hey Nick!
We started this index on February 13th, so the most recent crawl data would be from 2/12.
Hope that helps!
Thanks,Carin
This is great news! Some Fantastic progression!
Thanks and Congrats to those involved!
Nice job. The technology changes at SEOmoz last year have really paid off. It used to take months to get a refresh of the index.
Wow... Tanks, My Inspirasions... :)