We're continuing the trend of two index releases each month by bringing you the latest Mozscape index release today - only 15 days after our last release on February 12th! The latest Mozscape index took about 11 days to process, with a fairly significant portion crawled the beginning of February. The crawl data spans about 38 days, so the oldest crawl data will date back to the beginning of January. You can access refreshed data across all of our applications - Open Site Explorer, the Mozbar, PRO campaigns, and the Mozscape API.

Our Big Data processing team (Martin York, Douglas Vojir, and Stephen Wood) have been working on some really exciting improvements to our processing code base reducing the length of time processing takes, as well as beginning development on a highly anticipated new Mozscape index feature: 

  • The Mozscape index is created in one continuous batch processing pipeline. A massive amount of crawl data is initially downloaded which is first sorted and organized, then the computations and magic are applied. Every so often, files get uploaded in a checkpoint step; just in case something catastrophic happens to the index, we'll be able to roll back to a fairly recent step.

    Recently the Big Data processing team dug through this checkpointing code to see where they could optimize - and they really optimized! The time needed to checkpoint files varies throughout the pipeline, but the longest checkpointing step used to take about 60 hours to complete… With the optimization from Doug and Martin, this step now takes on average 2.18 hours! Holy time savings!!
     
  • The first few steps in processing are dedicated to organizing how the work is going to be distributed across the entire Mozscape processing cluster. These files are broken out into what are called shards and then assigned across the entire fleet of machines. Sometimes these shards aren't always completely full; this means one machine will be all done with work before another machine. Martin revisited this code as well to see what type of optimization could be applied. With the help of our master data scientist, Matt Peters, Martin was able to improve the distribution of work, saving around 25% of time spent processing! 
     
  • One feature we hear requested fairly often is including HTTPS crawl data in the Mozscape index. Good news - development on this feature has begun, and we hope to have HTTPS data included in the Mozscape index this summer! 

Here are the metrics for this latest index:

  • 82,275,594,589 (82 billion) URLs
  • 9,097,532,641 (9.1 billion) Subdomains
  • 148,991,416 (149 million) Root Domains
  • 829,267,740,331 (829 billion) Links
  • Followed vs. Nofollowed
    • 2.25% of all links found were nofollowed
    • 56.08% of nofollowed links are internal
    • 43.92% are external
  • Rel Canonical - 15.43% of all pages now employ a rel=canonical tag
  • The average page has 73 links on it
    • 62.93 internal links on average
    • 10.33 external links on average

And the following correlations with Google's US search results:

  • Page Authority - 0.35
  • Domain Authority - 0.19
  • MozRank - 0.24
  • Linking Root Domains - 0.31
  • Total Links - 0.25
  • External Links - 0.29

Crawl histogram for the February 27th Mozscape index

As you can see from the metrics above, there continues to be an increase of subdomains as we have discovered a small number of root domains that have a substantial number of subdomains associated with them. 

We always love to hear your thoughts! And remember, if you're ever curious about when Mozscape next updates, you can check the calendar here. We also maintain a list of previous index updates with metrics here.