It's that time once again! Mozscape's latest index update is live as of today (and new data is in OSE, the mozBar and PRO by tomorrow). This new index is our largest yet, at 164 Billion URLs, however that comes with a few caveats. The major one is that we've got a smaller-than-normal number of domains in this index, so you may see link counts rising, while unique linking root domains shrink. I asked the team why this happened, and our data scientist, Matt, provided a great explanation:

We schedule URLs to be crawled based on their PA+external mozRank to crawl the highest quality pages. Since most high PA pages are on a few large sites this naturally biases to crawling fewer domains. To enforce some domain diversity the V2 crawlers introduced a set of domain mozRank limits that limit the crawl depth on each domain. However, this doesn't guarantee a diverse index when the crawl schedule is full (as we had for Index 52).

In this case, many lower quality domains with low PA/DA are cut from the schedule and out of the index. This is the same problem we ran into when we first switched to the V2 crawlers last year and the domain diversity dropped way down. We've since fixed the problem by introducing another hard constraint that always schedules a few pages from each domain, regardless of PA. This was implemented a few weeks ago and the domain numbers for Index 53 are going back up to 153 million.

Thankfully, the domains affected should be at the far edges of the web - those that aren't well linked-to or important. Still, we recognize this is important and thus are focused on balancing these moving forward.

Several other points may be of interest as well:

  • Last index took nearly 13 weeks to process, this one's only 7 weeks. This means relatively fresher data, though not as fresh as we'd like. The oldest information will be from February and the newest from mid-April.
  • Of all the URLs on which data was requested in the last month, this update has data for 88.56% of them (this is only very slightly lower than last index's 88.80%) 
  • This index still has very high correlations with rankings. Below are a few samples of Spearman correlations with higher rankings in Google.com (US):
    • Page Authority (PA) - 0.38
    • Domain Authority (DA) - 0.26
    • URL MozRank (mR) - 0.20
    • URL MozTrust (mT) - 0.22
    • Linking Root Domains to the URL - 0.29
    • Total # of Links to the URL - 0.22

This bit is important: Next index, we're going back down to between 70-90 billion URLs, and focusing on getting back to much fresher updates (we're even aiming to get to updates every 2 weeks, though this is a challenging goal, not a guarantee). The 150 billion+ page indices are an awesome milestone, but as you've likely noticed, the extra data does not equate with hugely better correlations nor even with massively higher amounts of data on the URLs most of our customers care about (as an example, in index 50, we had ~53 billion pages and 82.09% of URLs requested had data). That said, once our architecture is more stable, we will be aiming to get to both huge index sizes and dramatically better freshness. Look for tons of work and improvements over the summer on both fronts.

Below are the stats for Index 52: 

  • 164,569,893,828 (164 billion) URLs
  • 1,222,033,252 (1.22 billion) Subdomains
  • 117,444,355 (117 million) Root Domains
  • 1,784,256,496,532 (1.7 trillion) Links
  • Followed vs. Nofollowed
    • 2.57% of all links found were nofollowed
    • 64.91% of nofollowed links are internal
    • 35.09% are external
  • Rel Canonical - 11.33% of all pages now employ a rel=canonical tag
  • The average page has 85.12 links on it
    • 74.38 internal links on average
    • 10.74 external links on average

Feedback is greatly appreciated - this index should help with Penguin link data identification substantively more than our prior one, and the next one should be even more useful for that. Do remember that since this index stopped crawling and began processing in mid-April, link additions/removals that have happened since won't be reflected. Our next index will, hopefully, be out with 5 or fewer weeks of processing, to enhance that freshness. We're excited to see how this affects correlations and data quality.