It's time for another Mozscape index update. New data is now available in Open Site Explorer, the Mozbar, our tools and through the API. July's update comes with some good news, and potentially some bad news, too. As you're likely aware, the previous two indices, while huge in size (150B+ URLs each) suffered from a lack of freshness due to the additional processing time required to calculate our link graph and metrics over such phenomenally big numbers of links & pages. Today's index is relatively large by prior standards (~72B URLs, larger than most anything we launched before April 2012). And it's slightly fresher - the link data in the index today was crawled almost entirely in May.
This index was originally scheduled to launch earlier, but ran into troubles, including Amazon's AWS outage and plenty of hardware failures, too. As we've mentioned in the past, SEOmoz is in the process of building a new private hybrid cloud datacenter that will replace AWS for Mozscape and should provide us with much greater reliability. We know how important it is to have regular data updates you can count on, and we're putting people and money to work as fast as possible to get off the unreliability that Amazon's systems have created.
Let's take a look at the full metrics for this index:
- 78,813,641,094 (78 billion) URLs
- 674,286,481 (674 million) Subdomains
- 165,476,769 (165 million) Root Domains
- 778,554,162,687 (778 billion) Links
-
Followed vs. Nofollowed
- 2.33% of all links found were nofollowed
- 57.62% of nofollowed links are internal
- 42.38% are external
- Rel Canonical - 12.5% of all pages now employ a rel=canonical tag
-
The average page has 74 links on it
- 63.28 internal links on average
- 10.72 external links on average
And here are the latest correlations between Mozscape metrics and Google's search results:
- Page Authority - 0.34
- Domain Authority - 0.23
- MozRank - 0.19
- Linking Root Domains - 0.24
- Total Links - 0.2
- External Links - 0.24
Because this update is much smaller in total URL size (~50% of the prior, 165 billion URL index), your link count totals will likely be much smaller, even if you've grown your link building efforts. Below is an example of the numbers for various Seattle startups across May's larger index and July's smaller one:
Above: May's 165 Billion URL index data
Above: July's smaller, 78 Billion URL index data
Note that, as one might expect, link counts are between 50-75% of their former value. This percentage will be lower for sites that get many links from the far corners of the less-traversed, less-popular pages and sites on the web, and higher for sites with links from more popular/well-linked-to sites and pages.
We're working hard to grow index size in the future back up to 100Billion+ URLs. Our crawlers can already handle vastly more, and it's just the unreliability of Amazon's hardware that holds us back. Our engineers and sysops folks are working around the clock to get there as soon as we can.
We've also done some work recently to update the scoring systems for the Keyword Difficulty/SERPs Analysis Tool. You'll now see a more accurate and usable algorithm applied to results where very fresh pages are ranking, e.g. news, sports, trending topics, etc. Here's an example query that previously would have produced a keyword difficulty score of 1:
Libor Rate Scandal was a SERP that until a few days ago, had virtually no traffic and very different results. All of these pages are ones that have been produced in the last day or two, and thus don't have Page Authority scores. However, the Domain Authority is now being used to help calculate KW difficulty, which should seriously help those of you who analyze fresh results.
The next 2-4 Mozscape index updates will continue to be on AWS, but we're now running 3-4 indices in parallel (which costs a fortune, but gives us fallback options if/when Amazon's failures lose an index or massively delay it). In the next 3-4 months, we hope to be operating indices off our new hybrid cloud environment and see much greater reliability, which will enable us to produce larger, fresher and more consistent updates.
The fact that the indices are different sizes every time makes it very hard to measure progress. We are fighting a battle to remove spam links created by our old link builders. I have been eagerly awaiting the update to see the progress we have made and now it's pretty much irrelevant as I don't know the difference between what's dropped off and what hasn't been indexed because it's low quality...
I fully understand that what you guys are doing is incredibly difficult but the erratic nature of the recent indices (constantly shifting dates, different index sizes etc.) make it almost impossible to use this as a reporting tool.
I totally empathize, and I wish there was more we could do to make the web a more stable place. Unfortunately, even if our indices were very constant in the quantity of URLs crawled, there would still be massive fluctuation. The reality of the web is that between 20-25% of the pages and sites we see in 2-3 weeks of crawling will no longer exist in their current form in the following crawl cycle (sometimes more). The web's dynamism makes it very, very hard to keep a consistent measure and thus, I strongly recommend doing competitive-based comparisons. That way, if your counts are down but so are your competitors', you can assume it's flux in the web overall vs. flux in just your own backlinks.
Thanks for the response Rand, I will definitely look more closely at doing competitive based comparisons in the future.
Unfortunately, even if our indices were very constant in the quantity of URLs crawled, there would still be massive fluctuation"
Yes but at least then we would know that the differences were down to web fluctuations instead of crawl fluctuations! It seems like the crawls need to be be more consistent in order to (try) compare one to another. I do understand though that it's not easy finding the balance between processing time and freshness, especially when relying on AWS!
Just as an example of the problem, one of our websites domain authority dropped by 8 points this month while our main competitors didn't seem to move much. I don't believe that we have actually lost as many links as the new index suggested, it's just far more likely that more of our links were "from the far corners of the less-traversed, less-popular pages and sites on the web". The vast difference in index size this time around just makes this very difficult to fully interpret.
Fair points. I suspect that indices will mostly grow from this point on (maybe a few in the next 2-3 months that are similar in size to this one). Our goal long term is likely in the 250B+ range, as we suspect Google is closer to that range these days (in the main index anyway). As we reach for that, we may find that we need to scale up/down in order to reach freshness targets, deal with technical issues, etc.
Thus, while I totally agree with your points, I want to be honest and transparent that given our goals and the challenges, it's probably not something we can realistically achieve.
Oh - and lastly, because the sample size of any given index is so large, fluctuations in DA/PA are usually more a result of lost/gained links in comparison to the broader web and changes in Google's ranking algo than they are reflections of index size changes. It's the nice part about how DA/PA are built - they're relatively similar in ranking correlations between indices half this size and 2X this size.
Thanks Rand, the transparency is always appreciated!
seomoz has come a long way in a few years! that is a mountain of data. seomoz search engine next??
totally. I'd love to see that, great idea. (surely Rand et. al have already thought of it.)
Do the microsoft thing and just throw more power at computational challenges!
Would actually be kinda cool if you made a refresher post outlining this new process you're working on(how hard it is). Some of us take this data for granted I think ;)
Or tell us some interesting things google has had to say about the index.
One of our chief engineers on the Mozscape project, Martin, did a great blog post illustrating some of our crawl details: https://devblog.seomoz.org/2012/06/how-does-seomoz-crawl-the-web/. We're planning to continue this trend with more on how we process, and I think another post on our hardware move could be great, too.
Rand,
Thanks for sharing link to Martin's post! I find these posts fascinating and inspiring. Would love to read more about your hardware, challenges you faced and how you solved them. Cheers!
Hey Jason,
Definitely stay tuned for an upcoming blog post on the new processing set up with the hybrid cloud. We'd love to give you all insight into the challenges we're running into processing larger indices and the solutions we're working on!
Look for a blog post in the next couple months.
Thanks!
Carin
I can't hardly wait for that. My personal experience with AWS was a nightmare! It had so much potential too...looking forward to the post.
@Jason Capashaw ! Don't worry because it is not decided that what is exactly update.
I also heard about this update but side by side i also heard July Tuna update but really not sure that what is exactly right update. Let's see now what say Google Algorithm.
Looking forward to it Carinoverturf!
And thanks for the link Rand
I'm seeing some weird anchor text issues too.
If you provide some specifics, we can help look into it. Feel free to use either our feedback forum: https://seomoz.zendesk.com/categories/6327-known-issues-feature-requests or email the help team https://www.seomoz.org/help
Well it's present on just about every single domain I've checked. Some more than others.
For example:
https://www.opensiteexplorer.org/links?site=www.seomoz.org
all these on the first page of the report above have corrupted anchor text:www.emarketer.com/Sources.aspxwww.webpronews.com/can-your-site-lose-its-rankings-because-of-competitors-negative-seo-2012-04
feeds.feedburner.com/DailyBlogTips
www.smashingmagazine.com/2010/03/06/23-tools-and-tips-for-any-ecommerce-website/
By corrupted I mean, the actual anchor text is there but random additional text is appended to it in the ose reports.
Interesting - I'm forwarding this onto the engineers right now to take a look. If you'd like to send me any other example, feel free to email me at [email protected].
I'll keep you posted on what we find!
Our engineers jumped online this weekend to look into this and, unfortunately, it appears to be more fallout from a parsing bug that was introduced into this index as a result of a recent change to our crawlers. You can read more about it in this forum post where Phil provided a pretty detailed explanation: https://seomoz.zendesk.com/entries/21655043-mozscape-index-update-temporarily-delayed-until-july-6th.
We're looking into what we can do to repair the view in this current index, but we do have another index in the works that should be complete in a few weeks. This parsing bug is 100% resolved in this and any other future indexes.
I know this doesn't help you out now with the reports you'll have to print - I'm really sorry for this inconvenience. I'll keep you posted as we look into this further!
Thanks,
Carin
Great news, that will keep me busy on Monday morning looking at fresh data!
What?
Some people wait until Monday, instead of obsessively checking every five minutes all day Friday to see if the update has been released?
Hard to believe.
Some people are also interested in spending time with their families or are not in work at weekends!
It's good to see your aren't wasting any time investing your new venture capital funding. lol. I can't tell you how great it is to get a look at this kind of aggregate data.
Very nice Update :)
I had goose bumps seeing these statistics. Interesting to note that 57% of the nofollowed links are internal.
I saw the inbound links go down and attributed it to your smaller index size. My client seemed calm with that explanation.
But should that still be an issue, here on 8/22/12? They keep wondering when their link counts will recover?
Or should I talk them out of using this metric month to month?
For historical reasons my entire site is https, (not just where it is needed), but it seems like there is no way to use opensiteexplorer to see who actually links to the https adress (if they use http or drop the protocol it works nicely).
Am I wrong? If not, have you thought about doing something about it?
I was getting worried. Thanks for the update Rand.
Nice Update keep on sharing...
Thanks Rand for these updates. I like the way you always point out the positive points as well as the drawbacks. I always appreciate the honesty.
Rand, even we evaluated most of the public cloud service providers & finally decided to have our private infrastructure, as it would give the cost advantage as we scale up.
Moreover, as we are at a pretty early stage, we've got a lot of flexibility to play around with different configurations and learn.
Even though we have built our setup using commodity hardware & FOSS, reliability and performance is at par. Details on our setup https://www.searchenabler.com/blog/build-your-own-data-center/
There were some good discussion and got some great inputs on our private setup from HackerNews yesterday https://news.ycombinator.com/item?id=4207439
Interesting to compare these correlation #'s to these.
Keep in mind, I studied advertising instead of marketing so I didn't have to take stats or cal...so maybe these aren't even relevant. I mean...I realize they're from a dif. dataset but it's the same search engine right? I know...I know...only if it was that simple.
Same search engine, but they're measured at different times and using different metrics. We've done many correlation studies here at Moz, too, on search ranking factors, e.g. https://www.seomoz.org/article/search-ranking-factors#metrics
Rand, can you answer this please - 2 things :-
1. I am still seeing 0 scores for DA and PA on sites we set up 5 months ago, which are well indexed and prominent in serps, not in the far flung corners. This concerns me
2. In old mozrank, typically if the home page, for example had a PA of 40, a direct sub-page, linked from the home page, with no other links, would be about 35. Now i see on lots of sites that a direct sub-page, as above, seems to consistently lose about half the PA, so would be about 20. Is this deliberate ? If so, can you explain the rationale please. Does Moz now decrease the impact of internal links ?
Thanks
scap that first point, my apols, it has updated this time... grovel grovel
No worries! If #1 ever happens again, be sure to drop us a line and we can look into what might be happening. In this case, it sounds like we've got the data, and future updates should have more/better crawl info on the new sites.
Re #2 - that could be something our PA/DA algo changed as it was trying to better match Google's ranking results, or it could be an artifact of another change. It's not something we introduced manually, though.
On point #1, it took 5 months after launch for one of my sites to get beyond 0 DA and 0 PA. But it was for a brand new domain with few links and in an obscure niche.
5 months is still way too long (should be 2 at worst). This is actually a great case study in why we're biasing away from the huge index updates that take 2+ months to process and switching to smaller indices that can run in 3-4 weeks.
Are you saying though Rand via the possible mozrank PA change, that google is giving less weight to internal links ? This seems like a big change in your algo, so trying to understand it. Thanks.
Unfortunately, it's hard to know for certain. It could be the result of many other things - machine-learning models are great for matching, but not so great for telling you why using a particular derivative of a metric helps produce better correlations.
Re #2: The PA/DA machine learning model hasn't been updated since last November so any changes you see in the metrics with this update are due to link graph changes (as seen by our crawlers) and not the model. See https://www.seomoz.org/blog/introducing-seomoz-updated-page-authority-and-domain-authority for a description of the current model. We monitor it closely with each update and it has been remarkably consistent in aggregate across the entire index and a large set of SERP results.
So, now we have some more robust tool that we can use to analyse the performance of our website. looking forward to sing it effectively!
There's some anchor text corruption going on with this update! Just have a look at a wide range of sites Anchor text on opensiteexplorer!
Hey there - what type of corruption are you seeing? Would love it if you emailed me some Open Site Explorer examples and we'll take a look at what is going on! [email protected]
Thanks!
Carin
I just hear about "July Tuna Update" So Its not confirm yet that exactly what is next update of Google?
Grammar police: You'll know see a more accurate and usable algorithm. Thanks for the update!
Thanks fixed it!