It's that time once again! Mozscape's latest index update is live as of today (and new data is in OSE, the mozBar and PRO by tomorrow). This new index is our largest yet, at 164 Billion URLs, however that comes with a few caveats. The major one is that we've got a smaller-than-normal number of domains in this index, so you may see link counts rising, while unique linking root domains shrink. I asked the team why this happened, and our data scientist, Matt, provided a great explanation:
We schedule URLs to be crawled based on their PA+external mozRank to crawl the highest quality pages. Since most high PA pages are on a few large sites this naturally biases to crawling fewer domains. To enforce some domain diversity the V2 crawlers introduced a set of domain mozRank limits that limit the crawl depth on each domain. However, this doesn't guarantee a diverse index when the crawl schedule is full (as we had for Index 52).
In this case, many lower quality domains with low PA/DA are cut from the schedule and out of the index. This is the same problem we ran into when we first switched to the V2 crawlers last year and the domain diversity dropped way down. We've since fixed the problem by introducing another hard constraint that always schedules a few pages from each domain, regardless of PA. This was implemented a few weeks ago and the domain numbers for Index 53 are going back up to 153 million.
Thankfully, the domains affected should be at the far edges of the web - those that aren't well linked-to or important. Still, we recognize this is important and thus are focused on balancing these moving forward.
Several other points may be of interest as well:
- Last index took nearly 13 weeks to process, this one's only 7 weeks. This means relatively fresher data, though not as fresh as we'd like. The oldest information will be from February and the newest from mid-April.
- Of all the URLs on which data was requested in the last month, this update has data for 88.56% of them (this is only very slightly lower than last index's 88.80%)
-
This index still has very high correlations with rankings. Below are a few samples of Spearman correlations with higher rankings in Google.com (US):
- Page Authority (PA) - 0.38
- Domain Authority (DA) - 0.26
- URL MozRank (mR) - 0.20
- URL MozTrust (mT) - 0.22
- Linking Root Domains to the URL - 0.29
- Total # of Links to the URL - 0.22
This bit is important: Next index, we're going back down to between 70-90 billion URLs, and focusing on getting back to much fresher updates (we're even aiming to get to updates every 2 weeks, though this is a challenging goal, not a guarantee). The 150 billion+ page indices are an awesome milestone, but as you've likely noticed, the extra data does not equate with hugely better correlations nor even with massively higher amounts of data on the URLs most of our customers care about (as an example, in index 50, we had ~53 billion pages and 82.09% of URLs requested had data). That said, once our architecture is more stable, we will be aiming to get to both huge index sizes and dramatically better freshness. Look for tons of work and improvements over the summer on both fronts.
Below are the stats for Index 52:
- 164,569,893,828 (164 billion) URLs
- 1,222,033,252 (1.22 billion) Subdomains
- 117,444,355 (117 million) Root Domains
- 1,784,256,496,532 (1.7 trillion) Links
-
Followed vs. Nofollowed
- 2.57% of all links found were nofollowed
- 64.91% of nofollowed links are internal
- 35.09% are external
- Rel Canonical - 11.33% of all pages now employ a rel=canonical tag
-
The average page has 85.12 links on it
- 74.38 internal links on average
- 10.74 external links on average
Feedback is greatly appreciated - this index should help with Penguin link data identification substantively more than our prior one, and the next one should be even more useful for that. Do remember that since this index stopped crawling and began processing in mid-April, link additions/removals that have happened since won't be reflected. Our next index will, hopefully, be out with 5 or fewer weeks of processing, to enhance that freshness. We're excited to see how this affects correlations and data quality.
We all appreciate larger, fresher updates. This latest report is mostly good news. A few thoughts / questions:
- I recognize the Linkscape crawl is the most complex, time-consuming task your team performs. With the ever increasing pressures of examining links (ranking from Penguin to our client's focus on DA to the basic needs of link building) SEOs need more information, faster. My team subscribes to multiple toolsets, including SEOmoz, for tracking and managing links.
And so my question is...with SEOmoz recently acquiring $18 million in funding, can you share any high level overview of whether some of that funding will go to making the world's best non-Google web crawler? From a naive point of view it seems if you double the number of servers involved with crawling you can either crawl twice as many URLs or complete the job in half the time, either of which would be appreciated! I am sure it's not that simple. We would all love to hear your vision for the next year.
- One suggestion for adjusting the crawl: if any user requests data on a domain using OSE or any SEOmoz tool, then that URL could be added to a table which is manually included in future crawls. This suggestion would only apply to any domains for which there is no data currently available in Linkscape. If this feature is costly to implement, perhaps you can limit it to paid subscribers.
- As for the icing on the cake, any plans for SEOmoz to introduce a method of link tracking? When I am viewing a competitor's links in OSE, I would love to click a box next to a link I find indicating I wish to pursue obtaining that link for a client. When I view my own client links I would love to be able to do more to track them (i.e. mark links as spam, get alerts if a link disappears, etc)
I love the MOZ toolset but I find myself being forced to subscribe to multiple toolsets to get the most complete, accurate data. I look forward to one day finding a single toolset which can meet the needs of all my clients, or at least in one area such as back links.
I've definitely thought in the past that it would be really smart to give some sort of priority to domains that were looked up in OSE or have an SEOmoz campaign running for them. Seems like it would be an easy way to offer more thorough results to their paying customers, which would hopefully increase customer retention and reduce churn rate, something that I'm sure SEOmoz considers one of their most important metrics, given that it is a subscription based service.
I would also love to be able to have my team work with a single toolset, but I don't see it happening anytime soon.
Hey Ryan,
Thanks for all the great feedback and suggestions! You have a lot of great points - let me see if I can address them all:
1. Our plans toward making the world's best non-Google crawler: our crawler is amazing - too amazing for our processing :) We can easily crawl 160 billion URLs, or more by adding more machines like you suggest, however, we need to do some work to the way we process the data in order to keep the index fresh. That is our biggest priority now as we work toward a larger index AND consistent releases. As Rand hints above, in the next 18 months or so, we're working toward more of a rolling index, similar to Google. This would allow us to process much more crawl data than we currently can.
2. The suggestion of injecting requested URLs is a great idea and one we've talked about. It is on the road map, but does have some complications that need to be worked out - including some metrics not available since we didn't organically find the page or typos/malformed URLs typed into OSE. Definiltey something we hope to work out in the future, though!
3. Link tracking - that is an interesting suggestion! I'll pass that one over to the product team for them to add to the feature request list. Thanks for the great ideas!
Thanks for the update Carin! All great to hear.
Why am I not seeing any of the new data?
Last Index Update: May 1st, 2012
Argh!
Hi optimizeguyz,
Are you looking at OSE? The index has been updated, just looks like we haven't refreshed the stats and index date on the main page. Thanks for catching this! Try searching a site and let me know if you're not seeing the updated data.
Yeah, I'm not seeing any updated info. All the campaigns I monitor look the exact same as they did prior to this update. I'm new to SEOmoz, so maybe I'm not looking at this right, but it doesn't appear to have updated within OSE or MozBar.
Yep, Sam's totally right - sorry about that confusion! Also, PRO information will take up to at most a full day to update after a release so you might not see the web app update until the end of day today. We are working to speed up that update in the web app - we know you all need that refresh as soon as possible!
**grain of salt, I'm not a PRO member**
If losing 7.6% of requested URLs reduces the number of overall URLs parsed by 67.3% and thus would theoretically mean an index update every 16 days, I think most customers would be thrilled with the fresher data. I'd say that a 4:1 approach would work to help out everyone. Do four updates that cover 80% of URLs requested at ~55 million URLs, then do a 'big' one. That gives it a full quarterly cycle goal of 5 updates per quarter, with at least one of them being a larger, more encompassing update (though not as fresh).
Hey look, 2 cents!
Interesting idea, and definitely something to consider. Our goal right now is to aim to "have our cake and eat it, too" by getting off Amazon's highly faulty hardware onto something more robust that's customized for the kinds of operations and redundancy a web index and link graph calculation requires. If we're successful with that in the near future, our hope is to have very large index updates every 2 weeks (meaning 104 annually or 26/quarter).
After that, there's some crazy plans to get to a rolling index system more like what Google does - potentially with faily freshness (that's probably 18+ months away, though).
I love cake and when it comes to eating cake, I'm Jedi. How about all the clever bods at SEOMOZ stop chasing after Google and create the replacement? Google Search has had lots of fun for 15 years (I was there welcoming them in and kicking out hotbot, lycos et al) and deserves to be pushed off the top (",). All tech has its day and then its time to change. The change cycle is now gettinng shorter and I love new world order nearly as much as cake. Before you agree, can I have a 1000 shares for the idea?
Every dog has its day and they keep on having theirs.
I'm in!
@ucffool great idea...
@ randfish "index updates every 2 weeks" now that would be impressive :)
I must say I have been using OSE for some time now it's an pretty useful tool but at this point does not give me know real reason to buy as my site https://www.fashionox.com/, there is no data available for it. But it is definitely something I would maybe look into in near future.
I am going off topic a bit, but your white broad Friday excellent I listen to it on my phone has I am always busy doing work to my site.
Great job, Moz engineering team! This is indeed a feat of proportions that I cannot even begin to understand. You guys do a great job!
I also think it is a good call to scale BACK the number of URLs crawled/processed in order to deal with scaleability issues, and then to scale back up again. Fresh data is better than more data, IMO. Fresh + more is the sweet spot, but fresh and accurate is better than old and accurate.
Keep on keeping on, guys!
Thank you for setting challenging goals and being the foundation of the seo world.
I'm still seeing a bit of old link data in OSE. There are a few links in particular that went away over 2 months ago, yet remain in my site's report. Any idea why?
My guess is that it wasn't part of the 163 billion. They have no reason to remove older data (some data is better than none), unless they have scanned it for fresh information.
Hey Anthony,
Link metrics in this index is going to be from early February to mid-April. If you're seeing old links, we probably recraweld that page in February or March, prior to the updates. Hope that helps!
Thanks Carin (and Mario). I'll keep my eye on it after the next few crawls.
This wait seemed shorter compared to the last one, and the results seem to be much better. My only question is: is the anchor text breakdown updated with this roll-out? If I remember correctly, it wasn't fresh with the previous update.
Yep, the anchor text is updated with this release. We had so many operational issues with the AWS machines in the prior index - in order to get the data to our users in a reasonable amount of time, we released that data separately. Definltey not the normal procedure! This release is a full update of the entire index :)
Hey Rand,
Thanks for the update (and the Update!) :)
I just hope that everyone takes a big breath and THINKS about what they have been doing with their sites before the next Index Update arrives... I'm sure all these high-volume backlink removal campaigns have to make a big impression on OSE data very soon.
It would just be wrong if people were to jump to the wrong conlusion and blame the Index for changes brought about by their own activities!
Sha
While I am glad you guys do what you do .. I really dont read these ( mozscape index updated ) blog posts :). But I am sure many mozzers ( that is a word right ) apprecieate the info.
Hooray for fresh data. :-) Keep up the good work, Seomoz-Team.
I do not understand why are you going after numbers, instead focus on the quality. For example I have changed a lot of links (canceled or change the anchor text) more than 5 months ago, but still I see the old data in OSE! That is terrible for me!
Good work. Thank you for sharing these data.
Yeah you are right itogers... SEO moz provides us the latest updates of Google algorirhms. It really helps us to be in competetion.
are you trying to build a new search engine
great job with the update. Every 2 weeks is very challenging. Hope it works out
The effort involved in something like this is just outstanding.Will we ever see the team at SEOMoz release their very own search engine?
That, would be simply awesome.
Hi Rand
I am, like many others, a big fan of you and your team.
However, as Carin knows, some sites I manage are not being updated with OSE, and that's what I pay my dues for. These sites are not brand new, they are not on the edge of the net, they have good links coming in. In other words, the metrics are unavailable to me and so I can't advise my clients.
I never had this problem until about 3 months ago.
Fresher updates FTW!
Thank you SEOmoz for amazing link data! You help us SEOs keep an eye on our competitors competition!
Good stuff rand
AWESOME!!
Interesting News Randy hope you get succeed in your milestone
Firstly a huge congratulations on getting where you are. A long way from the puppet.. I digress. My team and I have been playing catchup and looked at your new search, we found it a little confusing, but as said, I've personally been out of the loop some time. What you have appears to be SEO specific? Would I be correct there? The Cantufind Network I represent has to step up to the plate refilling the shelves so to speak, would your API be an alternative during and probably after then?
I'm kinda guessing I need to be a premium member to be asking all these on the face of it newbie questions but one thing that really interests me with your ideas is that they work against all odds. We did notice some anomolies when we did our own check but wouldn't want to post them here, how and who do we address these to please Rand.
Again, a massive good luck from the new team @cantufind.com, we really do mean it.
Hey there! Our help team is pretty fantastic and can probably answer all of the questions you have. Go ahead and email [email protected] with your questoins and I'm sure they'll either be able to answer them for you or get you in touch with the right person!
Thanks!Carin