After some wrestling with Amazon's EC2 and the tragic loss of many hard disks therein, we've finally finished processing and have released the latest Linkscape update (previously scheduled for Feb. 14). This new index is, once again, quite large in comparison to our prior indices, and contains a mix of crawl data going back to the end of last year. In fact, this is technically our largest index ever!
Here are the latest stats:
- 65,997,728,692 (66 billion) URLs
- 601,062,802 (601 million) Subdomains
- 140,281,592 (140 million) Root Domains
- 739,867,470,316 (740 billion) Links
-
Followed vs. Nofollowed
- 2.21% of all links found were nofollowed
- 57.91% of nofollowed links are internal
- 42.09% are external
- Rel Canonical - 11.11% of all pages now employ a rel=canonical tag
-
The average page in this index has 71.88 links on it
- 60.98 internal links on average
- 10.90 external links on average
We also ran our correlation metrics against a large set of Google search results and saw very similar data to last round. Here are the latest numbers using mean Spearman correlation coefficients (on a scale of 0 to 1, higher is better):
- Domain Authority: 0.26
- Page Authority: 0.37
- MozRank of a URL: 0.19
- # of Linking Root Domains to a URL: 0.26
Our evaluation process also check the comprehensiveness of our crawl data against a large set of Google results, and in this index, we've got link data on 82.09% of SERPs. This is slightly down from last month's 82.37%, which we suspect is a result of the late release. Crawl data ages with the web, and new URLs make their way into the SERPs, too. To help visualize our crawl, here's a histogram of when the URLs in this index were seen by Linkscape:
We always "replace" any older URLs with newer content if we recrawl or see new links to a page, so while there may be some "old, crusty" stuff from December, the vast majority of this index was crawled in mid-to-late January.
In the next few weeks, we're working on a new, experimental index that may be massively larger (2-3X) this one, and closer to what's in Google's main index at scale. This is very exciting for us and we hope, for all of you who use Open Site Explorer, the Mozbar, the Linkscape API and tools from our partners like Hubspot, Conductor, Brightedge and our newest API partner, Ginza Metrics (check out some cool stuff they're doing with Moz data here and in the screenshot below).
Ginza Metrics' New Backlink Analysis Tool
If you're interested in chatting about using Moz data in products, drop Andrew Dumont a line and he'll be happy to help. And, as always, feedback on this latest index, our tools or metrics are greatly appreciated.
Fantastic work by rand and the OSE team, I was suprised on Quora the other day when I saw how much this tool costs to run!! Overall but one of the best SEO tools in the market =)
Thanks James - lots of people seemed to like that thread, so I'm linking to it here: https://www.quora.com/How-much-does-SEOMOZ-Linkscape-infrastructure-cost Sadly, costs went up substantially this index due to us having to re-run a lot of processing (over $300K in total), but we're hoping Amazon is going to provide a refund for a good portion of that.
(eyes wide open) $300K... Hats off to Rand and the team of OSE. SERIOUSLY! <3 SEOmoz
re: $300k - Wow! Although quite awesome that you can just outsource the computing power to the Amazon cloud and not have to build your own datacenter. Very cool. Make sure you charge it on your credit card to collect the Air Miles... :)
Yeah I really liked that thread sent it to a few people who use the tool alot, but thats the best thing about SEOmoz is the transparancy with information within the company.
Ok good job, but tell me how long would it take you to start a search engine, namely Moz.com in competition to Google.com?May seem a bit funny at the moment, but I can see you are progressing well towards that(If you want to go for it)!
It was a top secret and finally you open that :D
Oh no dear, you are taking in wrong here, TAGFEE does not leave any place for any mysteries in professional work, you can still check Moz.com and you will find that Roger Bot is standing there behind the door!
We've talked a few times about building an actual search engine, but to be honest, it's not our mission or vision. We want to help people share their ideas on the web, and another search engine is somewhat ancillary to our goals around bigger, better, fresher data and better visualization, analytics and recommendations from that data.
That said, it may be an R&D project for us at some point to attempt a search engine build, simply to get more information we can use to provide recommendations and advice and inform our own product roadmap.
Guess what Rand!
I signed in for the "Ginza Metrics" After reading your comments about them i-e "check out some cool stuff they're doing with Moz data here and in the screenshot below."
And now i got 41(forty one) mails in my inbox for activation of my account! Can you please tell them not to annoy users with their Email bombardment? Its really annoying!Thanks
*Update: Talked to 'Ray' there, and he is looking into the issue, seems some sort of technical error.
Whoa! I'd definitely drop Ray a line about that - I'm sure it's unintentional.
ZOMOES! Zero Other Mashups Offer Equal Service.
This is Absolutely Awesome!!!
Look at how much work is available for people doing SEO. This is great data and I am really looking forward to your next update. Wow, what an amazing feat… To make this already inconceivably large index 2-3x larger.
Thank you SEOMoz
Ooo a day early. I am noticing a drop in things like mozRank & trust for us and our competitors (but DA was up). Not a lot, but your calculated metrics shaved a few percentage points off the scores, while links increased. Was there any real change to how things are calculated or is this just an effect of the change to the crawl? Not that it matters, just curious if you see anything similar.
Also, when I look at the top pages tab in OSE, it shows pages that have been long dead for over a year that we 301'd to its new page. All links point to the new page, is there every any update to that top page list? The site used to be PHP, now it's .NET. Our 5th "top page" is an old PHP page that is redirected to the new live one. That old one shouldn't contain any value anymore and be invisible, right?
Thanks for the update!
We've seen that over time, too. I suspect what happens is the same thing you see with Google's PageRank score (though on a much more granular level). Basically, as the sites at the "top" of the link graph (those earning the most links) get matched to a PageRank / MozRank 10, the sites in the middle of the curve naturally distribute a bit lower. Hence, the best way to look at this is always in a competitive comparison view so you can see how your competitors' metrics are affected, too.
@ 300K an update you guys should really start looking @ building your own private cloud.
Thtas fantastic, I can see many more links now. Keep up the good work SEOmoz!
WOW really great news, every month am accounting days to see next index update, really very helpful for SEO.
Wow! Fantastic work by SEOmoz as usual.
Nice! I totally missed this the other day. Now to begin digging...
Great stuff. It is good to see such transparency and I am encouraged to connect with your affiliates.
Thanks Rand and Team. Lots of new links. . . Thanks for this.
Thanks Rand and Team. Lots of new links found too!
I really enjoyed watching the inbound domains count in one campaign nearly triple after the update. It's nice to know that YOUR efforts allow me to track MY efforts. Keep up the great work. :)
Thanks again for this huge index!
I would be very curious to know what is the software architecture used for processing this enormous amount of data on amazon EC2, and if you tried different solutions which one was the better one (for example SQL vs NoSQL - assuming Amazon allows some freedom, especially for people that pay what you guys pay!).
Another thing that I've always been wondering is what machine learning technique you used to "reverse engineer" google algorithm, although I can understand if you do not want to disclose it.
Obviously I would have so many more questions in my head, but I dont want to ask too much!
Finally, I apologise if you answered this question already, but I could not find it anywhere (I found some hints on the machine learning algorithm on a Whiteboard Friday video with Ben, but nothing specific).
Thanks in advance,
Riccardo
I'm not going to be able to explain all the architectural bits, but I can say that most of the code is C++, and we actually use a custom-built, flat-file storage system rather than a SQL or other existing database.
As far as the machine learning process - we used some existing libraries customized to build a model against ~10K Google search results (the first 30). Correlation against these is fairly simplistic, but then we have PA/DA, which are a mashup of all our metrics to get a "best fit" line using these keyword-agnostic metrics.
Hope that helps!
Did you change your internal id's on things like links on this last crawl? Up until this last OSE update, I was comparing links I had previously stored with new ones receive from the API. I would use your lrid property in the API (the internal link id) to see if the link had been previously discovered. That worked until this last crawl. Now all the lrid's are new.
I'm not 100% sure about that, but certainly possible. Let me ask around the team.
On edit: got this from one of our big data team members:
"Yes, our internal IDs rotate out with each new index. Unfortunately any tools that were relying on them will need to be tweaked; at the very least they will need to collect new ids. Note also that these ids are explicitly not part of our public interface; use at your own risk and all that."
Thanks for finding out the answer! I will explore new ways of keeping up with historical link discovery.
The internal lDs on links, URLs, FQDNs, and PLDs are exactly that: internal IDs. These change on every index release. If you got lucky and some of them did not change across an index release, I would be quite surprised.
BTW it is proud moment for me to be a part of SEOmoz (Daily Reader). Thanks Rand
Hi Rand,
This is just so freaking awesome.
Was just wondering, will it be possible to have something like "Historic link profile data". Since you guys remove historic data but i believe that you must store that data somewhere right?
I am only asking this as from an analyst point of view it would be very helpful for me. For example - If one of my competitor’s ranks fell drastically, looking at the historic data and the fresh data I will be able to analyze what went wrong for them and then i will refrain from using the same tactics.
- Sajeet
We actually DO have this now! If you go in your campaigns and look at the link data tab, you'll find historical link metrics for you and your competitors. Here's the post announcing it from December: https://www.seomoz.org/blog/historical-link-analysis-is-here
Hey Rand... Re: Ginza, assuming that in the sentence "check out some cool stuff they're doing with Moz data here", the here is meant to be linked? :)
Doh! Sorry - fixed that (and added a nice screenshot from them).
Congratulations!
Gratz! It's a realy huge update for all webmasters)
Always like seeing updated and more data.
I must say the average number of links on a page suprises me. Especaly since most of my tools are configured to flag up pages with >100 links, that 70 is the average made me realise how easy that is to go over on a lot of the sites I deal with.
Sweet! Thanks for the hard work!
Looking good! Always happy to see updated data when I wake up.
that's great guys. Thanks for all the hard work! I couldn't imagaine having to work with 66 BILLION URLS! Ha!
Last update is a huge jump and big suprise for me, all my websites have 70% more backlinks discovered. Great job!