I've got good news. Today marks a new Linkscape index (only 14 days after our previous index rollout) which means new data in Open Site Explorer, the Mozbar, the Web App and the Moz API. It's also more than 60% larger than our previous update in early January and shows better correlations with rankings in Google.com; I'm pretty excited.
For the past couple years, SEOmoz has focused on surfacing quality links and high quality, well-correlated-with-rankings metrics to help provide a link graph that shows off a large sample of the web's link graph. However, we've heard feedback that this isn't enough and may not be exactly what many who research links are seeking (or at least, it's not fulfilling all the functions you need). We're responding by moving, starting with today's launch, to a new, consistently larger link index.
Today's data is different from how we've done Linkscape index updates in the past. Rather than take only those pages we've crawled in the past 3-4 weeks, we're using all of the pages we've found since October 2011, replacing anything that's been more recently updated/crawled with a newer version and producing an index more like what you'd see from Google or Bing (where "fresh" content gets recrawled more frequently and static content is crawled/updated less often). This new index format is something that will let us expose a much larger section of the web ongoing, and reduces the redundancies of crawling web pages that haven't been updated in months or years.
Below are two graphs showing the last year of Linkscape updates and their respective sizes in terms of individual URLs (at top) and root domains (at bottom):
As you can see, this latest index is considerably larger than anything we've produced recently. We had some success growing URL counts over the summer, but this actually lowered our domain diversity (and hurt some correlation numbers of metrics) so we rolled back to a previous index format until now.
This means you'll see more links pointing to your sites (on average, at least) and to those of your competitors. Our metrics' correlations are slightly increased (I hope to show off more detailed data on that in a future post with help from our data scientist, Matt), which was something we worried about with a much larger index, but we believe we've managed to retain mostly quality stuff (though I would expect there'll be more "junk" in this index than usual). The oldest crawled URLs included here were seen 82 days ago, and the newest stuff is as fresh as the New Year.
Despite this mix of old + new, the percent of "fresh" material is actually quite high. You can see a histogram below (ignore the green line) showing the distribution of URLs from various timeframes going into this new index. The most recent portion, crawled in the last 2/3rds of December, represents a solid majority.
Let's take a look at the raw stats for index 49:
- 58,316,673,893 (58 billion) URLs
- 639,806,598 (639 million) Subdomains
- 135,392,083 (135 million) Root Domains
- 617,554,278,005 (617 billion) Links
-
Followed vs. Nofollowed
- 2.10% of all links found were nofollowed
- 56.50% of nofollowed links are internal
- 43.50% are external
- Rel Canonical - 11.79% of all pages now employ a rel=canonical tag
-
The average page has 87.36 links on it
- 73.06 internal links on average
- 14.29 external links on average
In addition to this good news, I have some potentially more hilarious and/or tragic stuff to share. I've made a deal with our Linkscape engineering group that if they release an index with 100+ billion URLs by March 30th (just 72 days away), I will shave/grow my facial hair to whatever style they collectively approve*. Thus, you may be seeing a Whiteboard Friday with a beardless or otherwise peculiar-looking presenter in the early Spring. :-)
As always, feedback is welcome and appreciated on this new index. If some of the pages or links are looking funny, please let us know.
* 20th century European dictator mustaches excluded
Nice update.
Why doesn't OSE move to more real-time updating though ? Similiar services are doing it and cost less, although they lack certain features I particularly like with OSE.
It's not currently possible to have metrics like mozRank/mozTrust/Page Authority/Domain Authority or sorting/filtering from the API side (which makes OSE and other tools fast when you re-sort or choose different options).
Basically, we crawl the web for a few weeks, then run a processing cycle on the data that's been crawled, which uses a metric ton of computers on EC2 for about 20 days. There's ways to improve efficiency here (we think) and to move to beefier computers that can run some of this faster, but we'll likely have at least 10-15 days of processing between index cycles.
Where we can and will offer more real-time updating is on our fresh-web index. Currently available through Blogscape in Labs (though occasionally down as we fix/update it), this pulls only fresh/new posts from 10mm+ feeds. In the next few months, that will become a more useful, robust and productized service and we hope can fill the gap between real-time and feature-rich.
This. Please don't think about turning SEOMoz into yet another link list for the sake of speed. Leave that to majestic and ahrefs. The incredible metrics are what make the Linkscape data set so invaluable.
thanks again for all the hard work from the Moz team. We truly appreciate it!
I would highly agree with that... with that being said i truly appreciate the way SEOmoz is doing great job for the betterment of the quality SEO on the web.
All great things Rand! Awesome work. .:)
Humm, I can't stop thinking of the idea that Seomoz could one day make it's own search engine.
I thought the exact same thing Netlogiq. The exact, same thing...
LOL don't Google it.....Moz it!
The Moz Panda #justathought :)
Thanks Rand - that's great. Improved correlation with Google is particularly exciting.
Great to wake up to the new update, looking forward to discovering the extra coverage. Think I will also be revisiting many of the backlink reprots I produced yesterday!
Rand, am glad that linkscape is getting 58 billion pages indexed, but this makes you look a bit immature, not you but linkscape actually. Actually there is a correlation between these figures and immaturity. Remember, back in 2000 and 2001 Google used to have a phrase on their main homepage, and they use to mention things: "Google index: 1,060,000,000 web pages" and "Search 1,060,000,000 web pages". Now when I looked at your post title, it reminded me of the immature Google! Google matured later and removed the figures, I hope you too go above the figures. Its not the number of pages indexed by linkscape that matters, its the quality of your parameters. Things that matter here are "how much the mozrank & PR are corlating to each other", how much your PA and DA matched that of the Google quality parameters.Rand, I may not be in a position to advise you in something, but I can suggest stuff to you and give ideas for your betterment. Keep it up!
Hmm. Interesting point - I agree that for a search engine, bigger isn't necessarily better, but what we heard and saw from customers was that they did want bigger, especially because link mining and link research are better accomplished with larger indices even if quality comparison/assesment isn't (necessarily - it could be, too).
Rest assured we'll keep working toward both goals - better correlation with rankings from our metrics AND larger indices so link researchers can find the deep stuff they're seeking.
asadwahab
Obviously Rand can respond to this directly, so I'm not going to put words in his mouth or claim I know his views on your comment.
What I want to speak to is my own opinion on what you said. And since it's just my opinion, please understand it's meant with respect of your own opinion.
Given that SEOmoz is not the size of, nor do they have the funding of a company such as Google, and while some people, such as yourself, are fixated on the "quality of the results", as far as I am concerned, the quality could be 100% for all I care, and yet if the data set were just a few hundred pages, SEOmoz would be worthless to me.
While the exact number of pages indexed isn't important to me, knowing that the moz team continue to strive to increase the volume of pages indexed means the world to me. And having an actual number assigned to it allows me to both gauge the scale and scope of the progress they make in regard to increasing the volume, it also humbles me to know that even a small company within the search community can even manage to gather that much data, let alone do so with the intent, desire and willingness to also ensure as accurate a correlation data-set as possible.
As I communicated with Rand tonight on Twitter, having a much larger data set makes a great deal of difference to me in the work I do at the forensic level, because the larger the data set, the more likely patterns will reveal themselves. Patterns that are at the heart of my forensic analysis during SEO audits.
And sure, while there may come a point when not detailing counts may come across as "mature" to you, I for one hope Rand continues to do his best to maintain the TAGFEE policy.
I agree with you, a larger data is a basic key in such indicies and helps in knowing about the patterns, correlation and so on and on, but let me summarize my point.
Big numbers are great, infact mind blowing, but 'talking' and 'mentioning' it is not(IMO). It may distract one's attention from the other things like quality.
And I agree that its a small company, and its doing great, but if they continue the pace, I am sure they will grow up and one day Matt cutts will be sending a personal message to Rand and will ask him 'Hey Rand, can we(google) have a look at your API? we need to check certain things' :P. Its all about thinking big and openly and I believe in SEOMoz, they are on right path, they just needs to act like a big company now, as they are getting bigger now.
Actually it shows progression.
I agree with Craig and besides - why shouldn't Rand and SEOMoz celebrate this - it's a success and is worth sharing.
Just FYI - Page Authority's correlation w/ Google rankings is ~0.37 on average with this index and Domain Authority is ~0.26. Those are the highest numbers in a long time and we hope we can push over 0.4 for PA in the next few indices.
As a comparison, PageRank in Google's Toolbar is typically ~0.12 (but it's been a while since we checked).
That's what I was looking for, actually this information was missing in the original post, that's what triggered me... Thanks for the information,Rand! Now I can have 'inner peace'!
Obviously quality does matter! But I believe number is the next big that that I will consider before letting my money in their pocket so I think celebration is great as it appreciate the efforts team is doing towards the betterment of the system.
I would highly agree with Alan’s point that yes quality is all that matters but what if quality stays up to few hundreds of links so celebrating numbers is Ok (its good if not great!). I believe sharing these numbers with the community actually satisfying the customer and letting them know that SEOmoz care for you and give them a reason to say a customer!...
What a great start for 2012!
Presently I am supplementing my OSE results with Majestic and am also looking at Raven and other SEO tools. I hope OSE keeps progressing. I dream of a day when we may be able to use just OSE for link research without the need to subscribe to multiple toolsets.
Great Update MOZ!!!
Good job mozzers!
Great work, more links have been picked up for me, and PA making more sense compared to competitors more than ever before.
But crawls are not as deep as they were a few months go, as only about 500 of my own pages are getting crawled (out about 2000+). But it seems that it’s the less important ones that are getting missed, so trade off worth it.
update: this weeks crawl was 10,001 guess a seomoz angel read my post ;)
I have been eagerly anticipating this update and really thought it would be a great improvement, but it is still the same. Google WM Tools says I have over 800 links; OSE says I have 59. That is exactly the same as before the update. Has it not propigated everywhere?
Awesome! Thanks! I'm looking forward to your new look :)
Hi ,may I ask a quetion that,
where have you managed to get the indexed pages?
just get them from search engines (by searching a lot of keywords and crawl the SERPs)or you crawl on a basis of your existing database?
Also I find a corelation between PR and OSE domain authority
when you call domain quthority, what metrics would you consider in this factor?
And , I found those OSE links includes a lot of in-site links?
so, what kind of in-site links can be viewed as OSE links , and which not?
I am a seo from China, have using your site for about a year, yet many things not so clear,
would you pls kindly tell me?
We crawl the web ourselves, just like a search engine would.
Correlation between PR and DA should be pretty high, but they're even higher for PR and mozRank (given that both are built on very similar algorithms).
OSE should show you on-site/internal links as well as external, though more of the focus tends to be on external.
Amazing update!
I think there is a genuine case for Moz to use its ever-growing index to power their own search engine.
It certainly makes more sense than a Facebook-powered search engine.
Great work on obtaining improved correlation. It helps us all with our SEO efforts and at the end of the day our SERP results
I'm loving this new update. It's found a decent number of our best links which wern't listed before(or had and had vanished), and I expect the same of our competitors.
'Completeness' is part of the quality of this site, just as much as the correlation of the PR e.t.c. Good to see it improve.
Great update, but earlier today I just made a report of 100 different site's backlinks - looks like I'll be doing it over momentarily! #timingfail
Doh! Sorry about that. I should have mentioned in the post that this was a unexpected index update (not reflected on the calendar for us) because we weren't sure whether it would work and if all the numbers would look good (correlations-wise). Thus, we had a backup index rolling to launch Feb. 1st, but given the success (so far) of this one, we'll likely stick to this format and release something in early-mid February that's even larger showing data from approx Nov-end of January.
p.s. Don't hold me to this, though. Linkscape is massively complex and challenging to work with, so our engineers will have final say on what gets launched when :-) I'll try to message that (likely on Twitter) when we have a clear picture.
This is an awesome new update. Finally, more coverage. Nice to see OSE catching up with some of the bigger indexes!
I made some tests and it seems that a large number of revelant backlinks are now returned for french websites. Thanks !
Rand, SEOmoz engineers, this is for you:
Thanks a lot! <3
To some comments above and later reader: Of course you should not focus on every link you see. Your competitors SEOs may be even worse then you :p
But you should have the opportunity to know (and build) these links when needed.
Have a great day!
Just wanted to add - this is pretty damn close to Majestic SEO. I just ran a test on two of my websites and although it is slightly by around 200 links it's far closer than it's ever been before.
Really well done - I'm looking forward to hearing about the 100+ billion URLs by March 30th!
Tell me Rand, why Linkscape can't add data instantly and needs refresh all the time?
I guess there is too much of complications in dealing with such a big chunk of data. Real time will need some too high computational power... P.S: Thats what I understood from the post above!
I think it's because they need to recalculate mozrank each time. And since Mozrank is like PR, they need a large amount of resources to refresh.
Rand explains this above, but it is pretty simple. All the other competitors - the old yahoo site explorer, majestic SEO, ahrefs.com, even google webmaster tools, are all just big lists of links. Nothing more.
SEOMoz provides side by side metrics like mozRank and mozTrust. A fairer comparison would be to compare the update speed of SEOMoz to the Toolbar PR updates for Google, since they are essentially calculating the same metrics. Seeing as TBPR can take months to update, it seems SEOMoz is doing a bang up job.
My thanks for this ambitious rollout. After the demise of Yahoo! Site Explorer it would have been easy to rest on on your laurels.
Excuse me now. I'm going investigate what's happening with my sites.