On Thursday (November 3rd) of this past week, Linkscape's index updated (in record time - just 3 weeks). New link data's once again available in OpenSiteExplorer, via the SEOmoz API and in the Mozbar. Here are the stats for this latest index update (our 46th index update):
- 43,077,387,028 (43 billion) URLs
- 480,597,551 (480 million) Subdomains
- 105,570,741 (105 million) Root Domains
- 356,255,241,471 (356 billion) Links
- Followed vs. Nofollowed
- 2.18% of all links found were nofollowed
- 58.21% of nofollowed links are internal, 41.79% are external
- Rel Canonical - 10.46% of all pages now employ a rel=canonical tag
- The average page has 77.28 links on it (down .19 from last index)
- 64.86 internal links on average
- 12.42 external links on average
Since August, we've been struggling with the particularly devious problem of binary files in the index messing up link counts and showing links that Google + Bing probably are not counting. In September's crawl, we put a black list on these files and saw a reduction of ~40% in binary files. This time, we've made even more progress (though it's tough to know exactly how much - we're continuing to evaluate) and you should see a signifcant reduction in these binary files.
In part because of the reduction in these files, processing time for the Linkscape index was reduced, enabling us to produce a much faster index update. However, we're planning in December to produce a much larger index and thus anticipate processing time to rise back up. On the plus side, this will mean a lot more link data. In 2012, we're aiming to reach into the 100billion+ URL index size, closer to what we've heard Bing + Google keep in their main indices (~120-140 billion URLs).
As always, feedback on the new index is greatly appreciated - if you're seeing stuff we've missed, files we shouldn't have crawled or metrics that feel wrong, please let us know. Our engineers would love to hear from you.
Noticed the index update a few days ago - thank you! A ~30 day refresh on a 100 billion URL index will be an epic achievement. I don't envy the bandwidth cost of that!
On a separate note, Will mentioned this in his Searchlove presentation: https://blekko.com/webgrep the ability to grep the LS index would be completely awesome. Even if you guys were taking requests (like Blekko) you could probably show the community some phenomenal insight!
Steve Souder's HTTParchive is another interesting example of this: https://httparchive.org/
By the way, just a small feature request - when you update the index could we get an email notification? :-)
Hi Richard,
We are talking internally right now about a way to notify people about an update. Stay tuned as we have a few options in mind but need to narrow it down and execute it.
It is articles like this one that prove to me I made a wise decision to get the pro membership. Not only is the Moz team doing awesome work but the community adds so much value.
Thanks Rand and team for your responses and thanks community for interacting. GREAT STUFF!
Yeaaah! Good Work! Fresh Data always sets a highlight of my day.
And I still hope, that - one day - there will be a solution for tracking domain indicators like DA, DmR, DmT historical, so that I do not have to type these values after each update into an Excel spreadsheet ;-)
Advanced Webranking (a third-party software, that imports Linkscape data) does provide historical tracking for the MOZ-values. But I would love to have this inside of SEOmoz Pro!
With sunny regards from Germany,
Sebastian
Yep - product team has that spec'd and it's being built into the web app as I type. Hopefully released before the year is out (maybe even in November).
I totaly agree with softclick and I can't wait to get this feature. +1 for that. That is the feature that I miss the most. After that you can just stop improving it. Kidding :)
Related to web app. I saw different results in ranking on web app rankings and "old" rank tracker tool. Why is that happening and is that normal thing?And it was just few hours difference from the weekly update and checking on old rank tracker tool. Similar thing also on multiple domains.
The web app should have the most accurate, geo/personalization-agnostic rankings; believe old rank tracker may still be pulling with Seattle-geo-biasing.
Your reply makes me very happy!
Nice on rand, have been seeing those funny files in the OSE index, happy the problems are now been fixed up =)
Any plans to work on a historic index too as a side project?
Thank you, this is fantastic. I am seeing a lot of links from small blogs than I am not used to seeing in the past, it seems like the index is getting great at including all of them as well. I'm seeing a lot more links in general to some pages on my site since the last update.
Great work!
I was seeing the same thing for my personal site and for Everywhereist.com. Going to check some more sites, but yeah, kinda cool that we seem to be crawling some of that stuff more deeply.
Awesome! Thanks for keeping results fresh!
Would be nice if reciprocal links were flagged like no follow ones are.
The last couple of days my mozbar has been asking for a username and password even though I'm logged in. Is this something that is related to latest update?
Thanks
Does Linkscape plan on updating again any time soon? My last crawl was in November and I need to send out a report.. Very frustrating..
Nice to see another update… I can see lot of links from different sources…great wor SEOmoz!
I'm still finding OSE data way off from what's being reported by MajesticSEO and Ahrefs.com.
If out of three data sources, two are reasonably linear and one is way off the mark, is the only logical and correct assumption that the one way off the mark is unreliable at best and misleading at worst?
I fine OSE data an interesting fiction but it's gone way past the point of it being useful and actionable data of the sort that we can put to clients.
As far as I understood it it's simple. Majestic pulls a lot of the c**p even with their new way with "fresh data", where OSE cut's off some percentage of lowest quality links. Also I think OSE first needs to index page and link than on the next update they gonna index the pages that were linked from the pages that got indexed in previous update. So it takes time for links to get picked up. I think for older domains and older links differences are much lower.
Just my 2 euro cents.
We generally map well to the numbers you see from referring links/domains in your Google Analytics, as well as what's reported in Google/Bing Webmaster Tools and Yahoo! Site Explorer. MJ tends to be much larger than any of these, often not in proportion. I'm unfamiliar with Ahrefs; not sure if they build their own indices or rely on third party data in whole or part.
Our goal is to always have the metrics that best correlate to rankings in Google, and for the past few years, whenever we or other third parties have run analyses, we tend to hit that mark. That's not to say we don't want to get bigger and fresher - we realize a sample set is not as valuable as the entire web's link graph (so long as they're all portions Google/Bing are crawling + counting, too).
I should also note that we do a lot of canonicalization and de-duplication of URLs, which we've found leads to higher quality stuff, but it does make the raw counts lower.
Thanks for the update Ran! Glad to see the data is a little more accurate now
“In 2012, we're aiming to reach into the 100million+ URL index size”
100billion+?
Congrats on the update. Do you plan on doing (or have you already done) an analysis blog post of how your index has changed over time in terms of things like % nofollowed, number of pages with rel=canonical, etc? I’d be interested to see how quickly the web follows the search engines’ new initiatives.
Doh! Thanks for the catch; fixed that up.
In terms of analysis - yes; great idea. Next update (in December), I'll do some charts showing nofollow and rel canonical over time.
Looking forward to see these charts Rand! Keep up the good work!
"As always, feedback on the new index is greatly appreciated - if you're seeing stuff we've missed, files we shouldn't have crawled...?
I was wondering, could you please purge my personal credit card information and my social security out of your recent crawled index data? (just kidding)
On a serious note, I was looking for a tool, or a service, that could generate visual map-cloud of links pointing to one's website. Something that was described by social-network-spam-and-author-rank post.Please point me in the right direction if that is available. Google and bing can generate that map, but I don't think I have access to it.
This may become possible for your engineers and either provide that through opensiteexplorer, or at least via PRO tool. Such feature alone would tip me over to become a paid pro member.
Otherwise, congratulations - 43bil urls, wow!
Best regards,
Sasha
I asked this question in QA, but I'm curious about how Linkscape handles .html URLs that 302 redirect to .exe files. Those .html files are showing up as top linked to URLs for our site.
If I change these to 301s, what happens with how linkscape handles theses URLs and files?
I can also robots.txt these URLs, but if I do that before you see that they're redirects to .exe files, what happens then?
Our affiliate program has resulted in quite a lot of links to these download.html files, which are all redirects to a URL that prompts a download of an exe.
Example:
https://www.bigfishgames.com/download-games/4346/top-chef/download.html
It's an odd problem, but one result is it's made the top pages report have a lot of noise in it.
Pinged the Linkscape engineering team Justin - they should have a reply on this soon.
Hey Justin!
Great question! I forwarded this to our engineers and here is their response below. They weren't exactly sure of your use case, so if this isn't a suffient answer, let me know and we'll track down a deeper explanation for you! You can always email me at [email protected] if you want to reach out directly!
"I don't think the Linkscape crawlers treat 301s differently from 302s. We will track the actual status code and we may eventually try to follow the redirect based on its MozRank.
In the case of this particular URL, I checked and it is currently a 302. The redirected URL is an exe like stated, and the content-type is currently being specified as application/octet-stream. I don't believe these URLs should be causing us any problems since the content-type header is being specified correctly. The problem that we have with binary files is when the content-type is not being specified properly. Then we have to rely on the file extension..."
In short, the binary file is not an issue for our crawlers, but the link will be counted as a 301 or a 302 since it is essentially an inbound link.
I hope this helps, but let me know if you have more questions!
Thanks!!
Carin