Those of you who've been running Linkscape reports this weekend may have noticed our new, fresher data in the index. This is our second update (the first was on Dec. 8th of 2008) and it's brought with it a refresh of much of the crawl we originally launched with in October. This update is seriously good news for us, as it's taken far less time, energy and frustration than our first update, and it's good news for our members (and anyone who uses the free data) as much of the web's changes in Q4 2008 are now included in the index. We're working towards updates every 4-5 weeks, but it may be several more before we can get that level of freshness.
Some important & interesting statistics from this crawl:
- URLs: 36,330,454,654
- Subdomains: 225,767,675
- Root Domains: 46,952,859
- Links: 410,360,303,763
Distributions of External Links from this Index:
- 50%: 0
- 95%: 1
- 99%: 8
- 99.5%: 18
- 99.9%: 108
- 99.99%: 1854
- 99.9999%: 104,828
Distributions of Linking Domains to a Domain:
- 50%: 0
- 95%: 1
- 99%: 4
- 99.5%: 8
- 99.9%: 26
- 99.99%: 145
- 99.9999%: 9713
Distribution of External Links to a Root Domain:
- 50%: 2
- 95%: 80
- 99%: 558
- 99.5%: 1,266
- 99.9%: 9,138
- 99.99%: 94,999
- 99.9999%: 12,286,546
It's crazy to realize that if your page has 100 external links to it, you're in the top .1% of all pages on the web. Likewise, if you have just 25 external domains with links to your site, you're in the top .1% of all websites. On the entire web, the vast majority of websites never get 100 links pointing to them!
My takeaway? The Internet is really, really, really big (also, remember that Linkscape's index is probably about 1/2 to 1/3 the size of major indices like Google & Yahoo!, so these numbers don't hold true for all data sources, just for metrics from our index). We also like to be domain diverse with our index, so Linkscape shows a lot of links between domains but is still shallow for large sites and deep internal pages (which could bias these numbers).
Sharp readers will note that our number of pages has actually slightly decreased since the last update - since this was a "refresh" update, that was actually anticipated. We found about 10% of pages that we had crawled are now issuing some type of error code - 404s, 500s, etc. Thus, if you've always wondered how much of the web decayed over the course of 3 months, this might be a pretty good statistic to scratch that itch.
In terms of upgrades to the interface and the Linkscape tool, you've probably noticed the new basic report format:
And the new data detail tab (which, if you're a serious SEO or want the juicy KPIs from the tool, is the place to look):
If you've got feedback on either of these, please do leave a comment or send email to Adam _at_ SEOmoz.org. We're also looking at upgrading the basic comparison reports and the advanced reports in the near future, so ideas are always welcome on those, too.
And now on to something that I'm very excited to announce (and which I think you'll find fascinating)... Our top 250 domains list! Nick pulled out the top ranked root domains (by Domain mozRank) and some stats about them and we've created a page on SEOmoz that we'll update with each index showing some of the web's most powerful/important domains from a link perspective. Remember, Domain mozRank is like Google's PageRank - it's an iterative, Markov-chain algorithm, so this isn't just raw numbers, it's based on the importance of who links to you as well.
Some of the most interesting finds for me:
- Macromedia and Adobe are individually both at the top of the list; Adobe most probably for the PDF download links and Macromedia for the Flash player. Note that we don't wipe out domains via 301 like the search engines do, so we're actually seeing and reporting the links to domains that redirect (and then noting those redirects in the tool and passing the link juice on).
- Miiebian.gov.cn is, I believe, a government website in China that many sites operating in the country are required to link to - hence the high rankings.
- Sedoparking.com was one of the big outliers, but looking through their links, even though the engines probably are not passing link weight, they clearly have a ton of domains (and many important and once-important ones) linking in.
- Statcounter - I always knew they had a lot of links, but was surprised to see how many important domains still use their simple widget for counting traffic (and then providing that link back). It's a testament to being an early player on the web with a simple, useful product.
- Blogspot and Wordpress are no surprise - since this is looking at root domains (aka *.domain.com), every link to every subdomain on those is counting on the domain-level link graph.
Digging through the full top 250 is certainly interesting, both as a web data aficionado and as an SEO. Seeing how some of the top domains have earned their links and realizing the strategies and brands that rise to the top could probably inspire another dozen or two blog posts. In the future, I'd like to try to do this for subdomains and pages as well - to see what specific content is garnering the web's links. Hopefully we can have that out in our next index update (or the one following).
To wrap things up, I just want to say publicly how proud I am of the work Nick & Ben have done, along with the contributions from Jeff and the entire dev and front-end team. While Linkscape is still in beta, and probably has a few more months at that status, the depth and breadth of the project, the engineering brilliance and the value it's brought to SEO still amazes me. In fact, Jeff and I just got this email today to help remind us that the long nights and weekends at the computer are worth it:
Hey you guys, I'm finally using Linkscape for one of my client projects, three months on :) First time I have used it since the day after you launched; just been too busy with design projects & other stuff.It just totally rocks. I continue to be blown away by how easy it is to use, and how helpful the data is. Particularly the competitive data and all of the sorting features.I couldn't help myself from emailing you about it :)
Now it's back to work for the next updates, which should bring both fresher data and a much larger index as well.
p.s. We love feedback, so please do share!
Looking at the data more in detail, I was intrigued by three domains: kqzyfj.com, anrdoezrs.net and tkqlhce.com. They all have a domain mozRank above 8 which is quite impressive for such non sense named domains.
So I checked their Whois info and they're all owned by Commission Junction, a "global leader in the online advertising channels of affiliate marketing and managed search" from ValueClick Group. Digging more I found a thread on a forum where a former employee explains that these "random domains are there more or less just to stay away from adblocks and/or used for different tracking". I run AdBlock on Firefox and they indeed aren't blocked -- yet! ;)
So to put it in a nutshell, these are affiliate links. Obviously, Google returns no archived page for these domains. And as affiliate links they redirect to other websites.
All this raises a couple of questions:
- how can they have such mozRank/mozTrust? A quick glance on yahoo site explorer for tkqlhce.com shows backlinks from no great website.
- do you think they pass value?
- I don't know much about affiliate companies but there might be other and some bigger than CJ. Have their domain been filtered out from the top 250 or do they appear too?
Hope someone reads my comment despite his longness...
PS: edited to make paragraphs
No worries I read your comment Keonda - I think these are valid questions - and whilst I love linkscape, when you've got a tool this big looking at this much data a top250 table will highlight some interesting issues (such as mine below) and yours i think will provide an interesting point as to how google etc look at these "junk/temp" domains.
Just as an odd addendum - the Top 250 Domains page uses a Google Gadget plug-in that shows the chart of the domains and lets you re-sort and play with the data, but it doesn't seem to load for me in Firefox. It does, however, work fine in both Internet Explorer and Opera, so if you're having trouble, you might want to try those browsers.
I'll hold my hand up and admit that I didn't use Linkscape too much to begin with, but as time passes and the updates roll on, I am finding the data more and more useful.
On it's own, the top domains contains most of the ones I would expect to be there. It will be nice to see this charted over time to see what direction the web is heading in. Maybe you could turn it into a rank, showing who has risen in demand, and who has fallen?
Works for me (Kubuntu 8.10-ish) on Opera 9.52 but timesout on Konqueror 4.1.96RC1 and continues attempting to load without completing on FF 3.0.5
If I grab the table's link and just put that up then I can load it instantly in FF, just won't load in the page. XS-permissions problem?
HTH
Thanks alot! You ROCK! I've been using Linkscape tool while working with my clients to show how hard it could be to get positions in SERP for a desired keyword.
@rand,
I have a question as for your SEOmoz.org URL structure optimization. Why you haven't solved the problem with trailing slash?
For now you have:
https://www.seomoz.org/blog/linkscape-update-2-now-live/
and
https://www.seomoz.org/blog/linkscape-update-2-now-live
as different URL adresses.
Another interesting one is the inclusion of uk.com -
You have 519,747 links whereas Google's Link command gives it 13 links, pretty big margin of error.
So Linkscape has it as a major site - is this correct or a mistake on Linkscapes part?
After all if it was so powerful would it not rank for "uk" ? (I know it doesnt)
I'm not sure what causes this issue but I suspect that uk.com isnt that powerful.
Thanks Robbothan!
Regarding your query, uk.com has a bunch of subdomains, from christmascards to online-guitar-lesson. This is where the links go (5.8 million inlinks according to Yahoo).
Ok - that makes sense - does that actually help any of them?
Ie: does it help https://www.uk.com
or anyone hosting ie www.diningtables.uk.com/
Of interest to me as we have control of some ISP's where we do hosting for a few million people, and each of the users tend to operate in a different manner as does we with our main site, so i wouldn't see why we would appear in a Top1000 domains list?
I think it's a discrepancy caused by the fact that it's a pseudo top-level domain. You can buy example.uk.com instead of example.co.uk but it's not a "true" TLD - it's just selling sub-domains of uk.com. The domain weight is almost certainly not there (in the same way that links to *.co.uk don't add weight to other .co.uk domains) - and also like *.blogspot.com or *.wordpress.com. These have to be there when you look at the domain graph, but they don't necessarily inherit domain weight when there is no cross-linking between sub-domains.
ha took me a sec to understand! Makes sense tho - thanks for clarifying.
Ahhh! You know, I've never considered that! thanks Will
@Will: in the end,, I think you are correct. Just because you have a blog on blogspot.com doesn't mean you will rank as well as www.blogspot.com for the same keywords.
However (and I love to be a contrarian), this may just be because the very high blogspot.com authority is diluted among everyone and his dog that has a blog on blogspot.com. QED :P
So, I think we're both right ;)
I'll see if I can't get the number of subdomains listed in the spreadsheet (at least next time around) and how the root domain mozRank is diluted amongst them.
Note to self : Read to the bottom of comments BEFORE responding to one. Otherwise, you might just repeat what someone else has said. Oops!
Google lists 1.2 million pages on UK.com. That sounds quite important. They're just a way to get around domain naming problems - I'm suprised that the main site www.uk.com is not something more interesting .. like a "domain" selling page, it's just a standard squatting page (CC, broadband, etc.).
I think this was answered correctly already, so let me give the official seal of approval:
This list is a *.example.com type list. So when you see uk.com it's really considering links to any page on any subdomain. It doesn't matter than there isn't any page https://uk.com/ I don't think you can get this kind of info with a single query from Y!SE or GG.
We believe (and have some very strong experience and statistical evidence which I HOPE we'll be able to release sometime soon) that the organization a page is associated with (regardless of that page's stats) is VERY important to ranking.
Basically: a crappy page (with good keyword targeting) on strong domain (subdomain or root) is very rank worthy.
I think smart SEOs look at PageRank of the homepage of a domain to approximate this. We have Domain mozRank which we think is a better metric for this purpose.
hi nick - So your saying that the isp or hosting company you go with (for a personal site mainly) is pretty key - ie, if your on .wordpress.com then your seo benefits versus unknown blogging platform.
That makes a huge impact on ISP's that have hosting, such as us, as it's a selling point and a potential risk to us.
I'm saying it might be a factor in some cases, yes. And think about from a user perspective. Perhaps wrongly, I see wordpress or blogspot (minus the spam which I can easily avoid) as VERY public platforms for some high-end blogs. This is as opposed to a blogging platforms I've never heard of.
This impact could be an opportunity for you. If you're offering a hosted blogging platform, you could try and have that fill a niche, or build general credibility (and rankability) for that platform. This is likely as opposed to competetors that aren't doing those things. This seems like a competitive advantage
@rand, just a thought for the next toolbar update can you include something like Alexa ranking using this kinda of data? Something cool like a little rank number..
Even it is limited being able to use multiple sources such as Google, PR, Alexa, Compete, Y! directory... to check the relative strenght of competitors and potential links would be awesome!
I agree with Mintyman something showing how it has risen or fallen similar to your analytics tool.
congrats to your web dev team!
Hi Rand,
I'm a big fan of Linkscape. Thanks for the update notes. I have a question regarding the % of web sites with 100+ inbound links. You .1% is quite a bit lower than I would've thought. Is there a way to find out the % of websites with 100+ links at different levels and verticals for us to know the break down of the market. For example, could your data be broken down into, say, Fortune 1000 companies, mid-size companies, and Mom & Pop businesses. Or perhaps broken out by vertical like retail, travel, etc.
Not to create a ton of extra work for you, but it would be pretty cool to know how saturated our relative markets are. I'm not sure how accurate my assuption is here, but my guess is that 100+ internal links is a good gauge of whether or not they are either attempting to publish content worthy of links and/or have an in-house marketer with a passing interest in SEO. I think it would also be valuable to see how many web sites have 1,000+ inbound links as that (at least from a gut-check perspective) seems like the level that the company has either an in-house SEO or has hired one at some point.
These link levels are up for debate, of course. However, I think that having some sort of line in the sand for various company sizes, verticals, and geographic areas would provide very useful data to determine market need and potential saturation in certain area.
I look forward to more good data. Keep it coming!
You're thinking along the same lines as we are. We are trying to break this data down. And it's good to hear feedback that these kinds of aggregate stats are valuable to you.
As far as absolute link counts and comparing them to your own experience, keep in mind that our index is a (large) sample of the web. So our numbers might be anywhere from 1/2 to 1/10 the size you're going to see from Y!SE, which will already vary considerably from what you see at GG.
Needs to have PDF reports for report obsessed managment imo. Then I can say hey look here, before and after in a simple to digest format.
I think the bandwidth cost is expensive but it is worth to pay.
i am becoming a fan of linkscape..
Thanks Rand,
The more time goes by, the more I find myself using Linkscape first to give me an initial overview of whatever site I am looking at.
At times it is a real time saver and I like the many search options on there.
Just wish there wasn't a cap on how many advanced searches we can run per month :P
I'd like to suggest the ability to name saved reports, not sure if this is possible.
Thanks!
Anyone,
Linkscape is great! Superb job!
I don't get one thing though. What is the difference between 'sub-domain' and 'root domain' in a drop down right next to 'run advanced report' button? And why'subdomain' is pre-selected when domain doesn't even utilize subdomains? Is this as simple as a bug?
Seomozers,
You do a good job explaining how everything works. It may make sense adding a quick description of what that drop down does.
Cheers!
What's the bandwidth monetary cost of your web crawling thus far, if it's not a secret? Bandwidth for web indexing on this scale seems to add up to a great cost.
It is expensive, Yes :) Not sure if I can see much more than that. But it is lots of cash, and lots of human energy. I saw a picture of myself from a year ago and I have grown many deep lines in my face :(
My big task for the last couple of months has been efficiency. With the downturn in the economy the higher-ups around here aren't so cool with costs doubling every month :P
"Note that we don't wipe out domains via 301 like the search engines do"
Can anyone clarrify this comment please?
Cheers,
Steve
Managed to view the list using the direct link to the google doc posted above. I'm intrigued that there are only 5 .co.uk domains in the list (bbc.co.uk, google.co.uk, guardian.co.uk, amazon.co.uk, timesonline.co.uk) and no .ac.uk or .gov.uk sites... I don't know if I'm *surprised* exactly, but intrigued, definitely. I know one of your goals is increasing international depth, so this may just be a feature of that...
Thanks for the update Rand. I use Linkscape for some research but I have been saddened that many domains I investigate are not in the index. I hope that has changed.
This was a "refresh" update which means we didn't add terribly many new domains or pages, just refreshed data that was already there. Rand mentioned that there was about 10% rot in our data (which is a nice metric to have IMHO).
Internally we have two tiers of data. The first tier are powerful, popular pages and domains (read: high mR). The second tier is stuff closely related, but potentially unpopular (read: low mR but having at least one link from a high mR page). We'll give you data about all of it.
This time around we just refreshed the first tier while expanding the second. It's an experiment :P
Next time around might not end up with a much bigger index--thanks for setting expectations, Rand >:( -- but we feel pretty good that it'll be a better mix in that first tier.
Oh and here's a rule of thumb for inclusion in that first tier:
* must have been around for at least 2-3 months, with at least 1 internal link, preferably more.
* must have at least 1 external link which has been around for at least 1-2 months.
Linkscape is definately a great tool. Gerat to see that an update has been made. Gotta check out my websites now! :p
-Brenelz