As some of you likely noticed, Linkscape's index updated today with fresh data crawled over the past 30 days. Rather than simply provide the usual index update statistics, we thought it would be fun to do some whiteboard diagrams of how we make a Linkscape update happen here at the mozplex. We also felt guilty because our camera ate tonight's WB Friday (but Scott's working hard to get it up for tomorrow morning).
Linkscape, like most of the major web indices, starts with a seed set of trusted sites from which we crawl outwards to build our index. Over time, we've developed more sophisticated methods around crawl selection, but we're quite similar to Google, in that we crawl the web primarily in decending order of (in our case) mozRank importance.
For those keeping track, this index's raw data includes:
- 41,404,250,804 unique URLs/pages
- 86,691,236 unique root domains
After crawling, we need build indices on which we can process data, metrics and sort orders for our API to access.
When we started building Linkscape in late 2007, early 2008, we quickly realized that the quantity of data would overwhelm nearly every commercial database on the market. Something massive like Oracle may be able to handle the volume, but at an exorbitant price that a startup like SEOmoz couldn't bear. Thus, we created some unique, internal systems around flat file storage that enable us to hold data, process it and serve it without the financial and engineering burdens of a full database application.
Our next step, once the index is in place, is to calculate our key metrics as well as tabulate the standard sort orders for the API
Algorithms like PageRank (and mozRank) are iterative and require a tremendous amount of processing power to compute. We're able to do this in the cloud, scaling up our need for number-crunching, mozRank-calculating goodness for about a week out of every month, but we're pretty convinced that in Google's early days, this was likely a big barrier (and may even have been a big part of the reason the "GoogleDance" only happened once every 30 days).
After processing, we're ready to push our data out into the SEOmoz API, where it can power our tools and those of our many partners, friends and community members.
The API currently serves more than 2 million requests for data each day (and an average request pulls ~10 metrics/pieces of data about a web page or site). That's a lot, but our goal is to more than triple that quantity by 2011, at which point we'll be closer to the request numbers going into a service like Yahoo! Site Explorer.
The SEOmoz API currently powers some very cool stuff:
- Open Site Explorer - my personal favorite way to get link information
- The mozBar - the SERPs overlay, analyze page feature and the link metrics displayed directly in the bar all come from the API
- Classic Linkscape - we're on our way to transitioning all of the features and functionality in Linkscape over to OSE, but in the meantime, PRO members can get access to many more granular metrics through these reports
- Dozens of External Applications - things like Carter Cole's Google Chrome toolbar, several tools from Virante's suite, Website Grader and lots more (we have an application gallery coming soon)
Each month, we repeat this process, learning big and small lessons along the way. We've gotten tremendously more consistent, redundant and error/problem free in 2010 so far, and our next big goal is to dramatically increase the depth of our crawl into those dark crevices of the web as well as ramping up the value and accuracy of our metrics.
We look forward to your feedback around this latest index update and any of the tools powered by Linkscape. Have a great Memorial Day Weekend!
Those are some incredible numbers, billions - wow. It keeps on amazing me how small the % of nofollow links is while I come across em so often.
The numbers are amazing.
There are as many unique pages in the index as there are stars in a small galaxy. A mind-blowing figure.
Honestly, it is shocking!!
93% of MY links are no-follow!! I honestly need to retify that!!
Is mozbot a 13 legged cat? ;o)
You mean Cat Bus?
Why does that exist? Who thought: "You know what we need? We need a kitten turned into a form of mass-transportation."?
Twisted!
Looks a bit like a 13 legged french poodle to me haha
Linkscape and SeoMoz Toolbar were the inspiration for SEO Site Tools (now at 16,733 users) and a big helper in its success... it used 24,636,240 rows from Linkscape last month so thanks for the great data and all the wonderful work yall do
your awesome
@CarterCole
after this comment installs jumped to over 18.5k users :)
UR SPIDUR CARTOON ROX!!11
I love Open Site Explorer (hey, I even love the GRAPHICS from it). My question is: does it provide exactly the same amount of links that Linkscape does? It is time consuming to run both for the same job. I will be very happy when they will become one.
Things I'd love to see added to the tool:
1. PR of the linking pages (I always have to keep the PR tool checker in a separate tab). Yes, PR is... oldfashioned... but still...
2. Linking domains to show not only the domain root but also the most juicy linking page from each domain.
I was expecting the index update in June so this was a nice surprise.
There was a lot of debate about LinkScape when it was born in 2007, according to the investigation I performed when I first joined SEOmoz, and I honestly don't want to get into it...
LinkScape is there, it's an awesome tool, and I make use of it everyday. No need to say I don't want to tell anything wrong about it.
But I'm being curious... From what I read, it's almost impossible for LinkScape to have a crawler process that works like standard SE. SEOmoz may not be a startup anymore, but the time and ressources required to fetch 41 Billions URLs is tremendous. Also, If I check at my server logs, I can't find any UA related to SEOmoz crawler...
Does it means LinkScape index is based on another index (Such as Google, Yahoo, etc.), or that SEOmoz got a stealth crawler process w/o defined UA?
Again, I don't want to debate the fact a crawler needs a UA and blah blah blah! :) Being honest, I don't really care about it. And I do want my websites to be indexed by LinkScape! I'm just being curious!
Thanks for your time!
We actually talked at the board meeting yesterday about being more transparent with a new version of our crawler. That should be coming out later this summer (depending on internal schedules around other things). Crawling started out as a big problem to overcome, but we've gotten pretty good at it, actually, and today, most of our scale issues come more from processing and the delays that causes with freshness, etc.
If you know anyone - we are hiring experienced engineers to work on making Linkscape bigger/better/more kickbutt, so please send them our way!
Thanks for your reply Rand.
I know it's not an easy question to answer from a company's point of view.
As an experienced bS software developer, I would be more than interested in working with SEOmoz to make Linkscape better. Unfortunately, I'm French Canadian... So I guess I'll just tell my american friends! :)
I was freaked by the update , last time I had 1140 links , now I have 344 for my site rbgrant.co.uk . My page authority , mozrank and domain authority have been badly affected therefore . What am I to think now ?
If you could shoot an email over to sitesupport at seomoz.org with details on the site, we'd be happy to take a look. It could be that A) some of those have disappeared B) we didn't crawl the pages that linked to you this round C) We previously overinflated numbers due to crawling duplicates (that we've now fixed).
Hi, I have a question... a little off topic but still about the linkscape data. I've been trying to get the SEOmoz API for the link data to work for sometime now. I'd really like to be able to use this new updated data with that.
But I always get the error message "unauthorized api..." - even when I fomulate a correct call to the API. I even get this from the example request on this page: https://apiwiki.seomoz.org/Links-API
I've also submitted a few support tickets about this but never got a response which is why I'm posting here. Hopfully someone can help with this.
Thanks in advance for your help! And have a great Memorial Day Weekend!
Yikes! I'm really sorry we haven't gotten back to you on support requests. If you shoot me an email (rand at seomoz.org) I'll forward it around and see what's going on.
Thanks Rand!
I just sent you an email.
I read this and I get excited for what else you can do with this technology infrastructure you've developed. You referred to the early days of Google. Hmm...
And the VC's aren't pounding down your door? Where do I sign up?
And when you go public, do pro members get first dibbies? hehe
Thank you for the update. Just in time for the weekend.
P.S. How high is that Whiteboard hung? :)
Have a great Memorial Day!!!
Next week's WBF will be animated using a flipbook! Actually that would be a fun thing to put on the lower corner of the pro guides :-)
Better idea! Make the flip books for a different day. Instead of Whiteboard Friday (WBF), make a FBW (flipbook Wednesday)!
Wow! When that makes me laugh out loud, I know I've worked too hard this week! Happy Memorial Day Weekend!
great doodles, thanks
Cheap hardware on cloud ftw. I take it you're coming to the point where you need your own file system like Google did. Where to put all the shards heh. Definitely time to get in some software engineers.
I am loving the data retrieved from both Linkscape and Open Site Explorer.
I use OSE more regularly since it's quicker and clearer to use and absorb the data but the more I learn the more I appreciate the full capacity of Linkscape.
The Mozbar is an additional tool that has proven momentous in my work.
After looking at the "newest" update, I was curious to see the impact of a change I made almost a month ago, but alas, it wasn't there. What is the current time lag between crawl, process, and post?
The index takes about 15-20 days to crawl and another 6-8 of processing, followed by the push, which takes actual man hours here (we're working to automate it) that can run to 2-3 days. Basically, what you see in this update should have been crawled between the end of April to early/mid May.
Wow, is this tool a huge expense for SEOmoz?
Rand,
Thanks for the reply. Are you keeping previous indexes and is there a plan (resources) to integrate historical data into future tools?
Rand,
I don't actually use the Seomoz tools. I use some other paid tools. nevertheless, I am a frequent visitor of Seomoz, and a handful of other I.M. sites. The info here is always first rate, and I know I can take it to the bank.
Anyway, I just wanted to say a thanks for your attitude with regard to sharing information with the rest of us. It's quite admirable in my opinion. And, this post serves as a prime example of that willingness to share.
Thank you,
Doug
As SEO's we should be a bit more observant when the clues are before us. Either Rand shrank or that whiteboard is 2.5 feet higher than normal. Hint: look at the whiteboard tray. It's pit level.
Great post. Now I'm wondering how you duct taped a mainframe style database together to handle the flat files?
Anyhoo ... I feel a strange duality occurring in my SEO career. The deeper I get into the "everything" that matters in SEO (and the more excited I get about it all) the more I see that no one really cares about the man behind the curtain. My time is often spent trying to explain why a gray hat or black hat effort should be avoided. The rest of my time is spent empathizing with my client that such and such aspect of search is stupid or doesn't make sense ... and oh, yeah ... you still need to address the concern and pay for it. Oh, and that's only the on-page stuff.
C'mon Jonnie, pull up a stump and let me tell you a story about the days of the wild linkgraph.
Well done Rank.
Nice to see how everything works from the backstage. I use Carter Cole's Google Chrome toolbar and I definetly recommend it
Powers WebsiteGrader? No way, I looooove that tool - it all leads back to SEOMoz eh? :)
Also count me in for loving Whiteboard Fridays, always enjoy them.
Every road leads to SEOmoz ... ops, Rome... and Rand is its emperor ;)
Well, not the whole thing, just our metrics that are included in the service :-)
And yeah, I'm also a fan of the grader sites. Dharmesh has done a great job with those. The new LinkedIn grader is pretty slick, too (though in early alpha/beta).
Yep The website grader by hubspot is pretty neat
Really great news for me, as finally the new sites we put up are indexed and return results in linkscape, I was getting a bit worried.
Thanks
Hi Rand,
We've not noticed the change, we logged in today an it still says since your last report linkscape has not updated even though we know Yahoo site explorer is showing many more links.
Pretty great behind-the-scenes look LinkScape, very interesting!
I have this question. It looks like Yahoo Site Explorer may die soon. I feel it would be a shame for them to drop it but they just might. Well, that would mean that your OSE will be the most likely candidate to fill that vaccuum.
It seems to me that what stands between Linkscape/OSE as they are now and something of the magnitude of YSE is the brute force of crawling and processing and data storage. Yahoo can find and process a lot more links right?
So are you guys working on catching up with this scale? I am curious what your thinking is. OSE seems to have everything that a link analysis tool of this generation must have, except the number of links. What are your projections in that department?
I hope OSE just takes over link analysis completely!
I missed the classic tune of the WBF intro, but the solution you find out with this cartoon version is great... and somehow a "static" post was the best way to explain the hard job behind the Linkscape crawl.
The Linkscape datas and all the tools developed around them are actually a Need for my job, and I wonder sometimes how I could work without them.
To make the classic Linkscape reports converging into OSE is a great idea. Just wondering if you are working on the possibility of recollecting historical data (links losts or now discarded from crawling, graphically exalting new links).
Interesting would be also to merge the Linkscape visualization and comparison tool into OSE,as this offer the opportunity to compare two different sites.
Finally... we all know SEOmoz is going to present something really great in the next few months... could you give us some bits of preview here to make our curiousity calm down for a while?
haha, might go watch the first 5 seconds of last weeks WBF - right to when Rand says "Howdy SEOMoz fans" :)- then come back to this post.
The new whiteboard looks HUGE! Rand will need a stool just to reach to the top - or he could sit on Will's shoulders :P
Great to see "behind the curtain".
I'm so impressed that you share this kind of data with everyone. Keep up the great work!
Pretty cool seeing how everything works behind the scenes!
Im guessing with the huge new office you have quite a few more whiteboards!
Wow Rand. That is some whiteboard! Scott is going to have to set his camera up all the way across the room to be able to get the whole board in the shot!
Always like hearing about the behind the scenes stuff with your tools. And like those above, I look forward to Linkscape and OSE becoming one, although I'm curious as to what will be available publicly vs. what will be available for Pro only.
My favorite picture of all was the little stick guy pushing out the new index in step #4.
so any plans to launch a search engine to monetize all that data?
:)
Yay linkscape update! Now I have to go re-run everythign. Cya in a week.
Just curious - do you guys run your own server farm or rent from companies like Amazon etc?
We run lots of our own stuff, but do processing and storage using cloud computing because it scales so nicely.
Hey Rand, if you need CPU cycles, why not ask your user base? There are way to do that, Boinc being my favourite (I'm still looking for aliens). You could give us MozPoints for doing some of the work ;)
You're not the only one still looking for aliens... I'm giving my part of CPU to Boinc too since... mmm ... almost 6 years.
I always look forward to the video on Fridays...but this post was great as is. Now I get to see this video on Saturday ;~) In that first pic of you Rand, did they put that whiteboard WAY TO HIGH ON THE WALL? or are you on your knees slaving away for all the mozzers......
tony ;~)
Very cool. Sounds like a lot of creativity went into the flat file storage method that enabled this system to exist without Oracle. Glad to see the SEOmoz team isn't easily hindered by obstacles.
Best,