I was notified by Michael at Social Patterns that Alexa opened up their web index to automated tool builders and users of all kinds. There's a lot of great data on their site, but the general gist is:
Alexa provides compute and storage resources that allow users to quickly process and store large amounts of web data. Users can view the results of their processes interactively, transfer the results to their home machine, or publish them as a new web service...
Size: Each Web crawl consists of 100 Terabytes of Web content spanning 4 billion pages and 8 million sites.
Frequency: In addition to daily crawls of popular Web content, Alexa crawls a broad cross-section of the Web in 2-month snapshots.
Example: Edward wants to create a collection of JPEG images. He utilizes Alexa's Advanced Search to create a private search collection that locates all documents in the archive with a MIME type of jpeg, response code 200, url extension of jpg and a size of 64k or larger.
It sounds like an incredible service, but I note that it's not free. Michael's write-up has more details. I wish I had more time to investigate and use it in the upcoming weeks and months, but, sadly, we've got a fairly full schedule. I did apply for an account, so maybe if I have some spare developer time I'll ask someone to look into it.
it’s a good move by alexa….. but the thing to be concerned about is that when you’re opening up an index of this size, it would invite all sorts of spammy folks. thats bad. on the positive side, this means less number of bots eating up your bandwidth… everyone would go to alexa for scraping data …..
Michael - it sounds like they're just storing content that is publicly available online. If you want to sue them for that, you should also sue google for indexing/caching your pages and web surfers for keeping a cached copy in their browser.