Having access to data and large data sets is something any SEO worth his salt craves. Sure, managing a massive dataset or database can be a bit of a hassle, but having good information is key and there are a handful of uses for other people's/sites' data sets that are readily available for purchase online. Big budget linkbuilding isn't the only way to spend your SEO budget these days!
Let's take a look at five examples of datasets that you can easily and readily purchase and how you might go about using them.
Geocities
The true motivation for this article was a chat with Tom... and the fact that you can now get Geocities (yes, seriously, the whole thing) in the form of a 1 tb torrent (thanks Hackernews). How much is it going to cost you? Just your email address.
What would you do with Geocities, you ask? The sky's the limit with this one really! I'm not saying you want to use any of the great tickers and beautiful layout/seisure inducing colours for which Geocities is now famous.
However, you may very well want to use the huge volume of content that could quite easily be respun for your own purposes for a start? Or, use the epic designs for mapping out your new site- up to you!
Keyword Datasets
Why pay for keywords? Well, for starters, because sometimes you may find you have a client that has exhausted the entire set of data available through the adwords API (yes, this has seriously happened before). If the site is strong enough and you find you're still able to rank reasonably for long-tail terms post-MayDay there's no harm in creating some new content to target the long-tail. This isn't to suggest that you should buy keyphrases and not do the research yourself, but discussed, more data is almost always better than less.
And, most importantly- just because the data isn't in the API doesn't mean there isn't any search volume for it!
Some of the outfits out there selling keywords and keyphrases are:
- SEM Rush (AdWords, Google Words, "hidden" keywords)
- Hitwise also offers a range of products as well as one-off reports that will include a great deal of this information
- WordStream can hook you up with access to a few trillion keyphrases as well - access to their API runs from $300-$2,500 per month depending upon how many units you're looking for (pricing details).
- It's not yet up in it's final form but Rich Baxter is beta-testing a pretty darned good Keyword tool you may want to look at.
- Finally, KeywordSpy is something I'd be interested in checking out.
This sort of thing won't come cheap, but it can be extremely valuable to the larger sites.
80 Legs Crawl Packages
Some of you may be familiar with using 80 legs as a tool to crawl and scrape your way through the interwebs. It's a tool that I've not spent nearly enough time with as I didn't find it quite as intuitive to use as Mozenda. However, the nice thing about 80 legs is that they have compensated for this a bit by offering packaged-up crawls.
The vast majority of the packages cost $350 per month (with the exception of the ebay motors crawl for $150/month) though the data you could pull off these is extremely valuable and saves you the trouble of doing any of the crawling yourself (or if your IP has been banned you naughty SEOs).
Again, these sets could be used for anything from price-comparison to market analysis and right on down to content creation and keyphrase research. If you're one of the fortunate few working in the space for which these are offered you should definitely have a look.
Twitter Census
So, the Twitter Census dataset is just an example of the variety of datasets you can buy from InfoChimps though the general concept of owning one year's worth of URLs, hashtags, and smiley usage seems like it could be used a number of ways. Either, you could create an infographic worthy of a link from the likes of Mashable, TechCrunch, etc.
Or, you could use the data to monitor keyphrase usage, common abbreviations, or any other sort of trend in social interaction (could be a great source of keyphrases as well as the search engines begin to take signals and include social directly in the SERPs. This set is currently placed at $300.
Linkscape
Rand was being a bit coy about this one and at time of press I wasn't able to get a serious price out of them but there's a price for everything right? Any serious bidders should probably get in touch with the SEOmoz team directly...
Along these lines, there are a number of other datasets that do not have a price set but I'm sure you could get your hands on with enough money and asking the right people. These would include: Backtype API data, Wordtracker, or Amazon's entire product catalogue. It all comes down to asking the right people, but ultimately anyone with a brain for business and a load of data would sell you their info if you know how to ask for it.
Bonus SWAG
Don't you just love it when you can get your hands on some awesome free stuff that you never knew you wanted in the first place? Well, thankfully, there are a few datasets that I came across that I thought were worth sharing and could give you some value for free.
Feel free to take a gander at these datasets and try to make use of the data! Can you say "infographic ammunition"?
The entire dataset from the New York Stock Exchange from 1970-current (Open, Close, High, Low, Volume).
Massive sets of US Census Data.
And for those of us based over in the UK - huge volumes of UK Government data right at your fingertips.
Other Huge Datasets to get stuck into:
Project Gutenberg for over 6,000 full books available online. These book lists at the very least could be of interest.
Any number of the Google Labs Datasets. My personal favourite of which is the "Broadband Penetration in Europe"
The Freebase data dump which happens to include 26gb of the world's information from the likes of Wikipedia, Freebase and a handful of other datasets.
Any number of epic datasets from Elastic Web's Public Datasets not to be missed! This includes Wikipedia, IMDB, Stack Overflow, etc.
Final Pro-Tip
One thing that you may have noticed is a byproduct of providing large datasets to people is that they tend to be solid gold for linkbait. We could focus an entire post around this but if you've got access to great data and you're not offering it out to your users/curious SEOs what are you thinking?! Publish the data, make it free to download, and require a link back for attribution for anyone who wants to use it- simples!
I also use Outwit for data harvesting, there is a free (limited features) version available as a Firefox plugin: outwit.com
Outwit on Google image search is super for speedy & decent image harvesting
Don't forget Google's n-gram set from the Linguistic Data Consortium: https://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
(Note that it's not on their labs page, either)
- Ben
Whoa, that's a massive dataset: "The n-gram counts were generated from approximately 1 trillion word tokens of text..." :-|
It's something like 86gb, unzipped. Each file is *only* strings of text. Lots of data. Definitely helps to know different databases and database design to make good use of it.
Thanks for the info; project Gutenberg is one of my favourite resource sites
Two of my favorite sources of data are Google Insights for Search (download the data in csv format), and Twapper Keeper.
Nice one Sean!
I hadn't seen Twapper Keeper so that's definitely cool. Can't even say how happy I am having written this post as people have been so great about sharing some of their favourite datasets! So much data to play with it's hard to know even where to start.
Note that freebase.com is from MetaWeb, who got acquired by Google earlier this year.
I have a strong suspicion the Google are using the freebase data to get closer to having their algorithm "understand" web page content, instead of just picking out keywords.
I blogged about it here: https://bigiain.posterous.com/is-this-the-genesis-of-the-semantic-web
Iain
Hey Iain,
that very topic formed a large-ish part of a presentation at the SEOmoz PRO SEO seminars.
& I agree, for what its worth.
Thanks Sam, this really is a lot of stuff- extremely interesting, but also easy to get caught up in.
Amazing how much information one can find, I could already make use of some of the data. Cheers!
Check out the dataset tag on delicious too, loads of great stuff there:
https://www.delicious.com/tag/dataset
Thanks to @Ontolo for reminding me about Google's huge n-gram set too. Must get my hands on it soon.
I use www.spyfu.com, this website will tell you the keywords your competitors are using, & how much money they are spending in pay per click, some features are free other you have to pay for them.
Can you post more of this, I want to learn more about it.
Nice details, thanks for sharing this post.
Totally agree with you - mozenda is much easier to use than 80 legs - which is a pity as it's much more expensive!
Wow, great research behind this post!
Hi Sam
Enjoyed your presentation at PROSEO and this is a really useful post, too.
I'd like to suggest an additional source of keyword information, Wordtracker, which also sells keyword reports - https://www.wordtracker.com/reports.html.
Wordtracker has access to a database of more than half a billion searches a year, going back several years. So, it's a good source of recent and historical information. And these are real searches by real people on real search engines.
We can also provide you with data on the competition you'll face for any keyword - so, for example, we can tell you how many competing pages there are (ie, pages with the keyword in the title tag, or in the text of an inbound link).
For more information, just drop me an email justin at wordtracker.com
Hope that's useful.
Justin
Can we get more information on how this data can be used...I have no idea. It seems to me that with such huge amount of data, finding something useful would be like a needle in a haysack.
Agreed. I'm curious if anyone could get anything valuable out of all the Geocities data, because my memories of Geocities sites involve tons of music, glittery/animated graphics and lots of pages saying 'Moved to [domain name]'.
- Jenni
Sure thing. I guess there are a number of ways in which this data could be used and a few different suggestions I might make.
First off there is loads of information you could realistically use to create an infographic or release data on Geocities (number of sites, number of sites about music, number of sites with traffic counters, etc.).
Alternatively, and this is less useful from a pure SEO perspective, you could relaunch the whole thing (as someone is already doing).
You could use the handwritten content to contribute to articles or to flesh out some pages.
It's not neccessarily going to work for everyone and it may take some time to find the quality stuff, but it's certainly worth a look for the sites who are struggling on new content.
I think the opportunities are quite limitless- just needs a bit of creativity.
Perhaps Martin will let you know what he plans to do with that 1tb of data :)
Thanks for all the info Sam! I haven’t seen too many articles about what you can do with such large data pools, so it’s definitely useful to have a resourceful post like this.
And I must say, those Geocities layouts are a fantastic display of amazing web design :-p
Alternatively- you could just use all of the fun animated gifs to create some solid linkbait gold (like this one):
https://www.textfiles.com/underconstruction/
Oh my. What a page...
I'm flabbergasted at how many animated .gif's are for "under construction" messages... It makes me wonder what all those people were doing on Geocities that would require so much construction.
Thanks for that image, man. I haven't seen so many animated .gif's for years...
Now, to look for unicorns and dancing hampsters!
"I said there was but one solitary thing about the past worth remembering, and that was the fact that it is past-can't be restored." Mark Twain
Fantastic Sam, I'm clearly not thinking outside the box enough! Brain's a bit slow today ;)
- Jenni
Definitively some great resources!
Thank you for the post!
Excellent post! i totally agree that good valuable data is excential. data drives!
Nice post, thanks.
This post has got the "X Factor" (if you're not british you probably dont get that reference sorry).
Fantastic links Sam, and IT support are now bricking it because Ive just asked for 3TB of storage area outside our company backup array. When I start asking those questions, people start getting nervous!
I really like geocities. Thanks so much for this post!
Søkemotoroptimalisering
phenomenal post,thanx .