Historically, I have written and presented about big data—using data to create insights, and how to automate your data ingestion process by connecting to APIs and leveraging advanced database technologies.
Recently I spoke at SMX West about leveraging the rich data in webmaster tools. After the panel, I was approached by the in-house SEO of a small company, who asked me how he could extract and leverage all the rich data out there without having a development team or large budget. I pointed him to the CSV exports and some of the more hidden tools to extract Google data, such as the GA Query Builder and the YouTube Analytics Query Builder.
However, what do you do if there is no API? What do you do if you want to look at unstructured data, or use a data source that does not provide an export?
For today's analytics pros, the world of scraping—or content extraction (sounds less black hat)—has evolved a lot, and there are lots of great technologies and tools out there to help solve those problems. To do so, many companies have emerged that specialize in programmatic content extraction such as Mozenda, ScraperWiki, ImprtIO, and Outwit, but for today's example I will use Kimono Labs. Kimono is simple and easy to use and offers very competitive pricing (including a very functional free version). I should also note that I have no connection to Kimono; it's simply the tool I used for this example.
Before we get into the actual "scraping" I want to briefly discuss how these tools work.
The purpose of a tool like Kimono is to take unstructured data (not organized or exportable) and convert it into a structured format. The prime example of this is any ranking tool. A ranking tool reads Google's results page, extracts the information and, based on certain rules, it creates a visual view of the data which is your ranking report.
Kimono Labs allows you to extract this data either on demand or as a scheduled job. Once you've extracted the data, it then allows you to either download it via a file or extract it via their own API. This is where Kimono really shines—it basically allows you to take any website or data source and turn it into an API or automated export.
For today's exercise I would like to create two scrapers.
A. A ranking tool that will take Google's results and store them in a data set, just like any other ranking tool. (Disclaimer: this is meant only as an example, as scraping Google's results is against Google's Terms of Service).
B. A ranking tool for Slideshare. We will simulate a Slideshare search and then extract all the results including some additional metrics. Once we have collected this data, we will look at the types of insights you are able to generate.
1. Sign up
Signup is simple; just go to https://www.kimonolabs.com/signup and complete the form. You will then be brought to a welcome page where you will be asked to drag their bookmarklet into your bookmarks bar.
The Kimonify Bookmarklet is the trigger that will start the application.
2. Building a ranking tool
Simply navigate your browser to Google and perform a search; in this example I am going to use the term "scraping." Once the results pages are displayed, press the kimonify button (in some cases you might need to search again). Once you complete your search you should see a screen like the one below:
It is basically the default results page, but on the top you should see the Kimono Tool Bar. Let's have a close look at that:
The bar is broken down into a few actions:
- URL – Is the current URL you are analyzing.
- ITEM NAME – Once you define an item to collect, you should name it.
- ITEM COUNT – This will show you the number of results in your current collection.
- NEW ITEM – Once you have completed the first item, you can click this to start to collect the next set.
- PAGINATION – You use this mode to define the pagination link.
- UNDO – I hope I don't have to explain this ;)
- EXTRACTOR VIEW – The mode you see in the screenshot above.
- MODEL VIEW – Shows you the data model (the items and the type).
- DATA VIEW – Shows you the actual data the current page would collect.
- DONE – Saves your newly created API.
After you press the bookmarklet you need to start tagging the individual elements you want to extract. You can do this simply by clicking on the desired elements on the page (if you hover over it, it changes color for collectable elements).
Kimono will then try to identify similar elements on the page; it will highlight some suggested ones and you can confirm a suggestion via the little checkmark:
A great way to make sure you have the correct elements is by looking at the count. For example, we know that Google shows 10 results per page, therefore we want to see "10" in the item count box, which indicates that we have 10 similar items marked. Now go ahead and name your new item group. Each collection of elements should have a unique name. In this page, it would be "Title".
Now it's time to confirm the data; just click on the little Data icon to see a preview of the actual data this page would collect. In the data view you can switch between different formats (JSON, CSV and RSS). If everything went well, it should look like this:
As you can see, it not only extracted the visual title but also the underlying link. Good job!
To collect some more info, click on the Extractor icon again and pick out the next element.
Now click on the Plus icon and then on the description of the first listing. Since the first listing contains site links, it is not clear to Kimono what the structure is, so we need to help it along and click on the next description as well.
As soon as you do this, Kimono will identify some other descriptions; however, our count only shows 8 instead of the 10 items that are actually on that page. As we scroll down, we see some entries with author markup; Kimono is not sure if they are part of the set, so click the little checkbox to confirm. Your count should jump to 10.
Now that you identified all 10 objects, go ahead and name that group; the process is the same as in the Title example. In order to make our Tool better than others, I would like to add one more set— the author info.
Once again, click the Plus icon to start a new collection and scroll down to click on the author name. Because this is totally unstructured, Google will make a few recommendations; in this case, we are working on the exclusion process, so press the X for everything that's not an author name. Since the word "by" is included, highlight only the name and not "by" to exclude that (keep in mind you can always undo if things get odd).
Once you've highlighted both names, results should look like the one below, with the count in the circle being 2 representing the two authors listed on this page.
Out of interest I did the same for the number of people in their Google+ circles. Once you have done that, click on the Model View button, and you should see all the fields. If you click on the Data View you should see the data set with the authors and circles.
As a final step, let's go back to the Extractor view and define the pagination; just click the Pagination button (it looks like a book) and select the next link. Once you have done that, click Done.
You will be presented with a screen similar to this one:
Here you simply name your API, define how often you want this data to be extracted and how many pages you want to crawl. All of these settings can be changed manually; I would leave it with On demand and 10 pages max to not overuse your credits.
Once you've saved your API, there are a ton of options (too many to review here). Kimono has a great
learning section you can check out any time.
To collect the listings requires a quick setup. Click on the pagination tab, turn it on and set your schedule to On demand to pull data when you ask it to. Your screen should look like this:
Now press Crawl and Kimono will start collecting your data. If you see any issues, you can always click on Edit API and go back to the extraction screen.
Once the crawl is completed, go to the Test Endpoint tab to view or download your data (I prefer CSV because you can easily open it in Excel, CSV, Spotfire, etc.) A possible next step here would be doing this for multiple keywords and then analyzing the impact of, say, G+ Authority on rankings. Again, many of you might say that a ranking tool can already do this, and that's true, but I wanted to cover the basics before we dive into the next one.
3. Extracting SlideShare data
With Slideshare's recent growth in popularity it has become a document sharing tool of choice for many marketers. But what's really on Slideshare, who are the influencers, what makes it tick? We can utilize a custom scraper to extract that kind data from Slideshare.
To get started, point your browser to Slideshare and pick a keyword to search for.
For our example I want to look at presentations that talk about PPC in English, sorted by popularity, so the URL would be:
https://www.slideshare.net/search/slideshow?ft=presentations&lang=en&page=1&q=ppc&qf=qf1&sort=views&ud=any
Once you are on that page, pick the Kimonify button as you did earlier and tag the elements. In this case I will tag:
- Title
- Description
- Category
- Author
- Likes
- Slides
Once you have tagged those, go ahead and add the pagination as described above.
That will make a nice rich dataset which should look like this:
Hit Done and you're finished. In order to quickly highlight the benefits of this rich data, I am going to load the data into Spotfire to get some interesting statics (I hope).
4. Insights
Rather than do a step-by-step walktrough of how to build dashboards, which you can find here, I just want to show you some insights you can glean from this data:
- Most Popular Authors by Category. This shows you the top contributors and the categories they are in for PPC (squares sized by Likes)
- Correlations. Is there a correlation between the numbers of slides vs. the number of likes? Why not find out?
- Category with the most PPC content. Discover where your content works best (most likes).
5. Output
One of the great things about Kimono we have not really covered is that it actually converts websites into APIs. That means you build them once, and each time you need the data you can call it up. As an example, if I call up the Slideshare API again tomorrow, the data will be different. So you basically appified Slisdeshare. The interesting part here is the flexibility that Kimono offers. If you go to the How to Use slide, you will see the way Kimono treats the Source URL In this case it looks like this:
The way you can pull data from Kimono aside from the export is their own API; in this case you call the default URL,
https://www.kimonolabs.com/api/YOURPAIID?apikey=YO...
You would get the default data from the original URL; however, as illustrated in the table above, you can dynamically adjust elements of the source URL.
For example, if you append "&q=SEO"
(https://www.kimonolabs.com/api/YOURPAIID?apikey=YOURAPIKEY&q=SEO)
you would get the top slides for SEO instead of PPC. You can change any of the URL options easily.
I know this was a lot of information, but believe me when I tell you, we just scratched the surface. Tools like Kimono offer a variety of advanced functions that really open up the possibilities. Once you start to realize the potential, you will come up with some amazing, innovative ideas. I would love to see some of them here shared in the comments. So get out there and start scraping … and please feel free to tweet at me or reply below with any questions or comments!
Man, I love Kimono. I've tried a few alternatives, like import.io but you can't really beat the speed and ease of the bookmarklet. This is a SUPERB introduction to data mining, taking a scraping a level beyond the typical SEO Tools for Excel tutorial. Great post Benjamin.
I wasn't so impressed with import.io (slow and buggy) - will give this one a try - though I'm still looking for a data extractor that will grab GPS coordinates from Google Maps in business directories.
Hello,
Thank you for the feedback. As for your lat/long question, have you tried this API: https://developers.google.com/places/documentation/details ?
Check out PromptCloud they are good in Data Extraction Services
Thank you Paul!
Dang! This YouMoz post really snuck up on me! Unbelievably detailed guide, Benjamin. I've been looking for an easier way to submit a DuckDuckHack "Goodie" and this free Kimono tool is EXACTLY what I needed to get it done quickly and easily.
Another thing this post does is it really shows the simplicity of scraping and how it can so easily structure unorganized/unstructured data. I hadn't heard of/used Kimono before but now I can't wait to get started! Thanks again.
Thank you Brady,I appreciate the feedback. Another part I really like about API based tools likes Kimodo is the fact that it turns any set of scraping rules into an actual API that can be reused and extended vs. a one time export.
Totally agree. You can basically design an entire product/app without even touching the code! Really neat platform to use.
Great work, Benjamin. I've become somewhat obsessed with data scraping over the past 6 months (if you read my Moz post a few weeks ago, you'd know this!). I've been playing around with a load of different tools at the moment (as well as looking at my own) and I have to say that Kimono is one of my favourites.
I'm a big fan of using the SEO Tools plugin for Excel but there's a lot of scalability issues that come in there. I particularly enjoyed your thoughts on analysing SlideShare content - that could make for an interesting report in itself.
Keep up the good work!
Hey this is a great article. Thanks for all the accompanying graphs and charts!
Thank you!
Nothing is more fun that organizing and aligning 30 screenshots ;)
A very good informational tool. Let me just implement this step by step guide.
Benjamin, Kimono Labs looks incredible thanks for posting about it.
Does it support websites which require you to login to view the data?
Also does it support websites which require you to enter data, click submit in order for the data to be viewed?
Thanks,
Jamie
Thanks for the article, had no idea this existed.
I've used Scrapebox but never really delved further then just pulling keyword suggestions.
This seems like you can do some pretty cool things with it.
Great list of tools, le me use it one by one and step by step thanx for such useful information...
[link removed]
Good stuff Benjamin, adding to my list of scraping tools (https://www.garethjames.net/a-guide-to-web-scrapping-tools/) and linking to this post.
I found https://www.promptcloud.com to be a better solution for scraping web data.
Great article, Thanks for sharing the nice presentation about big data—using data to create insights. It’s very informative. I found one more good resource related to Big Data through Intellipaat. They are providing Big Data Certification training including Hadoop and its Ecosystem with the 24*7 Lifetime Support.
Free as in beer. For free as in freedom, use ScreenSlicer
Do you have a time estimate for how long this would take?
Thanks for the intro to Kimono. We use Outwit here and it's been effective but it looks like Kimono is even more versatile.
We could have really used this a few months ago when migrating a client's site. Every few days we would use Chrome's "Scrape Similar" extension to grab the top 100 URLs with a site: search to see which new URLs had entered the index and which old ones had been removed. Sounds like Kimono would have put this on auto-pilot for us.
Never thought that somebody can do scraping for free so professionally.Thanks for the insight.
Wow! Another in-depth post on data scraping. I am just starting to experiment with data scraping and this one with kimono tool makes me want to check it out, particularly the extractor feature. Nice :)
May be i found a great app ... Thanks a lot.
Wow, this is really interesting stuff. I can only begin to imagine all of the ways you could apply this to any number of scenarios.
Kimono got bought out and it's not working anymore.
Import.io is good but it kinda expensive for me.I've tried the free trial before but see the price starts from $249, so...
You may want to look at this new web scraping software called Octoparse. It takes a little time to begin. But they have rich tutorials on the website. Plus it doesn't require programming skills.
They provide cloud-based scraping service and API access.
You can use it to create API but you have to pay for the service. Free version doesn't have the API and cloud service.
I've been using it for 2 month with the Standard plan $89, which a lot cheaper than import.io. So far so good.
In case you want to know, here is the link:
https://www.octoparse.com/
Paul, aren't you with Octoparse? https://www.octoparse.com/pad/octoparse.xml
yes kimino is excellent tool for web scraping! programmers can easily scrap data from websites using this tool!! but if you are not a programmer and need quality and accurate data then i recommend you to check the website LOGINWORKS https://www.loginworks.com/our-services/web-scraping/ , it has expertise in getting quality data!
Kimono is really a cool tool to scrape the data from website. But in order to get the accurate and quality data using kimono, you should be proficient with scripting. However, if you don't want to do it as your own or using any tool, I suggest you to avail the customized web scraping services currently offered by several companies. Loginworks is a well known name in this industry. Check their web scraping services at:
https://www.loginworks.com/our-services/web-scraping/
Its a very Nice and Informative Content for Big Data but not for Small Entrepreneurs. Thanks anyways.
With a little common sense, we can adapt the methods discussed in this Artile to a small business. I assure you!