The Nasty Problem with Scraping Results from the Engines

One theme that I've been concerned with this week centers around data transparency in the search engine world. Search engines provide information that is critical to the business of optimizing and growing a business on the web, yet barriers to this data currently force many companies to use methods of data extraction that violate the search engines' terms of service.

Specifically, we're talking about two pieces of information that no large-scale, successful web operation should be without. These include rankings (the position of their site(s) vs. their competitors) for important keywords and link data (currently provided most accurately through Yahoo!, but also available through MSN and in lower quality formats from Google).

Why do marketers and businesses need this data so badly? First we'll look at rankings:

For large sites in particular, rankings across the board will go up or down based on their actions and the actions of their competition. Any serious company who fails to monitor tweaks to their site, public relations, press and optimization tactics in this way will lose out to competitors who do track this data and, thus, can make intelligent business decisions based on it.
Rankings provide a benchmark that helps companies estimate their global reach in the search results and make predictions about whether certain areas of extension or growth make logical sense. If a company must decide on how to expand their content or what new keywords to target or even if they can compete in new markets, the business intelligence that can be extracted from large swaths of ranking data is critical.
Rankings can be mapped directly to traffic, allowing companies to consider advertising, extending their reach or forming partnerships

And, on the link data side:

Temporal link information allows marketers to see what effects certain link building, public relations and press efforts have on a site's link profile. Although some of this data is available through referring links in analytics programs, many folks are much more interested in the links that search engines know about and count, which often includes many more than those that pass traffic (and also ignores/doesn't count some that do pass traffic).
Link data may provide references for reputation management or tracking of viral campaigns - again, items that analytics don't entirely encompass.
Competitive link data may be of critical importance to many marketers - this information can't be tracked any other way.

I admit it. SEOmoz is a search engine scraper - we do it for our free public tools, for our internal research and we've even considered doing it for clients (though I'm seriously concerned about charging for data that's obtained outside TOS). Many hundreds of large firms in the search space (including a few that are 10-20X our size) do it, too. Why? Because search engine APIs aren't accurate.

Let's look at each engine's abilities and data sources individually. Since we've got a few hundred thousand points of data (if not more) on each, we're in a good position to make calls about how these systems are working.

Google (all APIs listed here):

Search SOAP API - provides ranking results that are massively different from almost every datacenter. The information is often less than useless, it's actually harmful, since you'll get a false sense of what's happening with your positions.
AJAX Search API - This is really designed to be integrated with your website, and the results can be of good quality for that purpose, but it really doesn't serve the job of providing good stats reporting.
AdSense & AdWords APIs - In all honesty, we haven't played around with these, but the fact that neither will report the correct order of the ads, nor will they show more than 8 ads at a time tells me that if a marketer needed this type of data, the APIs wouldn't work.

Yahoo! (APIs listed here):

Search API - Provides ranking information that is a somewhat accurate map to Yahoo!'s actual rankings, but is occassionally so far off-base that they're not reliable. Our data points show a lot more congruity with Yahoo!'s than Google's, but not nearly enough when compared with scraped results to be valuable to marketers and businesses.
Site Explorer API - Shows excellent information as far as number of pages indexed on a site and the link data that Yahoo! knows about. We've been comparing this information with that from scraped Yahoo! search results (for queries like linkdomain: and site:) and those at the Site Explorer page and find that there's very little quality difference in the results returned, though the best estimate numbers can still be found through a last page search of results.
Search Marketing API - I haven't played with this one at all, so I'd love to hear comments from those who have.

MSN:

Doesn't mind scraping as long as you use the RSS results. We do, we love them and we commend MSN for giving them out - bravo! They've also got a web search SDK program, but we've yet to give it a whirl. The only problem is the MSN estimates, which are so far off as to be useless. The links themselves, though, are useful.

Ask.com

Though it's somewhat hidden, the XML.Teoma.com page allows for scraping of results and Ask doesn't seem to mind, though they haven't explicitly said anything. Again, bravo! - the results look solid, accurate and match up against the Ask.com queries. Now, if Ask would only provide links

I know a lot of you are probably asking:

"Rand, if scraping is working, why do you care about the search engines fixing the APIs?"

The straight answer is that scraping hurts the search engines, hurts their users and isn't the most practical way to get the data. Let me give you some examples:

Scraped queries have to look as much like real users as possible to avoid detection and banning - thus, they affect the query data that search engineers use to improve web search.
These queries also hit advertisers - falsifying the number of "real" impressions that advertisers see and lowering their CTRs unnaturally.
They take up search engine resources and though even the heaviest scraping barely impacts their server loads, it's still an annoyance.

With all these negative elements, and so many positive incentives to have the data, it's clear what's needed - a way for marketers/businesses to get the data they need without hurting the search engines. Here's how they can do it:

Provide the search ranking position of a site in the referral string - this works for ranking data, but not for link data and since Yahoo! (and Google) both send referrals through re-directs at times, it wouldn't be a hard piece to add.
Make the API's accurate, complete and unlimited
If the last option is too ambitious, the search engines could charge for API queries - anyone who needs the data would be more than happy to pay for it. This might help with quality control, too.
For link data - serve up accurate, wholistic data in programs like Google Sitemaps and Yahoo! Search Submit (or even, Google Analytics). Obviously, you'd only get information about your own site after verifying.

I've talked to lots of people at the search engine level about making changes this week (including Jeremy, Priyank, Matt, Adam, Aaron, Brett and more). I can only hope for the best...

Comments 15

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

LifeDesign81

2007-12-10T04:09:25-08:00

This is more pertinant now Yahoo! Pipes have released Fetching tools. Page scraping in all but name:

https://blog.pipes.yahoo.com/2007/12/06/new-fetch-page-module-and-nice-web-path-enhancement/

1 0

This is more pertinant now Yahoo! Pipes have released Fetching tools. Page scraping in all but name: <a href="https://blog.pipes.yahoo.com/2007/12/06/new-fetch-page-module-and-nice-web-path-enhancement/" rel="nofollow">https://blog.pipes.yahoo.com/2007/12/06/new-fetch-page-module-and-nice-web-path-enhancement/</a> 
Cancel
bentho

2006-08-14T23:40:55-07:00

The Page Rank check sum is widely available on forums, in the google firefox extension and such, and yer they probably cycle through a through proxy's

1 0

The Page Rank check sum is widely available on forums, in the google firefox extension and such, and yer they probably cycle through a through proxy's
Cancel
PFIWesternStore

2008-01-21T19:43:19-08:00

I use to have a google SOAP API before they discontinued it but I can't find it anymore. Does anyone have suggestion of where to look online or in my files/software to find it?

Thanks,
Jason

1 0

I use to have a google SOAP API before they discontinued it but I can't find it anymore. Does anyone have suggestion of where to look online or in my files/software to find it? Thanks, Jason
Cancel
Mark Jackson

2009-11-25T13:07:06-08:00

The Ask API link is broken but here's an unofficial, detailed explanation of the Ask.com API process.

https://www.antezeta.com/blog/ask-web-search-api

Good stuff,

Zach Browne
Vizion Interactive

1 0

The Ask API link is broken but here's an unofficial, detailed explanation of the Ask.com API process. https://www.antezeta.com/blog/ask-web-search-api Good stuff, Zach Browne Vizion Interactive
Cancel
MrMagne

2011-05-30T14:45:29-07:00

Hi there! Old but nice article :)

I was wondering if any of you guys know how i can scrape the Google SERP whitout violating Google terms? Do Google offer such service? I need the SERP data to keep track of my rankings. Now, i know there is plenty of services that offers SERP monitoring, but i want my own private solution. Sure would appreciate any tips :)

1 0

Hi there! Old but nice article :) I was wondering if any of you guys know how i can scrape the Google SERP whitout violating Google terms? Do Google offer such service? I need the SERP data to keep track of my rankings. Now, i know there is plenty of services that offers SERP monitoring, but i want my own private solution. Sure would appreciate any tips :)
Cancel
James Green

2010-02-07T12:29:06-08:00

Are you still scraping data or have the APIs improved?

1 0

Are you still scraping data or have the APIs improved?
Cancel
RobRobRob

2006-08-14T17:55:06-07:00

Does anyone know how Seochat's Pagerank Search gets its Pagerank data? AFIK this isn't available from Google's API. They must have cracked the checksum, but how do they send such a huge volume of requests every day without Google shutting them down? Through anonymous proxies maybe?

1 0

Does anyone know how Seochat's Pagerank Search gets its Pagerank data? AFIK this isn't available from Google's API. They must have cracked the checksum, but how do they send such a huge volume of requests every day without Google shutting them down? Through anonymous proxies maybe?
Cancel
bentho

2006-08-13T16:21:46-07:00

Awesome post

But what about the Alexa API?

https://www.alexa.com/site/devcorner/web_info_...

You get 10000 free requests and from then on its $0.00015 per request ($0.15 for 1,000 requests).

It also has lots of useful features other then web search. But i don't know how good/usefull the results are, i am testing this at the moment, at least its a move in the right direction.

bentho edited 2006-08-13T16:23:22-07:00
1 0

Awesome post But what about the Alexa API? <a href="https://www.alexa.com/site/devcorner/web_info_services" rel="nofollow">https://www.alexa.com/site/devcorner/web_info_...</a> You get 10000 free requests and from then on its $0.00015 per request ($0.15 for 1,000 requests). It also has lots of useful features other then web search. But i don't know how good/usefull the results are, i am testing this at the moment, at least its a move in the right direction.
Cancel
- Rand Fishkin
 
 2006-08-13T21:55:56-07:00
 
 Bentho - while the results suck (in general) for Alexa, the fact that you can buy data is really quite amazing. It means that the API is truly accessible and not marginalized at all by Amazon - since those of us pulling data are paying customers.
 
 1 0
 
 Bentho - while the results suck (in general) for Alexa, the fact that you can buy data is really quite amazing. It means that the API is truly accessible and not marginalized at all by Amazon - since those of us pulling data are paying customers.
 Cancel
GeoffreyF67

2006-08-11T12:16:23-07:00

I've been banned from all of the search engines at one time or another. An API for me to do scraping would be awesome although in my recent research, I'm thinking about getting away from that.

1 0

I've been banned from all of the search engines at one time or another. An API for me to do scraping would be awesome although in my recent research, I'm thinking about getting away from that.
Cancel
Jarrod Hunt

2006-08-11T11:20:47-07:00

Amen Brother!

1 0

Amen Brother!
Cancel
Jarrod Hunt

2006-08-11T16:02:47-07:00

The first search engine that opens up their data for researchers to get quality data, will get a huge endorsement from me.

It's no secret that part of Google's huge success came from Tech Savvy people who loved all the advanced search features it offered. Google started out with a huge Geek following.

With the millions of Geeks out there, if a search engine was to truly open up their API, they could end up with a huge boost in usage.

Using Yahoo as an example. I now heavily recommend to all of my clients that they utilize Yahoo for research. Everywhere you go now, its Yahoo Backlinks this, or Yahoo Site Explorer that. Thats a lot of free branding for Yahoo. It's only natural that if I am on Yahoo's site doing research that I will tend to use them for standard searches as well.

BTW, Good job Google on Web Master Central https://www.google.com/webmasters/. This is definitly a positive step forward. I'm really enjoying sitemaps right now. Its a little limited but seeing actual queries that have resulted in traffic is a nice feature.

1 0

The first search engine that opens up their data for researchers to get quality data, will get a huge endorsement from me. It's no secret that part of Google's huge success came from Tech Savvy people who loved all the advanced search features it offered. Google started out with a huge Geek following. With the millions of Geeks out there, if a search engine was to truly open up their API, they could end up with a huge boost in usage. Using Yahoo as an example. I now heavily recommend to all of my clients that they utilize Yahoo for research. Everywhere you go now, its Yahoo Backlinks this, or Yahoo Site Explorer that. Thats a lot of free branding for Yahoo. It's only natural that if I am on Yahoo's site doing research that I will tend to use them for standard searches as well. BTW, Good job Google on Web Master Central <a href="https://www.google.com/webmasters/." rel="nofollow">https://www.google.com/webmasters/.</a> This is definitly a positive step forward. I'm really enjoying sitemaps right now. Its a little limited but seeing actual queries that have resulted in traffic is a nice feature.
Cancel
fulcrum

2006-08-11T17:24:58-07:00

Well said. I've uncovered very similar data.

These queries also hit advertisers - falsifying the number of "real" impressions that advertisers see and lowering their CTRs unnaturally.

Regarding scaping and its impact on CTR, it may be possible in Google to construct the scraping URLs requested to preclude any negative impact. Unfortunately, I can't vouch 100% for implementing this in a large scale, scraping context. From Google:

Google now provides a way for you to preview ads and search results as they would appear on a regular Google search results page – without impacting your metrics.

To perform a test search query, simply append the Google search results page URL with '&adtest=on.' For example, the search results page URL for the query 'send flowers' is https://www.google.com/search?&q=send+flowers. To see the search results page for the query 'send flowers,' without registering an impression, you can use https://www.google.com/search?&q=send+flowers&....

This feature is also helpful for locating your client's regionally targeted ad. To preview search results and ads as they'd appear in other geographical locations, you can add additional parameters after '&adtest=on.' Here's a list of parameters and links to more information about possible values for each parameter:

- Target country: use '&gl=aa,' where aa is the country code as listed at https://www.google.com/apis/adwords/developer/....

- Target region: use '&gr=bb-bb,' where bb-bb is the region code as listed at https://www.google.com/apis/adwords/developer/....

- Target city: use '&gcs=c,' where c is the name of a city. If you use the city parameter, the region parameter must also be set to the region containing the city. For example, to see a search results page for the query 'wedding planner' for users in New York City, the URL would be https://www.google.com/search?&q=wedding+plann...York.

- Target latitude and longitude: use '&gll=latitude,longitude,' where the latitude and longitude values are specified in micro-degrees. For example, '&gll=37304332,-121393872.'

- Target postal code (US only): use '&gl=US&gpc=nnnnn,' where nnnnn is any 5-digit US postal code. To use this parameter, the 'gl' parameter must be set to the United States (gl=US). However, please note that you cannot target ads to postal codes at this time.

- Target DMA (US only): use '&gm=ddd,' where ddd is the US Metropolitan Region code as listed at https://www.google.com/apis/adwords/developer/....

Last, please keep in mind that if you click on the 'next' arrow to see results showing on the following pages, you will need to append the location parameter again.

fulcrum edited 2006-08-11T17:39:48-07:00
1 0

Well said. I've uncovered very similar data. <blockquote>These queries also hit advertisers - falsifying the number of "real" impressions that advertisers see and lowering their CTRs unnaturally.</blockquote> Regarding scaping and its impact on CTR, it may be possible in Google to construct the scraping URLs requested to preclude any negative impact. Unfortunately, I can't vouch 100% for implementing this in a large scale, scraping context. From Google: <blockquote>Google now provides a way for you to preview ads and search results as they would appear on a regular Google search results page – without impacting your metrics. To perform a test search query, simply append the Google search results page URL with '&adtest=on.' For example, the search results page URL for the query 'send flowers' is <a href="https://www.google.com/search?&q=send+flowers" rel="nofollow">https://www.google.com/search?&q=send+flowers</a>. To see the search results page for the query 'send flowers,' without registering an impression, you can use <a href="https://www.google.com/search?&q=send+flowers&adtest=on" rel="nofollow">https://www.google.com/search?&q=send+flowers&...</a>. This feature is also helpful for locating your client's regionally targeted ad. To preview search results and ads as they'd appear in other geographical locations, you can add additional parameters after '&adtest=on.' Here's a list of parameters and links to more information about possible values for each parameter: - Target country: use '&gl=aa,' where aa is the country code as listed at <a href="https://www.google.com/apis/adwords/developer/adwords_api_countries.html" rel="nofollow">https://www.google.com/apis/adwords/developer/...</a>. - Target region: use '&gr=bb-bb,' where bb-bb is the region code as listed at <a href="https://www.google.com/apis/adwords/developer/adwords_api_regions.html" rel="nofollow">https://www.google.com/apis/adwords/developer/...</a>. - Target city: use '&gcs=c,' where c is the name of a city. If you use the city parameter, the region parameter must also be set to the region containing the city. For example, to see a search results page for the query 'wedding planner' for users in New York City, the URL would be <a href="https://www.google.com/search?&q=wedding+planner&adtest=on&gr=US-NY&gcs=New" rel="nofollow">https://www.google.com/search?&q=wedding+plann...</a>York. - Target latitude and longitude: use '&gll=latitude,longitude,' where the latitude and longitude values are specified in micro-degrees. For example, '&gll=37304332,-121393872.' - Target postal code (US only): use '&gl=US&gpc=nnnnn,' where nnnnn is any 5-digit US postal code. To use this parameter, the 'gl' parameter must be set to the United States (gl=US). However, please note that you cannot target ads to postal codes at this time. - Target DMA (US only): use '&gm=ddd,' where ddd is the US Metropolitan Region code as listed at <a href="https://www.google.com/apis/adwords/developer/adwords_api_us_metros.html" rel="nofollow">https://www.google.com/apis/adwords/developer/...</a>. Last, please keep in mind that if you click on the 'next' arrow to see results showing on the following pages, you will need to append the location parameter again. </blockquote>
Cancel
alexorig

2006-08-11T20:14:01-07:00

Absolutely fantastic post for discussion, I really really love it.

I have often been irritated that scraping is against terms and conditions since many people do it anyway. This is like an old argument in the real world that it is better to make something legal and regulate it than to let it stay illegal and unregulated. There are huge studies on this around prostitution, alcohol, cannabis and other less pleasant parts of life and the conclusions are always mixed.

Certainly exposing this data gives SEO's a head start. It would make it trivial to detect who is the #1 site in a field, ripping apart all the variables in that site (similar but with a bit more depth than your page strength tool) and by running a comparison against the #2 or better #10 site for that field getting significant clues towards improving your rankings. Effectively this kind of back engineering of the algorithm makes the space race hot up between SEO, SEO black hat and the search engines themselves which in many ways isn't a bad thing since it would level the playing field a little more than now within the SEO community.

From the fact there are still major arguments around the real world examples I noted above I personally find it hard to straight up agree with your suggested results. I certainly do find your arguments compelling from a search engine's point of view.

From a search engine's point of view I personally wouldn't expose search APIs for another reason.

eBay benefit massively from exposing their platform to any developer and publically released data around their dev con stated huge amounts of listings (hence revenue) are driven by the totally free eBay API (with 1.5million calls a day offered to any developer who certifies) so eBay make money by exposing their platform for free. Unlike eBay the search engines don't benefit from allowing people to use their search results platform without visiting it. Imagine Google search bars, boxes, related links that don't link to google and worse use Shopping.com or Y!PN ads to monetize... Think Copernicus and mashing up yahoo + google + ask + msn results to get the best results and then cut them out of the revenue upside.

Exposing those API's would mean the search engines would have to invest in a lot of resource for certification and policing the API usage unless they charged a prohibitive amount to use them (when SEOs would end up still scraping). So for those reasons I don't expect any engine to really open up their search APIs in the near future for free. There simply is not enough upside in it for them right now since relatively few people scrape... perhaps this could change in the future.

Great post, great discussion... love it!

alexorig edited 2006-08-11T20:17:14-07:00
1 0

Absolutely fantastic post for discussion, I really really love it. I have often been irritated that scraping is against terms and conditions since many people do it anyway. This is like an old argument in the real world that it is better to make something legal and regulate it than to let it stay illegal and unregulated. There are huge studies on this around prostitution, alcohol, cannabis and other less pleasant parts of life and the conclusions are always mixed. Certainly exposing this data gives SEO's a head start. It would make it trivial to detect who is the #1 site in a field, ripping apart all the variables in that site (similar but with a bit more depth than your page strength tool) and by running a comparison against the #2 or better #10 site for that field getting significant clues towards improving your rankings. Effectively this kind of back engineering of the algorithm makes the space race hot up between SEO, SEO black hat and the search engines themselves which in many ways isn't a bad thing since it would level the playing field a little more than now within the SEO community. From the fact there are still major arguments around the real world examples I noted above I personally find it hard to straight up agree with your suggested results. I certainly do find your arguments compelling from a search engine's point of view. From a search engine's point of view I personally wouldn't expose search APIs for another reason. eBay benefit massively from exposing their platform to any developer and publically released data around their dev con stated huge amounts of listings (hence revenue) are driven by the totally free eBay API (with 1.5million calls a day offered to any developer who certifies) so eBay make money by exposing their platform for free. Unlike eBay the search engines don't benefit from allowing people to use their search results platform without visiting it. Imagine Google search bars, boxes, related links that don't link to google and worse use Shopping.com or Y!PN ads to monetize... Think <a href="https://www.copernic.com/" rel="nofollow">Copernicus</a> and mashing up yahoo + google + ask + msn results to get the best results and then cut them out of the revenue upside. Exposing those API's would mean the search engines would have to invest in a lot of resource for certification and policing the API usage unless they charged a prohibitive amount to use them (when SEOs would end up still scraping). So for those reasons I don't expect any engine to really open up their search APIs in the near future for free. There simply is not enough upside in it for them right now since relatively few people scrape... perhaps this could change in the future. Great post, great discussion... love it!
Cancel
Carlos del Rio

2006-08-11T08:41:43-07:00

I have spoken with people who were barred from Ask for "producing excessive robotic traffic" -- so it seems there is a measurement somewhere.

1 0

I have spoken with people who were barred from Ask for "producing excessive robotic traffic" -- so it seems there is a measurement somewhere.
Cancel

Post Analytics

Comments 15

Log in to Moz

Don't have an account?