A group of students at Stanford has put together a paper (warning! PDF) detailing their estimates of Google, Yahoo! and MSN's index size. The figures below are what they came up with:
- Google - 53 billion pages
- Yahoo! - 8.4 billion pages
- MSN - 3.7 billion pages
My opinion is that these numbers are way, way off. Here's a brief take on their methodology:
This is because an 8-digit number very rarely produces more than 1000 hits and Google and Yahoo always report their top 1000 results. So it is possible for us to count the exact number of hits in the entire result set. We compare the estimated hits reported by Google (Yahoo) with the actual hits, and that the reported numbers are a bit larger than the actual number in the case of Google and smaller than the actual number in the case of Yahoo.
Then the team admits in the paper that their data for MSN is inaccurate:
For MSN, however, we were unable to retrieve the actual number of hits returned. This is because MSN Search does not return more than the top 250 results for any query and for some 8-digit numbers there are more than 250 hits that cannot be fetched completely. Our results for MSN Search are thus not accurate.
If I were to estimate index sizes, I'd try to use a sampling of results that produced between 50-100 results, get about 200-300 of these phrases and test the overlap between the engines (% of unique pages, total # of unique pages, duplicates of pages, etc.). It shouldn't be too hard for a university team to do something like this and get much better data than the above. Of course, if I had access to automatic queries to search engines via Stanford's setup, I probably wouldn't be conducting a whole lot of size testing research :)
Thanks to Bill for pointing this out.
Comments 0
Please keep your comments TAGFEE by following the community etiquette.