As much as it pains me to have to give ground to the industry giant, Google appears to still have the edge in index size. Despite Yahoo!'s claims to over 19 billion web documents indexed, in a very good report from earlier this week, Matthew Cheney and Mike Perry from the Univ. of Illinois note that:
Based on the data created from our sample searches, this study concludes that for a random set of words a user can expect, on average, to receive 65% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,034 test cases we ran, only in 16% of the cases (1606) did Yahoo! return more results. In 83.7% of the cases (8399) Google returned more results. In less than 1% of the cases both search engines returned the same number of results.
The methodology of the study is reasonably bulletproof, and it would be my guess that rather than a "lie", what has happened here is that Google has purged "duplicate" documents while Yahoo! has counted them towards index size. While Yahoo! is clearly making progress, reports like this without the ability to back it up via experimentation is probably bad for publicity...
Michael - I'd be interested to hear your specific critiques. While I certainly agree that the process is flawed, the results and methodology used seem accurate enough to me to be considered valuable in a comparison. While Yahoo! very well may have 19.2 billion documents indexed, there are more unique documents at Google - that's what these 10,000 searches tell me.
Michael - all good points. If I was building a space shuttle, I certainly wouldn't rely on data like this. However... None of the items you've pointed out suggest a bias - inaccuracy, certainly, but no inherent bias towards favoring Google or Yahoo.
The results, however, are so positively in Google's favor, that unless I'm missing an element of bias that would give Google an advantage, I'd still have to say this is strong evidence, even if it doesn't fit the rigorous guidelines for scientific proof.
Thanks for your critiques, though. They help to show a lot of folks what's really going on.
I guess a bigger index size doesn't guarantee more relevant results. That's the algo not the size of the index. Having a huge library and no way to find the right book we need is no use. ;-)