I was wrong.
My theory went something like this:
If I were a search engineer, I would want an algorithm to determine my results. I would, however, validate these results with human input for at least the highest-volume search queries. For the very highest volume queries in the world, I would hope, by now, to have got it "right" - and that at least the first page of results output from my algorithm would be exactly what I wanted it to be.With a corollary that:
If this isn't the case, I would strongly consider hand-editing the top results for these huge volume phrases while I worked on the algorithm in order that my search engine worked as well as possible in the meantime.There is a big question over what the "right" answer should be for very high volume generic queries that I might come back to another time (for generic queries there can often be far more than 10 pages good enough to be on the first page, and choosing between them requires more knowledge about the searcher than you can possibly have). To be clear here - I'm not talking about which results should be top under the current algorithm, but rather which pages should be top when thinking from scratch like a search engineer.
To test my theory, I decided to look at the search results for poker-related terms. I think poker's been on my mind since my brother (who won an award this week - congratulations, bro) took me to a casino for my birthday so I could lose money...
I picked three phrases of varying search query volume:
- poker
- free online poker
- rakeback (a term related to poker affiliates)
My null hypothesis was that the highest volume phrase ('poker') would be very static through the week. Either the results are actually hand-edited behind the scenes (in which case there is very little chance that they would be edited daily) or the engineers are happy with their algorithm (and, again, trying to think like a search engineer, for a generic search like this, what factors would cause you to change your mind from day to day about the top set of results?).
I have couched a lot of this in scientific language, but I'm not trying to claim my little test was perfect. There are a lot of factors that can spoil it, but taking care to minimise as many of these as I could, below are some charts that show what I found.
These are charts of rankings over time: the x-axis is time (from the 28th October to 6th November this year) and the y-axis is ranking at Google.com (gl=US). Each of the lines (or points in some cases) is a different page (generally different website - there were no examples of different pages off the same site swapping for each other in the results sampled).
I haven't labeled the points and lines because this isn't about whether I see the same results as you or whether they are still ranking (or even about tactics or underhanded techniques). I think the patterns are what is interesting:
Anyway, you can see how wrong I was.
The 'poker' search which I thought would have acted as though it were hand-edited over a short timescale like this (even if it isn't actually) in fact behaved differently to my prediction in two ways:
- The results changed almost entirely on each of the first three days. I still find this hard to believe. As a search engineer, what (in the absence of news, which wasn't in evidence during the course of this week) could cause you to want to change practically the whole set of results for such a high volume search phrase from day to day?
- Even with the same set of results in the latter part of the week, there were some pretty significant movements:
The 'free online poker' search behaved far more like I was expecting for a high volume search phrase. It shows evidence of being algorithmic (the pinpoint result that dropped in on the fourth day subsequently went into free fall and now ranks somewhere in the 60s). I think this shows that it got there via some kind of manipulation (I haven't looked into what, and for the purposes of this analysis, I don't think it matters - I don't think that it came in via a hand edit). Apart from that, the rankings are fairly stable with gradual changes and few surprises.
I like the pattern of the 'rakeback' search results, the serenity of the top three with chaos below. Obviously I wouldn't like it much if I were number 4, but that's a different story. Given the range of insights above, I'm not sure that this graph actually tells us all that much, but since I gathered the data, I thought I'd include it for completeness.
So what can we learn from this and feed back into my initial assumptions to correct them and see where we end up? I'd love to hear others' thoughts in the comments, but the things I have come up with are:
- Methodology: I am the first to admit that this is not a scientific study (to understand what is really going on, I would need to see referral data for all the top results - not something I have access to for high volume searches).
- Testing: even if I am correct and results are as they would be if hand-edited for top volume search queries, given the data-driven nature of Google, they would want to test variants to see if they satisfied their users more.
- Spam: perhaps the algorithm is 'nearly right' but still susceptible to attacks such as the one we see in the single data point in the 'free online poker' results.
- News: obviously, when a query deserves freshness (QDF) the results are going to be shaken up regularly. I don't think that is the case in any of these examples, as none of the results coming or going were particularly timely.
I think the premise might be slightly askew because it implies that the search engineer (or another form of limited human input) is the judge of what is right for the search results; instead of, as QDF indicates, search engineers being more interested in determining what the Internet is talking about, how much, and how so.
"Rakeback" is a great example, because as far the offline lexicon is considered, it's barely a word, but to the search engines they've already analyzed and concluded that it's part of the Internet lexicon. That would be victory one for a search engineer, the second being the determination of who is the most talked about, who is the most authoritative--in a word, relevancy. With such a niche term, it makes sense that the top three results have established some form of that and are rather static.
"Poker" is a much more common term with a much wider range of search intent. As an off the cuff guess, I would think that search intent could vary by time-of-day and day-of-week, with some sites providing a better search result experience on evenings and weekends, while others would be statistically better during the day. As, "Searches related to: poker," indicates, people often look for instructional sites, entertainment, supplies, and places to play poker online. Given all the variables that can be added to such a broad term, I'd expect the term to provide fluctuating search results, but that might be exactly what the search engineer determines to be best statistically.
Yeah - some good points. I was essentially wanting to test my theory that the search engineer, having built an engine has 'favourite' searches that determine how well it has performed.
I know I'd find it hard not to do that. It's impossible to test though - this was trying to be the next best thing...
You are you and I am me, and we're both a lot smaller than the Internet (or Google's daily traffic volume).
My guess is that a search engineer would want to be most successful based on traffic response, ad revenue, and some statistically valid breakdown of top 3 clicks, or first page clicks.
Working with this theory, after establishing a baseline in regards to relevancy--which most search engines have--I'd then be looking, algorithmically, at what mix returns the optimal results based on the user reaction. This is seen especially with QDF searches and logged in user searches (last clicked @, shuffling of results based on click, etc.) In the case of personalized search what we see as personal micro data could be applied to many macro users based on similar actions. Of course the baseline influencers would still have the most weight, but optimizing the experience for the user is probably the holy grail.
That's not to say there isn't value in the expert model, and Google admits as much with its launch of Knols, it's just that a search engine is all about the most relevant for the many. So, like you said, this would be impossible for one person to test, but your tests were definitely eye opening and I really got a kick out of the prep and results that came from this post.
An excellent post Will - though I'd probably say that regardless since you linked to my award and the post is all about poker :-p
I find the rakeback results particularly interesting - what is it about those top results which causes them to avoid that fluctuation in the bottom results?
You're right that it's not terribly scientific but I absolutely love this kind of thought experiment where you try and figure out what's going on.
Interesting approach. I'm not sure if you're thinking like a search engineer though, since my own thinking would be "I want an algorithm that requires as little human intervention as possible for both human resource utilisation and accusations of bias reasons." What I did find interesting was the differences between the two term results, one crazily active, the second predictably static, and the third static at the top but below that hopping like a creel of lobster.
That's where I think the real questions come in. I take an economist's view of the Google results, in that assuming if the algorithms are relatively static and accurate then it is human behaviour that causes differentiation. This is simplistic and probably not true, but I find it easier to ascribe fluctuations to greed, sloth, and deception rather than the Omnipotent Engineer theorem (hey, I think I just laid the groundwork for an Evolution/Intelligent Design style debate!).
So why the differences? Poker is remarkably fluid for what should be a generic search term that should have a wikipedia entry, hobbyist sites, actual goods for sale (chips, tables, decks, etc.). I'm not sure what the poker SEO mafia's interest in the term is, and I'm assuming that interest is quite high due to the movement. I expected Free Online Poker to yo-yo far more as well (are you sure you didn't have them switched? :)
Rakeback is the most interesting in that it can be directly linked to an economic activity, and has little to no traction outside of the small circle who already know what it is. It seems to follow the "Google as economics" theory in that there are stable "top" performers coming first, with a tangled snarl of up and comers and down and outers shifting in and out of place.
I think what it tells us is that we can't assume Google operates in isolation; that their algorithm (and business) always requires examination of the human factor. I notice you didn't take into account what you assumed the behavior of the people striving for those positions would be. I'd like to see maybe a comparison of a "stable" category (say desks or filing cabinets or something) versus another "volatile" one (like RX or online dating).
Thanks for doing this, it's really made me think some interesting things, the most obvious of which is if macroeconomic principles can be applied to Google's search results.
Some interesting points. Since publishing the post I have started cooking dinner (evening here in the UK) so I'll have to be brief... There will, of course be competitor action to consider - but my thinking on the highest volume search (I checked - they aren't switched!) was that regardless of what you did during a week, Google would have their minds made up (algorithmic minds that is) and it would be fairly static. I was very wrong!
Regarding your first point - I would definitely want no human influence in the algorithm in the long run, but I think in the short term, my desire for good results *could* outweigh this (we know hand jobs happen in the results sometimes....).
Thanks for an insightful comment.
I'm not disagreeing with your premise that a stable keyword will be much less liable to be influenced by competitor activity, but by human activity I also mean the larger context of people making web pages, people visiting them, the general scope of human interest and content in that topic. So that the shape of "poker" overall could be viewed as the actions of consumers (site visitors) and producers (site owners and related service providers, e.g. SEO). Their interactions would be represented in the "market" (in this case Google, to really stretch the analogy). If your algorithm is that abstract (and in a way, if you're simulating a non-deterministic system it kind of has to be) it could then, have no human involvement whatsoever and still display behaviour similar to hand edited results. (Also I'm amazed no one has called you out on "hand jobs" yet, SEOs must be a very puritan bunch)
I'm kind of off on a tangent in that I'm trying to think like a search engineer in the large large sense, which probably has no application to the question at hand (instead being kind of random musing). I'm guessing the Google process is much more manual as you theorize, but it would be interesting if it were possible to code it my way. We could call Google "the invisible hand".
How would one hand edit results for "poker" anyway? What is a relevant result for "poker"? I'm just guessing they leave it to the algo and maybe, just maybe include some more info-heavy results to spice up the SERPS.
As I said, that's a conversation for another day! I might touch on that at some point. I don't think you'd start from scratch to hand-edit, but I could see you taking the algo output and tweaking / binning manipulative sites...
I find this topic very interesting as well - after all, for a generic query like this how do you determine the 'best' results?
Having said that - for a search like this actually it's fairly trivial to see which the biggest brands are and rank them accordingly. That's one of the tools I'd use to hand-edit but it's fairly rudimentary...
If I were Google I might also add in a bit of stirring to cycle the top results. Google must be aware that those who "win" the algo can make a lot of money and so they may then decide to share it out a bit for particularly economically beneficial search results where the quality of the top ranked pages may not actually differ by anything significant.
Such a move would help to obfuscate the fine detail of the algo and hide it from "algo-crackers".
I have also noticed that the extremely popular terms first page change on a daily basis where as less popular terms are more stable.
 Interesting
Exclusive Rakeback Offer 200% up to €1000 VIP bonus + a 30% monthly rake back bonus plus Claim up to a maximum of 4 €1000 Reload Bonuses in 2009
https://www.pokerraking.com/vip/
this is the best search engine blog post i have read all week. thanks a lot, man.
My thoughts are these:
Google is absolutely obsessed with testing. They have also without a doubt started to gather data on user behavior based on how we react to the SERPs. The fluctuations are probably tests of different sites to see if they are indeed relevant to the search query based on user behavior. The stable sites probably indicate those that have already pass the test over a given period of time and will be difficult to 'shake' from the top results.
I would imagine most of the hand editing involves removing sites rather than ranking them, but that's just a hunch.
 I obviously don't have data to back any of this up, but it makes sense from what I know.
Interesting post. In theory, I would think the results would be on par with your hypothesis.
Question: How much, if any, did your previous searches and web activity/behaivor change your future results? Â
And other factors that come to mind are the amount of seo work going on at these particular sites. I guess that's some form of another poster on "freshness of content".
Economic value is off for the above searches. I know this industry inside and out. (1st page for "rakeback" until July, now 4th page, but a top 3 rakeback company worldwide) Measuring the top three in economic value and their ranking in the top 3, then top 10: For poker - 2/3; 2/10; f.o.p. - 1/3; 3/10; rakeback - 0/3; 0/10 (These are obviously from my own searches of your test terms.)
What there needs to be is a way to rank by reputation. Oh, I know, that's what the link game is all about. That's what search engines and seo are. But it is so imperfect. Referencing popularity by buzz, especially when you have a community in the niche as perfect as 2+2 (reference Tom's "Nature of Online Communities" post of 10/16), should be the name of the search engine game. Rank forum users by reputation, review their posts, put sites (linked or not) up or down based on what they say. ...n/m, that's ridiculous
You guys sure do love poker at moz. :)
I agree with another poster: If I'm a search engineer, I don't want humans touching anything in my results. Ever.
Anyway, this has been a complete ramble. Thanks for the post and comments. -Byron
Nice premise, but seems like you were not thinking like an engineer :)
If a SERP looks like it coud benefit from hand editing, that should be abstracted into patterns, process, algorithms, etc. It scales.
For example, what's the solution if the 4th result for "poker" changes to a page about the wildlife in Africa? A human could catch this, but so does crawling more often, comparing with previous versions, rebuilding the index daily for the top X results, checking via google bar if the query term exists in the loaded page, time spent (via analytics)...
Do you think the flucuations has anything to do with the fact that the final table of the World Series of Poker was going on Nov 9th & 10th?
Although I haven't included what the sites are deliberately, none of them were news related so I am currently discounting the world series (and the last date I looked at was the 6th Nov).
Will,
Could there be a fresh content factor. They might not be a news site but they might have a blog that is reflected to the homepage and it might be updated once in a while that might reflect in fluctuations (especially with the World Poker Tour going on as mentioned above).
That's definitely possible, Mert. Yeah. I didn't do a proper look at how the content on the homepages changed during the time...
OMG, not the WPT it was the WSOP Main Event!
I thought this was the key, "for generic queries there can often be far more than 10 pages good enough to be on the first page, and choosing between them requires more knowledge about the searcher than you can possibly have."
An ultra generic search term like poker (or cars or books) gives SE almost no indication of the actual intention behind the search, so I don't think it'll be very efficient for hand editing, can you imagine how much hair you'd pull out or how many arguments may take place between the editors?
It makes more sense for 'engineers' to open SERP for these general terms up for algorithm take its best shot at it. And with user data over time, maybe they can then determine what the most likely intention is when searchers enter these types of generic terms.
I think its an interesting study (off to make my own chart for our niche) will be nice to see how we are doing and how steady we are, one issue is that different data centers give different results etc... but still its interesting...
 nice post.
Will, I'm sure you have looked into that, just wonder, highly competitive terms would tend to behave more unsteady then other non competitive terms? Those terms that are most likely to be manipulated must have some type of "special treatment" from SE?
I compare this to the stock market, where its value goes up and down according to many variables.
In addition to the comments about poker events going on at the time of the test, what about just day of the week and time of day. Seems that Monday morning searches due to 'I can't believe I lost so much' as opposed to Friday afternoon searches thinking 'I'm really going to clean up this weekend' might be very different. Of course that implies the assumption that people playing poker have regular jobs and do it for fun on weekends and that might not be the case at all.