Determining Relevance
When a user submits a query to a search engine, the first thing it must do is determine which pages in the index are related to the query and which are not. Throughout this post, I will refer this as the "relevance" problem. More formally, we can state it as follows:
Given a search query and a document, compute a relevance score that measures the similarity between the query and document.
The "document" in this context can also refer to things like the title tag, the meta description, incoming anchor text, or anything else that we think might help determine whether the query is related to the page. Practically, a search engine computes a number of relevance scores using different page elements and weights them all to arrive at one final score.
The relevance problem has been extremely well studied in the research community. The first papers go back several decades, and it is still an active area of research. In this post, I focus on the most influential approaches that have stood the test of time.
Relevance vs Ranking
Conceptually, we can separate relevance determination from ranking the relevant documents, even if they are implemented as a single step inside a search engine. In this mental framework, the relevance step first makes a binary (True/False) decision for each page, then the ranking step orders the documents to return to the user.I'll present some data later in this post that vividly illustrates this split and how it relates to different ranking signals.
Query and Document Models
Translating the query and document from raw strings into something we can do computation with is the first hurdle in computing a similarity score. To do so, we make use of "query models" and "document models." The "models" here are just a fancy way of saying that the strings are represented in some other way that makes computation possible.
The above image illustrates this process for the query "philadelphia phillies" and the Wikipedia page about the Phillies. The final step in computing the similarity score runs the query and document representations through a scoring function.
Query Models
The following image illustrates some different types of query models:
The building blocks at the bottom include things like tokenization (splitting the string into words), word normalization (such as stemming where common word endings are removed), and spelling correction (if a query contains a misspelled word, the search engine corrects it and returns results for the corrected word).
Built on top of these building blocks are things like query classification and intent. If the search engine determines that a particular query is time sensitive it will return news results, or if it thinks the query intent is transactional it will display shopping results.
Finally, at the top of the pyramid are more abstract representations of the query such as entity extraction or latent topic representations (LDA). Indeed, Google knows that the "philadelphia phillies" are a major league baseball team and since it is baseball season returns last night's score at the top of the search results (in addition to the knowledge graph on the right).
Document Models
Like query models, there are several different types of document models commonly used in search.
TF-IDF is one of the oldest and most well known approaches that represents each query and document as a vector and uses some variant of the cosine similarity as the scoring function. A language model encodes some information about the statistics of a language and includes knowledge such as the phrase "search engine optimization" is much more common then "search engine walking." Language models are used heavily in machine translation and speech recognition, among other applications. They are also extremely useful in information retrieval. Yet another class of models uses the probability ranking principle, which directly models the probability of relevance given the query and document. Of these, Okapi BM25 has been shown to be particularly effective.
Correlation study
By now, you are probably wondering if search engines actually use any of these things, and if so, which ones are the most important. To explore this, we designed a correlation study similar to ones we have run in the past (see this for some background on the general approach). In this case, we collected the top 50 results from Google-US for about 14,000 keywords. This resulted in about 600,000 pages that we then crawled and used to compute a number of different similarity scores.
As you can see, the language model approach performed the best with a mean Spearman correlation of 0.10, consistent with results published in the research literature.
If we do some stemming of both the query and document first and recompute, the correlations increase slightly across the board:
This suggests that Google is indeed doing some type of word normalization or stemming in their relevance calculation.
Relevance vs Ranking revisited
Comparing these correlations vs Page Authority (an aggregate in-link metric in our Mozscape index) on the same data set, we see a substantial difference:
This begs the question: if these sophisticated similarity scores are so useful, why aren't the correlations higher? The answer lies in the conceptual relevance vs ranking split I discussed earlier.
To convince myself, I constructed an experiment as illustrated below:
To run the experiment, I first took 450 random pages from our dataset stratified across the top 50 results (so that they include nine #1 ranked pages, nine #2 ranked pages, etc.). Then I added the 450 random pages to the top 50 pages in each search result to make one group of 500 pages for each keyword. Since 50 of these pages are in the search result, and 450 are not, 10% of them are relevant to the keyword and 90% are not (the assumption here is that if the page appears in a Google search then it is relevant). Then for each keyword, I collected the Page Authority and Language Model similarity score and sorted by each (the tables in the middle).
Finally, I computed the Precision at 50, which is the percentage of the top 50 results sorted by PA/Language Model score that are actually in the search result. This directly measures the extent to which PA or the Language Model can separate relevant from irrelevant pages. Since 10% of the 500 documents are in the search result, we can achieve a 10% precision by randomly sorting them. This 10% precision is our baseline (bottom gray bars in the image).
The results are striking. The PA precision is very close to the baseline, which says that is does no better then a random number at determining relevance even though it does do a good job at ranking the top 50 once they are known to be relevant. On the other hand, the Language Model precision is close to 100%. Put another way, the Language Model is nearly perfect in determining which of the 500 pages are in the search result, but does a poor job at actually ranking those relevant documents.
Takeaways
This type of query-document similarity scoring is well established in the research literature and underlies every modern information retrieval system. As such, it is fundamental to search and is immune to algorithm change.
Since search engines use sophisticated query and document models, there is no need to optimize separately for similar keywords. For example, any page targeting "movie reviews" will also target "movie review."
Finally, you can use the conceptual split between relevance and ranking in your workflow. When creating or modifying existing content, first concentrate on making the page relevant to a broad set related keywords. Then concentrate on increasing the search position.
More Ranking Factors results coming soon
These are the first results we've released from the 2013 Ranking Factors project. As in years past, the project includes both an industry survey and large correlation study. I'll be presenting the results at MozCon this year (so get your tickets if you haven't already!), and we'll be following it up with a full report sometime later this summer.
To dig deeper
Here are all the slides from my SMX Advanced talk:
I highly recommend the book Introduction to Information Retrieval by Manning et al. It is available for free online reading from their site and provides a comprehensive description of everything discussed in this post (and much, much more). In particular, see Chapters 2, 6, 11 and 12.
Thanks for reading. I look forward to continuing the discussion in the comments below!
Awesome post Matt, I've been thinking about this for a while - I was wondering what you thought about relevance in link building too? Is it important to stick tightly to your niche? And do you think links from diverse sources vs. niche sites could be a ranking factor?
Thanks!
Glad you liked it! Great question, I've wondered the same thing myself. Unfortunately I don't have any answers for you -- if anyone else has some thoughts on it I'd love to hear from you.
At Page One Power we historically have taken great pride in labeling ourselves a "Relevancy First Link Building Firm", and experienced strong success in our endeavors. Stick tightly to your niche and build human relationships with active community members - don't focus on the links, focus on the people - and the links will come to you. Find domains where your link compliments the site structure, meta tags, and page titles - but also benefits humans. These two things align more often than not when you keep relevancy as your main focus. We like to call this "Links for the Betterment of Mankind" because they offer true value to real people.
In my experience, there ARE diverse sources for potential links within any given niche. How do you define diverse sources?
Matt - This article rules. You are a scientist and a gentleman. When I woke up this morning, I did not realize I would be leaving the office with a practical understanding of term frequency-inverse document frequency!
I watched Page One Powers webinar on relevant link building and got some good tips from it.
"Since search engines use sophisticated query and document models, there is no need to optimize separately for similar keywords. For example, any page targeting "movie reviews" will also target "movie review.""
I think you may be incorrect on your takeaway here. I still see vast ranking differences in plural form vs singular form keyword queries. We spent 6 months building links for a plural form keyword target and saw ranking shift from unranked to currently the top of page 2 (for plural form search), and zero ranking shift for singular form search.
I'm also finding this Nicholas. SERPs still differ greatly depending on plurality. That includes everything from "es" to "s". I'd hate to be the guy trying to rank for fungi and fungus.
Thanks for the observations. Spot checking a few SERPs, I do see that Google is returning different results for plural vs singular so you are correct that they don't stem everything down. As I think about this some more I'd speculate that they use some combination of stemming/other normalization and no stemming/no normalization and send everything through a machine learning algorithm (or at least that's what I would do if I were them :-)) I don't know how else to explain the difference in correlations for stemmed vs not stemmed otherwise. If it was random noise in the data then I wouldn't expect them all to increase, but rather some increase and some decrease.
Great Article...thanks. This is some stuff we have noticed (in our comapny) without the math behind it, of course...nice study!
Incredibly useful post Matt. I tip my Philadelphia Phillies hat to you.
Your post sort of explains why a site can have a high PA and yet not be well ranked, i.e. if a document's language isn't relevant, it will not get into the pile to get ranked. I see this quite a lot with legacy sites, e.g. government, non-profit agencies that have been around for a long time but have not optimized their web pages. Thanks for documenting - the less guesswork, the better!
Great post Matt, I was wondering the same thing mentioned above- what are you thoughts on relevance of link building?
Findout exact relevancy is quite easy as well as quite tough too. But some time we get confused to choose the exact relevancy, only after getting this post i have some effective and standard rules how to determine the relevancy. This is a brilliant and one of the most effective post on this topic, i have read several articles on this topic but this one is the most effective.
Thank you for including the book recommendation! I always like to see "further reading", especially for the more technical topics like this.
This graphic about the detail , I am like it, make me more understand about the culture SEO
Hi Matt,
Is there a connection between this subject and QDF?
Can you elaborate?
thanks!
Hi Matt, maybe you can explain your conclusion a bit more:
"Finally, you can use the conceptual split between relevance and ranking in your workflow. When creating or modifying existing content, first concentrate on making the page relevant to a broad set related keywords. Then concentrate on increasing the search position"
Isn't it generally better practice to have a page a focused page on one keyword, than make it relevant for broad phrases, because you can have a seperate pages for the other related keywords, and this would not dilute other pages that you are ranking for. A good example is to have various landing pages for each keyword, versus trying to rank for all related keywords say on your homepage? Am I missing something?
Wow, this is great.
Deep analysis.
But I find many of the posts that is coming from a broad niche (for example, a health topic on a website that is about corruption. Or a health topic on a diverse and very broad niche), but he is ranking very well (prankly, sit on the first page).
And most of them are coming from very high domain authorities (such as blogspot, and filled with a bunch of ads). So I conclude, that, relevancy is minimum. Domain authority may cover it well. Please correct me if I'm wrong.
Anyway, thank you so much for the post.
___________
This is a sample of query. As you can see, blogspot is the king [ :( or :) :D]
https://www.google.com/search?q=cara+menghilangkan+jerawat&pws=0&site=webhp&ei=M6W4UYulDcSUrgfmkoGQBQ&start=0&sa=N&biw=1511&bih=708
Very interesting and thorough post! I wish I was able to see your SMX Advanced talk!
Some very heavy information in there but it's pretty darn interesting too, especially the part on relevance and ranking - had wondered what was going on there.
Very well argumented perspective on smililarity | relevance | ranking. Thanks!
Thanks for the technical article. Got some deeper insights into how SE works.
Wow! Hard stuff but really interesting. I've always like to learn about how search engines works, specifically how they relates queries and documents, and this post gets deep inside the topic.
I'm going to read the suggested book and keep learning more and more.
Thanks for the post, very interesting and useful topic.
Heavy Stuff, but useful! Your link to Info Retrieve at the end didn't work, maybe their (Stanford) servers getting stressed.
Just tried the link and it is working for me now. Looks like it was a transient error when you tried it earlier.
Hate to do this... it doesn't "beg the question" https://begthequestion.info/. Pet peeve [prepares to receive flaming death from above]
Great technical info on this (especially the Query Model). Thanks Matt
Hi matt
Thanks for sharing this information.
Again thanks
Erinsmith