A good search engine does not attempt to return the pages that best match the input query. A good search engine tries to answer the underlying question. If you become aware of this you'll understand why Google (and other search engines), use a complex algorithm to determine what results they should return. The factors in the algorithm consist of "hard factors" as the number of backlinks to a page and perhaps some social recommendations through likes and +1' s. These are usually external influences. You also have the factors on the page itself. For this the way a page is build and various page elements play a role in the algorithm. But only by analyzing the on-site and off-site factors is it possible for Google to determine which pages will answer is the question behind the query. For this Google will have to analyze the text on a page.
In this article I will elaborate on the problems of a search engine and optional solutions. At the end of this article we haven’t revealed Google’s algorithm (unfortunately), but we’ll be one step closer to understand some advice we often give as an SEO. There will be some formulas, but do not panic. This article isn’t just about those formulas. The article contains a excel file. Oh and the best thing: I will use some Dutch delights to illustrate the problems.
Behold: Croquets are the elongated and bitterballen are the round ones ;-)
True OR False
Search engines have evolved tremendously in recent years, but at first they could only deal with Boolean operators. In simple terms, a term was included in a document or not. Something was true or false, 1 or 0. Additionally you could use the operators as AND, OR and NOT to search documents that contain multiple terms or to exclude terms. This sounds fairly simple, but it does have some problems with it. Suppose we have two documents, which consist of the following texts:
Doc1:
“And our restaurant in New York serves croquets and bitterballen.”
Doc2:
“In the Netherlands you retrieve croquets and frikandellen from the wall.”
Oops, almost forgot to show you the frikandellen ;-)
If we were to build a search engine, the first step is tokenization of the text. We want to be able to quickly determine which documents contain a term. This is easier if we all put tokens in a database. A token is any single term in a text, so how many tokens does Doc1 contain?
At the moment you started to answer this question for yourself, you probably thought about the definition of a "term". Actually, in the example "New York" should be recognized as one term. How we can determine that the two individual words are actually one word is outside the scope of this article, so at the moment we threat each separate word as a separate token. So we have 10 tokens in Doc1 and 11 tokens in Doc2. To avoid duplication of information in our database, we will store types and not the tokens.
Types are the unique tokens in a text. In the example Doc1 contains twice the token "and". In this example I ignore the fact that “and” appears once with and once without being capitalized. As with the determination of a term, there are techniques to determine whether something actually needs to be capitalized. In this case, we assume that we can store it without a capital and that “And” & “and” are the same type.
By storing all the types in the database with the documents where we can find them, we’re able to search within the database with the help of Booleans. The search "croquets" will result in both Doc1 and Doc2. The search for "croquets AND bitterballen" will only return Doc1 as a result. The problem with this method is that you are likely to get too much or too little results. In addition, it lacks the ability to organize the results. If we want to improve our method we have to determine what we can use other then the presence / absence of a term in a document. Which on-page factors would you use to organize the results if you were Google?
Zone Indexes
A relatively simple method is to use zone indexes. A web page can be divided into different zones. Think of a title, description, author and body. By adding a weight to each zone in a document, we’re able to calculate a simple score for each document. This is one of the first on page methods search engines used to determine the subject of a page. The operation of scores by zone indexes is as follows:
Suppose we add the following weights to each zone:
Zone | Weight |
title | 0.4 |
description | 0.1 |
content | 0.5 |
We perform the following search query:
“croquets AND bitterballen”
And we have a document with the following zones:
Zone | Content | Boolean | Score |
title | New York Café | 0 | 0 |
description | Café with delicious croquets and bitterballen | 1 | 0.1 |
content | Our restaurant in New York serves croquets and bitterballen | 1 | 0.5 |
Total | 0.6 |
Because at some point everyone started abusing the weights assigned to for example the description, it became more important for Google to split the body in different zones and assign a different weight to each individual zone in the body.
This is quite difficult because the web contains a variety of documents with different structures. The interpretation of an XML document by such a machine is quite simple. When interpreting an HTML document it becomes harder for a machine. The structure and tags are much more limited, which makes the analysis more difficult. Of course there will be HTML5 in the near future and Google supports microformats, but it still has its limitations. For example if you know that Google assigns more weight to content within the <content> tag and less to content in the <footer> tag, you’ll never use the <footer> tag.
To determine the context of a page, Google will have to divide a web page into blocks. This way Google can judge which blocks on a page are important and which are not. One of the methods that can be used is the text / code ratio. A block on a page that contains much more text than HTML code contains probably the main content on the page. A block that contains many links / HTML code and little content is probably the menu. This is why choosing the right WYSIWYG editor is very important. Some of these editors use a a lot of unnecessary HTML code.
The use of text / code ratio is just one of the methods which a search engine can use to divide a page into blocks. Bill Slawski talked about identifying blocks earlier this year.
The advantage of the zone indexes method is that you can calculate quite simple a score for each document. A disadvantage of course is that many documents can get the same score.
Term frequency
When I asked you to think of on-page factors you would use to determine relevance of a document, you probably thought about the frequency of the query terms. It is a logical step to increase weight to each document using the search terms more often.
Some SEO agencies stick to the story of using the keywords on a certain percentage in the text. We all know that isn’t true, but let me show you why. I'll try to explain it on the basis of the following examples. Here are some formulas to emerge, but as I said it is the outline of the story that matters.
The numbers in the table below are the number of occurrences of a word in the document (also called term frequency or tf). So which document has a better score for the query: croquets and bitterballen ?
croquets | and | café | bitterballen | Amsterdam | ... | |
Doc1 | 8 | 10 | 3 | 2 | 0 | |
Doc2 | 1 | 20 | 3 | 9 | 2 | |
DocN | ... | ... | ... | ... | ... | |
Query | 1 | 1 | 0 | 1 | 0 |
The score for both documents would be as follows:
score(“croquets and bitterballen”, Doc1) = 8 + 10 + 2 = 20
score(“croquets and bitterballen”, Doc2) = 1 + 20 + 9 = 30
Document 2 is in this case closer related to the query. In this example the term “and” gains the most weight, but is this fair? It is a stop word, and we like to give it only a little value. We can achieve this by using inverse document frequency (tf-idf), which is the opposite of document frequency (df). Document frequency is the number of documents where a term occurs. Inverse document frequency is, well, the opposite. As the number of documents in which a term grows, idf will shrink.
You can calculate idf by dividing the total number of documents you have in your corpus by the number of documents containing the term and then take the logarithm of that quotient.
Suppose that the IDF of our query terms are as follows:
Idf(croquets) = 5
Idf(and) = 0.01
Idf(bitterballen) = 2
Then you get the following scores:
score(“croquets and bitterballen”, Doc1) = 8*5 + 10*0.01 + 2*2 = 44.1
score(“croquets and bitterballen”, Doc2) = 1*5 + 20*0.01 + 9*2 = 23.2
Now Doc1 has a better score. But now we don’t take the length into account. One document can contain much more content then another document, without being more relevant. A long document gains a higher score quite easy with this method.
Vector model
We can solve this by looking at the cosine similarity of a document. An exact explanation of the theory behind this method is outside the scope of this article, but you can think about it as an kind of harmonic mean between the query terms in the document. I made an excel file, so you can play with it yourself. There is an explanation in the file itself. You need the following metrics:
- Query terms - each separate term in the query.
- Document frequency - how many documents does Google know containing that term?
- Term frequency - the frequency for each separate query term in the document (add this Focus Keyword widget made by Sander Tamaëla to your bookmarks, very helpful for this part)
Here's an example where I actually used the model. The website had a page that was designed to rank for "fiets kopen" which is Dutch for “buying bikes”. The problem was that the wrong page (the homepage) was ranking for the query.
For the formula, we include the previously mentioned inverse document frequency (idf). For this we need the total number of documents in the index of Google. For this we assume N = 10.4 billion.
An explanation of the table below:
- tf = term frequency
- df = document frequency
- idf = inverse document frequency
- Wt,q = weight for term in query
- Wt,d = weight for term in document
- Product = Wt,q * Wt,d
- Score = Sum of the products
The main page, which was ranking: https://www.fietsentoko.nl/
term | Query | Document | Product | |||||
tf | df | idf | Wt,q | tf | Wf | Wt,d | ||
Fiets | 1 | 25.500.000 | 3.610493159 | 3.610493159 | 21 | 441 | 0.70711 | 2.55302 |
Kopen | 1 | 118.000.000 | 2.945151332 | 2.9452 | 21 | 441 | 0.70711 | 2.08258 |
Score: | 4.6356 |
The page I wanted to rank: https://www.fietsentoko.nl/fietsen/
term | Query | Document | Product | |||||
tf | df | idf | Wt,q | tf | Wf | Wt,d | ||
Fiets | 1 | 25.500.000 | 3.610493159 | 3.610493159 | 22 | 484 | 0.61782 | 2.23063 |
Kopen | 1 | 118.000.000 | 2.945151332 | 2.945151332 | 28 | 784 | 0.78631 | 2.31584 |
Score: | 4.54647 |
Although the second document contains the query terms more often, the score of the document for the query was lower (higher is better). This was because the lack of balance between the query terms. Following this calculation, I changed the text on the page, and increased the use of the term “fietsen” and decreased the use of “kopen” which is a more generic term in the search engine and has less weight. This changed the score as follows:
term | Query | Document | Product | |||||
tf | df | idf | Wt,q | tf | Wf | Wt,d | ||
Fiets | 1 | 25.500.000 | 3.610493159 | 3.610493159 | 28 | 784 | 0.78631 | 2.83897 |
Kopen | 1 | 118.000.000 | 2.945151332 | 2.945151332 | 22 | 484 | 0.61782 | 1.81960 |
Score: | 4.6586 |
After a few days, Google crawled the page and the document I changed started to rank for the term. We can conclude that the number of times you use a term is not necessarily important. It is important to find the right balance for the terms you want to rank.
Speed up the process
To perform this calculation for each document that meets the search query, cost a lot of processing power. You can fix this by adding some static values to determine for which documents you want to calculate the score. For example PageRank is a good static value. When you first calculate the score for the pages matching the query and having an high PageRank, you have a good change to find some documents which would end up in the top 10 of the results anyway.
Another possibility is the use of champion lists. For each term take only the top N documents with the best score for that term. If you then have a multi term query, you can intersect those lists to find documents containing all query terms and probably have a high score. Only if there are too few documents containing all terms, you can search in all documents. So you’re not going to rank by only finding the best vector score, you have the have your statics scores right as well.
Relevance feedback
Relevance feedback is assigning more or less value to a term in a query, based on the relevance of a document. Using relevance feedback, a search engine can change the user query without telling the user.
The first step here is to determine whether a document is relevant or not. Although there are search engines where you can specify if a result or a document is relevant or not, Google hasn’t had such a function for a long time. Their first attempt was by adding the favorite star at the search results. Now they are trying it with the Google+ button. If enough people start pushing the button at a certain result, Google will start considering the document relevant for that query.
Another method is to look at the current pages that rank well. These will be considered relevant. The danger of this method is topic drift. If you're looking for bitterballen and croquettes, and the best ranking pages are all snack bars in Amsterdam, the danger is that you will assign value to Amsterdam and end up with just snack bars in Amsterdam in the results.
Another way for Google is to use is by simply using data mining. They can also look at the CTR of different pages. Pages where the CTR is higher and have a lower bounce rate then average can be considered relevant. Pages with a very high bounce rate will just be irrelevant.
An example of how we can use this data for adjusting the query term weights is Rochio's feedback formula. It comes down to adjusting the value of each term in the query and possibly adding additional query terms. The formula for this is as follows:
The table below is a visual representation of this formula. Suppose we apply the following values :
Query terms: +1 (alpha)
Relevant terms: +1 (beta)
Irrelevant terms: -0.5 (gamma)
We have the following query:
“croquets and bitterballen”
The relevance of the following documents is as follows:
Doc1 : relevant
Doc2 : relevant
Doc3 : not relevant
Terms | Q | Doc1 | Doc2 | Doc3 | Weight new query |
croquets | 1 | 1 | 1 | 0 | 1 + 1 – 0 = 2 |
and | 1 | 1 | 0 | 1 | 1 + 0.5 – 0.5 = 1 |
bitterballen | 1 | 0 | 0 | 0 | 1 + 0 - 0 = 1 |
café | 0 | 0 | 1 | 0 | 0 + 0.5 – 0 = 0.5 |
Amsterdam | 0 | 0 | 0 | 1 | 0 + 0 – 0.5 = -0.5 = 0 |
The new query is as follows:
croquets(2) and(1) bitterballen(1) cafe(0.5)
The value for each term is the weight that it gets in your query. We can use those weights in our vector calculations. Although the term Amsterdam was given a score of -0.5, the adjust negative values back to 0. In this way we do not exclude terms from the search results. And although café did not appear in the original query, it was added and was given a weight in the new query.
Suppose Google uses this way of relevance feedback, then you could look at pages that already rank for a particular query. By using the same vocabulary, you can ensure that you get the most out of this way of relevance feedback.
Takeaways
In short, we’ve considered one of the options for assigning a value to a document based on the content of the page. Although the vector method is fairly accurate, it is certainly not the only method to calculate relevance. There are many adjustments to the model and it also remains only a part of the complete algorithm of search engines like Google. We have taken a look into relevance feedback as well. *cough* panda *cough*. I hope I’ve given you some insights in the methods search engine can use other then external factors. Now it's time to discuss this and to go play with the excel file :-)
This needs to be promoted to the main blog!
Good article. Thanks for explaining in terms a math-impaired mind can understand. Actually, I only understood 54.678% of it, but that's good for an ADD reader.
I wish I'd paid more attention in calculus and statistics 101 but thankfully, you stayed awake.
Will re-read once brain hemorrhage stops bleeding from my ears.
PL
Great to have these concepts explained in simple terms - Thanks.
It is sometimes easy to forget that there is a big gap between understanding the what is good SEO practice and why it works.
I can see this post being very useful to those who are new to SEO and want to try to understand what is going on behind the search query.
First of all i would like to thank you for the Excel which saves my time while calculating vectorspace score, that's totally awesome. I found much genuine scores to give me an idea about my pages. And thanks for the easy to follow readable post about Search engine Algo Basics. It is quite helpful to understand the root level info about Search engines. Hats off to you "rolfbroer" for your hard work. By the way Dutch delights are delicious!
Thanks once again.
really nice explanation. thank you for this input.
Very detailed information. I think it should be on the main blog.
Nice post Rolf.
I was actually skeptical when I first saw the headline of this article, thingking "oh great another SEO trying to get into the actual mathematics behind IR and algorithms," but you've actually done a good job.
Have you read "Google's PageRank and Beyond: The Science of Search Engine Rankings?" It dives into a lot of the math behind PageRank and the linear algebra behind algorithms. It's a good read, I highly recommend it.
Thnx! Just added it to my reading list :)
Great article Rolf,
It shows you've put a lot of time and effort in explaining how it actually works. Asking someone to explain SEO will often resolve in a more brief and refutable explanation. Your post offers great leverage in creating an irrefutable and analytic explanation.
Thank you for that!
Kudos from Groningen (the Netherlands) for both a great article and the great analogies / testsubjects. Bitterballen and bicycles, the building blocks and cornerstones of modern Dutch society ; )
Dear, i would like to thank your for you article about the (Search Engine Algorithm Basics),Im yousef ,
im Master Student at alquds university, im working to create search engine that can assign the suitable jobs for every Person depending on his Profile ,i would like to ask what is the perfect search method and what about the vector model space and relevance feedback i need your help for the relevance feedback search with full example .
Thanks
Thank you very much for useful information
Looks confusing but after re-reading your blog post a second time, makes total sense. I suggest if anyone else didn't grasp this after the first read, just give it a second go, it'll clear up a lot of questions you may have. Thanks for sharing :)
Great post, Rolf. A much-needed education on some of the underlying algorithmic principles at the foundation of search. Looking forward to learning some of the deeper concepts, as well.
The Vector Space Model can be explained by harking back to that old bit of maths that most people will have heard: the square of the hypotenuse is equal to the square of the other two sides
a^2 = b^2 + c^2
So if you have two points in x,y format, you can work out the distance between those points.
For example, back in the caveman days when there were only two words "ug" and "um", documents were pretty simple, such as "ug ug ug um" and "ug um um um ug um". Those two documents could be represented using 2 dimensional vectors, each vector representing the number of times a word occurs: d1(3,1) and d2(2,4). Say you really wanted to find out about "um um um", so this would be represented as a third vector q(0,3). We can then work out the differences between q(0,3) and d1(3,1) to calculate the hypotenuse - 3.6 in this case - and the same between q and d2 - which is 2.2. d2 is therefore closest, and could be considered as most relevant, to the query.
The clever bit is this doesn't just work in two dimensions, but multiple dimensions, indeed as many dimensions as there are words (and it works much better when there are lots of dimensions).
Thats the basics, it then gets more complex as dimensions are warped, dropped and added by modifiers such as idf to discriminate common words.
Its Interesting Article. you lack some where between contents relevancy. Before reading this i though your article should cover all aspects of search engine working. But you defined only on page factor limited to content. I wish you or any other should publish, How search engine works on behalf of link building ? How to deal with various off page sites that are article, press release, bookmarking etc ? If someone have extensive exprience or knowledge to deal with these, i like to request him/her share his/her expereince.....
Because the query (terms we we're looking for) is: croquets AND bitterballen
Neither of those terms are in the Title of the example.
If the query was: New York, then the document would have had a score of 0.9 (just no score for description)
My boss just recommended that to me. Well then, let's get to it.
Thanks for the spreadsheet and helping to reveal a few of the gears that operate in the search engines.
what does Wf stand for in the vector model and how does it work??
According to the table,
Wf = tf * tf ; and..
Wt,d = sqrt ( Wf <for each term> / <sigma over all terms>Wf<for each term> ).
Hi Rolf,
Nice post. I really like to read stuff like that!
Do you actually score the pages of your clients or you do it as a learning exercice that then influence how you optimize pages?
I just use it as a signal, remember it's just a small part of a very complicated algorithm. So just using the score and adjust the page to optimize a score won't help in most cases. You have to look at all the metrics.
But if you can explain this to the category managers of your client, they gain better understandings of how a search engine threats their content. (and that keyword stuffing ain't good) ;)
I agree it's not the only metric but I like to get the most out of it and sometimes just that and internal linking achieve a lot! Especially for long tail and less competitive keywords.
With clients I often start by optimizing pages and wait for indexing to see what I can gain just from that. With time you get better and less links are required to get to the top.
Why does the title "New York Cafe" score a zero in your first example?
Interesting post that tries to explain the complex behind-the-scenes of search in simple terms. For those who are interested to know more, I recommend a book titled "Introduction to Information Retrieval" authored by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze available online for FREE at https://nlp.stanford.edu/IR-book/ :) Thanks for the post!
A good read, having trouble with Excel file, has name ref in cells B2 B3 B4
Weird, what version/metrics are you using? Maybe i can reproduce it and fix it. I made it in Excel on a Mac (dutch version). So i'm not sure if it works on all versions.
No problem with cells B2...B4 using Excel 2007 in compatibility mode on Win XP Pro.
I have 2003 because of legacy apps, i may address legacy and upgrade, I have been meaning to do so for a while.
The problem lies in the fact that the cells I11...I15, I20...I24, and I29...I33 use the IFERROR function, which did not exist in Excel 2003.
The sample spreadsheet here presented is designed to accomodate up to 5 Query terms, but is populated with only 3, so that 2 rows of each of the above named sets of 5 rows have no input values, and thus yield errors, the end results of which IFERROR forces to be the numberic value 0 (zero.)
Under Excel 2003, the error condition is not so handled, such that cells I16, I25, and I34, which are the cells whose values populate B2...B4, end up containing the error #DIV/0! (Divide by zero.)
There are several techinques that serve as workarounds. Here, the easiest is to populate unneeded data with insignificantly small values.
If you populate the spreadsheet with data for 5 Query Terms, you should find that all is well. As a quick test, just copy the contents of cells F4...I4 into F5...I5 and F6...I6. Assuming that that goes well, then, fill unused cells as follows:
You should then be set to handle any number of Query Terms from 1 to 5, simply by setting the cells in columns F through I to 1, 0, 0, and 0, respectively, for any unused row.
As for your legacy apps, I can't speak for Excel 2010, but can tell you that I'm running several that were developed for 2003, and are running quite nicely on Excel 2007 in compatibility mode with no problem.
Great post, I love when things get more technical than "get links and make good content".
Even if off page metrics like links & authority and social signals are gaining more and more weight, it's always good to remember the basics about on page factors and their importance in the equation.
Thanks for the knowledge, good to see some mathy posts getting published!
Love your explanation there - using the analagy of a takeaway - must admit I started to glaze over the first time I saw the algebra however it wasn't as bad the second time round!
not bad at all. well done
Awesome post!
Thanks for the explanation about wordfrequencies and weighting the different parts of a page! Good job!
What are the outcomes (scores) for the website fietsentoko.nl when you assume the size of Google of today 8th dec 2011 (51 billion webpages according to worldwidewebsize.com) instead of the 10.4 billion pages you took.
I'd loved to do it myself, but i'm having also problems with the Excel-file (Excel 2003, cells B2 till B4).
If I use the 51billion the figures are like this:
5.6121072 (home page, didn't want this one to rank)
5.5160773 (target page - old version)
5.6281853 (target page - new version)
old difference: 0.0230343
new difference: 0.0160780
Still haven't been able to reproduce the error, which version are you using?
P.S. sorry for all the new lines in this post, but the system doesn't allow me to use single enters.
Thanks for your fast reply!
I'm using version Excel 2003 (Dutch) With SP3.
Awesome content, thanks for the share... Rochio's feedback formula made my brain hurt.
Hi Rolf,
really nice post and very well explained. Maybe you would like to call it Search Engine Algortihm for Dummies?
The only bad thing about the post is that you used "kroketjes en bitterballen", because I miss them even after 30 years living abroad: they are not easy to get here in Spain :-)
A great way to explain how search engines operate. Optimization for search engines and Operation of search engines are two different methods all under one related matter. The concept is similar to knowing how use software on a computer and knowing how a computer works.
Ill probably have to read over the last part of this to fully comprehend.
Thanks for posting!
Makes sense, but really ahhh math just seeing it in a few parts makes me cringe more so than seeing Katy Perry matching Micheal Jackson in number 1 singles. Anyway the thing is with layouts changing all the time and the way Google interprets the weight of the keywords how long until they change the current weights. Obviously people can go ahead and sculpt a page, but really it doesn't help improve content and the whole point of a good search engine is to deliver good content not content weighted for keywords.
In the mean time I'm waiting for something to happen that really changes the search results by including social media more so in the algorith.
Please don't go blind on the maths in my post. Ofcourse they can change weights. The vector space model in my post is first mentioned back in 1975, so it's not the most advanced metric you can think of. Altough the algorithms get more complex, they still want to answer the users question (query) with the best results. They need content for that. One of the metrics they can use is the vectorspace score in my post.
Altough if page A has good weighted keywords in it's content (don't forget things like synonyms) , Google will still look at other metrics like links and social media to determine if the content from page A is really that good. But if the content from page B has no signals that it has content matching your query, it won't rank for it (easy).
So just see the post as a introduction to simple searchengine algorithms. I hope it helps to understand some of the problems a searchengine faces. I agree the vectorspace model doesn't help you to make better content for users, but it might help the searchengine to interpret your content the right way ;)
Have to admit that my brain started to hurt about half way through, but I persisted, even when I go to "Although the term Amsterdam was given a score of -0.5, the adjust negative values back to 0"...all I can say is thank heavens for smart guys like you...I am just not brave enough to go and play with the spreadsheet :)
I don't think stop words are relevant to water down the value of key phrases any more. SEs are good enough to understand the different between "a room with a view” and “a room for a view". What with latent semantic indexing and all. I think it was news from 2008 or something that stop words are dead. Or am I missing something here?
Not sure what you're trying to say, but:
If you've read my comment on kateG1298, you know this algorithm was first mentioned back in 1975. So stop words have been dead for the last 35 years now, not since 2008. May be Google had some stoplist the first year(s) because of the lack of serverpower, but this post isn't really about stopwords/stoplists. Thist post just has some very basic insights in how a very basic search engine can threat a few mentioned problems. But then again, I'm not sure what you're trying to say :)