In my last column, I wrote about how to use term frequency analysis in evaluating your content vs. the competition's. Term frequency (TF) is only one part of the TF-IDF approach to information retrieval. The other part is inverse document frequency (IDF), which is what I plan to discuss today.
Today's post will use an explanation of how IDF works to show you the importance of creating content that has true uniqueness. There are reputation and visibility reasons for doing this, and it's great for users, but there are also SEO benefits.
If you wonder why I am focusing on TF-IDF, consider these words from a Google article from August 2014: "This is the idea of the famous TF-IDF, long used to index web pages." While the way that Google may apply these concepts is far more than the simple TF-IDF models I am discussing, we can still learn a lot from understanding the basics of how they work.
What is inverse document frequency?
In simple terms, it's a measure of the rareness of a term. Conceptually, we start by measuring document frequency. It's easiest to illustrate with an example, as follows:
In this example, we see that the word "a" appears in every document in the document set. What this tells us is that it provides no value in telling the documents apart. It's in everything.
Now look at the word "mobilegeddon." It appears in 1,000 of the documents, or one thousandth of one percent of them. Clearly, this phrase provides a great deal more differentiation for the documents that contain them.
Document frequency measures commonness, and we prefer to measure rareness. The classic way that this is done is with a formula that looks like this:
For each term we are looking at, we take the total number of documents in the document set and divide it by the number of documents containing our term. This gives us more of a measure of rareness. However, we don't want the resulting calculation to say that the word "mobilegeddon" is 1,000 times more important in distinguishing a document than the word "boat," as that is too big of a scaling factor.
This is the reason we take the Log Base 10 of the result, to dampen that calculation. For those of you who are not mathematicians, you can loosely think of the Log Base 10 of a number as being a count of the number of zeros - i.e., the Log Base 10 of 1,000,000 is 6, and the log base 10 of 1,000 is 3. So instead of saying that the word "mobilegeddon" is 1,000 times more important, this type of calculation suggests it's three times more important, which is more in line with what makes sense from a search engine perspective.
With this in mind, here are the IDF values for the terms we looked at before:
Now you can see that we are providing the highest score to the term that is the rarest.
What does the concept of IDF teach us?
Think about IDF as a measure of uniqueness. It helps search engines identify what it is that makes a given document special. This needs to be much more sophisticated than how often you use a given search term (e.g. keyword density).
Think of it this way: If you are one of 6.78 million web sites that comes up for the search query "super bowl 2015," you are dealing with a crowded playing field. Your chances of ranking for this term based on the quality of your content are pretty much zero.
Overall link authority and other signals will be the only way you can rank for a term that competitive. If you are a new site on the landscape, well, perhaps you should chase something else.
That leaves us with the question of what you should target. How about something unique? Even the addition of a simple word like "predictions"—changing our phrase to "super bowl 2015 predictions"—reduces this playing field to 17,800 results.
Clearly, this is dramatically less competitive already. Slicing into this further, the phrase "super bowl 2015 predictions and odds" returns only 26 pages in Google. See where this is going?
What IDF teaches us is the importance of uniqueness in the content we create. Yes, it will not pay nearly as much money to you as it would if you rank for the big head term, but if your business is a new entrant into a very crowded space, you are not going to rank for the big head term anyway
If you can pick out a smaller number of terms with much less competition and create content around those needs, you can start to rank for these terms and get money flowing into your business. This is because you are making your content more unique by using rarer combinations of terms (leveraging what IDF teaches us).
Summary
People who do keyword analysis are often wired to pursue the major head terms directly, simply based on the available keyword search volume. The result from this approach can, in fact, be pretty dismal.
Understanding how inverse document frequency works helps us understand the importance of standing out. Creating content that brings unique angles to the table is often a very potent way to get your SEO strategy kick-started.
Of course, the reasons for creating content that is highly differentiated and unique go far beyond SEO. This is good for your users, and it's good for your reputation, visibility, AND also your SEO.
Hi Slavo,
Thanks for your question, what I was trying to say was it's not just about slapping the words on the page, but you need also to address the user's needs directly related to those words.
An example might help. If you have a page about lamps, and you decided that "ornate lead glass lamps" was a a rare term, you could add it to the page, but you should only do so if you see those types of lamps.
Let me know if that makes sense.
Thanks for the reply Eric. Totally agree. My point was this- instead of just adding more long tail phrases (option a. - that most people use, but brings no results), or creating content that specifically targets those phrases (option b.), a better option would be somewhere in between- to expand the content and include these rare terms- but do it in more depth (adding a whole additional chapter). And of course if it makes sense to user intent.
A great example would be adding "local SEO in 2015" to a guide about SEO. Though not only throwing the phrase in the copy, but expanding it into a whole separate chapter. Or is it wiser to target this phrase with a different piece of content altogether?
Hi Eric, great concept, but allow me to understand it better by asking a couple of dumb questions.
As you replied to some of the comments, this is more than just adding long tail keywords to the page. What would be a better alternative- adding some additional chapters into the content, or creating content that takes a new angle altogether (new title and all...)?
I use the first approach myself whenever I try to create reviews of products- adding additional chapters that I think are going to be valuable info for the user.
Thanks Eric for writing the IDF part of this scientific series.
I just wanted to know, does TF-IDF model also applies on Latent Semantic Indexing (LSI) uses by search engines? As per my understanding, LSI tries to overcome the problems of lexical matching by using statistically derived conceptual indicesinstead of individual words for retrieval.
Hi Umar - I don't believe it does. LSI is something that came in later.
Correct me if I'm wrong here... IDF = measure of uniqueness of a term/phrase based on all indexed instances of that term/phrase. The more unique it is, the more valuable it is to have on the page - which would help the page rank for all related terms or simply get the page indexed?
Hi Oleg - it can help you rank for related terms. However, it's important that you ties this to some real value that you add, as opposed to simply loading the rare words on a page.
Hi Eric, Nice article! What is the right way to tie content to chosen terms? What do engines expect? For example, if in one sentence I list terms "nike", "adidas" and in another sentence I use the term "these brands", will search engines be able to map "these brands" to the aforementioned list of brands?
Thank you very much Eric! This basically explains everything what we did wrong at my previous company (when a was a novice in online marketing), and which was taught to us by a large consulting company. I will forward this blog to my previous co-workers.
Glad it was helpful!
Hello Eric!
Good article.
If I understand correctly, IDF is the system that measures the importance of the frequency of keywords and Google examines all existing documents and based on that, which is calculated by the average of occurrences of a keyword.
This is important when Google to index your page, but there are other things that are important when it comes to having a good position.
It is not always easy to find a keyword that is not repeated in the Google search. Perhaps the search for synonyms to help us.
Thanks for the information.
Thanks for sharing this, Eric!
I was wondering how IDF differs from long tail keywords?
Very interesting slant on a popular topic. You have explained the science behind not just being unique but also being found
Frankly, it's too technical for basic-level guys like me :)
That's an amazing way to calculate the uniqueness of content, but calculating the IDF of a whole article would need something complex, is there any way or tool that can help us calculate the IDF on our own? I mean it would take months to do it manually. Btw your scientific series is always awesome Eric :)
Hello Eric,
Great post, short and informative. IDF does not help us to rank well but could be helpful to get found to audience that's what everyone looking for.
I like the image on the top, it is telling the definition of uniqueness ;)
Thanks
Hey Eric,
Thanks for the post, correct me if I'm wrong but you're basically showing the similarities between inverse document frequency (IDF) and long tailed keyword, how you can statistically break down what you need to target, in a granular form?
I know this is more of a broad out look of the post, but wanted to make sure I am understanding the bigger picture.
Cheers,
Hi Justin - One of the main things I was trying to get at is that publishing the same old stuff that every one else does, or simply copying successful people, is not really a good strategy. You need to bring something new and unique to the table.
However, as I will say to JibbedSEO in a moment (in response to his comment below) the goal is not to throw random rare keywords on your page (see below for the rest).
Gotcha, back in the day I use to piggy back off what others were doing, but now I try to bring new creativity to the table. Thanks for your clarification (lets me know I'm on the right path)
Also I like how you do brake down the keyword search in a statistical manner, like you said "not to throw random rare keywords on your page". Buggers will stick every once in a while, but your break down it built for a better overall foundation.
Thank you for getting back to me
Cheers,
Before i say anything, your first image makes me eager to read the full article. ready catchy images and it says everything about the article. i think we provide the same to customers, business will also grow like this.
Hope everyone takes this post as (I think) it was intended, which is an interesing look at how Google understands and can sort content on the web. I don't think the author was suggesting to add in rare keywords from obsessive Googling.
I do think I am suggesting a bit more than this. I think that IDF teaches us that it's really critical to bring something unique to the table. If you are the 2,137th person trying to rank on some major term, well, good luck! Differentiation is essential. What is it that you do that's unique?
I completely agree though, that this is not meant to spawn some keyword spamming exercise!
Eric your post and examples are nice for delivering a very basic understanding of the subject. Unfortunately, there are as I see it two major flaws in the examples, A) Only a very small percentage of users search with quotes. For those that do search with quotes, your example is somewhat correct if the string is unique I.E only one result. If as the case most of the time a user searches for "pink monkeys in fort lauderdale" without quotes.. even if there is only one instance of the phrase "pink monkeys in fort lauderdale" on the web it is more likely that google will return an authoritative page about monkeys in fort lauderdale that sit on pink chairs. B) Is really just part of A as I understand it google / search engines put far more weight on individual words than phrases, at the single word level nothing will return just a few results.
nealeg, I don't see how searcher behaviour has anything to do with TF and IDF, which are constants even if everyone on the planet was wiped out. The real challenge is how to use this knowledge in a practical manner, i.e. compute the TF-IDF for every term on a page. For that you will need the count of total pages in Google (about 30 trillion) and a total count of the term in the SERPs. That could be thousands of queries to Google per the study set. If your IP doesn't get blocked, you will get very good numbers. Alternatively, you can get somewhat useful numbers by comparing each term against the total number of words in the given web page.
Hi Nealeg - the use of quotes or not is irrelevant to the use of TF/IDF. I simply used that to have Google help me find pages with the exact phrases. For example, 6.78M pages appear to have the exact phrases "super bowl 2015" (without the quotes) on them.
Of course most uses search without "", but the point is that Google place a lot more weight on exact phrases on pages. So if a user searches on "super bowl 2015" (without the ""), Google will weight a page with that exact phrase more than they will a page talking about the 2014 super bowl, that happened to be written on January 15, 2015 that happens to have the article publication date on the page.
Thank U Eric 4 great post!