It's time to look at your content differently—time to start understanding just how good it really is. I am not simply talking about titles, keyword usage, and meta descriptions. I am talking about the entire page experience. In today's post, I am going to introduce the general concept of content quality analysis, why it should matter to you, and how to use term frequency (TF) analysis to gather ideas on how to improve your content.
TF analysis is usually combined with inverse document frequency analysis (collectively TF-IDF analysis). TF-IDF analysis has been a staple concept for information retrieval science for a long time. You can read more about TF-IDF and other search science concepts in Cyrus Shepard's excellent article here.
For purposes of today's post, I am going to show you how you can use TF analysis to get clues as to what Google is valuing in the content of sites that currently outrank you. But first, let's get oriented.
Conceptualizing page quality
Start by asking yourself if your page provides a quality experience to people who visit it. For example, if a search engine sends 100 people to your page, how many of them will be happy? Seventy percent? Thirty percent? Less? What if your competitor's page gets a higher percentage of happy users than yours does? Does that feel like an "uh-oh"?
Let's think about this with a specific example in mind. What if you ran a golf club site, and 100 people come to your page after searching on a phrase like "golf clubs." What are the kinds of things they may be looking for?
Here are some things they might want:
- A way to buy golf clubs on your site (you would need to see a shopping cart of some sort).
- The ability to select specific brands, perhaps by links to other pages about those brands of golf clubs.
- Information on how to pick the club that is best for them.
- The ability to select specific types of clubs (drivers, putters, irons, etc.). Again, this may be via links to other pages.
- A site search box.
- Pricing info.
- Info on shipping costs.
- Expert analysis comparing different golf club brands.
- End user reviews of your company so they can determine if they want to do business with you.
- How your return policy works.
- How they can file a complaint.
- Information about your company. Perhaps an "about us" page.
- A link to a privacy policy page.
- Whether or not you have been "in the news" recently.
- Trust symbols that show that you are a reputable organization.
- A way to access pages to buy different products, such as golf balls or tees.
- Information about specific golf courses.
- Tips on how to improve their golf game.
This is really only a partial list, and the specifics of your site can certainly vary for any number of reasons from what I laid out above. So how do you figure out what it is that people really want? You could pull in data from a number of sources. For example, using data from your site search box can be invaluable. You can do user testing on your site. You can conduct surveys. These are all good sources of data.
You can also look at your analytics data to see what pages get visited the most. Just be careful how you use that data. For example, if most of your traffic is from search, this data will be biased by incoming search traffic, and hence what Google chooses to rank. In addition, you may only have a small percentage of the visitors to your site going to your privacy policy, but chances are good that there are significantly more users than that who notice whether or not you have a privacy policy. Many of these will be satisfied just to see that you have one and won't actually go check it out.
Whatever you do, it's worth using many of these methods to determine what users want from the pages of your site and then using the resulting information to improve your overall site experience.
Is Google using this type of info as a ranking factor?
At some level, they clearly are. Clearly Google and Bing have evolved far beyond the initial TF-IDF concepts, but we can still use them to better understand our own content.
The first major indication we had that Google was performing content quality analysis was with the release of the Panda algorithm in February of 2011. More recently, we know that on April 21 Google will release an algorithm that makes the mobile friendliness of a web site a ranking factor. Pure and simple, this algo is about the user experience with a page.
Exactly how Google is performing these measurements is not known, but what we do know is their intent. They want to make their search engine look good, largely because it helps them make more money. Sending users to pages that make them happy will do that. Google has every incentive to improve the quality of their search results in as many ways as they can.
Ultimately, we don't actually know what Google is measuring and using. It may be that the only SEO impact of providing pages that satisfy a very high percentage of users is an indirect one. I.e., so many people like your site that it gets written about more, linked to more, has tons of social shares, gets great engagement, that Google sees other signals that it uses as ranking factors, and this is why your rankings improve.
But, do I care if the impact is a direct one or an indirect one? Well, NO.
Using TF analysis to evaluate your page
TF-IDF analysis is more about relevance than content quality, but we can still use various precepts from it to help us understand our own content quality. One way to do this is to compare the results of a TF analysis of all the keywords on your page with those pages that currently outrank you in the search results. In this section, I am going to outline the basic concepts for how you can do this. In the next section I will show you a process that you can use with publicly available tools and a spreadsheet.
The simplest form of TF analysis is to count the number of uses of each keyword on a page. However, the problem with that is that a page using a keyword 10 times will be seen as 10 times more valuable than a page that uses a keyword only once. For that reason, we dampen the calculations. I have seen two methods for doing this, as follows:
The first method relies on dividing the number of repetitions of a keyword by the count for the most popular word on the entire page. Basically, what this does is eliminate the inherent advantage that longer documents might otherwise have over shorter ones. The second method dampens the total impact in a different way, by taking the log base 10 for the actual keyword count. Both of these achieve the effect of still valuing incremental uses of a keyword, but dampening it substantially. I prefer to use method 1, but you can use either method for our purposes here.
Once you have the TF calculated for every different keyword found on your page, you can then start to do the same analysis for pages that outrank you for a given search term. If you were to do this for five competing pages, the result might look something like this:
I will show you how to set up the spreadsheet later, but for now, let's do the fun part, which is to figure out how to analyze the results. Here are some of the things to look for:
- Are there any highly related words that all or most of your competitors are using that you don't use at all?
- Are there any such words that you use significantly less, on average, than your competitors?
- Also look for words that you use significantly more than competitors.
You can then tag these words for further analysis. Once you are done, your spreadsheet may now look like this:
In order to make this fit into this screen shot above and keep it legibly, I eliminated some columns you saw in my first spreadsheet. However, I did a sample analysis for the movie "Woman in Gold". You can see the full spreadsheet of calculations here. Note that we used an automated approach to marking some items at "Low Ratio," "High Ratio," or "All Competitors Have, Client Does Not."
None of these flags by themselves have meaning, so you now need to put all of this into context. In our example, the following words probably have no significance at all: "get", "you", "top", "see", "we", "all", "but", and other words of this type. These are just very basic English language words.
But, we can see other things of note relating to the target page (a.k.a. the client page):
- It's missing any mention of actor ryan reynolds
- It's missing any mention of actor helen mirren
- The page has no reviews
- Words like "family" and "story" are not mentioned
- "Austrian" and "maria altmann" are not used at all
- The phrase "woman in gold" and words "billing" and "info" are used proportionally more than they are with the other pages
Note that the last item is only visible if you open the spreadsheet. The issues above could well be significant, as the lead actors, reviews, and other indications that the page has in-depth content. We see that competing pages that rank have details of the story, so that's an indication that this is what Google (and users) are looking for. The fact that the main key phrase, and the word "billing", are used to a proportionally high degree also makes it seem a bit spammy.
In fact, if you look at the information closely, you can see that the target page is quite thin in overall content. So much so, that it almost looks like a doorway page. In fact, it looks like it was put together by the movie studio itself, just not very well, as it presents little in the way of a home page experience that would cause it to rank for the name of the movie!
In the many different times I have done an analysis using these methods, I've been able to make many different types of observations about pages. A few of the more interesting ones include:
- A page that had no privacy policy, yet was taking personally identifiable info from users.
- A major lack of important synonyms that would indicate a real depth of available content.
- Comparatively low Domain Authority competitors ranking with in-depth content.
These types of observations are interesting and valuable, but it's important to stress that you shouldn't be overly mechanical about this. The value in this type of analysis is that it gives you a technical way to compare the content on your page with that of your competitors. This type of analysis should be used in combination with other methods that you use for evaluating that same page. I'll address this some more in the summary section of this below.
How do you execute this for yourself?
The full spreadsheet contains all the formulas so all you need to do is link in the keyword count data. I have tried this with two different keyword density tools, the one from Searchmetrics, and this one from motoricerca.info.
I am not endorsing these tools, and I have no financial interest in either one—they just seemed to work fairly well for the process I outlined above. To provide the data in the right format, please do the following:
- Run all the URLs you are testing through the keyword density tool.
- Copy and paste all the one word, two word, and three word results into a tab on the spreadsheet.
- Sort them all so you get total word counts aligned by position as I have shown in the linked spreadsheet.
- Set up the formulas as I did in the demo spreadsheet (you can just use the demo spreadsheet).
- Then do your analysis!
This may sound a bit tedious (and it is), but it has worked very well for us at STC.
Summary
You can also use usability groups and a number of other methods to figure out what users are really looking for on your site. However, what this does is give us a look at what Google has chosen to rank the highest in its search results. Don't treat this as some sort of magic formula where you mechanically tweak the content to get better metrics in this analysis.
Instead, use this as a method for slicing into your content to better see it the way a machine might see it. It can yield some surprising (and wonderful) insights!
If you had to say in one sentence about making quality content or generate quality content for Google. What would you say?
Great question! I join your question
I would say: Make quality content for your human audience, not for Google.
Absolutely right samuel scott...
Totally agree with you on this one.
Samuel's right, quality content for humans pays off on a long term.
I too agree that our focus is to feed the right data to our audience. But I think high quality content should be developed for our audience with Google norms.
Definitely a user first world, so generate content for users first. I tried to be clear in the post that I do NOT advocate writing code for search engines. However, we can use search engines and the process outlined in this post, to help us get more input and knowledge about what we might be missing from our content.
All of this still requires human interpretation and filtering to make sure that we are applying the resulting data in the right way.
I'm finding that longer articles (1200 words) work better.
The whole issue of writing for search engines vs. for humans is an interesting one. Would you rather people actually found and read a crappy article, or that they never found a great article and never read it?
In reality though, if you write a really long article, that tends to make it comprehensive and you usually will cover a lot of really good subtopics, and rank for a variety of things. It's really difficult to write 1200 words of fully optimized SEO content gibberish without actually saying something real!
Websa, if I was your copywriter, I'd ask you to be more pointed, since at present it is impossible to answer with one sentence. Each industry hosts separate buying cycles, sought information, and customer behavior in addition to "quality" being a retrospective term (regarding sought results).
As far as 'quality content' for Google, I think you would need (at least) one creative person (who is a good composer of content in addition to one who may help you avoid the need to be on first page for competitive terms) and a person who understands G's search engine and has a knack for predicting its evolution. For example, I believe myself to be creative, yet I can't determine whether content is good (enough) for mobile (from a dev standpoint), a page is entirely optimized (from an SEO standpoint), or is even needed as an organic product (from a PPC perspective).
I'm not sure if your question is tongue-in-cheek, yet my reply is completely genuine. Moreover, it depends on a client/marketing team. For example, I've come across a number of clients and (cough) peers who are more than happy to sidestep my writing in exchange for article spinning software. "It's all fair in love and SER," as the saying goes, I guess, but the initial question can't be answered succinctly.
Nice post Eric. I am based in Turkey. Here people dont want to read so much so the bounce rate was quite higher on my contents. What is did was start putting informatic videos into my articles. That realy helped to increase my PA and decreased the Bounce rate.
@marianduanet
Its the 'count' of the word that appears most in the client's text. In this case the number of times 'woman' appears in the text. Other competitors have different words appear most often, e.g. in position 1's text, the word 'and' appears most often.
This analysis seems me very hit. It is evident that to the hour to write an article or post for our blogs have to do it thinking in the users that go it to read. We have to contribute something, give solution to some problem. But once done this, no this at all badly give him a turn taking into account the formulas commented. The competition in some niches is so big that any detail can carry us a pair of places up or disappear in the SERPs
Eric,
Thank you for this spreadsheet, it is very helpful. I am having a bit of trouble understanding the Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Num Docs Num Comp columns. Where are are you figuring out the keyword position? Or am I mis-interpreting the spreadsheet? The keyword tools you referenced didn't give any data on positions... I am a fledgling data scientist trying to get a much better understanding of this concept. Thank you!
Joe
You need to pull the data from a keyword density tools one ranking position at a time. Pull the data for position 1 first, then position 2, etc.
Hi Eric!
This definitely looks like a great analysis, but I can't figure out which tab exactly I have to insert my keywords, my competitors keywords, and what the VLOOKUP formula are.
This spreadsheet probably needs more explanation.
I am a bit stuck here.
Hi Slava,
You need to replace the data in the current tab, in columns B, C, D, E and F from row 4 on down with the keyword information you gather from the keyword tools I referenced in the post. You should also update the numbers in B2, C2, D2, E2, and F2 with the value of the most popular keyword for that column. I.e. in B2, put the number of instances of the most popular keyword in column B.
Hope that helps!
Hi Eric, great post! SEO Book have a nice keyword density tool too
I really like this approach - I hate 'keyword density' as it can result in an over-fixation on keyword occurance, which doesn't always line up with User experience. We've always looked at varying terms/synonyms etc but a comparison to competitor content can really yield valuable insights into how two pieces of content with similar wordlength, similar link profiles etc - why one ranks better than the other. Thanks for the spreadsheet link too.
Agreed, I hate the keyword density concept as well. The key here is to use this analysis to stimulate thinking about deficiencies in your content. Consider it a way to generate ideas on how to make it better.
Very good tip about privacy policy. I've never thought about it, but now I remember noticing the privacy policy on sites and has an impact on confidence in the site. A bit ridiculous, because I never read that page, just assume I know what it's wrote there. So, it's important to have that page, and not use low contrast font or in other ways make that page practically invisible.
Reading further through your post, at first i thought is about matching keyword density with competitors. Unlikely from you to write or from Moz to publish such post. So, i've continued reading, and i'm glad i did. It's an inventive method to find through TF analysis, which "part of the story" is missing on my site. Or if we compare with the tv series, which episode is missing. Someone may discover that he lacks the entire season.
Eric, can you describe what the most popular term on the page really is in your Method #1 formula?
I mean, is it a 1-word keyword, or does it have to be something specific?
Ivan - most likely, it's the 1 word keyword that is used the most often. It does not need to be anything specific.
The frequency analysis plus the content that is targeted for visitors would make your site rank in long-term.
Interesting, because regularly the US-market is the leading one for Onlinemarketer and SEOs like us in germany. But this is something, what we disscussed and implemented in our workflows years before. As more surpringsly it is, that you guys are coming up know :o)ROCKIT
Hi Eric,
Nice Post! The spreadsheet is just a holy grail for TF-IDF analysis. The only thing is the tools which we are using for keyword density. Is there still a concept of keyword density valid after hummingbird update from Google? I think this is one topic to be debated here. If there is any then what is the ideal keyword density? It may seem conflicting at times as one can use this easily to get better in search results. What do you think? Eric!
Hi Amit - this is one of the tricky aspects of this post. I am not really pushing for going back to keyword density analysis. I am pushing for using TF analysis to come up with observations about limitations and deficiencies in your content.
When I run this type of analysis it's quite common for me to find pages that are missing entire concepts that belong within the content. That's the direction you want to go with these.
That is not quite intuitive but certainly a scientific way to measure your content quality and depth. Google' prime focus is to improve the quality of SERPS by promoting useful information from trusted sources and if we still are thinking about keyword stuffing and other shortcuts we would not exist in the future of it.
Hi Eric, I like the fact that you have included a caveat at the end about this being a tool to give you an insight rather than some magic formula. I am going to have to give it a go though as it seems as though the synonyms may be an indication of thin content, along with related keywords that may have been omitted. Asking yourself an honest question about the added value for the user and how in depth the article actually is (i.e how much research and more importantly, unique research was produced for it) is a similar method, just without the statistics.
Hi Eric,
I think it's great that TF/IDF is getting some play these days. Certainly implied in this article, but the more you understand how crawlers technically work, how TF calculations work and how indexing really works (stemming /weighting etc.) the better off you are when it comes to improving website / content strategy. Nicely done.
Hey Eric,
Thanks a ton for the awesome post. I am an analytic guy who loves to figure out how things work! Your post exactly does that. I'm going to try it for one of my blog posts and let you know how it goes.
Cheers! :)
The content quality analysis process is really inspiring. Nice that i have found this in twitter and stop here for reading the whole article. Thanks.
Moz, I think there's a mass-downvoter on the loose, FYI! :)
I think that is very difficult that Google could determinate your content quality with algorithms. The content quality is determinated by users. This is my opinion.
Thank's
In addition to adding a Privacy Policy page would you include having it be accurate? Such as following Google/s privacy policy guidelines when using services such as Analytics.
I am not sure I would use the phrase "Accurate". I'd phrase it as being concerned about whether or not the company honors it or not, but Google has no way to measure that.
I have been using rapid miner to generate reports for TF-IDF, you can grab competing pages from Google, process for term frequency, term occurrence and produce a CSV file with all the data for post filtering and insights.
A great additional feature is to use WordNet plugin to find synonyms of words and group them. For example I am targeting www.seopremo.co.uk searches on my site SEOpremo.co.uk, using this feature you can group and count: UK/United Kingdom/Great Britain/ (for example) in 1 row for ITF, this helps as it stops you from overdoing it on the synonyms.
Additionally you can
I have built several processes for rapidminer to perform these tasks.
You can input a group of pages and compare to another group of pages in order to determine site wide ITF of a term.
Using this process, I did manage to get some long tail top ten rankings for phrases like "affordable SEO expert UK", however the two term phrases proved more difficult, so I guess it could be the offsite factors influencing this.
(As a side note, I recently just started to try to target local search for my home page by appending a city name "Southampton" to the page, but have yet to run it through these rapidminer processes, and do not have geolocation NAP address for the target location, but for another town in the area.)
You can download rapidminer studio free
For this, I had been using this and it's great.
https://marketxls.com/technical-indicators/
Hello,
I was wondering if you guys know where the '7' came from, from the spreadsheet formula? (above the 'client' cell)
Thank you in advance!
Great post, Eric. What's the difference between "All" and "Comp"? I thought that All would mean five competitors plus the client's data, while Comp was just the five competitors. Not all formulas are in the sheet, so I'm having fun figuring them out. I learnt a lot about Excel as a byproduct.
A great post, thanks Eric.
Given that John Mueller has said that text within hidden areas of a page (i.e. that aren't visible after initial page render), may be devalued or not considered for ranking purposes, would you exclude such text from your analysis?
Though the best advice there might be not to hide important text :)
If your content is not perfect, you will be google for the professional Spamer. Writing content for váy đẹp mùa hè, you should take that topic alone
Love this article! It focused on the user and Google. Not just an SEO outlook. Although this is quite time consuming and would provide problems for low budget smaller companies trying to compete in the online world.
This article gave me so many ideas!
Thanks
I loved this article because it gives a few things to look at for the more experienced SEO guys but the biggest takeaway does not need the formula - it is simply the observation that other sites have deeper content for the search query e.g. the mention of the actors, reviews etc. Combine that with the privacy policy type stuff and you have the basics right there. I hope every SEO student gets this.
Having said that, I'm looking forward to experimenting with the spreadsheet starting with the formula (I'd like to see the effects of changing to a smaller base TF of say 0.25 with the other 0.75 adjusted) and also see what the effects of removing common stop words like 'we' and 'and' might have.
This post was very informative and did a good job of emphasizing the importance of content for users while also generating content for Google.
Thank you for the great article and information. it is really very useful
When it comes to content quality, the ONLY things I can think of are:
1) The content should have embedded intro videos on their site. 2) Our domain is barely 1 year old, theirs is 2+ years old. 3) Their front page has more keyword-rich "content", which is literally a bunch of testimonials (in a rolling display box) that repeatedly use their keywords. 4) They have a backlink from their local web design company (I don't have that option since I built our site myself).
#3 is what's driving me crazy. Could it be that I just need to throw more keyword-rich content onto our main page? I feel like that would detract from UX, and I really like our front page as it is, without cluttering it up with testimonials and other content that would be better placed on a separate page.
Hi Sam - I'd be careful with being too artificial with your analysis. We don't want to drift into a pure keyword density approach. A different question is whether or not the general concept of having testimonials is benefiting them. That may be the case. I understand your UX concerns though.
Not knowing your site at all, I would use the observation about the testimonials and consider testing it. You can try some on the home page of your site for a couple of months, and see how it does. If you don't see a major benefit, than back off of it and place them elsewhere.
This is called a real scientific approach to make content valuable for the audience and Google. Thank you for this resourceful spreadsheet, would love to try out this method asap.
Really solid post here, I like using this approach to help explain content quality. I know I will be trying this spreadsheet out one of these next couple of days, that's for sure.
Gone through the whole analysis, I am just wondering about the common keyword in different post. How these can be put together. As it seems hard to compile the data.
Many Thanks
[posted wrong comment -- editors, please remove]
I think your comment time-traveled here from 2005.
(Edit: The comment original endorsed the importance of keyword density.)
content is king, and it always be!!!! :-)
No, the customer is king. We serve at his pleasure, and everything marketers and salespeople do is for him. Always has been, always will be.
Very good article Eric!
Content marketing is a very competitive world where quality content makes the page can have a good position. Good content is what makes the customer has an interest in their products.
To optimize the content of our website or business must take many things into consideration, including going to update the content and keep abreast of everything that we can affect. It is a rather laborious.
There are tools that help you analyze the content of your blog basing on the quality of it.
I appreciate the information you have provided to us in your article.
Regards,
In one episode of the underrated and cancelled U.S. comedy "The Crazy Ones" -- RIP Robin Williams -- there is a battle between Williams (a creative) and a data scientist (a "quant") over who can create the best advertisement at the agency for a campaign.
The "quant" used all of the data that we expect algorithms and search engines and social networks to have to make an ad that would appeal to that "persona." Williams used his brain's creativity. Guess who won? (Yes, I know it's fictional, and the outcome was decided by the show from the beginning.)
However, I still think Williams would win today. The longer I work in digital marketing, the less I care about Google directly and the more I care about human beings. (Partly because marketing has always been about human beings and partly because Google is an algorithm that wants to think like a human being and "marketing to Google" and "marketing to humans" is increasingly becoming the same thing.)
I measure "content" in two ways:
1. Will is make the reader / viewer / listener go "Wow!" (Specifically, a relevant reader among your target audience.)
2. Will it get the reader to do what I want? (Buy, sign-up, visit, or something else.)
The first goal is intangible; the second is measurable. And I'm not sure the first will ever be measurable.
I agree completely with the notion of making content first for your audience, and not Google. Nothing in this article is meant to suggest anything different than that!
If I was the boss at a publication like the The New York Times, National Geographic, or another that is generally thought to have high quality content, I would not be asking my content production team to do anything different.
I think that the impact of their content stands for itself and if they tried to "tweak the text" to so that it fits a mathematical formula, the result will either be a degradation of the product or a cost that is not recovered.
Rewording great writing to meet a formula is really hard, time-consuming and a great way to bust the morale of the writer (which I believe is something worth preserving).
Google has a ton of factors that they use to rank a web page. I am betting that the points awarded for converting a "natural language" document to one that is "formula correct" is quite small - especially when you are guessing at the formula.
So, all of my money is still being bet on the great natural language writer.
EGOL - as I noted in the article, the purpose of the technical analysis is to provide insights and generate ideas. In no way was I suggesting that you do this as a purely mechanical analysis that you acted on without interpreting the results. I think I was quite clear about this in the article.
So I agree with you, the first thing to do is to be a great natural language writer. But, you can use technical analysis to help you better understand what you might be missing that is of value to others.
Hi EGOL,
I'm not an expert writer or editor ... but do you really think that big editorial, news agencies don't analyse their content from a scientific point of view as well?
They don't have studies around what words trigger engagement etc?
Ways to write to trigger feelings?
I doubt that!
Cornel
Cornel,
I agree that the high quality content producers are looking at ways to improve their content and putting a lot of effort into it. They are spending a lot of effort on things like clarity, impact, reading level, and engagement.
If they are comparing the term frequencies of their documents against the term frequencies of their competitors documents I believe that they are wasting their time. This is telling everyone to "mimic your competitors". My advice would be "do something superior".
In my opinion, this type of analysis is arbitrary and has nothing to do with the real quality of the content from the editors perspective and nothing to do with the quality of the content from the reader's perspective. Put the same effort into kicking up the editorial content instead of trying to take great content, run it through a word counter and then change pieces of it to match standard that you don't even know that Google is using.
Term frequency and content quality are two different things.
Doing term frequency analysis might be really valuable if you build it into one of those "content spinner" programs or a "mash-up generator". So, I might agree with that as an application for term frequency.
I don't think that it has a place in the evaluation of genuine editorial content.
very nice post i really enjoy reading it.. keep sharing such great content.
Thank you Eric for posting such highly informative and useful information with us. You have done a great job, you have mentioned 18 points list I really appreciate it. I will give a best try all these points in my sites next time. These points can attract users in any site once user like any site he/she must bookmark it for future use.