Many times as SEOs, we think about the "on-page optimization" process as simply following the best practices for placing our targeted keywords (and possibly, some variations of them) on the page. My previous blog post about Perfecting Keyword Targeting covers this in some detail. But, we also know that search engines aren't nearly naive enough to care only about the individual terms/phrases that the user queries. For years, search engines have been doing work with topic modeling (this paper from Berkeley researchers does a nice job exploring the concept as it relates to IR).
While it's challenging as SEOs to know where this work has taken them, we can certainly assume that the words and phrases you use on a page likely influence its ranking, as well as how and where you use the targeted query term.
For those of you who've been following our blog posts about research into this area over the past few months, you know we've hit some stumbling blocks. Initially, we thought the free LDA Labs Tool had an extremely high correlation with Google.com rankings (higher even than most link-based metrics). However, after analyzing some results from others who ran tests, we saw biasing in our results and went back to the drawing table.
At the PRO Training Seminar in London, Ben Hendrickson shared our latest findings - much more conservative numbers, but consistent with the data and defensible.
As you can see, these numbers are lower than our previous datapoints, and the true correlation of the tool in its current format (which has been re-engineered) is now between the "LDA even vs. odds" and "LDA w/ bias" numbers. This makes our version less predictive than, say, # of links or linking root domains, but more predictive than any other on-page factor we analyzed (save features around exact/partial keyword match domains).
Since this research and the tool was released, people have been asking me "How Do I Use This?" Luckily, this video put together from WebProNews is an excellent resource to help answer that.
If you've got more questions about LDA, the tool, topic modeling in general or anything else related, feel free to ask below. We're still very excited about this topic, and while Ben is taking a short sabbatical until January (his first vacation in 3 years), I'm happy to help where I can. We hope that sometime in Q1 of next year, we'll have even more work on our LDA model, better correlations, and recommendations of specific words/phrases that may be adding to or hurting scores.
p.s. For those who have been following the posts closely, you may have noticed that a number of individuals who often don't like the work SEOmoz publishes were particularly vehement in criticizing our LDA research. I don't have much that can address those concerns other than to say - as before, we're still in the early stages of work on this. It's very challenging to do from a coding, mathematics and analysis perspective, so more polished results may still be several months away.
We've been excited to have others analyze our work (which is how we discovered our initial error on the correlation numbers) and we look forward to more people experimenting with it. We do feel, however, that despite the criticisms, we're going to continue conducting and presenting research like this, with similar caveats. If you believe our work is wrong, misguided or non-useful, there's certainly no obligation to use it. We can promise that until we feel very strongly about our research, evidence and the value it provides, we won't be making this a formal part of our Web App, PRO tools or best practices. But, in the meantime, for those who like playing on the cutting edge, we feel it's part of our core values to continue working and sharing in areas like this.
I've been following this LDA stuff since you guys first posted. I took a blog post of mine that has been ranked #4 for a few months and "LDA'd" it by adding in some LDA terms.
The post now ranks #1. Anecdotally it was a big win for LDA in my mind :)
More experiments are in the hopper, but I'm looking at LDA as a new framework for revising on-page. Thanks for all the great work you do, SEOmoz. Keep the data coming!
Very interesting! I would love to hear a bit more about this if you were willing to share! I know that none of this stuff happens in a vacuum but I would be really interested to hear more about this (perhaps a YouMoz post if you're willing :)).
Things I'm most interested in hearing about- did you ramp up other SEO efforts around the same time? Did you run some of the spectrums Rand mentions in the videos and see you were close on all other fronts, let's try tweaking LDA?
Thanks for sharing and would love to hear more!
Sam,
It might be worth putting together a YOUmoz post, but for those with instant gratification issues, here's the summary:
1) I'm playing with a tool called SEMScout, which apparently gathers all website content for a term, does a bunch of term frequency stuff and then spits out a list of words and a "theme score."
2) Sort said list by theme score to identify the most important LDA terms
3) Revise the post to include some of the highest theme score LDA terms
4) Wait for re-index
5) Throw party for the #1 result on my search term
Please note this was only one test on one blog post. Your mileage may vary. Offer void in Tennessee.
I'm with Sam on this one Josh. If you've got more pages you are going to try optimizing based on LDA alone, I'd love to read a YOUmoz on the particulars.
My thumb up is:
Just one note, that is not your fault. If the LDA tool is not to be considered (still) an official SEOmoz tool (maybe an early stage Lab Tool), therefore the guys of WebProNews could have entitled the interview differently and not "Determine Content Relevancy with New SEOmoz Tool", because it can lead to misunderstandings, as people tend to remember the "Headings".
I think LDA will be the top 3 shuffle play in the months/years to come!
I'll admit a lot of this was confusing to me when I first read the post about LDA a while ago but the query examples sure helped quite a bit in reinforcing the idea of topic modeling.
That being said, the use of your LDA tool reminds me of an open source project wherein many bright minds help each other in order to reach a common goal. I'm glad that you have opened this up to the public so that people can test it out. The end result should be great!
Ben Hendrickson's talk on LDA at ProSEO 2010 was a real highlight for me. I wrote afterwards in my blog "Watching his mental machinations trying to explain such a complex subject (as LDA) in simple terms was akin to watching an octopus on roller skates. If he's not already a YouTube sensation, he should be!"
I admire and value SEOMoz for their ambition and courage of which LDA is just one example.
This makes so much sense that it's hard to criticize it.
The first post i read regarding the launch of LDA and the tool, had to completely re-structure my content to increase the %. I did see some very good results from doing this. I think what people are missing is that LDA is not the holy grail to ranking well in the SERPS, but another gem on the holy grail that is SEO.
Keep it going SEOmoz :-) your research efforts are appreciated by a large 'somewhat silent' group of SEO's who are rooting for you.
Thanks for all your awesome work. Short question, which might be a stupid question. How do you use the actual tool? You pop in the keyword in the query section and then you put the source code in the document section? Is that correct? Otherwise please explain.
Thanks again!
Sorry - we should have made the tool more evident. You can do a number of things:
I've been interested in latent dirichlet allocation since my University days, when a particularly nerdy friend briefly discussed the topic over drinks. Oh yes folks, I spent much of my University drinking days talking about topic modelling - my poor, long suffering Mom may never get a grandbaby....:-)
So, being something of a nerd, I was immediately intrigued when I first read about the research into LDA's spearman correlation with Google results. While the correlations did seem a little high, I was still proud to be reading cutting edge research into how search engines figure out what the hell we are talking about. However, initial excitement aside, I have felt let down these past few months. Not by the dramatically overstated correlations, but by the backlash surrounding it.
Some general points:
To compensate for points #1 and 2, the academic world has taken to publishing peer reviewed journals. While many argue that peer review stifles creativity, it is very effective at reducing the amount of bad research that gets published. It has also given rise to some (seemingly) unlikely partnerships between scholars - unlikely partnerships that have yielded tremendous results.
Perhaps this LDA fiasco is a sign that cutting edge research on search needs formal peer review?
From a complete outsider's point of view, the mudslinging surrounding this research has been absolutely disgusting. Many people within the search community have acted like children (or worse). If peer review will prevent this negativity, bring it on!
Incoming newbie questions on LDA and topic modeling:
First of all, would you say LDA is something that is "needed" all across a website? Say you're running a retail website for hundreds and hundreds of products. Then there are a dozen or so categories with descriptive headers to organize the contained products. Now there's also a home page with three or four general descriptions to file everything into a few bigger categories. Would it be a good idea to apply LDA all over the place? Just categories and the home page? Just the home page? Or maybe just product pages, but leave the home page descriptions and category descriptions for user readability and ease of understanding where they are at? I ask because LDA seems complicated and I want to know if I SHOULD take large amounts of time figuring out good applications of LDA for everything, or if I should spend my time in better places.
Second question: let's say I am applying LDA to all three of these general areas - specific product pages, category pages and the home page. Would it be much easier to apply LDA to an individual product, or the much larger, general scope of the category it's contained within? I would think the individual product since it's more... focused. A category description would be meant for a short but detailed description of everything IN the category, so it seems like it would be harder to focus on a "topic model" for something such as a category.
Thank you very much in advance for any help :)
I'm confused by your LDA tool.
Enter phrase:
"caped crusader robin living gotham city"
Relevance to word "batman" ......20%
Erm.
IR – LDA? For those of us who are not immersed in these topics it would be very helpful to define an acronym the first time you use it. Nevertheless, thanks for all you do for the industry.
gbh - If you read the original post on all of this I think you will get an understanding of what those terms mean.
https://www.seomoz.org/blog/lda-and-googles-rankings-well-correlated
What's LDA? - https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
It's a probabilistic version of Latent Semantic Analysis. A method of measuring topical relevancy.
In very simplistic terms it looks at word|phrase use in a document and gives bonus points (for the topicality of the document wrt a keyphrase) if the phrases used are associated elsewhere with the keyphrase that defines the topic. Docuements are basically considered to be about a small number of topics.
It doesn't take into account distances between keywords or semantics of the document (AFAICT) but just looks, as the Wikipedia article puts it, at the doc as a "bag of words".
https://nlp.stanford.edu/pubs/hall-emnlp08.pdf on topic fpr those, who love linguistics. NLP rules Google!
very informative
The rigor with which you are approaching the subject is comforting. I see a lot of cursory SEO experiments done that lead to wrong conclusions and therefore countless wasted hours in pursuing dubious tactics. This analysis looks promising and I especially await the "recommendations of specific words/phrases that may be adding to or hurting scores". Being able to act upon this info is the other half of the battle.
It's the first time I see somebody using LDA to do SEO works, it's very novel!
But I wonder whether the lda tool supporting non-english pages, eg. Chinese?
Thank you very much!
Sadly, right now it's just english, but as we get advance the model, I suspect we'll be able to plug in new corpuses (corpi?) from different languages without too much pain.
Corpora ;)... my latin college classes coming up (and also hugly absurd memories related).
Very interesting article. I have one question.
How does it cope with non-english languages? I tried it with my swedish site and got 99% on one of my pages. This result confuses me.
As Rand wrote above, right now it is working just for English, because the corpus the LDA is working with it's the english version of Wikipedia.
That doesn't mean that in the corpus cannot be swedish (or italian or any other language) words/phrases, but not in a critical mass able to offer credible %.
That means that your 99% is not to be trusted.
Intuitively, I would think that the score for content not in the corpus would be low rather than high. If unknown words raise the score, then how can you differentiate bad content (ie content with jibberish words) from good content?
I'm still confused.
First of all I advice I'm not a genious about the maths behind the LDA, but - using logic - I think that high % are logical if in the corpus there are, let say, 1.000 swedish words and the text you used for your test has quite few of them in.
Being the swedish words corpus very small, just few presence of contextually correlated words both in the corpus and the text will cause high %.
But false ones or - at least - not reliable - because 1.000 words are not a real mirrror of the sweden dictionary.
That is what I think, but surely Rand can give you a more proper answer and correct surely mine.
I for one really like that y'all are pursuing LDA modeling Rand. It's cutting edge and I look to the mozplex to be just that. Cutting edge.
I remember the rabid detractors when your original LDA post (as well as the subsequent posts) came out and while I didn't agree with their criticisms, I thought it was a class act on the part of the mozplex to allow the dialogue.
So thumbs up for being a class act :-p
Hey Rand the LDA is a POWERFUL weapon in my personal arsenal! Keep up the good work!
-Ian
dang, i just tested out my LDA score and got a 93%. Too bad I don't rank for that word. But my feeling is if I decide to begin link building again, coupled with LDA, I will have a much easier time ranking now that my site is aging. I like Rand's example of the "pianist". I think writing normally would get a higher LDA score anyway.
Thank you for this post. You did begin talking about topic modelling at Webit Expo in Bulgaria. But this is quite a nice add-in to that topic.
Writing content that includes related terms of the targeted keywords is simply intuitive. If search engines don't use it or put much weight on it, they most likely will in the near future. SEOmoz seems to look at the industry and say, "If search engines were searching like humans, how would they look at content and how would they organize it?"
It's this mindset that helps keep you at the top of the game. Thanks for all your hard work.
Thanks for another great tool! Keep up your awesome work (and - to add a critical note - don't neglect fixing bugs / annoyances in the existing tools over new tools and features).
Rand,
What accounts for this apparent drop in correlation?
(Sorry if I am misunderstanding the results or your post, but I didn't see a reason given.)
The earlier correlation was correct according to the tool's output, but we think there was unintentional biasing because of the way it queried Google results to check against evens/odds (I can't explain it as well as Ben can, and he's on leave for a while). However, suffice to say, we're trying to be very careful in our methodology and feel comfortable with the current data being accurate. That said, we think we can likely get a much more predictive version in the future.
The exciting part is that even with this simplistic tool and model, we're seeing numbers better than straight keyword use metrics (like TF*IDF). That would, at least, suggest that what your content says is as important or more important than your keyword usage.
Rand there is the LDA tol on Moz now?