Many times as SEOs, we think about the "on-page optimization" process as simply following the best practices for placing our targeted keywords (and possibly, some variations of them) on the page. My previous blog post about Perfecting Keyword Targeting covers this in some detail. But, we also know that search engines aren't nearly naive enough to care only about the individual terms/phrases that the user queries. For years, search engines have been doing work with topic modeling (this paper from Berkeley researchers does a nice job exploring the concept as it relates to IR).

While it's challenging as SEOs to know where this work has taken them, we can certainly assume that the words and phrases you use on a page likely influence its ranking, as well as how and where you use the targeted query term.

For those of you who've been following our blog posts about research into this area over the past few months, you know we've hit some stumbling blocks. Initially, we thought the free LDA Labs Tool had an extremely high correlation with Google.com rankings (higher even than most link-based metrics). However, after analyzing some results from others who ran tests, we saw biasing in our results and went back to the drawing table.

At the PRO Training Seminar in London, Ben Hendrickson shared our latest findings - much more conservative numbers, but consistent with the data and defensible.

LDA Corrrelation October 2010

As you can see, these numbers are lower than our previous datapoints, and the true correlation of the tool in its current format (which has been re-engineered) is now between the "LDA even vs. odds" and "LDA w/ bias" numbers. This makes our version less predictive than, say, # of links or linking root domains, but more predictive than any other on-page factor we analyzed (save features around exact/partial keyword match domains).

Since this research and the tool was released, people have been asking me "How Do I Use This?" Luckily, this video put together from WebProNews is an excellent resource to help answer that.

 

If you've got more questions about LDA, the tool, topic modeling in general or anything else related, feel free to ask below. We're still very excited about this topic, and while Ben is taking a short sabbatical until January (his first vacation in 3 years), I'm happy to help where I can. We hope that sometime in Q1 of next year, we'll have even more work on our LDA model, better correlations, and recommendations of specific words/phrases that may be adding to or hurting scores.

p.s. For those who have been following the posts closely, you may have noticed that a number of individuals who often don't like the work SEOmoz publishes were particularly vehement in criticizing our LDA research. I don't have much that can address those concerns other than to say - as before, we're still in the early stages of work on this. It's very challenging to do from a coding, mathematics and analysis perspective, so more polished results may still be several months away.

We've been excited to have others analyze our work (which is how we discovered our initial error on the correlation numbers) and we look forward to more people experimenting with it. We do feel, however, that despite the criticisms, we're going to continue conducting and presenting research like this, with similar caveats. If you believe our work is wrong, misguided or non-useful, there's certainly no obligation to use it. We can promise that until we feel very strongly about our research, evidence and the value it provides, we won't be making this a formal part of our Web App, PRO tools or best practices. But, in the meantime, for those who like playing on the cutting edge, we feel it's part of our core values to continue working and sharing in areas like this.