How do I recap the SEOmoz PRO Seminar session on Uncovering a Hidden Technique for SEO? The title is so attractive that it produces Pavlonian symptoms as we salivate at the thought of uncovering a hidden SEO treasure. Ben Hendrickson of SEOmoz presented a model which appears to show how Google may assigning relevance to keyword terms based on context - topical relevance.

Is Latent Dirichlet Allocation (LDA) that hidden jackpot?

1st - LDA is not new nor something SEOmoz invented. The Information Retrieval model has been around for 7 or 8 years, and IR geeks have talked about it before. There are a number of resources, as well as nay saying, about LDA and Google's possible use of it.

2nd - What is new is SEOmoz's LDA Topics Tool that produces a relevancy score based off a query (search term). It enables one to play with words that may increase a page's relevancy in the eyes of Google. It shows words that help Google determine how relevant the page is to a user's search query.

Game Changer?

Kyle Stone tweeted that the LDA tool is a game changer, and many retweeted.

SEOmoz LDA tool = game changer

Is SEOmoz's LDA tool a game changer? That's yet to be seen. The goal is to report Ben's research as presented at the Mozinar and how a layman (myself) interprets such. Rand is going to do a follow-up post to explain more.

Why all the hype?

The SEO Challenge

SEOs face the continual challenge of figuring out Google's hidden ranking algorithms. How do we rank higher? Which signals are the most important? We know search engines are "learning models" that attempt to understand "context” of words. Google has said for years that webmasters should concentrate most on providing good relevant (contextual) content.

There are ways to rank higher. Is it as easy as 1, 2, 3?

  1. Create quality copy with keyword(s) on the page along with associated anchor text links.
  2. Get good links.
  3. What Ben talked about in this session.

LDA - Topic Modeling & Analysis

Latent Dirichlet Allocation, in layman's terms, translates to "topic modeling." In search geek terms, LDA is the following formula:

LDA Formula

(Did you digest that? Don't worry; Mozzers groaned and laughed at the same time. PLUS: Scientist Hendrickson delivered this session after lunch!)

LDA Simplified - Here is Ben's way of explaining topic modeling:

LDA Formula Simplified

(Okay, I was once proud that I got an A in Logic and Combinatorics - discrete math/set theory. However, that computer science class now feels like basic math compared to this formula.)

It made more sense when Rand Fishkin joined Ben on stage and when Todd Freisen moderated and deciphered during Q&A. (Manuela Sanches of Brazil was sitting next to me and said that Ben's "presentation needed subtitles!")

The objective of LDA, from my deciphering of Greek, is to understand how Google is using semantic contextual analysis combined with other signals, to define topics/concepts. It's how Google analyzes the words on a page to determine the "set" to which a word belongs - how relevant a search query is to pages in its database.

For example: How does Google assign relevance to the word "orange" on a page? They determine orange is related to the fruit set or to the color set by page context.

LDA Defined:

"Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into "topics" and documents into mixtures of topics. It has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008).

A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over "topics", which are in turn distributions over words."

Bayesian - ah, a term I recognize!! Bayesian spam filtering is a method used to detect spam. It draws off a database and learns the meaning of words. It's "trained" by us when we mark an email as spam. It looks at incoming emails and calculates the probability that the content of an email is contextually spammy.

I found a PowerPoint presentation about Bayesian Inference Techniques by Microsoft Research from 2004 that presents the possibility of using LDA. Go to slide 54 and read:

"Can we build a general-purpose inference engine which automates these procedures?"

Microsoft has been looking at LDA models. Do search engines use it as one of their primary methods?

Ben sampled over 8 million documents with approx. 1,000 queries. He believes Google is using LDA topic modeling to determine (learn) what words mean by their associations with, relevance to, other words on the page. (Other factors are included.) Ben called the results a "co-occurrence explanation" that use a "cosine similarity."

SEO Takeaway:

  • Results that are higher in Google SERPs, in general, have more topical content.
  • Search engines do APPEAR to apply semantic analysis… when indexing a page and determining the intent of the words on the page.

Rand tweeted an explanation (in 140 x 4) as follows:

Rand's tweets explaining LDA

Dana's LDA Catwalk Metaphor for Topic Modeling:

Imagine the words on your page as walking down the fashion runway in Paris. Your keyword phrase is "dressed" in semantic accessories, words that correlate to and dress up your topic. Associated words bring meaning to and highlight the fashion model's outfit. Adjectives, modifiers and synonyms are like jewelry, hats, and shoes. The combination can transform your base layers (your target terms) from casual or conservative business attire into a sexy night-on-the-town ensemble.

Combinations and permutations of words on a page "dress" your skinny or curvy fashion model. Relevant words provide Google with an image of what she is wearing and the catwalk upon which she struts. LDA refers back to what Google already knows about these "accessories" (words) and their previous association with the topic terms related to fashion.

Enter Topical Ambiguity - I just broke the "rules" for context with the catwalk metaphor by referring to modeling in two contexts on this page:

  • I used "modeling" terms that relate to the "fashion industry" set.
  • The catwalk metaphor is irrelevant content that is off-topic for discussing "LDA topic modeling."

Google Algorithm Exposed?

Ben clearly said that LDA is an ATTEMPT to explain the SERPs. His scenario, a quote from his presentation slides, follows:

One of us needs to implement it so we can:

1) See how it applies to pages
2) See if it helps explain SERPs
One-two-three-not-it.

LDA is not LSI.

There were some tweets claiming SEOmoz was bringing back LSI or snakeoil. Ben clarified that LDA is not LSI, which deals more with keyword density. He explained that he is NOT talking about loading keywords on a page but about the relevance of the topics within the page. He said that:

"LSI doesn’t have the same bias toward simple explanations. LSI breaks down as you try to scale up the number of topics."

The LDA tool deals with context, semantic relevancy, not density - in addition to some other random factors. Example:

If SEOmoz has a page all about "SEO" and "tools," and there is another word on the page that can be explained by a word that is more related to SEO topic, then the related word would be used. Meaning, "seo tools" doesn't have to be repeated over and over, and the related word would be interpreted by Google as being relevant.

Ben, who appears to have the brain of a search engine, noted that it "appears" LDA is what Google is heading for in the near future. He said (paraphrased):

If they are not doing it, they seem to be doing something that has the same output. They are probably already using it.

Rand deciphered:

It’s a super weird coincidence if Google is not using it.

Are On-Page Signals Stronger than Links?

Are we heading toward more emphasis of on-page topic modeling? I'm not an IR geek, but I do plan to spend more energy focusing on understanding how search engines retrieve informaton. We are dealing with a semantic Web. LDA may indicate that good old on-page optimization sends stronger signals than links.

SEOmoz's LDA tool attempts to show how relevant content is to a chosen keyword. It computes relevance of queries.

The following shows how relevant SEOmoz's Tools page is to a similar page on Aaron Wall's SEO Book.

seo tools relevance for SEOmoz & SEO Book

The score at the top is an indicator of how relevant the content on that page is according to LDA.

  • Aaron's content is 72%* relevant for the query "seo tools."
  • SEOmoz's tools page is 40%* relevant.

*NOTE: (I inserted the logos.) You can run the same pages and get different results. The results are similar in that SEO Book always scored as more topically relevant, but the percentage varies. Is this the random Monte Carlo algorithm at work? Ben?

Mozinar Question:

"How do we execute this for SEO?"

Ben's Answer:

"I don't actually do SEO. I write code."

That's up to us, the SEOs, to play and test in our Google playground.

Use the tool to decide if you can win with LDA to optimize your on-page signals.

  1. Use the LDA Topics Tool to return words that could be used on a page for a query.
  2. Then determine who is ranking for that term.
  3. Simply write content that is highly on-topic based off the findings you observe.

If you are not performing that well in the SERPs, think about classic on-page optimization. In the example above, rather than putting another instance of "seo tools" on the page, LDA shows there are better ways to tell Google that you are about that topic. The tool provides a way to measure that.

IMPORTANT: There is a threshold at which too many related words will appear as too spammy. LDA is not something to be used to game Google.

Test the LDA Tool out for yourself, and draw your own conclusions.

***
DISCLAIMER: I'm not claiming this methodology has uncovered hidden SEO treasures. Time, testing and playing around with a new SEOmoz tool while observing the SERPs will reveal the answer. In the meantime, I'm going to dress up my pages and accessorize them with relevant terms that make them dazzle so they look good climbing the Google catwalk.