How do I recap the SEOmoz PRO Seminar session on Uncovering a Hidden Technique for SEO? The title is so attractive that it produces Pavlonian symptoms as we salivate at the thought of uncovering a hidden SEO treasure. Ben Hendrickson of SEOmoz presented a model which appears to show how Google may assigning relevance to keyword terms based on context - topical relevance.
Is Latent Dirichlet Allocation (LDA) that hidden jackpot?
1st - LDA is not new nor something SEOmoz invented. The Information Retrieval model has been around for 7 or 8 years, and IR geeks have talked about it before. There are a number of resources, as well as nay saying, about LDA and Google's possible use of it.
2nd - What is new is SEOmoz's LDA Topics Tool that produces a relevancy score based off a query (search term). It enables one to play with words that may increase a page's relevancy in the eyes of Google. It shows words that help Google determine how relevant the page is to a user's search query.
Game Changer?
Kyle Stone tweeted that the LDA tool is a game changer, and many retweeted.
Is SEOmoz's LDA tool a game changer? That's yet to be seen. The goal is to report Ben's research as presented at the Mozinar and how a layman (myself) interprets such. Rand is going to do a follow-up post to explain more.
Why all the hype?
The SEO Challenge
SEOs face the continual challenge of figuring out Google's hidden ranking algorithms. How do we rank higher? Which signals are the most important? We know search engines are "learning models" that attempt to understand "context” of words. Google has said for years that webmasters should concentrate most on providing good relevant (contextual) content.
There are ways to rank higher. Is it as easy as 1, 2, 3?
- Create quality copy with keyword(s) on the page along with associated anchor text links.
- Get good links.
- What Ben talked about in this session.
LDA - Topic Modeling & Analysis
Latent Dirichlet Allocation, in layman's terms, translates to "topic modeling." In search geek terms, LDA is the following formula:
(Did you digest that? Don't worry; Mozzers groaned and laughed at the same time. PLUS: Scientist Hendrickson delivered this session after lunch!)
LDA Simplified - Here is Ben's way of explaining topic modeling:
(Okay, I was once proud that I got an A in Logic and Combinatorics - discrete math/set theory. However, that computer science class now feels like basic math compared to this formula.)
It made more sense when Rand Fishkin joined Ben on stage and when Todd Freisen moderated and deciphered during Q&A. (Manuela Sanches of Brazil was sitting next to me and said that Ben's "presentation needed subtitles!")
The objective of LDA, from my deciphering of Greek, is to understand how Google is using semantic contextual analysis combined with other signals, to define topics/concepts. It's how Google analyzes the words on a page to determine the "set" to which a word belongs - how relevant a search query is to pages in its database.
For example: How does Google assign relevance to the word "orange" on a page? They determine orange is related to the fruit set or to the color set by page context.
LDA Defined:
"Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into "topics" and documents into mixtures of topics. It has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008).
A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over "topics", which are in turn distributions over words."
Bayesian - ah, a term I recognize!! Bayesian spam filtering is a method used to detect spam. It draws off a database and learns the meaning of words. It's "trained" by us when we mark an email as spam. It looks at incoming emails and calculates the probability that the content of an email is contextually spammy.
I found a PowerPoint presentation about Bayesian Inference Techniques by Microsoft Research from 2004 that presents the possibility of using LDA. Go to slide 54 and read:
"Can we build a general-purpose inference engine which automates these procedures?"
Microsoft has been looking at LDA models. Do search engines use it as one of their primary methods?
Ben sampled over 8 million documents with approx. 1,000 queries. He believes Google is using LDA topic modeling to determine (learn) what words mean by their associations with, relevance to, other words on the page. (Other factors are included.) Ben called the results a "co-occurrence explanation" that use a "cosine similarity."
SEO Takeaway:
- Results that are higher in Google SERPs, in general, have more topical content.
- Search engines do APPEAR to apply semantic analysis when indexing a page and determining the intent of the words on the page.
Rand tweeted an explanation (in 140 x 4) as follows:
Dana's LDA Catwalk Metaphor for Topic Modeling:
Imagine the words on your page as walking down the fashion runway in Paris. Your keyword phrase is "dressed" in semantic accessories, words that correlate to and dress up your topic. Associated words bring meaning to and highlight the fashion model's outfit. Adjectives, modifiers and synonyms are like jewelry, hats, and shoes. The combination can transform your base layers (your target terms) from casual or conservative business attire into a sexy night-on-the-town ensemble.
Combinations and permutations of words on a page "dress" your skinny or curvy fashion model. Relevant words provide Google with an image of what she is wearing and the catwalk upon which she struts. LDA refers back to what Google already knows about these "accessories" (words) and their previous association with the topic terms related to fashion.
Enter Topical Ambiguity - I just broke the "rules" for context with the catwalk metaphor by referring to modeling in two contexts on this page:
- I used "modeling" terms that relate to the "fashion industry" set.
- The catwalk metaphor is irrelevant content that is off-topic for discussing "LDA topic modeling."
Google Algorithm Exposed?
Ben clearly said that LDA is an ATTEMPT to explain the SERPs. His scenario, a quote from his presentation slides, follows:
One of us needs to implement it so we can:
1) See how it applies to pages
2) See if it helps explain SERPs
One-two-three-not-it.
LDA is not LSI.
There were some tweets claiming SEOmoz was bringing back LSI or snakeoil. Ben clarified that LDA is not LSI, which deals more with keyword density. He explained that he is NOT talking about loading keywords on a page but about the relevance of the topics within the page. He said that:
"LSI doesn’t have the same bias toward simple explanations. LSI breaks down as you try to scale up the number of topics."
The LDA tool deals with context, semantic relevancy, not density - in addition to some other random factors. Example:
If SEOmoz has a page all about "SEO" and "tools," and there is another word on the page that can be explained by a word that is more related to SEO topic, then the related word would be used. Meaning, "seo tools" doesn't have to be repeated over and over, and the related word would be interpreted by Google as being relevant.
Ben, who appears to have the brain of a search engine, noted that it "appears" LDA is what Google is heading for in the near future. He said (paraphrased):
If they are not doing it, they seem to be doing something that has the same output. They are probably already using it.
Rand deciphered:
It’s a super weird coincidence if Google is not using it.
Are On-Page Signals Stronger than Links?
Are we heading toward more emphasis of on-page topic modeling? I'm not an IR geek, but I do plan to spend more energy focusing on understanding how search engines retrieve informaton. We are dealing with a semantic Web. LDA may indicate that good old on-page optimization sends stronger signals than links.
SEOmoz's LDA tool attempts to show how relevant content is to a chosen keyword. It computes relevance of queries.
The following shows how relevant SEOmoz's Tools page is to a similar page on Aaron Wall's SEO Book.
The score at the top is an indicator of how relevant the content on that page is according to LDA.
- Aaron's content is 72%* relevant for the query "seo tools."
- SEOmoz's tools page is 40%* relevant.
*NOTE: (I inserted the logos.) You can run the same pages and get different results. The results are similar in that SEO Book always scored as more topically relevant, but the percentage varies. Is this the random Monte Carlo algorithm at work? Ben?
Mozinar Question:
"How do we execute this for SEO?"
Ben's Answer:
"I don't actually do SEO. I write code."
That's up to us, the SEOs, to play and test in our Google playground.
Use the tool to decide if you can win with LDA to optimize your on-page signals.
- Use the LDA Topics Tool to return words that could be used on a page for a query.
- Then determine who is ranking for that term.
- Simply write content that is highly on-topic based off the findings you observe.
If you are not performing that well in the SERPs, think about classic on-page optimization. In the example above, rather than putting another instance of "seo tools" on the page, LDA shows there are better ways to tell Google that you are about that topic. The tool provides a way to measure that.
IMPORTANT: There is a threshold at which too many related words will appear as too spammy. LDA is not something to be used to game Google.
Test the LDA Tool out for yourself, and draw your own conclusions.
***
DISCLAIMER: I'm not claiming this methodology has uncovered hidden SEO treasures. Time, testing and playing around with a new SEOmoz tool while observing the SERPs will reveal the answer. In the meantime, I'm going to dress up my pages and accessorize them with relevant terms that make them dazzle so they look good climbing the Google catwalk.
Note to understand my comment: I absolutely not own a scientific past and my hability in understanding formulas like those ones presented in your post can be compared to an ant one. But I've a deep semiotic (and rethoric) knowledge due to my past studies and work... and I think that LDA could be explained also with the theory of signs and rethorics.
Now I try to explain my assumption in the most logical way:
Therefore context - in a more wide sense - is important. But probably saying that On Page context is more important than the Link Graph is not totally correct. Maybe the best assumption is to think to LDA is something as important as Link Graph and one complementing the other.
Links present in a different context from the site linked, are going to be devaluated because of the failing correspondance. But also the page with the highest LDA percentile without a Link Graph confirming the relevancy of the page itself is going - probably - to rank worst than a less LDA optimized page that owns a better link profile.
P.S.1: maybe this search of context signs - apart all the Business reasons - can explain why Search Engines are indexing more kind of content every day (Images, .Pdf, .Doc, Flash Video, and now also SVG... when Audio files?).
P.S.2: I hope I was understandable
P.S.3: If not just think that I mixed LDA with LSI and came out with LSD
I agree with you. On page optimization is too easy to game and easy to spam, whereas external links will always carry more authority with the search engines. Although this tool makes me think I need to do a better job with on-page optimization.
That was a post in itself gfiorelli1! Your point about links providing context for LDA is certainly key, and we do know Google values links heavily. There are certainly complex factors involved, and our attempts to explain such are from a layman's perspective.
LSD? Well, some of us in attendance may have felt we were observing and listening to Ben's presentation through such a lens!
I think that was a great summary!
Hi, you are right about the signs and the context. I remember Umberto Eco. I studied him once, when i was in the university trying to learn about fine arts.
Interesting would be now to get an review after the tool have gone online. Are there somebodies out here, that have use this tool effective for better onpage optimization?
Very very interesting Dana. The LDA tool - super insightful. This may be naive of me, but I've always used GooglesExternal Keyword Tool to find relevent keyword phrases to add to my content to make it seem more relevant. Google lists the keyword suggestions in order of relevancy, so it makes sense that the more of these related terms it finds in my copy, the more likely they will judge my targeted keyword as relevent. The SEOMoz LDA tool works using a different formula, but seems to judge how good of job I'm doing at targeting these phrases.
Rack up another metric for on-page optimization, perhaps the most important since the title tag.
This may be naive of me, but I've always used GooglesExternal Keyword Tool to find relevent keyword phrases to add to my content to make it seem more relevant
Rather than being a naive idea, I think that that's a great idea Cyrus. As long as you kept the words within the parameters of good copy (i.e. not adsense like copy)it sounds like it'd lend itself towards a better LDA score
I'm planning to post about this much more on Monday night/Tuesday morning (didn't know Dana was writing this up!), but I can say that the LDA tool isn't suggesting terms/phrases that get lots of search volume, but rather words and phrases that are likely "connected" in a vector space model of semantics.
So, for example, when writing about elephants, words like "tusks," "protrusions," "hide," "pachyderm," "africa," "poachers," etc. may not be commonly searched-for, but may be useful to use on the page to provide more value to readers who are interested in learning about elephants AND more relevant to search engines who use vector-space models (probably more complex/advanced than the LDA stuff we've created) to influence their rankings.
It sounds like another great tool for the labs Rand. If anyone could find a way to create a tool that lists related phrases it's your team.
Google wonder wheel does the job of finding related phrases really well.
Would you say that looking at the colocation of words within a search results page would achieve the same? e.g. take the top x results, extract common phrases from those pages and chances are they are there because Google has deemed those pages relevant to the same topic. Saves on having to create an index of the web and implement the algorithm.
Sounds very logical.
But when i compare no. 1 in the serps against my page (far away from position 1) with a special keyword, we have a score of
6% (wikipedia) : 62% (mine)
So I think we are still on the way to a semantic web.
Until then, incoming links and trust are the major factors.
But this future vision sounds good when the link hunting game times are over.
Because it's just a game: the best link collector wins, not the most relevant!
Long live king content .... :)
Ben didn't offer any ideas on how to use it for SEO, so as soon as I was home from the Mozinar I explained the LDA tool to my SEO team as best I could and we brainstormed a number of ways we could apply it to our SEO efforts. It's not as eloquent an explanation as Dana's as to the intricacies and history of LDA, but maybe it can help answer that particular question of how to use it.
Nice article Rebecca. Very well explained. Thanks for the link.
I just recently tried this tool, this is after reading this article.
I noticed a couple of my projects home page were ranked on the first page,
not in a high competitive market, but local wise yes,
When i looked at the LDA tool - LDA spit out both of my project home page
in the above 50%.
So I am more curious to start tracking this. Thank you for this Great Eye Opener!
I've always believed that on-site optimization is just as important as off-site.
But its great to hear the explanation of being on topic relevant & not spamming the
keywords.
Cheers! :-)
We had similar idea and the sore truth still lies in number of RootDomains linking and variation of anchors.
Awesome stuff. What a way to come back from Labor Day. This is an interesting development not just for content on one's own site, but for links too. This emphasis on relevancy is linked in my mind to the recent kerfluffle about linkbait and infographics (see https://www.davidnaylor.co.uk/infographics-are-here-to-stay.html) and the now notorious Pakistan flood infographic (see: https://thenextcorner.net/pakistan-floodbait-end-infographic/). Infographics are very clever but do they actually make sense half the time? Well, no. I welcome any attempts on Google's part to limit their effectiveness (when they are irrelevant).
If SEOmoz took the "Better Way to Think of it" equation and put it on a t-shirt I'd just have to buy one...maybe two. But I'm a super nerd when it comes to industry t-shirts. Still not a bad idea.
I've known about this for some time and always called it 'theasaurus based content' before i knew the real term was synonims you can use the google suggest box as well as online synonim tools to work through your content ensuring that all terminology reinforces the core keyword.
Very interesting concept...i have started using the wonder wheel to add relevant keywords, great tool
Excellent post - we heard about Latent Semantic Indexing (LSI) a couple of years ago, but this is a far better model. Still, I think that Google may have customised it a bit if they're really looking to emulate human searchers. A couple of hypothesis/questions:
I love this topic. I have been dabbling in topic modeling the last few weeks and LDA is exactly the formula I used. Good to see I chose the right direction. There are two situations I use it in. (1) To crawl a list competitor's high priority pages (for me) and extract the interesting words and topics. (2) I group together keywords and find out what pages are ranking for that group, and run those pages through my LDA script, and extract any useful insights.
I am really into 3D modeling lately as well. Right now I have a prototype of a 3D scatter plot that clusters interesting topics together. It's interactive so you can pan & zoom. It's all browser based too, so no flash, no plugins, etc. I am doing this because I write code but my role has switched to SEO in recent years. I hate using Excel. On top of this, our office is filled iwth a nice mix of creative personnel and statistic nuts. Interactive 3D topic modeling should help bridge the gap between those two teams. When I get closer to completion I will probably write a post demonstrating it for the community to use. Not only will it give the user actionable data, but visually pleasing to the average non-SEO.
My experiences have led me to believe quite stronly in on-page SEO for getting the results I seek in the SERPs. I'm absolutely certain I could obtain higher ranking via this method if I were to write less per page and focus instead on creating more tightly written category silos for my own site. However, as I'm trying to provide prospects with a logical progression of information in the form of quotes backed up by links to parent resources in order to educate and convert these prospects I must accept losing some position as the sacrifice. I like the way this tool rates pages however, and as a test I took your own results (SEOMoz.org), copied and pasted it into an HTML doc and uploaded the text to a brand new domain & ran the test on the results again and got almost the exact same rating (minus two). I then over-wrote the page with a blank page so as not to invite Google to place the new site in the SERPs and thus possibly harm anyone elses ranking.
What's most interesting however is that the index page of SEOMoz.org did not contain the search term I chose to investigate even once, and the remaining text in the results also did not seem to pertain very closely at all to what it was I was looking for, even though SEOMoz holds the first spot in the SERPs for that keyword phrase. Most interesting. I'll have to continue looking at this. Thanks.
Hey Dana - thanks so much for covering this session. It was certainly exciting to be there and see Ben's research for the first time together with so many folks :-)
On the specifics - I think there might be some inaccuracies above, but that's certainly not your fault. Ben and I have been working on a post (with some help from others) to help clarify the issue as best we can. We're shooting to have that ready on Tuesday. I'm sorry - I didn't know you were planning to post something as well or we could have coordinated!
Thanks again for all your hard work covering what was, I'm sure, a very challenging session (amongst many great, and fast paced presentations).
Thanks Rand. I just had a "duh" moment! Of course, why didn't I think to connect with you knowing you had a post forthcoming? sigh... I strictly had my blogger/recap hat on. Hindsight is 20/20.
However, what may be beneficial out of this is how those of us in attendance understand it. I spoke with close to a dozen attendees, and everyone had a different opinion/understanding of LDA and the tool.
I am surely not alone in looking forward to your and Ben's explanation to enlighten us. Thank YOU!
How uber sweet to be able to comment again!
Thanks for the post Dana. That is some pretty awe inspiring Greek Ben used. I'm fairly certain I would have had to put a helmet on my head to keep it from exploding if I had sat through that presentation. You've explained it nicely and now I know for certain that it's not about the Learning Disabilities Association, the Long Drivers of America or the Lyme Disease Association.
While I was watching the tweet stream come out of #mozinar, I was totally stumped by the "LDA" reference. Every reference said things like "game changer" and "blown away" and I was all...Huh? What? What's LDA??
So instead of just twittering back "Yo, what in the world is LDA?" I decided that I didn't want to look like a total ignoramus (and thus remove any lingering doubts people had) so I went looking for "LDA SEO" on Google. And nothing came up that explained it. PS - I just looked on Google a minute ago and the entire SERP is now dominated by SEOmoz and this topic.
Sorry about the run on paragraph. I had used spacing, but apparently when Javascript is turned off, so is spacing. :(
Key points from Rand,
"We're hopeful that this is the start of learning more about this process and productizing suggestions about it. It remains to be seen whether folks can "improve" their LDA scores according to our models and move up in the rankings, but we should see some of those results soon.
Let's all keep these points in mind.
Consider this recap as an interpretation that has some errors from one who is learning about LDA without testing or applicable understanding. Just like the telephone game where you sit in a circle and pass on a message, meanings get deciphered and translated differently. Thus, I'd suggest we hold on further comments on this post to allow Rand and Ben to share and explain more to us in their post next week. Wouldn't it be best to hold the conversation there? Agree?
Can you provide a link to their post here? Thanks!
Great blog post and a good attempt at explaining what can be a very complicated area.
Research shows that users are increasing their average search query length each year in anticipation of the ambiguous results they will receive. So, if LDA isn't one of their primary methods used to establish how relevant a search query is to a particular page in its database, what else could they be using? LSI certainly isn't scalable..
Google offered my old university professor big bucks to go work for them with his knowledge in text mining to improve relevancy in their results. You can bet LDA has been in use for a good few years at least.
Hi Jamie - Ben noted this in his presentation as well - that LDA, at least our model for it, is likely much more simple than what Google actually uses to calculate term/phrase vector models. That said, we're hopeful that this is the start of learning more about this process and productizing suggestions about it. It remains to be seen whether folks can "improve" their LDA scores according to our models and move up in the rankings, but we should see some of those results soon.
One more great post from you here Dana. Informative, interesting and an insightful post here. On-site optimization is always as important as that of off-site optimization and this is a great tool to help and give ideas on how we should optimize our pages.
Thanks for writing about it. :)
Hey Dana, I just finished reading Rand's post on LDA, and was compelled to come back here to say you did a great job recapping it. I got the exact same take aways from his post as I did yours. You go girl!
LDA score for this page for the term "lda score": ~85%
Score for Rand and Ben's follow-up post: ~70% :)
Any plans to bring LDA score into the SEOmoz API?
Well, the takeaway from this must surely be that long detailed articles about your topic should help you rank better - because in the course of a 1000 word article, you should have naturally mentioned all the related words to your subject, which in turn means that the bots have a better idea about what the page is about.
The second takeaway is that spun content gets knocked down the rankings, especially badly spun stuff.
unfortunately its not working with other languages such as Slovak that contains special characters, are you planning on fixing this?
I have been out of SEO for a good 4 years now and this topic is re-kindling my interest in all things Search. Finally, I have got off to starting a few experiments based around topicality and have also incorporated certain tests for topical links, non-topical, site/page topicality. Should be good to get base learnings in 3 months time. Now I am in the midst of taking up math/stat classes to get myself up-to some serious analysis ;)
SEOmoz has a new member now (not that it matters) and It would be an understatement to say that I would be following this place very closely, perhaps everyday.
Thanks Dana for the post.
Other than trial and error, is their some way to get this tool to suggest additional terms or at least highlight terms that are contributing to the overall score?
Hear, Hear
I have been trial and error-ing for the last hour and not getting too far.
Rand what I would like is, I give the tool my Keyword, and it gives me back my content, then SEOMoz takes on no more pro members.
To be realistic, to give us a list of do and don’ts would be good, words that are relevant, and those that may cause ambiguous content.
I can see this being a great tool for ambiguous keywords, other keywords may not benefit so much.
So, basically you are saying Use synomyms when writing stuff?
Plus related keywords and terms.
As mentioned in the example with "The Stones" and Mick Jagger.
Audax666's is correct, related keywords and terms. "Mick Jagger" is not a synonym for "The Stones" but is contextually relevant.
I may have placed too much emphasis on synoyms by bolding such words above. Bottom line, think about classic on-page optimization. You don't want to mix messages. Avoid topical ambiguity as I did (on purpose) by writing about "topic modeling" and "fashion modeling" in the same post.
We do know Google's algorithm is complex and that they do look at context.
Again, we all look forward to Rand's expansion on the application of LDA and the SEOmoz tool. We can continue that conversation there.
Thanks everyone!
Hi Dana,
Good post my maths is not great so the explanation that went with it helped a lot - i can just imagine peoples faces when the formulas started to appear!
LDA does make a lot of sense when thinking about search and possibly will become more relevant as the web become more semantic.
I am looking forward to the follow up post from the SEOMoz team on this as its something i cam keep to learn more about.
For quick rankings we should add another 4th factor, which isn't new or ground breaking information, but it deserves to be revisited: Keyword relevance in domain.
At least in the search scene in Norway (and I would assume it's not limited to only Norway) I've seen a huge number of domains that are spot-on relevant to the keywords claim top ranking. Even with low LDA relevance (thanks, this was a great tool to prove my point even stronger), hardly any inbound links and spammy-looking content, you see these domains pop-up everywhere in the rankings.
Keyword relevance in domain has always been an important factor, but I'm suspecting Google has put A LOT more weight on it.
Any thoughts? Other observations?
Correlation between LDA and Google Position
I just did a (very) quick and dirty experiment on data which is close to my heart as the owner of a maternity clothes shop: I took the top 20 Google ranked sites for 'Maternity Clothes', plugged them into to LDA tool and checked the correlation (using Excel's Pearson test).
The result was -0.45 - so a middling negative correlation between rank and LDA value, meaning that overall as Google position number goes up, LDA goes down, but that position is also affected by other factors. I think this is a strong result, and didn't expect the LDA alone to have as strong a correlation. More data is obviously needed (this is just one test on a small data set with only one key phrase considered), but I think that's an interesting result.
Notes:
Correlation result of+1 would indicate a perfect positive correlation, 0 no correlation, and -1 a perfect negative correlation.
Experiment was done on Google UK
There was one outlier (which I included in the correlation) which had a high rank but very low LDA score as it's recently stopped trading and is now just a holding page, it's taking a while for it to drop down the rankings.
To get a better picture I need to compare this correlation with others, the obvious being to do a correlation between SERP and domain/page authority.
Thanks for such a great post Dana!
Great post. Using the Google tools like Wonder Wheel, Sets and the ~keyword search you can find the words that Google see as related or relevent.
Thanks for a great post.
For Some reasons people were not taking On-page optimization more seriously & suddenly now after this LDA thing they feel the importance of on-page optimization.
I think On-page is base of any SEO & lately people were ignoring it by spending much time on link building. But Google & other search engine have maintained a constant importance on On-page factors.
Only after someone performing some experiments on this theory can prove it completely correct & finally we can exposed the google algorthim.
I think people were taking on-page seriously - just acknowledging that link building is more important. Building links without having a well-optimised site would be like bailing water out of a boat with holes in. Either way, you're going to expend a lot of effort and still end up sinking. Withouthaving a stable, seaworthy craft you're never going to get anywhere.
All LDA does is give us more of an idea of how we should optimise our pages.
Howdy Dana,
Thanks for sharing all this great stuff from the Mozinar! And thanks for giving us some 'context' on LDA. I think all of us SEO's instinctively knew this was happening, but there hasn't been much for discussions, per say, about the mechanics of semantic relatedness. Off hand, I can think of a couple interesting tools that Google might be using to achieve it's 'topic model'.
The first is knowledge-based information retrieval, in which databases of information such as wikipedia are mined for related terms and frequency of ocurrences.
The second, is entity tags. These would certainly reduce the margin of error when attemptimg to automatically calculate numerous uses for terms in an almost infinite combination of words and sentences.
Do you think Google might be using these to help their algorithm understand the nuances of human languages?
I heard about Latent Semantic Indexing (LSI) a couple of years ago, but you've done the best job at explaining it. Great post!
I just used the LDA tool to match the term "website design" with a page that is about website design but in a different language (did not post the link to the page because some might consider it as self promotion). The relevance was less than 1%. The same page with the same keyword but in the language that the page was created got a score of 99.981%.
Does that mean that search engines can not recognize content relevance to a topic if it is in another language? The same content in two different languages is categorized by search engines as two different topics? SEOs what's the verdict on this?