Last week at our annual mozinar, Ben Hendrickson gave a talk on a unique methodology for improving SEO. The reception was overwhelming - I've never previously been part of a professional event where thunderous applause broke out not once but multiple times in the midst of a speaker's remarks.
_
Ben Hendrickson speaking in last Fall at the Distilled/SEOmoz PRO Training London
(he'll be returning this year)
_
I doubt I can recreate the energy and excitement of the 320-person filled room that day, but my goal in this post is to help explain the concepts of topic modeling, vector space models as they relate to information retrieval and the work we've done on LDA (Latent Dirichlet Allocation). I'll also try to explain the relationship and potential applications to the practice of SEO.
A Request: Curiously, prior to the release of this post and our research publicly, there have been a number of negative remarks and criticisms from several folks in the search community suggesting that LDA (or topic modeling in general) is definitively not used by the search engines. We think there's a lot of evidence to suggest engines do use these, but we'd be excited to see contradicting evidence presented. If you have such work, please do publish!
The Search Rankings Pie Chart
Many of us are likely familar with the ranking factors survey SEOmoz conducts every two years (we'll have another one next year and I expect some exciting/interesting differences). Of course, we know that this aggregation of opinion is likely missing out on many factors and may over or under-emphasize the ones it does show.
Here's an illustration I created for a presentation recently to help illustrate the major categories in the overall results:
This suggests that many SEOs don't ascribe much weight to on-page optimization
_
I myself have often felt that from all the metrics, tests and observations of Google's ranking results, the importance of on-page factors like keyword usage or TF*IDF (explained below) is fairly small. Certainly, I've not observed many results, even in low competitive spaces, where one can simply add in a few more repetitions of the keyword, maybe toss in a few synonyms or "related searches" and improve rankings. This experience, which many SEOs I've talked to share, has led me to believe that linking signals are an overwhelming majority of how the engines order results.
But, I love to be wrong.
Some of the work we've been doing around topic modeling, specifically using a process called LDA (Latent Dirichlet Allocation), has shown some surprisingly strong results. This has made me (and I think a lot of the folks who attended Ben's talk last Tuesday) question whether it was simply a naive application of the concept of "relevancy" or "keyword usage" that gave us this biased perspective.
Why Search Engines Need Topic Modeling
Some queries are very simple - a search for "wikipedia" is non-ambiguous, straightforward and can be effectively returned by even a very basic web search engine. Other searches aren't nearly as simple. Let's look at how engines might order two results - a simple problem most of the time that can be somewhat complex depending on the situation.
For complex queries or when relating large quantities of results with lots of content-related signals, search engines need ways to determine the intent of a particular page. Simply because it mentions the keyword 4 or 5 times in prominent places or even mentions similar phrases/synonyms won't necessarily mean that it's truly relevant to the searcher's query.
Historically, lots of SEOs have put effort into this process, so what we're doing here isn't revolutionary, and topic models, LDA included, have been around for a long time. However, no one in the field, to our knowledge, has made a topic modeling system public or compared its output with Google rankings (to help see how potentially influential these signals might be). The work Ben presented, and the really exciting bit (IMO), is in those numbers.
Term Vector Spaces & Topic Modeling
Term vector spaces, topic modeling and cosine similarity sound like a tough concepts, and when Ben first mentioned them on stage, a lot of the attendees (myself included) felt a bit lost. However, Ben (along with Will Critchlow, whose Cambridge mathematics degree came in handy) helped explain these to me, and I'll do my best to replicate that here:
In this imaginary example, every word in the English language is related to either "cat" or "dog," the only topics available. To measure whether a word is more related to "dog," we use a vector space model that creates those relationships mathematically. The illustration above does a reasonable job showing our simplistic world. Words like "bigfoot" are perfectly in the middle with no more closeness to "cat" than to "dog." But words like "canine" and "feline" are clearly closer to one that the other and the degree of the angle in the vector model illustrates this (and gives us a number).
BTW - in an LDA vector space model, topics wouldn't have exact label associations like "dog" and "cat" but would instead be things like "the vector around the topic of dogs."
Unfortunately, I can't really visualize beyond this step, as it relies on taking the simple model above and scaling it to thousands or millions of topics, each of which would have its own dimension (and anyone who's tried knows that drawing more than 3 dimensions in a blog post is pretty hard). Using this construct, the model can compute the similarity between any word or groups of words and the topics its created. You can learn more about this from Stanford University's posting of Introduction to Information Retrieval, which has a specific section on Vector Space Models.
Correlation of our LDA Results w/ Google.com Rankings
Over the last 10 months, Ben (with help from other SEOmoz team members) has put together a topic modeling system based on a relatively simple implementation of LDA. While it's certainly challenging to do this work, we doubt we're the first SEO-focused organization to do so, though possibly the first to make it publicly available.
When we first started this research, we didn't know what kind of an input LDA/topic modeling might have on search engines. Thus, on completion, we were pretty excited (maybe even ecstatic) to see the following results:
Correlation Between Google.com Rankings and Various Single Metrics
(the vertical blue bars indicate standard error in the diagram, which is relatively low thanks to the large sample set)
_
Using the same process we did for our release of Google vs. Bing correlation/ranking data at SMX Advanced (we posted much more detail on the process here), we've shown the Spearman correlations for a set of metrics familiar to most SEOs against some of the LDA results, including:
- TF*IDF - the classic term weighting formula, TF*IDF measures keyword usage in a more accurate way than a more primitive metric like keyword density. In this case, we just took the TF*IDF score of the page content that appeared in Google's rankings
- Followed IPs - this is our highest correlated single link-based metric, and shows the number of unique IP addresses hosting a website that contains a followed link to the URL. As we've shown in the past, with metrics like Page Authority (which uses machine learning to build more complex ranking models) we can do even better, but it's valuable in this context to just think and compare raw link numbers.
- LDA Cosine - this is the score produced from the new LDA labs tool. It measures the cosine similarity of topics between a given page or content block and the topics produced by the query.
The correlation with rankings of the LDA scores are uncanny. Certainly, they're not a perfect correlation, but that shouldn't be expected given the supposed complexity of Google's ranking algorithm and the many factors therein. But, seeing LDA scores show this dramatic result made us seriously question whether there was causation at work here (and we hope to do additional research via our ranking models to attempt to show that impact). Perhaps, good links are more likely to point to pages that are more "relevant" via a topic model or some other aspect of Google's algorithm that we don't yet understand naturally biases towards these.
However, given that many SEO best practices (e.g. keywords in title tags, static URLs and ) have dramatically lower correlations and the same difficulties proving causation, we suspect a lot of SEO professionals will be deeply interested in trying this approach.
The LDA Labs Tool Now Available; Some Recommendations for Testing & Use
We've just recently made the LDA Labs tool available. You can use this to input a word, phrase, chunk of text or an entire page's content (via the URL input box) along with a desired query (the keyword term/phrase you want to rank for) and the tool will give back a score that represents the cosine similarity in a percentage form (100% = perfect, 0% = no relationship).
When you use the tool, be aware of a few issues:
- Scores Change Slightly with Each Run
This is because, like a pollster interviewing 100 voters in a city to get a sense of the local electorate, we check a sample of the topics a content+query combo could fit with (checking every possibility would take an exceptionally long time). You can, therefore, expect the percentage output to flux 1-5% each time you check a page/content block against a query. - Scores are for English Only
Unfortunately, because our topics are built from a corpus of English language documents, we can't currently provide scores for non-English queries. - LDA isn't the Whole Picture
Remember that while the average correlation is in the 0.32 range, we shouldn't expect scores for any given set of search results to go in precisely descending order (a correlation of 1.0 would suggest that behavior). - The Tool Currently Runs Against Google.com in the US only
You should be able to see the same results the tool extracts from by using a personalization-agnostic search string like https://www.google.com/xhtml?q=my+search&pws=0 - Using Synonyms, "Related Searches" or Wonder Wheel Suggestions May Not Help
Term vector models are more sophisticated representations of "concepts" and "topics," so while many SEOs have long recommended using synonyms or adding "related searches" as keywords on their pages and others have suggested the importance of "topically relevant content" there haven't been great ways to measure these or show their correlation with rankings. The scores you see from the tool will be based on a much less naive interpretation of the connections between words than these classic approaches. - Scores are Relative (20% might not be bad)
Don't presume that getting a 15% or a 20% is always a terrible result. If the folks ranking in the top 10 all have LDA scores in the 10-20% range, you're likely doing a reasonable job. Some queries simply won't produce results that fit remarkably well with given topics (which could be a weakness of our model or a weirdness about the query itself). - Our Topic Models Don't Currently Use Phrases
Right now, the topics we construct are around single word concepts. We imagine that the search engines have probably gone above and beyond this into topic modeling that leverages multi-word phrases, too, and we hope to get there someday ourselves. - Keyword Spamming Might Improve Your LDA Score, But Probably Not Your Rankings
Like anything else in the SEO world, manipulatively applying the process is probably a terrible idea. Even if this tool worked perfectly to measure keyword relevance and topic modeling in Google, it would be unwise to simply stuff 50 words over and over on your page to get the highest LDA score you could. Quality content that real people actually want to find should be the goal of SEO and Google's almost certainly sophisticated enough to determine the different between junk content that matches topic models and real content that real users will like (even if the tool's scoring can't do that).
If you're trying to do serious SEO analysis and improvement, my suggested methodology is to build a chart something like this:
SERPs analysis of "SEO" in Google.com w/ Linkscape Metrics + LDA (click for larger)
Right now, you can use Keyword Difficulty's export function and then add in some of these metrics manually (though in the future, we're working towards building this type of analysis right into the web app beta).
Once you've got a chart like this, you can get a better sense of what's propping up your competitors rankings - anchor text, domain authority, or maybe something related to topic modeling relevancy (which the LDA tool could help with).
Undoubtedly, Google's More Sophisticated than This
While the correlations are high, and the excitement around the tool both inside SEOmoz and from a lot of our members and community is equally high, this is not us "reversing the algorithm." We may have built a great tool for improving the relevancy of your pages and helping to judge whether topic modeling is another component in the rankings, but it remains to be seen if we can simply improve scores on pages and see them rise in the results.
What's exciting to us isn't that we've found a secret formula (LDA has been written about for years and vector space models have been around for decades), but that we're making a potentially valuable addition to the parts of SEO we've traditionally had little measurement around.
BTW - Thanks to Michael Cottam, who suggested the reference of research work by a number of Googlers on pLDA. There are hundreds of papers from Google and Microsoft (Bing) researchers around LDA-related topics, too, for those interested. Reading through some of these, you can see that major search engines have almost certainly built more advanced models to handle this problem. Our correlation and testing of the tool's usefulness will show whether a naive implementation can still provide value for optimizing pages.
For those who'd like to investigate more, we've made all of our raw data available here (in XLS format, though you'll need a more sophisticated model to do LDA). If you have interest in digging into this, feel free to email Ben at SEOmoz dot org.
How Do I Explain this to the Boss/Client?
The simplest method I've found is to use an analogy like:
If we want to rank well for "the rolling stones" it's probably a really good idea to use words like "Mick Jagger," "Keith Richards," and "tour dates." It's also probably not super smart to use words like "rubies," "emeralds," "gemstones," or the phrase "gathers no moss," as these might confuse search engines (and visitors) as to the topic we're covering.
This tool tries to give a best guess number about how well we're doing on this front vs. other people on the web (or sample blocks of words or content we might want to try). Hopefully, it can help us figure out when we've done something like writing about the Stones but forgetting to mention Keith Richards.
As always, we're looking forward to your feedback and results. We've already had some folks write in to us saying they used the tool to optimize the contents of some pages and seen dramatic rankings boosts. As we know, that might not mean anything about the tool itself or the process, but it certainly has us hoping for great things.
p.s. The next step, obviously, is to produce a tool that can make recommendations on words to add or remove to help improve this score. That's certainly something we're looking into.
p.p.s. We're leaving the Labs LDA tool free for anyone to use for a while, as we'd love to hear what the community thinks of the process and want to get as broad input as possible. Future iterations may be PRO-only.
[ERRATA by Ben (sept 16th, 2:00pm PST): The blog post above reports the correlation measurement as 0.32. It should have been 0.17.]
Rand, I'm not one who said LDA hasn't been used by search engines. But I'm probably among the critics of the new tool, I suppose.
I wasn't at the event. I saw the tweets about things supposedly being reverse engineered. I also saw various people who joked that they didn't really understand all that was presented -- though apparently, that didn't prevent them from applauding this big breakthrough.
My perspective is remembering back when WebPosition came out with its page analyser tool, say 1999. On-the-page factors were FAR more important back then than they are today. And I watched all these new SEOs try to use this tool, which was based on data gathered from thousands of actual ranked pages. They'd create "perfect pages" for each search engine which, in the end, often didn't rank at all.
And then they'd get confused. They had these super high scores. Why weren't the pages ranking.
The answer was that even back then, search engines had multiple factors they took into account. None of these tools could deliver perfection in all areas, especially as off-the-page factors accelerated.
So now you've got this new LDA tool, that tells me Dogpile -- ranked #1 on Google for "search engines" -- has a 21% score. What's the amazing correlation there? Or Yahoo has a -2147483648% score. I don't even know what that means. Or WebCrawler, also in the top 10, has a 37% score.
I'm not a math whiz. I don't know equations. But I sure as heck can tell that these scores don't seem to be a predictor of anything. Heck, SEOmoz ranks in the top 10 for "seo," and you've got a 54% score.
I appreciate that you've suggested that this tool be used with other criteria, that you clarify above that things remain to be seen on whether this tool, as well as LDA scores in general, help with anything. Of course, apparently after only being out for a few days, people are reporting to you dramatic ranking boosts. So maybe this is the latest and greatest thing.
Me, though, after watching a long line of tools and trends (theme pyramids. no, LSI! no, LDA), I'm pretty jaded. I tend to think people waste too much time tapping into tools rather than focusing on core fundamentals, such as better content or just getting those dang important links.
Hmm... Did you read the post carefully? It sounds like a number of items were missed in your interpretation. Let's start with correlation - 0.32 is not 1.0. 1.0 would be perfect correlation and would mean that for each Google search result, LDA alone perfectly predicted ranking position - the highest LDA scoring page content with our naive implementation would rank first, the second-highest second, etc.
A correlation of 0.32 is showing that's pretty far off from the case.
You also seem to suggest that the scores have some absolute, fixed value (54% always being better than 25%, etc). For an individual query, that's usually the case, at least as it relates to this single metric. But across queries, the complexity or simplicity of a given query to topic relationship could mean that 25% is quite high for one query/topic pair vs. 54% being low for another.
Maybe an analogy would be best -we've shown in the past, for example, how Google's PageRank score or the # of links pointing to a page has a correlation with rankings in the 0.1-0.2 range. If you saw that, then looked at SERPs and saw a page with 500 links ranking behind a page with 1,000 links, you probably wouldn't think "that correlation number is wrong!" This is a similar situation.
We have published the raw data in XLS format (linked-to above) and we certainly invite anyone to re-create our work. In terms of the tools usefulness or application to SEO and improving pages - we think it's interesting, given the relatively high correlation (compared to link metrics) to try it out, but we haven't suggested conclusive results. Those who've seen great results so far may have been lucky, may have seen unrelated changes boosting their positions, etc. Controlled experiments in the SEO world are, as you know, rife with problems (which is why we like using the correlation methodology).
A good way to potentially judge the value of the tool for you (and decide on whether you're a fan or critic) might be to try an analysis like this of a keyword phrase you care about, and find an example where it looks like your topic relevance (LDA score) is substantively lower than high-ranked competitors but other metrics are on par. Then try improving the relevance and seeing if it effectively improves rankings. I have yet to do or observe enough cases like this to know whether it works, so I'd consider myself a skeptic too (and tried to make that clear in the post).
Does that make sense? After reading your comment I was worried I'd done a very poor job of explaining things :(
BTW - the negative percent for Yahoo sounds like it might be a bug. We'll definitely look into that. Thanks for the catch!
fyi - negative bug also at work in query for "hotels" url https://www.holidayinn.com
I did read the post. I even downloaded the spreadsheet.
The headline says that LDA and Google's rankings are "remarkably well correlated."
So when you said in your reply to me that a correlation of .32 is "pretty far off" from being perfectly predicting ranking, how do you leap from that to a headline about LDA being "remarkably well correlated?"
As for the scores, yes, I think that when you have a tool that suggests something has 25% relevancy for a particular query, that's indeed the tool giving a fix value for that query. If this is 25% across all queries that are related to core query, that still doesn't seem to help. In either case, it's odd to have something that is "remarkably well correlated" as a factor having such low scores for pages that nonetheless remain top ranked.
The link analogy makes no sense. First, I couldn't even tell from Google what links it was counting. LinkScape might guess at that for me, but that's simply a guess. Assuming you did show me every link that Google counted, I still don't know the weight each of those links is given, any additional weight that might be given based on domain score, the age of those links and how that might factor, the anchor text of those links.
In other words, no, I wouldn't think a page with 500 links outranking one with 1000 links means links aren't an important factor. I'd simply think that Google's got some funky ways of measuring links that I can't fully understand.
Here you've got a tool that you still haven't demonstrated in this post gives any particular strong correlation to ranking. Despite comments about this post supposedly clarifying and explaining everything, I don't see that.
Here's what I see.
1) A chart based on 72 people who themselves are largely offering opinions of what they think Google uses for ranking purposes, drawn from a survey SEOmoz drafted of factors -- which itself may not include some factors that Google actually uses. I think I took this last survey myself. Sometimes, I'd scratch my head over some of the factors that I was asked to rate, IE -- like I didn't think Google would use them. Having said this, I largely agree with the general conclusions of the chart. But despite being a pretty looking chart, there's nothing scientific about it. Heck, there aren't even actual percentages attached to it.
2) A long discussion about topic modeling, which says nothing about LDA correlation. I thought it was long known that search engines have gone beyond the keywords on the page and tried to determine page concepts in many ways. Excite -- and this was 1997, if I recall, called their particular system ICE.
3) A correlation chart which gives me a raw score of just below .35. Again, I'm not a math whiz. But I know what correlation means in general. If you tell me something correlates with some factor, I expect it to show a lot, often, a high percentage. As you said, this shows the opposite. A low correlation It's better than some other factors, but it's still not even 50%.
The spreadsheet you provide to back the chart actually does NOT contain the data but rather, as it notes, shows "comparable" results. It also doesn't explain testing criteria such as:
1) Did you correct results for IP targeting
2) Did you correct results for personalization
3) Did you randomize the queries across IP addresses, so that Google might not be feeding SEOmoz with customized results
That's just some of it. I don't actually think those factors are that big of a deal, but they aren't even addressed in a post that's supposed to explain these finding.
Perhaps the most important thing to me is that there's also plenty of research and evidence that search engine will ignore some text on a page and perhaps weight other text more strongly. I see nothing that discusses how this study might take that into account.
In the end, I come away what feels like another keyword density analyser tool -- except maybe you could call this a keyword concept analyzer. Here are some top keywords the tool seems to find from top ranking pages. Try adding some of those to your pages, maybe you'll do better.
And heck, maybe you will do better. On site criteria still IS a factor. But it also feels like we've had these types of "what keywords do top pages use" tools before. The difference here is this one's being pitched with a study that says the factor is "remarkably well correlated," when the stat itself you say shows the opposite.
I guess our big disagreement is around what a "remarkable" correlation is. Given that no other single metric we've ever looked at from keyword in the title tag to TF*IDF to # of links or PageRank or # of linking root domains or anchor text concentration, etc. has ever been this high.
Given that Google claims they have 200+ signals in their ranking algorithm, I think a reasonable person would find it fair to call a correlation of 0.32 for a single factor "remarkable." It's certainly much more than we would have anticipated or then the survey of folks might suggest would be predicted. It's even more remarkable in a comparative sense (with all the previous metric correlations).
With regards to the "how" of grabbing search results, I suppose those critiques are potentially valid, which is why the raw data is available so you could re-do the work. The methodology is linked-to above as well, and matches the system we used for the SMX Advanced presentation you requested. If you find errors in that methodology, we're certainly open to them and would love to improve.
p.s. Your point about the survey being just a survey is reasonable, but I was using it to show SEOs opinions, not search algorithm reality, which I think is an appropriate use. Are you disagreeing with that? You seem to suggest that you also believe (and think many others do) that on-page factors, whether they're around topic modeling or raw keyword usage, are relatively unimportant (FYI - until I see strong evidence that the correlation is more than just correlation and seems to have causation, I'm still in that boat too).
My understanding of correlation is in the sense of multiple regression analysis, wherein the model would be just one of many variables - which would make sense since the Google algo has, as Danny mentions, 200+ ranking signals. A correlation of .32 would essentially mean that 32% of the variation in results can be explained by the model, while the remaining 68% is due to other factors. In such a context, 32% is indeed quite significant, but it's not the entire pie. No single ranking factor will ever have a 1.00 correlation under a MRA scenario - it simply isn't possible.
Is that in the right ballpark as far as your thinking on the correlation stat?
Just to clarify, a correlation of 0.32 means that that 10.2% (0.32^2) of variance in result rankings can be explained by the model.
Oh riiiight! Correlation is R, so you have to do the R-squared... forgot about that step. Thank you! That still produces a fairly significant number.
In a search on SEO, 8 out of the 10 results I get have "SEO" in the title.
In a search for books, 9 out of 10 results have the word "books' in the title -- the one that doesn't goes with the singular "book."
In a search for "seattle apartments," every single result has those words, most of them that exact phrase.
I guess that's a factor I find remarkable -- that pages that rank well often have the words someone searched for in their title tags. That's like 70% or more correlation.
Your past studies have never found that words in the title tag have a correlation higher than 32%? What did you find.
Of course, the high number can mean little. At this point, many sites are savvy enough to know that title tag usage was considered important and continue to do it, even if it's other factors that come into play. Which is correct? We don't really know.
On the survey, I guess I'm disagreeing with comments I've read about how this post somehow proves anything. I see comments about what the research supposedly shows. But there's only one hard reference to it, a single number, and not a lot of actual explanation behind that number, to me. The only other figure-based chart in this post is that opinions graph, and I'm just saying it isn't related to the LDA thing.
I appreciate that you're trying to do a scientific study of SEO factors at SEOmoz. I appreciate the incredbily difficulty in that, given that there are so many factors out there that can skew any attempt of this. Understanding that you find these scores to be a higher correlation than other figures you've looked at also puts your headline in better perspective.
But my takeway from this is pretty much LDA probably won't help you that much. There's not even evidence it'll give you an edge. At best, the tool seems like a nice synonym suggestion tool. Maybe adding some additional words to pages will help people. But whether that's because they then match an LSA model better or have something else is uncertain.
We went over our methodology pretty carefully together I thought (both on stage and in comments at Sphinn and on the phone, etc). When we do ranked correlations, it's looking at whether using the keyword in the title (or any other metric/aspect) is well correlated with pages that rank higher rather than lower. Thus, if you say "8/10 pages in the search results has the keyword in the title tag," and that shows the importance of titles, one could just as easily make the statement "8/10 pages in the search results is hosted on Apache vs. IIS, which shows the importance of Apache hosting."
The methodology we use is best described by Ben (and he's done so a couple of times), but I'll try to paraphrase. Basically, we're looking for elements that predict higher rather than lower rankings. The low correlation of keywords in title elements would thus be because pages that use the keyword in the title don't have a strong tendency to outrank those that don't when we look at the top 20 or 50 results. Certainly, because many of those pages do this already (you can see those numbers we made in the follow-up post from the SMX Advanced presentation about prominence of elements in Google and Bing), that's going to mean we need to control for these elements and similarities to get good data.
As far as critiquiing the tool before using it or trying to apply, I think your points about being intellectually rigorous and careful about assumptions would directly contradict your conclusions (since it doesn't sound like you've put in the work to have good data about it). I agree with the concept of being skeptical, but does it not strike you as premature to make those assessments? If you said "I tried the methodology of analysis and improving scores on a few dozen or hundred pages, but saw no discernible results," I'd be much more empathetic of your points about the tool/process's irrelevance.
Rand, you keep referencing your presentation and methodology at SMX Advanced as if, at least to my take, as if I somehow signed off on what you presented as being factual, correct, well researched and approved.
I didn't do that. To clarify, you were one of the presenters on the Bing versus Google ranking factors sessions. I invited you to present, to show what you wanted to share with the audience. SEOmoz does a lot of studies like this, especially of late, and I thought it would be useful to have you share what you could discover.
That doesn't mean I vetted your methology, only the general idea that you'd present. As you know, there were some who dispute the methodology of that test, as well as the ability of anyone to test anything in SEO. That's also why I tend to go with panel formats rather than solo speakers. Unless you're absolutely certain about something in SEO, it's sometimes useful to have a variety of viewpoints. And in SEO, there's lots that we're not absolutely certain about.
That's that about your previous study. As for this one, I was trying to understand, within this post, exactly what was found. That was, I thought, the point of this post. When someone talked about the tool on Sphinn last week, you commented that a comprehesive post would be coming about it. I really did look forward to reading it, to better understand what was found and how this tool was supposed to help.
I wanted "SEOmoz looked into X by doing Y and found Z." That's what I assume Ben presented in his talk at your conference last week, which produced some tweets and initial blog posts with references to reverse engineering. That's pretty big news, if you'd really found something that reverse engineered Google's results. I know you weren't saying that yourself. I know in this post, you've seemed to dampen down some of those expectations.
Personally, I didn't come away with what I wanted. I've got a post about a "remarkable" ranking factor and a tool that potentially will help those trying to optimize for that. I wanted to be convinced more about the former and how to better use the tool. I got the how to better use the tool part. I'm still not convinced on the discovery portion.
When things start slipping away from plain English and into the mathematical realm of "ranked correlations," I feel like things are also slipping away from the ability for anyone to really understand what you're talking about, much less prove or disprove anything.
In particular, Ben's post about "Statistics A Win For SEO," might as well be gibberish for the lay person. Basically, it says to most of us "just trust us, we're doing it right, we do the math right." And yet, we have other people who disagree. We can't really judge either group, lacking the skills.
I've got a little book on my bookshelf called "How To Lie With Statistics." I'm not saying SEOmoz is lying. The point of that book is how statistics can be made to say anything.
If you want to convince us, you need to do it in plain English, and in a more indisputable manner. Without even getting into causation or correlation, I can give you a plain English challenge to your data.
How did you adjust for the fact that Google might use different ranking algorithms for different types of queries?
It's long been discussed that for certain classes of queries, Google might use different ranking algorithms. This is a study that presumably is of one ranking algorithm. But are you certain that was all you looked at, all the results you received, from one single ranking system? If not, then entire basis of the study is flawed.
Again, I don't envy the challenge in trying to do this. I really do appreciate the attempt. But I feel like that as each study comes out, and gets questioned, the answer is to provide more big mathematical words of reassurance. I just don't find those reassuring.
Overall, the goal seems to be to list the important SEO factors that people should care about. If we're talking on-the-page criteria, are you telling me that your studies do NOT find that the presence of the search terms in the HTML title tag is good predictor of whether a document will rank well? I know it's not perfect, but LDA is more important than the HTML title tag?
From what Ben says, no -- keywords in the title tag seem to have an extremely low correlation to pages doing well. And yet we have so many people who still, even today, tell you that just by changing their title tag, they improved rankings.
Now that would be a study. Have we really been wasting our time all these years telling people to think about HTML title tags?
Ben, as a matter of fact, using the https:// protocol is a very important factor of doing well in Google web search. If you fail to use it, you largely aren't going to show up. Google web search proper lists web pages. Indeed, perhaps the most important factor for someone who wants to rank well in Google web search is to, yes, have a web site.
In terms of using the tool, I used it as soon as it was tweeted. I used it a couple of more times before this post came out. I used it after this post came out. My critique is that:
1) It's entirely unclear what the heck you're supposed to do with it. Do I try to find a page with a high LDA score, then look at the words extracted, and then insert them into my own page to get a better score? If so...
2) Does it work? My critique is that I see pages that already have low LDA scores reported which still rank well, so clearly it does NOT work in all cases -- or we wouldn't naturally see this happening.
Nope, I haven't run battery of tests to explore the tool further. But neither have you. You created a tool, completely with a study, that tells us you see this strong correlation with ranking. Presumably, you'd have enough time to also test the thing out, and see if you really got improved rankings. That's absent.
I'm sorry if this comes across as overly critical. I've just seen too many on-the-page analysis tools over the years that have consumed way too much time for SEOs, plus answered way too many emails from people who write, "Do you know about this tool? I used it, but it didn't work...."
If raw prominence vs. ranked correlations isn't clear, perhaps the best way to describe it is:
Raw prominence - the chance that things that rank well will have this element.
Ranked correlation - the chance that an element predicts that it will be ranking higher/lower than results that don't have this element.
I'm sorry to hear that you think we might be lying with statistics or that Ben's post was meant as a "shut up and trust us." The intent was entirely the opposite. I guess if you feel that the stats discussion, which, I'll be honest, can be tough for me to grasp as well at time, isn't a fair way to talk about the issue then I'm at a loss for how we can better help. The criticisms leveled were very technical - the only reasonable way to respond is to be equally technical. Some of those critiques were nonsensical (you can see Ben going back and forth in the comments about whether something that was claimed to be "incalculable" for example). Where we can, we've tried to do a best attempt at plain language. Certainly the criticisms strike me as much more challenging for a layman to parse. I suppose we can always try harder.
I guess I presumed your understanding of our methodology was better and you had more trust in it. Sorry to hear that's not the case. I hadn't realized there was no explicit or implicit trust in the methods when the presentation was requested - strikes me as a bit weird. I'd thought that you'd read the previous work we'd done on ranking correlations and liked it, which is why you asked us to do it.
Not sure where this leaves us, but I guess if you don't trust the methodology of ranking correlations is sound, then it would indeed be a stretch to ask you to believe that our LDA model is interesting/useful, as the correlation number is what makes it so. It's also frustrating, because the future work to verfiy the usefulness of this method is to build ranking models with it and then attempt to single out elements to perform a causation analysis. But if your argument is "this is too technical to understand or know whether you're doing anything worthwhile" than we're unlikely to convince you with that work either.
I didn't say you were lying, Rand. Please don't say I said that. I said the opposite: "I'm not saying SEOmoz is lying."
I referenced the book "How To Lie With Statistics." The book wasn't a reference to me thinking SEOmoz was lying. That's just the name of it. It's a good book for anyone without a mathematical or statistics background to read. It's short and very informative and relevant to this discussion. But in case there was a misconception, that's also why I specifically said "I'm not saying SEOmoz is lying."
I didn't say Ben's post was "shut up and trust us." I said Ben's post doesn't clarify much to a lay person, in my opinion. That's a lay person telling you this. My takeaway from that is basically what I described: "just trust us." Not "shut usp and trust us," which is a much different thing and attitude.
I'm not suggesting that you're telling anyone to shut up. What I am saying is that a post that relatively few can understand isn't helpful when you're trying to convince people that a study is valid, in the face of criticisms. It's like writing in French when most of your audience doesn't speak that language. We can't follow. Effectively, all we can get is that you're saying to trust you. And we might have people we equally respect saying we should trust them.
I also did not say that I didn't trust you. I said that you kept referencing your presention at SMX with a tone that seemed to imply I somehow had approved with your methodology. I'm clarifying I neither approved (nor disapproved of it). Rather, I simply invited you to speak and share your viewpoints and findings on the matter.
I'm sorry you thought that meant there was some implied trust that whatever you presented would be correct. The trust was that you'd have something interesting to share and discuss, which I thought you did. Nothing weird about that. But that's far different than the suggestion that just because you presented something, you've got some endorsement that it was fine. Heck, Google presents -- I don't agree with all the things they say.
In the end, I try to have a mix of people with different viewpoints on SEO issues, especially because SEO issues are so open to debate. That's the value in panels.
To recap, what I did say:
To recap, what I did NOT say:
As for where this leaves us, for me, it leaves me with yet another tool for researching potential keywords that you may want to use in your copy, which may or may not be more effective than other tools that are out there.
As for the dilemma of trying to conduct SEO studies but not being able to prove to non-technical people that they are sounds, I don't have a solution for you there. Studying SEO factors was hard enough when we had fewer of them, and links played a less important role. To me, this stuff seems almost impossible now -- and potentially dated the moment it goes out. Heck, you said yourself that you find this stuff hard to grasp.
Doesn't really matter, in the end. You put a tool out there. Some people will buy it. Some will have success. Maybe it was the tool that did it for them. Maybe it was something else that they'll apply to the tool, but they'll be happy regardless.
It was much the same thing when people got all into "theme pyramids" before search engines really were looking beyond a page itelf to see how "topical" a web site was. People starting building out their themes, claiming success, even though as far as I could tell, they were succeeding not because they had a bigger pyramid but because they started focusing on their content more.
Yeah, it always comes back to content, doesn't it.
Anyway, I guess I've said all I can say, and if I keep going, this thread will get down to one line per sentence :)
My point was that using "https://" as one's protocol instead of the alternative of "https://" is unlikely to affect one's rankings positively or negatively, or that at least one shouldn't take the statement that "99% of results in the top 10 of Google are http instead of https" as evidence we should preferring http over https for SEO considerations.
I can see how my earlier remark was not the clearest.
Rand made all of the points I would, and he made them well. I have little to add besides mentioning the blog post where I explained the reasons for measuring correlation as we do.
https://www.seomoz.org/blog/statistics-a-win-for-seo
For comparing different metrics, it isn't very useful to measure how frequently it occurs in the top 10. Keyword in the title might be 70%, but having the protocol be http is going to be around 99%. This tells us little about how related these features are to ranking high. (Although it can be an interest way to compare two engines with each other.)
Regarding the specific example of correlation for keyword usage in the title, that was actually included in the data that Rand presented at SMX advanced and also posted it to our blog. It was within the margin of error of zero, and solidly below 0.01, which is well below the 0.32 score for LDA score.
Ben
If you had 2000 results and the top 1000 results had 99% keyword in title, and the bottom 1000 had 98% in tiltle, the the corraltion would be low even if te toop 10 had 10 out of 10 keywords in the title. I dont uunderstand why you would be counting how many in the top 10 have the keyword in the title.
Going out of order with this reply, as the formatting is making it hard to read/follow!
Your points about having a tough time understanding what we're doing and how are fair. I shouldn't be defensive about those, I should merely try to do a better job explaining them. We'll try to make a video this Friday both on the topic of how we do correlations and more on LDA. Perhaps next time we see each other in person, we can go over it as well and see if I can do a better job that way.
Re: controlling for personalization/geo-targeting/etc - we do personalization agnostic searches similar to this - https://www.google.com/xhtml?q=my+search&pws=0 - generally we find that these do a reasonable job of having little bias for those issues, but certainly our data and results are thus for non-local-specific and non-personalized results. We also don't include any "universal" style listings like images, maps, etc. We're just looking at the standard web search results. Certainly it's a fair critique to note that these may not match the SERPs an individual searcher might see due to other things Google may be doing.
pws=0 won't remove IP targeting, not from what I've seen.
Do a search for plumber, Rand. I wager you'll see some Seattle-based plumbers in there. I get three Orange County, CA-based ones. These are regular web search results, not local Onebox results. Regular results that are being elevated because they're deemed relevant to my location.
There's no way these things should be showing other than my location. None of them are showing because of my web history, either. I can see the same thing in two other browsers with no history at all.
Google doesn't disclose that this geotargeting is happening, as it used to. You also cannot turn it off, even using the PWS paramater.
That's 30% of the results being influenced based on the geography of the searcher/location of the business, a huge chunk of the regular results -- at least for that local query.
I'm taking a huge thumbs down risk here stepping into what's tantamount to a public "private" conversation but this thought's been nagging at me since the verbal exchange between Rand and Danny began...........
This LDA tool from SEOmoz is free to use, free to experiment with, and free to leave alone. They've offered what I would consider a remarkable amount of data to support their theory, but it's up to us the users to decide to use the tool (or not). What purpose is there in trying to minimize the value of a free tool that may (or may not) improve your SEO?..........
Personally, I trust them and believe in their character and because of that I will "waste" my own time accordingly in playing and experimenting with the LDA tool. And I'll be a thankin' them for its availability...........
Again, apologies for the lack of formatting. JS breaks the page on my machine so I have to turn it off.
I suppose I take a bigger thumbs down risk coming over here and commenting critically about a tool that's free for the moment to use, on a forum with a lot of SEOmoz fans. And I appreciate that I've been allowed to do so, and with some welcome, even if some people don't agree with my comments.
But why bother taking a critical look at a currently free tool? Because until last week, no one was worrying about LDA. Now, you have a number of people in the SEOmoz community starting to buzz about it. And that spreads beyond it, in an industry where people can easily obsess over the smallest of things -- regardless of their importance.
If you're going to obsess, it ought to be for good reason. Otherwise, you're sculpting links only to find it didn't help -- no wait, it did -- no wait, it did until Google changed things -- no way, it still works, even though Google says it doesn't.
You're creating microsites, because you were told microsites are the new thing for SEO. Wait, you're worrying about LSA. Wait, no, it's theme pyramids. Wait, it's three way linking. No, it's the need to outbound link. No, no, no....
For years, I've sat on the other end of this confusion from SEOs chasing the algorithm, running for whatever is put out there, regardless of whether anyone can verify it or not. People get confused, scared, perplexed -- and sadly, in my view, losing themselves in details rather than seeing the bigger picture.
Last week, my jaw pretty much dropped as I watched Mozzers here obsess that their PA or DA scores suddenly dropped. As if those scores meant anything.
For goodness sakes, people would see Google PR scores drop -- actual honest-to-goodness scores from Google, and yet that wouldn't be reflected in any traffic drops. And now SEOmoz's entirely third-party estimates of values that are not used by Google, not used by Bing, get changed -- and people started hitting the panic button.
Something's wrong with that. Something's wrong with people getting excited about a factor that just gets popped out there with frankly not a remarkable amount of data to support anything. You darn well ought to be critically examining this stuff. Time is short. Time should be spent on what matters.
In this thread, I've commented how 30% of the web search results might be locally influenced, with no ability to override that. Fact. Verified. You don't need to argue about regressions or correlations about it. Do a search. Have someone else do the same search in another city. You can see it yourself.
That's a major influence you need to be considering, if you're dealing with local queries. All the OneBoxes and mixed Universal Search results are major influences, too, that you need to be considering. Pushing a few words around chasing an LSA score because we haven't had a fad for a while and this must be The New Thing isn't a good use of time, not without a lot of verification.
That's just my opinion. It comes from watching the space since 1996, and it's colored by all the fads I've seen over that time period, and all the confused people I've had to deal with in the aftermath of those fads. It heavily colors my views.
But you're absolutely right. It is your time to waste. Or not waste. In the end, whatever you're comfortable with, I've got no argument with that.
Dear Danny,
it is exactly your experience that makes valuable everything you say (and enjoyable also your rants against United Airlines customer care).
That is why - and I believe I express also what other Mozzers think - I really appreciate your comments here, as at the same time I appreciate how Rand is trying to explain to you (therefore to me too as a reader of the blog) what SEOmoz did, do and will do about the LDA.
I recognize myself in your picture of the average SEO practitioner, as I daily deal with Tools that should have to help me in my job and I daily deal with the opportunity or not to use every theory about "how to have better results in SERPs". As you, I'm not a mathematician and as you I need to have things explained in plain english (gosh, I'm not even an english speaking guy) to not occur in misunderstandings. But that is why I like SEOmoz, because they offers great tool and try to explain best practices the clearest they can. And that's why I like you, because your experienced knowledge (hey, you were there when BBS were still ruling, at least here in Europe) can offer to us a vision with perspective... and because I've always have a weakness for devil's advocates.
I know not to know, but in few things yes I believe when it comes to SEO
Thanks again for your comments here, and come back soon.
@danny As per usual,Gianluca has said it better than I could. Thanks for the explanation Danny. Being the point person that others go to with questions and concerns makes your unease with the situation more understandable to me..........
I don't have anyone but me and my clients to worry about. And the only thing I am judged by with my clients is results: Is the site converting more? Everything else (SERP position, Page/Domain/moz rank, LDA scores, etc.) is just what I use on my end. Yes I report them, but neither I nor my client look to them for a performance evaluation. Results are the coin of my realm and they are only measured in sales...........
It speaks really well of you Danny that you did come into SEOmoz's forum to discuss this. For the record, I've given you thumbs up for every comment you made. Peace brother :-p
On the geotargeting - I'm seeing non geo-targeted results when I use google.com/xhtml (https://www.google.com/xhtml?q=plumbers). I see the local results, but our calcs don't include anything but standard web results.
Rand, geotargeted results are different from local listings. [plumbers] doesn't return geotargeted results for me either, but try [toyota dealers], [air conditioning repair] (which might only be geotargeted for me because I am in Florida), or [community college].
I don't see how geo-targeting of search results matters for this.
If the evidence holds for results for a Seattle IP, it will almost certainly hold for a Miami IP. The results differ, and the geo relative inputs to the algorithms will differ, but I assume the algorithms used are going to be essential the same.
Ben, in my experience geo-targeting doesn't impact the entire set of the results though. Certain sites will reeive a boost because of it, but not all. So no, I don't think the algorithm for geo-targeted search results would be "essentially the same."
I am glad you guys are talking about LDA, because even though its a pretty old concept, its still uber fun for most of us geeks out there! But, I have to ask, have you considered that maybe your correlations are so high because your model is good at interpreting human behavior rather than understanding Google's algorithm? In other words, doesn't it just prove the long assumed rule, that people link (the real ranking factor) to relevant content? Even by your own chart above the score is highest for the page that has the most links.Correlation is not causation.But again, I am happy you are pushing folks to think about this stuff.
Hi Joe - yes! That's exactly what I noted in this part of the post above:
But, seeing LDA scores show this dramatic result made us seriously question whether there was causation at work here (and we hope to do additional research via our ranking models to attempt to show that impact). Perhaps, good links are more likely to point to pages that are more "relevant" via a topic model or some other aspect of Google's algorithm that we don't yet understand naturally biases towards these.
We don't know if this is causation or just correlation, but the correlation is certainly much higher than any of us anticipated, which makes it very interesting to try and understand why that is.
Rand,
Have you & your team read this post yet?
https://www.seangolliher.com/2010/seo/185/
It suggests that from the data you published you can't confidently interpret whether a high LDA score correlates to a higher rank or not.
If you guys are going to continue releasing statisticaly driven tools like this, perhaps it would be wise to hire a statisticican so you don't risk making a big splash like this about something that's not based on solid stats.
Unfortunately information like the post above is sort of like a newspaper running a correction in small print on page 10 after running a huge incorrect headline. If the post I linked to is correct, I would hope you would do as much work as needed to make sure people like those commenting here know that LDA doesn't actually correlate to rankings.
After visualizing the data, I'm beginning to agree that there may have been some method missteps.
Hey Sean - feel free to ping us if you see anything that looks weird in the methodology. I see that your calculation is of Mean LDA by rank. Avg of relative difference of the mean or avg difference between the result and the one above it might also be interesting visualizations.
Just saw that this morning. Ben and the team have a couple other projects they're also working on, but we'll certainly try to investigate and respond before the weekend. If we find that there were methdology mistakes, we'll certainly post updates and any corrections.
I disagree with that post.
We compute standard error of the mean. That is included as error bars in the chart used in the post, and you can recompute it from the data I provided.
Standard error of the mean is sufficient to see there is extremely high confidence in the results our post showed. You can use it to see we are confidant that the mean spearman correlation coefficient for the LDA is higher than that for TF-IDF, and unique linking IP address. You can also, of course, see we are quite confident it is higher than zero.
Using standard error of the mean to show confidence avoids several of the drawbacks of what Sean Golliher says we should have done instead. As he notes, because of a shortage of degrees of freedom for any given SERP, computing confidences of the correlation coefficient for a specific SERP is not very informative. Pooling the pair data points of results across SERPs would make the sample no longer qualify as a “simple random sample.” What we did also intuitively fits with what we are trying to show - that conclusions from the computed sample mean generalizes to a larger population of queries.
You can read a complete discussion about why I prefer the methodology we use to measure correlation and show statistical confidence here: https://www.seomoz.org/blog/statistics-a-win-for-seo
In the Sean Golliher's post he links to a post and comments by Dr. E. Garcia. Garcia makes or relies on several positions that are contrary to establish mathematics, including:
1) The mean correlation coefficient is uncomputable. He is wrong, it is computable.
2) Spearman’s correlation measures any monotonic correlation just as Pearson’s correlation does. He is wrong, Pearson’s measures only linear correlation.
3) PCA is not a linear method. He is wrong, it is.
In my post I referenced above, I go through these and other incorrect statements in greater detail, and cite/quote them directly
Ben,
Using the Fisher Transformation recommended by Dr. Garcia in his tutorial on correlation coefficients, I recalculated the average correlation coefficient for your data.
The resulting average correlation (r' = 0.35) is still statistically significant, and rather close to the one you found.
I attempted to calculate a confidence interval, but was not sure if the n in the standard error equation should be the number of URLs or the number of queries.
Using the former, the 95% confidence interval is 0.32 to 0.38.
Using the latter, the 95% C.I. is 0.27 to 0.42.
*Note: I replaced instances of (r' = -1) with (r' = -0.999999)
It is nice that the methodology Garcia says we should be using also shows us having statistical significance. Like his prior criticisms of our methodology, I find the criticisms included in the PDF you link to baseless. I also note the methodology he claims we should be using is quite poor.
Contrary to Garcia's suggestion, replacing instances of 1 with 0.999999 does not mean one will have a 99.999999% accuracy in terms of producing an unbiased estimator. Values near 1 have issues, not just the value 1. As values approach 1, there becomes a lot of upward bias. Given the few of degrees of freedom behind the coefficients for each query, we will get non-trivial upward bias in correlation coefficient. And unfortunately the bias won't be consistent as it will depend on the number of extreme coefficients in each sample.
Although Garcia claims it is improper to choose not to use the fisher transform when combining correlation coefficients, I am hardly the first person to note that it is better to not use the Fisher transform for their problem. I refer you to page 82 of Meta-Analysis of Correlations, which notes for a problem "the Fisher z transformation produces an estimate of the mean correlation that is upwardly biased and less accurate than an analysis using untransformed correlations." They had posed the question "is the weighted average always better than the simple average?" and they used the example as a way of saying "no."
Furthermore, there are a lot of nice properties that are lost when one stops using the arithmetic mean. For instances, taking the sample mean as we report it as an estimator for the population mean has no bias. (Where the population is a larger population of queries for which the queries we looked at would be a simple random sample) This is the interpretation of the numbers we report that I prefer, and that I think is most meaningful.
This is another interesting post about the validity of the LDA tool... https://smackdown.blogsblogsblogs.com/2010/09/09/proof-that-the-new-seomoz-tools-is-at-least-half-accurate/
Skitzzo,
Interestingly, with the new revised numbers showing that the current LDA model is about half accurate, the link you shared seems highly psychic. Though I don't think the concept is "bullsh!t" as the link says, being relevant to topics is important. Whether or not this particular tool does that, is another story.
Rand,
I'd have to agree that a much more complex model than has been described is being used by the search engines. I mean, when you've got literally thousands of sites, and in turn tens or hundreds of thousands (or even millions) of pages that "could" show up for any single phrase, LDA type models are going to be required to sort it all out for relevance.
Personally, I think it's got just as much to do with on-site cross-page sectional relevance as it does with individual page and cross-site relevance. And it means that people need to do a much better job at refining focus in these things than most in the SEO industry do. Or they can keep doing what they have, which would just make my work easier. :-)
In any case, I hope more people pay attention to the concepts at the core of this, and that they realize the new Moz tool is only really something to get them thinking as much as anything else. because how many single word keyword searches take place for sites that most of us optimize for (or should be optimizing for)?
I'm very excited about this LDA discussion precisely because it makes me think from different angles on possible optimizations (not to mention the actionable guidance I need to offer clients).
What makes me cringe ever so slightly is the layman's description of the bottom line. Even though my SEO effectiveness has potentially increased through this additional tool, everyone else will have more cause to believe that they, in fact, know all about that SEO stuff.
"On-page, schmon-page. I'll just get the intern to write up some quality sentences. Bada-boom, Bada-bing! I got your SEO right here."
Good to see you poking about the forums, Alan.
Admittedly, I have sampled the Koolaid. I am a mozzer and will continue to be regardless of other communities (Distillers, FTW) and tools I will adopt in the future (Hello, SecondStep ... from Blueglass ... yeah I'm talking to you! Hurry up and let me into beta, bitches.).
I can appreciate Danny Sullivan's perspective. He's right. Let's not get nutty over here. Yet on the flip side SEOmoz is not the one going over the deep end. It's us (me), the overworked and probably mentally taxed individual who blindly jumps on tools. We want to go clicky-clicky and hope that the tools do most of the heavy lifting. There's nothing wrong with us wanting that. In the end knowing when to use tools and when to grind it out helps separate our consultative price points ;)
Some of the brilliance from SEOmoz is the transparency. Who else is trying this crap and publishing for the world to see? The major SEO players all have their proprietary tools that allow them to whisper to the client, "Mumble mumble mumble, correlation. Mumble mumble mumble, helped clients make money. Mumble mumble mumble, scientific approach." SEOmoz is giving it to us to a certain extent.
Yes, we the knucklehead users will jump on this tool and see how we can work it into our repertoire. Some of us are indeed over hyping it with hyperbole.
Yet SEOmoz is also correct. Within context these initial steps with the LDA tool show a significant single factor where (if I understand properly) no other single element has received such a clear correlation to rankings. Who cares that so much is hidden and mashed up between +200 signals? Before we figure those out Google would have unlocked the human genome to breed psychics to replace the data centers. So let's relax a bit and look at what we have.
We potentially have a deep on-page optimization tool that will help excellent content creators be even better. It won't save lives or give Oompah Loompas a home, but it will offer future optimization guidance; and in it's current state LDA is already significant.
Dear Mike,
you're touching a sensitive chord with your comment.
We are and will be always mozzers in our souls, because we were mozzers even before knowing this word was existing. And I think that this is because we find reflected ourselves in the TAGFEE philosophy of SEOmoz.
But, at the same time, that doesn't mean to be blind and closed to the Moz world. Personally, as you, I too use tools from several sources (ciao Raven, hello Aaron, hey BlueGlass... I've too signed for 2nd beta run... this time select me too), and I too read all those blogs, online magazine and whatsoever can help me in doing my job better, easier and more effective.
Personally I thumbed up all the comments done by Rand and Danny because I love that kind of discussion - polite but firm - that they are having here about LDA. Ojalà more of these 'disputes' will be present here, because it's from a Thesis-Antithesis situation that we can form a real Syntesis.
So... just wanted to say this, also as a positive petition to all of us (including myself) to remind the juice of TAGFEE, that is - IMHO - discuss and explain always your point of view... be enthusiast if you feel so, or disappointed if not... but always do it here first in the most useful way.
Post Scriptum for Danny:
Dear Danny, I thank you for explaining your point of view and I wish I can read more of you here too in the future. It's your fault if many of us are SEOs, because you inspired us. I think that could taken as a causation case.
Post Scriptum for Rand:
Dear Rand, thank you for strenuous effort in explaining scientifically what's behind the Google curtain. It can be a titanic task, but not just because if it titanic it has not to be done.
@Mike One of your best...comments...ever Mike. It manages to put into words what I'd bet a lot of mozzers resonate with. Well, perhaps with the exception of calling the folks at Blueglass female dogs.
@Gianluca I agree with you completely re: spirited discussions here at the moz. It's when I really learn. Having these discussions without flaming though is what I appreciate the most. I too thumbed Danny and Rand up.
Mike, actually there were multiple SEOmoz employees putting some of the most ridiculous hype out there about their LDA tool. So while Rand sort of pulled back from those promises of the moon, the damage has already been done. At least IMO.
Thank you Rand,
For "beginners" like me, SEOmoz has proven to be the most useful online resource to understand SEO because of posts like this. You have yet again taken a complex concept and turned it into an easy to read post that is extremely informative and helpful.
Thanks!
Although I know that Rand edited this post because I had an extra line between the paragraphs, I wanted to clarify that my praise for Rand and SEOmoz was in no way edited by Rand :-p
Thank you Rand for this enlighting post... one of the most anticipated surely.
As you know, I'm an international mozzer, and it is not so common for me to deal with websites making of English its main language.
Therefore, the LDA Tool is going to be really more a "new indicator" tool adding tips about what could probably is making one site ranking better than a strategic tool
Despite this, I've tried a small scale experiment using a word in Italian ("posizionamento", which is a synonim of SEO in our italian tech language) and checking the LDA for the websites in the first 10 position and a random number of others in the second/third page of the SERPs.
The reasoning behind this experiment is very practical:
What I discovered is that somehow we international mozzers could use the LDA tool as a "signal metric", as we can find some sort of basic rule: higher LDA % corresponds to better ranking websites than lower %. When this is not happening, other factors are playing (PA, Linking Root Domains, Exact Keyword Match in the Anchor Text...).
But, beware... it seems simply a signal to be used with others (PA, Links, Anchor Text) for us not english SEOs.
Anyway, I will try to make a larger experiment in order to see if the LDA tool - even if working only with Google.com and even if it is clearly stated that is all based over english - can help us international mozzers with other languages websites.
Hey gfiorelli,
I just tried the same thing for some German words and results were very similar:
Generally speaking, the higher the sites ranked for the term on google.com, the higher was their LDA-Score (of course there are other relevant factors). From this quick first check, the tool appears to work pretty well for words in other languages as well- as long as they don't have a (different) meaning in English.
I guess depending on how close the results of your national google sites are to google.com, it could be already be helpful. Even without the tool, one should be able to figure out some thematically related terms.
Or can anyone think of a smart way to find out which words google associates with a word? The keyword suggestion tool didn't really produce any relevant results for me...
Same problem here with Google suggests... not to talk about Wonder Wheel, that is more "related search" centric IMHO (e.g.: if you put Italian Fashion in Google.it wonder wheel, you receive back localized search aka: Arezzo, Milano, San Marino, Urbania... logical because sites where many Italian Fashion Outlets are).
No, right now the best thing is to spend time and literally read the content of the competition page and use epistemological and rethorical knowledge (aka: really well known the language of the web page and the way that language it's used).
Great detective work Gianluca. You are posizionamento extraordinaire........
What I discovered is that somehow we international mozzers could use the LDA tool as a "signal metric", as we can find some sort of basic rule: higher LDA % corresponds to better ranking websites than lower %. When this is not happening, other factors are playing (PA, Linking Root Domains, Exact Keyword Match in the Anchor Text...)..........
This is just plain good advice for those in the US as well G.
gfiorelli and FranktheTank,
Thanks for taking the time to run the first tests on this topic. I'm tempted to try this out in French but would not keep the data (since I'm USA based and the URL localization codes may not work through the tools, etc.). Overall, it seems likely that results would remain inline since we're running such simple examples, but hey! It's another signal.
I also love Rand's chart. Especially where LDA could be higher on a lower ranking page. This LDA really placed some aspects of SEO solidly in my mind. We have to look at and consider the various factors for understanding why the winners are winners and others are not.
You have to love it when all you can do is offer the best, reasonable guess then back it up with the only data you have.
EDIT:
In a comment down below by Ben clarify things quite well and - substantially - thumbed down my assumption here above explaining how the tool works and how other than english words can be present in the "LDA Tool Thesaurus", but not with a sufficient specifical weight in order to give relevant or precise results.
Anyway... while waiting to play with a more extended version of the LDA Tool (I know we will have to wait), I will just take notice of the big concept behind it:
Boy, take care of what you write and be marvellous while writing it as Conte(n)xt rule more than you believe.
Thank you so much for this! I was trying to recount as much of Ben's presentation to my team and failed miserably on the stuff with the charts. This will be an excellent followup for them for an even better understanding. We're already excited about the potential applications!
Really awesome work moz'ers - no matter if it turns out to carry any ranking value or not. The research seems sound and well thought through. Big thumbs up to Ben & co. for doing this work - it's what helps drive SEO towards a more scientific approach and not best-practice conjecture.
HTML clipboard
I don't care if Garcia admits to his mistakes or not. My objection is that, from what I have read of his work, he draws conclusions without taking into account other mitigating factors. In both LDA and LSI the conclusions are based on the use of (key) words in the data stack and not the importance of those words as designated visually by position, size, and emphasis.
I think we all pretty much agree that Google "weighs" keywords by their position, both in the visible page and in the source code. For any word evaluation program to be effective it would have to include these factors as well as density and synonyms.
Yet another factor is the unnecessary over analysis stuffed with meaningless scientific jargon. (Meaningless to the task at hand).
Our goal is to reverse engineer Google's algorithms to the point of being able to design pages to include the levels of relevance they require. We do not need to know the actual algos they use to determine relevance, only that they do. Experimentation with on page factors will soon point to their needs.
Being a SEO is like being a detective trying to think like a search engine trying to think like a human. And where does this take us? What is the difference between a human's interpretation of a page and a search robots? The search engine trying to think like a human would not consider the amount of source code. It would not consider anything that the human did not see. It would look at the page from a human angle, considering only the visible, factored on relevance. If you design from this angle, SEO becomes dead easy.
best,
Reg - NBS-SEO.com
I'd just like to add one more comment on this. First, I read Garcia's comments too. I'm a business person, not a scientist, and don't mind saying this; the guy comes across to me as kind of a nut job. I'm sure he's a really smart guy but brains can get in the way of good sense, and when that happens, credibility goes out the window.
I didn't expect the LDA tool to be perfect from the get-go, but uber-mozzers tweeted that Ben reverse-engineered the algo, referred to it as a "game changer", and suggested that people who weren't pro-mozzers were "idiots". That kind of rhetoric invites pretty serious scrutiny, which is what the tool has gotten. Probably more than anything there is a lesson to be learned about when, where, and how to unveil new tools. But then the search gods come along and "instantly" everyone's following another trail. What an interesting time this is.
Ben - I watched your London presentation on the DVD and anticipated your talk in person last week greatly. I personally feel a greater sense of trust toward SEOmoz knowing that they're putting resources into actual science (not that I ever doubted, just sayin'). In a business where the latest gadget is often a new mirror that reflects a different color of smoke, science is more than welcome. You're doing an awesome job and you're great for the SEO community. Keep up the great work.
Hey, can't find the LDA Labs tool. Where can I find it?
Thanks!
It's great to see someone in SEO trying to understand Google from a Computer Science perspective, keep it up guys!
We've recently concluded an extensive research and analysis program in this area of Google SERP's, although we took a different approach of analyzing, testing, testing, and then of course more testing of Goog search results. We wrote our own Semantic Analysis Tool to take apart Google results and it turns out to be a complex mixture! We only concluded this work on August 24, 2010 (3 years total work program) and have not yet published our findings, although we've put a site up at https://searchenginesemantics.com/ as a start.
What we can tell your readers is that Google is not using Latent Semantic Indexing to index web pages.
Keyword Density, the oft-pushed route of many SEO's for years is a myth, and probably always was.
Google Wonderwheel, Related Searches and Google Suggestions are mostly irrelevant to SEO'ing this area.
Google is doing Semantics in a BIG way but not the way that the Semantic Web crowd think in terms of Ontologies, Microformats, RDF etc. Google is focused on understanding page and website meaning, which is very much a semantic concept.
Here's an example of our research:
Using the Tilde symbol immediately before a keyword search on Google reveals synonymous terms, e.g. ~SEO Software, but it's pretty useless because all you get back is the term 'search engine optimization software' as a related term that's bolded.
Once you understand that Google uses acronyms as literal terms, i.e. SEO really means 'search engine optimization' to them, you try a different search of '~search engine optimization software' but that also brings back exactly the same unhelpful result.
However, once you understand that Google uses each individual word as a 'Word Class' AND can also interpret the entire phrase once these terms are combined, you try a very different search of ~search ~engine ~optimization ~software and hey presto you get bolded search results (aka related synonymous terms and phrases) which include Search Engine Ranking Software, Search Engine Submission Software, Keyword Tool etc. which have obvious relevance to the search term (this is Step 1 in Analysis).
Google's work is a moving picture since we see semantic relevance changing for some phrases on a fairly frequent basis (they're getting better at it in some areas but are still hopeless in others e.g. '~Google' does not know it's a search engine in it's own results lol, but they do know that '~Marlboro' sells cigarettes, '~Apple' sells Macs, '~StateFarm' sells insurance etc.
It looks as though Google is correlating their word and phrase relationships from their own search queries, then developing algorithmic models to assess webpage accuracy against these relationships. This is probably why they just acquired Metaweb, the biggest semantic classification platform out there, since they get a ready-to-go team of people specializing in word relationship matching.
What Google is doing is very complex, and covers (here we go!):
word disambiguation, bi-gram breakage, phrase-based indexing, keyword proximity and placement rules, phrase proximity and placement rules, related word (synonymous) occurrences, acronyms, aliases, root synonyms, synonyms of synonyms (and then even more levels in some cases), keyword stemming, new rules for word capitalization, and finally, Faceted Search. Google has filed multiple patents in these fields since 2003.
Before you roll your eyes and shout out Say What!! the good news is that the root principles of analysis, interpretation and on-page optimization that will directly assist SEO's are very much alive on Google SERP's and can be quickly learned. On-page optimization is absolutely where it's at; Page Titles and URL naming are important, links do assist, but on-page rules the roost. The key is understanding how to analyze and interpret the results of Google Semantic Analysis, optimization is then a breeze.
Statistical models like LDI can be very useful in analyzing correlation but what we're really after in SEO is causation. We have our own Bachelor of Computer Science (specializing in Semantics) from Stanford University on our team and his insights are invaluable, yet our discoveries have mostly centered around what we can actually see Google doing, then developing, testing and proving models around this analysis.
This is our first post on this subject area since we concluded our research, and we've posted here due to the high correlation of Ben's work for SEOMoz; it looks like we're all trying to understand the same issue here i.e. What Is Google Doing!??.
This field is the single most important change in the Google Algorithm, it powers their search reuslts today, and is the center of much continuing work at Google. It's not going to go away, so we in SEO had better get it, and fast!
Chris Lewis
Founder
SearchEngineSemantic.com
This is getting more to the heart of the matter. We had a client that was glass repairman who wanted to rank for "windows portland." Of course all of the top results are about Microsoft and not about broken glass.
After facepalming for five minutes and asking the sales rep what the most popular operating system in the world was we started looking into this a little more in depth. This was my first exposure to LSI. I doubt very much that Google would rank a site just because it had every related synonym or phrase related to "window repair." My take on this whole discussion/pissing match is that it is ridiculous to think that you can rank on the first page of Google by simply doing one thing or another on-site or off-site. The correct answer is (and will continue to be) do both and start with good research and find your related keywords.
Context matters.
Very interesting indeed. I've just spent the past 2 hours in the office, late night crunching and studying this stuff, comparing to the LDA labs reports, looking at competitor sites, keywords, etc.
My question is: how does the Term Target tool compare with the LDA analysis tool?
For example, as for LDA, our client's page scored 49%, for "term X", while the competitor scored 77% for the same term. But we scored an A- in the Term Target tool for the term, whereas the competitor scored an F!
Can someone explain why the Term Target would be so low for the term for the competitor, while the LDA score would be so high? This doesn't quite make sense to me.
The LDA keyword extrator seems to ignore titles, alt=, that kind of thing. Those are of course emphasized by the Term Target tool, so I bet that's a big source of disconnect.
Ah, that would explain it. Thanks, gcrocker
As a firm believer in search engine AI, there has to be some inturpertation of the intent as well as content. The tool is reperesentative of only a part of the result scoring. Search has gone way beyond keyword matches, link influence, or keyword densities. Now it is relevance based. Relevance as applied to the visual display and code markup.
Relevance as to other pages in the silo, regardless of being LSI, LDA, or LSD based, the relevance to the central topic's kewords must be there. It is established with keyword presentation, synonym densites, silo densities, and, in a very small part, PageRank. Nobody seems to consider just HOW Google determines relevance. IMO built from experience, this is the core to positioning.
My NBS-SEO site scores #2 on Google for "SEO Information" (quotes or no quotes) The LDA tool tells me I have a 46% relevance. Two links down is the Wikipedia's entry which scores at 76%.
Taking my B*lls in hand and disagreeing with Danny Sullivan about links counting, the search shows my site at a PR4 and Wikipedia at a PR of 7Yahoo site explorer show the Wikipedia site to have 53,555 in-bound links. My NBS has 136.
I am FIRMLY convinced that on-page work is a MAJOR factor in SERPs. best,
Reg
Hey,
This post seems quite old, I was going through the SEOmoz pro tools to see if they still have this tool and i couldnt find it. Can anyone tell me if the approach of optimizing content based on the LDA methodology is something that we still need to focus on?
Thanks,Zuheb.
Im also interested in knowing this.
The LDA Tool looks useful, but once you've run it for your page and some competitor pages, you're sort of left scratching your head wondering what to do next. The main approach seems to be "go read your competitors' pages and look for what you're missing." But that's a pain.
I felt like a tool letting me compare words between client pages and competitor pages quickly would be a good way to get LDA % higher, so I wrote this LDA Optimization Tool to do just that. Here's an intro on how to use it.
Basically, paste in keywords from your page and your competitors' pages, then it gives you word clouds showing what terms you're missing or need more of. It's fast to take a page from 60% to 90% LDA by applying this. Whether that'll help your rankings is up to Google, though!
Enjoy!
cool. great work!
wow, what a massive subject.
In simple terms I think LDA can be summarised into, all keywords have related words or phrases. google have some tools to help you find these.
Google wonder wheel
Google sets
does any one know of any more?
Thanks for the explanation Rand! Are there any plans to make this LDA tool available via API calls?
It is a good idea. There is no compelling business reason not to, and it would be exciting to see what other folks would build with the data.
My brain hurts and I don't know if it's because it's happy or tired. Thanks for the in-depth explanation of LDA. I can't wait to see the tool that suggests additional terms.
Also, posts like these give me two thoughts:
1) Bravo Google. It continues to amaze me how complicated the search algorithms have to be in to prevent folks like the SEOmoz team blowing the whole thing wide open.
2) Awesome work, SEOmoz team.
If you are going to do a white board Friday, maybe you can address some of the variances below. Either that or consider making it a drawing board Friday instead ;)
I just clicked the Compute Relevance button four times in a row without changing anything and saw a spread between 69% and 61%. Comparing to SERPs, I searched for "san diego web developer" (singular) and got the following:
Make the term above plural and the site in the #1 position drops to 20th but the LDA stays basically the same (ugg...plurals!!!). I realize there was talk of +/- and I honestly didn't catch what the .32 was about, maybe that would be good to cover in a training (and to mention on the tool page itself). I also know that my search had a local element so maybe that made a difference (I also saw big variations on branded search). My biggest concern is the rather large difference in analysis of text that doesn't change from one click to the next.
Have you tried using LDA with other ranking factors such as Page Authority via linear regression to see if there is incremental predictive ability?
Sean, I've missed you. And as usual, a good idea. I'd be interested to see results from that. I tried something along those lines rather quickly, although it wasn't done carefully enough to trust the result and so wasn't included in my presentation or Rand's post.
I read Dana’s post and comments, it would seem a lot of people may be missing the point when suggesting the wonder wheel as related terms can change context, instead of strengthen context. Example for “stones” “rock collector” may be related, but does not strengthen context but rather create ambivalence. One could be a collector or rock music or a geologist collecting mineral samples, but none the less the term is related.
Or am I missing the point?
This whole idea is extremely interesting - especially for the maths geeks out there.I had wondered for a while whether search engines had moved beyond use of LSI, given the inherent flaws and naivity in that model.
One an interesting question, that may need answering by Ben, Will or someone of that ilk: could a similar model be put together using a Zipf-Mandlbrot law rather than simple Bayesian probability? Z-M models language use far more accureately and so, if a tractable problem, using Z-M would be the next logical steo.
Unfortunately this is ground a lowly physics graduate is fearful to tread and must be guided by mathemagicians.
What is the difference between LSI and LDA ?
If we want to rank well for "the rolling stones" it's probably a really good idea to use words like "Mick Jagger," "Keith Richards," and "tour dates". If i were using LSI, what words would correspond to LDA ?
How does Google's algorithm factor in things like Traffic+CTR data, registeration and hosting ? Has there been any seomoz posts regarding this ? ( If yes, would you please mention the url,) Or any word from Google ?
LSI (or LSA) and LDA are both different ways to approach the same problem - topic modeling for concepts in information retrieval. LSI as a methodology isn't quite as scalable (it tops out around 400 topics, and our naive implementation of LDA reaches about 5,000). There's differences in the way they work, too, such that LSI often doesn't bias towards simple explanations (as noted in the post).
There's lots more to read on this at Wikipedia - https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation and https://en.wikipedia.org/wiki/Latent_semantic_indexing
As far as click-through-rates and usage data, I'd be surprised to see them be a big factor (simply because they're relatively easy to manipulate and Google's representatives have mentioned this numerous times - this post might be relevant)
Rand, thank you for clarifying the engine behind and opportunities of the LDA tool. You explanation of the scoring variance and that scores are relative helped shed a lot of light!! I especially like your SERPs analysis chart incorporating LDA (thumbs up)!
It's also nice to see data that indicates Google really is looking for quality content people want. And the LDA tool's score is a nice measure to enable us to weigh various options of content for a page. As Joe said, SEOmoz is pushing people think about the importance of relevant content - good old fashioned on-page optimization, which can and usually does lead to more links, more engagement and more positive signals to Google.
Next is to wrap my head around cosine similarity, but the term vector model really helped!
I know this is a little dated but I think this is actually more important than ever. Can you let us know if the LDA tool was disabled or where we can find a similar tool now? I don't see it in the labs.
This really was one of the more interesting moments from the Seattle seminar (along with Ben uncorking a "holy f#$%ing $h!%" as we ran some examples). The correlation is hard to ignore, but in my limited experiments I've struggled to find a way to use this to my advantage.
Nice to see people still talking about term vector theory, I did a YouMoz post on that last year and tumbleweeds blew by. I used to think I was a math nerd, but I've got nothing on the genius of Ben - THIS is math nerddom that gets attention! :)
Nice....haven't heard the term "cosine" since calculus in high school.
I guess a great spin off this would be how to up your percentage. The percentage is not just frequency or percentage, correct?
An interesting article, Rand. I've tried your tool, and got some results that definitely made me think. Like Alan said, that's at least one benefit of such a tool. I'm still not convinced that LDA is the biggest aspect of G's algorithm, but I can imagine it playing a part.
This really is a fantastic post.
I really appreciate/admire your ability to break down fairly technical concepts for those of us with a more biz/marketing background (opposed to CS).
I love your format of presenting the technical aspects (encouraging everyone starting to get lost to hang on) and then explaining how those concepts/findings should be interpreted and used in the SEO process.
Nicely done Rand.
Dirichlet
Di·ri·chlet
/ˌdɪrɪˈkleɪ; Ger. ˌdiriˈkleɪ/
Show Spelled[dir-i-kley; Ger. dee-ree-kley]
From dictionary.com
After reading through the IRW newsletter that I received yesterday morning I continued to read some other interesting articles from the IR Thoughts blog. A few of the articles were published earlier in the year but the most recent one was On SEOs Latest Deceiving Artifact LDA (https://irthoughts.wordpress.com/2010/09/01/on-seos-latest-deceiving-artifact-lda/). The article was particularly relevant to me because I'm familiar with two other articles by the same author, The Keyword Density of Nonsense (https://www.miislita.com/fractals/keyword-density-optimization.html) and Latest SEO Incoherences (https://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/). Later that day, someone I respect forwarded me this article Latent Dirichlet Allocation (LDA) and Google's Rankings are Remarkably Well Correlated https://www.seomoz.org/blog/lda-and-googles-rankings-well-correlated.
There appears to be a discrepancy and my comments about it are this: I'm excited that members of our community are involved in aligning internet marketing to science because I think that lends credibility to an industry that needs it. Thank you Rand and Ben. Most importantly to me, I'm honoured to have heavy thinking scientific minds keep us accountable for our assertions. Thank you Dr. E. Garcia.
-Sean Ruiz (sean.seoinc.com)
The problem I have with Garcia is not that he is sometimes wrong. Everyone makes mistakes. It isn't even that he makes incorrect statements with greater confidence than the frequency of his mistakes suggests he should. We all are human. My problem is that he never admits when he is wrong, even when it is clearly shown. This reaches absurd levels, and folks continue to cite his old discredit posts.
He argued that I made a mistake because the mean correlation coefficient cannot be computed. I carefully refuted that. Then he goes on to argue the mean correlation coefficient is a biased estimator. Something cannot be uncomputable and a biased estimator, his arguments contradict each other. But even in cases like this, where he goes on to take a new position contrary to his old position, he never admits his old position was wrong or retracts his person attacks based on the discredit position.
He also never backs down even when far outside of the mathematic consensus. For instance, do both Spearman's and Pearson's seek to measure correlation through a monotone function? Academic consensus says no. Wikipedia says no. Text books say no. But Garcia said cited a statement otherwise as being "right" and used that as his basis for initial critism of our work that used mean spearman's correlation.
(I go through a few other of his incorrect statements here: https://www.seomoz.org/blog/statistics-a-win-for-seo, and also go into more details on the points I've referenced above where I quote/cite him directly.)
"The problem I have with Garcia is not that he is sometimes wrong. Everyone makes mistakes. It isn't even that he makes incorrect statements with greater confidence than the frequency of his mistakes suggests he should."
I'm sorry but the irony of that statement on the SEOmoz blog is just too much for me not to mention it. This site is the king of making bombastic claims and then quietly apologizing when they're proven wrong.
Ben, I'm not a mathemetician and unfortunately don't understand the points you're making about Dr. Garcia's statements, but simply citing a post you wrote about your methodology and thinking that "settles" the arguement is ridiculous. How about linking to other statisticians that support your points? How about performing some of the other tests that have been suggested to you?
When making the kinds of claims that you guys have, the burden of proof is on you and in my mind, and apparently several others' you've come nowhere near meeting that burden. Of course that hasn't stopped Rand from saying things like LDA is "Remarkably Well Correlated" to Google's rankings, or Gillian from saying things like you "reverse engeneered the google algorithm."
I did cite a lot of other sources in the post I linked to that rebuts Garcia. I don't think you really want me to go through quoting and linking to them all again here, when you can read where I do so with much greater care. But there you can see:
Garcia claimed one cannot compute a mean correlation coefficient. I showed examples of several papers computing exactly that. Google scholar lists thousands of them.
Garcia's position that Spearman's and Pearson's did not differ in one measuring monotonic correlation as opposed to linear. I quoted clear statements in Wikipedia that conflict with the claim.
Garcia claimed it was incorrect to say (as we had) that PCA was a linear method. I quoted several authority sources which used nearly identical language to us, and used the language to make the same point we did with it.
....
I think we are reasonably good at acknowledging our mistakes as we find them. I helped Danny with the math showing critics of his page rank sculpting post were entirely correct, and helped him use that as the basis for a full blog post correction. The alternatives would be to update the post to acknowledge a mistake, or mention the correction in the context of a bigger post. A completely new blog post gets the most attention.
I am sure I will have the opportunity to show I acknowledge a mistake one of these days, and I will do so as completely as Danny did
....
Rand's statement seems reasonably conservative to me.
I'll respond to Dr. E. Garcia's claim in the post you cite. I am somewhat hesitant to do so, because I think it is likely instead of defending his current claim, he will just move on to a new one. That has been my experience so far, so I worry I am just feeding a troll. But here we go.
His argument in the post you link to is that we are snake oil salesmen to say there is evidence that Google does something LDA like, because LDA like topic modeling algorithms cannot work to solve web-scale problems. That is a poor argument, here is why:
1) The tool I made applies LDA to scoring relevance, and works for close to any arbitrary URLs and queries (sorry non-English languages) on the web. If one programmer who is able to spent part of his time on the problem can come up with that, how could it be beyond Google's capabilities to do something along those lines?
2) The head of Google research in China is named Dr. Edward Chang. If one looks at his research, one can see his team there has an implementation of LDA capable of running on thousands of machines. What dataset does Garcia think this is running on if not the web? Does he think that Chang's code is unable to processes a sufficiently large portion of the web to produce scores meaningful to ranking algorithms?
3) We did not say Google is using LDA. We said there was evidence suggesting Google is doing topic modeling that gives somewhat similar results to LDA. Unless Dr. E. Garcia is a lot closer to teams at Google than I think he is, how could he possibly speak to the performance of any topic modeling algorithm they could have come up with?
It is nice that Garcia mentioned LDA before us. But if his contribution was to say that there is no way any topic modeling that is LDA like could be being used by Google because it cannot scale up that large, then I don't think he added much of value.
I'll try to keep this short (as I have no way to make paragraphs) but it sure won't be easy...........First: You have a real gift of explaining complicated things in easy to understand ways Rand. After reading this post, I "get" what you are looking to accomplish with LDA and your tool. And even if it is just a simple implementation, it can still be used in judging a sites SEO (like Gianluca stated above)..........Second: Your charts are the best! The "Components of Google's Ranking Algorithm" is a great example of a picture being worth a thousand words..........After reading this post, I can totally see why Ben rocked the mozinar and why the excitement level was so high...........I've never felt so fortunate as I have lately to be able to be a PRO member of SEOmoz. This has been a real watershed of a year for y'all and I feel uber blessed to be a part of it.
I just joined this year and I've been quite impressed. The kicker, and the reason why SEO has been my favorite profession so far, is that all of this science can only lead us to reasonable guesses. We collect, then look at the clues before nervously making assumptions. There are wizards behind the curtain and they're not letting us in. There's no single yellow brick road to follow either, but all are guarded by the flying monkeys. What an adventure!
Don't see the tool anymore - I use a different one, I will try the moz version when/if it comes back :-)
Nevermind what I wrote above. Hello Instant.
What about news homepages that have a myriad of themes going on in one place? How do you build themes in such a cluttered environment, within just a few lines of text?
Hi there, I'm new. I have been trying to access the LDA tool. Is it still available?
Hi Rand,Thanks for sharing some great opportunities that lies within the tool. It is really true that the importance given to on-page factors is fairly less. The tool's score is really a great help to set content for the pages. The examples showing the search engine's method of ordering the results based on the content was really a good one.Nice that the tool is free to use for all at present.
The tool is a great start and this is a very interesting topic. In the future, I would love to see more from the tool, like suggestions to improve score and a grouping of topics report, like "key" is used in relationship to locks, pianos, and as an adjective - which brings another point:
Do you think Google's algo looks at how a word is used grammatically (as a verb, noun, adjective, adverb, etc.) and then weights results accordingly?
oh neat! I used LDA for a client last year to help google index their product catalog here's the link: https://vcc-sae.org/topics. In it I explain briefly how I used LDA in combination with sphinx search engine.
"Below is an index of all SAE papers and standards. The index was created using LDA (latent Dirichlet allocation). The headlines below are links to abstracts discovered using the top 500 topics selected using LDA. The abstract is selected using the topic key words listed below the headline. The keywords are passed to the sphinx search engine using 'match_mode=any' or SPH_MATCH_ALL and the top matched abstract is the final headline shown below."
It worked surprisingly well, we went from 10% traffic from search engines to almost 90% in off conference season... Now this was back in 2008 and we haven't updated since, and it's still pulling a good chunk of traffic, down to about 70%
I tried using your LDA Topic relevance tool and received66% match when I triedQuery : Michael Jackson2nd paragraph of the wikipedia article on Michael Jackson75% relevanceQuery: Michael jackson1st paragraph of the wikipedia article on Michael Jackson88% relevanceQuery: Michael Jackson5+6+7th paragraph (bigger content) of the wikipedia article on Michael Jackson77% relevanceQuery: Michael jacksonTwo paragraph's from wikipedia article on India :OI know this is a beta version tool but what do you think is the problem here?
Ok, maybe I missed this somewhere, but how do you find the approximate number of exact match anchor texts? Do you have to go into OSE manually and do two at a time? If so, this is very time consuming. I think this is a very important metric, so if anyone has any suggestions on an efficient way of gathering this info, please let me know. Thanks!
The matrix from the last picture is great! How do I understand when you add the functionallity of generating those? I would become pro members just or that! :>
Also where we can read more about people moved UP in SERPs thanks to optimizing their topic using the tool?
Rand another very meaty post. Something i still need to fully get my head around, is it more complex than the rolling stones analogy or would that some it up?
Also just been trying to use the LDA tool and not sure if there is a problem with it. Each time i ran it with even with no change to the page or text content pasted into the document field, the results reduced consistently. Am i misunderstanding something? Again, get it to assess 100% identical content as the previous time you run it and the results are different. Maybe you can get that fixed so I can start using this.
using firefox 6.0.2
Chris
Hi,
I actually attended this event and even though its months later I still laugh at how ... unique this presentation was. I know most of the event was recorded and am wondering if SEOmoz would be willing to post a short "teaser" about this.
To be honest this was really the highlight of the whole event for me and I am excited to see what more evolves from this ... I recall being completely lost for the 1st 1/2 of this presentation and then finally catching on towards the end that what Ben was suggesting is that the age old maxim of 30% on site and 70% off site could be myth and we are massively underestimating the value and potential of on site optimization.
Any chance we could see a video?
David GS
In many ways the lda study reminds me of the inductive method of literature study. Where divisions and sections and segments back up the upper level terms to bring coherency. In many ways its like reverse engineering your way back to a main term. So knowing that LDA plays a large part in relevancy. How would one find the LDA terms that relate to the main category for which your overall website is attempt to rank for. - does this make sense , suggestions welcome .
Which would be an "explanation" for the 6th position?
I understand there would be a lot of different factors that would influence that page's position, however, could somebody guide me in this regards?
Could it be "high-profile" inbound links (edu, gov, etc)?
Thanks!
PS. As always, very useful post! Love to read SEOMOZ...makes me feel there's still a lot to learn :)
My guess for the seo-usa.org result is that it's part of a diversity algorithm.
Fair enough :)
Cheers!
It is a bit disconcerting that all the critics about this don't even attempt to replicate the experiment with their own methodology, considering that Seomoz kindly makes the raw data available for them to work on. In science, and I live this very closely, being my partner and several friends, researchers in different fields, you need to back up your criticisms with your own empirical data. You can't just say; this is rubbish because I say so, suggest a number of random methodologies for the publishers of the study to recreate at their cost,and see if they fail. If the critics are really serious about their rebuttals, they should cite a similar study with a different methodology that refutes what Seomoz says. Just because Danny, or the Gipsy or Vandermar say that the experiment is flawed, regardless of their reputation in the SEO community, that doesn't prove them right. If they are serious about it, they would do their own research and prove Ben and Rand wrong.
I don't care much about Seomoz, this must be my third or fourth post here. I use their tools, but I also use other tools. I am pretty much agnostic in terms of SEO and I also read Dr Garcia posts and newsletter. But for me any criticism without empirical data to support it is simply invalid and more an ego rant than anything else.
Thanks rand for the links and also for your clear explanations ( as ever ).
Is there any way to explain to clients the term "topic modelling". In laymen's terms ?
Its great to see someone applying statistics to seo in this way. I have a lot of research to do before I can fully understand the implications, but my take-aways from this are:
Thanks for the great post.
Speaking to Joehall's comment, was there a specific corpus collection that you used to fit the model to? It's great to see the work that you've all done in creating this algorithm, and as with any search algorithms, there are always hurdles in overcoming the problem of overfitting.
I will definitely enjoy seeing where this new research, and tool creation will lead SEOmoz in the coming months and years, as I see a huge potential for this to create a secondary version where we see an SEO'er submitting a site, and the tool then returning the possible topics with their associated relevancy. It's also very nice to see some good explanations of the science of information retrieval to the general search marketing audience, great teaching material.
Thank you Rand.
The corpus is 8 million documents from the English version of Wikipedia. We believe this probably has some non-commercial biasing of content, but overall is likely to be reasonably representative.
I have been using related search phrases as well as varying terms related to the keywords I want to rank well and it works.
This LDA thing you dicussed is another validation of what I have been telling my clients that their contents should blend with their niche and keywords they want to capture.
This is an amazing post. Thanks for shedding more light on the topic. While my brain is still hurting I performed some tests on a Dutch branding website. The results were indeed unreliable for non-english words as I saw later in a reply by Ben. Still I will be on the lookout waiting for a trained tool on Dutch words.
May I suggest you place a notice for international users testing non-English websites about the unpredictable results?
@Rand, regarding the idea to produce a keyword recommendations tool. A while back while performing a keyword research and playing with Google's wonder wheel I was contemplating an idea. It would be extremely useful if there could be a wonder wheel on steroids. Starting with a seed word and branching out showing related words and its variants (count). In addition to the ability to drill down you can use any of the variants to continue your exploration. What’s so special about the tool? The ability to immediately asses the competition, search volume and even search trends.
p.s. Don’t tell anyone, I recently added this to my bucket list of ideas.
I'm really interested in this post. Even though we do believe that Google is much more sophisticated but this method is a useful way to improve our quality for better ranking undoubtedly.
Thank you!
Great post.
Excellent bed time reading.
So I don't sound like a total buffoon at work.
How do you pronounce 'Dirichlet'?
Thanks
Great explanation, Rand!
It's not much of a stretch to presume that Google is using LDA (or something like it) not just on the raw page content, but on anchor text from links to the page...content on pages linking TO the target page...etc.
An approach like that would do wonders for devaluing links from spammy directories and increasing the weight of links from solid articles, I'd think.
This is very exciting stuff...looking forward to Ben's next experiments with LDA!
I'd LOVE to see LDA applied to links. It probably is. I find guest blogging to be strikingly helpful with rankings, for example.
How about that, a reference to Piano. That's what I'm talking about. Thank you Rand for yet another great post. I'm trying to let it all soak in.
Fascinating stuff...
Maybe I missed it but how do you develop the contextually relevant words if they are less obvious. If you're targeting The Rolling Stones then tour dates, Mick Jagger etc make total sense but what if the relevant words within the topic are less clear. Is there a tool that will generate a word cloud of contextual terms?
This post is one of the most in depth ever posted on this subject and although I won`t pretend to understand every bit of it, it certainly explains as clearly as possible the LDA system and I think it`s clear that the search engines do use topic modelling.
Really appreciate this immense level of rigidity and depth of analysis - this is what SEOMoz.org from the pack in terms of what it offers professional SEO's everything. Can't wait to delve into this veritable goldmine of search data and statistical perspectives.... Hope I make it out alive! :)
Thanks a lot Rand, very interesting.
I tried the tool for my website seoguru.nl. I rank high on page one for "SEO", in google.nl. The tool gives a relevancy of 4% on "SEO" for my homepage. On a test word "Kralen" I get a relevancy of 98%. But the word "Kralen" is not used on the homepage, therefore I expected 0%.
That's strange isn't it? What's your take on this? Thanks!
The tool was trained over an English corpus. What it does for other languages will not be good.
Most likely there was a trivial amount of Dutch in the corpus, and it got lumped together in a single topic. That would imply we'll score any Dutch word we know about really highly with any other Dutch word we know about.
We should train the tool over a corpus with multiple languages, or perhaps train different topics for a variety of languages. Until then, the tool will be pretty random for all languages besides English.
Ok, thanks, good to know!
Ok... good to know...
shame it means I will use LDA less than I'd like to (just a 30%/40% of my clients also have an important english version of their websites).
That doesn't mean I won't stop playing with it.
BUT: the most important thing is the concept behind the tool... that means that I will pay even more attention to the quality of the written content of the sites I'm working on.
Here is a Spearman coefficient of 0.35 :
https://en.wikipedia.org/wiki/File:Spearman_fig2.svg
Do you find it conclusive?
Do I find it conclusive of what?
Here are two things I find it conclusive about:
1) LDA, as we implemented it, is not the only factor in determining rankings. 0.32 is far enough away from 1 to see that.
2) LDA, as we implemented it, is non-trivially correlated with rankings. 0.32 is far enough away from 0 to see that.
I do not find it not conclusive evidence that Google is doing some sort of topic analysis, but I find it strong evidence of that. Correlation does not imply causation, but given 0.32 is higher than one would measure for the number of linking root domains (or count of links, or a page rank like algorithm), I find it strong evidence that Google is probably doing some sort of topic analysis that produces scores somewhat simular to what our tool does. It is hard for me to think of a plausible confounding variable that could produce so much correlation unless it was some sort of a topic modeling algorithm.
Do you disagree?
Good work Ben, I am convinced that this is part of the ranking factor Google uses, and your correlation levels prove how much weight it is being given, enough for optimization of each page being necassary under the LDA umbrella terms being ranked for, but not enough to rip up a page and start over. You have expressed in an equation what each of us should have been doing in each sentance on each page. Thanks.
It's been a while since I got so excited about something in SEO. I read everything, including external sources and references and rewrote our home page as a test. Now I wait...
Uno de esos posts que te tienes que leer cada año / año y medio :-)
I'd be interested to learn how LDA algorithms deal with ambiguous words or phrases, for example
Bark is the outermost layers of stems and roots of woody plants. Plants with bark include trees, woody vines and shrubs.
and
A bark is a noise most commonly produced by dogs and puppies. Woof is the most common representation in the English language for this sound (especially for large dogs).
are both good candidates for the word "bark".
Would an LDA algorithm bundle together words like "woody, roots, stems" with "dogs, woof, sound" or would it use one set of related words to try to identify one meaning of the word and another set for the other meaning?
"bark" will appear in topics appropriate for its "tree wrapper" meaning and in topics appropriate for its "dog noise" meaning. Which of these will be used to explain the word in a given document is based on what other topics are used to explain the document. As the document has more words like "root," and "leaf" it is more likely to use the topic with tree stuff to explain those words as well as the word "bark." If the document has other words like "bone," "dog," and "collar" it will be more likely to explain all of those along with the word "bark" with a dog stuff topic.
So if you were to search for the word "bark", how would the algorithm calculate relevance across the two meanings? What happens if you have a document that contains some like "the dog urinated on the bark of the tree". I assume it would score for both meanings, but would those scores be combined to make it more relevant for the word "bark" than the same line but with dog substituted with cat? Thanks for explaining... interesting stuff.
It's all new and I'm happy to test and use the new tool hehe, It will be very useful hehe !
Awesome breakdown of Ben's talk, Rand. I left the SEOMoz Seminar really excited about this, and my team was very interested to hear about it when I returned to the office. This post really helps to explain it, without the use of Ben's admittedly awesome but indecipherable mathematical formulas. Keep up the good work, and I'll be sure to let you know if the tool helps me to rank better for any of our top keywords.
Hm ... there seems to be an error in the LDA-tool. If you load data from an external url and click several times on the button "compute relevance", the amount increases per click. I think this value should be exactly the same for a static query and static textual content. (For future bugtracking - there are minor issues with eg. ä/ö/ü ).
Best regards,
Karl
Thanks Rand, For providing great information on LDA. It was good to know how search engine might be working & this examples clearly shows that we are one step closer in exposing the Google Algorithm.
For the visual learners, I created a bar graph using this data. It shows the average LDA for each ranking:
Mean LDA by Search Result Rank
Okie Dokie, then. Keeping in mind that I am so not the maths guy. Look, I even put an "s" on it to prove my point (... I'm not European). So let's get into the nitty grungy.
How do I read your graph especially when I compare it to Rand's seo-serps-ranking? Are these two related but based off different data?
Thanks.
I took the data from the spreadsheet that Rand linked to in this post. I added a column containing each URL's rank in its SERP. Then, I calculated the average LDA for each rank across all of the queries, and plotted it on the graph.
For instance, first-ranked pages had an average LDA of 77.1%, whereas pages ranked eighth had an average LDA of 46.9%.
As for its relation to Rand's SEO-SERPs ranking, I'm not sure to what you are referring.
Interesting chart, thanks!
One of the most interesting posts this year, thanks!
Looking forward to more on this, especially on an international version for other countries and languages.
Truly fascinating stuff!
What happened to the LDA tool mentioned here? No longer available?
Thanks for the post! Useful as always.
My question is one around the use of super footers and cascaded menus. With these added (sometimes irrelevant keywords) how could this effect the pages relevancy, if at all?
I used the tool and noticed it picked up all the keywords. I removed the keywords from our footer and noticed that my relevancy went up by 5% according to the tool. Which is rather large right?
Awesome. How can I know closely-related keywords?
That's something we don't have a process for today, but as I noted at the end of the post, we hope to in the future. Folks who've been trying it thus far have been primarily using trial-and-error (and benchmarking against pages/content that models well).
A nice tool to do that is Google Sets
I think an important point to be made her is that copy-writers actually have to know the subject area thoroughly when writing. Ahem, sorry Indian outsourcing companies offering to write about heaven, earth and everything in between for $5 an article.
Only when you are knowledgeable within a niche can you use the right related words/synonyms/acronyms/homonyms and avoid the damaging ones.
This might prove to be quite effective against spammers and cheap/spammy outsourced content creating over time.
This is super interesting - thanks for sharing!
Thanks for the insights. While waiting for this promising tool, I'll stick to:
- marketing knowledge (from marketing and sales team)- google wonder wheel- google keyword suggestion tool
5. Two years ago google still recognized a rewrite of a page as an unique one if 55-65% of text was unique (including word / sentence / synonym order). Are the way Google treats near duplicate content now close to https://www.copyscape.com/?results ? 6. Does H1-H3 still matters at all, what is the optimal amount in % to use on a 1000 word page? Same % for keywords in strong tags?7. Does keywords with commerce intent like "buy", "shop" get additional penalties if links earned using that anchor text?8. What are the top3 factors for higher rankings (internal, external)? is the quantity of right targeted anchor links still the number #1 factor?9. Is it true that 1 second web page load time will increase rankings soon?10. After reading “the number of pages that we crawl is roughly proportional to your PageRank” https://www.stonetemple.com/articles/interview-matt-cutts-012510.shtml a PR6 website would get 20 000 pages crawled?
Great post and the tool.
I tried it with document (entering text) and I've got very interesting results.
Regarding URL instead of entering the text, am I the only one who's getting following error:
"error, got 0 http response for <url> whereas a 200 is required." ?
I am trying with several different domains/pages without HTTP, with HTTP, with/without WWW, ...
Or there are too many requests at this time so server have a problem?
Hi Rand,
Nice post! The LDA mechanism described above is very positive for seo perspective and hope this tool would probably beneficiary for the ranking of keywords and definitely for optimization correlating the Google ranking.
Could you please describe how much percent of ranking would improve after implementing this?
Kind Regards!
Howdy Rand,
Thanks for explaining LDA in detail! Do you think Gogle is using entity tags for term vector spaces and topic modeling?
In your "Pianist" example, surely those pages that don't contain the word pianist in any of the usual places will be so far down the SERPs that any amount of topic modeling would be futile? Are you suggesting that by creating beautifully modeled content for a term such as "pianist" without actually using that term in the page title/content/url etc, would give that page a ranking chance against those that do have the term in prominent places, with potentially rubbish content therein also?
I guess what I'm trying to say is that we can all agree that LDA/LSI plays a part in the algorithm but surely there are many other more important indicators for ranking before we get into such minutiae of detail?
Stunning analysis however from all involved. Where else can we get access to such clear infographics - that's rhetorical! Thanks to all.
I don’t think that they were trying to say you can rank high without the keyword in the text, just that the LDA tool picked the right text.
But as for importance, the correlation as rand explained is higher that any other metric, including back links and keyword in the title. Correlation isn’t causation, but I am not going to leave out the Keyword in the title or stop getting back-links and for that reason I would not ignore LDA. There could be some other reason or causation for the correlation of any of these metrics, but it stretches the imagination to believe there is.
wow - i'm glad my time zone's 10 hours ahead of you. i would NOT have been able to handle this first thing in the morning
For same URL and single word query I got 45% and 62% isn't that more than mentioned +-5%? Regardles, it's awsome article, one of the best I've read this year.
Can you email ben at seomoz dot org with your results? A swing that large wasn't present in our testing, and we'd love to take a deeper look.
Hi Rand,
Could you please give me some suggestions before using the tool in effective manner?
Thanks and Regards!
Hi,
I have runned the tool few times on the same url and the same keyword,the result is keep changing.I have started with 70%,after 5 min later the LDA went up to 84%.
How consistent is the tool?
I just used the LDA tool and it gives 50% relevency for our main keyword. ( the site is on second position for this keyword ). But for number one site for the same keyword, the relevency is 48%. Not much difference. But huge difference in terms of position. Are there other factors at play like strength of links, on page seo.
Almost definitely. It would be very weird to imagine a search engine that based its rankings off a single metric/variable. As I noted in the post, it might pay to build a chart with multiple relevant metrics to help get a sense for what could be causing the various sites/pages to rank where they do.
First off, great post. Second, great tool.
I just wanted to express shock and awe that you could come up with something like this. I'm still a noob when it comes to SEO but I have read so much conflicting material about the importance of on-page optimization that I'm lost as what to really trust. Your LDA scoring tool and informative analysis is an eye opener to those who don't practice on-pade SEO at all, but also, it makes sense to me that search engines are being made 'smarter' and might eventually reach a stage where they, almost like a person with a sweet tooth might, differentiate between a bag of Skittles and a Louis Vuitton bag with a "Skittles" monogram printed 50 times. Obviously SEs would rank the former over the latter and give candy searchers a much better result PLUS a better user experience when they do click on the (real) Skittles link. Dumb analogy but it makes sense to me. Please comment if I am missing the picture entirely.
It would be great if having a high LDA score proved higher SERP ranking potential, but I know that's not the end of the 'Neverending Story' of SEO. So I will instead use the tool as a guide to how to think about optimizing my on-page content more for human readers even more than before since that seems to be where SEs are headed.
Dumb question probably, but when analyzing backlinks, would SEs also ensure that sites linking to your own are also of a high LDA score? Or does it not matter and the guy with the most links -- relevant or otherwise -- wins?
Keep up the good work seomoz people! If I ever get back to the Emerald City, I'd like to say thanks in person!
Great stuff Rand - thanks for publishing.
Do you think that there is a role in the algorthims that use actual user searches to determine some of the topic matching? Using a document based strategy for determining related words could be enhanced by looking also at common research refinements/qualifiers (e.g. piano keys, grand piano, piano stool). These user generated term relationships could be valuable to understand the importance of certain terms about a particular word or phrase.
I picked up a keyword: bulk sms
and put that into Seomoz'z LDA computation tool at https://www.seomoz.org/labs/lda
The topmost result on Google SERP-1 (I was logged out of my google account) showed relevance score of 69%, but the very last result on Google SERP-1 for the same query showed LDA-centric relevance of 79% for the given keyword. Ditto experience for other queries as well. So isn't the (seomoz's) LDA score NOT yet mature enough to be a reliable SEO/on-page-optimization metric?
Not by itself. The tool isn't giving a general purpose relevance score, or even a general on page factor score. It ignores a lot of on page factors, and completely ignores anything related to links.
I do think that today it is a useful data point to look at in combination with other data points, as Rand suggests doing in his post above. But do keep in mind this is new stuff. The tool is still in labs, and there isn't yet the experience to say adding keywords that cause this tool to score a pages higher will make it more likely the page will rank higher. It suggested by the correlation numbers, it is suggested by the anecdotal evidence that Danny Sullivan mentioned, and there is certainly a rather plausible explanation for why it would. In a few months, I am sure we'll know a lot more about when and how to apply topic modeling considerations to SEO.
topic modelling methods are used to classify textual data such that it becomes more semantically relevant. there is a promising web company, goes by the name Peer39, that mastered the Machine Learning and Content Modelling methods to really retrieve exact match! I have o doubt that the big SE's apply such methods to determine relevancy.
GrEaT pOsT!!!
Thank you so much for making this tool available and for doing all of the research on LDA. This is going to be an immensely powerful tool for the SEO community as a whole.
Oh ... and ...
Best. Post. Ever.
"I've never previously been part of a professional event where thunderous applause broke out not once but multiple times in the midst of a speaker's remarks"
-That's because the audience thought he was done with his presentation.
I don't think anyone had a clue what he was talking about until Rand showed up to save the day (thanks Rand). Maybe next time you should keep Ben locked up in the lab as public speaking is not his cup of tea.
Sorry to hear you didn't like it. Did the post help clarify?
Yes, this post helped quite a bit. Thanks Rand.