In this week's Whiteboard Friday Rand Fishkin and Ben Hendrickson discuss LDA (Latent Dirichlet Allocation) and SEO (Search Engine Optimization). There has been a lot of discussion about the relationship between these two topics lately and this video answers many of the questions people in the community have been asking. It is comprehensive (25 minutes) and uses many easy to understand diagrams and examples to discuss what impact LDA may have on the SEO industry. We look forward to reading your comments below.
Video Transcription
Rand: Howdy, SEOmoz fans. Welcome to another edition of Whiteboard Friday.
Today, I am joined by Ben Hendrickson. Ben?
Ben: Hello. We've met before.
Rand: Have we really?
Ben: I think so.
Rand: So, Ben is our senior scientist here at SEOmoz. He does a lot of our
research work and has been working on some interesting projects.
Lately, we posted about one of those projects and asked for some
feedback and got some great responses. A lot of people are very
passionate, very excited. And some people are a little confused. So,
we wanted to dive deeper with this LDA stuff.
What's LDA, Latent Dirichlet Allocation. We wanted to talk about topic
modeling in general. There was some feedback, right, and I am sure
you saw some of it too, that was like, "I'm not quite sure. You're
saying on-page maybe is more important because of this LDA stuff,
and I always thought on-page just meant keyword density or stuffing
your keywords."
Ben: Yeah. Clearly words used matter. For any given SERP, a huge number of
links aren't going to rank for it because they have nothing to do
about it because they never use the word at all. Right? I mean,
Google.com ranks a very few things and it has a ton of links. So, of
course, words matter that are on the page.
Rand: But we've always, as an SEO, even when you've done your previous
research, it was sort of like, boy, it sure does look like links are
a whole lot more important than . . .
Ben: Using the keyword in the title box. Right. Yeah. So this was
something that actually was very surprising for us, which is why we
showed it. What was that? It seems like using other sort of related
words to the query in a very specific way seemed to help a lot.
Right?
Rand: And we were kind of weirded out by that.
Ben: Yeah.
Rand: Or we were at least surprised by that. So, that is why we are sharing
it. So, let's go back in time a little bit and talk about this whole
. . . for people who are kind of going, "I don't understand what you
mean when you say it's more sophisticated than keyword density, or
it's more sophisticated than a normal keyword metric or keyword
usage." Keyword density is just like the percent of times that the
word is used out off all the words in a document.
Ben: Yeah.
Rand: Super simple to game. Kind of useless for IR is my understanding.
Ben: Well, I mean, it gets you a lot of the way. I mean, at least you have
that word in the document you return to people. But, like your blog
post earlier in the week showed, there is a lot of basic situations
where you can't tell what is the better content just by doing this.
Rand: Right. And so, IR folks in the '60s came up with this TF-IDF thing,
which is essentially like looking at whether the terms that are
being used are more frequent in the corpus as a whole. So, if you
are like a library, they look at all the books in the library. Or if
you are a card catalogue, they'll look at all that. And now that
there are search engines, they look at all of the documents on the
Web.
Ben: Yeah, right. So, the big intuition here is that they are searching
for multiple words. The word that is rarely ever used is the one
that actually matters the most. So, if you are searching for the
SEOmoz building, a document that includes a building and SEOmoz is
probably very relevant. A document that contains "the building" or
"the SEOmoz" is a lot less relevant. So, the basic story there is
that you are biased against caring about words that are very common.
Rand: Right. So I like your Lady Gaga example where you're like, well,
documents that have Gaga on them are probably way more relevant than
those that just have lady on them, even though lady and Gaga are
both four letter words in the phrase.
Ben: Yeah, exactly.
Rand: All right, cool. So we evolved to this TF-IDF stuff. And then there
is this like co-occurrence thing, which we talked about on the
SEOmoz blog a long time ago. Co-occurrence is kind of interesting
where we look at, and let me make sure I am getting this right. It
is essentially that, oh well, oftentimes when I see, for example,
Distilled Consulting and building and SEOmoz and building, I find
those frequently together because it turns out that we share offices
with Distilled and we do lots of work together and those kinds of
things. So, maybe a document that has both Distilled and building
and SEOmoz might be more relevant than just the one that just says
SEOmoz.
Ben: Exactly. Right. So, if you are trying to basically figure out if it's
just an offhand reference to it or if it's something that is
actually valid a whole lot, right, the fact that it is using a whole
lot of other words that also occur with the keyword would be a good
indication of that.
Rand: But then topic modeling, I think that even I get a little bit
confused when I think about topic modeling versus co-occurrence,
because it seems like topic modeling is maybe very similar to this.
Ben: Well, this is great because you drew a Venn diagram that shows the
difference really well.
Rand: Right. Super smart of me.
Ben: It's like you kind of knew. So you can imagine that you could have a
whole bunch of words that would have a very high co-occurrence with
Star Trek. Right? You could have documents that talk about gravity,
space, planet, and tachyon. But it still might not be about Star
Trek, even though you've got four words that co-occur a lot with
Star Trek. It could about astronomy. Those are all real things that
exist in the real world, or at least people think they might exist
in the real world in the context of tachyons. But if you have
something that is talking about tachyons and gravity and William
Shatner, that's probably Star Trek. Right?
And so, it's not just the number of words you have that co-occur.
You are actually trying to figure out are these words being used in
the context where they are talking about Star Trek, or are these
words being used in the context of talking about astronomy. The way
we can do this is because in general fewer topics is better. So,
it's possible that we have something that is talking about astronomy
and TV and it happened to use gravity and tachyon and William
Shatner in the context of something else he did. But it's more
likely to just have . . .
Rand: So normally, we might say like, "Okay, I can imagine Google using
this to try and do a couple of things." Right?
Ben: Right.
Rand: For weird queries, where maybe the word Star Trek wasn't used but
they think it might be about that and they think that's what the
person wanted, maybe they would do it. But for ordinary rankings, it
seems like using these words when I'm talking about astronomy or
using these words when I'm talking about Star Trek isn't going to
help me any more than not using them. But then we did this topic
modeling work and we tried to analyze that. Right? So we used a
process called LDA, which maybe we can talk about in a sec. But we
used this process to basically build a model that has all these
different topics.
Ben: Right.
Rand: And essentially, the topics, as I understand them, aren't actually
keywords. They're just like a mathematical representation of a
subject matter. Like you were saying there's probably a cartoon
topic, but it's not like the word occurred necessarily.
Ben: Yeah, right. So, it has actual words in it. Right?
Rand: Yeah.
Ben: You can look at a given topic and you can see all of the words in it
and see how much each word is in it. But no human went by and said
we should make a topic about this to show what words may be put
together. So, if you look at papers, people pretty much refer to
topics by whatever the most common word in it is, which in the case
of cartoon might be cartoon.
Rand: Like I remember one of the early ones we were looking at was
Transformers.
Ben: Yeah, right.
Rand: It was like, oh, well, Optimus Prime and Megatron and Sydney, the
woman who's in the all of the movies now. She came up a lot. Megan
Fox was in there.
Ben: Is she related to Vanessa Fox.
Rand: I don't think so.
Ben: Okay.
Rand: In fact, I strongly suspect no.
Ben: Okay.
Rand: I'd guess it's a screen name. But so, in any case, you get these
topics. You have these words in them. And then when we say, "Well,
how much does this matter? Like how much does it matter if I am
writing a page about Star Trek and I have lots of links pointing to
me, but I'm not ranking as well as I think I should. Could it be
that maybe I have not included keywords that would tell Google that
I am actually about the topic Star Trek or about related topics?"
Yes. And so, we don't know how important that is. And that's why we
did something about correlation to try and figure this out.
Ben: Yeah, right. Because, obviously, we don't work at Google.
Rand: We just have to look at the outcome.
Ben: We have to look at the search results and then decide if this seems
like what they are doing. Yeah. So we try to see.
Rand: All right. So, let's talk about that correlation process. So Ben,
we're talking about this correlation thing and a part of me is kind
of going like, as a classic SEO, like non-statistics, math major,
this kind of thing, I kind of go, "Isn't the best way to test
whether this works is to have like two random documents on the Web,
and I'll try putting your LDA stuff to work and see if it raises up
one of them or doesn't raise up the other?" And I can do tests that
way. Like, what's this correlation? Why do I need that? Is that a
better way to do it?
Ben: I mean, they are just different. We've tried doing control tests
where we put the keyword and title tag on one and not the other and
we see which one ranks. But it's very hard to do enough of those to
reach statistical significance. It's pretty easy to set ten websites
where one is doing stuff one way and the other is doing stuff the
other way. But you end up doing like four one way and six the other,
or three one way and seven the other.
Frequently, a lot of these effects aren't that big. Google sees it
as hundreds of things that influence SERPs. So even if you try to
control for as many variables as you can to try and make it the same
between these two, there is just a lot of noise in terms of what
actually ranks higher. So it takes a very large amount of work to
make enough samples to say something with statistical confidence.
Rand: And you never know when you might have some weird factor that is
influencing all of them in some weird way.
Ben: Yeah. There is another problem that you are probably looking at this
really tiny page and little tiny domains because you are not setting
a huge number of large-scale domains to try to this out. Right?
Rand: Right.
Ben: So you are going to get an answer. The question is: Is this answer
going to scale up to real pages people care about from my small
pages that have ten links to them? So, it is a very interesting
process, and I actually would be very fascinated that people get
good results from it. But, we have tried it and the results have all
kind of been . . .
Rand: Middling at best.
Ben: Middling, yeah.
Rand: There are no good conclusions from anything. So instead, we use this
correlation process. Right?
Ben: Right.
Rand: If I understand your process right, you basically run across not a
dozen or a hundred, but hundreds or thousands, in some cases, of
different search results looking for elements that will predict that
something ranks higher or lower.
Ben: Yeah.
Rand: And so I saw that Danny Sullivan left some great comments in our blog
post about LDA. He said, for example, "Well, you guys said that
correlation with keywords in the title is very low. I don't believe
that at all because, when I look at search results, all the search
results I see almost always have the keyword in the title tag. So,
what are you measuring here that I'm not seeing?"
Ben: Right. The difference is measuring what a keyword is in the search
results versus measuring what is correlated with making it appear
higher in the search results.
Rand: So if all of these included the keyword Star Trek in the title
element, then what's the ranking correlation of the title element
with the keyword?
Ben: It would be zero. Right?
Rand: Because they are all the same. What's the possibility that something
will be a blue link appearing on Google?
Ben: That's an interesting thing. We computed some data a while ago using
the correlations where we were comparing Bing and Google. It
actually was interesting to see Google tends to have a lot of stuff
with this element. Bing had fewer things with that element. It
actually tells you how the search engine is different. It's
interesting just looking at raw prominence when you are trying to
compare two search engines. But it's not very interesting when you
are trying to compare two features because . . .
Rand: Or when you're trying to figure out what will help you rank well.
Ben: Exactly.
Rand: Okay. So, got you. So what Danny Sullivan is talking about with this
"I see the keyword in the title tag like 70 percent of the time or
more," that's this raw prominence thing.
Ben: Right.
Rand: That's like how many times does it appear in there? But correlation
of a specific feature with ranking higher is essentially looking at
all of these and then saying like, hmm, you know, on an aggregated
basis across hundreds or thousands of search results . . . I think
the study you did for the Google/Bing thing was like 11,000
different search results. Right?
Ben: It took a long time making search, writing it down on paper.
Rand: Yeah. I bet it did. You're totally incredible for having done it
manually. So, you look at all of those and then you would say, "Oh,
well this particular element on average like, having the keyword
exactly match the domain name, the top level the domain like it does
here, boy that sure looks like it is correlated with ranking much
higher." I think having the keyword in the domain name was one of
the highest correlated single features that we saw.
Ben: Yeah, right.
Rand: And the same thing goes for number of linking word domains, like
diversity of different link sources that you got. Like in tons and
tons of different websites, I have a link to Amazon, that seems to
predict or correlates well with it doing pretty darn well.
Ben: Right.
Rand: And if I recall, I think correlations for title tags and keyword-
based stuff, with the exception of the domain name, was in the like
0 to 0.1 range. Maybe 0.15, something like that.
Ben: Yeah. In fact, some of them were actually a little bit negative.
Rand: Why would it be negative?
Ben: Because it is quite plausible that if it's in the title, someone put
it there because they would like to rank higher than they actually
do and (_________) a lot of other things and it's just not a very
good page.
Rand: So you're saying, because of keyword stuffing SEOs, there could be a
negative correlation or other conflicts.
Ben: Yeah. Exactly.
Rand: So this on-page stuff, pretty small correlation. Right? So then, we
looked at things like links. A lot of those were in the 0.2 to 0.3
range, with 1 being a perfect correlation. So there was like a link
to your domains. That was pretty decent, like 0.24 or 0.23 or
something like that. Things like page authority, which is a metric
we calculate, was really quite nicely high. It was like 0.35 almost,
0.34, something like that.
Ben: I can't confirm or deny these numbers. I don't remember them off the
top of my head.
Rand: All right. But there are different ranges. Right?
Ben: Yeah.
Rand: So, when we looked at linking stuff, it was almost always better than
on-page stuff.
Ben: Yeah, right. Links seem to be, if you had to develop a Google search
algorithm to sort the things and you had to make a choice of Google
as you could, just looking at links seemed to get you most of the
way in terms of anything that we did.
Rand: So then when we saw this LDA thing at 0.32 something, that seems
whacky. That seems crazy high for an on-page factor, because we
never looked at anything that was about the features of the words or
how you use them, with the exception maybe of the keyword in the
domain name, that was this high in correlation. So that sort of
struck us as being very odd, and this is one of the reasons that we
wrote about it and were excited about. But let me just throw this
out there. Correlation is not causation. Right? It could be that
maybe domain name is really the thing that is being ranked. But
maybe it's other features. Right? Correlation doesn't necessarily
mean that that is what is causing it.
Ben: Right. And almost certainly our LDA model is not causing it, because
Google doesn't use our LDA model. They're not asking for numbers.
Right? Then almost certainly Google is not going to do LDA like we
have done it. They have not used our corpus. We have a model that is
correlated with Google's results, and it is certainly not causing
Google's results. But the thing is that it is a very high
correlation. So, they are doing something that is somehow producing
results that are correlated with a LDA model. It is hard to imagine
really what that would be, unless it was some sort of topic modeling
or something like looking at the words used on the page.
Rand: So, there's two things that come out of this. One is that, to my
mind, when I see something that high and assuming all the numbers
look right, I think some people gave your numbers a hard time, but
it looks like at the least the criticism they have received so far
has not made us doubt that we have done something wrong.
Ben: Yeah. I spend most of the day running code. But it is quite plausible
that I did something wrong. I'm sure I have. But the specific
complaints people have come up with so far aren't very credible.
But, you know, in the future, it will certainly happen someday.
Rand: I'm sure we are all excited for that day, Ben. Assuming that these
numbers are quite high, doesn't it sort of say like maybe we've been
wrong about this on-page stuff not mattering all that much? Maybe we
should do more on that front, like more investigation, test out the
results, try putting our keywords on the pages in certain ways.
Ben: Well, Google always says to spend time writing good content. Right?
And that's a little bit hard to apply, but you can interpret that as
being right content makes it clear what your topic is by using words
that are going to eliminate any topic from being (________) except
for the one that you are trying to rank for. So, I don't know if
it's that revolutionary. It seems like people have worried a lot
about their content in the past and a lot of people say to do so.
Rand: But so people in the past, they talked about things like, oh, we
should use like the Google Wonder Wheel. And we should use related
searches and put those words on our pages. We should use things like
synonyms that we get from the service. Well, how is the LDA stuff
different? Or is it? Like if I just do these things, am I going to
do great over here?
Ben: Well, I mean they are not going to be bad. But if you can imagine
that when you put a whole bunch of synonyms for tachyon, it's not
going to actually help clarify if you're about astronomy or Star
Trek. Right? So, you don't actually or that you're trying to discuss
bark collars and you want to just clarify that you are talking about
dogs as opposed to the stuff that wraps trees. You are not going to
want to put a whole bunch of synonyms for collars or barking. Yeah,
but that's sort of weird and unnatural. You much more want to put
other related words to make it clear that we are talking about some
sort of bark preventive system.
Rand: So, let's talk really briefly about the tool today. It doesn't do
exactly this. Right? Instead, it give us a score.
Ben: Yeah.
Rand: All right. Let's look that.
Ben: Okay.
Rand: Now this LDA score, tool might be an overstatement. It's a Labs. You
can look and see it. It works. You can put stuff in. But we have a
lot of really beautiful tools here at SEOmoz, and this is not one of
them. So, it's not the prettiest thing in the world. But it does
leverage the topic modeling work, and you use the specific process,
LDA, which we think is sort of better than some other ones, but not
being as good as the sophisticated stuff Google does.
Ben: Almost certainly.
Rand: I enter a query up here. Something I want to rank for. I put in some
words here, and it will give me a percent telling me how topically
relevant it thinks this content here is to the word here. And it
will do the same thing like if I enter a URL down here, it will
populate this box with the content from that page.
Ben: Right.
Rand: So this gives me sort of a rough sense of I can play around and see
does SEOmoz's LDA tool work. LDA scores seem to predict anything
that I can rank better. So, I could look at the top ten results and
be like, "Wow, I'm winning on links. I think I'm doing a good job of
keyword usage. But boy, all these other people have much higher LDA
scores than I do. Maybe I should try increasing that." Is that sort
of a suggested application here?
Ben: That would seem very reasonable to me. Like it is kind of new. No one
has a huge amount of experience with it. So far, it seems like
people have said that it chains up a higher score and it has helped
them rank, but that's very anecdotal. There's a very plausible
reason why you would think that that would work. But, we're kind of
on the bleeding edge here.
Rand: We're not trying to say that like you definitely enter something in
here, you should use this and boost up the rankings of all of your
pages. It will work perfectly or anything like that
Ben: Yeah, exactly. But it seems very plausible that basically getting a
higher score helps you rank higher. And the tool let's you see
clearly what this kind of topic modeling is going to be able to
figure out. It sort of shows you the kind of connections that Google
certainly will be able to make in figuring out that pizza is related
to food but donkey is not related to food. So you can sort of
explore and see how this stuff works.
Rand: Cool. One weird thing that people have noted and the last point is
that this fluctuates a lot. Oftentimes, when I run it, it will
fluctuate one to five percent change. Like I'll hit go on the same
URL, the same content, the same keyword, and it will change one
percent to five percent. Sometimes it seems like it can go to maybe
seven, eight, or nine percent. A couple of people have reported --
we haven't been able to see them -- rare instances where it is more
than ten percent fluctuation. So, explain to me what is going on
there. What is the sampling that the tool does?
Ben: Right. So there's a very large possible number of ways that you could
explain the document with topics. It could be about Star Trek. Or it
could be about astronomy and TV shows. There are lots of different
ways that you could explain the different word usages in there. So
we can't actually just try all of them and weight them by the
probability because that would take years to answer anybody. So
instead, we sample them based upon their likelihood and then we
average that. So, if you wanted to figure out are most people going
to vote Democrat or Republican this year, you might sample 100
people and you're going to conclude that 40 percent are going to
vote Democratic this year.
Rand: But then if you sample a different 100 people . . .
Ben: It will be a little bit different. Generally, you can come back and
say 70 percent are going to vote Democratic this year. It's in
theory possible, but it doesn't happen that frequently.
Rand: Got you. So you can essentially use this number. If I was really
interested, I would have to get more precise. I could run it a bunch
of times, and I would be getting a bunch of different samples and I
would average those out
Ben: Yeah. In the back end, we're doing it a bunch of times for you and
averaging them. So averaging it yourself on the front end as you go
isn't terrible.
Rand: It's just a big use of our bandwidth.
Ben: Oh, yeah. It really helps our numbers of hits to our website.
Rand: Oh, yeah. I'm sure that's all correlated with rankings too.
Ben: I know like unique visitors. What's that?
Rand: All right. Well, Ben, we're excited about this tool. We really
appreciate you doing this research work. It's exciting and
interesting. I think we'll know more in the future, in the months to
come, whether this is really great and applicable for SEO or that it
turns out that maybe it's some other things causing this weird
correlation.
Ben: Absolutely.
Rand: Well, thanks very much for obviously building this and joining us.
And thanks to all of you for watching Whiteboard Friday. We'll see
you again next week.
Ben: This was a long one.
Rand: Very impressed that you watched it. We do appreciate it.
Video transcription by SpeechPad.com
[UPDATE by Ben (sept 10th, 12:50pm PST): In the video I stated that "specific complaints people have come up with so far aren't very credible." This was directed at the claims, not the people who raised them, and I wish I has used the word "accurate" instead of "credible." My apologizes to anyone who was offended. Credible people can say things I disagree with. Indeed, the back and forth over their concerns about the unweighted mean Spearman's rank correlation coefficient has been a useful context to explain exactly why we consider it a better statistic to use than commonly suggested alternatives.
Also, I noticed that Russ Jones did work to reproduce some of our findings. He used a different dataset and different methodology, emphasized good qualifications to keep in mind, and broke out competitive vs non-competitive which we didn't do.]
[ERRATA by Ben (sept 16th, 2:00pm PST): The blog post above reports the correlation measurement as 0.32. It should have been 0.17.]
Hi all and thanks for the brilliant work you guys do.Theres is however something that I find annoying with your Whiteboard Fridays: I find it sometimes very hard to hear/understand.This seems to be due to the poor quality of the microphone you use as well as the acoustics of the room !
The more technical the subject, the harder for me (am I the only one??) to understand what you say. As you know, we all listen to your presentations on our lappies and the speakers are not very good either, so all in all, from source to output, theres is quite a loss of quality.
So please, Rand: could you invest some of your had earned cash in a couple of higher spec lapel mikes maybe?
This would make it a much more enjoyable experience, specially for us foreigners :-)
Thanks and keep on the good work!
@Audilo; same here. But it helps a lot to read the transcript along with the video, unless you get (____)
Same here too, quality used to be great until you changed it all. I just get loads of hiss now :(
That's why they've added transcripts :D it's indeed very helpful.
Yes, they have added transcript, but it is called White Board Friday I want to the and hear the video..with the whiteboard .. hell it might as well be called transcript friday in that case
lol very good point!
I consistently have the same problem. Using my head mike / earphones helps. You could try that if you have one.
I fully agree with what Audilo said.
The audio is very difficult to hear, I had mine turned up so loud that when the video ending techno hit me I almost fell out of my chair. The content is great as always but an investment in some lapel mics and maybe even some noise insulation to reduce echo in that room would make these videos much more pleasant for me to watch.
same here audio issues
I like Ben's dance moves. If you put a rap-tune on it, I bet he pulls it off :)
BTW - For anyone with more interest in the technicalities of the LDA model, its advantages over LSI/pLSI/etc and some of the ways Google might be using it, this video - https://www.youtube.com/watch?v=vgqWMGT9haY - from a Google Tech Talk by Amit Gruber. Skip to 11:03 if you'd just like to hear about the LDA stuff.
Gruber worked on some research related to text/topic analysis that was posted about on the Official Google Research blog here.
So... let see if I understood the message:
Then:
Finally
On a user note... 25 minutes of WBF are quite long, but in this case very needed. Anyhow, when talking about LDA what I'd like to see next should be real life cases, as we are still almost on a theoretical phase (or not)?
That was a good summary as usual Gfiorelli.
Bad sound quality this week, cant hear a thing at full volume on my headphones :(
I didn't notice any change
Hi Rick-- we're working to improve the audio for future whiteboard Friday's. Please stay tuned!
Bad sound quality this week, cant hear a thing at full volume on my headphones
Oh my goodness spam in SEOMOZ! Shocking!
Just a quick note guys: this comment above was written just because there was an incredible spam comment before... not for other reason. So no reason to thumb down our innocent Michael :)
Has been removed. I was surprised. :)
I wasn't quick enough to kill that SPAM, sorry for that! Â
I was just surprised because this blog is always clean, which I really like. You must be doing great job. It must be hard when you readers coming from different timezones.
I did not mean to make big thing out of it.Â
No big deal, we try to keep up on SPAM in the comments since it can really take away from the overall message of the post and comments. Â Feel free to PM me if you ever see on that I may have missed.
Casey
Great topic Rand and Ben. By far, the best Whiteboard Friday to date. It seems to boil down to what I believe Rand mentioned in an earlier post and that is Web pages, as far as search engines are concerned, are text based. So it doesn't surprise me that this concept of topic modeling and LDA is an important factor in showing that your Web page (or text document) is more relevant than another.Â
I look forward to more on this topic and thanks for all of the great insights!
Ben, in your original presentation, you mention that your LDA corpus is 8 million pages from Wikipedia. I wonder if that might skew your results since wikipedia ranks so high for so many long-tail phrases. Have you tried removing them from the results and confirming that you still see relatively high correlation scores?
That is a valid concern. Â I had not looked at that, but did so right now.
On a dataset that is similarly constructed to what we posted but not identical, the mean coefficient is 0.326347.  If I strip out every URL with wikipedia.org in it, I get 0.313532. Â
So it doesn't make much difference, but there is a slight drop.  Wikipedia articles will generally score quite well even if it was not the corpus, so it is not clear to what extent this drop is because of bias vs because we are removing some of the easier to identity good results.Â
That was a very good question.
Good topic. I still believe natural content is the way to go. Manipulating word selection is fine but will it always work?
As it's all about being honest and transparent at SEOmoz I would also like to join the ranks of those who were not too happy with this weeks whiteboard session. I found myself distracted from the actually very interesting content by dance moves and bad audio. The whiteboard friday is usualyl one of my 'it's almost weekend, let's watch some clips' treat - not this time..
Other than that, please keep up the good work guys. Don't know what I would do without you!
J
"the keyword startrek"? I think I'm on the holodeck right now, somebody turn this thing off
Hmmm, this page only get's an LDA score of 16% for "bad sound quality". Perhaps a few more people could comment on it?
Joking aside, definately get some decent microphones, and I'd suggest some soundproofing for the ceiling to improve the accoustics. Love the post regardless. Not many online videos would keep me watching, but your content wins out over sound quality ;)
Thanks Andy.  We are indeed investing in dampening one of our rooms, and improving the overall quality of the sound.  Our new office is noisy!
Agreed!
Though LDA is not going to change a whole lot in terms of how we write, it is quite helpful for us to understand the concept.
I could see the LDA tool being really cool in certain situations, especially if they added the ability to enter a keyword and check the LDA "score" for each of the top 10-20 Google results for the term. Similar to heatmap analytics, I feel like LDA analysis is more of a top-down view of site structure and on-page layout/elements in general.
Not sure if this is the right place but I'll ask anyway.
I have a couple of customers who like to embellish on their product descriptions and use words that, while possibly resonating with their demos, are probably not so good for SEO. Although I've never really had a way to prove it. Â I'm wondering if the LDA tool can back this up. Here's an example:
"This ain't your Father's Barber Shop. This ain't your Mother's Salon. Don't come here looking for a shoeshine or a pedicure. We're not a chain store more interested in gimmicks than quality of service."
Humans know what the description above is talking about even with out the rest of the context, but when the search engines see terms like "shoeshines", "pedicures", and "chain stores", does that bring relevance to the keywords/phrase "barber shop" down? I claim they do but haven't had a way to prove it. Is that something LDA will tell me?
- Jeff Hancock
Man Jeff, you've got the best example of a perfect use of the LDA tool that I've seen yet. Your client issue would totally be solved using it. Compare the current copy of the site to optimised copy you create that strips out the shoeshine, etc. and I'd bet you'd be able to show your clients better LDA scores for your optimised copy.
See this is a perfect example of my problem with this study! If Jeff uses SEOmoz's tool to generate an LDA score and tells them improving it will help their SEO, it will be one more case of SEO's selling the latest trend that may or may not have any impact on their rankings at all.
SEOmoz's conclusions to their data have been called into question in multiple places, but even if we accept their conclusions 100%, there's absolutely NO data to suggest an improved LDA will improve your rankings.
Selling this kind of crap is exactly why our industry is viewed in such a skeptical light. This response is also why I wish SEOmoz were more responsible with their studies. If you're going to present yourselves as SEO scientists, follow the scientific policy of peer review, etc before making bombastic claims and turning people like Jeff into possible snake oil salesmen.
Ben - in the spirit of TAGFEE, I'm going to say that I just don't agree with your perspective, but certainly am fine with providing you an outlet to voice it. Nothing I've seen so far suggests to me that we've done something irresponsible. We presented some research at our seminar, some people tweeted about it excitedly (though I think you'll far worse overhyping for nearly every story that appears in the political/entertainment/technology field), we wrote a blog post that explained what we'd done, invited others to repeat (several did so, even using other methodologies recommended by critics, and got similar or better correlation results) and provided a free tool.
We're certainly planning to do lots more of this in the future. If this is a process you don't agree with or don't like, you are free not to participate or engage. There is nothing compulsory about our work and our suggestions around every facet of SEO have always been "try this, if it works for you, great, and if not, no worries."
SEOmoz is a private software company. We're not a regulatory board or some officially sanctioned representative of the engines or the SEO field. We love providing material to those who enjoy, appreciate or find value in our work and we empathize with and harbor no ill will towards any who don't. I'd appreciate if you treated the situation proportionally in the future.
Rand, do you follow your mom on twitter? Or Joanna Lord? Your comment seems to suggest that your employees did nothing to hype this tool when in fact they were some of the most egrigious. (You can see my latest post on Skitzzo.com for my favorite examples.)
Also, I would be very interested to see the othe replicated studies that you mentioned. Can you maybe include them as related links or something of the sort on one of your LDA posts?
The concept of peer reviews is very sound. But in order to have peer reviews there need to be "peers"; hence other researchers who have done a similar study that disproves the study thesis that is being scrutinised. So far the complaints about the experiment design and statistical methodology haven't been backed up by a disproving study. It is very good to criticise Seomoz about their lack of scientific rigor, but without citing a study that disproves theirs, it all sounds like a childish tantrum. Everybody is free to create their own model and prove Seomoz wrong and I would be very happy to read and hype the rebuttal if it shows me a more sound conclusion.
I completely disagree with you WeareSkitzoo.
Surely with this information at Jeffs disposal it would be irresponsible of him not to recommend to his client that they investigate the potential of changing the text. Of course he should not state that "this is fact", (after all not much in SEO is) and roll it out across 100's of pages but a couple of copy changes here and there may make a massive difference. Of course it may not.
I for one will be looking at some of our clients pages and seeing for myself if it impacts on rankings.
Certainly there could be extreme instances where a writer has gone overboard with irrelevant language and potentially confused a search engine about the topic/relevance of the page to the keyword. However, my suspicion (albeit untested given the tool/process' newness) is that most of the time, you'll get better value out of identifying topics and content you may not have covered that searchers are interested in.
Still - great thinking in terms of application. I'd imagine that this could be particularly more likely in news/media pieces, where creative reporters like to use flourish and diversity in their works rather than speaking plainly to a topic. However, I'd hate to see that creativity lost - hopefully it can simply be channeled in a way that's both productive for searchers/engines and still enjoyable to read.
This is great guys. I'm really excited to try LDA out on some of my content. It's a tough set of concepts but you've explained it all very well between Whiteboard Friday and the recent blog posts.
By the way, I love the SEO community. Only here will you hear "Megan Fox???? Is she related to Vanessa Fox?" I'd hang out with Vanessa over Megan any day too, fellas. If that means there's something wrong with us, then at least we're in good company.
Yeah Megan is hot, but what's up with those tumbs? (thumb's that look like big toes?) No thanks! lol
Is it me or is all this stuff a bit over the top? surely if you just write good content you shouldn't need to worry... or am I missing something?
I mean, it's interesting but I don't think it will change the way I write content.
It might not change the way you write content, but might go a long way to explaining why it doesn't rank as highly.Â
For example, this LDA finding suggests that similes are not a good idea. That is, the words that make up expressions like "like a hot knife through butter" and "as sweet as sweet as candy" are likely to make your language less topically relevant. (for most topics)
Would you hang on to your similes depsite the fact they might be hurting your ranking? Â
I have made pages in the past that have ranked well before adding any real text, just a few headings images with alt tags and stuff. Then when added 500 to 1000 words they have dropped rankings. I though maybe un-natural text or keyword stuffing was to blame, but I now thing LDA may have had something to do with it.
Yeah I have often found this and it's opposite in many ways to the things we normally say as seo's
Very nice video but I do agree video/audio quality was a little low. One question though, are the correlation stats like the .32 LDA and the link factors percentage posted anywhere? I think this would be a great thing to see.
Very interesting video. It reminds of something one of my professors had told me about writing papers. She told me you should never imply anything, be very explicit and make sure the reader knows what you are talking about. We shouldn't just assume the engines know what we are talking about because we mentioned a keyword a couple of times. Since the same word can have several completely different meanings, as cited by the bark example in the video, it makes sense that the engines would try to look for other factors to ensure that the results they return are for the correct bark. By looking at the Bing commercials (In the commercials someone mentions a word and everyone around them begins spouting out "search results" that include the word but are non-related) we can see that this is a thought in the mind of search engines. I still believe that content should be written for users not engines, but this would make me think about some terms I would use. After all you can still have natural content but include content for the engines as well. I was also pretty excited to hear how dirichlet was pronounced.
Hey Rand and Ben,
I really like the idea of boiling down SEO to mathematical concepts when possible, as I'm a real logical and deductive person. I imagine the majority of people that really love this LDA tool are the ones that come from development or math backgrounds, where those that think it doesn't change anything are more of a journalist and designer background. I could be wrong though, I'm basing that off just a few opinions I've seen.
Anyway, there are a lot of people saying "this won't change how I write anything" and others claiming "backlinks are still the most important factor". I see two underlying questions that have yet to be really answered. Most likely they will be over time.
1) What makes this different than any other SEO "fad"?
2) How does this change what I am doing or need to do?
Only time and implementation will adequately answer the first question. Although it is doubtful that everyone will ever be satisfied by the answer, even if it shows some concrete evidence. The difference is this is based off mathematical analysis rather than "gut instinct" and "quaint observations". It's not that working off instinct and observations (some would call this experience) is a bad thing. It's just a different way of doing things. Some will prefer basing their actions on statistical principles and others on experience.
We have to wait to see if it's different than other fads. If people implementing it begin to show results, it might begin to calm down some skeptics. I think the bigger question is how we are supposed to use this information. Right now it does nothing other than reinforce what we've known all along - writing good quality content is an important aspect of SEO.
That's not game changing. If I write a well crafted article about Star Trek, it's most likely going to have a high LDA for "star trek". If I write a gimmicked article and don't really know what I'm talking about (e.g. who is Captain Kirk and Spock and the Gorn and Orion slave girls...), it probably won't.
There are probably uses for this tool that we have yet to discover or understand. The biggest potential use for it that I see is the inverse: I submit my document and you tell me what it's about based on your LDA model. Then I can check my Star Trek article, and your LDA tool might say, "hey, I see where you're getting at on the Star Trek front, but it sounds more like you're talking about sickly looking enslaved women, you should shy away from the Orion slave girls and toss in a few more references to Kirk and maybe the Enterprise." That's useful.
As of right now, yeah, it doesn't change how most people do things. Unless they were writing low quality articles. The tool is still in the Labs, after all.
I'm excited to see where it goes from here!
Regards,
Matthias
1.) It isn't an "SEO fad" it is a mathematical means of valuing written context. We are just now openly discussing what application it has to the issue of search rankings.
2.) It doesn't change anything. The need for clear comprehensive writing pre-dates search engines. If you were good at writing copy for search engines last week you will still be good at it today. LDA is just a perspective for looking at your writing vs. your intended meanings.
LDA is a metric, not a tactic.
Re: I imagine the majority of people that really love this LDA tool are the ones that come from development or math backgrounds, where those that think it doesn't change anything are more of a journalist and designer background.
I think I can recognize myself in the pattern you have "intuitively" guessed reading the comments, not being a math lover (or, better, it's math that disliked me since I was attending Primay School).
But that doesn't mean I'm not intrigued by the concept of LDA and its possible use in On Page Optimization Tools, because any tool that can make my day more productive is very welcome.
The fact is that - coming from an editorial world - that topics like "Context" "Semantics" "Signs" and so on were already there in my daily working baggage. If there are proves that can show us in a math/scientifical way that they are important and - since further experiments won't say the contrary - have a relatively high correlation with rankings, then they simply are confirming what everybody could infer from the well known Google suggestion "write relevant content for your readers". And if you are coming from the editorial world, you know that relevant means not only "important" or "unique", but also write content well written. That suggestion implies "not write for us", that means "not fill with keywords", that means "write naturally", that leads "write giving a context to what you want to express". Therefore, Context was already there and important, simply its importance was not explained with formulas.
That is why LDA is finally not going to revolutionate my life, simply it confirms to me that I am doing right and help me in faster checking if my competitors are doing better or worse on that particular factor. Just for this, I welcome LDA and the future LDA Tools: they are going to make less "painfull" my job, not changing the way I do my job.
Nice WBF. I agree that it would be much more helpful if the tool suggested related (topical) keywords that would improve your score.
BTW - You missed comment SPAM from "Dan Dees" who slid in a nice little keyword ("SEO") pointing to his site.
Keep up the great work!
I think that on-page factors have been undervalued the past few years. At the end of the day making a comparison of the LDA of two documents is going to be about as useful as making a comparison of readablity scores. It is a good factor to use to measure your copy writer, but it isn't a siver bullet.
We are in an era where search engines really only have two types of factors: things a site can control (content, architecture, etc.), and things they can't (links, citations, competing sites, search behavior). Because Google is the site that spam built they will err on the side of things you don't control; hence penalties for link buyers.
One drawback to obsessing about LDA (just like with KW density) is that it will paint you into a corner of inhuman writing. And, inhuman writing, or overly complicated writing, will lose out to more human friendly (linkable?) writing.
Carlos - we may have some substantive disagreements:
1) Re: Readability scores vs. LDA scores - unless correlation with readability scores is quite high (certainly would be interesting to check, but I have doubts - though, to be fair, I didn't think LDA would be particularly high either), it would seem that, by definition, readability improvements won't have the same impact as LDA improvements. Of course, the correlation could be something else entirely, but I'm struggling to think what that could be if not some form of topic modeling.
2) Re: Inhuman writing - I'd hope not! While density would suggest some very weird overuse of terms, LDA should, primarily, just be suggesting terms/phrases you may not be using that you probably should (or some that you are using that could be confusing engines/visitors about the topic). I suppose it's possible that some writers might go overboard with this, but hopefully, it's more like the example I gave in the previous post - you've been writing about the Rolling Stones, but forgot to include "Keith Richards" - I'd think that would be good for search engines AND visitors.
3) My guess is that search engine don't just observe on-site factors like word usage/topics/keywords/etc and links, but that social signals, usage signals, searcher behavior patterns (perhaps branded searches) and maybe even manual quality ratings (internal and 3rd party - especially w/ local), are all making their way (or will) into ranking algorithms. Some of this stuff SEOs are good at observing, others are tougher.
Ok I've played around with this labs tool a little and would like to suggest a reversal on how the tool works. Currently, the tool allows you to enter text or a URL to be compared against a specific word (similar to KW density) resulting in a percentage grade.
Instead I would like if the tool allowed the user to enter specific text or a URL and resolve with an observation on what words correlate to the document.
For example if I enter a CNN story on the trapped miners in Chile (https://www.cnn.com/2010/WORLD/americas/09/12/chile.miners/index.html?hpt=T1) I would get keyword three phrases according to the SEOmoz Term Extractor Tool like:
I’m proposing that the LDA tool extract these terms and quantify the result with percentages.
Now my question is this: Are the percentages currently coming out of the LDA a mix of Google results and computation or pure statistical computations? If it’s purely computational this would account for the growing differential when running the tool again and again for the same document. As more people use the LDA the base numeric is going to change. If LDA was computational plus referential against current Google results you would have proof against the percentages and a significant reason to use this tool. Think of it as a sandbox for testing your content for ranking.
This is definitely on the roadmap, but it's significantly more work to build. Hopefully we'll have something in the near future.
Though everyone says that inbound links are very important , it is the on page content and optimization which gives you the initial rankings and with this concept of LDA it gets proved that the content should have true quality with great contextual info. even if the keywords are not repeated.
I think it's time you got a couple of wireless mics for the videos... it is a little tiring to listen to you guys because the mic is far from your mouths...
We're working on it! Please expect improved quality for this Friday.
I've been working with the LDA tool to see what wins I can find for clients, and was surprised to see what looks like a keyword density aspect to the tool. Specifically, if I take a block of text from a client site that yields (say) a 71% LDA score with 459 words, then sort the words in the sample set, I still see roughly 71% LDA with those 459 words. (So this affirms that phrases are ignored and only single words are being used.)Â
But if I test LDA with just the 252 unique words from the set, the LDA score drops to 53%.
This was a surprising result. What's the nature of keyword density as a factor for LDA?
Any natural looking text (that can be read by humans) will need to have prepositions, articles, pronoums, verbs....a block of text based on "keywords" wouldn't hold much meaning at all...perhaps the tool addresses this fact and that's why your total LDA % goes down?
It looks like the LDA tool removes "stop words" already, leaving only "interesting" words. Run it against a URL and in the box you'll see the words they actually used for LDA. Here's a concrete example of what I'm seeing:
Query: star trek
19-20% with Document: kirk mccoy
17-20% with Document: mccoy kirk
So word order seems irrelevant.
12-14% with Document: mccoy kirk mccoy
Adding more mccoy seems irrelevant, even negative.
53-56% with Document: kirk mccoy kirk
But more kirk = higher LDA. This makes sense, but I don't see this aspect described in the LDA info from seoMoz so far.
56-60% with Document: kirk kirk mccoy
63-68% with Document: kirk kirk kirk mccoy
3 kirks to one mccoy seems best for LDA %.
34-36% with Document: mccoy kirk mccoy kirk
38-48% with Document: kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy kirk mccoy
98% with Document: kirk kirk kirk kirk kirk kirk mccoy uhura chekov
98% with Document: kirk mccoy uhura chekov
It's like I need more words in the target LDA space, even if they're repeats, but unique words from the LDA space did best.
In a separate test with real content from a web site, it seemed that having a "natural" keyword density helped. Sorted unique words ranked the lowest, doubling all unique words was higher, but a natural keyword density did better.
I think ben stated that keyword stuffing may give you a higher score but doubts it will give you higher rankings, we have to use a bitt of common sence here
Yet another fantastic post. I was lacking knowledge in this field. Thanks so much the information you covered. It enlightened =) me to say the least.
It sounds fine to me - a little quiet but just cranked it up and it's fine. Thanks for the video on a complicated subject - doesn't seem so complicated anymore.
Good WBF, I do understand the issue better now.
I just check pro tool section it look much more organized. Great work seomoz.
Interesting theory, but as benniog said, i don't think i'm going to change the way i write soon. The hardest thing in my opinion is to figure out what words to use to improve your score. The Star Trek example is quit clear but in real live there are much more subjects to take into account.
In short, interesting theory which is hard to implement. But i keep following the research you guys are doing on this subject. Are there any plans to incluid foreign language in this tool?
When i first read 25 minutes, i got scared. I thought i will be dealing with some advanced stats or something. But there was nothing like that. However i am still not sure how to improve LDA score. All i get is write for your audience and not for search engines. Should i copy the words of higher LDA scores pages and use them in my copy?
Maybe simply be aware that the way the write contents is probably a factor why they rank better... but just one between many others.
I agree... I think the point is to realize that context matters, and it is harder to "fake" with keyword stuffing. I cross referenced word usage from a client's site and competitors about 4 months ago and found that my client was missing some very relevant keywords... we included them and found our long tail traffic increased. The long tail seems to be another element of this discussion that perhaps should get more airtime (especially with Google's more recent changes).
Any chance your going to publish the keywords and weights (are there weights?) of the topics you find... it would be way cool to kinda reverse it and show how you can make the relevance stronger rather than trying to guess what the coinsurance and topic words are
It seems liek the problem with LDA is the same as LSI if you have a large enough sample set you can build more topics but its all based on sample size and the same way that you run out of data to compute recommendations for algorithms like facebook or netflix you can only get a limited number of words back to use and for most "topics" there may not be enough data in the sample to build out a good list
could content is king be causing some of this stuff to kick in b/c there are more words that google can look for these more advanced metrics that are built off the Google giant index of content so natural writing will show what the topic is on (thats the whole point right)
We'll certainly try to do more publishing as we get more sophisticated with the tool, process and testing. At one time in the past, Google had a labs tool where you could plug in a site and it would give you the topic/category it felt was most applicable. That could be something for us to rebuild - I was always frustrated they took it away, but my understanding is that much like their link data, they felt there was too much potential for abuse/manipulation by spammers.
Google provides this now as part of AdPlanner:Â www.google.com/adplanner
Awesome! Hadn't really it was powered by the same backend, but it definitely looks more beefed up (deeper categories, etc). Here's an example for SEOmoz.org. Thanks V!
Same here. Very difficult to follow the discussion on the video.
Transcript saves the day.
Ideally a good quality video sound and transcript would be best.
Lets not distract from the LDA issue though.
At least we get to hear/read the LDA discussion because of Seomoz Transparency
Ben, I'm confused about how keyword density in documents changes LDA %. Examples:
Keyword:Â star trek
LDA 20% with Document: kirk mccoy
LDA 67% with Document: kirk kirk kirk mccoy
I don't see this effect described in the LDA videos & papers I've looked at. Is this an artifact of your implementation, or an expected result with LDA?
It's not an artifact per se - I'd think it would be expected. A document or piece of content that mentions the word "Kirk" many times is more likely to be about Star Trek than one that does so only once. The key isn't that the tool perfectly measures your exact usage, but rather that you can see strengths/weaknesses of content blocks from a topic modeling perspective.
Keyword density isn't a metric the engines use and it's not one we use, either. It's just that having 7 words of content where a topically relevant term is employed multiple times is more likely to be "relevant" than a content block that's very small and contains fewer relevant words.
I think this is why KW Density persists as a myth - some SEOs will "improve" their KW density, see their rankings rise, and assume cause and effect. It's a big part of why the myth of this metric is so hard to fight, because technically, sometimes improving it might actually help (but that doesn't mean it's a good way to measure things or an optimization tactic to pursue).
I think I'm trying to scratch the surface on the consine similarity process. My understanding is that the LDA Tool looks at the search phrase words to figure out what topics are related, then looks at the document to see what topics it covers, and then assigns a % relatedness.
What's surprising to me is that if I have "kirk" in a document once, I should have the topics related to that word covered. So more kirks in the document shouldn't cause LDA to go up.
So I'm missing something here. I'm trying to build a tool to help find keyword blind spots from looking at LDA % across sites for a given keyword, but I can't tell whether I can get away with using unique words or if I have to pass through repetitions in order to keep the keyword density natural.
Shout out to whoever made the transcript btw - it's like you just picked the hardest words you could find for them to decipher. Good job :)
Oh man. The best part was when Megan Fox came up and it took a couple of seconds to get her name - Ben could not have been less enthusiastic about her!
Just goes to show how hard these guys work and have no time for petty Transformers movies haha.
LSI LDA it all sounds like some kind of new SEO drug :)
i have little bit confuse about how to improve LDA score?
Can u guys help me?
The basic concept is to do a better job making the words and phrases describe the keyword/content/query you're targeting. For example, if you are trying to rank well for "The Rolling Stones" but describe them only in passing and focus primarily in the subject matter abou a Peanut Butter and Jelly sandwich you ate once at their concert, that may be less topically relevant than including information and words about the band's members, history, records, songs, concerts, style, etc.
While the LDA tool we've built is certainly an imperfect and imprecise model, it may help you to measure the degree to which your content (or that of others) is "topically relevant" to a particular term/phrase.
Firstly, I would say that you shouldn't focus on "improving the score" and should focus on being more topically relevant. There is no evidence to prove that a higher LDA score will increase your rankings. There is evidence that suggests that the score  correlates well with the SERPs. EDIT: so in theory, a higher score will lead to higher rankings.
Use the tool on your site, as well as on your competitor's. cross-reference relevant terms that appear often on your competitors' pages. Assemble a list of words you are not using on a specific topic and try to fluidly add them in. Blog posts may be good places to do this.
This help?
Tremendously helpful videos.
I understand that synonyms are irrelevant for LDA .
What about "related searches" and "wonderwheel"? They seem good tools for this.
Related searches can change context or can be ambiguous, How I am attacking this is, I take my keyword, make a list of terms that reinforce the context, then make a list of terms that may cause confusing or ambiguous. Then try to use the right list.
When writing for humans we try not to confuse and state our context, well we need to do this for machines also, as they can be easily confused and manipulated.
Interesting post about LDA and SEO.Â
I will be looking into this topic further.Â
with all do respect to seomoz staff...aren't you guys putting too much focus on this LDA stuff? It almost feels like you are trying to promote yourself\tool based on LDA...(sorry I am a bit candid)
Considering most SEOs are marketers not mathematicians, I think more focus on natural good quality content that is both user and search engine friendly maybe a better strategy to grab attentions...
other than that you guys rock :)
Hmm... I guess we feel a bit differently about this. LDA seems like an interesting topic modeling system and something SEOs generally aren't familar with, so it requires a good deal of explanation (and I don't know that we've yet to do a comprehensive job, though we're trying).
It's also a bit odd to say we're "selling" ourselves with it - it's not yet part of any of our paid tools. There's the free research and the free tool for testing, but we've yet to productize it in a monetized way. I think we'll be holding off on that until we feel confident that it really is valuable for improving rankings.
If there's things we're doing that are creating this impression, let me know. I certainly don't want this being misconstrued.
Rand, you know as well as I do that the attention being generated by this (as well as links of course) improve your bottom line. Just because this tool is free for now, doesn't mean you're not benefiting from it financially.
Ben - are you suggesting we should be a non-profit? Or act like one? My sense (which, admittedly, could be biased) is that across the SEO market, we provide more free resources (tools, guides, blog posts, research, presentations, etc.) than any other company our size. Granted, this has marketing benefits and is a marketing channel, but is it your opinion that it's an illogical, unprofitable or evil practice?
Rand, I'm not at all saying you should be non-profit or that producing free tools is evil. My comment was in response to your statement "It's also a bit odd to say we're "selling" ourselves with it."
Both you and Ben (in comments elsewhere) have acted like there's no monetary benefit to producing this tool and hyping up the results as a big breakthrough in SEO.
That's simply not the case as all the buzz surrounding this and other studies increases brand awareness, generates inbound links, and generally improves your bottom line.
I have no idea whether this incentive influenced your conclusions in any way (not even close to a stats guy) but I think we as a community should examine the study as we would any other that came from a company with a financial incentive built into the outcome.
That skepticism is warranted is a fair point, but I think you're backing down substantively from the confrontational and negative connotation of your earlier remarks. That's great if it's the case, but then it sounds as though your disagreement/issue is far less systemic (and one one which we agree).
You just sound jealous. Fact is, this brings us one step closer to understanding a system that, each day, becomes more complex. Whether this is a huge step or a half step is yet to be seen. However, complaining that when SEOMOZ does their job they benefit from a job well done is just plain dumb.
Rand - promote away, man. You guys spent the money, you spent the time, this is your frikkin' website for crying out loud. Asking for and receiving feedback on a tool you make available is one thing, but to be criticized for capitalizing on your own product, on your own website... that is where I personally would draw the line. Maybe you ought to change your .org to .com. SEOmoz is not a charity, people.
The commenter that Rand responded to criticized them for selling themselves with it, not me. I have no problem with them selling themselves with everything they do, my issue (in this limited sense) is with Rand acting like they have no financial investment in this. They've spent time & money on this, and there's money to be made if they can brand themselves as the company that has tools that reverese engineer the Google algo.
Just because it's not a part of the paid tools, doesn't mean they're not "selling themselves with it."
WeRASkitzzo,
Yes, free tools are types of promotions. Yes, SEOmoz is benefiting from the attention.  That is obvious. Rand was very upfront with his plans for making this tool part of the pro package in the future last week in a post,  In a post last week, "We're leaving the Labs LDA tool free for anyone to use for a while, as we'd love to hear what the community thinks of the process and want to get as broad input as possible. Future iterations may be PRO-only."
I think you are saying that it is disingenuous to act as if SEOmoz has nothing to gain from the discussion, even if the tool is currently free. Fair point, but couldn't that be said for anything they write? They have a blog to promote themselves, by extension every post is promotional (albeit, their style is educational). I would like to think that this community is rational and will skeptically use this tool and determine its usefulness for themselves.
If I were going to design a search engine I would include topic modeling in it. Who wouldn't? If you want the most relevant results you have to make sure the results are on the right topic, and with Google Instant, topicality is even more important. The tool isn't perfect, we know that. I for one am interested to see where the LDA tool goes in the future.Â
I've used the LDA tool on my client's pages and have identified some phrases that are relveant that we were not using, including these terms in future blog posts seems like a good strategy for the long tail.
Also, I have no illusions that SEOmoz is in business to make a profit. That is why they do everything they do. Offering a new tool for free to seek feedback is a wise business decision as it will improve the tool faster. The tool can be turned into a earning proposition later, once its value is established (if it proves valuable). It is valid to question the motives of anyone who sells solutions to problems you didn't know you had. I do think this discussion is very relevant to SEOs though. We all know that content matters (keywords, etc.), but I think it is relevant to note that CONTEXT also matters (topicality, etc.).
I think the corrations warrent the hype, to asume that good well written content shows the same corraltion may be folly, but having said that maybe its something to test, do good content writers get good LDA scores and good rankings, I dont know how to test that as it would mean someone has to judge what iis well written content.
I'm trying to understand what they say, but I have to read the transcript to clearly understand what is going on.. But when I'm reading the transcript I miss the drawings and other useful stuff on whiteboard..
Subtitles maybe?
I do agree with this post a lot. Thanks for posting. Although I've a little problem with the sound in the video still yeah the script helps a lot. I always believe that content is still the best in order to have a good reputation online that in return can bring you tons of visitors though it's not an over night success. On the other hand, the combination of SEO and etc. will be an added factor for your site to succeed but make sure that you are on the right track or else everything will be futile