The search engines are considerably smarter than we often give them credit for, and one of the ways they've become so "intelligent" is through the data provided to them on the billions of web pages they crawl. Today, I'd like to walk through a visual tour of the search engines' process and abilities in the field of semantic analysis and understanding.
Googlebot crawls billions of pages across the web, indexing a text content equivalent to thousands of times the size of all the world's libraries combined. With all this massive amount of data in the index, Google can start to form some assumptions about the incidence and frequency of particular terms & phrases.
One of Google's simplest powers is to be able to calculate the relationships between two or more terms/phrases. In the example above, Google's recognized that Spain & Iberia might be connected semantically. If we recall Dr. Garcia's lessons on term co-occurrence, we can see a simplistic way that this might be happening.
Obviously, Google has even more sophisticated ways of breaking down and analyzing an individual page or even sections in a page. They could, for example, form tighter connections between words/phrases that frequently appear very close to each other in sentences or paragraphs. As these techniques get more refined and more advanced, Google could take on an almost artificial intelligence with regard to semantic connections.
Pretty impressive for a robot & a scary, mechanized spider, eh? So how does this apply to the practice of SEO, content authorship, or website building?
There are several hypotheses I've formed about how to optimize based on this knowledge:
- Build a Semantically Intelligent Site Architecture
Since the search engines have some data about what terms are relevant and related to one another, it can't hurt to use the most logical system possible to create an organization chart for your site's content. Usually, common sense works best, but you could always fall back on the co-occurrence calculations if you need to find out if that chicken stock recipe belongs under "french cooking" or "american classics." - Create Documents that Use Relevant Terms/Phrases
If you're targeting the term "mortgages" but most of your content is about rental properties, you might find that changing it around to connect with more relevant content is valuable. - Get Links from Semantically Relevant Pages
Term Co-Occurrence can be a great way to find out if a link from a page about surfing will be semantically beneficial to your page about snowboarding. - Understand Why that Page Might be Ranking
Sometimes when we see a page ranking and run a few checks on the strength of the domain and the links pointing in, we might scratch our heads thinking, "how the heck is that ranking above my page?" I've experienced this queasy feeling plenty of times and found that after some careful analysis, it looked like many of the pages pointing to my domain and page weren't nearly as "connected" as the pages linking to my competitor. While links in number and authority are very powerful, there's little doubt that semantic connections and topical relationships play their part, too. - Get a Sense of What the Future Holds
In a few years, will Google be smart enough to identify link "intent?" Could they have the semantic processing capability to extract psychological cues from the sentences and paragraphs surrounding a link? Would this help them to determine link weighting and link trust? Very possibly.
I don't use a heavy dose of co-occurrence calculation in most of my work, and it's actually rare that the topic comes up in a consulting contract, but I do believe that the more we know about search engines and the more we can see the machine at work when we look at the results, the better we'll be as SEOs.
I'd love to hear if anyone has additional uses for this kind of data or other relevant semantic analysis. One thing I've been wondering about on this topic is how search engines might use the statistical probability of word/phrase occurrence in rankings. Dr. Garcia touches on that more specifically here.
At Fortune Interactive, where I work, we take semantics quite seriously. If you were at SES NY, you may have heard Mike Moran from IBM mention our proprietary software, SEMLogic. In fact, Rand, you blogged about SEMLogic a couple years ago.
As a copywriter, I use a very specific part of the system called Thememaster (whether that name sticks or not remains to be seen). This tool generates reports of words commonly found in the search results for a given keyword or keyphrase.
Mike Marshall, the creator of SEMLogic, recently taught me that the higher the score for this metric the better. In other words, while you can have "too high" of a keyword density (b/c the SE's don't like keyword stuffing), you can use as many words from the report as you want and not be "penalized."
Having said that, I still choose those words very carefully. Oftentimes the report returns words such as "policy", "rights", and "information" which are often found at the bottom of every web page for legal purposes.
What does that mean? I believe (and i could be terribly wrong) it means that there is still a lot of room for content writers to determine the relevancy of pages. Yes, use keywords. Yes, use semantics, but also include your own semantics.
SEMLogic offers many other metrics besides the Thememaster tool that our Search Marketing Specialists use. One of them involves inbound linking. I think that takes into consideration semantic analysis, but I'm not completely sure.
But it does tell us which metric appears to be the most important for a given set of results. And that gets into things like Competitive Intelligence and Algorithms (or as i like to call it, Math). And suddenly I'm thinking of my next Starbucks run.
Thank you for quoting my work. I usually don't post here, but I feel like I should this time since some above don't seem to understand what is/is not semantics or co-occurrence.
There are new advances in the area of semantics, co-occurrence theory, and brain association activities most of the above posters are not aware of.
I have sent Rand a complimentary information that might help to clear things up regarding these topics. It is up to Rand to follow through on this, if he finds time.
Regarding semantics, indeed there is room for semantics, not only for the SEO side, but from the search engines side.
Regards
Dr. Edel Garcia
Dr. Garcia - thanks for the email. I'll try to connect with you on that tonight and share it with the blog readers, too hopefully.
Please do Rand. I know I would enjoy the information and I'm sure I wouldn't be the only one.
And thank you Dr. Garcia for taking the time to help clear things up. It's greatly appreciated. I look forward to Rand's sharing with the rest of us.
I don't usually go in for 'me too' posts.
But me too!
Even if this isn't a large part (or even a part) of many people here's day-to-day work, I find this kind of theoretical stuff fascinating.
Semantics matter more than many people know... I think our industry is quite obsessed with "keyphrases" (that is, having a string of keywords in a specific, uniform order). I think people forget that juxtaposing relevant words and phrases is just as important as stringing them all together, having them in title tags, h1 and h2 titles etc.
Take a look at the long tail searches you've received for your sites... I've seen people arrive at some our sites and the sites of our clients' with search phrases that incorpate content over multiple paragraphs and pages. Simply sticking profitable words together isn't the final destination of content creating in SEO. In fact, that just makes your copy sound bad, which will have people hitting "back" no matter what they've searched for.
I've always believed that as the engines get better at recognizing semantic relationships between words, it's going to allow more people to write their content in a more natural style -- as they should have been doing all along. Write for people and count on the search engines to adequately approximate the understanding those people have when they see your text. They may not be great at it yet, but if you write too much on the side of the SEs' weakness instead of the readers' strengths, you may get rankings, but you won't get conversions.
I have a question about the last of the illustrations: in order to activate their semantic analysis powers, don't the bot and the spider need to touch their power rings together?
I was thinking much the same reading the post. And it's only a good thing. We should always be putting visitors first and search engines would like us to do just that. I don't think they're there yet, but in time I think they will get better recognizing semantic relationships and natural writing will be the thing for both people and bots.
While current accuracies and abilities may be highly questionable, I think there is no question that this is the way to the future.
But no doubt, just like communication between different people, especially adding in regionalities, dialects, slang, or even different base languages, this will be a challenge.
When you add this into the mix of everything else, it will give the engines considerable power to cut through the BS, spam, and promotional links, copy, etc.
I think it will be much easier at that point for the SEs to pick out what doesn't belong even though all of the suppossedly "correct" words were used, but even more important, allow them to identify the better sites even when the words haven't been used because they understand the underlying meaning of what is being searched for.
This will probably be even more true as average query word length continues to extend from 3 to 4 to 5 words or beyond. Maybe we'll even so more results served up based on semantic connection even when the specific searched words aren't specifically found in the results.
What it means for SEO will be the wild card question. Maybe it will make it easier and sites that focus developing great, relevant, topical content for users will win out. Then again, maybe it will be even harder to rank smaller or less authoritative sites as the most dominate sites will own the SERPs for their topics regardless of keywords.
I'd imagine that as semantic indexing and relationship technology becomes better and more prevalent in the algo, authoritative niche sites will be very hard to beat...and Wikipedia will still suck all of the fun out of the SERPs.
When various Wikipedia results rank above people's own fanzines, I'd say this has already happened :(
...and Jane gives us an insight into her personal search history ;)
I was always a Michael Owen fan, actually :) His site is showing up before The Wiki for me...
Not for me - Wikipedia top. Maybe when he's back from injury.
Jane - you do realise that the large proportion (I assume) of your audience who are from the US don't have a clue who we're talking about right?
Fun eh?!
I have to agree with Michael Martinez in that Google's search results reveal little or no semantic analysis. This is for very good reasons. Firstly any meaningful semantic analysis of a corpus as large and varied as the web is computationally prohibitively expensive and secondly the jury is still out on the value and usefulness of such an analysis.
Most of the interesting research results in semantic analysis are from relatively small homogeneous collections and we are a long way from perfecting the current algorithms such as LSI to be usable or even useful on the collection of all web pages.
For SEOs a more interesting line of enquiry might be Google's entry into personal search because in a sense this transcends semantic analysis. If you analyze the way in which an individual user uses search terms and monitor and continuously iterate and refine the results then you automatically have a kind of individual semantic vocabulary.
The advantage of this approach is that as well as having a computationally simple feedback loop you can possibly achieve a higher degree of relevancy for the user without having to make semantic guesses based on billions of web pages. If you want to make guesses you can simply aggregate the personal search data.
Not directly related to search engine optimization or even semantics, but Google used this data to enter a machine translation competition a few years back and not only did they win, they blew the competition away.
CNet talks about it and a quick search will get you many more summaries.
great article and great illustrations. i allways think about the huge servers google needs to store all that information ;)
Coincidence never ceases to amaze me - I've just come out of a presentation on the future of search where we discussed the possibilities of LSI, but also the data that the engines could use from social bookmarking sites, etc.. which they own.
I may have to tack a link to this atricle on to the presentation.
Better late than never I guess - a good way to know if a word does have an LSI partner is to do a ~keyword (yeah, include the little symbol thing). This makes bold all keywords Google finds similar to the original keyword you searched for, however it's still a search so you don't always find the related words.
Well, a discussion of what is/is not LSI is given in my tutorial series on SVD and LSI, wherein it is demonstrated that there is no such thing as "LSI-Optimization" and "LSI-Friendly" documents. The many tricks from those firms that claim to provide LSI-based services are debunked. You can start with tutorial #1
https://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html
If you want to skip all the theory and math just go to tutorial #5 of the series and try to grasp the figures. There is a reason as to why it is not possible for SEOs try to game or predict the redistribution of term weights coming from a valid SVD's LSI algorithm/implementation. It all boils down to a co-occurrence phenomenon taking place in the corpus (entire collection).
Hate to promote this, but the upcoming issue of IRWatch - The Newsletter provides a visual side-by-side comparison of LSI-based SEO myths vs facts (without the math).
I'm amazed at how many SEO "experts" out there are perpetuating so many LSI myths through SES, forums, and blogs, while deceiving their prospective clients. It is a shame. I hope one day they will be slapped with a consumer class action lawsuit for fraudulent trade.
Dr. E. Garcia
"a good way to know if a word does have an LSI partner is to do a ~keyword (yeah, include the little symbol thing)"
Well, actually is not. The idea that LSI results can be invoked by a query operator needs to stop. This is one of many other theories outhere that need correction. To quote Dr. Garcia on this matter:
"Many SEOs are misquoting old papers and the focus of that old research. Many of these SEO "experts" don't even know how to do basic SVD decomposition, nor do they understand the how-to steps involved in computing LSI scores. In the process they have stretched such research findings and added a few of their own myths in order to market better whatever they sell. For instance, today one can see some suggesting that to have documents "LSI friendly" one needs to stuff content with synonyms or related terms. This perception is incorrect."
I highly recomend May's issue of IR Wacth - The Newletter: "Demystifying LSI for SEOs"
Thanks, Jose. Query operators are first in my list of LSI myths.
Indeed, there is a lot of speculations and hearsay within the SEO industry regarding LSI. This seo-rant.com article is a good example. It is clear that its author is just passing around myths. The irony of the article is in its title:
"Understanding Latent Semantic Indexing"
The first paragraph of that article is just a collection of myths and non sense, many of which I have seen around before. Note how LSI is described:
"It compares nearly everything about a page: links; anchor text; meta tags; punctuation; sentence structure; language; img tags; page size - and a lot more. It compares this to sites that have proven themselves to be ‘authority’ sites. So Google knows, based on authority sites, that ‘cat’ and ‘dog’ are animals - because sites that use these keywords will often mention the keyword ‘animal’ when talking about cats or dogs".
Clearly the author does not know how SVD is applied to the IR problem LSI tries to tackle. I haven't seen so much garbage in a single post.
How many other LSI myths can you spot in the rest of the article?
Sorry to sound a bit harsh, but when hearsays like this reaches an SEO blog, forum, or a SES conference, I can understand why IR folks just laugh at SEOs since most don't really understand how search engines actually work.
This tells me that there is a lot of educational work to do within the search marketing industry.
The funny thing is that at different degrees, such kind of hearsay always finds its way through a marketing conference talks (SES speakers, SEO events, etc). Then, next thing one hears is SEOs blogging about this or that talk, repeating the same crap as facts.
Dr. E. Garcia
Dr. Garcia - just a heads up. I've sent you a couple emails over the past 2 weeks, and all have bounced back, saying they were undeliverable...
I got no emails from you. Probably these were marked as spam by my ISP. Last week I had a similar problem with Mike Grehan and few other friends. To be sure I placed your account in my list of "Accept" emails. Feel free to try again.
BTW, Here is another SEO -Michael Duz- seeing the light and realizing the truth about SEOs LSI Myths and LSI itself. He has a great post, too: The LSI Myth .
Dr. E. Garcia
i am sure that, given the ammount of information google has about different words and phrases, they can build lots of semantic connections.
i also think that they are just at the beginning of this process and that this is the direction they are going.
so, paying more and more attention to semantics and related words is the future of copywriting... and that was rand's point too, wasn't it?
I actually focused on Socio-linguistics in college. I wonder when the day will come or generation that is, that treats search in a more semantic way. Our input is prepared for machines, we frame our language for easy dissection. Search Engines are getting better at relating terms, but we're still well away from linguistic intuition.
Something I always think about when I see this topic discussed is, "What could Google do with all this info besides use it for search purposes?" I mean, couldn't an SEO expert with dual Ph.D's in linguistics and sociology (c'mon, there has to be someone with those credentials out there) process all this data to learn more about how people use language?
Sure, you'd probably study the data all your life and not get very far, but you could recruit and train cute interns to take up where your work left off.
I had a professor in college with credentials quite similar to the above, and she could have had a field-day with Google's data. Then, she would have had us draw socio-linguistic trees of the data for an entire semester. Luckily, she hadn't discovered SEO or its mountains of data upon my graduation.
I personally think that Google is using semantic analysis to help determine a theme relevancy quality score when looking at a site's IBL profile. The theme relationship between sites that link to each other does have an impact on the search results but I do agree with Michael that it is somewhat minimal but I do think that it does have an impact. I do think that a properly themed/siloed site can complete against sites that have a higher allinanchor ranking.
I am sure that at some point Google will rely on co-occurrence more strongly in the future. A site wanting to rank for "puppy training tips" may be able to improve it's relevancy score by also using semantically relevant phrases such as "dog training tips" or "pet training tips" and this could indirectly affect the site's rankings.
Google AdSense works using this type of predictive intelligence to determine a page's theme relevancy so it can display the most relevant ads possible. I am sure that Google uses their artificial co-occurrence statistics to identify and purge out MFA and spammy sites and it would only make sense to use that same intelligence to help determine the theme relevancy between sites based on linking patterns.
That is my two cents worth :)
Good post :) :)
Hadn't come across that Co-Occurrence data stuff before :)
Rand,
What exactly do you mean when you say: "it looked like many of the pages pointing to my domain and page weren't nearly as "connected" as the pages linking to my competitor"? I've experienced instances of frustration just like this before (I believe), when a competitor has very little link weight outside of a closely interlinked network of related websites, yet I've got thousands of quality links coming from everywhere (authorities in the industry included). Is there a point when the sheer amount of good links can overcome a strong network of sites that are all on topic?