For the next few weeks, my blog posts will primarily consist of re-authoring and re-building the Beginner's Guide to Search Engine Optimization, section by section. You can read more about this project here.
Part I: How Search Engines Operate
The major global search engines include Google, Yahoo!, Microsoft/Live, Baidu, Naver & Ask.com. This guide primarily covers Google, Yahoo!, Microsoft & Ask - the major engines in the United States and other English language countries. Sadly, we don't have the expertise or experience to offer insight into Baidu (which operates almost exclusively in China) or Naver (Korea's primary search engine).
The search engines have several major goals and functions. These include:
- Crawling and indexing the billions of documents (pages & files) accessible on the Web
- Providing answers to user queries, most frequently through lists of relevant pages
In this section, we'll be walking through the basics of these functions from a non-technical perspective.
Crawling & Indexing
Imagine the World Wide Web as a network of stops in a big city subway system. Each stop is its own unique document (usually a web page, but sometimes a PDF, JPG or other file). The search engines need a way to "crawl" the entire city and find all the stops along the way, so they use the best path available - links:
ABOVE: London's "Tube" Serves as an Apt Analogy for the Journey of Search Engines Across the WWW
In our representation, stops like Embankment, Picadilly Circus & Moorgate serve as pages, while the lines connecting them (in black & brown) represent the links from those pages to other pages on the web. Once Google (at the bottom) reaches Embankment, it now sees the links pointing to Charing Cross, Westminster & Temple and can access any of those "pages."
The link structure of the web serves to bind together all of the pages in existence (or, at least, all those that the engines can access). Through links, search engines' automated robots, called "crawlers" or "spiders" (hence the illustrations above) can reach the many billions of interconnected documents.
Once the engines find these pages, their next job is to parse the code from them and store selected pieces of the pages in massive hard drives, to be recalled when needed in a query. To accomplish the monumental task of holding billions of pages that can be accessed in a fraction of a second, the search engines have constructed massive datacenters, like this one from Google in The Dalles, Oregon:
The NYTimes covered Google's datacenter in The Dalles
These monstrous storage facilities hold thousands of machines processing unimaginably large quantities of information. After all, when a person performs a search at any of the major engines, they demand results instanteously - even a 3 or 4 second delay can cause dissatisfaction, so the engines work hard to provide answers as fast as possible.
Retrieval & Rankings
For most searchers, the quest for knowledge begins like this:
And ends with a list of relevant pages on the web, returned in order of "importance." This process requires the search engines to scour their corpus of billions of documents and do two things - first, return only those results that are relevant or useful to the searcher's query, and second, rank those results in order of perceived value (or importance). It is both "relevance" and "importance" that the process of search engine optimization is meant to influence.
To the search engines, relevance means more than simply having a page with the words you searched for prominently displayed. In the early days of the web, search engines didn't go much further than this simplistic step, and found that their results suffered as a consequence. Thus, through iterative evolution, smart engineers at the various engines devised better ways to find valuable results that searchers would appreciate and enjoy. Today, hundreds of factors influence relevance, many of which we'll discuss throughout this guide.
Importance is an equally tough concept to quantify, but search engines must do their best. Currently, the major engines typically interpret importance as popularity - the more popular a site, page or document, the more valuable the information contained therein must be. This assumption has proven fairly successful in practice, as the engines have continued to increase users' satisfaction by using metrics that interpret popularity.
So, when you see a page like this:
You can surmise that the search engine (in this case, Yahoo!) believes that the Super Hero Stamps Page on USPS.com is the most relevant and popular page for the query "super hero stamps," while the AP news article on the topic is less relevant/popular.
Popularity and relevance aren't determined manually (and thank goodness, because those trillions of man-hours would require Earth's entire population as a workforce). Instead, the engines craft careful, mathematical equations - algorithms - to sort the wheat from the chaff and to then rank the wheat in order of tastiness (or however it is that farmers determine wheat's value). These algorithms are often comprised of hundreds of components. In the search marketing field, we often refer to them as "ranking factors." For those who are particularly interested, SEOmoz crafted a resource specifically on this subject - Search Engine Ranking Factors (last updated in April of 2007).
... and with that, I'm off to bed. Please do share your thoughts in the comments below. Oh yeah - and must read stuff today would probably include:
- This post on YOUmoz about link juice asks some good questions (I need to link over to YOUmoz more)
- The eMarketer study on Word of Mouth (via Justilien) is worth a look
- The AP reports on global search usage data from ComScore
- Michael Gray had a great interview on advanced link tactics
- And, Aaron had a brilliant one with Eli from BlueHatSEO (another must-read blog)
- Worried about how Google might improperly penalize your site? Googlers respond to concerns about a variety of subjects on Google Groups (Go Susan!)
Whew... this is a lot of work. What have I gotten myself into?
Seems like a great start, Rand. Some specific comments, if you don't mind a little constructive criticism:
Sorry, I've been in editor mode recently, mostly for myself (there I go apologizing again), so I hope that's helpful and not just nitpicky.
Excellent critiques, Pete - I agree it needs a little re-working, particularly with the relevance piece.
As part of my ongoing commitment to avoiding my own work, here's the kind of thing that comes to mind when I think of friendly spiders.
You think that is a happy spider? Look at this happy spider (I feel like I am commenting on digg)
I like the robot spiders! :-)
Bigging up London and the UK in general! Long may this trend last Rand ;-)
Seriously though - seems good to me. Things I'd like to see covered in this section:
Information on how rankings change and what causes that (i.e. different data centres etc) as well as how rankings change based on location (google.co.uk google.com etc)
I'd also like to see information on how a page gets cached - and what you're seeing when you look at a google cached page.
I'm not sure I 'get' this format - should I be writing those things myself or is this a section for us to comment on what you've written so far?
Tom - good points. I definitely agree that "caching' (and possibly a screenshot of it) belongs in this section. Rankings and data centers - also a good point, and yes, a good time to mention that as well.
Rand, for ease of use you may want to employ notations that link to the Appendices. For example, in the paragraph, "To the search engines, relevance..." you could tag the "In the early days of the web" with a superset "m" and link it to the myth of keyword stuffing, and address the timeline of keyword spam there.
You could also go through this and text link terms to their definition page, sort of like this online encyclopedia I saw once...
Chuck - excellent idea. I'll see how we can work in those references throughout the document come publication time.
Can maybe I help?
I always think datacenters should be based in the coldest places to cut down on air conditioning.
just a quick note to say youve got a dead link to https://news.yahoo.com/s/ap/20071009/ap_on_hi_te/worldwide_search in the first para.
Great article by the way!
Although this is my first comment, I have been reading SEOMOZ for sometime now . I often refer clients and potential clients here as source for good information on all things SEM, and this article is exactly why. Great Job Rand!
Give us some SEO tips on those Asian giants (Baidu & Naver) plzzz?
Here are some links to get you started:
NHN/Naver
https://www.businessweek.com/magazine/content/06_05/b3969057.htm?chan=tc
https://seoinkorea.blogspot.com/
https://www.promarketingonline.com/Korean-seo.htm
Baidu
https://www.filination.com/blog/2007/03/25/chinese-baidu-search-engine-optimization-seo-in-china/
https://www.selfseo.com/story-19549.php
https://www.baidupro.com/
Cheers Rand:)
Baidu is the main compititor of Google in Asia. So to know about it may be necessary for some business confined in Asia. Pease let us know some thing about Baidu - the Chiness search engine.
Tastiness of wheat! Haha! You make me laugh! :-D
First of all, congratulations for this piece of work you're delivering and thank you, I find it extremely helpful...Other information I'd like to see in this guide is how factors like hosting or domain/subdomain affect regional ranking, and your recommendations when selecting them for a website
This is a great tutorial on how the search engine indexing works.
Thanks for putting it together.
Rand - I think this is a really good start. I would change that line to be a little bit more specific. Search engines are currently indexing all sorts of digital information (web pages, documents, books, news, emails, videos, etc.), but the focus of the guide is on the web pages (is it?).
Thanks Hamlet - will do!
Also being a newbie, I only stumbled on the original beginner’s guide last week. I read it over the weekend and found it very useful so I’m really pleased it’s now being updated – I will follow the posts with interest over the coming weeks.
I found the post clear and helpful, so I suppose being new, if I can understand what’s being written about then the document is meeting its objectives. Being a Brit I particularly like the Tube analogy!
Thanks.
drummerboy9000 and ristfal,  Glad to hear you are finding the website useful. I know first hand that learning SEO can be a duanting expererince. There is a HUGE amount of information on the internet and at least in my expereince most of it was crap. I found (I openly admit my bias here) that SEOmoz was a great central place to read valuable information and more importantly meet and talk with talented and knowledgable people. Good luck and let me know if I can do anything to help
As a newbie these posts are helping me a lot. Thank you!
Just curious, how long did this post take to write? Is it in line with what you expected when you decided to start blogging the new guide?
It's a little slower going than I anticipated - I actually started blogging early last night (11pm) and was at this until about 1:15am - mostly the illustrations.
Rand - Keep up the great work! I know with all that you guys do you must be swamped but I think this rewrite is going to be so valuble to everyone from experts to newbies, like myself. It's been a while since I've read the original and I can't wait to keep up with your rewrite posts on the blog. I still feel there is so much more for me to learn and I consider you guys my first source for information. Thanks!
Rand, your writing style and use of fun and intelligent analogies is really superb. Looking forward to reading more of these...
Good start! I have to admit, as an American I was slightly distracted by going off to see if those places you listed on your subway line were real :)
Even though I have a good grasp on SEO tactics (at least enough to get the basic ideas), your writing style is great at validating what I THINK I know! I look forward to the next few installments!
Rand,
I have looked over this post three times(at different times in the day), just to make sure I wasn't in an odd mood or something...That being said...
I hate to be the first one not to like something, but please take this in the spirit in which it was meant...
The tube diagram is not quite as clear as it could be, and it confuses me somewhat. This could be due to the fact that connecting tubes have been faded a bit much in photoshop.
I had no trouble following the original diagram in the older beginners guide.*cringes at having just criticized Rand Fishkin
Also, I would perhaps just leave the Google Spider, rather than adding Yahoo. I think two spiders adds too much to the complexity, and might muddle the picture for a beginner.
Please don't think I am being petty.
The absence of the ASK spider is somewhat telling. I'm not sure it even exists in the first place.
Ha ha! No Ask spider - you're mean :) Gary Price is crying right now...
Ok - criticism well taken; I'll try to make the example clearer when I edit it. And please, don't fear calling me out if I'm wrong or if it's not the best it can be - this resource is supposed to help a lot of people and I definitely want it to be the best it can be.
Hmm, I thought the tube analogy was really good...