For the next few weeks, my blog posts will primarily consist of re-authoring and re-building the Beginner's Guide to Search Engine Optimization, section by section. You can read more about this project here.

Part I: How Search Engines Operate

The major global search engines include Google, Yahoo!, Microsoft/Live, Baidu, Naver & Ask.com. This guide primarily covers Google, Yahoo!, Microsoft & Ask - the major engines in the United States and other English language countries. Sadly, we don't have the expertise or experience to offer insight into Baidu (which operates almost exclusively in China) or Naver (Korea's primary search engine).

The search engines have several major goals and functions. These include:

  • Crawling and indexing the billions of documents (pages & files) accessible on the Web
  • Providing answers to user queries, most frequently through lists of relevant pages

In this section, we'll be walking through the basics of these functions from a non-technical perspective.

Crawling & Indexing

Imagine the World Wide Web as a network of stops in a big city subway system. Each stop is its own unique document (usually a web page, but sometimes a PDF, JPG or other file). The search engines need a way to "crawl" the entire city and find all the stops along the way, so they use the best path available - links:

Analogy of the London Subway as Links & Web Pages
ABOVE: London's "Tube" Serves as an Apt Analogy for the Journey of Search Engines Across the WWW

In our representation, stops like Embankment, Picadilly Circus & Moorgate serve as pages, while the lines connecting them (in black & brown) represent the links from those pages to other pages on the web. Once Google (at the bottom) reaches Embankment, it now sees the links pointing to Charing Cross, Westminster & Temple and can access any of those "pages."

The link structure of the web serves to bind together all of the pages in existence (or, at least, all those that the engines can access). Through links, search engines' automated robots, called "crawlers" or "spiders" (hence the illustrations above) can reach the many billions of interconnected documents.

Once the engines find these pages, their next job is to parse the code from them and store selected pieces of the pages in massive hard drives, to be recalled when needed in a query. To accomplish the monumental task of holding billions of pages that can be accessed in a fraction of a second, the search engines have constructed massive datacenters, like this one from Google in The Dalles, Oregon:

NY Times Piece on Google's Facility in The Dalles, Oregon
The NYTimes covered Google's datacenter in The Dalles

These monstrous storage facilities hold thousands of machines processing unimaginably large quantities of information. After all, when a person performs a search at any of the major engines, they demand results instanteously - even a 3 or 4 second delay can cause dissatisfaction, so the engines work hard to provide answers as fast as possible.

Retrieval & Rankings

For most searchers, the quest for knowledge begins like this:

Yahoo! Query Bar

And ends with a list of relevant pages on the web, returned in order of "importance." This process requires the search engines to scour their corpus of billions of documents and do two things - first, return only those results that are relevant or useful to the searcher's query, and second, rank those results in order of perceived value (or importance). It is both "relevance" and "importance" that the process of search engine optimization is meant to influence.

To the search engines, relevance means more than simply having a page with the words you searched for prominently displayed. In the early days of the web, search engines didn't go much further than this simplistic step, and found that their results suffered as a consequence. Thus, through iterative evolution, smart engineers at the various engines devised better ways to find valuable results that searchers would appreciate and enjoy. Today, hundreds of factors influence relevance, many of which we'll discuss throughout this guide.

Importance is an equally tough concept to quantify, but search engines must do their best. Currently, the major engines typically interpret importance as popularity - the more popular a site, page or document, the more valuable the information contained therein must be. This assumption has proven fairly successful in practice, as the engines have continued to increase users' satisfaction by using metrics that interpret popularity.

So, when you see a page like this:

Super Hero Stamps Results at Yahoo!

You can surmise that the search engine (in this case, Yahoo!) believes that the Super Hero Stamps Page on USPS.com is the most relevant and popular page for the query "super hero stamps," while the AP news article on the topic is less relevant/popular.

Popularity and relevance aren't determined manually (and thank goodness, because those trillions of man-hours would require Earth's entire population as a workforce). Instead, the engines craft careful, mathematical equations - algorithms - to sort the wheat from the chaff and to then rank the wheat in order of tastiness (or however it is that farmers determine wheat's value). These algorithms are often comprised of hundreds of components. In the search marketing field, we often refer to them as "ranking factors." For those who are particularly interested, SEOmoz crafted a resource specifically on this subject - Search Engine Ranking Factors (last updated in April of 2007).

... and with that, I'm off to bed. Please do share your thoughts in the comments below. Oh yeah - and must read stuff today would probably include:

Whew... this is a lot of work. What have I gotten myself into?