For the next few weeks, I'm working on re-authoring and re-building the Beginner's Guide to Search Engine Optimization, section by section. You can read more about this project here.



Search engines, as we've shown above, are limited in how they crawl the web and interpret content to retrieve and display in the results. In this section of the guide, we'll focus on the specific technical aspects of building (or modifying) web pages so they're optimally structured for search engines and human visitors. This is an excellent part of the guide to share with your programmers, information architects, and designers, so that all parties involved in the site's construction can plan and develop a search-engine friendly site.

Indexable Content

In order to be listed in the search engines, your content - the material available to visitors of your site - must be in HTML text content. Images, Flash files, Java applets, and other non-text content is virtually invisible to search engine spiders, despite advances in crawling technology. The easiest way to ensure that the words and phrases you display to your visitors are visible to search engines is to place it in the HTML text on the page. However, more advanced methods are available for those who demand greater formatting or visual display styles:

  • Images in gif, jpg, or png format can be assigned "alt attribues" in HTML, providing search engines a text description of the visual content
  • Images can also be shown to visitors as replacements for text by using CSS styles
  • Flash or Java plug-in contained content can be repeated in text on the page
  • Video & Audio content should have an accompanying transcript if the words and phrases used are meant to be indexed by the engines

Most sites do not have significant problems with indexable content, but double-checking is worthwhile. By using tools like SEO-Browser, a website that lets you see web pages the same way search engine spiders do, you can see what elements of your content are visible and indexable to the engines.

For example, below, I have an image of SEOmoz's homepage:

SEOmoz's Homepage

The visual images of the Seattle skyline and the graphic elements give the page a great look and feel, but let's see what the search engines can access:

SEOmoz Homepage via SEO-Browser

Using the SEO Browser site, we're able to see that to a search engine, SEOmoz's homepage is simply a collection of text and links (which is exactly what we'd want to see).

Now let's check out another favorite site of mine, Orisinal, a clever collection of wonderfully designed, Flash-based games.

Orisinal Homepage

The graphics are great, but there's not a lot of text on the page - it just says "Orisinal games." Perhaps that's all the page needs to rank for?

Orisinal Home via SEO Browser

Uh oh... Via SEO Browser, we can see that the page is a barren wasteland. There's not even text telling us that the page contains the Orisinal Games. The site is entirely built in Flash, but sadly, this means that search engines cannot index any of the text content, or even the links to the individual games.

If you're curious about exactly what terms and phrases search engine can see on a webpage, SEOmoz has a nifty tool called "Term Extractor" that will display words & phrases ordered by frequency. However, it's wise to not only check for text content but to also use a tool like SEO Browser to double-check that the pages you're building are visible to the engines. It's very hard to rank if you don't even appear in the keyword databases :)

Crawlable Link Structures

On an individual page, search engines need to see content in order to list pages in their massive keyword-based indices. They also need to have access to a crawlable link structure - one that lets their spiders browse the pathways of a website - in order to find all of the pages on a website. Hundreds of thousands of sites make the critical mistake of hiding or obsfucating their navigation in ways that search engines cannot access, thus impacting their ability to get pages listed in the search engines' indices. Below, I've illustrated how this problem can happen:

Google's Spider Unable to Crawl Links

In the example above, Google's colorful spider has reached page "A" and sees links to pages "B" and "E." However, even though C & D might be important pages on the site, the spider has no way to reach them (or even know they exist) because no direct, crawlable links point to those pages. As far as Google is concerned, they might as well not exist - great content, good keyword targeting, and smart marketing won't make any difference at all if the spiders can't reach those pages in the first place.

To start, let's take a quick look at the anatomy of a standard HTML link:

Anatomy of a Link

In the above illustration, the "<a" tag indicates the start of a link. Link tags can contain images, text, or other objects, all of which provide a "click-able" area on the page that users can engage to move to another page. This is the original concept of the Internet - "hyperlinks." The link referral location tells the browser (and the search engines) where the link points to. In my example, the URL https://www.jonwye.com is referenced. Next, the visible portion of the link for visitors, called "anchor text" in the SEO world, describes the page I'm pointing to. In this example, the page pointed to is about custom belts, made by my friend from Washington D.C., Jon Wye, so I've used the anchor text "Jon Wye's Custom Designed Belts." The </a> tag closes the link, so that elements later on in the page will not have the link attribute applied to them.

This is the most basic format of a link - and it is eminently understandable to the search engines. The spiders know that they should add this link to the engine's link graph of the web, use it to calculate query-independent variables (like Google's PageRank), and follow it to index the contents of the referenced page.

Now let's look at some common reasons why pages may not be reachable:

  • Links in Submission-Required Forms
    Forms can include something as basic as a drop down menu or as complex as a full-blown survey. In either case, search spiders will not attempt to "submit" forms and thus, any content or links that would be accessible via a form are invisible to the engines.
  • Links only accessible through Search
    Although this relates directly to the above warning on forms, it's such a common problem that it bears mentioning. Spiders will not attempt to perform searches to find content, and thus, it's estimated that millions of pages are hidden behind completely inaccessible walls, doomed to anonymity until a spidered page links to it.
  • Links in Un-Parseable Javascript
    If you use Javascript for links, you may find that search engines either do not crawl or give very little weight to the links embedded within. Standard HTML links should replace Javascript (or accompany it) on any page where you'd like spiders to crawl.
  • Links in Flash, Java, or other Plug-Ins
    The links embedded inside the Orisinal site (from our above example) is a perfect illustration of this phenomenon. Although dozens of games are listed and linked to on the Orisinal page, no spider can reach them through the site's link structure, rendering them invisible to the engines (and un-retrievable by searchers performing a query).
  • Links pointing to pages blocked by the Meta Robots tag or Robots.txt
    The Meta Robots tag (described in detail here) and the Robots.txt file (full description here) both allow a site owner to restrict spider access to a page. Just be warned that many a webmaster has unintentionally used these directives as an attempt to block access by rogue bots, only to discover that search engines cease their crawl.
  • Links on pages with many hundreds or thousands of links
    The search engines all have a rough limit of 100 links per page, before they may stop spidering additional pages linked-to from a page. This limit is somewhat flexible, and particularly important pages may have upwards of 150 or even 200 links followed, but in general practice, it's wise to limit the number of links on any given page to 100 or risk losing the ability to have additional pages crawled.
  • Links in Frames or I-Frames
    Technically, links in both frames and I-Frames are crawlable, but both present structural issues for the engines in terms of organization and following. Unless you're an advanced user with a good technical understanding of how search engines index and follow links in frames, it's best to stay away from them as a place to offer links for crawling purposes.

If you avoid these pitfalls, you'll have clean, spiderable HTML links that will allow the spiders easy access to your content pages. Links can have additional attributes applied to them, but the engines ignore nearly all of these, with the important exception of the rel="nofollow" tag.

Rel="nofollow" can be used with the following syntax:

<a href=https://moz.com rel="nofollow">Lousy Punks!</a>

In this example, by adding the rel=nofollow attribute to the link tag, I've told the search engines that I, the site owner, do not want this link to be interpreted as the normal, "editorial vote." Nofollow came about as a method to help stop automated blog comment, guestbook, and link injection spam (read more about the launch here), but has morphed over time into a way of telling the engines to discount any link value that would ordinarily be passed. Links tagged with nofollow are interpreted slightly differently by each of the engines:

  • Google - nofollow'd links carry no weight or impact and are interpreted as HTML text (as though the link did not exist). Google's representatives have said that they will not count those links in their link graph of the web at all.
  • Yahoo! & MSN/Live - Both of these engines say that nofollow'd links do not impact search results or rankings, but may be used by their crawlers as a way to discover new pages. That is to say that while they "may" follow the links, they will not count them as a method for positively impacting rankings.
  • Ask.com  - Ask is unique in its position, claiming that nofollow'd links will not be treated any differently than any other kind of link. It is Ask's public position that their algorithms (based on local, rather than global popularity) are already immune to most of the problems that nofollow is intended to solve.

Keyword Usage & Targeting

We'll have to save this for the next in the series...


 

* Flash and search engines can work together, but it requires the use of some clever code-replacement type technology called sifr (Scalable Inman Flash Replacement), which can be used to show Flash text to users and HTML to search engines.