For the next few weeks, I'm working on re-authoring and re-building the Beginner's Guide to Search Engine Optimization, section by section. You can read more about this project here.
Search engines, as we've shown above, are limited in how they crawl the web and interpret content to retrieve and display in the results. In this section of the guide, we'll focus on the specific technical aspects of building (or modifying) web pages so they're optimally structured for search engines and human visitors. This is an excellent part of the guide to share with your programmers, information architects, and designers, so that all parties involved in the site's construction can plan and develop a search-engine friendly site.
Indexable Content
In order to be listed in the search engines, your content - the material available to visitors of your site - must be in HTML text content. Images, Flash files, Java applets, and other non-text content is virtually invisible to search engine spiders, despite advances in crawling technology. The easiest way to ensure that the words and phrases you display to your visitors are visible to search engines is to place it in the HTML text on the page. However, more advanced methods are available for those who demand greater formatting or visual display styles:
- Images in gif, jpg, or png format can be assigned "alt attribues" in HTML, providing search engines a text description of the visual content
- Images can also be shown to visitors as replacements for text by using CSS styles
- Flash or Java plug-in contained content can be repeated in text on the page
- Video & Audio content should have an accompanying transcript if the words and phrases used are meant to be indexed by the engines
Most sites do not have significant problems with indexable content, but double-checking is worthwhile. By using tools like SEO-Browser, a website that lets you see web pages the same way search engine spiders do, you can see what elements of your content are visible and indexable to the engines.
For example, below, I have an image of SEOmoz's homepage:
The visual images of the Seattle skyline and the graphic elements give the page a great look and feel, but let's see what the search engines can access:
Using the SEO Browser site, we're able to see that to a search engine, SEOmoz's homepage is simply a collection of text and links (which is exactly what we'd want to see).
Now let's check out another favorite site of mine, Orisinal, a clever collection of wonderfully designed, Flash-based games.
The graphics are great, but there's not a lot of text on the page - it just says "Orisinal games." Perhaps that's all the page needs to rank for?
Uh oh... Via SEO Browser, we can see that the page is a barren wasteland. There's not even text telling us that the page contains the Orisinal Games. The site is entirely built in Flash, but sadly, this means that search engines cannot index any of the text content, or even the links to the individual games.
If you're curious about exactly what terms and phrases search engine can see on a webpage, SEOmoz has a nifty tool called "Term Extractor" that will display words & phrases ordered by frequency. However, it's wise to not only check for text content but to also use a tool like SEO Browser to double-check that the pages you're building are visible to the engines. It's very hard to rank if you don't even appear in the keyword databases :)
Crawlable Link Structures
On an individual page, search engines need to see content in order to list pages in their massive keyword-based indices. They also need to have access to a crawlable link structure - one that lets their spiders browse the pathways of a website - in order to find all of the pages on a website. Hundreds of thousands of sites make the critical mistake of hiding or obsfucating their navigation in ways that search engines cannot access, thus impacting their ability to get pages listed in the search engines' indices. Below, I've illustrated how this problem can happen:
In the example above, Google's colorful spider has reached page "A" and sees links to pages "B" and "E." However, even though C & D might be important pages on the site, the spider has no way to reach them (or even know they exist) because no direct, crawlable links point to those pages. As far as Google is concerned, they might as well not exist - great content, good keyword targeting, and smart marketing won't make any difference at all if the spiders can't reach those pages in the first place.
To start, let's take a quick look at the anatomy of a standard HTML link:
In the above illustration, the "<a" tag indicates the start of a link. Link tags can contain images, text, or other objects, all of which provide a "click-able" area on the page that users can engage to move to another page. This is the original concept of the Internet - "hyperlinks." The link referral location tells the browser (and the search engines) where the link points to. In my example, the URL https://www.jonwye.com is referenced. Next, the visible portion of the link for visitors, called "anchor text" in the SEO world, describes the page I'm pointing to. In this example, the page pointed to is about custom belts, made by my friend from Washington D.C., Jon Wye, so I've used the anchor text "Jon Wye's Custom Designed Belts." The </a> tag closes the link, so that elements later on in the page will not have the link attribute applied to them.
This is the most basic format of a link - and it is eminently understandable to the search engines. The spiders know that they should add this link to the engine's link graph of the web, use it to calculate query-independent variables (like Google's PageRank), and follow it to index the contents of the referenced page.
Now let's look at some common reasons why pages may not be reachable:
- Links in Submission-Required Forms
Forms can include something as basic as a drop down menu or as complex as a full-blown survey. In either case, search spiders will not attempt to "submit" forms and thus, any content or links that would be accessible via a form are invisible to the engines. - Links only accessible through Search
Although this relates directly to the above warning on forms, it's such a common problem that it bears mentioning. Spiders will not attempt to perform searches to find content, and thus, it's estimated that millions of pages are hidden behind completely inaccessible walls, doomed to anonymity until a spidered page links to it. - Links in Un-Parseable Javascript
If you use Javascript for links, you may find that search engines either do not crawl or give very little weight to the links embedded within. Standard HTML links should replace Javascript (or accompany it) on any page where you'd like spiders to crawl. - Links in Flash, Java, or other Plug-Ins
The links embedded inside the Orisinal site (from our above example) is a perfect illustration of this phenomenon. Although dozens of games are listed and linked to on the Orisinal page, no spider can reach them through the site's link structure, rendering them invisible to the engines (and un-retrievable by searchers performing a query). - Links pointing to pages blocked by the Meta Robots tag or Robots.txt
The Meta Robots tag (described in detail here) and the Robots.txt file (full description here) both allow a site owner to restrict spider access to a page. Just be warned that many a webmaster has unintentionally used these directives as an attempt to block access by rogue bots, only to discover that search engines cease their crawl. - Links on pages with many hundreds or thousands of links
The search engines all have a rough limit of 100 links per page, before they may stop spidering additional pages linked-to from a page. This limit is somewhat flexible, and particularly important pages may have upwards of 150 or even 200 links followed, but in general practice, it's wise to limit the number of links on any given page to 100 or risk losing the ability to have additional pages crawled. - Links in Frames or I-Frames
Technically, links in both frames and I-Frames are crawlable, but both present structural issues for the engines in terms of organization and following. Unless you're an advanced user with a good technical understanding of how search engines index and follow links in frames, it's best to stay away from them as a place to offer links for crawling purposes.
If you avoid these pitfalls, you'll have clean, spiderable HTML links that will allow the spiders easy access to your content pages. Links can have additional attributes applied to them, but the engines ignore nearly all of these, with the important exception of the rel="nofollow" tag.
Rel="nofollow" can be used with the following syntax:
<a href=https://moz.com rel="nofollow">Lousy Punks!</a>
In this example, by adding the rel=nofollow attribute to the link tag, I've told the search engines that I, the site owner, do not want this link to be interpreted as the normal, "editorial vote." Nofollow came about as a method to help stop automated blog comment, guestbook, and link injection spam (read more about the launch here), but has morphed over time into a way of telling the engines to discount any link value that would ordinarily be passed. Links tagged with nofollow are interpreted slightly differently by each of the engines:
- Google - nofollow'd links carry no weight or impact and are interpreted as HTML text (as though the link did not exist). Google's representatives have said that they will not count those links in their link graph of the web at all.
- Yahoo! & MSN/Live - Both of these engines say that nofollow'd links do not impact search results or rankings, but may be used by their crawlers as a way to discover new pages. That is to say that while they "may" follow the links, they will not count them as a method for positively impacting rankings.
- Ask.com - Ask is unique in its position, claiming that nofollow'd links will not be treated any differently than any other kind of link. It is Ask's public position that their algorithms (based on local, rather than global popularity) are already immune to most of the problems that nofollow is intended to solve.
Keyword Usage & Targeting
We'll have to save this for the next in the series...
* Flash and search engines can work together, but it requires the use of some clever code-replacement type technology called sifr (Scalable Inman Flash Replacement), which can be used to show Flash text to users and HTML to search engines.
My compliments on this. I particularly like that you spent time on a graphic that simply and clearly shows the different parts of a link. There is very little good infomation out there for the beginner and I look forward to the completion of this.
From a "feedthebot" standpoint, I always will mention the Google webmaster guidelines when I see such a post.
Understanding the way a search engine spider sees your webpage clears up the most common problems a website has being seen and understood by search engines.
The Google webmaster guidelines actually suggest going further and using Lynx to see if your website is being understood by spiders correctly. In the Google webmasters help forum I have encountered dozens of websites that were not being seen correctly by spiders, even though via a spider simulator, the site looked fine.
Learn more about the Google guideline that covers this here. (it is a feedthebot page)
Hey Rand,
Quick Brainstorm session here: I think it would be very beneficial to have a short warning section at the beginning of the guide describing why the many Tips, Secrets and Hacks that SEO newbies see won't work. I think bad information is the hardest hurdle for new SEOs to overcome.
Giving some insight into the complexity of the engines algos (without overwhelming the reader) would help new SEOs steer clear of the many SEO scams that exist. Remind them that the Ph Ds at the search engines really are smart and that simply reading a expensive (and useless) e-book will not make you millions overnight.
Thanks
Great primer and I believe, very timely. I could be wrong, but there seems to be another wave of new people visiting the site. I like the link breakdown graphic as well and although that talking googlebot gives me the creeps, what with his skinny double jointed legs and all, it's a great representation.
Rand you should talk about the No Script tag and how it can be used to get around some of the link issues that you mentioned. Great summary of the issue.
Excellent suggestion - I'll definitely do that :)
Good Point
Also if any importance or effect of 'title' in ahref tag.
So many people hire web designers who are great in design but don't know anything about SEO. My friend Carolina Bogart just told me about her new client (fashion company).
Her new client made a website which are nothing but flash to show how perfect they are in fashion industry... result was terrible! Pages wasn't indexed by search engines, traffic almost 0.
Now Carolina suffer to rewrite content from flash images to a web page.
Yes its true that So many people hire web designers who are great in design but don't know anything about SEO.
Although most of the clients who approach these Web Designers have an great design in mind with a full flashy theme here n there or comletely flash website inspired from other websites. It is not that the web designers dont know anything about SEO at all. Maybe they knew it but the client wanted "nothing but flash to show how perfect they are in fashion industry".
It is when they want to do the SEO or promote their website in search engines that they come to know their website needs major modifications to compete with the competitors websites and if they had added only parts of their website in flash then it would have been more effective. Here mostly its the client who is also responsible (fully to say) who "don't know anything about SEO".
Hey Rand..most of the things you have mentioned here, 70% seo's are already aware off (so many blogs n website shouting the same thing). But the most special thing about this post (infact all your posts) is the presentation. It's so simple yet so affirming, and thats why I always keep comming back to you. Its like solving a jig-saw puzzle, you have all the pieces(seo info) but you dont know how to put them together.
BTW I was using linux browsers earlier to see how my site looks to search engines. seo-browser is a nice option too. thanks for the tip :)
keep up the good work..cheers!!
My personal favorite "Lynx-like" viewer is the right-click tool from Yellowpipe. Just download it and then right-click on any page you are viewing, select "Yellowpipe Lynx Viewer Tool" and a popup window will show you not only what the bots see, but will also show which links on the page the bots can see. Very handy.
that's damn handy! thanks.
I like the picture with the spider......
Does anyone know Google treats the pages that are not spiderable from the home page, but are contained in the Sitemap files that Google reads?
Thanks for the post Rand and thanks for the seo-browser site.
Great update and can't wait for more!
Wow, I am surprised I am the first one to notice this. Maybe I'm the only one that's learning new things here and needing to click on the links! :-)
Rand, your links here are swapped.
The meta robots tag link goes to a page describing robots.txt file and vice versa.
Also, you briefly mentioned nofollow, and I knew you wrote this last year, so you wouldn't have included what we now have learned about Google and how they handle nofollow. You may want to update it.
:-) Great post! So easy to read! Thank you so much!
Excellent tips on that one, every designer should see this post.
Keep the good tips comming.
Although Flash is not indexable originally, some Flash contents are indexed right now in Google, as this query shows: https://www.google.com/search?hl=en&q=filetype%3Aswf I guess that this is due to the sifr technology, but if somebody can confirm it would be great.
Hi Rand,
When we can expect to download completely, this new guide in form of PDF or word document.
I'm not sure how much more Rand has to write, but when he's finished with it we'll probably slap it together in a more cohesive format and will notify the community of its launch.
Great update as usual Rand.
I like the spider! It is so cute. Honestly, Rand and other seomozzers must adore google as they always depict it in such a lovely way!
I never knew a site like seo-browser.com existed. Thanks for the tip. =)
Very nice and easily understandable. I'm sending a link to my clients!
Along with the seo browser, here's a spider simulator which I find useful.