The Robots Exclusion Protocol (REP) is a conglomerate of standards that regulate Web robot behavior and search engine indexing. Despite the "Exclusion" in its name, the REP covers mechanisms for inclusion too. The REP consists of the following:
- The original REP from 1994, extended 1997, that defines crawler directives for robots.txt. Some search engines support extensions like URI patterns (wild cards).
- Its extension from 1996 that defines indexer directives (REP tags) for use in the robots meta element, also known as "robots meta tag." Meanwhile, search engines support additional REP tags with an X-Robots-Tag. Webmasters can apply REP tags in the HTTP header of non-HTML resources like PDF documents or images.
- The sitemaps protocol from 2005 that defines a procedure to mass submit content to search engines via (XML) Sitemaps.
- The Microformat rel-nofollow from 2005 that defines how search engines should handle links where the A Element's REL attribute contains the value "nofollow." Also known as a link condom.
It is important to understand the difference between crawler directives and indexer directives. Crawlers don't index or even rank content. Crawlers just fetch files and script outputs from Web servers, feeding a data pool from which indexers pull their fodder.
Crawler directives (robots.txt, sitemaps) suggest to crawlers what they should crawl and must not crawl. All major search engines respect those suggestions, but might interpret the directives slightly differently and/or support home-brewed proprietary syntax. All crawler directives imply an "indexing is allowed," which means that search indexes can and do list uncrawlable URLs on their SERPs, often with titles and snippets pulled from 3rd party references.
All indexer directives (REP tags, microformats) require crawling. Unfortunately, there's no such thing as an indexer directive on the site level (yet). That means that in order to comply to an indexer directive, search engines must be allowed to crawl the resource that provides the indexer directive.
Other than robots.txt directives that can be assigned to groups of URIs, indexer directives affect individual resources (URIs) or parts of pages like (spanning) HTML elements. That means that each and every indexer directive is strictly bound to a page or other web object; respectively, a part of a particular resource (e.g., an HTML element).
Because REP directives relevant to search engine crawling, indexing, and ranking are defined on different levels, search engines have to follow a kind of command hierarchy:
Robots.txt Located at the web server's root level, that's the gatekeeper for the entire site. In other words, if any other directive conflicts with a statement in robots.txt, robots.txt overrules it. Usually search engines fetch /robots.txt daily and cache its contents. That means that changes don't affect crawling instantly. Submissions of sitemaps might clear and refresh the robots.txt cache, which means the search engine should fetch the newest version of this file.
(XML) sitemaps Sitemaps are machine readable URL submission lists in various formats, e.g., XML or plain text. XML sitemaps offer the opportunity to set a couple of URL specific crawler directives, or better hints for crawlers, such as desired crawling priority or "last modified" timestamps. With video sitemaps in XML format, it's possible to provide search engines with metadata like titles, transcripts, or textual summaries, and so on. Search engines don't crawl sitemap submissions restricted by robots.txt statements.
REP tags Applied to an URI, REP tags (noindex, nofollow, unavailable_after) steer particular tasks of indexers, and in some cases (nosnippet, noarchive, noodp) even query engines at runtime of a search query. Other than with crawler directives, each search engine interprets REP tags differently. For example, Google wipes out even URL-only listings and ODP references on their SERPs when a resource is tagged with "noindex," but Yahoo and MSN sometimes list such external references to forbidden URLs on their SERPs. Since REP tags can be supplied in META elements of X/HTML contents as well as in HTTP headers of any web object, the consensus is that contents of X-Robots-Tags should overrule conflicting directives found in META elements.
Microformats Indexer directives put as microformats overrule page settings for particular HTML elements. For example, when a page's X-Robots-Tag states "follow" (there's no "nofollow" value), the rel-nofollow directive of a particular A element (link) wins.
Although robots.txt lacks indexer directives, it is possible to set indexer directives for groups of URIs with server sided scripts acting on site level that apply X-Robots-Tags to requested resources. This method requires programming skills and good understanding of web servers and the HTTP protocol.
For more information, syntax explanations, code examples, tips and tricks, etc., please refer to these links:
- Crawler/indexer directives
- robots.txt
- robots.txt 101
- REP tags in META elements
- X-Robots-Tags
- Microformats
- URL removal / deindexing
- Sitemaps
Sebastian, January/15/2008
Sebastian, I have no idea why out of every complicated issue on the internet you decided to latch onto Robots.txt files.
But I'm glad you did. Although I believe it to be somewhat insane.
Heh whatever. Awesome post.
Probably because robots.txt is a popular topic at the moment, due to Google's experiments with REP tags for robots.txt. I think that Webmasters should care, because changes that Google can standardize will stick, although they're pretty much weird (IOW not REP compliant) in their current experimental stage.
Geek stuff is "insane" by design. ;)
Thanks!
well done, sebastian. thanks for shedding some light on an important topic. in my experience, strategic use of robots.txt can be a game-changing step in consolidating a site's link equity and putting important pages in a better position to rank.
i always do whatever is necessary to educate clients that times have changed since https://www.robotstxt.org came online.
specifically - it's true that, strictly speaking, robots.txt doesn't allow wildcards. however, both google and yahoo support pattern matching as an extension of the standard:
https://www.google.com/support/webmasters/bin/answer.py?answer=40367&topic=8846
https://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html
with that in hand, you can do a *heap* of powerful stuff to help clients rank better.
Heh popular subject? I need to check out your blogroll more then, since I've seen very little about it (Except from you).
Oh well, looks like I'm getting educated none-the-less.
You're right. Popularity is based on the size of the crowd, and the crowd is small in this case.
Sebastian's site is rapidly becoming one of the best resources on the net for many SEO and coding related topics.
Don't just read the robots.txt stuff that he has posted. Go browse the rest of his site. Without delay. Go on...
Thank you for the compliment. :)
It is not an idle compliment. Your recent post on getting URLs outta Google - the good, the popular, and the definitive way was also excellent.
You have a talent for imparting detail without becoming overly verbose, of which I am more than a little jealous.
Sebastian,
nice job tackling a challenging subject. Great seeing additional information on this "ancient" -- at least in Internet years -- but still very important protocol.
Still amazing how many sites haven't implemented this, even not disallowing anything just to avoid 404 errors throwing off their analytics, or have implemented, but incorrectly . . . such as accidentally disallowing their entire site, or not including blank lines between disallow statements (though technically specified, I'm hoping that the robots have become savvy enough to overlook this issue . . . but who knows).
One statement kind of threw me here, so hoping you can clarify...
For example, using mydomain.com/mypage.htm...
if robots.txt says to disallow mypage.htm, then the robots won't crawl that page. So this would be the case where the meta robots tag on that page said to index or follow (which would be unnecessary anyway) as the meta robots tags won't be seen anyway.
but if robots.txt has no related directive or even an allow directive (which I'd typically not recommend as it may not be recognized by all, though if memory serves me correctly, it may have finally been recognized by all the majors... but that didn't used to be the case) and mypage.htm has a noindex meta, then it seems the statement above would say that robots.txt would win and the page would be indexed.
Maybe I misunderstood, but this would not seem correct as the meta directive should overrule the txt at that point to give greater control at the page level.
This was by design as the protocol, as mentioned, was designed to be exclusionary. Prior to the allow directive or wildcard pattern matching, all one could do was to disallow, therefore, control was based on where you disallowed.... through robots.txt at the site, directory, or file level, or through meta at the file level.
"Disallow" is a crawler directive, "noindex" is an indexer directive. "Disallow" doesn't disallow indexing, just forbids crawling. If you want to make sure that search engines comply to your indexer directives, you must allow crawling.
Currently you can't restrict indexing in robots.txt. Also, the lack of a "Disallow" statement for a particular URI doesn't mean "index it even when the page has a 'nofollow' tag".
Crawlers don't index stuff, and follow only crawler directives. Indexers (and query engines) don't crawl stuff, and follow only indexer directives.
If a page is disallow'ed, robots.txt wins because search engines can't spot the indexer directives on the page.
If you submit a disallow'ed URL via XML sitemap, robots.txt wins and the engines won't crawl it. In theory they could list the URL picked from the sitemap on the SERPs, in fact that happens only when the uncrawlable URL has strong inbound links.
@Sebastian:
rel="nofollow" is a debated format for microformats. The microformats community did not develop it, it was Google's team.See (Specification, Abstract and Open Issues):
https://microformats.org/wiki/rel-nofollow
And this:
https://www.seomoz.org/blog/12-ways-to-keep-your-content-hidden-from-the-search-engines#jtc46076
(also see Rand's 11th point)
[It appears to be that there is something about rel="nofollow" daily here on seomoz :)]
Yep, rel-nofollow in its original shape is debated at Microformats, but due to enough adopters it's a settled de facto standard. BTW only Google doesn't discover new stuff from condomized links, other engines (Yahoo/MSN) just don't pass reputation or ignore it totally (Ask).
It seems the CSS lacks support of DL/DT/DD elements.
Thanks for the edits. :)
Sebastian - it frightens me to think what you put online you need to know so much about how to keep it all away from prying robots.... :)
Another great post about robots. Learned more about Robots in the last couple of posts from Sebastian than on any other site previously.
Shaun, the way more interesting side of the REP is steering search engines the other way round.
"video sitemaps"
I was unaware of that. Worth the price of admission. Thanks.
Thumbs up.
I like the idea of mobile sitemaps.
I agree, that's awesome.
What about the robots-nocontent attribute for marking content at the block/element level? Although I think this is only implemented by Yahoo! - plus it doesn't really block content as such - but thought it should probably be worth a mention. Yahoo has a help page on it: https://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-14.html
Only 500 sites on the whole Web have implemented it, probably because it's, well, politely put, unusable.
The idea is neither new nor bad, but Yahoo's implementation turned to a miserable failure. Kudos to Yahoo for trying it, but they should have thought about it more than half a second before the launch. That's why the block/element level row in the image above doesn't mention Yahoo's robots-nocontent class name nor Google's section targeting.
I'm sorry, to explain the flaws I'd need to write a book and I fear the tiny editor can't eat that. Here is at least a rant.
Excellant post, a little high in the clouds for me but looking forward to checking out your website!!!
In your diagram you talk about page level use of a nofollow. If a site was already built and I needed to nofollow 3 subpages (such as about us, contact us & shipping info) would you recommend nofollowing all the links on the site pointing to these 3 pages (which is alot of work) or can it be done on a page level instead.
And if I can do it on a page level will I encounter any problems versus the link level change?
The "nofollow" directive tells search engines not to follow links on an entire page (meta element, X-Robots-Tag), respectively particular links (rel-nofollow) on a do-follow'ed page. The "nofollow" is applied to the link destination, not to the page carrying the links. That means nofollow'ing a contact page condomizes all its outgoing links, but has absolutely no impact on incoming links.
Currently you've no other chance than adding a rel-nofollow to all A elements that point to your contact page. That's a shitload of work, and fault-prone.
There's an inofficial way to accomplish that on the site level with Google, but I won't recommend it because this method blocks crawling and indexing, besides marking the contact page as dangling node that doesn't suck PageRank.
I've developed a flexible and safe method suitable to accomplish that, but I don't know whether or not the search engines will implement it (anytime soon). A few SE engineers discuss my draft, that's all I can tell so far.
Sebastian,
This is a great post exaplaining the technical aspect of handling robots. I will sure mark this onr as reference.
That being said, in a world of SEO 2.0, tags, blogs, universal search, RSS and syndication - The real and main SEO challenge gets more and more technical, avoiding duplicate content and guiding the robots to places which are important to us. I think in that field the whole indistry is lacking some good practical examples such as what content do we block? from who? when? when do we use nosnippet and where do we use Meta nofollow?
I think some insights to that area may really leverage the good work you started.
OC
Thanks for your post. You really know what you're talking about
Thanks, helped me explain it easily to someone!
The very recent draft for HTML 5 now includes some stuff about the nofollow attribute.
this is a fantastic post, very informative and inclusive!
Sebastian is robots.txt God. I hope you get your proposed robots.txt directives Sabastian because I know you won't sleep until you do :)
Sebastian: I didn't know about your website, but now I have seen it and I have to say that it is great! From now on, I will have that in my frequently visited sites. :-)
thanks, it was a useful post
Fantastic resource, Sebastian, thanks! Rarely do you come across a good guide like this. The information regarding which directives "win" over others is also great.
Thanks Jane. :) Please keep in mind that search engines might change the rules at any time without notice. All REP standards are non-binding recommendations respectively suggestions. Currently all engines have a different take on the REP, IOW each SE maintains a proprietary REP implementation. I wouldn't be surprised if for example tomorrow MSN or Ask starts to support X-Robots-Tags, but insanely decides that robots meta elements deserve priority over HTTP headers. Closely monitoring such changes is a good idea.
sometimes you make my head spin with your pamphlets, but that's okay. i love it. this is great stuff sebastian, thanks.
Thanks and sorry for the headaches. I try to simplify things, but sometimes I fail when the topic is somewhat complex.
Excellet Post. thanks.
@all Thanks for your thoughts. :)
An awesome post - please sphinn it, mozzers!
Hi Sebastian, great job.
I see you made the avatar switch here at SEOmoz :)
Hi Pat, thanks, and yes, the real red crab looks better.
Thanks, another great guide to expand my knowledge on robots.txt, though is tough to keep up with the latest trends and work. Any ways I'll keep reading, I need more caffeine.
Very interesting, I never had really looked into all the inner workings of robots/crawlers.
I did have a question about crawlers and shopping carts. I have a client that was having issues with crawlers putting hundreds of thousands of dollars in to his cart and of course abandoning them. This messes up our ability to accurrately determine what his true cart abandonment rate is. This month we got 55 carts abandoned in one day for $150,00o total.
So I created a robots.txt file and uploaded it. I was pretty sure that i formatted it correctly but it still didn't work. So I nofollowed the Add to Cart buttons and we still are getting a burst of cart abandonments once or twice a month.
www.fs4sports.com/robots.txt if someone wants to peek at it and maybe point out something I missed.
You should disallow the checkout URLs too. But that's not the problem. Not all Web robots obey robots.txt. Log every HTTP request of the URIs in question, record IP, host name, user agent and such. Then check these reports frequently for bots and block those from the shopping cart with a server sided script that serves them a 401 or 403 HTTP response code.