Some of the Internet's most important pages from many of the most linked-to domains, are blocked by a robots.txt file. Does your website misuse the robots.txt file, too? Find out how search engines really treat robots.txt blocked files, entertain yourself with a few seriously flawed implementation examples and learn how to avoid the same mistakes yourself.
The robots.txt protocol was established in 1994 as a way for webmasters to indicate which pages and directories should not be accessed by bots. To this day, respectable bots adhere to the entries in the file... but only to a point.
Your Pages Could Still Show Up in the SERPs
Bots that follow the instructions of the robots.txt file, including Google and the other big guys, won’t index the content of the page but they may still put the page in their index. We’ve all seen these limited listings in the Google SERPs. Below are two examples of pages that have been excluded using the robots.txt file yet still show up in Google.
Cisco Login Page
The below highlighted Cisco login page is blocked in the robots.txt file, but shows up with a limited listing on the second page of a Google search for ‘login’. Note that the Title Tag and URL are included in the listing. The only thing missing is the Meta Description or a snippet of text from the page.
WordPress’s Next Blog Page
One of WordPress.com’s 100 most popular pages (in terms of linking root domains) is www.wordpress.com/next. It is blocked by the robots.txt file, yet it still appears in position four in Google for the query ‘next blog’.
As you can see, adding an entry to the robots.txt file is not an effective way of keeping a page out of Google’s search results pages.
Robots.txt Usage Can Block Inbound Link Effectiveness
The thing about using the robots.txt file to block search engine indexing is not only that it is quite ineffective, but that it also cuts off your inbound link flow. When you block a page using the robots.txt file, the search engines don’t index the contents (OR LINKS!) on the page. This means that if you have inbound links to the page, this link juice cannot flow to other pages. You create a dead end.
(If this depiction of Googlebot looks familiar, that's because you've seen it before! Thanks Rand.)
Even though the inbound links to the blocked page likely have some benefit to the domain overall, this inbound link value is not being utilized to its fullest potential. You are missing an opportunity to pass some internal link value from the blocked page to more important internal pages.
3 Big Sites with Blocked Opportunity in the Robots.txt File
I've scoured the net looking for the best bloopers possible. Starting with the SEOmoz Top 500 list, I hammered OpenSiteExplorer in search of heart-stopping Top Pages lists like this:
Ouch, Digg. That's a lot of lost link love!
This leads us to our first seriously flawed example of robots.txt use.
#1 - Digg.com
Digg.com used the robots.txt to create as much disadvantage as possible by blocking a page with an astounding 425,000 unique linking root domains, the "Submit to Digg" page.
The good news for Digg is that from the time I started researching for this post to now, they've removed the most harmful entries from their robots.txt file. Since you can't see this example live, I've included Google's latest cache of Digg's robots.txt file and a look at Google's listing for the submit page(s).
As you can see, Google hasn't begun indexing the content that Digg.com had previously removed in the robots.txt.
I would expect Digg to see a nice jump in search traffic following the removal of it's most linked to pages from the robots.txt file. They should probably keep these pages out of the index with the robots meta tag, 'noindex', so as not to flood the engines with redundant content. This move would ensure that they benefit from the link juice without flooding the search engine indexes.
If you aren't up to speed on the use of noindex, all you have to do is place the following meta tag into the <head> section of your page:
<meta name="robots" content="noindex, follow">
Additionally, by adding 'follow' to the tag you are telling the bots to not index that particular page, but allowing them to follow the links on the page. This is usually the best scenario as it means that the link juice will flow to the followed links on the page. Take for example a paginated search results page. You probably don't want that specific page to show up in the search results as the contents of page 5 of that particular search is going to change day to day. But by using the robots noindex, follow the links to products (or jobs in this example from Simply Hired) will be followed and hopefully indexed.
Alternitavely you can use "noindex, nofollow" but that's a mostly pointless endeavor as you're blocking link juice as with the robots.txt.
#2 - Blogger.com & Blogspot.com
Blogger and Blogspot, both owned by Google, show us that everyone has room for improvement. The way these two domains are interconnected does not utilize best practices and much link love is lost along the way.
Blogger.com is the brand behind Google's blogging platform, with subdomains hosted at 'yourblog.blogspot.com'. The link juice blockage and robots.txt issue that arises here is that www.blogspot.com is entirely blocked with the robots.txt. As if that wasn't enough, when you try to pull up the home page of Blogspot, you are 302 redirected to Blogger.com.
Note: All subdomains, aside from 'www', are accessible to robots.
A better implementation here would be a straight 301 redirect from the home page of Blogspot.com to the main landing page on Blogger.com. The robots.txt entry should be removed altogether. This small change would unlock the hidden power of more than 4,600 unique linking domains. That is a good chunk of links.
#3 - IBM
IBM has a page with 1001 unique linking domains that is blocked by the robots.txt file. Not only is the page blocked in the robots.txt but it also does a triple-hop 302 to another location, show below.
When a popular page is expired or moved, the best solution is usually a 301 redirect to the most suitable final replacement.
Superior Solutions to the Robots.txt
In the big site examples highlighted above, we’ve covered some misuses of the robots.txt file. Some scenarios weren't covered. Below is a list of effective solutions to keep content out of the search engine index without link juice leak.
Noindex
In most cases, the best replacement for robots.txt exclusion is the robots meta tag. By adding 'noindex' and making sure that you DON'T add 'nofollow', your pages will stay out of the search engine results but will pass link value. This is a win/win!
301 Redirect
The robots.txt file is no place to list old worn out pages. If the page has expired (deleted, moved, etc.) don't just block it. Redirect that page using a 301 to the most relevant replacement. Get more information about redirection from the Knowledge Center.
Canonical Tag
Don't block your duplicate page versions in the robots.txt. Use the canonical tag to keep the extra versions out of the index and to consolidate the link value. Whenever possible. Get more information from the Knowledge Center about canonicalization and the use of the rel=canonical tag.
Password Protection
The robots.txt file is not an effective way of keeping confidential information out of the hands of others. If you are making confidential information accessible on the web, password protect it. If you have a login screen, go ahead and add the 'noindex' meta tag to the page. If you expect a lot of inbound links to this page from users, be sure to link to some key internal pages from the login page. This way, you will pass the link juice through.
Effective Robots.txt Usage
The best way to use a robots.txt file is to not use it at all. Well... almost. Use it to indicate that robots have full access to all files on your website and to direct robots to your sitemap.xml file. That’s it.
Your robots.txt file should look like this:
-----------------
User-agent: *
Disallow:
Sitemap: https://www.yoursite.com/sitemap.xml
-----------------
The Bad Bots
Earlier in the post I mentioned that "Bots that follow the instructions of the robots.txt file," which means that there are bots that don't adhere to the robots.txt at all. So while you're doing a good job of keeping out the good bots, you're doing a horrible job of keeping out the "bad" bots. Additionally, filtering to only allow bot access to Google/Bing isn't recommend for three reasons:
- The engines change/update bot names frequently (e.g. the Bing bot name change recently)
- Engines employ multiple types of bots for different types of content (e.g. images, video, mobile, etc.)
- New engines/content discovery technologies getting off the ground stand even less of a chance with institutionalized preferences for existing user agents only (e.g. Blekko, Yandex, etc.) and search competition is good for the industry.
Competitors
If your competitors are SEO savvy in any way shape or form, they're looking at your robots.txt file to see what they can uncover. Let's say you're working on a new redesign, or a whole new product line and you have a line in your robots.txt file that disallows bots from "indexing" it. If a competitor comes along, checks out the file and sees this directory called "/newproducttest" then they've just hit the jackpot! Better to keep that on a staging server, or behind a login. Don't give all your secrets away in this one tiny file.
Handling Non-HTML & System Content
- It isn't necessary to block .js and .css files in your robots.txt. The search engines won't index them, but sometimes they like the ability to analyze them so it is good to keep access open.
- To restrict robot access to non-HTML documents like PDF files, you can use the x-robots tag in the HTTP Header. (Thanks to Bill Nordwall for pointing this out in the comments.)
- Images! Every website has background images or images used for styling that you don't want to have indexed. Make sure these images are displayed through the CSS and not using the <img> tag as much as possible. This will keep them from being indexed, rather than having to disallow the "/style/images" folder from the robots.txt.
- A good way to determine whether the search engines are even trying to access your non-HTML files is to check your log files for bot activity.
More Reading
Both Rand Fishkin & Andy Beard have covered robots.txt misuse in the past. Take note of the publish dates and be careful with both of these posts, though, because they were written before the practice of internal PR sculpting with the nofollow link attribute was discouraged. In other words, these are a little dated but the concept descriptions are solid.
- Rand’s: Don’t Accidentally Block Link Juice with Robots.txt
- Andy’s: SEO Linking Gotchas Even the Pros Make
Action Items
- Pull up your website’s robots.txt file(s). If anything is disallowed, keep reading.
- Check out the Top Pages report in OSE to see how serious your missed opportunity is. This will help you decide how much priority to give this issue compared to your other projects.
- Add the noindex meta tag to pages that you want excluded from the search engine index.
- 301 redirect the pages on your domain that don’t need to exist anymore and were previously excluded using the robots.txt file.
- Add the canonical tag to duplicate pages previously robots.txt’d.
- Get more search traffic.
Happy Optimizing!
(post edited 10/12/10 @ 5:20AM to reflect x-robots protocol for non-html pages)
Thank you for that summary. That gave me a perfect understanding of those technical issues.
I suppose a lot of people use rather the robots.txt file instead of the "noindex, follow" in the metags to block pages - simply because it is easier to handle. If e.g. a homepage with lots of pages and just one template should block some pages, the faster way would be by robots.txt (without the need of making some duplicate templates....). But the fastest shouldn't be always the best way - as you teached us :-).
This is true! As with most things SEO, a little (or a lot) of extra work can really pay off. These are the type of tactics that separate basic optimization from advanced optimization.
Thanks for the examples, Lindsay! It's time to dive into the robts.txt file on a couple client sites (and my own) to make sure everything is still running smoothly.
This is actually a very useful post and something that is easy (for me at least) to forget or overlook.
It got me thinking though, is there much of a reason to use noindex and nofollow most of the time? Wouldn't noindex, follow be better in almost all cases?
If that became the default use, it would be a lot easier than trying to remember to switch it for specific SEO valuable pages etc.
The only use I can see for meta 'noindex,nofollow' might be a page that you want to keep out of the index and one that is also full of paid links. However, 'noindex,nofollow' is really just like adding a file to the robots.txt in that it blocks link juice flow. The main difference being that the 'noindex,nofollow' tag would actually be effective at keeping a listing out of the SERPs.
An additional advanced note is that 'follow' isn't necessary to say. 'follow' is the default and is implied. Your standard entry for pages you want to keep out of the index could simply be;
<meta name="robots" content="noindex">
instead of;
<meta name="robots" content="noindex, follow">
The result is the same.
Doh! :) Good to know about the default... durr.
I know there's some debate on this subject, from an internal PR-flow standpoint, but personally, I think NOINDEX,NOFOLLOW can be handy when you have an architecture where all of the pages past a certain level are either duplicates or pages you definitely don't want indexed. For example, let's say your site looks like:
(1) Home > (2) Search > (3) Product > (4) Product Options > (5) Cart > (6) Checkout
You might not want to index "Product Options," because they'll be near-duplicates, and you don't want to index your shopping cart and checkout pages, so why not cut off the bots at "level" (4) by tagging those pages with NOINDEX,NOFOLLOW. It ends up being a lot cleaner than trying to nofollow all the links, block parameters, etc.
Dr. Pete, even if you don't want those pages indexed, why not allow the links to be followed? Parameters aren't an issue if you use canonical tags & that way you don't lose out on the benefit of any incoming links from people linking to your product pages or something of that nature.
The issue of paid or sponsored links would definitely be a good use for nonindex,nofollow , but I would imagine that's a fairly rare occurance for most people.
LOL - I described the whole setup and then never even mentioned the NOFOLLOW aspect. Nice. That's what babies do to your brain.
In that example, since you know that nothing after level (4) should be indexed, I think it can be cleaner to cut off the bots entirely. You're not typically going to be cross-linking those pages in any way useful to search visitors, and you can save the crawlers bandwidth and focus them on your more important pages. You're basically saying "Everything below Level 4 is useless to you - ignore it, and focus on 1-3".
I have to agree with Dr. Pete. Entire sections of a site should be sliced out of ot the equation. Better to expend more energy on those higher level pages in this scenario. The extremely minute PR that such deep pages passes isn't necessarily worth the effort.
How about an easier thought process?
If you could envisage ever wanting to use "noindex, nofollow" that is likely a page that should have any juice redirected away to somewhere else using canonical tags or some kind of cloaking bot herding.
In which case it shouldn't really matter about the nofollows on the page
The thing is those pages don't get a tiny amount of PageRank - you might well have a link from every product page even if they are not in sitewide navigation. If you are avoiding using link level nofollow because you don't know what is happening to the juice, and avoiding javascript due to accessibility problems and have raw links, that can add up to a fair amount of juice.
It is nice to get a link to that old post and good that you pointed out about the date.
I am actually going to be partially debunking myself soon as for the last 16 months since the nofollow change was announced I have seen some pretty compelling evidence that we are missing a huge chunk of the equation.
This is what I wrote on Matt's nofollow change post:
Halfdeck, my hope/guess, and this is purely speculation, is that Google uses something like “DDD” Dynamic Domain Dampening.
It is something that is possibly needed to handle hanging/dangling pages effectively anyway, whereby rather than giving this part of the dampening factor to the whole web, it is redistributed with the domain instead.
Note: dynamic domain dampening is just something I came up with for a tweet. For a while now I have been convinced that it might apply to all blocked URLs as well, including robots.txt.
Whilst I can do some testing, and I possibly have the best scripts to do it, it is harder to prove anything than the old days of proving any kind of sculpting improved results of specific ranking goals.
The good news is I am going to be releasing my code soon as open source so other people can blow up their brains on this as well.
Edited to try to get the indent working on the quoted text with no luck. Also couldn't type any normal characters when trying to edit, though carriage returns were working.
This really IS a must-read article... t's this kind of knowledge that separates people who "think" they know SEO and advanced professionals. Now - if I could just do a better job of motivating clients to actually implement such tasks when pointing them out... < sigh >
Can't you just use X-Robots for excluding non-HTML content?
For example: https://pastie.org/1214742
Thanks Bill! You've taught me something.
It is possible to block indexing of non-html files by adding a meta directive to the HTTP Header of a document. Others, read more about the x-robots tag from Google here. Bing also supports it, and describes their policies a little on page 13 of this document (it is a downloadable PDF from Bing).
Thanks again, Bill.
I've updated the post to reflect the x-robots protocol for non-HTML pages and provided a little attribution to you for being the first to point it out. Big thanks, Bill!
No problem - happy to help. Great article!
Whoa! Seriously great tip. I just saw that some of the most powerful pages on a site I'm working on are being wasted. Now for yet another item in my to-do list. :0)
Really? I'm amazed how many sites are doing this without realizing it. Too get the most value out of this adjustment, make sure that these top pages that will be changed to 'noindex,follow' also have nice internal links to your most important pages that are in the index. That way you will pass that link value through.
Good luck!
Thanks Lindsay, I am going through one of my websites and found 23 useless pages being index... thanks a million, this should move it up from No.2 to no.1 on Google
That sounds like a big win in my book! Congratulations!
I am always amazed how a little mistake like that can when rectified can make such a big difference, I got be honest, it look like it's also moved it's second keyphrase from 7th to 2nd.
Thanks again for a great post.
This is a bit anecdotal (and I'm curious about other SEO's experiences), but I've also found that Robots.txt can be pretty lousy for removing content that's already been indexed. If you have it from the beginning, it tends to work alright (not foolproof, as you said, especially if you get inbound links to those pages). Once your content is indexed, though, adding Robots.txt to remove it is extremely unpredictable.
Robots.txt is completely useless for removing content that's already been indexed. Think about it ... you disallow crawling of some portion of your website's content so the bots don't crawl the content. That's critical, because the bots don't crawl the content. So I you have impelmented meta robots on each page for <noindex> <nochache> <noarchive> etc., the bots never see the instructions because they don't bother to crawl the page. In summary:
DO: Use meta robots to help "deindex" batches of content
DON't: Use robots.txt in tandem with meta robots to help deinded batches of content
OR: Use GWMT to quickly remove indexed URLs and then add the meta robots to continue to keep the content out of search egnine indexes
I think the problem is that a lot of people think Robots.txt is a hatchet that can be applied to instantly hack away already indexed pages. We give so many warnings about mis-using it (and rightly so) that people give it near magical properties. Of course, then they add a bunch of pages to Robots.txt only to find out weeks or months later that little or nothing happened.
Great Article Lindsay, just the one I was looking for. I was struggling over finding a way to add a # in my url for a classified site that I had, this meta noindex was just the answer I needed. Keep up the great work.
Excellent post. I've seen robots.txt inaccurately explained in so many blog posts by "experienced" SEOs, it's like a nasty virus! I try to always leave a comment and clear up the confusion, but now I can just include a link to this post. Thanks.
Exactly! I was thinking the very same thing about answering questions in Q & A. We often get questions about blocking pages and now we have a nice, succinct post to point them too. Lindsay obviously rocks. :)
It’s amazing how many big names out there are misusing the robots file. I too have been guilty of this in the past but now use the preferred option of the robots Meta tag to noindex pages.
Thanks for the useful explanations as well as the alternative methods available
That's a good reminder post. I certainly need a dose of such posts from time to time as i am sufferring from information overflow. I would like to bust one myth regarding the use of robots.txt. There is no such thing as 'Allow:' field in robots.txt (https://www.robotstxt.org/robotstxt.html), still you can find webmasters using them. Here is an interesting video from Matt Cutts on 'Can I use robots.txt to optimize Googlebot's crawl?'. Note how matt reacts :)
Had to wait a week to read it, but the wait was well worth it Lindsay. What a fantastic resource for robots.txt.
It's posts like this that keep reminding me that I am not nearly as advanced in SEO as I thought I was and instead am simply a student with much more to learn. [sigh]
I love the way Google ignores basic SEO best practice. I wonder what sort of SEO capability they have internally looking at their own assets.
Thank you! That cleared most of the questions. I can understand the effort you put in to get this post up. Two thumbs up!!
Great work. I was losing a lot of juice, I was using the meta robots without "..., follow. " Thank you, you're beautiful and smart.
In using a robots txt file, if for example i need to dissallow a page like the terms & conditions page on an eccomerce shop, because it will come up for duplicate content and that can effect the SEO, then i will only dissalow that individual file.
Hi Victoria - Ideally, you want to ensure that you website does not generate duplicate content. The next best solution is to add the meta robots tag, 'noindex,follow'. This ensures that any link value passed into the page will transfer through to other pages and help those them rank. Save the link juice!
Hi Lindsay - thanks for all the robots.txt best practise tips. I gave myself 9 out of 10 for my latest implementation :)
I do have a questrion for you relating to one of your cautionary points: "the practice of internal PR sculpting with the nofollow link attribute (is) discouraged".
What technique would you recommend for internal PR sculting?
Is it OK to use robot.txt for out going links? I use links for price comparison on my site, and for this purpose i have to give links of other sites on every page, so I have added robot.txt for every such link. Can Google mind such activity or they do not care how many times you use it... Please share your knowledge on it.
great info here. i become shocked when i found this article here. Nice admin.
https://snapcrack.net/winaso
Slightly late comment!
Wouid it be sensible to use noindex,follow on a search page.
The page has its own url (/search.php) but of course the content changes every time.
I use robots.txt on my pages which have duplicate content to avoid any penalties.Those pages dont have any link jusice to pass on.
Thanks for the tips! I recently experienced with one of my larger clients the usage of robot.txt to segragate the xml sitemap crawl into two websites. Your tips brought back memories. I appreciate the information.
Some nice examples that are often lacking from articles about robots.txt; I'm going to be a bit pedantic though and point out that it is not the title tag in the Cisco results - after all the page is not indexed so how could the search engine know what the title is! Probably comes from the text of incoming links.
I strongly agree with you that the best use of robots.txt is almost always "don't bother."
Robots.txt has a very high potential for mistakes and damage to overall SEO and in the end is handled as a suggestion even by the major players (Google and Bing), as shown in your post. They can be used to improve crawl efficiency, especially with multiple sitemaps, but I would advise most people too use page level handling of both follow and index commands.
Insightful post.
One question I am having is whether or not using Robots.txt is still useful for trying to prevent affiliate URLs from being indexed. I am aware that rel=canonical can solve this well, but I'd like to know peoples thoughts on avoiding the backend by modifying Robots.txt with a wildcard like Disallow: *?affid= or something like that.
The trouble with with blocking tracking code variants in the robots.txt is that you are creating a link juice dead-end for any value that the inbound affiliate links may be creating for your site. You would be much better off using the canonical tag to consolidate the duplicate page URLs that these links can create.
Lindsay,
Great post! I can't count how many times I have had to tell clients or other people on the web to stop misusing robots.txt and the damages it can do. I love to see the grin on their face when they see the results of removing those pages from robots.txt and adding the meta noindex tag. Thanks for taking the time to write this out!
Well done. A very nice post indeed.
Awesome post ,like have to review it for once and for all hahaha.
Wow, Digg really doesn't get SEO right at all! It feels like it wasn't that long ago when they didn't redirect from "www" to "non www". Awesome example of how to shoot yourself in the foot!
This is an excellent summary with good real-world examples Lindsay. A great reference resource for those who risk making some potentially drastic errors!
Great post, thank you for this. I've always favored robots.txt over meta noindex and now I can see why that's not always the optimal way to do it. I'll be making metanoindex, follow a best practice for pages I don't want indexed but that may still receive links. Thanks!
- Evan
Let me ask a maybe provocative question.
If robots.txt can cause so many misuses, why not simply use it just to block, from the first seconds of a website life, only the "backend" carpets (java, scripts, admin...) and to point faster bots to your sitemap.xml and rely on meta robots tag, 301 and canonical for all the content/frontend related pages?
I like your thought process. The problem, in my view, is that the search engines like to have permission to view that stuff, especially the js, to ensure nothing funny is going on there. The admin content should be password protected anyway, so why bother?
Thanks Lindsay.
I was thinking about the admin carpet and "backend" in general having in mind the classical robots.txt that comes with CMS like Joomla, that comes this way:
User-agent: *Disallow: /administrator/Disallow: /cache/Disallow: /components/Disallow: /images/Disallow: /includes/Disallow: /installation/Disallow: /language/Disallow: /libraries/Disallow: /media/Disallow: /modules/Disallow: /plugins/Disallow: /templates/Disallow: /tmp/Disallow: /xmlrpc/
I usually have always to retouch it (images, media and, from what you're saying other stuff).
If what you really want is to keep these pages out of the search engine index, the meta robots tag would be more effective at that. In the name of conserving robot resources on your site, you could probably safely nofollow some of these admin pages as well. I agree that you should be careful with the /images and /media defaults!
Hehe, it's fun to know that the best use of robots.txt is to not use it.
Thanks !
I'm using the robots.txt file to block bots for the script file names like:
Disallow: /index.php
or if I'm using mod_rewrite like:
RewriteRule ^([a-z\-]*\/[a-z\-0-9]*\.html)$ /winkel.php?url=$1 [L]
I block the script too:
Disallow: /winkel.php
Personally, I think this is a situation where you're much better off with either 301-redirects or the Canonical tag. Otherwise, as Lindsay pointed out, you may be cutting off link-juice to those alternate versions of the pages.
Agreed...I use 301's in this situation too.
WOW amazing.. hats off to you mam.. its a great great information for us on robot.txt file ..we thought that it really stop the crawler to check the site.. But SEO is a silly thing no one can be sure when its at the top and when its jump out of the competition lolz.. but you said it right..here is another site which surely gives you the chance to boost up your keywords and take you to the top...Successful Site is guaranteed to increase traffic to your site and generate more customers, and more sales for your business.Portland seo is a new seo company from portland and we surely took a alot of care about the robot.txt issues.... thanks alot for that nice piece of information ... nobody can understand about google...
till date I havent realised this, thanks for such a nice brief & have started implementiong all of my high traffic sites.
Let the link juice flow! Yippee!