As you've probably learned, you can't always rely on search engine spiders to do an effective job when they visit and index your website. Left to their own devices bots can generate duplicate content, perceive important pages as junk, index content that shouldn’t serve as a user entry point, and many other issues. There are a number of tools at our disposal that allow us to make the most of bot activity on a website such as the meta robots tag, robots.txt, x-robots-tag, canonical tag and others.
Today, I’m covering robot control technique conflicts. In an effort to REALLY get their point across, webmasters will sometimes implement more than one robot control technique to keep the search engines away from a page. Unfortunately, these techniques can sometimes contradict each other: One technique hides the instruction of the other or link juice is lost.
What happens when a page is disallowed in the robots.txt file and has a noindex meta tag in place? How about a noindex tag and a canonical tag?
Quick Refresher
Before we get into the conflicts, let’s go over each of the main robot access restriction techniques as a refresher.
Meta Robots Tag
The Meta robots tag creates page-level instructions for search engine bots. The Meta robots tag should be included in the head section of the HTML document and might look like this:
<html>
<head>
<title>Article Print Page</title>
<meta name=”ROBOTS” content=”NOINDEX” />
</head>
Below is a table of the generally supported commands along with a description of their purpose.
Command
|
Description
|
NOINDEX
|
Prevents the page from being included in the index
|
NOFOLLOW
|
Prevents bots from following the links on a page
|
NOARCHIVE
|
Prevents a cached copy of the page from being available in the search results
|
NOSNIPPET
|
Prevents a description from appearing below the page link in the search results AND prevents caching of the page
|
NOODP
|
Prevents the Open Directory Project (DMOZ.org) description of the page from being displayed in the search results
|
NODIR
|
Prevents Yahoo! Directory titles and descriptions for the page from being displayed in the search results
|
Canonical Tag
The canonical tag is a page level meta tag that is placed in the HTML header of a webpage. It tells the search engines which URL is the canonical version of the page being displayed. Its purpose is to keep duplicate content out of the search engine index while consolidating your pages strength into one ‘canonical’ page.
The code looks like this:
<link rel="canonical" href="https://example.com/quality-wrenches.htm"/>
X-Robots-Tag
Since 2007 Google and other search engines have supported the X-Robots-Tag as a way to inform the bots about crawling and indexing preferences in the HTTP Header used to serve the file. The X-Robots-Tag is very useful for controlling indexation of non-HTML media types such as PDF documents.
As an example, if a page is to be excluded from the search index the directive would look like this:
X-Robots-Tag: noindex
Robots.txt
Robots.txt allows for some control of search engine robot access to a site, however it does not guarantee a page won’t be crawled and indexed. It should be employed only when necessary, and no robots should be blocked from crawling an area of the site unless there are solid business and SEO reasons to do so. I almost always recommend using the Meta tag “noindex” for keeping pages out of the index instead.
Avoiding Conflicts
It is a bad idea to use any two of the following robot access control methods at once.
- Meta Robots 'noindex'
- Canonical Tag (when pointing to a different URL)
- Robots.txt Disallow
- X-Robots-Tag
In spite of your strong desire to really keep a page out of the search results, one solution is always better than two. Let's take a look at what happens when you have various combinations of robot access control techniques in place for a single URL.
Meta Robots 'noindex' & Canonical Tag
If your goal is to consolidate one URL’s link strength into another URL and you don’t have
any better solutions at your disposal, go with the canonical tag alone. Do not shoot yourself in the foot by also using the meta robots ‘noindex’ tag. If you use both bot herding techniques, it is probable that the search engines won’t find your canonical tag at all. You’ll miss out on the link strength reassignment benefit of a canonical tag because the meta robots ‘noindex’ tag has ensured that he canonical tag won’t be seen! Oops.
Meta Robots 'noindex' & X-Robots-Tag 'noindex'
These tags are redundant. I can’t see any way that having both in place for the same page would directly cause damage to your SEO. If you can alter the head of a document to implement at meta robots ‘noindex’, you shouldn’t be using the x-robots-tag anyway.
Robots.txt Disallow & Meta Robots 'noindex'
This is the most common conflict I see.
The reason I love the meta robots ‘noindex’ tag is that it is effective at keeping pages out of the index, yet it can still pass value from the no-indexed page to deeper content that is linked from it. This is a win-win and no link love is lost.
The robots.txt disallow entry restricts the search engines from looking at anything on the page (including potentially valuable internal links) but does not keep the page’s URL out of the index. What is the good in that? I once wrote a post on
this topic alone.
If both protocols are in place, the robots.txt ensures that the meta robots ‘noindex’ is never seen. You’ll get the effect of a robots.txt disallow entry and miss out on all the meta robots ‘noindex’ goodness.
Below I'll take you through a simple example of what happens when these two protocols are implemented together.
Here is a screenshot from the Google SERP for a page that is disallowed in the robots.txt and also has a meta robots 'noindex' in place. The fact that it is in Google's index at all is your first clue of a problem.
Here you can see the meta robots 'noindex' page. Too bad the search engines can't see it.
Here you can see that the entire subdomain is disallowed in the robots.txt, ensuring that useful meta robots 'noindex' tags are never seen.
Assuming mail2web.com is sincere in its desire to keep everything out of the search engines, they'd be better off using the meta robots 'noindex' exclusively.
Canonical Tag & X-Robots-Tag 'noindex'
If you can alter the <head> of a document, the x-robots-tag likely isn’t your best route for restricting access in the first place. The x-robots-tag works better if you reserve it for non-html file types like PDF and JPEG. If you have both of these in place, I’d imagine that the search engines would ignore the canonical tag and fail to reassign link value as hoped.
If you are able to add a canonical tag to a page, you shouldn’t be using an x-robots-tag.
Canonical Tag & Robots.txt Disallow
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen. No link juice passed. Do not pass go. Do not collect $200. Sorry.
X-Robots-Tag 'noindex' & Robots.txt Disallow
Because the x-robots tag exists in the HTTP Response Header, it is possible that these two implementations could intermingle and both be seen by the search engines. However, the statements would be redundant and the robots.txt entry would ensure that no links within the page would be discovered. Once again, we have a bad idea on our hands.
---------------------------------
Bonus Points!
I searched high and low for a live example to share here. I wanted to find a PDF that was both robots.txt disallowed AND noindexed with the x-robots-tag. Sadly, I came up empty handed. I'd have dug around all night, but this post had to go live at some point! Please, I beg you, beat me at my own game.
My process was as follows:
2. Start up your HTTP reader. I use
HTTPfox.
3. Call up the robots.txt disallowed PDF file and check the Response Header for the X-Robots-Tag noindex entry.
Good luck! Let me know when you find one!
----------------------------------
The concept I've been driving at here is fairly straight-forward. Don't go over-board with your robot control techniques. Choose the best method for the scenario and back away from the machine. You'll be much better off.
Happy Optimizing!
Robot Rabbits from Shutterstock
Good post will be handy for any one who wants to learn about SE access restrictions, easy to read and understand =)
I must admit the award for the best robots.txt files I have ever seen goes to Rishi's website:
https://explicitly.me/robots.txt
Any one seen a better robots.txt ??
Oh yes! i remebrer this! Dear Google...! :)
Not as sophisticated as the Rishi one, but the SEOmoz one has a cool index in its robots.text :D https://www.seomoz.org/robots.txt
WOW sir! Robots.txt is getting fancy day by day… Remember the daily mail SEO Job advertisement in their Robots.txt? that was a fabulous idea too…
What's the blank Disallow line achieving? Out of curiosity...
I don't think it's "better," but I do recall coming across a feature including this:
https://www.last.fm/robots.txt
Disallow: /harming/humans
Disallow: /ignoring/human/orders
Disallow: /harm/to/self
Good rules for GoogleBot to follow. :) I wonder how a search engine treats links to a robots.txt?
haha that robots.txt file is awesome! great post too Lindsay
What's so funny about robots.txt. This is a must have file on any website.
Thanks for this - genuinely. It's great to get a refocus - I've had conflicts recently on a client and it's caused issues. But all cleared up thanks to meta tags and a clean robots file! Thanks!
In my opinion robots.txt it's the most efficient way to mess up your entire seo strategy. And even if you do it right it still blocks the free flow of trust and page rank on your site. So, I agree that you should try to use other ways of controlling crawler access whereever possible.
To check in audits if robots.txt does any harm to a site or blocks pages that it should not, I developed a little firefox add on. This add on - roboxt! - shows in the status bar whether the actual URL is blocked by robots.txt. The user could also choose in the preferences that it should mark internal links to blocked URLs and display the total number of links blocked on the current page.
If you are interested in using that add on, you can download it here at mozilla: https://addons.mozilla.org/en/firefox/addon/roboxt/ (a new version that is compatible with ff 6 is currently reviewed by mozilla and will be availible soon). For further information consult this short manual: https://nikolassv.de/roboxt-en/
Is there an english version of the plugin?
yes, the plugin is translated and should detect your browsers language automaticly. The correct link to the english version of its entry at mozilla is: https://addons.mozilla.org/en/firefox/addon/roboxt/
(the link in my last comment does still link to the de-URL)
Hello Lindsay
Its a great post on Robots.txt. One thing I would like to add here is the list you mentioned to follow has got a slight mistake for the :
NODIR Prevents Yahoo! Directory titles and descriptions for the page from being displayed in the search resultsIt should be NOYDIR to block the Yahoo Directory.I hope everyone should update it on their list.Cheers
Lindsay, thanks for making some of the more technical aspects of SEO easy to understand. This post definitely explained some stuff that was "foggy" for me.
Lindsay -
Great post here! I love seeing a good technical SEO post here on the Moz blog.
This post was particularly actionable for me today with a client. This one is bookmarked for sure.
Cheers!
Ive never think about this before... Thanks..
Great post about Robots.txt. Reilly it will more use full for SEO. Thank you Keep raking
Thanks Lindsay. You have covered all technical things.
One more thing is what would happen if you put this entry into your robots.txt file?
User-agent: *
Disallow: /robots.txt
I know this is crazy idea but try this :)
Cheers..
Have you tried this yourself? What was the result?
This looks like total absurd. What would that string do?
Is it not simplier to just use the robots.txt?
With noindex you don't prevent Google for crawling your page, so why can Google not find a canonical if you use this together?
Great post Lindsay :-)
I willadd that 9 out of 10 times you should use:
<meta name="robots" content="noindex">
And not
<meta name="robots" content="noindex,nofollow">
Which is unfortunately a tag I see a lot.
Thanks for such an interesting post. I am regular reader of your blog and implement your valuable suggestions.
Great post Linday !Anyway I share the same doubt of Marcoswidung and a clarification from you on this would be much appreciated. I've often used robots.txt to stop the bot crawling useless pages (often created by CMS) to optimise crawling resources. Is this a good idea or meta robot noindex is always the best option ?
Thanks,
Ale
Hey Lindsay,
One issue which I havn't seen discussed and one which still confuses me is how to handle websites that are available via both http and https. I usually use PHP with an If/then statement like below to check if https is on and if so I add meta robots noindex tags. Is this a good strategy?
if (isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS']) == 'on') {
echo '<meta name="robots" content="noindex,nofollow">';
}
Hi Riona
You should typically use a 301 redirect in this case, to redirect users to the http version.
I didn't think about the noindex. Good points Ill have to bookmark and check this out again later.
Great post Lindsay... Thanks for sharing very useful information...
Great read Lindsay,
These all are teachinal stuff and it should be execute in every website and it will definately resolve many problems regarding your SEO work. robots.txt and x-robots tags are vital and you describe very well here. Thanks.
A nice refresher on a useful topic, thanks!
I've been using robots.txt to exclude entries from the engines since I don't have the ability to actively edit the headers of my pages (gotta love Wordpress). It's been working well for me so far, though from what you say, it sounds like the meta robots tag would be vastly superior. Oh well, whatcha gonna do?
An excellent, concise post - thanks.
The value of managing robot access effectively cannot be overstated.
Until you've checked what's actually being indexed by Google (especially from a large website) you can easily fail to appreciate the amount of junk that can get indexed and used by search engines - to the detriment of your online objectives and goals.
This knowledge is one of the SEO-related things that very few "normal" people know and therefore separates a professional SEO from the Average Joe who has read one article on SEO.
Therefore you also need to know this if you work with SEO, as this is one of the reasons anyone should hire you.
In the future more and more people will know some basic SEO stuff (titles, descriptions, including keyword in content etc.). But the more technical stuff is not going mainstream anytime soon.
Great walkthrough.
Great description of the indexing process and avoiding conflicts. I have just tweeted this to my clients and future clients who will find this a particulary good point of reference.
Good post, though I miss a discussion about the potential use of robots.txt in trying to avoid the spider trap. If your crawling budget is limited you it might be worth to stop spiders from even reaching certain pages and let the spider spend more of the budget on pages that are more important from a landing page perspective, despite missing out on some link juice. For example complex internal search URLs resulting from faceted navigation versus product pages. In those cases, add a param to those URLs you do not want to be crawled and block that param in robots.txt. And make sure to noindex those pages as well.
I am not a fan of the robots.txt at all. I view it as a link juice brick wall, plus it doesn't even keep pages out of the index. Sure, it is easy to implement but does it really meet the goals of search marketing? Check out this post where I rant at length about the robots.txt and why it is pretty useless. https://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions. Would love to hear your thoughts.
I can't deny that I get a bit hesitant about implementing the so-called hatchet approach on our e-commerce site when reading your posts... But the reason I am considering it is our site´s rather complex faceted navigation that features 10 facets with approx. 10-15 values each. In combination with our catalogue of 4 million+ products, that faceted navigation exposes bots to hundreds of thousands (if not millions, I am not very good at math) of unique and many times very complex URLs leading to pages featuring quite thin and in many cases practically the same content . What I am afraid of is that search bots get tired of following those URLs and bail out instead of eating our product pages. I would of course only disallow complex URLs (3+ or so facets) and URLs containing params/param values considered irrelevant from a landing page perspective.
Yes, there might be a few inbound links to those "far out" URLs but if we can make bots crawl important pages instead of useless pages, don't you think that we have gained something? Or do you consider the hatchet approach useless for "spider trap" prevention? You do mention those disallowed URLs still end up in the SE index.
I know there are other methods to avoid spider traps out there but they require quite too much system development on our part.
This topic would make a great blog post and discussion. I'm not going to deny that crawler fatigue can become an issue with gigantic websites. In your case, your website produces a huge number of 'search results' pages. That is a concern all of it's own because the search engines love to hate these types of pages and hundreds of thousands of them are certainly capable of getting you in trouble.
Before you can make a decision about robot access methods for the site, you'd need to make sure that your deeper product pages are all accessible through alternate means.
Good point - I've seen a few people throw everything but the kitchen sink at Google, and a lot of times these tactics actually impede each other and slow down what you're trying to accomplish. It can be frustrating to wait, but doing it wrong can take a whole lot longer than having a little patience.
I'd just add that, while it's not an ideal solution, Google Webmaster Tools parameter blocking is another option, if you're really in a bind. I don't think it's a good, long-term approach, but it's fast. If you do something bad and end up indexing a ton of URL-based duplicates, I sometimes recommend it.
Great post! Personally, I've always used the robots.txt file to exclude files/pages/etc. I have a couple questions for you regarding the robots.txt file:
Thanks everybody for any feedback you can give me.
-REF
Hey Rob,
You will want to use the following:
User-agent: *
Disallow:
What a great little run down - good stuff yet again. I am a firm believer that the Canonical Tag is the most underused and neglected tag that is so effective, I wish people would use them more and correctly :)
Nice post Lindsay :-) very informative and I'm sure it will be linked to by many :-)
I especially liked your take on the Robots.txt Disallow & Meta Robots 'noindex'
This post is exactly what I needed to read. I have felt that the more I researched robots.txt vs meta robots 'noindex," the more confusing I made a rather simple topic. Thanks again.
That issue more extensive, no doubt always have something to learn SEO, I think this is a complete guide of indexing techniques and restriction, excellent recommendations.
Lindays, good write up. Keep in mind that search engines can choose not to follow your robots.txt or canonical directives, i.e. the root of a gov site was entirely disallowed with robots.txt, there were 1000s of links pointing to it and Google decided to still index it, for user experince.
Generally, you can say that Robots.txt Disallow & Meta Robots 'noindex' is not to be used, but one can support the other is such cases.
Cheers!
Nice Post!
Robots.txt is one of the most important tag that is used by almost every webmaster but there is much confession because different tag have almost similar functionality (at least it looks like that its similar) Post define a clear differences between the tags and how to go with that…
Only a thing I wanted to add as per my experience is that understand that scenario and use one tag only… using multiple tag may hurt you instead of saving you…
Over all a great read!
yes its good topic rebots.txt, When a search engine crawler comes to your site, it will look for a special file on your site. That file is called robots.txt and it tells the search engine spider, which Web pages of your site should be indexed and which Web pages should be ignored. But <meta name=”ROBOTS” content=”NOINDEX” /> is provide good way.
Here are some Robots tags that are common < Meta content="NOINDEX" name="ROBOTS">- Ignore content and follow links < Meta content="NOFOLLOW, INDEX " name="ROBOTS">- Include content and do not follow links < Meta content="NOINDEX,NOFOLLOW" name="ROBOTS">- Ignore content and do not follow links < Meta content="INDEX,FOLLOW" name="ROBOTS">- Include content and follow links < Meta content="NOARCHIVE" name="ROBOTS">- Cache link should not show Search results pages < Meta content="NOODP" name="ROBOTS">- The Open Directory Project (ODP) title and description for the page should not be displayed in Search results < Meta content="NOYDIR" name="ROBOTS">- The Yahoo Directory title and description for the page should not be displayed in Search results < content="NOSNIPPET" name="ROBOTS">- Titles are only displayed in Search results page and not description or text context for this page