Left to their own devices, search engine spiders will often perceive important pages as junk, index content that shouldn’t serve as a user entry point, generate duplicate content, along with a slew of other issues. Are you doing everything you can to guide bots through your website and make the most of each visit from search engine spiders?
It is a little like child-proofing a home. We use child safety gates to block access to certain rooms, add inserts to electrical outlets to ensure nobody gets electrocuted, and place dangerous items out of reach. At the same time we provide educational, entertaining, and safe items within easy access. You wouldn't open the front door of your unprepared home to a toddler, then pop out for a coffee and hope for the best.
Think of Googlebot as a toddler (If you need a more believable visual, try a really rich and very well-connected toddler). Left to roam the hazards unguided you'll likely have a mess and some missed potential on your hands. Remove the choice to access the troublesome areas of your website and they’re more likely to focus on the good quality options at hand instead.
Restricting access to junk and hazards while making quality choices easily accessible is an important and often overlooked component of SEO.
Luckily, there are a number of tools that allow us to make the most of bot activity and keep them out of trouble on our websites. Lets look at the four main robot restriction methods; the Meta Robots Tag, Robots.txt files, the X-Robots Tag, and the Canonical Tag. We’ll summarize quickly how each method is implemented, cover the pros and cons of each, and provide examples of how each one can be best used.
CANONICAL TAG
The canonical tag is a page level meta tag that is placed in the HTML header of a web page. It tells the search engines which URL is the canonical version of the page being displayed. Its purpose is to keep duplicate content out of the search engine index while consolidating your pages strength into one ‘canonical’ page.
The code looks like this:
<link rel="canonical" href="https://example.com/quality-wrenches.htm"/>
There is a good example of this tag in action over at MyWedding. They used this tag to take care of tracking parameters important to the marketing team. Try this url - https://www.mywedding.com/?utm_source=whatever-they-want-to-track. Right click on the page, then view the source. You'll see the rel="canonical" entry on the page.
Pros
- Relatively easy to implement. Your dev group can move on to bigger fish.
- Can be used to source content across domains. This may be a good solution if you have syndication deals in the works but don't want to compromise your own search engine presence.
Cons
- Relatively easy to implement incorrectly (see catastrophic canonicalization)
- Search engine support can be spotty. The tag is a signal more than a command.
- Doesn't correct the core issue.
Example Uses
- There are usually other ways to canonicalize content, but sometimes this is a solid solution given all variables.
- Cindy Krum, a Moz associate, recommends canonical tag use if you run into a sticky situation and your mobile site version is outranking your traditional site.
- If you don't want to track your referal parameters with a cookie, the canonical tag is a good alternative.
ROBOTS.TXT
Robots.txt allows for some control of search engine robot access to a site; however it does not guarantee a page won’t be indexed. It should be employed only when necessary. I generally recommend using the Meta tag “noindex” for keeping pages out of the index instead.
Pros
- So easy a monkey could do it.
- Great place to point out XML Sitemap files.
Cons
- So easy a monkey could do it (see Serious Robots.txt Misuse)
- Serves as a link juice block. Search engines are restricted from crawling the page content so (internal) links aren't followed and passed the value they deserve.
Example Uses
- I recommend only using the robots.txt file to show that you have one. It shouldn't really restrict anything, but serves to point to the XML Sitemaps or an XML Sitemap direcotry file.
- Check out the SEOmoz robots.txt file. It is fun and useful.
META ROBOTS TAG
The Meta robots tag creates page-level instructions for search engine bots. The Meta robots tag should be included in the head section of the HTML document. Here is some info on how the tag should look in your code.
The Meta Robots Tag is my very favorite option. By using the 'noindex' tag, you keep content out of the index but the search engine spiders will still follow the links and pass the link love.
Pros
- Use of 'noindex' keeps a page out of the search index better than other options like a robots.txt file entry.
- As long as you don't use the 'nofollow' tag, link juice can pass. Woot!
- Fine tune your entries in the SERPs by specifying NOSNIPPET, NOODP, or NODIR. (You're getting all fancy on me now!)
Cons
- Many quite smart folks use 'noindex, nofollow' together and miss out on the important link juice flow piece. :(
Example Uses
- Imagine that your log-in page is the most linked to (and powerful) page on your website. You don't want it in the index, but you certainly don't want to add it to the robots.txt file because that is a link juice block.
- Search result sort pages.
- Paginated versions of pages.
X-ROBOTS-TAG
Since 2007 Google and other search engines have supported the X-Robots-Tag as a way to inform the bots about crawling and indexing preferences in the HTTP Header used to serve the file. The X-Robots-Tag is very useful for controlling indexation of non-HTML media types such as PDF documents.
Pros
- Allows you to control indexation of unusual content like Excel files, PDFs, PPTs, and whatever else you've got hanging around.
Cons
- This kind of weird content can be troublesome in the first place. Why not publish an HTML version on the web for indexation and this secondary file type for download, etc?
Example Uses
- You offer product information on your site in HTML, but your marketing department also wants to make the beautiful PDF version available. You'd add the X-Robots to the PDFs.
- You have an awesome set of excel templates that are link bait. If your bothered by the Excel files outranking your HTML landing pages you could add noindex to your x-robots tag in teh HTTP Header.
Lets Turn this Ship Back Around
What was all the baby talk you started out with, Lindsay? Oh, that's right. Thanks. In your quest to bot-proof your website, you have a number of tools at your disposal. These differ greatly from those used for baby-proofing but the end result is the same. Everybody (babies and bots) stays safe, on track, out of trouble, and focused on the most important stuff that is going to make a difference. Instead of baby gates and electric socket protectors you've got the Meta Robots Tag, Robots.txt files, the X-Robots Tag, and the Canonical Tag.
In my personal order of preference, I'd go with...
- Meta Robots Tag
- Canonical Tag
- X-Robots-Tag
- Robots.txt file
Your Turn!
I would love, love, love to hear how you use each of the above robot control protocols for effective SEO. Please share your uses and experience in the comments and let the conversation flow.
Happy Optimizing!
Stock Photography by Photoxpress
LOL - "So easy a monkey could do it" is definitely a warning sign, in SEO and in life.
Well... sometimes monkeys do things better than me :)
A nice overview of content indexation and crawling control. We use the robots.txt file to prevent directories and a few inconsequential pages from being seen by the bots. You can see it all here.
I would agree with your opinion on the meta robots tag. Especially in the case of one-off pages. I can think of two challenges the meta robots tag presents. First, in the case of a site built on a framework it can be difficult to get such indexation exceptions onto individual pages. As part of the Marketing, it can take a lot of work to get such an update to the top of the Engineering queue. Second, once said exceptions are in place, managing the meta robots tags can become a challenge if they start to add up (unless a management tool is added to the CMS as part of the framework update).
We also use the 'rel canonical' markup for all our pages as we have many, many inbound links with tracking parameters.
Just a couple thoughts I had. All in all, an informative piece. I liked the baby analogy (I have two baby girls).
Working with the robots is always a dangerous thing. I've just seen too many uninformed people deny access to their whole site and then wonder what happened. I guess now I have a good post to send them to.
Bit by bit I have leaned to use the meta robots tag, the canonical tag and the robots.txt, the right way. Mainly through such useful post like that.I have never used the x-robots tag - I have to admit I didn't knew it even exists. I have to take a closer look at ...
so do i... i never know X Robots Tag even Exisit. Great post this is turly educational to me!
Thanks for that!
Great article and very timely as I'm about to launch a re-designed site, will now go and double check for those noindex,nofollow booboos.
WTH - a thumbs down because I thanked a writer?
I think someone thought it was spam.
I'll give you an extra thumb up just to fix it. =)
thanks for artilce...I have noticed that Google tends to ignore robots.txt if there are too many directories being listed...usually a few are OK but if you have a long list then google will ingore them...so trick is to consolidate all pages you don't want indexed in as few directories as possible.
And you can hide a nice Easter Egg in your robots.txt tag for those curious enough to look...as they do at https://www.searchenginefriendlyhosting.com/robots.txt
Great Post, you have gotten my mind thinking through our current indexing strategy and now I'm wondering if is it as tidy as it could be?
You are correct, its a lot like having kids. You think your house is pretty clean and child proof and then you head over to a friends for a play date only to find that your definition of clean and tidy sucks... thanks, I now think my house sucks! ;)
Lindsay, you get a thumbs up just for mentioning the SEOmoz robots.txt file!
Seeing it, I had a great laugh. Thanks.
If you liked the SEOmoz robots.txt, you'll love the https://explicitly.me/robots.txt by Rishi
I have recently written a blog entry about the importance of SEO for start up businesses and really appreciate the information to make this processes more succesful. Thanks for the info
https://takecareof.biz/seo-is-about-people-not-robots/
Great post, thank you!
I use the robots.txt file (I know, shame on me) to block a large number of pages on a large ecommerce site I work on. I'm pretty sure it's the best option in this case because the site architecture uses a generic www.mysite.com/Browse.aspx URL for all pages when certain filtering elements on category pages are clicked.
For instance, if someone wanted to sort a category by manufacturer, the site redirects to /Browse.aspx but keeps the identical page content. This created thousands of duplicate pages with the identical URL - /Browse.aspx!
I used robots.txt to block this URL and soon saw indexed versions of pages dropping from the index. Since then we have seen a considerable increase in long tail keyword traffic.
Anyways, I just wanted to share an example of a robots.txt working well. We could not have used the canonical URL tag here because lots of different content had the same URL, not the other way around.
That said, do you think the meta robots tag would be a better solution? If so, why?
Great Article!! The article discusses several great ways of controlling what can be indexed without killing the link juice from those pages. Thanks for all the great info!!
Thanks for tha X-ROBOTS suggestion. I have been wondering how to block pdf files from being indexed (besides in the robots.txt file).
Now, I need help with the right .htaccess code to block the indexing of pdf files (all or just certain ones). Any suggestions?
Hello,
I would be interested to hear what is the best recommended practice to handle URL containing campaign tags. I looked at the HTML suggestion in Webmaster Tools and I relaized that a lot of my URL are flagged as having duplicate title tags because Google has a record of both the 'canonical' URL and URL's tagged with my Google Analytics / Webtrends tracking variables (utm_source, WT.mc_id, glcid, etc...).
I absolutely must maintain those tags otherwise I lose tracking of my PPC campaigns - but obviously they are causing some duplicate content issues for Google.
I read the example #2 on the post about the canonical meta tag where it is suggested to 'record the referral' and 301 the taffed URL to the canonical URL. My concern is that this solution doesn't seem to be compatible with the way most analytics package work. Google Analytics and other tracking technologies use client side scripting to call the Web Analytics server once the page has loaded and send information about the current page viewed by the user, grabbing various parameters added to the URL. If I 301 the tagged URL to the canonical URL, the Web Analytics tag will fire only on the canonical page and it will not be able to grab the campaign tracking variables which have been removed via the 301.
What would be the recommended solution then?
For now, I have tried going into the parameter handling tab in google's webmaster tool and setting all my campaign variables to ignore.
Lothaire
What about the basket and checkout from an E commerce site
Robot.txt or noindex ?
Hello Lindsay,
Today, I was searching for solution to set up Meta Robots NOINDEX, Follow on Ecommerce website. I have found one attractive YouTube video from Google webmaster tools' help desk and your blog post on SEOmoz!
https://www.youtube.com/watch?v=ZjRGkc__FwQ
https://www.gunholstersunlimited.com/airguns.html#!/p=clear&manufacturer=228&order=name
https://www.knobdeco.com/cabinet-hardware/cabinet-knobs.html?line=edwardian
I have big question regarding dynamic pages which compiled by Narrow by Search or Shop By section on my Ecommerce website.
Now, I have strong conclusion regarding rel=canonical! And, We don't need to implement rel=canonical on dynamic pages.
So, I can set Meta Robots NOINDEX, Follow on all pages which I have described above. But, I would like to double confirm before make it happen on live website.
What you think about it? Can you please give me more ideas on it?
Oh, nice info for robots.txt. I will include this on my website.
Why aren't there other posts like this out there?!? Unfortunately, not all the SEOmozzers are experts. It is difficult for someone like me, an SEO novice, to really understand how the tags work and how to implement them after reading articles about them. Thank god, there are people like Lindsay who explain technical concepts in a simple manner and then provide examples of how to implement them in a practical way. I think even someone like me will now be able to use such fundamental tags for a positive on-page optimization activity.
Thanks for sharing this post as I found it extremely helpful! I loved your comparison of Google Bot to a Toddler. Hilarious.
Great guide to all the different methods, but I'm still somewhat unsure about which pages I shouldn't be allowing acess to on a massive, dynamic site.
Anyone seen any good guides on this?
thank you infromation about robot.txt and meta robot tag. i am using robot.txt file but i dont know this much of depth of that file info.
Most forums also have a huge amount of pages and links that should be noindexed and nofollowed.
Robots.txt is old school and as in Jazz old school is the best !
There are a few limitation that can be "helped" with the X-Robots-Tag.
Other then those two I don't relly see the need for the rest - but this is just me.
Thanks for this......I had just asked a question related to this last week. You have givin me more confirmation! Here is the link to the question:
https://www.seomoz.org/q/right-now-i-have-my-categories-as-noindex-should-i-change-them-to-index-or-let-the-individual-pages-retain-all-the-juice
Tony ;~)
canonical tags have proven very useful on our sites, eliminates tons of duplicate content issues. the robots.txt file however, is very tricky to implement and risks important pages not be indexed. better to keep the a clean robots file and use meta robots instead. :)
Nice post.
As SEO for eCommerce site,I always have the following pages NoIndexed:
PrivacyShipping infoReturn policyTerms......
The list can go on,but you see what I meant :)
Those pages are mostly "useless" for users and rarely read by users.Hence they should be useless for SERP for any search phrases.
Nice list, I would also add the Basket links (add to, view) to that. - Jenni
I use the META Robots tag on pages when I put a website up for clients to view before we're done with it. This way it's live, they can access it but no indexation.
Just have to remember to remove it when going live ;)
Hi Lindsay,
Where was this post last night when I needed it!! ;) A bit of a technical question here, since I've heard different opinions from some very intelligent SEOs.
Here's the scenario:
Below is a sample of the URLs you want blocked, but you only want to block /beerbottles/ and anything past it:
To remove the pages from the index should you?:
If that's successful, to block Googlebot from crawling again - should you?:
"To add the * or not to add the *, that is the question"
Thanks!
Dave
Hey Dave:
I just took the liberty of posting your question to the new public Q&A so don't forget to check in there in a bit to see if anyone nailed it.
Cool move GNC
Hi,
Nice article. Since farmer/panda we also think about using noindex/follow more on our website. Wee are an e-commerce website with 80% of affiliateproducts. Meaning large amounts of duplicate content or empty pages. We gradually increase products being unique by writing unique content for it but this is not something that goes quick.
Would you recommand noindexing all these pages to create more uniqueness and therefore better rankings for our website. We were hit quite severe when panda rolled out in the US. Mainly on our unique products because these were the pages which ranked the best.
By the way we are talking about noindexing thousends and thousends of pages in this process. I allready calculated that these "low quality" pages only bring in 3% of our SEO traffic and only 1% of revenue.
Only fear that i have is that Google will "punish" us for suddenly noindexing this large amount of our pages at once...
Great summarize of all the possibilities. Personnaly, I never use the X-Robots-Tag.
Great article! I especially like the analogy of baby proofing your site for a rich and well connected toddler! Think I might use that in future with clients when explaining on-site SEO.
I never knew that the x-robots header existed. I could efinitely have used that a few times!
Thanks for the heads up! Great post.
I appreciate your report, I agree with your point that many are using "nofollow,noindex" equally. I hope your post will help them out......
Hi usually use the robots.txt for:
My preferred choice is, whenever possible, to use the Meta Robot tag (for instance for paginated pages of categories products)
I use in order to avoid products page duplications (for instance if a product is in more categories and those are shown by the CMS in the URLs)
Finally, the classic 301, which is not strictly a way to restrict robot access, but a perfect way to say to it where to access.
We used the robot.txt to block SEs crawling for duplicate pages and our Domain authority went up.
So you have adding a noindex meta tag to allow link juice to flow, but dont you want to supplement this with a canonical tag if its a duplicate URL so that links to the duplicate version get counted for the canonical version?
Excellent article Lindsay. Great analogy between a developing child and bots. I have made the same comparison a few times lately myself to better explain how a web site should develop and why.
Until now, watching how my blog indexed by search engines without closing unneeded pages in robots. True to use the plugin All in One SEO Pack, which automatically adds the articles .
All indexed smoothly, without problems, until one important item was not on the front lines in the index with an address that does not coincide with the original. It was addressed to the preview article for editing, while it lost an important property of the exact occurrence of a key phrase in the address of the article, which could put it right in the first place.
Conclusion: Do not neglect any of the methods described above.
P.s. I apologize for my bad English.
Needless to say, I go for the Meta Robots tag. This post gave me insight in to X-Robots tag. Cannonical Tags have always been a mystery to me and I decided to give up on them :)
Thanks for the share.
Yes, yes.. Happy Optimizing!
Great article! I use the robots.txt file to exclude CMS core files and use robots meta tag to control content indexation, so it looks like I'm on the right track!
I confess I have rarely used the canonical tag...I try not to duplicate content. However, a website I am managing is using UTM code to track clicks, and it's something I now need to implement.
I have also never heard of the X-robots-tag - but it's a good thing to know about! Personally I hate finding pdfs and .docs in the SERPS!
Good info on robot.txt file- just cleared up a whole lot of junk that was getting indexed
Nice article gonna fix my robots.txt now.
SEOmoz keeps impressing me again and again :) Their robots.txt was really unexpected :)
I am an Internet Marketer.But,I don't no much about on page SEO.This article help me a lot. Thanks. DPS qualitypointtech.net