Here at SEOmoz, we're usually talking about how to make your content more visible to the search engines. Today, we're taking a different direction. It may seem unusual, but there are plenty of times when content on your website needs to be protected from search indexing and caching. Why?
- Privacy
There are thousands of reasons to desire protection of your content from direct search traffic, from private correspondence to alpha products and registration or credential requirements. - Duplicate Content Issues
If you serve up content in multiple formats (print friendly pages, Adobe PDF versions, etc.), it's typically preferable to have only a single version showing to the search engines. - Keyword Cannibalization
We've written a detailed post about how to solve keyword cannibalization, but in some cases, blocking spiders from accessing certain pages or types of pages can be valuable to help the process and ensure the most relevant and highest converting pages are ranking for the query terms. - Extraneous Page Creation
There are inherent problems with creating large numbers of pages with little to no content for the search engines. I've covered this before, talking about the page bloat disease and why you should eliminate extraneous pages. Si's post on PageRank also does a good job of showing why low value pages in the index might cause problems. In many cases, the best practice with purely navigation or very thin content pages is to block indexing but allow crawling, which we'll discuss below. - Bandwidth Consumption
Concerns about overuse of bandwidth can inspire some site owners to block search engine activity. This can hamper search traffic unless you're cautious about how it's used, but for those extra large files that wouldn't pull in search traffic anyway, it can make good sense.
So, if you're trying to keep your material away from those pesky spiders, how do you do it? Actually, there are many, many ways. I've listed a dozen of the most popular below, but there are certainly more. Keep in mind that tools like Moz Pro's Site Crawl will help you uncover many of them; you can check it out with a free trial if you're curious.
- Robots.txt
Possibly the simplest and most direct way to block spiders from accessing a page, the Robots.txt file resides at the root of any domain (e.g., www.nytimes.com/robots.txt) and can be used to disable spider access to pages. More details on the specifics of how to construct a robots.txt file and the elements within can be found in this Google Sitemaps blog post - Using a Robots.txt File and Ian McAnerin's Robots.txt Generator Tool can be very useful to save yourself the work of creating the file manually. UPDATE: I'm adding a link to Sebastian's excellent post on robots protocols and limitations, which gives a more technical, in-depth look at controlling search engine bot behavior.
- Meta Robots Tag
The Meta Robots tag also enables blocking of spider access on a page level. By employing "noindex," your meta robots tag will tell search engines to keep that page's content out of the index. A useful side note - the meta robots tag can be particularly useful on pages where you'd like search engines to spider and follow the links on the page, but refrain from indexing its content - simply use the syntax - <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> - and the engines will follow the links while excluding the content. -
Iframes
Sometimes, there's a certain piece of content on a webpage (or a persistent piece of content throughout a site) that you'd prefer search engines didn't see. In this event, clever use of iframes can come in handy, as the diagram below illustrates:
The concept is simple - by using iframes, you can embed content from another URL onto any page of your choosing. By then blocking spider access to the iframe with robots.txt, you ensure that the search engines won't "see" this content on your page. Websites may do this for many reasons, including avoiding duplicate content problems, lessening the page size for search engines, lowering the number of crawlable links on a page (to help control the flow of link juice), etc. - Text in Images
The major search engines still have very little capacity to read text in images (and the processing power required makes for a severe barrier). Thus, even after this post has been spidered by Google, Yahoo!, and Live, the word below will have 0 results:
Hiding content inside images isn't generally advisable, as it can be impractical for alternative devices (mobile in particular) and inaccessible to others (such as screen readers). - Java Applets
As with text in images, the content inside java applets is not easily parsed by the search engines, though using them as a tool to hide text would certainly be a strange choice. - Forcing Form Submission
Search engines will not submit HTML forms to attempt an access of the information retrieved from a search or submission. Thus, if you keep content behind a forced-form submission and never link to it externally, your content will remain out of the engines (as the illustration below demonstrates)
The problem, of course, is when content behind forms earns links outside your control, as when bloggers, journalists, or researchers decide to link to the pages in your archives without your knowledge. Thus, while form submission may keep the engines at bay, I'd recommend that anything truly sensitive have additional protection (through robots.txt or meta robots, for example). - Login/Password Protection
Password protection of any kind will effectively prevent any search engines from accessing content, as will any form of human-verification requirements like CAPTCHAS (the boxes that request the copying of letter/number combinations to gain access). The major engines won't try to guess passwords or bypass these systems. - Blocking/Cloaking by User-Agent
At the server level, it's possible to detect user agents and restrict their access to pages or websites based on their declaration of identity. As an example, if a website detected a rogue bot called twiceler, you might double check its identity before allowing access. - Blocking/Cloaking by IP Address Range
Similarly, IP addresses or ranges can be customized to block particular bots. Most of the major engines crawl from a limited number of IP ranges, making it possible to identify them and restrict access. This technique is, ironically, popular with webmasters who mistakenly assume that search engine spiders are spammers attempting to steal their content, and thus block the IP ranges to restrict access and save bandwidth.
- URL Removal
A secondary, post-indexing tactic, URL removal is possible at most of the major search engines through verification of your site and the use of the engines' tools. For example, Yahoo! allows you to remove URLs through their Site Explorer system, and Google offers a similar service through Webmaster Central. - Nofollow Tag
Just barely more useful than the twelfth method listed here, using nofollow technically tells the engines to ignore a particular link. However, as we've shown with several of the other methods, problems can arise if external links point to the URLs in question, exposing them to search engines. My personal recommendation is never to use the nofollow tag as a method to keep spiders away from content - the likelihood is too high that they'll find another way in. - Writing in Pig Latin
It may come as a surprise to learn that none of the major engines have a Pig Latin translator. Thus, if you'd like to keep some of your content from being seen in a search query, simply encode and publish :) For example, try searching for the English version of the following phrase and you'll see no results "Elcomeway otay Eomozsay Istermay Orgelsprockenmay!" (at least, until someone translates it in the comments below).
Hopefully these tactics can help you understand the best ways to hide content from the engines effectively. As always, feel free to chime in with comments, questions, or opinions.
I think it is important to note that if you absolutely don't want a page or content crawled/indexed, etc. - you can't just rely on one or even two of these methods.
I've seen too many people think a "rel=nofollow" or form submit helps, only to have someone from the outside link to the page. And robots.txt can't be relied on completely - since not all search engines follow REPs the same way. And it too seems to be "ignored" at weird and inconvenient times.
And if you publish a full RSS feed - that brings up other issues as well, especially if the content is scraped/archived by someone else.
If you have a lot of content you want readily available on the web (no passwords required), but not available in search engines - it would almost take a full-time anti-SEO initiative employing at least a handful of these techniques just to be safe.
It is one of the mysteries of search engines that as hard as it can be to get some pages indexed, it can be just as tricky to get others never indexed in the first place.
I've never really experimented with content where it was critical that it didn't appear in SEs. Interesting to hear that it's so hard.
I had a 'Disallow: /wp-login.php' declaration on a blog's robots.txt file. Somehow it still got indexed in Google, it has since fallen out, but it took a while.
I don't know why someone doesn't make a default robots.txt file for the Wordpress Install.
Do a Google search on inurl:wp-login.php and look at all the "page bloat" (and probable leaking link juice). Its only like 500K pages, not even a smidgeon when when talking about the Internet - but still.
I also would like to have an official definitive list of all the ways Google will first find your page.
A while back my son and I made a little page on Brinkster for him to mess around on. It wasn't a common URL at all (vinnymyster.com). Neither of us linked to it via any other site. So it shouldn't have just been out there and no need for a robots.txt file or any other exclusions.
A month later I was doing a google search on his gamer tag (VinnyMyster) and the site popped up in Google. I did a link:vinnymyster.com search and nothing was returned. I went to Yahoo - and the site was no where to be found.
Someone on a forum said that Google might have picked it up because my host or registrar might send out some RSS feed of "new sign-ups" or that Google could put in a queue newly registered domain names.
I think it was because he built the site using almost 100% includes from other sites (he was trying to replicate a MySpace page), and that one of those sites gave some kind of link back to all sites calling it - and Google somehow picked that up.
This was a while ago and I never did quite figure it out.
I guess the point is - you can never know 100% where Google might be finding its incoming link - so make sure you stay on top of it.
I always used to be paranoid about G toolbar for finding sites that didn't have links to them...
Good points, vingold. I have also seen that it can take a long, long time to get content out of the SEs once it's already been indexed. So if you plan on hiding some of your content, it's best to do it right away rather than waiting.
Part of the problem nowadays is with the authority of community sites. It's amazing how quickly a personal page can get picked up just because you linked to it on Facebook, MySpace, LiveJournal, etc.
I've also found that password-protected areas are about the only surefire method.
A "Disallow:" statement in robots.txt doesn't prevent from indexing by design. In fact, "Disallow:/" means "Don't crawl it, but feel free to index" it. That's perfectly compliant to all related Web standards. See my comment above.
Sebastian, excellent! Your post certainly clears a lot of things up.
I need to print it out, read it, let it simmer in my head, rinse and repeat.
It is a very definitive breakdown of the whole no-indexing issue.
So much for what I picked up in the forums.
Rand, your REP methods (robots.txt, robots meta element) need some nitpicking clarification. ;)
Currently we've no indexer directives for robots.txt. Crawler directives like "Disallow:" do not prevent from indexing based on 3rd party signals or even internal links pointing to disallow'ed contents. Also, disallow'ed URLs waste PageRank. If you want to deindex stuff you shouldn't block it in robots.txt, because all indexer directives require crawling.
Only Google obeys the "noindex" directive in robots meta tags as well as X-Robots-Tags (I agree with Joost, REP tags in the HTTP header are way sexier than meta elements on the page, and they work with PDFs and other non-HTML content types too). Yahoo as well as MSN do index references to URLs carrying a "noindex" REP tag.
Selfish but relevant link drop:
Getting stuff out of Google - the good, the popular, and the definitive way
Thanks for your time
Sebastian
Excellent points, Sebastian, and no worries on the link drop - we certainly appreciate them when relevant :)
I'll try to update the piece with your notes!Â
I've submitted additional REP info to YOUmoz.
Thanks
Sebastian
Sebastian - terrific! Perhaps I can simply link over to your post to help guide the more complex aspects of noindex and disallow.
Thanks, that's how I understand it. I was wondering why you mentioned noFollow in the post. :)
Thanks Rand, I'd certainly apprecite the link. :)
By the way, it seems my REP post is stuck in the YOUmoz moderation queue.
Yahoo! Search added support for the X-Robots-Tag directive back at the beginning of December, and to date I have had no experience of them failing to adhere to the tag (and I have just checked my test-beds).
If you have definite examples of Slurp or the Yahoo! index disobeying the Robots Exclusion Protocol headers then I would be interested in seeing them.
Great Stuff there Sebastian. Thanks for the link.
And I think I deserve an extra moz point to find this spelling mistake -
11. the liklihood is too high :)
Sorry, I was a bit late in editing this post.
The search engines may not have pig-latin translators, however Google has translated its own search engine into pig latin for you:
https://www.google.com/intl/xx-piglatin/
:)
edit: to insert link correctlyÂ
There is another method I have seen in use on some websites. They make all the links towards these pages unreadable with javascript. The links are obfuscated, so that search engines may not even detect they are links to other pages.Â
Hey Rand,
Belated comment here as the comment about iframes sparked off an idea that I had to follow through on first!
So thanks for a very timely post, which just may have saved me a very large headache :)Â
8 years !!! Need to update this info now, and other thing is need solution for how to avoid unwanted keywords in google webmasters
I wouldn't rely on robots.txt alone. I think using all preventative measures you can use (robots.txt, meta noindex/nocrawl, and rel=nofollow) is the best route. Â
Thanks Rand,
Again Great PostÂ
I would like to know your trick of using robots.txt to block an inpage (but external URL) iframe. I'm sure this is not possible as the robots standard, for obvious reasons, ignores references to external URL's.
Thanks for having this 12 ways to keep the content hidden from the SEs, This is really a great for me as a beginner in SEO..GOOD Work and Keep it up.
Great Post. First we learn how to get the content into search engine and now how to keep the content out of search engines way.
Hello, I know this is an old blog but I had a quick question with respect to iframes. I have a persistent piece of content on my site that I do not want Google or any other search engine to index and I have included that content in an iframe as mentioned in this blog and the link from which the iframes derives content is set as noindex, nofollow but still the content is indexed by search engines. Could you please tell me a way to solve this problem? Thanks!
hey, i want a clarification i.e. i am using a blog in which i have a directory which i dont want to index but i put it in robot.txt and then also google is showing results, in which it is written "description not available due to site robots.txt but why is it index? can u help me? thanks in advance.
Ever since the Phoenicians invented money the matter has had a simple solution: popular demand. The companies that bother you with unsolicited information when using a search engine do so expecting some return for what they pay for the advertising. We make it a custom to preclude doing business with entities that show up during our searches and interfere with our main goal. The search engines providers then might find it beneficial to put the filters themselves in order to protect the interests of their paying clients. Filters that allow searchers to pinpoint relevant data, excluding all others not specifically checked, might be of great benefit to paying advertisers.
Hi there,
Currently we have so many unwanted search engines browse our website https://coolbuck.com. For example,
5.199.133.88 [arni.name]
182.118.22.206 [hn.kd.ny.adsl]
74.215.13.162 [WS1-DSL-74-215-13-162.fuse.net]
5.9.27.74 [5-9-27-74.crawler.sistrix.net]
... and many more.
We want to block them from crawling our website. What should we write into our robot.txt.
Thx.
Peter
You can also save a PDF as an image. It's a PDF that looks like any other PDF except the file size is larger.
Hi,
Great post! I'm new on SEOmoz and the value of some post is just incredible.
I have a more specific question about something we're trying to do.The account settings of our community members are in a jquery sliding panel which is on top of everypage. The problem is that search bots can crawl through it (checked on seo-browser): it is unecessary and prevents them from seeing the important content of the page.
From this post I would assume that the best way to do that is to use iframes. Am I correct?
I don't believe there is any evidence that supports the claim that rel="nofollow" prevents the crawlers from visiting a resource.
Sarven - if you read my comments, that's exactly what I wrote! Maybe an additional diagram showing this would drive the point home further.
Rand, great info. I have a similar problem, but slightly different.
We upload pdf files that change each year as updated health plans are released. We delete the pdf's from prior years to avoid duplicate content because plans from year to year are very similiar but Google still has these pages indexed. After we remove them, Google reports that the pages are 'not found' in webmaster tools. Should I remove these pages on a case by case basis through webmaster tools? or is there a more efficient way? Thanks Rand! Really appreciate the input.
A question (for you and the Moz community): what about content that's already out there but you want to have removed (especially dynamic content where the removal request isn't feasible)? I've found that the META tags don't always work for content that's already spidered, and even Robots.txt takes weeks to affect the pre-existing index.
Just wondering if one method might be better than others for pages that have previously been indexed (as opposed to entirely new pages).
Deindexing and Not-Indexing follow the same set of rules. The problem is that with a pretty low crawl frequency changes in your indexer directives don't take effect instantly. Submit an XML sitemap (with all URLs you want to have deidexed) to refresh the robots.txt cache and to force crawling of recently outdated resources. When you assign a "noindex" REP tag to an URL, search engines won't obey it unless they've crawled and processed it. Not that all search engines obey "noindex" REP tags...
Interesting, thanks. I wondered whether it was an issue of delisting working differently or just a matter of the time it takes to "rediscover" secondary pages. I thought it might be what you suggested, but didn't have any verification.
The Removal Request Tool in the Google Webmaster Console is going to help you here as well.
Google require you to have followed the appropriate actions for excluding the content, that is you need to:
As yet Google have not announced that the Removal Request Tool supports the X-Robots-Meta protocol.Details are available from Google at the webmaster help centre.
I'm missing one, although you could see it as part of the robots.txt thing: the x-robots-tag. Cool thing about it is that it's even harder to detect for people :)
The x-robots-tag meta is an excellent addition and allows SWF and PDF files, in particular, to be excluded with ease.
The downside is that, to date, only Google and Yahoo! are supporting the tag. Until at least two other engines begin supporting this meta tag I think that it has to remain an addition, rather than an alternative, to other methods.
You know Rand, I was just wondering about methods of exposing myself less. Too often I've been finding myself exposed all over the place, multiple times. I'm sure people are getting sick of me exposing too much of myself and...
*giggles*
In all seriousness - this is a fantastic post. I have a site that's just going through several issues where they are even deduping stuff scrapers stole. It's been a long process.
I've also recommended the iFrames technique to a friend who needed to hide some content but not all from search robots on some pages.
*teeheehee* but everyone has to love the Pig Latin suggestion! And you'd probably rank extremely well (until everyone caught on... damn them ;-) )
There is one more good reason why you should consider hiding some of your content. Spammers... If you don't hide, they can actually screw your mailbox and yes rankings too.Needless to mention, nice post Rand !! Cheers !
Great post, thanks Rand!
Why didn't you know that all the nouveau webmasters don't want you to see their content? At least if you are in the digg demographic. All those pesky kids steal bandwitdth and never click on ads, like Dennis the Menace.
https://whydiggisblocked.com/Â
If people refuse to understand social media, should they be enlightened?
What about using JavaScript to alter the CSS. This could present slightly different content to customers rather than search engines.
Where's the fine line between doing this to game the system and doing it for accessibility e.g. adding section headings for screenreaders etc.
Haha! Never heard about that!Â
Great post - I love it to have all techniques listed in one place!
Great info Rand! I'm gonig to share this on our E-Marketing Performance reading list!
#6 Forcing Form Submission
Really just felt like filler b/c as you say while it works as soon as a link comes bounding in.. it doesn't : /
Would have been a stronger piece if you'd have just say 11 Ways.
Welcome to SEOmoz Mister Morgelsprocken!
 :P
Thanks for another great post Rand.Â
Rand, Thanks for the Pig Latin tip. Rand, please discuss the difference between distributing link juice and hiding away content. We do as you do and employ noFollows all throughout...say a comments thread. For instance on this page " Reply Private Message Permalink Add Commentare all noFollow. I understand why and am interested in hearing you discuss the different approaches and reasoning in noFollow vrs hiding content by other methods. THANKS.
Marty - the thing is, nofollow is used to control the flow of PageRank - we wrote about that here. Keeping content away from the engines entirely is more the focus of this post.
Thanks, that's how I understand it.
What about flash ? Is it indexable ?
You're right there - Flash certainly isn't indexable. Thankfully, we can get similarly impressive UI using CSS!
I think you may be somewhat overconfident there.
I would not say that google are incapable of crawling content and following links in swf files. The Flash spidering test in my public test-bed has been compromised by a third party linking to the destination page, so I am not able to give you a firm example, but relying on Adobe Flash as a method of excluding content would not be a good model. Â
Shoulda made your test page a little less compelling, huh?
I know they are testing a lot of things to index inside flash. Expect to see real progress soon.
Now flash can be indexed as well
https://support.google.com/webmasters/answer/72746?hl=en
Hadn't thought of using iframe for that - could be hany.
Flash is also an option (similar to images/java)...
And what about Ajax?
AJAX can be used to exclude content for exactly this purpose.
I provided a news blog solution to a British financial bank where I use this approach to supply users with content on the home page without having that content indexed, thuis avoiding duplication issues between the front page and the specific article pages. Â
Getting the feeling you could only really think of 11 there, Rand?
What did twiceler do to you?
Here is Stephan Spencer weighing in with Matt Cutts: https://blogs.cnet.com/8301-13530_1-9834708-28.html