When should you disallow search engines in your robots.txt file, and when should you use meta robots tags in a page header? What about nofollowing links? In today's Whiteboard Friday, Rand covers these tools and their appropriate use in four situations that SEOs commonly find themselves facing.
Video transcription
Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week we're going to talk about controlling search engine crawlers, blocking bots, sending bots where we want, restricting them from where we don't want them to go. We're going to talk a little bit about crawl budget and what you should and shouldn't have indexed.
As a start, what I want to do is discuss the ways in which we can control robots. Those include the three primary ones: robots.txt, meta robots, and—well, the nofollow tag is a little bit less about controlling bots.
There are a few others that we're going to discuss as well, including Webmaster Tools (Search Console) and URL status codes. But let's dive into those first few first.
Robots.txt lives at yoursite.com/robots.txt, it tells crawlers what they should and shouldn't access, it doesn't always get respected by Google and Bing. So a lot of folks when you say, "hey, disallow this," and then you suddenly see those URLs popping up and you're wondering what's going on, look—Google and Bing oftentimes think that they just know better. They think that maybe you've made a mistake, they think "hey, there's a lot of links pointing to this content, there's a lot of people who are visiting and caring about this content, maybe you didn't intend for us to block it." The more specific you get about an individual URL, the better they usually are about respecting it. The less specific, meaning the more you use wildcards or say "everything behind this entire big directory," the worse they are about necessarily believing you.
Meta robots—a little different—that lives in the headers of individual pages, so you can only control a single page with a meta robots tag. That tells the engines whether or not they should keep a page in the index, and whether they should follow the links on that page, and it's usually a lot more respected, because it's at an individual-page level; Google and Bing tend to believe you about the meta robots tag.
And then the nofollow tag, that lives on an individual link on a page. It doesn't tell engines where to crawl or not to crawl. All it's saying is whether you editorially vouch for a page that is being linked to, and whether you want to pass the PageRank and link equity metrics to that page.
Interesting point about meta robots and robots.txt working together (or not working together so well)—many, many folks in the SEO world do this and then get frustrated.
What if, for example, we take a page like "blogtest.html" on our domain and we say "all user agents, you are not allowed to crawl blogtest.html. Okay—that's a good way to keep that page away from being crawled, but just because something is not crawled doesn't necessarily mean it won't be in the search results.
So then we have our SEO folks go, "you know what, let's make doubly sure that doesn't show up in search results; we'll put in the meta robots tag:"
<meta name="robots" content="noindex, follow">
So, "noindex, follow" tells the search engine crawler they can follow the links on the page, but they shouldn't index this particular one.
Then, you go and run a search for "blog test" in this case, and everybody on the team's like "What the heck!? WTF? Why am I seeing this page show up in search results?"
The answer is, you told the engines that they couldn't crawl the page, so they didn't. But they are still putting it in the results. They're actually probably not going to include a meta description; they might have something like "we can't include a meta description because of this site's robots.txt file." The reason it's showing up is because they can't see the noindex; all they see is the disallow.
So, if you want something truly removed, unable to be seen in search results, you can't just disallow a crawler. You have to say meta "noindex" and you have to let them crawl it.
So this creates some complications. Robots.txt can be great if we're trying to save crawl bandwidth, but it isn't necessarily ideal for preventing a page from being shown in the search results. I would not recommend, by the way, that you do what we think Twitter recently tried to do, where they tried to canonicalize www and non-www by saying "Google, don't crawl the www version of twitter.com." What you should be doing is rel canonical-ing or using a 301.
Meta robots—that can allow crawling and link-following while disallowing indexation, which is great, but it requires crawl budget and you can still conserve indexing.
The nofollow tag, generally speaking, is not particularly useful for controlling bots or conserving indexation.
Webmaster Tools (now Google Search Console) has some special things that allow you to restrict access or remove a result from the search results. For example, if you have 404'd something or if you've told them not to crawl something but it's still showing up in there, you can manually say "don't do that." There are a few other crawl protocol things that you can do.
And then URL status codes—these are a valid way to do things, but they're going to obviously change what's going on on your pages, too.
If you're not having a lot of luck using a 404 to remove something, you can use a 410 to permanently remove something from the index. Just be aware that once you use a 410, it can take a long time if you want to get that page re-crawled or re-indexed, and you want to tell the search engines "it's back!" 410 is permanent removal.
301—permanent redirect, we've talked about those here—and 302, temporary redirect.
Now let's jump into a few specific use cases of "what kinds of content should and shouldn't I allow engines to crawl and index" in this next version...
[Rand moves at superhuman speed to erase the board and draw part two of this Whiteboard Friday. Seriously, we showed Roger how fast it was, and even he was impressed.]
Four crawling/indexing problems to solve
So we've got these four big problems that I want to talk about as they relate to crawling and indexing.
1. Content that isn't ready yet
The first one here is around, "If I have content of quality I'm still trying to improve—it's not yet ready for primetime, it's not ready for Google, maybe I have a bunch of products and I only have the descriptions from the manufacturer and I need people to be able to access them, so I'm rewriting the content and creating unique value on those pages... they're just not ready yet—what should I do with those?"
My options around crawling and indexing? If I have a large quantity of those—maybe thousands, tens of thousands, hundreds of thousands—I would probably go the robots.txt route. I'd disallow those pages from being crawled, and then eventually as I get (folder by folder) those sets of URLs ready, I can then allow crawling and maybe even submit them to Google via an XML sitemap.
If I'm talking about a small quantity—a few dozen, a few hundred pages—well, I'd probably just use the meta robots noindex, and then I'd pull that noindex off of those pages as they are made ready for Google's consumption. And then again, I would probably use the XML sitemap and start submitting those once they're ready.
2. Dealing with duplicate or thin content
What about, "Should I noindex, nofollow, or potentially disallow crawling on largely duplicate URLs or thin content?" I've got an example. Let's say I'm an ecommerce shop, I'm selling this nice Star Wars t-shirt which I think is kind of hilarious, so I've got starwarsshirt.html, and it links out to a larger version of an image, and that's an individual HTML page. It links out to different colors, which change the URL of the page, so I have a gray, blue, and black version. Well, these four pages are really all part of this same one, so I wouldn't recommend disallowing crawling on these, and I wouldn't recommend noindexing them. What I would do there is a rel canonical.
Remember, rel canonical is one of those things that can be precluded by disallowing. So, if I were to disallow these from being crawled, Google couldn't see the rel canonical back, so if someone linked to the blue version instead of the default version, now I potentially don't get link credit for that. So what I really want to do is use the rel canonical, allow the indexing, and allow it to be crawled. If you really feel like it, you could also put a meta "noindex, follow" on these pages, but I don't really think that's necessary, and again that might interfere with the rel canonical.
3. Passing link equity without appearing in search results
Number three: "If I want to pass link equity (or at least crawling) through a set of pages without those pages actually appearing in search results—so maybe I have navigational stuff, ways that humans are going to navigate through my pages, but I don't need those appearing in search results—what should I use then?"
What I would say here is, you can use the meta robots to say "don't index the page, but do follow the links that are on that page." That's a pretty nice, handy use case for that.
Do NOT, however, disallow those in robots.txt—many, many folks make this mistake. What happens if you disallow crawling on those, Google can't see the noindex. They don't know that they can follow it. Granted, as we talked about before, sometimes Google doesn't obey the robots.txt, but you can't rely on that behavior. Trust that the disallow in robots.txt will prevent them from crawling. So I would say, the meta robots "noindex, follow" is the way to do this.
4. Search results-type pages
Finally, fourth, "What should I do with search results-type pages?" Google has said many times that they don't like your search results from your own internal engine appearing in their search results, and so this can be a tricky use case.
Sometimes a search result page—a page that lists many types of results that might come from a database of types of content that you've got on your site—could actually be a very good result for a searcher who is looking for a wide variety of content, or who wants to see what you have on offer. Yelp does this: When you say, "I'm looking for restaurants in Seattle, WA," they'll give you what is essentially a list of search results, and Google does want those to appear because that page provides a great result. But you should be doing what Yelp does there, and make the most common or popular individual sets of those search results into category-style pages. A page that provides real, unique value, that's not just a list of search results, that is more of a landing page than a search results page.
However, that being said, if you've got a long tail of these, or if you'd say "hey, our internal search engine, that's really for internal visitors only—it's not useful to have those pages show up in search results, and we don't think we need to make the effort to make those into category landing pages." Then you can use the disallow in robots.txt to prevent those.
Just be cautious here, because I have sometimes seen an over-swinging of the pendulum toward blocking all types of search results, and sometimes that can actually hurt your SEO and your traffic. Sometimes those pages can be really useful to people. So check your analytics, and make sure those aren't valuable pages that should be served up and turned into landing pages. If you're sure, then go ahead and disallow all your search results-style pages. You'll see a lot of sites doing this in their robots.txt file.
That being said, I hope you have some great questions about crawling and indexing, controlling robots, blocking robots, allowing robots, and I'll try and tackle those in the comments below.
We'll look forward to seeing you again next week for another edition of Whiteboard Friday. Take care!
Great vid Rand, got a bit distracted by that hair though, is that a temporary redirect down the right side of your head?
Well, that's what you get when you tell your barber to "do something cool" in the back... :-)
Such an incredible video for all of us. You have given really great examples for robots.txt and meta robots tag. Many newbies as well as experienced get confused between them, but after going through this awesome stuff, I don't think they would be in any doubt.
This is very simple, attractive and easy to understand video that helps us to grow our website. I will bookmark this post and share with my staff and friends for enhance their knowledge. Thanks Rand for helping us.
Thrilled to hear it! Let us know if you've got any specific issues or questions (or ask 'em here: https://moz.com/q)
I'm happy you brought up the Twitter rtxt example - I was confused as to why they would do such a thing & I'm even more surprised their SEO team (if your assumption is correct about www/non-www) would go this route. Great examples & reminders about which of these options are considered first and have the potential to block the others.
Yeah, not the smartest move on Twitter's part, IMO (although we don't know for sure exactly what their intentions might be, and some have speculated that Google actually asked them to do this as part of the partnership - I'm skeptical, but who knows!?).
This wouldnt force other search engines to try to buy that "firehose" connection like google would it??
Well Google is reportedly looking to hire a SEO manager so perhaps they don't know better themselves. :)
This is so important to get right. We had a major, albeit temporary catastrophe on a client where the web designer had been asked to do upgrades. They had rightfully applied the noindex,nofollow tag on the development URL but then did not change this back when they rolled out the upgrade.
The result was within 24hrs, pages started slipping out the index. Thankfully as we track stats daily we caught this before too much harm was done. Although we caught it within 24hrs, because Google had crawled a ton of the noindexes, it actually took 3-4 days for pages to stop being removed and then to get reindexed.
But does go to show that the META ROBOTS tag is honoured, and often swiftly so, so be warned and only stop search bots only if you're sure that's the right thing for your site.
Excellent example Martin - thanks for sharing. I think this makes a strong case for having some sort of crawl monitoring/alerting set up (either Moz Analytics or Onpage.org or the like). Manual runs of something like Screaming Frog can be really useful too.
Definitely. I've only just come across OnPage.org but crawl monitoring is definitely something worth looking into more. (Although until Google sort out their 'indexed URLs' bug I wouldn't put much store by WMT figures, so any external tool is a good idea!)
Yes that can result in a catastrophic event, but if you are using no index tag or blocking pages through robots.txt in bulk, it generates a warning in Search console (former GWMT). So it is better to keep an eye on the search console on regular basis.
Great, that you guys also always return to BASICS for our starting members and also to refresh and to get rid of wrong opinions. Thnaks to Rand and the whole MOZ Team, Michael from Austria
Hey Rand,
As always, you are the BOSS.
You have a gift, it's rare to see people present with such fluency.
1) Do you think it's possible to consider high crawling bandwidth as a negative SEO factor?
2) Could you please elaborate on the best SEO practices for tags & categories
Hey Yaniv,
That is a great question, and I think I can answer your second one. If I understood Rand correctly the best practice is to create [no index, follow] meta tags.
This way we don't have to worry about duplicate content that can arise from this kind of navigation on the site but google will still follow the links. Which is important for internal link profile optimization.
Super useful WBF Rand. I'd guess 1/3 of my clients have misused robots.txt and noindex,NOfollow vs. noindex,follow.
One quick note about 410 http responses: while it's supposed to indicate to the search engines that a page is gone forever, I have seen a case (6 months ago) where a client had a manual penalty based on pages that were returning 410 (they were created by site users and totally spammy), and in the manual penalty reinclusion request REJECTION, Google was citing those pages that had been returning 410 for several months. It was only when the client changed the DNS to remove the subdomain they were on entirely that the penalty was lifted. I'd have said this was a mistake in how Google's reviewers were handling this, personally. But...that's what I saw, so all be warned!
I think it's worth making the distinction that Webmaster tool's removal function actually removes pages from the search results --- not Google's index. See John Mu's post here.
If you are dealing with things like duplicate content, it's probably not as effective as other options mentioned above. Thanks for the video!
Good point Ashley! There's a difference between removed from index and removed from search visibility. Thanks for clarifying that :-)
Rand on a blog site should catagory pages be noindex, follow?
This post is something like a 'back to basics' but I admit that it's very necessary. Lately I'm seeing a proliferation of posts about this topic, and there's no doubt that the reason is its usefulness and necessity.
It is very important to differentiate between crawling and indexing. I think the problem exposed in the example of blogtest.html is more common than we think because sometimes we tend to obsess with 'blocking' a page (name it as you like), and we forget that some 'orders' need from a previous crawling to be effective.
Great post, Rand. One last thing: I love your T-shirt; very original.
Thanks Sergio! Agree that lots of issues come up due to the fine points between crawling vs. indexing.
First - T-Shirt is amazing...
Second - i found all these complications with all mechanisms controlling crawlers little bit disturbing. Because there are many ways to shoot yourself in foot using them.
One of last example is brand new site with approx 100 pages. Designers put there canonical tag with one "small" issue. All canonical tags point to homepage. They just want to be sure that there isn't duplicate content. And i saw this 3 months after site is up.
Sometimes even small changes can break things. WordPress have one radio box about crawlability and SEO plugins (AIO or Yoast) add few more. And one incidental click can erase all site for crawling.
Yup - we see this a lot, too. The crawl and indexing options can be powerful, but they also can seriously mess things up if not done properly. On the plus side, that's pretty much a guarantee that there'll always be demand and need for talented SEOs :-)
Yet again the best edition...was away for couple of days couldn't really catch up early. I was having the same issue of index we have few site with no index tag but those sites been index by both search engines. Now got an answer why they were indexed.
Thanks Rend for making it clear.
Hello Rand,
Good refreshment of some basic concepts as well as technical points about robots. One thing I wish to include is - If you disallowed any page in robots.txt file and expect that it will never appear in Google then you are little bit missguided. Because, if Google thinks that your disallowed page is relevant and could help to searcher based on search query, then it will be appear in search result as Google taking references from open directory sites like DMOZ.
So, if you really want to make your page away from search engine then you must have to put NOINDEX code and if you want to prevents search engine to show its own description taken from directory, use this code - <meta name="robots" content="noodp " />
Hope it helps to people.
Yup! That's exactly what I noted with my visual in the first part of the video.
Re: the description from the directory - we've seen that work sometimes and not so well other times. Google can pull descriptions from anchor text and other places it seems, and there's no way to stop them from doing that, sadly.
Yes, because google believes to deliver relevant result only, so it would be either by us or goggle its self. Anyway, thanks for the reply :)
Hey Wizard,
Correct, You have to use 'noindex' to completely remove from the results too. However, I didn't understand your point to ' let them crawl it' .
As per my experience with robots files, most of the times if the page is disallowed in robots.txt and doesn't have meta description ,in such case Google fetches important text or actionable text from that page and shows within search results.
For example,
Keyword "open site explorer" shows https://moz.com/researchtools/ose/ (Blocked in Robots.txt) on first rank with the actionable text "It looks like you have JavaScript disabled. JavaScript is required to use Open Site Explorer" at description.
If I direct put that URL into the search bar, it shows me Moz's footer page text "Moz doesn't provide consulting, but here's a list of recommended companies who do!" in the description. Question is how Google shows such results for disallowed URLs.
Hi Vishal - here's a good example from Moz: https://www.google.com/search?q=site%3Amoz.com%2Fa...
You can see that Google's showing a disallowed page from Moz, but since they can't crawl it, they say: "A description for this result is not available because of this site's robots.txt – learn more."
Hello Rand,
I had that WTF moment when I was beginner and it took sometime for me to overcome those through experiments. Google recently was true when it said that SEO's need to get their hands dirty in order to learn better. This is really great WBF for people who are struggling with such on-page techniques. We had a set of landing page hosted on clients sub-domain and the page was blocked in robots files. However, it was ranking well in search results for the targeted keyword when searches were done. Those pages we not intended for organic and were landing pages for AdWords. Later we did implement meta no-index and let it be crawled which helped us move it away from indexation. I guess many SEO's will get to learn a lot form today's WBF.
The best of article..thanks @Rand Fishkin.
I found this article quite helpful. Frankly, I’m a small business owner and knew little about the relationship between search engines and robot tags. I’m seeking new SEO approaches to better serve my business venture. Nevertheless, I found Rand’s explanation for when to disallow search engines in robots.txt file as well as when I should employ meta robots tags in a page header. The tutorial on no following links as well proved helpful. I especially enjoyed the content on meta robots and how they inhabit the headers of individual pages, whereby you can administer a single page through a meta robots tag. I understand now how this communicates to the given search engine whether it be Google or Bing and whether they should maintain a page in the given index and if they should continue to follow the links on a given page. I’m curious how some of these approaches can be applied to small business sites in terms of enhancing inbound marketing efforts to increase end sales? I would love to hear any ideas related to this. Thanks again!
Great article. For search results, I think the use of query string parameters allows great flexibility. Bookmarkable, yet still configurable using webmaster tools,
Hi Rand
Great post. Just about to become a full time Moz customer :-) We just launched a .com version of our site (our original is a .co.za) and didn't rel canonical all the .com pages which were exact duplicates. We now, after seeing a traffic drop in our .co.za results have rel canonicaled all the duplicate .com pages. Is this all we would need to do? Do we just wait now, for google to re-index, and can we expect our rankings to return to normal once this occurs?
Thanks!
Mike
Nice post Rand!
I have one question, I added "Robots.txt" file in my sub domain, e.g. "subdomain.mydomain.com/robots.txt" and I disallow sub-domain through Robots.txt. But still, it's showing in Google search like "A description of this result is not available because of this site's robots.txt – learn more." Why?
It is happening because you have blocked the content through robots.txt which means the URL is indexed but the content of the page is blocked from the bot.
Thanks for the reply Salman!
I have added following code "User-agent: * Disallow: /" in following root "subdomain.mydomain.com/robots.txt". Why Google indexed this URL, so what should I have to do to stop indexing the sub-domain URL? I don't want to index my sub-domain on Google search any more.
I have to disagree with the advice given on this part "If I have content who quality I'm still improving that isn't ready for Google, what should I do?" Noindex on a few hundred articles if they are not ready..
I know I won't be a big fan when I say this but to be honest if your pages are not ready for Google then its doubtful they are ready for your visitors. If the page is thin or duplicate then your likely not giving your users an experience you want them to have and you should put those pages completely on hold.
Furthermore its a form of shaping your reputation with Google, I understand the reasons but at the end of the day if you follow the principles whatever is good for your users then its good for Google too, only time you should ever consider noindex is on sensitive pages.
It's also a very bad idea to noindex 'thin' pages, I often see people using "noindex, follow" thinking they will get the juice from any links made to it, which they won't...Even tag pages should be indexed, this is why we now have canonical links.
My advice is only ever use noindex for content that was never intended for Google. Using noindex on user usable pages reminds me of the days people used 'nofollow' on internal a hrefs in an attempt to manipulate the page flow.
Hi Rand, great post. My somewhat late addition:
I think the X-Robots HTTP-Header is a rather useful alternative to the meta robots tag:
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
While providing similar possibilities, it enables you to address non-HTML Content like PDFs, Images as well. It’s also easy to apply for folders, certain URL patterns or whole domains, if you apply it via webserver configuration. Quite useful if you want to exclude a development URL in an effective and efficient way for example.
Regards, Tobias
The section on search results is exactly the issue i'm trying to tackle. I've got huge amounts of crawl budget being spent on internal search results pages, but only a few variations of those searches (thought it amounts to thousands of pages) are truly valuable.
Thankfully i have a parameter that i can use to identify which ones i truly think are useful or not to external searches.
examples:
If I understood this correctly, the best bet would be to nofollow links to anything that falls into #4, but also include noindex, follow in the <head> of these pages. For the options 1-3, I should not nofollow links, and I should not include the noindex, follow. I should also not use the robots.txt to block any of these pages (1-4).
This still doesn't solve any problems with crawl budgeting, though, so how can I attempt to accomplish this? There are tons of links out to these pages (almost our entire site is search-controlled), and I'd rather not waste where nothing could be gained.
Finally, any thoughts on the url parameter controls in GWMT? if interface=ImSuperUseful¶m2=xyz also creates a useless page, does that parameter tool in GWMT do anything to help this? Has anybody run any tests on its utility or effectiveness?
Thanks to anyone who can comment
-MB
Great explanation on what is going on there. My simple tools have been SEO plugins (just check the box!) and Redirection plugin...but see there is a little more strategy I can now employ when encountering certain situations.
Hey Rand,
Just a little confused about "nofollow" tags. Well, we were exposing all the search pages URLs (which are infinite pages in our case) from our site, that means we were exposing URLs having same content and hence, we were open for duplicate content penalty. For instance, a page with search query "mobile" might have absolutely same content as a search page with search query "mobiles" and we were dynamically linking these both of these pages.
So, to avoid this, I have put a "nofollow" to all the places from where these search pages were linked internally (can't noindex them because of some stupid reasons, and even changing the inter-linking logic is a little tricky part as of now). Does that tell bots that I don't trust my own site and hence nofollowing my own URLs?
What should be the best way to handle this particular issue? Have a look at this site: https://www.askme.com/delhi/search/pizza-hut, and check for all the links with having "/search/" in the it.
If you have 3 pages labeled A, B and C...
A uses "noindex, nofollow" which typically means the Googlebot won't visit B and discover C...
However this goes on the assumption that Google will respect the nofollow, while this is true it can often ignore the tag completely. The safest way to avoid content being indexed whcih was never intended for search engines in the first place is to use a noindex on those pages, followed by a rule in the robots.txt as a backup in case something goes wrong with the header responses.
Very nicely described..
Thanks
hiii,,
very nice article, i already know the use of robots.txt, but didnt know it so deeply, it will help me in doing my site.
Hi Rand,
Great video, thank you for that.
I have a question: I have a customer and his website is like a marketplace. Sometimes different sellers list same product so there might be same products with different SKUs. On the other hand, because every SKU has different URL, this creates duplicate content. How can I solve this problem? Does canonical solve this problem?
Thank you for your help
Emre Tonguç
This is a great post and answers many questions to how to get rid of a page for a client "the right way" when I run into duplicate content issues, or irrelevant pages that need to be taken care of.
The Robots.txt file power is something that I think is miss underused at times but the meta robots really gave another tool for specific pages.
Plus the new tools in the WMT is a great tip for any that have not been using them yet.
Either way always a great WBF and also awesome information on how we can all be better SEO's.
Hey Rand,
This edition of WBF certainly brushed up the basic.. Thanks for helping me with this insight..
Cheeeers!!
About: 2. Dealing with duplicate or thin content
The color filter does not create duplicate content. Therefore the canonical-tag is wrong here (like using it on paginated pages). Google is recommending "noindex, follow".
Yeah, but their recommendation is wrong. If someone links to the "gray" version of the tshirt page and I want that link to count to my original, rel=canonical is the way to go. That said, if folks are searching for the gray version of the shirt separately from the others, then I want to let that page actually get indexed!
I was wrong and thinking of category filters. For filtered product pages of course, canonical is ok :)
Hi Rand,
Thanks for the video. Interesting distinction between the robots.txt and the robots meta tag.
Question on internal search results - A site currently has search results indexed which I don't believe is the best as far as crawl budget etc is concerned. Is the right course of action to disallow the search results within robots.txt and then use the URL removal tool to remove the search results?
Thanks
Gareth
Great article! I have a question about nofollow, does it still matter in term of sculpting the flow of PR?
Sort of, but it's such a teeny tiny ranking factor that it's mostly useless (with the exeception a few rare edge cases).
wow! it's realy a good information. The Spanish SEO manager like me have a problem with the lenguage but i can understand with this graphic video.
Always good to review the basics and make sure you haven't missed out on an essential building block.
One suggestion I do have for the search pages is a more pragmatic approach:
You noindex your search results by default,
Record what searches are being performed on your site,
Once a search reaches a certain amount (I will let you decide what is enough traffic to be valuable) you craft a page that serves those results and allow it to be indexed.
This benefits your users (so long as you keep serving them the same content!!!) as they want this information and you are making it easier for them to get straight to it, it also benefits you as this page that has some value is now more visible.
Hi Richard,
As I'm currently having a very very similar issue, I'd just like to hear a second(in this case your) opinion about internal search.
I have a search term that is looked up by visitors quite often resulting in the tipical /catalogsearch/result/?q=term page.
My question now is: If there already exists a more SEO friendly Landing Page, that is focusing on this term, should I 301 redirect this specific search /catalogsearch/result/?q=term to the /category/subcategory/ page or does that cause any issue?
Thanks a lot in advance
You can either use a 301 or you can rel=canonical if you think some visitors who use the search function would prefer to get the search-results style page.
You already have an answer from Rand (lucky you) but just to confirm I would 301/canonical to the preferred page, it just makes sense to focus your authority onto the one page.
Yes! Great suggestion Richard. Love that methodology for finding which search queries to make into landing pages.
Google does say that they don't like your search result pages in search result, but whenever I type little tricky local search query, they show ton of such search result pages in ads appear on SERPs. So it seems Adwords ads are not spoiling search quality but when you want to do it in organic, you are spoiling it.
Yes i agree with you BrijB, Sometime it look like Google wants to take more and more website owners to start PPC,
Hi Rand, great video (as always!). I still am on cloud 9 from being able to touch the holy whiteboard earlier in the week =D.
You mentioned crawl bandwidth, do you know if anyone has done any research or tests on the limits of how many pages Google is able to crawl per day?
Also do you get any penalties towards the number of pages crawled if you have error(s)?
Cheers, Ash
I agree with you Rand, if a search result page provides a valuable content to a user, who couldn't easily find that content any ther way, we need to let Google (Bing and others) index and rank those search result pages....
Hi Rand
We have built several "un-indexable" websites for a business who want to have an exclusive offer for fidelity card clients only arriving through link in a private area of a high traffic website. Having spent most of my time building search friendly websites optimized for maximum visibility - the first time I heard this request I wanted to cry.
Anyway, we have used robot meta with noindex, and absolutely no links on the web, and this has worked perfectly so far - 0 hits through search in over 1 year. It also definitely helped that we build the site on a brand new domain.
You made an interesting caveat about the conflict between robots.txt and robots meta - the txt stopping robots from actually reading the robots meta instructions - I hadn't looked at it that way.
Thanks
The "duplicate/thin content" bit is spot-on. As usual, great post. Fishkin for Prez.
Thanks for the video. So I was already pretty clear on a lot of this.
But I never did quite get rel=canonical when it came out. It sounds like it acts essentially as a 301, but without the actual redirect.
Is that accurate, or am I oversimplifying things?
Yeah, that's pretty much it. It's not as perfectly respected/followed as a 301, but close. More here: https://moz.com/learn/seo/canonicalization
Perfect. Thanks Rand!
What should the step-by-step process be if you are migrating from seperate desktop and mobile URL's to responsive? Which should be used to make sure the mobile URL's no longer index in search results; 301 redirects, noindex/nofollow, ad/or robots.txt to block mobile? If any of them should be used when should they be added? Should anything get added before the full responsive migration? Should they be added when the site launches?
You'd want to do pretty much what's done when you redirect one site to another, i.e. redirect each individual m.yourdomain.com page to the right www.yourdomain.com page (rewrite rules can be very helpful here). I wouldn't block crawling to the m-dot URLs or you'll prevent Google from seeing the redirects!
Hiii Rand, Very Informative Video, What WordPress tags should be index or not if every tag contain more than 5 articles?
Hey Rand!
First off this is great for people like me who dont know much about Robots.txt thanks
I had a quick question i wanted your opinion on, I submitted a site map through search console of about 4800 URLs but its only indexing between 94-120 i know today (7/14) there was a bug but this has been for months and we arent using any robots or no index that would block that. Any advice??
again thanks for the videos, i look forward to them every friday
As usual a great WBF and a cool hair style :)
What's your take on improving the indexation of Sub-domains? I came across to a site which has the sub-domain version and it's been months since its launch and only a single page got indexed. Sitemap, Robots.txt everything seems alright. Can we control it in any other way?
Thanks,
If you're having indexation problems it's usually one of four things:
Thanks for your input. It seems 2nd and 3rd points are causing this.
By the way, is there any specific name of your new hair cut? :)
Umar
Hi Rand,
just got a quick question about your affirmation at 2:40 where you say: Just because a page doesn't get crawled, it doesn't mean that it gets indexed.
Am I getting this right:
If you disallow the page/or folder right from the beginning, it shouldn't get index (assuming that google respects your robots.txt settings)
If you disallow the page/or folder after some time and it already got crawled and indexed, than the disallow setting somehow "comes late" and a page could appear in the search results even though you set robots.txt on disallow
Any reply would be much appreciated
Thanks
Sorry about my lack of clarity - it's not that the disallowed page gets indexed, but it can get into search results. Google will show something like a "we can't show a description for this result because of robots.txt" type of message for the description. They don't actually crawl and index the page, but they do index the URL and create a record of it which can appear in search results.
Very useful video again, i alsto I think the x Robots HTTP Header is a quite useful alternative to the meta robots tag, if i may say it. Once again, im quite impressed with the vids hehe
Each of these are issues for my clients. I appreciate the validation. Now I need to make a variety of cms tools play nice. Maybe the developers know how to read and will fix their crummy programming after I forward this article!
Rand! Great refresher for me here, enjoyed it as with all WBF's :)
I've got a question following the .css and .js warning that Google Search Console is distributing at present. I'm blocking our /ajax/ files (which load in boring dropdown lists dynamically) using our robots.txt file. This is causing Google to only partially render the page upon a 'Fetch' so concerned I may have to revert this decision and allow crawling...
My question is then, can you rel="noindex" ajax files, .js, .css files etc. just as with 'normal' pages?
Having a mini panic about this as we're doing this to assist UX as these items have no value in the SERPs! :) Thanks!
Hi Daniel, while those files "have no value in the SERPs" they do impact how all of your other pages are crawled and shown in the SERPs. Do not worry about noindexing those files. Let Googlebot crawl them, see them, love them. They are seen as important and essential files to Google. Google Webmaster guidelines explicitly state:
https://support.google.com/webmasters/answer/35769...
If you do block those files, it could have a negative impact on your site's indexation and ranking. I thinks you are overthinkings things. I would not panic because you do not have CSS and JS blocked, I would panic if you did have CSS and JS blocked!
Cheers
Thanks for the quick reply.
I'll remove the /ajax/ line out of my robots.txt and allow Google to take them into the index. Shame I can't allow crawling of these assets and still have them noindexed.. the files literally are just meaningless lists on there own! :)
Hopefully Google will realise these have no user value and drop the files themselves out of the index after a while.
Certainly not worried about allowing the .js and .css to be crawled - but just wondering if there was a method of noindexing without preventing crawling.
Thanks again for the response - if anyone else has any tips on this I'd be greatful to hear! Thanks!
If they are meaningless lists, they will probably not rank so I think you are good. Are you concerned that they will outrank the main page?
Something else I just thought of. If those ajax files contain content that is canonical to the page you can use a canonical http header
https://moz.com/blog/how-to-advanced-relcanonical-...
to show that those are parts of the main page. Not sure if this is true, for your setup, but another option.
Hi CleverPhD - No, not concerned they will outrank anything. We have around 12,000 pages of real content pages which should always outperform. It was just a case of making that decision for Google and telling it that these were unimportant with regards to indexation. Just looking to make its time on site more efficient! :)
Thanks for your help - much appreciated. This morning we've removed that line from the robots.txt file and will let the big G make its own mind up about these resources! :)
Hey Rand. I have a question about duplicate content for e-commerce websites.
I have different colours for the same produc. However I don´t have a "default" page with no colour mention to use canonical.
In that case can I say that I the the pink colour , for instance, is my main page and do the canonical from the other colour to it?
Tks a lot
Hey Randy, how are you?
Amazing content by the way, I used to inspire on your posts sometimes to build my answers to the developer tech lead when I have to prove to them why I am requesting changes and updates on the website =) =) ... Good job!
So, I was looking throughtout the internet for any info that could confirm the part where you said:
"Google has said many times that they don't like your search results from your own internal engine appearing in their search results, and so this can be a tricky use case."
But I wasn´t lucky to find any article or content about it. Could you please indicate to me where Google said that? Any hint will add more XP score to my knowledge and my colleagues here ;)
We have tons of search result pages cached on Google and I definitely would like to give them a solution.
Cheers,
Hi Jonatas,
I found this remark
Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.
in the classic webmaster guidelines and a pretty old blog article by Matt Cuts.
Cheers, Tobias
so if I create a page such as "pumping unit rentals" specifically for the purpose to rank for that kw, yet Google decides it wants to use a different page that does not have as much quality content, how can I get Google to switch?
The trouble with noindexing pages is that you can send the crawler into a black hole. If you have really bad pages on your site, either improve them or delete them. What is left is what the crawler should see and what the visitor should see too.
Hi,
If I just disallow the robots to access the entire file folder instead of the url.
Will the page show up in the search results?
The URL can still show up in results. The only way to keep it out entirely is to let Google crawl the page and use the meta robots noindex protocol.
Thanks :)
Hi Rand,
Thanks for this fantastic refresher!
I'm working with a real estate directory website (similar to Zillow and Trulia). In the past they experienced warnings in Google Webmaster Tools due to Googlebot encountering an extremely large number of links. This was occurring due to the almost-endless amount of internal search results pages (many of which contain facets) that Google were discovering and attempting to crawl. The solution at the time was to block ALL search results pages in robots.txt and create a seperate SEO-friendly directory of property type/location pages which was internally linked to from the footer and within the XML sitemap.
My feeling is that this is not the best solution to this problem and I'd like to propose removing that robots.txt disallow and completely change the way their URLs are structured so that useful pages are contained in subfolders (like property type, location, buy/rent/sold) and all non-search essential parameters (e.g. number of bathrooms, car spaces... etc.) are built as parameters. Rather than blocking these pages to Google, all pages would simply contain canonical tags which only contain the subfolder URLs (i.e. all parameters are stripped from the canonical URL). This solution should enable Google to crawl and honour any links at faceted pages while only prioritising and indexing valuable directory results. I'm hoping it will also make Google's crawl of the site more efficient, saving on bandwidth. The only real drawback is that I would have to setup 301s for all old URLs.
Is this the best solution for this type of website?
Hi Rand,
Great article/video, it's also a good refresh beacouse it's really important to know how Controlling Search Engine Crawlers.
However, I have a question on your second example (ecommerce tee shirt and different colors).
Why do you say that use noindex on canonical pages might interfere with the rel canonical ?
If we don't use noindex meta tag, it's possible that Google indexes these pages (and also the default version) if usage and backlinks are on them.
Thanks for the time you will spend to answer me.
Hi Rand,
Thanks for this post. Duplicate content has been an ongoing issue for us. Our organic rankings are doing pretty good, but we could potentially do better by eliminating our dup content issues.
I'm wondering what you would recommend for the following.
The Moz crawl says that https://www.incipio.com/cases/tablet-cases.html is a duplicate of https://www.incipio.com/cases/tablet-cases/microsof... . There is a rel canonical in place on the https://www.incipio.com/cases/tablet-cases.html page. The canonical is the following: <link rel="canonical" href="https://www.incipio.com/cases/tablet-cases.html" />
Why is the crawl still saying there is a duplicate with the canonical?
Thanks for your help!
Hello Nicole,
If you don't find the answer you are looking for here, you can always ask a question like this to our community in the Moz Q&A Forum! https://moz.com/community/q
Hi Danielle,
I've tried that. I was hoping someone here could answer.
Thanks!
Very useful, thanks Rand! I was wondering about the use of "Crawl-delay" in the robots.txt - under which circumstances would you want to use that?
.
This video is a total refresher! Thank you, Rand! You are always great and provide useful information :)
Most of all I like point 4 about the internal search engines - it always depends on the website, its direction, business model and the audience behavior I think, but in most cases it is really confusing and not a good idea the search results to come up in Google.
Dido Grigorov
This article is so nice it give so many knowledge about the robot.txt file.
Hii Fishkin...
very nice video and post, i already know about robots.txt and robots meta tag, but didn't know their deep fact. such a interesting for my software company and seo company. we will definitely use on client websites.
HI Rand
Great video and something that I am currently doing a bit of work around myself. Loved the part about product pages and the multi variants that often come along with these size, colour, images etc. So agree that the best solution here is to use the rel=canonical tag to point to the original source of that information. Although how about this as another idea, applying rules in the code that when another colour is selected then the rule rewrites the meta title and meta description by adding the colour to those attributes. So for instance lets say we are selling (Nike) which by the way we don't :-)
www.examplesite.com/nike/test-trainer
Meta title - Nike Test Trainer - example site clothing store
Meta Description - Shop the Nike test Trainers from official stockists example clothing store | Free deliveries on orders over £40.
Then applying a rewrite rule code side that when the user say for instance selects red in that item
www.examplesite.com/nike/test-trainer/red
Meta Title - Nike Test Trainer in Red - Example Site Clothing Store
Meta Description - Shop the Nike Test Trainers in Red from Official Stockists Example Clothing Store | Free Deliveries on Orders Over £40
and so and and so forth.
The new page would have the rel=canonical tag on this pointing to the original page.
So the question is 2 fold.
1) Is this generating a new page to be indexed and would it help with long tail queries.
2) Would the rel=canonical need removing from the new pages for the above to happen and then would it create duplication issues across the site.
I would love to hear some feedback on this and if anyone has tested the above and what sort of results you had either way.
If you don't find the answer you are looking for here, another great place to ask a question like this to our community in our Q&A Forum! https://moz.com/community/q
Hello there!
I know I am late on this, but I had a quick comment on #3. Passing link equity without appearing in search results. What is missing here is use of the rel=next and rel=prev tags on those paginated pages. Google recommends using rel next prev so that they can see the relationship between the series of pages. You can then add the noindex,follow onto the paginated pages (with the exception of Page 1 assuming it is a useful landing page with information) so that pages 2-n do not get indexed (or removed).
I think of the rel=next prev as the opposite of disallowing those pages in robots. Rel next prev helps Google navigate the paginated pages so that it can find what it needs (i.e. the pages that are linked to in the pagination) before the meta robots prevents the paginated pages from being indexed. (I think I made sense there).
Technically, Google would prefer for you to not use meta robots with rel next prev, as they would like to decide what page in the series is most important for the search results, but we use the combo so that we have better control and it works pretty well.
Cheers!
Hey Rand - great post. Most of the post was about how to not have some things crawled. What about the opposite?
What are your thoughts about throwing your sitemap into the robots file? Does that help making the site more crawlable? And what about all those <changefreq> and <priority>? I've always kind of put my thumb in the air on that one.
Such a fantastic feature for every one of us. Numerous amateurs and also experienced get befuddled between them, however in the wake of experiencing this great stuff, I don't think they would be in any uncertainty.You have given truly extraordinary samples for robots.txt and meta robots tag.
thank you
Awesome post:-) I especially like the part about ---> So then we have our SEO folks go, "you know what, let's make doubly sure that doesn't show up in search results; we'll put in the meta robots tag:" <---