Learn About Robots.txt with Interactive Examples

Comments 90

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

Martijn Oud

2013-01-07T06:07:12-08:00

Thanks for making the module free! Hope it's a step to more free modules ;-)? Joking aside - I really love the way DistilledU is developing itself. It looks very promising. Codecademy is a great website to look up to and use as an example.
As for robots.txt I've never really needed the 'advanced' ways since A) all crawlers are welcome and B) exclusion of a page/directory I usually do on the page itself as to not accidentally block the whole site!

6 0

Thanks for making the module free! Hope it's a step to more free modules ;-)? Joking aside - I really love the way DistilledU is developing itself. It looks very promising. Codecademy is a great website to look up to and use as an example. As for robots.txt I've never really needed the 'advanced' ways since A) all crawlers are welcome and B) exclusion of a page/directory I usually do on the page itself as to not accidentally block the whole site!
Cancel
Danilo Petrozzi

2013-01-07T05:14:37-08:00

Very interesting post, expecially for beginners, but I have to correct you on a basic/extremely-important code that you have mispelled:

Sitemap: /file.xml <- this is totally wrong
Sitemap: file.xml <- still wrong
Sitemap: https://www.site.com/file.xml <- correct

Just try yourself inside your WMT panel: proof here

Bye =)

6 0

Very interesting post, expecially for beginners, but I have to correct you on a basic/extremely-important code that you have mispelled: Sitemap: /file.xml <- this is totally wrong Sitemap: file.xml <- still wrong Sitemap: https://www.site.com/file.xml <- correct Just try yourself inside your WMT panel: <a href="https://www.espertoseo.com/correct-sitemap-syntax.jpg" rel="nofollow">proof here</a> Bye =) 
Cancel
- Will Critchlow
 
 2013-01-07T06:11:29-08:00
 
 Good spot. I'll get that fixed up.
 
 2 0
 
 Good spot. I'll get that fixed up.
 Cancel
- Will Critchlow
 
 2013-01-07T07:39:33-08:00
 
 That should be fixed shortly (here and within DistilledU!).
 Thanks for pointing it out.
 
 willcritchlow edited 2013-01-07T07:39:46-08:00
 2 0
 
 That should be fixed shortly (here and within DistilledU!). Thanks for pointing it out.
 Cancel
- Will Critchlow
 
 2013-01-08T14:47:48-08:00
 
 It's fixed now (still need to update the text to make it clear that the new version tests for a sitemap at https://www.distilled.net/my-sitemap.xml).
 
 1 0
 
 It's fixed now (still need to update the text to make it clear that the new version tests for a sitemap at https://www.distilled.net/my-sitemap.xml).
 Cancel
 - AriNahmani
 
 2013-01-09T03:47:34-08:00
 
 This tool was great for one of my new account managers, thanks so much Will!
 
 He did mention that while you're making this change, you might wanna make it clear that the next question tests for a sitemap at https://www.distilled.net/my-video-sitemap.xml as well.
 
 AriNahmani edited 2013-01-09T03:49:21-08:00
 2 0
 
 This tool was great for one of my new account managers, thanks so much Will! He did mention that while you're making this change, you might wanna make it clear that the next question tests for a sitemap at https://www.distilled.net/my-video-sitemap.xml as well.
 Cancel
Phil Pearce

2013-01-08T13:54:37-08:00

@Will

Interesting post. I would like to add...

* Just stating the obviously for the newbies, but a robots.txt block prevents onpage meta canonical being crawled, hence if canonical are important - then robots.txt might not be the best way to block or control crawled and indexed pages (e.g. exclude parameters in GWT and BingWT might be safer).

* Robots.txt is only applied at a sub-domain level, thus a block on www.seomoz.org does not stop stageosev3.seomoz.org or apiwiki.seomoz.org being crawled for example:
https://www.seomoz.org/robots.txt
https://apiwiki.seomoz.org/robots.txt
https://go.seomoz.org/robots.txt
https://stageosev3.seomoz.org/robots.txt

* Personally, for dev servers and news websites - I use both Disallow: / and Noarchive: / in order to prevent internal pages and external backlinks to these pages being indexed (rather than just block prevent internal pages being crawled) - just to be on the safe side.

For example the sub-domain "go.seomoz.org" is disallowed, but backlinks with &aff_id are not Noarchived. Hence a search for "site:go.seomoz.org inurl:aff_id" returns 550 results.
https://www.google.com/search?q=site:go.seomoz.org+inurl:aff_id&num=100&pws=0&as_qdr=all&prmd=imvns&filter=0

* On the PPC side, you did not mention that User-agent: AdsBot-Google does not follow the robots.txt protocol precisely - because it ignores user-agent: * and thus needs to be declared explicitly.
Conversely BingAdbot (called User-agent: Adidxbot) does comply with robots.txt user-agent: *

* Personally I prefer to use "&seo_robots_noindex=true" rather than "&crawl=no"
then
User-agent: *
Disallow: /*?seo_robots_noindex=true
Disallow: /*&seo_robots_noindex=true
e.g. https://www.seomoz.org/pages/search_results#stq={keyword}&placement_category={target}&seo_robots_noindex=true

Or
User-agent: AdsBot-Google
User-agent: AdsBot-Google-Mobile
User-agent: Adidxbot
Dissallow:
Allow: /ppc-landing-pages/

* Lastly, you might find this interesting it is a "robots.txt whitelist" I collated (a technique used by www.facebook.com/robots.txt and www.alexa.com/robots.txt) as a means to block scrapers and reduce bandwidth rather than using server-side htaccess user-agent block.
https://www.dropbox.com/s/w7wl8k79oz00wuy/robots.txt
Note: "Crawl-delay: 30" within robots.txt can also be used to reduce bandwidth.

Cheers

Phil.

P.S. Going a bit off topic, but the importance of using "&seo_robots_noindex=true" or "crawl=no" method in URLs increases - when lots of dynamic pages are used for PPC (and hence these need to be blocked within GoogleOrganic) for example, here is a string I am testing at the moment:
https://www.mydomain.com/[category]/[seed]/[expansion]/[final]/?search={keyword}&device={ifmobile:mobile}&city={lb.city}&postalCode={lb.postalCode}&adtype={adtype}&KW={keyword}&match={matchtype}{ifcontent:c}&distribution={ifsearch:search}{ifcontent:content}&creativeid={creative}&adposition={adposition}&network={network}&placement_category={target}&placement={placement}&ad_param2={param2}&ad_param1={param1}&ad_insertiontext={insertionText}&adwords_producttargetid={adwords_producttargetid}&campaign_exp_aceid={aceid}&seo_robots_noindex=true

4 0

@Will Interesting post. I would like to add... * Just stating the obviously for the newbies, but a robots.txt block prevents onpage meta canonical being crawled, hence if canonical are important - then robots.txt might not be the best way to block or control crawled and indexed pages (e.g. exclude parameters in GWT and BingWT might be safer). * Robots.txt is only applied at a sub-domain level, thus a block on www.seomoz.org does not stop stageosev3.seomoz.org or apiwiki.seomoz.org being crawled for example: <a href="https://www.seomoz.org/robots.txt" rel="nofollow">https://www.seomoz.org/robots.txt</a> <a href="https://apiwiki.seomoz.org/robots.txt" rel="nofollow">https://apiwiki.seomoz.org/robots.txt</a> <a href="https://go.seomoz.org/robots.txt" rel="nofollow">https://go.seomoz.org/robots.txt</a> <a href="https://stageosev3.seomoz.org/robots.txt" rel="nofollow">https://stageosev3.seomoz.org/robots.txt </a> * Personally, for dev servers and news websites - I use both Disallow: / and Noarchive: / in order to prevent internal pages and external backlinks to these pages being indexed (rather than just block prevent internal pages being crawled) - just to be on the safe side. For example the sub-domain "go.seomoz.org" is disallowed, but backlinks with &aff_id are not Noarchived. Hence a search for "site:go.seomoz.org inurl:aff_id" returns 550 results. <a href="https://www.google.com/search?q=site:go.seomoz.org+inurl:aff_id&num=100&pws=0&as_qdr=all&prmd=imvns&filter=0" rel="nofollow">https://www.google.com/search?q=site:go.seomoz.org+inurl:aff_id&num=100&pws=0&as_qdr=all&prmd=imvns&filter=0</a> * On the PPC side, you did not mention that User-agent: AdsBot-Google does not follow the robots.txt protocol precisely - because it ignores user-agent: * and thus needs to be declared explicitly. Conversely BingAdbot (called User-agent: Adidxbot) does comply with robots.txt user-agent: * * Personally I prefer to use "&seo_robots_noindex=true" rather than "&crawl=no" then User-agent: * Disallow: /*?seo_robots_noindex=true Disallow: /*&seo_robots_noindex=true e.g. <a href="https://www.seomoz.org/pages/search_results#stq=%7Bkeyword%7D&placement_category=%7Btarget%7D&seo_robots_noindex=true" rel="nofollow">https://www.seomoz.org/pages/search_results#stq={keyword}&placement_category={target}&seo_robots_noindex=true</a> Or User-agent: AdsBot-Google User-agent: AdsBot-Google-Mobile User-agent: Adidxbot Dissallow: Allow: /ppc-landing-pages/ * Lastly, you might find this interesting it is a "robots.txt whitelist" I collated (a technique used by www.facebook.com/robots.txt and www.alexa.com/robots.txt) as a means to block scrapers and reduce bandwidth rather than using server-side htaccess user-agent block. <a href="https://www.dropbox.com/s/w7wl8k79oz00wuy/robots.txt" rel="nofollow">https://www.dropbox.com/s/w7wl8k79oz00wuy/robots.txt</a> Note: "Crawl-delay: 30" within robots.txt can also be used to reduce bandwidth. Cheers Phil. P.S. Going a bit off topic, but the importance of using "&seo_robots_noindex=true" or "crawl=no" method in URLs increases - when lots of dynamic pages are used for PPC (and hence these need to be blocked within GoogleOrganic) for example, here is a string I am testing at the moment: https://www.mydomain.com/[category]/[seed]/[expansion]/[final]/?search={keyword}&device={ifmobile:mobile}&city={lb.city}&postalCode={lb.postalCode}&adtype={adtype}&KW={keyword}&match={matchtype}{ifcontent:c}&distribution={ifsearch:search}{ifcontent:content}&creativeid={creative}&adposition={adposition}&network={network}&placement_category={target}&placement={placement}&ad_param2={param2}&ad_param1={param1}&ad_insertiontext={insertionText}&adwords_producttargetid={adwords_producttargetid}&campaign_exp_aceid={aceid}&seo_robots_noindex=true 
Cancel
- David Sottimano
 
 2013-01-17T02:03:20-08:00
 
 Good call on the Adsbot, it's not a well known fact and you just reminded me too.
 
 Do you think that the whitelist technique is useful? Scrapers don't have to respect robots.txt, nor would they let directives stop them ;)
 
 DaveSottimano edited 2013-01-17T02:04:58-08:00
 2 0
 
 Good call on the Adsbot, it's not a well known fact and you just reminded me too. Do you think that the whitelist technique is useful? Scrapers don't have to respect robots.txt, nor would they let directives stop them ;)
 Cancel
Sean Lade

2013-01-07T06:24:15-08:00

I've been a member of DistilledU since it was in beta (not sure if it still is??) and can say its fantastic. This is just one example out of loads of great resources and training it provides, as well as all the videos that have recently been made available for the DistilledU members. Would highly recommend a visit (and a subscription). Keep up the good work guys.

4 0

I've been a member of DistilledU since it was in beta (not sure if it still is??) and can say its fantastic. This is just one example out of loads of great resources and training it provides, as well as all the videos that have recently been made available for the DistilledU members. Would highly recommend a visit (and a subscription). Keep up the good work guys.
Cancel
ClickConsult

2013-01-07T04:20:42-08:00

Thanks Will. The rules are displaying on a single line in your examples, shouldn't they be on seperate lines? M

ClickConsult edited 2013-01-07T04:21:59-08:00
3 0

Thanks Will. The rules are displaying on a single line in your examples, shouldn't they be on seperate lines? M
Cancel
- Will Critchlow
 
 2013-01-07T06:10:24-08:00
 
 Yes. They were in my final draft post - I'm working to get that fixed up now.
 
 2 0
 
 Yes. They were in my final draft post - I'm working to get that fixed up now.
 Cancel
- Ashley Sefferman
 
 2013-01-07T07:01:21-08:00
 
 The post is now updated with this change. Sorry for the inconvenience, and thanks for reading!
 
 2 0
 
 The post is now updated with this change. Sorry for the inconvenience, and thanks for reading!
 Cancel
 - ClickConsult
 
 2013-01-07T08:01:15-08:00
 
 Great. Was about to say you learn something new every day haha
 
 1 0
 
 Great. Was about to say you learn something new every day haha 
 Cancel
Annie Cushing

2013-01-07T07:48:16-08:00

This is, hands down, the best post I've ever seen on the SEOmoz site. I wish I had Will's coding skills, so I could teach people Excel this way.

3 0

This is, hands down, the best post I've ever seen on the SEOmoz site. I wish I had Will's coding skills, so I could teach people Excel this way. 
Cancel
- Will Critchlow
 
 2013-01-07T07:51:11-08:00
 
 Thanks Annie!
 I already did the excel one in DistilledU.:)
 
 1 0
 
 Thanks Annie! I already did the <a href="https://www.distilled.net/u/interactive-excel/" rel="nofollow">excel one in DistilledU</a>.:)
 Cancel
Nick-SEOSpark

2013-01-07T14:04:06-08:00

Hey, thanks very much for this Will. I've been using DistilledU for quite a few months now, and it's been particularly useful for training new staff. However, there is clearly something for the advanced as well with the new modules that are being added. Another useful interactive module I noticed was the " search operators" lesson in the beginners section, where you get quizzed on which which "operators" are besto for each given situation. Really been impressed with DistilledU for training + learning purposes!

2 0

Hey, thanks very much for this Will. I've been using DistilledU for quite a few months now, and it's been particularly useful for training new staff. However, there is clearly something for the advanced as well with the new modules that are being added. Another useful interactive module I noticed was the " search operators" lesson in the beginners section, where you get quizzed on which which "operators" are besto for each given situation. Really been impressed with DistilledU for training + learning purposes! 
Cancel
Ben R Woodard

2013-01-07T09:15:24-08:00

I love Distilled U!

2 0

I love Distilled U! 
Cancel
Jemin Desai

2013-01-07T22:31:46-08:00

Will,100 thumbs Up post :) really its give clear idea about how we can use robots.txt file by smart way.

2 0

Will,100 thumbs Up post :) really its give clear idea about how we can use robots.txt file by smart way. 
Cancel
Andrew Borman

2013-01-08T02:31:52-08:00

Good post on robots.txt with basics for newbies. But somebody answer me what's the difference between:

User-agent: * Disallow: /secret/
and
User-agent: * Disallow: /secret/*
Thanks in advance.

Seot0p edited 2013-01-08T02:32:03-08:00
2 0

Good post on robots.txt with basics for newbies. But somebody answer me what's the difference between: User-agent: * Disallow: /secret/ and User-agent: * Disallow: /secret/* Thanks in advance.
Cancel
- Will Critchlow
 
 2013-01-08T05:58:49-08:00
 
 There's no difference. Wildcards at the end of the line are ignored (or, rather, implied).
 
 1 0
 
 There's no difference. Wildcards at the end of the line are ignored (or, rather, implied).
 Cancel
 - Andrew Borman
 
 2013-01-09T05:44:37-08:00
 
 Thanks for the answer. If so should test it for several common search engines...
 
 2 0
 
 Thanks for the answer. If so should test it for several common search engines...
 Cancel
Roberto Robles

2013-01-08T08:24:34-08:00

Thanks Will, very good interactive guide/tutorial. If all other modules are like this one, $40 a month is a great deal. Will have to give it a try.

2 0

Thanks Will, very good interactive guide/tutorial. If all other modules are like this one, $40 a month is a great deal. Will have to give it a try. 
Cancel
Deep Kumar Vijayvergia

2013-01-08T03:12:27-08:00

Thanks for sharing..Nice summarize tutorials and example about robots.txt.

2 0

Thanks for sharing..Nice summarize tutorials and example about robots.txt.
Cancel
Steve Morgan

2013-01-08T02:50:08-08:00

Brilliant idea for a post, Will. They say people all learn in different ways, and I'd much rather try an interactive version of something than simply read about it. This is like the Codeacademy of SEO! Good work!

How about one on hreflang in the future? I can see that being useful for a lot of folk! :-)

2 0

Brilliant idea for a post, Will. They say people all learn in different ways, and I'd much rather try an interactive version of something than simply read about it. This is like the Codeacademy of SEO! Good work! How about one on hreflang in the future? I can see that being useful for a lot of folk! :-)
Cancel
Kalpesh B Patel

2013-01-07T22:38:03-08:00

Hey Will,Thanks for sharing such a nice post.I believe wildcard characters are star (*) and question marks (?)Is question marks (?) are considered in robots.txt?

2 0

Hey Will,Thanks for sharing such a nice post.I believe wildcard characters are star (*) and question marks (?)Is question marks (?) are considered in robots.txt?
Cancel
algogmbh_petra

2013-01-07T06:07:08-08:00

That's just a perfect way to learn!
I just just couldn't resist to make my way through - and I have learned some new things about robots.txt.
Thanks for the links at the top. I knew already Codeacademy (I am a huge fan, too) - I will check them out asap.

2 0

That's just a perfect way to learn! I just just couldn't resist to make my way through - and I have learned some new things about robots.txt. Thanks for the links at the top. I knew already Codeacademy (I am a huge fan, too) - I will check them out asap. 
Cancel
firstconversion

2013-01-07T04:30:49-08:00

Wow that functionality is awesome!

2 0

Wow that functionality is awesome!
Cancel
Anita_Clark

2013-01-19T05:09:30-08:00

Very simple and easy to follow, even for tech challenged users. An excellent share on some of the ways to leverage using the robots.txt file.

1 0

Very simple and easy to follow, even for tech challenged users. An excellent share on some of the ways to leverage using the robots.txt file. 
Cancel
Katie Kelly

2013-01-16T19:24:13-08:00

Very cool, interactive post.

1 0

Very cool, interactive post.
Cancel
Chammy

2013-01-18T05:20:09-08:00

Thanks Will this is great
One question - if a disallow is added, say Disallow: /tag/ but a load of pages within the tag folder are already indexed, should they fall out over time, or is there a better way to handle this?

1 0

Thanks Will this is great One question - if a disallow is added, say Disallow: /tag/ but a load of pages within the tag folder are already indexed, should they fall out over time, or is there a better way to handle this?
Cancel
- Will Critchlow
 
 2013-01-18T05:58:27-08:00
 
 Robots.txt doesn't actually do anything to with regard to indexation (even pages that have always been blocked with robots.txt can appear in the index - they just don't know much about them).
 Depending on your priorities, you could add a meta noindex to those pages, let them get re-crawled and then block them in robots.txt.
 
 1 0
 
 Robots.txt doesn't actually do anything to with regard to indexation (even pages that have always been blocked with robots.txt can appear in the index - they just don't know much about them). Depending on your priorities, you could add a meta noindex to those pages, let them get re-crawled and then block them in robots.txt.
 Cancel
James Smith

2013-01-16T00:35:34-08:00

It's one of the best post to know about robots.txt from basic, I ever read which helps everyone to easily understand about it.

CBIL360Inc23 edited 2013-01-16T00:35:51-08:00
1 0

It's one of the best post to know about robots.txt from basic, I ever read which helps everyone to easily understand about it. 
Cancel
DeepakDwivedi

2013-01-14T03:34:59-08:00

Very informative post, i know about robots.txt but a detailed explaanation i got here.

1 0

Very informative post, i know about robots.txt but a detailed explaanation i got here. 
Cancel
Salman Farooqui

2013-01-10T21:29:26-08:00

Bookmarked.

1 0

Bookmarked.
Cancel
Du Nguyen

2013-01-11T01:07:08-08:00

Dear willcritchlow,

How to solve this problem of my website.

Ex: My website have page structure mysite.com/category/a-b-c.html. But now it appears 2 page

mysite.com/category/login.aspx and mysite.com/category/register.aspx. I think that these problems from my developers. And I try to use robots.txt like

User-agent: *

Disallow: /register.aspx

Disallow: /login.aspx

But it 's not effective ! So please tell me how to solve this problem using robots.txt

Thanks.

1 0

Dear <a href="https://www.seomoz.org/users/profile/21379" rel="nofollow">willcritchlow</a>, How to solve this problem of my website. Ex: My website have page structure mysite.com/category/a-b-c.html. But now it appears 2 page mysite.com/category/login.aspx and mysite.com/category/register.aspx. I think that these problems from my developers. And I try to use robots.txt like User-agent: * Disallow: /register.aspx Disallow: /login.aspx But it 's not effective ! So please tell me how to solve this problem using robots.txt Thanks.
Cancel
- Will Critchlow
 
 2013-01-11T03:42:14-08:00
 
 You need something like:
 
 User-agent: *
 
 Disallow: */register.aspx
 
 Disallow: */login.aspx
 
 (assuming you want to blog all versions of those pages wherever they sit).
 
 willcritchlow edited 2013-01-11T03:43:23-08:00
 3 0
 
 You need something like: User-agent: * Disallow: */register.aspx Disallow: */login.aspx (assuming you want to blog all versions of those pages wherever they sit).
 Cancel
gc1

2013-01-13T09:57:48-08:00

@willcritchlow

Regarding your example:

User-agent: * Disallow: /secret/Allow: /secret/not-secret.html

All the folders and files who are in the /secret/ folder will be excluded? (except not-secret.html)

If not, how can i block the robots access to/or no index those directory's or files in the ex. bellow:

/secret/directory1/

/secret/directory2/

/secret/*.*

except /secret/not-secret.html or another directory ex. /secret/allowed/

Thank you in advance.

gc1 edited 2013-01-13T09:59:52-08:00
1 0

@<a href="https://www.seomoz.org/users/profile/21379" rel="nofollow">willcritchlow</a> Regarding your example: User-agent: * Disallow: /secret/Allow: /secret/not-secret.html All the folders and files who are in the /secret/ folder will be excluded? (except not-secret.html) If not, how can i block the robots access to/or no index those directory's or files in the ex. bellow: /secret/directory1/ /secret/directory2/ /secret/*.* except /secret/not-secret.html or another directory ex. /secret/allowed/ Thank you in advance. 
Cancel
- Will Critchlow
 
 2013-01-14T04:56:16-08:00
 
 All the folders and files who are in the /secret/ folder will be excluded? (except not-secret.html)
 
 Correct. Which I think means you know how to do the second part of your question - you simply allow the specific paths you want to allow within the disallowed directory.
 
 2 0
 
 <blockquote>All the folders and files who are in the /secret/ folder will be excluded? (except not-secret.html)</blockquote> Correct. Which I think means you know how to do the second part of your question - you simply allow the specific paths you want to allow within the disallowed directory.
 Cancel
 - gc1
 
 2013-01-14T06:35:33-08:00
 
 i got it:)
 
 Thank's
 
 1 0
 
 i got it:) Thank's 
 Cancel
Nick Powell

2013-01-21T01:12:24-08:00

I like your article & the way you are explaining about robots.txt. It is nice article & contain genuine data for learning how we create robots.txt file for a site.

1 0

I like your article & the way you are explaining about robots.txt. It is nice article & contain genuine data for learning how we create robots.txt file for a site.
Cancel
René Hansen

2013-02-03T03:40:33-08:00

Nice, I am defineatly going to return to this post when I need a specific robots.txt command!

1 0

Nice, I am defineatly going to return to this post when I need a specific robots.txt command!
Cancel
seo_plus

2015-03-25T15:35:50-07:00

nice idea but is it broken? March 2015 it "lights up green" with any entry. Seems broken to me.

Is there a trick to get it to work??

1 0

nice idea but is it broken? March 2015 it "lights up green" with any entry. Seems broken to me. Is there a trick to get it to work??
Cancel
Jacob Henke

2015-07-12T16:42:24-07:00

I know this is an old post, but it ranks well on Google (~7th) for "robots.txt sample"

It looks like some of the formatting on the page has been broken over the years? Specifically, see the "Add an XML Sitemap" section and how most of the text boxes appear to be missing line breaks.

It still helped me out though, Thanks!

1 0

I know this is an old post, but it ranks well on Google (~7th) for "robots.txt sample" It looks like some of the formatting on the page has been broken over the years? Specifically, see the "Add an XML Sitemap" section and how most of the text boxes appear to be missing line breaks. It still helped me out though, Thanks!
Cancel
stilliard

2016-03-07T16:04:21-08:00

In case it helps anyone that lands here, you can test robots.txt rules with example urls here: https://robots-txt-parser.stapps.io/

1 0

In case it helps anyone that lands here, you can test robots.txt rules with example urls here: https://robots-txt-parser.stapps.io/
Cancel
bgorman

2016-06-01T14:06:24-07:00

Thanks so much for this - super helpful as I try to enhance my skill set. The correct answers were not turning green for me - is this functionality still operational? Or was I just completely wrong on everything ...

1 0

Thanks so much for this - super helpful as I try to enhance my skill set. The correct answers were not turning green for me - is this functionality still operational? Or was I just completely wrong on everything ...
Cancel
ᑌᗰEᖇ ᔕOᕼᗩIᒪ

2015-02-19T02:42:05-08:00

Hi.. I'm working on a Chinese website, and due to some laws in the country. It shows a splash screen, asking the users if they are below 18 or 18+. This is causing an issues in the indexing of the site. Google isn't fetching the content of the site, while it only shows the splash screen in the indexed data.\

If I apply "Crawl-delay directive" would this help me? Because by then the user would have selected one of the options.

1 0

Hi.. I'm working on a Chinese website, and due to some laws in the country. It shows a splash screen, asking the users if they are below 18 or 18+. This is causing an issues in the indexing of the site. Google isn't fetching the content of the site, while it only shows the splash screen in the indexed data.\ If I apply "Crawl-delay directive" would this help me? Because by then the user would have selected one of the options.
Cancel
Sharon_C

2015-01-29T11:11:38-08:00

Is it just me, or are the boxes not turning green anymore?

1 0

Is it just me, or are the boxes not turning green anymore? 
Cancel
SteveColin

2013-02-21T02:34:59-08:00

This is amazing! can anyone clear the robots.txt. What should i do if i want to block a single page having URL is www.xyz.com/abe-ace ?

1 0

This is amazing! can anyone clear the robots.txt. What should i do if i want to block a single page having URL is www.xyz.com/abe-ace ? 
Cancel
ni8in

2013-06-21T22:41:49-07:00

Hello ,

I have one question. I have a site for example https://www.abc.com/xyz.html. Now there is content of this page. This page contain pagination. When i go to page 2, URL shows like https://www.abc.com/xyz.html?p=2. Means google take it as duplicate content. How can i prevent from this using robots.txt. I know about canonical. But i want to do this by using Robots.txt. Please Help me.

-Ankit

KeriMorgret edited 2013-06-22T11:11:20-07:00
1 0

 Hello , I have one question. I have a site for example https://www.abc.com/xyz.html. Now there is content of this page. This page contain pagination. When i go to page 2, URL shows like https://www.abc.com/xyz.html?p=2. Means google take it as duplicate content. How can i prevent from this using robots.txt. I know about canonical. But i want to do this by using Robots.txt. Please Help me. -Ankit 
Cancel
SpookSEO

2013-11-30T14:28:03-08:00

Hi Will! Yes its so true that Robots.txt is a plain-text file found in the root of a domain. It is actually an accepted standard and allows webmasters to control all kinds of automated consumption of their site, ant not just by search engines.

1 0

 Hi Will! Yes its so true that Robots.txt is a plain-text file found in the root of a domain. It is actually an accepted standard and allows webmasters to control all kinds of automated consumption of their site, ant not just by search engines. 
Cancel
KatieRose

2013-01-10T09:48:01-08:00

Great intro to robots.txt! This answered a lot of my questions, thanks!

1 0

Great intro to robots.txt! This answered a lot of my questions, thanks! 
Cancel
Bryan Passanisi

2013-01-08T12:16:12-08:00

Thanks Will, great for referencing.

1 0

Thanks Will, great for referencing. 
Cancel
WebSupportGuy42

2013-01-07T14:03:47-08:00

Great post, but it seems to me that the example entries are pretty mixed up. Under "exclude directories" the example is excluding a single file. Under "allow specific paths" the example is excluding a path. Maybe they are all one step out of sequence or something? Makes it confusing.

1 0

Great post, but it seems to me that the example entries are pretty mixed up. Under "exclude directories" the example is excluding a single file. Under "allow specific paths" the example is excluding a path. Maybe they are all one step out of sequence or something? Makes it confusing.
Cancel
- Will Critchlow
 
 2013-01-07T14:18:02-08:00
 
 They are interactive. You need to enter the correct answers...
 
 1 0
 
 They are interactive. You need to enter the correct answers...
 Cancel
Vladimir Kharev

2013-01-07T06:12:36-08:00

Will, thank you for the detailed post about the fundamental things in SEO.But I think many people are faced with a situation, that Googlebot considers disallowing in robots.txt only as "don't index content of this page", but page URL can appear in SERP with notice "A description for this result is not available because of this site's robots.txt" (or even with links anchors), especially if page have some inbound links or Google+ mentions.So, best way to prevent page indexing and ranking - is a meta noindex instruction in document head section. Are you agree?

1 0

Will, thank you for the detailed post about the fundamental things in SEO.But I think many people are faced with a situation, that Googlebot considers disallowing in robots.txt only as "don't index content of this page", but page URL can appear in SERP with notice "A description for this result is not available because of this site's robots.txt" (or even with links anchors), especially if page have some inbound links or Google+ mentions.So, best way to prevent page indexing and ranking - is a meta noindex instruction in document head section. Are you agree?
Cancel
- Geoff Kenyon
 
 2013-01-07T09:14:14-08:00
 
 Robots.txt is good when you don't want Google to access a lot of pages and waste crawl bandwidth. Really there is no general best solution, rather the best solution is dependent upon your goals.
 
 3 0
 
 Robots.txt is good when you don't want Google to access a lot of pages and waste crawl bandwidth. Really there is no general best solution, rather the best solution is dependent upon your goals.
 Cancel
- Will Critchlow
 
 2013-01-07T09:38:14-08:00
 
 Some of the resources I link out to at the end give more general guidance on when robots.txt is a good idea (and what kind of problem it's good for solving).
 Also, see Geoff's answer :)
 
 2 0
 
 Some of the resources I link out to at the end give more general guidance on when robots.txt is a good idea (and what kind of problem it's good for solving). Also, see Geoff's answer :) 
 Cancel
 - Vladimir Kharev
 
 2013-01-07T10:23:28-08:00
 
 Yes, colleagues. I agree about importance of goals.
 
 vkharev edited 2013-01-08T04:02:36-08:00
 1 0
 
 Yes, colleagues. I agree about importance of goals.
 Cancel
Keith Goode

2013-01-07T15:00:04-08:00

Hey Will,

I'm curious to know if you can add a wildcard between elements in a robots.txt file and not risk everything past the first element. For instance, what would be the impact of having a Disallow element like this?:

Disallow: /folder/*.aspx

Would that only block .aspx elements within /folder/ or would that disallow everything under /folder/?

To expand my question even further, would you be able to block all occurrences of a specific subfolder name regardless of the URL structure if you had a Disallow element like this?:

Disallow: /folder/*/widgets/

Thanks,
Keith

1 0

Hey Will, I'm curious to know if you can add a wildcard between elements in a robots.txt file and not risk everything past the first element. For instance, what would be the impact of having a Disallow element like this?: Disallow: /folder/*.aspx Would that only block .aspx elements within /folder/ or would that disallow everything under /folder/? To expand my question even further, would you be able to block all occurrences of a specific subfolder name regardless of the URL structure if you had a Disallow element like this?: Disallow: /folder/*/widgets/ Thanks, Keith 
Cancel
byoung

2013-01-07T13:27:43-08:00

Outstanding post! A basic file that is so flexible and powerful. Thanks so much for sharing great info and examples.

1 0

Outstanding post! A basic file that is so flexible and powerful. Thanks so much for sharing great info and examples.
Cancel
Dubs

2013-01-07T13:04:29-08:00

+1 Will, I really enjoyed testing myself with the interactive robots.txt test!! Thank you!

1 0

+1 Will, I really enjoyed testing myself with the interactive robots.txt test!! Thank you! 
Cancel
Andrew Stickel

2013-01-07T07:01:03-08:00

I've got two that aren't turning green:
Add Multiple Blocks:
User-agent: googlebot Disallow: /secret/User-agent: * Disallow: /
Use More Specific User Agents:
User-agent: googlebot Disallow: /secret/
User-agent: googlebot-images Disallow: /secret/User-agent: googlebot-images Disallow: /copyright/

atstickel12 edited 2013-01-08T19:42:30-08:00
1 0

I've got two that aren't turning green: Add Multiple Blocks: User-agent: googlebot Disallow: /secret/User-agent: * Disallow: / Use More Specific User Agents: User-agent: googlebot Disallow: /secret/ User-agent: googlebot-images Disallow: /secret/User-agent: googlebot-images Disallow: /copyright/ 
Cancel
- Will Critchlow
 
 2013-01-07T07:35:47-08:00
 
 Hi there - sorry that there weren't line breaks in the text areas initially. Try refreshing the page and having another go - it should be clearer and easier now.
 I *think* that might be the root of your issues with the first one.
 For the "more specific user agents" test, I think your solution may be technically correct, but I think you have an extra "User-agent: googlebot-images" versus my proposed solution - you can add both the Disallow lines into the same block (as per the very first test).
 Let me know if that's not clear or still doesn't work for you.
 
 1 0
 
 Hi there - sorry that there weren't line breaks in the text areas initially. Try refreshing the page and having another go - it should be clearer and easier now. I *think* that might be the root of your issues with the first one. For the "more specific user agents" test, I think your solution may be technically correct, but I think you have an extra "User-agent: googlebot-images" versus my proposed solution - you can add both the Disallow lines into the same block (as per the very first test). Let me know if that's not clear or still doesn't work for you.
 Cancel
Alexeo

2013-01-07T07:49:03-08:00

Thank you for this work, it's a good post to learn robots.txt. You add somes good examples.

1 0

Thank you for this work, it's a good post to learn robots.txt. You add somes good examples.
Cancel
Sitebee

2013-01-07T10:49:40-08:00

Must admit, quite impressive!

I will send a couple of my online buddy to this post who are interested in SEO techniques.

1 0

Must admit, quite impressive! I will send a couple of my online buddy to this post who are interested in SEO techniques. 
Cancel
MattAntonino

2013-01-07T18:56:48-08:00

Now that you've changed the answers to the two sitemap questions, using /file.xml doesn't work and I had no idea what domain to use. I figured out it was https://www.distilled.net/ eventually but without that, the last two are very confusing to try to answer.
Great post - I wish I knew how to do what you've done to make these interactive questions. That's a pretty cool feature for this type of educational post.

1 0

Now that you've changed the answers to the two sitemap questions, using /file.xml doesn't work and I had no idea what domain to use. I figured out it was https://www.distilled.net/ eventually but without that, the last two are very confusing to try to answer. Great post - I wish I knew how to do what you've done to make these interactive questions. That's a pretty cool feature for this type of educational post. 
Cancel
Brad Russell

2013-01-07T16:37:57-08:00

Hi Will, AWESOME post and love the interactivity. Just a quick question, I seem to be stuck on the basic wildcards section. I've come up with: User-agent: * Disallow: *.html /blog/ --- what am I doing wrong? Many thanks!

1 0

Hi Will, AWESOME post and love the interactivity. Just a quick question, I seem to be stuck on the basic wildcards section. I've come up with: User-agent: * Disallow: *.html /blog/ --- what am I doing wrong? Many thanks!
Cancel
Kalu Charan Parida

2013-01-08T04:34:37-08:00

Quite excellent information on technical SEO. I just a have doubt what should be Robots.txt url for this sample website https://example.example1.com ?
Thanks for sharing this post.

KaluDigital edited 2013-01-08T04:35:18-08:00
1 0

Quite excellent information on technical SEO. I just a have doubt what should be Robots.txt url for this sample website https://example.example1.com ? Thanks for sharing this post.
Cancel
- Will Critchlow
 
 2013-01-08T05:57:32-08:00
 
 https://example.example1.com/robots.txt
 Is that what you mean?
 
 1 0
 
 https://example.example1.com/robots.txt Is that what you mean?
 Cancel
- MattAntonino
 
 2013-01-08T13:21:37-08:00
 
 Kalu - use https://www.distilled.net
 @Will - see my comment about 8 up. Now that the answers are fixed on the last two per Danillo's comment, you need to know the URL for the file to make the last two turn green.
 
 User-agent: *
 
 Disallow: /private/
 
 Sitemap: https://www.distilled.net/my-sitemap.xml
 
 Sitemap: https://www.distilled.net/my-video-sitemap.xml
 
 Without knowing the domain, you can't answer those.
 
 MattAntonino edited 2013-01-08T13:21:58-08:00
 2 0
 
 Kalu - use https://www.distilled.net @Will - see my comment about 8 up. Now that the answers are fixed on the last two per Danillo's comment, you need to know the URL for the file to make the last two turn green. User-agent: * Disallow: /private/ Sitemap: https://www.distilled.net/my-sitemap.xml Sitemap: https://www.distilled.net/my-video-sitemap.xml Without knowing the domain, you can't answer those.
 Cancel
 - Will Critchlow
 
 2013-01-08T14:49:15-08:00
 
 Yep - sorry - I can't edit the post in live and I missed that - getting it fixed up with the editors.
 
 2 0
 
 Yep - sorry - I can't edit the post in live and I missed that - getting it fixed up with the editors.
 Cancel
Majid Bukhari

2013-01-07T04:25:35-08:00

Will, Excellent post on robots.txt. The start of the article is very interesting. Add multiple blocks and use more specific user agents will help to a beginner in robots.txt.

1 0

Will, Excellent post on robots.txt. The start of the article is very interesting. Add multiple blocks and use more specific user agents will help to a beginner in robots.txt. 
Cancel
briandesp

2013-01-07T07:41:55-08:00

I am trying to block a certain directory of pages using:Disallow: /directory/*/directory2/
Will this still allow the use of pages inside /directory/

1 0

I am trying to block a certain directory of pages using:Disallow: /directory/*/directory2/ Will this still allow the use of pages inside /directory/
Cancel
- Will Critchlow
 
 2013-01-07T14:16:36-08:00
 
 Yes. That blocks things like /directory/anything/directory2/anything is that what you mean?
 
 1 0
 
 Yes. That blocks things like /directory/anything/directory2/anything is that what you mean?
 Cancel
PratikJoshi91

2013-01-08T03:06:52-08:00

extraordinary publish on robots.txt. thank you for the specific publish about the essential factors in SEO

1 0

extraordinary publish on robots.txt. thank you for the specific publish about the essential factors in SEO
Cancel
Peter Mead

2013-01-07T06:07:42-08:00

Thanks Will,
This is a good reference for people wanting to learn about Robots.txt. Seems to be back in style again huh?

An oldie but a real goodie.

1 0

Thanks Will, This is a good reference for people wanting to learn about Robots.txt. Seems to be back in style again huh? An oldie but a real goodie. 
Cancel
Joseph Schaefer

2013-01-07T04:17:06-08:00

Hey Will,

Above you show examples for "Disallow: news*.html"

In that section you say it blocks "/news.html" but it doesn't block "/directory/news.html". Can you give me a URL example of when /news.html would be blocked? Does it have to bump up right next to the root as in https://domain.com/news.html ?

Thanks.

~ Joe

1 0

Hey Will, Above you show examples for "Disallow: news*.html" In that section you say it blocks "/news.html" but it doesn't block "/directory/news.html". Can you give me a URL example of when /news.html would be blocked? Does it have to bump up right next to the root as in https://domain.com/news.html ? Thanks. ~ Joe 
Cancel
- Will Critchlow
 
 2013-01-07T09:37:06-08:00
 
 Any of the following patterns would block /directory/news.html
 - Disallow: /directory/news.html
 - Disallow: /directory/
 - Disallow: */news.html
 Hope that helps.
 
 willcritchlow edited 2013-01-07T09:37:28-08:00
 6 0
 Any of the following patterns would block /directory/news.html <ul><li>Disallow: /directory/news.html</li><li>Disallow: /directory/</li><li>Disallow: */news.html</li></ul> Hope that helps. 
 Cancel
 - franco lucchetti
 
 2013-01-21T07:49:55-08:00
 
 Hi Will,
 It definitely helps, but what you wanted to say about
 
 User-agent: * Disallow: /private/ Sitemap: /my-sitemap.xml
 
 Is that sitemap location /private/?
 tks
 
 edmondantes-216430 edited 2013-01-21T07:50:38-08:00
 1 0
 Hi Will, It definitely helps, but what you wanted to say about <ul><li>User-agent: * Disallow: /private/ Sitemap: /my-sitemap.xml</li></ul> Is that sitemap location /private/? tks 
 Cancel
 - Will Critchlow
 
 2013-01-21T07:51:27-08:00
 
 As another commenter reminded me, sitemap locations need to be full URLs so it should be:
 https://www.distilled.net/my-sitemap.xml
 Hope that helps.
 
 1 0
 
 As another commenter reminded me, sitemap locations need to be full URLs so it should be: https://www.distilled.net/my-sitemap.xml Hope that helps. 
 Cancel
- RankWatch65
 
 2013-01-07T11:27:04-08:00
 
 Hey Untypical! That's really a very good question. Thanks Will for such a nice and very interactive robots.txt 101. I too have a question for you. What should one do to block directives and pages in a sub domain. Should we add a separate robots.txt file for each sub domain on a site ?
 
 3 0
 
 Hey Untypical! That's really a very good question. Thanks Will for such a nice and very interactive robots.txt 101. I too have a question for you. What should one do to block directives and pages in a sub domain. Should we add a separate robots.txt file for each sub domain on a site ?
 Cancel
 - Will Critchlow
 
 2013-01-07T14:17:22-08:00
 
 Yes. They need their own robots file.
 
 2 0
 
 Yes. They need their own robots file.
 Cancel
Etienne Dupuis

2013-01-07T12:40:12-08:00

-deleted-

EtienneDupuis edited 2013-01-16T07:06:29-08:00
1 1

-deleted-
Cancel
SiteWizard_LLC

2013-01-10T07:29:55-08:00

This is a nice intro to robots.txt. For further reading and advanced techniques I would recommend Google's Robots.txt Specifications.

1 1

This is a nice intro to robots.txt. For further reading and advanced techniques I would recommend <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">Google's Robots.txt Specifications</a>. 
Cancel
- Will Critchlow
 
 2013-01-10T07:35:14-08:00
 
 Yep - definitely worth going to the source. I linked to that in the paragraph that starts "In addition to reading about the protocol". Thanks
 
 1 0
 
 Yep - definitely worth going to the source. I linked to that in the paragraph that starts "In addition to reading about the <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">protocol</a>". Thanks
 Cancel
 - SiteWizard_LLC
 
 2013-01-18T12:07:33-08:00
 
 Oops - I missed that. Instead I'll contribute a web-based tool for creating a robots.txt and a tutorial on building a robots.txt using the IIS SEO Toolkit. :-)
 
 1 0
 
 Oops - I missed that. Instead I'll contribute a <a href="https://www.robotsgenerator.com/" rel="nofollow">web-based tool for creating a robots.txt</a> and a tutorial on building a <a href="https://www.iis.net/learn/extensions/iis-search-engine-optimization-toolkit/managing-robotstxt-and-sitemap-files" rel="nofollow">robots.txt using the IIS SEO Toolkit</a>. :-) 
 Cancel
Moosa Hemani

2013-01-07T04:55:25-08:00

This is very basic but really in-depth and defines it from the starting so that even a one who has no idea of what robots.txt is about can learn how to go with it. I love these guides as they are handy and very much actionable.

I will add this in my presentation some time so that more and more people can read about this and learn from it.

Thankyou will for this!

MoosaHemani edited 2013-01-07T04:56:34-08:00
1 6

This is very basic but really in-depth and defines it from the starting so that even a one who has no idea of what robots.txt is about can learn how to go with it. I love these guides as they are handy and very much actionable. I will add this in my presentation some time so that more and more people can read about this and learn from it. Thankyou will for this!
Cancel
Manish Chauhan

2013-01-07T04:52:48-08:00

Great post for a beginner..but am bit disappointed that there is nothing new for an experienced one :(

1 8

Great post for a beginner..but am bit disappointed that there is nothing new for an experienced one :(
Cancel
MikeSrin

2013-01-07T06:43:55-08:00

@ willcritchlow but i felt very boring stuff. u went too much in deep

1 9

@ <a href="https://www.seomoz.org/users/profile/21379" rel="nofollow">willcritchlow</a> but i felt very boring stuff. u went too much in deep 
Cancel
- Dana Tan
 
 2013-01-18T15:44:02-08:00
 
 I disagree with you Mike. I think one of the myths about SEO is that it's all an exciting, magical concoction of chants, spells and potions. While it may not be nearly as sexy (nor does it require a magic wand), understanding the value of a properly configured Robots.txt file is real grunt-work SEO. It's that kind of thing that often makes the biggest difference. This post is totally aligned with my SEO mantra: "It's a matter of time, patience and intelligent work."
 
 Thanks for the post Will. I'm definitely a fan.
 
 1 0
 
 I disagree with you Mike. I think one of the myths about SEO is that it's all an exciting, magical concoction of chants, spells and potions. While it may not be nearly as sexy (nor does it require a magic wand), understanding the value of a properly configured Robots.txt file is real grunt-work SEO. It's that kind of thing that often makes the biggest difference. This post is totally aligned with my SEO mantra: "It's a matter of time, patience and intelligent work." Thanks for the post Will. I'm definitely a fan.
 Cancel

Post Analytics

Learn About Robots.txt with Interactive Examples

Interactive guide to Robots.txt

Basic Exclusion

Exclude Directories

Allow Specific Paths

Restrict to Specific User Agents

Add Multiple Blocks

Use More Specific User Agents

Basic Wildcards

Block Certain Parameters

Match Whole Filenames

Add an XML Sitemap

Add a Video Sitemap

What to do if you are stuck on any of these tests

Obligatory disclaimers

Comments 90

Interactive guide to Robots.txt

Basic Exclusion

Exclude Directories

Allow Specific Paths

Restrict to Specific User Agents

Add Multiple Blocks

Use More Specific User Agents

Basic Wildcards

Block Certain Parameters

Match Whole Filenames

Add an XML Sitemap

Add a Video Sitemap

What to do if you are stuck on any of these tests

Obligatory disclaimers

Comments 90

Log in to Moz

Don't have an account?