Backlink Blindspots: The State of Robots.txt

Comments 57

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

Associate

Cyrus Shepard
Associate

2018-05-22T01:15:46-07:00

Fantastic research!

Have to wonder if this is cause and effect. Many webmasters block robots after finding them in their server logs. Since Ahrefs/Majestic previously crawled more sites up until now, it would make sense that they are blocked more often.

If this is the case, it's easy to speculate that Moz may be added to a proportional # of robots.txt moving forward.

Cost of success :) Thoughts?

Cyrus-Shepard edited 2018-05-22T01:55:30-07:00
9 0

Fantastic research! Have to wonder if this is cause and effect. Many webmasters block robots after finding them in their server logs. Since Ahrefs/Majestic previously crawled more sites up until now, it would make sense that they are blocked more often. If this is the case, it's easy to speculate that Moz may be added to a proportional # of robots.txt moving forward. Cost of success :) Thoughts? 
Cancel
- Tim Soulo
 
 2018-05-22T02:32:36-07:00
 
 exactly my thoughts, Cyrus :)
 
 2 0
 
 exactly my thoughts, Cyrus :) 
 Cancel
- Russ Jones
 
 2018-05-22T04:52:06-07:00
 
 A "cost of success" is one way of putting it, but it is also a "threat of existence". Backlink index crawlers must balance politeness with pervasiveness if they want to continue to offer a valuable product to their customers.
 
 Being blocked is a function of time and speed. Crawlers that have been active for a very long time will likely collect some blocks, even if they are very polite. Crawlers that have been active only a short time but are aggressive will collect blocks very quickly. Majestic has been crawling the longest and probably the 2nd most aggressive of the group compared. Ahrefs has been crawling the shortest but with the most aggressive crawlers. Moz has been crawling the 2nd longest but with very polite crawlers.
 
 We are addressing our crawlers as a part of our beta, but I doubt we will modify our politeness much. We will crawl wider rather than deeper in most cases in order to prevent exactly these kinds of problems.
 
 4 0
 
 A "cost of success" is one way of putting it, but it is also a "threat of existence". Backlink index crawlers must balance politeness with pervasiveness if they want to continue to offer a valuable product to their customers. Being blocked is a function of time and speed. Crawlers that have been active for a very long time will likely collect some blocks, even if they are very polite. Crawlers that have been active only a short time but are aggressive will collect blocks very quickly. Majestic has been crawling the longest and probably the 2nd most aggressive of the group compared. Ahrefs has been crawling the shortest but with the most aggressive crawlers. Moz has been crawling the 2nd longest but with very polite crawlers. We are addressing our crawlers as a part of our beta, but I doubt we will modify our politeness much. We will crawl wider rather than deeper in most cases in order to prevent exactly these kinds of problems. 
 Cancel
 - Jean-Christophe Chouinard
 
 2018-05-22T05:24:07-07:00
 
 Since it is highly unlikely that anyone would block all bots from crawling their website, isn't it tmore than likely that crawlers will use multiple user-agents in order to have a realistic profile of backlinks?
 
 1 0
 
 Since it is highly unlikely that anyone would block all bots from crawling their website, isn't it tmore than likely that crawlers will use multiple user-agents in order to have a realistic profile of backlinks? 
 Cancel
 - Russ Jones
 
 2018-05-22T07:07:47-07:00
 
 Actually, the most common type of robots.txt block we encounter is a blanket dismissal of all bots except major search engines - ie: block all and explicitly allow google, bing, yahoo, baidu, etc.
 
 To my knowledge, Moz, Majestic and Ahrefs all abide by the Robots.txt protocol strictly and do not surf with different UAs just to bypass the rules. This would be highly unethical, IMHO.
 
 5 0
 
 Actually, the most common type of robots.txt block we encounter is a blanket dismissal of all bots except major search engines - ie: block all and explicitly allow google, bing, yahoo, baidu, etc. To my knowledge, Moz, Majestic and Ahrefs all abide by the Robots.txt protocol strictly and do not surf with different UAs just to bypass the rules. This would be highly unethical, IMHO.
 Cancel
 - Cyrus Shepard
 
 2018-05-22T09:28:47-07:00
 
 Fair points Russ. I suppose it's all speculation as to why.
 
 And certainly not to distract from your main point, which is that according to the data, Moz currently has the most allowed crawler out of the major tool providers.
 
 1 0
 
 Fair points Russ. I suppose it's all speculation as to why. And certainly not to distract from your main point, which is that according to the data, Moz currently has the most allowed crawler out of the major tool providers. 
 Cancel
elenaglop

2018-06-04T10:27:15-07:00

Great job! Thanks for sharing this with us. It is very important to pay close attention to the robots.txt file, the best way to tell Google where to go and not.

Thank your for the amazing tips!

4 0

Great job! Thanks for sharing this with us. It is very important to pay close attention to the robots.txt file, the best way to tell Google where to go and not. Thank your for the amazing tips!
Cancel
Ankit Mishra

2018-05-22T03:01:43-07:00

Great Insight Russ Jones,

People at Moz are working hard and we see the improvement.

Webowners block sites because they might see some logs in their server from bots of these website which are not their target., resulting sites get block from crawling. But I'm sure this not be the case with Moz.

Cheers.

Aankitmishra edited 2018-05-22T03:05:33-07:00
3 0

Great Insight <a href="https://moz.com/community/users/4260765" rel="nofollow">Russ Jones</a>, People at Moz are working hard and we see the improvement. Webowners block sites because they might see some logs in their server from bots of these website which are not their target., resulting sites get block from crawling. But I'm sure this not be the case with Moz. Cheers.
Cancel
Dmitry-Ahrefs

2018-05-22T00:42:12-07:00

Is Moz stats for DotBot or RogerBot or both?

3 0

Is Moz stats for DotBot or RogerBot or both?
Cancel
- Andy-Halliday
 
 2018-05-22T03:38:31-07:00
 
 Very good question - that could make a difference
 
 1 0
 
 Very good question - that could make a difference
 Cancel
- Russ Jones
 
 2018-05-22T04:53:10-07:00
 
 DotBot.
 
 RogerBot is our site-crawl bot. It is different. You would never see RogerBot in your logs UNLESS you were a MozPro user and used our site crawl tool.
 
 1 0
 
 DotBot. RogerBot is our site-crawl bot. It is different. You would never see RogerBot in your logs UNLESS you were a MozPro user and used our site crawl tool.
 Cancel
 - David Butler
 
 2018-05-22T17:02:25-07:00
 
 Hi Russ,
 
 This is an interesting one.
 
 Since RogerBot only crawls MozPro user sites, it shouldn't be blocked very often - unless a user really needs to control the RogerBot crawl.
 
 I would love to see the same stats that include DotBot and RogerBot combined - or at least include RogerBot where it has been added as:
 
 User-agent: rogerbot
 Disallow: /
 
 If it's added this way, it's likely that the user had the intention of blocking DotBot instead.
 
 If you do a quick Google search for something like "how to hide backlinks from competitors", there is a surprising amount of incorrect information out there that recommends blocking RogerBot with no mention of DotBot.
 
 Something to consider when looking at this data...
 
 Cheers,
 
 David
 
 3 0
 
 Hi Russ, This is an interesting one. Since RogerBot only crawls MozPro user sites, it shouldn't be blocked very often - unless a user really needs to control the RogerBot crawl. I would love to see the same stats that include DotBot and RogerBot combined - or at least include RogerBot where it has been added as: User-agent: rogerbot Disallow: / If it's added this way, it's likely that the user had the intention of blocking DotBot instead. If you do a quick Google search for something like "how to hide backlinks from competitors", there is a surprising amount of incorrect information out there that recommends blocking RogerBot with no mention of DotBot. Something to consider when looking at this data... Cheers, David
 Cancel
 - Russ Jones
 
 2018-05-23T09:09:39-07:00
 
 I did a quick glance for mentions in robots.txt from a sample.
 
 RogerBot: 909
 
 AhrefsBot: 8423
 
 Dotbot: 6213
 
 Majestic: 15740
 
 So even if you assumed every RogerBot block was a mistaken attempt to block Dotbot, Dotbot would still be the lowest.
 
 2 0
 
 I did a quick glance for mentions in robots.txt from a sample. RogerBot: 909 AhrefsBot: 8423 Dotbot: 6213 Majestic: 15740 So even if you assumed every RogerBot block was a mistaken attempt to block Dotbot, Dotbot would still be the lowest.
 Cancel
 - David Butler
 
 2018-05-23T16:20:25-07:00
 
 Nice one! Thanks for the update, Russ!
 
 1 0
 
 Nice one! Thanks for the update, Russ!
 Cancel
 - seofreak17
 
 2018-05-25T08:37:58-07:00
 
 I crawled most Quantcast top (US) websites and counted bot mentions, disallow all, allow all, crawl-delay, disallow partial, allow partial in robots.txt:
 
 Bot name: mentions, disallow all, allow all, crawl-delay, disallow partial, allow partial
 
 *: 403417, 13381, 27914, 44290, 328892, 130352
 
 mj12bot: 23325, 12102, 38, 9070, 292, 29
 
 googlebot: 22021, 185, 2862, 1926, 11798, 2490
 
 baiduspider: 15875, 9839, 116, 1207, 1989, 145
 
 bingbot: 13026, 1147, 1649, 8165, 5572, 1474
 
 yandexbot: 11850, 7786, 302, 1396, 1958, 1060
 
 msnbot: 11423, 1366, 557, 6486, 4059, 327
 
 ia_archiver: 11157, 5512, 244, 765, 1255, 92
 
 ahrefsbot: 9829, 7644, 32, 723, 190, 11
 
 slurp: 9668, 1698, 568, 4372, 3887, 228
 
 dotbot: 7188, 5838, 13, 221, 177, 4
 
 twitterbot: 6221, 81, 1268, 234, 458, 764
 
 sogou: 6052, 3799, 21, 66, 1336, 14
 
 spbot: 4901, 4288, 0, 178, 36, 2
 
 semrushbot: 4877, 3369, 33, 275, 101, 9
 
 rogerbot: 4788, 2145, 989, 1325, 1226, 42
 
 blexbot: 4538, 4031, 0, 174, 61, 3
 
 sistrix: 3535, 3302, 1, 10, 13, 0
 
 exabot: 3302, 2636, 11, 87, 54, 10
 
 gigabot: 3200, 2164, 23, 196, 59, 3
 
 bubing: 2796, 2475, 0, 5, 17, 1
 
 sitebot: 2723, 2522, 0, 1, 6, 1
 
 seznambot: 2574, 975, 12, 43, 1303, 17
 
 ccbot: 2560, 2086, 13, 122, 52, 17
 
 ezooms: 2381, 2216, 0, 10, 13, 0
 
 archive.org_bot: 2292, 382, 1076, 731, 1747, 19
 
 megaindex: 2074, 1567, 0, 22, 19, 0
 
 voilabot: 1312, 1219, 5, 31, 23, 1
 
 yetibot: 1274, 19, 0, 7, 1238, 0
 
 seokicks-robot: 1125, 1000, 0, 18, 5, 1
 
 mail.ru_bot: 971, 653, 4, 73, 98, 50
 
 facebookexternalhit: 695, 36, 139, 64,, 210, 239
 
 siteexplorer: 639, 587, 0, 3, 4, 0
 
 xovibot: 600, 494, 1, 23, 26, 0
 
 wotbox: 597, 520, 0, 4, 9, 0
 
 linkdexbot: 564, 379, 0, 10, 2, 0
 
 360spider: 534, 381, 8, 23, 34, 7
 
 obot: 534, 319, 0, 0, 0, 0
 
 applebot: 530, 83, 34, 206, 134, 59
 
 vagabondo: 528, 348, 0, 1, 2, 0
 
 searchmetricsbot: 503, 376, 6, 7, 23, 4
 
 meanpathbot: 403, 353, 0, 0, 0, 0
 
 mauibot: 370, 340, 1, 1, 10, 0
 
 linkpadbot: 292, 206, 0, 3, 0, 0
 
 mojeekbot: 277, 214, 0, 3, 1, 0
 
 nerdybot: 244, 196, 0, 2, 0, 0
 
 yacybot: 206, 122, 0, 3, 4, 0
 
 findxbot: 150, 61, 0, 2, 0, 0
 
 extlinksbot: 114, 101, 0, 1, 0, 0
 
 dataprovider: 29, 20, 0, 0, 7, 0
 
 seofreak17 edited 2018-05-29T09:49:27-07:00
 2 0
 
 I crawled most Quantcast top (US) websites and counted bot mentions, disallow all, allow all, crawl-delay, disallow partial, allow partial in robots.txt: Bot name: mentions, disallow all, allow all, crawl-delay, disallow partial, allow partial *: 403417, 13381, 27914, 44290, 328892, 130352 mj12bot: 23325, 12102, 38, 9070, 292, 29 googlebot: 22021, 185, 2862, 1926, 11798, 2490 baiduspider: 15875, 9839, 116, 1207, 1989, 145 bingbot: 13026, 1147, 1649, 8165, 5572, 1474 yandexbot: 11850, 7786, 302, 1396, 1958, 1060 msnbot: 11423, 1366, 557, 6486, 4059, 327 ia_archiver: 11157, 5512, 244, 765, 1255, 92 ahrefsbot: 9829, 7644, 32, 723, 190, 11 slurp: 9668, 1698, 568, 4372, 3887, 228 dotbot: 7188, 5838, 13, 221, 177, 4 twitterbot: 6221, 81, 1268, 234, 458, 764 sogou: 6052, 3799, 21, 66, 1336, 14 spbot: 4901, 4288, 0, 178, 36, 2 semrushbot: 4877, 3369, 33, 275, 101, 9 rogerbot: 4788, 2145, 989, 1325, 1226, 42 blexbot: 4538, 4031, 0, 174, 61, 3 sistrix: 3535, 3302, 1, 10, 13, 0 exabot: 3302, 2636, 11, 87, 54, 10 gigabot: 3200, 2164, 23, 196, 59, 3 bubing: 2796, 2475, 0, 5, 17, 1 sitebot: 2723, 2522, 0, 1, 6, 1 seznambot: 2574, 975, 12, 43, 1303, 17 ccbot: 2560, 2086, 13, 122, 52, 17 ezooms: 2381, 2216, 0, 10, 13, 0 archive.org_bot: 2292, 382, 1076, 731, 1747, 19 megaindex: 2074, 1567, 0, 22, 19, 0 voilabot: 1312, 1219, 5, 31, 23, 1 yetibot: 1274, 19, 0, 7, 1238, 0 seokicks-robot: 1125, 1000, 0, 18, 5, 1 mail.ru_bot: 971, 653, 4, 73, 98, 50 facebookexternalhit: 695, 36, 139, 64,, 210, 239 siteexplorer: 639, 587, 0, 3, 4, 0 xovibot: 600, 494, 1, 23, 26, 0 wotbox: 597, 520, 0, 4, 9, 0 linkdexbot: 564, 379, 0, 10, 2, 0 360spider: 534, 381, 8, 23, 34, 7 obot: 534, 319, 0, 0, 0, 0 applebot: 530, 83, 34, 206, 134, 59 vagabondo: 528, 348, 0, 1, 2, 0 searchmetricsbot: 503, 376, 6, 7, 23, 4 meanpathbot: 403, 353, 0, 0, 0, 0 mauibot: 370, 340, 1, 1, 10, 0 linkpadbot: 292, 206, 0, 3, 0, 0 mojeekbot: 277, 214, 0, 3, 1, 0 nerdybot: 244, 196, 0, 2, 0, 0 yacybot: 206, 122, 0, 3, 4, 0 findxbot: 150, 61, 0, 2, 0, 0 extlinksbot: 114, 101, 0, 1, 0, 0 dataprovider: 29, 20, 0, 0, 7, 0 
 Cancel
 
 Russ Jones
 
 2018-05-29T10:09:22-07:00
 
 Awesome!
 
 1 0
 
 Awesome!
 Cancel
- Russ Jones
 
 2018-05-22T04:53:26-07:00
 
 DotBot. The Bot we have crawled the web with for a decade now.
 
 1 0
 
 DotBot. The Bot we have crawled the web with for a decade now.
 Cancel
Nicholas White

2018-06-04T13:11:12-07:00

Glad to see the Moz Link Explorer leading with the "closest to Googlebot" data. This just made me like the Moz Link Explorer more than I already do. Thanks for sharing your data on this Russ!

2 0

Glad to see the Moz Link Explorer leading with the "closest to Googlebot" data. This just made me like the Moz Link Explorer more than I already do. Thanks for sharing your data on this Russ!
Cancel
Wes Ory

2018-06-04T08:55:56-07:00

Solid research and comparisons to Ahrefs (what we have typically used) and Majestic. I definitely agree with utilizing multiple link analysis sites to discover the most referring domains, and I was excited to see Moz be the closest to GoogleBot on all 3 data analysis charts your shared.

2 0

Solid research and comparisons to Ahrefs (what we have typically used) and Majestic. I definitely agree with utilizing multiple link analysis sites to discover the most referring domains, and I was excited to see Moz be the closest to GoogleBot on all 3 data analysis charts your shared.
Cancel
Shakil Ahamed

2018-05-26T11:23:19-07:00

Great post. Really the article is informative and have helpful SEO strategy that is very important a seo learner. Thanks a lot Russ Jones, for sharing a important topics with us.

1 0

Great post. Really the article is informative and have helpful SEO strategy that is very important a seo learner. Thanks a lot Russ Jones, for sharing a important topics with us. 
Cancel
Anand k

2018-06-10T21:58:41-07:00

The way people on Moz research and provides the amount of information about each and every topic is really great , the above blog is having such a detailed information about robots.txt is very useful.

1 0

The way people on Moz research and provides the amount of information about each and every topic is really great , the above blog is having such a detailed information about robots.txt is very useful.
Cancel
Silem

2018-05-27T09:20:17-07:00

Er, why not simply only check robots.txt rules for Google bots and follow those instead of the rules set for your own bots?

1 0

Er, why not simply only check robots.txt rules for Google bots and follow those instead of the rules set for your own bots?
Cancel
- Russ Jones
 
 2018-05-29T10:11:19-07:00
 
 Simple respect for webmasters. Bots can cause problems like slow down the site or increase bandwidth costs. If a webmaster says don't crawl, we wont.
 
 2 0
 
 Simple respect for webmasters. Bots can cause problems like slow down the site or increase bandwidth costs. If a webmaster says don't crawl, we wont. 
 Cancel
 - Luis Alvarez
 
 2018-05-30T06:50:45-07:00
 
 Very respectful behaviour. Nice!
 
 1 0
 
 Very respectful behaviour. Nice!
 Cancel
DarkJosevi

2018-05-28T03:32:31-07:00

It is very important to save time to the robot, google compensates it as soon as possible to see the whole robot much better for the page, so it is crucial to tell him where to go and where not.

1 0

It is very important to save time to the robot, google compensates it as soon as possible to see the whole robot much better for the page, so it is crucial to tell him where to go and where not.
Cancel
Bayu Angora

2018-06-10T03:27:44-07:00

I allow everything on my robots.txt

1 0

I allow everything on my robots.txt
Cancel
Javier Lobera

2018-05-29T02:35:23-07:00

Buen post Russ!

It seems very important to pay attention to the status of our website through robots.txt, this helps us to have a "clean" URL and there are many people, even professionals in the sector who do not pay the attention it deserves and also we can assume a small point of improvement in our positioning as Google sees a "clean" site and takes it into account in some of its algorithms.

1 0

Buen post Russ! It seems very important to pay attention to the status of our website through robots.txt, this helps us to have a "clean" URL and there are many people, even professionals in the sector who do not pay the attention it deserves and also we can assume a small point of improvement in our positioning as Google sees a "clean" site and takes it into account in some of its algorithms.
Cancel
Carlos Saez

2018-05-25T04:40:07-07:00

With these contributions it is much easier to move forward with our SEO projects. Thank you very much Russ

1 0

With these contributions it is much easier to move forward with our SEO projects. Thank you very much Russ
Cancel
shubhangisrivastava

2018-05-24T00:34:06-07:00

Omg so much information about robots.txt and blocking site. Just few second before i only know how to submit robots.txt to webmaster. Thanks for such valuable info

1 0

Omg so much information about robots.txt and blocking site. Just few second before i only know how to submit robots.txt to webmaster. Thanks for such valuable info 
Cancel
Joseph Garcia

2018-05-22T18:19:27-07:00

Definitely not what I was expecting based on the title of this post. I was expecting to read an article related to what your sites robots.txt file should contain based on industry/Google changes. Nonetheless, interesting research, does the job of lending more credibility to Moz Link Explorer at least in the short term.

1 0

Definitely not what I was expecting based on the title of this post. I was expecting to read an article related to what your sites robots.txt file should contain based on industry/Google changes. Nonetheless, interesting research, does the job of lending more credibility to Moz Link Explorer at least in the short term. 
Cancel
SalesHacker

2018-05-22T16:42:42-07:00

Russ,

Solid research here man for sure. Forgive me for the brute question - but was this article intended to be a jab at your competitors? As Cyrus pointed out - it's likely that DotBot will eventually get blocked just as much as Ahrefs and Majestic in the future, especially as the crawler becomes more aggressive. Crawl speed obviously has a lot to do with it, and as you mentioned, "crawl politeness" will be enforced. But as an in-house SEO, is there an actionable takeaway from all this?

- Gaetano

1 0

Russ, Solid research here man for sure. Forgive me for the brute question - but was this article intended to be a jab at your competitors? As Cyrus pointed out - it's likely that DotBot will eventually get blocked just as much as Ahrefs and Majestic in the future, especially as the crawler becomes more aggressive. Crawl speed obviously has a lot to do with it, and as you mentioned, "crawl politeness" will be enforced. But as an in-house SEO, is there an actionable takeaway from all this? - Gaetano
Cancel
- Sha Menz
 
 2018-05-23T04:31:31-07:00
 
 Hi Gaetano!
 
 I think the most obvious and immediately actionable takeaway for any SEO comes right at the beginning of the post -
 
 "small changes in robots.txt which prevent some bots and allow others ultimately leads to very different results compared to what Google actually sees."
 
 In a world where we all feel justified in demanding greater and greater accuracy from the tools at our disposal, the lesson we need to take from this is that we should think carefully about the effect our robots.txt directives will have on those tools and the data they can give us.
 
 Interesting also to see that the most pervasive method encountered by all of the tools is "a blanket dismissal of all bots except major search engines - ie: block all and explicitly allow google, bing, yahoo, baidu, etc.". Some might say this is the smart option. Others might call it the lazy one.
 
 Certainly if we choose to invest in using specific tools because we value accurate data we should at least be sure we're not blocking them from our own sites. ;)
 
 - Sha
 
 Sha_Menz edited 2018-05-23T17:11:30-07:00
 3 0
 
 Hi Gaetano! I think the most obvious and immediately actionable takeaway for any SEO comes right at the beginning of the post - "small changes in robots.txt which prevent some bots and allow others ultimately leads to very different results compared to what Google actually sees." In a world where we all feel justified in demanding greater and greater accuracy from the tools at our disposal, the lesson we need to take from this is that we should think carefully about the effect our robots.txt directives will have on those tools and the data they can give us. Interesting also to see that the most pervasive method encountered by all of the tools is "a blanket dismissal of all bots except major search engines - ie: block all and explicitly allow google, bing, yahoo, baidu, etc.". Some might say this is the smart option. Others might call it the lazy one. Certainly if we choose to invest in using specific tools because we value accurate data we should at least be sure we're not blocking them from our own sites. ;) - Sha 
 Cancel
- Russ Jones
 
 2018-05-23T09:37:35-07:00
 
 Let me start with the easy question: as an in-house SEO, is there an actionable takeaway from all this?
 
 If you want all the links, you have to use all the tools. If you want to just use one tool that is most like Google in terms of constituent pages. Using both this analysis and the research I posted here, Moz is the tool for you.
 
 As for whether it was intended to be a crack against our competitors, I certainly intended to draw a distinction between Moz's index and theirs, and that distinction is positive for Moz. The data is freely available for anyone to test. Just go download the Quantcast Top Million list, grab their robots.txt, and use any one of the many free robots.txt testers on github to see for yourself.
 
 First, I originally ran this study in July of 2016. I didn't report on it then even though I could have. The ratios were approximately the same then as now.
 
 Second, I think people need to know about the quality of the data they receive. Our competitors brag about their size (and now Moz does too), and we need to be honest about the impact crawling has on webmasters. If Ahrefs or Majestic wants to run the analysis again next year and show that we have caught up, by all means, they can.
 
 I'm sorry if it hurts the feelings of our competitors when I present research that shows our data is better, but I'm not going to hide solid research just for those purposes.
 
 4 0
 
 Let me start with the easy question: as an in-house SEO, is there an actionable takeaway from all this? If you want all the links, you have to use all the tools. If you want to just use one tool that is most like Google in terms of constituent pages. Using both this analysis and <a href="https://moz.com/blog/big-fast-strong-backlink-index-comparisons" rel="nofollow">the research I posted here</a>, Moz is the tool for you. As for whether it was intended to be a crack against our competitors, I certainly intended to draw a distinction between Moz's index and theirs, and that distinction is positive for Moz. The data is freely available for anyone to test. Just go download the Quantcast Top Million list, grab their robots.txt, and use any one of the many free robots.txt testers on github to see for yourself. First, I originally ran this study in July of 2016. I didn't report on it then even though I could have. The ratios were approximately the same then as now. Second, I think people need to know about the quality of the data they receive. Our competitors brag about their size (and now Moz does too), and we need to be honest about the impact crawling has on webmasters. If Ahrefs or Majestic wants to run the analysis again next year and show that we have caught up, by all means, they can. I'm sorry if it hurts the feelings of our competitors when I present research that shows our data is better, but I'm not going to hide solid research just for those purposes.
 Cancel
seofreak17

2018-05-22T15:55:32-07:00

Russ again a nice research but only accurate for robots.txt blocks, many webmasters and hosting providers use temporary IP blocks after some requests or permanent IP blocks for known unwanted crawler IP ranges at network or web server software level, both are not to test for other bots without spoofing IP addresses (if possible). There are also many websites and hosting providers who block crawler user agents at network or web server software level what you can test by spoofing user agents or compare homepage request timeouts with your Dotbot user agent compared to a web browser user agent.

I agree MOZ crawl mostly nice so far, that can indeed help, but also that MOZ used user agent Ezooms from 2011 till 2014. How much is that user agent blocked and do you still respect it?

1 0

Russ again a nice research but only accurate for robots.txt blocks, many webmasters and hosting providers use temporary IP blocks after some requests or permanent IP blocks for known unwanted crawler IP ranges at network or web server software level, both are not to test for other bots without spoofing IP addresses (if possible). There are also many websites and hosting providers who block crawler user agents at network or web server software level what you can test by spoofing user agents or compare homepage request timeouts with your Dotbot user agent compared to a web browser user agent. I agree MOZ crawl mostly nice so far, that can indeed help, but also that MOZ used user agent Ezooms from 2011 till 2014. How much is that user agent blocked and do you still respect it? 
Cancel
- Dmitry-Ahrefs
 
 2018-05-22T21:54:07-07:00
 
 "I agree MOZ crawl mostly nice so far, that can indeed help, but also that MOZ used user agent Ezooms from 2011 till 2014. How much is that user agent blocked and do you still respect it?"
 
 Russ,
 
 is it true? Could you please share history of User Agents used by Moz crawlers in the past?
 
 1 0
 
 "I agree MOZ crawl mostly nice so far, that can indeed help, but also that MOZ used user agent Ezooms from 2011 till 2014. How much is that user agent blocked and do you still respect it?" Russ, is it true? Could you please share history of User Agents used by Moz crawlers in the past?
 Cancel
 - Russ Jones
 
 2018-05-23T09:23:34-07:00
 
 Im looking into Ezooms right now - news to me. We were crawling with DotBot back then but we did apparently have an alias. I did just re-run the analysis and adding in ezooms and the trends still stand. If we assume everyone blocking RogerBot (our site audit crawler), Ezooms (a temporary alias) or DotBot, Moz still comes well below Ahrefs and majestic
 
 All Moz Crawlers, even RogerBot: 7641
 Ahrefs: 8423
 Majestic: 15740
 
 2 0
 
 Im looking into Ezooms right now - news to me. We were crawling with DotBot back then but we did apparently have an alias. I did just re-run the analysis and adding in ezooms and the trends still stand. If we assume everyone blocking RogerBot (our site audit crawler), Ezooms (a temporary alias) or DotBot, Moz still comes well below Ahrefs and majestic All Moz Crawlers, even RogerBot: 7641 Ahrefs: 8423 Majestic: 15740
 Cancel
- Russ Jones
 
 2018-05-23T09:30:10-07:00
 
 It appears Ezooms was an alias. I dont know what % of the crawl was powered by it. I can say this though - even if we assume all RogerBot blocks were intended for the web crawler and not our site audit, and we include ezooms bot, and we include Dotbot, Moz is still the least blocked.
 
 Sites from random sample:
 
 All Moz Crawlers, even RogerBot: 7641
 Ahrefs: 8423
 Majestic: 15740
 
 As for "many webmasters and hosting providers...", I think robots.txt is by far the most prevalent method of blocking individual bots. But it would make for interesting research. Perhaps that might be something you might want to do.
 
 1 0
 
 It appears Ezooms was an alias. I dont know what % of the crawl was powered by it. I can say this though - even if we assume all RogerBot blocks were intended for the web crawler and not our site audit, and we include ezooms bot, and we include Dotbot, Moz is still the least blocked. Sites from random sample: All Moz Crawlers, even RogerBot: 7641 Ahrefs: 8423 Majestic: 15740 As for "many webmasters and hosting providers...", I think robots.txt is by far the most prevalent method of blocking individual bots. But it would make for interesting research. Perhaps that might be something you might want to do.
 Cancel
 - seofreak17
 
 2018-05-23T18:12:53-07:00
 
 Did you also count Ahrefs old user agent SiteBot used in 2010-2011?
 
 1 0
 
 Did you also count Ahrefs old user agent SiteBot used in 2010-2011?
 Cancel
 - Russ Jones
 
 2018-05-29T10:13:21-07:00
 
 No I did not.
 
 1 0
 
 No I did not.
 Cancel
 - Tim Soulo
 
 2018-05-24T06:17:59-07:00
 
 Moz: 7641
 Ahrefs: 8423
 Majestic: 15740
 
 Cool! So what you're saying is that Ahrefs got blocked ~10% more than Moz while crawling ~10x faster for the past few years? :)
 
 timsoulo edited 2018-05-24T06:51:25-07:00
 1 0
 <ul><li>Moz: 7641</li><li>Ahrefs: 8423</li><li>Majestic: 15740</li></ul> Cool! So what you're saying is that Ahrefs got blocked ~10% more than Moz while crawling ~10x faster for the past few years? :)
 Cancel
Andy-Halliday

2018-05-22T03:37:02-07:00

Really interesting article - I would love to know DailyMail blocks Ahrefs and not Majestic or Moz.

Agree with what Cyrus said above about the other bots being more active so thats why they are got blocked - more likely to appear in top 'x' most active bots.

My other theory wouldn't really matter for this as its the top 1 million websites, but "black hatters" are more likely to block Ahrefs from their PBN network - as its their preferred choice of tool when doing link analysis and they don't want to give away to the competition their PBN network.

1 0

Really interesting article - I would love to know DailyMail blocks Ahrefs and not Majestic or Moz. Agree with what Cyrus said above about the other bots being more active so thats why they are got blocked - more likely to appear in top 'x' most active bots. My other theory wouldn't really matter for this as its the top 1 million websites, but "black hatters" are more likely to block Ahrefs from their PBN network - as its their preferred choice of tool when doing link analysis and they don't want to give away to the competition their PBN network.
Cancel
- Russ Jones
 
 2018-05-22T04:46:02-07:00
 
 The dailymail.co.uk exclusively blocks Ahrefs. They even do it TWICE in robots.txt...
 
 https://www.dailymail.co.uk/robots.txt
 
 Also, you are right that aggressive crawlers are more likely to trigger a block from a webmaster. This is why crawl politeness is so important. Backlink indexes need to be careful not to burn too many bridges. Moz was able to become comparable in size to Ahrefs and Majestic WITHOUT running into these problems. We have a high level of crawl politeness which forces us to crawl wider rather than deeper sometimes, but in the end it produces a more complete link graph, in my opinion.
 
 rjonesx. edited 2018-05-22T05:17:21-07:00
 3 0
 
 The dailymail.co.uk exclusively blocks Ahrefs. They even do it TWICE in robots.txt... <a href="https://www.dailymail.co.uk/robots.txt" rel="nofollow">https://www.dailymail.co.uk/robots.txt</a> Also, you are right that aggressive crawlers are more likely to trigger a block from a webmaster. This is why crawl politeness is so important. Backlink indexes need to be careful not to burn too many bridges. Moz was able to become comparable in size to Ahrefs and Majestic WITHOUT running into these problems. We have a high level of crawl politeness which forces us to crawl wider rather than deeper sometimes, but in the end it produces a more complete link graph, in my opinion.
 Cancel
 - Tim Soulo
 
 2018-05-22T06:17:35-07:00
 
 > Moz was able to become comparable in size to Ahrefs and Majestic WITHOUT running into these problems.
 
 Russ, don't you think you're misleading people by saying that?
 
 Didn't you increase your crawl speed not too long ago? While Ahrefs is already known to have second most active bot after Google: https://www.incapsula.com/blog/most-active-good-bo...
 
 As for "crawl politeness" - believe me, here at Ahrefs we know a thing or two about it. I'm sure in a year from now we'll see if you guys will be able to catch up with our crawl speed/efficiency/politeness ;)
 
 timsoulo edited 2018-05-22T07:00:23-07:00
 2 0
 
 > Moz was able to become comparable in size to Ahrefs and Majestic WITHOUT running into these problems. Russ, don't you think you're misleading people by saying that? Didn't you increase your crawl speed not too long ago? While Ahrefs is already known to have second most active bot after Google: <a href="https://www.incapsula.com/blog/most-active-good-bots.html" rel="nofollow">https://www.incapsula.com/blog/most-active-good-bo...</a> As for "crawl politeness" - believe me, here at Ahrefs we know a thing or two about it. I'm sure in a year from now we'll see if you guys will be able to catch up with our crawl speed/efficiency/politeness ;)
 Cancel
 - Russ Jones
 
 2018-05-22T08:28:53-07:00
 
 Thanks for your response.
 
 We have made no adjustments to politeness and we last increased our crawlers was about a year and a half ago. I ran this same analysis about 2 years ago and the ratios are roughly still the same.
 
 I'm sure that Ahrefs knows quite a bit about "crawl politeness" and I didn't seem to imply otherwise. But you are the 2nd or 3rd most pervasive crawler on the web - that comes with both costs and benefits. Today, I outlined the costs, as Ahrefs has certainly touted the benefits (your index size) for many years.
 
 With regard to politeness, it is a gut call - frequency of recrawl, depth of crawl, time between requests, etc. all must be balanced against the threat of being blocked. Indexes have to make a choice. it is similar to whether you decide to keep spam in your index or not. It is both a positive if you want to help webmasters find bad links, but a negative both in costs and in presenting users an accurate reflection of what Google might be judging them on.
 
 I'm just trying to open the eyes of our readers to the complexities of crawling the web and how it influences link indexes differently.
 
 6 0
 
 Thanks for your response. We have made no adjustments to politeness and we last increased our crawlers was about a year and a half ago. I ran this same analysis about 2 years ago and the ratios are roughly still the same. I'm sure that Ahrefs knows quite a bit about "crawl politeness" and I didn't seem to imply otherwise. But you are the 2nd or 3rd most pervasive crawler on the web - that comes with both costs and benefits. Today, I outlined the costs, as Ahrefs has certainly touted the benefits (your index size) for many years. With regard to politeness, it is a gut call - frequency of recrawl, depth of crawl, time between requests, etc. all must be balanced against the threat of being blocked. Indexes have to make a choice. it is similar to whether you decide to keep spam in your index or not. It is both a positive if you want to help webmasters find bad links, but a negative both in costs and in presenting users an accurate reflection of what Google might be judging them on. I'm just trying to open the eyes of our readers to the complexities of crawling the web and how it influences link indexes differently.
 Cancel
salgodelacrisis

2018-05-25T03:36:15-07:00

Very good post It will undoubtedly help us to know how to track more and better.

1 0

Very good post It will undoubtedly help us to know how to track more and better. 
Cancel
Splashweb

2018-05-23T03:53:22-07:00

For most of my sites I take the block all and let the ones through that might send me some traffic approach. The problem with the likes of Ahrefs, Majestic, Moz and others is that for a small business I can't justify the cost of using them, and the free offerings are of little use. Hence why would I let them bot my servers for so little in return?

1 0

For most of my sites I take the block all and let the ones through that might send me some traffic approach. The problem with the likes of Ahrefs, Majestic, Moz and others is that for a small business I can't justify the cost of using them, and the free offerings are of little use. Hence why would I let them bot my servers for so little in return?
Cancel
- Russ Jones
 
 2018-05-23T09:39:15-07:00
 
 It is a fair question. I think Majestic still offers Open Access to crawl data for your site if you authenticate that you are the site owner. I think Moz should do something similar.
 
 2 0
 
 It is a fair question. I think Majestic still offers Open Access to crawl data for your site if you authenticate that you are the site owner. I think Moz should do something similar. 
 Cancel
 - Splashweb
 
 2018-05-23T10:23:25-07:00
 
 Thanks for that Russ I'll check out Majestic.
 
 1 0
 
 Thanks for that Russ I'll check out Majestic.
 Cancel
 - Splashweb
 
 2018-05-24T00:25:11-07:00
 
 The Majestic offering exists for site owners, but it only allows you to see what reports are available, without giving you access to any useful data, so worthless.
 
 I'm surprised that webmasters aren't more proactive on this and block all of the main indexing bots, such as Ahrefs, Majestic, Moz and a few others. Their sole purpose is to harvest as much data about your own sites, and the relationships you have with other sites, and then to sell that data back to the owner and more importantly to your competitors. Sure there's added value with all of the reporting, analysis and aggregation etc. so there's a price, if your site's can justify it. But, even if I was using any of these services, I would still block all of the others (that my competitors may be using) and even the one I'm using too, assuming that's feasible.
 
 The way things are going I can see the day when even Googlebot should be blocked; after all, what's the point of being listed on Google if that doesn't translate into them providing something in return, ie. some free traffic.
 
 All of these analytics companies are relying totally on Webmasters allowing them free access to their properties, but increasingly it's time to ask why.
 
 Splashweb edited 2018-05-24T00:25:47-07:00
 1 1
 
 The Majestic offering exists for site owners, but it only allows you to see what reports are available, without giving you access to any useful data, so worthless. I'm surprised that webmasters aren't more proactive on this and block all of the main indexing bots, such as Ahrefs, Majestic, Moz and a few others. Their sole purpose is to harvest as much data about your own sites, and the relationships you have with other sites, and then to sell that data back to the owner and more importantly to your competitors. Sure there's added value with all of the reporting, analysis and aggregation etc. so there's a price, if your site's can justify it. But, even if I was using any of these services, I would still block all of the others (that my competitors may be using) and even the one I'm using too, assuming that's feasible. The way things are going I can see the day when even Googlebot should be blocked; after all, what's the point of being listed on Google if that doesn't translate into them providing something in return, ie. some free traffic. All of these analytics companies are relying totally on Webmasters allowing them free access to their properties, but increasingly it's time to ask why.
 Cancel
 - seofreak17
 
 2018-05-24T07:34:05-07:00
 
 I just checked how much bots hit four small websites by analyzing their log files.
 
 Period: Jan 1, 2018 till May 24, 2018
 
 Website, A, B, C, D
 
 Web pages, 1, 1, 8, 11
 
 Human hits, 576, 803, 3893, 7205
 
 Bot hits, A, B, C, D:
 
 bingbot, 249, 472, 766, 1709
 
 yandexbot, 413, 238, 494, 1974
 
 mj12bot, 45, 48, 404, 2159
 
 semrushbot, 299, 146, 536, 1221
 
 baiduspider, 93, 194, 771, 879
 
 googlebot, 133, 218, 312, 932
 
 dotbot, 19, 11, 344, 561
 
 360spider, 0, 277, 231, 169
 
 sogou web spider, 181, 301, 184, 1
 
 ahrefsbot, 0, 57, 153, 246
 
 spbot, 39, 26, 134, 252
 
 bubing, 0, 2, 116, 145
 
 mauibot, 0, 16, 135, 99
 
 linkdexbot, 0, 0, 81, 133
 
 seznambot, 24, 0, 103, 26
 
 seokicks-robot, 0, 0, 0, 151
 
 mail.ru_bot, 20, 16, 28, 80
 
 blexbot, 0, 26, 74, 36
 
 dataprovider, 24, 20, 65, 17
 
 extlinksbot, 0, 0, 87, 22
 
 siteexplorer, 0, 0, 48, 38
 
 slurp, 0, 7, 32, 40
 
 exabot, 0, 12, 8, 52
 
 megaindex, 2, 21, 18, 20
 
 netcraft, 9, 10, 7, 32
 
 archive.org_bot , 8, 1, 8, 38
 
 ccbot, 2, 10, 6, 29
 
 obot, 0, 0, 0, 38
 
 yetibot, 0, 7, 0, 18
 
 I don't think bots have any impact by small websites at hardware or traffic cost, an average $3-$5 a month cloud hosting can handle hundred thousand requests a day and have a 10-20TB traffic limit.
 
 Also most bots support crawler delay in robots.txt or you block pages who are to heavy to load often.
 
 Not seen by advertisers, buyers or not listed in search engines (or their API's) can have much more impact.
 
 seofreak17 edited 2018-05-25T15:04:36-07:00
 3 0
 
 I just checked how much bots hit four small websites by analyzing their log files. Period: Jan 1, 2018 till May 24, 2018 Website, A, B, C, D Web pages, 1, 1, 8, 11 Human hits, 576, 803, 3893, 7205 Bot hits, A, B, C, D: bingbot, 249, 472, 766, 1709 yandexbot, 413, 238, 494, 1974 mj12bot, 45, 48, 404, 2159 semrushbot, 299, 146, 536, 1221 baiduspider, 93, 194, 771, 879 googlebot, 133, 218, 312, 932 dotbot, 19, 11, 344, 561 360spider, 0, 277, 231, 169 sogou web spider, 181, 301, 184, 1 ahrefsbot, 0, 57, 153, 246 spbot, 39, 26, 134, 252 bubing, 0, 2, 116, 145 mauibot, 0, 16, 135, 99 linkdexbot, 0, 0, 81, 133 seznambot, 24, 0, 103, 26 seokicks-robot, 0, 0, 0, 151 mail.ru_bot, 20, 16, 28, 80 blexbot, 0, 26, 74, 36 dataprovider, 24, 20, 65, 17 extlinksbot, 0, 0, 87, 22 siteexplorer, 0, 0, 48, 38 slurp, 0, 7, 32, 40 exabot, 0, 12, 8, 52 megaindex, 2, 21, 18, 20 netcraft, 9, 10, 7, 32 archive.org_bot , 8, 1, 8, 38 ccbot, 2, 10, 6, 29 obot, 0, 0, 0, 38 yetibot, 0, 7, 0, 18 I don't think bots have any impact by small websites at hardware or traffic cost, an average $3-$5 a month cloud hosting can handle hundred thousand requests a day and have a 10-20TB traffic limit. Also most bots support crawler delay in robots.txt or you block pages who are to heavy to load often. Not seen by advertisers, buyers or not listed in search engines (or their API's) can have much more impact.
 Cancel
 - seofreak17
 
 2018-05-31T18:06:27-07:00
 
 I expect major search engine bots generate more traffic then backlink tool bots who only request html files. Major search engine bots crawl also media files like css, js, jpg, gif, png, bmp, pdf, doc, xls, ods, avi, mp4 etc.
 
 You must ask yourself if you want a link from a website who block all major backlink tool bots.
 
 Setting a crawl delay is not always a good idea. A popular e-commerce software provider host round 600,000 client websites at only two IP addresses. Most bots limit themselves to maximum one request a second to one unique IP, that's maximum 86,400 requests a day to one unique IP. To crawl that 600,000 websites 150-200 pages deep can take till 2 years and with crawl delay at 10 even till 20 years.
 
 seofreak17 edited 2018-06-01T03:03:18-07:00
 2 0
 
 I expect major search engine bots generate more traffic then backlink tool bots who only request html files. Major search engine bots crawl also media files like css, js, jpg, gif, png, bmp, pdf, doc, xls, ods, avi, mp4 etc. You must ask yourself if you want a link from a website who block all major backlink tool bots. Setting a crawl delay is not always a good idea. A popular e-commerce software provider host round 600,000 client websites at only two IP addresses. Most bots limit themselves to maximum one request a second to one unique IP, that's maximum 86,400 requests a day to one unique IP. To crawl that 600,000 websites 150-200 pages deep can take till 2 years and with crawl delay at 10 even till 20 years.
 Cancel
Casey Bryan

2018-05-23T18:00:25-07:00

Some interesting research you have on your hands and it definitely shines a good light on the new Link Explorer tool. Cheers.

1 0

Some interesting research you have on your hands and it definitely shines a good light on the new Link Explorer tool. Cheers.
Cancel
Edson Finotti

2018-05-23T11:28:36-07:00

Keep doing that and crawling more sites! We need better info to make better choices, so we count on you to get them for us! Thanks for the transparecncy at this article!

1 0

Keep doing that and crawling more sites! We need better info to make better choices, so we count on you to get them for us! Thanks for the transparecncy at this article!
Cancel
alberthseo

2018-05-23T17:42:35-07:00

Excellent contribution!!!

I think it is important to take into account these actions for SEO, taking into account that there are few who actually work with SEO, I have seen that most of them only focus on creating content, but do not spend time to analyze your page and the errors that it has.

For what I consider this information very useful.

Thanks for the input.

1 0

Excellent contribution!!! I think it is important to take into account these actions for SEO, taking into account that there are few who actually work with SEO, I have seen that most of them only focus on creating content, but do not spend time to analyze your page and the errors that it has. For what I consider this information very useful. Thanks for the input.
Cancel
Heather Physioc

2018-05-23T19:41:10-07:00

Pardon if I overlooked this point of clarity - when you say "Moz's robots.txt profile was most similar to that of Google's," does that mean in quantity only, or in actual, specific domains that block or allow the Moz and Google crawlers as well?

HeatherPhysioc edited 2018-05-23T19:42:48-07:00
1 0

Pardon if I overlooked this point of clarity - when you say "Moz's robots.txt profile was most similar to that of Google's," does that mean in quantity only, or in actual, specific domains that block or allow the Moz and Google crawlers as well?
Cancel
- Russ Jones
 
 2018-05-29T10:12:47-07:00
 
 Both statements are true. Relative to Ahrefs and Majestic, DotBot is excluded by fewer sites and we we have fewer examples where Google is not blocked and we are.
 
 2 0
 
 Both statements are true. Relative to Ahrefs and Majestic, DotBot is excluded by fewer sites and we we have fewer examples where Google is not blocked and we are.
 Cancel
Agencia_SEO_Micros

2018-06-05T15:03:14-07:00

Hrefs sucks.... very much better and intuitive MOZ tools, i tried the two platforms and finally take MOZ.

Great Post, Thx Russ.

1 1

Hrefs sucks.... very much better and intuitive MOZ tools, i tried the two platforms and finally take MOZ. Great Post, Thx Russ.
Cancel

Post Analytics

Backlink Blindspots: The State of Robots.txt

Why does it matter?

So, how are we doing?

Methodology

Total sites blocked

Total RLDs blocked

Total pages blocked

Unique sites blocked

Conclusions

Comments 57

Why does it matter?

So, how are we doing?

Methodology

Total sites blocked

Total RLDs blocked

Total pages blocked

Unique sites blocked

Conclusions

Comments 57

Log in to Moz

Don't have an account?