Googlebot Crawl Issue Identification Through Server Logs

Comments 52

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

Richard Baxter

2012-07-02T04:30:11-07:00

We usually always request a few days of server logs when we carry out a technical site review. There's *so* much value to it.

As you've pointed out - you can find thousands (if not tens of thousands) of requested URLs that never end up visible in the main index. Particularly, GoogleBot's new found love of crawing weird ajax / js URLs.

Proper log file analysis is easy - and you learn more about real SEO in a day or so of deep diving that you do in months of reading blog posts.

8 0

We usually always request a few days of server logs when we carry out a technical site review. There's *so* much value to it. As you've pointed out - you can find thousands (if not tens of thousands) of requested URLs that never end up visible in the main index. Particularly, GoogleBot's new found love of crawing weird ajax / js URLs. Proper log file analysis is easy - and you learn more about real SEO in a day or so of deep diving that you do in months of reading blog posts.
Cancel
- David Sottimano
 
 2012-07-02T04:37:23-07:00
 
 Right on.
 
 And maybe what we commonly assumed may not be correct about Googlebot right? Looking forward to your post soon ;)
 
 1 0
 
 Right on. And maybe what we commonly assumed may not be correct about Googlebot right? Looking forward to your post soon ;)
 Cancel
 - Richard Baxter
 
 2012-07-02T05:06:57-07:00
 
 Working on it!
 
 1 0
 
 Working on it!
 Cancel
 - Gerry White
 
 2014-08-17T15:22:14-07:00
 
 curious - did that post ever appear? Server logs is my current "must research" it is great to start to from other peoples observations!
 
 1 0
 
 curious - did that post ever appear? Server logs is my current "must research" it is great to start to from other peoples observations! 
 Cancel
aakk9999

2012-07-02T07:19:23-07:00

Great post and great idea to trawl the logs to analyse the Crawl Budget :)

I have just few small comments / clarifications:

The image "Request for 1 web page" with multiple arrows beling shown as responses should in fact show multiple requests (for the one page) with the multiple responses. I think you are explaining this below the image with "A request for page-a.html will likely end up with multiple hits because we need to get images, css.. " - I would just like to clarify that "multiple hits" are multiple pairs of "request/response" messages rather than one request message with multiple response messages which is what the picture shows.

The "Keep important pages as close to the root as possible" - this presumably refers to clickpath (from the home page) and not to server file path. I could have file server path 5 levels deep, but if the URL is linked directly from the home page, its click path is only 1 level deep and therefore the page is emphasised within the information structure.

With regards to disallowing cpprcr folder - this can be done at the User-agent: * level since Google Adbot ignores blanket exclusion under "User-agent: *" Unless, of course, you already have a separate section in robots.txt for Googlebot, in which case it should go in there.

And lastly, if you do have separate section for Googlebot (or if you have added a separate section for Googlebot to your robots.txt as per paragraph above) then the javascript /*/json disallow must be added to Googlebot section in robots.txt since if there is Googlebot section in existance, Google will ignore blanket user agent section. (i.e. if you have User-agent: Googlebot in robots.txt then Googlebot will ignore anything under User-agent: * section)

aakk9999 edited 2012-07-02T07:20:28-07:00
4 0

Great post and great idea to trawl the logs to analyse the Crawl Budget :) I have just few small comments / clarifications: The image "Request for 1 web page" with multiple arrows beling shown as responses should in fact show multiple requests (for the one page) with the multiple responses. I think you are explaining this below the image with "A request for page-a.html will likely end up with multiple hits because we need to get images, css.. " - I would just like to clarify that "multiple hits" are multiple pairs of "request/response" messages rather than one request message with multiple response messages which is what the picture shows. The "Keep important pages as close to the root as possible" - this presumably refers to clickpath (from the home page) and not to server file path. I could have file server path 5 levels deep, but if the URL is linked directly from the home page, its click path is only 1 level deep and therefore the page is emphasised within the information structure. With regards to disallowing cpprcr folder - this can be done at the User-agent: * level since Google Adbot ignores blanket exclusion under "User-agent: *" Unless, of course, you already have a separate section in robots.txt for Googlebot, in which case it should go in there. And lastly, if you do have separate section for Googlebot (or if you have added a separate section for Googlebot to your robots.txt as per paragraph above) then the javascript /*/json disallow must be added to Googlebot section in robots.txt since if there is Googlebot section in existance, Google will ignore blanket user agent section. (i.e. if you have User-agent: Googlebot in robots.txt then Googlebot will ignore anything under User-agent: * section) 
Cancel
- David Sottimano
 
 2012-07-02T07:39:55-07:00
 
 Hey,
 
 1) Yes, you're right and I thought about making that clearer. "multiple hits" are multiple pairs of "request/response" messages rather than one request message with multiple response messages which is what the picture shows.
 2) Yes, keeping important pages as close to the root is representative of click depth from the homepage. ex. www.example.com/the/file/path/doesnt/matter/ <==That file path is very long, but it doesn't mean it's necessarily buried in the site. We could in fact link it from the homepage regardless of how many folders you see. Sorry if I confused anyone!
 3) I'll have to try that sometime - I didn't know that. Also, I wanted to be super careful I didn't mess with Adbot, therefore I used the specific user-agent.
 4) Yes, great tip. I had to do quite a bit of testing to find that out, and I did in fact specify this in the Googlebot specific section of the robots.txt file as well.
 Great feedback, I feel like I should pay you or something ;)
 
 1 0
 
 Hey, 1) Yes, you're right and I thought about making that clearer. "multiple hits" are multiple pairs of "request/response" messages rather than one request message with multiple response messages which is what the picture shows. 2) Yes, keeping important pages as close to the root is representative of click depth from the homepage. ex. www.example.com/the/file/path/doesnt/matter/ <==That file path is very long, but it doesn't mean it's necessarily buried in the site. We could in fact link it from the homepage regardless of how many folders you see. Sorry if I confused anyone! 3) I'll have to try that sometime - I didn't know that. Also, I wanted to be super careful I didn't mess with Adbot, therefore I used the specific user-agent. 4) Yes, great tip. I had to do quite a bit of testing to find that out, and I did in fact specify this in the Googlebot specific section of the robots.txt file as well. Great feedback, I feel like I should pay you or something ;) 
 Cancel
Sam Garrity

2012-07-02T07:33:46-07:00

Great post!

Another thing that you should look out for is secure counterpart URLs. Googlebot fetches https://www.example.com and then it might fetch https://www.example.com:443 and that could lead to duplicate content issues. I have noticed this on a lot of sites where I have installed SEO Crawlytics plugin.

4 0

Great post! Another thing that you should look out for is secure counterpart URLs. Googlebot fetches https://www.example.com and then it might fetch https://www.example.com:443 and that could lead to duplicate content issues. I have noticed this on a lot of sites where I have installed SEO Crawlytics plugin. 
Cancel
- David Sottimano
 
 2012-07-02T08:26:32-07:00
 
 Great tip, should have included that. Thanks Yousaf.
 
 1 0
 
 Great tip, should have included that. Thanks Yousaf.
 Cancel
Roque Lage de Llera

2012-07-02T07:59:55-07:00

Thanks for the contribution Dave. Information its reallly valuable. Great post; great author.

3 0

Thanks for the contribution Dave. Information its reallly valuable. Great post; great author.
Cancel
Wonderkidxx

2012-07-02T04:26:40-07:00

I thought that "splunk" was something else before I read it again...

Well done man.

+1

3 0

I thought that "splunk" was something else before I read it again... Well done man. +1
Cancel
- David Sottimano
 
 2012-07-02T04:30:20-07:00
 
 You and everyone else who's heard the name ;) Great product though...
 
 1 0
 
 You and everyone else who's heard the name ;) Great product though...
 Cancel
 - Eric Wu
 
 2012-07-06T13:57:47-07:00
 
 I <3 Splunk! I've been using it for over a year now ... it's totally awesomesauce! I like to keep a macro of the Googlebot IP ranges so I can accurately filter our good bot traffic.
 
 1 0
 
 I <3 Splunk! I've been using it for over a year now ... it's totally awesomesauce! I like to keep a macro of the Googlebot IP ranges so I can accurately filter our good bot traffic.
 Cancel
Joshua Moxon

2012-07-02T05:34:45-07:00

Hi Dave,

thanks for the post, you have demystified server log analysis (I am sure there is more that you can do with it).
Question for you, I have read a fair amount about not blocking user agents in the robot.txt file as that becomes a veil beyond which Google can't see and doesn't know anything about. Also, it does nothing to pass page rank and so you can lose value into a black hole (if I understand things right), is that something that you are concerned about? Any suggestions for addressing it differently or at the root so that you don't have to resort to just blocking it via robots.txt? Am I making to big of a deal out "issues" of using robots.txt in this manner?

Thanks again for walk through. Cheers!

3 0

Hi Dave, thanks for the post, you have demystified server log analysis (I am sure there is more that you can do with it). Question for you, I have read a fair amount about not blocking user agents in the robot.txt file as that becomes a veil beyond which Google can't see and doesn't know anything about. Also, it does nothing to pass page rank and so you can lose value into a black hole (if I understand things right), is that something that you are concerned about? Any suggestions for addressing it differently or at the root so that you don't have to resort to just blocking it via robots.txt? Am I making to big of a deal out "issues" of using robots.txt in this manner? Thanks again for walk through. Cheers!
Cancel
- David Sottimano
 
 2012-07-02T05:58:12-07:00
 
 Hey,
 
 I actually barely scratched the surface of what you can do with server logs, but I tried to keep it short so I didn't lose reader interest.
 
 In regards to your question, I wish I could give you one simple answer. Firstly, there's nothing wrong with blocking Googlebot or any other user-agent in robots.txt - the point is to stop them from crawling pages you don't want / need them to see. There's a bunch of other reasons as well, maybe a specific user-agent is aggressively crawling your site and racking up hosting bills, and you just need it to stop.
 
 As for PageRank, the theory is that Google won't pass PageRank to blocked pages. So, losing value is a concern - but it depends on what you're blocking in the first place. A site like bbc.co.uk actually blocks their /search/ folder - which in my opinion might be a mistake (because Googlebot can't spread the external link juice around the site). In this specific post, I blocked weird PPC pages that weren't linked to, and javascript URLs that were part of forms. There is no problem with that at all.
 
 Ideally, if you make the *perfect* site, you won't need to block anything. Sometimes you just need to, and I think Rand summed it up nicely in his last 2 points in this post https://www.seomoz.org/blog/headsmacking-tip-13-dont-accidentally-block-link-juice-with-robotstxt.
 
 There are so many considerations you need to run through before making blocks, using nofollows and other methods of controlling Googlebot's crawl path. Each scenario has it's own perils, but it shouldn't stop you from making changes.
 
 Does that help?
 
 3 0
 
 Hey, I actually barely scratched the surface of what you can do with server logs, but I tried to keep it short so I didn't lose reader interest. In regards to your question, I wish I could give you one simple answer. Firstly, there's nothing wrong with blocking Googlebot or any other user-agent in robots.txt - the point is to stop them from crawling pages you don't want / need them to see. There's a bunch of other reasons as well, maybe a specific user-agent is aggressively crawling your site and racking up hosting bills, and you just need it to stop. As for PageRank, the theory is that Google won't pass PageRank to blocked pages. So, losing value is a concern - but it depends on what you're blocking in the first place. A site like <a href="https://www.bbc.co.uk/robots.txt" rel="nofollow">bbc.co.uk</a> actually blocks their /search/ folder - which in my opinion might be a mistake (because Googlebot can't spread the external link juice around the site). In this specific post, I blocked weird PPC pages that weren't linked to, and javascript URLs that were part of forms. There is no problem with that at all. Ideally, if you make the *perfect* site, you won't need to block anything. Sometimes you just need to, and I think Rand summed it up nicely in his last 2 points in this post <a href="https://www.seomoz.org/blog/headsmacking-tip-13-dont-accidentally-block-link-juice-with-robotstxt" rel="nofollow">https://www.seomoz.org/blog/headsmacking-tip-13-dont-accidentally-block-link-juice-with-robotstxt</a>. There are so many considerations you need to run through before making blocks, using nofollows and other methods of controlling Googlebot's crawl path. Each scenario has it's own perils, but it shouldn't stop you from making changes. Does that help?
 Cancel
Sha Menz

2012-07-02T05:37:49-07:00

Wow David!

Thank you so much for sharing Splunk in this post.

I've always loved my logs, but you just made my dark fascination for digging into them something I can do in a flash on company time!!

Sha

3 0

Wow David! Thank you so much for sharing Splunk in this post. I've always loved my logs, but you just made my dark fascination for digging into them something I can do in a flash on company time!! Sha 
Cancel
Jay Viks

2012-07-04T06:43:47-07:00

Dave, very good insight on how to dissect Googlebot crawling activity on a website.

I can't describe how useful and important it is to keep an eye on your crawling reports. We were recently saved from a big mishap on one of our online shop because of our crawl reports. The website had:
(i) numerous 301 redirects from URLs which were outdated and still happened to be on the site, and
(ii) crawling of 2 or 3 different query string URLs of a single page thereby creating duplicate page error along with consuming valuable crawl budget.

You won't believe that even the advanced crawling software's that we have employed to monitor our website for all on-page errors were not able to identify these issues for us. As soon as we noticed them in our daily crawl report, we were able to take precautionary action of changing the old URLs to newer ones, putting rel=nofollow for URLs with query strings and placing self-referencing canonical tag on few important pages.

We don't completely rely on server logs. We have our own custom PHP utility which generates a daily crawl activity report for all major search engines. You can play around with the data once you have it in your database. Our script monitors Googlebot, Bing, Yahoo, other major search spiders including mobile bots for mobile websites. (btw, we have seen a rapid increase in sales through mobile website - is it a paradigm shift underway?)

Due to NDA, I am unable to share the utility which we have developed, however sharing here few of the articles that my development team has referred at the time of implementation:
https://stackoverflow.com/questions/916147/how-to-identify-web-crawlers-of-google-yahoo-msn-by-php
https://www.user-agents.org/
https://www.liamdelahunty.com/tips/php_gethostbyaddr_googlebot.php

As it goes, you can't build a bug-free software, and hence its highly recommended to test your implementation thoroughly and keep it updated all times.

Its nice to see the way you have given justice to this complex, but significant area of "crawl" monitoring.

JayViks edited 2012-07-04T06:46:27-07:00
2 0

Dave, very good insight on how to dissect Googlebot crawling activity on a website. I can't describe how useful and important it is to keep an eye on your crawling reports. We were recently saved from a big mishap on one of our online shop because of our crawl reports. The website had: (i) numerous 301 redirects from URLs which were outdated and still happened to be on the site, and (ii) crawling of 2 or 3 different query string URLs of a single page thereby creating duplicate page error along with consuming valuable crawl budget. You won't believe that even the advanced crawling software's that we have employed to monitor our website for all on-page errors were not able to identify these issues for us. As soon as we noticed them in our daily crawl report, we were able to take precautionary action of changing the old URLs to newer ones, putting rel=nofollow for URLs with query strings and placing self-referencing canonical tag on few important pages. We don't completely rely on server logs. We have our own custom PHP utility which generates a daily crawl activity report for all major search engines. You can play around with the data once you have it in your database. Our script monitors Googlebot, Bing, Yahoo, other major search spiders including mobile bots for mobile websites. (btw, we have seen a rapid increase in sales through mobile website - is it a paradigm shift underway?) Due to NDA, I am unable to share the utility which we have developed, however sharing here few of the articles that my development team has referred at the time of implementation: https://stackoverflow.com/questions/916147/how-to-identify-web-crawlers-of-google-yahoo-msn-by-php https://www.user-agents.org/ https://www.liamdelahunty.com/tips/php_gethostbyaddr_googlebot.php As it goes, you can't build a bug-free software, and hence its highly recommended to test your implementation thoroughly and keep it updated all times. Its nice to see the way you have given justice to this complex, but significant area of "crawl" monitoring. 
Cancel
- David Sottimano
 
 2012-07-09T01:33:41-07:00
 
 Thanks for the insightful comment and the resources you left. Sounds like you and your team have managed to completely automate the process, which is incredibly efficient of you.
 
 Thanks for sharing, much appreciated.
 
 1 0
 
 Thanks for the insightful comment and the resources you left. Sounds like you and your team have managed to completely automate the process, which is incredibly efficient of you. Thanks for sharing, much appreciated.
 Cancel
Sasha Zabelin

2012-07-02T12:35:08-07:00

Great post, I regularly check my server log to see the visits and ban the spam looking IPs. Theanks for the tip that it can be used in this way as well.

2 0

Great post, I regularly check my server log to see the visits and ban the spam looking IPs. Theanks for the tip that it can be used in this way as well. 
Cancel
rahul_splunk

2012-07-02T14:53:00-07:00

Hi Dave,

Great post!

One other way to do this is to use the Web Intelligence App (free download on Splunkbase) and use the pre-built reports. There is a pre-built report for bot activity, with the ability to drill down into the raw logs or search on the individual activities.

Disclaimer: I work for Splunk.

2 0

Hi Dave, Great post! One other way to do this is to use the Web Intelligence App (free download on Splunkbase) and use the pre-built reports. There is a pre-built report for bot activity, with the ability to drill down into the raw logs or search on the individual activities. Disclaimer: I work for Splunk.
Cancel
- David Sottimano
 
 2012-07-03T02:12:41-07:00
 
 I was hoping one of you guys would stop by ;) I know I've only scratched the surface of what Splunk can actually do, but thanks for the tip.
 
 In fact, I think you should write a post for SEOs. I'd love to contribute so feel free to private message me.
 
 2 0
 
 I was hoping one of you guys would stop by ;) I know I've only scratched the surface of what Splunk can actually do, but thanks for the tip. In fact, I think you should write a post for SEOs. I'd love to contribute so feel free to private message me. 
 Cancel
Brent Copstead

2012-07-02T11:40:03-07:00

Excellent information Dave. And thank you for providing the screenshot video.

Question: in your logs, are you finding the same crawling issues with Bing or other crawlers?

2 0

Excellent information Dave. And thank you for providing the screenshot video. Question: in your logs, are you finding the same crawling issues with Bing or other crawlers?
Cancel
- David Sottimano
 
 2012-07-02T12:45:34-07:00
 
 Bing who? Just kidding. Catching Bingbot turned out to be harder and definitely crawled less of this particular site, so I didn't actually look close enough.
 
 I've probably made a mistake by not checking Bing, but I definitely saw the same pattern with Yandex.
 
 Np, and thanks for reading.
 
 1 0
 
 Bing who? Just kidding. Catching Bingbot turned out to be harder and definitely crawled less of this particular site, so I didn't actually look close enough. I've probably made a mistake by not checking Bing, but I definitely saw the same pattern with Yandex. Np, and thanks for reading.
 Cancel
Paligap

2012-07-02T06:42:06-07:00

I suspect following through on this might solve an ongoing issue we are having with the indexation of a client site (600K pages, had 260K indexed at one stage, now just 15K). Given the importance of the client, however, I may ask the technical team to report on it before I begin dabbling in things I don't understand completely.

Thanks, Dave, for the advice. It certainly gives me a new avenue to explore with this project. Very much appreciated. Also, good work presenting a tricky topic in such an accessible way, I'm sure that will be a major plus for a lot of people.

2 0

I suspect following through on this might solve an ongoing issue we are having with the indexation of a client site (600K pages, had 260K indexed at one stage, now just 15K). Given the importance of the client, however, I may ask the technical team to report on it before I begin dabbling in things I don't understand completely. Thanks, Dave, for the advice. It certainly gives me a new avenue to explore with this project. Very much appreciated. Also, good work presenting a tricky topic in such an accessible way, I'm sure that will be a major plus for a lot of people.
Cancel
- David Sottimano
 
 2012-07-02T07:27:51-07:00
 
 Thank you for reading and leaving such a nice comment!
 
 2 0
 
 Thank you for reading and leaving such a nice comment!
 Cancel
ClickConsult

2012-07-02T08:51:57-07:00

Great post, we've got someone from the team checking this out now on a client's site. So the extra pages that you got indexed, how much extra traffic organic delivered? If even only a quarter pulled in a single extra visit a month, that's a return of 1,750 visits, all for a couple of quick robots.txt fixes. Wow.

ClickConsult edited 2012-07-02T08:52:21-07:00
2 0

Great post, we've got someone from the team checking this out now on a client's site. So the extra pages that you got indexed, how much extra traffic organic delivered? If even only a quarter pulled in a single extra visit a month, that's a return of 1,750 visits, all for a couple of quick robots.txt fixes. Wow. 
Cancel
- David Sottimano
 
 2012-07-02T12:46:27-07:00
 
 I don't have enough data at the minute, but why don't you bug me on Twitter in 3 weeks time and I'll share some results? @dsottimano
 
 1 0
 
 I don't have enough data at the minute, but why don't you bug me on Twitter in 3 weeks time and I'll share some results? @dsottimano
 Cancel
Igal_Zeifman

2012-08-02T02:13:27-07:00

Thanks Dave!

Interesting post, very useful information. (as always)

I do want to point out that some of recorded Googlebot visits are probably belong to Googlebot impersonators.

Just recently our company (Incapsula) completed a Googlebot case-study (covering a group of 1,000 websites) in which we've discovered a 21% avg. Googlebot impersonation ratio. Some of these were SEO tools (using Googlebot user-agent is a good idea if you want to do this type of crawl) but ~16% were malicious and used for some kind of cyber-attack.
If you are interested, you can find more here: Fake Googlebot study

For this reason I also wanted to share Google official verification method and also say that its currently insufficient as Google dose not keep/update IP ranges. For example there is a Google Image bot that visits from Chinese IP and this can look suspicious (and lead to blockage) in the absence of official information.

1 0

Thanks Dave! Interesting post, very useful information. (as always) I do want to point out that some of recorded Googlebot visits are probably belong to Googlebot impersonators. Just recently our company (Incapsula) completed a Googlebot case-study (covering a group of 1,000 websites) in which we've discovered a 21% avg. Googlebot impersonation ratio. Some of these were SEO tools (using Googlebot user-agent is a good idea if you want to do this type of crawl) but ~16% were malicious and used for some kind of cyber-attack. If you are interested, you can find more here: <a href="https://www.incapsula.com/the-incapsula-blog/item/369-was-that-really-a-google-bot-crawling-my-site" rel="nofollow">Fake Googlebot study</a> For this reason I also wanted to share<a href="https://support.google.com/webmasters/bin/answer.py?hl=en&answer=80553" rel="nofollow"> Google official verification method</a> and also say that its currently insufficient as Google dose not keep/update IP ranges. For example there is a Google Image bot that visits from Chinese IP and this can look suspicious (and lead to blockage) in the absence of official information. 
Cancel
miss_SEO

2012-07-19T02:01:49-07:00

Wonderful post, thank you. I better be checking my server logs now :-)

1 0

Wonderful post, thank you. I better be checking my server logs now :-)
Cancel
ChrisTalin

2012-07-05T04:38:40-07:00

Great post - I'm just scratching the surface of 'technical' SEO and this is just the sort of thing I want to incorporate into my site reviews.

I unfortunately came across a hurdle before being able to dig into the data: I cannot verify the 'clientip' as being one of Googlebots... the IP I have is 10.248.57.207 - MXtoolbox tells me ' 10.248.57.207 is a private IP address' when trying to do a rDNS lookup. Do you have any further information about how to find out what this means and what possible problems I might encounter?

Thanks :)

1 0

Great post - I'm just scratching the surface of 'technical' SEO and this is just the sort of thing I want to incorporate into my site reviews. I unfortunately came across a hurdle before being able to dig into the data: I cannot verify the 'clientip' as being one of Googlebots... the IP I have is 10.248.57.207 - MXtoolbox tells me ' 10.248.57.207 is a private IP address' when trying to do a rDNS lookup. Do you have any further information about how to find out what this means and what possible problems I might encounter? Thanks :)
Cancel
- David Sottimano
 
 2012-07-09T01:37:22-07:00
 
 Probably not the real Googlebot then. Anyone can easily change their User-agent to mimic Gbot, and it looks like this is the case. If that IP appears in your server logs daily, and is causing your site problems (crawling aggressively, consuming too much bandwidth etc..) - I would block it from accessing your site. Once blocked, and you see another similar IP doing the same thing, try guiding that IP into a bot trap.
 
 1 0
 
 Probably not the real Googlebot then. Anyone can easily change their User-agent to mimic Gbot, and it looks like this is the case. If that IP appears in your server logs daily, and is causing your site problems (crawling aggressively, consuming too much bandwidth etc..) - I would block it from accessing your site. Once blocked, and you see another similar IP doing the same thing, try guiding that IP into a bot trap.
 Cancel
Yashwant Naik

2012-07-05T21:48:58-07:00

Hi,

Thanks for such a informative information.

I am having one big problem here. I am facing some unusual problem from Google Bot.
"My subdomain pages are get Indexed in Search Engine but when I check the Cache but it is not CACHED. Showing 404 error."

This is first Time I am getting such wired experience from Google.

Please suggest some solution.

Thanks in advance.

1 0

Hi, Thanks for such a informative information. I am having one big problem here. I am facing some unusual problem from Google Bot. "My subdomain pages are get Indexed in Search Engine but when I check the Cache but it is not CACHED. Showing 404 error." This is first Time I am getting such wired experience from Google. Please suggest some solution. Thanks in advance. 
Cancel
- David Sottimano
 
 2012-07-13T05:33:39-07:00
 
 Indexed doesn't necessarily mean cached ;) I don't know what the rule is, but Google decides whether the page is worthy of being cached. Unfortunately I don't have the fix, but just make sure you don't use the noarchive tag (<meta name="robots" content="noarchive">) and that the page content is not duplicated anywhere else.
 
 Is it that big of a problem if the page is ranking? I'd probably say no and don't worry about it.
 
 DaveSottimano edited 2012-07-13T05:35:45-07:00
 1 0
 
 Indexed doesn't necessarily mean cached ;) I don't know what the rule is, but Google decides whether the page is worthy of being cached. Unfortunately I don't have the fix, but just make sure you don't use the noarchive tag (<meta name="robots" content="noarchive">) and that the page content is not duplicated anywhere else. Is it that big of a problem if the page is ranking? I'd probably say no and don't worry about it.
 Cancel
mr_ian84

2012-08-15T10:38:22-07:00

Analysing your server logs is of vital importance. There are various tools out there. I suggest Apache Logs Viewer. It's free and you can easily filter and sort out the log data easily.

You can also produce some nice reports to hand to your clients/supervisors. More information from:

https://www.apacheviewer.com

1 0

Analysing your server logs is of vital importance. There are various tools out there. I suggest Apache Logs Viewer. It's free and you can easily filter and sort out the log data easily. You can also produce some nice reports to hand to your clients/supervisors. More information from: https://www.apacheviewer.com 
Cancel
Amjath Khan

2016-06-01T02:58:40-07:00

Great post. I'm not sure if something changed of late. If you're running PPC on your site and happen to block the Googlebot like the example above, you'll get notification from Google saying that Google's adsbot are blocked on your site and that your ads cannot be approved even though the robots.txt tester within GSC shows as 'allowed' for adsbot-Google.

So, the solution is

User-agent: Googlebot

Disallow: /*/cppcr/

Disallow: /cppcr

User-agent: AdsBot-Google

User-agent: AdsBot-Google-Mobile

Allow:

Amjath edited 2016-06-01T03:05:32-07:00
1 0

Great post. I'm not sure if something changed of late. If you're running PPC on your site and happen to block the Googlebot like the example above, you'll get notification from Google saying that Google's adsbot are blocked on your site and that your ads cannot be approved even though the robots.txt tester within GSC shows as 'allowed' for adsbot-Google. So, the solution is User-agent: Googlebot Disallow: /*/cppcr/ Disallow: /cppcr User-agent: AdsBot-Google User-agent: AdsBot-Google-Mobile Allow:
Cancel
tgmode

2015-01-30T08:13:54-08:00

this above steps look time consuming and manual... i am using a automated log file tool at botify.com which is half the cost!

1 0

this above steps look time consuming and manual... i am using a automated log file tool at botify.com which is half the cost!
Cancel
SpookSEO

2014-01-01T20:30:29-08:00

Hi David! This is such a very helpful post. Thank you very much for sharing on how to use server logs to see Googlebot. I will definitely try this. I will also be sharing this to my friends.

1 0

 Hi David! This is such a very helpful post. Thank you very much for sharing on how to use server logs to see Googlebot. I will definitely try this. I will also be sharing this to my friends. 
Cancel
Wendy Pauley

2013-05-31T04:11:14-07:00

This is a brilliant guide and is only let down by Splunk being such a pain in the ass! Not the fault of the author of course.

I wonder how many sales they hook with this page https://www.splunk.com/view/pricing/SP-CAAADFV

1 0

This is a brilliant guide and is only let down by Splunk being such a pain in the ass! Not the fault of the author of course. I wonder how many sales they hook with this page <a href="https://www.splunk.com/view/pricing/SP-CAAADFV" rel="nofollow">https://www.splunk.com/view/pricing/SP-CAAADFV</a> 
Cancel
PeterPeterPE

2012-09-02T00:16:59-07:00

Am wondering about how the bot makes up files that aren't on my site? I run a simple 404 snippet that emails me about missing pages, missing images or hack attempts and I am constantly seeing googlebot making pages up.

Is the bot making suggestions on what files should be created for the site based on...? The site I recently built gets a few requests per day for pages that would seem to be a good fit as far as naming goes, but I know my sitemap is clean and definitely those pages do not exist So ... what should I do? Create the pages or do a redirect for a ghost page made up by the bot?

PS. the little 404 app is simple. the ErrorDocument 404 line in htaccess leads to an exact copy of my index page: example: index2.php - it opens when someone attempts to go to a page that does not exist as well if something is missing on an existing page. Any problems I receive an email telling me what file was requested and what the IP address is that requested it. Also good if you are selling ebooks and you want to know when someone has reached your uber-secret download location.

Makes it easier to IPBan those looking for a backdoor - as well, the little app is great for a new build as it instantly sends feedback of anything that is missing. If anyone wants a copy hmu.

PeterPeterPE edited 2012-09-02T00:25:03-07:00
1 0

Am wondering about how the bot makes up files that aren't on my site? I run a simple 404 snippet that emails me about missing pages, missing images or hack attempts and I am constantly seeing googlebot making pages up. Is the bot making suggestions on what files should be created for the site based on...? The site I recently built gets a few requests per day for pages that would seem to be a good fit as far as naming goes, but I know my sitemap is clean and definitely those pages do not exist So ... what should I do? Create the pages or do a redirect for a ghost page made up by the bot? PS. the little 404 app is simple. the ErrorDocument 404 line in htaccess leads to an exact copy of my index page: example: index2.php - it opens when someone attempts to go to a page that does not exist as well if something is missing on an existing page. Any problems I receive an email telling me what file was requested and what the IP address is that requested it. Also good if you are selling ebooks and you want to know when someone has reached your uber-secret download location. Makes it easier to IPBan those looking for a backdoor - as well, the little app is great for a new build as it instantly sends feedback of anything that is missing. If anyone wants a copy hmu.
Cancel
Dharam

2012-07-02T18:33:21-07:00

Dave, Amazing job and thanks for your valuable thoughts on this topic.

1 0

Dave, Amazing job and thanks for your valuable thoughts on this topic. 
Cancel
Andre Nader

2012-07-02T13:51:14-07:00

Hey Dave,

I believe the example you are showing isn't a valid Googlebot IP. 50.56.92.47 Appears to come from Rackspace's servers. I have found that 99% of the time (based on the log activity I have analyzed) Googlebot IP's start with 66.249 so that is an easy way to quickly filter out bots pretending to be Googlebot (or get their IPs to block).

DreNader edited 2012-07-02T13:58:43-07:00
1 0

Hey Dave, I believe the example you are showing isn't a valid Googlebot IP. 50.56.92.47 Appears to come from Rackspace's servers. I have found that 99% of the time (based on the log activity I have analyzed) Googlebot IP's start with 66.249 so that is an easy way to quickly filter out bots pretending to be Googlebot (or get their IPs to block).
Cancel
- David Sottimano
 
 2012-07-03T02:09:16-07:00
 
 Good catch. Actually, you can spoof Googlebot's IPs and it's important to use Google's official method that I've outlined above title: How to confirm what you're seeing is actually Googlebot.
 
 1 0
 
 Good catch. Actually, you can spoof Googlebot's IPs and it's important to use Google's official method that I've outlined above title: How to confirm what you're seeing is actually Googlebot. 
 Cancel
nsauser

2012-07-02T13:13:05-07:00

Dave, sorry if I missed it but what columns did you use for both axis on the graph? And is that a pivot table graph?

Are there any other columns you would recommend plotting for comparison?

Excellent post, thanks!

Nathan

nsauser edited 2012-07-02T13:34:06-07:00
1 0

Dave, sorry if I missed it but what columns did you use for both axis on the graph? And is that a pivot table graph? Are there any other columns you would recommend plotting for comparison? Excellent post, thanks! Nathan
Cancel
- David Sottimano
 
 2012-07-03T02:18:47-07:00
 
 Hey, so the graph above is based on a pivot chart. In my spreadsheet I had a field named "root" whereby it took the first folder in the file path. Example: example.com/cppcr/abcd/efg - the "root" would be "cppcr".
 
 In the pivot chart sheet:
 
 My column label: Useragent (filtered to Googlebot)
 
 Row label: Root
 
 Values: Count of URI
 
 I think this was by far the easiest for me, but there are so many different ways of looking at this. I guess another important graph would be: count of URI on your site, and how many Google actually crawls in a month.
 
 Good luck!
 
 1 0
 
 Hey, so the graph above is based on a pivot chart. In my spreadsheet I had a field named "root" whereby it took the first folder in the file path. Example: example.com/cppcr/abcd/efg - the "root" would be "cppcr". In the pivot chart sheet: My column label: Useragent (filtered to Googlebot) Row label: Root Values: Count of URI I think this was by far the easiest for me, but there are so many different ways of looking at this. I guess another important graph would be: count of URI on your site, and how many Google actually crawls in a month. Good luck!
 Cancel
Tom Wilkinson

2012-07-02T06:57:39-07:00

that sounds very useful for sites with PPC and/or affiliate campaigns.

1 0

that sounds very useful for sites with PPC and/or affiliate campaigns. 
Cancel
Manoj Pallai

2012-07-02T21:04:28-07:00

Great post Dave, how ever I'm going to PPC and affiliate campaign, I hope it must have ++ advantages for me..

Thanks a lot.

Manoj

1 0

Great post Dave, how ever I'm going to PPC and affiliate campaign, I hope it must have ++ advantages for me.. Thanks a lot. Manoj
Cancel
BWIRic

2012-07-03T08:12:59-07:00

Great tips on some good tools - we've been looking at dashboards recently and have been impressed by the power that splunk opens up. Awesome.

1 0

Great tips on some good tools - we've been looking at dashboards recently and have been impressed by the power that splunk opens up. Awesome. 
Cancel
Mohammad Aslam Shaikh

2012-07-03T03:47:58-07:00

Wao very good information for all seo expert Thank you so much Dave

KeriMorgret edited 2012-07-03T04:49:33-07:00
1 0

Wao very good information for all seo expert Thank you so much Dave 
Cancel
Shashi Kumar Gupta

2012-07-03T01:15:31-07:00

Really too much informative post on Googlebot and how it is working and how we can use it and can take the benefits from it at huge level.

Thanks for sharing i always wait your stuffs

1 0

Really too much informative post on Googlebot and how it is working and how we can use it and can take the benefits from it at huge level. Thanks for sharing i always wait your stuffs 
Cancel
Aiden Moor

2012-07-02T23:30:53-07:00

Great to know about this, really Dave your contribution is very valuable and I am thinking to use it ASPS.

1 0

Great to know about this, really Dave your contribution is very valuable and I am thinking to use it ASPS. 
Cancel
Gianluca Fiorelli

2012-07-02T06:38:35-07:00

great post Davide,

i have started quite recently digging into server logs, because i found out it was practically the only way i could follow to really understand why googlebot was indexing errandly important product pages of one of my clients.

here you cite Splunk, but i imagine you worked also with other log analytics tool. if it is so, would you like to tell me what were they?

1 0

great post Davide, i have started quite recently digging into server logs, because i found out it was practically the only way i could follow to really understand why googlebot was indexing errandly important product pages of one of my clients. here you cite Splunk, but i imagine you worked also with other log analytics tool. if it is so, would you like to tell me what were they?
Cancel
- David Sottimano
 
 2012-07-02T06:42:54-07:00
 
 I like the paid version of https://www.weblogexpert.com/ and I've also used https://awstats.sourceforge.net/ before on different hosts (didn't get as detailed as I wanted, but I probably didn't use its full potential). So far, Splunk is my favourite and the quickest.
 
 3 0
 
 I like the paid version of <a href="https://www.weblogexpert.com/" rel="nofollow">https://www.weblogexpert.com/</a> and I've also used <a href="https://awstats.sourceforge.net/" rel="nofollow">https://awstats.sourceforge.net/</a> before on different hosts (didn't get as detailed as I wanted, but I probably didn't use its full potential). So far, Splunk is my favourite and the quickest.
 Cancel
peterzmijewski

2012-07-02T23:20:34-07:00

This blogging seems very informative blog....

1 1

This blogging seems very informative blog.... 
Cancel

Post Analytics

Field Name	Value
IP	50.56.92.47
Date	31/May/2012:12:21:17 +0100
Method	GET
Response Code	404
User-agent	Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)
URI_request	/wp-content/themes/esp/help.php
Host	www.example.com

Googlebot Crawl Issue Identification Through Server Logs

Anatomy of a server log hit

Example hit

Googlebot crawl issues you can find with logs

Using server logs to see Googlebot

Step 2: Download & Install Splunk.

Step 3: Adding your server log data to Splunk

Step 4: Only displaying hits containing Googlebot as the user-agent

Step 5: Export to Excel

The Analysis, problem & the fix

The problem

A look into my Excel spreadsheet

How to confirm what you're seeing is actually Googlebot

How did I fix this?

Results

Comments 52

Anatomy of a server log hit

Example hit

Googlebot crawl issues you can find with logs

Using server logs to see Googlebot

Step 2: Download & Install Splunk.

Step 3: Adding your server log data to Splunk

Step 4: Only displaying hits containing Googlebot as the user-agent

Step 5: Export to Excel

The Analysis, problem & the fix

The problem

A look into my Excel spreadsheet

How to confirm what you're seeing is actually Googlebot

How did I fix this?

Results

Comments 52

Log in to Moz

Don't have an account?