This is a follow-up to a post I wrote a few months ago that goes over some of the basics of why server log files are a critical part of your technical SEO toolkit. In this post, I provide more detail around formatting the data in Excel in order to find and analyze Googlebot crawl optimization opportunities.
Before digging into the logs, it’s important to understand the basics of how Googlebot crawls your site. There are three basic factors that Googlebot considers. First is which pages should be crawled. This is determined by factors such as the number of backlinks that point to a page, the internal link structure of the site, the number and strength of the internal links that point to that page, and other internal signals like sitemaps.
Next, Googlebot determines how many pages to crawl. This is commonly referred to as the "crawl budget." Factors that are most likely considered when allocating crawl budget are domain authority and trust, performance, load time, and clean crawl paths (Googlebot getting stuck in your endless faceted search loop costs them money). For much more detail on crawl budget, check out Ian Lurie’s post on the subject.
Finally, the rate of the crawl — how frequently Googlebot comes back — is determined by how often the site is updated, the domain authority, and the freshness of citations, social mentions, and links.
Now, let's take a look at how Googlebot is crawling Moz.com (NOTE: the data I am analyzing is from SEOmoz.org prior to our site migration to Moz.com. Several of the potential issues that I point out below are now solved. Wahoo!). The first step is getting the log data into a workable format. I explained in detail how to do this in my last server log post. However, this time make sure to include the parameters with the URLs so we can analyze funky crawl paths. Just make sure the box below is unchecked when importing your log file.
The first thing that we want to look at is where on the site Googlebot is spending its time and dedicating the most resources. Now that you have exported your log file to a .csv file, you’ll need to do a bit of formatting and cleaning of the data.
1. Save the file with an Excel extension, for example .xlsx
2. Remove all the columns except for Page/File, Response Code and User Agent, it should look something like this (formatted as a table which can be done by selecting your data and ^L):
3. Isolate Googlebot from other spiders by creating a new column and writing a formula that searches for “Googlebot” in the cells in the 3rd column.
4. Scrub the Page/File column for the top-level directory so we can later run a pivot table and see which sections Google is crawling the most
5. Since we left the parameter on the URL in order to check crawl paths, we’ll want to remove it here so that data is included in the top level directory analysis that we do in the pivot table. The URL parameter always starts with "?," so that is what we want to search for in Excel. This is a little tricky because Excel uses the question mark character as a wildcard. In order to indicate to Excel that the question mark is literal, use a preceding tilde, like this: "~?"
6. The data can now be analyzed in a pivot table (data > pivot table). The number associated with the directory is the total number of times Googlebot requested a file in the timeframe of the log, in this case a day.
Is Google allocating crawl budget properly? We can dive deeper into several different pieces of data here:
- Over 70% of Google's crawl budget focuses on three sections, while over 50% goes towards /qa/ and /users/. Moz should look at search referral data from Google Analytics to see how much organic search value these sections provide. If it is disproportionately low, crawl management tactics or on-page optimization improvements should be considered.
- Another potential insight from this data is that /page-strength/, a URL used for posting data for a Moz tool, is being crawled nearly 1,000 times. These crawls are most likely triggered from external links pointing to the results of the Moz tool. The recommendation would be to exclude this directory using robots.txt.
- On the other end of the spectrum, it is important to understand the directories that are rarely being crawled. Are there sections being under-crawled? Let’s look at a few of Moz’s:
In this example, the directory /webinars pops out as not getting enough Google attention. In fact, only the top directory is being crawled, while the actual Webinar content pages are being skipped.
These are just a few examples of crawl resource issues that can be found in server logs. A few additional issues to look for include:
- Are spiders crawling pages that are excluded by robots.txt?
- Are spider crawling pages that should be excluded by robots.txt?
- Are certain sections consuming too much bandwidth? What is the ratio of the number of pages crawled in a section to the amount of bandwidth required?
As a bonus, I have done a screencast of the above process for formatting and analyzing the Googlebot crawl.
In my next post on analyzing log files, I will explain in more detail how to identify duplicate content and look for trends over time. Feel free to share your thoughts and questions in the comments below!
Nice.
To go with this, AJ Kohn just published a killer post on Crawl Budget Optimisation here:
https://www.blindfiveyearold.com/crawl-optimization
Very Detailed and Informative post. These helpful related posts links really helps in better understanding of the topic.
Excellent post...I;m sending this over to our tech guy right now so he can implement some of these strategies into his plans.
Great post, hey quick question, you said:
"Are spiders crawling pages that are excluded by robots.txt?"
We've actually seen this a lot, what would you recommend if they are? Often before an update, we'll see google crawling through many of our site sections that have been disallowed by our robots.txt file.
Thanks!
Thank you for sharing this how-to! I'll be adding this to my to-do list.
Hey Tim,
What time frame do you usually review log files in?
Thanks in advance.
Great analysis, but do you *need* log files for this? Couldn't you pull internal and external linking data with Screaming Frog and come to similar conclusions?
Also, why would you block the page strength tool by robots.txt? With so many external links, wouldn't it stand to reason there would be a high number of navigational searches to that page as well?
Hey phantom, your log files are going to tell you the exact pages that Gbot is crawling, ScreamingFrog will give you a good idea of link structure.
Good point on on /page-strength/ it currently redirects to /tools/ so no point in throwing away those links.
Wow!! it is just amazing to see the strategies. I will definitely implement some of these. Thank for sharing such a useful data. Most interesting post to read and to bookmark. :)
Hi Tim,
The bonus screencast is awesome! I so much appreciate the illustration with effective text box guides. This step by step analysis is really vital in order to see an invisible element that highly influences site performance.
Thanks for sharing this! Learnt alot!
I have a question about your "page-strength" example. You say this:
"Another potential insight from this data is that /page-strength/, a URL used for posting data for a Moz tool, is being crawled nearly 1,000 times. These crawls are most likely triggered from external links pointing to the results of the Moz tool. The recommendation would be to exclude this directory using robots.txt."
So you are asking Google not to crawl those sites anymore. If you have links pointing to that section of the site, wouldn't this devalue them?
So how can I get the spider to look at sections that it's missing off?
1) "Disallow" crawling of sections that aren't important - hopefully the spider will spend less time there and then have more "crawl budget" for the sections it's missing
2) Make sure the sections it's missing are being submitted in sitemaps
3) If you're already doing 1 and 2 and it's still not crawling the pages you prefer, look at how your site is laid out and consider making the links to those sections more prevalent and higher up the page. You can try to build external links to those sections as well.
Please correct and aloso forgive me If I am wrong, by disallowing a page which is important (no matter if it is less important than one which is more crawlable) are we not doing injustice with that page?
Instead of disallowing why not we try to gain some more good quality link for less crawled but more important page as you have mentioned in point no 3? As I think the back link also play an important role in crawling.
I am completely agree with your 2nd and 3rd point.
A website architecture sholud be such that your most important pages have easy accessibilty from all the pages. The pages should also be easily crawlable, should not redirect the spider.
One more important thing that we have not suggested is the PAGE LOAD TIME. Try to make the page load as faster as possible, not only the less crawled page but the entire pages. By doing this we will give spiders extra time so that it can crawl more pages than what it crawls now.
Here are some excellent posts on how and why checking page load time is important.
15 Tips to Speed Up Your Website
Site Speed - Are You Fast? Does it Matter for SEO?
Regards
Sasha
Wow, timresnik:
This is one of the most interesting posts I´ve readed so far in a while, and that doesn´t circles around the same SEO topics. Just a question: could I get a penalty from this "Googlebot over optimization"? :) lol, just kidding.Thx again for sharing this awesome info.
Great post! Question Re. under crawled pages: Do you think Google bots respect sitemap priority settings in this respect? In other words, would increasing the priority in a sitemap xml file for certain directories have any noticeable effect on crawl rate?
Hey I don't think Google is giving importance to priority in the sitemap. I haven't seen priority in sitemap is going to work now. Secondly, if you are going to increase the priority of the less noticeable directory it doesn't affect the crawl rate of that particular directory.
Hey Tim, thanks for this. This is a great step-by-step process.
You mention:
"In this example, the directory /webinars pops out as not getting enough Google attention. In fact, only the top directory is being crawled, while the actual Webinar content pages are being skipped."
You say at the beginning of this post that many of the issues you found have been fixed with the migration. Was this issue fixed? Can you tell us why this directory was not being crawled as often as you would like? And, how did Moz fix the issue?
Thanks!
Hey George, thanks for the question. We're pretty sure this has to do with the content improving and the freshness of the section. Prior to relaunch the section was over 6 months stale.
What Can You Do To Get the GoogleBots' Favor?
You absolutely would have to keep away from the "Black Hat SEO" strategy. In no doubt they appear to be logical, if you examine the all-purpose mechanism of the GoogleBots.
What do you think of Googlebot in optimizing your site?
Do you have some techniques or tools used in optimizing SEO in mobile?
I love me a good technical post. You helped me discover that Google is crawling a lot of links that aren't that relevant and not crawling some directories that are. (Interesting to see that the bot will follow form buttons to a contact page from product pages around the site but barely make it to some of the folders that get linked on every page but only from the footer.) I'm adjusting our robots.txt and thinking about new strategies for the site layout, too!
Good post, i was missing on Analyzing Logs. Have downloaded weblog expert after going through your post
Also, eager to check on your next post on - how to identify duplicate content
Hi Tim, thanks for the great article. The only thing I will recommend is to use Data > Text to Columns instead of the formulas. Especially, for this table "text to column" might work best.