There's been some talk recently in the SEO industry about 'crawl allowance' - it's not a new concept but Matt Cutts recently talked about it openly with Eric Enge at StoneTemple (and you can see Rand's illustrated guide too). One big question however is how do you understand how Google is crawling your site? While there are a variety of different ways of measuring this (log files is one obvious solution) the process I'm outlining in this post can be done with no technical knowledge - all you need is:
- A verified Google webmaster central account
- Google Analytics
- Excel
If you want to go down the log-file route then these two posts from Ian Lurie on how to read log files & analysing log files for SEO might be useful. It's worth pointing out however that just because Googlebot crawled a page it doesn't necessarily mean that it was actually indexed. This might seem weird but if you've ever looked in log files you'll see that sometimes Googlebot will crawl an insane number of pages but it often takes more than one visit to actually take a copy of the page and store it in it's cache. That's why I think the below method is actually quite accurate, by using a combination of URLs receiving at least 1 visit from Google and pages with internal links as reported by webmaster central. Still, taking your log file data and adding it into the below process as a 3rd data set would make things better (more data = good!).
Anyway, enough theory, here's a non technical step by step process to help you understand which pages Google is crawling on your site and compare that to which pages are actually getting traffic.
Step 1 - Download the internal links
Go to webmaster central and navigate to the "internal links" section:
Then, once you're on the internal links page click "download this table":
This will give you the table of pages which Google sees internal links to. Note - for the rest of this post I'm going to be treating this data as an estimate of Google's crawl. See a brief discussion about this at the top of the post. I feel it's more accurate than using a site: search in Google. It does have some pitfalls however since what this report is actually telling you is the number of pages with links to them, not the pages which Google has crawled. Still, it's not a bad measure of Google's index and only really becomes inaccurate when there are a lot of nofollowed internal links or pages blocked by robots.txt (which you link to).
Step 2 - Grab your landing pages from Google Analytics
This step should be familiar to all of you who have Google Analytics - go into your organic Google traffic report from the last 30 days, display the landing pages and download the data.
Note that you need to add "&limit=50000" into the URL before you hit "export as CSV" to ensure you get the as much data a possible. If you have more than 50000 landing pages then I suggest you either try a shorter date range or a more advanced method (see my reference to log files above).
Step 3 - Put both sets of data in excel
Now you need to put both of these sets of data into excel - I find it helpful to put all of the data into the same sheet in Excel but it's not actually necessary. You'll have something like this with link data for your URLs from webmaster central on the left and the visits data from Google Analytics on the right:
Step 4 - Vlookup ftw
Gogo gadget vlookup! The vlookup function was made for data sets like this and easily lets you look up the values in one data set against another data set. I advise running a vlookup twice for each data set so we get something like this:
Note - that there may be some missing data in here depending on how fresh the content is on your site (this is possibly enough room for a whole separate post on this topic) so you should then find and replace '#N/A' with 0.
Step 5 - Categorise your urls
Now, for the purposes of this post we're not interested in a URL by URL approach, we're instead looking at a high level analysis of what's going on so we want to categorise our URLs. Now, the more detail you can go into at this step the better your final data output will be. So go ahead and write a rule in excel to assign a category to your URLs. This could be anything from just following a folder structure or it could be more complex based on query string etc. It really depends on how your site structure works as to the best way of doing it so I can't write this rule for you unfortunately. Still, once this is done you should see something like this:
If you're struggling to build an excel rule for your pages and your site follows a standard site.com/category/sub-category/product URL template then a really simple categorisation would be to just count the number of '/'s in the URL. It won't tell you which category the URL belongs to but it will at least give you a basic categorisation of which level the page sits at. I really do think it's worth the effort to a) learn excel and b) categorise your URLs well. The better data you can add at this stage the better your results will be.
Step 6 - Pivot table Excel Ninja goodness
Now, we need the magic of pivot tables to come to our rescue and tell us the aggregated information about our categories. I suggest that you pivot both sets of data separately to get the data from both sources. Your pivot should look something like this for both sets of data:
It's important to note here that what we're interested in is the COUNT of the links from webmaster central (i.e. the number of pages indexed) rather than the SUM (which is the default). Doing this for both sets of data will give you something like the following two pivots:
And:
Step 7 - Combine the two pivots
Now what we want to do is take the count of links from the first pivot (from webmaster central) and the sum of the visits from the second pivot (from Google Analytics), to produce something like this:
Generating the 4 columns on the right is really easy by just looking at the percentages and ratios of the first 3 columns.
Conclusions
25% of the crawl allowance accounts for only 2% of the overall organic traffic
So, what should jump out at us from this site here is that the 'search' pages and 'other' pages are being quite aggressively crawled with 25% of the overall site crawl between them yet they only account for 2% of the overall search traffic. Now in this particular example this might seem like quite a basic thing to highlight - afterall a good SEO will be able to spot search pages being crawled by doing a site review but being able to back this up with data makes for good management-friendly reports and will also help analyse the scope of the problem. What this report also highlights is that if your site is maxing out it's crawl allowance then reclaiming that 25% of your crawl allowance from search pages may lead to an increase in the number of pages crawled from your category pages which are the pages which pull in good search traffic.
Update: Patrick from Branded3 has just written a post on this very topic - Patrick's approach using separate XML sitemaps for different site sections is well worth a read and complements what I've written about here very nicely.
Very nice, coincides nicely with a blog post I did yesterday about how to use multiple sitemaps (one for each section) to analyse indexation.
https://www.blogstorm.co.uk/using-multiple-sitemaps-to-analyse-indexation-on-large-sites/
(edit: can't seem to add a link in Chrome so you will have to cut & paste!)
Nice post Patrick - that's a neat idea and I'd not thought of doing that. I've added a link to your post above.
Seems crawl issues are on everyone's agenda right now!
using multiple sitemaps works really well but it can be painful to setup for larger websites, but it also shows how drastically Google changes its preferences for crawling...
so much of the content seems to be sitting within the secondary index and for larger sites its not always feasible to build enough links deep enough to get all the content included.
this webmaster/excel post could be useful but may also create more questions than answers for many webmasters
Thanks - That's a great way to keep an eye on indexing issues. Glad to see Excel being used in SEO more and more - there surely is a great potential.
Just an Excel improvement for the #N/A problem and avoid having to search and replace the #N/A. It is possible to embed the VLOOKUP in an error-catching formula as below: Assuming your original formula is VLOOKUP(blah), simply write instead: In Office 2003: =IF(ISNA(VLOOKUP(blah)),0,VLOOKUP(blah)) In Office 2007: =IFERROR(VLOOKUP(blah),0)
For those of you who want a simpler solution on wordpress take a look at this crawl rate tracker plugin. IMHO it's very handy
Thanks Marco, I think that might be a much easier option!
Great tutorial! I would also love to see this as a video!
Here are a couple of other Excel tips that can help you categorize URIs in the method above very quickly (with the right formulae and/or template, the process described by Tom only takes a couple of minutes). Tip #1 If as suggested you want to categorize by the depth of the URI (i.e. umber of subfolders), you can count the number of forward slash symbols in cell A2 using the following formula: =LEN(TRIM(A2))-LEN(SUBSTITUTE(TRIM(A2),"/","")) (the exact number will depend if you use trailing slash, but depth is relative anyway) Tip#2 If you want to flag if the cell A2 contains "blog" for example, you can use the following formula:
=IF(NOT(ISERR(FIND("blog",A2))),"Blog","")
You can then fill down this formula to have the same applied to all cells in Column A.
If you want to identify multiple categories, one way to proceed is to create similar formulas in as many columns as you need categories. In each column, a slightly different formula will check the presence of one certain word or structure within the URI.
Say you want to split your URIs in 4 categories: /blog, /ugc, /tools and "Other". Assume that your URI are in column A.
When you filldown everything, Column E will give you the name of the matching category for all URIs.
Another way to do that is to embed all 4 columns into one "mega-formula" or, much better, to create a custom formula in VBA that cycles through a predefined list of words, check if the target cell contains one of them, and if so returns the matching category name. Hope this helps! Would be a shame not to follow Tom's advice just on the assumption that the Excel bit is long-winded and fastidious :)
Great points on crawling Tom. I'd also like a video of it :)
Superb. Is there any chance of getting this as a video ^^
For those of you who are just as familiar with VLOOKUP as I was before I read this post (not at all), I think this 10min. tutorial video exlains it quite clearly:
https://www.youtube.com/watch?v=2wHtcct7mCE
I've to admit it... more than an Excel Ninja I must consider myself an Excel Ninja wannabe... and for someone who was used to took F at school in maths that should be considered a firm demostration of good will.
This preview to make you all know (Tom and all the ones who added 'formulas' in the comments) that you have all my gratitude, as you are helping me learning and dealing with this part of the job, so needed but that it's not my fav especially if if I've to build the formulas.
About sitemaps... to use different sitemap for larger sites (and for all those ones that have more linguistic version) is something I too suggest, as I saw how it helped a couple of client's website to be better crawled by the bots.
I am lost at step 4!
Love your honesty... I'm not sure a video would help me... I need to work to understand this stuff
oh thats a nice way, but can i book out the rest of the week to create that table?
very interesting posting pratick
Hi Tom,
Nice one on writing this up. We're just working through it, but got a little confused towards the end. In the bit where you say:
Generating the 4 columns on the right is really easy by just looking at the percentages and ratios of the first 3 columns.
I can see how the yield is calculated, but can't work out how you calculated the % of crawl & % of traffic.
Could be doing something silly here tho. Appreciate your help
Hi Jonathan,
Sorry for glossing over details here.
%crawl = (urls indexed by google from this category) / (sum of all urls indexed by google)
%traffic = (sum traffic for this category) / (sum of all traffic to all pages)
(these are both * 100 to get %s and rounded to display nicely - hence the zeros are actually just really small percentages)
Does that make sense?
Thanks
Tom
Great post! I wonder if if there are any alternatives for Google Webmaster Central and Google Analytics for this Site Crawl Diagnosis?
nice tutorial
Wow great post! Another way to show the dev guys the severity of crawling issues.
Love posts like these!
Thanks for the tutorial. Diagnosing crawling issues for me has always been a bit of a problem. Not because I couldn't do it but because the process has always seemed a bit long-winded (I know it's essential).
Great article !but it's hard when you do not master Excel :-(
If SEOmoz had a guide with those specific queries on Excel, this would be great (ok, may be I ask too much)Anyway nice approach !
David
Awesome, awesome post, Tom.
Can I pick one nit? My name is 'Lurie' not 'Laurie' :)
But thanks for the shout-out, and the great tip.
Hi Ian, sorry about that! I've fixed the post now.
Hi Tom, Very Informative Post thank's for sharing...!
I have always found GWT's internal links data useful for indexation measurements. However it is very difficult to deal with the data if you have hundreds of thousands of URLs indexed. Excel allows you to paste only 65k rows unfortunately :)
Excel 2007 allows a million :)
OW, OW, OW Tom!! My head hurts just thinking of all the math I'm gonna have to do! Why oh why didn't I pay more attention in Mr. Scattergood's classes!
Many thanks for laying this all out so nicely for us Tom. No matter how much it's gonna hurt my head, I appreciate it. These advanced techniques you so freely share are much appreciated.
Wow. A few questions. Will this work in OpenOffice I'm having trouble finding any info , aside from importing them, on pivot tables for OpenOffice. I'm also completely unfamiliar with the "vlookup" function anyone have any pointers on how to use it in this sense?
Thanks for the information, I can't wait to apply this technique.
It is useful , I Should have a try!
Hi Tom,
Really great tip here. Very timely as well, as I am currently auditing several large websites for indexing and crawl issues. Using Xenu LS, Google Webmaster Tools, and Google Analytics to draw analyses and information--I guess this tip pretty much paints that picture I'm looking for.
However, I am not as proficient with Excel as you guys are and would definitely love to see this in a video--the step by step process of achieving the final xls sheet. Would surely review your post and try it out though.
@Patrick, thanks for sharing your post. Would also be trying it out.
Thanks SEOMoz, Distilled, Patrick.
Jurgen
I like that you think. Thank you for share very much.
Hello Everybody. I´m a green horn in the seomoz blog. This is the first article. Great Tips. I will stay in the community.
-- Jen - removed link