One of the frustrations of doing SEO for large websites is the fact that Google makes it very difficult to see more than a small part of the search index. Even in Webmaster Tools, Google's index search is built on the same mechanics as its web search, which only lets you see the first 1,000 pages of any result. Whether you're trying to get pages discovered, struggling with duplicate content, confirming robots.txt changes, or doing advanced index sculpting, that 1,000-page barrier can be extremely limiting when you're dealing with a site with 10,000 or more indexed pages.
So, how can we dig deeper into the index and really see the big picture?
The Tools – Site: and Inurl:
First off, you're going to need a couple of tools. I'll assume that most of you are familiar with Google's "site:" command, which returns the indexed pages from any given domain or subdomain. Let's take our friends here at SEOmoz as an example. Type "site:seomoz.org" into Google's search box, and you'll see something like this:
The other command we'll be using is "inurl:", which, paired with other search terms, restricts the results to only those containing a specific keyword in the URL. Paired with the "site:" command, Google only reveals indexed pages which contain those URL keywords.
The Tactic – Index Deconstruction
Using our SEOmoz example, how can we find out which pages are included in the roughly 12,000-page index when we can only see those pages 1,000 at a time? Those last three words are the key: we can only see 1,000 pages at a time, but depending on how we construct our searches, they don't have to be the same 1,000 pages. By splitting up our index searches logically, we can break the full index up into manageable chunks. We'll do this by using "inurl:" to force the "site:" command to show us the index through smaller windows.
An Example – Deconstructing SEOmoz
This is one of those techniques that's much easier to illustrate with an example. Let's say that we needed to dig deeply into SEOmoz's 12,000 indexed pages. The first thing that we might do is to take a look at the main navigation to get an idea of the URL/folder structure of the site. Looking at the top-right navigation on SEOmoz, we see the following (I've added the numbers 1-6 - see below):
Other than "Home," the first link goes to the "/blog" folder. That looks promising, so let's try out our combination "site:" and "inurl:" search:
After clicking the "omitted results" link to see the full list, we get 2,430 pages of the index that contain the word "blog." That's a good start, so let's see what we can do with a few more of the major folders (numbered above):
- inurl:blog – 2430
- inurl:ugc - 712
- inurl:articles - 96
- inurl:tools - 29
- inurl:users – 5880
- inurl:marketplace - 787
Not bad: with just 6 subfolders, we've accounted for 9,934 pages or over 80% of the index. This, of course, assumes minimal overlap, and the accuracy of Google's numbers may be questionable (I'll discuss some issues with "inurl:" at the end of the post), but it's more than adequate to get the job done.
Now, we're left with a couple of groups, such as (5) that are still greater than 1,000 pages. At this point, you'll have to use some logic and your knowledge of the site in question. As a frequent Moz user, I know that the "users" folder contains all of the user profiles. Digging a little, I can easily find that those profiles all contain "users/view." A new search on "inurl:users/view" reveals 5,810 user profiles, making up almost all of the pages in the "users" folder and almost half of the total index.
An Example – Canonical URLs
Most of the time, we aren't going to be trying to deconstruct the entire Google index for a site, but just need to answer a specific question. Let's take my own company site/blog as an example. Recently, I realized that I had left some loose ends in the code that were revealing both canonical and non-canonical URLs. So, for example, the same blog post might have the following two URLs:
- https://www.usereffect.com/topic/the-last-spam-youll-ever-need
- https://www.usereffect.com/index.php?id=154
I've recently made some code changes to fix the problem, but how do I find out if my fix is working? I simply look for "id" in the URL with a search command like "site:usereffect.com inurl:id". As of this writing, that search only shows 1 result, suggesting that my changes are having the desired effect.
Advanced Inurl Tips
I hope that I've demonstrated just how powerful two relatively simple search tools can be when effectively combined. Before you go out and put this to work, though, a couple of warnings about "inurl:", which has a tendency to misbehave.
First, "inurl:" seems to ignore punctuation, for the most part. A targeted search on the folder "inurl:/blog" returns the same results as "inurl:blog," which is to say that it returns every page that contains "blog" anywhere in the URL. In some cases, this won't be a problem, but you'll have to judge that on a case-by-case basis. Like standard Google search terms, "inurl:" only searches on whole words (but doesn't seem to allow word stems), and you can only use a single word at a time in any given "inurl:" statement.
You can use multiple "inurl:" statements (one for each word) in your search, which are automatically combined with a logical AND. You can also use "-inurl:" to exclude specific URL keywords from any given search. Finally, you can combine "site:", "inurl:" and stand-alone keywords to target indexed pages by URL and content keywords in one statement.
Great post Pete - I love digging into what the search engines provide and offering up ways to use them. Other tactics along I like along these same lines:
There are other good ones, too, but these generally work well for me.
Oooo! That date range query is nice Rand! I've just been talking about dupe content detection using similar techniques here - hope you like
Excellent; thanks. I keep hearing about the date-range query but wasn't clear on how to use it as a URL parameter.
Oh, and BTW, thanks for letting me use SEOmoz as an example. I know it's publicly available information, but some people might still consider it a little invasive. You just happened to be a perfect example for this exercise.
Would a post on all the URL parameters for Google be helpful?
Ooohh...YES YES YES :D
Okay, this hasn't been updated lately, but still pretty on target. Here is a free PDF that Stephan did back when...
Unlocking Google's Hidden Potential
Great queries, but I'm a little confused about the date range query. This might be a stupid question, but just so I'm straight here...You go to google, do your search, and once you have the results you add that juicy little tidbit to the end of the string and that will give you the filter?
thanks for helping out the clueless :P
Exactly: the version Rand gave is just a URL parameter. You can also go to Google, click on "Advanced Search" and then open up the "Date, usage rights, numeric range, and more" link to see the date-search options.
I should've actually bothered to try that one out earlier, so thanks for the nudge :)
cool stuff!
Excellent post Pete. You had me at "cracking." :)
Great post.
Just one thing to be aware of: The total numbers of search results reported by Google is not accurate for most searches.
They use a distributed BigTable database for their index. And Google's developers always keep on saying: "BigTable doesn't have counts". That's why the total number of search results changes every time you click "Next". It's just an estimation.
So if Google reports, say, 975 results on the first page, it doesn'tmean that you will be able to get those 975 URLs from Google. On the second page this number may be slightly different, and a few pages later you will see that 342 results is all that Google have for you.
P.S. As a shameless self-promotion, I can mention our tool called FirstStop WebSearch, which can be used to automate retrieving large amounts of search results. Some of our clients found it convenient to use batch searches (using techniques similar to those described in this article) to break the Google's 1,000 page barrier.
Although the total search results for a regular query are usually an approximation, I've found the result count for the "site:" command to be reasonably accurate. It's definitely true, though, that as you drill down into smaller counts, the accuracy seems to improve.
It depends ;-)
For site:seomoz.org inurl:ugc the first page said 711 but I could only get 380 results.
For site:seomoz.org inurl:articles the initial estimate was 99 and the final 38.
Nice one Pete! sphunn!
Thanks... you beat me to the punch. I was about to tell everyone that you just posted a great blog entry on finding dupe content with site:, inurl:, and intitle:. File that under "great minds think alike" ;)
Cheers for the shout and the sphinns on my post too.
great post doctor Pete,
I work full time for a very large directory firm in the UK and we have over 5 million pages in our directory. I undertand how hard it is to find out how all the pages are ranking due to the size of the site we have. We currently get over 7 million searches per month but would like more.
We have tried to make it as easy as possible for Google to understand the pages we want ranked and the reverse o not getting indexed. We are currently considering using nofollows on all client links so we dont look like we are selling paid listings. As we have multipul thousands o clients this needs to be right.due to thefact we have both canonical and non-canonical URLs in the idex it makes it even harder to understand whats going on.
Thanks for the article, its given me some ideas. I do appreciate the tame taken to write the post.
Wow, and I thought sorting through a 30,000-page index for a client was a challenge. You've really got your work cut out for you with 5,000,000,000 pages, but I suppose the basic challenges are the same.
Pete,
I read your comment before I saw the one above it and I immediately thought - WOW! Five Billion pages! That's like GigaGinormous!
I think you have a few too many extra zeros in there. ;)
Oops, got a little carried away :)
An awesome post. I wish there was an easy way to sort out dupe URLs that pop out with different inurl: queries...
Dr. Pete,
Great post! Especially when initially auditing a site, we'll undergo this type of URL dissection because it also helps to illustrate potential siloing, templating, and may illustrate where challenges exist, such as much lower indexation of a URL pattern that may illustrate crawling issues or pages that are perceived as duplication.
I tend to prefer the segmentation within a site: query.
Even in cases where the modifier doesn't appear elsewhere within the URL, the above and the inurl: will often result in differing counts, so whichever method is used, it is best to pick and stick with one than go back and forth.
The advanced queries can be extremely useful and powerful, but mixing and matching with them can be a little challenging. I recommend people play with these on a site they know they well to get a comfort level of what works, what works best, and what goes wacky.
On a sidenote: from a methodology standpoint, I have the team append &start=990&filter=0 to the Google URL to jump to the last page and pull in any omitted results. In some brief tests on sites that were small enough (less than 1,000 results), I've found this to be the most accurate method.
Whenever I try a new combination, I try to use a little common sense and see if the results pass the smell test. I was playing with a strange mix of multiple "intitle:" and "inurl:" statements the other day, for example, and the SERPs made absolutely no sense.
I do prefer site: modifier over the inurl: modifier as well when it comes to segmenting folders.
In the inurl:blog example, even results from /some-folder/containing/blog/here gets counted which is not intended.
How to go about Yahoo and MSN?
superb post and cool quick tips for google
hi Guys
I have been looking into this and got struck at
https://www.google.co.uk/webhp?sourceid=navclient-ff#hl=en&q=site:partyrama.co.uk&start=990&sa=N&fp=ee8e7832ac926f03
Site I am working on is www.partyrama.co.uk
Just did some research on my sites.. and i found some interesting stuff google index special character links.. nice post! thumbs UP!
I think this inurl: thing may help me. My blog is using the 'all in one SEO' plugin and it has decided to give the /tag/ pages (that get indexed) the canonical preference. I want to change all of these to point to the actual post, to avoid duplication.
Can I use the inurl: tool to find all the /tag/ pages, then edit them all individually to change the canonical command?
The blog is Freebiejeebies
Thanks
You can use inurl: or you can just add the virtual directory path right to the site: command, like this:
https://www.google.com/search?q=site:www.freebiejeebiesgadgets.com/tag
Thanks for the help - I didn't know I could do that!
Hey Pete,
I've been out of pocket for a few days but just wanted to quick add my commendations for a very nice post. The fact that it also elicited a few more helpful query strings and ideas only adds to an already excellent contribution.
I found this one and the better captcha post on your UserEffect site to be both insightful and helpful.
Thanks.
Oooo...I'm having the very same issue with canonical and non-canonical URLs, and my programmer seems confused as to why it's a problem. Do you have an easy fix I can pass along?
Thanks for the great information. Super helpful!
Cool, I use site: a lot, but I'm excited to try out inurl:
read everything you can on this site - start with the free articles and work your way up.
there are a lot of really great insights here so use the search to find your answers too.
good luck, and welcome to the party :)
Really nice post Pete - I don't work with as many large sites as I used to, but I'll be bookmarking this one for when I do, and sending it on to people I know who still are
dr pete thats the best thing I have read in the last 4 weeks, and trust me, theres been some good stuff going round. awesome.
I wish I had something intelligent to add, but I dont :P
Great post....very helpful!!!!
thank you Pete for all those very helpful insights! great post, especially when working with larger sites, such as e-commerce ones. until now I was using mostly site:+keyword in the query, eg. site:www.seomoz.org rand but your tips will get me better results. thx again!
Great post, Dr. Pete, you are really killin' it this week with your CAPTCHA post and now this.
I don't use site: and inurl: commands nearly as much as I should, and this is a fantastic reminder to do so.
Rand, thanks for the additional advanced operators also.
Good stuff Pete. Congratulations on promotion to the blog.
I'm wondering if there'd be any utility in large sites planning for this when designing their site by including unique identifying strings for sections that they'll want to track. Like you pointed out "blog" could return inurl: results that aren't actually in the blog section, so perhaps adding something more unique to all blog page urls would be useful?
It's absolutely a good idea to plan out a logical URL structure and make links distinctive, but I've toyed around with it a bit, and you have to be careful not to get too cryptic. The priority is still to create URLs that are user-friendly and good for general SEO.
Well done Dr.
As a WordPress user I have to take some of the plugin descriptions at face value for what they do. Those claiming to get rid of WP's duplicate content issues have always been difficult to truly evaluate. These techniques show definite ways to help out, thanks.
The newsletter I do in-house work for writes on the same few subjects over and over again so I use the site: + intitle: + keyphrase or topic to help ensure canonical linking structures are what I'd like them to be as well as keeping an eye out for unintentional cannibalism.
Cheers!
@trontastic