Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation - getting more of their pages included in Google's index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I'm going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.

First, a quick introduction to a truth that I'm not sure Google's shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven't seen) - that is - the concept that there's an "indexation cap" on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn't feel like the principle was "confirmed," but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google's been more open about recently.

The "indexation cap" makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little "unique" content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google's "last 24 hours" function:

SEOmoz blog post search on Google in the past 24 hours
Seriously, go have a look; the quantity of "junk" you wouldn't want in your search engine's index is remarkable

Since Tom published the post on Xenu's Link Sleuth last night, Google's already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that's conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who's tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it's been a dead metric for a long time).

So - long story short - Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

The interesting part is that, in the past 3 months, the number of big websites (I'll use that to refer to sites with an excess of 1 million unique pages) we've talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we're not alone. The pattern is usually the same:

  • One morning, you wake up, and 40% of your search traffic is gone with no signal as to what's happened
  • Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
  • Enter statistics data, showing that rankings for big terms aren't down (or, maybe down a little), but that the long tail has gotten a lot shorter
  • Re-consideration request goes to Google
  • Somewhere between 10 to 40 days later, a message arrives saying:

We've processed your reconsideration request for https://xyz.com.

We received a request from a site owner to reconsider how we index the following site: https://xyz.com

We've now reviewed your site. When we review a site, we check to see if it's in violation of our Webmaster Guidelines. If we don't find any problems, we'll reconsider our indexing of your site. If your site still doesn't appear in our search results, check our Help Center for steps you can take.

  • This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
  • Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
    Exec: "We spent $10 million @#$%ing dollars with you last month and you can't help?"
    AdWords Rep: "I'm sorry. We wish we could help. We just don't have any influence on that side of the business. We don't know anyone there or talk to anyone there."
    Exec: "Get me your boss on the phone. Now."
    Repeat ad nauseum until you reach level of management commensurate with spend of the exec's company (or their connections)
    Exec: "Can you get me some answers?"
    AdWords Boss: "They won't tell me much, but apparently they're not keeping as many pages in the index from your site as they were before."
    Exec: "Yeah, we kind figured that part out. Are they going to put us back in."
    AdWords Boss: "My understanding is no."
    Exec: "So what am I supposed to do? We're not going to have money to buy those $10 million in ads next month, you know."
    AdWords Boss: "You might try talking to someone who does SEO."
  • At this point, consultants receive desperate email or phone messages

To help site owners facing these problems, let's examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don't have statistical or quantitative data to back them up at this time):

  1. Importance on the Web's Link Graph
    We've talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It's likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index.
  2. Backlink Profile of the Domain
    The profile of a site's links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value).
  3. Trustworthiness of the Domain
    Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up.
  4. Rate of Growth in Pages vs. Backlinks
    If your site's content is growing dramatically, but you're not earning many new links, this can be a signal to the engine that your content isn't "worthy" of ongoing attention and inclusion.
  5. Depth & Frequency of Linking to Pages on the Domain
    If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they're not particularly keen on the deep content - which is why the index may toss it out.
  6. Content Uniqueness
    Uniqueness is a constantly moving target and hard to nail down, but basically, if you don't have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you're at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they're also, in my experience, much tougher on pages and sites that don't earn high quantities of external links to their deep content with this analysis.
  7. Visitor, CTR and Usage Data Metrics
    If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny.
  8. Search Quality Rater Analysis + Manual Spam Reports
    If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).

Now let's talk about some leading indicators that can help to show if you're at risk:

  • Deep pages rarely receive external links - if you're producing hundreds or thousands of pages of new content and fewer than "dozens" earn any external link at all, you're in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
  • Deep pages don't appear in Google Alerts - if Google Alerts is consistently passing you by (not reporting, this can be (but isn't universally) an indication that they're not perceiving your pages as being unique or worthy enough of the main index in the long run.
  • Rate of crawling is slow - if you're updating content, links and launching new pages multiple times per day, and Google's coming by every week, you're likely in trouble. XML Sitemaps might help, but it's likely you're going to need to improve some of those factors described above to get in good graces for the long term.

There's no doubt that indexation can be a vexing problem, and one that's tremendously challenging to conquer. When the answer to the "how do we get those pages back?" is "make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs," there's going to be resistance and a search for easier answers. But, like most things in life, what's worth having is hard to get.

As always, I'm looking forward to your thoughts (and your shared experiences) on this tough issue. I'm also hopeful that, at some point in the future, we'll be able to run some correlations on sites that aren't fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.