Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation - getting more of their pages included in Google's index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I'm going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.
First, a quick introduction to a truth that I'm not sure Google's shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven't seen) - that is - the concept that there's an "indexation cap" on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn't feel like the principle was "confirmed," but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google's been more open about recently.
The "indexation cap" makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little "unique" content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google's "last 24 hours" function:
Seriously, go have a look; the quantity of "junk" you wouldn't want in your search engine's index is remarkable
Since Tom published the post on Xenu's Link Sleuth last night, Google's already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that's conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who's tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it's been a dead metric for a long time).
So - long story short - Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.
The interesting part is that, in the past 3 months, the number of big websites (I'll use that to refer to sites with an excess of 1 million unique pages) we've talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we're not alone. The pattern is usually the same:
- One morning, you wake up, and 40% of your search traffic is gone with no signal as to what's happened
- Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
- Enter statistics data, showing that rankings for big terms aren't down (or, maybe down a little), but that the long tail has gotten a lot shorter
- Re-consideration request goes to Google
- Somewhere between 10 to 40 days later, a message arrives saying:
We've processed your reconsideration request for https://xyz.com.
We received a request from a site owner to reconsider how we index the following site: https://xyz.com
We've now reviewed your site. When we review a site, we check to see if it's in violation of our Webmaster Guidelines. If we don't find any problems, we'll reconsider our indexing of your site. If your site still doesn't appear in our search results, check our Help Center for steps you can take.
- This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
- Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
Exec: "We spent $10 million @#$%ing dollars with you last month and you can't help?"
AdWords Rep: "I'm sorry. We wish we could help. We just don't have any influence on that side of the business. We don't know anyone there or talk to anyone there."
Exec: "Get me your boss on the phone. Now."
Repeat ad nauseum until you reach level of management commensurate with spend of the exec's company (or their connections)
Exec: "Can you get me some answers?"
AdWords Boss: "They won't tell me much, but apparently they're not keeping as many pages in the index from your site as they were before."
Exec: "Yeah, we kind figured that part out. Are they going to put us back in."
AdWords Boss: "My understanding is no."
Exec: "So what am I supposed to do? We're not going to have money to buy those $10 million in ads next month, you know."
AdWords Boss: "You might try talking to someone who does SEO." - At this point, consultants receive desperate email or phone messages
To help site owners facing these problems, let's examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don't have statistical or quantitative data to back them up at this time):
- Importance on the Web's Link Graph
We've talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It's likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index. - Backlink Profile of the Domain
The profile of a site's links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value). - Trustworthiness of the Domain
Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up. - Rate of Growth in Pages vs. Backlinks
If your site's content is growing dramatically, but you're not earning many new links, this can be a signal to the engine that your content isn't "worthy" of ongoing attention and inclusion. - Depth & Frequency of Linking to Pages on the Domain
If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they're not particularly keen on the deep content - which is why the index may toss it out. - Content Uniqueness
Uniqueness is a constantly moving target and hard to nail down, but basically, if you don't have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you're at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they're also, in my experience, much tougher on pages and sites that don't earn high quantities of external links to their deep content with this analysis. - Visitor, CTR and Usage Data Metrics
If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny. - Search Quality Rater Analysis + Manual Spam Reports
If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).
Now let's talk about some leading indicators that can help to show if you're at risk:
- Deep pages rarely receive external links - if you're producing hundreds or thousands of pages of new content and fewer than "dozens" earn any external link at all, you're in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
- Deep pages don't appear in Google Alerts - if Google Alerts is consistently passing you by (not reporting, this can be (but isn't universally) an indication that they're not perceiving your pages as being unique or worthy enough of the main index in the long run.
- Rate of crawling is slow - if you're updating content, links and launching new pages multiple times per day, and Google's coming by every week, you're likely in trouble. XML Sitemaps might help, but it's likely you're going to need to improve some of those factors described above to get in good graces for the long term.
There's no doubt that indexation can be a vexing problem, and one that's tremendously challenging to conquer. When the answer to the "how do we get those pages back?" is "make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs," there's going to be resistance and a search for easier answers. But, like most things in life, what's worth having is hard to get.
As always, I'm looking forward to your thoughts (and your shared experiences) on this tough issue. I'm also hopeful that, at some point in the future, we'll be able to run some correlations on sites that aren't fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.
This is one of the better posts I have read in a while.
I think larger sites need to do more content cleanup / analysis on a more consistent basis to really avoid this problem. I would recommend running regular reports on a URL, Date Created, Last Date Modified and Incoming Links to that URL as well as traffic numbers, etc (i.e. everything mentioned above). Then, the SEO teams for these larger sites should institute an "aging" policy where either content is archived or removed from the submitted sitemap or old URLs redirected to higher authority/similar topic URLs.
It really comes down to being a huge signal to noise ratio problem where noise are URLs without any backlinks, real visitors, etc. and the greater your site can get its signal:noise ratio, the more authority it will seem to have.
Definitely a very challenging problem because most people use a "publish and forget" policy to producing web site content.
Brilliant post Rand! We spend a lot of time explaining this issue to clients. At the end of the day publishers need to focus on a few things to maximize indexation:
1. Quality content (across the whole site)
2. Eliminating dupe content
3. Quality links
MacSeth, the speed of the website is a good point... Efficient code and layout is important, but on huge sites, maybe this is an even more important metric, simply because a large site is more time consuming to index and more intensive on Google's hardware.Even more important to get deep level links to big sites to "confirm" the importance and validity of the page and let Google know the page is worth indexing!
I agree with @porly and @MacSeth...
Our large sites have seen this "deindexing" effect within the last year or so, but they were built 3 years ago not accounting for page speed. Amongst the other important factors Rand mentioned, it seems it would be prudent for anyone with "old code" to revamp the site, lose some code weight and where possible, utilize design templates that place main content at the top of the code with nav and other sidebar elements at the bottom of the code - this way at least the spiders see the important content before all the bloat. I can see how critical this is for large sites especially.
Time for a clean up! Great post Rand.
I love the point about: "Rate of Growth in Pages vs. Backlinks"
It's a pretty strong signal to send to a search engine when your site expands from 100,000 indexed pages to let's say... 6 million indexed pages in the space of a few weeks. You're going to attract some attention, right?
And then: Deep pages rarely receive external links
If you're working to expand a site in the way I mentioned above, I'd strongly recommend you take it easy, being sure to address areas of "weakness" in the architecture - large volumes of content with no external links.
One final point - if your development team are making changes to the site, keep an eye on your internal link strucure. If you lost a navigational element you could end up orphaning an entire content section. I've seen this happen and it can drastically reduce the length of your long tail if it affects enough pages. I've seen something similar happen, where pages were constantly being orphaned, re-linked to, and orphaned again, depending on whether a particular item was available in the database. I drew a chart of the number of keywords bring traffic to the site on a daily basis and the results clearly demonstrated instability in the long tail.
Following the indicator "Rate of Growth in Pages versus Backlinks" seems correct, but is it only about dofollow links or also nofollow?
I'm asking this, as people tend to abuse of nofollow links these days.
Thanks!
Dan
Great post - although it raises a few worrying thoughts!
Particularly useful for Wordpress style blogs where there are so many ways of accessing the same content - time for some housekeeping me thinks!
Thanks - now get some sleep!
Great post
I think there's another factor who can influence the "indexation cap": the amount of internal duplicate content.
IMHO, the highest is the number of internal duplicate pages, the lower the indexation cap. Having an efficient strucutre means also saturate the indexation cap with only relevant pages.
Great Post Rand. Doing SEO for really big sites is a whole different game. And nice dialogue with the Adwords peeps ;)
Great post but I think you "missed" out one important metric: the actual speed of the website (as a whole and individual pages).
I think this is/will also cause a certain drop in indexed pages as well.
Why? Big websites with a LOT of content tend to slow down quite a lot in the process... if it is not fixed the bots will not really like it and may leave before they got the chance to crawl it.
Absolutely, and only going to increase.
No longer is it publish or perish, but publish originally, or -- and minimize internal duplication, URL bloat, earn strong external validation -- or perish.
I wonder whether the removal of old and generally obsolete pages that have been left out there "not to generate 404 errors" would help "free some space" in Google's index.
It's a good point. I hope Google soon come up with a system that will crawl through and remove several types of sites:
- Sites that are indexed in the first page that require you to login or even pay for the answer to your question.
- Sites that lead to 404s, server errors or even 403 forbidden.
- Misleading sites that don't explain your issue and are spam orientated (I know this is something Google is constantly working on, but the report system isn't exactly easy in most situations!)
I would prefer them to develop more useful way of communication with people that goes a step further than "submit something and we may see it/do something".
The percentage of low-quality and inaccurate results in SERPS is so big. I don't believe placing a cap over big sites with a lot of (mostly unique) content will solve those issues for Google.
Very good point there. But at least they are doing something. Google has always been a bit of a mysterious figure to most people. There's no real way of contacting them with proper results at the moment.
They rely a little too much on automation.
This is really interesting article - good work Rand!
I think the indexation cap is a good idea - especailly for the users as hopefully this will bring back even better quality results and finally start to remove more and more spam.
But...
One thing I have noticed again is the fact the Google is inserting breadcrumb links again (I've just googled "renault" and the second entry down has further links next to the url).
So my point is even though websites may start to loose content, website owners will start to gain further links in the SERP which has got to be a plus!
Great review of a question that we're definitely hearing a lot. One clarification - do you think this "cap" is a literal, fixed cap (X pages or X% of pages) or just an artifact of what I'll call "spider fatigue"? I've always thought of it as just a matter of there not being enough link-juice to go around, but it sounds like the Google rep suggested there is a literal, programmatic cap?
Isn't there a way to tell these apart? In one case, Googlebot has a "depth cap" that may be different for different spider entry pages. In the other case, there is a domain-wide cap.
If we do what Rand says and spread link love directly from deep entry pages that may have few internal links from them (e.g., a very popular article from a couple years ago, we would expect to see the new "neighbor" pages getting indexed, and the total pages indexed would increase. However, if it is a domain-wide cap, we'd expect this strategy to results in the new "neighbor pages" getting indexed at the expense of some other pages somewhere, with the total number staying the same.
In other words, reorganizing your internal links would not improve TrustRank or any of the factors that might boost your fixed cap, but it would allow you to get the most out of a spider that that that had a "depth cap".
Great article, and this is a subject that many of us have thought about alot over the years. I used to have an e-commerce sites with 15,000 products, and even though I worked very hard on the internal and external linking, it still took 5 years to get anywhere near all the pages indexed.
The problem with branded products is that the landing pages will almost always be seen as duplicate content, and only by having a massively popular domain could you get around this.
For me the focus will always now be on fresh content, which means I advise my customers to describe their products themselves instead of copying the standard descriptions in as text. It's a nightmare and hard work but because of the competition now, it's a necessary evil.
Really good post and interesting read but enough to start me worrying.It looks like the over commercialisation of the web and search engine results is really going to either push out the little guy who maintains a website which is interesting totally or leave him only findable though a selection of vertical search options
One of the best points in the post is to watch the ratio of pages generated vs inbound links those pages attract. You simply can't hire rubbish writers for pennies to create crappy content.
Are the limits different depending on the domain? How will Google differentiate a site with hundreds of writers churning how hundreds of pages of quality content a day versus an aggregator? Would they possibly use the time the content was originally crawled, MozTrust, or other metrics?
Hi rand,
Thanks for a wonderful insight on suhc a critical topic. But what I dont understand is why google wants to have cap on newly build site. I have site https://www.ekhichdi.com that has some 300 odd pages but still just 100 pages are index. Though some old pages which has been removed are still in the index
Also just yesterday google index one of the pages that was created yesterday itself but it has not index pages that are 20 days old.
This was great! I especially loved this part "...soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing...". :-)
Thanks for the post Rand, was looking at one of my sites and noticed a significant drop in pages in the index and was like...WTF? This actually went undected longer than normal as I have been enjoying an increase in traffic on my Tier 1 keywords. That said, i want my long tails back. I pretty much figured I needed to add some value to those pages by way of content and links juice. This post confirms that.
BTW - This is one of the best line you've ever written (regarding the email back from Google), "This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled..."
Freakin' Hilarious! See you in AZ in Jan.
Thanks for the update on this. I wonder if Google, or other search engines, will increase the cap after a couple of more years. Because if the trend continues then there would be billions of webpages all over the Internet, and that could take quite a load on their engines.
IF YOU LINKBUILD BY SYNDICATION, read this please
Say you send out articles for republication on another site, with links back to the original. This is a fantastic way of link building, but what heppens when you decide to time limit your articles, marking them for deletion after a year? You're be setting yourself up to erase all your hard work by having broken inbound links.
Fix this by linking to a major category of your site, like "breaking news" rather than the original article itself, but make sure you're articles are indexed first (you'll have to figure a way to do that)
Or, if someone has created a plugin that deletes articles after a year, then places a 301 redirect to on that page to a major category of your site, that would be awesome. In fact, that's a damn good idea for a CMS plugin, not meaning to toot my own horn.
Hope this helps,
Chris
While I can sometimes see the benefit of this, in my opinion, it's potentially more dangerous than the problem it seeks to fix. If you get those links pointing to other pages, rather than the original, there's a much higher chance your articles will be considered duplicates than if the copies all point to your originals. I think I'd recommend not deprecating old content without 301'ing it to the new version/location instead.
Bummer I'm late to comment here as it's a great post.
If it's an indexation cap, I guess meta noindex/follow is a good way to help resolve this?
Or is it a 'crawl cap', in which case robots.txt blocking of pages might be better?
Some of the more fancy javascript created links are starting to look more and more attractive!
As a contributing writer for Associated Content, this has been a huge issue on the site. Because we earn pageview bonus money from content written, it is vital that our unique content is indexed regularly to maintain search position. Unfortunately, many pages go unindexed. For this reason, I have found other ways to try and promote my content on a regular basis with link building directly to my content pages. Overtime, the cost and time associated with this effort depletes any earnings from pageview revenue. Catch 22, to me.
Great post Rand (good to meet you pubcon btw :) )
Do you think the index cap is one of the reasons behind the development of caffeine?
Reuben
Thanks for taking the time to write this. It's made me stress out a little bit, but in an enriching way.
My main concern is that my upcoming website is going to be quite static on the root domain, which is the highest priority in the sitemap and has the CTAs within. But part of my site is also a blog area, which is where I'm hoping to drive some content to my site from. With Google placing the indexing cap on my site, am I at a significant level of risk when my blog is getting more visitors than the index of my site? If so, then what do you suggest?
For a while I have been looking the answer to that How many page in % does Google Index for a site. I have been running a blog for a year now and I noticed only approx 40% of the total pages are indexed. It alarmed me and wanted to explore what exactly is wrong. And going through several sources and finally this post i can confirm that Google certainly indexation cap in place for Websites with fewer baclinks. I obsereved several other sites with low quality content that had more pages indexed in Google and only found that they had good number of backlinks mostly generated through blogrolls or by distributing free wordpress theme with link to their blog. anyway thanks for the detailed information that i was looking for.
Excellent post.
here i want to know about alternatives of directory submissions, can you help me for this.
thanks
Think seomoz could create a simple tool that is a site index checker - helping us easily monitor our index size over time?
Like the Rank Checker but for Index Size.
A fantastic post as usual.
I would imagine that the main factor has got to be internal duplicate content, which I see all the time on clients’ websites.
I think that this is a positive step in many ways fixing the dupe content and improving the site rather than falling back on just adding a site map will only improve the web after all.
Mechanical Turk just blew my mind. I can't believe this has been around since 2005 and this is the first I have heard of it.
artificial, atificial intelligence? I still can't figure out it's purpose, but that's not new with the intenet.
Nice post Rand...I could really identify with the adwords exec conversation :)
Good post Rand. I'm flattered that 250 people felt kind enough to blog about my post (wat? did I miss the point?_.
Seriously though, I think this is the single biggest argument for still using nofollow and general link sculpting on a site - reduce crawl wastage and get more of your money pages in the main index.
Does a supplemental index still exist? I'd like to say yes but would love to hear others thoughts on that.
So why not use the meta robots noindex,follow rule?The pageranks will keep flowing, you don't have the PR "evaporation effect" (probably) and you do exactly what Google wants: you endorse an indexation cap, only you now are in charge of the pages indexed.
Yeah - we've seen a specific example where a large site removed all their nofollows thinking that Google was hurting them for it, then lost 35-40% of their search traffic from the tail due to indexation problems. They put back the nofollows, waited a couple weeks, and things were back to normal. I'd say there's still strong evidence that while PageRank may "leak" nofollows can help with controlling the indexation on a site.
Rand once explained well 100*100*100 = 1M and #@(% all other tags and navigations and this is one of best advices ever. Noffolow maybe 'drains' link joice but it seems it is more important for google to determine relative pagerank of page on your website.
I for one can't wait to see what this does to two of my spammy competitors who seem to have untold thousands of pages in order to reap as much long tail as possible.
Excellent post Rand! This has been a topic of debate in our offices for a while now, nice work
Nice post. Good topic too. Wonder if it would effect a 'small' site of 500+ pages. I ask because a site I worked on recently had been revamped, but the indexing was atrocious. This is with 9+ years of being active.
Also, I read somewhere recently the nofollows don't actually work. Any ideas on this?
Cheers,
Lara
Sure thing that Google won't like to keep too many pages around, since in the end it justs adds up to their spending. Say now, you want to play nice ad relief Google a little bit and have less pages cached. Anyone an idea how one could do that?
I currently have 6 Mio pages in the index, where my sitemap lists only 300k links and I am really wondering what else Google has stored (we have a lot more pages, but they are basically combinations of filters and hence "no-index").
Nice post Rand
Google's crawl resource is an ongoing challenge. Those machines are running hot!
Excellent post. It does make things a little difficult for WordPress blogs, but I like the overarching thought (once again) that high-quality content is the answer.
Great post - inspired me to register and ask a question...
If I have a site with about 100,000 "quality pages" and over 2 million "poor quality" pages primarily for SEO, am I shooting myself in the foot? Does this article imply that I would be better off deleting the 2 million SEO pages (which bring in about 1k-3k visits per day) to boost the quality of the 100,000 quality pages? And would the net gain be better if I made that decision? Thanks for any thoughts.
I'm not sure I'd phrase it exactly that way - the 100K pages, if they're getting indexed and aren't at risk (they have external links and other positive signals), may not be harmed by the other 2 million. But, to get those 2 million consistently in and earning traffic, you probably need to think about the items above.
Hmm...great piece, Rand.
This quote tho --
"...We've talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It's likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index..."
-- will make me rethink my own opinions of PR...exceptional catch here!!!!
:-)
Jim
"...We've talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It's likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index...
-- will make me rethink my own opinions of PR...exceptional catch here!!!"
Agree with you JVRudnick
While there are several ways websites can mitigate this I think decisions like these on Google's part are a turn for the worse.
I understand the technical reasons to make such decisions (db storage, processors, speed to index, etc...) but when your business is driven by relevancy, as Google's is, removing pages to make life easier for them means they are/may be removinghigh relevancy pages for a lot of searches. This seems to be an instance when Google serves themselves and not their customers. I understand why they make these decisions but think these small issues are the same type of issues that led Yahoo to their decline.
I think Google assign an amount of time to spend on a website based on the rank of this website. Then it will crawl pages until it reaches this amount of time : so html download speed and ranking is important to crawl, and index also.
Sorry because i need help.
I have a website https://www.iexplorevietnam.com, any of you kindly to check for me if it is working well, I build this by myself but don't really know how to make it friendl with google search engine.
Very thank you if anyone could check for me.
HELP!!!!
- Casey Removed Link
Good thing I only have 999,999 pages! Honestly, why would any site be that big.