Hi Mozzers, I've recently had to deal with several indexing problems that a few clients were experiencing. After digging deeper into the problems, I figured I'd write a post for Moz to share my experience so others don't have to spend as much time digging for answers to indexation problems. All it means is that your site, or parts of it, are not getting added to the Google index, which means that nobody will ever find your content in the search results.

Identifying Crawling Problems

Start your investigation by simply typing site:yoursite.com into the Google search bar. Does the number of results returned correspond with the amount of pages your site has, give or take? If there's a a large gap in the number of results versus the actual number of pages, there might be trouble in paradise. (Note: the number given by Google is a ballpark figure, not an exact amount). You can use the SEO Quake plugin to extract a list of URLs that Google has indexed. (Kieran Daly made a short how-to list in the Q&A section on this).

The very first thing you should have a look at is your Google Search Console dashboard. Forget about all the other tools available for a second. If Google sees issues with your site, then those are the ones you'll want to address first. If there are issues, the dashboard will show you the error messages. See below for an example. I don't have any issues with my sites at the moment, so I had to find someone else's example screenshot. Thanks in advance, Neil :)

Crawl Errors

The 404 HTTP status code is most likely the one you'll see the most. It means that whatever page the link is pointing to, cannot be found. Anything other than a status code of 200 (and a 301 perhaps) usually means there's something wrong, and your site might not be working as intended for your visitors. A few great tools to check your server headers are URIvalet.com, the Screaming Frog SEO Spider, and Moz Pro's Site Crawl (take a free trial for the full experience).

Fixing Crawling Errors

Typically these kinds of issues are caused by one or more of the following reasons:

  1. Robots.txt - This text file which sits in the root of your website's folder communicates a certain number of guidelines to search engine crawlers. For instance, if your robots.txt file has this line in it; User-agent: * Disallow: / it's basically telling every crawler on the web to take a hike and not index ANY of your site's content.
  2. .htaccess - This is an invisible file which also resides in your WWW or public_html folder. You can toggle visibility in most modern text editors and FTP clients. A badly configured htaccess can do nasty stuff like infinite loops, which will never let your site load.
  3. Meta tags - Make sure that the page(s) that's not getting indexed doesn't have these meta tags in the source code: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
  4. Sitemaps - Your sitemap isn't updating for some reason, and you keep feeding the old/broken one in Webmaster Tools. Always check, after you have addressed the issues that were pointed out to you in the webmaster tools dashboard, that you've run a fresh sitemap and re-submit that.
  5. URL parameters - Within the Webmaster Tools there's a section where you can set URL parameters which tells Google what dynamic links you do not want to get indexed. However, this comes with a warning from Google: "Incorrectly configuring parameters can result in pages from your site being dropped from our index, so we don't recommend you use this tool unless necessary."
  6. You don't have enough Pagerank - Matt Cutts revealed in an interview with Eric Enge that the number of pages Google crawls is roughly proportional to your Pagerank.
  7. Connectivity or DNS issues - It might happen that for whatever reason Google's spiders cannot reach your server when they try and crawl. Perhaps your host is doing maintenance on their network, or you've just moved your site to a new home, in which case the DNS delegation can stuff up the crawlers access.
  8. Inherited issues - You might have registered a domain which had a life before you. I've had a client who got a new domain (or so they thought) and did everything by the book. Wrote good content, nailed the on-page stuff, had a few nice incoming links, but Google refused to index them, even though it accepted their sitemap. After some investigating, it turned out that the domain was used several years before that, and part of a big linkspam farm. We had to file a reconsideration request with Google.

Some other obvious reasons that your site or pages might not get indexed is because they consist of scraped content, are involved with shady link farm tactics, or simply add 0 value to the web in Google's opinion (think thin affiliate landing pages for example).

Does anyone have anything to add to this post? I think I've covered most of the indexation problems, but there's always someone smarter in the room. (Especially here on Moz!)