Short post tonight as I'm just back from a short trip with Mystery Guest to celebrate our one year anniversary (which was awesome, BTW) and need to get caught up on lots of email.
Let's start with a quick quiz - which of the following statements is true?
- A) My pages are in my XML sitemaps file, so they must be getting crawled
- B) My pages have been crawled, so they must be in the index
- C) My pages are in the index, so they must be able to show for queries
- D) None of the above
If you guessed A, B or C, congratulations, you're part of a large contingent of folks doing SEO who are (rightfully!) a little confused about how the engines might be doing this. I've created a quick graphic to help out:
The takeaways here aren't tremendous, but they can be valuable to help explain to SEO outsiders why pages may not be drawing traffic even though metrics like appearing in your XML sitemaps, showing in Google Blogsearch queries or appearing to be crawled in Google Webmaster Tools suggest they should. If you want to determine if a page (or set of pages) are actually included in the engines' main indices, there's only two definitive ways to know:
- Perform queries that show the page appearing in the results (without having to use the &filter=0 in the URL string)
- Check your traffic logs to see if queries are actively sending the page traffic
This is why I love the metric of # of pages that received at least one visit from search engine X each month. If that number is trending in a positive direction, you can at least rest assured the engine is indexing (and holding onto) your pages.
Comments are strongly encouraged on this topic (particular since I didn't get to cover it in great detail). Thanks!
Indexing is a strange beast and can sometimes do very funny things. Around a year ago I was looking after a site when its number of indexed pages fell from around 100 to 2 (as shown by webmaster tools and site:www.mysite.com). The remaining pages continued to show PageRank (5)
The site had a strong (and genuinely natural) set of inbound links from education institutions and venture capital backers. I checked the site for anything that may have been considered black hat but it was totally clean, nothing out of the ordinary.
After around a month with me scratching my head the site was reindexed and it has continued to be indexed until this day. The important thing to emphasise here is that we didn’t change anything because there was nothing wrong to change.
Sometimes Google’s indexed does strange things that I doubt you could truly explain unless you had access to Google’s internal systems. Has anyone else experienced anything like this?
I have. And believe me, domain authority could only do so much, considering other factors that may affect a particular geographical location - Culture, trends, behaviors, etc.
I've never seen this kind of stuff happened before. I guess they must have done something on the server and it caused indexed pages dropped from 100 to 2. Otherwise even if you don't update for a long time, it won't change dramatically like this :D
yes, I am handling a website right now, and the same thing is happening. It is not even ranking for its own URL, when it was a couple of weeks ago, and nothing was changed on the site. Strange...?!
Out of interest has Google Webmaster tools flagged anything? It didn't in my case.
Looks like a penalty, unless you're doing something bad already, do a very thorough sweep and see if the site is hacked and now hosting malware or spam links.
Facing it and am glad to know that I am not unique!
My site https://www.skill-guru.com do not show any keywords when I looked up google webmaster.
But it was being metioned in top search queries. So there were some keywords.
Everything was done in accordance but google did not picked any keywords. Strange!!
And now I see it has keywords.
May be it was a new site and google takes some time to establsih trust
I've noticed that Google Webmaster tools seems to at times lag behind what is happening in the searcj results. If the site is appearing I'm sure you will start seeing the keywords in webmaster tools soon.
It would be interesting to know how long this process takes and if it varies site to site.
Same thing happens on my blog 1st time, on this 21st feb, my traffic suddenly drop from half, and google index pages drop from around 3500 to 2500, dont know what happen, i didn't change anything on my blog...
after reading your post just praying my blog also get index again after 1 month..
I have a 6 yr old site that was ranked #1-3 in Google for every keyword possible for years.
In November 2008 we were suddenly dropped to page 14 for all keys.
Above us were 13 pages of TOTALLY unrelated pages. I mean hamburger stands, discount tires, women's lingerie, etc. Totally unrelated. All PRs remained the same.
7 days later, we mysteriously popped back to #1 for all keys.
We changed nothing. We asked Google for nothing. Clearly they were messing with the algorithm again.
Great graphical view of indexation - does the job much better than just words. Do you find many clients monitoring these metrics (# pages crawled, # pages indexed & in which index, # pages ranking, # pages delivering traffic)?
Loving mozbot (and his crate - lol). Look forward to seeing more of him.
mozBot's name is Roger. He is awesome.
Wasn't Bing Bot invited to the party? I'd like to see more of BingBot in the future :)
P.S. Congrats Rand on the 1 year anniversary!
I have a level that is not listed on Levels of Indexation (very cool by the way).
I was recently contacted by a new client who had developed a website with an SEO firm. Going to going and typing info:www.example.com and clicking "find pages from site" there were 100 listings both pages on the website and pdfs and in webmaster tools for the same site there were 72 pages index at least it appeared based on my little search and webmaster tools that this was accurate but you could take an exact sentence from the body copy add quotes and search for it and never find the website. After a complete review of the website there were several long sections of keyword stuffing which I pulled out and now the site ranks well and a direct pull of content easily finds the website. So maybe that falls into section "F" but in my opinion the site was definitely penalized for keyword stuffing resulting in this situation.
So maybe we need a 0- " Seen by Crawler, Indexed and Penalized"
I agree with the 0- = "Penalized".
I agree, the graph is missing a penalty level.
Nice.
I'd possibly add another layer to your chart, at the top, between the searcher and the result in Google's index, with a page appearing as a result in a cache (and possibly a multi-level cache) for a query term. Many queries are repeated regularly, and it's much less computationally expensive for a search engine to grab results from a cache of queries than perform a search of the index.
For an example, see the process described in "Method and system for query data caching and optimization in a search engine system" (US Patent # 7,467,131).
And, Happy Anniversary to you and Mystery Guest.
Can anyone advise an easy way to measure Rand's "metric of # of pages that received at least one visit from search engine X each month" other than prouring through and crossreferencing log files?
There is other comment above that was mention it:
In Google Analytics:
select >Traffic Sources
select > Search Engines
then in the 'dimension' drop-down select > Landing Page
Although shorter, Mozbot looks a lot more sophisticated than Googlebot.
Hi Rand,
This all makes sense and gives me some insight into a strange occurance with my own blog.
I believed that a 'cached' site means it has been indexed - this is not necssarily the case as my blog proved to me.
I simple changed my title and description meta tags, and within a few hours, Google had seemed to index the page, yet the cache data was several days old....
Thanks for the insight
Christopher P West
Great article, thank you!
I'm lucky. For some reason new pages ALWAYS turn up on Google, not a clue why. External links, perhaps. Don't want to jinx it though, so my mouth is sealed.
Randy, this is a neat presentation for the state of indexing. It was new for me to know about Nevertheless, in terms of weighting the link -from Google's eye as I believe- domain age play an important role too.
Anyway a curious question: does Google Bot have a name other than 'GoogleBot'? :)
So what happens when one has several social profiles on the net with backlinks? I've experienced that google apparently doesn't index them all even though they're all high PR. Could it be due to duplicate content? (My profiles are all pretty similar in their bio or About me text, albeit I vary the backlinks/anchor texts) But even if they don't get indexed I'm pretty sure they're being crawled so maybe the backlink juice gets counted anyway? Any input on this is appreciated.
After reading Edurank's comment I guess it's true that a site:url search isn't really a fool proof way of finding out if something's indexed? Because sometimes I find a profile URL by doing a search for the address without the site: parameter, whereas with the site: parameter it doesn't show up. Could that mean the page is only in the secondary index?
Outside of having content in xml files blocked by robots exclusion protocols, why would such URLs not be crawled, or it a crawl queue?
google webmaster tools will tell you all about the way google crawls and will show crawl errors. should help you figure out why your not being crawled
Great post. Really appreciate the representation posed by the graph.
Thanks!
I find the Google site command site:www.mysite.com to be very inaccurate. There are big swings in the number of pages being counted, and the peaks and valleys don't correlate at all to traffic increasing or decreasing.
Yahoo seems to be more consistent, but I'd rather get a steady read on how many pages are in the Google index. Any ideas on getting that data?
Indexing is pretty much the only metric I pay attention to these days. Good overview.
But it's far from this simple. There is lots of space between C and F. Things are complicated by the fact that Google never discards any indexed page. They hoard data and are loath to delete anything.
I'd say there is a colossal ghost index which Google references but which doesn't feed the serps. So technically nothing is ever discarded from Google's index; it is hidden from public view by a front-end filter (rather ineptly).
How come Mozbot is so much cuter than Googlebot? :)
This is driving me nuts. I have 38,000 pages on my site and Google has only indexed about 3,000.
My domain isn't a year old yet so I'm wondering if this is a sandbox issue. Other than that I've covered all the basics. I don't have a lot of inbound links but I have some high ranking ones.
The graphic in this article was useful but I need some tips on how to trouble shoot the problem.
I definitely agree with all the commenters here, great work keep it up dude!!!...
<a href="https://www.kpicapital.com/">KPI</a><a href="https://www.kpicapital.com/">KPI Capital</a>
thanks for the info...<a href="https://www.kpicapital.com/">KPI</a><a href="https://www.kpicapital.com/">KPI Capital</a>
Would it be a fair assumption that once a page has achieved 1 or 2 points of pagerank its probably at the top of the hierarchy above as in "permanent"?
great article - really useful
I check my webmaster account two days ago and it had 50 "redirect errors" and today it has 140.
Can someone explain why its giving me these errors?
The site is www.freedomaerospaceinc.com it didn't have errors for the longest, and now I guess the crawler it bringing back something not so happy.
great article - really useful
Crawling and indexing of a site is also dependant on links as well (at least for Google as far as i can remember), based on the changes they made after the Big Daddy infrastructure update, such as number of links, deep linking etc.
I would guess that falls into the "authority" of the domain point, then as @rodney mentioned, Google does what it wants at times, doesn't neccessarily mean drastic changes are needed :)
thanks for the great graphical explanation. I am doing some experiments right now with a website which has got a PR6, creating some new pages, and trying things around. This graphic will help my client understand the process, or at least how things should be working: they understood though that getting indexed and appearing in the SERP are 2 different actions.
I have had this happen too. I also learned to patiently wait and Google catches up with itself.
I think this post shows that what really matters is your search results.
Great image representation of how it works, thanks - it's clear and easy to understand. (Also Mozbot is amazingly cute!). Not sure if i get the lettering system though. School grades that go to F- and 0?
I like the graphics in your posts. I'm a very visual learner. I especially liked the crate for mozbot.
The crate is definitely the centerpiece of this artistic achievement. I wonder what is inside the crate? Perhaps the Ark of the Covenant, Kanye West, or even... cake?!
On a more serious note, I definitely agree with many of the commenters here. Getting indexed is tricky, especially for newer sites and blogs.
Great post Rand !
Any reason why there is A B C D F F- and not A B C D E F
I imagine this stems from the way the grading system in the U.S. usually works. For whatever reason we don't use the letter "E" when giving grades.
Because E stands for "Enough", and in this case, it is not enough!
Although D stands for "Done"... hmm...
I'd like to see at which point in the graph a formerly well-indexed page fits in after returning from being 'sandboxed'.
Good to see this being acknowledged as so many so-called SEOs seem to ignore it. It's one of the reasons I never blindly add a Google sitemap to a site, only if there's a definite reason for doing so. Even then I'd rather solve the fundamental problem of why pages aren't being indexed through the navigation.
What I've long wanted to do is analyse how/whether this can explain a phenomena I see quite often - sites whose rankings show a regular cycle of peaks and troughs. Usually there's about a 4-5 week cycle where a term will either drop out or see a sudden rise before returning to its approximate average result. I used to think this was some sort of data centre issue, where a set of data was returning because of a data centre being out of step with the rest but now I suspect it's more to do with particular pages being on the edge of main or supplementary index status. However there are so many potential variables that it's very difficult to isolate and study.
Would enjoy seeing more discussion of this.
Mozbot is cute!
I bet there's a secret weapon of mass indexation in that crate.
Very clever!
This is a good post for SEOs, but for those newbies that are trying to wrap their heads around this stuff, having 7 gradients of indexing is a little daunting.
Completely agree with the #of pages that received at least one visit from a search engine each month - really useful to have for large sites that are always growing.
Also agree with Bill Marshall - more discussion on indexing would be great!
Happy Anniversary & Kudos for your commitment to blog!
One line summary:
Link juice is the lifeline of crawl frequency, crawl depth & indexing merit. ( and oh may I add frequency of content change and content uniqueness..)
-AD
So I have an e-commerce client that is stuck in D mode "Stored in a specialized/Vertical Index Only". Any suggestions on how to additionally get them into the main index? We provide Google with a feed to Froogle and have connected the sitemap in the Webmaster tools.
Great simple easy to understand post ! Thanks
However amongst the list, where and how would 301s and 302s fit in? does 302 hold much impact although they are temp re-directs? What about 301s will the crawler really see it as a perm re-direct or will it judge according to the 301 status it has?
Love the graphic!
Bot behaviour data supports what you've posted. I'm tracking bot activity on a number of sites and have noticed that although Google is visiting us several hundred times a day, for our more established site (over 10yrs old) any new pages we post gets crawled and immediately indexed. As for the recently launched sites, although bot activity is present, indexing is another matter where only a small percentage of pages are kept in the index.
This type of indexing behaviour is really apparent in newer sites where number of indexed pages fluctuate on a daily basis.
I sometimes forget how complicated the indexes are. Thanks for re-teaching me.
Does anyone know how to setup the "# of pages that received at least one visit from search engine X" metric in Google Analytics? I'm new to GA and SEO, so please pardon my ignorance.
Hi EndingPop
one way to see this is to go into a report,
select >Traffic Sources
select > Search Engines
then in the 'dimension' drop-down select > Landing Page
You'll get a report titled something like "Search sent 14,282 total visits via 571 landing pages", and of course detail on those visits.
If you're running any pay-per-click campaigns you'll also need to hit the 'Non-Paid' link to filter them out.
Michael
Thanks for your help, stream!
Anyone tried to create any relations between # of pages that received at least one visit X # of pages indexed by Google (using site: operator) ?
Not exactly, but depending on why you want to do that, this might help you anyway....
For a site that's always growing (or shrinking) in pages, # pages yielding organic traffic won't give you the whole picture as far as trending is concerned.
But you can take that metric and turn it into a KPI (key performance indicator) by relating it to the raw number of pages in your site to get % pages yielding organic traffic. Then you can watch the percentage trending to see if your ability to get pages indexed is getting better or worse which is probably a better measure of SEO performance.
Admittedly that's not directly related to the # pages indexed by Google (as you mentioned) but you may find it useful. What exactly was it you were trying to understand by relating those two things?
i liked the post and the grading system i dont know why everyone is making comments about the grading i could tell what you meant
i know this will knock me even farther from my 100 points but why the thumbs down on this? my other post got 2 downs two and i dont get it
If there is a outlink,index later or not! Page with outlink(point other sites) be indexed maybe have high rank.so internal page without outlink will be indexed faster.