A few days after new content shows up in Google, it will sometimes flicker out of the SERPS for a few hours. Apparently, this is common knowledge to some SEOs. This is not common knowledge to programmers like me, and I nearly made a tin foil hat in preparation for the googlicopters when I learned the project (Linkscape) that I'd worked on for months had disappeared from all SERPs on the Friday evening of the week it launched. Fortunately, Rand responded to the internal email thread and, just as he predicted, 12 hours later, Linkscape was back in Google's SERPs like nothing had happened.
This raised the question, "Why would the engines drop new content out of the search results for a few hours after it is been in the results for a few days?" I don't know, but let me make an educated guess - sometimes there is a brief gap between pages falling out of a smaller (but quicker to build) index and when a larger (but slower to build) index is finished getting rebuilt with those pages in it.
Not having worked at Google I have no solid evidence they have multiple indices, but let me make the case that they probably do. Linkscape currently takes over a month to move something being crawled to it appearing in the results. There is some low hanging fruit to reduce this to more like a couple weeks, but for the foreseeable future we aren't going to have turnaround time on the order of hours like the engines do because our index is large enough that it just takes a lot of computers and lot of hours to compute it. So... how do the engines get around this issue? They could make their indices support random inserts, but this would make them more complex and less efficient. The other option is to have two indices. That way they can have one index that is small and quick to update, and another that is large and slow to update. The small index would try to have the difference between what is crawled and what is in the big index. At query time, they would then need to check both. Of course they could have more sizes of indices besides just two, but that doesn't affect the basic point that presumably Google has more than one.
Google could remove a page from the small index only after it is in the big index, but then it would be in both indices for a while until the small index was rebuilt. This overlap means the small index is larger than necessary, so can't rebuild as quickly as is possible, and so won't be as fresh as is possible. So perhaps they try to time it perfectly so there isn't any overlap nor any gap. The problem comes that as they crawl faster, grow their indices, add complexity to their indexing or let the intern check in his summer project, it is easy for a small gap to form. So maybe it is just hard to ensure that there is never any gap unless one is willing to waste resources by letting them overlap.
Chas (the developer who sits next to me) manages some indices with a large+small model that, for the record, never has gaps. And he contributes the fact that his large index starts rebuilding at midnight on Friday because load is lighter on the weekend. However, his computers are set to GMT which means it starts Friday 5PM PST. Well, it was a bit after 5PM on a Friday when Jane first noticed Linkscape dropped from Google's SERPS (I received her email at 5:28PM). Google has fewer CPU constraints than Chas, but they do have bandwidth constraints which is what's needed to push new indices out to lots of computers.
So the theory is that Google had two indices that were supposed to go live in the first seconds of the weekend GMT. First was the new large index that added our page. Second was the new small index that dropped our page. Only the small index was on time.
Or, at least, that is the best theory I can come up with. What do you guys think?
p.s. from Rand - This post is Ben Hendrickson's first on SEOmoz. He's been with us nearly a year, working on Linkscape, and before that with Microsoft & his own technology startup project. I'm thrilled to have him contributing to the blog. Hopefully, he'll get a photo up sometime soon :-)
I'll throw in a few ideas. 1.) There is plenty of solid evidence that Google uses multiple indices. About a year ago, you couldn't go anywhere without hearing (from SEOs and from Google) about the Supplemental Index. 2.) Since Google has many (redundant) data centers around the World, they have the ability to reroute traffic away from a particular data center as it is being updated. There should never be a gap of time during which the indexed data for a given URL is completely unavailable. 3.) After Google did their recent PageRank update, there was a brief period where the Toolbar PageRank values went back to how they were before the update. Then the update "reoccurred." In other words, this phenomenon is not limited to the index, but may also include other sets of data (such as PageRank/the link graph). BTW... my ideas don't align to point to any particular theory. I have multiple personalities, and I gave them all a chance to voice their opinions.
URLs show OK in edit window for post below but are truncated in the public view.
I was talking to Tedster about this recently, and these links might be useful for further research:
Google File System: https://labs.google.com/papers/gfs.html
There's certainly more than one index, and one of the partitions seems to hold the "freshest results" index, which is migrated to the main index as soon as it can be done.
Two patents come to mind as relevant:
Interleaving Search Results:
https://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&p=1&f=G&l=50&d=PG01&S1=20080140647.PGNR.&OS=dn/20080140647&RS=DN/20080140647 - URL isn't showing fully after I paste it in.
Selectively Searching Partitions of a Database:
https://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&p=1&f=G&l=50&d=PTXT&S1=7,254,580.PN.&OS=pn/7,254,580&RS=PN/7,254,580 - URL isn't showing fully after I paste it in.
...And I kind of hoped there was a conspiracy where Google took it down to bait the SEOers into identifying themselves by making panicked changes ;-)
An excellent theory Ben - this makes sense to my non-programmatical eyes. Certainly the "fresh page appears then disappears" thing is something I've seen time and time again.
RapidSpammerSEO
Dude, seriously?
Kind of agree with Paul Pedersen above - I think it was originally Google conspiracy to get rid of SEOs by provoking heart attacks (when discovering your new website, briefly ranking no 3, has dissapeared without trace.)
Since that doesn't scare anybody anymore, I'm terrified of what they'll think up next...
I have noticed this too for quite some time, but did not have an explanation for it. Thanks bhen 2 or more indices it makes sense.
This used to drive me nuts, I remember thinking that I was teetering on the edge of a higher and lower ranked database.
But I think you're right about being moved to a slower to build index.
Great post, did notice this a while ago but didnt think about or look into it second time. Just thought Google is Dancing again. Maybe preparing for the "Dancing with the Celebrity" show. :)
Um...This explains a lot...I appreciate the info. :)
I was sure Google was trippin. I would write an article, build some links and within a few hours I would see it ranked on page 1. The next day when I looked for the content it would be gone. The time it was missing varies from a few hours to a few days but it typically returned with a good ranking. Thanks for the explanation.
I also see this kind of behavior on Yahoo's SERPs.
I change Titles of some of my pages, and suddenly appear on the first page. Could be due to the fact that you mention, or they are also tracking CTR, Bouce Rate, Time on Page, and some other factors to consider your page to be on the top ones for that keyword.
Just my 2 cents
Great post, Ben (get yourself a profile picture already, I seem to be saying that a lot recently). Reading your writing is just like hearing you talk - isn't it funny how it's like that for some people. I have no evidence either way on what you describe, but it sounds plausible...
That theory makes sense to me.
Does this thought sound accurate to you: If there is always an active crawl for the small index, and some sort of delayed crawl or slower crawl for the large index, then new links might 'freshen' a page that might already be in the large index but if the links are found within seconds of the switchover between the small and large index, that might cause an overlap of sorts and contribute to the disappearance of new content in the SERPS? Does that make sense? Or is this simply a crawl issue regardless of incoming links during the first few hours/days of the new content?
Lol - I have to give you my disclosure that I'm still trying to learn the SEO ropes and I don't claim to know any or very much SEO so if the above question is completely incoherent, that's my excuse. ;)
An interesting concept.
Since we all come at this from our own points of view, perspectives and experience I'm not sure we are going to come up with a consensus.
I see you idea of more than one index, with the lag time attributed to gap in switching over.
I am inclined to go with a combination of Darren and Ben. Something more like, multiple indices, set to serve regions that overlap as they rebuild on a timed schedule. Like waves, going over the globe, being built like incoming tide. It comes onshore a little farther, then falls back as it goes off line and an older version holds the place before the new wave comes in.
Data can be fed in and it goes to category but then that index goes down to rebuild. While it is down another index, overlaps the service area, but it is fed on a different cycle so it is not as up to date yet. Next cycle through it will have caught up with the one it replaces, but not with the newest data it has scraped.
That's my theory. I can't prove it.
BTW... I don't have multiple personalities but I do hear the voices in Darrens head. I can't say anymore. They might hear me.
One more thing. Great first post Ben.
1.) Like waves around the globe? I'm lost. 2.) You're an idiot, #1. 3.) Both of you shut up. This guy can hear us.
You're not the boss of me.
Don't make me stop this car.
I meant to say "wives around the glob", Duh!
Do I have to use the fat crayons?
Mom, Darren is eating the paste again.
I'm not worried what he's eating, It's what he's smoking that's the issue... :-)
I smoke paste.
and eat weed? Darren you have it all backwards! ;-)
I have a website that shows up top 5 one day then it will go to like 13-15 the next day and then back again. This has been happening for about a month now and I have no explanation of why. Maybe this is related? Maybe not... it doesn't completely disapear.
Excellent post, excellent further thoughts in the comments - a good read!
I would agree. Today I searched for a term that showed 2.5 million results, but the odd thing was it showed only ONE result and then right after it there was the gooooooogle to view the next pages. A few minutes later it was back to normal. Now that could have been a seperate bug, but it could also have been along the lines of what you are saying.
When big changes like that are observed, you probably flipped to another datacentre for some of the searches.
You can see the IP of which datacentre your results are coming from by installing the ShowIP extension for Mozilla Firefox.
Huh? The "freshness factor" has been in Google for some time now. Reasoning unknown really. Maybe to help boost good, accurate content to the top. Not sure, but it is there.
Now after 2 weeks or so if that doc/page has not achieved the necessary links to warrant its ranking it simply falls to its normal position by the G algo.
This should not be new to anyone dealing with "rank watching".
Now after 2 weeks or so if that doc/page has not achieved the necessary links to warrant its ranking it simply falls to its normal position by the G algo. This isn't what he described.