Day 4. I can hear the walls. They all sound like Matt Cutts. I know every language family from around the world. My keyboard mocks me. Sweet merciful weekend, will you never come? Can't stop...can't stop transcribing...

Matt proudly displays his werewolf vs. unicorn shirt, which depicts the silhouettes of a werewolf and a unicorn mid-battle in front of a full moon. Underneath the image is the caption “It’s on now”. Matt says, “We’ve got the unicorn and the werewolf. Mortal enemies since the beginning of time, and it’s on now!”

1. For all the data center watchers out there, should all the results across one Class C IP address block be the same most of the time, except when you’re pushing data, or are they supposed to be different because you’re trying different things on them? And would it make more sense to use the direct IP addresses when reporting issues or problems, or the 41 gfe data center names?

Let’s talk about datacenters. Back in the days of the dinosaurs, (here Matt does either an impression of a dinosaur, complete with screeching and clawing, or he just watched some Japanese anime and had an epileptic seizure), when the dinosaurs roamed the earth, you could actually run a search engine off of one computer. Those days are long since gone, unless you have a really, really powerful computer or something very, very small to search over, or you have a Google search appliance, I guess.

These days, you pretty much have to have a data center. In the early days of data centers you could just do some sort of "round robin" trick with DNS so that you would always hit different data centers, Google does some very smart stuff with load balancing, some very interesting techniques to try to make sure that different data centers are able to perform well.

So, your basic question was this: Should all things on the same Class C IP block be roughly the same? And yes, they should be roughly the same in that they’re typically the same data center, but not always. So, let me give you a couple examples. If one data center has to fail over, or if one data center is out of the rotation, then even if you’re going to one IP address, you can get bounced over to a different data center. Even though if you look like you’re consistently hitting the same data center, behind the scenes, underneath Google's load balancing, you could be hitting a different data center completely. Those situations are somewhat rare, but not that rare. So that’s sometimes when you see people having debates online on Webmaster World, Data Center Watch, stuff like that, they can actually be seeing different things even if they hit the same IP address.

The other point I wanted to make, and I made this at PubCon Boston, was that the data centers often have a lot of different things. So, whenever there’s a new algorithm update or some other feature that we are trying out, we’ll often try it on one data center first to make sure the quality is what we expect it to be based on evaluation, stuff like that. The data centers do differ according to some very complex, intricate plans so we can try out different things at different data centers. Typically, on one class C IP address you will usually hit the same data center, but that’s not guaranteed.

Also, at PubCon Boston I showed an example of the sorts of things that are going on in different data centers. It sort of shows how things are a lot more intricate now than they used to be. So, Google does a lot more smarter  scheduling, and it’s a lot harder for a random person to just look at a data center reverse engineer or try to guess which way things are going, stuff like that.

As far as which IP address versus the gfe name, which, I think exactly me and g1smd know about, nobody else has really bothered to talk about it that much, except maybe on Webmaster World, you can use either the IP address or the two letter code of a data center because we’re able to map them both back. So, if you tell us one we can tell what the other one is, either way.

In general though, there are probably better ways to spend your time than watching data centers. I think it's a good use of your time to work on your content, a good use of your time whenever there’s something major going on, if you really want to look whenever there’s a page rank update or something going on, but in general, there's enough stuff going on in different data centers that I would say it's probably not worth checking every single data center every single day. Trying to figure out "Okay, how am I going to do, or how have I been doing", it’s probably better to spend a little more time paying attention to your logs, and work backwards based off of that.

2. Is it possible to search for just home pages? I tried doing "-inurlhtml" and "-inurlhtm", blah blah blah, PHP, ASP, but that doesn't filter out enough.


That's a really good suggestion, Peter. I hadn’t thought about that. Fast(?) used to offer something like that, but I think all they did was look for a tilde in the URL. I will file that as a feature request, and see if people are willing to prioritize it where they might be able to offer that. My guess is it would be relatively low on the priority list because of all the syntax you mentioned, subtracting off a bunch of extensions, but would probably work pretty well.

3. Clarification

Ah, I get to clarify something about strong vs. bold and emphasize vs. italic. So, there was a previous question where somebody asked about whether it was better to use bold or whether it was better to use strong because bold is what everybody used in the old days when the dinosaurs roamed the earth, and strong is what the W3C recommends, and at that time last night, I thought that we just barely barely barely, in epsilon, preferred bold over strong, and I said "For the most part, don’t worry about it."

The nice thing is an engineer actually took me to the code where I could actually see it for myself, and Google does treat bold and strong with exactly the same weight. So, thank you for that, Paul, I really appreciate it. In addition, I checked, he also found code that shows that em, as in emphasize, and italic are treated exactly the same as well. So, there you have it. Go forth and mark up like the W3C would like you to do. Do it, do it semantically well, and don’t worry so much about krufty(?) old tags because Google will score it the same either way.

4.  Will we see more kitty posts in the future?

I think we will; in fact, I tried to get my cats in on this shot and to sit still, but they were a little scared of the lights, so we’ll see if I can get them used to it.

5.  What are Google SSD, Google Gas, Google RS2, Google Mobile Marketplace, Google Weaver, and other services discovered by Tony Ruscoe?

I think it was very clever of Tony to try to do a dictionary attack against our services check in, but I'm not going to talk about what those services are.

6.  [What might be some of the topics] in the Duplicate Content Session of SES?

I gave a little bit of a preview in one of the other sessions, on video, but I think what we will basically talk about, Jerry will be there, a lot of other people will be there, and we will talk about shingling. What I will essentially say is that Google does a lot of duplicate detection from the crawl, all the way down to the very last millisecond, practically when a user sees things. We do stuff that's exact duplicate detection, and we do stuff that is near duplicate detection. We do a pretty good job all the way along the line of trying to beat out dupes and stuff like that.

The best advice I’d give is to make sure that your pages that will have near  the same content, look as much different as possible. If they really are truly different content, a lot of people worry about printable versions, or somebody else asked about a .doc word file compared to an html file. Typically, you don't need to worry about that. If you have similar content on different domains, maybe in French and another version in English, you really don't need to worry about that.

Then again, if you do have the exact same content, maybe for a Canadian site,and for a .com site, it’s probably just the sort of thing where we’ll detect whichever one looks better to us and just show that, but it wouldn't necessarily trigger any sort of penalty or anything. If you want to avoid it, you can try to make sure that your templates are very, very different. But, in general, if the content is quite similar it’s better just to let us show whichever representation we think is the best, anyway.

7.  Does Google index or rank blog sites differently than regular websites?

Not really. Somebody else asked about links from .govs and .edus, and whether links from 2 level deep .govs and edus, like .gov.pl, were worth the same as .gov, and the fact is we really don’t have much in the way to say, "Oh, this is a link from the ODP, or .gov or .edu, so give that some sort of special boost." It's just that those sites tend to have higher Page Rank because more people link to them and reputable people link to them. So, blog sites, there’s not really any distinction, unless you go off the blog search, of course, and then it’s all constrained to blogs. In theory, we could rank them differently, but, for the most part, just the general search, the way it crawls out ends up working out okay.