It's been a big month for false positives and getting caught with spam, and I've never been one to break up a theme. Short post, but an important one that every dev team should be aware of.
The story starts with a smart SEOmoz member, Per Svanström, getting stumped by a perfectly legitimate, white hat subdirectory, with plenty of PageRank, dropping out of Google's index:
You can see from the image that the single URL was dropped, but a site:birdstep.com/database query reveals that in fact, all of those pages are out of the index. Time for some detective work.
Jane & I spent a few minutes trying to puzzle out if bad links were pointing in or if the pages were somehow cloaking or violating TOS. As we were digging through the backlink profile, we saw that, naturally, the birdstep.com domain was linking to the subdirectory on most every page. When we viewed the source code of those pages (for example, the homepage - www.birdstep.com), we saw something strange. Below is the tail end of the source code for their top nav bar:
<li class="menuObject"><a href="https://www.birdstep.com/Corporate/"><img src="/images/menu/Corporate.gif" border="0" alt="Corporate" /></a></li>
<li class="menuObject"><a href="https://www.birdstep.com/Contact-us/"><img src="/images/menu/Contact_us_active.gif" border="0" alt="Contact us" /></a></li>
<li class="menuObject"><a href="https://www.birdstep.com/database/"><img src="/images/menu/Database.gif" border="0" alt="Database" /></a></li>
Looks fine, right? Just a regular menu serving up images as the clickable link. Only problem is...
Notice the navbar? See the missing link? That's where the "database" section should be linked-to, only the image is missing. Apparently, it was just a design mistake and so they used a 1x1 pixel gif until they could get it fixed. There are plenty of other visible links in the content body of many pages over to the database section, but that top link in the navbar is invisible - technically violating Google's rules. Despite the fact that plenty of other sites and pages link to the database section legitimately, and Birdstep certainly has no reason or intention to hide that link (other than a miscalculation on pixel width), the whole subdirectory was removed from the index.
Luckily, we caught it, Birdstep has removed the link, and they'll hopefully have the subdirectory re-included in the near future. They also generously gave us permission to discuss the Q+A issue on the blog, which we very much appreciate. I think this serves as a wise warning to developers and designers everywhere - unintentional, white-hat spirited mistakes can be just as dangerous and have just as dire consequences as black hat manipulation. Watch your code!
One more point of interest - in searching around on this issue, I noticed that a Google search for https://www.birdstep.com/database/. (with the added period at the end) brought up this result:
I ran another query on a page I know was removed from the index, and it also yielded a result like the one above (unfortunately, I can't share that page publicly). It's possible that this might help diagnose future pages that are removed for bad behavior and exhibit similar symptoms - definitely not a bad query to have in your arsenal if it really does work consistently.
UPDATE: Looks like although this hidden nav element could be a problem, it wasn't actually this issue coming into play here. The answer was... capital letters cloaking 404 pages to Google (an excellent find from John Mueller). Basically, Birdstep was using some user-agent and port detection to redirect Googlebot to a 404 error page (obviously, not an intentional, "we're cloaking because we want to trick Google," but the "oops, that was dumb" kind). The odd part is, it looks like Yahoo! and MSN/Live got it right (and there are plenty of links), but Googlebot was being treated differently.
We didn't notice this initially due to multiple problems - first, just switching your user agent to Googlebot in Firefox won't expose the issue. Neither will using search spider emulators like SEO-Browser. You need to actually telnet to Port 80 (as Matt Cutts notes in the comments). Second, you will see the page in Yahoo! and MSN (making it feel more like a penalty than a crawl issue). I seriously doubt they'll be banned for this - the intent to spam or deceive isn't there - but once again a fascinating detective story about the problems a site can have. Big thanks to Matt and to John for their help.
p.s. Removed the bottom part of the original post due to overwhelming feelings of sheepishness.
p.p.s. Dave Naylor has a tool that can help detect this sort of thing (though it wasn't originally intended for that use).
Hi Rand
You might want to try to access that page using a user-agent like "Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)".
Not only is the page returning a 404 error page with result code 200, it also has a meta robots with "noindex" on it. I think fixing that would go a long way towards helping us to crawl, index and rank it appropriately :-)
Oh, thats where it went wrong.
The client went from changeing their url from their old /Database/ to useing /database/ but I guess this has been missconfigured on their end as the 404 page.
The scary part is that the browser don't display 404 but show the correct page so easy to think it is working.
Will most definatly use the user agent you linked to in the future as that would have pin pointed this issue right away. Thank you for that.
John - greatly appreciate that! I had changed my user agent (to just GGbot), but that still re-directed me properly and using SEO-Browser even returned it properly. I definitely need to imitate google more closely when investigating sites (I had presumed because Yahoo! and MSN had it, and it had received PageRank, that is must be a penalty - not a 404).
Thanks a ton!
Rand, this is not just a capitalization issue.
If you're going to do a blog post about an issue, please go to the trouble of doing a telnet to port 80 and giving the Googlebot user agent. That way, you've done as much as possible to get down to the metal (you're not coming from a Google IP, but everything else is the same). Plus you'd be able to see information that other tools (wget, browsers, SEO-Browser) don't provide, e.g. if there's a noindex meta directive or a 404 page in the body text.
My takeaway is that there's an opportunity to write a tool that gives this level of information, or to write a blog post about how to fetch a page as Googlebot using telnet or curl. That would help people more in my opinion.
Agreed - that sounds like a very good tool emulator to build. I'll ask some folks here at SEOmoz to put it together :)
And yes - I agree that I should haev investigated more thoroughly. Not to make excuses, but seeing it in Yahoo! and MSN (and through SEO-Browser / simple user-agent change) threw me off. Thanks!
if the capitalization is not the issue, what is?
Is there an example someone can provide showing how to do this? It has been ages since I used telnet and I never used it for something like this.
Cheers,
@trontastic
Whoever is in charge of birdstep.com, you continue to do suboptimal things. When a user tries to fetch https://www.birdstep.com/Database/ they see a normal page. When Google tries to fetch a page, we see a temporary redirect (302) to https://www.birdstep.com/errors/404.html?aspxerrorpath=/default_database.aspx
My take on this: you're shooting yourself in the foot by trying to be too smart. Not only are you putting yourself at higher risk at being removed for cloaking, you've effectively removed yourself from Google entirely already, by redirecting to a 404 page.
My advice: in the next day or so, go over your webserver code and remove *absolutely everything* that is doing conditional redirects or serving based on the Googlebot user-agent or IP address. Once you've removed all that junk, your redirect problems will be apparent just by following the links on your own site.
In this case, Xenu Linksleuth may help get you started in finding the many errors in the internal navigation -- once you serve the same content to browsers and bots alike.
Bzzzt. Sorry Rand, not correct. softplus got it above. I got suspicious when I visited www.birdstep.com/database and immediately got sent to a different url www.birdstep.com/Database (note the uppercase 'D').
The document that is returned to Google is seriously horked (it claims to be a 404). So this is certainly a problem that birdstep.com created for themselves and can solve themselves.
Rand, would you mind correcting this title or updating the blog post?
I would agree that the 404 is horked but the redirection has happend after the initial post as the client has renamed their catalouge "database" to the old setting (when the page was indexed and very well ranked) "Database" and then their CMS automaticly adds a 301 from the lowercase to the uppercase as the system knows the uppercase version.
I know this is not good and both urls should of course be handled by the system, but it is nothing added by the client but by their CMS.
On the other hand, the de-index is not related to the invisible menu (if I read your answer correctly) and then I guess the initial thought about the exluction is not correct anymore.
So I just want to confirm:
what happened first: the page dropped out of the index or your client's url path change?
They changed the url in their CMS from capital D to /database/ along with alot of other advices and a while after single pages started to vanish untill after about 2 weeks they were all gone.
I'm also abit embarassed as I managed to turn a very intressting post to an error seeking gathering on the single page, which of course was not my intention, so will continue my investigations on this matter elsewhere in respect of the blog and all you mozers.
"On the other hand, the de-index is not related to the invisible menu (if I read your answer correctly) and then I guess the initial thought about the exluction is not correct anymore."
Macaper, it has nothing to do with the menu. It has to do with the site cloaking to Google, but then sending Google to a 404 error page. By the way, that's completely aside from the upper/lowercase database/Database issue.
Hard to keep up with the post as answers keep coming in into post furhter up the post.
Due to all the great help and input at this post now the problem is pin pointed. Cloaking is of course not good and should result in a page not beeing indexed.
The telenet information was completly new to me, but will most definatly keep that for the future to check new sites.
As of the cloaking being very much unintentional and even worse, not vissible in the server this is hard to find the cause of all this, as it has obviously not been like this from start as that would have made the site excluded before the SEO began of the page.
First check of the site was Google Webmaster Tools and that flaged all OK which was unfortunate as that made us think that problem was elsewhere.
An other issue is that the site is hosted on a joined webhosting company so hard to get that company to react and start to go over their servers to find what is wrong and what is causing this cloaking behavior, but as you write Matt. Make them take away everything. Reset servers totatly and reinstall things, cause as either client or hosting company knows what has made the server act as cloaking I se no other way to make sure the problem is fixed.
Alot of great input in all your answers in this post and I just wanted to point out, as I read your statements as the client tries to cloak. Trust me, they are not. Just a server glitch, probably from their hosting company, that started all this a while back.
Updated the post, Matt. Thanks for stopping by :)
BTW - Is this something Google will fix? If Yahoo! and Live can index it, and browsers are redirecting, wouldn't Google want to do so also? Seems like there might be a lot of pages excluded if Google can't pick up the capitalization issue.
I think this is rather some strange server setup rather then capitalization issue with Google, cause the server requires an exact match of the url atm, I just found out after digging further after the input from this post. If you enter lowercase it 301 you to the /Database/ version, hence the lowercase will be a 404 as the system doesn't think it excists.
The same is, as I found now, if you enter it with our without a trailing slash it will 301 you to the url with a trailing slash.
So think that this is all rather a very strange server setting change that ended up requireing exakt matches in the url, even down to capitalization or not and by that adding tons of 301:s all over the place.
Strange is one way to put it. :) Please see my comment (here) about an issue that definitely needs to be fixed ASAP.
Edit from Rand - Made Matt's link live and visible.
*** If you enter lowercase it 301 you to the /Database/ version, hence the lowercase will be a 404 as the system doesn't think it excists. ***
Errr. NO.
If the lowercase issues a 301 redirect, then that URL is a 301 redirect. It cannot also be a 404.
A URL returns ONE status code in the HTTP header:
200 - Page OK, here is the content.
301 - redirect to another URL. The browser makes a new HTTP request to fetch the new URL.
302 - redirect to another URL. use the 301 not the 302.
404 - Page not found. A page full of error text is served at the originally requested URL, and the 404 tells the bot to NOT index the URL and not index the content. In the 404 nothing is said about whether the page may come back with real content at some future date.
410 - Page gone. Similar to 404, except that page is NEVER coming back.
*** So think that this is all rather a very strange server setting change that ended up requireing exakt matches in the url, even down to capitalization or not and by that adding tons of 301:s all over the place. ***
Links within the internal navigation should point to the URL that you want to be indexed, in exactly the correct case.
When a user or bot follows an internal link they should not be hitting any kind of internal redirect to get to the content.
The issue should be fixed by making the URL in the link be exactly the same, and exactly the same case, as the actual URL for the content.
If it is not possible to fix the internal scripting to make them both match up, then you could always employ an internal rewrite (that's a rewrite, NOT a redirect) to translate the externally requested URL into an internal facing filepath and filename. In the rewrite, the internal file path and file name are NOT exposed back to the browser.
This is basic design stuff that I see badly done all the time.
Rand, I have no idea whether birdstep.com is cloaking to Yahoo or Microsoft as well. If not, that would account for why Y/M have the url. Also, Y/M handle the noindex meta tag differently. Most people prefer how Google handles the noindex meta tag.
HeHe, it is nice to hear John called "softplus" (his old google webmaster help group name)
heh!
As beeing the person not notice the missnig menu item in the first place I can only say that Pro membership can end up saving your life. I work with SEO atlest 80-100h every week and I know that alot of the readers in here do the same and I can say for a fact that I wouldn't know what to do without my Pro membership, so anyone not useing it, please check it out.
I also want to thank you Rand for diggin into this as I still feel that even if it is a rule breaker, even if it was unintentionally, I still think that a total exclution of all pages are not in balance to so many other pages you find daily that are intentionally breaking the rules.
I don't want to go totatly of topic and start to talk about all pages that breakes the rules intentionally, but as everyone know that works with SEO or SEM they are plenty and still they are allowed to be indexed and even rank top positions. But one honest misstake like the one we are talking about in this post can get you totatly excluded.
Get me right here, I'm all for White Hat SEO and I would be the first one to cry happy tears if all Black Hat SEO pages was excluded, but I'm starting to loose my trust in Google cause it feels like it's a total lottery if you site will go through the "all seeing eye" of Google or if it will get the full penelty.
Amen to the value of the SEOmoz Pro membership. My blog got banned for about a month due to some viagra injections on my WP blog. I couldn't figure out what the hell was going on since I didn't see the code anywhere on my site, but Jeff & Rebecca responded to my Q in the A section within about 12 hours and had me all squared away.
Even if I only use this feature once in 12 months, $400 a year is a small price to pay for a second opinion from some of the top SEOs in the world on something this important.
Thus, I'm making a point of noting here that Birdstep got their issue solved (or at least diagnosed) thanks to the Q+A section in SEOmoz PRO. I do think we offer a good service, and I really do believe in it; I think I'm just a bit shy about self-promotion.
I dont think you should be shy of self promotion. Its not really promotion if you are stating a fact! lol. Pro membership does help. And QnA is one of the best ways to make use of resources not available to most SEOs: A second opinion or a pair of professional eyes.
while eating a pork pie, and taking time to catch up on my rss.. i found this,
Personally I would have started at the robots.txt file , in turning finding a xml site map, in turn finding this :
https://www.birdstep.com/upload/sitemap/sitemap-DB.xml
which shows me that the webmaster has made a mistake, the xml links are all lower case giving a Google QI to what to index.. looking at the server headers things get silly .. so fix the Urls or at least fix your xml sitemap.
DaveN
When you say headers get silly what do you mean?
Client is useing a CMS that by default sets all names that gets used in urls a capital letter. Yes, I know that it's silly as you want all lowercase in all urls, but as the CMS automaticly create capital catalouge names in the url the sitemap is generated (by the CMS) reflecting the actually structure of the system.
It's even like if you enter /database/ into the url you get redirected to /Database/ automaticly by the system
But maby I'm not understanding what you said needed to be changed.
Sorry if my questions and focus on the actually site rather then the hidden link and by that went far of topic.
Per - I think Dave's just suggesting that since the URLs all resolve to /Database/, you should remove the /database/ (small "d") in the sitemap and replace them with Big "D"s.
Having just modified a sitelike this, the problem is often more difficult than the original designer may have thought at first glance.
The Category Name needs to feed to the URL href, anchor text and title attribute in all of the navigational links and breadcrumb trails. It may also be needed in the title tag of the destination page.
In the href, spaces and underscores are best avoided, and the words are best done all in lower case. Most punctuation, other than slash, hyphen, colon, or dot is also best avoided.
In the anchor text and in the title attribute, words must have spaces between them, and leading capitals are best, and almost any punctuation is allowed (although quotes and brackets may need to be escaped).
Rand, you stated:
Do you really think that having bad links pointing at a legitimate site will get that site deindexed? I've always heard of people worrying about this but as best I can tell it's another SEO created boogie-man. I'd love to see any examples or tests done that back up the fear of inbound links.
WeRASkitzzo- Love the Carlin Icon!
I agree i thought another shaddy seo could get your site to drop in the SERPS but not necessarily be dropped from the index completely with bad links.
Haha, it's not a Carlin logo, that's a cartoon charicature of me! It's not that clear at this resolution but it's definitely the first time I've ever been told I look like him.
I'd question whether they can even get you to drop in the SERPS, not to mention be deindexed.
WeRASkitzzo : it can be done.
easy way.. buy yahoo PPC account for URL you want to dump, buy real estate on known spam/link farms... and then...
umm.. maybe i shouldn't complete this thought...
(whistles innocently)
A lot of people have claimed they know how to do it but, I'm sorry, I just don't buy it. There are more than a few people in the SEO world that would have no qualms about doing such a thing and I find it hard to believe that more cases of this wouldn't be documented.
I mean hell, if it were that easy to do, wouldn't someone have done it to a high profile SEO site like Moz, or SELand, or Matt Cutts' blog?
What I'm saying is I need proof before I buy into this theory.
*** Do you really think that having bad links pointing at a legitimate site ***
Bad links "in" can get "alternative" URLs for the content spidered and indexed, using URLs that do not exist anywhere in the internal site navigation.
There are many things you can do to protect against that happening, but most people don't actually do any of them.
Don't be shy about promoting the PRO section. I'm glad I made the investment and being able to put questions to the team really is worth its weight in gold alone. Having access to all the other goodies is just icing on top.
Is there are reason why the SEOMoz promotional material for the pro account doesn't mention the Q+A feature? I had no idea!
Must admit the Pro Q&A is an amazing service that I really dont use enough!!
Still I find the Pro account so usefull i've been paying for it personally (not thru expenses or company) since I joined up!
Thanks for sharing another fine real life issue!
But I'm a bit sceptical about the "point technique". Maybe I understood it totally wrong but for most pages I tried to lookup with this search query it gives me results. Normally if you open a page with the dot e.g. https://www.seomoz.org/. you are "redirected" to https://www.seomoz.org/ with a status code 200 (ok). As I understand this is normally behavior:
. = this directory
.. = parent directory
Which means www.example.com/. = www.example.com
Also if I see results like the ones beneath, I'm curious if you mean something special about the SERP you mentioned, e.g. that it has only 1 result, without description and "no omitted results".
Could you clarify how your result differs from typical results like these:
"https://www.seomoz.org/."
https://www.seomoz.org/. (shows the web2.0 award, becuase it seems the most relevant page with a dot, but I believe otherwise it would show the same result like the above with quotes)
https://mail.yahoo.com/.
It's interesting to hear about different kinds of penalties as you never know where it can hit you. I have always been under impression that hidden content when perceived as illegit effects the whole site, not just one page.... really interesting...
Rand wrong dude again Yahoo didn't get it right
https://search.yahoo.com/search?p=www.birdstep.com%2FDatabase%2F&ei=UTF-8&fr=moz2
@ann i thought they there lower cloaking when i saw it, didn't want to say the "C" but header where different if you where a spider to human
It's a Ban in my world
That's odd - it's listed fine in Site Explorer...
But why would you ban them? They clearly did not do this to gain an advantage or with the intent to game Google - those pages were 404'ing! If Google is saying they strongly consider intent when looking at gaming issues, this has to be one where they'd try to help the site, not ban them.
What content was being delivered to Google before this problem was found?
Has that content been recently removed, hence the 404 now?
SOCO can't make a full analysis, as the crime scene has been contaminated.
Dave, try this search instead, it's not the same thing with Yahoo:
[site:www.birdstep.com/Database -rfijbdrefv]
-Michael
Seeing how this issue has not been resolved, I'd like to post my guess at what is happening. You're probably hitting a bug in IIS6, which can cause the web application to crash when the Googlebot visits. It's probably triggering a 500 error, but your custom error page is handling it. You can find out more about it (and get a fix) at:
https://www.kowitz.net/archive/2006/12/11/asp.net-2.0-mozilla-browser-detection-hole.aspx
https://todotnet.com/archive/0001/01/01/7472.aspx
(added: just noticed that Fabio pointed to the same page :-))
"You need to actually be on Port 80 (as Matt Cutts notes in the comments)."
It's less that the cloaking happened on port 80 (every tool was talking to the same port 80 on the webserver) and more that if you telnet'ed to port 80 you'd see the raw dump of exactly what birdstep.com was returning, without following any 301/302-type behavior.
Telnetting to port 80 is handy because you can see things like the raw body text that was returned and the raw server headers that are returned. Things like wget usually just follow the redirect, so you don't see the nitty-gritty details that the web server returned along the way.
Thanks for updating the title/post.
Matt, did they fix that issue? I added a user-agent switcher to my header detector tool, and I'm not seeing what you described:
https://www.bad-neighborhood.com/header_detector.php
It was by user-agent, correct, and not by ip?
mvandemar, it doesn't look like the issue is fixed, because I just checked and it's still doing it for Googlebot. So they are at a minimum still doing something high-risk based on the IP address of Googlebot (added: or they haven't truly turned off the user-agent checking, as John/softplus points out below.)
That's worse than user agent cloaking, isn't it?
Meaning, more dangerous as far as staying indexed goes.
Hi Michael, there's actually a pretty easy way to see what's happening here. All you need is "wget" (open source / free). Just use a command like the following (all on one line):
wget --save-headers -U "Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)" https://www.birdstep.com/database/
You will see the redirects happening and it will download the final page where you can see the robots "noindex" meta tag. At the moment, I see a 301 redirect to a 302 redirect to a page called 404.html returning result code 200 (with the "noindex" robots meta tag). It's nothing exotic, has nothing to do with uppercase in URLs, just some strange cloaking to the Googlebot's user agent.
In general, when an URL shows up with no associated information in the index, that means that we know about the URL but either aren't allowed to or just can't show more. That's usually from a robots.txt, a robots/googlebot meta tag, from an exotic x-robots meta tag or because the URL is just not working for us (returning 5xx, 4xx, times out, etc).
Leaving a robots "none" or "noindex" meta tag on a site when it's new or has been re-done is actually pretty common (and confusing to new webmasters).
Actually softplus, it almost looks like wget is doing something that's wrong. For some reason it seems to be appending index.html to the end of what it is trying to fetch, which is why you are getting the 404. There is no index.html.
If I just use telnet (which doesn't add anything), I get this:
Microsoft Telnet> open www.birdstep.com 80Connecting To www.birdstep.com...
GET /database/ HTTP/1.0
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)
HTTP/1.1 301 Moved Permanently
Connection: close
Date: Thu, 26 Jun 2008 20:43:03 GMT
Server: Microsoft-IIS/6.0X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: /Database/
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 127
Connection to host lost.Press any key to continue...
My tool does the exact same thing, doesn't try to add a default document or anything, and I don't get the 404 error.
(edited to re-format output...)
Ok, nevermind, I see it now. It does have to do with case, softplus... lowercase 301's to the uppercase version, regardless of what useragent you are using, and uppercase 302's to a 404 page if you have Googlebot as your useragent. My tool is showing it, so no, it's not IP based delivery. I was just checking the wrong case URL.
Hi Michael, the "index.html" that wget displays is just the local file name that it defaults to when there is none specified (eg for "domain.com/folder/"). It doesn't necessarily mean that the server is using that name, it's just that it has to use something to save those URLs under locally (where you're running wget). It's cool to see you make a tool that helps detect this kind of issue!
it would be instressting to know what you get for information if you check https://www.birdstep.com/Database/ (with slash) rather then the lowrecase database.
Reason for this is as I stated above is that for some very odd reason their CMS (or server settings) requires an exact match on the url based on what the name of the path is in their CMS.
In their CMS, atm (this is what has been renamed since the initial post this afternoon) the database folder is named "Database" wiith an uppercase. So when you add the url with either a lowercase database or without a trailing slash the system will automaticly redirect the incoming question to the exact match which is the /Database/.
Why it does this I have no idea as I'm not in charge of the servers (niether is the client - birdstep), but it's sort of hurtfull to read the comments like risky business and the "what if" answers as I know that nothing that atm is acting like cloaking is intentional and it hasn't been like this from start, but someone changed something along the way that started this odd behavior and I can't seem to find out what it is as the hosting company isn't responding and hasn't been all day.
The "risky" comment was based on me not seeing what Matt Cutts was seeing, because I misunderstood what he was saying. He made this statement:
In fact you only get the 302 redirect to the 404 page with the uppercase version of the url, and only if you have your user-agent set to Googlebot. That's what threw me, and lead me to think that maybe IP cloaking was involved. I was wrong though.
Really great work on seeing that Michael - we were confused internally about the capitalization issue (hence changing the post twice), so it's nice to see it get sorted.
BTW - You mentioned a tool you're using - if that's public, please feel free to link to it. I'm sure the other SEOs reading the post would appreciate.
I did, thanks. 10 comments up.
Yups please add the tool into the post if it gets public !
Your server should redirect an incorrectly-cased request over to the right URL using a 301 redirect (do NOT use a 302 redirect), or else it should directly serve a 404 status code for the incorrect requests.
This ensures that the content can only ever be indexed under one canonical URL.
There are several errors in implementation here.
A major one is that your internal links don't point to the correct URL. Users should NEVER hit a redirect when they click an INTERNAL link. The links need to match the real URL.
Another is that you're doing different things for Googlebot and for regular users.
Another is that serving an error page with a "200" status is at best confusing, and at worst, a source of Infinite Duplicate Content.
If you are doing different things for Google, then that is either coded into your script, or into the server configuration files.
Someone needs to own up as to where the problem is, and fix it real soon now.
So many good answers from you in this entire post so dont know where to answer so will just answer here, at the last one.
I know all about the 301, not use the 302, only use lowercase, setup the IIS to handle the upperlowercase the same as Apache (thats why I dont like IIS).
Problem is that I don't controll any of this.
Good news though, is that the problem should finaly be fixed with the user agent cloaking issue. The CMS supplier had released an hotfix to sort this, that the server responsible had missed to implement (yeah, I know, dhuu).
The issue with the 301 from d to D still is there, but they know about it and are fixing it.
Thank you again everyone for your great responses and knowledge and sorry again for, indirectly causing the 3 renames of this post. Very embarassing...
I think a lot of people learnt a lot of new things during this investigation... and you got some great advice direct from, not one, but TWO Googlers...
I hope alot of people learned alot cause atlest I did. I learned alot about the CMS the client was useing, but more I learned about the will and efford the Google people give, to make the net a better place by participating in this post and for that I salute them. I truely salute them for even giving them time and in what other place can you get that but here, at SEOMoz. Thank you all for this, after all, great post and thank you from a humble simple consultant for explaining why I should stay so humble.
/M
My developers frequently make changes without my approval which forces me to thoroughly and constantly quality check our sites for issues that appear harmless to the 'not so SEO savvy' developers.
I could see how making the image link a single pixel image may have seemed to be the quickest temporary solution to a design problem. I still find it hard to believe that it was truly that harmless, but I may never know.
As for SEOmoz PRO membership, it's great. Promote it like crazy because it is truly useful. My mentor lead me to SEOmoz.org and it has played a critical role in the success of our company and my personal growth.
@Matt Cutts - Thanks for pointing out the real issue. Regardless of the issue it appears that SEOmoz lead to the solution. The question lead to a reasonable assertion by Rand's team which then caused this post. This post peaked your interest on a great blog that is graced with your presence and the real issue was revealed. Props to SEOmoz and Matt Cutts!
peaked ---> piqued
:-)
Great sluething Rand.
It's always panic stations when pages suddenly get de-indexed and it takes a cool head to quickly figure out the cause and find a fix. Well done.
BTW I'd love to see more from you about the advanced query you mentioned at the end. What's the signifigance of the added period? Think you can write a post in the future once you've had a bit more time to try it out?
Now I am even more confused cause I did some more diggin into this as I got the info about the page beeing 404.
In the clients CMS enviroment they have named the initial folder "database" hence the url https://www.birdstep.com/database/.
When you check with the above useragent bot suggestion you actually get a 404 response, even if the page works in a browser.
Initially the client had the folder named "Database" and then the page was indexed. So the error of de-indexing started once they renamed their folder to "database".
Also if I check the useragent bot above against https://www.birdstep.com/Database I get the page. URL's shouldn't be casesensitive so how come this is?
Now the client renamed their folder again back to how it once was - "Database", before the page started to vanish and if I now check the url against the useragent bot I get exactly the oposite result.https://www.birdstep.com/Database/ the bot say 404, but if I enter https://www.birdstep.com/database/ I get the page.
This is exactly the other way around in comparasing on how they have named their folders and shoud a folder name actually create casesensitive urls?
Just to check if this was the case I added another level into the url to check what the bot thought about it and was i surprised as both variations worked. https://www.birdstep.com/database/Support/https://www.birdstep.com/database/support/ both returned the page when I checked with the bot.
This is the user agent I used to test the site based on the entry a few comments above.
Macaper, thanks for the link to user agent.
I think the lesson from this example is simple - stay away from capital letters in URLs.
When using Apache, case matters, so /database and /Database should return different folders (or 404s if one doesn't exist)
You seem to be using IIS, which is a different kettle of fish.
IIS is not case sensitive out of the box (although I believe there are extensions to fix that)
So /database and /Database will point to the same place by defaut.
I think what you really need to do is review the CMS you are using to decide whether it is causing more problems for you than its worth.
There seems to be an awful lot of complex funkiness in there - more than necessary :(
*** URL's shouldn't be casesensitive so how come this is? ***
Oh yes they should.
"Page.html" is a different URL to "page.html" is a different URL to "PAGE.html" is a different URL to "page.HTML".
Only the domain name is case-insensitive, not the folder or file path.
The fact that IIS isn't case-sensitive should be treated as a BUG. It is a major cause of Duplicate Content.
Apache gets it right, right out of the box.
totally...
heh - funny
Rand, sometime ago I got the same problem with a SEO customer, and he has the same ASP.NET server that this server site.
I've used this tool to check how Googlebot sees the page:
https://www.smart-it-consulting.com/internet/google/googlebot-spoofer/index.htm
There are 2 options of Googlebot. Choose the "Googlebot-Mozilla-2.1", then click on submit. The new page will show "Object Moved". This is because server returned an 302 error.
I've solved this problem asking the server support to fix using these steps:
https://www.kowitz.net/archive/2006/12/11/asp.net-2.0-mozilla-browser-detection-hole.aspx
I hope this help.
Fábio Ricotta
Thanks this was a helpful post. You can get into the cloaking trap too easily.
Forgive my ignorance on this subject, but how does this affect affiliate marketing? ie, many affiliate codes include a 1x1 gif image for tracking. Does including this image violate Google's TOS?
Thanks
It might do, if it is embedded directly into the HTML code.
If it is written out to the browser screen using Javascript from an external file, then the bot will likely never see it.
Does this mean that using css display:none on elements like h1 or menu items can hurt? I thought it was quite commonly done on sites that are image intensive or built entirely with flash.
Rand, why would i ban them..
it's cloaking, what if and this is a "what if" thier was two pages
page one what the user see's, page two spam for the engines, and whatif just after that page was banned in google the webmaster had just removed the spam page, causing a 404 error now. I not say that did happen more a what if, and what if you run though more pages fro a google ua and you saw some other issues, and not to mention dupe content issues.
Dave
ok, so this seems to still be causing confusion, and if i'm reading this correctly, then the UPDATES you made to the post, Rand, are still wrong.
What I'm reading (perhaps incorrectly), is that it's not an issue of url mis-capitalization - but an issue of cloaking - serving something different to googlebot - and that "SOMETHING" that is being served to gbot is 404 and noindex.
Am I right in my interpretation of what John and Matt are saying?
Donna - yep, I had to update a second time because there was an actual cloaking issue happening (it's just not visible unless you hit from the right port).
Great detective work Rand & Jane.
It is unfortunate that shady links can make your site drop in the SERPS..I have seen it happen. I personally would not do it - maybe I have to much of a conscience.
What I have seen is that the site drops in rankings for a week or two and then goes back up.
this technic can be very useful to see a removed URL from the index. Thank you for sharing that point Rand!
Well it's good to see Google discounting bad links rather than making the whole site suffer :)
For all ye non believers. I am a very happy pro member.
Btw, nice columbo work on that site. Damn Google and their rules n stuff...
Presumably the nav link was the first link to that subdomain in the markup. Do you think that might have had to do with the SERP drop?
One question I have wondered about repeatedly is if it is considered cloaking to mirror a mostly Javascript/AJAX driven site with a hard anchor-tag URL version (where the hard links are mostly hidden)?
Also if you disable links with, say jQuery, is this the same as hiding links on the page?
Any advice would be GREATLY appreciated.
(and to all you Pro-membership fanboys, I'll be jumping on board soon, I just don't have the time right now to use all the tools)