For those of you who haven't heard about it, this is what we saw this morning:
That pointless little bar at the bottom of the screen that we constantly tell people not to worry about had gone from full of green (7/10) to sadly gray overnight. The story at Google was even worse.
The page you see ranking first is the full list of winners and honorable mentions, but it is the "shortened" version. The main page, URL https://moz.com/web.2.0 was gone.
To make a long story short, this morning, Rand got in touch with Google and was advised that changing the URL so it doesn't end in ".0" would be a wise decision. Google would prefer not to make an official or public comment, but they did give us permission to share this tidbit. Naturally, we investigated deeper, and found that it's not just inadvisable, but literally impossible to get a URL indexed in Google's engine if it ends with a .0 (similar to how Google won't index file extensions ending in .exe or .tgz).
Whilst there is plenty of evidence that URLs ending in .0 often belong to spam pages (wild guess here, but let's say there are 800,000 or so URLs on the web ending in a ".0" and maybe, oh... I don't know, 0.5% of them are worth indexing), I'm not sure that this is a good metric by which to determine an immediate penalty. Some other decent pages that have been hit in a similar way include https://en.wikipedia.org/wiki/Windows_1.0, which enjoys a healthy number of backlinks but which won't appear in Google. This page, URL https://en.wikipedia.org/wiki/Web_2.0, appears in Google's index as https://en.wikipedia.org/wiki/Web_2. None of the URLs which redirect to include the slash are flagged.
Becoming more fascinated by this, we did some investigating. What we discovered was that this penalty is indeed limited to the number zero. URLs ending in .n where "n" is any other number are not removed. If Google finds a version of the page that resolves with the slash, you'll avoid the penalty. In one instance, a page that resolved with underscores in place of the stop was indexed.
Below is an assortment of URLs which are indexed in Yahoo! (and many also in Live), but which show no PageRank and do not appear in Google's index. Below those, I've listed very similar pages that are indexed, but which do not end in .0.
Out of Google's Index (but in Yahoo!):
https://www.fileplanet.com/62709/60000/fileinfo/WinZip-9.0 is not indexed and has no PageRank. Call this duplicate content if you will, but it still shows the same trend in action.
You'll notice some interesting things, such as the fact that en.opensuse.org/Bugs:Most_Annoying_Bugs_10.3 is indexed but en.opensuse.org/Bugs:Most_Annoying_Bugs_ 10.0 is not.
Quite simply, making sure a page resolves with a slash will avoid this problem. I'm of the opinion that this is a pretty silly thing to penalise for without some sort of human review, but it's important that we pick up on things like this so that we can avoid such "false positive" penalties. Make sure to add "check for URLs ending in .0" to your next checklist for site reviews and please, do share if you've found any other filename extensions that exhibit similar behaviour from any of the engines in the comments.
UPDATE: en.wikipedia.org/wiki/SAML_1.1 also seems to be suffering from a penalty and it will be useful to go through some more URLs that end in .n to gauge whether or not they're penalised. Most of the examples we saw that didn't involve a zero had not been hit in any way. I'd love to know how extensive this filter really is.
To make a long story short, this morning, Rand got in touch with Google and was advised that changing the URL so it doesn't end in ".0" would be a wise decision. Google would prefer not to make an official or public comment, but they did give us permission to share this tidbit. Naturally, we investigated deeper, and found that it's not just inadvisable, but literally impossible to get a URL indexed in Google's engine if it ends with a .0 (similar to how Google won't index file extensions ending in .exe or .tgz).
Whilst there is plenty of evidence that URLs ending in .0 often belong to spam pages (wild guess here, but let's say there are 800,000 or so URLs on the web ending in a ".0" and maybe, oh... I don't know, 0.5% of them are worth indexing), I'm not sure that this is a good metric by which to determine an immediate penalty. Some other decent pages that have been hit in a similar way include https://en.wikipedia.org/wiki/Windows_1.0, which enjoys a healthy number of backlinks but which won't appear in Google. This page, URL https://en.wikipedia.org/wiki/Web_2.0, appears in Google's index as https://en.wikipedia.org/wiki/Web_2. None of the URLs which redirect to include the slash are flagged.
Becoming more fascinated by this, we did some investigating. What we discovered was that this penalty is indeed limited to the number zero. URLs ending in .n where "n" is any other number are not removed. If Google finds a version of the page that resolves with the slash, you'll avoid the penalty. In one instance, a page that resolved with underscores in place of the stop was indexed.
Below is an assortment of URLs which are indexed in Yahoo! (and many also in Live), but which show no PageRank and do not appear in Google's index. Below those, I've listed very similar pages that are indexed, but which do not end in .0.
Out of Google's Index (but in Yahoo!):
- en.wikipedia.org/wiki/Windows_1.0
- en.wikipedia.org/wiki/Web_2.0
- https://en.wikipedia.org/wiki/Die_Hard_4.0
- drupal.org/drupal-5.0
- keznews.com/3799_Vista_Transformation_Pack_8.0_Final_-_VTP_8.0
- en.wikipedia.org/wiki/BASIC_8.0
- drupal.org/drupal-6.0
- en.opensuse.org/OpenSUSE_11.0
- www.shopping.com/xGS-Illustrator_11.0
- www.mythtv.org/wiki/index.php/Opensuse_11.0
- www.shopping.com/xGS-Suse_9.0
- en.wikipedia.org/wiki/Mac_OS_X_10.0
- en.opensuse.org/Bugs:Most_Annoying_Bugs_
10.0
- en.wikipedia.org/wiki/Web_2
- drupal.org/drupal-5.0-beta1
- https://keznews.com/3799_Vista_Transformation_Pack_8_0_Final_-_VTP_8_0
- drupal.org/drupal-6.0-beta1
- www.mythtv.org/wiki/index.php/Opensuse_10.3
- www.mythtv.org/wiki/index.php/Opensuse_10.2
- en.opensuse.org/Bugs:Most_Annoying_Bugs_10.3
https://www.fileplanet.com/62709/60000/fileinfo/WinZip-9.0 is not indexed and has no PageRank. Call this duplicate content if you will, but it still shows the same trend in action.
You'll notice some interesting things, such as the fact that en.opensuse.org/Bugs:Most_Annoying_Bugs_10.3 is indexed but en.opensuse.org/Bugs:Most_Annoying_Bugs_
Quite simply, making sure a page resolves with a slash will avoid this problem. I'm of the opinion that this is a pretty silly thing to penalise for without some sort of human review, but it's important that we pick up on things like this so that we can avoid such "false positive" penalties. Make sure to add "check for URLs ending in .0" to your next checklist for site reviews and please, do share if you've found any other filename extensions that exhibit similar behaviour from any of the engines in the comments.
UPDATE: en.wikipedia.org/wiki/SAML_1.1 also seems to be suffering from a penalty and it will be useful to go through some more URLs that end in .n to gauge whether or not they're penalised. Most of the examples we saw that didn't involve a zero had not been hit in any way. I'd love to know how extensive this filter really is.
I did a post about this: https://www.mattcutts.com/blog/dont-end-your-urls-with-exe/
Thanks Matt - I went ahead an made it a live link for you. Feel free to link drop here, BTW, they're all nofollow'd. :)
Great job to the SEOmoz team for finding this and Google for fixing it so quickly!! :) Multiple thumbs for the seomoz team!
Thanks Matt, that is great; thanks for that post and clarification.
Everyone else: In regards to this issue, this is the new word from Matt:
Great explanation Matt and for jumping in there to explain it further!
Hi Matt,
I just wanted to say your guys missed the Filetype:1 search. While the .0 is showing the regular searches .1 seems to only show up with additional session id dynamic urls attached AFTER the .1 . I don't mean to teach Google what to do but just an FYI. I guess there better not be a Web 2.1
Nice work! Do you know where the indication of spaminess comes from? Is there a CMS that has a habit of doing this?
I'd suggest that we just upgrade/rename every core release immediately and retrospectively to create indexable URLs.
Web 2.0 => Web 2.1
Drupal 6.0 => Drupal 6.0.1
Die Hard 4.0 => Die Hard 4.1
I found this post so much more useful than many of the recent ones. Well done and thanks for sharing it. As entertaining as the SMX type posts are, I skim them just because my time reading absolutely needs to be ruthlessly targeted these days.
I love it when I can learn one immediately actionable thing each day!
Premium Plug: For those who haven't upgraded to premium membership yet, there are tons of these little gems in the SEO Tips and of course the hundreds of Q&As.
I'd second that - definitely lots of great stuff in premium just like this.
"To make a long story short, this morning, Rand got in touch with Google and was advised that changing the URL so it doesn't end in ".0" would be a wise decision..."
Got in touch with Google.. Grrr.. Wish I could also call Google for explanations when it messes up with my client's site. Nevertheless useful discovery for all of us. Cheers :>
Seriously ... Must be nice to get confirmation on what happend so quickly ...
He literally called Google.
Rand: "Hello, Google? We noticed a problem..."
Google: "011101110110010101101001011100100110010000100001"
I would like to point out that this says "Weird!" in binary code. Rebecca ftw!
Heh, yes, I wanted to have Google say, "You don't say!", but it ended up being way too long. :)
01000001111100011 (Good Morning, this is Google, How can I help you !!)
"011101110110010101101001011100100110010000100001"
Heck. Rebecca is an even bigger geek than I thought.
In this picture, behind Christine's head, you will see the device we lovingly call the Matt Phone.
fulldisclosureincaseigetintroubleiamtotallykidding
I would say that any URL that ends with a dot and a short suffix could suffer from the same problem. The dot has a special meaning in file terms to indicate a file extension, which in turn implies a file type. This could change the way a spider or indeed a browser inteprets a page.
It seems that numbers are going to be the most likely place where a dot is used at the end of a title due to the nature of version numbers but maybe there are other examples.
Putting the slash on the end of the URL certainly helps as this implies a directory instead of a file type so maybe this is the best solution.
I guess another approach would be to replace dots with dashes, but I'm not sure how this would change the affectiveness of a well crafted SE-friendly URL, e.g. would there be any difference in ranking for "Web 2.0" between the following URLs
I'll second that. Although it's probably not necessary, I make a habit of not using "." in URLs at all, as it's generally reserved for domain names. In the case of "Web 2.0", I've either used "web-20" or "web-2-0". It's not as pretty, but it's definitely safer.
Third'ed. My rule of thumb is to never use a '.' past the top level domain. Call me ole skool, but it took me years to start naming files with more than 8 characters as well... I think sometimes we get caught up in the limit of what we can do, when instead keeping it basic works more often and on a wider variety of stuff.
I use dots as word separators in URLs all the time. I have never used spaces or underscores, and stopped using hyphens a long time ago. My URLs therefore only have slashes and dots in them.
I have never had a problem, but then again the URL always has a trailing slash if it is for the index file in a folder, or it will have a proper extension like .html, .php, or something similar, if it is for a normal file.
So, for "extension-less" URLs, and where a dot is being used as a word spacer, anything after the final dot after the final slash will be treated as if it were an extension in and of itself.
Good catch.
I have always been wary of extension-less URLs. This is yet one more thing to trip you up when using them.
We basically saw that URLs ending in anything but .0 (and maybe .1 as well; jury is still out on that one) were fine, includng underscores and dashes.
Stephen, Chuck and I must all be coders. I have to admit, I automatically think about things like reserved words and characters, so using a dot in a URL just makes me itchy :)
Yeah, look out for wacky file names...Schools can have this problem sometimes as well.
It's sad that that made my Friday morning complete :)
"Stephen, Chuck and I must all be coders"
Guilty as charged!
i was wondering as much and i'm glad to see someone verify! =)
makes sense as to why when wordpress is creating your URLs they go 20 instead of 2.0.
I wonder if it has to do with how most Linux/Unix systems rotate and store system log files? (or any files set up to do so) ... With a .0, .1, .2 etc. at the end of the file name? Being a sysadmin type that was the first thing that came to mind. Typically logrotate gzips and puts a .gz on anything after .1 or .2, which could explain why the .3 urls are showing up, but the .2 is not.
Wouldn't want a bunch of misplaced log files spamming up the index now ;) Just a thought.
Google is able to spot the MIME of anything, only a bad developer or someone really lazy (like me) would set the behavior "if it ends by a .\d{1}, it must be a file we don't want; so let's not index it without looking any further".
I mean, come on, it's Google :-)
I would have thought that they would at least inspect the MIME type to see if it is "text/html" or anything else (PDF/Text/DOC) remotely usable before deciding to ditch the content.
I guess they didn't.
Sounds reasonable to me. For it to be something Google won't disclose after such a high profile case it has to do with hackers, malware, or a virus. Log files in the index can and is being used by hackers could very well be it.
Ripp, good point about binary log files. The other common stuff that ends up with .0 or .1 type extensions are UNIX libraries. Do a search for [ld so lib] to see the sorts of dynamic libraries that are binary and often end with a .0 or .1.
Fasincating and good catch guys (even if you did have a little help from Google).
Every SEO needs to know this.
How big is it - large. Urls with numbers at the end = SEO fail
Very interesting, and certainly something that we'll be adding to our standard guidelines.
One thing; you list:
en.wikipedia.org/wiki/SAML_1.1
as being one that suffers from the penalty. Typo pls?
That's what I thought initially but looking at it more closely I'm not so sure:
google cache of the page
serp for saml 1.1
I don't know why though since the URL doesn't end .0
Its not just .0 - its may be other decimaled numbers - for example:
Wikipedia 9.11 page
Google cache
google search on wikipedia "9.11"
How strange, and yet this page:
en.opensuse.org/Bugs:Most_Annoying_Bugs_10.3
IS indexed. How bizarre!
Yes, I was going to say... it's weird that all the .0 URLs are out but many of the .n URLs are in.
Curious.
Hmm.
https://www.wowwiki.com/Patch_1.9.0 - not indexed
https://www.wowwiki.com/Patch_1.9.1 - not indexed
https://www.wowwiki.com/Patch_1.9.2 - indexed
https://www.wowwiki.com/Patch_1.9.3 - indexed
https://www.wowwiki.com/Patch_1.9.4 - indexed
this search query result's odd (when a number is before decimal) replace 1 with any number
Why Jane, Why? Why? use the word "P" word? It isn't a penalty, it's a "feature" :)
nice find Jane , and thanks for letting us know . :)
premium rocks
Wow!
IF
(URL ends in .0)
THEN
{exclude from index}
seems like a completely nuts idea.
I wonder what triggered that line of code to be added to the hundreds of other factors?
Crawler (Googlebot) code doesn't need to be all that sophisticated. When I wrote my crawler I remember putting in a line like that which filtered out extensions like .wmv, .jpg, etc because when the crawler hit those urls it stalled.
<possible theory>
this is a carry-over from .exe indexing.. back in an older version of php google didn't index files due to executables that could run on the page and install a trojan.. this was also prior to .asp files being indexed.
try your test with .0000 or .000.
you will find that google indexes those.. anything with a "." and less than 3 characters looks like a script ending and therefore to combat malware.
</possible theory>
Trailing slash Questions and hope this doesn't take us off-topic.
Let me know if these should be Q&A ?s instead.
Thanks, Jane, for adding value and reminding us to look at our URLs.
Yahoo often strip the trailing slash from the visible text URL on the page, but usually retain it in the HREF part of the link.
If the resource is a folder, or is an index file in a folder, then the canonical form is to include the trailing slash in the URL.
A request for such a resource, and without the trailing slash included should be served a 301 redirect to the canonical with-trailing-slash form.
I found this while I was researching, so only tested the pages that didn't redirect to the trailing slash version. Naturally, the trailing slash versions were indexed just fine in Google.
Very good find.
I am glad it was seomoz that found it and are able to get an answer so quick. I will update my namesake guideline site.
Full credit has to go to the people at Google who looked into it for us, too. For that, we can all be very thankful!
Interesting. Thanks for sharing those information. I already had the problem with URL ending by a postal code, but I didn't understood why. Now, thast's clear!
Finally the Web2.0 page is back.
Check this:
https://sphinn.com/story/53219
Congratulations rand !!
Wow, that's some great detective work there, mozzers! Very interesting as well. Thank you.
This look like a bug to me, this behavior just doesn't make sense.
looks like a bug to me too... guess they will have to fix it now...
Amazing (yet somehow annoying). Thanks for bringing it up!
These are the small "tidbits" of knowledge that make it critical for a company to have an SEO person on-staff or in a consulting position. There are simply too many "random" google rules that can hurt a business.
Thanks again and keeping digging!
cheers - Ryan
great post, thank you Google for telling Rand about it, and thank you Jane for telling us !
another point to look after....
The post is in the index, it just isn't cached and doesn't rank.
Link here
Is anyone suggesting that this post isn't indexed? I can't see that...
I meant the web 2.0 awards page, not this blog post.
in that case your link isnt correct... hence the confusion.
could be that the URL doesn't end in dot 0 but hyphen 0.
Which makes it effectively useless. I'm not sure what you're trying to say...
i thought patrickaltoft was pointing to an indexed url that ended in hyphen 0, which i thought would be treated completely different than one ending in dot 0 as you were referring to. maybe i'm confused. :)
{exclude from index}
Errrrr, I meant to say....
{exclude from SERPs}
I think I mentioned this in a sphinn comment or maybe on the DP forums. The fact remains it is a good tip and I sorta verbed it as spam. The only good that came from this is the tip.
My only questions is this, did you find oud if Google has protection against any of ther numbers [1-9] so to speak. Should we avoid those also? I request a followup if not.
Crazy... Good spot. Why would Google not want this 'officially' released? Seems to me the only people hurt by that obscurity would be innocent bystanders who happen to create .0 extensions accidentally... I can't see that it's a great spam-blocker (i.e. I can't see that there are many spam pages that would rank if only they didn't end .0)
Guess it just got "officially released" by consenting to allow SEOmoz to share.
Webmaster guideliness goes viral. Excellent!
Awesome catch and discovery. Dont think I need to say anything more except show my appreciation.
Just an update: it looks like this has been taken care of :)
hmm... 1001000101010101000111100 (i mean it is amazing :P )
Im just impressed with this, even 4 years after this post a lot of people don't know about this particularity.
Actually - this has changed. you CAN index urls with .0 - https://www.google.com/search?hl=en&q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWindows_1.0&btnG=Google+Search
Rishi's right; Google fixed this very soon after we wrote about it :)
I love it that we're all still subscribed to this post after all these years!
Lol. Especially since I am not sure I knew you that well way back then...
Maybe Moz needs a "Resolution" for posts that are now no longer 100% right... and close off comments...
Amazing find. Interesting Read. :)
Interesting. It is the same find like we did back in a days on v7n forum. Some spammer post something like this s-e-r-g-e-y and Google read this.
https://www.v7n.com/forums/google-forum/58158-g-o-o-g-l-e-c-n-r-e-d-t-h-i-s.html - you can read it here.
This is really interesting information. I cannot believe that you did this much research to find out new information that no one knew about. .02 could have been a URL I used on a potential client website in the future especially with the rise of web 2.0 sites. Thank you and I cannot wait to read more interesting articles from your amazingly written blog.
I love how little nuggets of knowledge like this can still pop up within the community.
Great find, guess it doesn't hurt to have some connections at Google
BUT, wasn't the URL like that for the previous awards aswell? And only now the PR and indexing have gone...
The previous years' URL included the /. For some reason we changed it this year... but we'll be changing it back ;)
Excellent find Jane.
Is the restriction only for .0 or for all the decimal values like .1 .2 .3 etc. Please clarify.
Pratheep
www.aztecsoft.com
its hard to clarify - but if you read the comments above, its definately affecting .1 and .0
We cant see the same on any other decimals.
Very interesting article and its also surprising this article didn't end with .0 but -0! I hope google doesn't penalize for -0 soon :D !
Thanks for sharing your experience and detailed information about this.
Cheers!
P.S: Having issues with javascript WYISWYG editor on Opera 9.5! Does anyone have similar issue?
Great info we are just about to launch our site and this info will be very useful when we make it. Thanks for the info
By the way what else is Google telling you? And how much do I have to pay you for it?
Many thanks for the great info. Adding it to my arsenal ;)
Great job of laying out the research you did Jane! Still though, I am really interested to know why Google doesn't want to release this officially.
Immense. Another one to add to the "basics" list. Didn't know this. Great work. Good, good, good.
Good post, but an even better find!
- Eric
p.s. More evidence that a drop in the green stuff does not necessarily mean a drop in rankings or traffic.
Well the page did disappear from the rankings and thus wasn't getting traffic for terms like "web 2.0" and "web 2.0 awards", but it didn't harm the rest of our site, and it'll be fixed now :)
nice.
Good find! Like Matt pointed out they are revisiting their decision. I'm just curios why these pages (https://www.google.com/search?hl=en&q=filetype%3A0&safe=off) are not cached even though a cached link appears on them and the green bar is still gray on all of them.
Curiouser and curiouser - I know I'm new to all this, but I'm not sure why google would want to penalise in this way.
Anyways - thanks for sharing this :)
Now is this a late April joke or what?
I cannot believe a valid URL per http specification would just be the ONLY reason for deindexation and PR removal.
This is just as obscure as a statement like "Don't be evil" from a company like Google forcing a healthy webmasters community into a Nazi-like denounciation pack AKA negative SEO and google bowling
LOL, just read Matts explanation for "disabled" file extensions
https://www.mattcutts.com/blog/dont-end-your-urls-with-exe/
So those urls will not even be checked via mime type but won't be requested at all... so this appears to be due to some optimization in their web-crawler to save bandwidth :-)
Thanks Rand for researching this out..for Jane posting it..for Matt Cutts adding in his 2 cents..and for Google for considering making some changes. This tidbit of knowledge benefits us all.
Anyone know how Drupal 6.x outputs X.0?
I don't have much experience with d6 yet, but I guess as always you can use Pathauto to be sure you don't have trouble with such issues.
I added a test listing to my directory after reading this deliberatly ending it in .0
The page was indexed by google on the same day and ranks no1 for the target phrase SEO Companies in Warwickshire
https://www.uksmallbusinessdirectory.co.uk/business-listings.asp?strCompanyName=2.0
as you can see it ends with .0
does the freshbot ignore dropping URL's like this? and is that page likely to get dropped from the index when google does a proper crawl?
URLs ending in .0 are not filtered now (at least, not in the way they were), but with the old filters, our URL was in the index for a couple of weeks before it was removed. It sure seemed like it was found, indexed, ranked and then noticed and removed, but ours was also a 301 from /web2.0/ to /web2.0. Thus, the page wasn't brand new...
I don't know what happened to brand new pages that ended in .0. Now, they should be in there just fine though.