At the end of last year the website I work on, LocateTV, moved into the cloud with Amazon Web Services (AWS) to take advantage of increase flexibility and reduced running costs. A while after we switched I found that Googlebot was crawling the site almost twice as much as it used to. Looking into it some more I found that Google had been crawling the site from a subdomain of amazonaws.com.
The problem is, when you start up a server on AWS it automatically gets a public DNS entry which looks a bit like ec2-123.456.789.012.compute-1.amazonaws.com. This means that the server will be available through this domain as well as the main domain that you will have registered to the same IP address. For us, this problem doubled itself as we have two web servers for our main domain and hence the whole of the site was being crawled through two different amazonaws.com subdomains and www.locatetv.com.
Now there were no external links to these AWS subdomains but, being a domain registrar, Google was notified of the new DNS entries and went ahead and indexed loads of pages. All this was creating extra load on our servers and a huge duplicate content problem (which I cleaned up, after quite a bit of trouble - more below).
A pretty big mess.
I thought I'd do some analysis into how many other sites were being affected by this problem. A quick search on Google for site:compute-1.amazonaws.com and site:compute.amazonaws.com reveals almost 1/2 million web pages indexed (often dodgy stats with this command but it gives some scale of the issue):
My guess is that most of these pages are duplicate content with the site owners having separate DNS entries for their site. Certainly this is the case for the first few sites I checked:
- https://ec2-67-202-8-9.compute-1.amazonaws.com is the same as https://www.broadjam.com
- https://ec2-174-129-207-154.compute-1.amazonaws.com is the same as https://www.elephantdrive.com
- https://ec2-174-129-253-143.compute-1.amazonaws.com is the same as https://boxofficemojo.com
- https://ec2-174-129-197-200.compute-1.amazonaws.com is the same as https://www.promotofan.com
- https://ec2-184-73-226-122.compute-1.amazonaws.com is the same as https://www.adbase.com
For Box Office Mojo, Google is reporting 76,500 pages indexed for the amazonaws.com address. That's a lot of duplicate content in the index. A quick search for something specific like "Fastest Movies to Hit $500 Million at the Box Office" shows duplicates from both domains (plus a secure subdomain and the IP address of one of their servers - oops!):
Whilst I imagine Google would be doing a reasonable job of filtering out the duplicates when it comes to most keywords, it's still pretty bad to have all this duplicate content in the index and all that wasted crawl time.
This is pretty dumb for Google (and other search engines) to be doing. It's pretty easy to work out that both the real domain and the AWS subdomain resolve to the same IP address and that the pages are the same. They could be saving themselves a whole lot of time time crawling URLs that are due to a duplicate DNS entry.
Fixing the source of the problem.
As good SEOs we know that we should do whatever we can to make sure that there is only one domain name resolving to a site. There is no way, at the moment, to stop AWS from adding the public DNS entries and so a way to solve this is to make sure that if the web server is accessed using the AWS subdomain then redirect to the main domain. Here is an example using Apache mod_rewrite of how to do this:
RewriteCond %{HTTP_HOST} ec2-123-456-789-012.compute-1.amazonaws.com
RewriteRule ^(.*)$ https://www.mydomain.com/$1 [R=301,L]
This can be put either in the httpd.conf file or the .htaccess file and basically says that if the requested host is ec2-123-456-789-012.compute-1.amazonaws.com then 301 redirect all URLs to the equivalent URL on www.mydomain.com.
This fix quickly stopped Googlebot from crawling our amazonaws.com subdomain addresses, which took considerable load off our servers, but by the time I'd spotted the problem there were thousands of pages indexed. As these pages were probably not doing any harm I thought I'd just let Google find all the 301 redirects and remove the pages from the index. So I waited, and waited, and waited. After a month the number of pages indexed (according to the site: command) was exactly the same. No pages had dropped out of the index.
Cleaning it up.
To help Google along I decided to submit a removal request using Webmaster Tools. I temporarily removed the 301 redirects too allow Google to see my site verification file (obviously it was being redirected to the verification file on my main domain) and then put the 301 redirect back in. I submitted a full site removal request but it was rejected because the domain was not being blocked by robots.txt. Again, this is pretty dumb in my opinion because the whole of the subdomain was being redirected to the correct domain.
As I was a bit annoyed with the fact that the removal request would not work in the way I wanted it to I thought I'd leave Google another month to see if it found the 301 redirects. After at least another month, no pages had dropped out of the index. This backs up my suspicion that Google does a pretty poor job of finding 301 redirects for stuff that isn't in the web's link graph. I have found this before, where I have changed URLs, updated all internal links to point at the new URLs and redirected the old URL. Google doesn't seem to go back through it's index and re-crawl pages that it hasn't found in it's standard web crawl to see if they have been removed or redirected (or if it does, it does it very, very slowly).
Having had no luck with the 301 approach, I decide to change to using a robots.txt file to block Google. The issue here is that, clearly, I didn't want to edit my main robot.txt to block bots as that would stop crawling of my main domain. Instead, I created a file called robots-block.txt that contained the usual blocking instructions:
User-agent: *
Disallow: /
I then replaced the redirect entries from my .htaccess file to something like this:
RewriteCond %{HTTP_HOST} ec2-123-456-789-012.compute-1.amazonaws.com
RewriteRule ^robots.txt$ robots-block.txt [L]
This basically says that if the requested host is ec2-123-456-789-012.compute-1.amazonaws.com and the requested path is robots.txt then serve the robot-block.txt file instead. This means I effectively have a different robots.txt file served from this subdomain. Having done this I went back to Webmaster Tools, submitted the site removal request and this time it was accepted. "Hey presto", my duplicate content was gone! For good measure I replaced the robots.txt mod_rewrite with the original redirect commands to make sure any real users are redirected properly.
Reduce, reuse, recycle.
This was all a bit of a fiddle to sort out and I doubt many webmasters hosting on AWS will have even realised that this is an issue. This is not purely limited to AWS, as a number of other hosting providers also create alternative DNS entries. It is worth finding out what DNS entries are configured for the web server(s) serving a site (this isn't always that easy but you can use your access logs/analytics to get an idea) and then making sure that redirects are in place to the canonical domain. If you need to remove any indexed pages then hopefully you can do something similar to the solution I proposed above.
There are some things that Google could do to help solve this problem:
- Be a bit more intelligent in detecting duplicate domain entries for the same IP address.
- Put some alerts into Webmaster Tool so webmasters know there is a potential issue.
- Get better at re-crawling pages in the index not found in the standard crawl to detect redirects
- Add support for site removal when a site wide redirect is in place
In the meantime, hopefully I've given some actionable advice if this is a problem for you.
Amazing heads up AND solution - great work Stephen!
Thanks Richard. Let's hope Google gets better at detecting this stuff.
Good post thanks,
Had problems in the past with duplicates and had done 301 too, but thankfully they have been picked up in 2-3 weeks period by Google. I was following advice on this from the following post which applicable to full domain move and is easy to follow: https://www.seomoz.org/blog/seo-guide-how-to-properly-move-domains
Thanks for the comment Tatiana. Glad that you had more success in cleaning up the duplicates. Out of interested, how many pages did you have to clean up?
Unfortunately that guide on moving domains doesn't work for this situation because you can only fill in a "Change of Address form" in Google Webmaster Tools for top level domains. When I access that page for the AWS subdomain I get a message "Restricted to root level domains only".
I did not have to use change of address either as we had to move only certain sections too. But I find the tip regarding submission of old site map very useful.
We had to clean up around 7k pages, mainly due to our keen developers, who wanted to finish all early and not check what they do.
I am not to technical myself so that was a hard task to get to them what I mean and need. Hence working on my knowledge of 301 rules and found your post very useful.
Is this guy MIB?
Haha. Actually I'm an agent in The Matrix.
That was a good solution.. and i think implementing a different robot.txt file can be applied in many a number of area!
Thank you again
Hey, Stephen,
I love it when the post is technical but I actually understand it. (Copy writers are dumber than dirt and APIs only confuse us.)
Now if you could just tell me where the off button is on my new lap top.
Thanks for a new perspective. Very spinn-able. Already working it into my pitch.
Neo
Cool - I actually thought the post was a little too technical!
Nice post Stephen!
Really smart solution there with the alternative robots.txt file!
Thanks! Yeh, you've gotta love mod_rewrite - can solve all sorts of problems.
I'm not sold on the Cloud yet, and for me this is more evidence to wait.
Don't get me wrong - I really like the infrastructure at AWS. It really has reduced our running costs compared to a traditional managed hosting environment and also given us a lot of flexibility in terms of being able to start new servers whenever we need one. This doesn't just have to be to increase capacity but also for testing out changes to our site or infrastructure before we deploy them.
Stephen, I'd like to get your feedback on a tool we made to help with AWS costs/pricing:
https://awswatch.spikesource.com/
It's just a prototype now, but let me know if you think it can help you scale costs.
Just throw me on the pile of people who think this is a great post. Thorough, informative and well articulated.
I know you wrote this ages ago but I wanted to ask you a question. I have a client who has an Amazon webstore and all of her product descriptions are duplicated all over the web on these small ecommerce sites/blog sites. I suppose Amazon is doing the syndication, but this poses a huge problem (I assume) of duplicate content issues for the ranking of her main website. How could I fix this issue other than having her rewrite the 6500 product descriptions of her store products? Thanks.Â
Hey, I have a site hosted on Amazon's neat infrastructure too :-). Just a few comments on this:
- Please do not use a robots.txt with a blanket disallow to deal with duplicate content. Doing that would completely block the ability to see redirects and does not prevent those URLs from showing up in search results. Even without redirects, it's important that we can crawl alternate versions so that we can recognize them as alternates.
- We have a number of tips regarding how to handle cross-domain duplicate content at https://googlewebmastercentral.blogspot.com/2009/12/handling-legitimate-cross-domain.html . Another simple trick to keep it from getting out of hand is to just use full URLs in your site's navigation.
- Keep in mind that AWS images have dynamic IPs (and public DNS entries) - stopping & restarting it would give the site a new IP unless you're explicitly preventing that. Make sure that the .htaccess rule for the redirect is based on the preferred host name and not one particular alternate version (that would cover the IP address as well as any random host name that is pointing to the same IP address).
- While we may be indexing a lot of content from these alternate sources, we're generally pretty good at picking the "right" versions to show in the search results. There is no need to manually remove those results, just as there is no need to manually remove indexed IP addresses -- if you spot them, just make sure that you handle canonicalization properly and they'll go away on their own, over time. In general, having an alternate version indexed will not cause your site problems, so don't panic, but as with any issue, if you spot it, try to fix it :-).
Hiya. Here are my thoughts on your comments
- Yes, I really didn't want to blanket disallow but the problem was that even after a few months Google had not detected the 301 redirects and so this was the only way Google would allow me to submit the URL removal in WMT. As I said in the post, once the URLs we removed I put back the 301 redirect.
- Some nice tips. The issue with full URLs in navigation is that it won't work in a development system (as it would send you to the main site instead of your dev site).
-Â Yes - we use Elastic IP addresses to create a fixed external IP address which we use in DNS (i.e. the DNS points to the IP not the AWS subdomain).
-"they'll go away on their own, over time" - well I disagree with this point. Even when I had a 301 redirect in place they didn't drop out of the index even after waiting months for it to happen. And, as I demonstrated in the post, you can have issues if you have multiple alternative versions indexed (see the Box Office Mojo example). The biggest issue for us was the additional load the bot was putting on our servers crawling it from many different URLs.
I just want to reinforce that using the robots.txt disallow to handle duplicate content or canonicalization issues is a really, really bad idea. I would very, very strongly recommend not doing that. You do not solve a duplicate content problem by not letting crawlers access the URLs that you're trying to de-duplicate. At any rate, combining redirects with a robots.txt disallow has no effect: if we can't crawl the URLs, we can't find the redirects. Using the URL removal tool to hide the duplicates does not solve that problem either.
Optimally, a server will be set up properly so that it only responds to requests for the known host names (this is fairly easy to do with the web-servers I've worked with), which would avoid running into this problem. Handling it with proper 301 redirects is the preferred means of solving it if the server was not set up correctly initially -- and it will take quite some time to have all of the obscure, duplicate URLs crawled and the redirects found, that's normal and to be expected.
If this is a problem that you see on a lot of other AWS sites, it might be useful to have a blog post about setting up a server properly using the more popular AWS images. If a site is set up properly from the start, you won't need to do redirects like this, but maybe it's not completely clear how to do that with the existing docs. Do a quick survey of the sites that you see indexed with the wrong canonicals and find out what they're using, then write an awesome blog post about how to set it up right :-).
I agree - I wouldn't use robots.txt to handle duplicate content - I only used it to bulk remove the thousands of duplicate URLs from the index because Google was so very, very slow at finding the 301s that I had put in place and WMT gave me no other option to remove the URLs than having a robots.txt in place.
I agree that ideally the servers would be set up correctly, and I thought that is what we had (we never had the issue before moving to AWS). I think the solution leadegroot suggested above is quite a nice way of solving the problem. I also think having reverse DNS set up correctly for AWS (which is something they've only been able to do recently) would help solve the problem.
As you are a Googler, what did you make of the suggestion I made at the end of the post that Google could implement to help solve this problem? I doubt all webmasters using AWS (or other platforms) are going to be savvy enough to know about this and will have taken standard configurations that don't have the appropriate set up to canonicalize. They may be experiencing addional server load without them knowing why.
Also, do you have a comment on why Google didn't find 301 redirect for URLs in the index even after months of waiting?
Hi Stephen
This is not really a DNS problem, it's a problem that the server is configured in a way to allow all host names to be accessed through the IP address. While that sometimes makes sense, it does cause perceived problems like this. Usually this is something the server admin would handle (on Apache in the virtual hosts configuration, as far as I know).
There are a lot of ways to solve this already, so I don't really think it makes sense to do something special for a situation like this; redirect and move on to the next real problem. Canonicalization issues like this are fairly common (Google has the same on some of the sites), but they generally don't cause any visible problems, so it's not really something worth losing any sleep over :).
Once you see it and set up redirects, we'll see the redirects once we crawl those URLs. If we're still keeping some old URLs indexed after months, then chances are that we're not crawling those URLs very frequently, and accordingly it's not going to cause a problem bandwidth-wise or search-results-wise.
FWIW It looks like your alternate host names are still not redirecting properly, I see secure., www2., https:// and admin. on the first page of a site-query, and they're still returning the same content. Also, your lu.php script seems to be using a 302 redirect to resolve the short URLs. These things all contribute to the duplicate content as well; using the rel=canonical would help if redirecting all non-matches is a problem. (sorry, too much looking at websites makes these things jump out & I'd prefer to mention them when I see them :-)).
What site are you talking about here?
Thank! for this awesome knowledge...Hatsoff
Is that a white hat or a black hat off?
Great post Stephen, very helpful as I'm trying to get rid of LOTS of dupes in my webstore. I wasn't aware that you can actually request a removal by Google Webmaster Tools. How can you actually do that?
Cheers,
Stefan
If you expand "Site Configuration" from the left navigation then click "Crawler Access" there is a tab called "Remove URL". This you can use to remove individual URLs, entire directories or an entire site.
Is there an issue with speed of indexing here with the solution given.
If you use a robots.txt file to serve to UA's accessing the newly registered domain that AWS creates then will the pages of the site be indexed as quickly?
Of course one doesn't want the AWS registered domain pages to supersede the example.com pages (which they perhaps would by being earlier to enter the index). But also one wants those pages to be indexed as soon as possible.
My question really is whether using a link-canonical tag on-page would have achieved the same thing but also optimised the entry of pages in the index.
To spell it out if ec2-123-456-789-012.compute-1.amazonaws.com/page.html was found by googlebot before example.com/page.html then googlebot would see the canonical tag and refer to (and index) the correctly labelled canonical page, no? It might take a lot longer for googlebot to come around and find the page via the example.com domain.
I do what leadegroot suggests on my domains however.
Kind of related questions, but maybe not: If you host your site through Amazon Web Services, does that mean it shared IPs with other sites that use AWS, meaning that sites all appear to be from the same IP for the search engines?
Generally you would get what they call an "Elastic IP Address" which is a fixed and dedecated IP that you can use for your site.
I wouldn't have thought AWS have enough IP addresses to give out dedicated IP addresses for every customer, not until IP6 comes along.
Yup - I'm not sure if they would but I guess it doesn't matter for some applications of cloud services such as CDNs or data services, where a DNS entry will be good enough.
Really, one of the default setups you should use for any site is something like:
RewriteCond %{HTTP_HOST} !^www\.mydomain\.comRewriteRule (.*) https://www.mydomain.com/$1 [R=301,L]
This will stop a lot of different 'wrong address' problems.
Encountering problems down the line because you haven't done this, or its equivalent, is just a little unprofessional.
I could not implement this rewrite rule because we have multiple webservers and often I will want to directly access a single server to check that it is functioning correctly. If I implemented the rule you suggested then I would always be redirected to the main domain, where I can guarantee which web server I would be accessing.
Its not hard to exclude a given IP from the rewrite.
True, although that IP can then be indexed e.g. see what's happening to Box Office Mojo through their IP address:
site:174.129.253.143
I'm guessing this is all links back to the DNS issues as I don't think Google indexes sites usually through their DNS name and IP address.
As I said in the post, I think Google could be helping with this problem by being a bit smarter and giving better tools and notifications.
No, not quite what I meant.
I haven't needed to do it myself, so this is untested (and done a bit quickly at this time of the morning), but the general concept is correct:
use:
RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com
RewriteRule (.*) https://www.mydomain.com/$1 [R=301,L]
and replace the 999... with your IP (use one of the whatsmyip services if you don't know what it is)
ie if the visitor is not at IP 999... and the visitor is not on domain www.mydomain.com then redirect the visitor to www.mydomain.com
Then people on your IP are able to reach the site by alternate domains - but all the crawling bots, and the run of the mill visitors, will be corrected to the 'correct' domain.
The most common use of the simple ruleset is to avoid simple www canonicalisation problems, but at the same time it fixes a myriad of problems, like the one you saw, as well as eg. strangers accidentally pointing their DNS to your server (happens more than you'd think).
Its useful stuff to fix generically rather specifically :)
Ok - with you now - didn't realise you meant the user IP instead of the server IP! This seems like a good solution to the probably, particularly if you put it in httpd.conf (because putting it in .htaccess would mean that you couldn't run a copy of the site on a development server without it redirecting all the time).
I get around the devbox thing with the line:
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com
RewriteCond %{HTTP_HOST} !^mydevbox.\com
RewriteRule (.*) https://www.mydomain.com/$1 [R=301,L]
but it is an overhead :(
Yes, http.conf would be a installation specific way of hitting the problem:)
[edit: oh, those line breaks weren't readable!)