Amazon Web Services: Clouded by Duplicate Content

Comments 39

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

Richard Baxter

2010-06-15T01:58:21-07:00

Amazing heads up AND solution - great work Stephen!

4 0

Amazing heads up AND solution - great work Stephen!
Cancel
- Stephen Tallamy
 
 2010-06-15T01:59:49-07:00
 
 Thanks Richard. Let's hope Google gets better at detecting this stuff.
 
 1 0
 
 Thanks Richard. Let's hope Google gets better at detecting this stuff.
 Cancel
Tatiana_London

2010-06-15T01:45:10-07:00

Good post thanks,

Had problems in the past with duplicates and had done 301 too, but thankfully they have been picked up in 2-3 weeks period by Google. I was following advice on this from the following post which applicable to full domain move and is easy to follow: https://www.seomoz.org/blog/seo-guide-how-to-properly-move-domains

4 0

Good post thanks, Had problems in the past with duplicates and had done 301 too, but thankfully they have been picked up in 2-3 weeks period by Google. I was following advice on this from the following post which applicable to full domain move and is easy to follow: https://www.seomoz.org/blog/seo-guide-how-to-properly-move-domains 
Cancel
- Stephen Tallamy
 
 2010-06-15T02:19:21-07:00
 
 Thanks for the comment Tatiana. Glad that you had more success in cleaning up the duplicates. Out of interested, how many pages did you have to clean up?
 
 Unfortunately that guide on moving domains doesn't work for this situation because you can only fill in a "Change of Address form" in Google Webmaster Tools for top level domains. When I access that page for the AWS subdomain I get a message "Restricted to root level domains only".
 
 3 0
 
 Thanks for the comment Tatiana. Glad that you had more success in cleaning up the duplicates. Out of interested, how many pages did you have to clean up? Unfortunately that guide on moving domains doesn't work for this situation because you can only fill in a "Change of Address form" in Google Webmaster Tools for top level domains. When I access that page for the AWS subdomain I get a message "Restricted to root level domains only". 
 Cancel
 - Tatiana_London
 
 2010-06-15T09:31:36-07:00
 
 I did not have to use change of address either as we had to move only certain sections too. But I find the tip regarding submission of old site map very useful.
 
 We had to clean up around 7k pages, mainly due to our keen developers, who wanted to finish all early and not check what they do.
 
 I am not to technical myself so that was a hard task to get to them what I mean and need. Hence working on my knowledge of 301 rules and found your post very useful.
 
 3 0
 
 I did not have to use change of address either as we had to move only certain sections too. But I find the tip regarding submission of old site map very useful. We had to clean up around 7k pages, mainly due to our keen developers, who wanted to finish all early and not check what they do. I am not to technical myself so that was a hard task to get to them what I mean and need. Hence working on my knowledge of 301 rules and found your post very useful. 
 Cancel
manontop

2010-06-21T22:40:13-07:00

Is this guy MIB?

3 0

Is this guy MIB?
Cancel
- Stephen Tallamy
 
 2010-06-22T00:15:04-07:00
 
 Haha. Actually I'm an agent in The Matrix.
 
 2 0
 
 Haha. Actually I'm an agent in The Matrix.
 Cancel
Vishnu M

2010-07-08T23:36:08-07:00

That was a good solution.. and i think implementing a different robot.txt file can be applied in many a number of area!

Thank you again

2 0

That was a good solution.. and i think implementing a different robot.txt file can be applied in many a number of area! Thank you again 
Cancel
webwordslinger

2010-06-17T06:14:57-07:00

Hey, Stephen,

I love it when the post is technical but I actually understand it. (Copy writers are dumber than dirt and APIs only confuse us.)

Now if you could just tell me where the off button is on my new lap top.

Thanks for a new perspective. Very spinn-able. Already working it into my pitch.

Neo

2 0

Hey, Stephen, I love it when the post is technical but I actually understand it. (Copy writers are dumber than dirt and APIs only confuse us.) Now if you could just tell me where the off button is on my new lap top. Thanks for a new perspective. Very spinn-able. Already working it into my pitch. Neo
Cancel
- Stephen Tallamy
 
 2010-06-17T08:16:04-07:00
 
 Cool - I actually thought the post was a little too technical!
 
 2 0
 
 Cool - I actually thought the post was a little too technical!
 Cancel
gerard.top

2010-06-21T23:19:39-07:00

Nice post Stephen!

Really smart solution there with the alternative robots.txt file!

2 0

Nice post Stephen! Really smart solution there with the alternative robots.txt file! 
Cancel
- Stephen Tallamy
 
 2010-06-22T00:18:24-07:00
 
 Thanks! Yeh, you've gotta love mod_rewrite - can solve all sorts of problems.
 
 1 0
 
 Thanks! Yeh, you've gotta love mod_rewrite - can solve all sorts of problems.
 Cancel
Christian Maund-Anderson

2010-06-23T09:16:07-07:00

I'm not sold on the Cloud yet, and for me this is more evidence to wait.

2 0

I'm not sold on the Cloud yet, and for me this is more evidence to wait.
Cancel
- Stephen Tallamy
 
 2010-06-23T09:31:55-07:00
 
 Don't get me wrong - I really like the infrastructure at AWS. It really has reduced our running costs compared to a traditional managed hosting environment and also given us a lot of flexibility in terms of being able to start new servers whenever we need one. This doesn't just have to be to increase capacity but also for testing out changes to our site or infrastructure before we deploy them.
 
 2 0
 
 Don't get me wrong - I really like the infrastructure at AWS. It really has reduced our running costs compared to a traditional managed hosting environment and also given us a lot of flexibility in terms of being able to start new servers whenever we need one. This doesn't just have to be to increase capacity but also for testing out changes to our site or infrastructure before we deploy them.
 Cancel
SSConrad

2010-08-24T11:39:57-07:00

Stephen, I'd like to get your feedback on a tool we made to help with AWS costs/pricing:

https://awswatch.spikesource.com/

It's just a prototype now, but let me know if you think it can help you scale costs.

1 0

Stephen, I'd like to get your feedback on a tool we made to help with AWS costs/pricing: https://awswatch.spikesource.com/ It's just a prototype now, but let me know if you think it can help you scale costs.
Cancel
John Malloy

2012-05-31T05:31:05-07:00

Just throw me on the pile of people who think this is a great post. Thorough, informative and well articulated.

1 0

Just throw me on the pile of people who think this is a great post. Thorough, informative and well articulated.
Cancel
Mollie Benton

2013-04-29T16:15:16-07:00

I know you wrote this ages ago but I wanted to ask you a question. I have a client who has an Amazon webstore and all of her product descriptions are duplicated all over the web on these small ecommerce sites/blog sites. I suppose Amazon is doing the syndication, but this poses a huge problem (I assume) of duplicate content issues for the ranking of her main website. How could I fix this issue other than having her rewrite the 6500 product descriptions of her store products? Thanks.

1 0

I know you wrote this ages ago but I wanted to ask you a question. I have a client who has an Amazon webstore and all of her product descriptions are duplicated all over the web on these small ecommerce sites/blog sites. I suppose Amazon is doing the syndication, but this poses a huge problem (I assume) of duplicate content issues for the ranking of her main website. How could I fix this issue other than having her rewrite the 6500 product descriptions of her store products? Thanks. 
Cancel
softplus

2010-06-22T01:48:01-07:00

Hey, I have a site hosted on Amazon's neat infrastructure too :-). Just a few comments on this:

- Please do not use a robots.txt with a blanket disallow to deal with duplicate content. Doing that would completely block the ability to see redirects and does not prevent those URLs from showing up in search results. Even without redirects, it's important that we can crawl alternate versions so that we can recognize them as alternates.

- We have a number of tips regarding how to handle cross-domain duplicate content at https://googlewebmastercentral.blogspot.com/2009/12/handling-legitimate-cross-domain.html . Another simple trick to keep it from getting out of hand is to just use full URLs in your site's navigation.

- Keep in mind that AWS images have dynamic IPs (and public DNS entries) - stopping & restarting it would give the site a new IP unless you're explicitly preventing that. Make sure that the .htaccess rule for the redirect is based on the preferred host name and not one particular alternate version (that would cover the IP address as well as any random host name that is pointing to the same IP address).

- While we may be indexing a lot of content from these alternate sources, we're generally pretty good at picking the "right" versions to show in the search results. There is no need to manually remove those results, just as there is no need to manually remove indexed IP addresses -- if you spot them, just make sure that you handle canonicalization properly and they'll go away on their own, over time. In general, having an alternate version indexed will not cause your site problems, so don't panic, but as with any issue, if you spot it, try to fix it :-).

1 0

Hey, I have a site hosted on Amazon's neat infrastructure too :-). Just a few comments on this: - Please do not use a robots.txt with a blanket disallow to deal with duplicate content. Doing that would completely block the ability to see redirects and does not prevent those URLs from showing up in search results. Even without redirects, it's important that we can crawl alternate versions so that we can recognize them as alternates. - We have a number of tips regarding how to handle cross-domain duplicate content at https://googlewebmastercentral.blogspot.com/2009/12/handling-legitimate-cross-domain.html . Another simple trick to keep it from getting out of hand is to just use full URLs in your site's navigation. - Keep in mind that AWS images have dynamic IPs (and public DNS entries) - stopping & restarting it would give the site a new IP unless you're explicitly preventing that. Make sure that the .htaccess rule for the redirect is based on the preferred host name and not one particular alternate version (that would cover the IP address as well as any random host name that is pointing to the same IP address). - While we may be indexing a lot of content from these alternate sources, we're generally pretty good at picking the "right" versions to show in the search results. There is no need to manually remove those results, just as there is no need to manually remove indexed IP addresses -- if you spot them, just make sure that you handle canonicalization properly and they'll go away on their own, over time. In general, having an alternate version indexed will not cause your site problems, so don't panic, but as with any issue, if you spot it, try to fix it :-).
Cancel
- Stephen Tallamy
 
 2010-06-22T09:28:14-07:00
 
 Hiya. Here are my thoughts on your comments
 
 - Yes, I really didn't want to blanket disallow but the problem was that even after a few months Google had not detected the 301 redirects and so this was the only way Google would allow me to submit the URL removal in WMT. As I said in the post, once the URLs we removed I put back the 301 redirect.
 
 - Some nice tips. The issue with full URLs in navigation is that it won't work in a development system (as it would send you to the main site instead of your dev site).
 
 - Yes - we use Elastic IP addresses to create a fixed external IP address which we use in DNS (i.e. the DNS points to the IP not the AWS subdomain).
 
 -"they'll go away on their own, over time" - well I disagree with this point. Even when I had a 301 redirect in place they didn't drop out of the index even after waiting months for it to happen. And, as I demonstrated in the post, you can have issues if you have multiple alternative versions indexed (see the Box Office Mojo example). The biggest issue for us was the additional load the bot was putting on our servers crawling it from many different URLs.
 
 1 0
 
 Hiya. Here are my thoughts on your comments - Yes, I really didn't want to blanket disallow but the problem was that even after a few months Google had not detected the 301 redirects and so this was the only way Google would allow me to submit the URL removal in WMT. As I said in the post, once the URLs we removed I put back the 301 redirect. - Some nice tips. The issue with full URLs in navigation is that it won't work in a development system (as it would send you to the main site instead of your dev site). - Yes - we use Elastic IP addresses to create a fixed external IP address which we use in DNS (i.e. the DNS points to the IP not the AWS subdomain). -"they'll go away on their own, over time" - well I disagree with this point. Even when I had a 301 redirect in place they didn't drop out of the index even after waiting months for it to happen. And, as I demonstrated in the post, you can have issues if you have multiple alternative versions indexed (see the Box Office Mojo example). The biggest issue for us was the additional load the bot was putting on our servers crawling it from many different URLs. 
 Cancel
 - softplus
 
 2010-06-24T01:57:14-07:00
 
 I just want to reinforce that using the robots.txt disallow to handle duplicate content or canonicalization issues is a really, really bad idea. I would very, very strongly recommend not doing that. You do not solve a duplicate content problem by not letting crawlers access the URLs that you're trying to de-duplicate. At any rate, combining redirects with a robots.txt disallow has no effect: if we can't crawl the URLs, we can't find the redirects. Using the URL removal tool to hide the duplicates does not solve that problem either.
 
 Optimally, a server will be set up properly so that it only responds to requests for the known host names (this is fairly easy to do with the web-servers I've worked with), which would avoid running into this problem. Handling it with proper 301 redirects is the preferred means of solving it if the server was not set up correctly initially -- and it will take quite some time to have all of the obscure, duplicate URLs crawled and the redirects found, that's normal and to be expected.
 
 If this is a problem that you see on a lot of other AWS sites, it might be useful to have a blog post about setting up a server properly using the more popular AWS images. If a site is set up properly from the start, you won't need to do redirects like this, but maybe it's not completely clear how to do that with the existing docs. Do a quick survey of the sites that you see indexed with the wrong canonicals and find out what they're using, then write an awesome blog post about how to set it up right :-).
 
 1 0
 
 I just want to reinforce that using the robots.txt disallow to handle duplicate content or canonicalization issues is a really, really bad idea. I would very, very strongly recommend not doing that. You do not solve a duplicate content problem by not letting crawlers access the URLs that you're trying to de-duplicate. At any rate, combining redirects with a robots.txt disallow has no effect: if we can't crawl the URLs, we can't find the redirects. Using the URL removal tool to hide the duplicates does not solve that problem either. Optimally, a server will be set up properly so that it only responds to requests for the known host names (this is fairly easy to do with the web-servers I've worked with), which would avoid running into this problem. Handling it with proper 301 redirects is the preferred means of solving it if the server was not set up correctly initially -- and it will take quite some time to have all of the obscure, duplicate URLs crawled and the redirects found, that's normal and to be expected. If this is a problem that you see on a lot of other AWS sites, it might be useful to have a blog post about setting up a server properly using the more popular AWS images. If a site is set up properly from the start, you won't need to do redirects like this, but maybe it's not completely clear how to do that with the existing docs. Do a quick survey of the sites that you see indexed with the wrong canonicals and find out what they're using, then write an awesome blog post about how to set it up right :-). 
 Cancel
 - Stephen Tallamy
 
 2010-06-24T04:15:41-07:00
 
 I agree - I wouldn't use robots.txt to handle duplicate content - I only used it to bulk remove the thousands of duplicate URLs from the index because Google was so very, very slow at finding the 301s that I had put in place and WMT gave me no other option to remove the URLs than having a robots.txt in place.
 
 I agree that ideally the servers would be set up correctly, and I thought that is what we had (we never had the issue before moving to AWS). I think the solution leadegroot suggested above is quite a nice way of solving the problem. I also think having reverse DNS set up correctly for AWS (which is something they've only been able to do recently) would help solve the problem.
 
 As you are a Googler, what did you make of the suggestion I made at the end of the post that Google could implement to help solve this problem? I doubt all webmasters using AWS (or other platforms) are going to be savvy enough to know about this and will have taken standard configurations that don't have the appropriate set up to canonicalize. They may be experiencing addional server load without them knowing why.
 
 Also, do you have a comment on why Google didn't find 301 redirect for URLs in the index even after months of waiting?
 
 1 0
 
 I agree - I wouldn't use robots.txt to handle duplicate content - I only used it to bulk remove the thousands of duplicate URLs from the index because Google was so very, very slow at finding the 301s that I had put in place and WMT gave me no other option to remove the URLs than having a robots.txt in place. I agree that ideally the servers would be set up correctly, and I thought that is what we had (we never had the issue before moving to AWS). I think the solution leadegroot suggested above is quite a nice way of solving the problem. I also think having reverse DNS set up correctly for AWS (which is something they've only been able to do recently) would help solve the problem. As you are a Googler, what did you make of the suggestion I made at the end of the post that Google could implement to help solve this problem? I doubt all webmasters using AWS (or other platforms) are going to be savvy enough to know about this and will have taken standard configurations that don't have the appropriate set up to canonicalize. They may be experiencing addional server load without them knowing why. Also, do you have a comment on why Google didn't find 301 redirect for URLs in the index even after months of waiting? 
 Cancel
 - softplus
 
 2010-06-24T07:16:21-07:00
 
 Hi Stephen
 This is not really a DNS problem, it's a problem that the server is configured in a way to allow all host names to be accessed through the IP address. While that sometimes makes sense, it does cause perceived problems like this. Usually this is something the server admin would handle (on Apache in the virtual hosts configuration, as far as I know).
 
 There are a lot of ways to solve this already, so I don't really think it makes sense to do something special for a situation like this; redirect and move on to the next real problem. Canonicalization issues like this are fairly common (Google has the same on some of the sites), but they generally don't cause any visible problems, so it's not really something worth losing any sleep over :).
 
 Once you see it and set up redirects, we'll see the redirects once we crawl those URLs. If we're still keeping some old URLs indexed after months, then chances are that we're not crawling those URLs very frequently, and accordingly it's not going to cause a problem bandwidth-wise or search-results-wise.
 
 FWIW It looks like your alternate host names are still not redirecting properly, I see secure., www2., https:// and admin. on the first page of a site-query, and they're still returning the same content. Also, your lu.php script seems to be using a 302 redirect to resolve the short URLs. These things all contribute to the duplicate content as well; using the rel=canonical would help if redirecting all non-matches is a problem. (sorry, too much looking at websites makes these things jump out & I'd prefer to mention them when I see them :-)).
 
 1 0
 
 Hi Stephen This is not really a DNS problem, it's a problem that the server is configured in a way to allow all host names to be accessed through the IP address. While that sometimes makes sense, it does cause perceived problems like this. Usually this is something the server admin would handle (on Apache in the virtual hosts configuration, as far as I know). There are a lot of ways to solve this already, so I don't really think it makes sense to do something special for a situation like this; redirect and move on to the next real problem. Canonicalization issues like this are fairly common (Google has the same on some of the sites), but they generally don't cause any visible problems, so it's not really something worth losing any sleep over :). Once you see it and set up redirects, we'll see the redirects once we crawl those URLs. If we're still keeping some old URLs indexed after months, then chances are that we're not crawling those URLs very frequently, and accordingly it's not going to cause a problem bandwidth-wise or search-results-wise. FWIW It looks like your alternate host names are still not redirecting properly, I see secure., www2., https:// and admin. on the first page of a site-query, and they're still returning the same content. Also, your lu.php script seems to be using a 302 redirect to resolve the short URLs. These things all contribute to the duplicate content as well; using the rel=canonical would help if redirecting all non-matches is a problem. (sorry, too much looking at websites makes these things jump out & I'd prefer to mention them when I see them :-)). 
 Cancel
 
 Stephen Tallamy
 
 2010-06-25T03:06:21-07:00
 
 FWIW It looks like your alternate host names are still not redirecting properly
 What site are you talking about here?
 
 1 0
 
 <blockquote>FWIW It looks like your alternate host names are still not redirecting properly</blockquote>What site are you talking about here? 
 Cancel
andyfisher

2010-06-22T00:36:54-07:00

Thank! for this awesome knowledge...Hatsoff

1 0

Thank! for this awesome knowledge...Hatsoff 
Cancel
- Stephen Tallamy
 
 2010-06-22T09:29:05-07:00
 
 Is that a white hat or a black hat off?
 
 1 0
 
 Is that a white hat or a black hat off?
 Cancel
Expand Online

2010-06-22T14:10:18-07:00

Great post Stephen, very helpful as I'm trying to get rid of LOTS of dupes in my webstore. I wasn't aware that you can actually request a removal by Google Webmaster Tools. How can you actually do that?

Cheers,

Stefan

1 0

Great post Stephen, very helpful as I'm trying to get rid of LOTS of dupes in my webstore. I wasn't aware that you can actually request a removal by Google Webmaster Tools. How can you actually do that? Cheers, Stefan 
Cancel
- Stephen Tallamy
 
 2010-06-22T14:22:19-07:00
 
 If you expand "Site Configuration" from the left navigation then click "Crawler Access" there is a tab called "Remove URL". This you can use to remove individual URLs, entire directories or an entire site.
 
 1 0
 
 If you expand "Site Configuration" from the left navigation then click "Crawler Access" there is a tab called "Remove URL". This you can use to remove individual URLs, entire directories or an entire site.
 Cancel
pbhj

2010-06-22T06:33:48-07:00

Is there an issue with speed of indexing here with the solution given.

If you use a robots.txt file to serve to UA's accessing the newly registered domain that AWS creates then will the pages of the site be indexed as quickly?

Of course one doesn't want the AWS registered domain pages to supersede the example.com pages (which they perhaps would by being earlier to enter the index). But also one wants those pages to be indexed as soon as possible.

My question really is whether using a link-canonical tag on-page would have achieved the same thing but also optimised the entry of pages in the index.

To spell it out if ec2-123-456-789-012.compute-1.amazonaws.com/page.html was found by googlebot before example.com/page.html then googlebot would see the canonical tag and refer to (and index) the correctly labelled canonical page, no? It might take a lot longer for googlebot to come around and find the page via the example.com domain.

I do what leadegroot suggests on my domains however.

1 0

Is there an issue with speed of indexing here with the solution given. If you use a robots.txt file to serve to UA's accessing the newly registered domain that AWS creates then will the pages of the site be indexed as quickly? Of course one doesn't want the AWS registered domain pages to supersede the example.com pages (which they perhaps would by being earlier to enter the index). But also one wants those pages to be indexed as soon as possible. My question really is whether using a link-canonical tag on-page would have achieved the same thing but also optimised the entry of pages in the index. To spell it out if ec2-123-456-789-012.compute-1.amazonaws.com/page.html was found by googlebot before example.com/page.html then googlebot would see the canonical tag and refer to (and index) the correctly labelled canonical page, no? It might take a lot longer for googlebot to come around and find the page via the example.com domain. I do what leadegroot suggests on my domains however.
Cancel
Casey Winters

2010-06-22T07:33:19-07:00

Kind of related questions, but maybe not: If you host your site through Amazon Web Services, does that mean it shared IPs with other sites that use AWS, meaning that sites all appear to be from the same IP for the search engines?

1 0

Kind of related questions, but maybe not: If you host your site through Amazon Web Services, does that mean it shared IPs with other sites that use AWS, meaning that sites all appear to be from the same IP for the search engines?
Cancel
- Stephen Tallamy
 
 2010-06-22T09:18:58-07:00
 
 Generally you would get what they call an "Elastic IP Address" which is a fixed and dedecated IP that you can use for your site.
 
 1 0
 
 Generally you would get what they call an "Elastic IP Address" which is a fixed and dedecated IP that you can use for your site.
 Cancel
 - pbhj
 
 2010-06-22T11:48:57-07:00
 
 I wouldn't have thought AWS have enough IP addresses to give out dedicated IP addresses for every customer, not until IP6 comes along.
 
 1 0
 
 I wouldn't have thought AWS have enough IP addresses to give out dedicated IP addresses for every customer, not until IP6 comes along.
 Cancel
 - Stephen Tallamy
 
 2010-06-22T12:52:11-07:00
 
 Yup - I'm not sure if they would but I guess it doesn't matter for some applications of cloud services such as CDNs or data services, where a DNS entry will be good enough.
 
 1 0
 
 Yup - I'm not sure if they would but I guess it doesn't matter for some applications of cloud services such as CDNs or data services, where a DNS entry will be good enough.
 Cancel
leadegroot

2010-06-15T06:41:26-07:00

Really, one of the default setups you should use for any site is something like:

RewriteCond %{HTTP_HOST} !^www\.mydomain\.comRewriteRule (.*) https://www.mydomain.com/$1 [R=301,L]

This will stop a lot of different 'wrong address' problems.

Encountering problems down the line because you haven't done this, or its equivalent, is just a little unprofessional.

2 2

Really, one of the default setups you should use for any site is something like: RewriteCond %{HTTP_HOST} !^www\.mydomain\.comRewriteRule (.*) https://www.mydomain.com/$1 [R=301,L] This will stop a lot of different 'wrong address' problems. Encountering problems down the line because you haven't done this, or its equivalent, is just a little unprofessional.
Cancel
- Stephen Tallamy
 
 2010-06-15T07:08:45-07:00
 
 I could not implement this rewrite rule because we have multiple webservers and often I will want to directly access a single server to check that it is functioning correctly. If I implemented the rule you suggested then I would always be redirected to the main domain, where I can guarantee which web server I would be accessing.
 
 3 0
 
 I could not implement this rewrite rule because we have multiple webservers and often I will want to directly access a single server to check that it is functioning correctly. If I implemented the rule you suggested then I would always be redirected to the main domain, where I can guarantee which web server I would be accessing.
 Cancel
 - leadegroot
 
 2010-06-15T14:02:04-07:00
 
 Its not hard to exclude a given IP from the rewrite.
 
 1 0
 
 Its not hard to exclude a given IP from the rewrite.
 Cancel
 - Stephen Tallamy
 
 2010-06-15T14:21:27-07:00
 
 True, although that IP can then be indexed e.g. see what's happening to Box Office Mojo through their IP address:
 
 site:174.129.253.143
 
 I'm guessing this is all links back to the DNS issues as I don't think Google indexes sites usually through their DNS name and IP address.
 
 As I said in the post, I think Google could be helping with this problem by being a bit smarter and giving better tools and notifications.
 
 1 0
 
 True, although that IP can then be indexed e.g. see what's happening to Box Office Mojo through their IP address: <a href="https://www.google.com/search?q=site%3A174.129.253.143" rel="nofollow">site:174.129.253.143</a> I'm guessing this is all links back to the DNS issues as I don't think Google indexes sites usually through their DNS name and IP address. As I said in the post, I think Google could be helping with this problem by being a bit smarter and giving better tools and notifications.
 Cancel
 - leadegroot
 
 2010-06-15T15:24:59-07:00
 
 No, not quite what I meant.
 
 I haven't needed to do it myself, so this is untested (and done a bit quickly at this time of the morning), but the general concept is correct:
 
 use:
 
 RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999
 
 RewriteCond %{HTTP_HOST} !^www\.mydomain\.com
 
 RewriteRule (.*) https://www.mydomain.com/$1 [R=301,L]
 
 and replace the 999... with your IP (use one of the whatsmyip services if you don't know what it is)
 
 ie if the visitor is not at IP 999... and the visitor is not on domain www.mydomain.com then redirect the visitor to www.mydomain.com
 
 Then people on your IP are able to reach the site by alternate domains - but all the crawling bots, and the run of the mill visitors, will be corrected to the 'correct' domain.
 
 The most common use of the simple ruleset is to avoid simple www canonicalisation problems, but at the same time it fixes a myriad of problems, like the one you saw, as well as eg. strangers accidentally pointing their DNS to your server (happens more than you'd think).
 
 Its useful stuff to fix generically rather specifically :)
 
 leadegroot edited 2010-06-15T15:54:55-07:00
 3 0
 
 No, not quite what I meant. I haven't needed to do it myself, so this is untested (and done a bit quickly at this time of the morning), but the general concept is correct: use: RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999 RewriteCond %{HTTP_HOST} !^www\.mydomain\.com RewriteRule (.*) https://www.mydomain.com/$1 [R=301,L] and replace the 999... with your IP (use one of the whatsmyip services if you don't know what it is) ie if the visitor is not at IP 999... and the visitor is not on domain www.mydomain.com then redirect the visitor to www.mydomain.com Then people on your IP are able to reach the site by alternate domains - but all the crawling bots, and the run of the mill visitors, will be corrected to the 'correct' domain. The most common use of the simple ruleset is to avoid simple www canonicalisation problems, but at the same time it fixes a myriad of problems, like the one you saw, as well as eg. strangers accidentally pointing their DNS to your server (happens more than you'd think). Its useful stuff to fix generically rather specifically :) 
 Cancel
 
 Stephen Tallamy
 
 2010-06-15T15:40:01-07:00
 
 Ok - with you now - didn't realise you meant the user IP instead of the server IP! This seems like a good solution to the probably, particularly if you put it in httpd.conf (because putting it in .htaccess would mean that you couldn't run a copy of the site on a development server without it redirecting all the time).
 
 1 0
 
 Ok - with you now - didn't realise you meant the user IP instead of the server IP! This seems like a good solution to the probably, particularly if you put it in httpd.conf (because putting it in .htaccess would mean that you couldn't run a copy of the site on a development server without it redirecting all the time).
 Cancel
 
 leadegroot
 
 2010-06-15T15:53:48-07:00
 
 I get around the devbox thing with the line:
 
 RewriteCond %{HTTP_HOST} !^www\.mydomain\.com
 
 RewriteCond %{HTTP_HOST} !^mydevbox.\com
 
 RewriteRule (.*) https://www.mydomain.com/$1 [R=301,L]
 
 but it is an overhead :(
 
 Yes, http.conf would be a installation specific way of hitting the problem:)
 
 [edit: oh, those line breaks weren't readable!)
 
 leadegroot edited 2010-06-15T15:54:34-07:00
 2 0
 
 I get around the devbox thing with the line: RewriteCond %{HTTP_HOST} !^www\.mydomain\.com RewriteCond %{HTTP_HOST} !^mydevbox.\com RewriteRule (.*) https://www.mydomain.com/$1 [R=301,L] but it is an overhead :( Yes, http.conf would be a installation specific way of hitting the problem:) [edit: oh, those line breaks weren't readable!) 
 Cancel

Post Analytics

Comments 39

Log in to Moz

Don't have an account?