Duplicate content in SEO has been around for quite some time and even if Google has been saying they have been getting smarter and smarter in figuring out the best page to display in the SERPS from a list of duplicate content pages. They claim that it is something less to worry about today, than before. But knowing this issue exist, they give advice from various places, also in support threads, employee blogs, webmaster help videos, and many other places on how we should fix this issue. Some say simply block your duplicate content pages, some say redirect them. Maybe there is no 1 rule that best fits all situations, so I decided to enumerate the various ways to fix duplicate content issues, the differences so you can draw you own advantages and disadvantages to help you judge which method is the best to use for your specific situation. So let's go ahead and review each one.
Blocking in Robots.txt
Probably this is one of the most common suggestion used by many people, including several people from Google. This is also one of the oldest recommendations in the book and is probably outdated since there are many other things you can do today.
This would work in eliminating duplicate content. Search engine bots will see the robots.txt file and when it sees to exclude a URL of the hosted domain name, this URL is no longer crawled and indexed. Having said that, the only problem in using robots.txt in eliminating duplicate content is some people may be linking to the page that is excluded. That would prevent these links from contributing to your website's search engine ranking.
Using the Meta Robots: NoIndex/Follow tag
Another way to eliminate duplicate content, is to use the Meta Robots tag noindex/follow:
<meta name="robots" content="noindex,follow" />
The rationale behind using this tag is the noindex value is telling search engines not to index the page, thus eliminating duplicate content. And the follow value is telling search engines to still follow the links found on this page, thus still passing around link juice. The problem is there are still some people that believes this does not work. Once it's noindex, most probably it is automatically nofollow as well, but then again, why was the value nofollow and follow invented for the robots meta tag if you are not given the power to separate this out from the index and noindex? Crawled or not, this has to be tested out. I believe Rand has taken Google's word for it that this tag works. Upon searching around for people that tested this with anchor text using unique words, I found Scott McLay from UK doing some test. Well for me, for some reason, can never be satisfied by results and post by other people including Matt Cutts statements sometimes. And the only reason why I haven't tested this myself for a long time was there are just many other alternatives in fixing duplicate content that I didn't find the need to really know how search engines really treat this noindex/follow tag. But if any of the readers has done a good test on this, maybe you can publish your results here and also say how you did your test.
The 301 Redirect
A lot of people in the industry love the 301 redirect to fix duplicate content. Because so many people have tried it out and many know it works. It has also been abused in many shady ways too, but that's not my topic. So what really happens in a 301 redirect in treating duplicate content?
The nice thing about this compared to the two methods above is we are really sure based on statements from the respective search engines, as well as testing by numerous people (which probably includes you, the reader of this blog), knows that a link going to a page that 301 redirects will be considered as a link of the destination page of the redirect. This seems like the ultimate fix to all duplicate content issues, but actually, there is also a good reason to use the next methods I will mention.
This blog post though is not about how to do 301 redirects but if ever just in case that is what you were searching for, 301 redirects can be done on the webserver software (Apache, IIS, etc.) or through server-side programming (PHP, ASP/.net, ColdFusion, JSP, Perl, etc.). Probably a good starter guide for different 301 redirect implementations is the guide by WebConfs.
The Canonical Link Tag
The nice thing about the canonical link tag, search engines behave in the same way how it would look at a 301 redirect. It is not going to index the duplicate content page. Only the destination page will appear in the search engine index. All links going to the duplicate content pages will be counted as links of the main content page.
<link rel="canonical" href="https://(main content page)" />
If Google treats the canonical link tag in a very similar way how 301 redirects are treated, the main difference is what the user experience is. A 301 redirect, well... redirects. While the canonical link tag does not. So you can imagine when this might be better than a 301 redirect, when users may not want to be redirected.
Let's say you are browsing a department store website. And a business traveler is looking for different traveling bags and also needed a laptop bag and arrived to a URL like this:
https://www.example.com/travel/luggage/laptop-bags/targus/
While let's say there is some computer geek that wants a new laptop and a bag to go along with it and ended up in a URL like this:
https://www.example.com/electronics/computers/laptops/accessories/laptop-bags/targus/
Let's say these two pages are duplicate content pages on the same department store website, but doing a 301 redirect to fix the problem, messes up the user experience. If the buyer's train of thought in this example was to buy different bags, if they get 301 redirected to the computers section, makes them lost and would need to do some extra effort to go back to the luggage. Which the geek laptop buyer looking for different accessories would not want to be redirected to the luggage since he may be looking for more laptop accessories.
Although a canonical link tag does not redirect, you still have to choose which one would be the main page search engines would display in search engine results.
The Alternate Link Tag
The alternate link tag, is very similar to the canonical link tag. Although this is used mainly for International or Multilingual SEO purposes.
<link rel="alternate" hreflang="en" href="https://www.example.com/path" />
<link rel="alternate" hreflang="en" href="https://www.example.co.uk/path" />
<link rel="alternate" hreflang="en" href="https://www.example.com.au/path" />
The Canonical link tag will remove all other duplicate content, but for the Alternate link tag, all pages will still be index, but this helps guide Google choose the best result for the individual country versions of Google. And eliminates the problems Google may run into treating pages as duplicate content.
To sum things up, here is a simply guide when to use which type of redirect in different cases of duplicate content:
- Alternate Link Tag
- International pages, multilingual pages, intended for different countries.
- Canonical Link Tag
- Multiple categories and subcategories with different category paths, but the same content.
Example:
https://www.example.com/products/laptops/sony/
https://www.example.com/products/sony/laptops/ - Tracking codes, Session IDs mainly because redirection sometimes interferes with the functionality of the tracking codes and sessions.
Example:
https://www.example.com/path/file.php?SID=BG47JF448JD6I7TGF439LVFD476
https://www.example.com/path/file.php?utm_whatever=5uck3rs
https://www.example.com/path/file.php - Different variable orders due to how some CMS platforms are created.
Example:
https://www.example.com/path/file.php?var1=x&var2=y
https://www.example.com/path/file.php?var2=y&var1=x
- Multiple categories and subcategories with different category paths, but the same content.
- 301 Redirect
- Cases where a redirection does not bother the user experience such as www and non-www, index files, trailing slashes, hosting IP address.
Example:
https://www.example.com/
https://example.com/
https://www.example.com/index.html
https://www.example.com
https://123.123.123.123/ - Domain changes, and URL changes of pages that no longer exist.
Example:
https://www.example.com/old_folder/old_file 301 redirects to https://www.example.com/new-folder/new-file/
https://www.example.net/ 301 redirects to https://www.example.com/
- Cases where a redirection does not bother the user experience such as www and non-www, index files, trailing slashes, hosting IP address.
- Meta Robots NoIndex/Follow
- Probably the best place to use this is in a list of archived post, such as a blog. Where the main URL of the individual blog post or the permalink may have content that is posted as a duplicate somewhere in the archive view by date, the category view, the author view, tag topic views, or in the pagination of older blog post from the blog homepage. You cannot really do a 301 redirect, nor do a canonical link tag since these pages may have more than 1 blog post listed and you will have to finalized where the 301 redirect should go or where the canonical link tag should point to. Thus I would take my chances using the Meta Robots tag, NoIndex,Follow, and hopefully all the links still help.
- Robots.txt
- I no longer see a need to use robots.txt in duplicate content issues. The natural linking is something too precious to lose. Just use robots.txt to really block of content that does not need to be indexed at all, duplicate content or not.
Disclaimer: Although I have in my examples, PubCon and CSI Miami, both websites do not have duplicate content. The images are for example purposes only. As for SMX East, SMX Advanced London and SMX Australia, these pages also have no duplicate content.
Photo of Brett Tabke, was by Andy Beal. CSI Miami photo of David Caruso by CBS Television/Alliance Atlantis. Photo of Danny Sullivan is a photo by SMX/3rd Door Media. All other brands used in this blog post are trademarks or registered trademarks of their respective owners.
Assuming the canonical tag is handled the same way as a 301 is a bit of a dice roll. In theory yes, but what of the non-canonical URL gets more links than the canonical one? SE's treat it as a recommendation, not the same as a 301. IMHO SE's are like 6 year old kids the less you leave open for interpretation the less likely they are to get it wrong.
Makes sense and knowing with your time in the industry you are very well more experienced than I am. So far I have been basing the behavior of canonical tags on actual results. Mainly 1. When I apply the canonical tag on pages to point to another. The pages disappear in the SERPs. And 2. After replicating this experiment with the same results: https://www.seomoz.org/blog/using-canonical-tag-to-get-more-than-one-anchor-text-value-11283 it seems to me, it is behaving very similarly to a 301.
But now that you have posted this comment,... makes me think... if it is treated as a recommendation, then... I could be wrong in my statement and it just so happened in my testing and actual results, Google followed my recommendations all the time. I guess the best answer would be further testing and if results are repeatable, then that is the time we can say it is a good theory.
Thanks for the comment, puts more work on our testing board to really revisit and see how canonical tags have been working.
would be looking forward to the results on the follow up test Benj. would be interesting to know if Graywolf's recommendation theory is spot on, which I assume is...
Nice post by the way.
Cheers!
I use the NOINDEX/Follow tags on product pages that have duplicate content. I have a large ecommerce store with 8,000 products. All products are very similar, but not all products have active searches. I NOINDEX the product pages, then change the meta robot to INDEX/Follow as I write new content. That has been the best way for me to handle 700 products with the same content.
In the meantime, I use category and sub-category pages to rank and drive traffic, as well as content pages and the product pages I have previously changed to unique content.
Nice recap, and I'll be curious to see more testing around the rel-alternate tag. One issue I've found with Robots.txt (besides not passing link-juice) is that it tends to be unreliable for pages that are already indexed. In other words, if you put it in place before a page exists or when you launch a site, Robots.txt will keep that page out of the index pretty well. Once the page is in the index, though, Robots.txt won't always kick it out.
I get it but trying to explain this stuff to my fiance or my mom, that's just plain impossible.
A very noob question:
How are you determining duplicate content?
Thanks.
- Internally, within a website, normally we all know what are the common things to look at. Like www and non-www, index files, variable orders, etc. Basically what I have listed above, I test for all of them. In the summary I have. And if you are familiar with the site, sometimes you already know the answer to these questions.
- Sometimes at the end of the SERPs pages... when you see omitted results, check them out, sometimes the duplicates get filtered out and are all in there. Sometimes even similar content end up in the omitted results.
- With other websites, other domains, maybe even properties you also own, or people copying or scraping your content, or resyndicating your RSS which you are really sharing to the world and is not necessarily a bad thing... you can check the duplicate content using Google searching long exact phrases or use tools like https://www.Copyscape.com
Great post, benjarriola! I have bookmarked the page as I am sure I will have to read it some time again in future.
But I have a question: do you really think duplicate content is such a big issue? I assume that probably you will not rank as well as you could if you have exactly the same text/content on the pages like /url, /url1 and /url2 which were created by mistake, but I doubt Google or any other search engine will penalize my blog if I open categories and archives for Googlebot.
Google now claims to find out if the feedback was positive or negative (do you remember the ‘Christian Audigier glasses story’? If Google is so clever, why it should penalize for link architecture blogs have by nature? Personally I have categories and archive open. Do you think the pages of my blog will rank better if I close them?
P.S. Sorry if I seem to be rude of smth, it’s just your post touched the strings of my heart. :)
P.P.S.: It is nice to be one of the first to comment here, on SEOmoz.org :)
Good question... there are a lot of pages out there that duplicate content exist and some still seem to rank well even without fixing it.
All I can say about that is... Google seems to be smart already in determining which is the original among the duplicates and rewards that page accordingly. Although there are also testimonials from other people claiming, a page outranking them or they disappeared after a more authoritative, or popular blog posted what they have originally had on their blog. So Google is smart, but sometimes it may still make mistakes in choosing the original page among the duplicates.
Aside from that, some keywords are more competitive than others. In a less competitive market, and the less amount of duplicates, the less it is a problem. But as a preventative, precautionary measure, I'd just fix all cases of duplicate content I can fix.
And lastly... on the dask side... I believe it was an experiment by Dan Theis some time ago... where multiple free web-based proxies were used to create multiple duplicate content pages to kick something out of the SERPs.
Comprehensive Information Benj!
I've not used alternate in the link tag before, so did some reading.
Here are some clarifying articles on how to correctly implement for HTML4 and HTML5. Note that both these articles mention that this attribute may be dependant on sibling attributes.
https://www.w3.org/TR/html5/links.html#rel-alternate
https://blog.whatwg.org/the-road-to-html-5-link-relations#rel-alternate
Thanks repriseaus, and in addition to the resources on the proper standards, here are a few resources on how Google has announced how they will treat the tag when used in multilingual/international pages:
Google Webmaster Central Blog
https://googlewebmastercentral.blogspot.com/2010/09/unifying-content-under-multilingual.html
Google Webmaster Tools Help Pages
https://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=189077
Google Webmaster Tools Support Forum
https://www.google.com/support/forum/p/Webmasters/thread?tid=64d086930a0bd2d0&hl=en
I have seen a few SEO bloggers out there saying this can be used for duplicate content for PDFs and HTML pages, and it is the proper way to do it actually. Since it is alternate content and this is how the standards were set to be. But I am not sure if Google is already set to view it in the same way. I have seen no official statement from Google saying the Alternate tag works for HTML and PDFs with the same content.
Good post, I like to read this. But I have one question. How about this:
I have no one similar (or same) post. All of my posts are completely different. For examples:
mydomain.com/my1category/post-title-about-book/
mydomain.com/my1category/another-book-post-title/
mydomain.com/my2category/good-pencil-title/
mydomain.com/my2category/best-post-title-about-pencil/
All the four posts above have different content, completely different, not similar even one paragraph.
Note:
a) I let my tags and my categories indexed by search engine.
b) Each post only has one category (no double categories). But one post has more than one tags.
My questions are:
1) Although the post URLs seems similar a little, will Google treat them as double/similar posts?
2) Do I still need to use canonical?
Hope you could explain and share with me. Thanks for all nice people here.
I spent the last few hours trying to find out how to redirect 301 with adding trailing slash just for home page ONLY. I am trying to do like this https://www.berricle.com to https://www.berricle.com/
I need solution with minimum resource on server. Just need for ONLY home page.
I tried like this in htacess file, but didn't work.
Redirect 301 https://www.berricle.com https://www.berricle.com/
Please someone help me with solution. Thanks you sooooooo much.
When I compare 2 sites we own... 1 is built completely wrong for SEO and the other is better optimized for SEO. The one that is built wrong completely dominates Google (over 1,000 keywords on page 1 Google) and the other site just doesnt do as well... but it is 3 times the size and both have been online for 5 yrs. Both sites are article/content sites.
It is just funny how everything is hit or miss with this stuff.
Thanks for nicely explaining how to handle duplicate content and which one option would suit in implementation in different-2 situation of seo like LSEO, ISEO and country specific.
I would like to have your advise on something. I was told by an SEO expert hired by one my client that a recent drop in ranking for their site was mainly due to a duplicate content problem between pages.
My client sells events and concert tickets in Montreal and other cities across North America. I've recently built citiy hub pages to promote some specific cities. These pages content a good sized paragraph talking about the city, some featured events links and the links to all the events in that city. The SEO expert tell me this conflicts with the search event results page when client do a search for all events in Montreal, wich of course gives out the same event listing that in the city hub page BTU with a different small upper text, different descriptions, different title and keywords tags. I find it odd that Google would threat those are duplicate content pages, and penalise our raking (loosing almost 50% visits to those pages).
What would be your oppinion on this? (Before I go in and do all the change this SEO epxerts ask me to)
Thanks
There is duplicate content and also similar content. Sometimes, if a page is very similar to another, even if differences are present, these can be treated as duplicate content also. I guess it boils down to how much is similar and how much is different? Probably one signal to look at if pages are considered as duplicate content on your end, normally when you do a site: command in Google to see all indexed pages. The pages that are very similar end up at the very end of the last page where it says omitted results. Check your omitted results if you have any and if many pages end up here... they are somewhat treated as duplicate content.
Hi,
I have few questions..
1) What if the content is spinned ?? I have seen this happen and spinned content getting better ranking as they promoting it more aggressively and because of this my site was ranking much behind then theirs.
2) Is it ok to have some part of content same in 2 pages on same site? Eg: First paragraph of Custom Coding page on Home page.
What about missing entry redirection? Should we use Canonical Link Tag?
Lets say a page uses 'id' as a url parameter to know which product to display, and a redirection to the category page is done if the product cannot be found (deleted or missleading link). Google treats this link as dupplicate content but would the Canonical Link Tagfix this?
Hi,
I am asking for your opinion which one of these methods to use(or another one) in the following case:
Our client has a blog section of his site. Currently there are less than 20 post so frequently while searching for a specific tag you get the same content - because frequently 1 or 2 post have the same tags and the SEOMoz Craw Diagnostic shows these pages as "Duplicate Page Content".What is your advice, what do you recommend?
Thanks in advance,Ilian Iliev
I used rel=canonical for a relativly high traffic website and I saw drop of traffic which I explain on my seo blog.
I don't think rel=canonical works like 301. It might tell google where the parent page is but I don't think it properly transfers the ranking of duplicated page to its parents like 301 does.
Hi, i have a B2C website which sold apparel. My website is designed with sub-directory of multi-language. I notice that many of page detected as duplicate content and title. For example, in one category, there are 10 pages with different products, and google seems duplicate content and only index once. The website was translated with multi-language pages, google seems duplicate as well.
can I use 'Alternate Link Tag', 'Canonical Link Tag', '301 Redirect' together? and how to implement in B2C website which is targeting multi-countires?
Thanks
Multi-Language: Use Alternate Link Tag.
If one of the languages has duplicate content internally also, then you can use the Canonical Link tag and 301 redirect also.
So on a single page, you can have both Canonical link tag and Alternate link tag but they do not necessarily have to have the same href value.
As for doing these with a 301... it would be a good precautionary measure to have everything, but generally I would give more priority to a 301 than a canonical link tag.
Thanks, can u give an example which website also use this methods?
Hi Ben and thanks for sharing this very useful note for canonical tag. We had discussions on FB regarding my online shopping website TheMiniMall.com duplication of contents issues and Not Selected Pages issues in Webmaster.I studied your article but i am not sure that what should i use canonical or Disallow: *Duplicate* in Robots.txt for my site. Please assist by providing some more details on this topic.
Thanks
Great Post, Kindly suggest if i have Free Trial, Get A Quote, Contact Us, Sign up Pages have the same content, is it necessary to block the every single page in Google or else we can use noindex, follow tag in each Page. Kindly suggest. If we block these pages in Google is there any chance of getting leads decreases. Waiting for your suggestions.
Don't 301 redirects also cause the page to load slower? If you have too many redirects, like with a legacy site that created 16 duplicate pages, wouldn't 16 301 redirects essentially slow down the page loading time immensely, and drive the organic ranking down in that way?
Great technical SEO report on duplicate content Benj.
Analysing technical points is little a challenging work but it's important for SEO
I am creating a help desk for both agents and retail customers. My managers are suggesting robots.txt for the agent site to eliminate the duplicate content issue. I want to use canonicals. Since this article is dated, I'm wondering what are your recommendations? Thank you.
What about URL Parameters handled within Webmaster Tools? Would you say parameter handling is as effective or less effective as the measures you've listed above? Would you recommend using both?
Is www.example.com/ and www.example.com/?ref=blog are also duplicate pages? How does Google Bot treats the bookmarks eg: www.example.com/#video. Does it get counted as duplicate content?
And How much beneficial are the tags in the Pages. My site is new and example.com/tag1 and example.com/tag2 generates same content. Is it Harmful in terms of Duplicate Content?
Very useful post, but there's something I've been wondering about.
I've recently chosen my preferred domain to be www. With some of the pages (moreso posts) in my website, viewers sometimes get a message similar to the one below:
This has caused a substantial decrease in my site traffic. Is there anything I can do to fix this issue? The pages still exist and the slugs have not changed. All I've done is change the preferred domain.
I have doubts this is being caused by the preferred domain setting. And thanks for sharing your site, but from what page do you get this message?
When I view traffic statistics from the backend, it shows me links of pages people are viewing, and from where. A lot of the traffic comes from either the Google search engine or images directory. When I click on some of the Google links (specifically from their images directory) that direct viewers to pages in my site, I get the redirect notice like the one shown above.
I wonder if people see the same redirect notice that I encounter... I noticed a significant decrease in my traffic as well.
What are your thoughts on cross-domain canonicals? I have a client that has three websites in the real estate business. The original website was specific to one area with a domain name that includes keywords for the area, then they added a new site that covers a large area (statewide) with a domain specific for that region and finally a nationwide site.
Each higher-level site includes all the content of the lower-level sites. (i.e. the content on the city site is inlcuded in both the state site and country site) and is design of the pages are the same so the content is almost 95% duplicated on those pages. They don't want to merge all the sites since the lowest-level site is generating most of the income and is very well-ranked so they can't risk it.
So the question is does it make sense to use the canonical tag to indicate that the lower-level site is the original or just leave all three sites with no canonical tags and let google sort out the duplicates. Of course, the higher-level sites have additional content not included on the lower-level site.
I hope that made sense.
Cross domain canonical link tag doesn't seem to work well for me. I would normally just do a 301 redirect instead for cross domain.
The problem with the 301 is from the user point of view. In our experience, people don't like to be sent to a different website during their session. Within the same website it's fine, they don't even realized they were 301'd, but when you send them to a different site it really creates some confusion for them.
Hello Ben
Great Post about redirect, block & canonical tips. But i want to in how many ways we block the pages, so search engine bots not index that page?
What i'd like to know is, does the tag pass link juice.
say if i had:
www.example.com/laptops
www.example.co.uk/laptops
www.example.pl/laptops
then i had the other 2 conjunctioning alternate tags on each page (which will create an infinite loop of alternating)
then built links to one of the pages, will this power then be distributed between them all?
thanks for the great summary.
but i'm not altogether clear on how you would implement rel=canonical as recommended in these two instances from your post:
The two cases we bump into are:
1. with a particular cms that generates SID/var1/var2 parameters. The CMS allows us to create the page, and specify the rel statement on that one source page. But the additional versions of the page with the extra parameters are outside our control - we can't specify rel=canonical on the var1 version, but not on the original version, as we only have one version we have access to;
2. utm parameters. Again, these parameters are added after the page was created, and are used for tracking in analytics. the original page that we create, should not have rel=canonical. But the additional URLs with tracking parameters should have rel=canonical.
In practise, how have you carried this out?
cheers
Chris
As long as you know how to code it, there should be no problem at all.
1. with a particular cms that generates SID/var1/var2 parameters. The CMS allows us to create the page, and specify the rel statement on that one source page. But the additional versions of the page with the extra parameters are outside our control - we can't specify rel=canonical on the var1 version, but not on the original version, as we only have one version we have access to;
You will need server side programing here. (PHP, ASP, ASPX, Cold Fusion, Perl, JSP, whatever your cart was made in) And have conditional statements. Like in PHP it would be something like:
if($var1 and $var2) {
echo '<link href="https://www.example.com/'.$var1.'/'.$var2'" />';
}
What is happening here: If both variables exist, then display the canonical URL with the variables in the correct preferred order. And if does not hurt to have the the canonical link tag where the URL is the same as the URL in the canonical link tag. But it will help in the URL that has a different order. In the code, the SID is also excluded already.
This is a simplified example, of course it may change depending on the situation of the actual code. If you are using POST or GET variables, or if it is so complicated, you do not know where the URL folders are coming from, then use $_SERVER[REQUEST_URI] to get the URL string, then some tricks with the string handling, using functions like strpos, str_replace, preg_replace, strstr and more.
2. utm parameters. Again, these parameters are added after the page was created, and are used for tracking in analytics. the original page that we create, should not have rel=canonical. But the additional URLs with tracking parameters should have rel=canonical.
I'll probably do something like:
$RequestURI = $_SERVER[REQUEST_URI];
$RequestURI = str_replace(strstr($RequestURI,'?utm'),'',$RequestURI);
echo '<link href="https://'.$_SERVER[HTTP_HOST].$RequestURI.'" />';
What is happening here is, I get the URL string, which is the REQUEST_URI, I look for the ?utm and everything after it using strstr. Then I take it all out using str_replace, then I put it back in the URL for the canonical tag.
Extra non-related story...
This is answering the canonical question specifically. Although I had a nice chat with Jaimie Sirovich aka SEO_Egghead where he approaches this problem in a different way. Probably even in a more simple implementation. But it really uses exclusion by robots.txt and adding in URL parameters to indicate who is duplicate. But that is another story and not really an answer to your question, but is just interesting also. And I am sure, other people, other readers, may have other solutions also that I have not heard of yet. Maybe the more aggressive people might even try 301 everything instead of canonical, but... cloak the 301 and show it only to Googlebot.
Benj,
Do you ever find having your articles "flipped" to be an issue? In the past I've had outsourcers flip articles, but not alter them enough... would this creat a "duplicate" problem? Just curious. Thanks.
I am not a fan of article flipping or article spinning, although I have very many friends that testify this works. I guess it all depends on the degree of or extent of flipping/spinning. Sometimes it does look truly original and duplicate content is no longer a problem. Although if you want to maintain readership of these, you would want to double check if they still sound nice to users. Although this is totally another topic.
Great post, thank you.
I have embraced the rel = canonical tag because it's often the only option I have. My clients are generally small business with cheap hosting plans. You can't do a redirect through the cPanel and the hosting company won't do it for you.
They will give you "code" to insert by yourself at your own peril. I tried to edit an .htaccess file to redirect the non-www. version to the www version and crashed the website.
I'm using rel=canonical to redirect index.html files to the root directory, but I don't see how it could help with redirecting the non-www to the www.
thanks benj for this.
something about 301's creating infitit loop for my pages i ended up using
I didn't know of the other one so I just learned :0
Nice post - Just to join in the debate between 301 vs. canonical. I think it totally depends on what you're using it for. For example it pays to do a 301 on a page that has duplicate content or has 'another version' but you can't do this for a product page for example. If you have 1 product with 5 different colour variants then I'd use the canonical tag and pick one of the colour options as the one I want to be seen as the authoratative page and place the canonical tag on the other colour options.
None of these methods solves my duplicate content issues.
We sell batteries for cars. One type of battery fits in many cars.
We have these URLs
1: domain.com/car-make1/model1/batteryA -> BatteryA fits in this car, page is optimized for this car make and model and this battery
2: domain.com/car-mak2/model2/batteryA -> BatteryA also fits in this car, page is optimized for this car make and model and this battery
3: domain.com/batteryA -> this is the mainpage of this battery
The content of these pages is almost duplicate
So one could say that we need to have a canonical link in 1 & 2 pointing to 3. But then my pages won't show up in the SERPs if someone searches for a battery for a certain car make and model
Well, in my opnion the solution is to optimize de content.
Maybe if you create a paragraph like "the battery A is good because... and it is usefull for..."
Create some specification on the products, even if they are similar explore de diferents details to create unique content and expand the text quantity.
This will help user get the diferences between the batteries and give them a more detailed product page.
Amazon usually do that, get content with product details.
If you are targeting each one of these pages with different target keywords, might as well optimize it separately as well. I agree with luizamcalmeida, might as well work on the content, and it is no longer duplicate content.
There's no easy solution here, but there are a few things I'd suggest thinking about:
1. In a proper faceted design you can address this by showing different permutations of products to the bots targeting from the facet-level, not the product level. Each product might have a slightly different applicable products and you can tweak each facet page (not all platforms allow for this, but Endeca and FAST have a solution for this, as do we in our applications and retrofits). Done wrong faceted design is a spider trap to begin with (see #4), however — so this is not an option for most.
2. What Ben is saying is ultimately correct. There's no way to fix this without authoring at least something. Do #1 or write product content. I think #1 is a little better.
3. In theory rel canonical _could_ sum the content and show the correct page per query. In other words they'd group or collapse the results and show the most relevant one. I doubt they do it, but someday they could — in this way canonical is more powerful than eliminating the result. Of course this only works when the duplication is in small numbers.
4. Canonicalization doesn't work at all when you have many orders of magnitude of useless or duplicate combinations of settings. You'll still need robots.txt for that.
Hope this helps someone. Feel free to drop me a line.
Jaimie Sirovich
SEO Egghead, Inc.
Professional Search Engine Optimization with PHP & ASP.NET (Wrox Press)
Nice article Benj.
I wonder if content generated by auto-translate modules/plugins will be considered as duplicate entry.
For example
"This is such a wonderful place" is translated to Filipino "Ito ay talagang napakagandang lugar" or in French "C'est un endroit merveilleux"
Will search engines treat the translated versions as duplicate content, or unique content entirely?
Thanks!
Alfred
Duplicate or not, I think the alternate link tag is still good to use since you get to target the respective countries better.
I agree,
Because you can have many pages in many languages, but still being the same content. So, in this case, it better for you to know in what language this page is viewed the most.
thanks Benj for the post. I would vote 301 each and everytime to eliminate thinking on anyone else part (read SEs)
One common type of mirror page is a printable version of a page. The canonical tag seems to me to be the appropriate method for handling this type of mirror page.
That is totally right. Forgot to mention that. Although if the printable version is a pdf... that is another story.
don't you lose a certain percentage of link juice with 301 redirect?
I find that the canonical tag is really a hit or a miss, especially in cross domain situations.
Nice to note, and this is where I believe a 301 is better. When it is a cross domain situation.
One of the best post I read this days!
I like a lot the canonical and alternate comparision.
I have two clients and canonical is very useful to one of them: it´s an e-commerce site and we have some problems before with duplicate categories in diferent places, but with the canonical we solve this question.
The alternate link tag, i never used but as im reading the post give a nice idea to apply it. its the perfect example, use sites with multilingual pages.
Very nice and keep writing more and more to us!
Just to follow-up on the debate about - it works. Recently I've done triple-domain swap with canonical tag and I've managed to gain rankings for similar (although not the same) set of keywords.
301 would just be killing one website or page and in terms of user experience would be bad IMHO.
Regards,
Really nice round up benjarriola. But robots.txt does not keep Google from indexing a page, it only keeps Google from crawling the page when the crawl initiates with your site. So if, for example, there is an outside link to a page or URL that is not crawled per robots.txt commands, the page/URL/content will still be indexed by Google. All it takes is one link and - bam - so much for robots.txt.
True. Totally agree but failed to mention. Thanks for mentioning this. Also some added observation is if you block it in robots.txt, but still is indexed because of a link, it will not have a good title and there will be no description in the SERPs but it is still listed.
Much of the post was about duplicate content within a site, but how do each of these methods work when it's content from a different site?
One of my clients has been copying news articles about his company on the company site. I was considering just getting rid of that content (since it's all copied from other sites) and just posting links, but if there's a way to keep the content without getting penalized, I'd go with that. What's the best way?
It cannot be avoided but a few things that can be done:
ASFAIK Robots.txt dont exclude from indexing, only from crawling, making it a bad choice if you want to exclude for duplicate content
meta-robots noindex is the way to go over robots.txt if you want to exclude.
More info on seomoz post on Robots Exclusion Protocol
Great post, i usually use 301 for most of my website and canonical in rare cases. But this is the first time i come across alternate link tag. As i am dealing with a Multilingual SEO sites, this will help a lot i hope.
Thanks for this.
We have ajax driven URLs once auser is on our site and static URLs that we deliver to an "SEO" directory. I want to implement rel=canonicals on my site. Do you think I should use the static URL and not the ajax URL? That's the direction I am heading in. Do you agree?
I would say static URLs. When you say AJAX URLs, I am assuming these are the URLs with the hash/pound/sharp/number sign. It is common SEO knowledge that the everything after the hash tag is not read by search engines. Thus it is viewing the page as if the hash tag was not there. Since this is really a client side technology that the browser reads and serverside scripting cannot read this either.
Although many have reported already that Google does read the hash tag. And Google has been playing with it's headless browser for some time. https://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html
And many have seen Google differentiate pages with hash tags in the first link priority issue. https://www.seomoz.org/blog/the-first-link-counts-rule-and-the-hash-sign
[Off topic story: I remember playing around with this experiment before this was announced by SEOMoz. I am not saying I was first. Actually I was walking in the parking lot of OMS and saw Rand Fishkin lost also. And had a brief conversation on 'first link'. And he told me about the hash tag URL effects on first link, after about a few months... the blog post came out from someone on YouMoz with the test that was done. So I'm not first, my source was still Rand. :) ]
What I do is even if I have AJAX content and AJAX links, the default static content and links that load are still... welll static. Better if you see an example than explaining it. Check https://www.ajaxoptimize.com/
Bookmarked this one... Very clear! Thank you so much!
Nice information about Duplicate content benjarriola, Thanks for your post!
I tend to recommend 301 where possible and canonical elsewhere. I've recently moved whole blog domains using canonical though and it worked well and fast too.
Great post! Thanks for laying out all of the options.
I have used <meta name="robots" content="noindex,follow" /> and it has definitely worked for me. I have staging URLs that aren't linked to anywhere (or anywhere that I can find!) yet they were still indexed - blocking via robots.txt definitely did NOT work! Thanks for the other suggestions, too.
If you were testing if they are indexed or not, the tag will definitely work. Now the the question is are links going to this page still counted and are the links on the page itself, is it still passing link juice to where they are going?
Another advantage to using the 301 redirect is to clean up the duplicate content issue and control where your links points by redirecting to either the www or non-www (whichever is preferred) of each of your site's pages. No point having links pointed to the wrong version of the page, lowering link juice by dividing up where links point. And REALLY no point to have 2 versions of each of your site's pages, creating duplicate content. It's surprising to me how few sites use this easy fix.
Sometimes it is not that easy if you have a large enterprise site of 10,000 pages or more.
Aside from that there are also links coming from other people, on other websites you do not own. Thus you cannot control. So this is where all these solutions in the blog post come in. :D
Impressive post mate
Serves as a good resource for anyone tossing between options.