This post has been a long time coming - there's a definite need to address internal duplicate content issues in an intelligent, simple, illustrated manner. I'm hoping that you'll be able to show this to the boss (or, more likely, the engineering team) and help them achieve that critical "AHA!" moment required for all true progressive changes.

Fundamentally at issue is the creation of multiple pages on our own sites that host the same pieces of content. There may be very good reasons to do this - print-friendly pages (although sophisticated CSS can fill this gap), pages that have different formatting or navigation, versions of a page that be appropriate for multiple areas of the site, or the ever-ubiqutuosly problematic pagination of content. Today, I'll be using the most common example - one from the world of blogging.

First off, your typical blog has two big duplicate content issues:

  1. Duplicate content from every blog post appears on the blog's main page
  2. Duplicate content between indexing of paginated blog index pages

Here's issue #1 in visual format:

Blog Post Duplicate Content Image

You can clearly see the copies of every word, sentence and paragraph from the blog home page appearing in the individual posts, creating a natural, but troublesome duplication issue. Which content should the search engines rank? The blog home page probably has more link juice and PageRank overall, but the individual post page is more targeted. The positive part of it is, at least if you're a frequent blogger, that front page content moves down (and eventually, off) the page relatively quickly... but what if you have pagination?

That leads to issue #2:

Duplicate Content Pagination Issue

You said it, Huffy Googlebot! This blog is in worlds of trouble with all the multiple copies of content they're showing you. It gets even worse if Google isn't regularly visiting every paginated page on your blog, because then the copies can multiply into even more versions than you really have. For example, if Googlebot visited 3 days and hit a paginated page each day, they could easily have 4 copies of a blog post in the index - one on each of the paginated pages, and the blog post page itself.

Now the good news - my opinion is that Google, Yahoo! & MSN have all seen this pattern of duplication with blogs so many times that they've probably found relatively good workarounds for it. However, I will say that when SEOmoz switched from having this problem (prior to our February '07 re-launch) to solving it, our traffic from search for old blog posts rose about 25%.

However, not all internal duplicate content issues are in blog structures. You can find similiar issues on blogs of all shapes and sizes. On many news sites, for example, printer-friendly pages are popular. On forums, there might be several copies of a post accessible through crawlable links. A good number of e-commerce sites have exactly the same product page in different categories, producing different URLs (this is sometimes the worst nightmare of all).

So, how do you fix it? Simple (well, OK, not totally simple):

Dealing with Multiple Copies of a Page

The illustration offers two very good solutions, but you have to know when to use each. I advise the meta noindex tag when you're launching a site and can put it in right from the getgo. For older sites that may have lots of internal and external links pointing to the various versions of content, a 301-redirect is the way to go.

Some SEOs even advise the use of cloaking - and in this case, it's hard to argue that it's unethical or even against the spirit of the engines. The concept is to use a 301-redirect conditional to the search engines that forwards all link love to the original version and establishes it as the canonical source. Human visitors, meanwhile, can still see the content in print-friendly format (or whatever unique way it's being presented). I believe there's almost always a better workaround then content delivery, but in those rare cases...

Hopefully, this post will help to alleviate some of your duplicate content concerns and make it just a touch easier to convince your team of how important these modifications are. Don't forget, if you need a basic primer on duplicate content, you can check out my illustrated post on the subject from back in March.