This post has been a long time coming - there's a definite need to address internal duplicate content issues in an intelligent, simple, illustrated manner. I'm hoping that you'll be able to show this to the boss (or, more likely, the engineering team) and help them achieve that critical "AHA!" moment required for all true progressive changes.
Fundamentally at issue is the creation of multiple pages on our own sites that host the same pieces of content. There may be very good reasons to do this - print-friendly pages (although sophisticated CSS can fill this gap), pages that have different formatting or navigation, versions of a page that be appropriate for multiple areas of the site, or the ever-ubiqutuosly problematic pagination of content. Today, I'll be using the most common example - one from the world of blogging.
First off, your typical blog has two big duplicate content issues:
- Duplicate content from every blog post appears on the blog's main page
- Duplicate content between indexing of paginated blog index pages
Here's issue #1 in visual format:
You can clearly see the copies of every word, sentence and paragraph from the blog home page appearing in the individual posts, creating a natural, but troublesome duplication issue. Which content should the search engines rank? The blog home page probably has more link juice and PageRank overall, but the individual post page is more targeted. The positive part of it is, at least if you're a frequent blogger, that front page content moves down (and eventually, off) the page relatively quickly... but what if you have pagination?
That leads to issue #2:
You said it, Huffy Googlebot! This blog is in worlds of trouble with all the multiple copies of content they're showing you. It gets even worse if Google isn't regularly visiting every paginated page on your blog, because then the copies can multiply into even more versions than you really have. For example, if Googlebot visited 3 days and hit a paginated page each day, they could easily have 4 copies of a blog post in the index - one on each of the paginated pages, and the blog post page itself.
Now the good news - my opinion is that Google, Yahoo! & MSN have all seen this pattern of duplication with blogs so many times that they've probably found relatively good workarounds for it. However, I will say that when SEOmoz switched from having this problem (prior to our February '07 re-launch) to solving it, our traffic from search for old blog posts rose about 25%.
However, not all internal duplicate content issues are in blog structures. You can find similiar issues on blogs of all shapes and sizes. On many news sites, for example, printer-friendly pages are popular. On forums, there might be several copies of a post accessible through crawlable links. A good number of e-commerce sites have exactly the same product page in different categories, producing different URLs (this is sometimes the worst nightmare of all).
So, how do you fix it? Simple (well, OK, not totally simple):
The illustration offers two very good solutions, but you have to know when to use each. I advise the meta noindex tag when you're launching a site and can put it in right from the getgo. For older sites that may have lots of internal and external links pointing to the various versions of content, a 301-redirect is the way to go.
Some SEOs even advise the use of cloaking - and in this case, it's hard to argue that it's unethical or even against the spirit of the engines. The concept is to use a 301-redirect conditional to the search engines that forwards all link love to the original version and establishes it as the canonical source. Human visitors, meanwhile, can still see the content in print-friendly format (or whatever unique way it's being presented). I believe there's almost always a better workaround then content delivery, but in those rare cases...
Hopefully, this post will help to alleviate some of your duplicate content concerns and make it just a touch easier to convince your team of how important these modifications are. Don't forget, if you need a basic primer on duplicate content, you can check out my illustrated post on the subject from back in March.
Rand, I hate to see you miss such a great opportunity to link to me :P
My fix for blog / wordpress pages going supplemental :)
Thanks for that Joost, very handy.
I also found this video yesterday whilst looking for the exact same fix for duplicate content!
https://www.wolf-howl.com/video/make-wordpress-search-engine-friendly/
Good link Joost!
Thanks for the link. Thumbs up!
There's another SEOmoz tool idea in there somewhere - a duplicate content detector. I would build one, but it sounds hard, so I thought I'd just suggest it ;)
It is easy to find some examples of dup content, but a thorough site review can be hard - especially for large or complex sites. It's not always obvious the number of different ways you can reach content...
And a question: Do you believe there is any duplicate content issue with very small canonicalisation stuff, e.g.: www.example.com/directory, www.example.com/directory/ and www.example.com/directory/index.html all providing the same content? Seems to me that it would be nice to fix this, but I'm sure the engines are on top of this one. Anyone got any thoughts on that?
Well www.example.com/directory will 301 redirect to www.example.com/directory/ by default so there isn't much issue there. With the /index.html thrown into the mix though I don't think that there is much to worry about getting hit with dup content, but I would say it is diluting the strength of the intended URL though. Plus, this gives users the opportunity to naturally link to either page that they want, which can further dilute the page(s).
www.example.com/directory will only redirect to www.example.com/directory/ if there is an actual directory there.
Some content management systems create things that look like directories that are not actual directories and that can be accessed with or without the trailing slash. They are different URLs.
A slash in Unix is the symbol for a directory (as a backslash is in Windows). For example, the correct URL for a home page is https://example.com/ -- the trailing slash indicates that you are requesting the root directory of example.com. If you forget the trailing slash, the server will add it.
On Apache you can use .htaccess to enforce the inclusion of a slash or removal of the slash.
I actually have been seeing issues like this a lot with my clients recently. I'm not sure how the SE's deal with it but I have seen some funky things. Like www.example.com/home.html having 0/10 PR but the same page when typed in www.example.com/Home.html has a 6/10 PR. I'm not sure if Google sees the two pages as seperate or if just PR is case sensitive, either way it seems kinda wacky. Anyone else see anything like this and if so any fixes? Does it even matter?
Technically, those are or at least can be two different pages. On a *nix based server home.html and Home.html could be different content, so could have two different PR ratings. IIS servers don't make a distinction between these.
Letter case in URLS definitely matters with Google. /Home.html is a different URL than /home.html, though as you mention IIS (Windows) doesn't care. If the site were on a *nix server one or the other would send a 404 error.
For example:
https://www.google.com/index.html is correct
https://www.google.com/Index.html is a 404 error
Here is an often overlooked tip for reducing blog duplicate content.
1. Add category descriptions, headers and other unique information to the category pages and the main (or most powerfull) paginated pages.
This is easily accomplished by adding extra templates to your blog's theme (like custom category templates which have a category description etc).
I'll do a follow up blog post and give away some more tips. Nice post!
Very true Solomon, on my blog I've got unique titles and such for pages as well. It's not quite easy to do in WordPress though, you'll have to invest some time :)
Solomon, please don't forget to follow up. I for one am very interested in this topic. :) Thanks.
What I do with blogs to avoid dup. content.
1. Truncate the post to 30-40 words.
2. "Conitinue reading here" - make sure to change the anchor text into something more appropriate (like the post title).
3. Category pages or archive pages? I dont use archive pages. They list the content in it's entirety. I use category pages with message excepts.
From a usability standpoint, having smaller posts reduces scrolls and lets people navigate to the content they want, more quickly.
mytwocents
I agree this is a huge ongoing issue, and while the engines are potentially working with the major blog/CMS platforms to address default setups and the duplicate content they create out of the box, there is still a lot of potential abuse issues for lesser supported sites and plugins. The practice of tagging and archiving only increased the issue.
I recall reading a post recently about an end-all-be-all of .htaccess and/or robots.txt files for Wordpress. This suppossedly handled categories, archives, feeds, tags, comments, index pages, etc. However, I just spent a half-hour looking for said post and can't find it for the life of me. If I do find it in the near future, I'll post it up.
This would be really helpful... please post it if you find it Roadies.
Thanks.
You owe me a thumbs up. :D
Creating the ultimate wordpress robotstxt file
I would add two items to Rand's article: one to the problem and one to the solution.
As willcritchlow comments: Same content accessed via multiple URLs is a potential issue as well (ie.:/index.html and /; www and non-www). This can be fixed via .htaccess's PermanentRedirect, Google webmaster central and/or some mod_rewrite lines.
In addition to meta robots tag "noindex", we can alternatively use the robots.txt disallow directive. This is especially useful to block robots' access to RSS feeds. As the feeds are in XML, we can not use the meta robots tag there.
I wrote a detailed post about this in my blog. I include relevant examples of robots.txt and .htaccess files.
Hamlet, nice article.
You bring up the robot.txt and how Google supports the wildcard, it is a very true and simple piece of advice that many people can benefit from.
Thanks, Pat. I am glad you found the article useful.
I don't think that all robots should be blocked from the RSS feeds -- at least not the main one(s). I let Google get my main RSS feed and block the other feeds (in WordPress). I create alternate content in the main RSS feed by creating a custom excerpt for each post which becomes the content of the feed. That prevents duplicate content of the home page and it also creates different content on the category pages.
Between your post and this one I think I have dup content solutions well covered...
robots.txt files were developed to cover a lot of these issues and am glad they are starting to get the serious attention they deserve.
Now if someone were to create a site that creates and monitors a site's robots.txt file we would have a winner.
AussieWebmaster,
I think I have the idea for doing what you propose.
Such a tool would crawl a website, identify the duplicate content and create the robots.txt file to fix this.
I have the code for the crawler, and creating the robots.txt is not a big deal. I need some time to carefully research the best approach to detect duplicate content.
I will post the crawler to my blog next week.
Does Google offer the same robots no-content attribute as Yahoo? Have not asked about this yet.... though you have reminded me to do that Monday unless someone wants to jump in here.
They do.
Hmm sorry, that's for RSS feeds. They don't for content afaik.
Thanks for the quick response... let's see what we can do about that - there are enough influential people here to help it along.
Personally i don't like the thought of it, and would like to see it documented some more before diving into it... How links within blocks like that are handled for instance...
I agree. The robots no-content attribute is not a good idea. The average Web site owner will not know how to use it correctly and SEOs will come up with all kinds of absurd explanations about where it should be applied.
A vast number of people still get rel=nofollow wrong (hint: it does not mean "do not follow", it means "do not vouch").
No-content is non-standardized, ambiguous code bloat and should not be used on Web sites. It is the job of the search engines to determine which sections of a page are part of the template and which parts are the content.
In the future microformats may be able to provide more information about sections of Web pages to search engines, but no-content is not a good solution.
Agreed... though all these sorts of tags are nice for our job security, I don't think they make the web a better place...
We should focus on making things simpler... Not harder.
G-Man wrote about Wordpress duplication before half year.
Another problem (not usually in blogging software, but in custom CMS's) it's if you separate one article or category with pages and link it on this way: Original URL: example.com/article-name.htm Page 2: example.com/article-name/page2.htm And link from page 2 or page x to the first page: example.com/article-name/page1.htm
Notice that you have duplication of Original/page 1 content! You need to also avoid that.
Great point. Never thought to look for this!
Hi Rand!
Thanks for the informative post.
The latest chatter over at the G Webmaster Help forums seems to suggest that dupe is not a cause of supps. Personally I think that's not true, but irregardlesss...
I've had a bash at making a plugin to fix these issues - I also have there a few quotes about the latest thoughts on the causes / solutions etc - but not quite as pretty as yours.
https://www.utheguru.com/seo_wordpress-wordpress-seo-plugin
I'd love your comments / suggestions about how to improve the plugin.
Ciao,
Matt
Okay - I have an idea as to how to deal with this and I'd like to know your opinion on it.
What if instead of using the categories which automatically creates duplicate content simply not use the categories.
Instead with the text box widget create our anchor text links with the categories names in it.
Each time we write a post that would be categorized in that subject, we could go to that main page and manually write in a few sentences for the new post and link from that main post to the new post.
To solve the front page - make it a page that we manually go to and update too if we weremaking a site and not a blog.
This technique could be used to build a site with wordpress- what do you think of it?
I think that in this way we could be certain of it not having duplicate content.
Thankfully, since this article was written, we have the rel=canonical solution which really helps address such headaches.
I tend to agree with Matt that dup content does impact your site.
I've been dealing with duplicate content on pagination of comments. The comments would get paged, but the text of the post would show up on the next page of comments (duplicate content). I came up with a seo friendly solution to dealing with hundreds of comments - it's a wordpress plugin called paginated comments.
I know this is not exactly on topic, but it's close enough that those of you struggling with too many comments may appreciate the information.
This and the support links makes the article mandatory reading.
I would copy and paste it all into one big blog post but would get hammered for duplicate content :)
So maybe I will just link to the articles and show some link love.
hmm, but what about those occaisional links that happen to show up in your articles, do you noindex, nofollow... or noindex, follow print friendly pages?
Excellent post, I'm making this mandatory reading for our development team. I was totally wondering how search engines handle the multiple areas where content gets placed. I wonder if having a lot of comments helps, as they are usually not displayed on the home page?
Also, I've seen some blogs only post intros with "read more" links to the full post. Is this another workaround?
In a wordpress structure if you are using the excerpts for the archive pages will it be helpful enough to get rid of this Duplicate issue? While using Excerts in teh Category pages I found some of those pages are getting ranked for some Combinations and getting listed in teh SEPRs. So I am not seeing any serious issues facing for restricting the visibility to such pages.
If I go with redirection which page is going to replace my old pages? I am not sure 301 resirect is the solution here.. "Noindex" is acceptable and seems to be the best solution.
I could be wrong though...
kichus: i agree with the first part of what you're saying: make them rank for combinations of terms in different blog posts. Just make sure these pages are unique :)
I myself have been thinking of a way to automically include one keyword from each post into the title for that specific category or archive page... that would probably rule :)
Well I was looking through this sites robots.txt and noticed you block page and category from being indexed. Just curious on why both of these were blocked and how you expect bots to find deep pages if you don't allow them to view categories or pages?
On a default Wordpress install you have archive, category along with pagination on those pages. Then you have the pagination of the main site. How do you suggest organizing this for bots? Right now I am allowing categories to be indexed and blocking archives along with the paginated main page files. ex. site.com/page/2/ but allowing site.com/category/name/ and site.com/category/name/page/2/
I am trying to find a better way to do this as I am not happy with that setup.
So which did you ultimately end up using for your Wordpress setup? Did you block categories, tags or did you block the pagination links?
Duplication is like an onion, or a bloomin' onion (mmm) if you prefer your onions deep-fried...
...once you get through one layer, you find there is another one underneath--- although an endless bloomin' onion would be a good thing, not so much with duplication issues.
Here is another wordpress duplicate content issue...
I have a bunch of pages indexed with referrer metadata... so the URLs looks like these:
https://www.mysite.com/?referrer=www.othersite.com
https://www.mysite.com/index.php?referrer=www.othersite.com
And the pages show up with the same content as the index page.
I think you can do this for any wordpress page.. add a question mark after the URL for any page and put in some additional characters... instant duplicate content.
Now that I think about it, this could be used against a wordpress blog to screw with your rankings.
Post a few hundred links to your competitor's site like:
https://www.competitorsite.com/?your-rankings-will-belong-to-me
https://www.competitorsite.com/?have-fun-sorting-this-one-out
I hope someone has a fix for this because I’m vulnerable right now.
If the pages are indexed, make a list of them and then redirect those specific URLs to the correct location with 301 redirects. Then configure your site to send a 404 header when non-existent query strings are requested.
On a site where those URLs haven't been indexed, you could add to robots.txt:
User-agent: *
Disallow: /?
and/or:
Disallow: /*?
I have been dealing with a major dupe issue with a client. Their site is basically an aggregation of blogs. Members all have a blog page and then at the top level there are several indexes of blog posts, e.g. most recent by category, most popular by category, most popular by category today, most popular by category all time, etc. And of course each directory has pagination. I tried no indexing and no folllowing a lot of these pages to create a direct path to the member blogs and main category pages and if anything it stopped traffic growth dead in its tracks. I am removing them now but looking for some creative ideas as to how to deal with this stuff. Any ideas Mozzers?
Awesome post. Im trying to deal with duplicate content right now dealing with http and https.
Drig,
I don't think http and https will cause duplicate content problems. Search engine crawlers only follow http links, as far as I know.
I added an http to https redirection example to the .htaccess file in my post.
Actually, I didn't think SEs had an issue with crawling or indexing https.
Identity, you are right. I was not up to date on this.
Thanks for the heads up! I'm updating my article to correct this.
I had a client site that had a minor problem with some duplicate http/https pages. Google doesn't seem to have a problem indexing https pages if they're linked to. That doesn't necessarily mean they'll choose to show that one over an identical http version but it's something to consider.
Sending the same content over HTTP and HTTPS can create duplicate content. This often happens on sites that use relative URLs on internal links.
So if you are on https://example.com/page.php and then click on href="/" then you (and spiders) will end up on https://example.com/ instead of https://example.com/ -- two different URLs with the same content.
One solution is to send different robots.txt files for HTTP and HTTPS.
More here:
https://www.google.com/support/webmasters/bin/answer.py?answer=35302
https://blogs.msdn.com/livesearch/archive/2006/06/28/649980.aspx
That is the proposed solution in my post.
This is a really great way illustrate it. This is definately going to my programmers.