“No one saw the panda uprising coming. One day, they were frolicking in our zoos. The next, they were frolicking in our entrails. They came for the identical twins first, then the gingers, and then the rest of us. I finally trapped one and asked him the question burning in all of our souls – 'Why?!' He just smiled and said ‘You humans all look alike to me.’”
- Sgt. Jericho “Bamboo” Jackson
Ok, maybe we’re starting to get a bit melodramatic about this whole Panda thing. While it’s true that Panda didn’t change everything about SEO, I think it has been a wake-up call about SEO issues we’ve been ignoring for too long.
One of those issues is duplicate content. While duplicate content as an SEO problem has been around for years, the way Google handles it has evolved dramatically and seems to only get more complicated with every update. Panda has upped the ante even more.
So, I thought it was a good time to cover the topic of duplicate content, as it stands in 2011, in depth. This is designed to be a comprehensive resource – a complete discussion of what duplicate content is, how it happens, how to diagnose it, and how to fix it. Maybe we’ll even round up a few rogue pandas along the way.
I. What Is Duplicate Content?
Let’s start with the basics. Duplicate content exists when any two (or more) pages share the same content. If you’re a visual learner, here’s an illustration for you:
Easy enough, right? So, why does such a simple concept cause so much difficulty? One problem is that people often make the mistake of thinking that a “page” is a file or document sitting on their web server. To a crawler (like Googlebot), a page is any unique URL it happens to find, usually through internal or external links. Especially on large, dynamic sites, creating two URLs that land on the same content is surprisingly easy (and often unintentional).
II. Why Do Duplicates Matter?
Duplicate content as an SEO issue was around long before the Panda update, and has taken many forms as the algorithm has changed. Here’s a brief look at some major issues with duplicate content over the years…
The Supplemental Index
In the early days of Google, just indexing the web was a massive computational challenge. To deal with this challenge, some pages that were seen as duplicates or just very low quality were stored in a secondary index called the “supplemental” index. These pages automatically became 2nd-class citizens, from an SEO perspective, and lost any competitive ranking ability.
Around late 2006, Google integrated supplemental results back into the main index, but those results were still often filtered out. You know you’ve hit filtered results anytime you see this warning at the bottom of a Google SERP:
Even though the index was unified, results were still “omitted”, with obvious consequences for SEO. Of course, in many cases, these pages really were duplicates or had very little search value, and the practical SEO impact was negligible, but not always.
The Crawl “Budget”
It’s always tough to talk limits when it comes to Google, because people want to hear an absolute number. There is no absolute crawl budget or fixed number of pages that Google will crawl on a site. There is, however, a point at which Google may give up crawling your site for a while, especially if you keep sending spiders down winding paths.
Although the “budget” isn’t absolute, even for a given site, you can get a sense of Google’s crawl allocation for your site in Google Webmaster Tools (under “Diagnostics” > “Crawl Stats”):
So, what happens when Google hits so many duplicate paths and pages that it gives up for the day? Practically, the pages you want indexed may not get crawled. At best, they probably won’t be crawled as often.
The Indexation “Cap”
Similarly, there’s no set “cap” to how many pages of a site Google will index. There does seem to be a dynamic limit, though, and that limit is relative to the authority of the site. If you fill up your index with useless, duplicate pages, you may push out more important, deeper pages. For example, if you load up on 1000s of internal search results, Google may not index all of your product pages. Many people make the mistake of thinking that more indexed pages is better. I’ve seen too many situations where the opposite was true. All else being equal, bloated indexes dilute your ranking ability.
The Penalty Debate
Long before Panda, a debate would erupt every few months over whether or not there was a duplicate content penalty. While these debates raised valid points, they often focused on semantics – whether or not duplicate content caused a Capital-P Penalty. While I think the conceptual difference between penalties and filters is important, the upshot for a site owner is often the same. If a page isn’t ranking (or even indexed) because of duplicate content, then you’ve got a problem, no matter what you call it.
The Panda Update
Since Panda (starting in February 2011), the impact of duplicate content has become much more severe in some cases. It used to be that duplicate content could only harm that content itself. If you had a duplicate, it might go supplemental or get filtered out. Usually, that was ok. In extreme cases, a large number of duplicates could bloat your index or cause crawl problems and start impacting other pages.
Panda made duplicate content part of a broader quality equation – now, a duplicate content problem can impact your entire site. If you’re hit by Panda, non-duplicate pages may lose ranking power, stop ranking altogether, or even fall out of the index. Duplicate content is no longer an isolated problem.
III. Three Kinds of Duplicates
Before we dive into examples of duplicate content and the tools for dealing with them, I’d like to cover 3 broad categories of duplicates. They are: (1) True Duplicates, (2) Near Duplicates, and (3) Cross-domain Duplicates. I’ll be referencing these 3 main types in the examples later in the post.
(1) True Duplicates
A true duplicate is any page that is 100% identical (in content) to another page. These pages only differ by the URL:
(2) Near Duplicates
A near duplicate differs from another page (or pages) by a very small amount – it could be a block of text, an image, or even the order of the content:
An exact definition of “near” is tough to pin down, but I’ll discuss some examples in detail later.
(3) Cross-domain Duplicates
A cross-domain duplicate occurs when two websites share the same piece of content:
These duplicates could be either “true” or “near” duplicates. Contrary to what some people believe, cross-domain duplicates can be a problem even for legitimate, syndicated content.
IV. Tools for Fixing Duplicates
This may seem out of order, but I want to discuss the tools for dealing with duplicates before I dive into specific examples. That way, I can recommend the appropriate tools to fix each example without confusing anyone.
(1) 404 (Not Found)
Of course, the simplest way to deal with duplicate content is to just remove it and return a 404 error. If the content really has no value to visitors or search, and if it has no significant inbound links or traffic, then total removal is a perfectly valid option.
(2) 301 Redirect
Another way to remove a page is via a 301-redirect. Unlike a 404, the 301 tells visitors (humans and bots) that the page has permanently moved to another location. Human visitors seamlessly arrive at the new page. From an SEO perspective, most of the inbound link authority is also passed to the new page. If your duplicate content has a clear canonical URL, and the duplicate has traffic or inbound links, then a 301-redirect may be a good option.
(3) Robots.txt
Another option is to leave the duplicate content available for human visitors, but block it for search crawlers. The oldest and probably still easiest way to do this is with a robots.txt file (generally located in your root directory). It looks something like this:
One advantage of robots.txt is that it’s relatively easy to block entire folders or even URL parameters. The disadvantage is that it’s an extreme and sometimes unreliable solution. While robots.txt is effective for blocking uncrawled content, it’s not great for removing content already in the index. The major search engines also seem to frown on its overuse, and don’t generally recommend robots.txt for duplicate content.
(4) Meta Robots
You can also control the behavior of search bots at the page level, with a header-level directive known as the “Meta Robots” tag (or sometimes “Meta Noindex”). In its simplest form, the tag looks something like this:
This directive tells search bots not to index this particular page or follow links on it. Anecdotally, I find it a bit more SEO-friendly than Robots.txt, and because the tag can be created dynamically with code, it can often be more flexible.
The other common variant for Meta Robots is the content value “NOINDEX, FOLLOW”, which allows bots to crawl the paths on the page without adding the page to the search index. This can be useful for pages like internal search results, where you may want to block certain variations (I’ll discuss this more later) but still follow the paths to product pages.
One quick note: there is no need to ever add a Meta Robots tag with “INDEX, FOLLOW” to a page. All pages are indexed and followed by default (unless blocked by other means).
(5) Rel=Canonical
In 2009, the search engines banded together to create the Rel=Canonical directive, sometimes called just “Rel-canonical” or the “Canonical Tag”. This allows webmasters to specify a canonical version for any page. The tag goes in the page header (like Meta Robots), and a simple example looks like this:
When search engines arrive on a page with a canonical tag, they attribute the page to the canonical URL, regardless of the URL they used to reach the page. So, for example, if a bot reached the above page using the URL “www.example.com/index.html”, the search engine would not index the additional, non-canonical URL. Typically, it seems that inbound link-juice is also passed through the canonical tag.
It’s important to note that you need to clearly understand what the proper canonical page is for any given website template. Canonicalizing your entire site to just one page or the wrong pages can be catastrophic.
(6) Google URL Removal
In Google Webmaster Tools (GWT), you can request that an individual page (or directory) be manually removed from the index. Click on “Site configuration” > “Crawler access”, and you’ll see a series of 3 tabs. Click on the 3rd tab, “Remove URL”, to get this:
Since this tool only removes one URL or path at a time and is completely at Google’s discretion, it’s usually a last-ditch approach to duplicate content. I just want to be thorough, though, and cover all of your options. An important technical note: you need to 404, Robots.txt block or Meta Noindex the page before requesting removal. Removal via GWT is primarily a last defense when Google is being stubborn.
Update: In the comments, Taylor pointed out that Google lifted the requirement that you have to first block the page to request removal. Removal requests can be done without blocking via other means now, but the removals only last 90 days.
(7) Google Parameter Blocking
You can also use GWT to specify URL parameters that you want Google to ignore (which essentially blocks indexation of pages with those parameters). If you click on “Site Configuration” > “URL parameters”, you’ll get a list something like this:
This list shows URL parameters that Google has detected, as well as the settings for how those parameters should be crawled. Keep in mind that the “Let Googlebot decide” setting doesn’t reflect other blocking tactics, like Robots.txt or Meta Robots. If you click on “Edit”, you’ll get the following options:
Google changed these recently, and I find the new version a bit confusing, but essentially “Yes” means the parameter is important and should be indexed, while “No” means the parameter indicates a duplicate. The GWT tool seems to be effective (and can be fast), but I don’t usually recommend it as a first line of defense. It won’t impact other search engines, and it can’t be read by SEO tools and monitoring software. It could also be modified by Google at any time.
(8) Bing URL Removal
Bing Webmaster Center (BWC) has tools very similar to GWT’s options above. Actually, I think the Bing parameter blocking tool came before Google’s version. To request a URL removal in Bing, click on the “Index” tab and then “Block URLs” > “Block URL and Cache”. You’ll get a pop-up like this:
BWC actually gives you a wider range of options, including blocking a directory and your entire site. Obviously, that last one usually isn’t a good idea.
(9) Bing Parameter Blocking
In the same section of BWC (“Index”), there’s an option called “URL Normalization”. The name implies Bing treats this more like canonicalization, but there’s only one option – “ignore”. Like Google, you get a list of auto-detected parameters and can add or modify them:
As with the GWT tools, I’d consider the Bing versions to be a last resort. Generally, I’d only use these tools if other methods have failed, and one search engine is just giving you grief.
(10) Rel=Prev & Rel=Next
Just this year (September 2011), Google gave us a new tool for fighting a particular form of near-duplicate content – paginated search results. I’ll describe the problem in more detail in the next section, but essentially paginated results are any searches where the results are broken up into chunks, with each chunk (say, 10 results) having its own page/URL.
You can now tell Google how paginated content connects by using a pair of tags much like Rel-Canonical. They’re called Rel-Prev and Rel-Next. Implementation is a bit tricky, but here’s a simple example:
In this example, the search bot has landed on page 3 of search results, so you need two tags: (1) a Rel-Prev pointing to page 2, and (2) a Rel-Next pointing to page 4. Where it gets tricky is that you’re almost always going to have to generate these tags dynamically, as your search results are probably driven by one template.
While initial results suggest these tags do work, they’re not currently honored by Bing, and we really don’t have much data on their effectiveness. I’ll briefly discuss other methods for dealing with paginated content in the next section.
(11) Syndication-Source
Note: It appears that the syndication-source tag was deprecated in June of 2012. Thanks to @WriteonPointSEO for pointing this out in the comments. The update wasn't very well announced, but it appears to be legitimate. I'll leave the section of the post intact, but please understand that this tag probably has no impact currently.
In November of 2010, Google introduced a set of tags for publishers of syndicated content. The Meta Syndication-Source directive can be used to indicate the original source of a republished article, as follows:
Even Google’s own advice on when to use this tag and when to use a cross-domain canonical tag are a little bit unclear. Google launched this tag as “experimental”, and I’m not sure they’ve publicly announced a status change. It’s something to watch, but don’t rely on it.
Update (11/21/11): For even more confusion, Google has recently added the "standout" tag. This is supposed to be used when you break a news story, but the interplay between it and syndication-source is unclear. Again, I wouldn't rely on these tags for now. Thanks to SEO Workers for pointing this out in the comments.
(12) Internal Linking
It’s important to remember that your best tool for dealing with duplicate content is to not create it in the first place. Granted, that’s not always possible, but if you find yourself having to patch dozens of problems, you may need to re-examine your internal linking structure and site architecture.
When you do correct a duplication problem, such as with a 301-redirect or the canonical tag, it’s also important to make your other site cues reflect that change. It’s amazing how often I see someone set a 301 or canonical to one version of a page, and then continue to link internally to the non-canonical version and fill their XML sitemap with non-canonical URLs. Internal links are strong signals, and sending mixed signals will only cause you problems.
(13) Don’t Do Anything
Finally, you can let the search engines sort it out. This is what Google recommended you do for years, actually. Unfortunately, in my experience, especially for large sites, this is almost always a bad idea. It’s important to note, though, that not all duplicate content is a disaster, and Google certainly can filter some of it out without huge consequences. If you only have a few isolated duplicates floating around, leaving them alone is a perfectly valid option.
(14) Rel="alternate" hreflang="x"
(Added on 04/02/12 - hat tip to @YuriKolovsky). Since this post was published, Google introduced a new way of dealing with translated content and same-language content with regional variations (such as US English vs UK English). Implementation of these tags is complex and very situational, but here's a complete write-up on the hreflang="x" attribute.
V. Examples of Duplicate Content
So, now that we’ve worked backwards and sorted out the tools for fixing duplicate content, what does it actually look like in the wild? I’m going to cover a wide range of examples that represent the issues you can expect on a real website. Throughout this section, I’ll reference the solutions listed in Section IV – for example, a reference to a 301-redirect will cite (IV-2).
(1) “www” vs. Non-www
For sitewide duplicate content, this is probably the biggest culprit. Whether you’ve got bad internal paths or have attracted links and social mentions to the wrong URL, you’ve got both the”www” version and non-www (root domain) version of your URLs indexed:
Most of the time, a 301-redirect (IV-2) is your best choice here. This is a common problem, and Google is good about honoring redirects for cases like these.
You may also want to set your preferred address in Google Webmaster Tools. Under “Site Configuration” > “Settings”, you should see a section called “Preferred domain”:
There’s a quirk in GWT where, to set a preferred domain, you may have to create GWT profiles for both your “www” and non-www versions of the site. While this is annoying, it won’t cause any harm. If you’re having major canonicalization issues, I’d recommend it. If you’re not, then you can leave well enough alone and let Google determine the preferred domain.
(2) Staging Servers
While much less common than (1), this problem is often also caused by subdomains. In a typical scenario, you’re working on a new site design for a relaunch, your dev team sets up a subdomain with the new site, and they accidentally leave it open to crawlers. What you end up with is two sets of indexed URLS that look something like this:
Your best bet is to prevent this problem before it happens, by blocking the staging site with Robots.txt (IV-3). If you find your staging site indexed, though, you’ll probably need to 301-redirect (IV-2) those pages or Meta Noindex them (IV-4).
(3) Trailing Slashes ("/")
This is a problem people often have questions about, although it's less of an SEO issue than it once was. Technically, in the original HTTP protocol, a URL with a trailing slash and one without it were different URLs. Here's a simple example:
These days, almost all browsers automatically add the trailing slash behind the scenes and resolve both versions the same way. Matt Cutts did a recent video suggesting that Google automatically canonicalizes these URLs in "the vast majority of cases".
(4) Secure (https) Pages
If your site has secure pages (designated by the “https:” protocol), you may find that both secure and non-secure versions are getting indexed. This most frequently happens when navigation links from secure pages – like shopping cart pages – also end up secured, usually due to relative paths, creating variants like this:
Ideally, these problems are solved by the site-architecture itself. In many cases, it’s best to Noindex (IV-4) secure pages – shopping cart and check-out pages have no place in the search index. After the fact, though, your best option is a 301-redirect (IV-2). Be cautious with any sitewide solutions – if you 301-redirect all “https:” pages to their “http:” versions, you could end up removing security entirely. This is a tricky problem to solve and should be handled carefully.
(5) Home-page Duplicates
While problems (1)-(3) can all create home-page duplicates, the home-page has a couple unique problems of its own. The most typical problem is that both the root domain and the actual home-page document name get indexed. For example:
Although this problem can be solved with a 301-redirect (IV-2), it’s often a good idea to put a canonical tag on your home-page (IV-5). Home pages are uniquely afflicted by duplicates, and a proactive canonical tag can prevent a lot of problems.
Of course, it’s important to also be consistent with your internal paths (IV-12). If you want the root version of the URL to be canonical, but then link to “/index.htm” in your navigation, you’re sending mixed signals to Google every time the crawlers visit.
(6) Session IDs
Some websites (especially e-commerce platforms) tag each new visitor with a tracking parameter. On occasion, that parameter ends up in the URL and gets indexed, creating something like this:
That image really doesn’t do the problem justice, because in reality you can end up with a duplicate for every single session ID and page combination that gets indexed. Session IDs in the URL can easily add 1000s of duplicate pages to your index.
The best option, if possible on your site/platform, is to remove the session ID from the URL altogether and store it in a cookie. There are very few good reasons to create these URLs, and no reason to let bots crawl them. If that’s not feasible, implementing the canonical tag (IV-5) sitewide is a good bet. If you really get stuck, you can block the parameter in Google Webmaster Tools (IV-7) and Bing Webmaster Central (IV-9).
(7) Affiliate Tracking
This problem looks a lot like (6) and happens when sites provide a tracking variable to their affiliates. This variable is typically appended to landing page URLs, like so:
The damage is usually a bit less extreme than (5), but it can still cause large-scale duplication. The solutions are similar to session IDs. Ideally, you can capture the affiliate ID in a cookie and 301-redirect (IV-3) to the canonical version of the page. Otherwise, you’ll probably either need to use canonical tags (IV-5) or block the affiliate URL parameter.
(8) Duplicate Paths
Having duplicate paths to a page is perfectly fine, but when duplicate paths generate duplicate URLs, then you’ve got a problem. Let’s say a product page can be reached one of 3 ways:
Here, the iPad2 product page can be reached by 2 categories and a user-generated tag. User-generated tags are especially problematic, because they can theoretically spawn unlimited versions of a page.
Ideally, these path-based URLs shouldn’t be created at all. However a page is navigated to, it should only have one URL for SEO purposes. Some will argue that including navigation paths in the URL is a positive cue for site visitors, but even as someone with a usability background, I think the cons almost always outweigh the pros here.
If you already have variations indexed, then a 301-redirect (IV-2) or canonical tag (IV-5) are probably your best options. In many cases, implementing the canonical tag will be easier, since there may be too many variations to easily redirect. Long-term, though, you’ll need to re-evaluate your site architecture.
(9) Functional Parameters
Functional parameters are URL parameters that change a page slightly but have no value for search and are essentially duplicates. For example, let’s say that all of your product pages have a printable version, and that version has its own URL:
Here, the “print=1” URL variable indicates a printable version, which normally would have the same content but a modified template. Your best bet is to not index these at all, with something like a Meta Noindex (IV-4), but you could also use a canonical tag (IV-5) to consolidate these pages.
(10) International Duplicates
These duplicates occur when you have content for different countries which share the same language, all hosted on the same root domain (it could be subfolders or subdomains). For example, you may have an English version of your product pages for the US, UK, and Australia:
Unfortunately, this one’s a bit tough – in some cases, Google will handle it perfectly well and rank the appropriate content in the appropriate countries. In other cases, even with proper geo-targeting, they won’t. It’s often better to target the language itself than the country, but there are legitimate reasons to split off country-specific content, such as pricing.
If your international content does get treated as duplicate content, there’s no easy answer. If you 301-redirect, you lose the page for visitors. If you use the canonical tag, then Google will only rank one version of the page. The “right” solution can be highly situational and really depends on the risk-reward tradeoff (and the scope of the filter/penalty).
(11) Search Sorts
So far, all of the examples I’ve given have been true duplicates. I’d like to dive into a few examples of “near” duplicates, since that concept is a bit fuzzy. A few common examples pop up with internal search engines, which tend to spin off many variants – sortable results, filters, and paginated results being the most frequent problems.
Search sort duplicates pop up whenever a sort (ascending/descending) creates a separate URL. While the two sorted results are technically different pages, they add no additional value to the search index and contain the same content, just in a different order. URLs might look like:
In most cases, it’s best just to block the sortable versions completely, usually by adding a Meta Noindex (IV-4) selectively to pages called with that parameter. In a pinch, you could block the sort parameter in Google Webmaster Tools (IV-7) and Bing Webmaster Central (IV-9).
(12) Search Filters
Search filters are used to narrow an internal search – it could be price, color, features, etc. Filters are very common on e-commerce sites that sell a wide variety of products. Search filter URLs look a lot like search sorts, in many cases:
The solution here is similar to (11) – don’t index the filters. As long as Google has a clear path to products, indexing every variant usually causes more harm than good.
(13) Search Pagination
Pagination is an easy problem to describe and an incredibly difficult one to solve. Any time you split internal search results into separate pages, you have paginated content. The URLs are easy enough to visualize:
Of course, over 100s of results, one search can easily spin out dozens of near duplicates. While the results themselves differ, many important features of the pages (Titles, Meta Descriptions, Headers, copy, template, etc.) are identical. Add to that the problem that Google isn’t a big fan of “search within search” (having their search pages land on yours).
In the past, Google has said to let them sort pagination out – problem is, they haven’t done it very well. Recently, Google introduced Rel=Prev and Rel=Next (IV-10). Initial data suggests these tags work, but we don’t have much data, they’re difficult to implement, and Bing doesn’t currently support them.
You have 3 other, viable options (in my opinion), although how and when they’re viable depends a lot on the situation:
- You can Meta Noindex,Follow pages 2+ of search results. Let Google crawl the paginated content but don’t let them index it.
- You can create a “View All” page that links to all search results at one URL, and let Google auto-detect it. This seems to be Google’s other preferred option.
- You can create a “View All” page and set the canonical tag of paginated results back to that page. This is unofficially endorsed, but the pages aren’t really duplicates in the traditional sense, so some claim it violates the intent of Rel-canonical.
Adam Audette has a recent, in-depth discussion of search pagination that I highly recommend. Pagination for SEO is a very difficult topic and well beyond the scope of this post.
(14) Product Variations
Product variant pages are pages that branch off from the main product page and only differ by one feature or option. For example, you might have a page for each color a product comes in:
It can be tempting to want to index every color variation, hoping it pops up in search results, but in most cases I think the cons outweigh the pros. If you have a handful of product variations and are talking about dozens of pages, fine. If product variations spin out into 100s or 1000s, though, it’s best to consolidate. Although these pages aren’t technically true duplicates, I think it’s ok to Rel-canonical (IV-5) the options back up to the main product page.
One site note: I purposely used “static” URLs in this example to demonstrate a point. Just because a URL doesn’t have parameters, that doesn’t make it immune to duplication. Static URLs (parameter-free) may look prettier, but they can be duplicates just as easily as dynamic URLs.
(15) Geo-keyword Variations
Once upon a time, “local SEO” meant just copying all of your pages 100s of times, adding a city name to the URL, and swapping out that city in the page copy. It created URLs like these:
In 2011, not only is local SEO a lot more sophisticated, but these pages are almost always going to look like near-duplicates. If you have any chance of ranking, you’re going to need to invest in legitimate, unique content for every geographic region you spin out. If you aren’t willing to make that investment, then don’t create the pages. They’ll probably backfire.
(16) Other “Thin” Content
This isn’t really an example, but I wanted to stop and explain a word we throw around a lot when it comes to content: “thin”. While thin content can mean a variety of things, I think many examples of thin content are near-duplicates like (14) above. Whenever you have pages that vary by only a tiny percentage of content, you risk those pages looking low-value to Google. If those pages are heavy on ads (with more ads than unique content), you’re at even more risk. When too much of your site is thin, it’s time to revisit your content strategy.
(17) Syndicated Content
These last 3 examples all relate to cross-domain content. Here, the URLs don’t really matter – they could be wildly different. Examples (17) and (18) only differ by intent. Syndicated content is any content you use with permission from another site. However you retrieve and integrate it, that content is available on another site (and, often, many sites).
While syndication is legitimate, it’s still likely that one or more copies will get filtered out of search results. You could roll the dice and see what happens (IV-13), but conventional SEO wisdom says that you should link back to the source and probably set up a cross-domain canonical tag (IV-5). A cross-domain canonical looks just like a regular canonical, but with a reference to someone else’s domain.
Of course, a cross-domain canonical tag means that, assuming Google honors the tag, your page won’t get indexed or rank. In some cases, that’s fine – you’re using the content for its value to visitors. Practically, I think it depends on the scope. If you occasionally syndicate content to beef up your own offerings but also have plenty of unique material, then link back and leave it alone. If a larger part of your site is syndicated content, then you could find yourself running into trouble. Unfortunately, using the canonical tag (IV-5) means you'll lose the ranking ability of that content, but it could keep you from getting penalized or having Panda-related problems.
(18) Scraped Content
Scraped content is just like syndicated content, except that you didn’t ask permission (and might even be breaking the law). The best solution: QUIT BREAKING THE LAW!
Seriously, no de-duping solution is going to satisfy the scrapers among you, because most solutions will knock your content out of ranking contention. The best you can do is pad the scraped content with as much of your own, unique content as possible.
(19) Cross-ccTLD Duplicates
Finally, it’s possible to run into trouble when you copy same-language content across countries – see example (9) above – even with separate Top-Level Domains (TLDs). Fortunately, this problem is fairly rare, but we see it with English-language content and even with some European languages. For example, I frequently see questions about Dutch content on Dutch and Belgian domains ranking improperly.
Unfortunately, there’s no easy answer here, and most of the solutions aren’t traditional duplicate-content approaches. In most cases, you need to work on your targeting factors and clearly show Google that the domain is tied to the country in question.
VI. Which URL Is Canonical?
I’d like to take a quick detour to discuss an important question – whether you use a 301-redirect or a canonical tag, how do you know which URL is actually canonical? I often see people making a mistake like this:
The problem is that “product.php” is just a template – you’ve now collapsed all of your products down to a single page (that probably doesn’t even display a product). In this case, the canonical version probably includes a parameter, like “id=1234”.
The canonical page isn’t always the simplest version of the URL – it’s the simplest version of the URL that generates UNIQUE content. Let’s say you have these 3 URLs that all generate the same product page:
Two of these versions are essentially duplicates, and the “print” and “session” parameters represent variations on the main product page that should be de-duped. The “id” parameter is essential to the content, though – it determines which product is actually being displayed.
So, consider yourself warned. As much trouble as rampant duplicates can be, bad canonicalization can cause even more damage in some cases. Plan carefully, and make absolutely sure you select the correct canonical versions of your pages before consolidating them.
VII. Tools for Diagnosing Duplicates
So, now that you recognize what duplicate content looks like, how do you go about finding it on your own site? Here are a few tools to get you started – I won’t claim it’s a complete list, but it covers the bases:
(1) Google Webmaster Tools
In Google Webmaster Tools, you can pull up a list of duplicate TITLE tags and Meta Descriptions Google has crawled. While these don’t tell the whole story, they’re a good starting point. Many URL-based duplicates will naturally generate identical Meta data. In your GWT account, go to “Diagnostics” > “HTML Suggestions”, and you’ll see a table like this:
You can click on “Duplicate meta descriptions” and “Duplicate title tags” to pull up a list of the duplicates. This is a great first stop for finding your trouble-spots.
(2) Google’s Site: Command
When you already have a sense of where you might be running into trouble and need to take a deeper dive, Google’s “site:” command is a very powerful and flexible tool. What really makes “site:” powerful is that you can use it in conjunction with other search operators.
Let’s say, for example, that you’re worried about home-page duplicates. To find out if Google has indexed any copies of your home-page, you could use the “site:” command with the “intitle:” operator, like this:
Put the title in quotes to capture the full phrase, and always use the root domain (leave off “www”) when making a wide sweep for duplicate content. This will detect both “www” and non-www versions.
Another powerful combination is “site:” plus the “inurl:” operator. You could use this to detect parameters, such as the search-sort problem mentioned above:
The “inurl:” operator can also detect the protocol used, which is handy for finding out whether any secure (https:) copies of your pages have been indexed:
You can also combine the “site:” operator with regular search text, to find near-duplicates (such as blocks of repeated content). To search for a block of content across your site, just include it in quotes:
I should also mention that searching for a unique block of content in quotes is a cheap and easy way to find out if people have been scraping your site. Just leave off the “site:” operator and search for a long or unique block entirely in quotes.
Of course, these are just a few examples, but if you really need to dig deep, these simple tools can be used in powerful ways. Ultimately, the best way to tell if you have a duplicate content problem is to see what Google sees.
(3) SEOmoz Campaign Manager
If you’re an SEOmoz PRO member, you have access to some additional tools for spotting duplicates in your Campaigns. In addition to duplicate page titles, the Campaign manager will detect duplicate content on the pages themselves. You can see duplicate pages we’ve detected from the Campaign Overview screen:
Click on the “Duplicate Page Content” link and you’ll not only see a list of potential duplicates, but you’ll get a graph of how your duplicate count has changed over time:
The historical graph can be very useful for determining if any recent changes you’ve made have created (or resolved) duplicate content issues.
Just a technical note, since it comes up a lot in Q&A – Our system currently uses a threshold of 95% to determine whether content is duplicated. This is based on the source code (not the text copy), so the amount of actual duplicate content may vary depending on the code/content ratio.
(4) Your Own Brain
Finally, it’s important to remember to use your own brain. Finding duplicate content often requires some detective work, and over-relying on tools can leave some gaps in what you find. One critical step is to systematically navigate your site to find where duplicates are being created. For example, does your internal search have sorts and filters? Do those sorts and filters get translated into URL variables, and are they crawlable? If they are, you can use the “site:” command to dig deeper. Even finding a handful of trouble spots using your own sleuthing skills can end up revealing 1000s of duplicate pages, in my experience.
I Hope That Covers It
If you’ve made it this far: congratulations – you’re probably as exhausted as I am. I hope that covers everything you’d want to know about the state of duplicate content in 2011, but if not, I’d be happy to answer questions in the comments. Dissenting opinions are welcome, too. Some of these topics, like pagination, are extremely tricky in practice, and there’s often not one “right” answer. Finally, if you liked my panda mini-poster, here’s a link to a larger version of Pandas Take No Prisoners.
Update: Post-publication, a handful of people requested a stand-alone PDF version of the post. You can download it here (22 pages, 560KB).
Woah!
While I was reading this guide:
Thanks Peter for this titanic effort of classifying the different kind of duplicated content, which is apparently so easy to understand but so confusing at times, as you (as I) have seen from the tons of questions about this topic in the Q&A.
Nice job Gianluca,
Epic comment to match an epic post!
Sha
LOL - nice :) Imagine what happened while I wrote it.
Lol - delighful analogies Gianluca!
Haha!
Duplicate Content 101 with Dr Pete... You just gave me 723 more reasons to continue educating our clients on the importance of [unique] content.
Thanks for all of the time you spent in the research and in putting this together in an easy-to-understand format, Dr. Pete! I know I'll be using it and sharing it a lot in Q&A. This is one of those posts that will wind up on the most-popular list for sure.
Dr Pete, thank you so much. I've been looking for a thorough post on duplicate content for a while. Also, for those SEO Excel ninjas, the TEXT TO COLUMN command works well for slicing up your internal link data, especially if you're looking for duplicate content. For instance, just change the delimiter to "/" and you'll be able to sort http for https with ease. Thanks for the post Doc.
Dr. Pete,
Wow, this is an incredible resource! This has definitely become one of my bookmarked pages and a post that I'll be referencing to my clients whenever they have any duplicate content concerns.
Something to consider: regarding syndicated content, we found that if the original content is behind a paywall, then the situation actually becomes a little bit different. This was a special exception for one of my clients who wanted to know whether syndicated content was worth the expense. Even though Google had crawled and indexed the original content, my client's content still ranked higher because it wasn't behind a paywall. However, we still recommended that they place a cross-domain canonical on their syndicated content, since it would likely improve the overall authority of the site and demonstrates their trustworthiness to search engines.
Other than that, really appreciated your advice about how to find duplicate content via search operators, especially not using www. Also, I agree that perusing Google Webmaster Tools when analyzing a new site can bring incredible insights. For instance, I would suggest using Google Webmaster Tools to find which sites link back to your site. This has helped me find tons of duplicate content, especially when clients own other domains and build a link network between these almost identical sites because they think it will help them take over the search results.
Overall, sometimes less is more.
Excellent tip about cross domains duplicated content.
Agreed. It's obvious but useful for someone.
Great tip - I admit I didn't cover syndication in great detail here, and it really is a complex topic. That was one of the biggest challenges of writing the post - at least half the scenarios I discuss have many incarnations and the "right" solution can be very situational.
Hi Stephanie, Dr Pete,
I am doing the SEO for a large hotel price comparison aggregator site which is chock full of syndicated content and has a duplicate content penalty. I was thinking of using noindex but I like your logic and am going to follow your lead aby adding syndication-source tag to give credit for hotel descriptions to the booking sites that they came from. I am hoping that it will make the site more trustworthy and improve rankings for the pages that do have original content. Do you think it should work?
As I read up on the tag I noticed that on 2/11/11 Google added a note to credit where credit is due article stating "we’ve updated our system to use rel=canonical instead of syndication-source, if both are specified". This seems to indicate that Google considers rel=canonical to have a similar effect to syndication-source. Following this through, it seems to me that the following scenario will look to Google like an attempt to claim original authorship and this could be part of the hotel comparison sites current woes. What do you think?
domain.com/hotel-123.html publishing syndicated content from another hotel site but using
Thanks
Ben
Wow Dr Pete!
First I have to say thank you for pressing on and covering the whole range of duplicate content issues and solutions! Half doing the job would have just lead to more confusion.
This post will be an especially outstanding resource for those of us who spend time in Q&A. To be honest, all I want for Christmas is a pdf download link so I can print it for use as a quick reference!
This one was a huge commitment in time & effort. Thank you for being generous enough to do all of that for the community. Very much appreciated.
Sha
@Sha Menz
You are right. Yesterday, I have added my question on SEOmoz. And, today I got christmas gift from Dr. Pete. Today, I was thinking to tweet you my question. But, Now all are on same platform with this rock solid blog post. Duplicate content will be stand at exit door of website. :)
You upsetted my time plan for the day with that long post but it was worth it :-)
This is one of my favorite blog posts of the year - a real comrehensive guide!
One addition to the tools for fixing duplicates: the X-robots Tag for e.g. pdf's.
You know, I pondered X-robots, but I was really tired :) Seriously, that is an important new tool. I'd like to see a post from someone who's used it a few different ways, because I've only dabbled at this point. It's a bit trickier to set up than most of the traditional solutions.
Pete -
Wow. Just wow. This is an incredible amount of content and I really hope people take the time to digest it and dig in to fix their duplicate content issues. I am constantly amazed at how many sites have www/non-www/index.html/index.aspx issues.
One you forgot to mention, I think, was that sites done in ASPX will still a return a 200 status code regardless of the case (upper or lowercase) in the URL. If you use ASPX on your site, this is something to watch for, and there is a strong argument for using a rel=canonical tag here to deal with it.
Amazing post. I will reference it frequently.
Hey John,
That shiny new badge looks good on you! :)
congrats
Sha
It's been there for a while, but thanks :-)
For my sanity, I try to pretend that .NET doesn't exist :) Someone could write another post this long (maybe two) about weird duplicate content it creates.
I might adopt that as general policy; being an SEO in a .NET agency with an enterprise CMS is slightly challenging to say the least!
It's quite annoying, I agree Pete. Sometimes I wonder if Google might realize this and therefore we don't have to worry about compensating for it. But now we get into "This is just a guess" and "Let Google figure it out", when we know that more often than not they get it wrong.
I can't believe this post has one thumbs down. Please tell me that was an accident. Otherwise, whoever you are, you should be ashamed of yourself. This is hands down one of the most valuable SEO resources online.
First of all, great articel Dr. Pete. I think this will be a good guide for solving and preventing duplicate content issues for many of us.
About meta robots. I would like to contribute that I always preffer NOINDEX, FOLLOW before NOINDEX, NOFOLLOW because using FOLLOW will not only allow the bots to crawl the links on that specific page as you write but I believe that it will also pass important link juice (that may be pointing to the page) to the links on that specific page.
It depends and, as an example of the rightness of using Noindex, Nofollow, I present you a real case I have to deal with.
A site I worked on had - due to bad programming - an horrible case of substantially duplicated content (when pages are not 100% identical, but they are overall so that Google consider them as duplicated). In fact, the filters' system they created was generating automatically 64*64 new URLs, 70% of which paginated. You can imagine the crawl problems Googlebot was having... such that Google itself sent an email to my client Webmaster Tools profile saying: "Hey dude, with all those URLs you're driving me crazy".
This issue was causing that more important pages, category ones too!!, were not crawled regularly with a the result that my client site was literally ranking in position 2 one day and position 10 the day after and position 7 the day after the second and so on for important category related kws.
Because of all of this, we finally decided to order Google to not index any faceted navigation (and their related paginated pages) to not overload the crawler. Results: ranking are getting stable and products pages that Googlebot was not able to find because it was spending all its crawl budget, finally were indexed and are ranking.
Thank you Gianluca, for your good input and great example as a case when not to prefer it.
You are absolute right on that when it can get to "complex situations". I didn't reflect much on the budget here but wanted to hightlight the believe of FOLLOW passing link juice! :)
You're welcome...
Generally, I think you're right - FOLLOW is going to be safer than NOFOLLOW. I think the one situation I use NOFOLLOW consistently is if you know you've reached the end of a path. For example, I might Meta NOINDEX,NOFOLLOW shopping cart pages, because everything "below" them should be NOINDEX'ed as well.
Now, you could argue that, given the recursive nature of PageRank calculations, that you're blocking the flow of internal PR back up (to navigation, etc.). I suspect that's negligible and that what you save from crawler fatigue outweighs what you lose in PR-passing, but I can't prove that. These things are nearly impossible to measure precisely.
Really fantastic post Pete - I'm a big fan of the crawl "budget" terminology. It's a concept that is often lost on people, and a great way to explain it. Kudos.
One bit of duplicate content missed - case senstitive...
It actually shocks me that Google has issues with 'www' vs none 'www' and that Google IS case senstitive, and spaces in URL's (%20 vs + ), I am sure that a simple alogirithim could be written that says if site is MS then it is not case sensitive.
My currently preferred duplicate content tool is Dan Sharps Screaming Spider Frog and Excel!
Have sent this guide though to everyone!!!
I was town on both trailing slash issues and case-sensitivity, because they're just so inconsistent these days. While I avoid mixed-case URLs, they sometimes cause no problems. Then, once in a while, boom - a bunch of problems.
This is a case where solid, site-wide canonical tags can prevent issues. I only hesitate to recommend site-wide canonical because so many people implement them wrong.
I would have (based on Googles advice) said that wrong canonical tags wouldn't be too much of a problem... but since seeing in practice where a canonical tag points to a 404 or similar and these pages not being indexed I am loathe to trust Google...! Anything invisible from canonical tags sitemaps and headers are more often implemented wrong than right... which is one of the reasons I love the SEOmoz custom crawl!
+1ing the excellent post Dr Pete! I personally can't wait for the day when Google's algorithm can pick up on poor quality spun content that doesn't make any sense and flag it as spam immediately.
I know the rules of search are ever-changing, but for now let's call this what it is: the definitive guide to duplicate content issues. Thanks, Hulk;)
I enjoyed reading this article Thanks for the update. Google Panda is really doing WOW in the field of SEO. Duplicate content should be removed from the site for better performance.Duplicate content you must avoid it,try to make fresh content and also you give quality content this now beneficial for us,so we try to give fresh and quality content for that we update regular.
Please update the article and add the point.
(10) International Duplicates hreflang="x"
https://support.google.com/webmasters/bin/answer.py?hl=en&answer=189077
it is supported now, and would save noobs like me some time researching after reading the long and informative article.
Thanks for the reminder. I don't usually add things to older posts, but since this one was definitely intended as a reference, I decided that you're right. I've added hreflang="x" as (14) at the end of the Tools section.
I've had this problem with clients that I do design work for. They provide the content, but a lot of times it turns out to be content just copied from the competition. So I have to work with them to do some original content.
Hey Dr. Pete - Google actually allows you to remove pages that are still accessible via the removal tool in Webmaster tools. I would agree that it's definitely a best practice and redundant to do a URL removal that can still be accessed, but a few months ago they opted to allow it - https://googlewebmastercentral.blogspot.com/2011/05/easier-url-removals-for-site-owners.html.
This is in reference to section IV, number 6.
I would probably only find this useful however if I had blocked something like a search results page in the robots.txt and didn't want to wait for Google to go back through and re-index/remove those URL's. Not sure why you'd want to remove a 200 page (outside of testing) that could get re-indexed soon after.
Thanks - I missed that update. I'll add a note to the post ASAP.
This is one of those landmark posts that's going to be referenced for years :D
Dr. Pete, Just asking - how long did it take you to put this together?
Indeed! This is the great example of link worthy content, i must say!
You know, I don't track very well, but maybe 20 hours? I find that's kind of a magic number for me.
Wow. That's called dedication.
+1 for Dr. Pete.
This is the single best practical guide to Panda that I have come across. It was definitely worth the time it took to read and digest. I am glad to report it has done more to ease my mind than to create more stress.
Thanks for all this effort and work you've done, Dr. Pete. And thanks for telling it so well and so easy to understand!!
Fantastic post... pity Google are not doing anything about duplicate content even after they have been informed many times!
I have written a detailed post on the topic and would be interested in others in similar situations? - https://www.my-beautiful-life.com.au/business/google-doing-nothing-about-duplicate-websites/
Wow, in Germany I would say: "Dieser Artikel ist der Hammer" - This article is very impressive. Very detailed and very pro. It really covers most aspects of duplicatate content.
I like this very much.
How about mobile sites?
That's a very good question, and definitely an oversight on my part. I'm not a mobile expert, by a long shot, but the common subdomain situation (like "m.example.com") is probably worth adding. Google has gotten better about it, but there are still plenty of people running into trouble.
Fantastic article, but I still think zombies are far more dangerous than Pandas.
When Google unleases the "Zombie" update in 2012, I know who to blame ;)
Epic Post is EPIC :)
I like section VII (4) where you discuss "Your Own Brain" - often overlooked by many
Will be pointing others to this post for sure ;)
When I was creating multiple website that were geo specific, I had to make sure the content was unique so none of the website would be considered duplicate content by google.
Compare hundreds of pages of copy to make sure they are all considerably different is not easy. I used https://www.duplicatecontent.net/ and https://jetchecker.com/ which gave me a percentage of how close the content is between multiple pages.
Hopefully these resources will help you avoid duplicate content penalties like it did for me!
Thanks Derek- Great resources! I always dig deep into the comments on these articles for THIS very reason.
Cheers for this Pete, certainly a lot of effort's gone into writing it... thankfully we've made good use here of the canonical tag on clients sites (mostly ecommerce) - where they've unintentionally duplicated data - to good effect. :)
Wow Dr. Pete. Just an incredible amount of work baked into this post. Despite the irony that this comment may qualify as a complete duplicate of the others preceeding it, I just wanted to thank you on behalf of seo students everywhere for putting it together.
Fantastic post Dr. Pete. I actually just managed to resolve a lot of duplicate content issues on our site (various types of internal search pages for the most part) with the addition of the noindex tag, so the timing is perfect. This is giving me even more ideas of content that could be removed. I'm looking forward to seeing the results of trimming down our index. Thanks again for all the work that went into this...
@Dr. Pete
I can say that, this blog post is on right time for me. Yesterday, I have asked one question on SEOmoz Q&A section regarding How to fix issue regarding URL parameters? I got quick answer from Alan Mosley. But, I have concern to index all pages which are associated to search pagination as follow.
https://www.vistastores.com/table-lamps
https://www.vistastores.com/table-lamps?p=2
https://www.vistastores.com/table-lamps?p=3
https://www.vistastores.com/table-lamps?p=4
https://www.vistastores.com/table-lamps?p=5
Right now, I have blocked all pages by robots.txt with following syntax.
Disallow: /*?p=
Honestly, I am not happy with this solution. I don't want to select any solution from rel=canonical or no index follow.
Now, turn for my brain as per Dr.Pete's recommendation. I have found one solution what I want to share on this blog which may solve issuer regarding search pagination.
I have checked search pagination over Mozilla Add-ons’ review for Google global.
https://addons.mozilla.org/en-US/firefox/addon/google-global/reviews/?page=1
https://addons.mozilla.org/en-US/firefox/addon/google-global/reviews/?page=2
Both pages are indexed by Google and shows on different search query with require snippets from content.
In this example, Meta info is same in all search pagination pages.
One another great example, from SEO Chat Forums.
https://forums.seochat.com/google-optimization-7/the-sticky-for-meta-description-keywords-keyword-density-and-title-199983.html
https://forums.seochat.com/google-optimization-7/the-sticky-for-meta-description-keywords-keyword-density-and-title-199983-2.html
Google have still not issue to index all search pagination pages.
In this example, Title tag is different from first one. They have added micro variation at beginning of Title tag.
I have double check with SEOmoz tool, both website did not add Rel=canonical or NOINDEX NO Follow Meta.
I would like to follow similar strategy with my ecommerce website because, in my website all search pagination have unique products. So, why should I break down my impression during long trail keywords?
I have another reason to trust on this method. Because, good authority and branding website like Mozilla and SEO Chat forum is following this strategy rather than focus on canonical tag or meta.
Now, I am looking forward for Dr. Pete's inputs or anyone who can give me more idea about it. Because, my webmaster tools shows me that, 6751 pages restricted by Robots.txt. I know that, my website does not contain that much duplication. Again, Here, I have used my brain to drill down few website. Dr. Pete. What you think about it?
I recall my advice was to use a canonical tag if the pages are in fact duplicate, and to do nothing if they are not, if it is just the titles and description, then I am not sure the work and complexity of altering them for each facet is worth it. Something I should of added, is that one would assume that you have a landing page that is going to rank better then all these product pages and that may be where you should put you time and efforts.
You raise an important point - there are sites where the search results are the content. Some directory or affiliate sites, for example, may only link out to outside sites and not have their own "product" layer. In that case, pagination is a bit trickier issue. Those search pages may be your bread and butter. For most of us, it's the deeper pages that count.
Honestly, I am agree with you. But, I have one mind set where I try to fix each and every issue which shows me on GWT or SEOmoz tool. Duplicate title tag is also one issue which I show on GWT. This is my ultimate concern. Each and every micro observation and editing may help me more for better performance. Thanks again to be with me on my comment.
I have to agree with you, if it was me i would do the same, i dont like any loose ends
Actually, Dr Pete listed the solution to your doubt: the use of in the paginated content.
This is also the solution Google itself is preaching to use in these cases.
There's certainly no official word from Google that says you can't let all paginated search pages be indexed. I find, though, that people wildly overestimate the value of these pages. Landing a visitor on Page 7 of a search has virtually no SEO value, IMO. Page 1 of any major search will have your core category keywords and capture most of the SEO value.
Here's the bigger problem, though - what if those 6,700 pages "wear out" the crawlers to the point that your actually product pages don't get crawled. Now, you're sacrificing high-value, high-conversion pages for low-value internal search pages. Practically, I see this happen too often. I solved this problem for one client long before Panda, and saw their search traffic triple over the next 2 months. Paginated content and other duplicates were keeping Google from crawling their most important content.
I'd also say - in general - that just because a site is high-authority doesn't mean you should take your technical SEO cues from them. Many reputable sites have sub-optimal on-page SEO. Because these sites are high authority, they can sometimes get away with things you can't.
One final word, though - don't use Robots.txt for pagination. You may end up blocking the crawl paths. Meta NOINDEX,FOLLOW will keep the pages out but the crawl paths open.
Again, you can leave them open and see what happens. You may be fine. Most of the time, though, my experience is that controlling your index has significant benefits. Since Panda, those benefits (and the risks of letting your index grow out of control) have only increased.
Agree, I am going to follow similar one for my website. I will let you know status very soon on this blog comment.. Thanks!
I might be missing something, but I'm not sure I am - Why not just add " - Page 2" etc. to the end of the paginated page titles? Would that not solve the duplicate page title issue across the entire site?
Also, if you are disallowing robots to crawl pages with "?p=" in the URL, aren't you then restricting Google from being able to view products within those pages too? Unless they are linked to from elsewhere, anyway.
Seems there's some confusion, because the pages you linked are ecommerce category pages, not search pages, right?
Maybe I'm wrong/confused, don't know! It's late on a Friday so could well be! :P
...P.S. Great post, Pete, will be using this for months/years to come no doubt! :)
Adding "Page 2", etc. to the titles is certainly better than nothing and can help de-duplicate to a small degree, but I find that other issues (like diluting your index) still remain for most sites. Again, it's a matter of scale. If you're talking a few dozen paginated results on a 1000-page index - no problem. If you've got 10,000 indexed pages and half of them are paginated search, then I think you'd have great results pruning that back.
You're correct about the parameter blocking - it's definitely a less optimal solution than Meta NOINDEX,FOLLOW, as it could cut off the bots. It doesn't seem to be all or none - I've used Robots.txt to block pagination parameters, and Google still seemed to crawl to some deeper pages, but in many cases I had other crawl paths in play. Parameter-blocking would probably be a last resort, given the wide array of options available.
Great article! Used it to convince my clients to fix their duplicate content issues!
Great post, although it may also be worth noting that changing a few words in a piece of content does not de-duplicate it! I struggle with this with some people.
Interestingly, with the trailing slashes -- I've been working with two sites recently on the same platform, which unfortunately exposed links both with and without the trailing slash. In one case, GWT reported lots of duplicates, but in the other case it didn't report any problems. Either way, it's still best to fix it at your own end and not trust Google to do it for you!
I'm curious - beyond the GWT errors, did the trailing slashes seem to have any impact on indexing, ranking, etc.? I haven't seen any recent cases of problems, but it's always good to be aware of it.
It's hard to say I'm afraid, as both sites (completely unrelated sites but built by the same developer) had just switched to this platform and there needed to be a lot of other 301 action for various URLs that had changed on one of them, so it wouldn't be a very clean test...
No worries - mostly just curious. I don't want to tell people not to worry if problems are still popping up regularly.
Well, the site as a whole definitely suffered significantly due to the change of platform, but I think it was more likely down to all the URLs and page content changing. Still, it's an easy problem to fix, and I'd always much rather fix it properly than rely on Google to sort it out.
Great post...but something I'm still not clear on re cross domain content duplication...
We have over 20 different country sites to manage each with a different domain:
e.g. www.moneyboxsaver.co.uk, www.moneyboxsaver.com.au, www.moneyboxsaver.co.nz each by the way these aren't actually my sites...
Much of the article content we want to create would be relevant in each country - what's the best practice here, can I post the same content on each of these countries (barring human translation for non-English speaking countrries which I believe isn't seen as duplication) Should we take a 'syndicated' view of our own content?
This is an important question. Before the Panda revolution, we had quite a few quotes from Matt Cutts and other Google engineers that suggest that it was fine to duplicate the content among same language countries. Some SEO experts have raised the concern that things are not so clear now. In my opinion, this concern is not founded. If you have almost the same content in the same language for different countries and your client can have a central administration with a single domain, then you can have one domain for all the countries and use IP delivery (without redirection) to adapt the content to each country. However, this only works if the different contents that you present (under a unique URL) to the different countries are almost identical. If, on the contrary, the content is very different, then, as you do now, you must use different domains for different countries. Even if the pages are very similar, you might be forced to use different URLs for different countries because of the administrative needs of your client. I believe that this will be fine and that the SEO experts that say otherwise are just trying to be on the safe side, without really understanding what is going on - nobody knows for sure what is the situation. The best is to go with the constraints from your clients and what needs to be presented to the visitors. I believe that the quotes from Matt Cutts and others still apply today : duplicate content when the different URLs are targeted toward different countries is fine. One things is certain : even if you use different URLs for the same content, as long as the different URLs are for different countries and you do IP delivery, you do not need to worry about dilution of inbound links because they are considered separately in each country. The point is that, when Googlebot follows the links, it gets redirected as any other user agents. In this way, each instance of Googlebot sees its own version of the links (with their redirections). As far as every individual instance of Googlebot is concerned, there is no duplicate content. It is very unlikely that one instance of Googlebot is going to concern itself with what is seen by the other instances to check whether there is a duplication of content among the different instances. This is corroborated by the fact that I have read that many websites still rank very well despite the fact that there is a duplication of content among same language countries.
Of course, I am sure that you want to know the opinion of the experts here, as much as I do. I am just a guest.
Hi, We wrote a book 10 years ago. The book is a collection of hundreds of stand alon articles. We are the original authors and copyright owners. Now we want to revise it with the help of our online community. The plan was to build the site www.keepingpetssafe.com (it is under construction) check it out.
Then we want to add a BLOG and blog the content one piece at a time, (one article each day for one year) and allow comments, I AM AFRAID after reading your warnings about duplicate content we could be about to implement a losing strategy. The blog would be alive and social, reaching out and involving the community. It would link to the "reference", main part of the website, which contains all of the original articles, blogged and not yet blogged. If you think it is possible to use this strategy do you recommend installing blog withing a single url such as www.keepingpetssafe.com/blog ? thank You
So that's your '6000 words and 38 images' post. Awesome. I think any post better than this on duplicate contents will be a duplicate of your post. I have no choice but to bookmark this post. Great job Peter.
I find the problem concerning trailing slashes, is not that canonical issue, but that many sites (wordpress) have 301 redirects to a trailing slash, but have internal links using the non-trailing slash, causing a un-necessary 301 redirect and a leak in link juice
Excellent post Dr. Pete! In the examples, I think one more you can mention is variable order. Like:
example.com?var1=val1&var2=val2
example.com?var2=val2&var1=val1
And for diagnosing duplicate content, you do mention GWT, Search Commands and Moz' Tools which are all good. I guess it is still worth mentioning CopyScape also.
Good call - one great way to make a URL-based duplicate problem even worse is to have a CMS that appends the variables in different orders depending on the path you took. It's amazing how easy it is to spin out 100 copies of a page.
Dr. Pete .. I have become fan of you.... Almost everything and every kind of situations regarding duplicate contents ..you have explained with the best possible ways to solve those issues... you are truly a Dr. for your field of expertise..
Your another post about "Catastrophic effects of wrong cannonicalization" is highly valuable for me .....I got so much info without utilizing my ATP's for doing Research.. just by referring your posts..It was easy to get thourough knowledge about it. Please ..please .. keep on posting...these kinds of quality posts..
Thanks
Dr. Pete,
This is an amazing resource that i can refer to someone who needs all the details of duplicate content all at one place. Also, thank you for dividing the post in to different headings, as this allows me to digest the long big post easily.
Duplicate content is a major problem if you are dealing with either ecommerce sites of sites that produce massive amount of content from time to time. I was dealing with the massive amount of duplicate content due to the CMS platform as it automatically create multiple versions of URL displaying the same content. At that time i simply blocked that area of a website through robots.txt, your post actually gives me an idea that there can be better options to deal with it.
I will surly look in to that matter again and come up to a better strategy!
A Great Resource without any doubt!
Incredible post, Dr Pete. One section in particular should help out my friend's site - I've passed on the info to pass on to his web developers. Well explained and easy to understand, on a subject that has a tendency to be complicated, tricky and irritating all at the same time. Good job sir!
Hi, Dr. Pete. I have a question about Duplicate Content and User-generated Content.
Say I have create a website that allows users to post their own content, like stories or reviews, and say that the users are pretty much copying wikipedia 90% of the time.
What would my safest strategy be? Hide user-generated content from google by robots.txt or "noindex,no follow"? The problem with this solution is that there will be almost no content left on the website.
I was thinking about putting up a structure like user-generated.mywebsite/content as opposed to mywebsite/user-generated/content so that at least the duplicate content problem will be limited to the subdomain only. Will this still end up damaging the site's rankings in the long run?
Thanks and I hope others will find the answer usefull.
Unfortunately, there's no magic solution. UGC is great, but if it's all thin/duplicate content, then you're probably better off NOINDEX'ing then you are letting thousands of low-quality, duplicate pages get crawled. If all the UGC you're atttracting is low quality and scraped, then I think you have to ask what's not working. Maybe you need to aggressively moderate. Maybe you need to only allow certain users (who have participated for a while) to submit content. Maybe you need to reward good users and encourage unique content.
You could isolate it to a sub-domain and potentially protect the root domain, but it still begs the question - why is the content like this? What are you building that's of value (for you, your users, or search users) if so much of the UGC is of this nature? In other words, there may be a better way all around, even aside from SEO.
Thank you for the very quick reply. I am probably worrying over nothing. I need to trust UGC a little more and probably come up with a structure that rewards good users and good content as well as mixing it up with unique content of my own.
I was also searching this kind of articles because two days back, I made 10 geograpgical pages in my website so, I am little bit confused about website content. I added similar content in all pages so this is a duplicate content or others. In future i want to make around 1000 geographic pages so, every time I add new content in pages or I use similar content
Great post, almost too long, may have been better as a pdf guide.
Anyway, 2 points
1: If you use the canonical url with the trailing slash as you've done, doesn't that make the whole site appear as one page?
IE should it be
<link href="https://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world" /> or
<link href="https://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world" >
I would have thought you'd want the engines to index the page and not a directory.
2 Despite what Matt Cutts says Google is not able to figure out the differnce between slashes and non slashes. My Google Webmaster Tools account is full of messages telling me I've got duplicate content for slashed and non slashed urls.
So I'd really like to know how to fix that.
Thanks
There is a downloadable (PDF) version at the very end of the post, for convenience.
(1) That trailing slash is actually the closing slash for the tag itself. Since <link> has no closing tag (i.e. you don't put </link> after it), the "/>" format is technically correct. Honestly, though, it generally works either way.
(2) I've found it rarely makes a big difference these days, but standard practice is to 301 redirect to the preferred version. It's also best to using a consistent version internal - if you use "www.example.com/", then your internal links should reflect that, as well as any canonicalization.
Dr Pete,
thanks for the quick response.
I also found this explanation which may help others
https://webdesign.about.com/od/beginningtutorials/f/why-urls-end-in-slash.htm
Hi Pete,
your article relates to domains with same owners.
How we can handle legal duplicate content on competitive domains of different owners?
Our pharmacy software is used by 150 pharmacies. We supply 80 000 descriptions of drugs. This result in 150 duplicates for 80.000 products. Unique content would result in 12 Million unique descriptions. This is not affordable and high risk given to pharmacy standards.
We can’t use cross-domain-canonicalization, because the multiple 150 domains are competitors. And we can not use all of your presented single-owner-solutions for the same reason.
How to deal with competitive-domain-duplicate-content?“
Any idea?
Regards, Wolf
I'll be honest - there's no magic bullet. 150 different sites won't all rank equally for 80K products. Without cross-domain canonicalization, syndication-source, or getting them to link back to you, you've got to compete with them head-to-head. That means building a stronger link profile, more authority, and as much unique content as you can muster.
You may also want to consider focusing your energy and search index. If your site is relatively new or has a weak-to-moderate link profile don't try to rank all 80K pages at once. Focus on the money-makers and really try to hit 500-1,000 hard.
WOW. Very comprehensive piece. Lots answered.
So as an Internet marketer who utilizes article syndication as an inbound linking strategy, you're saying that as long as each duplicate article links back to the original source material (usually via the author resource box) the seo value will remain intact? GREAT article BTW. Thank you! -gene
Linking back properly will help Google understand the true source (and is probably a positive trust signal), but it won't necessarily help you rank as the one syndicating the content. I think it depends a lot on how much syndicated content you use and whether there's enough unique content in the mix to back it up. Google could still devalue these pages if they make up too much of your site.
hello and thank you for the great post!
I have a question. You wrote "While robots.txt is effective for blocking uncrawled content, [...]."
Well I have a problem with a new website: google still crawls all pages even the ones blocked in the robots.txt. I have about 150 pages not blocked by robots but yesterday google crawled 903 pages in 1 day (and it's increasing regularly)... According to what you wrote in "the crawl budget" it's not that good.
Not only he crawls all the pages but he also indexes them (I can see them only after I click "repeat the search with the omitted results included").
Actually I posted this problem (is it a problem?) more in details here: https://stackoverflow.com/questions/8440681/why-is-google-crawling-pages-blocked-by-my-robots-txt
If you have time to answer it would be great!
Thanks!
I find Google respects Robots.txt more as a preventive measure (block a folder from being crawled, for example) than as a cure for duplicate content. Once it's indexed, Robots.txt won't always remove it. If you block a large percentage of pages, Google may also start to ignore Robots.txt.
I'd strongly consider a switch to META NOINDEX and/or possibly using the canonical tag, where appropriate. Blocking too much of a site with Robots.txt can get mess and, as you said, just doesn't always work as intended.
Thanks for the outstanding post and ongoing discussion. I didn't see this specifically addressed - on an ecommerce site, what's the best approach for presenting product detail pages that are the same product offered in different formats? For example, a book that is available in paperback, hardcover and ebook formats. We can do some differentiating based on the format, but the title and description fields are likely to be quite similar. Should we Rel-canonical these to one main format?
It's a similar situation to V-14 in the post ("Product Variations"). Honestly, it's very situational - those pages can look thing AND they can have long-tail SEO value, so it's a balancing act. In the case of something like books, I think it's probably best to only index a "parent" page and not all of the formats, for a couple of reasons:
(1) The format pages will probably only vary by a price and a few words, and will look thin, especially multiplied across 100s or 1000s of products.
(2) If you land searchers on the parent page, they have the option of choosing their format. In general, I think it's a better search user experience.
If your Amazon, you may have the authority and luxury of indexing everything, but that doesn't apply to most sites. I think the focus is generally better for SEO.
Hi Friends,
I need help on this. My US Based client is a interior designer and runs his service in multiple places like Michigan, Ohio, Florida.
I want to know will his 3 website be penalised for duplicate content by google if he has :
1. 3 Websites like InteriorDesignersMichigan.com, InteriorDesignersOhio.com, InteriorDesignersFlorida.com
2. The look and feel, the website template and content text of all the websites are same.
3. But the city names are changed in the content text.
For example all three website have content like:
1. Design Tech is a interior design company Since 1993 for office space in Michigan. We provide our services.....Blah...Blah
2. Design Tech is a interior design company Since 1993 for office space in Ohio. We provide our services.....Blah...Blah
3. Design Tech is a interior design company Since 1993 for office space in Florida. We provide our services.....Blah...Blah
Please suggest will all the three websites be panelised for duplicate content.
The short answer is: Yes, you are at risk, especially if the sites are linked to each other. With three sites, it's a bit hard to judge how risky it is. You haven't built out dozens of sites. Still, this content is "thin" almost my definition - it's duplicated except for a few keywords, and that's a tactic Google is devaluing more very day.
In most cases, one or two of the sites would just get devalued or the links between the sites would get discounted. In extreme cases, you could run into trouble with Panda. There was a time when microsites worked. Now, their SEO advantages are limited (and there's some risk). So, the question is - is it worth splitting your efforts three ways? In most cases, in 2012, I don't think it is. I think you would be better trying to build up unique content for all regions or focusing on one core set of content and then having a small amount of unique content for each region.
Thanks Dr.Pete for your valuable suggestions. My 3 website are not linked to each other but still I would start writing unique content for all the 3 websites.
Dr. Pete
Excellent article..thank you very much.
I'm now implementing and API with daily updates to import product content from one ecommerce site to another, they have totally seperate domains. I am not intending to link the sites or institute 301's or cross-domain canonical tags
If I use my own additional unique content with additional products and articles surrounding the duplicate data for indexing, is it still advisable to block indexation of the product content being imported, to avoid penalization of either the feed supplier url or my own?
Thanks
Even if I am a newbie for SEO and non English native speaker, I found this article so very informative and helpful. Thanks so much. However, as I am not a coder, I would have appreciated it even more if you had added very short basic examples about how to implement your suggestions. For example, coming to Meta Robots and Canonical Tag paragraphs, let's say I have example.com/hotels.php and I want to add canonical tag for Google not to index pages such as 'example.com/hotels.php?city=...' , should I add canonical tag to the "hotels.php" file? And if at the same time I need to index pages such as 'example.com/hotels.php?rates=...' , the canonical tag I have added for the parameter "city" automatically will suggest to Google not to index pages including "rates" and any other search parameter too?
Also, how do I fix the duplicate content problem on WordPress sites, as I have only a file there, index.php, and many potential duplicate (index.php/category/...., index.php/archive/...., index.php/tag/...., etc...) ?
Thanks for your patience
Articles keep getting bigger and bigger lately ;-) But I love this one nonetheless.
I was wondering what could go wrong with internal duplicate content when using country based pages. Google recommends a few methods of which one is (domain.com/us/, domain.com/uk/ almost similar content), but you mention that it could go wrong here, do you know the specific details as to why this could be happening?
Thanks for this great post.
Great stuff Dr Pete.
I made the mistake of allowing a staging server to get indexed and about a month ago the live server lost it's rankings. I contacted Google via webmaster tools for the development server and their response was that no manual actions were taken against the site.
In a scenario like this would Google consider any action they take to be algorithmic?
If the duplicate content were removed (it was removed a month ago) could we expect to see any action to be reversed? We haven't seen any improvements in rankings yet.
Is there any way to know with certainty there was a duplicate content action taken? The staging server has no links, no meta information, no traffic, so it would seem odd to me that Google would even view this as a problem.
Any suggestions?
Many thanks
If it's just a filter (and this can be tough to tell), de-indexing the staging server should help relatively quickly. The trick is often to actually get it de-indexed. Monitor those URLs carefully - ranking won't recover until they fall out. It really depends on the scope, what got indexed, and how you removed it.
If it was Panda related, then you may have to wait for a data update, and that can still be a month or so (between data updates).
Finally had a chance to read this post, I really appreciate how descriptive it is. Thanx!
sir ! i should first thank you for all ur hard work and interest (mainly) to write this greatest post (rather should say resource) ...
i am having so much problem with the scrapper sites and recently posted a question in the google webmaster forums regarding this problem ... and i request you to help me answering my questions ...
The sites which are effected by the Google panda (take it as site X), Scrapper sites (take it as site Y), Site which is not effected by Google panda (take it as Z)..
First question : once the site got effected by google panda and after that if site Y copied articles from site X, if you see the results in the google then the articles which are indexed in google for site X are outranked by site Y (though this is some problem in google algo which is already mentioned by Matt cutts when i asked him in the google plus hangout), i want to know the permanent solution for this. Second question : for the above condition many people suggested that we have to file DMCA complaint against that particular site Y to get rid of that out ranking problem, even people from google suggested this (may be in webmaster forums). Everytime new article posted in site X is outranked by site Y and complaining again and same thing happening - dont you think this is a burden for people ? or dont you think this is going to be a time loss? Third question : in some instances one person reported to google that a site y is copying articles from site x and asked to remove those articles and lets take the number as 20 .... again same situation ... again same situation ..... still that scaper site is the winner..... finally that person asked the google to remove that site y completely ... the reply from the google is that unless we have minimum of 100 urls which are copied ones there is no way to remove the site as a whole form the google search, my question is that we have to wait for the google to invent new solution for this outranking problem till the 99 article ??
========
final question dr. in your name is related to medical doc ? just asked out of my curiousity because i am doc in medical field ! .....
thank you once again !
Unfortuantely, there is no permanent solution. Determining the source of content is tricky, and Google is getting it wrong in plenty of situations. It's important to fix your internal problems, of course (including Panda). You've got to build up authority and your link profile, and you've got to get out signals that tell Google your content came first.
If they're flat out stealing, you can take legal action (including DMCA), but it's going to take time and probably money. So, it's always a trade-off of how aggressive you want to get. If it's consistently one site, I think that fight makes more sense.
I'm an experimental psychologist by training, not an MD.
I am facing lot of problems with blogspot blogs .. see for example if you take medium sized sites which are effected by panda and if the content is copied exactly by the blogspot blogs ... they are easily overranking the original source and may be people knowing this may even do harm by knowing this proceedure ... is the main domain power or authority what u say for that (blogspot.com) is the reason for over ranking ? no other way to soleve this problem and if u are interested i will personally show u one example where i faced 6 times the scrapper attacks from blogspot blogs!
Impressive post, really a must-read because it perfectly summarizes the duplicate content issues.
However, for #10 (International Duplicates) in the examples there is a real easy solution. I suggest that you use the new rel alternate canonical link element (check it out at https://googlewebmastercentral.blogspot.com/2011/12/new-markup-for-multilingual-content.html)
This is the far by the most detailed and well explained post about duplicate contents.Thanks so much for this.I hope to read more from you sir.
Awesome post... I am linking to this from my next article. My readers will love it.
Can someone please lend me a hand?
I had someone redesign a site for a client of mine, and there was a plugin used that added ?cbg_tz=0 at the end of all the URL's on the whole site. The plugin was removed but I want to know how I can make sure these links don't show up in the index anymore, as they still do. I am positive this is why I am not able to get him to show up even in the top 1,000 results for his KW terms. (local terms, low-med competition)
A very complex subject explained in a very clear and logical way. A terrific reference work.
This post is indeed a comprehensive resource on the subject matter. It provides answers that I could not find anywhere else to many questions that I had. I still have a few questions that are left unanswered, in particular about the canonical link element or canonical link tag (as it is called by Matt Cutt in https://www.mattcutts.com/blog/canonical-link-tag/.) One of the thing that I learned from you in a previous post is that it is in principle possible that <a href="https://www.seomoz.org/blog/6-extreme-canonical-tricks#jtc160924">the noindex signal in the canonical link tag is followed, but the link juice to the target is not passed</a>. In my opinion, it would be terrible if Google was doing that, very confusing. I would even say "unfair", but nothing is fair or unfair in a world where no laws exist. So, I was looking for ways to measure whether or not the link juice is passed. In Google Webmaster tool "Links to your site", if one clicks on a target page in the site, one obtains a list of all the external pages that has a link toward the target. One can even find out what are these external links. Let us consider an example, say the page example.com/source.html contains a canonical link tag toward example.com/target.html. Since, the former gets deindexed, only the latter one is a possible target in Google Webmaster tool. Nevertheless, one can see that the external links toward example.com/target.html is actually a link toward example.com/source.html. My first question is whether this is a strong indication that the link part of the canonical link element was followed by Google? Of course, the link juice that is passed from the external pages to the source, which contains the link element, depends on factors such as the authority of the external pages and the relevance of the anchor text, etc., but this is not the issue here. We are concerned about the link juice that is passed from the source to the target. The second question is, once we have determined that the link element is (fully) followed, are there additional factors used by Google to determine the link juice that is passed from the source to the target? I never heard about such factors. For example, I never heard that Google would analyze the differences between the source and the target to determine the link juice that will be passed or any thing like that. The only thing I ever seen mentioned is that there must be a small damping factor as in a 301 redirect, but this is fine and expected.
Unfortunately, it's nearly impossible to tell how much link-juice is being passed by any given link. If you had a ton of links to the canonical source and none to the target, you added the canonical tag, and the target suddenly started ranking, then it's pretty clear link-juice was passed. On the granular level, though, I'm afraid the data just isn't transparent.
I suspect that there are cases where Google devalues or partially ignores a canonical, especially if it seems like you're abusing it (just canonicalizing a ton of pages for their link-juice, even though they have nothing in common). This happens with 301s from time to time. In this case, they might de-index the source page but not pass the link-juice.
Honestly, though, I can't point to a clear example of that happening. If anything, Google is very lenient with canonical tag usage right now. There's some discussion that Bing might be less forgiving. I only suspect it could happen because we've seen it happen with 301s (as people have abused them).
Thank you for the clarification. Yes, the kind of experiments that you suggest will be great, I agree. I don't have the resources to do that. To do that, one needs to be big and have some controls over many sites that can be used for testing.
Excellent post, thank you!
Hi Dr.Pete,
Great article, thank you! I have a question regarding the implementation of Do you think the rel=prev/next attributes should be combined with meta robots noindex,follow tags on paginated pages? Google doesn't specify this but I think this is a fail safe way to satisfy Google and other search engines that do not comply with rel=prev/next.
Additionally, I have seen rel=prev/next implemented with self referencing canonicals on paginated pages. Any thoughts on this? This seems uneccesary.
Would love to hear your thoughts. Thanks!
I'm not a big fan of mixing signals - if it doesn't work, you never know quite why. If you're having problems related to pagination, I'd go with the META NOINDEX. If you're just looking to prevent future problems, I'd give rel=prev/next a shot and let it run by itself. It really depends on the scope and severity.
When you say "self-referencing", do you mean back to Page 1 or to that actually specific page (e.g. page 23) of results. Self-referencing back to the specific page would be the opposite signal - can't imagine every wanting to do that.
Hi, I asked this before on December 3rd but think it was lost in the other threads and some of the content was removed by the editor. Here I have tried to explain the idea in a different way, hope you can help me...
Hi Stephanie, Dr Pete,
I am doing the SEO for a large hotel price comparison aggregator site which is chock full of syndicated content and has a duplicate content penalty. I was thinking of using noindex on the pages with duplicate content but now thinking of using syndication-source tag to give credit for hotel descriptions to the booking sites that they came from. I am hoping that it will make the site more trustworthy and improve rankings for the pages that do have original content.
As I read up on the syndication-source tag I noticed that on 2/11/11 Google added a note to their credit where credit is due article stating "we’ve updated our system to use rel=canonical instead of syndication-source, if both are specified". This seems to indicate that Google considers rel=canonical to have a similar effect to syndication-source. Therefore if a site uses rel=canonical with a link to it's own page (implemented so that affiliate links are not indexed) but using syndicated content, Google might consider this an attempt to claim original authorship of the syndicated content.
What do you think?
Thanks
Ben
Interesting - I see what you're saying, but honestly, syndication-source is still new enough that I haven't seen that combo in play. My guess is that the canonical might overpower the syndication-source tag in this case, although I don't think there'd be any harm in using both (internal canonical and cross-domain syndication-source). Worst case, the cross-domain signal just won't work.
The other option would be to only put the canonical tag on the non-canonical versions (say, pages with tracking IDs) and then put the syndications-source tag but NOT the canonical tag on the canonical version. That would take some coding, though.
I believe I've read this post 3 times now and still have to go back and refer to it to remember what 'move' I need to make next! Thank you SO much for this detailed information. I should probably print this to put on my desk beside me for daily reference!
Thanks a lot for this excellent recap. Dr Pete !
About the international duplicates (10), it may be interesting to add a link towards a recent post written on the Google Webmaster Central Blog about new markup for multilingual content: https://googlewebmastercentral.blogspot.com/2011/12/new-markup-for-multilingual-content.html
Thank you so much. This is really an important addition. There is yet so much more we could say.
It took me 4 days to make enough time to get through this enormous post. WORTH IT. You really put a lot of work into it, thank you.
I especially appreciated all the uses you suggested for the canonical tag. Several were instances I actually hadn't thought of, and wasn't quite sure how to guide the developers of my clients' sites to fix. Very much appreciated, Dr. Pete.
- HP
awesome post - covers just everything - really good job, thank you.
Hi. Great resource! I am trying to figure out where I sit - I think it is cross domain syndication.
Basically, as well as fresh content I write, I also publish a lot fo press releases that are cut and pasted straight into my CMS. I change the title, but that is about all I have time for.
As these are not directly syndicated from another site, I was wondering how best to handle them. I know I am getting heavilly hit by Panda about it as well.
At the moment I am setting the site to add noindex to any non unique page that is more than 5 days old and remove them from the sitemap.xml
Is there anything else I can do?
Thanks!!
@Dr. Pete
I have implemented each and every attributes on my website which are suggested by you. But, I have question about Pagination and SEO: How do I fix during search parameters? I'm in hurry to get reply. :) BTW, Thanks for sharing and recommend to all SEO guys for implement on website.
1. What is the best approach to solve duplicate content - "301 Redirect" OR "Canonical URLs"?
Peter this is such a great and more in-depth post that is describing the exact meaning of duplicates and this post is helpful for me to better understand what this duplicate means for search engine. I have never seen this type of post that is delivering the best ideas and after reading I can easily prevent myself from having a duplicate content that is harmful to the site and also for its ranking.
Full detailed guide. congratulations...
Dr. Pete. Love this article. Great job on laying all this out. I know my comment is late to the post, but this page came up recently in one of my searches and I'm trying to provide a client with some backing to my recommendations related specifically to duplicate content found via secure and nonsecure versions of a page. I have a suggestion for a potential modification.
I'd say this section "V. Examples of Duplicate Content" particularly this item "(4) Secure (https) Pages" and the solution mentioned here "In many cases, it’s best to Noindex (IV-4) secure pages – shopping cart and check-out pages have no place in the search index." should now be slightly modified due to the recent announcement of using HTTPS as a ranking signal (https://googlewebmastercentral.blogspot.com/2014/08...)
I'd say that with Google moving towards SSL/HTTPS as a ranking signal, Noindexing shopping cart and check-out pages is still true, but wouldn't go as far as to say now that "it's best to Noindex secure pages". And I know the wording is not all inclusive because it is prefaced with "In many cases", but this could now be misleading to someone who may come across this without the prior knowledge or understanding of HTTPS as a ranking signal (e.g. I recommend a change to a client based on duplicate content from secure and nonsecure pages and send them to this post as a source, but to fix their problem they decide to Noindex secure pages because they haven't heard Google's recent announcement. Sure, I will do what I can to inform them of the announcement, but nonetheless, they may end up "fixing the problem" of duplicate content while creating a new one and potentially affecting their ranking in search in the future if Google uses the factor more heavily).
[fixed hyperlink - km]
We'll done, myself and every other reader are convinced we all need a negative content specialist on our team. Seo sales copy at its best. if I only had the time or budget.
Hi, I'm looking at syndication, I have a parent site and a subject matter specialist site.
I wish to syndicate some articles from the parent to the specialist site.
We use google CSE to search the sites (one index, which we filter when searching each site using the site: operator)
- CSE searches for "article about cats" on the subject site eg: "site:subject.com article about cats" what will happen will they get any results?
- Google searches for “article about cats” may only show results on parent (probably ok)
- Google searchrs for “specialist article about cats” may only show results on parent (probably not ok)
My Approach:
1) Ask you....
2) We’ll build the site with the canonical functionality and see what happens, (using Google webmaster tools)
3) We have any of the above problems, we can turn off this feature and hope the duplicates don’t mess up our rankings.
Sorry, I have very little experience with CSE, but I'm not under the impression that canonical tags impact it at all. If you do a cross-domain canonical that should help prevent any duplicate content issues, but you will have to pick which domain should rank. I don't think that will impact your CSE results, but I'm not 100% sure.
Amazing article, I ma facing problem with https. My previous host had shared SSL and my blog was also working with https version. Now Google have indexed my blogs https version too which I feel is creating duplicate content issue.
My new host do not offer SSL by default and now my site don't work with https version but Google is still showing https results. Now I am confused what to do. Please help me on this issue.
Even two years later I find some of this useful and relevant when trying to freshen up on a few items. All I can say is Thank you.. two years later.
Thanks for this Dr. Pete! Question about your quote:
Here's the bigger problem, though - what if those 6,700 pages "wear out" the crawlers to the point that your actually product pages don't get crawled. Now, you're sacrificing high-value, high-conversion pages for low-value internal search pages. Practically, I see this happen too often. I solved this problem for one client long before Panda, and saw their search traffictriple over the next 2 months. Paginated content and other duplicates were keeping Google from crawling their most important content.
This hits close to home and right on target for us. Our search result pages far out-index our product pages. Do you by chance have an article anywhere where you talk about how you handled this? This is our situation and we need to make a careful transition to rank our product pages while de-ranking our search pages.
Thanks!!
Unfortunately, pagination can be a very complex topic, and it depends a lot on your situation. I think this resource by Adam Audette is good, and it gets into just how tricky the problem can be:
https://searchengineland.com/five-step-strategy-for-solving-seo-pagination-problems-95494
Google has hinted more and more strongly that they don't think search pages are of value to users (i.e. their searches landing on your searches). On the other hand, major category searches, etc. are often key landings pages for some sites, so it's a balancing act.
I am working on product (machine) re-seller website...here each product specifications and descriptions are same from head website I have no chance to write content for each product here I have copy the same specifications form head office website
i have a fear that i have copied all specification from other website and google may not give priority to my website...
getting full confuse please help me
Nice artical very informative. Really helpful for us and I really appreciate it.
Great work on duplicate content issue Dr. Peter. Awesome guide to solve many content related problems.
Everyday we see some changes in internet marketing. Google updates it's algorithms to give best results for users but sometime this harms many SEO workers. Google updates on content optimisation are very harmful for fake or duplicate content and it takes some time to understand this types of updates.
hi Peter this is awesome article to escape for duplicate content problems
my website has two home pages as "www.xxxx.com" and www.xxxx.com/index.php" these two URLs have been indexed in google search. so request developer team to remove the index.php page or redirect it to the root.. becasue all the web pages are having two URLs having index.php path... "www.xxxx.com/contact" www.xxxx.com/index.php/contact
Still a solid post. Thanks.
Thanks for the information
I am sharing my site content with social sites like GPlus and Facebook. one day I copied my first para of the article and pasted it on google search I saw my facebook page came before my site page. Is it ok? for a normal person it would look like I have copied content from facebook. please suggest?
Pretty good overview of the possible duplicate issues. I’d only add internal search and additive filtering to the list.
OMG! This article is very long but very informative. As Esaky mentioned to take out the print of whole article to learn the things deeply. All vital things in seo panda updation are mentioned here through which anyone can save their website get penalized or banned and attain good rankings.I have read this article only once and now i am going to print it for future assistance. Thanks Dr. Pete
This article is very huge and will print it for better understanding !
My problem with the blog isthe Duplicate content is marked as for Search Terms or Tags ! like i have written 10 post about design art, i added tags as "design art" in the tag form. using WordPress.org self hosted website !
I have been relying on all of the valuable information in this post since you wrote it! Quite amazing work by the way! I see that in a forum Google recently said that the syndication source tag was deprecated. I was wondering if you had a suggestion for what to use in its place for a situation where a company syndicates health content to various hospitals what to use. The content is only a portion of the client's page so the rel canonical tag won't really work. Is a link to the original source sufficient?
Thanks - I wasn't aware of that, and just found a reference from the Google News team. Would've been nice if they told us that a bit louder, but it does appear to be official. I'll update the post.
Thanks Dr. Pete. With this tag gone, I am not sure what to do for my client except have the content on their customer's site link back to their site. As this syndicated content is only a portion of the page, rather than an entire page, it appears none of the other options will work.
Unfortunately, I don't think even syndication-source was intended for portions of a page - there are really no solid partial-content canonicalization or blocking solutions. The link back probably is your best bet.
First -- THANK YOU -- I am very impressed that you have taken the time to answer all of these questions.
I am developing a site for student apartment renters that I would like to bring to numerous markets. Apartment, sublet, and roommate posts will be user generated and unique to each site. The site will also include hundreds of helpful links (to relevant local resources) that will be mostly unique to each site.
However, the site is relatively copy-heavy and I would like to "duplicate" in numerous places (we have "About This Site" / "About This Page" copy on each page). For example, if the homepage says that our site is "Madison's favorite place to find Madison sublets", this copy (with the appropriate city name) would remain relevant in each market.
Question: Even though the copy is about our site in a market, and we have large amounts of other unique content, is duplication still a huge no no?
Thanks again!
Great content. Good work. This was the most informative article to me concerning Duplicate content.
Thank you Dr. Pete!
Just a note that the section where you find your missing & duplicate descriptions and titles that used to be "Diagnostics" in GWT is now called "Optimization" then "HTML Improvements".
My apologies, but I can't answer audit-level questions in blog comments - it's just proving to be too time-consuming, and quick answers often end up being bad answers in situations as complex as these. I'd encourage any SEOmoz members to submit complex questions to Private Q&A here on the site, where we can at least take a closer look at any individual site.
@Dr. Pete.
Today, I'm quite confuse with my Product Variations pages. I want to give one example to know more about it.
https://www.vistastores.com/patio-umbrellas
This is my main product page. I have developed new URL structure and left navigation structure to develop new web pages. All pages are open for crawling.
These are my branch pages.
https://www.vistastores.com/patio-umbrellas/shopby/manufacturer-california-umbrella
https://www.vistastores.com/patio-umbrellas/shopby/lift-method-search-manual-lift/manufacturer-california-umbrella
https://www.vistastores.com/patio-umbrellas/shopby/canopy-shape-search-hexagonal
https://www.vistastores.com/patio-umbrellas/shopby/canopy-shape-search-hexagonal/color-search-green
I have big big big confusion with Canonical, NOINDEX and Robots.txt. I have implemented all with my left navigation section.
Today, I need final solution which I will never change. My organic traffic is going down & quite worried. Can you give me exact solution for it. So, I can implement on website without any hesitation.
Hey Doc,
I need a professional diagnoses.
My illness is possible duplicate content. (Its not just a coincidence that I'm here :) )
You've got a very complete and detailed explanation here, and I wouldnt dare ask to trim it down or dumb it down. But without doing that, I'm a little confused, maybe its simply because no example fit my exact situation.
I have about 1,500 product pages. They are pages in wordpress, not posts. (Although they used to be and were all deleted and I made new pages. Changed themes and it wasnt able to look good as a post)
Well, heres my site https://funeralparlour.com but thats not gonna take you straight to my issue without clicking a few links inside.
Heres my setup.
I sell funeral program templates, in packages. Basic - Regular - Complete.
There are four types of lets say "sub packages".
Tri Fold Brochure - Single Fold Brochure - 4 Page Grad Fold Brochure - 2 Page Grad Fold Brochure
Ok, so for the Basic package, we would have Tri Fold Brochure and Thank You card..
Im sure you can guess what the next 3 different basic packages would be right?
And so on.. Regular also includes a Bookmark, and the complete also includes a Postcard and Prayer Card.
Now the only other difference is the "Theme" or Design..
Nature designs, I have 37 different designs in that section. So 37 Tri-Fold basic - regular - complete and 37 Single Fold: Basic - Regular - Complete etc.. 37**3*4= 444 different products in that section.
I also have many themes, such as patriotic, hobbies, professions, sports, religious etc.
I dont exactly how much is duplicate, but most pages are very similar, some text changes but not all that much, and the pictures change.
I have some pictures indexed and found in google images. I spent alot of time organizing them and changing all the alt tags, file names etc., to make it look pretty much like every picture is a unique picture, even if some are duplicate pictures, the filenames and alt tags are different.
So my question is basically, what do I do? hehe :)
My best course of action (aside from rewriting 1,500 product pages to be unique, which would take,, well I wouldnt get it finished this year.
Heres a few direct links, maybe you can see a bit cleared what I'm talking about
https://funeralparlour.com/store/christianity-01-tri-fold-obituary-template-basic-package/
https://funeralparlour.com/store/christianity-01-tri-fold-obituary-template-regular-package
https://funeralparlour.com/store/christianity-01-tri-fold-obituary-template-complete-package
https://funeralparlour.com/store/astronomy-01-4-page-grad-fold-obituary-template-basic-package
https://funeralparlour.com/store/astronomy-01-4-page-grad-fold-obituary-template-regular-package
https://funeralparlour.com/store/astronomy-01-4-page-grad-fold-obituary-template-complete-package
Help me Doc,
Its appreciated
Sorry about all the links, but its the easiest way for me to show you my symptoms :)
Cheers
Matt
Thanks for posting this information.
I do have a question about this subject matter. I have noticed that several of my competitors have created a number of pages whereby the only difference in content is the city and state name, and they are being indexed and searchable by Google. I attempted to do the same, and though my site has been crawled considerably over the last two weeks, I do not come up the search results. Am I being penalized by Panda or am I being too impatient waiting for some positive results.
Thank you.
Very well explained Dr. Pete. Thanks for sharing such a wonderful information. Honestly this part was missed in my studies on SEO.
Once again Thanks.
hii am so surprise that i read your good post.i have question.when i copy a post and i write reference source. does google's panda recognize my post as copy?My site is : www.rajeoon.com in persian
Hi
I just have trouble with duplicate content with link exchange page. As you know:
I have 1 module file linkexchange.php to make linkexchangepages and i found that i should use
(10) rel =next, rel = prev in your article. Could i use it in 1 file php like my troulbe?
If Pete is still around? This affects a client if they build different sites using different ips and buy urls that stay linked within themselves?? The sport, "soccer" we use is called different things across the globe so we have used different urls to get that countries users, certainly there is no penalty for being smart? What is your advise for the over 140 urls bought with soccer and football in them? Should we simply point each to the main company's site or is there a better way to use them? I am sure its simply pointing the urls to the main site but never hurts to ask. Thank you for your time I know how little each of us have in this industry!!
So, all 140 URLs can be crawled and resolve to the same content? Yeah, that's definitely going to look thin - cross-link them, and it's going to look like a link network. You could even end up with a fairly large-scale penalty.
Now, if they don't resolve separately, but just redirect to one core domain, that's fine. In Google's eyes, that will be one site. If each one is being crawled/indexed, though, you can create a real mess.
my thought was to use all urls bought as a billboard or advertisement with no links to main company(in fact no links at all) but an advertisement like billboard with web-commercials embedded..would this be a better way to use them...and help me to understand you are saying just point the url to another address so the spiders cant crawl..then the urls are simply chance landings if a web user would happen to place this in an url box? Thanks for responding nice of you..we are number one in most soccer searches but this is getting out of hand..running 16 years in the making the competition is growing and would like to maintain my positions
and pardon if it was in the readings but how do you know if you are being penalized..they communicate with you? or just slap a penalty silently on a user?
Hi Dr. Pete
I'm having problem with duplicate content for dating website datetolove.com. If you go to the location link in the website you will see the profile of same persons are showcasing in Country,state & city pages which will lead to duplication.if i use canonical tag in the country page then I think google will crawl only country page & will leave state & city pages.But I need that google should crawl all the 3 pages without any duplication's.Please help me out with this problem please check the link below.
https://www.datetolove.com/en/locations
Although i am making my comment too late but this is the best case study i ever read regarding search engine algorithm......
I really honor Mr.Dr. Pete
Our non-profit website www.birdlist.org has been plagued by increasingly unforgiving Google requirements. Created in 1998, we have one of the oldest bird websites on the net and we provide a list of birds for every country in the world and for every USA State. This is high demand information bird watchers. But all our info looks so similar to Google. Some states may only vary 5 species out of a list of 400.
So we tried to put in some content text, but for so many pages, it is impossible for us to author completely new text for each page. So we used a standard text, varied it a bit with keywords for the state in question. Example: https://www.birdlist.org/checklists_of_the_birds_of_the_united_states/birds_of_kentucky.htm For a while our page would come up but then it would sink away again. Our most common search is "birds of + state or country"
We still get half a million site visits per year, but we are 50% down from a year ago.
What can we do to improve our scoring again?
I have an interesting issue, one of our products below is no longer indexed in Google:
https://www.keepitpersonal.co.uk/personalised-swarovski-crystal-heart-vase-engraved-gift-p-836.html
It's an item which many other companies are selling however all competitors are using a "thinner description" and duplicate images as each other, ours is completely unique but there content is ranked!
I ran copyscape and noticed that a price comparison site has scraped our page content and wondering if this is why it is no longer indexed.
https://www.copyscape.com/view.php?o=27146&u=http%3A%2F%2Fwww.shopwiki.co.uk%2Fl%2FPersonalised-Engraved-Horse-Glass-Vase-Gift&t=1351162865&s=http%3A%2F%2Fwww.keepitpersonal.co.uk%2Fpersonalised-swarovski-crystal-heart-vase-engraved-gift-p-836.html&w=54&i=1&r=3
I think scrapers are effecting our ranking, anyone else had this issue, shall i contact this site and ask to remove?
Thank you so much for your advice on parameter handling. I recently discovered duplicate urls caused by parameters I am not even linking to on my site, but were made available to googlebot somehow by the developers of my site...
Hi Dr Pete
First of all Many Thanks for such a nice post over duplicate content issue.
I need one help regarding the site of my one client.
This site has lost its all ranking for all the keywords last week, and no any single keywords is appearing even in top 200 search results. So how to determine why this site has been penalized. Was it penalized by Panda or Penguin or anything else.
After analysis I also found that the content of some pages of this website was also present over some press release sites. When I searched by putting texts block from those pages in Google Search, then this very site did not appear but those press release sites containing the same content appeared at top position and the website of my client was nowhere.
And other question, is it possible to keep those press releases alive as well as that very same content on the website, as the client to wish the content on both sites(on press release and his website), is it possible.
Don't know what to do?
How to know why the website was penalized?
What should I do to recover this penalty or whatever and how to regain the previous ranking positions in Google Search Results?
Pleas Guide Me Dr. Pete.
Many Thanks.
Thank you, Dr.Pete! The post is awesome, extremely useful.
Oustanding! Thank you so much for your time putting this together. A great resource.
Wow - that has to be a "Hall Of Fame" post - high level summaries for non-propellor heads and solid detail for the techies we often bump heads with when trying to clean up thechnical infratsructure issues, of which duplicate content is often nightmarish. Outstanding!
Very good post - but we have been warned of panda since caffine indexing came out. Fresh and unique content was sought, and now panda almost gaurentees it
Great and clarifying article. Mentioning bing and get dont Yahoo site Explorer have any clever tools useful for similar functionality? I am not sure ;)
Like our own Open Site Explorer, YSE is really more of a tool for exploring your link graph. Unfortunately, most of the old Yahoo tools for really digging into your index are slowly going away since the Bing integration. I love YSE, but it's hard to recommend these tools to people, because they may not be around very soon.
Dr. Pete, thank you for taking the 20 or so hours to write this post it covered something that I have been thinking about allot recently. I do have a question for you and anyone who is willing to help.
Regarding "Near cross-domain duplicates" I love food and I am in the process of creating a recipe site using Wordpress (for fun and exp) to build a list of recipes that all have one ingredient in common (one of my favorites of course), but I am curious about how you think Google looks at recipes compared to other content online. The reason I ask is that when you think about it there could be thousands of almost the same recipe online (Spinach Dip for expl.) with small variations in the ingredients and preparation instructions.
Basically does Google look at recipes difrent than other content online and if not how do you avoid getting penalized for duplicate content without having to do extensive research on every recipe posted?
Thank you for any impute.
I think it's natural for content to be similar across subject matter areas. We obviously use the same terminology as other SEO blogs, cover some of the same topics, etc. There will inevitably be keyword overlap, sometimes large scale. In that case, it comes down to the usual SEO factors - your on-page targeting, link profile, etc.
If you're all posting the same (or 90%+ identical) recipes, it is a lot trickier. We see this a lot in e-commerce - 500 sites sell a product and all use the manufacturer's description. More and more, those sites are losing ranking ability, and it really comes down to what else you bring to the table. One way or another, you're going to need some unique content going forward.
Thanks allot for taking the time to answer my question Dr. Pete and I completely agree. It was great to get some confirmation on wether I was thinking about things the right way from someone with a bit more or I should say allot more exp than me :). Also I already have more than a few concepts in mind to make the site stand out at least a little from all the others online to help bring in new visitors and keep them coming back.
Again ty so much for your time.
Great article; however, its very long. I'd suggest adding a table of contents
Wow! Thanks for the detailed interesting post on a really complicated issue!
This is an ultimate guide to understand Duplicate Content from Basic to Advance.
A useful, but very long post. I will refer to this again and again.
Damn Pete, this is a serious amount of work. Thanks for pulling this together. Bookmarked.
Great Diagnosis Doctor!
Hi Dr Pete I am just a very inexperienced website owner who has seen a dramatic dip in website visitors since mid-October and it has set me thinking about duplicate content. Having read your article, I think it could possibly be one of two things, but would welcome your opinion. I sell children's books, through linking as an affiliate to Amazon. Could it be those affiliate URLs causing the problem? As far as I know, they only appear on the Amazon landing page. Would Google penalise me for this? Or is it more likely that Google doesn't like the cut and pasted book descriptions which came mainly from another publisher whose books I used to buy regularly? I can, of course change them to my own descriptions, although clearly it would take a while with 400 books. My logic for using their descriptions was that I was selling their books!Naieve maybe?
It's tough for affiliates these days, and Google hasn't been kind. Your instinct is correct - if you're using Amazon's information/copy and then linking out, Google is going to question whether you're adding any value. It's critical that you get some form of unique content in place to supplement the shared information. It doesn't have to be all 400 at once - start with your Top 20 revenue drivers and see what kind of impact it has.
I never pull duplicate information from Amazon. For one thing, it's not my content. For another, I figured the text would be dinged for duplicate. And besides, if the user can get that description from Amazon why bother with me? I'm not adding anything to the mix.
It's duplicate content on the same site that is a problem. Am not really convinced that syndication hurts, as long as you ensure your own page is stronger (ie has more links) than the syndicated copies.
Thanks Dr Pete, I'll get going on that content then. Could the URLs themselves, that I link to on Amazon, be causing a problem ? They must all look very similar, containing, as they do my affiliate ID number? Or is that irrelevant?
Purely from a duplicate content standpoint, handling those affiliate URLs is only an issue for Amazon to deal with. However, being an affiliate and linking out for your products does have many of its own SEO challenges.
Thanks for all your help. 10 product descriptions changed....390 to go.......
Any chance you could explain why using an Amazon affiliate link is an SEO problem? (This is only my 2nd post to read here and I know nothing but am working on it.)
It's not really a duplicate or Panda problem, so much as that Google doesn't view affiliates all that positively. In their mind, Amazon is the source, and you're just a copy - you don't have your own product pages, so they see you as less authoritative. So, the trick is to create your own unique, supporting content, and become an authority.
Mad props Dr. Pete! This is great stuff. Thanks and keep 'em coming!
This is a great and extensive reference for duplicate content. Thank you so much for taking the time to put it together. This will be great for a sticky and reference post.
Very nice summary. Many times I was convinced that the elimination of duplicate content significantly improved positions in search engines and website visits.
Dr. Pete you are a stud! I have already implemented some of the techniques you detailed above. Stellar work!
Dr. Pete, what a fabulous article. Thank you so much for putting it together.
Dr. Pete, a great article on this subject. One thing I am still curious about and maybe I missed this in other comments or somewhere else but what do we do to protect ourselves when someone else duplicates our content?
An example is we have a customer overseas that buys product from time to time, I come to find out they have copied and pasted all of our web contnet on their website without asking. Other then asking them to take it down whats the best way to stay protected from this type of thing hurting our rankings?
Thanks!
Funny, that came up on Twitter, too. I was approaching the article from the standpoint of sites that are duplicating content (not being duplicated) - the problem of what to do when your content is being scraped is a really tough one. Google will try to find the original source, but they don't always get it right. Unfortunately, without cooperation from the other site, it gets tricky. If things get bad, you can request a DMCA takedown.
WOW!
Eh, guys, finally we have it: "The Definitive Duplicate Content Post". This is an example of solidarity from a SEO expert to all his mates. Thanks Dr. Pete, obviously you've made a huge effort compiling all this information.
Thanks for your help. Simply, thank you very much.
Think I need a cup of tea or some kind of sports drink to recover now I've read that. Thanks for the info Dr Pete!
Amazing detailed guide this! Congratulation!
Hi Dr. Pete,
Great article, I will be implementing your advice on our next round of website updates.
Thank you K
Great list of sources of duplication.
This one might be covered under 'duplicate paths', but server load balancing can also be a source of duplication - www1 and www2 versions of the same page.
What a great resource about duplicate content ! Thats sure for reference when any discussion comes up about duplicate content.
I didn't understand this part. would you pls explain...
" Where it gets tricky is that you’re almost always going to have to generate these tags dynamically, as your search results are probably driven by one template."
Oh, sorry - Here's a longer explanation. The trick with Rel-Prev/Rel-Next is that the tag(s) are different depending on what page of paginated content you're on (2, 3, 4, etc.). Typically, your entire search is driven by one physical page/template, so it's going to take some code to generate the right tags.
Actually, Google even says to not put Rel-Prev on the first page or Rel-Next on the last page - that makes perfect sense, but it makes the code even a little tricker. It's not the friendliest solution for CMS users or webmasters without much coding experience.
Congrats to DR Pete, many known and unknown issues for me, all together placed at one post.
Same with jrcooper asked, how long did it take you to write this post!?
I saw the post get shared on Twitter and after visiting it, I scrolled to the bottom because it was past 12am, there was a long debate whether to read it... I went with reading it :)
This is one very useful guide to diagnose duplicate content and you illustrated some great examples that I've seen with the clients I've worked with that has/had those issues, mostly in the e-commerce space. The biggest issue is probably the duplicate paths and we've recommended to add the canonical tag since some CMS will generate and publish multiple versions of a product pages if they are tagged/categorized multiple times (grrr!). Great to validate my thoughts/suggestions with yours here Dr. Pete! :)
Hi Pete, for the Cross-ccTLD Duplicates case (19) have you thought about the Rel=alternate hreflang solution ? Here it goes : https://www.google.com/support/webmasters/bin/answer.py?answer=189077&&hl=en (see the "How does this apply to multi-regional webpages?" part)
I haven't tested it yet but am about to do so.
Has anybody tested out this solution ?
Thanks !
I"ve gotta be honest - that's the first I've heard of that one. I admit that international SEO isn't my strong point. I'd love to hear from people who have used it, as well. YOUmoz post, anyone? :)
There is still little information on the Internet about this solution.
It was previously mentioned on SEOMoz (for instance here https://www.seomoz.org/blog/duplicate-content-block-redirect-or-canonical) but the problem is nobody seems to have actually tested it so we can't really know if it works or not ...
It would be great if we could have some feedback from persons who tried this!
At Distilled, we've analyzed this specific post several times. However, for my own international client, we decided against recommending it. It's very complicated, code heavy, and even the post itself isn't entirely clear. We just thought it might actually be difficult for Google to implement. Also Google is pretty good at determining translated content.
But what about non translated content ? For instance a website that has exactly the same content for the .uk and .com domains.
Problem is, although we have declared a preferred domain in Google Webmaster Tools and heavily individualized the Title & Description tags for our pages, Google still mixes them from time to time (showing the UK website on Google COM and vice-versa).
Theoretically I find this solution satisfactory but I really wonder if it works
For one, having two domains with the exact same content isn't ideal because it will be considered duplicate content. However, to get to the main point, I have never seen a live example of where the hreflang has been implemented. My understanding of it is that Google isn't consistent on how it implements the tag either. Overall, I'd be very interested to see if the solution is satisfactory. Please write a YOUMOZ post on this ;).
Actually our website is one of the most important current French and European websites.
The reason for us having duplicate content on different domains is that we have a very international approach. For CTR reasons we had to serve the english content for UK users on a .uk domain and the english content for international users on a .com domain. The same goes for French (with domains in France, Belgium and Switzerland), German, Italian etc etc etc
We will be testing the hreflang tag in a month or so, only for the UK - COM websites. Will sure let people know about it, because I think it can solve many problems if it works.
What a brilliant post and great contribution to this subject. I've been struggling with a site for a while that sits on a .com address where the .com targets the UK and .com/USA targets the US - only problem this really confuses the heck out of Google!
A very useful post for any SEO, thanks Dr. Pete!
Truly fantastic post, Pete!
The next post I'd like to see following on from this would cover how Google detects duplicate content. I.e., how does Google measure the similarity/difference between two pages....do they have a way to remove the "template" elements (header, nav, footer), do they create a hash of the text blocks and compare those....do image filenames get inspected...etc.
Certainly in the Q&A questions I've been tackling here I've had a TON of people desperate to know what to do to avoid getting tagged with duplicate content, when their content really comes from a very common source (i.e. an RSS feed, or an affiliate program, or a manufacturer's set of product descriptions and images etc.).
Of course, the answer to that today would probably not be valid in a month or whenever the next Panda iteration happens :-)
Dr. Pete! ...I found your post more than useful! It's for sure it'll be a classic ..thank sfor your time doing it and for sharing of course :D ...
Read nothing else on the subject just pin it on your SEO wall!
After a long well craffted list the best comes at the end with "use your brain", always the best tool! Or not :)
I have a press release site and I thought about blocking the PR folder in robots.txt but that would kill all the SEO benefits for me and also the submitters so what I did is that I did a cross site canonical tag. So far its working although some may say its a hack but it work.
Another great post Dr Pete, even if I did start reading yesterday and have only just finished...:)
The major point of frustration for me with 'near duplicate content' is with regards to #15: Geo-keyword variations. I have invested a huge amount of time and effort in creating unique location-specific content for our agency site and for multiple client sites, including the use of 'clickable maps' to implement a link structure that doesn't look spammy. I then see competitors using the old 'find and replace' technique (simply replacing the place name hundreds of times) and consistently outranking my sites.
I actually look forward to the day that Google seriously devalue the existence of inbound external links, as these seem to have the power to 'override' legitimate SEO work on occasions like these.
Insane post. Thank you kindly DR. Pete.
Thanks Dr Pete for putting so much effort in this post. It's gone straight into my list of top 10 resources.
Thanks for such a fantastic post.
I have a question following a couple of points made above.A client has an ecommerce site and has taken the time to create unique product descriptions on its site rather than using the manufacturer's.
My client also sells these products on Amazon, if they use the same original product descriptions from their site on Amazon, is that still going to be viewed as duplicate content?
It could be, yes, especially since Amazon has such massive authority. If you're already set up this way, you should definitely keep tabs on how your product pages rank vs. the Amazon versions.
Thanks Pete much appreciated.
I would say you should do a test over a month or 2 with a few products using the amazon approach and a few that are simply on your clients site. That is the best way to find out real world. We do tests like this all the time. We also do tests on static vs dynamic page content and find that static content 100% of the time trumps dynamic content.
Wow. Thanks so much for this content. Super useful and appreciated. LMAO and extra TY from the "visual learners" : )
Thanks a great article, can you just confirm:
If i have two micro websites (both pointing to the main website) and i have put the same copy on each but then changed (spun) a number of the words and changed the images on one of the micro sites. Would Google pick that up and classify the sites as 'Near duplicates'? Thus giving one of them or both of them a bad ranking?
Thanks!
It depends, but it's highly unlikely that all of them will rank. Google is likely to filter out 1 or more. With 3 sites, the odds you'd be flat-out penalized aren't that high (depending on the sites), but the odds that 3 very similar sites will all rank are very low, unless there's virtually no competition. I think these kinds of microsites, which used to be pretty effective a few years ago (maybe even a couple of years ago) are going to lose ground quickly over the next year.
Great article and definitely one of the most in-depth pieces I've ever read on the topic. I'm still surprised how many people still don't understand how Google treats duplicate content. I'll definitely share this on my blog as it's a question that I get all the time. Thanks again!
Thanks for the great post. Ok, unique content is the only way to led search engines know who I am. But how can I concince our customers to create unique content for businesses like srews or turned parts? Following Matt Cutts advice to become a leader in my niche, it seems very hard to realize, because my competitor has the same idea with the same content...
I think you have to look at content as an extension of your unique value proposition. What does your client do that sets them apart? It's a broader and critical business question, well beyond SEO. If they can answer it, you've got a basis for content. If they can't, you have bigger problems than SEO.
I'd honestly suggest walking the floor, if you can - talk to the people who work there and find out what energizes them. Someone there is good at their job in a way that's interesting, and they probably don't even know it. Someone is passionate about a product others find boring. If you can understand those people, you've got a starting point.
Wish I could thumb up this post again - because I would.
I wasn't see a long time such well prepared post which pointing to potential problems and offers effective practical solutions. Duplicate content issue will stay hot topic for a while. It takes time for cleaning of big amount of duplicate content all over the net, created mostly by spammers and low quality link providers. But, their time is gone. Now I know what Dr. means when we talk about Dr. Pete. Excellent post, I'll bookmark it and read once again.
It is an absolute delight to read someone who can actually write about this subject using English that a layperson who knows absolutely nothing about the subject can actually understand. I can't believe it and didn't think it was possible.
I don't know code, don't know SEO, and can barely figure out a stupid keyword. I am as non-technie as they come. I'm a writer who blogs. I think, for me, PANDA has been a god-send because SEO is now moving more toward my territory because things are starting to make a little more sense.
This is my first visit here and, as I said, the 2nd article i've read--both by you. I didn't find this piece long at all. I found it comprehensive and the length probably necessary in order to deal intelligently with the subject matter. What immediately caught my attention was the whole question of what is duplicate content. This is actually the first article I've read that suggests that there is more to it than scraping and copying blocks of text. Not only had I never considered or realized that pages could be duplicated in the way you're discussing but the question of "near content" and specifically with regard to the organization of content on the page raised flags. So I have a couple of questions.
I honestly just thought all this Panda hoopla was basically over scraping and plagerism and things like that because that's what I've been reading. This is different--way different.
I want to know these things but clearly I'm not a coder and much of what you are talking about is way over my head. I'm assuming that these non-content issues such as URL duplication, etc., were I to find them, would require a technical person to handle.
What helps me is to know about these things and understand what I can do at the level of creating a post to eliminate the possibility of these things happening; but I get the feeling that some of what you are talking about may not be something I can control. That it happens when you have a site that does specific things such as sell items and use a shopping cart.
So my last question would be whether what are you talking about predominately happens on sites that developers create or if this is something that happens if you have a blog site that uses Wordpress and a theme such as Headway and you don't do a lot of fancy stuff or try to change the basic theme structure.
In essence, how much should I be worried?
Oh, and the only thing I've read for months about "thin" content references wordcount. You raise a whole new issue. (#16)
Thanks again for a well-thought out and developed article that even I could understand enough to go through the whole thing. And thanks to everyone commenting for such great interaction and suggestions. (My comment pays homage to your article in length.)
Thanks for the positive feedback :) Regarding your specific questions:
(1) Having a common template is fine, as long as the actual content is unique. The one exception would be "spun" articles - creating dozens or 100s of articles that are very similar and only differ by a few keyphrases. These are often going to look low-value to Google. Sharing a common theme and layout is ok, though - virtually all major blogs do that.
(2) Some variation is fine, and even natural. What I'd worry about is creating multiple pages that target keyword variations, where that keyword variation is the ONLY thing that changes. If you have "/home-tuition", "/home-tuition-classes", "/home-tuition courses" as 3 different pages, that only differ by those keywords (all other content is identical), that's a low-value tactic. It used to be common, and honestly, it used to work, but now the risks outweigh the benefits.
If those topics are all unique and you have unique things to say about them, that's different. It's fine to target keyword variations across a site (and I'd argue it's good SEO), but you have to have the content to support those variations.
(3) You can have good SEO on a custom site or a WordPress (or other CMS) site. The only issue with CMS's like WordPress is that they sometimes create duplicate URLs and paths or put the same TITLE and META description on every page. There are plug-ins and other ways to control that, though. The template itself normally isn't a problem. The exceptions would be massive templates that take a long-time to load or very "heavy" templates with very little unique content. If your template is loaded with images, ads, and plug-ins (social plug-ins, for example), and every article is a short paragraph, your site will look thin. That's a balancing act, though - there's no easy answer.
Wow, I am glad I found this reply before posting. This addresses my question; will the old technique of adding dozens of nearly identical keyword landing pages create a penalty?
It looks like combining close phrases on the same page is now better than creating a page for each keyword phrase variation or modifier.
One question which might be related to duplicate content. When a website has one url, not unique urls for each page, yet many unique pages with unique content what happens from an SEO perspective?
I am wondering if this is an advantage, since all of the pages affect the url pagerank, but I also see it as a disadvantage. Any ideas?
Great Information Dr. Pete. Its A-Z guidlines to slove duplicate Content issue post panda. This will surely help websites owner & webmasters who still was not able to overcome panda effect. Thanks a lot
What is the best way to find out if there are duplicate content URLs for a given URL on my site? Is it copyscape.com?
Any other tools that can just keep an eye on our entire site (or selected URLs) to see if anyone has copied our content?
CopyScape is more for tracking duplicates across other domains - such as syndicated or scraped content. Our PRO tools will monitor internal duplicates, and Google Webmaster Tools can handle some of that, too.
If you want to know about people copying, though, CopyScape is still one of the better, automated tools. I sometimes do a quick-and-dirty version by just copying blocks of unique test in quotes into the Google search. It's amazing how fast you can spot a couple-dozen scrapers with just a few lines of text from one of your more popular pages.
This is an amazing resource. My head is spinning from all the information. I can't thank you enough for your research and compiling this into a document that I can refer back to.
Thanks Pete. This will be a fantastic document to keep as a reference.
Just a few points: you mentioned that the best solution on search sorts is a meta noindex on that page. I use a canonical tag. Is that just as effective?
Also, one interesting thing I read when Google announced their pagination solution. They said that through their testing, that users prefer to have only one page i.e. no pagination at all. This solves all the pagination issues, but also gives the user the best experience, according to Google's "testing".
Purists will say that you shouldn't use canonical on search sorts, because they aren't "true" duplicates, but honestly, I suspect it'll work fine in most cases. I'd be careful with Google's testing findings - they have a bad habit of condensing the entire web into one data point. From what I've seen, having all results (or a lot) on one page CAN help, but it really depends on your audience. I'd definitely A/B test it. It's funny that they say that, but still have a 10-result SERP (for now).
That makes sense about the canonical. Thanks.
Yeah, I'm taking their "testing" results with a grain of salt. This is the easiest solution for them to deal with so hense they're promoting it!
Thanks again, Pete.
Wow this article rules! it helped me a lot in fixing lots of problems on my Magento webshop! Thnx
What a great resource
Like everybody else ... wow ... excellent job Dr. Pete!
Questions:
1. If part of your on-page content is dynamic syndicated content, where is the cut-off for duplicate and near duplicate? Is there a percentage? 50%? 75%? Is there a clear cut-off?
2. Tags on cms page headers would have to be implemented on a template by template basis. That is for example, the standout tag would only have to be present on that part of the site with breaking/original news? And not used more than seven times per week?
Again many thanks here!
(1) Unfortunately, not really. I think it depends a bit on your industry, how you're syndicating, and your overall authority (link profile strength, basically). Personally, I wouldn't push past 50% syndicated, unless your whole industry is 100% syndicated.
(2) The "standout" tag hasn't been very well documented at this point, I'm afraid. I don't have much data on how much attention Google pays attention to it. I'm afraid it's going to be like other call-out tags - people will abuse it to the point that Google stops paying attention. If you use it, I'd use it sparingly - they're more likely to take it seriously that way.
This is a gem of a post. It reflects a pretty deep grained understanding of the issues and I thank you for the effort you've put into this SEO duplicate 101. There was so much info that I may have simply not absorbed the answer, but you could help me out with this issue. A client's web designer has placed copies of my client's home page on satellite URLs. Same text same everything. the links from the product category navigation items in the main menu and the in content pictorial categories link and redirect to the main website's relevant pages though the anchor text etc is now out of date.
I wanted to simply redirect any traffic straight into the main website's content. I wanted to completely remove the duplicated homepages. The designer's judgement is that this will be seen as cloaking by the Big G and incur a different penalty. I disagree based on what I have tried to find out and my own fairly limited experience of working with large sites.
My alternative strategy was to refresh the content on those existing 1 page satellite sites and kill the duplication that way so that the urls are hosting more targeted content for their keywords.
The designer also controls the hosting and has been really arsey - it took 2 months to actually get the conversation - more than 12 phone calls and to be honest I would advise the clients to move hosts. Unfortunately they like cheap hosting that is offered, despite the shite customer service.
Would someone detached from the above argument be prepared to give an opinion on the best strategy for deduping in this situation. I am arguing with a person who has 10 years more mileage than me but I am convinced the strategy he has given my client is hurting their rankings for selling their goods across the uk.
Many thanks,
Ray
Hi.
You say: "All else being equal, bloated indexes dilute your ranking ability."
That's quite an important factor for me. As each user comes onto my site, their search term (if available) is used to construct a new page for future users of the same search term. Hence, as an example, I have a dynamic page called metal-blue-widgets.htm and another called blue-widgets-metal.htm. They are the same page, or duplicate content.
It's tricky, programatically, to spot which page would be canonical, so that it can be added as a header tag - I have over 500 products. The idea is to provide a page for my customers that just provides a list of the metal blue widgets they want - it saves them having to find it amongst the other stuff, so the purpose is good.
With 6000 of these pages in existence - all providing variations of products, and all indexed - it's very important for me to recognise the truth value of the above statement. Could anyone indicate other material that shows the same?
Regards
Baron
I don't find user-generated search pages to be very valuable, honestly, and they can often spin out of control. I think it's much better to manage your own categories/sub-categories and focus on what's important. Otherwise, you've got thousands of very similar pages competing for attention in the rankings, and Google's not that keen on search pages to begin with.
I am far from being as authoritative as Dr. Pete, but I know the theory and the last few months I read extensively on the subject matter. Google algorithm is based on the Page Rank algorithm. Nowadays, a lot of additional factors come into play, but the Page Rank algorithm is still very important. Basically, mathematically speaking, diluting your inbound links among many identical content is bad. I have seen it mentioned over and over by many experts and you should not have any difficulty to find your own references.
This dilution of inbound links and also the fact that you use badly your crawl budget are not considered as penalties imposed by Google. They are penalties that you impose to yourself. Before the Panda revolution, Google was very clear that it did not penalize duplicated content. With Panda, this might have changed. In my opinion, this means that you are better find a solution to your duplicate content issue.
However, removing the duplicate content is not always ideal. Often, the content is duplicated for a useful purpose. I am always annoyed when a so called SEO expert quickly jump to the conclusion that duplicated content or near duplicate content is a bad structure. On the contrary, it is fundamentally natural. Information progresses by phases of duplication (analysis) and consolidation (synthesis). Near duplication is fundamentally required because there are so many possible expectations from users and some times small details are important. Any fight by Google or SEO experts against near duplication is a waste of times. It is natural and necessary.
This being said, generating the duplicate content automatically in response to a query does not seem to fulfill any useful purpose except forcing many pages with similar content to be indexed separatly in Google index, which is bad. Duplicate content is fine in your site - this is the part that is natural, but you should do every thing that is possible to avoid duplicate content in Google index because Google search pages correspond to the synthesis part - we do not want duplicate content at this level. The canonical link element is a very useful tool for that purpose. I want to know what Dr. Pete thinks about it, but if you find the way to use the canonical tag to present a nice structure to Google with little duplicate content, then you will be fine even with these user-generated search pages.
I'll agree on the user level that some duplication is perfectly sensible, but I'm mostly talking about removing it from Google's view. For example, you need search pagination if you have a ton of results, in most cases - it's valid, useful and necessary. Google doesn't want that indexed, though, and that's the key.
Actually, it goes deeper than that - usability is not the same as search usability. Paginated search (for example) is perfectly useful for your site visitors. On the other hand, running a search that pulls up Page 17 of one of your internal search results is NOT useful for search visitors. They're going from Google's search results to yours, and that 17th page of results has very little context or meaning. So, I think these are two very different arguments. What's good for users on your site isn't always good for users who arrive on your site via Google.
You are the best !
Helllo DR.Pete,
Here you explain panda update and about duplicate content this really very best but i had doubt in this "Rel=Canonical" in this tag we write this url "www.Example.com" so this take this url or " www.example.com/index.html”, so this we take i m confused in this .So i hope you explain me this well..
jasika marshel canonical means preferred so if your preferred URL is www.Example.com in that case you have to add a code at the top of the head section of not preferred URL<link href="https://www.example.com/"/>, if you have a landing page for which you run lot of paid campaigns and hence involves lot of URL tags for tracking purpose as a precautionary measure you can add :
<link href="https://www.example.com/landingpage.shtml"/>
Arpitsrivastava is essentially correct, although you need the '' in the tag. Unfortunately, I think our comment editor is eating that part (it did it to me, too). Please see this Google post for the proper syntax:
https://www.google.com/support/webmasters/bin/answer.py?answer=139394
The confusing part is usually where to put the tag. The root ("/") version and "index.html" version of the page are almost always the same actual file/template, so you only need to put the tag in one place, and it'll cover all variations of the home-page (usually).
The trick is if you have a CMS or some kind of sitewide page header - you don't want to add a canonical tag to one template and have it roll out to 100s of pages. The actual implementation can get tricky in practice.
Dr Pete,
Great post, but there are certain things that i would like to point out -
www.abc.com/index.html >> 301 Redirect >> www.abc.com is not possible as that would lead to a redirect loop and the home page will go down. Canonical tag is the only way.
For international sites with duplicate content there are lot of options like-
- Sajeet
You're right, in the sense that the "index.html" redirect can loop, and I think Apache servers tend to have trouble with it. It's not impossible, though - there are some ways to get around it, and the rewrite is safer on other platforms. For a home-page, though, I'll agree that canonical is usually a better bet. In addition to being safer/easier, it also scoops up other variants.
Great article Dr. Pete. But I think some points require clarification.
1. I do no agree with point 4 about using the meta robots directives "noindex,nofollow". The reason is that you create dangling pages or nodes (dead end pages). The appropriate implementation would be the robots directives "noindex,noarchive,nosnippet,follow".
2. The information you provided about the meta tag "syndication-source" (11) is outdated. Google changed that to "standout".
I think there are times when it's ok and even advantageous to create a dead-end, if the path is naturally a dead-end for spiders. Otherwise, if the paths are complex enough, you could end up with crawl fatigue. Unfortunately, these situations are usually so complex that I've never seen anyone effectively measure one vs. the other. So, we're sometimes left with a difference between two educated guesses.
Do you have a reference that suggests "standout" has replaced "syndication-source"? It was my impression that "standout" was a way to call attention to a small sub-set of news items, not to send a syndication signal. You can "standout" links to other sites, but that isn't a canonical signal (as far as I understand it). I haven't implemented it, so I may be wrong.
About my first point, I was referring to the original PageRank patent, and since there is no evidence that have been modified, I prefer to stick to my point.
About my second point, "standout" replaced the purpose of use of the "syndication-source". To be specific:
Source: https://www.google.com/support/news_pub/bin/answer.py?answer=191283
This is purely my opinion, based on anecdotal evidence. By the original PageRank patent, you're right - since the PageRank calculation is iterative, cutting off a path completely could keep it from looping back up and through a site, theoretically choking off a small amount PR. I suspect, though, that:
(1) Changes to the Google algorithm over time have modulated the amount of PR passed by navigation elements. So, if the only links on a page other than sitewide links (including navigation) have no value, then the loss of iterative PR is virtually none.
(2) For deep pages, the amount of PR passed back up is small enough that the negative of losing it is smaller than the negative of causing the crawler to go through 100s or 1000s of unnecessary pages. Of course, this is highly situational.
An example where I'd consider using NOINDEX,NOFOLLOW is on a shopping cart page. Every page below it (checkout, for example) is useless to search. So, if the contextual links are useless and the PR-passing power of the sitewide links is modulated, I expect the loss is negligible. Of course, I'd also probably NOFOLLOW the link to the shopping cart itself, so the Meta NOINDEX,NOFOLLOW is really just a backup at that point. By nofollowing the link itself, PR flow is cut much more surgically.
For any given situation, calculating the amount of PR lost vs. the crawler fatigue is something only Google can do (and even they probably couldn't give you those numbers for any given site). So, it's highly speculative. I do agree that, when in doubt, NOINDEX,FOLLOW is going to be safer for most people in most situations.
Thanks for the reference on the changes to syndication-source. I'll dig into that over the weekend and update the post once I understand the distinction better.
Excellent article, I have recently tried out a new internal linking structure to deal with the issue of targeting multiple local content. I didn't duplicate content or have lots of city specific links, what I did was create a linking structure that associated each core subject to the city in a clever (I think) internal linking structure.
I have yet to see the results, but I plan on it being by first YouMoz post when I see the results.
The whole structure was designed to avoid the local duplicate content issues you were discussing.
Excellent article.
I am curious to know if you would consider the following 3 listing pages as duplicate content:
https://www.thinkvidya.com/bangalore/home-tuitions-classes
https://www.thinkvidya.com/bangalore/home-tuition
https://www.thinkvidya.com/bangalore/home-tutors
Is it a good strategy to build pages for related search terms, and which then show different set of search results?
I think it depends a bit on the scope, but if you're spinning out the same category (search results, in this case) just to target slight keyword variations, then I would call these near-duplicate. If you do it a couple of times to target your major keywords, it's probably fine. If every category has a handful of variations just to target different keywords, then you're probably going into run into problems.
Great article! This is the only one I'll probably need on duplicate content because you were so thorough. Thank you.
Question, twice now I've had to go up against chiropractor and veterinarian sites that were purchased or rented and came complete with copy, sometimes hundreds of pages of it. This same copy is shared by many, many other chiropractors and veterinarians across the web. It is duplicate copy, but nevertheless those sites rank high. The question is why? Maybe Google doesn't care?? I'm sure there are other factors coming into play here as well, some even as a result of all that copy which must be attracting visits, increasing time on site, lowering bounce rate, etc. Are all those things making up for the duplicate copy? What are your thoughts?
You can see it if you search on "santa cruz chiropractor." In the top 5, only McCollum and Griffin have custom copy.
One of the most helpful tools in the SEOmoz toolkit is the Keyword Difficulty Tool, which allows you to see exactly which metrics are being won and lost by each of the top 10 results for a given keyword term.
To see all of the metrics you need to run a full report and when setting that up, you can also add your own URL for comparison against the Top 10.
Rand gave a detailed explanation on how to use the full reports to see exactly what is influencing the rankings for each site in his post The Best Kept Secret in the SEOmoz Toolset.
Looks like these results are heavily influenced by local search factors.
Sha
As Sha said, there are quite a few factors that can come into play. With a local business, like a chiropractor or veternarian, local SEO factors (like your Google Places listing, citations, etc.) can definitely play a strong role. I also suspect that, if the templated sites target different regions (and especially if they're smaller sites), Google may overlook it. So, they may be able to push the envelope a little more. On the other hand, I still think that your own unique content is a competitive advantage over time.
Thanks for the insights, Dr. Pete. I do have a question regarding those pesky "near duplicate" pages: In your opinion, how different does content have to be in order to be considered non-duplicate? Would rewording the verbiage between one page and another do the trick, or is a more complex "intervention" needed?
And what about items like a company's "boilerplate" descriptor? As an important part of the brand, there are definite advantages to having it on multiple pages from a brand-identity standpoint, but does it fall into the duplicate-content category?
Look forward to your answer!
That's a really tough one, and it depends a lot on your site as a whole - how many pages are near-duplicates, what kind of authority do you have, etc. It also depends if you're already been hit by Panda or are just being proactive. It can take significant changes to undo Panda, but you can ease into it if you're just trying to prevent future problems.
I think there's another issue with chunks of copied content, like boilerplate company descriptions - they can cannibalize your own keywords. They might be a weak signal that you want the entire site to rank for terms in that description, but they also confuse Google as to your priorities. I think it's often better to target one page for that content, and stick to something shorter (like a solid tagline) sitewide. In most cases, I strongly suspect visitors ignore it, too - we tend to overestimate the value of our brand content in the eyes of site visitors.
Great insights -- thank you!
Really good stuff, Dr. Pete. I think "Your Own Brain" gets underutilized in many instances. Again, great article.
Many thanks for this brilliant post. Can't remember ever reading such a long blog post. This should be written exactly this way in any SEO book.
Thanks so much for the clarification Dr. Pete. Extremely detailed and helpful article. So many changes with Google and the changes are becoming too frequent. Good to have your article for reference.
Wow! That's what I call a comprehensive article about duplicated content. Everything clear and neat.
Epic Post for Duplicate Content, Thank you.
What he said.
How long did it take you to write this post? Must read for all SEO strategists
Wow!!! Just hats off mate.....Thanks for the great info. This is really gonna be a big help. Appreciated.
Cheers!!!
Edit: Just realized....1 thumb down??? SEOmoz surely has a spam....lol!!!
All the collective information under one roof..........thanks good research work
Great post. I didn't think that having a link to index.htm was different then linking to the main url. I was noticing some funny business in Google results by which it was showing the full url including index.htm and I was linking to the page by the domain only. I will setup a canonical to take care of this.
Now i understand why Google still prefers HTML pages because blogs are very complicated and creates lots of problems for search engines bots. Why they say that blogs are SEO friendly.
What a fantastic comment Dr Pete.
I have been feeling for a while that duplicate content was going to become a much less isolated issue that it was, so panda has not been a massive surprise. That said- boy are the updates coming thick and fast!
The Rel-prev and Rel-next tags are interesting ones too. They are something that I will be watching closely. What is the SEOmoz feeling on them. Awesome or a bit of a waste of time?
Of all of the great SEO blogs out there, SEOmoz does prove to be the most cutting edge out there.
So, all the way from the UK, we say 'legends' and 'keep up the good work!'
A number of people asked if we could make a stand-alone PDF version of this post available. That's complete now, and you can download it here. FYI, it's about 22 pages and 560KB.
Dr Pete, wowee! What a thorough article and one that explains each issue so very well. This is certainly one that I will be book marking and making regular reference to.
Cheers