I'm currently working on re-authoring and re-building the Beginner's Guide to Search Engine Optimization, section by section. You can read more about this project here.
Canonical & Duplicate Versions of Content
Why Canonical Versions of Content are Critical to SEO
Canonicalization can be a challenging concept to understand (and hard to pronounce - "ca-non-ick-cull-eye-zay-shun"), but it's essential to creating an optimized website. The fundamental problems stem from multiple uses for a single piece of writing - a paragraph or, more often, an entire page of content will appear in multiple locations on a website, or even on multiple websites. For search engines, this presents a conundrum - which version of this content should they show to searchers? In SEO circles, this issue often referred to as duplicate content - described in greater detail here.
The engines are picky about duplicate versions of a single piece of material. To provide the best searcher experience, they will rarely show multiple, duplicate pieces of content and thus, are forced to choose which version is most likely to be the original (or best).
Canonicalization is the practice of organizing your content in such a way that every unique piece has one and only one URL. By following this process, you can ensure that the search engines will find a singular version of your content and assign it the highest achievable rankings based on your domain strength, trust, relevance, and other factors. If you leave multiple versions of content on a website (or websites), you might end up with a scenario like this:
If, instead, the site owner took those three pages and 301-redirected them (you can read more about how to use 301s here), the search engines would have only one, stronger page to show in the listings from that site:
When multiple pages with the potential to rank well are combined into a single page, they not only no longer compete with one another, but create a stronger relevancy and popularity signal overall. This will positively impact their ability to rank well in the search engines.
Another new option from the search engines, called the "Canonical URL Tag" is another way to reduce instances of duplicate content on a single site and canonicalize to an individual URL.
The tag is part of the HTML header on a web page, the same section you'd find the Title attribute and Meta Description tag. In fact, this tag isn't new, but like nofollow, simply uses a new rel parameter. For example:
<link rel="canonical" href="https://moz.com/blog" />
This would tell Yahoo!, Live & Google that the page in question should be treated as though it were a copy of the URL www.seomoz.org/blog and that all of the link & content metrics the engines apply should technically flow back to that URL.
The Canonical URL tag attribute is similar in many ways to a 301 redirect from an SEO perspective. In essence, you're telling the engines that multiple pages should be considered as one (which a 301 does), without actually redirecting visitors to the new URL (often saving your dev staff considerable heartache). You can read more about implementation and specifics of the tag here.
As an example of canonicalization, SEOmoz has worked on several campaigns where two versions of every content page existed in both a standard, web version and a print-friendly version. In one instance, the publisher's own site linked to both versions, and many external links pointed to both as well (this is a common phenomenon, as bloggers & social media types like to link to print-friendly versions to avoid advertising). We worked to individually 301 re-direct all of the print-friendly versions of the content back to the originals and created a CSS option to show the page in printer-friendly format (on the same URL). This resulted in a boost of more than 20% in search engine traffic within 60 days. Not bad for a project that only required an hour to identify and a few clever rules in the htaccess file to fix.
Defending Your Rankings Against Scrapers & Spammers
Unfortunately, the web is filled with hundreds of thousands (if not millions) of unscrupulous websites whose business and traffic models depend on plucking the content of other sites and re-using them (sometimes in strangely modified ways) on their own domains. This practice of fetching your content and re-publishing is called "scraping," and the scrapers make remarkably good earnings by outranking sites for their own content and displaying ads (ironically, often Google's own AdSense program).
Preventing the scraping itself is often next to impossible, but there are good ways to protect yourself from losing out to these copycats.
First off, when you publish content in any type of feed format - RSS/XML/etc - make sure to ping the major blogging/tracking services (like Google, Technorati, Yahoo!, etc.). You can find instructions for how to ping services like Google and Technorati directly from their sites, or use a service like Pingomatic to automate the process. If your publishing software is custom-built, it's typically wise for the developer(s) to include auto-pinging upon publishing.
Next, you can use the scrapers' laziness against them. Most of the scrapers on the web will re-publish content without editing, and thus, by including links back to your site, and the specific post you've authored, you can ensure that the search engines see most of the copies linking back to you (indicating that your source is probably the originator). To do this, you'll need to use absolute, rather that relative links in your internal linking structure. Thus, rather than linking to your home page using:
<a href="../>Home</a>
You would instead use:
<a href="https://moz.com">Home</a>
This way, when a scraper picks up and copies the content, the link remains pointing to your site.
There are more advanced ways to protect against scraping, but none of them are entirely foolproof. You should expect that the more popular and visible your site gets, the more often you'll find your content scraped and re-published. Many times, you can ignore this problem, but if it gets very severe, and you find the scrapers taking away your rankings and traffic, you may consider using a legal process called a DMCA takedown. Luckily, SEOmoz's own in-house counsel, Sarah Bird, has authored a brilliant piece to help solve just this problem - Four Ways to Enforce Your Copyright: What to Do When Your Online Content is Being Stolen.
As always, comments, corrections, and suggestions are greatly appreciated! I'll try to speed up the guide in the next few days and weeks, so look for a little more "back to basics" blogging. I'll rely on Rebecca, Jane, & the YOUmozzers to keep adding diversity to the mix. :)
p.s. Oh jeez... 3:50am. I really need to start sleeping more.
Fantastic article, love the diagrams!I've been lurking here at SEOmoz for a while now, but I figured I better post and say thanks once in a while! You guys are basicly teaching me SEO (Im an apprentice SEO at a company in the uk) and I really appreciate it. - Henry
While canonicalization is hard for both SEO's and novices alike to say five times fast, it's even harder for novices to understand, even if explained five times slow.
There are two ways to look at this, from a "content" perspective and a "URL" perspective. The problem may be that we often talk from a content point of view, when what we are really talking about is URLs.
When you talk with clients who aren't all that knowledgable, it isn't surprising that they have such a hard time with this, but when you talk with clients who are very web savvy who also have a hard time with this, it makes you start to wonder whether you are both on the same page.
Let's take a typical site's homepage as an example... most would say, "No, that page is unique, that content isn't used anywhere else." This of course is a content focused view, based on duplicating content. But it helps to break it down to the URL level, to show them that:
domain.com and www.domain.com are actually two pages (that just so happen to show the same content), which with a little explanation, many will get and then implement redirects to either the www or the non-www version.
But we have to take it further, and show them how the "Home" navigation, which leads to www.domain.com/default.asp (or whatever) is also technically another page, as is www.domain.com/default.asp?source=header, as is www.domain.com/default.asp?source=footer, as is www.domain.com/default.asp?source=sitemap.
Laying out these URL variations helps to convey that duplicate content is often as much about URL variations leading to the same "page" as it is about having multiple pages with the same or chunks of the same content.
What should I say Rand ?
Most of times I read blog of seomoz I always find something really useful. As I am new to SEO, its great resource for me. This time you shed light on PING which I was not aware of at all.
Thanks for always sharing very useful things like this. I am a regular reader of seomoz hardly miss any post.
Thanks to whole SEOMOZ team for making SEOMOZ such a great resource for new people like me.
Hey Rand - good call on the scrapers comments. I like Joost's strategy on the subject. There's a heap of problems with dupe content - particulary dealing with proxy sites at the moment. Is any of that covered yet?
with the canonical URLs such as site.com, site.com/index.html, www.site.com and www.site.com/index.html i think it's not only important to 301 them to one URL but to also make sure that all of your links are pointing to the page in the same way. if you are redirecting everything to www.site.com then all of your internal links should point to the page as www.site.com (absolute as rand points out) even though you've set it up to redirect anyways.
though on a side note if your site is ASP on IIS you'll likely not be able to redirect site.com/default.asp to www.site.com as it will cause a loop. if anyone knows how to fix this, by all means let me know. i'ts been hurting my head for weeks.
Kimber, you're not the only one banging your head against the wall with IIS 301 redirects. I've searched and searched, but keep coming back with either the loop, or vb code so difficult, that most people will just give up and allow the dupe content.
Am I being anal for also making sure that index.* redirects back to /? I stress that quite a bit, but really its only the index page as opposed to the whole non-www to www issue.
yes, i too standardize all of my homepage urls to site.com/ with the trailing slash. i'm not exactly sure why that is, but i do it anyway.
"Canonicalization can be a challenging concept to understand (and hard to pronounce - "ca-non-ick-cal-eye-zay-shun"),"
Is it weird that I found this to be the most helpful part of the re-write? ;)
The canonical URL for the root of a domain and for any folder MUST end with a trailing "/" on the end.
Never link to "https://www.domain.com" or to "https://www.domain.com/folder" without it.
The correct URLs are "https://www.domain.com/" and "https://www.domain.com/folder/" with the trailing "/" included.
That's direct from the HTTP specs.
Where did you get that?
It's mentioned many times in the various RFCs (I forget the number, but it might be RFC 3986, I think) and in the Apache webserver documentation.
Here?
Under URL Layout - Trailing Slash Problem.
It is relevant to point out this only applies to apache web servers.
I wonder if thats true. What do you say Rand?
I check out alot of sites, and I notice that they don't redirect index.* back to the root of the domain, but I started doing that probably 2 years ago.
As far as http specs or apache documentation, when ever there is a question about 301's, g1smd is pretty locked on. Poor guy, I think everyone should chip in and buy him a one page domain with all his knowledge so he doesn't have to keep saying the same stuff over and over. heeh.
https://www.google.com/search?hl=en&lr=&q=g1smd+%2B+301+site:webmasterworld.com&btnG=Search
Hmm. Maybe I ought to see if SMX or SES want the full works on canonical URLs, redirects, and stuff that can screw up your rankings and traffic if you get it wrong. :-)
I don't know if they'd want it, but I'd sure as heck take it. Thanks Ian.
*** a one page domain with all his knowledge ***
Hang on a moment. Are you saying that all my knowledge would only fit on one page?
Cheeky Bugger!!! LOL.
With regards to scrapers. Funny enough I've had a little experience with scrapers and what you find is that most people don't get many links for their content so even if site A pinged and got it indexed first google will rank site B above as long as you get a couple links more then site A.
Being as most SEO guys are just learning about deep linking this is a rather easy thing to do. Take a longtail term like "buy kdh-8374 stereo receiver" stick is on some parasite hosting with a lot of domain weight like blogger, aol pages, hubpages, squidoo ... and the list goes on. Get 2-3 links to it and all of a sudden your content is making others money. Also this is very hard to do anything about because you can't find the person because you can't trace domain or hosting.
It's the perfect storm for the scraper. Good Content, Links with anchor, and a trusted domain. Don't need much more then that to rank.
So as far as I can see it the only thing you can do to defend against this is get a lot of deep links to each of your pages. Even then you're going to have a hard time. It's just a flaw in googles algorithm. Life goes on... Happy money making.
Rand,
The Canonical example you gave with your company and using the 3 different versions of your content was a good example. But what I am curious to know is what type of example would a website be using the same content on 2 different pages that would need a 301 re-direct?
I do know a lot of the duplicate issue problems can rise from sections on individual pages, that have the same content, but in this particular case you wouldn't want to 301 re-direct the entire page. As each page has it's purpose.
Does anyone have another good example of where you would use the 301 re-direct from 2 pages that have the same content?
Thanks,
BJ
This beginners Guide will grow big, so it seems... Good article!
RIes
Does the advice about print-only pages also apply to pages with CSS style switchers?
It is probably going to come down to how it is handled. If the switching is handled by appending a parameter to the URL, like domain.com/mypage.htm?css=big, then yes, it will be creating duplicate content.
While maybe not ideal, but a couple ways to help limit or protect against that is to nofollow those links or run them through javascript, and using robots.txt to isolate those URLs. But, that's not to say that someone won't come along, view the page with the "big" styles, and copy the URL and use it in their blog, and thus creating a link to the page.
Even better though might be to use server-side scripting to dynamically change the style instead of relying on URL parameters.
Taken from SEOMoz's own 301 page, if you were using querystrings for your stylesheet links, would this be the solution?RedirectMatch 301 /index.php(?css=big) https://www.yoursite.com/index.php$1
no
Thanks for giving such a good advice on how to prevent duplicate content. The illustrations helped a lot =)
"We worked to individually 301 re-direct all of the print-friendly versions of the content back to the originals and created a CSS option to show the page in printer-friendly format (on the same URL). This resulted in a boost of more than 20% in search engine traffic within 60 days."
In Joomla CMS, you have the option to activate a "printer friendly" icon to print a page from the site.
Will this have the same impact (as what you have outlined above), as if you had created individual web and print versions manually, or would you actually have to manually create a printer friendly version and do a 301 re-direct to potentially see that kind of traffic increase?
My interest is this - if I could gain a significant improvement in traffic simply by having two copies (original web version and print version) , and re-directing the latter to the former, it would seem like this would be a pretty good standard practice since it's not much work for that kind of potential benefit. Is that a fair statement?
Good stuff Rand. Illustrations helped.
Sean - you're basically fixing a mistake in site architecture, not actually benefiting from the two versions. Note the link I pointed to for Omarinho's question below.
Joomla CMS - I really don't know how it operates, but if there are two URLs for the same content, you'll have a problem. If that print-friendly link just uses Javascript to change the CSS or uses the 'print' command in the browser, it should be fine.
Hi Rand,
Me thinks I have found a good example of a conical version of SEOMOZ content:
https://www.seomoz.org/blog
https://www.seomoz.org/blog/
Providing no one linked to the latter it would not be a problem, however I suspect that it not the case.
Could a 301 be on the cards?
Joomla is stack full of Duplicate Content issues.
I find it a complete nightmare. Be very careful.
Great post, Rand! The new beginner's guide is going to be one of the most influential and valuable SEO documents ever created, if not the most.
Just make a sticky note and remember to 301 this post over to the entire beginner's guide once it's finished ;)
Great post and advices Rand. Using complete urls on internal links lets you detect the referring link on analytics and pointing out the scrapers.
This stuff is very important for any SEO to understand.
If you put a 301 on the print versions in order to re-direct to the original pages, how will be able the users to see the print versions when they need to print a specific page? CSS option in the same URL? I don't understand. :-o If you can post a link with an example, it would be great.
Omarinho - the reason you need to 301 is to grab that link juice. Then, you can use a modified CSS stylesheet (see this AListApart article) to make the same URL produce two different looking documents - one for print, one for web.
I get it now. Thank you!
The all CSS technique is by far the best approach, at least for handling print, mobile, or other versions because the styling is handled at the browser level, not through different pages or parameters, and thus creating potential duplicate content.
Even if you block the print pages from spiders, why add the chance that anyone might copy and paste those blocked URLs or, depending on how the pages are handled, keep multiple versions of pages (granted, using a CMS probably won't make this an issue).
Nice. I didn't knew about that ping thing. Will make sure I do that every time I post my blog. Thanks Rand:)
Both FeedBurner and WordPress can be set up to automatically ping certain sites every time you blog something new.
Oh really? Leme check. Do i require a plugin for that?
The basic wordpress install normally handles it for you, I believe.
Great post, Rand! I'm going to share this on our team reading list for 1/22 at the EMP blog!
Thanks Rand for the article. I have a question... For blogs which create duplicate content in archives and categories, how can that be handle? Should we 301 redirect the archive and categories pages to the main page of the article? .
i just robots.txt those URLs out. i think it's pretty unlikely that anybody is externally linking to your blog's archives or categories pages to 301 them and preserve backlinks. using robots.txt will simply tell the spiders to go away and not crawl or index those pages.
but then again, i could be wrong.
Thanks Kimber for your reply, it sounds logical to me. I'm just confused on what Rand tries to explained about having different copies of the same text in different URL's, like Rand mentioned about the main page of the article and a page in printer-friendly format. Can then this page in printer-friendly format be blocked by robots.txt also, as a solution to duplicate content?
"Can then this page in printer-friendly format be blocked by robots.txt also, as a solution to duplicate content?"
You could, but remember in Rand's situation both the printer version and regular version had incoming links. Therefore by 301'ing the printer version he sends all that link juice to the non-printer version.
So you could robot.txt block the printer version however that doesn't prevent people from linking to your (no ads) printer version.
Thanks BradleyT, that's perfectly clear now.
If you use the "media" function within CSS, there is no need to have multiple URLs for the screen and print versions of the page.
ANy links to details on how to use the robot file for this?
robotstxt.org tells you all about it. it's also recommended that you "test" out your robots.txt in Google Webmaster Tools to make sure you aren't blocking important pages by accident.
Thanks Kimber!!!
Using robots.txt does keep the duplicates out of the index, but those URLs still accumulate PageRank (because you are linking to them from within your site) which is now wasted, because it can't be passed on elsewhere within the site.
so you should also add nofollow to the internal links?
I wouldn't totally rule out what people might link to.
Case in point, in reviewing various URLs where a particular URL structure (heavy parameter-based version that lead to product pages that had much cleaner URLs) was used for the on-site search function, the first thought was that it was no concern since spiders couldn't fill out and submit the search form . . . but then I was surprised seeing that some of the URLs were showing up in the SERPs.
Doing some more digging and hopping down the backlink trail, the issue became clear . . . bloggers and others who wanted to link to the site and would often come to the site, and rather than navigate down through to the specific product page, would use the on-site search to search for it and link to the resulting page . . . which was the ugly URL page, not the nice, clean, spider and searcher friendly URL page.
It isn't always possible, especially on large CMS or ecommerce sites, but I'd always recommend eliminating the potential for URL variations whenever possible.
This can be a bit complicated, but I agree content is king. I've found a lot of info regarding this type of information lately.
https://www.massmailsoftware.com/blog/ has some good bits on it, and the others I forget :P
Regardless, it is worth making sure content is continually fresh, easy to be found, useful and sitting on a solid foundation.