If there's one issue that causes more contention, heartache and consulting time than any other (at least, recently), it's duplicate content. This scourge of the modern search engine has origins in the fairly benign realm of standard licensing and the occasional act of plagiarism. Over the last five years, however, spammers in desperate need of content began the now much-reviled process of scraping content from legitimate sources, scrambling the words (through many complex processes) and re-purposing the text to appear on their own pages in the hopes of attracting long tail searches and serving contextual ads (and other, various, nefarious purposes).
Thus, today, we're faced with a world of "duplicate content issues" and "duplicate content penalties." Luckily, my trusty illustrated Googlebot and I are here to help eliminate some of the confusion. But, before we get to the pretty pictures, we need some definitions:
- Unique Content - written by humans, completely different from any other combination of letters, symbols or words on the web and clearly not manipulated through computer text-processing algorithms (like those crazy Markov-chain-employing spam tools).
- Snippets - small chunks of content like quotes that are copied and re-used; these are almost never problematic for search engines, especially when included in a larger document with plenty of unique content.
- Duplicate Content Issues - I typically use this when referring to duplicate content that is not in danger of getting a website penalized, but rather, is simply a copy of an existing page that forces the search engines to choose which version to display in the index.
- Duplicate Content Penalty - When I refer to "penalties," I'm specifically talking about things the search engines do that is worse than simply removing a page from its index.
Now, let's look at the process for Google as it finds duplicate content on the web. In the examples below, I'm making a few assumptions:
- The page with text is assumed to be a page containing duplicate content (not just a snippet, despite the illustration).
- Each page of duplicate content is presumed to be on a separate domain.
- The steps below have been simplified to make the process as simple and clear as possible. This is almost certainly not the exact way in which Google performs (but it conveys the effect quite nicely).
There are a few additional subjects about duplicate content that bear mentioning. Many of these trip up webmasters new to the dup content issue, and it's sad that the engines themselves have no formal, disclosed guidelines for folks (although I suppose it does give folks like me a day job). I've written these out, as I most often hear them on the phone and see them in the forums:
Code to Text Ratio: What if my code is huge and the unique HTML elements on the page are very few? Will Google think my pages are all duplicates of one another?
Nope. As Vanessa clearly mentioned in our video together from Chicago, Google doesn't give a hoot about your code, they're interested in the content on your page.
Navigation Elements to Unique Content Ratio: Every page on my site has a huge navbar, lots of header and footer items, but only a little bit of content; will Google think these pages are duplicates?
Nope. Google (and Yahoo! and MSN) have been around the block a few times. They're very familiar with the layout of websites and recognize that permanent structures on all (or many) of a site's pages are quite normal. Instead, they'll pay attention to the "unique" portions of each page and often, largely ignore the rest.
Licensed Content: What should I do if I want to avoid dup content problems, but have licensed content from other web sources to show my visitors?
Use meta name = "robots" content="noindex, follow" - place this in your page's header and the search engines will know that the content isn't for them. It's best to do it this way (in my opinion), because then humans can still visit the page, link to it, and the links on the page will still carry value.
Content Thieves: How should I deal with sites I find that are copying my content?
If the pages of these sites are in the supplemental index or rank far behind your own pages for any relevant queries, my policy is to generally ignore it. If we tried to fight all the copies of SEOmoz content on the web, we'd have at least two 40-hour per week jobs on our hands. Luckily, this is the only domain issuing our content that has enough link strength to rank well for it, and the search engines have placed trust in SEOmoz to issue high quality, relevant, worthy content.
If, on the other hand, you're a relatively new site, or a site with few inbounds and the scrapers are consistently ranking ahead of you (or someone with a powerful site is stealing your work), you've got some recourse. One option is to file a DMCA infringement request with Google, with Yahoo!, and with MSN. The other is to file legal suit (or threaten such) against the website in question. If the site re-publishing your work has an owner in your country, this latter course of action is probably the wisest first step (I always try to be friendly before I send a letter from the attorneys), as the DMCA motions can take months to go into effect.
Percentage of Duplicate Content: What percent of a page has to be duplicate before I run into dup content penalties and issues?
22.45%. No, seriously, the search engines would never reveal this information because it would compromise their ability to prevent the problem. It's also a near-certainty that the percentage at each engine fluctuates regularly and that there's more than simple direct comparison that goes into dup content detection. If you really need the answer to this question, chances are you plan to do something blackhat with it.
Issues vs. Penalties: How do I know if I'm being penalized for having duplicate content, rather than simply having my pages removed from the index (or put in supplemental)?
Penalties require a good bit of abuse to go into effect, but I've seen it happen, even on domains from respectable brands. The penalties really arise when you start copying hundreds or thousands of pages from other domains and don't have a considerable amount of unique content of your own. It's particularly dangerous with new sites or those that have recently changed ownership. However, no matter whether you've got penalties or just find lots of your pages in supplemental hell, I highly recommend fixing the issue as I've described above.
What are your thoughts on dup content issues? Anything I've neglected or confused?
p.s. Googlebot got a nice upgrade courtesy of my improving illustration skills. I was feeling bad for the poor guy, despite the fact that it's 2:15am and I have conference calls starting at 9am tomorrow.
Ecommerce sites are the worst. 98% of the time when I take on a client who has a large catalog with thousands of items for sale, they all have product descriptions that are copy and pasted from the manufacturer.
Just by having all the pages rewritten to say the same thing using different words, they usually see huge increases in rankings.
Leave the copy and pasted junk for the exports to shopping sites I tell them.
Very true. It's incredible how many people are too lazy to re-write simple sentences and descriptions!
Agreed. It's sad, but I still think many people don't see websites as something they need to work on. I think many still see them as something you do once as cheaply as possible and then you make money.
It's interesting, because no one would think in a brick and mortar that they would be smart to stock every room with the same products, but they seem to think it's ok to do that online.
Jane in some cases I think it is just laziness, but I think it's more a lack of understanding of the web in general and goes to a deeper problem than laziness.
The same person who would hire someone to paint their store often doesn't think they should do the same and hire someone to design their website or rewrite their content.
You're spot on about the re-write Jeremy. I got hired a while back for an ecommerce with thousands of products. They were size wise bigger by twice then their closest competitor, but the competitors site was coming up more in common keywords. By simply changing the text copy of the product description, from the manufactures description, to something more unique and informative, we overtook our competition for rank and our ROI went through the roof for the following months.
Since I've left (about two years now) they have not kept up and have reverted back to copy and paste method from the manufactures catalog. Subsequently, as you can imagine, they have lost rank to several competitors.
I want to address is Stever's Multiple language versions of a single site that wind up using the first language when they deeper into the site.
We just finished a site audit of a site that has English, Korean, Japanese and Chinese "versions". The entire site is in english while the other versions have homepages as subdirectories of the main site and then logographic navigation with the English content recylced on the second and third levels. This is a recipe for duplicate content nightmares.
The solution was to use the same approach we employ for any site with lots of boilerplate content (shipping information, warranty, return policy, etc) embedded in the page. We spec'ed the non-enlish language pages to use Iframe calls against the content management system, so the logographic pages pulled the the english language content on the site client side. The other idea was to use AJAX to prevent the spiders from being able to crawl the English content within the other languages.
Jonah - I love the i-frame approach. Never considered that before; very creative and a good trick to have in the bag. That's my "new thing I learned today."
Nice reply, Jonah.
I was debating discussing frames (although not iframes - good point!) but the implications of the internal value flow around the different domains and implementing the return link made my head hurt and I cut the post back.
(Plus "poor man's cloaking" can still be surprisingly effective in uncompetitive areas, which detailed pages in other languages often are.)
But still, there is nothing better for user value and search engine results than proper translation.
Stever:
Notice I said built into a content management system. This solution stops scaling for manual coding somewhere under 100 pages.
Your post about multiple language versions is very helpful. Thank you.
like the robot!
so if the content is on the same domain, is not duplicate? it wil not go to supplemental?
dupe content is dupe content, whether it's on the same domain, different domain, or different domains owned by the same person... the engines don't want to serve up results that are all the same as it lowers the quality, potentially sending someone to multiple sites with the exact same information.
You may not be penalized, it is just that the engine will make a choice as to which page to serve up in the main SERPs.
Rand, This sounds like a strong argument against syndication of a blog. For example, if you have a blog that is not very strong yet and make a homepage for it on one of the syndication sites, then that page might bump the homepage for your blog from the results - or outrank it on many queries. This puts your feed on a very strong site that has millions of links and has more authority.
Then we go to the "snippets" that appear on the syndication sites. These could contain long tail phrases or keyword combinations that get a little search. Having those on a powerful and authoritative site will cut your traffic - simply because they will outrank you in the SERPs on these long tail queries.
There is no "penalty" here and maybe no "filtering" - they simply outrank you. My bet is that in most cases you get more traffic by outranking them on these long tail queries than you will get by being listed within their content.
EGOL you caught my attention since I have this issue. My blog gets republished on the iEntry network, which includes WebProNews. WPN had more authority than me and so my articles there generally outranks the same article on my own site.
They're nice enough to change the title and I think the meta description ends up different as well. But generally they get a lot of long tail traffic that should really fall to me.
In this case I don't mind since having my content there (with links bac to other posts on my site) seems to help build authority in my site and other pages have gained in long tail traffic. I also get direct traffic through WPN and do get to spread my brand or at least my name.
But it's always puzzled me why search engines can't figure out where the content originated. It would take a human being all of about 5 seconds to figure it out and while I now an algorithm isn't a human I can think of many ways they too could figure it out.
I think the search engines really could do a better job of determining which is the orignal content. I think at the moment the heavy emphasis is on the authority of each site.
If their site can get your content on the first page but your site can't lift it higher than the third then you might get more traffic to your own site by syndicating.
This will also expand your brand if your article are done will to promote it.
I was guessing that Google bought Feedburner so that they could thorugh the feeds identify original content sources. Maybe that isn't true.
I love GoogleBot. He's the best teaching aid since The Count.
Now we just need SEOmozzilla to combat GoogleBot in the streets of Mountain View!
Would it be shameful to admit that I've spent the better part of my day obsessed with what weapons GoogleBot should have? For the first one, I was thinking "The Supplementalizer"; he fires it at you and all your content goes into the supplemental index.
Excellent. Go on...
More? Hey, I didn't say it was a productive "better part of my day" :)
Oh, oh!! And the "Sandboxer" and the "De-Indexer" - such a great idea...
Every once in a while, GoogleBot could just pull a knife and say "I'll Cutts you!"
Sorry...
BWA-HA-HA!!! That, my friend, was divinely puntastic.
"My spammy sense is tingling."
Name the weapon that totally invalidates a thing's existence by asking "Did you mean...?"
That's one for us folks over in the UK (and anywhere else that likes their vowels) - I'd call it the 'colorizer'.
If you search for 'colour' on google.co.uk, the top results are all about 'color' and while it doesn't actively say 'did you mean color?' it might as well.
[edit: just remembered the search I did the other day where this annoyed me - it was for feng shui colours - since I happen to know someone who wants to rank for that kind of search in the UK and they slap a great big 'did you mean feng shui colors?' on the top.]
That's how it starts...First Google will try and "fix" how you guys over the pond spell, then google maps will start giving you distances in Miles instead of Kilometers.
I wish they'd do it the other way around: start giving US users metric results, it's about damn time we figured it out.
I can't get behind colour or optimisation, but I am fond of spelling theater theatre.
Don't mind imperial measurements (we like our miles and our pounds of vegetables).
Don't even think about taking away our pints of beer though. The EU wanted to standardise on half litres (or standardize on half liters). That's not a good idea.
I'm a little bit confused about this in light of some related comments on this post at SEOBook. He is saying that in order to get out of supplementals it's a good idea to reduce repetitive side-wide (header/footer/sidebar) content:
But you are saying that it doesn't matter. This is slightly different problem that he's talking about so maybe it just depends on the context ???
Edit: I'm enjoying the robot illustrations too! They made me smile today :)
The sentence you've quoted is the only one in that post of Aaron's that I disagree with. I've never seen any evidence that having unique content closer to the top of the code makes any difference at all. The first version of my site used absolutely positioned layers to make the content the first thing in the <body> and it didn't accomplish anything as far as I could tell.
Aaron's a very smart guy and he might have the data to back up that assertion, but in my experience, I haven't seen issues with bloated code, nor heavy header/footer content (although if it was extreme...)
Rand
I have to side with Aaron on this one. While all of the engines attempt to spider around code and Google does the best job of it, you leave a lot to chance if you assume the engine will treat your content the same when it is buried beneath hundreds of links.
I analyzed a PR 7 site that dates from 1996 with 45,000 unique skus, each with about 500 words of hand-written description. This site was all but knocked out of the index for duplicate content. They had 400,000 pages in the index when Big Daddy rolled out and less than 1,000 by the time we got involved. They suffered from mulitple sins (duplicate title and description tags on most pages, canonical issues stemming from indexed search result pages).
Still, they would not have been hit nearly so hard if it wasn't for the fact that they have a couple of hundred lines of code, most of which is hundreds of global and directory-contextual navigation, between the body tag and the beginning of the unique content.
Aaron's suggestion also contains another bit of wisdom. If you mix up the order of things in the global navigation, and maybe the anchor text, Google is more likely to count those internal links more than once because the footprint has changed.
So what about this bit in Response to SEO Questions by Rand Fishkin:
Still confused :) I have been reading some other stuff questioning the source order thing web designers seem to take for granted (this timefrom an accessibility perspective).
Thanks for the link to the usability article. This whole issue (in terms if usability and accessibility) has been on my mind a lot lately, though I've yet to come to any conclusions.
With the seo part of the issue I've seen arguments on both sides and have yet to convince myself of either argument. Generally when I code the template for a site I do leave the header, footer, sidebar, in the same place in the code, mostly because it's so much easier to maintain and for that benefit is worth more than anything I might lose in having the pages be repetitive.
It would be nice though to have some more definitive information though about the seo effect.
But the links will only be, at best, a source of traffic, unless you're suggesting that even a non-indexed page can take the benefit of its backlinks and spread it to the internal pages to which it links. And if that's the case, then the presence of the noindex instruction doesn't really make a difference, does it?
Interesting observation and a good point.
However, couldn't we say that these are two different factors in ranking?
Unlike a nofollow added to the link, I would hope that the SEs would still give the links value towards ranking the target site, even though they are told not to index or follow on the page itself.
But this is hypothesis, not an answer on my part. I can also understand why they may not want to give value after all.
My understanding is that they do give value to those links, although it may not be as high as from indexed pages. The test I saw on this was from a couple years ago, but it still passed at least some ranking weight.
You know I hadn't really thought of this till now, but it makes me wonder. If the search do pass link value through non-indexed pages then couldn't a spammy tactic be to create 1000's of duplicate pages add the no index, but still try to pass even a little link juice by allowing the links to be followed.
I suppose that would be pretty easy to detect, but it still makes me wonder.
I would think though if you've added noindex to dup pages then you're not working to build links into those pages. Also if they aren't showing up in SERPs since they're not in the index they wouldn't get so many natural links either.
I too would hope there would be some value in the links out of the page, but I would expect it to be a little less than for indexed pages.
Sometimes, I feel like the GoogleBot is actually kicking me in the supplementals.
Thanks, Rand; great information. I am having issues, though, with Google penalizing for too great a navigation element to unique content ratio, but I think this is severely compounded by duplicate TITLE and/or META tags. I'm struggling with getting out of the supplemental doghouse for intrasite content.
Hy Rand
This is my first comment here.
I have one question: from Google patent we know that Google knows well the date that a document was released. You can find this on History data - Inception Date
And they say: Google may determine how old each of the pages on a given website is and then determine the average age of pages on the website as a whole.
So if Google do knows the origin date of an page/article should not this be a criteria for selecting duplicate content from non duplicate content pages or site?
I mean..ok you know the date when a page "was born" but do you use it to find duplicate content?
It`s like this: I have a blog that has syndication and people steal content from me posting on their blogs. But the date or original content is well know by Google, as Patent says, how come PR, IBL`s and other factors say that the well ranked site or with good PR or links is the orginal source of content?
Just asking..
Good to see ur first comment!
Pay special attention to the word "may", it is not a fact, although pretty likely.
What it means is that it is possible that Google may give different weight to pages by looking at when Google first found out about that page, meaning when it was spidered. So if you write an excellent, unique blogpost today, but because your site is not well linked to, Google will only find/spider that page in 10 days from now. So if I know your blog and visit it, I can copy your blog post, post it on my highly linked to blog and your content will be spidered by Google tomorrow on my domain. So in the eyes of Google I had the original blog post.
That is one explanation, another one is that Google doesn't care who has the original content, as long as the content itself is on a highly linked to website, see my example somewhere above. I'm sure there will be other explanation as well. Just don't take any words for the absolute truth, investigate on your own and you just might end up with another theory. You have to remember that nobody, and I mean nobody, knows the absolute truth. Not even about search! ;)
<edit>damn my spelling at 1am!</edit>
Excellent points, tbfpa - I'd totally agree. Google knows they can't always trust dates by themselves, so they'll naturally use other metrics, even when the date data conflicts.
Hi Randfish,
I have made a few experiments on duplicate content in 3-4 month. I have read in every dup content articles, like yours, google penalize, remove duplicated one, how determine whether an article dup is or not etc.. And i know the rules. Actually SEO rules. But my experiments show me different results. For example: search for "12 Deadly Diseases Cured in the 20th Century" and you will find 3 of same content. One of them is from howstuffworks.com which the original is, one is in symbianize.com - in the first rank, and the other is my experimental... (i have deleted it but google still shows it). I wonder about one thing, howstuffworks have more PR, Alexa rank, linking to this page, first published- years ago than symbianize.com. Then why it is in the second place?! also why google just doesnt penalize this site about 6 month (that i discovered it)?! very interesting.
Would like to read your answer.
Sincerely,
Vusal Zeynalov
Thanks tfbpa for your fast comment.
So that means that if I only write content, unique content not for rankings but for public information (as peculiar documentation to inform people) and I do not have rankings, IBL` or PR..that should suggest Google that the stealler is the author...and not me. Why? Becouse he has good rankins IBL` or PR. But that`s easy to get.
Google is forcing me to find IBL`s, to get rankings or PR to sustain my content not to be stolen and to be creditated as the author of those articles?
So if I`m not ranking my site or getting good PR or links I might be stolen my content. Or steal content written by me from other sites that are well ranked (but they do have my content).
But, why Google has those data of my content if he can not make the difference from original content (by the time was created) and stolen content (the page that is on a well ranked site)?
That makes me a theft, my own theft. I`m stealing my content? Or others do that?
They do not care about the date that content appears (source or original content) but the well raning/PR/IBL sites that soled (sites that worked out for PR, rankings or PR) that content.
So in my mind is something like that: if you do not rank well or you do not have IBL` or PR you should not create content. Becouse if you do..you must get ranings, IBL`s or PR. But I`m not write content for SE`s but for users. And Google does care more about rankings and PR instead of good content?
How about people that do create good content for users but they are stolen? That means I do not have a chance on SE`s becouse I do not rank well. But an theft can stole my content and in the eyes of Google he is the creator, `couse he rank well.
Just asking again.. :)
Hope I do make sense.
Krumel - I wish I could answer you, but sadly, I can't follow your reasoning here and your sentences are very difficult to understand as well. If you're simply pointing out that SEO and content building on the web is becoming more difficult, I'd agree with you, but I'd still say it's a far cry easier than starting a new newspaper, magazine, radio station or other mass media distribution system. Sure, there are things that are unfair or tough, but nearly everyone on this blog has dealt with those issues and managed to emerge successful in the end.
I am closing my existing domain https://webgeekblog.com completely. Its been penalized by google for reasons I am unable to fanthom.
I am planning to move some of its content to my new website? does this affect my new domain in search results badly?
The present website will be nomore available except an index page saying I have moved to new website.
I have placed a robots.txt for blocing search engines to crawl and placed a removal request on google!
Please let me know whether I can copy few articles on my old website to new website?
thank you!
chandra
If my own site has two URL's to the same content, in other words, if x.com/new and x.com/products/tv/123.html are exactly the same page, does this matter?
Over a year later and still relevant.
Great explanation Rand. Do you think that the Search Engines have improved their handling of this type of information since you wrote this?
In the last six months I have seen indications that either the algorithm has become more robust or that individual verticals have been adjusted for duplicate allowances.
My personal take on it at this point is that in some instances (verts) it is still a fairly strong factor. It may be Most. However in others, it is less likely to get you slapped into supplemental hell but won't help you flow any juice or punch up your search position.
Could it be also that by increasing the ability to handle a higher volume of pages in their index, the engines have just worried less about the issue?
This is, in my opinion, the power of SEOmoz. An index of searchable information from a trusted source. You really have no idea how great this is.
Great post! Clears up a lot of misunderstanding as far as how search engine deals with duplicate content.
Elaine
DUI Advice
Hi Randfish
Do you know any efficient tools that can identify duplicate articles within repository? I want to run it as a background process where new incoming content will be checked for duplicates with let say 100,000 articles available in my repository.
I'm looking for better performing and easily integratable tool.
I think the illustrations are the best part! Keep them coming. They break it down so anyone can understand. Maybe my next tattoo will be that sweet google bot drawing...ha!
Interesting post, but it didnt answer, nor do the comments, my primary concern.
If I have 3 articles, all unique and written by me, and I publish these to 50 Article sites, am I generating 147 pages of duplicate content?
Would this be punished by Google? Or, would the 3 Articles that are seen as the originals be good, and just the IBL from the remaining 147 be counted? or would they not be counted since it is dupe content??
Hopefully someone may be kind enough to answer...
Thanks in advance!
At last My Question is : Article Syndication is Useful on NOt ?
if Useful then How?
if NOt then Why?
I found your post really useful. Duplicate contetnt is always a issue to get penalized by google. But yet no one knows how the exact algorithm work I guess? Any more info on it?
Hi Friends,
I need help on this. My US Based client is a interior designer and runs his service in multiple places like Michigan, Ohio, Florida.
I want to know will his 3 website be penalised for duplicate content by google if he has :
1. 3 Websites like InteriorDesignersMichigan.com, InteriorDesignersOhio.com, InteriorDesignersFlorida.com
2. The look and feel, the website template and content text of all the websites are same.
3. But the city names are changed in the content text.
For example all three website have content like:
1. Design Tech is a interior design company Since 1993 for office space in Michigan. We provide our services.....Blah...Blah
2. Design Tech is a interior design company Since 1993 for office space in Ohio. We provide our services.....Blah...Blah
3. Design Tech is a interior design company Since 1993 for office space in Florida. We provide our services.....Blah...Blah
Please suggest will all the three websites be panelised for duplicate content.
Thx for the info and illustrations. We are writing some content that targets separate cities in the same state and have manually rewritten the content 3 times and it's around 88-91% unique. We are hoping this works fine and is not flagged.
In less word, to fix dupe issues i should add a "noindex, follow" tag to all the pages containing duplicate contents.. also if i replace dupe text with original text in some existing pages and also changing the argument of the page, is sufficient for google to recognize the new pages just by sending them via webmaster tools?
[link removed]
This debate is going to rage on I am sure.
What is the risk of using duplicate content in articles to build links? We have been discussing this at our office recently and we are finding that there doesn't seem to be much in the way of examples or case study.
Anyone out there have any actual "proof" and not just anecdotes?
Here's a pickle, so to speak. A client has a localized website for a prominent plumbing business in Colorado (ranking well in the local market search) and is expanding down into Texas.
Can we take the website and update all the "Colorado" text to be "Texas" without a duplicate content penalty?
So, everything would be the same except for the instances of location. Time saver OR seo Russian Roulette?
And Rand, you generate such great conversation on here. Thanks man.
We own an online store and recently we opened a similar store on eBay with the hope of improving sales through different sales channels -perfectly legitimate. Though we only have a small percentage of our items listed on eBay, is there any probability of being penalized for duplicate content because we use the same product; name, description and pictures? We are thinking of opening a similar store elsewhere for the same reason –no different that a brick and mortar store opens multiple locations. I should mention that none of the stores directly or indirectly links to each other.
It's been a few years since this post, but there are still dup content questions out there.
Apple for example uses the same content for various versions of English sites (America, Canada, Australia). Also, it does not have country specific domains, it just uses .com, .com/ca, .com/au. Is this dup content bad? Will Google.com rankings be effected because of .com/ca getting ranked in Google.ca?
Any thoughts?
Thanks
PS - Search "apple imac" in Google.ca. It's pretty funny, results for all three countries are returned. So much for Google choosing one and throwing out the rest.
I have researched this more on my own and have come across a great post from Duncan Morris, entitled Why Apple isn't UK enough for Google. Check it out if you're interested.
I have a question on this - yes, I am a total beginner and apologize for my lack of knowledge in advance!!!
I have a website being built. There are three h1 tags and text boxes at that bottom of the index page. I just took a look at the new pages being designed and the graphic designer has put the same three boxes with the same titles at the bottom of every single page. It looks great and the flow of the website is nice, but will this give me too much duplicate content?
Thanks for any help/advice thrown my way.
Similar to a poster below, we have a website that sells about 200 all-natural soap products (soaphope.com) and we also sell the exact same portfolio of products on eBay. We use similar descriptions of our products in both channels. Is Google smart enough to know that this is not spam and not penalize the content as duplicative?
Also - eBay page headers are generated by them, not the seller - so if someone wanted to tell Google to ignore their eBay pages, is there even a way to do that?
Very interesting post. GoogleBot is awesome!
In my experience the single element that is of importance to get out of supplemental hell due to duplicate content is the quality of backlinks.
I have several hotel sites that uses the content provided by the mother company. Infact I use CNAME records to make it look like the content is on a subdomain, but the content is 100% the same as found on 100's of other websites.
At first those sites suffered a lot from supp hell, but once I received some quality backlinks (maybe just one...) most of those pages came out of the supp hell and went into the main index. And I must add again that the content is 100% the same as all other sites.
This is evidence to me that Google doesn't determine an original at all, they just include all pages which have enough quality backlinks.
You can take a look here https://www.google.com/search?q=%22The+171-room+Banff+Rocky+Mountain+Resort+%22is+nestled+at+the+base+of+Rundle+and+Cascade+Mountains+in+Banff+National+Park,+just+four+kilometers+from&num=100&hl=en&filter=0
Grrr, because above line is too long and therefore doesn't show all of it, do a search for "The 171-room Banff Rocky Mountain Resort is nestled at the base of Rundle"
Pay especially attention to the results with subdomains, they are 100% exactly the same. This example doesn't show 100's of exactly the same pages, but it does show a dozen of them and all of them are in the main index.
tbfpa - you can use the WYSIWYG editor to make long links into anchor text :)
I've seen this as well.
One thing that has worried me in the past related to this, is publishing full blog feeds vs. snippets. If you use a full-feed format, you essentially give anyone the ability to duplicate your entire site. In the even that a more authoritative site does it, you could potentially loose your blog into the sup index. Luckily, every instance i have actually seen of this so far was from worthless sraper sites, but looking into highly competitive business topics, I could see this becomming a potential problem.
As far as the robot goes, I think you should get rid of that flower thing, and incorporate it into your logo.
Yeah, the post was okay, but more importantly you've quoted the best line from the best book ever.
(Okay, so the post was really good and I've been really impressed with the post quality lately. I mean, you guys always have good posts, but this feels like an SEOmoz Renaissance!)
Did you like my Scott & Zelda references too? I know a lot of people who don't get Gatsby, but it always spoke to me.
Thanks for the reminder.... I forgot to mention, along with a great post, any post that can bring F. Scott into is tops in my books... Matt needs to figure out a way to give posts a "two thumbs up."
Gatsby is great, but personally, This Side of Paradise was his absolute greatest work, and personally, my all time favorite. They even did a movie version of Gatsby but I've always thought that Paradise, if done right, could make an incredible movie as well.
I read way to fast and completely missed the obvious reference. It's been awhile since I've read Gatsby though, even if you did point it out with fscott.com. Here's a reference about F Scott and you can have fun trying to discover where it's from.
'You've been through all of F. Scott Fitzgeralds books. You're very well read it's well know."
For a hint check my profile. The author of the reference is there. If you're familiar with the author you'll recognize where the reference is from easily.
By the way Rand Googlebot is looking much better this time around.
I'm lovin the use of images lately Rand, really spicin it up. Is someone getting content ready for his book??
22.45%? Everyone knows it's 19.67% ;)
Nice Googlebot makeover, he's totally styling. You'll of course have to dress him up for St. Pat's now!
This was a great recap on something that is definitely still a challenge to explain. I think it is important to address the issue of dupe on your own site as well as having multipe sites, both based on the misconception of two sites must be better than one as well as the dupe content issues across sites.
Hi Rand - another fantastic post! The plagiarism issue is certainly a frustrating one. I was amazed to find one of my recent blog posts appearing on around 15 other sites within 24 hours of posting. Everything had been scraped and chopped up into a load of gobbledegook mush. Of course, the pages had Adsense all over them. I don't have to worry about the pages ranking higher than mine however it's bloody annoying. I think a polite email may be in order...
Polite as in dropping a friendly DCMA request in their Inbox?
You knows it! ;)
Hi Rand,
Good post :) but I always thought Google took into consideration that Google took into account the domain age too when determing dominance Vs. Dup content
Great post.
GoogleBot rocks!
Duplicate content problems are hard to explain to clients. But its even harder to make them understand that having multiple domains that are exactly the same website is a bad thing.
What do you recommend for multiple language sites? These sometimes display the same content but start off with different homepages (ex. .com english, .pt portuguese...). Should there be a one language one domain principle, or 301s?
Hi Carfeu,
Was having the same question. I've got some sites in Dutch on a .be and .nl domain. People from the Netherlands dislike Belgian sites and visa versa. So duplicate content is an easy sollution for me. We don't have the people or the time to rewrite content for 2 sites.
With you on the GoogleBot, seomoz should go official with the GoogleBot character as our mascot.
Carfeu, I'm not rand, but I deal with this quite frequently.
Firstly, what you describe is pretty usual, but the sites do not often have duplicate content on the same domains. Often there will be a "starter" set of pages on the country domain in the native language (which do not then run into the duplicate problem or the "search pages from..." problem). Then the sites will link into the main (usually English-language pages) section for more detailed information where the site owner has not wanted to translate.
(Note that here the different languages are eligible for different directory listings and links from other sites, so also are feeding external whateveryouwanttocallitthesedays back into the main domain. This is also a minor but important point against Rand's recent theory of "keep it all on one site".)
Secondly, DE raises an issue against what I said above. This is an interesting conundrum where a particular country is important and it is worth hitting duplicate content issues. Here the duplicate is worth it because it will attract all the people who only search with the country restriction on and who otherwise would not see the pages on the "foreign" domain. So here it may well be worth having the duplicate content on both a .de and an .at. I have some figures somewhere - from an old WMW thread, I think - where proportions came out at ca. 15-20% in Europe who do that.
The language needs to be clarified somehow to fit what are two very different penalties (in my opinion)...
Cross-Site Duplicate Content: ie, I am syndicating an article I wrote to 100 sites, all of which are older, better, more visited and more popular than my site. Same-Site Duplicate Content: ie, i forgot to redirect my non-www to www.
Russ
Thanks for the informative post Rand. I write content for several plant blogs every day, and well...run into several of these problems every day! It's hard enough to come up with original content after your posts get into a higher range, let alone keep track of who's stealing it (and trust me I know people are stealing it). But I certainly can't say that I'm not a little guilty of re-arranging a few words in a post from an abandoned blog to generate traffic to mine (just a little guilty though) ;)
Hi Rand
Really good post, and I love your illustration skills!
Question, that may be silly.
If the website with the dup content gets rid of it and Google comes back and crawls, and finds no dup content, how does the website try and get their "trust" back from the se's - is it possible?
I'd say it's been a well-spent-staying-up-to-2:15-a.m. time. Thank you for the article.
How about duplicate content residing on the same domain but with multiple URL's? Let's say, an article in English is being linked from the same website using 6 different URL's. This is not ideal of course, but how bad is this?
Also, if Google is smart enough to determine page headers, footers, and other redundant parts, is it smart enough to determine if a site has 2 versions simply because one caters to a different country/territory, but has almost 90% of the same content/pages?
Thanks again.
Great Content Rand! I know I am late getting the post.
My only questions is can you recieve a penalty for having duplicate content within your own site? Or are these pages simply added to the suplementals?
My example would be: https://www.usalarm.com/home-security/
Every City & State page has duplicate content.
Thanks!
I absolutely love Googlebot as you interpret him. Makes him look so cute! I also love the article. So easy to duplicate without realizing it.
I am currently trying to see if I can take advantage of blogs scraping my content. What I do is simple, when I post something I make sure that I have several internal links to pages of my sites that are hard to get links from external sources. When the content gets syndicated by splogs, slightly change the post to avoid any big dup content issues and in the same time I build deeplinks for my site. Obviously the links I get aren't of the highest quality but they are within content and from a relevant theme page.
At PubCon Vegas I had a fairly long chat with Adam Lasnik on the topic of dup content.
He explained to me that google has a few levels of duplicate content;
He was very adamant that it is indeed better to be in the supp index than the regular index (which I definitely agree with). His claim was that being in the supplemental index just means that you need to add a little bit of value to your page in order to get out....be that in trusted backlinks, a customer review of a product, a more unique description, etc. Come up with a way to add value to the end user experience and Google will give you love.
I agree with everyone... another great post... awesome content here recently.
Rand, I like the charts and illustrations you have been adding to your posts. They help reinforce the main topic of the article.
Google bot is cool... will Yahoo Slurp make an appearance? Maybe we can get a bot battle going between the two. :)
Suddenly picturing a Jabba the Hut type creature.
Let the battles begin!
Maybe instead of one big bot Slurp should be a bunch of little ones?
Like an ant colony all working towards the same goal...
I'm not sure they are that organized.
He explains it so even I can understand it.
Thanks Rand.
I benefit from all my duplicate content
How many blogs recently discussed Feedburner now providing Google Reader stats?
Today I have also seen proof that links are becoming less and less relevant.