Scripting SEO: 5 Panda-Fighting Tricks for Large Sites

Comments 73

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

AjayYadav_InboundMarketer

2011-12-28T05:26:08-08:00

Thanks Corey for giving the idea of using leveinshtien distance equation for rectifying the duplicate content problems, well I applied the above solutions for my site running on PHP script but I have one more site which runs on ASP , so what would be the smart way to solve the problem of duplicate contents for sites running on IIS?

Thanks!!

AjayYadav_InboundMarketer edited 2011-12-28T05:26:26-08:00
5 0

Thanks Corey for giving the idea of using leveinshtien distance equation for rectifying the duplicate content problems, well I applied the above solutions for my site running on PHP script but I have one more site which runs on ASP , so what would be the smart way to solve the problem of duplicate contents for sites running on IIS? Thanks!!
Cancel
- Corey Northcutt
 
 2011-12-28T07:40:51-08:00
 
 Sure Ajay, the tool will work just as well in other languages. Here's an ASP Levensthein distance function that Google shot back at me:
 
 https://snipplr.com/view/9094/levenshtein-distance/
 
 I haven't tested this one yet, so you may want to experiment with a few others (ie. like above, PHP's default levenshtein() does not give the best results). Worst case scenario, PHP runs surprisingly well on IIS as well, or you can even create a database stored procedure with one (note the Levensthein function, and second "helper" function that produces a 0-10 ratio), though this one is much more mean to your system resources:
 
 https://www.artfulsoftware.com/infotree/queries.php#552
 
 4 0
 
 Sure Ajay, the tool will work just as well in other languages. Here's an ASP Levensthein distance function that Google shot back at me: https://snipplr.com/view/9094/levenshtein-distance/ I haven't tested this one yet, so you may want to experiment with a few others (ie. like above, PHP's default levenshtein() does not give the best results). Worst case scenario, PHP runs surprisingly well on IIS as well, or you can even create a database stored procedure with one (note the Levensthein function, and second "helper" function that produces a 0-10 ratio), though this one is much more mean to your system resources: https://www.artfulsoftware.com/infotree/queries.php#552
 Cancel
 - Corey Northcutt
 
 2011-12-28T07:56:31-08:00
 
 And here's another gorgeous-looking option that uses C# (in case that was ASP.NET :) )
 
 https://www.java2s.com/Open-Source/ASP.NET/Validation/nvigorate/Nvigorate/Common/Levenshtein.cs.htm
 
 1 0
 
 And here's another gorgeous-looking option that uses C# (in case that was ASP.NET :) ) https://www.java2s.com/Open-Source/ASP.NET/Validation/nvigorate/Nvigorate/Common/Levenshtein.cs.htm
 Cancel
 - AjayYadav_InboundMarketer
 
 2011-12-28T19:48:42-08:00
 
 Thanks corey for the immediate and helpful response, thumbs up for you! :)
 
 2 0
 
 Thanks corey for the immediate and helpful response, thumbs up for you! :)
 Cancel
- mark2012
 
 2012-01-25T08:09:34-08:00
 
 IIS7 has some great features to deal with this. IIS Extensions for URLRewriting plus lots more.
 
 1 0
 
 IIS7 has some great features to deal with this. IIS Extensions for URLRewriting plus lots more.
 Cancel
EGOL

2011-12-30T11:24:00-08:00

Wow!

This was a great post. Lots of tips and implementation details.

Nice! Thanks!

4 0

Wow! This was a great post. Lots of tips and implementation details. Nice! Thanks!
Cancel
Giuseppe Pastore

2011-12-29T13:39:44-08:00

When I read this kind of articles, I always regret not having a developer background. This is the kind of article I'll never be able to write, but knowing there're so generous person in the SeoMoz community that share their knowledge makes me feel less disappointed.

Thanks, Corey, really a great post, even for not code geek like me.

4 0

When I read this kind of articles, I always regret not having a developer background. This is the kind of article I'll never be able to write, but knowing there're so generous person in the SeoMoz community that share their knowledge makes me feel less disappointed. Thanks, Corey, really a great post, even for not code geek like me.
Cancel
- hyderali_
 
 2011-12-29T23:31:42-08:00
 
 Like you I too regret for not having a developer background :(
 
 Yes, we can't write such articles but I'm very thankful to Corey & others who post something about coding in SEOmoz which we can discuss with our internal developers who can sort out the issue going in our site.
 
 Thanks Corey :)
 
 3 0
 
 Like you I too regret for not having a developer background :( Yes, we can't write such articles but I'm very thankful to Corey & others who post something about coding in SEOmoz which we can discuss with our internal developers who can sort out the issue going in our site. Thanks Corey :)
 Cancel
 - Corey Northcutt
 
 2011-12-30T07:26:48-08:00
 
 Glad to help guys! Had read so many articles on analysis, seemed like something that tackled a few big implementation issues was overdue. :)
 
 1 0
 
 Glad to help guys! Had read so many articles on analysis, seemed like something that tackled a few big implementation issues was overdue. :)
 Cancel
- Holly Wade
 
 2012-01-05T17:19:39-08:00
 
 I agree. Sometimes I feel like I need to not only be a writer and SEO but also a designer, developer, etc. With all the changes in the world of Google and Social Networking it seems that more and more is expected of SEOs all the time.
 
 1 0
 
 I agree. Sometimes I feel like I need to not only be a writer and SEO but also a designer, developer, etc. With all the changes in the world of Google and Social Networking it seems that more and more is expected of SEOs all the time. 
 Cancel
Dave Colgate

2011-12-28T03:20:11-08:00

Really interesting to get such a technical perspective sometimes. These little 'tid bits' add to our arsenal as SEOs and I can see these items helping efficiency in getting stuff done. Keep them coming! Great post!

3 0

Really interesting to get such a technical perspective sometimes. These little 'tid bits' add to our arsenal as SEOs and I can see these items helping efficiency in getting stuff done. Keep them coming! Great post!
Cancel
Steven Mapes

2011-12-31T05:39:52-08:00

As someone who is more technical developer than SEO, its great to read a more technical side post on here. There are lots of things that you can do at a technical level to improve SEO from canonicalisation, schema.org implementations etc. I might have to write some articles myself on it in the new year

3 0

As someone who is more technical developer than SEO, its great to read a more technical side post on here. There are lots of things that you can do at a technical level to improve SEO from canonicalisation, schema.org implementations etc. I might have to write some articles myself on it in the new year
Cancel
- Corey Northcutt
 
 2011-12-31T08:28:54-08:00
 
 Will look forward to reading them!
 
 1 0
 
 Will look forward to reading them! 
 Cancel
SanketPatel

2011-12-27T11:48:04-08:00

Really Nice article from programmers point of you. I have just forwarded this blog post link to my php developer, he might be interested in it.

3 0

Really Nice article from programmers point of you. I have just forwarded this blog post link to my php developer, he might be interested in it.
Cancel
- Corey Northcutt
 
 2011-12-27T15:47:41-08:00
 
 Glad you liked it!
 
 1 0
 
 Glad you liked it!
 Cancel
tstolber1

2012-01-01T10:18:01-08:00

Great post, a lot of practical advice backed up with code and detailed explanations! I'm sure you have probably done this already - how about a script that does all this in one shot - maybe you need to specify a few parameters or have a config file to go with it, but you could have quite an effective - fixmydatabase.php type script there.

2 0

Great post, a lot of practical advice backed up with code and detailed explanations! I'm sure you have probably done this already - how about a script that does all this in one shot - maybe you need to specify a few parameters or have a config file to go with it, but you could have quite an effective - fixmydatabase.php type script there.
Cancel
- Corey Northcutt
 
 2012-01-01T14:09:42-08:00
 
 Is a good idea. I do have some maintenance scripts that use each of these methods (though #1 and #2 are intense on a level to where they seem to need to be run on their own). The big issue is that everyone's situation is a bit different, so the ideal application of this stuff could be as well.
 
 2 0
 
 Is a good idea. I do have some maintenance scripts that use each of these methods (though #1 and #2 are intense on a level to where they seem to need to be run on their own). The big issue is that everyone's situation is a bit different, so the ideal application of this stuff could be as well.
 Cancel
- tstolber1
 
 2012-01-02T18:13:33-08:00
 
 Yep, for sure it would be difficult to make something that worked across the board, but maybe some generic checking script or parameterised function that would allow easy customization or integration.
 
 Maybe someone that is really proficient at PHP could even create a class that allowed API type access.
 
 Just throwing it out there, I like PHP and automation. I am using automation in my SEO process more and more, having hisotrically been quite against it, now I am seeing some areas where it makes sense.
 
 1 0
 
 Yep, for sure it would be difficult to make something that worked across the board, but maybe some generic checking script or parameterised function that would allow easy customization or integration. Maybe someone that is really proficient at PHP could even create a class that allowed API type access. Just throwing it out there, I like PHP and automation. I am using automation in my SEO process more and more, having hisotrically been quite against it, now I am seeing some areas where it makes sense.
 Cancel
Pascal Landau

2011-12-31T09:53:59-08:00

Levensthein is imho not ment to be an algorithm to detect duplicates. I would rather take a look into shingles. There's a fantastic paper by Andrei Z. Broder covering that topic: https://clair.si.umich.edu/si767/papers/Week03/Similarity/nearduplicate_broder.pdf

Though not as easy as using Levensthein (there's no built in function afaik), it is much more accurate.

2 0

Levensthein is imho not ment to be an algorithm to detect duplicates. I would rather take a look into shingles. There's a fantastic paper by Andrei Z. Broder covering that topic: https://clair.si.umich.edu/si767/papers/Week03/Similarity/nearduplicate_broder.pdf Though not as easy as using Levensthein (there's no built in function afaik), it is much more accurate.
Cancel
- Corey Northcutt
 
 2011-12-31T10:50:39-08:00
 
 This looks really interesting, thanks for the share. While Levenshtein worked pretty much perfectly when I needed it, it's not without limitations (ie. in a scenario where we might want to look for smaller strings of duplicate content within much larger bodies of text). Have you implemented Shingles for this purpose in the past?
 
 1 0
 
 This looks really interesting, thanks for the share. While Levenshtein worked pretty much perfectly when I needed it, it's not without limitations (ie. in a scenario where we might want to look for smaller strings of duplicate content within much larger bodies of text). Have you implemented Shingles for this purpose in the past?
 Cancel
 - Pascal Landau
 
 2012-01-01T07:32:19-08:00
 
 I'm using it to measure the degree of uniqueness for the articles generated by my article spinner (e.g. re-generate an article if it's too close to one that was generated beforehand). The algorithm itself is pretty straightforward:
 
 1. normalize the text (lowercase, remove special chars, etc. - see https://www.miislita.com/information-retrieval-tutorial/indexing.html for some further ideas ;))
 
 2. take all 3-word-shingles
 
 3. get a fingerprint of each shingle (I'm using rabin https://en.wikipedia.org/wiki/Rabin_fingerprint )
 
 4. pairwise comparison of each article you want to check
 
 Levensthein fails pretty hard at near-duplicates (real world example: huge portion of quoted text but only a small fraction of unique text on the same page). I haven't used it for duplicate detection on "the same website" so far, but i have some ideas for a neat little application that can do something like this :)
 
 Hirnhamster edited 2012-01-01T07:32:56-08:00
 3 0
 
 I'm using it to measure the degree of uniqueness for the articles generated by my article spinner (e.g. re-generate an article if it's too close to one that was generated beforehand). The algorithm itself is pretty straightforward: 1. normalize the text (lowercase, remove special chars, etc. - see https://www.miislita.com/information-retrieval-tutorial/indexing.html for some further ideas ;)) 2. take all 3-word-shingles 3. get a fingerprint of each shingle (I'm using rabin https://en.wikipedia.org/wiki/Rabin_fingerprint ) 4. pairwise comparison of each article you want to check Levensthein fails pretty hard at near-duplicates (real world example: huge portion of quoted text but only a small fraction of unique text on the same page). I haven't used it for duplicate detection on "the same website" so far, but i have some ideas for a neat little application that can do something like this :)
 Cancel
 - Corey Northcutt
 
 2012-01-01T11:24:33-08:00
 
 Looks simple and sound enough. Would love to see that applicatoin if you get to creating it.
 
 1 0
 
 Looks simple and sound enough. Would love to see that applicatoin if you get to creating it.
 Cancel
 - scr4lyfseo
 
 2012-01-17T01:46:27-08:00
 
 Hey Guys,
 
 Great info! I have a question that I wanted to ask before I say something stupid ;)....
 
 1) What exactly is considered a 3-word-shingle and is it worth investing the time to learn if I just want to compare two articles for uniqueness?
 
 I could not really figure out what how Corey is using Levensthein. I assume you are passing in 2 full articles and using the distance returned for those 2 articles as a decision point on uniqueness, that seems extremely simple! Assuming this is the case, on a 500 word article, what would you assume to be unique enough to get around Panda ;)?
 
 I have been building a very advanced article spinner for years. I can put articles into copy scape and I never go above 10%. I am talking about spinning 1000+ articles which are well written with less than 10% similarities when using copyscape. Problem with this is that I am going to go broke doing article comparisons to determine when I need to re-spin ;).....
 
 I would love to share ideas and strategies at any time. Feel free to contact me directly or to go through the forum. Either way works for me. I really enjoy this stuff. It is like a challenge from Google! So far, everything has worked great for me but I have to kill this copyscape cost for checking duplicate content. So, any open source or free comparison checks are obviously well received over here....
 
 One comment, I noticed that noone seems to be doing any grammar checks. Is there a reason that you are skipping grammar? Unless this is achieved through the spell checker above...
 
 In the last pubcon I attended in Vegas, there was a clear message that google hired 1000 human reviewers to sniff out garbage. While article spinners can sometimes be very good, a good human reviewer can probably find that. I realize this is fairly far fetched and probably worth the risk, I am just curious what you think.
 
 For some interesting info.... I have a flow on my site that sells products. It simply takes a feed from commission junction and dumps out the products. The website itself was so strong, I ranked for many products and was making some nice money. One day, I lost 5K visitors. BOOM. GONE! I had no idea what happened. I looked at everything on my site (a national directory of vendors) and I could not find any dip in my traffic. Finally, I found that the specific product pages where I show the products from the feed had killed me. All traffic from those pages DISAPPPEARED overnight. I realize this is probably a panda update. I am very excited becasuse now I get to use my article spinner to see how good it is ;). If I can get my traffic back, I am in business probably for a few years until they come up with something else....
 
 Would love to hear your insight and feedback into this.
 
 I hope I did not ask too much. I am really into this stuff and the challenge of outsmarting google seems to be an addiction now. I hope my questions don't make me sound stupid. I know more than meets the eye. I just have never heard of shingles other than a disease that someone gets!
 
 Thanks for your post. Amazing info!
 
 2 0
 
 Hey Guys, Great info! I have a question that I wanted to ask before I say something stupid ;).... 1) What exactly is considered a 3-word-shingle and is it worth investing the time to learn if I just want to compare two articles for uniqueness? I could not really figure out what how Corey is using Levensthein. I assume you are passing in 2 full articles and using the distance returned for those 2 articles as a decision point on uniqueness, that seems extremely simple! Assuming this is the case, on a 500 word article, what would you assume to be unique enough to get around Panda ;)? I have been building a very advanced article spinner for years. I can put articles into copy scape and I never go above 10%. I am talking about spinning 1000+ articles which are well written with less than 10% similarities when using copyscape. Problem with this is that I am going to go broke doing article comparisons to determine when I need to re-spin ;)..... I would love to share ideas and strategies at any time. Feel free to contact me directly or to go through the forum. Either way works for me. I really enjoy this stuff. It is like a challenge from Google! So far, everything has worked great for me but I have to kill this copyscape cost for checking duplicate content. So, any open source or free comparison checks are obviously well received over here.... One comment, I noticed that noone seems to be doing any grammar checks. Is there a reason that you are skipping grammar? Unless this is achieved through the spell checker above... In the last pubcon I attended in Vegas, there was a clear message that google hired 1000 human reviewers to sniff out garbage. While article spinners can sometimes be very good, a good human reviewer can probably find that. I realize this is fairly far fetched and probably worth the risk, I am just curious what you think. For some interesting info.... I have a flow on my site that sells products. It simply takes a feed from commission junction and dumps out the products. The website itself was so strong, I ranked for many products and was making some nice money. One day, I lost 5K visitors. BOOM. GONE! I had no idea what happened. I looked at everything on my site (a national directory of vendors) and I could not find any dip in my traffic. Finally, I found that the specific product pages where I show the products from the feed had killed me. All traffic from those pages DISAPPPEARED overnight. I realize this is probably a panda update. I am very excited becasuse now I get to use my article spinner to see how good it is ;). If I can get my traffic back, I am in business probably for a few years until they come up with something else.... Would love to hear your insight and feedback into this. I hope I did not ask too much. I am really into this stuff and the challenge of outsmarting google seems to be an addiction now. I hope my questions don't make me sound stupid. I know more than meets the eye. I just have never heard of shingles other than a disease that someone gets! Thanks for your post. Amazing info!
 Cancel
 
 Corey Northcutt
 
 2012-01-17T13:49:09-08:00
 
 Hey thanks.
 
 I'll leave the shingles question to Hirnhamster as I haven't implemented it.
 
 Regarding Levenshtein, you're basically right-on. No doubt that you can spin an article into having a higher Levenshtein ratio than my (pretty convservative 10%... even at 20% it was pretty dead-on, but since I was also automating deletions, I didn't want to push my luck). It gets thrown off further when the amount of one piece of content that you're comparing is significantly increased/decreased (what appears to be the greatest advantage of shingles). It's worked great for me, but definitely may not for a number of scenarios. In all, however, I bet you could come up with some more accurate metrics on similarity using Levenshtein than with tools like Copyscape.
 
 Regarding grammar checking, you'd definitely be right, it's important and well-worth consideration. I have a few more functions beyond the contraction fix above... this was originally going to be a series, but I backed off a bit when I got to writing and thought I'd just put some of my best utilities forward.. the best approach that I have found is to just think about one grammar-related fix at a time. So far as I'm aware, there is no all-in-one open source grammar correction utility out there (though if there is, someone please chime in!).
 
 Regarding your situation in general, I have read stories on the black hat forums that a number of autoblogs/spinners are still absolutely thriving after all Panda updates, in spite of the punishment dealt to a number of more white hat sites (maybe Google should have set one person loose to download the dozen or so mainstream article spinners and reverse-engineered them instead :) ). Sooner or later, you'd think that they have to catch on, however, black hat/spinning still strikes me as a very short-term "take the money and run" art form.
 
 CoreyNorthcutt edited 2012-01-17T13:54:29-08:00
 1 0
 
 Hey thanks. I'll leave the shingles question to Hirnhamster as I haven't implemented it. Regarding Levenshtein, you're basically right-on. No doubt that you can spin an article into having a higher Levenshtein ratio than my (pretty convservative 10%... even at 20% it was pretty dead-on, but since I was also automating deletions, I didn't want to push my luck). It gets thrown off further when the amount of one piece of content that you're comparing is significantly increased/decreased (what appears to be the greatest advantage of shingles). It's worked great for me, but definitely may not for a number of scenarios. In all, however, I bet you could come up with some more accurate metrics on similarity using Levenshtein than with tools like Copyscape. Regarding grammar checking, you'd definitely be right, it's important and well-worth consideration. I have a few more functions beyond the contraction fix above... this was originally going to be a series, but I backed off a bit when I got to writing and thought I'd just put some of my best utilities forward.. the best approach that I have found is to just think about one grammar-related fix at a time. So far as I'm aware, there is no all-in-one open source grammar correction utility out there (though if there is, someone please chime in!). Regarding your situation in general, I have read stories on the black hat forums that a number of autoblogs/spinners are still absolutely thriving after all Panda updates, in spite of the punishment dealt to a number of more white hat sites (maybe Google should have set one person loose to download the dozen or so mainstream article spinners and reverse-engineered them instead :) ). Sooner or later, you'd think that they have to catch on, however, black hat/spinning still strikes me as a very short-term "take the money and run" art form. 
 Cancel
 
 scr4lyfseo
 
 2012-01-18T12:58:01-08:00
 
 Corey,
 
 What did you mean here:
 
 "(pretty convservative 10%... even at 20% it was pretty dead-on, but since I was also automating deletions, I didn't want to push my luck). "
 
 Are you referring to copyscape results?
 
 I am mostly trying to understand distances and where you can assume two articles are unique. I ran two 500 word articles through Levenstein and they returned 1450 changes required to make them unique. I realize that the smaller the number the better, but is there a threshold that you look for in terms of changes required to make two things equal?
 
 Unless, you are implying the following.
 
 Assume a 500 word article
 
 10% would be 50. If Levenstein suggests 50 or less changes to make the two articles equal, they are too similar?????
 
 Just looking to understand how to utilize the values.
 
 I don't like to think of my article spinner as black hat. I consider it a very sophisticated bulk writer ;).
 
 Thanks
 
 1 0
 
 Corey, What did you mean here: "(pretty convservative 10%... even at 20% it was pretty dead-on, but since I was also automating deletions, I didn't want to push my luck). " Are you referring to copyscape results? I am mostly trying to understand distances and where you can assume two articles are unique. I ran two 500 word articles through Levenstein and they returned 1450 changes required to make them unique. I realize that the smaller the number the better, but is there a threshold that you look for in terms of changes required to make two things equal? Unless, you are implying the following. Assume a 500 word article 10% would be 50. If Levenstein suggests 50 or less changes to make the two articles equal, they are too similar????? Just looking to understand how to utilize the values. I don't like to think of my article spinner as black hat. I consider it a very sophisticated bulk writer ;). Thanks
 Cancel
 
 Corey Northcutt
 
 2012-01-18T14:17:33-08:00
 
 EDIT: I somehow wrote out a big response and got only a quotation? Noooo!
 
 "Are you referring to copyscape results?"
 
 I was actually referring to my own trials with this with that bit. I don't use Copyscape.. it seems like more of marketing tool for article spinners than anything; I've never seen a spun article fail to pass it.
 
 "I don't like to think of my article spinner as black hat. I consider it a very sophisticated bulk writer ;)."
 
 Hey whatever works for you. I actually spend about as much time reading the black hat boards as I do snowy white hat resources like SEOMoz, SEL, and SEW; lots of fascinating things people are experimenting with that most white hats would never dare attempt. I just gravitate towards white hat in professional practice. :)
 
 CoreyNorthcutt edited 2012-01-18T14:22:59-08:00
 1 0
 
 EDIT: I somehow wrote out a big response and got only a quotation? Noooo! "Are you referring to copyscape results?" I was actually referring to my own trials with this with that bit. I don't use Copyscape.. it seems like more of marketing tool for article spinners than anything; I've never seen a spun article fail to pass it. "I don't like to think of my article spinner as black hat. I consider it a very sophisticated bulk writer ;)." Hey whatever works for you. I actually spend about as much time reading the black hat boards as I do snowy white hat resources like SEOMoz, SEL, and SEW; lots of fascinating things people are experimenting with that most white hats would never dare attempt. I just gravitate towards white hat in professional practice. :)
 Cancel
 
 scr4lyfseo
 
 2012-01-18T16:51:25-08:00
 
 Any chance to get an idea on this?
 
 I am mostly trying to understand distances and where you can assume two articles are unique. I ran two 500 word articles through Levenstein and they returned 1450 changes required to make them unique. I realize that the smaller the number the better, but is there a threshold that you look for in terms of changes required to make two things equal?
 
 I have looked everywhere and cannot understand at what point I can consider two articles to be unique given the results from Levenstein!
 
 Can you help?
 
 1 0
 
 Any chance to get an idea on this? I am mostly trying to understand distances and where you can assume two articles are unique. I ran two 500 word articles through Levenstein and they returned 1450 changes required to make them unique. I realize that the smaller the number the better, but is there a threshold that you look for in terms of changes required to make two things equal? I have looked everywhere and cannot understand at what point I can consider two articles to be unique given the results from Levenstein! Can you help?
 Cancel
 
 Corey Northcutt
 
 2012-01-19T13:54:46-08:00
 
 That seems wrong.. are you using the PHP levenshteinDistance2()? That should never be greater than 100. Some functions give you a number that then needs to be converted to ratios/percentages by a second function, like the MySQL example. I would just work off of those numbers.
 
 Truly unique articles actually haven't seemed to return greater than 50 / .5 for me, but I keep it kind of conservative for the sake of smaller data sets seeming to throw it (ie. the few that might slip by with just 5 words).
 
 1 0
 
 That seems wrong.. are you using the PHP levenshteinDistance2()? That should never be greater than 100. Some functions give you a number that then needs to be converted to ratios/percentages by a second function, like the MySQL example. I would just work off of those numbers. Truly unique articles actually haven't seemed to return greater than 50 / .5 for me, but I keep it kind of conservative for the sake of smaller data sets seeming to throw it (ie. the few that might slip by with just 5 words).
 Cancel
TechieGeek

2011-12-31T10:55:37-08:00

Does anyone know of any WordPress plugin that can spell check all previous posts? I tried looking for one but I can noly find plugins that proofread your post before you publish it.

2 0

Does anyone know of any WordPress plugin that can spell check all previous posts? I tried looking for one but I can noly find plugins that proofread your post before you publish it.
Cancel
- Corey Northcutt
 
 2012-01-01T14:34:53-08:00
 
 I'm not aware of one (this was all made to work with a custom application), but provided that a good one doesn't exist, a killer tool could be polished in about an hour:
 
 i.) Select all articles from the database
 
 ii.) Trim out punctuation and use explode() to break out the invidual words with a space character as the delimiter
 
 iii.) Check using the pspell function and echo what didn't return exact matches
 
 Done.
 
 1 0
 
 I'm not aware of one (this was all made to work with a custom application), but provided that a good one doesn't exist, a killer tool could be polished in about an hour: i.) Select all articles from the database ii.) Trim out punctuation and use explode() to break out the invidual words with a space character as the delimiter iii.) Check using the pspell function and echo what didn't return exact matches Done.
 Cancel
 - Andrew Gloyns
 
 2012-01-11T00:40:26-08:00
 
 Corey - this is ace! I passed your above comment to our Dev and he's fixed tonnes of typos and spelling errors on our site, Cheers :)
 
 2 0
 
 Corey - this is ace! I passed your above comment to our Dev and he's fixed tonnes of typos and spelling errors on our site, Cheers :)
 Cancel
 - Corey Northcutt
 
 2012-01-11T11:15:46-08:00
 
 Awesome!
 
 1 0
 
 Awesome!
 Cancel
Dwijayas

2011-12-31T06:00:54-08:00

At first, congratulations on your promotion. Thank you for such detail you explain in your article, some of yours can be applicable for me but the the problems is I run blog on blogger so for number 3, I need more guide from you. Hope you can help me, thanks.

2 0

At first, congratulations on your promotion. Thank you for such detail you explain in your article, some of yours can be applicable for me but the the problems is I run blog on blogger so for number 3, I need more guide from you. Hope you can help me, thanks.
Cancel
- Corey Northcutt
 
 2011-12-31T08:22:17-08:00
 
 Thanks. Blogger should take care of #3 for you (if you 'View Source' on one of your pages and CTRL+F for 'canonical' you can make sure).
 
 CoreyNorthcutt edited 2011-12-31T08:27:20-08:00
 2 0
 
 Thanks. Blogger should take care of #3 for you (if you 'View Source' on one of your pages and CTRL+F for 'canonical' you can make sure).
 Cancel
MoreWebsiteTraffic

2011-12-30T07:25:25-08:00

I actually didn't understand some of the technical things in here but looking at the facts they do make sense. And I am sending the link to my programmer in a short while.

2 0

I actually didn't understand some of the technical things in here but looking at the facts they do make sense. And I am sending the link to my programmer in a short while. 
Cancel
howtoclaimwhiplash

2012-01-01T16:46:38-08:00

I didn't know that was possible, Levenshtein Distance sounds like a clever trick to clean up duplicate content reasonably quickly.

2 0

I didn't know that was possible, Levenshtein Distance sounds like a clever trick to clean up duplicate content reasonably quickly.
Cancel
Joe Robison

2012-01-02T08:56:58-08:00

This stuff looks cool, definitely some good scripts to add to the arsenal.

2 0

This stuff looks cool, definitely some good scripts to add to the arsenal.
Cancel
usef4u

2012-01-07T00:39:44-08:00

Thanks for the post. Already Liked the post. It can save valuable time. For Spell Check is there any .net version of script.

2 0

Thanks for the post. Already Liked the post. It can save valuable time. For Spell Check is there any .net version of script.
Cancel
- Corey Northcutt
 
 2012-01-07T10:03:18-08:00
 
 Glad you enjoyed it Usef. Google returned this .NET variation for me:
 
 https://aspell-net.sourceforge.net/
 
 As in my response to Ajay above re: levenshtein functions, I haven't actually tested aspell.net. It's worth a shot though and worst case scenario, PHP does run quite good on IIS platforms. It could still be used with pspell for the sole purpose of auditing your database.
 
 1 0
 
 Glad you enjoyed it Usef. Google returned this .NET variation for me: https://aspell-net.sourceforge.net/ As in my response to Ajay above re: levenshtein functions, I haven't actually tested aspell.net. It's worth a shot though and worst case scenario, PHP does run quite good on IIS platforms. It could still be used with pspell for the sole purpose of auditing your database.
 Cancel
simmy

2012-01-09T01:31:51-08:00

Hi Rand,

Have been reading your posts on a regular basis. They are really very helpful.

I have one query which is not getting solved after trying out. I have my own websites on automotive which went down when Google panda got updated.

I will be very thankful if you can help me out to get those sites back in search engine

Waiting for your response

2 0

Hi Rand, Have been reading your posts on a regular basis. They are really very helpful. I have one query which is not getting solved after trying out. I have my own websites on automotive which went down when Google panda got updated. I will be very thankful if you can help me out to get those sites back in search engine Waiting for your response
Cancel
- Corey Northcutt
 
 2012-01-09T12:09:05-08:00
 
 Actually Rand didn't write this one. I hope you don't mind my commenting instead. :)
 
 Unfortunately, search engines are complex on a level to where there is no one instant piece of advice that I think anyone can give you. But depending on when your site dropped (ie. if it was with Panda, and which iteration of Panda), you can learn a lot.
 
 Panda on the whole is geared towards on-page site quality (content/code you control). Though keep in mind that it's likely affecting Google's perception of the quality of sites that link to you as well, so in that sense, your off-page SEO could somewhat be fair game as well. Cleaning up duplicate content and overall quality, and nipping out content that's too "thin" and brief has been big. There's been a lot of discussion of other factors as well, everything from bounce rates to grammar (Office 95 could identify good English, you can be sure that Google in 2012 can too).
 
 With Panda 2.5 especially, it seemed that sites that lacked rich media seemed to lose out, based on my interpretation of the "panda 2.5 biggest winners and losers" articles that were going around in October, and the trends that I saw in all of our clients' sites (although that's just my interpretation). Your best bet is to evaluate your site based on every theory that's white hat, seems supported by solid evidence, and most importantly, written by someone that seems to have a clue. There are also some nice tools out there to help you along with the basics if you're new at this (SEOmoz's toolset could definitely be a nice start).
 
 CoreyNorthcutt edited 2012-01-09T12:18:41-08:00
 1 0
 
 Actually Rand didn't write this one. I hope you don't mind my commenting instead. :) Unfortunately, search engines are complex on a level to where there is no one instant piece of advice that I think anyone can give you. But depending on when your site dropped (ie. if it was with Panda, and which iteration of Panda), you can learn a lot. Panda on the whole is geared towards on-page site quality (content/code you control). Though keep in mind that it's likely affecting Google's perception of the quality of sites that link to you as well, so in that sense, your off-page SEO could somewhat be fair game as well. Cleaning up duplicate content and overall quality, and nipping out content that's too "thin" and brief has been big. There's been a lot of discussion of other factors as well, everything from bounce rates to grammar (Office 95 could identify good English, you can be sure that Google in 2012 can too). With Panda 2.5 especially, it seemed that sites that lacked rich media seemed to lose out, based on my interpretation of the "panda 2.5 biggest winners and losers" articles that were going around in October, and the trends that I saw in all of our clients' sites (although that's just my interpretation). Your best bet is to evaluate your site based on every theory that's white hat, seems supported by solid evidence, and most importantly, written by someone that seems to have a clue. There are also some nice tools out there to help you along with the basics if you're new at this (SEOmoz's toolset could definitely be a nice start).
 Cancel
Woj Kwasi

2012-01-11T20:00:59-08:00

Like the tip on Levenshtein distance analysis - and not just because I like saying the name Levenshtein ;)

2 0

Like the tip on Levenshtein distance analysis - and not just because I like saying the name Levenshtein ;)
Cancel
Holly Wade

2012-01-05T17:21:04-08:00

Thanks for the article. I don't understand all of it but I appreciate the breakdown and the solutions. I hadn't thought about some of these concerns in the past. Thanks for your help.

2 0

Thanks for the article. I don't understand all of it but I appreciate the breakdown and the solutions. I hadn't thought about some of these concerns in the past. Thanks for your help. 
Cancel
Ram Babu

2012-01-05T03:28:55-08:00

great, so we have to be very confident with our content what google wants from us is unque and excellent written what users love . .now the thing like duplicate content because panda is running along with google to filter out all these waste to make the search very quality proof!

2 0

great, so we have to be very confident with our content what google wants from us is unque and excellent written what users love . .now the thing like duplicate content because panda is running along with google to filter out all these waste to make the search very quality proof!
Cancel
Sean McGown

2011-12-30T06:11:33-08:00

Corey, would it be possible to substitute the faux smart quotes for the ASCII equivalents (e.g., “)?

2 0

Corey, would it be possible to substitute the faux smart quotes for the ASCII equivalents (e.g., &#8220;)?
Cancel
- Corey Northcutt
 
 2011-12-30T07:23:04-08:00
 
 Sure, the HTML character code could be a slightly better option (although I personally prefer the non-tilted version). What I really try to avoid is situations where the raw character is converted on the page, and you just see those "square" icons (which I'm sure most webmasters have undoubtably seen before), or □. I've also seen them rendered in a wide variety of other weird ways in different scenarios. Depending on your database and page character settings, you may be immune from this particular issue, but I like to trim them out all of the same for a variety of reasons.
 
 I've also heard a few SEO's suggest using real quotation marks anytime that you quote in order to avoid possible duplicate content issues. Logically, the idea seems like a pretty sound way to identify a 3rd party, and could even fit in the "if I were Google" column, but I don't know if that one is confirmed beyond a theory. So in this way, it could just be one of my unconfirmed SEO superstitions for the time being; I still personally prefer to keep my sites this way.
 
 1 0
 
 Sure, the HTML character code could be a slightly better option (although I personally prefer the non-tilted version). What I really try to avoid is situations where the raw character is converted on the page, and you just see those "square" icons (which I'm sure most webmasters have undoubtably seen before), or &#9633;. I've also seen them rendered in a wide variety of other weird ways in different scenarios. Depending on your database and page character settings, you may be immune from this particular issue, but I like to trim them out all of the same for a variety of reasons. I've also heard a few SEO's suggest using real quotation marks anytime that you quote in order to avoid possible duplicate content issues. Logically, the idea seems like a pretty sound way to identify a 3rd party, and could even fit in the "if I were Google" column, but I don't know if that one is confirmed beyond a theory. So in this way, it could just be one of my unconfirmed SEO superstitions for the time being; I still personally prefer to keep my sites this way.
 Cancel
Bryant A. Tutterow

2012-01-04T15:24:31-08:00

Excellent scripting examples, thanks for the insight!

2 0

Excellent scripting examples, thanks for the insight!
Cancel
Atiqa Arshad

2012-01-02T05:03:08-08:00

I was really surprised to read this article. Obviously you mentioned great poits about SEO and this article must spread to other SEO as well as programme fellows.

Thanks for being shared this article in this community.

2 0

I was really surprised to read this article. Obviously you mentioned great poits about SEO and this article must spread to other SEO as well as programme fellows. Thanks for being shared this article in this community.
Cancel
scalax

2011-12-30T01:29:13-08:00

Really really nice blog post. Everyone of the 5 points will be useful to me. I can only say THANKS!

2 0

Really really nice blog post. Everyone of the 5 points will be useful to me. I can only say THANKS!
Cancel
TahirLiaqat

2011-12-29T13:44:48-08:00

thanks for sharing usefull information. I have duplicate content issue with my one client website and keep suggestion company to sort this but unfortunatley still waiting, its been just two and half year. Wish me luck :)

2 0

thanks for sharing usefull information. I have duplicate content issue with my one client website and keep suggestion company to sort this but unfortunatley still waiting, its been just two and half year. Wish me luck :)
Cancel
Associate

Tom Anthony
Associate

2011-12-29T13:56:42-08:00

Awesome first post, Corey. Esepcially love the use of levenshtein for duplicate articles. :)

2 0

Awesome first post, Corey. Esepcially love the use of levenshtein for duplicate articles. :)
Cancel
Hiren Vaghela

2011-12-28T02:37:45-08:00

Great read Corey, SEO with clearing technical stuff is a best way to optimize our website. They are corelate with each other. Canonical is impact a lot in many ways like if you have implement pagination, or dynamic urls. Levenshtein code is quite good one for sure.

Recently i have seen many wrong urls are generated through various parameter in my webmaster account. Mainly i have removed it but some of them are still there and the paramaeters are session_id , p_id etc. That would be great if you drop your feedback on this.

These points are really useful for me as well as many techy guys here.

Hirenvaghela edited 2011-12-28T02:40:30-08:00
2 0

Great read Corey, SEO with clearing technical stuff is a best way to optimize our website. They are corelate with each other. Canonical is impact a lot in many ways like if you have implement pagination, or dynamic urls. Levenshtein code is quite good one for sure. Recently i have seen many wrong urls are generated through various parameter in my webmaster account. Mainly i have removed it but some of them are still there and the paramaeters are session_id , p_id etc. That would be great if you drop your feedback on this. These points are really useful for me as well as many techy guys here.
Cancel
- Corey Northcutt
 
 2011-12-28T07:47:31-08:00
 
 Ideal world, we'd get those session ID's out of your URL's entirely, but if the content that Google should see isn't dependent on those GET variables, I'd absolutely play with the function above. It should wipe away what sounds like a pretty significant issue of duplicate content for you.
 
 You can also do a "site:yourdomain.com" search to see how many duplicate versions of your pages are really getting indexed. This can be more important than one might assume.. as much as Google always seems to claim that they're getting better at recognizing these sorts of things on their own, I still have clients approaching me on the regular that seem shocked when we show them 50+ versions of their homepage showing in Google's index.
 
 2 0
 
 Ideal world, we'd get those session ID's out of your URL's entirely, but if the content that Google should see isn't dependent on those GET variables, I'd absolutely play with the function above. It should wipe away what sounds like a pretty significant issue of duplicate content for you. You can also do a "site:yourdomain.com" search to see how many duplicate versions of your pages are really getting indexed. This can be more important than one might assume.. as much as Google always seems to claim that they're getting better at recognizing these sorts of things on their own, I still have clients approaching me on the regular that seem shocked when we show them 50+ versions of their homepage showing in Google's index. 
 Cancel
 - Hiren Vaghela
 
 2011-12-28T22:02:33-08:00
 
 Thanks Corey for the quick response. I have done the exercise which you have suggest. I got few urls and just putting those url to 301 or remove it through webmaster tools.
 
 1 0
 
 Thanks Corey for the quick response. I have done the exercise which you have suggest. I got few urls and just putting those url to 301 or remove it through webmaster tools.
 Cancel
Simon Dalley

2011-12-27T23:42:49-08:00

Giood points there - I'll get that over to the development team to run through those 5 steps, let's see what we can come up with! Hopefully if I'm doing my job properly not many issues :$

2 0

Giood points there - I'll get that over to the development team to run through those 5 steps, let's see what we can come up with! Hopefully if I'm doing my job properly not many issues :$
Cancel
Kentaro Roy

2011-12-27T13:59:30-08:00

Thanks for the explanations on multiple "levels", it's really useful for when you're working with multiple people with various areas of expertise.

2 0

Thanks for the explanations on multiple "levels", it's really useful for when you're working with multiple people with various areas of expertise.
Cancel
- Corey Northcutt
 
 2011-12-27T15:48:36-08:00
 
 Thanks, was hoping that would result in a clearer message.
 
 1 0
 
 Thanks, was hoping that would result in a clearer message.
 Cancel
Scott Bartell

2011-12-27T19:03:02-08:00

Those are some great ideas, thanks for sharing!

2 0

Those are some great ideas, thanks for sharing! 
Cancel
webdevcloud

2011-12-30T01:38:15-08:00

Thanks for some practical solution tips. I could surely use at least couple of these.

2 0

Thanks for some practical solution tips. I could surely use at least couple of these.
Cancel
Michael Gonzalez

2011-12-29T14:48:37-08:00

Oh Panda how I love you, thinks for the post. Duplicate content is never going away and thats the problem.

2 0

Oh Panda how I love you, thinks for the post. Duplicate content is never going away and thats the problem. 
Cancel
Ben Pines

2011-12-29T22:37:29-08:00

the spell check script link doesn't work.

great guide though!

2 0

the spell check script link doesn't work. great guide though!
Cancel
- Corey Northcutt
 
 2011-12-30T07:29:17-08:00
 
 Woops, looks like it's just the wrong link. This is the one you want: https://pastebin.com/aVPnJSK3
 
 That's really just to show you how it works, though, you'll want to download pspell before that will actually do anything.
 
 1 0
 
 Woops, looks like it's just the wrong link. This is the one you want: https://pastebin.com/aVPnJSK3 That's really just to show you how it works, though, you'll want to download pspell before that will actually do anything.
 Cancel
 - Keri Morgret
 
 2011-12-30T11:24:02-08:00
 
 Link is fixed!
 
 2 0
 
 Link is fixed!
 Cancel
 - Corey Northcutt
 
 2011-12-31T08:20:31-08:00
 
 thanks keri!
 
 1 0
 
 thanks keri!
 Cancel
Moosa Hemani

2011-12-29T22:37:40-08:00

Spell Check the data base with this tiny little smart shit... this is super amazing mate! I have played a bit with bulk canonical but this spell check thing seriously deserves some time to play with it!

Great work!

2 0

Spell Check the data base with this tiny little smart shit... this is super amazing mate! I have played a bit with bulk canonical but this spell check thing seriously deserves some time to play with it! Great work!
Cancel
- Corey Northcutt
 
 2011-12-30T07:35:39-08:00
 
 tiny little smart shit = a much better title for this post
 
 Thanks for the kind words, hope something here helps!
 
 1 0
 
 tiny little smart shit = a much better title for this post Thanks for the kind words, hope something here helps!
 Cancel
hyderali_

2011-12-30T01:01:11-08:00

Thanks Corey for this stupendous post.

Like I said I'm not into scripting & all that but yes I'm very thankful to you for the post. I love the way you shown the canonial issue. Passed this post to my developer & looking out to implement the same on my clients site.

Thanks a lot.

2 0

Thanks Corey for this stupendous post. Like I said I'm not into scripting & all that but yes I'm very thankful to you for the post. I love the way you shown the canonial issue. Passed this post to my developer & looking out to implement the same on my clients site. Thanks a lot.
Cancel
Dubs

2011-12-29T14:49:30-08:00

Great post Corey! Some very interesting programming for taking care of those pesky duplicate content articles! Thank you for sharing your developer knowledge as is gets the wheels turning for other tools and methods for SEO!

2 0

Great post Corey! Some very interesting programming for taking care of those pesky duplicate content articles! Thank you for sharing your developer knowledge as is gets the wheels turning for other tools and methods for SEO!
Cancel
Franck NLEMBA

2011-12-29T22:01:19-08:00

Oh! Congrats! i didn't hear about these stuff before. To me that completely new! Will need some high level technical help to proper test it

2 0

Oh! Congrats! i didn't hear about these stuff before. To me that completely new! Will need some high level technical help to proper test it
Cancel
CrowdFinch

2011-12-29T20:56:16-08:00

Those programming tactics using for meet out Panda updates. Thanks Corey!

2 0

Those programming tactics using for meet out Panda updates. Thanks Corey! 
Cancel
Brian Greenberg

2011-12-29T21:38:08-08:00

Brilliant :) Appreciate the high level tips. I also think author tags in the new micro data format will help sites in the long run.

2 0

Brilliant :) Appreciate the high level tips. I also think author tags in the new micro data format will help sites in the long run.
Cancel
ninjamarketer

2012-01-04T22:48:28-08:00

Good technical advise though not really geared towards fighting web content spam or Panda. Will definitely implement few of these.

1 0

Good technical advise though not really geared towards fighting web content spam or Panda. Will definitely implement few of these.
Cancel
Kamran Shafi

2011-12-28T02:03:31-08:00

Excellent article, explaining multi-utilities at different levels from programming point of view.

2 1

Excellent article, explaining multi-utilities at different levels from programming point of view.
Cancel

Post Analytics

Comments 73

Log in to Moz

Don't have an account?