Visualizing Duplicate Web Pages

By: DavidBarts January 8th, 2013

Visualizing Duplicate Web Pages

Moz Tools

The author's views are entirely his or her own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

We've just changed the way we detect duplicate or near-duplicate web pages in our custom crawler to better serve you. Our previous code produced good results, but it could fall apart on large crawls (ones larger than about 85,000 pages), and takes an excessively long time (sometimes on the order of weeks) to finish.

Now that the change is live, you’ll see some great improvements and a few changes:

Results will come in faster (up to an hour faster on small crawls and literally days faster on larger crawls)
More accurate duplicate removal, resulting in fewer duplicates in your crawl results

This post provides a look into the motivations behind our decision to change the way our custom crawl detects duplicate and near-duplicate web pages at a high level. Enjoy!

Improving our page similarity measurement

The heuristic we currently use to measure the similarity between two pages is called fingerprints. Fingerprints relies on turning each page into a vector of 128 64-bit integers in such a way that duplicate or near-duplicate pages result in an identical, or nearly identical, vector. The difference between a pair of pages is proportional to the number of corresponding entries in the two vectors which are not the same.

The faster heuristic we are working on implementing is called a similarity hash, or simhash for short. A simhash is a single, 64-bit, unsigned integer, again calculated in such a way that duplicate or near-duplicate pages result in simhash values which are identical, or nearly so. The difference between pages is proportional to the number of bits that differ in the two numbers.

The problem: avoid false duplicates

The problem is that these two measures are very different: one is a vector of 128 values, while the other is a single value. Because of this difference, the measurements may vary in how they see page difference. With the possibility of a single crawl containing over a million pages, that's an awful lot of numbers we need to compare to determine the best possible threshold value for the new heuristic.

Specifically, we need to set the heuristic threshold to detect as many duplicates and near-duplicates as possible, while minimizing the number of false duplicates. It is more important to absolutely minimize the number of page pairs which aren’t duplicates, so we’re not removing a page as a duplicate unless it actually *is* a duplicate. This means we need to be able to detect pages where:

The two pages are not actually duplicates or near-duplicates,
The current fingerprints heuristic correctly views them as different, but
The simhash heuristic incorrectly views them as similar.

We’re being incredibly careful about this to avoid the most negative customer experience we anticipate: having a behind-the-scenes change of our duplicate detection heuristic causing a sudden rash of incorrect "duplicate page" errors to appear for no apparent good reason.

The solution: visualizing the data

Our need to make a decision where many numeric quantities are involved is a classic case where data visualization can be of help. Our SEOmoz data scientist, Matt Peters, suggested that the best way to normalize these two very different measures of page content was to focus on how they measured difference between existing pages. Taking that to heart, I decided on the following approach:

Sample about 10 million pairs of pages from about 25 crawls selected at random.
For each pair of pages sampled, plot their difference as measured by the legacy fingerprints heuristic on the horizontal axis (0 to 128), and their difference as measured by simhash on the vertical axis (0 to 64).

The plot resulting from this approach looks like this:

Immediately, a problem is obvious: there's no measure of central tendency (or lack thereof) in this image. If more than one page pair has the same difference as measured by both legacy fingerprints and simhash, the plotting software will simply place a second red dot precisely atop the first one. And so on for the third, fourth, hundredth, and possibly thousandth identical data point.

One way to address this problem is to color the dots differently depending on how many page pairs they represent. So what happens if we select the color using a light wavelength that corresponds to the number of times we draw a point on the same spot? This tactic gives us a plot with red (a long wavelength) indicating the most data points, down through orange, yellow, green, blue, and violet (really, magenta on this scale) representing only one or two values:

How disappointing! That's almost no change at all. However, if you look carefully, you can see a few blue dots in that sea of magenta, and most important of all, the lower-leftmost dot is red, representing the highest number of instances of all. What's happening here is that red dot represents a count so much higher than all the other counts that most of the other colors between it and the ones representing the lowest numbers end up unused.

The solution is to assign colors in such a way that most of the colors end up being used for coding the lower counts, and to assign progressively fewer colors as counts increase. Or, in mathematical terms, to assign colors based on a logarithmic scale rather than a linear one. If we do that, we end up with the following:

Now we're getting somewhere! As expected, there is a central tendency in the data, even though it's pretty broad. One thing that's immediately evident is that, although in theory, the difference measured by simhash can go to a maximum of 64, in practice, it rarely gets much higher than 46 (three-fourths of the maximum). In contrast, using the fingerprints difference, many pages reach the maximum possible difference of 128 (witness all the red and orange dots along the right side of the graphic). Keep in mind that those red and orange dots represent really big counts, because the color scale is logarithmic.

Where we have to be most careful is on the bottom edge of things. That represents simhash values which indicate pairs of pages that are quite similar. If two pages are not, in fact, similar, yet simhash measures them similar where fingerprints saw a significant difference, this is precisely the sort of negative customer experience we are trying to avoid. One potential trouble spot is circled below:

The circled dot represents a pair of pages which are actually quite different, yet which simhash thinks are quite similar. (The dot to the left and even further below turns out to not be a problem: it represents a pair of nearly duplicate pages that the old heuristic missed!)

The vertical position of the troublesome dot represents a simhash difference of 6 (6 corresponding bits in the two 64-bit simhash values differ). It's not the only case, either: occasionally, such pairs of pages come up from time to time. It happens in 1% or less of the crawls, but it does happen. If we choose a simhash difference threshold of 6 (matching the threshold we currently have defined for the legacy fingerprints), there will be false positives.

Picking a threshold

Thankfully, 6 seems to be a border case. Above 6 bits of difference, the chance of a false positive increases. Below 6, I was unable to find any such pathological cases, and I examined thousands of crawls trying to find one. So I chose a difference threshold of 5 for simhash-based duplicate detection. That results in a situation represented by the final graphic:

Here we have lines drawn to represent the two difference thresholds. Everything to the left of the vertical line represents what the current code would report as duplicate. Everything below the horizontal line represents what the simhash code will report. Keeping in mind the logarithmic color scale and the red dot in the lower-left corner, we see that the number of page pairs where the two heuristics agree about similarity outweighs the number of page pairs where they disagree.

Note that there are still things in the "false positive" (lower right) quadrant. It turns out that those pairs tend not to differ much from the pairs where the two measures agree, or, for that matter, from the false negative pairs in the upper left quadrant. In other words, with the chosen thresholds, both simhash and the legacy fingerprints miss seeing some true near-duplicates.

The visible results

With this threshold decision, the number of false negatives outnumbers the number of false positives. This meets our goal of minimizing the number of false positives, even at the cost of incurring false negatives. Note that the "false positives" in the lower-right quadrant are actually quite similar to each other, and therefore would more accurately be described as the false negatives of the legacy fingerprints heuristic, rather than the false positives of the fingerprints heuristic.

The most visible aspects of the change to customers are two-fold:

1. Fewer duplicate page errors: a general decrease in the number of reported duplicate page errors. However, it bears pointing out that:

We may still miss some near-duplicates. Like the current heuristic, only a subset of the near-duplicate pages is reported.
Completely identical pages will still be reported. Two pages that are completely identical will have the same simhash value, and thus a difference of zero as measured by the simhash heuristic. So, all completely identical pages will still be reported.

2. Speed, speed, speed: The simhash heuristic detects duplicates and near-duplicates approximately 30 times faster than the legacy fingerprints code. This means that soon, no crawl will spend more than a day working its way through post-crawl processing, which will facilitate significantly faster delivery of results for large crawls.

I hope this post provides some meaningful insight into our upcoming changes. I look forward to hearing your thoughts in the comments below.

Comments 36

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

Chris Le

2013-01-08T08:10:19-08:00

I LOVE how SEOmoz's engineers blog about how they approach problems, work as a team to solve it, and then disclose the fun details. This is a great example of how two disciplines come together as a team to solve one problem then relating it back to us: the users.
GREAT JOB!!!!

5 0

I LOVE how SEOmoz's engineers blog about how they approach problems, work as a team to solve it, and then disclose the fun details. This is a great example of how two disciplines come together as a team to solve one problem then relating it back to us: the users. GREAT JOB!!!!
Cancel
DavidBarts

2013-01-08T14:02:21-08:00

Thanks for all your comments! Yes, this is quite technical. The reasons for using such elaborate logic for detecting duplicate pages are twofold:

Duplicate web pages are seldom exact duplicates. They often have date/time stamps embedded someplace, hidden form variables with different, dynamically-generated values that change on each page request, etc. If we just compared the HTML we got as two giant strings, it would miss a lot of duplicates. So we have to detect pages that are almost duplicate, as well as pages that are 100%, byte-for-byte, exact duplicates.Duplicate reporting is a lot of work (for each page in a crawl, we basically have to compare it to all other pages). That amount of work scales as the square of the number of pages on a site (i.e. double the number of pages, and the number of comparisons we must preform goes up by a factor of four). So minimizing this work is also very important.Simhashes are a way of dealing with needs 1 and 2 above.

4 0

Thanks for all your comments! Yes, this is quite technical. The reasons for using such elaborate logic for detecting duplicate pages are twofold: Duplicate web pages are seldom exact duplicates. They often have date/time stamps embedded someplace, hidden form variables with different, dynamically-generated values that change on each page request, etc. If we just compared the HTML we got as two giant strings, it would miss a lot of duplicates. So we have to detect pages that are almost duplicate, as well as pages that are 100%, byte-for-byte, exact duplicates.Duplicate reporting is a lot of work (for each page in a crawl, we basically have to compare it to all other pages). That amount of work scales as the square of the number of pages on a site (i.e. double the number of pages, and the number of comparisons we must preform goes up by a factor of four). So minimizing this work is also very important.Simhashes are a way of dealing with needs 1 and 2 above.
Cancel
- Cyrus Shepard
 
 2013-01-08T18:00:17-08:00
 
 This is awesome David. Thanks for taking the time to explain it to us.
 
 As much as I hate comments that ask a lot of dumb questions, nonetheless I have one :)
 
 Does the use of a proper canonical tag on a duplicate page cancel out a duplicate content warning?
 
 Again, thanks for the great visualizations.
 
 Cyrus-Shepard edited 2013-01-09T00:03:19-08:00
 1 0
 
 This is awesome David. Thanks for taking the time to explain it to us. As much as I hate comments that ask a lot of dumb questions, nonetheless I have one :) Does the use of a proper canonical tag on a duplicate page cancel out a duplicate content warning? Again, thanks for the great visualizations. 
 Cancel
 - DavidBarts
 
 2013-01-09T12:44:08-08:00
 
 Yes, using canonical tags will result in duplicate errors being suppressed. If one page refers to another as a duplicate, than that pair will not be reported as duplicates. Also, if two pages both refer to the same third page as their canonical, then they will not be reported as duplicates of each other, either.
 
 (This is not a new feature; the custom crawler has recognized canonical tags for some time now.)
 
 DavidBarts edited 2013-01-09T12:45:01-08:00
 2 0
 
 Yes, using canonical tags will result in duplicate errors being suppressed. If one page refers to another as a duplicate, than that pair will not be reported as duplicates. Also, if two pages both refer to the same third page as their canonical, then they will not be reported as duplicates of each other, either. (This is not a new feature; the custom crawler has recognized canonical tags for some time now.) 
 Cancel
 - Fabrizo Ferrari
 
 2013-04-12T17:17:42-07:00
 
 That doesn't work for me as you wrote: SEOmoz bot marked the following URLs as duplicate pages despite I have defined a canonical tag:
 
 https://www.virtualsheetmusic.com/score/PatrickCollectionFlPf.html?tab=mp3
 
 https://www.virtualsheetmusic.com/score/PatrickCollectionFlPf.html?tab=pdf
 
 I have also opened a thread about this issue here below despite I haven't gotten good answers about this issue so far:
 
 https://www.seomoz.org/q/why-does-seomoz-bot-see-duplicate-pages-despite-i-am-using-the-canonical-tag
 
 fablau edited 2013-04-12T17:17:54-07:00
 1 0
 
 That doesn't work for me as you wrote: SEOmoz bot marked the following URLs as duplicate pages despite I have defined a canonical tag: https://www.virtualsheetmusic.com/score/PatrickCollectionFlPf.html?tab=mp3 https://www.virtualsheetmusic.com/score/PatrickCollectionFlPf.html?tab=pdf I have also opened a thread about this issue here below despite I haven't gotten good answers about this issue so far: https://www.seomoz.org/q/why-does-seomoz-bot-see-duplicate-pages-despite-i-am-using-the-canonical-tag 
 Cancel
- Will Critchlow
 
 2013-01-09T02:33:56-08:00
 
 I enjoyed this post. Thanks David.
 Does the way you calculate simhashes (or fingerprints for that matter) simply work on the HTML as a huge string or are you considering the structure of the marked-up content to work out what changes are significant and what aren't (e.g. ignoring comments in the HTML or changes to CSS only)?
 
 2 0
 
 I enjoyed this post. Thanks David. Does the way you calculate simhashes (or fingerprints for that matter) simply work on the HTML as a huge string or are you considering the structure of the marked-up content to work out what changes are significant and what aren't (e.g. ignoring comments in the HTML or changes to CSS only)?
 Cancel
 - DavidBarts
 
 2013-01-09T12:46:15-08:00
 
 What we do is strip out all of the HTML and only look at the text on the page when performing near-duplicate comparisons.
 
 1 0
 
 What we do is strip out all of the HTML and only look at the text on the page when performing near-duplicate comparisons. 
 Cancel
dotSearch

2013-01-08T05:23:04-08:00

It looks like non-standard research, David. But what is practical use of this?

4 0

It looks like non-standard research, David. But what is practical use of this? 
Cancel
- Chris Le
 
 2013-01-08T08:12:57-08:00
 
 The practical use for people who use SEOmoz is a faster, more accurate crawl. You get your results faster. That's it. David is telling us how the SEOmoz engineering team actually made the crawl faster.
 
 8 0
 
 The practical use for people who use SEOmoz is a faster, more accurate crawl. You get your results faster. That's it. David is telling us how the SEOmoz engineering team actually made the crawl faster.
 Cancel
- Ashley Sefferman
 
 2013-01-08T09:02:45-08:00
 
 Chiming in on Chris_Le's response. The use of this post is to give our community members some statistical insight as to why we've changed the way our custom crawl detects duplicate and near-duplicate pages. Our engineering team has a ton of knowledge to share, and we think it's important to give our readers a chance to look at the data we use to make decisions on the backend. Our hope is that the more you know, the better your overall experience with our tools will be!
 
 3 0
 
 Chiming in on Chris_Le's response. The use of this post is to give our community members some statistical insight as to why we've changed the way our custom crawl detects duplicate and near-duplicate pages. Our engineering team has a ton of knowledge to share, and we think it's important to give our readers a chance to look at the data we use to make decisions on the backend. Our hope is that the more you know, the better your overall experience with our tools will be!
 Cancel
- Melo_Air
 
 2013-01-11T17:57:53-08:00
 
 I read: 1. Fewer duplicate page errors.... and 2. Speed, speed, speed....
 It came in way at the end, I have trouble paying attention also. Otherwise impressive post, that's some detailed data!
 
 Melo_Air edited 2013-01-11T17:58:26-08:00
 1 0
 
 I read: 1. Fewer duplicate page errors.... and 2. Speed, speed, speed.... It came in way at the end, I have trouble paying attention also. Otherwise impressive post, that's some detailed data!
 Cancel
George Andrews

2013-01-08T10:54:40-08:00

Hey David, just want to thank you and the rest of your team for being so open with the efforts you put into making the tools we use better. Also, you did a great job explaining these recent updates in terms easier to understand than I would normally expect for a topic like this. Thank you!

3 0

Hey David, just want to thank you and the rest of your team for being so open with the efforts you put into making the tools we use better. Also, you did a great job explaining these recent updates in terms easier to understand than I would normally expect for a topic like this. Thank you! 
Cancel
ScottSempathy

2013-01-08T10:22:40-08:00

Is anyone else scratching their head?

Cool improvements though.

3 0

Is anyone else scratching their head? Cool improvements though. 
Cancel
- Katie Kelly
 
 2013-01-16T19:04:56-08:00
 
 faster and better is good. But, yes, definitely scratching my head!
 
 1 0
 
 faster and better is good. But, yes, definitely scratching my head!
 Cancel
Brad Russell

2013-01-08T19:20:28-08:00

I think some of the comments are missing the point here. Even if you don't go past the first two dot points in the post, you can see that SEOmoz has made some good improvements to the way the report on duplicates. Whether you read the rest of the post is obviously up to you, but it's great to see these improvements and the transparency behind them.

3 0

I think some of the comments are missing the point here. Even if you don't go past the first two dot points in the post, you can see that SEOmoz has made some good improvements to the way the report on duplicates. Whether you read the rest of the post is obviously up to you, but it's great to see these improvements and the transparency behind them.
Cancel
Dawn Anderson

2013-01-08T09:47:39-08:00

Loved the detail in this. Thanks - most useful.

2 0

Loved the detail in this. Thanks - most useful.
Cancel
Alacra

2013-01-08T06:09:49-08:00

Did you use R for this?

2 0

Did you use R for this?
Cancel
- DavidBarts
 
 2013-01-08T14:03:38-08:00
 
 No, I used Python scripts (the same language the custom crawler is mostly written in) to process the data and Gnuplot to generate the colored-dot graphics.
 
 3 0
 
 No, I used Python scripts (the same language the custom crawler is mostly written in) to process the data and Gnuplot to generate the colored-dot graphics. 
 Cancel
Nick-SEOSpark

2013-01-08T03:28:26-08:00

Thanks very much for this detailed and technical update. The devil is in the detail, and it's refreshing to see that you're making these kind of improvements.

2 0

Thanks very much for this detailed and technical update. The devil is in the detail, and it's refreshing to see that you're making these kind of improvements. 
Cancel
- RankWatch65
 
 2013-01-08T11:45:50-08:00
 
 That's really a great info but most part was too technical for normal marketers to understand, but thanks for the detailed explanation on how the crawl process was made faster.
 
 6 1
 
 That's really a great info but most part was too technical for normal marketers to understand, but thanks for the detailed explanation on how the crawl process was made faster.
 Cancel
SpookSEO

2014-01-12T08:50:17-08:00

Hi David! Thanks for taking time expalining this to us. You and your team are the best. I guess transparency is really important. That is why people doing SEO works like reading your articles because you are so transparent. Anyways, I guess SEOmoz has made some good improvements to the way they report on duplicates.

1 0

 Hi David! Thanks for taking time expalining this to us. You and your team are the best. I guess transparency is really important. That is why people doing SEO works like reading your articles because you are so transparent. Anyways, I guess SEOmoz has made some good improvements to the way they report on duplicates. 
Cancel
Manoj Pallai

2013-01-10T01:52:26-08:00

David,Thanks for awesome describe.. completely agreed.Still not sure how the real use for such..

1 0

David,Thanks for awesome describe.. completely agreed.Still not sure how the real use for such.. 
Cancel
irummuneer

2013-01-10T02:30:06-08:00

Thanks very much for this comprehensive and practical update. The devid is in the feature, and it's inspirational to see that you're creation these kind of enhancements.

1 0

Thanks very much for this comprehensive and practical update. The devid is in the feature, and it's inspirational to see that you're creation these kind of enhancements.
Cancel
Majid Bukhari

2013-01-18T08:29:15-08:00

Very technical update. Thanks

1 0

Very technical update. Thanks
Cancel
Roberto Robles

2013-01-09T14:52:48-08:00

A little too technical for my blood. But as longs as it's faster and better, it sounds good. Thanks.

1 0

A little too technical for my blood. But as longs as it's faster and better, it sounds good. Thanks. 
Cancel
AmitDwivedi

2013-01-10T02:33:18-08:00

Great Tips for the SEO Guys.
Thank you so much to share tips with us.

1 0

Great Tips for the SEO Guys. Thank you so much to share tips with us. 
Cancel
Micah Greene

2013-01-09T07:22:51-08:00

This right here is probably the single greatest reason I'm going to hold onto my Pro subscription.

1 0

This right here is probably the single greatest reason I'm going to hold onto my Pro subscription.
Cancel
Ayaz Malik

2013-01-08T10:51:21-08:00

Well done guys! oh and my eyes hurt thanks to the shiny dotted images :p

1 0

Well done guys! oh and my eyes hurt thanks to the shiny dotted images :p 
Cancel
lsexton101

2013-01-08T08:43:55-08:00

As a color blind guy these charts hurt my eyes :-)

1 0

As a color blind guy these charts hurt my eyes :-) 
Cancel
- Nick-SEOSpark
 
 2013-01-08T09:07:59-08:00
 
 I think the charts look very pretty indeed! I do sympathise with the color blindness though.
 
 Nick-SEOSpark edited 2013-01-11T04:07:21-08:00
 2 4
 
 I think the charts look very pretty indeed! I do sympathise with the color blindness though. 
 Cancel
Christopher Franek

2013-01-08T11:15:14-08:00

I heard that you guys are working on a major update to your crawl engine that will enable deeper crawls that will uncover more external links. SEOmoz consistently omits recent high value links that Majestic SEO uncovers within a couple of days at most. Any news on any forthcoming crawl engine updates of this nature?

1 0

I heard that you guys are working on a major update to your crawl engine that will enable deeper crawls that will uncover more external links. SEOmoz consistently omits recent high value links that Majestic SEO uncovers within a couple of days at most. Any news on any forthcoming crawl engine updates of this nature? 
Cancel
- Carin Overturf
 
 2013-01-08T13:42:09-08:00
 
 Hey! We are constantly trying to improve our crawlers to make our Mozscape index (which powers Open Site Explorer) crawl deeper and fresher pages. That work is in progress right now, however, the crawler that David works on is a different crawler that specifically only crawls for our PRO campaigns. But we do hope to have lots of exciting changes this year!
 Thanks,
 Carin
 
 3 0
 
 Hey! We are constantly trying to improve our crawlers to make our Mozscape index (which powers Open Site Explorer) crawl deeper and fresher pages. That work is in progress right now, however, the crawler that David works on is a different crawler that specifically only crawls for our PRO campaigns. But we do hope to have lots of exciting changes this year! Thanks, Carin
 Cancel
 - Christopher Franek
 
 2013-01-10T14:44:57-08:00
 
 That would be great. One of the SEOmoz reps I had spoken to a couple of months back had mentioned that a new engine capable of deeper crawls for links was in development so I hope that comes to fruition sooner than later. Thanks
 
 1 0
 
 That would be great. One of the SEOmoz reps I had spoken to a couple of months back had mentioned that a new engine capable of deeper crawls for links was in development so I hope that comes to fruition sooner than later. Thanks 
 Cancel
francescan

2013-01-09T05:33:53-08:00

Hello there! Good job indeed.

I have just one question: have you already incorporated these changes into your crawler?

I am just wondering since yesterday I saw data of your last crawl, for one of the websites I am currently monitoring, and I could see a substantial decrease of duplicates pages detected compared to the previous week.

1 0

Hello there! Good job indeed. I have just one question: have you already incorporated these changes into your crawler? I am just wondering since yesterday I saw data of your last crawl, for one of the websites I am currently monitoring, and I could see a substantial decrease of duplicates pages detected compared to the previous week. 
Cancel
- DavidBarts
 
 2013-01-09T12:41:38-08:00
 
 Yes, the changes have already been incorporated, and that could well explain the decrease in reported duplicates you are seeing.
 
 1 0
 
 Yes, the changes have already been incorporated, and that could well explain the decrease in reported duplicates you are seeing. 
 Cancel
TedIves

2013-01-09T10:42:06-08:00

In my experience, Simhash is a pretty crappy similiarity measure. I would suggest thinking of it in terms of Simhash being a sort of a first-pass filter that can help you reduce the amount of data you need to process, due to the false positives. You'll eventually almost certainly ultimately need to use other, more computational-intensive similarity measures on the smaller set of data, if you expose this for people to use IMO, relying only on simhash will cause them to simply bang their collective heads against the wall.

Good luck with this direction you're taking, as Google and Bing have pretty much hired (or co-opted with funding) all the Ph.D's in this very critical area of near-duplicate detection!

1 0

In my experience, Simhash is a pretty crappy similiarity measure. I would suggest thinking of it in terms of Simhash being a sort of a first-pass filter that can help you reduce the amount of data you need to process, due to the false positives. You'll eventually almost certainly ultimately need to use other, more computational-intensive similarity measures on the smaller set of data, if you expose this for people to use IMO, relying only on simhash will cause them to simply bang their collective heads against the wall. Good luck with this direction you're taking, as Google and Bing have pretty much hired (or co-opted with funding) all the Ph.D's in this very critical area of near-duplicate detection! 
Cancel

Post Analytics