We've just changed the way we detect duplicate or near-duplicate web pages in our custom crawler to better serve you. Our previous code produced good results, but it could fall apart on large crawls (ones larger than about 85,000 pages), and takes an excessively long time (sometimes on the order of weeks) to finish.
Now that the change is live, you’ll see some great improvements and a few changes:
- Results will come in faster (up to an hour faster on small crawls and literally days faster on larger crawls)
- More accurate duplicate removal, resulting in fewer duplicates in your crawl results
This post provides a look into the motivations behind our decision to change the way our custom crawl detects duplicate and near-duplicate web pages at a high level. Enjoy!
Improving our page similarity measurement
The heuristic we currently use to measure the similarity between two pages is called fingerprints. Fingerprints relies on turning each page into a vector of 128 64-bit integers in such a way that duplicate or near-duplicate pages result in an identical, or nearly identical, vector. The difference between a pair of pages is proportional to the number of corresponding entries in the two vectors which are not the same.
The faster heuristic we are working on implementing is called a similarity hash, or simhash for short. A simhash is a single, 64-bit, unsigned integer, again calculated in such a way that duplicate or near-duplicate pages result in simhash values which are identical, or nearly so. The difference between pages is proportional to the number of bits that differ in the two numbers.
The problem: avoid false duplicates
The problem is that these two measures are very different: one is a vector of 128 values, while the other is a single value. Because of this difference, the measurements may vary in how they see page difference. With the possibility of a single crawl containing over a million pages, that's an awful lot of numbers we need to compare to determine the best possible threshold value for the new heuristic.
Specifically, we need to set the heuristic threshold to detect as many duplicates and near-duplicates as possible, while minimizing the number of false duplicates. It is more important to absolutely minimize the number of page pairs which aren’t duplicates, so we’re not removing a page as a duplicate unless it actually *is* a duplicate. This means we need to be able to detect pages where:
- The two pages are not actually duplicates or near-duplicates,
- The current fingerprints heuristic correctly views them as different, but
- The simhash heuristic incorrectly views them as similar.
We’re being incredibly careful about this to avoid the most negative customer experience we anticipate: having a behind-the-scenes change of our duplicate detection heuristic causing a sudden rash of incorrect "duplicate page" errors to appear for no apparent good reason.
The solution: visualizing the data
Our need to make a decision where many numeric quantities are involved is a classic case where data visualization can be of help. Our SEOmoz data scientist, Matt Peters, suggested that the best way to normalize these two very different measures of page content was to focus on how they measured difference between existing pages. Taking that to heart, I decided on the following approach:
- Sample about 10 million pairs of pages from about 25 crawls selected at random.
- For each pair of pages sampled, plot their difference as measured by the legacy fingerprints heuristic on the horizontal axis (0 to 128), and their difference as measured by simhash on the vertical axis (0 to 64).
The plot resulting from this approach looks like this:
Immediately, a problem is obvious: there's no measure of central tendency (or lack thereof) in this image. If more than one page pair has the same difference as measured by both legacy fingerprints and simhash, the plotting software will simply place a second red dot precisely atop the first one. And so on for the third, fourth, hundredth, and possibly thousandth identical data point.
One way to address this problem is to color the dots differently depending on how many page pairs they represent. So what happens if we select the color using a light wavelength that corresponds to the number of times we draw a point on the same spot? This tactic gives us a plot with red (a long wavelength) indicating the most data points, down through orange, yellow, green, blue, and violet (really, magenta on this scale) representing only one or two values:
How disappointing! That's almost no change at all. However, if you look carefully, you can see a few blue dots in that sea of magenta, and most important of all, the lower-leftmost dot is red, representing the highest number of instances of all. What's happening here is that red dot represents a count so much higher than all the other counts that most of the other colors between it and the ones representing the lowest numbers end up unused.
The solution is to assign colors in such a way that most of the colors end up being used for coding the lower counts, and to assign progressively fewer colors as counts increase. Or, in mathematical terms, to assign colors based on a logarithmic scale rather than a linear one. If we do that, we end up with the following:
Now we're getting somewhere! As expected, there is a central tendency in the data, even though it's pretty broad. One thing that's immediately evident is that, although in theory, the difference measured by simhash can go to a maximum of 64, in practice, it rarely gets much higher than 46 (three-fourths of the maximum). In contrast, using the fingerprints difference, many pages reach the maximum possible difference of 128 (witness all the red and orange dots along the right side of the graphic). Keep in mind that those red and orange dots represent really big counts, because the color scale is logarithmic.
Where we have to be most careful is on the bottom edge of things. That represents simhash values which indicate pairs of pages that are quite similar. If two pages are not, in fact, similar, yet simhash measures them similar where fingerprints saw a significant difference, this is precisely the sort of negative customer experience we are trying to avoid. One potential trouble spot is circled below:
The circled dot represents a pair of pages which are actually quite different, yet which simhash thinks are quite similar. (The dot to the left and even further below turns out to not be a problem: it represents a pair of nearly duplicate pages that the old heuristic missed!)
The vertical position of the troublesome dot represents a simhash difference of 6 (6 corresponding bits in the two 64-bit simhash values differ). It's not the only case, either: occasionally, such pairs of pages come up from time to time. It happens in 1% or less of the crawls, but it does happen. If we choose a simhash difference threshold of 6 (matching the threshold we currently have defined for the legacy fingerprints), there will be false positives.
Picking a threshold
Thankfully, 6 seems to be a border case. Above 6 bits of difference, the chance of a false positive increases. Below 6, I was unable to find any such pathological cases, and I examined thousands of crawls trying to find one. So I chose a difference threshold of 5 for simhash-based duplicate detection. That results in a situation represented by the final graphic:
Here we have lines drawn to represent the two difference thresholds. Everything to the left of the vertical line represents what the current code would report as duplicate. Everything below the horizontal line represents what the simhash code will report. Keeping in mind the logarithmic color scale and the red dot in the lower-left corner, we see that the number of page pairs where the two heuristics agree about similarity outweighs the number of page pairs where they disagree.
Note that there are still things in the "false positive" (lower right) quadrant. It turns out that those pairs tend not to differ much from the pairs where the two measures agree, or, for that matter, from the false negative pairs in the upper left quadrant. In other words, with the chosen thresholds, both simhash and the legacy fingerprints miss seeing some true near-duplicates.
The visible results
With this threshold decision, the number of false negatives outnumbers the number of false positives. This meets our goal of minimizing the number of false positives, even at the cost of incurring false negatives. Note that the "false positives" in the lower-right quadrant are actually quite similar to each other, and therefore would more accurately be described as the false negatives of the legacy fingerprints heuristic, rather than the false positives of the fingerprints heuristic.
The most visible aspects of the change to customers are two-fold:
1. Fewer duplicate page errors: a general decrease in the number of reported duplicate page errors. However, it bears pointing out that:
- We may still miss some near-duplicates. Like the current heuristic, only a subset of the near-duplicate pages is reported.
- Completely identical pages will still be reported. Two pages that are completely identical will have the same simhash value, and thus a difference of zero as measured by the simhash heuristic. So, all completely identical pages will still be reported.
2. Speed, speed, speed: The simhash heuristic detects duplicates and near-duplicates approximately 30 times faster than the legacy fingerprints code. This means that soon, no crawl will spend more than a day working its way through post-crawl processing, which will facilitate significantly faster delivery of results for large crawls.
I hope this post provides some meaningful insight into our upcoming changes. I look forward to hearing your thoughts in the comments below.
I LOVE how SEOmoz's engineers blog about how they approach problems, work as a team to solve it, and then disclose the fun details. This is a great example of how two disciplines come together as a team to solve one problem then relating it back to us: the users.
GREAT JOB!!!!
Thanks for all your comments! Yes, this is quite technical. The reasons for using such elaborate logic for detecting duplicate pages are twofold:
Duplicate web pages are seldom exact duplicates. They often have date/time stamps embedded someplace, hidden form variables with different, dynamically-generated values that change on each page request, etc. If we just compared the HTML we got as two giant strings, it would miss a lot of duplicates. So we have to detect pages that are almost duplicate, as well as pages that are 100%, byte-for-byte, exact duplicates.Duplicate reporting is a lot of work (for each page in a crawl, we basically have to compare it to all other pages). That amount of work scales as the square of the number of pages on a site (i.e. double the number of pages, and the number of comparisons we must preform goes up by a factor of four). So minimizing this work is also very important.Simhashes are a way of dealing with needs 1 and 2 above.
This is awesome David. Thanks for taking the time to explain it to us.
As much as I hate comments that ask a lot of dumb questions, nonetheless I have one :)
Does the use of a proper canonical tag on a duplicate page cancel out a duplicate content warning?
Again, thanks for the great visualizations.
Yes, using canonical tags will result in duplicate errors being suppressed. If one page refers to another as a duplicate, than that pair will not be reported as duplicates. Also, if two pages both refer to the same third page as their canonical, then they will not be reported as duplicates of each other, either.
(This is not a new feature; the custom crawler has recognized canonical tags for some time now.)
That doesn't work for me as you wrote: SEOmoz bot marked the following URLs as duplicate pages despite I have defined a canonical tag:
https://www.virtualsheetmusic.com/score/PatrickCollectionFlPf.html?tab=mp3
https://www.virtualsheetmusic.com/score/PatrickCollectionFlPf.html?tab=pdf
I have also opened a thread about this issue here below despite I haven't gotten good answers about this issue so far:
https://www.seomoz.org/q/why-does-seomoz-bot-see-duplicate-pages-despite-i-am-using-the-canonical-tag
I enjoyed this post. Thanks David.
Does the way you calculate simhashes (or fingerprints for that matter) simply work on the HTML as a huge string or are you considering the structure of the marked-up content to work out what changes are significant and what aren't (e.g. ignoring comments in the HTML or changes to CSS only)?
What we do is strip out all of the HTML and only look at the text on the page when performing near-duplicate comparisons.
It looks like non-standard research, David. But what is practical use of this?
The practical use for people who use SEOmoz is a faster, more accurate crawl. You get your results faster. That's it. David is telling us how the SEOmoz engineering team actually made the crawl faster.
Chiming in on Chris_Le's response. The use of this post is to give our community members some statistical insight as to why we've changed the way our custom crawl detects duplicate and near-duplicate pages. Our engineering team has a ton of knowledge to share, and we think it's important to give our readers a chance to look at the data we use to make decisions on the backend. Our hope is that the more you know, the better your overall experience with our tools will be!
I read: 1. Fewer duplicate page errors.... and 2. Speed, speed, speed....
It came in way at the end, I have trouble paying attention also. Otherwise impressive post, that's some detailed data!
Hey David, just want to thank you and the rest of your team for being so open with the efforts you put into making the tools we use better. Also, you did a great job explaining these recent updates in terms easier to understand than I would normally expect for a topic like this. Thank you!
Is anyone else scratching their head?
Cool improvements though.
faster and better is good. But, yes, definitely scratching my head!
I think some of the comments are missing the point here. Even if you don't go past the first two dot points in the post, you can see that SEOmoz has made some good improvements to the way the report on duplicates. Whether you read the rest of the post is obviously up to you, but it's great to see these improvements and the transparency behind them.
Loved the detail in this. Thanks - most useful.
Did you use R for this?
No, I used Python scripts (the same language the custom crawler is mostly written in) to process the data and Gnuplot to generate the colored-dot graphics.
Thanks very much for this detailed and technical update. The devil is in the detail, and it's refreshing to see that you're making these kind of improvements.
That's really a great info but most part was too technical for normal marketers to understand, but thanks for the detailed explanation on how the crawl process was made faster.
Hi David! Thanks for taking time expalining this to us. You and your team are the best. I guess transparency is really important. That is why people doing SEO works like reading your articles because you are so transparent. Anyways, I guess SEOmoz has made some good improvements to the way they report on duplicates.
David,Thanks for awesome describe.. completely agreed.Still not sure how the real use for such..
Thanks very much for this comprehensive and practical update. The devid is in the feature, and it's inspirational to see that you're creation these kind of enhancements.
Very technical update. Thanks
A little too technical for my blood. But as longs as it's faster and better, it sounds good. Thanks.
Great Tips for the SEO Guys.
Thank you so much to share tips with us.
This right here is probably the single greatest reason I'm going to hold onto my Pro subscription.
Well done guys! oh and my eyes hurt thanks to the shiny dotted images :p
As a color blind guy these charts hurt my eyes :-)
I think the charts look very pretty indeed! I do sympathise with the color blindness though.
I heard that you guys are working on a major update to your crawl engine that will enable deeper crawls that will uncover more external links. SEOmoz consistently omits recent high value links that Majestic SEO uncovers within a couple of days at most. Any news on any forthcoming crawl engine updates of this nature?
Hey! We are constantly trying to improve our crawlers to make our Mozscape index (which powers Open Site Explorer) crawl deeper and fresher pages. That work is in progress right now, however, the crawler that David works on is a different crawler that specifically only crawls for our PRO campaigns. But we do hope to have lots of exciting changes this year!
Thanks,
Carin
That would be great. One of the SEOmoz reps I had spoken to a couple of months back had mentioned that a new engine capable of deeper crawls for links was in development so I hope that comes to fruition sooner than later. Thanks
Hello there! Good job indeed.
I have just one question: have you already incorporated these changes into your crawler?
I am just wondering since yesterday I saw data of your last crawl, for one of the websites I am currently monitoring, and I could see a substantial decrease of duplicates pages detected compared to the previous week.
Yes, the changes have already been incorporated, and that could well explain the decrease in reported duplicates you are seeing.
In my experience, Simhash is a pretty crappy similiarity measure. I would suggest thinking of it in terms of Simhash being a sort of a first-pass filter that can help you reduce the amount of data you need to process, due to the false positives. You'll eventually almost certainly ultimately need to use other, more computational-intensive similarity measures on the smaller set of data, if you expose this for people to use IMO, relying only on simhash will cause them to simply bang their collective heads against the wall.
Good luck with this direction you're taking, as Google and Bing have pretty much hired (or co-opted with funding) all the Ph.D's in this very critical area of near-duplicate detection!