After watching Nate Buggia a few weeks ago speak about Live's Webmaster Tools, I was struck by his statistic about the number of domains on the web. He suggested that there are 78 million domains. There's certainly room for disagreement about this number—don't forget Google has one trillion web pages ;) —but I bet he's in the right ballpark. If that's right, could we manually review all of them?
Sure, 78 million domains is big. But not that big. A few months ago while investigating spam, Danny reviewed a fairly randomly chosen 500 domains in a matter of hours. And I think he did a great job of it, too. That's a good foundation, but could we scale that up and review millions of domains?
I see a few challenges here. Probably the biggest challenge I see is just getting this list of Live's 78 million domains. Next you're going to need a lot of manual reviewers. But if you're Live (or some other search engine) you've already got that list, and a large contract labor force. Too bad for the rest of us.
I suppose if you're clever you might be able to do this through Alexa's Web Information Service and Amazon Mechanical Turk. Taking a look at the Mechanical Turk pricing, it looks like you could charge one cent for every domain (or maybe each block of a few dozen domains). So we're probably talking about tens or hundreds of thousands of dollars. But that's pocket change for Google. And Google has plenty of remote offices with lots of search quality engineers. In fact, they say, "Google makes use of evaluators in many countries and languages. These evaluators are carefully trained and are asked to evaluate the quality of search results in several different ways."
So let's say a single person can review 1000 domains in a single day. And let's say you've got 1000 reviewers working on this problem. That tells me that 78 days later you've got all the relevant domains on the internet reviewed. That's less than 10% of Google's workforce, less than 2% of Microsoft's Workforce. Of course you could do it with less if you pre-filtered some of those domains, or took longer than three months to do it. If Google, Yahoo!, and Live haven't already done this... well, I can't imagine that they haven't done at least part of this by now.
How It's Feasible to Manually Review All Domains
Search Engines
The author's views are entirely his or her own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.
This would be a sweet job for someone.
I don't think I'd be eligible though - as soon as something like miniclip.com came up in my list, I'd be all "Neat - Penguin cricket! I'd better manually review this for a few hours."
lol I can actually picture you really doing that.
So it could be done.
As long as no one changes anything.
Since that's what I'm doing most of every day (and I assume I'm not the only one) I doubt it.
Not to say that a lot of actual eyeballs are getting on a huge amount of pages. I get your point.
Definitely food for thought.
I suspect you could go a long way getting a sense for a domain in a single snapshot, by browsing a few pages. Even if some of the content changes over a matter of months, I would bet it's unlikely that the basic nature of the domain is likely to change. And that kind of change might be easier to detect algorithmically too.
But fair argument. Content change does make this trickier.
I agree with you. I also see your point in the post.
I know that Google would love for us to think that it is all strictly the program working the algorithm on every website, in every verticle. But so many verticles have a unique twist. Like Rand and others have pointed out with duplicate material in a verticle of song lyrics. So it goes that humans are at least tweaking the numbers if not actually eyeballing the content.
You are absolutely right about the basic scheme or theme staying the same or the whole cache over time thing would leave no one to trust.
I was just thinking the elephant was just too big (eye balls on every website in a matter of months) but then when you break it down to bite size chunks, it IS Feasible. (I am getting sleepy, sorry) Maybe if we run this by Shor and his Chinese friends,... he'd probably get it done in a few hours :)
The closest anyone has got is ODP, with several million sites listed.
However, you are making one classic error. A domain is not a site, and a site is not necessarily one domain.
There are tens of thousands of domains that host hundreds to thousands of "sites" per domain... not just the free pages your ISP allocates to their customers, but various store fronts, blog platforms, forum hosts, etc.
That is, exampleshopping.com/goodstore is a different "site" to exampleshopping.com/crappystore so you cannot just score the domain, otherwise all of the other sites on that domain are hit for something that is out of their control. It is obviously much easier to spot when the demarkation is sub-domain based, but there are tens of thousands of examples where they are folder based.
that's a great point. a site: search for google's own blogspot returns 274 million results and i have reason to believe that *ahem* some of them may be spam ;)
and the manual review process is further complicated when you have a site that's generally very reputable publishing individual spam pages in sub-folders, targeted at high-value advertising areas, like forbes.com did
You are totally correct. Diving into the different facets of every domain becomes increasingly difficult. However, knowing at least something about exampleshopping.com in general, perhaps that it's hosting a wide variety of content and we should be careful about generalizing from one content area to another IS useful, I think. For instance, we might have algorithms that are deeply analyzing some content areas of sites and we need to know when their deep-but-narrow analysis generalizes and when it doesn't.
I agree, that when you consider all the subdomains, such as the ones on wordpress etc, there is much more work involved.
Easy to analyze small sites, but huge ones with a lot of subdomains would be a bit more difficult
Interesting and thought-provoking. But doesn't sound like a job I want :(
Within minutes I'd be thinking up ways to let computers handle such a boring and repetitive job, bringing us back to square uno..
Often this kind of task needs a lot of labeled data (e.g. human review) to prime the process. You are correct, as you do human review you'll figure out how to automate it, take what you have labeled, feed it into automation and get results. It IS a cycle. Human review -> automation -> new, deeper human review -> better automation -> ...
To me, this manually-review-all-domains thing is pretty much pointless. But let's pretend someone really wants to do it.
1000 domains a day doesn't sound feasible.If one spends only a minute per domain (you need to click the URL, wait for site to load and some sites are slow and some "turkers" still have slow dial-up connections, then one needs make an educated decision and somehow mark the domain.) it would take almost 17 hours to review 1000 domains.
OK. Lets say 500 domains a day with an 8-hour working day (without breaks). I still don't think so. The first dozen of domains will be fast. But then concentration decreases and productivity gets worse with every reviewed domain. So I'd say 200 domains a day is feasible.
So 78*5=390 days. More than a year.
Another issue. Not all sites are in English. You need to find people speaking ALL languages.
Regarding the Mechanical Turk.One cent is the smallest award that you pay to a worker. Another cent you should pay to Amazon. This doubles the expenses. And the "block of a few dozen domains" for one cent would hardly work since it's not a single action (you need each domain reviewed individually).
Anyway, the article is thought provoking.
Another thing to consider, peoples websites can be their livelyhood and an incorrect review by cheap labor crunching through 1,000 domains a day may very well result in the site owners family not eating next week.
You would ideally want domains reviewed by 3 independant people to rule out human error and produce accurate and consistent results, simply "Getting through 78 Million domains" is pointless if the results are poor quality.
That would up the reviews to 3x78 Million, then by the time you finish you would have to start over. The web is so dynamic todays legit site is tomorrows spam domain due to new owners or an array of factors.
Are domains the metric to review by? Often spam is something slipped in on a reputable domain. Other times a domain that uses spam tactics doesn't look that spammy itself. What to do in these cases?
That's a great question. Perhaps spam review is the only reason you'd be looking at domains. But even if it were I have strong reasons to believe that many, many domains on the internet are 100% spam. I believe that search engines do a pretty good job of rooting these out so we never see them.
But even for domains that get spam slipped in, it might be reasonable to expect to discover followable comment links or other user generated content very quickly. The domain might not be spam, but it's an indication that the links from pages from the domain need to be double checked, perhaps algorithmically and/or deeper manual review.
I agree that the greater percentage of spam is going to have a 100% domain-to-spam relationship, but often the spam SEOs have to deal with doesn't necessarily equate into C!@L1$.info, but the way someone is acquiring links, or where those links are placed (or hidden) on a page; basically what Google would dub 'manipulative link building'. When it gets to this level the domains involved are usually rather clean.
I loved your presentation at SEOmoz' advanced seminar, and this blog entry ties in well with advancing the measurement of neighborhoods/domains, but I think there's another network of sorts out there built on mostly clean domains. Excellent point as well on applying different levels of trust/authority to UGC on a trusted domain. Such a metric could also be measured by the section's link profile: How many UGC links point to bad neighborhoods? How often are UGC edits made? How many good neighborhood UGC links agree with what I (the SE) know as good neighborhoods? And so on. Maybe your next analysis can go in this direction: Link measurements on trusted UGC domains.
Why would you want to review all these domains?
Vaild point, but what difference would it make to the approach of an SEO? I mean what's the catch?
Great question: where's the beef? how does this help SEOs?
I think it's always interesting to think about things like a search engineer, to consider the resources (computational, and human) they've got at their disposal.
But what specific things can you take home? Here is another piece of evidence that solid, natural content is the key. Keyword density doesn't matter. Stuffing meta tags pales in comparison to compelling articles, images, and other human-centric media. You should look for links as editorial endorsements. And you should count on paid, or otherwise non-editorial links sparingly and in targeted ways.
If you've got a client that is pushing you towards SEO tactics you aren't comfortable with, you might suggest that search engines may have a closer, more-human eye on their site than they think.
Hmm... but Google doesn't even have a proper reporting system in place for spam reporting. You hardly get a notification or some thing like which tells you what happened to your spam report? Was any action taken? I think Google can save much more time and money if it places a more sophesticated spam reporting system for editorial discretion. Knowledgeable Internet users would be more more than happy to do this job for Google..
Nick,
Interesting post. I have seen evidence of companies doing similar things on campus. College students are great recruits for this kind of thing because they are generally pretty intelligent and desperate for money.
Most recently RIM (Blackberry) came around and wanted students to read the contents of tens of thousands of blog posts to identify if the content was related to their product or competitors product and if it was favorable or not.I thought about signing up for it for 15/hr and outsourcing it to Mechanical Turk.
It would have been free money.
I think an important thing to notice in regard to the 'changing content' is the fact that Google or any other search company does not have to start from scratch.
They already have mined data from our websites for years. There must be some way to use this data to simplify the process of having to check websites on a daily basis. Maybe even some human/computer hybrid option?
I'm totally with you about human/computer hybrid. I actually have doubts that every site (or much, much less likely every page) is human reviewed on a regular basis. But you could use an algo to identify suspicious or interesting sites, human review those, use that data to tune more algos. I worked with a professor for a long time who was very fond of the human-in-the-loop model and I'm inclined to agree.
And doing so would require only fractions of the workforce described in the original post!
This reminds me of Douglas Adams' book "Life, the Universe, and Everything". It has a part about an alien named Wowbagger the Infinitely Prolonged. It was a man with a purpose: to insult the Universe.That is, he would insult everybody in it. Individually, personally, one by one, and (this was the thing he really decided to grit his teeth over) in alphabetical order.When people protested to him, as they sometimes had done, that the plan was not merely misguided but actually impossible because of the number of people being born and dying all the time, he would merely fix them with a steely look and say, "A man can dream can't he?"
This reminds me of my days as an eBay seller. eBay was plagued with fraudulent listings, and even more listings that weren't fraudulent but were listed incorrectly (wrong category, links to off-eBay sites, etc.)
All of these errors meant the eBay experience was often unpleasant for buyers. This is one of the main reasons that many people never come back to eBay.
Yet in all my years there, TPTB insisted that there were too many listings to be manually reviewed. They relied on regular eBay members to report listing violations. I had friends that literally reported a few hundred listings a day.
Looking at your analysis, it's really inconceivable that eBay couldn't hire the necessary staff to be able to review each listing. I think they just didn't want to have to pay anyone for it.
eBay (and Google) can put a lot of automated processes in place to try and stop spam, but human eyes are often the best (and sometimes only) way to find problems.
There's three things that make it a little harder to scale than this article suggests:
1) The number of classification variables
2) The complexity of each defined classification
3) The scale of each domain
The problem is not necessarily in how many domains there are to review, but what you want to know about each domain.
If you have a single question, and it's simple, like "Does the homepage use a mustard yellow background?", then this will be relatively easy.
Of course, questions that simple can be answered by robots. The really juicy stuff is something more like, "Which vertical does this domain fall under?"
Even this seemingly simple question has LOTS of built-in assumptions. It assumes you have a fairly comprehensive vertical hierarchy already defined, and universally understood. ODP is pretty comprehensive, but even they left some stuff out.
Even if you already have all of those things, the methodology is still key. Unless you define the steps every reviewer must take, then you're probably really asking, "For this domain's homepage content that is above the fold, how would you classify the vertical?" Most reviewers would probably not look at anything more.
There's more... at $30/hour, you're looking at a cost of $2.34M per variable (assuming one variable/review). Since all of this would be entirely new meta-data for the search engine, they'd have no way to verify the quality of the work unless they did another review.
Almost any way you do it, this task would get out of hand pretty quickly.
Since we already know Google is manually reviewing some sites, then it stands to reason they are pre-processing the sites in order to minimize the task of reviewing.
You bring up lots of points. Glad people are thinking about this critically ;)
About defining rules and how well human review can do, there's actually a really great post about exactly that:
This paper on building a web spam reference collection by Castillo, et al. dives into detail about human review of sites. A lot of my belief in human review stems from this and our internal validation of some of their methods in my spam detection post (cited above in this post)
Boggles my mind! I think sly-grrr's point about the constantly changing organicness of the web throws a giant wrench into the works. It would mean that the review was actually at least 78 days behind in it's manual review - and that's a long time for a change to be recognized.
Thetjo above points out that we can leverage what we've learned from the past towards the present/future. Maybe we do it once, and discover that we need to keep a close eye on some certain sites (which are influential, but also do some shady business). Those might be the sites (a smaller subset of the 78 million) which we really need to manually review.
Thought provoking article. I suspect the number too, 78 million? Awkward.
But the task of validating/anayising them won't be easy but a possible task.
Manually doing it means, periodic re-checks and compartmentalizing the entire job - but yes, possible. Why not?
gulp! something hard to digest as I sit down to read my first post in the morning? are we saying that Big brother actually has the capacity and resources to do it? if that's case, i dont what else they wouldnt be able to do?