How It's Feasible to Manually Review All Domains

Comments 33

Please keep your comments TAGFEE by following the community etiquette.

E-mail me when new comments are posted

Sort by:

Comments are closed on posts more than 30 days old. Got a burning question? Head to our Q&A section to start a new conversation.

Associate

RobOusbey
Associate

2008-10-02T01:43:26-07:00

This would be a sweet job for someone.

I don't think I'd be eligible though - as soon as something like miniclip.com came up in my list, I'd be all "Neat - Penguin cricket! I'd better manually review this for a few hours."

6 0

This would be a sweet job for someone. I don't think I'd be eligible though - as soon as something like miniclip.com came up in my list, I'd be all "Neat - Penguin cricket! I'd better manually review this for a few hours." 
Cancel
- Rishi Lakhani
 
 2008-10-02T05:23:13-07:00
 
 lol I can actually picture you really doing that.
 
 1 0
 
 lol I can actually picture you really doing that. 
 Cancel
Jeff Sliger

2008-10-01T21:40:30-07:00

So it could be done.

As long as no one changes anything.

Since that's what I'm doing most of every day (and I assume I'm not the only one) I doubt it.

Not to say that a lot of actual eyeballs are getting on a huge amount of pages. I get your point.

Definitely food for thought.

3 0

So it could be done. As long as no one changes anything. Since that's what I'm doing most of every day (and I assume I'm not the only one) I doubt it. Not to say that a lot of actual eyeballs are getting on a huge amount of pages. I get your point. Definitely food for thought.
Cancel
- Nick Gerner
 
 2008-10-01T21:49:33-07:00
 
 I suspect you could go a long way getting a sense for a domain in a single snapshot, by browsing a few pages. Even if some of the content changes over a matter of months, I would bet it's unlikely that the basic nature of the domain is likely to change. And that kind of change might be easier to detect algorithmically too.
 
 But fair argument. Content change does make this trickier.
 
 2 0
 
 I suspect you could go a long way getting a sense for a domain in a single snapshot, by browsing a few pages. Even if some of the content changes over a matter of months, I would bet it's unlikely that the basic nature of the domain is likely to change. And that kind of change might be easier to detect algorithmically too. But fair argument. Content change does make this trickier.
 Cancel
 - Jeff Sliger
 
 2008-10-01T22:30:59-07:00
 
 I agree with you. I also see your point in the post.
 
 I know that Google would love for us to think that it is all strictly the program working the algorithm on every website, in every verticle. But so many verticles have a unique twist. Like Rand and others have pointed out with duplicate material in a verticle of song lyrics. So it goes that humans are at least tweaking the numbers if not actually eyeballing the content.
 
 You are absolutely right about the basic scheme or theme staying the same or the whole cache over time thing would leave no one to trust.
 
 I was just thinking the elephant was just too big (eye balls on every website in a matter of months) but then when you break it down to bite size chunks, it IS Feasible. (I am getting sleepy, sorry) Maybe if we run this by Shor and his Chinese friends,... he'd probably get it done in a few hours :)
 
 2 0
 
 I agree with you. I also see your point in the post. I know that Google would love for us to think that it is all strictly the program working the algorithm on every website, in every verticle. But so many verticles have a unique twist. Like Rand and others have pointed out with duplicate material in a verticle of song lyrics. So it goes that humans are at least tweaking the numbers if not actually eyeballing the content. You are absolutely right about the basic scheme or theme staying the same or the whole cache over time thing would leave no one to trust. I was just thinking the elephant was just too big (eye balls on every website in a matter of months) but then when you break it down to bite size chunks, it IS Feasible. (I am getting sleepy, sorry) Maybe if we run this by Shor and his Chinese friends,... he'd probably get it done in a few hours :) 
 Cancel
g1smd

2008-10-02T01:05:21-07:00

The closest anyone has got is ODP, with several million sites listed.

However, you are making one classic error. A domain is not a site, and a site is not necessarily one domain.

There are tens of thousands of domains that host hundreds to thousands of "sites" per domain... not just the free pages your ISP allocates to their customers, but various store fronts, blog platforms, forum hosts, etc.

That is, exampleshopping.com/goodstore is a different "site" to exampleshopping.com/crappystore so you cannot just score the domain, otherwise all of the other sites on that domain are hit for something that is out of their control. It is obviously much easier to spot when the demarkation is sub-domain based, but there are tens of thousands of examples where they are folder based.

g1smd edited 2008-10-02T01:07:26-07:00
3 0

The closest anyone has got is ODP, with several million sites listed. However, you are making one classic error. A domain is not a site, and a site is not necessarily one domain. There are tens of thousands of domains that host hundreds to thousands of "sites" per domain... not just the free pages your ISP allocates to their customers, but various store fronts, blog platforms, forum hosts, etc. That is, exampleshopping.com/goodstore is a different "site" to exampleshopping.com/crappystore so you cannot just score the domain, otherwise all of the other sites on that domain are hit for something that is out of their control. It is obviously much easier to spot when the demarkation is sub-domain based, but there are tens of thousands of examples where they are folder based.
Cancel
- grasshopper
 
 2008-10-02T09:38:27-07:00
 
 that's a great point. a site: search for google's own blogspot returns 274 million results and i have reason to believe that *ahem* some of them may be spam ;)
 
 and the manual review process is further complicated when you have a site that's generally very reputable publishing individual spam pages in sub-folders, targeted at high-value advertising areas, like forbes.com did
 
 2 0
 
 that's a great point. a site: search for <a href="https://zi.ma/b34028" rel="nofollow">google's own blogspot</a> returns 274 million results and i have reason to believe that *ahem* some of them may be spam ;) and the manual review process is further complicated when you have a site that's generally very reputable publishing individual spam pages in sub-folders, targeted at high-value advertising areas, like <a href="https://zi.ma/b45c93" rel="nofollow">forbes.com did</a>
 Cancel
- Nick Gerner
 
 2008-10-02T09:42:23-07:00
 
 You are totally correct. Diving into the different facets of every domain becomes increasingly difficult. However, knowing at least something about exampleshopping.com in general, perhaps that it's hosting a wide variety of content and we should be careful about generalizing from one content area to another IS useful, I think. For instance, we might have algorithms that are deeply analyzing some content areas of sites and we need to know when their deep-but-narrow analysis generalizes and when it doesn't.
 
 1 0
 
 You are totally correct. Diving into the different facets of every domain becomes increasingly difficult. However, knowing at least something about exampleshopping.com in general, perhaps that it's hosting a wide variety of content and we should be careful about generalizing from one content area to another IS useful, I think. For instance, we might have algorithms that are deeply analyzing some content areas of sites and we need to know when their deep-but-narrow analysis generalizes and when it doesn't.
 Cancel
- Web-Design-Adelaide
 
 2008-10-02T19:05:48-07:00
 
 I agree, that when you consider all the subdomains, such as the ones on wordpress etc, there is much more work involved.
 
 Easy to analyze small sites, but huge ones with a lot of subdomains would be a bit more difficult
 
 1 0
 
 I agree, that when you consider all the subdomains, such as the ones on wordpress etc, there is much more work involved. Easy to analyze small sites, but huge ones with a lot of subdomains would be a bit more difficult 
 Cancel
Associate

Will Critchlow
Associate

2008-10-02T01:26:11-07:00

Interesting and thought-provoking. But doesn't sound like a job I want :(

3 0

Interesting and thought-provoking. But doesn't sound like a job I want :(
Cancel
- Theo van der Zee
 
 2008-10-02T03:13:57-07:00
 
 Within minutes I'd be thinking up ways to let computers handle such a boring and repetitive job, bringing us back to square uno..
 
 1 0
 
 Within minutes I'd be thinking up ways to let computers handle such a boring and repetitive job, bringing us back to square uno..
 Cancel
 - Nick Gerner
 
 2008-10-02T09:44:23-07:00
 
 Often this kind of task needs a lot of labeled data (e.g. human review) to prime the process. You are correct, as you do human review you'll figure out how to automate it, take what you have labeled, feed it into automation and get results. It IS a cycle. Human review -> automation -> new, deeper human review -> better automation -> ...
 
 1 0
 
 Often this kind of task needs a lot of labeled data (e.g. human review) to prime the process. You are correct, as you do human review you'll figure out how to automate it, take what you have labeled, feed it into automation and get results. It IS a cycle. Human review -> automation -> new, deeper human review -> better automation -> ...
 Cancel
FirstStop

2008-10-02T14:12:06-07:00

To me, this manually-review-all-domains thing is pretty much pointless. But let's pretend someone really wants to do it.

1000 domains a day doesn't sound feasible.If one spends only a minute per domain (you need to click the URL, wait for site to load and some sites are slow and some "turkers" still have slow dial-up connections, then one needs make an educated decision and somehow mark the domain.) it would take almost 17 hours to review 1000 domains.

OK. Lets say 500 domains a day with an 8-hour working day (without breaks). I still don't think so. The first dozen of domains will be fast. But then concentration decreases and productivity gets worse with every reviewed domain. So I'd say 200 domains a day is feasible.

So 78*5=390 days. More than a year.

Another issue. Not all sites are in English. You need to find people speaking ALL languages.

Regarding the Mechanical Turk.One cent is the smallest award that you pay to a worker. Another cent you should pay to Amazon. This doubles the expenses. And the "block of a few dozen domains" for one cent would hardly work since it's not a single action (you need each domain reviewed individually).

Anyway, the article is thought provoking.

FirstStop edited 2008-10-02T14:13:41-07:00
3 0

To me, this manually-review-all-domains thing is pretty much pointless. But let's pretend someone really wants to do it. 1000 domains a day doesn't sound feasible.If one spends only a minute per domain (you need to click the URL, wait for site to load and some sites are slow and some "turkers" still have slow dial-up connections, then one needs make an educated decision and somehow mark the domain.) it would take almost 17 hours to review 1000 domains. OK. Lets say 500 domains a day with an 8-hour working day (without breaks). I still don't think so. The first dozen of domains will be fast. But then concentration decreases and productivity gets worse with every reviewed domain. So I'd say 200 domains a day is feasible. So 78*5=390 days. More than a year. Another issue. Not all sites are in English. You need to find people speaking ALL languages. Regarding the Mechanical Turk.One cent is the smallest award that you pay to a worker. Another cent you should pay to Amazon. This doubles the expenses. And the "block of a few dozen domains" for one cent would hardly work since it's not a single action (you need each domain reviewed individually). Anyway, the article is thought provoking.
Cancel
- TareeInternet
 
 2008-10-02T21:24:54-07:00
 
 Another thing to consider, peoples websites can be their livelyhood and an incorrect review by cheap labor crunching through 1,000 domains a day may very well result in the site owners family not eating next week.
 
 You would ideally want domains reviewed by 3 independant people to rule out human error and produce accurate and consistent results, simply "Getting through 78 Million domains" is pointless if the results are poor quality.
 
 That would up the reviews to 3x78 Million, then by the time you finish you would have to start over. The web is so dynamic todays legit site is tomorrows spam domain due to new owners or an array of factors.
 
 1 0
 
 Another thing to consider, peoples websites can be their livelyhood and an incorrect review by cheap labor crunching through 1,000 domains a day may very well result in the site owners family not eating next week. You would ideally want domains reviewed by 3 independant people to rule out human error and produce accurate and consistent results, simply "Getting through 78 Million domains" is pointless if the results are poor quality. That would up the reviews to 3x78 Million, then by the time you finish you would have to start over. The web is so dynamic todays legit site is tomorrows spam domain due to new owners or an array of factors. 
 Cancel
Ryan Purkey

2008-10-02T12:05:54-07:00

Are domains the metric to review by? Often spam is something slipped in on a reputable domain. Other times a domain that uses spam tactics doesn't look that spammy itself. What to do in these cases?

3 0

Are domains the metric to review by? Often spam is something slipped in on a reputable domain. Other times a domain that uses spam tactics doesn't look that spammy itself. What to do in these cases?
Cancel
- Nick Gerner
 
 2008-10-02T13:04:12-07:00
 
 That's a great question. Perhaps spam review is the only reason you'd be looking at domains. But even if it were I have strong reasons to believe that many, many domains on the internet are 100% spam. I believe that search engines do a pretty good job of rooting these out so we never see them.
 
 But even for domains that get spam slipped in, it might be reasonable to expect to discover followable comment links or other user generated content very quickly. The domain might not be spam, but it's an indication that the links from pages from the domain need to be double checked, perhaps algorithmically and/or deeper manual review.
 
 3 0
 
 That's a great question. Perhaps spam review is the only reason you'd be looking at domains. But even if it were I have strong reasons to believe that many, many domains on the internet are 100% spam. I believe that search engines do a pretty good job of rooting these out so we never see them. But even for domains that get spam slipped in, it might be reasonable to expect to discover followable comment links or other user generated content very quickly. The domain might not be spam, but it's an indication that the links from pages from the domain need to be double checked, perhaps algorithmically and/or deeper manual review.
 Cancel
 - Ryan Purkey
 
 2008-10-02T14:10:22-07:00
 
 I agree that the greater percentage of spam is going to have a 100% domain-to-spam relationship, but often the spam SEOs have to deal with doesn't necessarily equate into C!@L1$.info, but the way someone is acquiring links, or where those links are placed (or hidden) on a page; basically what Google would dub 'manipulative link building'. When it gets to this level the domains involved are usually rather clean.
 
 I loved your presentation at SEOmoz' advanced seminar, and this blog entry ties in well with advancing the measurement of neighborhoods/domains, but I think there's another network of sorts out there built on mostly clean domains. Excellent point as well on applying different levels of trust/authority to UGC on a trusted domain. Such a metric could also be measured by the section's link profile: How many UGC links point to bad neighborhoods? How often are UGC edits made? How many good neighborhood UGC links agree with what I (the SE) know as good neighborhoods? And so on. Maybe your next analysis can go in this direction: Link measurements on trusted UGC domains.
 
 1 0
 
 I agree that the greater percentage of spam is going to have a 100% domain-to-spam relationship, but often the spam SEOs have to deal with doesn't necessarily equate into C!@L1$.info, but the way someone is acquiring links, or where those links are placed (or hidden) on a page; basically what Google would dub 'manipulative link building'. When it gets to this level the domains involved are usually rather clean. I loved your presentation at SEOmoz' advanced seminar, and this blog entry ties in well with advancing the measurement of neighborhoods/domains, but I think there's another network of sorts out there built on mostly clean domains. Excellent point as well on applying different levels of trust/authority to UGC on a trusted domain. Such a metric could also be measured by the section's link profile: How many UGC links point to bad neighborhoods? How often are UGC edits made? How many good neighborhood UGC links agree with what I (the SE) know as good neighborhoods? And so on. Maybe your next analysis can go in this direction: Link measurements on trusted UGC domains.
 Cancel
Bureau-24

2008-10-02T10:08:26-07:00

Why would you want to review all these domains?

2 0

Why would you want to review all these domains?
Cancel
Pulkit ILoveFashionRetail.com

2008-10-02T10:20:34-07:00

Vaild point, but what difference would it make to the approach of an SEO? I mean what's the catch?

2 0

Vaild point, but what difference would it make to the approach of an SEO? I mean what's the catch?
Cancel
- Nick Gerner
 
 2008-10-02T10:38:41-07:00
 
 Great question: where's the beef? how does this help SEOs?
 
 I think it's always interesting to think about things like a search engineer, to consider the resources (computational, and human) they've got at their disposal.
 
 But what specific things can you take home? Here is another piece of evidence that solid, natural content is the key. Keyword density doesn't matter. Stuffing meta tags pales in comparison to compelling articles, images, and other human-centric media. You should look for links as editorial endorsements. And you should count on paid, or otherwise non-editorial links sparingly and in targeted ways.
 
 If you've got a client that is pushing you towards SEO tactics you aren't comfortable with, you might suggest that search engines may have a closer, more-human eye on their site than they think.
 
 2 0
 
 Great question: where's the beef? how does this help SEOs? I think it's always interesting to think about things like a search engineer, to consider the resources (computational, and human) they've got at their disposal. But what specific things can you take home? Here is another piece of evidence that solid, natural content is the key. Keyword density doesn't matter. Stuffing meta tags pales in comparison to compelling articles, images, and other human-centric media. You should look for links as editorial endorsements. And you should count on paid, or otherwise non-editorial links sparingly and in targeted ways. If you've got a client that is pushing you towards SEO tactics you aren't comfortable with, you might suggest that search engines may have a closer, more-human eye on their site than they think.
 Cancel
 - Pulkit ILoveFashionRetail.com
 
 2008-10-02T20:53:34-07:00
 
 Hmm... but Google doesn't even have a proper reporting system in place for spam reporting. You hardly get a notification or some thing like which tells you what happened to your spam report? Was any action taken? I think Google can save much more time and money if it places a more sophesticated spam reporting system for editorial discretion. Knowledgeable Internet users would be more more than happy to do this job for Google..
 
 PulkitRas edited 2008-10-02T20:54:29-07:00
 1 0
 
 Hmm... but Google doesn't even have a proper reporting system in place for spam reporting. You hardly get a notification or some thing like which tells you what happened to your spam report? Was any action taken? I think Google can save much more time and money if it places a more sophesticated spam reporting system for editorial discretion. Knowledgeable Internet users would be more more than happy to do this job for Google..
 Cancel
Danny Dover

2008-10-02T09:53:38-07:00

Nick,

Interesting post. I have seen evidence of companies doing similar things on campus. College students are great recruits for this kind of thing because they are generally pretty intelligent and desperate for money.

Most recently RIM (Blackberry) came around and wanted students to read the contents of tens of thousands of blog posts to identify if the content was related to their product or competitors product and if it was favorable or not.I thought about signing up for it for 15/hr and outsourcing it to Mechanical Turk.

It would have been free money.

DannyDover edited 2008-10-02T09:53:58-07:00
2 0

Nick, Interesting post. I have seen evidence of companies doing similar things on campus. College students are great recruits for this kind of thing because they are generally pretty intelligent and desperate for money. Most recently RIM (Blackberry) came around and wanted students to read the contents of tens of thousands of blog posts to identify if the content was related to their product or competitors product and if it was favorable or not.I thought about signing up for it for 15/hr and outsourcing it to Mechanical Turk. It would have been free money.
Cancel
Theo van der Zee

2008-10-02T03:22:11-07:00

I think an important thing to notice in regard to the 'changing content' is the fact that Google or any other search company does not have to start from scratch.

They already have mined data from our websites for years. There must be some way to use this data to simplify the process of having to check websites on a daily basis. Maybe even some human/computer hybrid option?

2 0

I think an important thing to notice in regard to the 'changing content' is the fact that Google or any other search company does not have to start from scratch. They already have mined data from our websites for years. There must be some way to use this data to simplify the process of having to check websites on a daily basis. Maybe even some human/computer hybrid option?
Cancel
- Nick Gerner
 
 2008-10-02T09:49:01-07:00
 
 I'm totally with you about human/computer hybrid. I actually have doubts that every site (or much, much less likely every page) is human reviewed on a regular basis. But you could use an algo to identify suspicious or interesting sites, human review those, use that data to tune more algos. I worked with a professor for a long time who was very fond of the human-in-the-loop model and I'm inclined to agree.
 
 1 0
 
 I'm totally with you about human/computer hybrid. I actually have doubts that every site (or much, much less likely every page) is human reviewed on a regular basis. But you could use an algo to identify suspicious or interesting sites, human review those, use that data to tune more algos. I worked with a professor for a long time who was very fond of the human-in-the-loop model and I'm inclined to agree.
 Cancel
 - Theo van der Zee
 
 2008-10-02T10:36:56-07:00
 
 And doing so would require only fractions of the workforce described in the original post!
 
 1 0
 
 And doing so would require only fractions of the workforce described in the original post!
 Cancel
FirstStop

2008-10-02T13:36:50-07:00

This reminds me of Douglas Adams' book "Life, the Universe, and Everything". It has a part about an alien named Wowbagger the Infinitely Prolonged. It was a man with a purpose: to insult the Universe.That is, he would insult everybody in it. Individually, personally, one by one, and (this was the thing he really decided to grit his teeth over) in alphabetical order.When people protested to him, as they sometimes had done, that the plan was not merely misguided but actually impossible because of the number of people being born and dying all the time, he would merely fix them with a steely look and say, "A man can dream can't he?"

FirstStop edited 2008-10-02T13:37:13-07:00
2 0

This reminds me of Douglas Adams' book "Life, the Universe, and Everything". It has a part about an alien named Wowbagger the Infinitely Prolonged. It was a man with a purpose: to insult the Universe.That is, he would insult everybody in it. Individually, personally, one by one, and (this was the thing he really decided to grit his teeth over) in alphabetical order.When people protested to him, as they sometimes had done, that the plan was not merely misguided but actually impossible because of the number of people being born and dying all the time, he would merely fix them with a steely look and say, "A man can dream can't he?"
Cancel
Lori Bourne

2008-10-02T13:30:20-07:00

This reminds me of my days as an eBay seller. eBay was plagued with fraudulent listings, and even more listings that weren't fraudulent but were listed incorrectly (wrong category, links to off-eBay sites, etc.)

All of these errors meant the eBay experience was often unpleasant for buyers. This is one of the main reasons that many people never come back to eBay.

Yet in all my years there, TPTB insisted that there were too many listings to be manually reviewed. They relied on regular eBay members to report listing violations. I had friends that literally reported a few hundred listings a day.

Looking at your analysis, it's really inconceivable that eBay couldn't hire the necessary staff to be able to review each listing. I think they just didn't want to have to pay anyone for it.

eBay (and Google) can put a lot of automated processes in place to try and stop spam, but human eyes are often the best (and sometimes only) way to find problems.

lorisa edited 2008-10-02T13:31:51-07:00
2 0

This reminds me of my days as an eBay seller. eBay was plagued with fraudulent listings, and even more listings that weren't fraudulent but were listed incorrectly (wrong category, links to off-eBay sites, etc.) All of these errors meant the eBay experience was often unpleasant for buyers. This is one of the main reasons that many people never come back to eBay. Yet in all my years there, TPTB insisted that there were too many listings to be manually reviewed. They relied on regular eBay members to report listing violations. I had friends that literally reported a few hundred listings a day. Looking at your analysis, it's really inconceivable that eBay couldn't hire the necessary staff to be able to review each listing. I think they just didn't want to have to pay anyone for it. eBay (and Google) can put a lot of automated processes in place to try and stop spam, but human eyes are often the best (and sometimes only) way to find problems. 
Cancel
manticor24

2008-10-02T09:46:12-07:00

There's three things that make it a little harder to scale than this article suggests:
1) The number of classification variables
2) The complexity of each defined classification
3) The scale of each domain

The problem is not necessarily in how many domains there are to review, but what you want to know about each domain.

If you have a single question, and it's simple, like "Does the homepage use a mustard yellow background?", then this will be relatively easy.

Of course, questions that simple can be answered by robots. The really juicy stuff is something more like, "Which vertical does this domain fall under?"

Even this seemingly simple question has LOTS of built-in assumptions. It assumes you have a fairly comprehensive vertical hierarchy already defined, and universally understood. ODP is pretty comprehensive, but even they left some stuff out.

Even if you already have all of those things, the methodology is still key. Unless you define the steps every reviewer must take, then you're probably really asking, "For this domain's homepage content that is above the fold, how would you classify the vertical?" Most reviewers would probably not look at anything more.

There's more... at $30/hour, you're looking at a cost of $2.34M per variable (assuming one variable/review). Since all of this would be entirely new meta-data for the search engine, they'd have no way to verify the quality of the work unless they did another review.

Almost any way you do it, this task would get out of hand pretty quickly.

Since we already know Google is manually reviewing some sites, then it stands to reason they are pre-processing the sites in order to minimize the task of reviewing.

2 0

There's three things that make it a little harder to scale than this article suggests: 1) The number of classification variables 2) The complexity of each defined classification 3) The scale of each domain The problem is not necessarily in how many domains there are to review, but what you want to know about each domain. If you have a single question, and it's simple, like "Does the homepage use a mustard yellow background?", then this will be relatively easy. Of course, questions that simple can be answered by robots. The really juicy stuff is something more like, "Which vertical does this domain fall under?" Even this seemingly simple question has LOTS of built-in assumptions. It assumes you have a fairly comprehensive vertical hierarchy already defined, and universally understood. ODP is pretty comprehensive, but even they left some stuff out. Even if you already have all of those things, the methodology is still key. Unless you define the steps every reviewer must take, then you're probably really asking, "For this domain's homepage content that is above the fold, how would you classify the vertical?" Most reviewers would probably not look at anything more. There's more... at $30/hour, you're looking at a cost of $2.34M per variable (assuming one variable/review). Since all of this would be entirely new meta-data for the search engine, they'd have no way to verify the quality of the work unless they did another review. Almost any way you do it, this task would get out of hand pretty quickly. Since we already know Google is manually reviewing some sites, then it stands to reason they are pre-processing the sites in order to minimize the task of reviewing.
Cancel
- Nick Gerner
 
 2008-10-02T09:55:21-07:00
 
 You bring up lots of points. Glad people are thinking about this critically ;)
 
 About defining rules and how well human review can do, there's actually a really great post about exactly that:
 
 This paper on building a web spam reference collection by Castillo, et al. dives into detail about human review of sites. A lot of my belief in human review stems from this and our internal validation of some of their methods in my spam detection post (cited above in this post)
 
 edited 2008-10-02T09:55:52-07:00
 1 0
 
 You bring up lots of points. Glad people are thinking about this critically ;) About defining rules and how well human review can do, there's actually a really great post about exactly that: This paper on building a <a href="https://www.chato.cl/papers/castillo_2006_reference_collection_spam.pdf" rel="nofollow">web spam reference collection</a> by Castillo, et al. dives into detail about human review of sites. A lot of my belief in human review stems from this and our internal validation of some of their methods in my spam detection post (cited above in this post)
 Cancel
DataEntryServices

2008-10-02T05:03:02-07:00

Boggles my mind! I think sly-grrr's point about the constantly changing organicness of the web throws a giant wrench into the works. It would mean that the review was actually at least 78 days behind in it's manual review - and that's a long time for a change to be recognized.

1 0

Boggles my mind! I think sly-grrr's point about the constantly changing organicness of the web throws a giant wrench into the works. It would mean that the review was actually at least 78 days behind in it's manual review - and that's a long time for a change to be recognized.
Cancel
- Nick Gerner
 
 2008-10-02T09:47:08-07:00
 
 Thetjo above points out that we can leverage what we've learned from the past towards the present/future. Maybe we do it once, and discover that we need to keep a close eye on some certain sites (which are influential, but also do some shady business). Those might be the sites (a smaller subset of the 78 million) which we really need to manually review.
 
 1 0
 
 Thetjo above points out that we can leverage what we've learned from the past towards the present/future. Maybe we do it once, and discover that we need to keep a close eye on some certain sites (which are influential, but also do some shady business). Those might be the sites (a smaller subset of the 78 million) which we really need to manually review.
 Cancel
Mani Karthik

2008-10-01T22:41:11-07:00

Thought provoking article. I suspect the number too, 78 million? Awkward.

But the task of validating/anayising them won't be easy but a possible task.

Manually doing it means, periodic re-checks and compartmentalizing the entire job - but yes, possible. Why not?

1 0

Thought provoking article. I suspect the number too, 78 million? Awkward. But the task of validating/anayising them won't be easy but a possible task. Manually doing it means, periodic re-checks and compartmentalizing the entire job - but yes, possible. Why not? 
Cancel
David Carralon

2008-10-02T00:47:55-07:00

gulp! something hard to digest as I sit down to read my first post in the morning? are we saying that Big brother actually has the capacity and resources to do it? if that's case, i dont what else they wouldnt be able to do?

1 0

gulp! something hard to digest as I sit down to read my first post in the morning? are we saying that Big brother actually has the capacity and resources to do it? if that's case, i dont what else they wouldnt be able to do? 
Cancel

Post Analytics

Comments 33

Log in to Moz

Don't have an account?