I was inspired by Xan Porter's recent post on Evaluating Web Page Quality. Search engines have their own evaluation signals, but as they try to improve in quality, it's only natural to pursue more human metrics in order to refine to the greatest possible degree. The research she points to - less is more: probalistic models for retrieving fewer relevant documents - offers up a terrific structure for how humans might evaluate documents:
Intrinsic Features:
- How accurate is the information presented?
- How biased or unbiased is the data?
- How believable is the content?
- How credible is the source?
Contextual Features:
- Is the information relevant to the user's query?
- Does the information add value to the subject?
- Is the work recent enough to be of value?
- Is the source thorough in its presentation?
- What amount of information is provided?
Representational Features:
- Can the material be interpreted in different ways?
- How easy or difficult is the material for a user to understand?
- Does the document state the information concisely?
- Is the source consistent?
Accessibility Features:
- Is the document accessible?
- Does the content present security risks?
Of these, which can search engines currently measure? I'd guess that of the above, they have signals about many, but only solid metrics for accessibility, security risks and, possibly, readability... There's a long way to go.
"What amount of information is provided?"
In my mind at least, this should be re-worded: "Is there too little information?"
Or is it possible that too much information is a bad thing?
>>Or is it possible that too much information is a bad thing?
A lot of Wikipedia articles contain too much useless information (since everyone wants to leave their mark), as well as many "scientific" books - Ph.D. thesis with 500+ pages, full of redundancies. Less is often more. Be concise. Measuring content in kilobytes is definitely "a bad thing".
"... as well as many "scientific" books - Ph.D. thesis with 500+ pages, full of redundancies."
To us may be, but to a fellow Ph.D., what may well be redundant to us and to some degree the search engines may well be just the details they were looking for.
I'd say the topic is going to play an enormous part in this. What with scientific content being at the upper end of the verbiage scale...
The problem with many Wikipedia articles is that they read as if they were written by some kid with ADD. Articles often lump together random facts that don't logically flow into each other.