As a note for those of us who would like to see more informed and more hypothesis-driven feature selection as the main driver of language analysis for information access I propose making the distinction between two types of information access task:
Simple tasks, where the reason for attempting computational analysis of text is that the scale of the task is daunting. Information retrieval is such a task. It is rational to meet this type of task by simplifying the collection, e.g. by reducing texts to bags of content words.
Difficult tasks, where the reason for attempting computational analysis of text is that the task is difficult for human assessors, and computational methods might help uncover features which are difficult to detect. Authorship attribution is such a task, novelty detection, hedge detection, attitude analysis are others. This sort of task needs new features, new ways of aggregating features, and new evaluation mechanisms. These are the fun tasks.