Bounds of gold standards for sentiment analysis experiments

I keep running experiments on sentiment analysis, the process where texts or sentences
are categorised in positive, negative, or neutral. Those who know me know that I like to
hold forth about how simplistic and impractical many of the starting points for that whole endeavour is. It is still both fun and informative to try.

In practical application sentiment analysis is most often based on purely lexical features –
the presence or absence of attitudinally loaded terms. Improving such analysis models is done either by improving the lexicon itself or improving the handling of such features: either one adds (or removes) terms from the lexicon, which of course has predictable effects on coverage, recall, and precision; or one understands the way those terms are used better, such as introducing negation handling or something else constructionally relevant.

I found that experiments in this direction tended to give less than satisfactory results, and this working report on bounds of lexical sentiment analysis is part of an effort to understand why. I show that the results one can get from experimentation on a gold standard are very much bounded by the lexical resource; that the gold standards differ considerably; and that potential experimental gains are relatively small.