Category Archives: for the record

verbs and their occurrence frequencies

Words, concepts, and usefulness of texts

Most of the tools built for practical analysis of texts are built for topical analysis and are based on lexical statistics. This means, in its simplest form, is that if a term is observed in text, it is a notable event to some extent. How notable it is can be estimated from previous experience of how often that term has been observed. That term can then be understood to refer to some concept or topic of interest with some level of likelihood or certainty. How likely it is that the term in question refers to the concept at hand can again be estimated from previous experience. The difference is that concepts are not observable other than by virtue of linking the terms observed in the text to the effect texts have on their readers. What that effect is can be studied in various ways to see whether readers are happy or not to peruse some given text given the task or situation they are in. Much research effort has been put into formalisation and experimentation with those two sources of experience, with representing concepts or topics in various ways, and into establishing if a text has the desired effect or not.

There is more to text than topic.

But almost the entire effort in those studies has been based on topical and referential analysis of text. There is more to text than topic. There are numerous reasons why one might be interested in these other factors, but since most tools are built along the principles above and heavily optimised to make use of previous experience about word occurrence, conceptual content, and topical usefulness of texts, there are precious few tools to sort out texts by other criteria. Many tasks are only to some extent topical in nature and have largely to do with other systems of language than the immediately referential, such as tracing power relations expressed in texts, understanding whose perspective is being reported in texts, or modelling change over time in attitude towards some societal phenomenon. Topical tools are not optimised for this purpose, and indeed even systematically suppress non-topical and non-referential features of text.

Verbs carry much of this information

Attitude, tense, mood, aspect, argumentation, stance of author, audience design assumptions, implicature, are all harboured in the text, overlayed on the patterns which are used to carry the topical and referential meaning.
A central category of interest for this sort of information in the linguistic signal is the verb. Verb phrases are the backbone of utterances. They refer themselves to a process, event, or state of the world; organise the referential expressions of a clause into argument structures; carry information to indicate participants’ roles in a clause; indicate in various ways temporality, aspectuality, and modality; and have attributes for manner, likelihood, veracity, and other such characteristics of an utterance.

Verbs and their occurrence statistics are typically disregarded

For topical analyses, verbs usually wash out in the statistics: a model focussed on the specifics of conceptual content of a text, verbs are too broadly distributed to be useful. There are more nouns than verbs. Verbs occur repeatedly and less burstily than nouns. This post and some following posts are intended to demonstrate some of differences of interest and relevance between noun phrases and verb phrases.

Descriptive statistics

These following data are taken from two years of 1990s newsprint previously used for information retrieval research in various shared tasks. It consists of 170 255 documents, 5 726 822 utterances, and
72 339 348 words.

Number of observations 24 349 780 5 200 970 6 730 550
Number of different observations 210 797 24 494 7 656

The basis for many information retrieval algorithms is the notion of collection frequency or idf.1 How widely dispersed an item is across documents is a fair measure of useful it is to pick out interesting items: “and” is less useful than “very” which is less useful than “oscillator”. Verbs are more widely dispersed than other lexical classes. The below table gives some basic statistics. The first half of the table shows how many verbs, nouns, or adjectives occur in over 100, 200, 500, or 1000 of the 170 355 documents in the collection these statistics are taken from. The second half of the table shows how many verbs, nouns, or adjectives occur more than twice in that number of documents.

  100   200   500   1000  
N 5.61% 11835 3.75% 7912 2.05% 4320 1.25% 2641
V 27.91% 2137 20.08% 1537 11.89% 910 7.77% 754
A 15.00% 3674 9.48% 2322 4.79% 1174 2.84% 595
  100   200   500   1000  
N 2.02% 4249 1.28% 2697 0.63% 1337 0.36% 113
V 6.49% 497 4.44% 340 2.44% 187 1.48% 696
A 2.80% 686 1.81% 444 0.91% 223 0.52% 127

This shows that a large proportion of the relatively few verbs are dispersed very widely and consequently weighted to be less interesting by most if not all topic modelling algorithms (certainly every algorithm I ever have had reason to implement uses dispersal statistics as one of the most basic assessments of relevance of a term). However, they might well be interesting for other analyses, in spite of not being topically specific! I will return to this.

1. Sparck Jones, Karen. “A statistical interpretation of term specificity and its application in retrieval.” Journal of documentation 28, no. 1 (1972): 11-21.


Everything before “But”

I have long wondered what effect the term “but” and others like it have on sentences such as “The restaurant serves nice food but the service is awful.” In general, it would seem that that what comes after “but” trumps what comes before. Here is a little study to confirm that so is indeed the case, but that this varies a bit depending on polarity of the target sentence and on the gold standard corpus under investigation.

Information structure, topic-comment, topicalisation, and sentiment analysis

I have run some experiments over the past year to figure out what happen if we weighted attitudinal items differentially according to their position in the utterance, thinking that topicalisation and the vanilla topic-comment structure might have something to say about which items are the most interesting. Yes, they do, but how to make use of it is not entirely clear.

A brief report of those experiments is here.

Bounds of gold standards for sentiment analysis experiments

I keep running experiments on sentiment analysis, the process where texts or sentences
are categorised in positive, negative, or neutral. Those who know me know that I like to
hold forth about how simplistic and impractical many of the starting points for that whole endeavour is. It is still both fun and informative to try.

In practical application sentiment analysis is most often based on purely lexical features –
the presence or absence of attitudinally loaded terms. Improving such analysis models is done either by improving the lexicon itself or improving the handling of such features: either one adds (or removes) terms from the lexicon, which of course has predictable effects on coverage, recall, and precision; or one understands the way those terms are used better, such as introducing negation handling or something else constructionally relevant.

I found that experiments in this direction tended to give less than satisfactory results, and this working report on bounds of lexical sentiment analysis is part of an effort to understand why. I show that the results one can get from experimentation on a gold standard are very much bounded by the lexical resource; that the gold standards differ considerably; and that potential experimental gains are relatively small.


The adjective decadent is included in several lists of negative terms compiled for sentiment analysis systems, but how is the term used? I looked in real life data from customer reviews and news.

  • … Eating one is good. Eating two is great. Eating three is decadent, and awesome on an
    empty stomach. Eat four and you start to feel sick. …
  • Love this amazing fusion dessert. Sounds exotic and looks soooo decadent. …
  • A tailored take on the label’s romantic aesthetic, Elie Saab’s crepe jumpsuit is a
    decadent choice for evening events
  • So let’s all raise a glass and hope we get many more decades of this decadent local staple
  • Dense and wickedly chocolatey, the decadent dessert is best shared for greater enjoyment.
  • I’m surrounded by soft, glowing candles while enjoying a glass of rich red wine and a box
    of decadent dark chocolate
  • If you’re feeling decadent, put a pinch of crumbled bacon or a couple of sun-dried
    tomatoes in an egg white omelet
  • Come join us for some more decadent daytime disco and house partying
  • … and many more examples

Almost none were negative. Most frequent topic was chocolate.

Talk at NLP lunch

Today, I gave a hastily put together talk on my current experimentation on attitudinal adjectival expressions in the NLP group lunch. Fun, but somewhat rhapsodic, since the experiments are as of yet incomplete! (I will update when I have more to tell).

|-----not small----------------|
           |-------------not large-----------|
                                      not ill^
|-----------not healthy----------------------O

One of the things I wanted to stress is to think about the utility of introducing linguistic sophistication for practical information system purposes. I posited three levels of use cases for large scale text

  1. what are they talking about?
  2. how are they talking about it?
  3. what are they saying about it?

Slides are here.

A digital bookshelf: original work on recommender systems

I spent the better part of the 1989-90 academic year at Columbia University, in the NLP group headed by Kathy McKeown as a visiting graduate student. I had recently begun my graduate studies and my idea was to work on statistical models to improve human computer interaction. I had heard of neural networks, read the recently published PDP book and worked through its examples (there was a 5 1/4″ floppy disk included!), and went to a Summer school on connectionist models and neural architectures organised by Boston University in Nashua NH, taught by Steven Grossman, Robert Hecht-Nielsen and some others (the school was very focussed on the ART architecture). I built a connectionist crossword puzzle generator in Prolog which given a lexicon almost managed to build a crossword puzzle.

That year was fruitful in several ways. My most important task for myself was setting up experiments on recommender systems. I collected data through a questionnaire, and I ran statistical experiments on .newsrc files on the systems I had access to. I visited Bellcore in Morristown, where my mentor Don Walker invited me to give a talk to his lab which I believe included Will Hill who later worked on this sort of thing. I had by then written a paper on the “Digital Bookshelf” which was promptly rejected by the 1990 INTERACT reviewers because they held that building a recommender system would interfere with users’ privacy and integrity.

When I came back to Stockholm I wrote a tech report to describe the idea. Later, I wrote a more complete report, when I worked at SICS. I only published it in 1994: I brought it to that year’s SIGCHI and distributed it to several friends there. Martin Svensson, one of my colleagues at SICS later picked up similar thoughts and wrote his dissertation on Social Recommendation Systems, and by then I started to regret that I had not worked more on developing the ideas further! I blame those reviewers for the 1990 INTERACT! (I probably should not have opened that discussion: I included a section on privacy aspects in the paper.)

  • The 1990 Tech Report: Jussi Karlgren. 1990. An Algebra for Recommendations. The Systems Development and Artificial Intelligence Laboratory. Working Paper No 179. Department of Computer and Systems Sciences. KTH Royal Institute of Technology and Stockholm University.
  • The 1994 Tech Report: Jussi Karlgren. 1994. Document Behaviour Based on User Behavior—A Recommendation Algebra. Tech report T94:04. Swedish Institute of Computer Science. (Or here, if that link breaks.)