I came back to TREC — first time since 1999 (CLEF started then, and that took over most of my attention). This year, I was one of the organisers of the Podcast Challenge which involved retrieval and summarisation of data from the 100 000 Podcast Data set we released for this purpose. We expect great things to happen with this data set: in some way, the current state of speech and podcast analysis field is quite similar to what text and social media analysis was in the mid nineties! The scale of the data set is daunting, the features we want to work with are not quite settled (we expect to see great results from peering into the audio, something noone did this year), the use case we aim for is not entirely determined, and the medium of podcasts is in its infancy and is likely to develop and change rapidly in th coming years! It’ll be right exciting to be here to see what comes next!
WordPress.com is excited to announce our newest offering: a course just for beginning bloggers where you’ll learn everything you need to know about blogging from the most trusted experts in the industry. We have helped millions of blogs get up and running, we know what works, and we want you to to know everything we know. This course provides all the fundamental skills and inspiration you need to get your blog started, an interactive community forum, and content updated annually.
I was honoured to be asked to participate as a discussion facilitator for the 35th conference on IT and Law held in digital form on November 11-12. There are numbers of interesting questions to do with how law meets autonomous decision making — how can responsibility be distributed when the operator of a system has less expertise than previously was typical? How can we address the question of invisible harm, when automatic decision making systematically causes some slight disadvantage to some of us? How can we act on the principles of editability and transparency in face of information imbalance? This discussion was not the last word on these issues!
This is the first year I did not travel to a CLEF meeting. Participation over a video link does work, and some of the interesting presentations came across quite well, but the full experience was somewhat curtailed. Hoping to see the world back on rails again next year!
At this year’s http://clef2019.clef-initiative.eu/CLEF in Lugano I presented a poster on How Lexical Gold Standards Have Effects On
The Usefulness Of Text Analysis Tools For
Digital Scholarship, presented work on detecting signs of eating disorders in social media posts done by my student Elena Fano as her master thesis,
and, at the newly instituted industrial session I discussed thresholds for adopting systematic evaluation schemes in operational settings. This last presentation was largely based on the chapter on these sorts of things in the new CLEF book.
I was honoured to be asked up to Uppsala to give a talk on what I have been up to under the title “Utterance spaces — how to represent lexical items, constructions, and contextual data in a unified vector space”. I tried to be provocative, but provocatively enough the audience seemed to mostly agree.
I was honoured to be invited to UC Davis, a short train ride from Stanford, by Raul Aranovich to give a talk on “Hyperdimensional computing for human data meets the squinting linguist” or “Explicitly encoded high-dimensional semantic spaces used for authorship profiling” at the linguistics department there! Slides are here.
I gave my first webinar ever, for the ACM, at the invitation of Rose Paradis. The title was An encoding model for hypothesis driven research on large heterogeneous data streams (OR “The squinting linguist meets hyperdimensional computing”).
Giving a webinar was a strange experience: talking to an audience of more than a thousand people but not seeing them in the room. It will take some time to get used to this sort of thing! Slides are here and the talk itself is published by the ACM on video which is a bit strange since it is mostly audio (plus the slides, of couorse).
Words, concepts, and usefulness of texts
Most of the tools built for practical analysis of texts are built for topical analysis and are based on lexical statistics. This means, in its simplest form, is that if a term is observed in text, it is a notable event to some extent. How notable it is can be estimated from previous experience of how often that term has been observed. That term can then be understood to refer to some concept or topic of interest with some level of likelihood or certainty. How likely it is that the term in question refers to the concept at hand can again be estimated from previous experience. The difference is that concepts are not observable other than by virtue of linking the terms observed in the text to the effect texts have on their readers. What that effect is can be studied in various ways to see whether readers are happy or not to peruse some given text given the task or situation they are in. Much research effort has been put into formalisation and experimentation with those two sources of experience, with representing concepts or topics in various ways, and into establishing if a text has the desired effect or not.
There is more to text than topic.
But almost the entire effort in those studies has been based on topical and referential analysis of text. There is more to text than topic. There are numerous reasons why one might be interested in these other factors, but since most tools are built along the principles above and heavily optimised to make use of previous experience about word occurrence, conceptual content, and topical usefulness of texts, there are precious few tools to sort out texts by other criteria. Many tasks are only to some extent topical in nature and have largely to do with other systems of language than the immediately referential, such as tracing power relations expressed in texts, understanding whose perspective is being reported in texts, or modelling change over time in attitude towards some societal phenomenon. Topical tools are not optimised for this purpose, and indeed even systematically suppress non-topical and non-referential features of text.
Verbs carry much of this information
Attitude, tense, mood, aspect, argumentation, stance of author, audience design assumptions, implicature, are all harboured in the text, overlayed on the patterns which are used to carry the topical and referential meaning.
A central category of interest for this sort of information in the linguistic signal is the verb. Verb phrases are the backbone of utterances. They refer themselves to a process, event, or state of the world; organise the referential expressions of a clause into argument structures; carry information to indicate participants’ roles in a clause; indicate in various ways temporality, aspectuality, and modality; and have attributes for manner, likelihood, veracity, and other such characteristics of an utterance.
Verbs and their occurrence statistics are typically disregarded
For topical analyses, verbs usually wash out in the statistics: a model focussed on the specifics of conceptual content of a text, verbs are too broadly distributed to be useful. There are more nouns than verbs. Verbs occur repeatedly and less burstily than nouns. This post and some following posts are intended to demonstrate some of differences of interest and relevance between noun phrases and verb phrases.
These following data are taken from two years of 1990s newsprint previously used for information retrieval research in various shared tasks. It consists of 170 255 documents, 5 726 822 utterances, and
72 339 348 words.
|Number of observations||24 349 780||5 200 970||6 730 550|
|Number of different observations||210 797||24 494||7 656|
The basis for many information retrieval algorithms is the notion of collection frequency or idf.1 How widely dispersed an item is across documents is a fair measure of useful it is to pick out interesting items: “and” is less useful than “very” which is less useful than “oscillator”. Verbs are more widely dispersed than other lexical classes. The below table gives some basic statistics. The first half of the table shows how many verbs, nouns, or adjectives occur in over 100, 200, 500, or 1000 of the 170 355 documents in the collection these statistics are taken from. The second half of the table shows how many verbs, nouns, or adjectives occur more than twice in that number of documents.
This shows that a large proportion of the relatively few verbs are dispersed very widely and consequently weighted to be less interesting by most if not all topic modelling algorithms (certainly every algorithm I ever have had reason to implement uses dispersal statistics as one of the most basic assessments of relevance of a term). However, they might well be interesting for other analyses, in spite of not being topically specific! I will return to this.
1. Sparck Jones, Karen. “A statistical interpretation of term specificity and its application in retrieval.” Journal of documentation 28, no. 1 (1972): 11-21.↩