At the upcoming LREC conference I will present some of the results from Tvärsök, a project to evaluate the effects of various morphological analysis tools on retrieval results. The project was a cooperation between Euroling Ab, SICS, and CST.
One of the research-wise most interesting findings was inspired by my recently being the academic opponent at the public defense of Kimmo Kettunens Ph D dissertation. He has studied generative morphologies applied to information retrieval and found that the nine most frequent morphological cases for Finnish nouns suffice to model most of what is needed for indexing and query processing purposes. In the table given below, I show how those nine cases actually are distributed non-uniformly over relevant and non-relevant documents. Locative cases are less likely to occur in topically relevant documents.
Search term case distribution in relevant and non-relevant texts (the most divergent values marked in bold; χ2: 70.155; df = 2; p < 0.005)
The LREC paper, “Experiments to investigate the connection between case distribution and topical relevance of search terms in an information retrieval setting”, is authored with Hercules Dalianis from Euroling and Bart Jongejan from CST, will soon be in the eprints archive, and is also presented on the Euroling blog.