Linda Eriksson: Better Aggregation of Features in Text

On this day Linda Eriksson defended her M Sc thesis. She tested how the difficult task of authorship attribution might be improved by aggregating textual features sequentially rather than by averaging them over a text. The hypothesis, which was based on a experiments made by Gunnar Eriksson and myself from a few years ago is that author traits and habits might be distinguishable as patterns over a text. An example would be to investigate sentence length. Typically this is studied by taking the average sentence length of a text – one author might tend to short ( 20 words) sentences. Well and fine, but this averaging hides information. This study instead investigated whether a pattern of sentence length over a window of up to five sentences might be specific to an author. Thus, two authors might have the same average but with a very different distribution – one might have a varied distribution (7, 8, 20, 20, 5), another a more consistent distribution (10, 13, 12, 15, 10). THe results were not entire conclusive, but did show an encouraging difference between authors – this direction needs more study!