Amaru Cuba Gyllensten: Automatic Lexicon Extraction on Random Index Word Spaces using Small Seed Lexica

Today, Amaru Cuba Gyllensten, who has been doing his graduation project at Gavagai, presented and defended his M Sc thesis in Computer Science at KTH. He has worked on finding correspondences or isomorphisms between two semantic spaces, trained on comparable corpora, potentially in two different languages. The idea is that by selecting a small number of base vectors, the source semantic space would be projected into a lower-dimensional space, and then projected up again in the target space. If successful, a procedure to extract bilingual lexica from such comparable semantic spaces could be built. Unfortunately, Amaru was able to show that this does not work. The work was well done but gave a negative result. Being optimistic, we still believe some aspects of the approach can be salvaged.