On the 8th of June, Sara Stymne, from the Department of Linguistics and Philology at Uppsala University presented her work on learning dependency parsers from heterogenous treebanks in Umeå.
We thank her warmly for a very interesting talk!
Abstract: I will discuss how dependency parsing can be improved for languages with more than one available treebank. I will start by introducing the task of dependency parsing, and describing a state-of-the-art algorithm for parsing, based on recurrent neural networks. I will then go on to describe the universal dependencies resources, where it is a common case that there are several treebanks available for a given language, which can differ in language variant, domain, genre and annotation style. For data-driven dependency parsers, performance typically improves with more training data. The divergence between different treebanks, however, can make it difficult to take advantage of all available data. The simple approach, to concatenate all available treebanks, often does not work well, even though it can be improved by fine-tuning. In this talk, I will focus on how the concept of language embeddings, which have been used to represent different languages in cross-lingual systems, can be extended to the mono-lingual case, where we use a similar framework on the treebank level, in the form of treebank embeddings. I will show that this framework leads to large improvements for several languages. I will also briefly discuss initial work on extending the framework in order to combine several corpora for one language with corpora for related languages.