POS-tagging a bilingual parallel corpus: methods and challenges

  • Irene Doval
Keywords: Multilingual resources, parallel corpus, corpus annotation, POS tagging, tagset corpus building


Abstract – This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.

Author Biography

Irene Doval
University of Santiago de Compostela / Spain