POS-tagging a bilingual parallel corpus: Methods and challenges
Abstract – This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.
Adelung, Johann C. 1781. Auszug aus der deutschen Sprachlehre für Schulen. Berlin: Voss.
Bello, Andrés. 1847. Gramática: gramática de la lengua castellana destinada al uso de los americanos. Santiago de Chile: Imprenta del Progreso.
Borin, Lars ed. 2002. Parallel corpora, parallel worlds. Selected papers from a symposium on parallel and comparable corpora at Uppsala University, Sweden, 22-23 April, 1999. Amsterdam: Rodopi.
Doval, Irene. 2016. PaGeS: design and compilations of a bilingual parallel corpus German Spanish. Compilation of bilingual corpora for linguistic research. In Antonio Moreno Ortiz and Chantal Pérez-Hernández eds. EPiC Series in Language and Linguistics Volume 1, CILC2016. 8th International Conference on Corpus Linguistics, 88–96.
Doval, Irene, Santiago Fernández Lanza, Tomás Jiménez Juliá, Elsa Liste Lamas and Barbara Lübke. Forthcoming. Corpus PaGeS: a multifunctional resource for language learning, translation and cross-linguistic research. In Irene Doval and María Teresa Sánchez Nieto eds. Parallel corpora for contrastive and translation studies: new resources and applications. Amsterdam: John Benjamins.
Feldman, Anna and Jirka Hana. 2010. A resource-light approach to morpho-syntactic tagging. Amsterdam: Rodopi.
Giesbrecht, Eugenie and Stefan Evert. 2009. Part-of-speech tagging – a solved task? An evaluation of POS taggers for the Web as corpus. In Iñaki Alegria, Igor Leturia and Serge Sharoff eds. Proceedings of the 5th Web as Corpus Workshop (WAC5). San Sebastian, Spain. http://www.stefan-evert.de/PUB/GiesbrechtEvert2009_Tagging.pdf.
Harris, Brian. 1988. Bi-Text, a new concept in translation theory. Language Monthly 54: 8–10.
Jurafsky, Daniel and James H. Martin. 2017. Speech and language processing. An introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Chapter 10: Part-of-Speech Tagging. https://web.stanford.edu/~jurafsky/slp3/ed3book. pdf (accessed 23 October 2017).
Lee, David Y.W. 2001. Genres, registers, text types, domains and styles: clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology 5/3: 37–72.
Leech, Geoffrey. 1997. Grammatical tagging. In Roger Garside, Geoffrey Leech and Tony McEnery eds. Corpus annotation. Linguistic information from computer text corpora. London: Longman, 19–33.
Manning Christopher D. 2011. Part-of-Speech tagging from 97% to 100%: is it time for some linguistics? In Alexander Gelbukh ed. Computational Linguistics and Intelligent Text Processing. CICLing 2011. Berlin: Springer, 171–189.
McEnery, Tony and Andrew Hardie. 2012. Corpus linguistics: method, theory and practice. Cambridge: Cambridge University Press.
Nivre, Joakim. 2015. Towards a Universal Grammar for Natural Language Processing. In Alexander Gelbukh ed. Computational Linguistics and Intelligent Text Processing. CICLing 2015. Berlin: Springer, 3–16.
Petrov, Slav, Dipanjan Das and Ryan McDonald. 2012. A universal Part-of-Speech tagset. In Eighth International Conference on Language Resources and Evaluation (LREC 2012), 2089–2096.
Sánchez León, Fernando. 1994. Spanish tagset for the CRATER project. https://arxiv.org/pdf/cmplg/9406023.pdf (accessed 12 October 2017).
Schiller, Anne, Simone Teufel and Christine Stöckert. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS. Universitäten Stuttgart und Tübingen. http://www.sfs.unituebingen.de/resources/stts-1999.pdf (accessed 13 September 2017).
Schmid, Helmut. 1994. Probabilistic Part-of-Speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester. http://www.cis.unimuenchen.de/~schmid/tools /TreeTagger/data/tree-tagger1.pdf (accessed 13 September 2017).
Schmid, Helmut. 1995. Improvements in Part-of-Speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger2.pdf (accessed 15 October 2017).
Schmid, Helmut. 2008. Tokenizing and part-of-speech tagging. In Anke Lüdeling and Merja Kytö eds. Corpus linguistics. An international handbook. Volume 1. Berlin: Walter de Gruyter, 527–551.
Schmid, Helmut and Florian Laws. 2008. Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, 777–784.
Seeker, Wolfgang and Jonas Kuhn. 2013. Morphological and syntactic case in statistical dependency parsing. Computational Linguistics 39/1: 23–55.
Telljohann, Heike, Yannick Versley, Kathrin Beck, Erhard Hinrichs and Thomas Zastrow. 2013. STTS als Part-of-Speech-Tagset in Tübinger Baumbanken. Journal for Language Technology and Computational Linguistics 28/1: 1–16.
Tiedemann, Jörg. 2011. Bitext alignment. Toronto: Morgan & Claypool.
Volk, Martin, Simon Clematide, Johannes Graen and Phillip Ströbel. 2016. Bi-particle adverbs, PoStagging and the recognition of German separable prefix verbs. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 296–305.
Volk, Martin and Gerold Schneider. 1998. Comparing a statistical and a rule-based tagger for German. https://arxiv.org/pdf/cs/9811016.pdf (accessed 10 October 2017).
Submission of your paper to this journal implies that the paper is not under submission for publication elsewhere. Material which has been previously copyrighted, published, or accepted for publication will not be considered for publication in this journal. Submission of a manuscript is interpreted as a statement of certification that no part of the manuscript is copyrighted by any other publisher nor is under review by any other formal publication. By submitting your manuscript to us, you agree on these copyright guidelines. It is your responsibility to ensure that your manuscript does not cause any copyright infringements, defamation, and other problems.
Submitted papers are assumed to contain no proprietary material unprotected by patent or patent application; responsibility for technical content and for protection of proprietary material rests solely with the author(s) and their organizations and is not the responsibility of the journal or its editorial staff. The main author is responsible for ensuring that the article has been seen and approved by all the other authors. It is the responsibility of the author to obtain all necessary copyright release permissions for the use of any copyrighted materials in the manuscript prior to the submission.
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under the BY Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal
Article submission implies author agreement with this policy.