The International Comparable Corpus:  Challenges in building multilingual spoken and written comparable corpora

Anna Čermáková; Jarmo Jantunen; Tommi Jauhiainen; John Kirk; Michal Křen; Marc Kupietz; Elaine Uí Dhonnchadha

doi:10.32714/ricl.09.01.06

Authors

Anna Čermáková Charles University in Prague https://orcid.org/0000-0001-8597-520X
Jarmo Jantunen University of Jyväskylä https://orcid.org/0000-0002-2600-5382
Tommi Jauhiainen University of Helsinki https://orcid.org/0000-0002-6474-3570
John Kirk University of Vienna https://orcid.org/0000-0002-7792-3191
Michal Křen Charles University https://orcid.org/0000-0002-8492-263X
Marc Kupietz Institut für Deutsche Sprache, Mannheim https://orcid.org/0000-0001-8997-8256
Elaine Uí Dhonnchadha Trinity College Dublin https://orcid.org/0000-0003-3448-4288

DOI:

https://doi.org/10.32714/ricl.09.01.06

Keywords:

ICC corpus, contrastive linguistics, comparable corpus, ICE corpus, data sustainability, copyright

Abstract

This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.

Downloads

Download data is not yet available.

References

Aijmer, Karin and Bengt Altenberg eds. 2013. Advances in Corpus-based Contrastive Linguistics: Studies in Honour of Stig Johansson. Amsterdam: John Benjamins.

Bański, Piotr, Joachim Bingel, Nils Diewald, Elena Frick, Michael Hanl, Marc Kupietz, Piotr Pęzik, Carsten Schnober and Andreas Witt. 2013. KorAP: The new corpus analysis platform at IDS Mannheim. In Zygmunt Vetulani and Hans Uszkoreit eds. Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. Poznan: Uniwersytet im. Adama Mickiewicza w Poznaniu, 586–587.

Calzolari, Nicoletta, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asunción Moreno, Jan Odijk and Stelios Piperidis eds. 2016. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. Portorož: European Language Resources Association.

Cosma, Ruxandra and Marc Kupietz. 2019. On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. Revue Roumaine de Linguistique, 64/3. Editura Academiei Române.

Crystal, David. 2004. The Language Revolution. London: John Wiley & Sons.

Diewald, Nils, Verginica Barbu Mititelu and Marc Kupietz. 2019. The KorAP user interface. Accessing CoRoLa via KorAP. Revue Roumaine de Linguistique 64/3: 265–277. http://www.lingv.ro/images/RRL%203%202019%2006-%20Diewald. pdf

Ebeling, Jarle and Signe Oksefjell Ebeling. 2013. Patterns in Contrast. Amsterdam: John Benjamins.

Granger, Sylviane. 2010. Comparable and translation corpora in cross-linguistic research. Design, analysis and applications. Journal of Shanghai Jiaotong University 2: 4–21.

Greenbaum, Sidney ed. 1996. Comparing English Worldwide. Oxford: Clarendon Press.

James, Carl. 1980. Contrastive Analysis. London: Longman.

Janssen, Maarten. 2016. TEITOK: text-faithful annotated corpora. In Calzolari et al. eds, 4037–4043.

Kallen, Jeffrey L. and John Kirk. 2007. ICE-Ireland: Local variations on global standards. In Joan C. Beal, Karen P. Corrigan and Hermann L. Moisl eds. Creating and Digitizing Language Corpora. London: Palgrave Macmillan, 121–162.

Kallen, Jeffrey L. and John Kirk. 2008. ICE-Ireland: A User’s Guide. Belfast: Cló Ollscoil na Banríona.

Kilgarriff, Adam, Michael Rundell and Elaine Uí Dhonnchadha. 2006. Efficient corpus development for lexicography: Building the New Corpus for Ireland. Language Resources & Evaluation 40/2: 127–152.

Kirk, John and Anna Čermáková. 2017. From ICE to ICC: The new International Comparable Corpus. In Piotr Bański, Marc Kupietz, Harald Lüngen, Paul Rayson, Hanno Biber, Evelyn Breiteneder, Simon Clematide, John Mariani, Mark Stevenson and Theresa Sick eds. Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP). Mannheim: Institut für DeutscheSprache, 7–12. https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/6243/file/2.+Kirk_Cermakova_From_ICE_to_ICC_2017.pdf

Kirk, John and Gerald Nelson. 2018. The International Corpus of English project: A progress report. World Englishes 37/4: 697–716.

Komrsková, Zuzana, Marie Kopřivová, David Lukeš, Petra Poukarová and Hana Goláňová. 2017. New spoken corpora of Czech: ORTOFON and DIALEKT. Jazykovedný časopis 68/2: 219–228.

Kopřivová, Marie, Zuzana Laubeová, David Lukeš and Petr Poukarová. 2019. ORATOR v1: Korpus monologů. Ústav Českého národního korpus FF UK, Praha. https://www.korpus.cz

Křen, Michal, Václav Cvrček, Tomáš Čapka, Anna Čermáková, Milena Hnátková, Lucie Chlumská, Tomáš Jelínek, Dominika Kováříková, Vladimir Petkevič, Pavel Procházka, Hana Skoumalová, Michal Škrabal, Petr Truneček, Pavel Vondřička and Adrian Jan Zasina. 2016. SYN2015: Representative Corpus of Contemporary Written Czech. In Calzolari et al. eds., 2522–2528.

Kupietz, Marc, Nils Diewald, Beata Trawiński, Ruxandra Cosma, Dan Cristea, Dan Tufiş, Tamás Váradi and Angelika Wöllstein. 2020. Recent developments in the European Reference Corpus EuReCo. In Sylviane Granger and Marie-Aude Lefer eds. Translating and Comparing Languages: Corpus-based Insights. Selected Proceedings of the Fifth Using Corpora in Contrastive and Translation Studies Conference. Louvain-la-Neuve: Presses universitaires de Louvain, 257–273.

Machálek, Tomáš. 2020. KonText: Advanced and Flexible Corpus Query Interface. In Calzolari, Nicoletta, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asución Moreno, Jan Odijk and Stelios Piperidis eds. Proceedings of the 12th Language Resources and Evaluation Conference. Marseille: European Language Resources Association, 7003–7008.

Mauranen, Anna. 1998. Will ‘translationese’ ruin a contrastive study? Languages in Contrast 2/2: 161–185.

McEnery, Tony and Andrew Hardie. 2013. The history of corpus linguistics. In Keith Allan ed. The Oxford Handbook of the History of Linguistics. Oxford: Oxford University Press, 727–746.

McEnery, Tony and Richard Xiao. 2004. The Lancaster Corpus of Mandarin Chinese. https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/lcmc/lcmc_info.htm

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty and Daniel Zeman. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In Calzolari et al. eds, 1659–1666.

Sharoff, Serge, Reinhard Rapp, Pierre Zweigenbaum and Pascale Fung eds. 2013. Building and Using Comparable Corpora. Berlin: Springer.

Uí Dhonnchadha, Elaine, Alessio Frenda and Brian Vaughan. 2012. Issues in designing a Corpus of Spoken Irish. In Calzolari, Nicoletta, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asunción Moreno, Jan Odijk and Stelios Piperidis eds. Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. Istanbul: European Language Resources Association.

Yadava, Yogendra, Andrew Hardie, Ram Lohani, Bhim N. Regmi, Srishtee Gurung, Amar Gurung, Tony McEnery, Jens Allwood and Pat Hall. 2008. Construction and annotation of a corpus of contemporary Nepali. Corpora 3/2: 213–225.

The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

issn

ISSN: 2243-4712

scopus-2

sjr

dialnet1

indexation

Abstracting & indexing

Current Issue

scopus