Challenges of combining structured and unstructured data in corpus development

Keywords: structured data, unstructured data, metadata, rich data, corpus annotation, corpus design


Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.


Download data is not yet available.


Metrics Loading ...


Biber, Douglas. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8/4: 243–257.

CLAWS. Computer program. Developed by UCREL at Lancaster University. (25 June, 2021.)

Davies, Mark. 2012. Some methodological issues related to corpus-based investigations of recent syntactic changes in English. In Terttu Nevalainen and Elizabeth C. Traugott eds., 157–174.

EEBO = Early English Books Online.

ECCO = Eighteenth Century Collections Online.

ESTC = English Short Title Catalogue.

Francis, W. Nelson and Henry Kučera. 1964. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Providence, Rhode Island: Brown University.

Hiltunen, Turo, Joseph McVeigh and Tanja Säily. 2017. How to turn linguistic data into evidence? In Turo Hiltunen, Joseph McVeigh and Tanja Säily eds. Big and Rich Data in English Corpus Linguistics: Methods and Explorations. Helsinki: VARIENG. (24 June, 2021.)

Hundt, Marianne and Geoffrey Leech. 2012. “Small is beautiful”: On the value of standard reference corpora for observing recent grammatical change. In Terttu Nevalainen and Elizabeth C. Traugott eds., 175–188.

ICC = International Comparable Corpus.

ICE = International Corpus of English.

Kermes, Hannah, Stefania Degaetano-Ortlieb, Ashraf Khamis, Jörg Knappen and Elke Teich. 2016. The Royal Society Corpus: From uncharted data to corpus. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hèléne Mazo, Asunción Moreno, Jan Odijk and Sterlios Piperidis eds. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association, 1928–1931.

Koplenig, Alexander. 2017. The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets – reconstructing the composition of the German corpus in times of WWII. Digital Scholarship in the Humanities 32/1: 169–188.

McEnery, Tony and Andrew Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press.

Meurman-Solin, Anneli and Arja Nurmi eds. 2007. Annotating Variation and Change. Helsinki: VARIENG. (24 June, 2021.)

Nevalainen, Terttu and Elizabeth C. Traugott eds. 2012. The Oxford Handbook of the History of English. Oxford: Oxford University Press.

Põldvere, Nele, Victoria Johansson and Carita Paradis. In press. On the London-Lund Corpus 2: Design, challenges and innovations. English Language and Linguistics 25/3.

Schöch, Christof. 2013. Big? Smart? Clean? Messy? Data in the humanities. Journal of Digital Humanities 2/3. (24 June, 2021.)

TCEECE = Tagged Corpus of Early English Correspondence Extension. 2020. Annotated by Lassi Saario and Tanja Säily. Spelling standardised by Mikko Hakala, Minna Palander-Collin, Minna Nevala, Emanuela Costea, Anne Kingma and Anna-Lina Wallraff. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki.

TEI Consortium, eds. 2020. Guidelines for Electronic Text Encoding and Interchange. (24 June, 2021.)

Tichý, Ondřej. 2018. Lexical obsolescence and loss in English: 1700–2000. In Joanna Kopaczyk and Jukka Tyrkkö eds. Applications of Pattern-driven Methods in Corpus Linguistics. Amsterdam: John Benjamins, 81–103.

Tyrkkö, Jukka. 2020. The war years: Distant reading British parliamentary debates. In Joacim Hansson and Jonas Svensson eds. Doing Digital Humanities: Concepts, Approaches, Cases. Växjö: Linnaeus University Press, 169–199.

ViMELF. 2018. Corpus of Video-Mediated English as a Lingua Franca Conversations. Birkenfeld: Trier University of Applied Sciences.

Wilkinson, Mark D., Michel Dumontier et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3: 160018.

Wittenburg, Peter, Hennie Brugman, Albert Russel, Alex Klassmann and Han Sloetjes. 2006. ELAN: A professional framework for multimodality research. In Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk and Daniel Tapias eds. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy: European Language Resources Association, 1556–1559.

How to Cite
Säily, T., & Tyrkkö, J. (2021). Challenges of combining structured and unstructured data in corpus development. Research in Corpus Linguistics, 9(1), i-viii.