Introduction: Innovation in spoken corpus linguistics

Keywords: spoken corpora, corpus design, corpus construction, transcription, representativeness

Abstract

Over the decades, technological advancements have substantially improved the efficiency and scope of spoken corpus compilation, but there remain many challenges ––both practical and theoretical–– that constrain 1) the quality of spoken corpus data, 2) the scale to which spoken corpora can be compiled, and 3) the authenticity with which spoken language is represented in textual form. This special issue presents eight studies which address contemporary innovations in spoken corpus design, data collection, processing, and analysis, covering a range of speech contexts and varieties. The studies focus on registers including online workplace meetings, casual conversation, oral histories, oral proficiency interviews, and YouTube vlogs. Innovations include the integration of automated transcription tools, multimodal annotation schemes, creative participant recruitment methods, and developments in natural language processing (NLP). Three contributions offer critical reconceptualisations of traditional approaches to spoken corpus design, proposing strategies to improve the authenticity of spoken corpora.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

BNC Consortium. 2007. The British National Corpus, XML Edition. Oxford Text Archive: http://hdl.handle.net/20.500.14106/2554.

Bowie, Jill, Sean Wallis and Sebastian Aarts. 2013. The perfect in spoken British English. In Sebastian Aarts, Joanne Close, Geoffrey Leech and Sean Wallis eds. The Verb Phrase in English: Investigating Recent Language Change with Corpora. Cambridge: Cambridge University Press, 318–352.

Brezina, Vaclav, Dana Gablasova and Susan Reichelt. 2018. BNClab. http://corpora.lancs.ac.uk/bnclab

De Cock, Sylvie. 2004. Preferred sequences of words in NS and NNS speech. Belgian Journal of English Language and Literatures 2: 225–246.

Du Bois, John W., Wallace L. Chafe, Charles Meyer, Sandra A. Thompson and Nii Martey. 2000. Santa Barbara Corpus of Spoken American English. Philadelphia: Linguistic Data Consortium.

Egbert, Jesse, Douglas Biber and Bethany Gray. 2022. Designing and Evaluating Language Corpora: A Practical Framework for Corpus Representativeness. Cambridge: Cambridge University Press.

Filchenko Andrey, Giorgia Troiani, John W. Du Bois, Gulnar Sarseke, Akyl Akanov, Moldir Bizhanova, Nikolay Mikhailov, Tansulu Temirbekova, Bybaris Seitak and Zhansaya Turaliyeva. 2023. Multimedia Corpus of Spoken Kazakh Language (version 1).

Gablasova, Dana, Vaclav Brezina and Tony McEnery. 2019. The Trinity Lancaster Corpus: Development, description, and application. International Journal of Learner Corpus Research 5/2: 126–158.

Gilquin, Gaëtanelle, Sylvie De Cock and Sylviane Granger. 2010. The Louvain International Database of Spoken English Interlanguage. Handbook and CD-ROM. Louvain-La-Neuve: Presses universitaires de Louvain.

Greenbaum, Sidney and Jan Svartvik. 1990. The London–Lund Corpus of Spoken English. In Jan Svartvik ed. The London–Lund Corpus of Spoken English: Description and Research. Lund: Lund University Press, 11–59.

Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina and Tony McEnery. 2017. The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics 22/3: 319–344.

Knight, Dawn, Fernando Loizides, Steven Neale, Laurence Anthony and Irena Spasić. 2021. Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh. Language Resources & Evaluation 55: 789–816.

McAuliffe, Michael, Michaela Socolof, Michael Wagner and Morgran Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. INTERSPEECH: 498–502.

Nelson, Gerald, Sean Wallis and Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.

Pettarin, Alberto. 2017. Aeneas: Automagically Synchronize Audio and Text. https://www.readbeyond.it/aeneas/

Põldvere, Nele, Victoria Johansson and Carita Paradis. 2021. On The London–Lund Corpus 2: Design, challenges and innovations. English Language and Linguistics 25/3: 459–483.

Smith, Nicholas and Cathleen Waters. 2019. Variation and change in a specialized register: A comparison of random and sociolinguistic sampling outcomes in Desert Island Discs. International Journal of Corpus Linguistics 24/2: 169–201.

Wittenburg, Peter, Hennie Brugman, Albert Russel, Alex Klassmann and Han Sloetjes. 2006. ELAN: A professional framework for multimodality research. In Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Josheph Mariani, Jan Odijk and Daniel Tapias eds. Proceedings of the 5th International Conference on Language Resources and Evaluation. Genoa: European Language Resources Association, 1556–1559.

Published
2024-09-17
How to Cite
Love, R. (2024). Introduction: Innovation in spoken corpus linguistics. Research in Corpus Linguistics, 12(2), i–viii. https://doi.org/10.32714/ricl.12.02.01