Challenges of releasing audio material for spoken data: The case of the London–Lund Corpus 2

Keywords: audio-to-text alignment, anonymisation, corpus compilation, spoken corpora, prosody, Praat

Abstract

This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Aijmer, Karin. 1996. Conversational Routines in English: Convention and Creativity. London: Longman.

Altenberg, Bengt. 1998. On the phraseology of spoken English: The evidence of recurrent word combinations. In Anthony P. Cowie ed. Phraseology: Theory, Analysis, and Applications. Oxford: Oxford University Press, 101–122.

Andersen, Gisle. 2016. Semi-lexical features in corpus transcription: Consistency, comparability, standardisation. International Journal of Corpus Linguistics 21/3: 323–347.

Atkins, Sue, Jeremy Clear and Nicholas Ostler. 1992. Corpus design criteria. Literary and Linguistic Computing 7/1: 1–16.

Baghai-Ravay, Ladan, Greg Kochanski and John Coleman. 2009. Precision of phoneme boundaries derived using Hidden Markov Models. Proceedings of INTERSPEECH 2009, Tenth Annual Conference of the Interantional Speech Communication Association, 2879–2882.

Boersma, Paul. 2001. Praat, a system for doing phonetics by computer. Glot International 5/9–10: 341–345.

BNC Consortium. 2007. The British National Corpus, version 3 (BNC XML Edition). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk. (9 April, 2021.)

Brinton, Laurel J. 2017. The Evolution of Pragmatic Markers in English: Pathways of Change. Cambridge: Cambridge University Press.

Calhoun, Sasha, Jean Carletta, Jason M. Brenier, Neil Mayo, Dan Jurafsky, Mark Steedman and David Beaver. 2010. The NXT-format Switchboard Corpus: A rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources and Evaluation 44/4: 387–419.

Clayman, Steven E. 2002. Sequence and solidarity. In Shane R. Thye and Edward J. Lawler eds. Group Cohesion, Trust and Solidarity. Oxford: Elsevier, 229–253.

Coleman, John, Ladan Baghai-Ravary, John Pybus and Sergio Grau. 2012. Audio BNC: The Audio Edition of the Spoken British National Corpus. http://www.phon.ox.ac.uk/AudioBNC (24 February, 2020.)

Cosi, Piero, Daniele Falavigna and Maurizio Omologo. 1991. A preliminary statistical evaluation of manual and automatic segmentation discrepancies. Proceedings of EUROSPEECH 1991, Second European Conference on Speech Communication and Technology, 693–696.

Crowdy, Steve. 1994. Spoken corpus transcription. Literary and Linguistic Computing 9/1: 25–28.

Cruttenden, Alan. 1997. Intonation. Cambridge: Cambridge University Press.

Diemer, Stefan, Marie-Louise Brunner and Selina Schmidt. 2016. Compiling computer-mediated spoken language corpora: Key issues and recommendations. International Journal of Corpus Linguistics 21/3: 348–371.

Dori-Hacohen, Gonen. 2017. Creative resonance and misalignment stance: Achieving distance in one Hebrew interaction. Functions of Language 24/1: 16–40.

Du Bois, John W. 1991. Transcription design principles for spoken discourse research. Pragmatics 1/1: 71–106.

Du Bois, John W. 2014. Towards a dialogic syntax. Cognitive Linguistics 25/3: 359–410.

Edwards, Jane A. 1995. Principles and alternative systems in the transcription, coding, and mark-up of spoken discourse. In Geoffrey Leech, Greg Myers and Jenny Thomas eds. Spoken English on Computer: Transcription, Mark-Up and Application. New York: Longman, 19–34.

Garrod, Simon and Martin J. Pickering. 2004. Why is conversation so easy? TRENDS in Cognitive Sciences 8/1: 8–11.

Gerstenberg, Annette, Valerie Hekkel, Julie Marie Kairet and Adélie Soumier-Vendé. 2017. LangAge Corpora: Resources for Language and Aging Research. Poster presentation at CLARe3, Berlin, Germany.

Grabe, Esther. 2004. Intonational variation in urban dialects of English spoken in the British Isles. In Peter Gilles and Jörg Peters eds. Regional Variation in Intonation. Tübingen: Niemeyer, 9–31.

Greenbaum, Sidney and Jan Svartvik. 1990. The London-Lund Corpus of Spoken English. In Jan Svartvik ed. The London-Lund Corpus of Spoken English: Description and Research. Lund: Lund University Press, 11–59.

Gries, Stefan Th. and Andrea L. Berez. 2017. Linguistic annotation in/for corpus linguistics. In Nancy Ide and James Pustejovsky eds. Handbook of Linguistic Annotation. Berlin: Springer, 379–409.

Hardie, Andrew. 2014. Modest XML for corpora: Not a standard, but a suggestion. ICAME Journal 38: 73–103.

Hasund, Ingrid Kristine. 1998. Protecting the innocent: The issue of informants’ anonymity in the COLT corpus. In Antoinette Renouf ed. Explorations in Corpus Linguistics. Amsterdam: Rodopi, 13–28.

Hirst, Daniel. 2013. Anonymising long sounds for prosodic research. In Brigitte Bigi and Daniel Hirst eds. Tools and Resources for the Analysis of Speech Prosody. Aix-en-Provence: Laboratoire Parole et Langage, 36–37.

Hoey, Elliott M. and Kobin H. Kendrick. 2017. Conversation Analysis. In Annette M. B. de Groot and Peter Hagoort eds. Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide. Hoboken: Wiley-Blackwell, 151–173.

Hoffmann, Sebastian and Sabine Arndt-Lappe. Submitted. Better data for more researchers – Using the audio features of BNCweb.

Hoffmann, Sebastian, Stefan Evert, Nicholas Smith, David Lee and Ylva Berglund Prytz. 2008. Corpus Linguistics with BNCweb – A Practical Guide. Frankfurt am Main: Peter Lang.

Hosom, John-Paul. 2000. Automatic Time Alignment of Phonemes Using Acoustic-Phonetic Information. Hillsboro, OR: Oregon Health and Science University dissertation.

Hosom, John-Paul. 2009. Speaker-independent phoneme alignment using transition-dependent states. Speech Communication 51/4: 352–368.

InqScribe. 2005–2020. Computer software. https://www.inqscribe.com/ (9 April, 2021.)

Kaufmann, Anita. 2002. Negation and prosody in British English. Journal of Pragmatics 34/10: 1473–1494.

Kendrick, Kobin H. and Francisco Torreira. 2015. The timing and construction of preference: A quantitative study. Discourse Processes 52/4: 255–289.

Kimps, Ditte. 2018. Tag Questions in Conversation: A Typology of their Interactional and Stance Meanings. Amsterdam: John Benjamins.

Kirk, John M. 2016. The Pragmatic Annotation Scheme of the SPICE-Ireland Corpus. International Journal of Corpus Linguistics 21/3: 299–322.

Kirk, John M. and Gisle Andersen. 2016. Compilation, transcription, markup and annotation of spoken corpora. International Journal of Corpus Linguistics 21/3: 291–298.

Kisler, Thomas, Uwe Reichel and Florian Schiel. 2017. Multilingual processing of speech via web services. Computer Speech & Language 45: 326–347.

Leech, Geoffrey. 2004. Adding linguistic annotation. In Martin Wynne ed. Developing Linguistic Corpora: A Guide to Good Practice. http://users.ox.ac.uk/~martinw/dlc/chapter2.htm (24 February, 2020.)

Lenk, Uta. 1998. Marking Discourse Coherence: Functions of Discourse Markers in Spoken English. Tübingen: Gunter Narr Verlag.

Lin, Phoebe. 2018. The Prosody of Formulaic Sequences: A Corpus and Discourse Approach. London: Bloomsbury.

Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina and Tony McEnery. 2017. The Spoken BNC2014: Designing and building a corpus of everyday conversations. International Journal of Corpus Linguistics 22/3: 319–344.

McEnery, Tony. 2018. The Spoken BNC2014: The corpus linguistic perspective. In Vaclav Brezina, Robbie Love and Karin Aijmer eds. Corpus Approaches to Contemporary British Speech: Sociolinguistic Studies of the Spoken BNC2014. New York: Routledge, 10–15.

Meurer, Paul. 2012. Corpuscle – A new corpus management platform for annotated corpora. In Gisle Andersen ed. Exploring Newspaper Language: Using the Web to Create and Investigate a Large Corpus of Modern Norwegian. Amsterdam: John Benjamins, 29–50.

Nelson, Gerald. 2002. Markup Manual for Spoken Texts. http://ice-corpora.net/ice/index.html (24 February, 2020.)

Nelson, Gerald, Sean Wallis and Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.

Ochs, Elinor. 1979. Transcription as theory. In Elinor Ochs and Bambi B. Schiefflen eds. Developmental Pragmatics. New York: Academic Press, 43–72.

Oostdijk, Nelleke and Lou Boves. 2008. Preprocessing speech corpora: Transcription and phonological annotation. In Anke Lüdeling and Merja Kytö eds. Corpus Linguistics: An International Handbook Vol. 1. Berlin: Mouton de Gruyter, 642–663.

Paradis, Carita. 1997. Degree Modifiers of Adjectives in Spoken British English. Lund: Lund University Press.

Paradis, Carita. 2008. Configurations, construals and change: Expressions of degree. English Language and Linguistics 12/2: 317–343.

Põldvere, Nele and Carita Paradis. 2019. Motivations and mechanisms for the development of the reactive what-x construction in spoken dialogue. Journal of Pragmatics 143: 65–84.

Põldvere, Nele and Carita Paradis. 2020. ‘What and then a little robot brings it to you?’ The reactive what-x construction in spoken dialogue. English Language and Linguistics 24/2: 307–332.

Põldvere, Nele, Matteo Fuoli and Carita Paradis. 2016. A study of dialogic expansion and contraction in spoken discourse using corpus and experimental techniques. Corpora 11/2: 191–225.

Põldvere, Nele, Victoria Johansson and Carita Paradis. In press a. A Guide to the London-Lund Corpus 2 of Spoken British English. Lund Studies in English. Lund: Centre for Languages and Literature, Lund University.

Põldvere, Nele, Victoria Johansson and Carita Paradis. In press b. On the London-Lund Corpus 2: Design, challenges and innovations. English Language and Linguistics 25/3.

Põldvere, Nele, Victoria Johansson and Carita Paradis. Submitted. Resonance in dialogue: The interplay between intersubjective motivations and cognitive facilitation.

Roberts, Felicia, Alexander L. Francis and Melanie Morgan. 2006. The interaction of inter-turn silence with prosodic cues in listener perceptions of “trouble” in conversation. Speech Communication 48/9: 1079–1093.

Roberts, Seán G., Francisco Torreira and Stephen C. Levinson. 2015. The effects of processing and sequence organization on the timing of turn taking: A corpus study. Frontiers in Psychology 6: 1–16.

Romero-Trillo, Jesús. 2014. ‘Pragmatic punting’ and prosody. In María de los Ángeles Gómez González, Francisco José Ruiz de Mendoza Ibáñez, Francisco Gonzálvez-García and Angela Downing eds. The Functional Perspective on Language and Discourse: Applications and Implications. Amsterdam: John Benjamins, 209–222.

Sauer, Simon and Anke Lüdeling. 2016. Flexible multi-layer spoken dialogue corpora. International Journal of Corpus Linguistics 21/3: 419–438.

Schiel, Florian. 1999. Automatic phonetic transcription of non prompted speech. In John J. Ohala, Yoko Hasegawa, Manjari Ohala, Daniel Granville and Ashlee C. Baile eds. Proceedings of ICPhS 1999, Fourteenth International Congress of Phonetic Sciences, 607–610.

Schmidt, Thomas. 2016. Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German. International Journal of Corpus Linguistics 21/3: 396–418.

Stenström, Anna-Brita. 1984. Questions and Responses in English Conversation. Malmö: Gleerup.

Stenström, Anna-Brita, Gisele Andersen, Kristine Hasund, Kristina Monstad and Hanne Aas. 1998. User’s Manual to Accompany The Bergen Corpus of London Teenage Language (COLT). Bergen: Department of English, University of Bergen.

Svartvik, Jan and Randolph Quirk eds. 1980. A Corpus of English Conversation. Lund: Gleerup.

Thompson, Paul. 2004. Spoken language corpora. In Martin Wynne ed. Developing Linguistic Corpora: A Guide to Good Practice. http://users.ox.ac.uk/~martinw/dlc/chapter5.htm (9 April, 2021.)

UCL Survey of English Usage. 2020. https://www.ucl.ac.uk/english-usage/ (5 April, 2021.)

Wallis, Sean, Gerald Nelson and Bas Aarts eds. 2006. The British Component of the International Corpus of English (ICE-GB), Release 2. London: Survey of English Usage computer software.

Weisser, Martin. 2017. Annotating the ICE corpora pragmatically – Preliminary issues & steps. ICAME Journal 41/1: 181–214.

Wichmann, Anne. 2008. Speech corpora and spoken corpora. In Anke Lüdeling and Merja Kytö eds. Corpus Linguistics: An International Handbook Vol. 1. Berlin: Mouton de Gruyter, 187–206.

Wichmann, Anne. 2011. Grammaticalization and prosody. In Bernd Heine and Heiko Narrog eds. The Oxford Handbook of Grammaticalization. Oxford: Oxford University Press, 331–341.

Wichmann, Anne, Anne-Marie Simon-Vandenbergen and Karin Aijmer. 2010. How prosody reflects semantic change: A synchronic case study of of course. In Kristin Davidse, Lieven Vandelanotte and Hubert Cuyckens eds. Subjectification, Intersubjectification and Grammaticalization. Berlin: Mouton de Gruyter, 103–154.

Wittenburg, Peter, Hennie Brugman, Albert Russel, Alex Klassmann and Han Sloetjes. 2006. ELAN: A professional framework for multimodality research. In Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk and Daniel Tapias eds. Proceedings of LREC 2006, Fifth International Conference on Language Resources and Evaluation, 1556–1559.

Yuan, Jiahong, Wei Lai, Chris Cieri and Mark Liberman. 2018. Using forced alignment for phonetics research. In Chu-Ren Huang, Peng Jin and Shu-Kai Hsieh eds. Chinese Language Resources and Processing: Text, Speech and Language Technology. Springer.

Published
2021-06-07
How to Cite
Põldvere, N., Frid, J., Johansson, V., & Paradis, C. (2021). Challenges of releasing audio material for spoken data: The case of the London–Lund Corpus 2. Research in Corpus Linguistics, 9(1), 35-62. https://doi.org/10.32714/ricl.09.01.04