Corpus as a slice of life: Representing naturally occurring language and its speakers
Abstract
Discourse is subject to numerous forces that shape its form. One force that is underestimated is the interactional dynamic among interlocutors. In devising the criteria that inform data selection for a corpus of spoken discourse, designers may end up prioritizing the collection of spontaneous discourse and overlook the fact that this type of discourse can still display artificial interactional dynamics. We propose an approach to spoken corpus compilation that aims at preserving naturally occurring interactional dynamics by choosing as focus of the corpus the representation of participants’ lives. Through the analysis of speech events collected in different projects, we demonstrate the advantages of sourcing naturally occurring discourse over spontaneous data. We then discuss a series of practices that the authors implemented in different contexts to ensure the collection of naturally occurring data. We argue that this framework yields the construction of corpora that are representative not only of a language, but also of the lives of its users.
Downloads
Metrics
References
Agbo, Seth A. and Natalya Pak. 2017. Globalization and educational reform in kazakhstan: English as the language of instruction in graduate programs. International Journal of Educational Reform 26/1: 14–43.
Arnon, Inbal and Neal Snider. 2010. More than words: Frequency effects for multi-word phrases. Journal of Memory and Language 62/1: 67–82.
Auderset, Sandra and Carmen Hernández Martínez. 2021. Documenting Tù’un Na Ñuu Sá Matxí Ntxè’è, a mixtec language of Oaxaca, Mexico. Endangered Languages Archive. http://hdl.handle.net/2196/a3085a77-687a-48b9-9caf-a48c3c1f1f1f.
Biro, Tifani, Annie J. Olmstead and Navin Viswanathan. 2022. Talker adjustment to perceived communication errors. Speech Communication 138: 13–25.
Blackwell, James W. and Peter R. R. White. 2018. The building blocks of speech: Spontaneity, pre-packaging and the genre structuring of university lectures. Text & Talk 38/3: 267–290.
Burnard, Lou. 2002. Where did we go wrong? A retrospective look at the British National Corpus. In Bernhard Ketteman and Georg Marko eds. Teaching and Learning by Doing Corpus. Amsterdam: Rodopi, 51–70.
Čermák, František. 2009. Spoken corpora design: Their constitutive parameters. International Journal of Corpus Linguistics 14/1: 113–123.
Chafe, Wallace L. 1980. The Pear Stories: Cognitive, Cultural, and Linguistic Aspects of Narrative Production. Westport: Praeger.
Chui, Kawai and Huei-ling Lai. 2008. The NCCU corpus of spoken Chinese: Mandarin, Hakka, and southern Min. Taiwan Journal of Linguistics 6/2: 119–144.
De Fina, Anna and Sabina Perrino. 2011. Introduction: Interviews vs. ‘natural’ contexts: A false dilemma. Language in Society 40/1: 1–11.
Dingemanse, Mark and Simeon Floyd. 2014. Conversation across cultures. In N. J. Enfeild, Paul Kockelman and Jack Sidnell eds. The Cambridge Handbook of Linguistic Anthropology. Cambridge: Cambridge University Press, 447–480.
Drummond, Kent and Robert Hopper. 1993. Back channels revisited: Acknowledgment tokens and speakership incipiency. Research on Language & Social Interaction 26 2: 157–177.
Du Bois, John W. 2003. Discourse and grammar. In Michael Tomasello ed. The New Psychology of Language: Cognitive and Functional Approaches to Language Structure. London: Lawrence Erlbaum Associates, 61–102.
Du Bois, John W. 2014. Towards a dialogic syntax. Cognitive Linguistics 25/3: 359–410.
Du Bois, John W., Wallace L. Chafe, Charles Meyer, Sandra A. Thompson and Nii Martey. 2000. Santa Barbara Corpus of Spoken American English. Philadelphia: Linguistic Data Consortium.
Du Bois, John W., Stephan Schuetze-Coburn, Susanna Cumming and Danae Paolino. 1993. Outline of discourse transcription. In Jane A. Edwards and Martin D. Lampert Data: Transcription and Coding in Discourse Research. London: Lawrence Erlbaum Talking, 45–89.
Du Bois, John W. and Giorgia Troiani. 2022. Cast the Net Wide: Corpus as a Slice of Life. (Presentation, 25 February 2022). Bologna: Italy.
Duranti, Alessandro and Charles Goodwin. 1992. Rethinking Context: Language as an Interactive Phenomenon. Cambridge: Cambridge University Press Cambridge.
Filchenko Andrey, Giorgia Troiani, John W. Du Bois, Gulnar Sarseke, Akyl Akanov, Moldir Bizhanova, Nikolay Mikhailov, Tansulu Temirbekova, Bybaris Seitak and Zhansaya Turaliyeva. 2023. Multimedia Corpus of Spoken Kazakh Language (version 1).
Godfrey, John J., Edward C. Holliman and Jane McDaniel. 1992. SWITCHBOARD: Telephone Speech Corpus for research and development. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. San Francisco: IEEE Computer Society, 517–520. https://doi.org/10.1109/ICASSP.1992.225858
Greenbaum, Sidney. 1991. The development of the International Corpus of English. In Karin Aijmer and Bengt Altenberg eds. English Corpus Linguistics: Studies in Honour Svartvik. London: Longman, 83–91.
Hall, Kira. 2008. Exceptional speakers: Contested and problematized gender identities. In Janet Holmes and Miriam Meyerhoff eds. The Handbook of Language and Gender. New York: Wiley Blackwell, 353–371.
Haq, Ehsan-Ul, Lik-Hang Lee, Gareth Tyson, Reza Hadi Mogavi, Tristan Braud and Pan Hui. 2022. Exploring mental health communications among Instagram coaches. In Nitin Agarwal, Zongmin Ma and Jon Rokne eds. Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. New York: IEEE Press, 218–225.
Heinz, Bettina. 2003. Backchannel responses as strategic responses in bilingual speakers’ conversations. Journal of Pragmatics 357: 1113–1142.
Hernández Martínez, Carmen, Griselda Reyes Basurto and Eric W. Campbell. 2021. MILPA (Mexican Indigenous Language Promotion and Advocacy): A Community-centered linguistic collaboration supporting indigenous Mexican languages in California. In Justyna Olko and Julia Sallabank eds. Revitalizing Endangered Languages: A Practical Guide. Cambridge: Cambridge University Press, 216–217.
Kangatharan, Jayanthiny, Maria Uther and Fernand Gobet. 2021. The effect of hyperarticulation on speech comprehension under adverse listening conditions. Psychological Research 86: 1–12.
Kemper, Susan. 1994. Elderspeak: Speech accommodations to older adults. Aging, Neuropsychology, and Cognition 1/1: 17–28.
Kibrik, Andrej A. and Olga V. Fedorova. 2018. An empirical study of multichannel communication: Russian pear chats and stories. Psychology. Journal of the Higher School of Economics 15/2: 191–200.
Kucera, Karel. 2002. The Czech National Corpus: Principles, design, and results. Literary and Linguistic Computing 17/2: 245–257.
Kuhl, Patricia K., Jean E. Andruski, Inna A. Chistovich, Ludmilla A. Chistovich, Elena V. Kozhevnikova, Viktoria L. Ryskina, Elvira I. Stolyarova, Ulla Sundberg and Francisco Lacerda. 1997. Cross-language analysis of phonetic units in language addressed to infants. Science 277 (5326): 684–686.
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina and Tony McEnery. 2017. The spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics 22/3: 319–344.
Lytle, Sarah Roseberry and Patricia K. Kuhl. 2017. Social interaction and language acquisition: Toward a neurobiological view. In Eva M. Fernández and Helen Smith Cairns eds. The Handbook of Psycholinguistics. New York: Wiley Blackwell, 615–634.
Nagy, Zoltán. 2016. The Khanty of Vasyugan. Change of the Religious System in XIX-XXI Centuries. Tomsk: Tomsk State Pedagogical University Publishing House.
Oostdijk, Nelleke. 2002. The design of the spoken Dutch corpus. In Pam Peters, Peter Collins and Adam Smith. New Frontiers of Corpus Research. Amsterdam: Rodopi, 105–112.
Pitt, Mark A., Keith Johnson, Elizabeth Hume, Scott Kiesling and William Raymond. 2005. The Buckeye Corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication 45/1: 89–95.
Pomerantz, Anita. 1984. Agreeing and disagreeing with assessments: Some features of preferred/dispreferred turn shapes. In J. Maxwell Atkinson and John Heritage eds. Structures of Social Action: Studies in Conversation Analysis. Cambridge: Cambridge University Press, 57–101.
Potter, Jonathan. 2002. Two kinds of natural. Discourse Studies 4/4: 539–542.
Quijada, Justine B., Kathryn E. Graber and Eric Stephen. 2015. Finding ‘their own’: revitalizing buryat culture through shamanic practices in Ulan-Ude. Problems of Post-Communism 62/5: 258–272.
Raso, Tommaso and Heliana Mello. 2012. The C-ORAL-BRASIL I: Reference corpus for informal spoken Brazilian Portuguese. In Vládia Pinheiro, Pablo Gamallo, Raquel Amaro, Carolina Scarton, Fernando Batista, Diego Silva, Catarina Magro and Hugo Pinto eds. Computational Processing of the Portuguese Language. New York: Springer 362–367.
Raso, Tommaso and Heliana Mello. 2014. Spoken corpora and linguistics studies: Problems and perspectives. In Raso, Tommaso and Heliana Mello eds. Spoken Corpora and Linguistic Studies. Amsterdam: John Benjamins, 1–24.
Rogers, Shane L., Jill Howieson and Casey Neame. 2018. I understand you feel that way, but I feel this way: the benefits of I-language and communicating perspective during conflict. PeerJ 6: e4831. https://doi.org/10.7717/peerj.4831.
Salazar, Jeremias, Guillem Belmar, Catherine Scanlon, Giorgia Troiani and Eric W. Campbell. 2021. Bridging diaspora: Technology in the service of the revitalization of Sà’án Sàvǐ ñà Yukúnanǐ. In Eda Derhemi ed. Endangered Languages and Diaspora. Berkshire: Foundation for Endangered Languages, 176–185.
Schegloff, Emanuel A. 1988. From interview to confrontation: Observations of the bush/rather encounter. Research on Language & Social Interaction 22/1–4: 215–240.
Schegloff, Emanuel A. 2015. Conversational interaction the embodiment of human sociality. In Deborah Tannen, Heidi E. Hamilton and Deborah Schiffrin eds. The Handbook of Discourse Analysis. New York: Wiley Blackwell, 346–366.
Scherlis, Lily. 2023. Boundary issues. Parapraxis. https://www.parapraxismagazine.com/articles/boundary-issues
Stivers, Tanya, N. J. Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federicoi Rossano, Jan Peter, Kyung-Eun Yoon and Stephen C. Levinson. 2009. Universals and cultural variation in turn-taking in conversation. In Proceedings of the National Academy of Sciences 106/26: 10587–10592. https://doi.org/10.1073/pnas.0903616106.
Stivers, Tanya, Nick J. Enfield and Stephen C. Levinson. 2010. Question-response sequences in conversation across ten languages: An introduction. Journal of Pragmatics 42: 2615–2619.
Stivers, Tanya and N.J. Enfield. 2010. A coding scheme for question–response sequences in conversation. Journal of Pragmatics 42/10: 2620–2626.
Swales, John M. 1990. Genre Analysis: English in Academic and Research Settings. Cambridge: Cambridge university press.
Szuchewycz, Bohdan. 1994. Evidentiality in ritual discourse: The social construction of religious meaning. Language in Society 23/3: 389–410.
Thompson, Sandra A., Emanuel A. Schegloff and Elinor Ochs. 1996. Interaction and Grammar. Cambridge: Cambridge University Press.
Tottie, Gunnel. 1991. Conversational style in British and American English: The case of backchannel. In Jan Svartvik, Karin Aijmer and Bengt Altenberg eds. English Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman, 254–271.
Troiani, Giorgia, John W. Du Bois, Gulnar Sarseke, Andrey Filchenko, Ilya Salimzianov, Nikolay Mikhailov, Fatima Moldashova, Akyl Akanov, Moldir Bizhanova, Dameliya Koishybayieva, Aigerim Khamitova, Tomiris Nurgalyieva, Aigerim Seiilbek, Bybaris Seitak, Bota Tursunova and Aruzhan Yelubay. 2022. Remote workflow as educational opportunity: The experience of the Multimodal Corpus of Spoken Kazakh language. Coyote Papers: 11–18.
Uther, Maria, Monja A. Knoll and Denis Burnham. 2007. Do you speak E-NG-LI-SH? A comparison of foreigner-and infant-directed speech. Speech Communication 49/1: 2–7.
Warren, Martin. 2006. Features of Naturalness in Conversation. Amsterdam: John Benjamins.
Wasow, Thomas. 2002. Postverbal Behavior. CSLI Stanford: The University of Chicago Press.
Xu, Yi. 2010. In defense of lab speech. Journal of Phonetics 38/3: 329–336.
Copyright (c) 2024 Research in Corpus Linguistics
This work is licensed under a Creative Commons Attribution 4.0 International License.