Addressing comparability and retrieval issues in conversation corpora: A case study on the Spoken British National Corpora (1994 and 2014), using the past perfect
Abstract
This paper addresses issues in comparison and analysis of conversation corpora. We focus on the demographically-sampled spoken portions of the British National Corpora (BNC), representing British English in 1994 and 2014, for the purposes of studying recent language change and sociolinguistic variation. Issues of comparability and representativeness of the two BNCs have been raised before (see Love 2020), with several measures taken to ensure backwards compatibility of the Spoken BNC2014 with its 1994 counterpart. However, we believe further considerations and solutions merit attention, relating to sampling, transcription, annotation, and corpus querying. The BNClab subcorpus (Brezina et al. 2018a), a sociolinguistic judgment sample derived from the parent BNCs, provides a very promising basis for analysis, although arguably its mixed geographical representativeness affects cross-time comparability. To address this, we make some proposals for modifying the BNClab subcorpus to improve comparability. Then, we use the modified sample to address issues in retrieval and quantification of grammatical constructions in the spoken BNCs, namely a) determining an appropriate frequency metric, b) retrieving a comprehensive but manageable set of examples from ‘messy’ spoken data, and c) handling transcription inaccuracies. Finally, we discuss the case study findings and wider methodological implications for users of these corpora.
Downloads
Metrics
References
Anderwald, Lieselotte. 2002. Negation in Non-standard British English: Gaps, Regularizations and Asymmetries. New York: Routledge.
Atkinson, Will. 2015. Class. Cambridge: Polity Press.
Axelsson, Karin. 2018. Canonical tag questions in contemporary British English. In Vaclav Brezina, Robbie Love and Karin Aijmer eds, 96–119.
Baker, Paul. 2023. A year to remember? Introducing the BE21 corpus and exploring recent part of speech tag change in British English. International Journal of Corpus Linguistics 28/3: 407–429.
Ball, Catherine. 1994. Automated text analysis: Cautionary tales. Literary and Linguistic Computing 9: 295–302.
Beal, Joan. 2010. An Introduction to Regional Englishes: Dialect Variation in England. Edinburgh: Edinburgh University Press.
Biber, Douglas. 1933. Representativeness in corpus design. Literary and Linguistic Computing 8/4: 243–257.
Bowie, Jill, Sean Wallis and Sebastian Aarts. 2013. The perfect in spoken British English. In Sebastian Aarts, Joanne Close, Geoffrey Leech and Sean Wallis eds. The Verb Phrase in English: Investigating Recent Language Change with Corpora. Cambridge: Cambridge University Press, 318–352.
Brezina, Vaclav and Miriam Meyerhoff. 2014. Significant or random? A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics 19/1: 1–28.
Brezina, Vaclav and William Platt. 2024. #LancsBox X. http://lancsbox.lancs.ac.uk/ (accessed 5 May 2023.)
Brezina, Vaclav, Dana Gablasova and Susan Reichelt. 2018a. BNClab. http://corpora.lancs.ac.uk/bnclab (accessed 5 May 2023.)
Brezina, Vaclav, Robbie Love and Karin Aijmer eds. 2018b. Corpus Approaches to Contemporary British Speech: Sociolinguistic Studies of the Spoken BNC2014. New York: Routledge.
Brezina, Vaclav, Abi Hawtin and Tony McEnery. 2021. The written British National Corpus 2014 – design and comparability. Text and Talk 41/5–6: 595–615.
Burnard, Lou. 2007. Reference Guide for the British National Corpus (XML edition). http://www.natcorp.ox.ac.uk/docs/URG/ (accessed 5 May 2023.)
Coleman, John, Mark Liberman, Greg Kochanski, Lou Burnard and Jiahong Yuan. 2011. Mining a year of speech. In Proceedings from the Workshop of New Tools and Methods for Very-Large-Scale Phonetics Research, 16–19. http://www.phon.ox.ac.uk/jcoleman/MiningVLSP.pdf (accessed 5 May 2023.)
Crowdy, Steve. 1993. Spoken corpus design. Literary and Linguistic Computing 8/4: 259–265.
Curry, Niall, Robbie Love and Olivia Goodman. 2022. Adverbs on the move: Investigating publisher application of corpus research on recent language change to ELT coursebook development. Corpora 17/1: 1–38.
Declerck, Renaat. 2006. The Grammar of the English Verb Phrase. Volume 1: The Grammar of the English Tense System: A Comprehensive Analysis. Berlin: Mouton de Gruyter.
Denison, David. 1993. English Historical Syntax: Verbal Constructions. London: Longman.
Depraetere, Ilse and Chad Langford. 2019. Advanced English Grammar: A Linguistic Approach. London: Bloomsbury.
Egbert, Jesse, Douglas Biber and Bethany Gray. 2022. Designing and Evaluating Language Corpora: A Practical Framework for Corpus Representativeness. Cambridge: Cambridge University Press.
Gablasova, Dana, Vaclav Brezina and Tony McEnery. 2017. Exploring learner language through corpora: Comparing and interpreting corpus frequency information. Language Learning 67/1: 130–154.
Garside, Roger and Nicholas Smith. 1997. A hybrid grammatical tagger: CLAWS4. In Roger Garside, Geoffrey Leech and Anthony McEnery eds., 102–121.
Garside, Roger, Geoffrey Leech and Anthony McEnery eds. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman.
Hardie, Andrew. 2012. CQPweb – Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17/3: 380–409.
Hofland, Knut, Anne Lindebjerg and Jørg Thunestvedt. 1999. ICAME Collection of English Language Corpora. Bergen: The HIT Centre.
Hoffmann, Sebastian, Stefan Evert, Nicholas Smith, David Lee and Ylva Berglund-Prytz. 2008. Corpus linguistics with BNCweb – A Practical Guide. Frankfurt: Peter Lang.
Hoffmann, Sebastian and Sabine Arndt-Lappe. 2021. Better data for more researchers: Using the audio features of BNCweb. ICAME Journal 45: 125–154.
Horvath, Barbara. 2013. Ways of observing: Studying the interplay of social and linguistic variation. In Christine Mallinson, Becky Childs and Gerard Van Herk eds. Data Collection in Sociolinguistics: Methods and Applications. New York: Routledge. https://doi.org/10.4324/9780203136065
Huddleston, Rodney and Geoffrey Pullum. 2002. The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press.
Ishihara, Noriko. 2003. “I wish I would have known!”: The usage of would have in past counterfactual if- and wish-clauses. Issues in Applied Linguistics 14/1: 21–48.
Jucker, Andreas, Gerold Schneider, Irma Taavitsainen and Barb Breustedt. 2008. Fishing for compliments: Precision and recall in corpus-linguistic compliment research. In Andreas Jucker and Irma Taavitsainen eds. Speech Acts in the History of English. Amsterdam: John Benjamins, 273–294.
Lavandera, Beatriz. 1978. Where does the sociolinguistic variable stop? Language in Society 7/2: 171–82.
Leech, Geoffrey and Nicholas Smith. 2000. Manual to Accompany the British National Corpus (Version 2) with Improved Word-class Tagging. https://ucrel.lancs.ac.uk/bnc2/bnc2postag_manual.htm (accessed 5 May 2023.)
Leech, Geoffrey and Nicholas Smith. 2005. Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB. ICAME Journal 29: 83–98.
Leech, Geoffrey, Marianne Hundt, Christian Mair and Nicholas Smith. 2009. Change in Contemporary English: A Grammatical Study. Cambridge: Cambridge University Press.
Love, Robbie. 2020. Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. New York: Routledge.
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina and Tony McEnery. 2017. The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics 22/3: 319–344.
McEnery, Tony, Robbie Love and Vaclav Brezina. 2017. Introduction: Compiling and analysing the Spoken British National Corpus 2014. International Journal of Corpus Linguistics 22/3: 311–318.
Milroy, Lesley and Matthew Gordon. 2003. Sociolinguistics: Methods and Interpretation. Oxford: Blackwell.
Mindt, Dieter. 2000. An Empirical Grammar of the English Verb System. Berlin: Cornelsen.
Reichelt, Susan. 2021. Recent developments of the pragmatic markers kind of and sort of in spoken British English. English Language & Linguistics 25/3: 563–580.
Rühlemann, Christoph. 2007. Conversation in Context: A Corpus-driven Approach. London: Continuum.
Sankoff, David. 2005. Problems of representativeness. In Ulrich Ammon, Norbert Dittmar, Klaus Mattheier and Peter Trudgill eds. Sociolinguistics: An International Handbook of the Science of Language and Society. Berlin: Walter de Gruyter, 998–1002.
Sankoff, David and Susan Laberge. 1978. The linguistic market and the statistical explanation of variability. In David Sankoff ed. Linguistic Variation: Models and Methods. New York: Academic Press, 239–250.
Schilling-Estes, Natalie. 2007. Sociolinguistic fieldwork. In Robert Bayley and Ceil Lucas eds. Sociolinguistic Variation: Theories, Methods, and Applications. Cambridge: Cambridge University Press, 165–190.
Smith, Nicholas. 1997. Improving a tagger. In Roger Garside, Geoffrey Leech and Anthony McEnery eds., 137–150.
Smith, Nicholas and Cathleen Waters. 2019. Variation and change in a specialized register: A comparison of random and sociolinguistic sampling outcomes in Desert Island Discs. International Journal of Corpus Linguistics 24/2: 169–201.
Sönning, Lukas and Manfred Krug. 2022. Comparing study designs and down-sampling strategies in corpus analysis: The importance of speaker metadata in the BNCs of 1994 and 2014. In Ole Schützler and Julia Schlüter eds. Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge: Cambridge University Press, 127–160.
Tagliamonte, Sali. 2006. Analysing Sociolinguistic Variation. Cambridge: Cambridge University Press.
Timmis, Ivor. 2005. Towards a framework for teaching spoken grammar. ELT Journal 59/2: 117–125.
Trask, R.L. 1993. A Dictionary of Grammatical Terms in Linguistics. New York: Routledge.
Yao, Xinyue and Peter Collins. 2013. Recent change in non-present perfect constructions in British and American English. Corpora 8/1: 115–135.
Copyright (c) 2024 Research in Corpus Linguistics
This work is licensed under a Creative Commons Attribution 4.0 International License.