Building LANA-CASE, a spoken corpus of American English conversation: Challenges and innovations in corpus compilation

Elizabeth Hanks; Isobelle Clarke; Gavin Brookes; Vaclav Brezina; Paul Baker; Randi Reppen; Douglas Biber; Tove Larsson; Jesse Egbert; Tony McEnery; Raffaella Bottini

doi:10.32714/ricl.12.02.03

Building LANA-CASE, a spoken corpus of American English conversation: Challenges and innovations in corpus compilation

Authors

Elizabeth Hanks Northern Arizona University https://orcid.org/0000-0001-9987-4485
Isobelle Clarke Lancaster University https://orcid.org/0000-0001-5541-6327
Gavin Brookes Lancaster University https://orcid.org/0000-0003-0726-2559
Vaclav Brezina Lancaster University https://orcid.org/0000-0002-1613-6100
Paul Baker Lancaster University https://orcid.org/0000-0001-6743-4020
Randi Reppen Northern Arizona University https://orcid.org/0000-0001-5657-9195
Douglas Biber Northern Arizona University https://orcid.org/0000-0002-7024-505X
Tove Larsson Northern Arizona University https://orcid.org/0000-0002-0489-2697
Jesse Egbert Northern Arizona University https://orcid.org/0000-0002-3751-2865
Tony McEnery Lancaster University https://orcid.org/0000-0002-8425-6403
Raffaella Bottini Lancaster University https://orcid.org/0000-0003-2142-2184

DOI:

https://doi.org/10.32714/ricl.12.02.03

Keywords:

spoken corpora, conversation, corpus compilation, LANA-CASE

Abstract

The Lancaster-Northern Arizona Corpus of Spoken American English (LANA-CASE) is a collaborative project between Lancaster University and Northern Arizona University to create a publicly available, large-scale corpus of American English conversation. In this article, we describe the design of LANA-CASE in terms of the challenges that have arisen and how these have been addressed – including decisions related to operationalizing the domain, sampling the data, recruiting participants, and selecting instruments for data collection. In addressing these challenges, we were able to draw on and further develop strategies established in the creation of other spoken corpora (including the British English counterpart to LANA-CASE, the Spoken British National Corpus 2014) as well as to implement recent theoretical and technical innovations related to each step. We hope that this discussion can inform future projects focused on the design and construction of spoken corpora.

Downloads

Download data is not yet available.

References

Aston, Guy and Lou Burnard. 1998. The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.

Biber, Douglas. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8/4: 243–257.

Biber, Douglas, Jesse Egbert, Daniel Keller and Stacey Wizner. 2021. Towards a taxonomy of conversational discourse types: An empirical corpus-based analysis. Journal of Pragmatics 171: 20–35.

Davies, Alan. 1991. The Native Speaker in Applied Linguistics. Edinburgh: Edinburgh University Press.

Dworkin, Jodi, Heather Hessel, Kate Gliske and Jessie H. Rudi. 2016. A comparison of three online recruitment strategies for engaging parents. Family Relations 65/4: 550–561.

Egbert, Jesse, Douglas Biber and Bethany Gray. 2022. Designing and Evaluating Language Corpora: A Practical Framework for Corpus Representativeness. Cambridge: Cambridge University Press.

Farrokhi, Farahman and Asgar Mahmoudi-Hamidabad. 2012. Rethinking convenience sampling: Defining quality criteria. Theory & Practice in Language Studies 2/4: 784–792.

Hanks, Elizabeth. (In preparation). Exploring the register of conversation: Uncovering linguists’ insights about its situational characteristics.

Knight, Dawn, Fernando Loizides, Steven Neale, Laurence Anthony and Irena Spasić. 2021. Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh. Language Resources and Evaluation 55: 789–816.

Labov, William. 1997. Linguistics and sociolinguistics. In Nikolas Coupland and Adam Jaworski eds. Sociolinguistics: A Reader. London: Palgrave Macmillan, 23–24.

Leech, Geoffrey. 1993. 100 million words of English. English Today 9/1: 9–15.

Love, Robbie. 2020. Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. New York: Routledge.

Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina and Tony McEnery. 2017. The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics 22/3: 319–344.

McCarthy, Michael J. 1998. Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press.

McEnery, Tony and Andrew Wilson. 2001. Corpus Linguistics: An Introduction. Edinburgh: Edinburgh University Press.

McEnery, Tony and Gavin Brookes. 2022. Building a written corpus: What are the basics? In Anne O’Keeffe and Michael McCarthy eds. The Routledge Handbook of Corpus Linguistics. London: Routledge, 35–47.

Saha, Koustuv, Pranshu Gupta, Gloria Mark, Emre Kıcıman and Munmun De Choudhury. 2023. Observer effect in social media use. https://doi.org/10.21203/rs.3.rs-2492994/v1

Shirk, Jennifer, Heidi Ballard, Candie Wilderman, Tina Phillips, Andrea Wiggins, Rebecca Jordan, Ellan McCallie, Matthew Minarchek, Bruce Lewenstein, Marianne Krasny and Rick Bonney. 2012. Public participation in scientific research: A framework for deliberate design. Ecology and Society 17/2: 1–20.

Simpson-Vlach, Rita C. and Sheryl Leicher. 2006. The MICASE Handbook: A Resource for Users of the Michigan Corpus of Academic Spoken English. Ann Arbor: University of Michigan Press.

U.S. Census Bureau. n.d. Explore census data. https://data.census.gov/(June 2022).

Downloads

PDF
HTML

Published

2024-02-29 — Updated on 2025-02-02

Versions

2025-02-02 (2)
2024-02-29 (1)

How to Cite

Hanks, E., Clarke, I., Brookes, G., Brezina, V., Baker, P., Reppen, R., … Bottini, R. (2025). Building LANA-CASE, a spoken corpus of American English conversation: Challenges and innovations in corpus compilation. Research in Corpus Linguistics, 12(2), 24–44. https://doi.org/10.32714/ricl.12.02.03 (Original work published February 29, 2024)

Download Citation

Issue

Vol. 12 No. 2 (2024): Special Issue "Innovations in the compilation and analysis of spoken corpora"

Section

Articles

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Submission of your paper to this journal implies that the paper is not under submission for publication elsewhere. Material which has been previously copyrighted, published, or accepted for publication will not be considered for publication in this journal. Submission of a manuscript is interpreted as a statement of certification that no part of the manuscript is copyrighted by any other publisher nor is under review by any other formal publication. By submitting your manuscript to us, you agree on these copyright guidelines. It is your responsibility to ensure that your manuscript does not cause any copyright infringements, defamation, and other problems.

Submitted papers are assumed to contain no proprietary material unprotected by patent or patent application; responsibility for technical content and for protection of proprietary material rests solely with the author(s) and their organizations and is not the responsibility of the journal or its editorial staff. The main author is responsible for ensuring that the article has been seen and approved by all the other authors. It is the responsibility of the author to obtain all necessary copyright release permissions for the use of any copyrighted materials in the manuscript prior to the submission.

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under the BY Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal

Article submission implies author agreement with this policy.

Building LANA-CASE, a spoken corpus of American English conversation: Challenges and innovations in corpus compilation

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Versions

How to Cite

Issue

Section

License

Make a Submission

Information

issn

ISSN: 2243-4712

scopus-2

sjr

dialnet1

indexation

Abstracting & indexing

Current Issue

scopus