Compiling a corpus of African American Language from oral histories

Keywords: African American English, oral history, automatic speech recognition, natural language processing, corpus linguistics, morphosyntax

Abstract

African American Language (AAL) is a marginalized variety of American English that has been understudied due to a lack of accessible data. This lack of data has made it difficult to research language in African American communities and has been shown to cause emerging technologies such as Automatic Speech Recognition (ASR) to perform worse for African American speakers. To address this gap, the Joel Buchanan Archive of African American Oral History (JBA) at the University of Florida is being compiled into a time-aligned and linguistically annotated corpus. Through Natural Language Processing (NLP) techniques, this project will automatically time-align spoken data with transcripts and automatically tag AAL features. Transcription and time-alignment challenges have arisen as we ensure accuracy in depicting AAL morphosyntactic and phonetic structure. Two linguistic studies illustrate how the African American Corpus from Oral Histories betters our understanding of this lesser-studied variety.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Bird, Steven, Edward Loper and Ewan Klein. 2009. Natural Language Processing with Python. California: O’Reilly Media Inc.

Blackley, Suzanne V., Jessica Huynh, Liqin Wang, Zfania Korach and Li Zhou. 2019. Speech recognition for clinical documentation from 1990 to 2018: A systematic review. Journal of the American Medical Informatics Association 26/4: 324–338.

Blodgett, Su Lin, Johnny Wei and Brendan O’Connor. 2018. Twitter universal dependency parsing for African-American and mainstream American English. In Iryna Gurevych and Yusuke Miayo eds. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne: Association for Computational Linguistics, 1415–1425.

Blodgett, Su Lin, Solon Barocas, Hal Daumé III and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. In Dan Jurafsky, Joyce Chai, Natalie Schluter and Joel Tetreault eds. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online publication: Association for Computational Linguistics, 5454–5476.

Coleman, John, Mark Liberman, Greg Kochanski, Jiahong Yuan, Sergio Grau, Chris Cieri and Lou Burnard. 2011. Mining years and years of speech. Phonetics Laboratory of the University of Oxford: 1–23. https://diggingintodata.org/sites/diggingintodata.org/files/miningayearofspeechwhitepaper.pdf.

Columbia University Center for Oral History Research. 2022. Columbia University Oral History Transcription Style Guide. https://www.ccohr.incite.columbia.edu/s/CCOHR-Transcript-Style-Guide-2022-htpm.pdf (accessed 31 January 2024.)

Dacon, Jamell. 2022. Towards a deep multi-layered dialectal language analysis: A case study of African-American English. In Su Lin Blodgett, Hal Daumé III, Michael Madaio, Anika Nenkova, Brendan O’Connor, Hanna Wallach and Qian Yang eds. Proceedings of the 2nd Workshop on Bridging Human–Computer Interaction and Natural Language Processing. Seattle: Association for Computational Linguistics, 55–63.

Davis, Alexis, Joshua L. Martin, Eric Cooks, Melissa J. Vilaro, Danyell Wilson-Howard, Kevin Tang and Janice Krieger. 2024. From English to “Englishes”: A process perspective on enhancing the linguistic responsiveness of culturally tailored cancer prevention interventions. Journal of Medical Internet Research preprint: 57528. https://preprints.jmir.org/preprint/57528

DiCanio, Christian, Hosung Nam, Douglas H. Whalen, H. Timothy Bunnell, Jonathan D. Amith and Rey Castillo García. 2013. Using automatic alignment to analyze endangered language data: Testing the viability of untrained alignment. The Journal of the Acoustical Society of America 134/3: 2235–2246.

Dinkar, Tanvi, Chleé Clavel and Ioana Vasilescu. 2023. Fillers in spoken language understanding: Computational and psycholinguistic perspectives. arXiv preprint arXiv:2301.10761: 1–20. https://arxiv.org/pdf/2301.10761.pdf

DuBois, John W., Terry DuBois, Georgio Klironomos and Brady Moore. 2020. From answer to question: Coherence analysis with rezonator. In Sophia Malamud, James Pustejovsky and Jonathan Ginzburg eds. Proceedings of the 24th Workshop on the Semantics and Pragmatics of Dialogue - Short Papers. Waltham, New Jersey: SEMDIAL, 1–4. http://semdial.org/anthology/Z20-Bois_semdial_0031.pdf

Egbert, Jesse, Biber Douglass and Betanny Gray. 2022. Designing and Evaluating Language Corpora: A Practical Framework for Corpus Representativeness. Cambridge: Cambridge University Press.

Farrington, Charlie and Tyler Kendall. 2019. The Corpus of Regional African American Language: MFA-Aligned. Version 2019.06. http://lingtools.uoregon.edu/coraal/aligned/.

Fasold, Ralph. 1972. Tense Marking in Black English: A Linguistic and Social Analysis. Washington: Center for Applied Linguistics.

Fitzgerald, Chris. 2022. Investigating a Corpus of Historical Oral Testimonies: The Linguistic Construction of Certainty. London: Routledge

Ghyselen, Anne-Sophie, Anne Breitbarth, Melissa Farasyn, Jacques Van Keymeulen and Arjan van Hessen. 2020. Clearing the transcription hurdle in dialect corpus building: The Corpus of Southern Dutch Dialects as case study. Frontiers in artificial intelligence 3/10. https://doi.org/10.3389/frai.2020.00010

Green, Lisa J. 2002. African American English: A Linguistic Introduction. Cambridge: Cambridge University Press.

Green, Lisa, Kristine M. Yu, Anissa Neal, Ayana Whitmal, Tamira Powe and Deniz Özyıldız. 2022. Range in the use and realization of BIN in African American English. Language and Speech 65/4: 958–1006.

Harris, A. Nicole. 2019. The Non-Aspectual Meaning of African American English Aspect Markers. New Haven: Yale University ProQuest Dissertations Publishing.

Harrington, Jonathan. 2010. Phonetic Analysis of Speech Corpora. Hoboken: John Wiley & Sons.

Hazen, Kirk. 2008. A vernacular baseline for English in Appalachia. American Speech 83/2: 116–140.

Hennink, Monique and Mary B. Weber. 2013. Quality issues of court reporters and transcriptionists for qualitative research. Qualitative Health Research 23/5: 700–710.

Johnson, Lisa M., Marianna Di Paolo and Adrian Bell. 2018. Forced alignment for understudied language varieties: Testing prosodylab-aligner with tongan data. Language Documentation & Conservation 12: 80–123.

Jørgensen, Anna, Dirk Hovy and Anders Søgaard. 2016. Learning a POS tagger for AAVE-like language. In Kevin Knight, Ani Nenkova and Owen Rambow eds. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: Association for Computational Linguistics, 1115–1120

Kendall, Tyler and Charlie Farrington. 2021. The Corpus of Regional African American Language. Version 2020.05. http://oraal.uoregon.edu/coraal (accessed 25 June 2023.)

Kendall, Tyler and Charlie Farrington. 2022. Managing sociolinguistic data with the Corpus of Regional African American Language (CORAAL). In Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller and Lauren B. Collister eds. The Open Handbook of Linguistic Data Management. Massachusetts: The MIT Press, 185–94.

Kendall, Tyler, Charlotte Vaughn, Charlie Farrington, Kaylynn Gunter, Jaidan McLean, Chloe Tacata and Shelby Arnson. 2021. Considering performance in the automated and manual coding of sociolinguistic variables: Lessons from variable (ING). Frontiers in Artificial Intelligence 4. https://doi.org/10.3389/frai.2021.648543

Kisler, Thomas and Florian Schiel. 2018. MOCCA: Measure of confidence for corpus analysis: Automatic reliability check of transcript and automatic segmentation. In Nicoletta Calzolari ed. Proceedings of the 11th International Conference on Language Resources and Evaluation. Miyazaki: European Language Resources Association, 1781–1786.

Koenecke, Allison, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky and Sharad Goel. 2020. Racial disparities in automated speech recognition. In Judith T. Irvine and Ann Arbor eds. Proceedings of the National Academy of Sciences 117/14: 7684–7689. https://doi.org/10.1073/pnas.1915768117

Kortmann, Bernd and Wagner, Susanne. 2005. The Freiburg English Dialect Project and Corpus (FRED). In Bernd Kortmann, Tanja Herrmann, Lukas Pietsch and Susanne Wagner eds. Volume 1 Agreement, Gender, Relative Clauses. Berlin: De Gruyter Mouton, 1–20.

Lee, Donghee N., Myiah J. Hutchens, Thomas J. George, Danyell Wilson-Howard, Eric J. Cooks and Janice L. Krieger. 2022. Do they speak like me? Exploring how perceptions of linguistic difference may influence patient perceptions of healthcare providers. Medical Education Online: 27/1: 2107470. https://doi.org/10.1080/10872981.2022.2107470

Magnotta, Sierra. 2022. Analysis of Two Acoustic Models on Forced Alignment of African American English. Georgia, U.S.: University of Georgia dissertation.

Martin, Joshua L. 2022. Automatic Speech Recognition Systems, Spoken Corpora, and African American Language: An Examination of Linguistic Bias and Morphosyntactic Features. Gainesville, Florida: University of Florida dissertation.

Martin, Joshua L. and Kevin Tang. 2020. Understanding racial disparities in automatic speech recognition: The case of habitual ‘be’. Interspeech: 626–630.

Martin, Joshua L. and Kelly E. Wright. 2022. Bias in automatic speech recognition: The case of African American language. Applied Linguistics 44/4: 613–630.

McAuliffe, Michael, Michaela Socolof, Michael Wagner and Morgran Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. INTERSPEECH: 498–502.

McLarty, Jason, Taylor Jones and Christopher Hall. 2019. Corpus-based sociophonetic approaches to postvocalic R-lessness in African American language. American Speech 94/1: 91–109.

Meyer, Julien, Laure Dentel and Fanny Meunier. 2013. Speech recognition in natural background noise. PloS one 8/11. https://doi.org/10.1371/journal.pone.0079279

Moore, Russell, Andrew Caine, Calbert Graham and Paula Buttery. 2015. Incremental dependency parsing and disfluency detection in spoken learner English. In Pavel Král and Václav Matoušek eds. Text, Speech, and Dialogue. New York: Springer International Publishing, 470–479.

Olsen, Rachel M., Michael L. Olsen, Joseph A. Stanley, Margaret E. L. Renwick and William Kretzschmar. 2017. Methods for transcription and forced alignment of a legacy speech corpus. Proceedings of Meetings on Acoustics, 1–13. https://doi.org/10.1121/2.0000559

Oregon Department of Transportation Research Section. 2010. Guide to Transcribing and Summarizing Oral Histories. https://www.oregon.gov/odot/Programs/ResearchDocuments/guide_to_transcribing_and_summarizing_oral_histories.pdf (accessed 25 June 2023.)

Ott, Myle, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier and Michael Auli. 2019. Fairseq: A fast, extensible toolkit for sequence modelin. In Ammar Waleed, Annie Louis and Nasrin Mostafazadeh eds. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Minneapolis: Association for Computational Linguistics, 48–53.

Pandey, Ayushi, Pamir Gogoi and Kevin Tang. Understanding forced alignment errors in Hindi-English code-mixed speech–a feature analysis. 2020. In Proceedings of First Workshop on Speech Technologies for Codeswitching in Multilingual Communities, 13–17. http://festvox.org/cedar/WSTCSMC2020.pdf

Pederson, Lee, Susan Leas McDaniel and Carol M. Adams eds. 1986. Linguistic Atlas of the Gulf States. Georgia: University of Georgia Press.

Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12: 2825–2830.

Pettarin, Alberto. 2017. Aeneas: Automagically Synchronize Audio and Text. https://www.readbeyond.it/aeneas/ (accessed 29 June 2023.)

Previlon, Wilermine, Alice Rozet, Jotsna Gowda, Bill Dyer, Kevin Tang and Sarah Moeller. 2024. Leveraging syntactic dependencies in disambiguation: the case of African American English. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. (Preprint available at https://doi.org/10.31234/osf.io/ph7q8).

Rahman, Jacquelyn. 2008. Middle-class African Americans: Reactions and attitudes toward African American English. American Speech 83/ 2: 141–76.

Rohanian, Morteza and Julian Hough. 2021. Best of both worlds: Making high accuracy non-incremental transformer-based disfluency detection incremental. In Chengqing Zong, Fei Xia, Wenjie Li and Roberto Navigli eds. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. Online publication: Association for Computational Linguistics, 3693–3703.

Roller, K. 2015. Towards the ‘oral’ in oral history: Using historical narratives in linguistics. Oral History: 73–84.

Samuel Proctor Oral History Program. 2007. Style Guide: Guidelines for Transcribing and Editing Oral Histories. https://ufdc.ufl.edu/IR00002513/00001 (accessed 25 June 2023.)

Samuel Proctor Oral History Program. 2016. Style Guide: Guidelines for Transcribing and Editing Oral Histories. https://oral.history.ufl.edu/wp-content/uploads/sites/15/SPOHP-Style-Guide-2016.pdf (accessed 25 June 2023.)

Samuel Proctor Oral History Project. 2020. Learn to Transcribe Oral History the SPOHP Way. https://www.youtube.com/watch?v=_aKXmOLQINw (accessed 23 June 2023.)

Samuel Proctor Oral History Project. 2023. Machen Florida Opportunity Scholars Program (MFOS). https://oral.history.ufl.edu/projects/machen-florida-opportunity-scholars-program-mfos/ (accessed 27 June 2023.)

Santiago, Harrison, Joshua Martin, Sarah Moeller and Kevin Tang. 2022. Disambiguation of morpho-syntactic features of African American English: The case of habitual be. In Bharathi Raja Chakravarthi, B Bharathi, John P McCrae, Manel Zarrouk, Kalika Bali, Paul Buitelaar eds. Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion. Dublin: Association for Computational Linguistics, 70–75.

Schiel, Florian, Christoph Draxler, Angela Baumann, Tania Ellbogen and Alexander Steffen. 2012. The Production of Speech Corpora. München: Open Access Ludwig-Maximilians-Universität München. https://doi.org/10.5282/ubm/epub.13693.

Schiffrin, Deborah. 2002. Mother and friends in a holocaust life story. Language in Society 31/3: 309–353.

Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Stolcke, Andreas and Jasha Droppo. 2017. Comparing human and machine errors in conversational speech transcription. Interspeech: 137–141.

Strong, Liz, Mary Marshall Clark and Caitlin Bertin-Mahieux. 2018. Columbia University Oral History Transcription Style Guide. Columbia: Columbia University Center for Oral History Research. https://incite.columbia.edu/publications-old/2019/3/13/oral-history-transcription-style-guide (accessed 25 June 2023.)

Tagliamonte, Sali A. 2004. Someth[in]’s go[ing] on!: Variable ing at ground zero. In Britt-Louise Gunnarsson, Lena Bergström, Gerd Eklund, Staffan Fidell, Lise H. Hansen, Angela Karstadt, Bengt Nordberg, Eva Sundergren and Mats Thelander eds. Language Variation in Europe: Papers from the Second International Conference on Language Variation in Europe. Uppsala: Uppsala Universitet, 390–403.

Tang, Kevin. 2015. Naturalistic Speech Misperception. London: University College London dissertation.

Tang, Kevin and Ryan Bennett. 2019. Unite and conquer: Bootstrapping forced alignment tools for closely-related minority languages (mayan). In Sasha Calhoun, Paola Escudero, Marija Tabain and Paul Warren eds. Proceedings of the 19th International Congress of Phonetic Sciences. Canberra: Australasian Speech Science and Technology Association Inc, 1719–1723-

Tevissen, Yannis, Jérôme Boudy, Gérard Chollet and Frédéric Petitpont. 2023. Towards measuring and scoring speaker diarization fairness. CoRR abs/2302.09991. https://doi.org/10.48550/arXiv.2302.09991.

Thomas, Erik R. and Guy Bailey. 2015. Segmental phonology of African American English. In Jennifer Bloomquist, Lisa J. Green and Sonja L. Lanehart eds. The Oxford Handbook of African American Language. Oxford: Oxford University Press, 403–419.

Whalen, Douglas H. and Joyce McDonough. 2015. Taking the laboratory into the field. Annual Review of Linguistics 1/1: 395–415.

Yoon, Sunmoo, Peter Broadwell, Frederick F. Sun, Maria De Planell-Saguer and Nicole Davis. 2023. Application of topic modeling on artificial intelligence studies as a foundation to develop ethical guidelines in African American dementia caregiving. Studies in Health Technology and Informatics 305, 541–544.

Yuan, Jiahong and Mark Liberman. Automatic detection of “g-dropping” in American English using forced alignment. 2011. In the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding. Hawaii: Curran Associates Inc, 490–493. https://doi.org/10.1109/ASRU.2011.6163980

Zayats, Vicky, Trang Tran, Richard Wright, Courtney Mansfield and Mari Ostendorf. 2019. Disfluencies and human speech transcription errors. Interspeech: 3088–3092.

Ziems, Caleb, William Held, Jingfeng Yang and Diyi Yang. 2022. Multi-VALUE: A framework for cross-dialectal English NLP. CoRR abs/2212.08011. https://doi.org/10.48550/arXiv.2212.08011.

Published
2024-04-25
How to Cite
Moeller, S., Davis, A., Previlon, W., Bottini, M., & Tang, K. (2024). Compiling a corpus of African American Language from oral histories. Research in Corpus Linguistics, 12(2), 45–79. https://doi.org/10.32714/ricl.12.02.04