Recent trends in corpus design and reporting: A methodological synthesis

Keywords: sampling; corpus design; methodological synthesis; methodological reporting practices; representativeness


Methodological design is a central issue for researchers in corpus linguistics. To understand trends in the reporting of important aspects of corpus design and the type of corpora being used in corpus linguistics research articles better, this study analyzes 709 descriptions of corpora from research published in corpus journals between 2010–2019. Each article was manually coded by two trained coders for aspects of corpus design, such as the population definition, sampling method, and sample size. Additionally, the study identifies missing information in corpus reporting. Our results show trends in corpus design, such as an increased use of spoken corpora. We also observe the existence of some robust sampling methodology and slight improvements in reporting practices over time. Overall, there is great diversity in the types of corpora that are observed in the corpus data, such as size. However, our results also show widespread underreporting of generally important corpus design choices and features, such as sampling methods or the number of texts in in even newly constructed corpora. Resultantly, suggestions for ways to improve reporting practices for empirical corpus linguistics studies are provided for authors, reviewers, and editors.


Download data is not yet available.


Metrics Loading ...


Almujaiwel, Sultan. 2019. Grammatical construction of function words between old and modern written Arabic: A corpus-based analysis. Corpus Linguistics and Linguistic Theory 15/2: 267–296.

Altman, Douglas G. 2015. Making research articles fit for purpose: Structured reporting of key methods and findings. Trials 16/53: 1–3.

Aston, Guy and Lou Burnard. 1997. The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.

Atkins, Sue, Jeremy Clear and Nicholas Ostler. 1992. Corpus design criteria. Literary and Linguistic Computing 7/1: 1–16.

Bennett, Paul, Martin Durrell, Silke Scheible and Richard J. Whitt. 2013. New Methods in Historical Corpora. Tübingen: Gunter Narr Verlag.

Berber Sardinha, Tony. 2000. Comparing corpora with WordSmith tools: How large must the reference corpus be? In Adam Kilgarriff and Tony Berber Sardinha eds. Proceedings of the Workshop on Comparing Corpora Vol. 9. Stroudsburg: Association for Computational Linguistics, 7–13.

Berber Sardinha, Tony. 2004. Lingüística de Corpus: Historico. Barueri: Manole.

Berndt, Andrea. E. 2020. Sampling methods. Journal of Human Lactation 36/2: 224–226.

Biber, Douglas. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8/4: 243–257.

Biber, Douglas, Susan Conrad and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Finegan. 1999. The Longman Grammar of Spoken and Written English. London: Longman.

Borenstein, Michael, Larry V. Hedges, Julian P. T. Higgins and Hannah R. Rothstein. 2009. Introduction to Meta-analysis. New Jersey: John Wiley & Sons.

Boulton, Alex and Tom Cobb. 2017. Corpus use in language learning: A meta‐analysis. Language Learning 67/2: 348–393.

Burnard, Lou. 1995. Users Reference Guide for the British National Corpus. Oxford: Oxford University Computing Services.

Caruso, Assunta, Antonietta Folino, Francesca Parisi and Roberto Trunfio. 2014. A statistical method for minimum corpus size determination. In Émilie Née ed. Proceedings of the Twelfth International Conference on Textual Data Statistical Analysis, 135–146.

Clarivate. 2021. Journal Citation Reports.

Clear, Jeremy. 2011. Corpus sampling. Topics in Linguistics 9: 21–33.

Crowdy, Steve. 1993. Spoken corpus design. Literary and Linguistic Computing 8/4: 259–265.

Davies, Mark. 2018. Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In Carla Suhr, Terttu Nevalainen and Irma Taavitsainen eds. From Data to Evidence in English Language Research. Leiden: Brill, 66–87.

Egbert, Jesse. 2019. Corpus design and representativeness. In Tony Berber Sardinha and Marcia Veirano Pinto eds. Multi-Dimensional Analysis: Research Methods and Current Issues. London: Bloomsbury Academic, 27–42.

Egbert, Jesse, Tove Larsson and Douglas Biber. 2020. Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User. Cambridge University Press.

Egbert, Jesse, Douglas Biber and Bethany Gray. 2022. Designing and Evaluating Language Corpora. Cambridge: Cambridge University Press.

Geluso, Joe and Roz Hirch. 2019. The reference corpus matters: Comparing the effect of different reference corpora on keyword analysis. Register Studies 1/2: 209–242.

Goh, Gwang-Yoon. 2011. Choosing a reference corpus for keyword calculation. Linguistic Research 28/1: 239–256.

Goulart, Larissa and Margaret Wood. 2021. Methodological synthesis of research using multi-dimensional analysis. Journal of Research Design and Statistics in Linguistics and Communication Science 6/2: 107–137.

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13/4: 403–437.

Hinrichs, Lars, Nicholas Smith and Birgit Waibel. 2010. Manual of information for the part-of-speech-tagged, post-edited Brown corpora. ICAME Journal 34: 189–231.

Hunston, Susan. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press.

Jiang, Yu-Gang, Chong-Wah Ngo and Shih-Fu Chang. 2009. Semantic context transfer across heterogeneous sources for domain adaptive video search. In Wen Gao, Yong Rui and Alan Hanjalic eds. Proceedings of the 17th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 155–164.

Larsson, Tove, Jesse Egbert and Douglas Biber. 2022. On the status of statistical reporting versus linguistic description in corpus linguistics: A ten-year perspective. Corpora 17/1: 137–157.

Leech, Geoffrey. 1992. 100 million words of English: The British National Corpus (BNC). Second Language Research 28: 1–3.

McEnery, Tony, Robbie Love and Vaclav Brezina. 2017. Introduction: Compiling and analysing the Spoken British National Corpus 2014. International Journal of Corpus Linguistics 22/3: 311–318.

Mizumoto, Atsushi, Luke Plonsky and Jesse Egbert. 2021. Meta-analyzing corpus linguistic research. In Magali Paquot and Stefan Th. Gries eds. A Practical Handbook of Corpus Linguistics. New York: Springer, 663–288.

Nartey, Mark and Isaac N. Mwinlaaru. 2019. Towards a decade of synergising corpus linguistics and critical discourse analysis: A meta-analysis. Corpora 14/2: 203–235.

O’Keeffe, Anne and Steve Walsh. 2012. Applying corpus linguistics and conversation analysis in the investigation of small group teaching in higher education. Corpus Linguistics and Linguistic Theory 8/1: 159–181.

Paquot, Magali and Luke Plonsky. 2017. Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research 3/1: 61–94.

Schmidt, Thomas. 2016. Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German. International Journal of Corpus Linguistics 21/3: 396–418.

Scopus. 2021. Sources.

Scott, Mike. 2009. In search of a bad reference corpus. In Dawn Archer ed. What’s in a Word-list? Investigating Word Frequency and Keyword Extraction. Oxford: Ashgate, 79–91.

Willinsky, John. 2006. The Access Principle: The Case for Open Access to Research and Scholarship. Cambridge: MIT Press.

How to Cite
Hashimoto, B., & Nelson, K. (2024). Recent trends in corpus design and reporting: A methodological synthesis. Research in Corpus Linguistics, 12(1), 59–88.