Recent trends in corpus design and reporting: A methodological synthesis
DOI:
https://doi.org/10.32714/ricl.12.01.03Keywords:
sampling; corpus design; methodological synthesis; methodological reporting practices; representativenessAbstract
Methodological design is a central issue for researchers in corpus linguistics. To understand trends in the reporting of important aspects of corpus design and the type of corpora being used in corpus linguistics research articles better, this study analyzes 709 descriptions of corpora from research published in corpus journals between 2010–2019. Each article was manually coded by two trained coders for aspects of corpus design, such as the population definition, sampling method, and sample size. Additionally, the study identifies missing information in corpus reporting. Our results show trends in corpus design, such as an increased use of spoken corpora. We also observe the existence of some robust sampling methodology and slight improvements in reporting practices over time. Overall, there is great diversity in the types of corpora that are observed in the corpus data, such as size. However, our results also show widespread underreporting of generally important corpus design choices and features, such as sampling methods or the number of texts in in even newly constructed corpora. Resultantly, suggestions for ways to improve reporting practices for empirical corpus linguistics studies are provided for authors, reviewers, and editors.
Downloads
References
Almujaiwel, Sultan. 2019. Grammatical construction of function words between old and modern written Arabic: A corpus-based analysis. Corpus Linguistics and Linguistic Theory 15/2: 267–296.
Altman, Douglas G. 2015. Making research articles fit for purpose: Structured reporting of key methods and findings. Trials 16/53: 1–3.
Aston, Guy and Lou Burnard. 1997. The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.
Atkins, Sue, Jeremy Clear and Nicholas Ostler. 1992. Corpus design criteria. Literary and Linguistic Computing 7/1: 1–16.
Bennett, Paul, Martin Durrell, Silke Scheible and Richard J. Whitt. 2013. New Methods in Historical Corpora. Tübingen: Gunter Narr Verlag.
Berber Sardinha, Tony. 2000. Comparing corpora with WordSmith tools: How large must the reference corpus be? In Adam Kilgarriff and Tony Berber Sardinha eds. Proceedings of the Workshop on Comparing Corpora Vol. 9. Stroudsburg: Association for Computational Linguistics, 7–13.
Berber Sardinha, Tony. 2004. Lingüística de Corpus: Historico. Barueri: Manole.
Berndt, Andrea. E. 2020. Sampling methods. Journal of Human Lactation 36/2: 224–226.
Biber, Douglas. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8/4: 243–257.
Biber, Douglas, Susan Conrad and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Finegan. 1999. The Longman Grammar of Spoken and Written English. London: Longman.
Borenstein, Michael, Larry V. Hedges, Julian P. T. Higgins and Hannah R. Rothstein. 2009. Introduction to Meta-analysis. New Jersey: John Wiley & Sons.
Boulton, Alex and Tom Cobb. 2017. Corpus use in language learning: A meta‐analysis. Language Learning 67/2: 348–393.
Burnard, Lou. 1995. Users Reference Guide for the British National Corpus. Oxford: Oxford University Computing Services.
Caruso, Assunta, Antonietta Folino, Francesca Parisi and Roberto Trunfio. 2014. A statistical method for minimum corpus size determination. In Émilie Née ed. Proceedings of the Twelfth International Conference on Textual Data Statistical Analysis, 135–146.
Clarivate. 2021. Journal Citation Reports. https://jcr.clarivate.com/jcr/home
Clear, Jeremy. 2011. Corpus sampling. Topics in Linguistics 9: 21–33.
Crowdy, Steve. 1993. Spoken corpus design. Literary and Linguistic Computing 8/4: 259–265.
Davies, Mark. 2018. Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In Carla Suhr, Terttu Nevalainen and Irma Taavitsainen eds. From Data to Evidence in English Language Research. Leiden: Brill, 66–87.
Egbert, Jesse. 2019. Corpus design and representativeness. In Tony Berber Sardinha and Marcia Veirano Pinto eds. Multi-Dimensional Analysis: Research Methods and Current Issues. London: Bloomsbury Academic, 27–42.
Egbert, Jesse, Tove Larsson and Douglas Biber. 2020. Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User. Cambridge University Press.
Egbert, Jesse, Douglas Biber and Bethany Gray. 2022. Designing and Evaluating Language Corpora. Cambridge: Cambridge University Press.
Geluso, Joe and Roz Hirch. 2019. The reference corpus matters: Comparing the effect of different reference corpora on keyword analysis. Register Studies 1/2: 209–242.
Goh, Gwang-Yoon. 2011. Choosing a reference corpus for keyword calculation. Linguistic Research 28/1: 239–256.
Goulart, Larissa and Margaret Wood. 2021. Methodological synthesis of research using multi-dimensional analysis. Journal of Research Design and Statistics in Linguistics and Communication Science 6/2: 107–137.
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13/4: 403–437.
Hinrichs, Lars, Nicholas Smith and Birgit Waibel. 2010. Manual of information for the part-of-speech-tagged, post-edited Brown corpora. ICAME Journal 34: 189–231.
Hunston, Susan. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Jiang, Yu-Gang, Chong-Wah Ngo and Shih-Fu Chang. 2009. Semantic context transfer across heterogeneous sources for domain adaptive video search. In Wen Gao, Yong Rui and Alan Hanjalic eds. Proceedings of the 17th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 155–164.
Larsson, Tove, Jesse Egbert and Douglas Biber. 2022. On the status of statistical reporting versus linguistic description in corpus linguistics: A ten-year perspective. Corpora 17/1: 137–157.
Leech, Geoffrey. 1992. 100 million words of English: The British National Corpus (BNC). Second Language Research 28: 1–3.
McEnery, Tony, Robbie Love and Vaclav Brezina. 2017. Introduction: Compiling and analysing the Spoken British National Corpus 2014. International Journal of Corpus Linguistics 22/3: 311–318.
Mizumoto, Atsushi, Luke Plonsky and Jesse Egbert. 2021. Meta-analyzing corpus linguistic research. In Magali Paquot and Stefan Th. Gries eds. A Practical Handbook of Corpus Linguistics. New York: Springer, 663–288.
Nartey, Mark and Isaac N. Mwinlaaru. 2019. Towards a decade of synergising corpus linguistics and critical discourse analysis: A meta-analysis. Corpora 14/2: 203–235.
O’Keeffe, Anne and Steve Walsh. 2012. Applying corpus linguistics and conversation analysis in the investigation of small group teaching in higher education. Corpus Linguistics and Linguistic Theory 8/1: 159–181.
Paquot, Magali and Luke Plonsky. 2017. Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research 3/1: 61–94.
Schmidt, Thomas. 2016. Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German. International Journal of Corpus Linguistics 21/3: 396–418.
Scopus. 2021. Sources. https://www.scopus.com/sources.uri
Scott, Mike. 2009. In search of a bad reference corpus. In Dawn Archer ed. What’s in a Word-list? Investigating Word Frequency and Keyword Extraction. Oxford: Ashgate, 79–91.
Willinsky, John. 2006. The Access Principle: The Case for Open Access to Research and Scholarship. Cambridge: MIT Press.
Published
Versions
- 2025-02-02 (2)
- 2024-04-19 (1)
How to Cite
Issue
Section
License
Copyright (c) 2024 Research in Corpus Linguistics

This work is licensed under a Creative Commons Attribution 4.0 International License.
Submission of your paper to this journal implies that the paper is not under submission for publication elsewhere. Material which has been previously copyrighted, published, or accepted for publication will not be considered for publication in this journal. Submission of a manuscript is interpreted as a statement of certification that no part of the manuscript is copyrighted by any other publisher nor is under review by any other formal publication. By submitting your manuscript to us, you agree on these copyright guidelines. It is your responsibility to ensure that your manuscript does not cause any copyright infringements, defamation, and other problems.
Submitted papers are assumed to contain no proprietary material unprotected by patent or patent application; responsibility for technical content and for protection of proprietary material rests solely with the author(s) and their organizations and is not the responsibility of the journal or its editorial staff. The main author is responsible for ensuring that the article has been seen and approved by all the other authors. It is the responsibility of the author to obtain all necessary copyright release permissions for the use of any copyrighted materials in the manuscript prior to the submission.
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under the BY Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal
Article submission implies author agreement with this policy.