A new approach to (key) keywords analysis: Using frequency, and now also dispersion

Keywords: keyness, dispersion, frequency, association, Clinton-Trump Corpus, British National Corpus


A widely-used method in corpus-linguistic approaches to discourse analysis, register/text type/genre analysis, and educational/curriculum questions is that of keywords analysis, a simple statistical method aiming to identify words that are key to, i.e. characteristic for, certain discourses, text types, or topic domains. The vast majority of keywords analyses relied on the same statistical measure that most collocation studies are using, the log-likelihood ratio, which is performed on frequencies of occurrence in two corpora under consideration. In a recent paper, Egbert and Biber (2019) advocated a different approach, one that involves computing log-likelihood ratios for word types based on the range of their distribution rather than their frequencies in the target and reference corpora under consideration. In this paper, I argue that their approach is a most welcome addition to keywords analysis but can still be profitably extended by utilizing both frequency and dispersion for keyness computations. I am presenting a new two-dimensional approach to keyness and exemplifying it on the basis of the Clinton-Trump Corpus and the British National Corpus.



Download data is not yet available.


Metrics Loading ...


Altman, Douglas G. and Patrick Royston. 2006. The cost of dichotomising continuous variables. BMJ 332(7549). 1080.

Baker, Paul. 2004. Querying keywords: Questions in difference, frequency, and sense in keyword analysis. Journal of English Linguistics 32/4: 346–359.

Biber, Douglas. 1988. Variation across Speech and Writing. Cambridge: Cambridge University Press.

Biber, Douglas and Jesse Egbert. 2018. Register Variation Online. Cambridge: Cambridge University Press.

Brown, David. 2016. Clinton-Trump Corpus. http://www.thegrammarlab.com/?nor-portfolio=corpus-of-presidential-speeches-cops-and-a-clintontrump-corpus

Burch, Brent, Jesse Egbert and Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3/2: 189–216.

Coxhead, Averil. 2000. A new academic word list. TESOL Quarterly 34/2: 213–238.

Cumberland, Phillippa M, Gabriela Czanner, Catey Bunce, Caroline J Doré, Nick Freemantle and Marta García-Fiñana. 2014. Ophthalmic statistics note: The perils of dichotomising continuous variables. British Journal of Ophthalmology 98/6: 841–843.

Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19/1: 61–74.

Egbert, Jesse and Douglas Biber. 2019. Incorporating text dispersion into keyword analyses. Corpora 14/1: 77–104.

Gries, Stefan Th. 2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1/2: 277–294.

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13/4: 403–437.

Gries, Stefan Th. 2010. Dispersions and adjusted frequencies in corpora: Further explorations. In Stefan Th. Gries, Stefanie Wulff and Mark Davies eds. Corpus Linguistic Applications: Current Studies, New Directions. Amsterdam: Rodopi, 197–212.

Gries, Stefan Th. 2016. Quantitative Corpus Linguistics with R. New York: Routledge.

Gries, Stefan Th. 2018. Towards a Unified Tupleization of Corpus Linguistics. Invited plenary talk at the 56th Annual Meeting of the Association for Computational Linguistics. Georgia State University.

Gries, Stefan Th. 2019a. Ten Lectures on Corpus-linguistic Approaches: Applications for Usage-based and Psycholinguistic Research. Leiden: Brill.

Gries, Stefan Th. 2019b. 15 years of collostructions: Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics 24/3: 385–412.

Gries, Stefan Th. 2021. Analyzing dispersion. In Magali Paquot and Stefan Th. Gries eds. Practical Handbook of Corpus Linguistics. Berlin: Springer.

Kilgarriff, Adam. 2005. Language is never, ever, ever, random. Corpus Linguistics and Linguistic Theory 1/2: 263–275.

Lijffijt, Jefrey and Stefan Th. Gries. 2012. Correction to “Dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics 17/1: 147–149.

R Core Team. 2020. R a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. http://www.R-project.org.

Scott, Mike 1997. PC analysis of key words – and key words. System 25/2: 233–245.

Scott, Mike and Christopher Tribble. 2006. Textual Patterns: Key Words and Corpus Analysis in Language Education. Amsterdam: John Benjamins.

Tribble, Christopher. 2002. Small corpora and teaching writing: Towards a corpus-informed pedagogy of writing. In Mohsen Ghadessy, Alex Henry and Robert L. Roseberry eds. Small Corpus Studies and ELT: Theory and Practice. Amsterdam: John Benjamins, 381–408.

Xiao, Zhonghua and Anthony McEnery. 2005. Two approaches to genre analysis: Three genres in Modern American English. Journal of English Linguistics 33/1: 62–82.

How to Cite
Gries, S. T. (2021). A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics, 9(2), 1-33. https://doi.org/10.32714/ricl.09.02.02