A new approach to (key) keywords analysis: Using frequency, and now also dispersion
DOI:
https://doi.org/10.32714/ricl.09.02.02Keywords:
keyness, dispersion, frequency, association, Clinton-Trump Corpus, British National CorpusAbstract
A widely-used method in corpus-linguistic approaches to discourse analysis, register/text type/genre analysis, and educational/curriculum questions is that of keywords analysis, a simple statistical method aiming to identify words that are key to, i.e. characteristic for, certain discourses, text types, or topic domains. The vast majority of keywords analyses relied on the same statistical measure that most collocation studies are using, the log-likelihood ratio, which is performed on frequencies of occurrence in two corpora under consideration. In a recent paper, Egbert and Biber (2019) advocated a different approach, one that involves computing log-likelihood ratios for word types based on the range of their distribution rather than their frequencies in the target and reference corpora under consideration. In this paper, I argue that their approach is a most welcome addition to keywords analysis but can still be profitably extended by utilizing both frequency and dispersion for keyness computations. I am presenting a new two-dimensional approach to keyness and exemplifying it on the basis of the Clinton-Trump Corpus and the British National Corpus.
Downloads
References
Altman, Douglas G. and Patrick Royston. 2006. The cost of dichotomising continuous variables. BMJ 332(7549). 1080.
Baker, Paul. 2004. Querying keywords: Questions in difference, frequency, and sense in keyword analysis. Journal of English Linguistics 32/4: 346–359.
Biber, Douglas. 1988. Variation across Speech and Writing. Cambridge: Cambridge University Press.
Biber, Douglas and Jesse Egbert. 2018. Register Variation Online. Cambridge: Cambridge University Press.
Brown, David. 2016. Clinton-Trump Corpus. http://www.thegrammarlab.com/?nor-portfolio=corpus-of-presidential-speeches-cops-and-a-clintontrump-corpus
Burch, Brent, Jesse Egbert and Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3/2: 189–216.
Coxhead, Averil. 2000. A new academic word list. TESOL Quarterly 34/2: 213–238.
Cumberland, Phillippa M, Gabriela Czanner, Catey Bunce, Caroline J Doré, Nick Freemantle and Marta García-Fiñana. 2014. Ophthalmic statistics note: The perils of dichotomising continuous variables. British Journal of Ophthalmology 98/6: 841–843.
Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19/1: 61–74.
Egbert, Jesse and Douglas Biber. 2019. Incorporating text dispersion into keyword analyses. Corpora 14/1: 77–104.
Gries, Stefan Th. 2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1/2: 277–294.
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13/4: 403–437.
Gries, Stefan Th. 2010. Dispersions and adjusted frequencies in corpora: Further explorations. In Stefan Th. Gries, Stefanie Wulff and Mark Davies eds. Corpus Linguistic Applications: Current Studies, New Directions. Amsterdam: Rodopi, 197–212.
Gries, Stefan Th. 2016. Quantitative Corpus Linguistics with R. New York: Routledge.
Gries, Stefan Th. 2018. Towards a Unified Tupleization of Corpus Linguistics. Invited plenary talk at the 56th Annual Meeting of the Association for Computational Linguistics. Georgia State University.
Gries, Stefan Th. 2019a. Ten Lectures on Corpus-linguistic Approaches: Applications for Usage-based and Psycholinguistic Research. Leiden: Brill.
Gries, Stefan Th. 2019b. 15 years of collostructions: Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics 24/3: 385–412.
Gries, Stefan Th. 2021. Analyzing dispersion. In Magali Paquot and Stefan Th. Gries eds. Practical Handbook of Corpus Linguistics. Berlin: Springer.
Kilgarriff, Adam. 2005. Language is never, ever, ever, random. Corpus Linguistics and Linguistic Theory 1/2: 263–275.
Lijffijt, Jefrey and Stefan Th. Gries. 2012. Correction to “Dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics 17/1: 147–149.
R Core Team. 2020. R a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. http://www.R-project.org.
Scott, Mike 1997. PC analysis of key words – and key words. System 25/2: 233–245.
Scott, Mike and Christopher Tribble. 2006. Textual Patterns: Key Words and Corpus Analysis in Language Education. Amsterdam: John Benjamins.
Tribble, Christopher. 2002. Small corpora and teaching writing: Towards a corpus-informed pedagogy of writing. In Mohsen Ghadessy, Alex Henry and Robert L. Roseberry eds. Small Corpus Studies and ELT: Theory and Practice. Amsterdam: John Benjamins, 381–408.
Xiao, Zhonghua and Anthony McEnery. 2005. Two approaches to genre analysis: Three genres in Modern American English. Journal of English Linguistics 33/1: 62–82.
Downloads
Published
How to Cite
Issue
Section
License
Submission of your paper to this journal implies that the paper is not under submission for publication elsewhere. Material which has been previously copyrighted, published, or accepted for publication will not be considered for publication in this journal. Submission of a manuscript is interpreted as a statement of certification that no part of the manuscript is copyrighted by any other publisher nor is under review by any other formal publication. By submitting your manuscript to us, you agree on these copyright guidelines. It is your responsibility to ensure that your manuscript does not cause any copyright infringements, defamation, and other problems.
Submitted papers are assumed to contain no proprietary material unprotected by patent or patent application; responsibility for technical content and for protection of proprietary material rests solely with the author(s) and their organizations and is not the responsibility of the journal or its editorial staff. The main author is responsible for ensuring that the article has been seen and approved by all the other authors. It is the responsibility of the author to obtain all necessary copyright release permissions for the use of any copyrighted materials in the manuscript prior to the submission.
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under the BY Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal
Article submission implies author agreement with this policy.