The Multi-Feature Tagger of English (MFTE): Rationale, description and evaluation

Authors

DOI:

https://doi.org/10.32714/ricl.13.02.03

Keywords:

software, multivariable analysis, multivariate analysis, open source, corpus linguistics, corpus tool, multi-dimensional analysis, Python

Abstract

The Multi-Feature Tagger of English (MFTE) provides a transparent and easily adaptable open-source tool for multivariable analyses of English corpora. Designed to contribute to the greater reproducibility, transparency, and accessibility of multivariable corpus studies, it comes with a simple GUI and is available both as a richly annotated Python script and as an executable file. In this article, we detail its features and how they are operationalised. The default tagset comprises 74 lexico-grammatical features, ranging from attributive adjectives and progressives to tag questions and emoticons. An optional extended tagset covers more than 70 additional features, including many semantic features, such as human nouns and verbs of causation. We evaluate the accuracy of the MFTE on a sample of 60 texts from the BNC2014 and COCA, and report precision and recall metrics for all the features of the simple tagset. We outline how that the use of a well-documented, open-source tool can contribute to improving the reproducibility and replicability of multivariable studies of English.

Downloads

Download data is not yet available.

References

Baker, Monya. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533/7604: 452–454.

Barlett, Tom and Gerard O’Grady eds. 2017. The Routledge Handbook of Systemic Functional Linguistics. London: Routledge.

Berber Sardinha, Tony, Marcia Veirano Pinto, Cristina Mayer, Maria Carolina Zuppardi and Carlos Henrique Kauffmann. 2019. Adding registers to a previous multi-dimensional analysis. In Tony Berber Sardinha and Marcia Veirano Pinto eds. Multidimensional Analysis: Research Methods and Current Issues. New York: Bloomsbury, 165–188.

Berez-Kroeker, Andrea L., Lauren Gawne, Susan Smythe Kung, Barbara F. Kelly, Tyler Heston, Gary Holton and Peter Pulsifer. 2018. Reproducible research in linguistics: A position statement on data citation and attribution in our field. Linguistics 56/1: 1–18.

Biber, Douglas. 1984. A Model of Textual Relations within the Written and Spoken Modes. California: University of Southern California dissertation.

Biber, Douglas. 1988. Variation across Speech and Writing. Cambridge: Cambridge University Press.

Biber, Douglas. A typology of texts. Linguistics 27: 3–43.

Biber, Douglas. 2006. University Language: A Corpus-based Study of Spoken and Written Registers. Amsterdam: John Benjamins.

Biber, Douglas. 2019. Multidimensional Analysis: A historical synopsis. In Tony Berber Sardinha and Marcia Veirano Pinto eds. Multi-Dimensional Analysis: Research Methods and Current Issues. London: Bloomsbury Academic, 11–26.

Biber, Douglas and Jesse Egbert. 2018. Register Variation Online. Cambridge: Cambridge University Press.

Biber, Douglas and Bethany Gray. 2013. Discourse characteristics of writing and speaking task types on the TOEFL IBT test: A lexico-grammatical analysis. ETS Research Report Series 2013/1. https://doi.org/10.1002/j.2333-8504.2013.tb02311.x.

Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Finegan. 1999. The Longman Grammar of Spoken and Written English. Harlow: Longman.

Bochynska, Agata, Liam Keeble, Caitlin Halfacre, Joseph V. Casillas, Irys-Amélie Champagne, Kaidi Chen, Melanie Röthlisberger, Erin M. Buchanan and Timo B. Roettger. 2023. Reproducible research practices and transparency across linguistics. Glossa Psycholinguistics 2/1. https://doi.org/10.5070/G6011239.

Brezina, Vaclav. 2018. Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Brezina, Vaclav, Abi Hawtin and Tony McEnery. 2021. The written British National Corpus 2014 – design and comparability. Text & Talk 41/5–6: 595–615.

Davies, Mark. 1990. Corpus of Contemporary American English (COCA). https://www.english-corpora.org/coca/.

Dixon, Daniel Hobson. 2022. The Language in Digital Games: Register Variation in Virtual and Real-World Contexts. Flagstaff: Northern Arizona University dissertation.

Egbert, Jesse, Tove Larsson and Douglas Biber. 2020. Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User. Cambridge: Cambridge University Press.

Egbert, Jesse and Shelley Staples. 2019. Doing multi-dimensional analysis in SPSS, SAS, and R. In Tony Berber Sardinha and Marcia Veirano Pinto eds. Multi-Dimensional Analysis: Research Methods and Current Issues. London: Bloomsbury Academic, 125–144.

Gewin, Virginia. 2016. Data sharing: An open mind on open data. Nature 529/7584: 117–119.

Goulart, Larissa. 2022. Communicative Text Types in University Writing. Flagstaff: Northern Arizona University dissertation.

Goulart, Larissa and Margaret Wood. 2021. Methodological synthesis of research using multi-dimensional analysis. Journal of Research Design and Statistics in Linguistics and Communication Science 6/2: 107–137.

Gray, Bethany. 2015. Linguistic Variation in Research Articles: When Discipline Tells only Part of the Story. Amsterdam: John Benjamins.

Gray, Bethany. 2019. Tagging and counting linguistic features for multi-dimensional analysis. In Tony Berber Sardinha and Marcia Veirano Pinto eds. Multi-Dimensional Analysis: Research Methods and Current Issues. London: Bloomsbury Academic, 43–66.

Gray, Bethany and Douglas Biber. 2013. Lexical frames in academic prose and conversation. International Journal of Corpus Linguistics 18/1: 109–135.

Hidalgo, Bertha and Melody Goodman. 2013. Multivariate or multivariable regression? American Journal of Public Health 103/1: 39–40.

In’nami, Yo, Atsushi Mizumoto, Luke Plonsky and Rie Koizumi. 2022. Promoting computationally reproducible research in applied linguistics: Recommended practices and considerations. Research Methods in Applied Linguistics 1/3. 100030. https://doi.org/10.1016/j.rmal.2022.100030.

John, Leslie K., George Loewenstein and Drazen Prelec. 2012. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science 23/5: 524–532

Le Foll, Elen. 2021. Introducing the Multi-Feature Tagger of English (MFTE). Perl. Osnabrück University. https://github.com/elenlefoll/MultiFeatureTaggerEnglish.

Le Foll, Elen. 2024. Textbook English: A Multi-Dimensional Approach. Studies in Corpus Linguistics 116. Amsterdam: John Benjamins.

Le Foll, Elen and Muhammad Shakir. 2023. Introducing a New Open-Source Corpus-Linguistic Tool: The Multi-Feature Tagger of English (MFTE). Paper presented at the 44th International Computer Archive of Modern and Medieval English Conference. NWU Vanderbijlpark: South Africa.

Leech, Geoffrey, Roger Garside and Michael Bryant. 1994. CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th conference on Computational Linguistics. Kyoto: Association for Computational Linguistics, 622–628.

Lindquist, Hans. 2009. Corpus Linguistics and the Description of English (Edinburgh Textbooks on the English Language – Advanced). Edinburgh: Edinburgh University Press.

Manning, Christopher D. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Alexander F. Gelbukh ed. Computational Linguistics and Intelligent Text Processing. Berlin: Springer, 171–189.

McEnery, Tony, Richard Xiao and Yukio Tono. 2006. Corpus-Based Language Studies: An Advanced Resource Book. London: Taylor & Francis.

McManus, Kevin. 2024. Replication and open science in applied linguistics research. In Luke Plonsky ed. Open Science in Applied Linguistics. Applied Linguistic Press, 148–165.

Nini, Andrea. 2014. Multidimensional Analysis Tagger (MAT). http://sites.google.com/site/multidimensionaltagger.

Nini, Andrea. 2019. The multi-dimensional analysis tagger. In Tony Berber Sardinha and Marcia Veirano Pinto eds. Multi-Dimensional Analysis: Research Methods and Current Issues. New York: Bloomsbury, 67–96.

Pashler, Harold and Eric–Jan Wagenmakers. 2012. Introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science 7/6: 528–530.

Picoral, Adriana, Shelley Staples and Randi Reppen. 2021. Automated annotation of learner English: An evaluation of software tools. International Journal of Learner Corpus Research 7/1: 17–52.

Porte, Graeme and Kevin McManus. 2018. Doing Replication Research in Applied Linguistics. Milton Park: Routledge.

Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A python natural language processing toolkit for many human languages. arXiv. https://doi.org/10.48550/arXiv.2003.07082.

Rayson, Paul and Roger Garside. 1998. The CLAWS web tagger. ICAME Journal 22: 121–123.

Sinclair, John McH., Gwyneth Fox, Stephen Bullon, Ramesh Krishnamurthy, Elisabeth Manning and John Todd eds. 1990. Collins Cobuild English grammar: Helping learners with real English. Glasgow: Harper Collins.

Sönning, Lukas and Valentin Werner. 2021. The replication crisis, scientific revolutions, and linguistics. Linguistics 59/5: 1179–1206.

Toutanova, Kristina, Dan Klein, Christopher D. Manning and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Marti Hearst and Mari Ostendorf eds. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Edmonton: Association for Computational Linguistics, 173–180.

Wallis, Sean. 2020. Statistics in Corpus Linguistics Research: A New Approach. London: Routledge.

Wieling, Martijn, Josine Rawee and Gertjan van Noord. 2018. Reproducibility in computational linguistics: Are we willing to share? Computational Linguistics 44/4: 641–649.

Wood, Margaret. 2023. Communicative Function and Linguistic Variation in State Statutory Law. Flagstaff: Northern Arizona University dissertation.

Downloads

Published

2024-11-30

How to Cite

Le Foll, E., & Shakir, M. (2024). The Multi-Feature Tagger of English (MFTE): Rationale, description and evaluation. Research in Corpus Linguistics, 13(2), 63–93. https://doi.org/10.32714/ricl.13.02.03