The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)

Lassi Saario; Tanja Säily; Samuli Kaislaniemi; Terttu Nevalainen

doi:10.32714/ricl.09.01.07

Authors

Lassi Saario University of Helsinki https://orcid.org/0000-0002-5936-7996
Tanja Säily University of Helsinki https://orcid.org/0000-0003-4407-8929
Samuli Kaislaniemi University of Eastern Finland https://orcid.org/0000-0002-3596-1341
Terttu Nevalainen University of Helsinki https://orcid.org/0000-0003-3088-4903

DOI:

https://doi.org/10.32714/ricl.09.01.07

Keywords:

corpus annotation, corpus markup, spelling normalisation, TEI-XML, part-of-speech tagging, Late Modern English

Abstract

This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspondence Extension (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa 3.7pp. The combination of POS tagging and social metadata will make the corpus attractive to linguists interested in the interplay between language-internal and -external factors affecting variation and change.

Downloads

Download data is not yet available.

References

ARCHER = A Representative Corpus of Historical English Registers. 1990–1993/2002/2007/2010/2013. Originally compiled under the supervision of Douglas Biber and Edward Finegan at Northern Arizona University and University of Southern California; modified and expanded by subsequent members of a consortium of universities. https://www.projects.alc.manchester.ac.uk/archer/ (25 February, 2020.)

Baron, Alistair. 2011a. VARD 2. Computer program. http://ucrel.lancs.ac.uk/vard/ (25 February, 2020.)

Baron, Alistair. 2011b. Dealing with Spelling Variation in Early Modern English Texts. Lancaster: Lancaster University dissertation. https://eprints.lancs.ac.uk/id/eprint/84887/ (25 February, 2020.)

BNC = The British National Corpus, version 3 (BNC XML edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk (25 February, 2020.)

CEEC-400 = Corpora of Early English Correspondence. 2020. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Jukka Keränen, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/CEEC/ (19 June, 2021.)

CEECE = Corpus of Early English Correspondence Extension. 2012. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/CEEC/ (19 June, 2021.)

CLAWS. Computer program. Developed by UCREL at Lancaster University. http://ucrel.lancs.ac.uk/claws/ (25 February, 2020.)

Corcoran, Paul E. 1974. COCOA: A FORTRAN program for concordance and word-count processing of natural language texts. Behavior Research Methods & Instrumentation 6/6: 566.

Davies, Mark. 2019. Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In Carla Suhr, Terttu Nevalainen and Irma Taavitsainen eds. From Data to Evidence in English Language Research (Language and Computers 83). Leiden: Brill, 66–87.

Fligelstone, Steve, Mike Pacey and Paul Rayson. 1997. How to generalize the task of annotation. In Roger Garside, Geoffrey Leech and Anthony McEnery eds. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman, 122–136. http://ucrel.lancs.ac.uk/papers/CAB_CH08.pdf (25 February, 2020.)

Hardie, Andrew. 2012. CQPweb – Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17/3: 380–409.

Hardie, Andrew. 2014. Modest XML for corpora: Not a standard, but a suggestion. ICAME Journal 38: 73–103.

HC = The Helsinki Corpus of English Texts. 1991. Compiled by Matti Rissanen (Project leader), Merja Kytö (Project secretary); Leena Kahlas-Tarkka, Matti Kilpiö (Old English); Saara Nevanlinna, Irma Taavitsainen (Middle English); Terttu Nevalainen, Helena Raumolin-Brunberg (Early Modern English). Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/HelsinkiCorpus/ (19 June, 2021.)

Hiltunen, Turo, Joe McVeigh and Tanja Säily. 2017. How to turn linguistic data into evidence? In Turo Hiltunen, Joe McVeigh and Tanja Säily eds. Big and Rich Data in English Corpus Linguistics: Methods and Explorations (Studies in Variation, Contacts and Change in English 19). Helsinki: VARIENG. https://varieng.helsinki.fi/series/volumes/19/introduction.html (19 June, 2021.)

Hiltunen, Turo and Jukka Tyrkkö. 2013. Tagging Early Modern English Medical Texts (1500–1700). Presentation at The First Corpus Analysis with Noise in the Signal Workshop (CANS 2013), 22 July, Lancaster University, UK. http://ucrel.lancs.ac.uk/cans2013/abstracts/Hiltunen%20Tyrkk%C3%B6.pdf (25 February, 2020.)

Hoffmann, Sebastian. 2005. Grammaticalization and English Complex Prepositions: A Corpus-based Study. London: Routledge.

Huddleston, Rodney and Geoffrey K. Pullum. 2002. The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press.

Hundt, Marianne ed. 2014. Late Modern English Syntax. Cambridge: Cambridge University Press.

Hundt, Marianne and Geoffrey Leech. 2012. “Small is beautiful”: On the value of standard reference corpora for observing recent grammatical change. In Terttu Nevalainen and Elizabeth C. Traugott eds. The Oxford Handbook of the History of English. Oxford: Oxford University Press, 175–188.

Kaislaniemi, Samuli. 2018. The Corpus of Early English Correspondence Extension (CEECE). In Terttu Nevalainen et al. eds., 45–59.

Kaislaniemi, Samuli, Mel Evans, Teo Juvonen and Anni Sairio. 2017. ‘A graphic system which leads its own linguistic life’? Epistolary spelling in English, 1400–1800. In Tanja Säily et al. eds., 187–214.

Kroch, Anthony, Ann Taylor and Beatrice Santorini. 2000. The Penn-Helsinki Parsed Corpus of Middle English. Department of Linguistics: University of Pennsylvania.

Kroch, Anthony, Beatrice Santorini and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English. Department of Linguistics: University of Pennsylvania.

Kytö, Merja. 1996. Manual to the Diachronic Part of The Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts (third edition). Helsinki: Department of English, University of Helsinki. http://clu.uni.no/icame/manuals/HC/INDEX.HTM (25 February, 2020.)

Lu, Xiaofei. 2014. Computational Methods for Corpus Annotation and Analysis. New York: Springer.

Marttila, Ville. 2011. Helsinki Corpus TEI XML Edition Documentation. Helsinki: VARIENG. https://helsinkicorpus.arts.gla.ac.uk/display.py?fs=100&what=manual (25 February, 2020.)

Marttila, Ville. 2014. Creating Digital Editions for Corpus Linguistics: The Case of Potage Dyvers, a Family of Six Middle English Recipe Collections. Helsinki: University of Helsinki dissertation. http://urn.fi/URN:ISBN:978-951-51-0060-3 (25 February, 2020.)

Nevalainen, Terttu, Minna Palander-Collin and Tanja Säily eds. 2018. Patterns of Change in 18th-century English: A Sociolinguistic Approach. Amsterdam: John Benjamins.

Nurmi, Arja ed. 1998. Manual for the Corpus of Early English Correspondence Sampler, CEECS. Helsinki: Department of English, University of Helsinki. http://korpus.uib.no/icame/manuals/CEECS/ (25 February, 2020.)

PCEEC = Parsed Corpus of Early English Correspondence. 2006. Annotated by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. http://hdl.handle.net/20.500.12024/2510 (25 February, 2020.)

Rayson, Paul, Dawn Archer, Alistair Baron, Jonathan Culpeper and Nicholas Smith. 2007. Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Matthew Davies, Paul Rayson, Susan Hunston and Pernilla Danielsson eds. Proceedings of Corpus Linguistics 2007, 27–30 July, University of Birmingham, UK, article 192. http://ucrel.lancs.ac.uk/publications/CL2007/ (25 February, 2020.)

Rissanen, Matti. 1989. Three problems connected with the use of diachronic corpora. ICAME Journal 13: 16–19.

Rodríguez-Puente, Paula, Cristina Blanco-García and Iván Tamaredo. 2019. Mark-up and annotation in the Corpus of Historical English Law Reports (CHELAR): Potential for historical genre analysis. Journal of the Spanish Association of Anglo-American Studies 41/2: 63–84.

Russell, D. B. 1965. COCOA —A Word-Count and Concordance Generator. http://www.chilton-computing.org.uk/acl/applications/cocoa/p001.htm (25 February, 2020.)

Saario, Lassi. 2020. Conversion of the CEEC-400 into XML. A Manual to Accompany the XML Edition. Helsinki: VARIENG. https://varieng.helsinki.fi/CoRD/corpora/CEEC/xml_doc.html (19 June, 2021.)

Saario, Lassi. 2021. XmlConverter. A Java Application to Process the File Format of the Corpora of Early English Correspondence. Helsinki: VARIENG. https://version.helsinki.fi/ceec/ceec-tools/XmlConverter (19 June, 2021.)

Saario, Lassi and Tanja Säily. 2020. POS Tagging the CEECE. A Manual to Accompany the Tagged Corpus of Early English Correspondence (TCEECE). Helsinki: VARIENG. https://varieng.helsinki.fi/CoRD/corpora/CEEC/tceece_doc.html (19 June, 2021.)

Säily, Tanja, Terttu Nevalainen and Harri Siirtola. 2011. Variation in noun and pronoun frequencies in a sociohistorical corpus of English. Literary and Linguistic Computing 26/2: 167–188.

Säily, Tanja, Turo Vartiainen and Harri Siirtola. 2017. Exploring part-of-speech frequencies in a sociohistorical corpus of English. In Tanja Säily et al. eds., 23–52.

Säily, Tanja, Arja Nurmi, Minna Palander-Collin and Anita Auer eds. 2017. Exploring Future Paths for Historical Sociolinguistics. Amsterdam: John Benjamins.

Sairio, Anni, Samuli Kaislaniemi, Anna Merikallio and Terttu Nevalainen. 2018. Charting orthographical reliability in a corpus of English historical letters. ICAME Journal 42/1: 79–96.

SCEEC = Standardised-spelling Corpora of Early English Correspondence. 2012. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Jukka Keränen, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio. Standardised by Mikko Hakala, Minna Palander-Collin and Minna Nevala. Department of English / Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/CEEC/ (19 June, 2021.)

Schneider, Gerold, Marianne Hundt and Rahel Oppliger. 2016. Part-of-speech in historical corpora: Tagger evaluation and ensemble systems on ARCHER. In Stefanie Dipper, Friedrich Neubarth and Heike Zinsmeister eds. Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) (Bochumer Linguistische Arbeitsberichte 16). Bochum: Ruhr-Universität Bochum, 256–264. https://www.linguistics.rub.de/konvens16/proceedings.html (25 February, 2020.)

TCEECE = Tagged Corpus of Early English Correspondence Extension. 2020. Annotated by Lassi Saario and Tanja Säily. Spelling standardised by Mikko Hakala, Minna Palander-Collin, Minna Nevala, Emanuela Costea, Anne Kingma and Anna-Lina Wallraff. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/CEEC/ (19 June, 2021.)

TEI Consortium, eds. 2020. Guidelines for Electronic Text Encoding and Interchange. Last updated on 13 February, 2020. http://www.tei-c.org/P5/ (25 February, 2020.)

The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

issn

ISSN: 2243-4712

scopus-2

sjr

dialnet1

indexation

Abstracting & indexing

Current Issue

scopus