Corpus Linguistics and Eighteenth Century Collections Online (ECCO)

Keywords: Eighteenth Century Collections Online (ECCO), English Short-Title Catalogue (ESTC), metadata, Optical Character Recognition (OCR), eighteenth-century studies, bibliographic data science

Abstract

Eighteenth Century Collections Online (ECCO) is the most comprehensive dataset available in machine-readable form for eighteenth-century printed texts. It plays a crucial role in studies of eighteenth-century language and it has vast potential for corpus linguistics. At the same time, it is an unbalanced corpus that poses a series of different problems. The aim of this paper is to offer a general overview of ECCO for corpus linguistics by analysing, for example, its publication countries and languages. We will also analyse the role of the substantial number of reprints and new editions in the data, discuss genres and the estimates of Optical Character Recognition (OCR) quality. Our conclusion is that whereas ECCO provides a valuable source for corpus linguistics, scholars need to pay attention to historical source criticism. We have highlighted key aspects that need to be taken into consideration when considering its possible uses.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Alston, Robin. 1981. ESTC texts on microfilm. Factotum: Newsletter of the XVIIIth century STC 12: 2–3.

Alston, Robin. 2004. The history of ESTC. The Age of Johnson 15: 269–329.

Bonnell, Thomas F. 2009. Reprint trade. In Michael F. Suarez and Michael L. Turner eds. The Cambridge History of the Book in Britain. Vol. V. 1695–1830. Cambridge: Cambridge University Press, 699–709.

Bullard, Paddy. 2013. Digital humanities and electronic resources in the long eighteenth century. Literature Compass 10/10: 748–760.

Cayley, Seth. 2017. Digitization for the masses: Taking users beyond simple searching in Nineteenth-Century Collections Online. Journal of Victorian Culture 22/2: 248–255.

Davies, Mark. 2012. Some methodological issues related to corpus-based investigations of recent syntactic changes in English. In Terttu Nevalainen and Elizabeth Closs Traugott eds. The Oxford Handbook of the History of English. Oxford: Oxford University Press, 157–174.

Eighteenth Century Collections Online. https://www.gale.com/intl/primary-sources/eighteenth-century-collections-online

English Short Title Catalogue. http://estc.bl.uk

Gale. 2016. Eighteenth Century Collections Online: The most comprehensive online library of English and foreign-language titles printed in the United Kingdom during the eighteenth century, plus thousands of important works printed in English elsewhere. https://www.gale.com/binaries/content/assets/gale-us-en/primary-sources/eighteenth-century-collections-online/ecco-roll-fold-2016-web.pdf

Greenfield, Sayre. 2010. ECCO OCR troubleshooting. Early Modern Online Bibliography. https://earlymodernonlinebib.wordpress.com/ecco-ocr-troubleshooting-by-sayre-greenfield/ (15 January, 2020.)

Gregg, Stephen. 2020. Old Books and Digital Publishing: Eighteenth-Century Collections Online. Cambridge: Cambridge University Press.

Gries, Stefan Th. 2012. Corpus linguistics, theoretical linguistics, and cognitive/psycholinguistics: Towards more and more fruitful exchanges. In Joybrato Mukherjee and Magnus Huber eds. Corpus Linguistics and Variation in English. Theory and Description. Amsterdam: Rodopi, 41–63.

Hill, Mark J. and Simon Hengchen. 2019. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities 34/4: 825–843.

Hiltunen, Turo, Joe McVeigh and Tanja Säily. 2017. How to turn linguistic data into evidence? In Turo Hiltunen, Joe McVeigh and Tanja Säily eds. Big and Rich Data in English Corpus Linguistics: Methods and Explorations. Helsinki: VARIENG. https://varieng.helsinki.fi/series/volumes/19/introduction.html (24 April, 2021.)

Hitchcock, Tim and Robert Shoemaker. 2007. The value of the proceedings as a historical source. Old Bailey Proceedings Online. https://www.oldbaileyonline.org/static/Value.jsp (16 April, 2021.)

Ijaz, Ali, Leo Lahti, Iiro Tiihonen and Mikko Tolonen. 2019. Analytical determination of editions from bibliographic metadata. In Jarmo Harri Jantunen, Sisko Brunni, Niina Kunnas, Santeri Palviainen and Katja Västi eds. Proceedings of the Research Data and Humanities 2019 Conference: Data, Methods and Tools. Oulu: University of Oulu. http://urn.fi/urn:isbn:9789526223216 (24 April, 2021.)

Kinley, Welly. 2003. Digital ECCOs of the eighteenth century. eContent, November Issue. https://chnm.gmu.edu/digitalhistory/links/pdf/introduction/0.27b.pdf (24 April, 2021.)

Lahti, Leo, Niko Ilomäki and Mikko Tolonen. 2015. A quantitative study of history in the English Short-Title Catalogue (ESTC) 1470–1800. LIBER Quarterly 25/2: 87–116.

Lahti Leo, Eetu Mäkelä and Mikko Tolonen. 2020. Quantifying bias and uncertainty in historical data collections with probabilistic programming. In Folgert Karsdorp, Barbara McGillivray, Adina Nerghes and Melvin Wevers eds. Proceedings of the Workshop on Computational Humanities Research 2020. Aachen: CEUR-WS.org, 280–289.

Lahti, Leo, Jani Marjanen, Hege Roivainen and Mikko Tolonen. 2019. Bibliographic data science and the history of the book (c. 1500–1800). Cataloging & Classification Quarterly 57/1: 5–23.

Linguistic DNA. 2017. Experimenting with the imperfect: ECCO & OCR. https://www.linguisticdna.org/ecco-ocr/ (20 February, 2020.)

Mäkelä, Eetu, Krista Lagus, Leo Lahti, Tanja Säily, Mikko Tolonen, Mika Hämäläinen, Samuli Kaislaniemi and Terttu Nevalainen. 2020. Wrangling with non-standard data. In Sanita Reinsone, Inguna Skadiņa, Anda Baklāne and Jānis Daugavietis eds. Proceedings of the Digital Humanities in the Nordic Countries 5th Conference 2020. Aachen: CEUR-WS.org, 81–96.

Tolonen, Mikko, Eetu Mäkelä and Leo Lahti. 2017. Analysing eighteenth-century key-terms and phrases using ECCO and ESTC. Paper presented at the British Society for Eighteenth Century Studies BSECS 46th Annual Conference, Oxford.

Tolonen, Mikko, Leo Lahti, Jani Marjanen and Hege Roivainen. 2018. A quantitative approach to book-printing in Sweden and Finland, 1640–1828. Historical Methods 52/1: 57–78.

Tolonen Mikko, Mark Hill, Ali Ijaz, Ville Vaara and Leo Lahti. 2021. Examining the early modern canon: The English Short Title Catalogue and large-scale patterns of cultural production. In Ileana Baird ed. Data Visualization in Enlightenment Literature and Culture. London: Palgrave Macmillan, 63–119.

Published
2021-04-27
How to Cite
Tolonen, M., Mäkelä, E., Ijaz, A., & Lahti, L. (2021). Corpus Linguistics and Eighteenth Century Collections Online (ECCO). Research in Corpus Linguistics, 9(1), 19-34. https://doi.org/10.32714/ricl.09.01.03