Generating linguistically relevant metadata for the Royal Society Corpus

Keywords: corpus building and extension, specialized diachronic corpora, written scientific English discourse, Royal Society Corpus, register-based metadata

Abstract

This paper provides an overview on metadata generation and management for the Royal Society Corpus (RSC), aiming to encourage discussion about the specific challenges in building substantial diachronic corpora intended to be used for linguistic and humanistic analysis. We discuss the motivations and goals of building the corpus, describe its composition and present the types of metadata it contains. Specifically, we tackle two challenges: first, integration of original metadata from the data providers (JSTOR and the Royal Society); second, derivation of additional linguistically relevant metadata regarding text structure and situational context (register).

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Baron, Alistair and Paul Rayson. 2008. VARD 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of the Postgraduate Conference in Corpus Linguistics. Birmingham, UK: Aston University. http://ucrel.lancs.ac.uk/people/paul/publications/BaronRaysonAston2008.pdf

Bizzoni, Yuri, Stefania Degaetano-Ortlieb, Peter Fankhauser and Elke Teich. 2020. Linguistic variation and change in 250 years of English scientific writing: A data-driven approach. Frontiers in Artificial Intelligence – Language and computation, Research topic Computational Sociolinguistics 3, Article 73.

Broeder, Daan, Oliver Schonefeld, Thorsten Trippel, Dieter Van Uytvanck and Andreas Witt. 2011. A pragmatic approach to XML interoperability – The Component Metadata Infrastructure (CMDI). In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies 7.

Burnard, Lou. 2005. Metadata for corpus work. In Martin Wynne ed. Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books, 30–46.

Crocker, Matthew W., Vera Demberg and Elke Teich. 2016. Information density and linguistic encoding (IDeaL). Künstliche Intelligenz 30: 77–81.

Degaetano-Ortlieb, Stefania, Hannah Kermes, Ekaterina Lapshinova-Koltunski and Elke Teich. 2013. SciTex: A diachronic corpus for analyzing the development of scientific registers. In Paul Bennett, Martin Durrell, Silke Scheible and Richard J. Whitt eds. New Methods in Historical Corpora. Volume 3 of Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache (CLIP). Tübingen: Narr, 93–104.

Degaetano-Ortlieb, Stefania and Elke Teich. 2019[online]. Towards an optimal code for communication: The case of scientific English. Corpus Linguistics and Linguistic Theory. https://doi.org/10.1515/cllt-2018-0088

Evert, Stefan and Andrew Hardie. 2011. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics 2011 Conference, Paper 153. Birmingham, UK: University of Birmingham. https://www.birmingham.ac.uk/documents/college-artslaw/corpus/ conference-archives/2011/Paper-153.pdf

Fankhauser, Peter, Jörg Knappen and Elke Teich. 2016. Topical diversification over time in the Royal Society Corpus. In Maciej Eder and Jan Rybicki eds. Digital Humanities 2016: Conference Abstracts. Kraków, Poland: Alliance of Digital Humanities Organizations (ADHO), 496–500. https://dh2016.adho.org/abstracts/322

Fischer, Stefan, Jörg Knappen and Elke Teich. 2018. Using topic modelling to explore authors’ research fields in a corpus of historical scientific English. In Digital Humanities 2018: Book of Abstracts. Mexico City, Mexico: Alliance of Digital Humanities Organizations (ADHO), 581–584. https://dh2018.adho.org/en/using-topic-modelling-to-explore-authors-research-fields-in-a-corpus-of-historical-scientific-english/

Fischer, Stefan, Jörg Knappen, Katrin Menzel and Elke Teich. 2020. The Royal Society Corpus 6.0: Providing 300+ years of scientific writing for humanistic study. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blace, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk and Stelios Piperidis eds. Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, 794–802. https://www.aclweb.org/anthology/2020.lrec-1.99.pdf

Halliday, Michael A.K. and James R. Martin eds. 1993. Writing Science: Literacy and Discursive Power. London: Falmer.

Hardie, Andrew. 2012. CQPweb – Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17: 380–409.

Harrison, Andrew John. 1989. Scientific Naturalists and the Government of the Royal Society 1850–1900. The Open University, PhD dissertation.

Kermes, Hannah, Stefania Degaetano-Ortlieb, Ashraf Khamis, Jörg Knappen and Elke Teich. 2016. The Royal Society Corpus: From uncharted data to corpus. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk and Sterlios Piperidis eds. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association, 1928–1931. https://www.aclweb.org/anthology/L16-1305.pdf

Knappen, Jörg, Stefan Fischer, Hannah Kermes, Elke Teich and Peter Fankhauser. 2017. The making of the Royal Society Corpus. In Gerolf Bouma and Yvonne Adesam eds. Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. Gothenburg, Sweden: Linköping University Electronic Press, 7–11. https://www.aclweb.org/anthology/W17-0503.pdf

Moskowich, Isabel. 2012. CETA as a tool for the study of modern astronomy in English. In Isabel Moskowich and Begoña Crespo eds. Astronomy “Playne and Simple”: The Writing of Science between 1700 and 1900. Amsterdam: John Benjamins, 35–56.

Moskowich, Isabel, Begoña Crespo, Luis Puente-Castelo and Leida Maria Monaco eds. 2019. Writing History in Late Modern English – Explorations of the Coruña Corpus. Amsterdam: John Benjamins.

Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing. Manchester, UK. https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf

Taavitsainen, Irma, Päivi Pahta and Martti Mäkinen eds. 2005. Middle English Medical Texts. Amsterdam: John Benjamins.

Taavitsainen, Irma and Päivi Pahta eds. 2010. Early Modern English Medical Texts: Corpus Description and Studies. Amsterdam: John Benjamins.

Taavitsainen, Irma and Turo Hiltunen eds. 2019. Late Modern English Medical Texts: Writing Medicine in the Eighteenth Century. Amsterdam: John Benjamins.

Teich, Elke, Stefania Degaetano-Ortlieb, Peter Fankhauser, Hannah Kermes and Ekaterina Lapshinova-Koltunski. 2016. The linguistic construal of disciplinarity: A data mining approach using register features. Journal of the Association for Information Science and Technology (JASIST) 67/7: 1668–1678.

Van Uytvanck, Dieter, Herman Stehouwer and Lari Lampen. 2012. Semantic metadata mapping in practice: The Virtual Language Observatory. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, Stelios Piperidis eds. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul, Turkey: European Language Resources Association, 1029–1034. http://www.lrec-conf.org/proceedings/lrec2012/pdf/437_Paper.pdf

Wilkinson, Mark D., Michel Dumontier, [...] and Barend Mons. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3: 160018.

Published
2021-01-04
How to Cite
Menzel, K., Knappen, J., & Teich, E. (2021). Generating linguistically relevant metadata for the Royal Society Corpus. Research in Corpus Linguistics, 9(1), 1-18. https://doi.org/10.32714/ricl.09.01.02