Research in Corpus Linguistics https://ricl.aelinco.es/index.php/ricl <p style="text-align: justify;"><em><strong>Research in Corpus Linguistics</strong></em> (<em>RiCL</em>, ISSN 2243-4712) is a scholarly peer-reviewed international scientific journal aiming at the publication of contributions which contain empirical analyses of data from different languages and from different theoretical perspectives and frameworks, with the goal of improving our knowledge about the linguistic theoretical background of a language, a language family or any type of cross-linguistic phenomena/constructions/assumptions. <em>RiCL</em> invites original, previously unpublished research articles, reports on corpus development, and book reviews in the field of Corpus Linguistics. The journal also considers the publication of special issues on specific topics, whose edition can be offered to leading scholars in the field.</p> AELINCO (Spanish Association for Corpus Linguistics) en-US Research in Corpus Linguistics 2243-4712 <p><a href="https://ricl.aelinco.es/index.php/ricl/copyright-notice" target="_blank" rel="noopener">Copyright notice</a></p> Challenges of combining structured and unstructured data in corpus development https://ricl.aelinco.es/index.php/ricl/article/view/198 <p>Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of <em>Research in Corpus Linguistics</em>. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.</p> Tanja Säily Jukka Tyrkkö Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-06-29 2021-06-29 9 1 i viii 10.32714/ricl.09.01.01 Generating linguistically relevant metadata for the Royal Society Corpus https://ricl.aelinco.es/index.php/ricl/article/view/158 <p>This paper provides an overview on metadata generation and management for the Royal Society Corpus (RSC), aiming to encourage discussion about the specific challenges in building substantial diachronic corpora intended to be used for linguistic and humanistic analysis. We discuss the motivations and goals of building the corpus, describe its composition and present the types of metadata it contains. Specifically, we tackle two challenges: first, integration of original metadata from the data providers (JSTOR and the Royal Society); second, derivation of additional linguistically relevant metadata regarding text structure and situational context (register).</p> Katrin Menzel Jörg Knappen Elke Teich Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-01-04 2021-01-04 9 1 1 18 10.32714/ricl.09.01.02 Corpus Linguistics and Eighteenth Century Collections Online (ECCO) https://ricl.aelinco.es/index.php/ricl/article/view/161 <p>Eighteenth Century Collections Online (ECCO) is the most comprehensive dataset available in machine-readable form for eighteenth-century printed texts. It plays a crucial role in studies of eighteenth-century language and it has vast potential for corpus linguistics. At the same time, it is an unbalanced corpus that poses a series of different problems. The aim of this paper is to offer a general overview of ECCO for corpus linguistics by analysing, for example, its publication countries and languages. We will also analyse the role of the substantial number of reprints and new editions in the data, discuss genres and the estimates of Optical Character Recognition (OCR) quality. Our conclusion is that whereas ECCO provides a valuable source for corpus linguistics, scholars need to pay attention to historical source criticism. We have highlighted key aspects that need to be taken into consideration when considering its possible uses.</p> Mikko Tolonen Eetu Mäkelä Ali Ijaz Leo Lahti Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-04-27 2021-04-27 9 1 19 34 10.32714/ricl.09.01.03 Challenges of releasing audio material for spoken data: The case of the London–Lund Corpus 2 https://ricl.aelinco.es/index.php/ricl/article/view/157 <p>T<span id="page3R_mcid17" class="markedContent"><span dir="ltr" style="left: 269.3px; top: 529.19px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.922899);">his article aims to describe key challenges of preparing and releasing audio material </span><span dir="ltr" style="left: 188.833px; top: 548.356px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.881285);">for spoken data and to propose solutions to these challenges. We draw on our experience of </span><span dir="ltr" style="left: 188.833px; top: 567.523px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.890574);">compiling the new </span><em><span dir="ltr" style="left: 321.8px; top: 567.523px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.914269);">London</span><span dir="ltr" style="left: 372.667px; top: 567.523px; font-size: 16.6667px; font-family: sans-serif;">-</span><span dir="ltr" style="left: 378.083px; top: 567.523px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.870696);">Lund Corpus 2 </span></em><span dir="ltr" style="left: 486.883px; top: 567.523px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(1.01154);">(LLC</span><span dir="ltr" style="left: 523.55px; top: 567.523px; font-size: 16.6667px; font-family: sans-serif;">-</span><span dir="ltr" style="left: 528.967px; top: 567.523px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.86203);">2), where transcripts are released </span><span dir="ltr" style="left: 761.167px; top: 567.523px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.894921);">together with </span><span dir="ltr" style="left: 188.833px; top: 586.69px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.905595);">the audio files. However, making the audio material publicly available required careful </span><span dir="ltr" style="left: 188.833px; top: 605.856px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.937672);">consideration of how to, most effectively, </span><span dir="ltr" style="left: 473.967px; top: 605.856px; font-size: 16.6667px; font-family: sans-serif;">1</span><span dir="ltr" style="left: 482.3px; top: 605.856px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.92369);">) align the transcripts with the audio and </span><span dir="ltr" style="left: 759.5px; top: 605.856px; font-size: 16.6667px; font-family: sans-serif;">2</span><span dir="ltr" style="left: 767.833px; top: 605.856px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.912124);">) anonymise </span><span dir="ltr" style="left: 188.833px; top: 625.023px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.910415);">personal information in the recordings. First, audio</span><span dir="ltr" style="left: 552.333px; top: 625.023px; font-size: 16.6667px; font-family: sans-serif;">-</span><span dir="ltr" style="left: 557.75px; top: 625.023px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.92566);">to</span><span dir="ltr" style="left: 570.667px; top: 625.023px; font-size: 16.6667px; font-family: sans-serif;">-</span><span dir="ltr" style="left: 576.083px; top: 625.023px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.905495);">text alignment was solved through the </span><span dir="ltr" style="left: 188.833px; top: 644.223px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.939829);">insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in </span><span dir="ltr" style="left: 188.833px; top: 663.39px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.852735);">the article, may later be used as a valuable complement to more robust automatic segmentation. </span><span dir="ltr" style="left: 188.833px; top: 682.556px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.903673);">Second, anonymisation </span><span dir="ltr" style="left: 358.917px; top: 682.556px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.908451);">was done by means of a </span><em><span dir="ltr" style="left: 554px; top: 682.556px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.95272);">Praat </span></em><span dir="ltr" style="left: 601.083px; top: 682.556px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.9276);">script, which replaced all personal </span><span dir="ltr" style="left: 188.833px; top: 701.723px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.897742);">information with a sound that made the lexical information incomprehensible but retained the </span><span dir="ltr" style="left: 188.833px; top: 720.89px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.91203);">prosodic characteristics. The public release of the LLC</span><span dir="ltr" style="left: 555.25px; top: 720.89px; font-size: 16.6667px; font-family: sans-serif;">-</span><span dir="ltr" style="left: 560.667px; top: 720.89px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.910191);">2 audio material is a valuable </span><span dir="ltr" style="left: 761.583px; top: 720.89px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.906822);">feature of the </span><span dir="ltr" style="left: 188.833px; top: 740.056px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.899222);">corpus that allows users to extend the corpus data relative to their own research interests and</span><span dir="ltr" style="left: 810.333px; top: 740.056px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.899281);">, </span><span dir="ltr" style="left: 819.083px; top: 740.056px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.867196);">thus</span><span dir="ltr" style="left: 846.617px; top: 740.056px; font-size: 16.6667px; font-family: sans-serif;">, </span><span dir="ltr" style="left: 188.833px; top: 759.223px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.891943);">broaden the scope of corpus linguistics. To illustrate this, we present three studies that have </span><span dir="ltr" style="left: 188.833px; top: 778.44px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.911157);">successfully used the LLC</span><span dir="ltr" style="left: 365.583px; top: 778.44px; font-size: 16.6667px; font-family: sans-serif;">-</span><span dir="ltr" style="left: 371px; top: 778.44px; font-size: 16.6667px; font-family: sans-serif; transform: scaleX(0.913551);">2 audio material</span><span dir="ltr" style="left: 479.383px; top: 778.44px; font-size: 16.6667px; font-family: sans-serif;">.</span></span></p> Nele Põldvere Johan Frid Victoria Johansson Carita Paradis Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-06-07 2021-06-07 9 1 35 62 10.32714/ricl.09.01.04 Multimodal meaning making: The annotation of nonverbal elements in multimodal corpus transcription https://ricl.aelinco.es/index.php/ricl/article/view/154 <p>The article discusses how to integrate annotation for nonverbal elements (NVE) from multimodal raw data as part of a standardized corpus transcription. We argue that it is essential to include multimodal elements when investigating conversational data, and that in order to integrate these elements, a structured approach to complex multimodal data is needed. We discuss how to formulate a structured corpus-suitable standard syntax and taxonomy for nonverbal features such as gesture, facial expressions, and physical stance, and how to integrate it in a corpus. Using corpus examples, the article describes the development of a robust annotation system for spoken language in the corpus of <em>Video-mediated English as a Lingua Franca Conversations</em> (ViMELF 2018) and illustrates how the system can be used for the study of spoken discourse. The system takes into account previous research on multimodality, transcribes salient nonverbal features in a concise manner, and uses a standard syntax. While such an approach introduces a degree of subjectivity through the criteria of salience and conciseness, the system also offers considerable advantages: it is versatile and adaptable, flexible enough to work with a wide range of multimodal data, and it allows both quantitative and qualitative research on the pragmatics of interaction.</p> Marie-Louise Brunner Stefan Diemer Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-06-07 2021-06-07 9 1 63 88 The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora https://ricl.aelinco.es/index.php/ricl/article/view/155 <p class="JLLS-Abstract-text" style="text-indent: 0cm; line-height: normal; tab-stops: 10.5pt 1.0cm 39.7pt 72.0pt; margin: 0cm 0cm 0cm 1.0cm;"><span lang="EN-US">This paper reports on the efforts of twelve national teams in building the <em>International Comparable Corpus </em>(ICC; <a href="https://korpus.cz/icc">https://korpus.cz/icc</a>) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the <em>International Corpus of English </em>(ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.</span></p> Anna Čermáková Jarmo Jantunen Tommi Jauhiainen John Kirk Michal Křen Marc Kupietz Elaine Uí Dhonnchadha Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-06-18 2021-06-18 9 1 89 103 10.32714/ricl.09.01.06 The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE) https://ricl.aelinco.es/index.php/ricl/article/view/156 <p>This paper discusses the process of part-of-speech tagging the <em>Corpus of Early English Correspondence Extension</em> (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa 3.7pp. The combination of POS tagging and social metadata will make the corpus attractive to linguists interested in the interplay between language-internal and -external factors affecting variation and change.</p> Lassi Saario Tanja Säily Samuli Kaislaniemi Terttu Nevalainen Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-06-29 2021-06-29 9 1 104 131 10.32714/ricl.09.01.07 How to prepare the video component of the Diachronic Corpus of Political Speeches for multimodal analysis https://ricl.aelinco.es/index.php/ricl/article/view/160 <p>The Diachronic Corpus of Political Speeches (DCPS) is a collection of 1500 full-length political speeches in English. It includes speeches delivered worldwide by English-speaking politicians in various settings, between 1545 and 2013. Enriched with semi-automatic morphosyntactic annotations and with discourse-pragmatic manual annotations, the DCPS is designed to achieve maximum representativeness and balance, preserve detailed metadata, and enable corpus-based studies of syntactic, semantic and discourse-pragmatic variation and change on political corpora.</p> <p>For speeches given from 1950 on, video-recordings of the original delivery are often retrievable online. This opens up avenues of research in multimodal linguistics, in which studies on the integration of speech and gesture in the construction of meaning can include analyses of recurrent gestures and of multimodal constructions. This article discusses the issues at stake in preparing the video-recorded component of the DCPS for linguistic multimodal analysis, namely the exploitability of recordings, the segmentation and alignment of transcriptions, the annotation of gesture forms and functions in the software ELAN and the quantity of available gesture data.</p> Camille Debras Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-07-06 2021-07-06 9 1 132 151 10.32714/ricl.09.01.08 Review of Fuster-Márquez, Miguel, Carmen Gregori-Signes & José Santaemilia Ruiz eds. 2020. Multiperspectives in Analysis and Corpus Design. Granada: Comares. ISBN: 978-8-413-69009-4 https://ricl.aelinco.es/index.php/ricl/article/view/182 Moisés Almela Sánchez Copyright (c) 2021 Research in Corpus Linguistics http://creativecommons.org/licenses/by/4.0 2021-06-29 2021-06-29 9 1 152 159 10.32714/ricl.09.01.09