Research in Corpus Linguistics

Review of Martín Arista, Javier and Ojanguren López, Ana Elvira. 2025. Structuring Lexical Data and Digitising Dictionaries. Grammatical Theory, Language Processing and Databases in Historical Linguistics. Leiden: Brill. 412 pp. ISBN: 978-90-04-70266-0.

2025-05-08T22:54:49+02:00

Review of Ljubica Leone. Composite Predicates in Late Modern English (Routledge Focus on Linguistics). 2024. London: Routledge. ISBN 978-1-032-52488-7

2025-11-16T10:11:19+01:00

Exploring noun lexical diversity and noun phrase complexity in Spanish email writing at B1 and C1 levels

2024-11-29T12:59:00+01:00

Research on noun phrase use in EFL writing has mainly focused on linguistic complexity and accuracy, lexical richness, and phraseological competence. However, the relationship between noun lexical diversity of nouns and the syntactic complexity of the noun phrases in which these nouns appear remains underexplored. To address this gap, this paper examines the lexical diversity of head nouns in noun phrases within a sample of emails written by L1 Spanish EFL learners at B1 and C1 proficiency levels, taken from the FineDesc Learner Corpus. The analysis considers both the lexical diversity of nouns and the syntactic complexity of the noun phrases they head. The findings reveal: a) a narrower range of nouns at the B1 level compared to the C1 level; b) a low percentage of nouns from both levels, based on the English Vocabulary Profile; and c) differences in NP complexity between the two proficiency levels (B1 and C1), depending on whether the head nouns are concrete or abstract. The paper underscores the importance of combining different complexity measures ––namely, lexical diversity and NP complexity analyses–– to gain a more comprehensive understanding of learners’ use of noun phrases.

The creation of the Indonesian TreeTagger for use in LancsBox and CQPweb

2024-11-28T13:10:31+01:00

TreeTagger is a multilingual tagger capable of performing headword and POS tagging. However, before the completion of this project, Indonesian had not been supported. Thus, corpus query systems employing TreeTagger as a subsystem, such as CQPweb v.3.3.10 and LancsBox v.5, were incapable of annotating Indonesian texts. This context leads to the following research: 1) develop Indonesian language support for TreeTagger, 2) evaluate its performance, and 3) integrate the support into two popular corpus query systems, namely CQPweb and LancsBox, and demonstrate its functionalities. The research procedure can be concisely summarised as follows: training, annotation and evaluation, and incorporation. A pre-annotated corpus and lexicon were used in the training process. Headwords for the lexicon and corpus were semi-automatically added using MorphInd, augmented with expert revisions. The training produced an Indonesian TreeTagger parameter file, whose accuracy for POS and headword annotation was 96 per cent and 91 percent respectively. The parameter file has been incorporated into LancsBox v.6 and CQPweb 3.3.11, enabling support for the Indonesian language.

The Multi-Feature Tagger of English (MFTE): Rationale, description and evaluation

2024-11-30T19:08:41+01:00

The Multi-Feature Tagger of English (MFTE) provides a transparent and easily adaptable open-source tool for multivariable analyses of English corpora. Designed to contribute to the greater reproducibility, transparency, and accessibility of multivariable corpus studies, it comes with a simple GUI and is available both as a richly annotated Python script and as an executable file. In this article, we detail its features and how they are operationalised. The default tagset comprises 74 lexico-grammatical features, ranging from attributive adjectives and progressives to tag questions and emoticons. An optional extended tagset covers more than 70 additional features, including many semantic features, such as human nouns and verbs of causation. We evaluate the accuracy of the MFTE on a sample of 60 texts from the BNC2014 and COCA, and report precision and recall metrics for all the features of the simple tagset. We outline how that the use of a well-documented, open-source tool can contribute to improving the reproducibility and replicability of multivariable studies of English.

Same, same, but erm sort of different? Comparing three kinds of fluencemes across Australian, British, Canadian, and New Zealand English

2025-01-26T19:26:01+01:00

Although L1-English fluency has been extensively studied from many angles, few contrastive studies examine whether fluency develops similarly or differently across L1-varieties while taking sociolinguistic variation into consideration. This paper aims to close this research gap and examines the use of three core strategies of fluency (or fluencemes), i.e. discourse markers, filled pauses and unfilled pauses, across Australian, British, Canadian, and New Zealand English. These fluencemes were extracted and manually disambiguated from the private conversation sections of the respective components of the International Corpus of English (ICE-AUS, ICE-GB, ICE-CAN, and ICE-NZ). The data were normalised per speaker and linked with the sociobiographic metadata of the speakers. Analysis using random forests revealed a consistent fluenceme distribution across the four varieties, with unfilled pauses being the most common, followed by discourse markers, and then filled pauses. This pattern suggests a ‘common fluenceme core’ among L1-English varieties. The influence of sociolinguistic variables —gender, age, education, and occupation— was modest across varieties and exhibited diverse trends. Male speakers tend to use filled pauses more frequently but fewer unfilled pauses compared to female speakers. Increasing age did not significantly affect the frequency of these strategies; however, older speakers tend to use discourse markers less frequently. Both education and occupation showed a slight positive correlation with overall fluency.

The Construction Complexity Calculator (ConPlex): A tool for calculating Nelson’s (2024) construction-based complexity measure

2025-02-04T08:51:22+01:00

The current study aims to increase the accessibility of Nelson’s (2024) recently suggested construction-based complexity measure by providing a tool that can calculate the measure for single or multiple texts. To validate the tool, complexity scores for the International Corpus Network of Asian Learners of English corpus (ICNALE) were compared with Nelson’s (2024) results. In addition, complexity scores were calculated for a new dataset, the Common European Framework of Reference English Listening Corpus (CEFR), along with the MERLIN corpus, which includes learner writing samples from learners of Czech, German, and Italian. Complexity scores generally increased across CEFR levels in all of the datasets. However, the complexity scores in the current study tend to be higher than the original study due to differences in the sentence splitting approach. The sentence tokenisation method used is deemed to be more appropriate, and it may be concluded that the Construction Complexity Calculator (ConPlex) tool accurately calculates Nelson’s measure. It is hoped that the tool will allow researchers to calculate the complexity of constructions at the text level for a wide range of research purposes.

The language of evaluation and stance in crowdfunding project proposals

2025-01-18T18:34:50+01:00

Today, digital crowdfunding platforms allow researchers to increasingly use digital resources to reach and engage diversified audiences, making scientific content accessible to everyone. This paper explores how evaluation in text contributes information relevant to understanding how scientists use language to express their expert opinions of scientific research and their attitudes about the value of their projects. Starting from the compilation and analysis of a 50-science project corpus from Experiment.com, evaluative stance expressions in this work were classified according to Biber’s (2004) taxonomy into the following stance categories: verbs, adverbs, adjectives and nouns. Subsequently, genre analysis was applied to identify the discourse functions of these evaluative words in each rhetorical section of the project proposals. Results show that the analysed crowdfunding proposals are rich in stance verbs (52.65%) and, to a lesser extent, stance adjectives (23.52%), serving to express values of effort, improvement and diligence in the proposed projects, as well as judgement regarding experiments and ‘Lab Notes’ updates, respectively. This can be useful for both theoretical advancement and pedagogical purposes, that is, to apply scientists’ findings to digital communication teaching and learning.

Spanish EFL learners' use of contrastive linking adverbials across three CEFR levels and gender influence

2025-02-04T15:58:35+01:00

The Common European Framework of Reference for Languages (2001) and its Companion Volume (2020) emphasize the importance of linking expressions for pragmatic competence. Research on contrastive devices has long attracted scholarly interest; however (pseudo)longitudinal studies across different levels or whether gender may affect learners’ written production in this respect have been neglected. This study aims to address this gap by analyzing how Spanish EFL learners at different levels express contrast and whether gender impacts their use of concessive expressions. Surprisingly, lower-level (B1) users show a wide range of expressions similar to higher-level users, while those at B2 levels tend to avoid "risky" options. Interestingly, gender does not significantly influence learners' use of connectors in this corpus, contradicting earlier findings that suggested female learners use more connectors than males.

A corpus-based study on the transitive uses of English physiological verbs

2025-01-15T18:31:47+01:00

This paper examines the transitivity potential of a group of English unergative verbs that denote physiological processes, a syntactico-semantic verbal class which has not received enough attention in the literature. Through a qualitative corpus-based analysis of 26 verbs conducted on the COCA and BNC corpora, it will be shown that the degree of transitivity of this verbal class is higher than stated in previous studies since, in addition to the cognate object construction (And burp the same garlic burps), the substance object alternation (I was breathing garlic over her), and the resultative construction (He yawned open his mouth), they have been documented in seven other transitive patterns in which they increase their valency with the addition of a non-canonical direct object: x’s way constructions (They crapped their way out?), reaction object constructions (Emma hiccups a yes), caused-motion constructions (They’d laugh me straight out of the door), preposition drop alternations (He shit the rug), the understood body-part object alternation (He snuffled his nose along his arm and sleeve), away constructions (Hatch yawned away another hour), and causative patterns (Always burp your baby when feeding time is over).