EusTimeML: A mark-up language for temporal information in Basque

We present EusTimeML, a mark-up language for temporal information in texts written in Basque. It is compliant with the TimeML specifications, while offering some adapted attributes and attribute values in order to represent the language-specific features of Basque. In particular, alterations have been carried out for verb tense, aspect and modality coding, as well as for time expression and signal annotation. EusTimeML also provides a major extension to the existing TimeML schemes, since the attributes and values for factuality annotation have been added to the existing temporal information annotation scheme. EusTimeML has been used to annotate the EusTimeBank Corpus, the news and history narratives corpus that has been used as the gold standard in temporal information processing in Basque.


INTRODUCTION
Natural Language Processing (NLP) aims at getting the deepest textual understanding, for which, after mastering morphosyntactic analysis, the focus has been put on semantic and discourse information. Temporal information is an integral part of those areas as it conveys the information of what is narrated in text while providing information to arrange narratives along a temporal axis. This information is of utmost relevance to the development of automatic systems that benefit from knowing the chronological ordering of events in texts, such as chronology creation (Bauer et al. 2015), event prediction (Radinsky and Horvitz 2013) and event forecasting systems (Kawai et al. 2010), among others.
Specifically, temporal information conveys the information of what happens (events narrated) and the times in which they happen (time expressions), as well as the temporal relations (simultaneity, precedence, etc.) between them. For example, in the sentence in (1), one can learn that there was a toilet paper theft (event) last month (time expression) after (temporal relation) there were shortages (event).
(1) Last month, armed robbers stole pallets of toilet paper in Hong Kong following panic-buying induced shortages.
That temporal information is collected in corpora that are annotated following structured formats, e.g., the eXtended Mark-up Language (XML), which make the information in the texts machine-readable. Mark-up languages provide a set of tags to classify the different elements in the text, as well as a set of attributes to describe the relevant linguistic features of those elements.
For the annotation of temporal information in Basque, we have created EusTimeML, a TimeML-compliant mark-up scheme (Pustejovsky et al. 2003a). It provides tags for events, time expressions and the relations that hold between them in XML format. As Figure 1 shows, some text strings have been assigned a tag (in green) since those are the elements in text that express temporal information. Additionally, a set of attributes (in purple) represents the main information (in pink) those strings convey. Figure 1: A text annotated following the EusTimeML mark-up language (simplified annotation) The text in Figure 1 is part of the EusTimeBank Corpus (Altuna et al. under revision a) which, in turn, has been used to train and evaluate temporal information processing tools. The Basque language has a long tradition of linguistic analysis and automatic processing (Alegria and Sarasola 2017) and integrating temporal information processing in the Basque processing pipeline (Otegi et al. 2016) has been the major motivation of this work.
This paper is structured as follows. We revisit the most relevant work on temporal information mark-up languages in Section 2. In Section 3, we present the basic features of TimeML, and in Section 4 we describe the most relevant linguistic features of Basque and the adaptations of TimeML that we have instituted to accommodate those features.
We discuss the strengths and weaknesses of EusTimeML in Section 5, and we conclude our work in Section 6.

BACKGROUND
Temporal information processing has attracted the interest of NLP scholars over the last two decades and has experienced a substantial boost since the creation of TimeML (Pustejovsky et al. 2003a). In fact, ever since the creation of TimeML, resource generation efforts and system evaluation competitions have multiplied. TimeML has been adapted to multiple languages, tasks and domains, and corpora annotated with TimeML schemes have increased in number.
The first temporal information mark-up languages (Mani and Wilson 2000;Ferro et al. 2003) only dealt with time expressions, for which the TIMEX and TIMEX2 tags respectively were created. These two tags also offered a set of basic attributes to code the main information expressed by time expressions, such as the normalised value and the granularity of the time expression. TimeML (Pustejovsky et al. 2003a), instead, made a qualitative leap in temporal information annotation, as this mark-up language offered tags for all the elements taking part in the expression of temporal information (see Section 3).
Nonetheless, TimeML is not the only mark-up language that has been developed to address temporal information. TEMANTEX (Wonsever et al. 2015) merges event annotation and factuality annotation. In the mark-up language developed for the NewsReader project (Minard et al. 2016), in turn, temporal information is tagged as in TimeML, but causality relations and entity co-reference are also considered. PLIMEX while it offers a much richer annotation for intrasentential temporal relations.

TIMEML
The TimeML mark-up language was specifically created to annotate events, time expressions and the temporal relations between them in text .
For that, the following set of tags was defined, one for each element concerning temporal information or type of relation: • <EVENT> for events: actions and situations that happen or occur, as in (2). 2 (2) Numerous conspiracies have <EVENT>appeared</EVENT> since the <EVENT>outbreak</EVENT>.
• <TIMEX3> for temporal expressions that convey date, time, duration or set information, as in (3).
• <SIGNAL> for sections of text, most commonly function words, that indicate the type of relation among temporal objects, as in (4).
• <TLINK> for temporal relations between two events, two time expressions or an event (in bold) and a time expression (in italics), as in (5).
(7) Ms Mengyun apologised (ei1), saying (ei2) she was "just trying (ei3) to introduce (ei4) the life of local people". <SLINK eventInstanceID="ei2" relatedToEvent="ei3" relType="EVIDENTIAL"/> These tags contained a set of attributes that coded or normalised the temporal information conveying features of the temporal objects and relations. Table 1 presents the attributes in TimeML for event features. In this case, four types of attributes can be identified according to the type of information they represent: i) event ID (eid) and event instance ID (eiid) offer identification information; ii) class (class) offers event classification information; iii) tense (tense) and aspect (aspect) offer temporal information; and, finally iv) part-of-speech (pos), polarity (polarity) and modality

Language-specific issues
Basque is a non-Indo-European language isolate, and thus it does not share many of the linguistic features of its neighbouring languages. In particular, many of its morphosyntactic features differ from the features in neighbouring languages and, hence, specific research for processing Basque is usually needed, as choices made for other languages cannot be applied straightforwardly.
For example, Basque is a highly agglutinative language in which information commonly expressed by prepositions in neighbouring languages is expressed by a rich set of postpositions attached to lemmas, as can be seen in the sentence in (8). This feature is extremely relevant in temporal information processing as lemmas accompanied by spatio-temporal declension cases are very frequent in temporal information expressions.
'Rescuers will stop rescue operations from sunset to sunrise.' In (8) there are two time expressions: iluntzetik 'from sunset' and egunsentira 'to sunrise'. Iluntze and egunsenti mean 'sunset' and 'sunrise', respectively, while the suffix -etik expresses the ablative case and -ra represents the allative case.
Verbal conjugation also represents a major difference between Basque and other languages. In Basque, there is a short list of single-word verb forms, typically to express punctual aspect, whereas most of the tensed verb forms are periphrastic. The lexical meaning of the verb and aspect are expressed in the main verb, while the auxiliary verb expresses tense and agreement with the persons taking part in the event, as well as mood and modality.
Looking at the sentence in (8) again, one may notice that the verb etengo dituzte ('will stop') also shows the rich morphology of Basque. The suffix -go expresses the future aspect of the verb and the auxiliary dituzte represents the present time tense (d-) as well as the concordance with the object (erreskate-operazioak.3PL), -it-and -z-, and the subject (sorosleek.3PL), -te.
As just mentioned, the future meaning of a verb is considered an aspectual value in Basque, whereas in many European languages future events are expressed by the future tense. This makes it possible to understand the Basque verbal tense as a bidimensional present-past feature (Table 2), and verbal aspect as a perfect-future feature (Table 3).

Adaptations to TimeML
Although the TimeML mark-up language is considered to be a standard for temporal information annotation, each version contains subtle variations to address language or task-specific issues. In EusTimeML, some of the attribute values have been modified to accommodate the analysis of Basque grammar.

Time expression and signal annotation
As introduced in Section 4.1, time expressions often get spatio-temporal postpositions and both elements commonly appear as a single token. Those elements are given separated tags in TimeML-styled schemes: one for the time expression (<TIMEX3>) and one for the function word (normally a preposition) expressing a temporal relation (<SIGNAL>). In the case of Basque, instead, as EusTimeML respects the token-level annotation, we decided to annotate the whole word as a time expression, since we believe that the postpositions' relational information can always be recovered from the morphosyntactic parsing.
Nevertheless, free postpositions are also possible in Basque and, in those cases, we decided to assign them a signal tag, as the tags for free postpositions do not interfere with any other tags present in a text. As a consequence, the time expression and signal information annotation according to EusTimeML is represented as in examples (9-10).
A similar decision was made for the annotation of events and signals. More precisely, as only main verbs are given the event tag, the auxiliaries of the periphrastic forms may get signal tags when they contain a temporal postposition, since there is no overlapping tag as in (11).
'Rescuers will stop rescue operations from <TIMEX3>sunset</TIMEX3> to <TIMEX3>sunrise</TIMEX3>' As can be seen in (10), the free postposition arte has been assigned a SIGNAL tag. The ablative -tik and the allative -ra in (9), and the allative -era in (10), instead, are part of the TIMEX3 tag to which they are attached. Nevertheless, it should be noted that, in (11), the auxiliary zutenean contains the locative suffix -ean, but as there is no conflict with any other tags, the token has been assigned a SIGNAL tag according to EusTimeML.

Aspect and tense annotation
The fact that the future is represented by aspect in Basque has led us to define an ad hoc set of values for aspect and tense. As in other TimeML-styled schemes, verbal aspect is expressed by the aspect attribute and verb tense is represented through the tense attribute. The values each attribute can be assigned to and the context have been summarised in Table 4  As a consequence, for the sentence in (8) the event etengo dituzte ('will stop') would be assigned the aspect and tense values as illustrated in (12), since this is a future verb form. Erreskate-operazioak ('rescue operations'), instead, will be assigned NONE as the value for aspect and tense, as it is expressed by a noun phrase and the form has no aspect or tense marks.

Modality annotation
The annotation of modality information has been tackled in various ways in the In the case of EusTimeML, we have followed the Basque grammar tradition, in which the modal verb ahal izan/ezin izan (possibility) and the semi-modal verbs behar izan (need or obligation) and nahi izan (desire) are considered. Taking this into account, the AHAL, BEHAR and NAHI values have been created for the modality attribute, and we have also used the NONE value for the cases in which no modality is expressed. The NONE value is the one assigned to the events in (13), as they do not convey any modality information.

Extensions to TimeML: Factuality
The main difference between EusTimeML and other TimeML-styled mark-up schemes relies on the factuality annotation added to EusTimeML (Altuna et al. 2018a). Factuality annotation has been closely related to TimeML in works such as Saurí (2008), but EusTimeML is the first TimeML-styled scheme that integrates it. For example, verb aspect and tense, the time expressions related to the events condition, the factuality values of the events, and some subordination relations (evidential, factive or counterfactive, among others) may evidence the factuality value of the subordinated event.
As our final goal is building timelines, factuality information will help us discern between events that effectively do occur and that should, as a consequence, appear on a timeline, events that have not happened, and events that may happen in the future. For this reason, we have opted for a factuality scheme in which we classify events as facts, counterfacts, or non-factual events when possible.
In EusTimeML, factuality information is coded through a set of event attributes.
These attributes are polarity (defined also in TimeML), certainty, factuality itself, and specialCases. These attributes and their values are illustrated in Table 5.  We have represented the factuality information of the events in (8) as shown in example (14).

Final EusTimeML definition and usage
Taking into account the decisions we made, the attributes and values for event annotation in EusTimeML are presented in Table 6. The remaining tags preserve the original TimeML attributes and the only differences in annotation are the ones presented in Section 4.2.1. As a consequence, annotations following EusTimeML remain easily transferable and comparable to other annotations carried out following any of the TimeML-styled schemes.

Event attributes Values
Event ID (eid) e<integer>  The mark-up language described in the preceding sections has been used for the annotation of EusTimeBank, the gold standard corpus for temporal information in Basque. EusTimeBank is a 92-document corpus (23,000 tokens) made up of 86 news documents and 6 historical narratives. The corpus has been used for the training and evaluation of bTime 3 (Salaberri 2017) and EusHeidelTime 4 (Altuna et al. 2017).
Additionally, the annotated documents obtained by those tools have been used as input for KroniXa (Altuna et al. under revision b), a tool to build timelines from Basque texts.
News and history texts are especially rich in temporal information, as they commonly narrate past events and offer the necessary information to arrange the events along the temporal axis. Hence, their narrative nature makes these texts an interesting basis for timeline generation. For this reason, a timeline dataset for the evaluation of KroniXa has been created from EusTimeBank (Altuna et al. 2019).

DISCUSSION
The creation of EusTimeML has been the first step towards automatic temporal information extraction from Basque texts. In order to be able to compare the Basque annotated corpora and the results obtained by NLP tools for Basque with the NLP resources for other languages, comparable annotation schemes and evaluation measures should be adopted. Hence, as TimeML schemes are widely used in English, Spanish and French, building the TimeML-compliant EusTimeML has been a convenient option.
The decisions on EusTimeML have been validated by means of a set of manual annotation efforts (Altuna et al. 2014(Altuna et al. , 2018a(Altuna et al. , 2018bAltuna 2018), in which interannotator agreement has been measured. Manual annotation analysis has shown that EusTimeML annotation guidelines are unambiguous for most of the elements, but we must note that event classification has been a major source of disagreement as annotators have considered some event classes to be virtually indiscernible in some contexts. The discussions after the agreement assessment have led to a wide consensus on EusTimeML and a consistent set of annotation guidelines has been produced ).
As our final goal is generating timelines based on the temporal information contained in texts, we have paid special attention to similar work based on TimeML annotations. In fact, the suitability of TimeML to encode temporal information for timeline building has been called into question. Ning et al. (2018) argue that the scarcity of intrasentential temporal relations heavily affects the event-event ordering. This opinion is shared by Derczynski et al. (2013), as they proposed TimeML-Dense, although timeline building was not their final goal. Laparra et al. (2017) are also aware of the data sparsity problem for timeline building provided by TimeML annotations.
They thus propose assigning the same time tag to all events a certain entity takes part in if they share the same tense, as a way to increase the number of anchored events. This partially solves the lack of temporal relations between events in the text. In Altuna et al. (under revision b) we have also found that some time expressions can have more than one correct normalised value in TimeML, which causes unnecessary time expression ordering problems as simultaneous events can be incorrectly placed in two different time points. For example, the quarters of the year may get different normalised values depending on whether they are referred to as quarters of a natural year or of a fiscal year.
Nonetheless, we consider that EusTimeML still offers sufficient information for timeline building. It should be taken into account that, even if bTime can only deal with a restricted set of temporal relations, experiments with KroniXa have shown very promising results, as a third of the events are correctly placed in the timelines.
Other authors have also highlighted some points in which TimeML struggles to properly encode temporal information. Ehrmann and Hagège (2009) noted that TimeML neither offered precise guidelines for time expression classification nor a clear distinction between characterisation and reference calculation annotations. According to them, a time expression such as 2 days before yesterday should be considered a date, and 2 days should be used to calculate its reference; TimeML proposes to annotate a duration (2 days) and a date (yesterday), instead. This same concern is shared by Bethard (2013) who proposes a scheme (SLATE) that allows machine-learning calculations.
Along the same lines, Laparra et al. (2018) identified the incapacity of TimeML to annotate compositional time expressions such as Saturdays since March 6, in which a set of dates is bounded by a determined time point. Event annotation through TimeML has also been a matter of discussion among scholars. For example, as Leeuwenberg and Moens (2019) point out, event durations cannot be explicitly tagged through TimeML, as no scheme for marking the durative (or punctual) nature of the events is provided. In spite of these flaws, TimeML is still the most widely used mark-up language for temporal information annotation.

EusTimeML addresses the need for a temporal information mark-up language for
Basque that can deal with its language-specific features. Nevertheless, even if it contains some modifications, it is largely comparable to other TimeML-styled schemes.
Adding factuality information has contributed to enlarging the amount of relevant information for timeline generation, which is our final goal.
In fact, EusTimeML has been the first step towards temporal information processing in Basque as it has been the mark-up language used for the EusTimeBank annotation, the corpus used for the development of the EusHeidelTime and bTime tools for temporal information extraction and normalisation. Furthermore, documents annotated following EusTimeML have also been used to generate timelines for the evaluation of KroniXa.
EusTimeML is now ready to use, although its customisability still allows for improvements and expansions. Addressing duration anchoring and increasing the amount of intrasentential temporal relations should be a goal for the TimeML community.