EusTimeML: A mark-up language for temporal information in Basque

Keywords: temporal information processing, Basque, mark-up language, annotation, TimeML

Abstract

We present EusTimeML, a mark-up language for temporal information in texts written in Basque. It is compliant with the TimeML specifications, while offering some adapted attributes and attribute values in order to represent the language-specific features of Basque. In particular, alterations have been carried out for verb tense, aspect and modality coding, as well as for time expression and signal annotation. EusTimeML also provides a major extension to the existing TimeML schemes, since the attributes and values for factuality annotation have been added to the existing temporal information annotation scheme. EusTimeML has been used to annotate the EusTimeBank Corpus, the news and history narratives corpus that has been used as the gold standard in temporal information processing in Basque.

References

Alegria, Iñaki and Kepa Sarasola. 2017. Language technology for language communities: An overview based on our experience. In Nicholas Ostler ed. FEL XXI Alcanena 2017 Communities in Control. Hungerford, UK: Foundation for Endangered Languages, DIDLeS, SOAS World Languages Institute and Mercator Research Centre, 91–97.

Altuna, Begoña. 2018. Euskarazko denbora-egituren azterketa eta corpusaren sorrera / Analysis of Basque temporal constructions and the creation of a corpus. Donostia: University of the Basque Country dissertation.

Altuna, Begoña, María Jesús Aranzabe and Arantza Díaz de Ilarraza. 2014. Euskarazko denbora-egiturak. Azterketa eta etiketatze-esperimentua. Linguamática 6/2: 13–24.

Altuna, Begoña, María Jesús Aranzabe and Arantza Díaz de Ilarraza. 2016. Euskarazko denbora-egiturak etiketatzeko gidalerroak v2.0 (UPV/EHU/LSI/TR;01-2016). Donostia: University of the Basque Country.

Altuna, Begoña, María Jesús Aranzabe and Arantza Díaz de Ilarraza. 2017. EusHeidelTime: Time expression extraction and normalisation for Basque. Procesamiento del Lenguaje Natural 59: 15–22.

Altuna, Begoña, María Jesús Aranzabe and Arantza Díaz de Ilarraza. 2018a. An event factuality annotation proposal for Basque. In Andrew U. Frank, Christine Ivanovic, Francesco Mambrini, Marco Passarotti and Caroline Sporleder eds. Proceedings of the Second Workshop on Corpus-Based Research in the Humanities (CRH-2), Vol. 1. Vienna: Gerastree Proceedings, 15–24.

Altuna, Begoña, María Jesús Aranzabe and Arantza Díaz de Ilarraza. 2018b. Adapting TimeML to Basque: Event annotation. In Alexander Gelbukh ed. Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science (LNCS) 9624. Cham: Springer International Publishing, 565–577.

Altuna, Begoña, María Jesús Aranzabe and Arantza Díaz de Ilarraza. 2019. EusTimeBank-TL corpusa: Denbora-informaziodun testuetatik denbora-lerroetara. In Olatz Arbelaitz, Urtzi Etxeberria, Ainhoa Latatu, Miren Josu Ormaetxebarria eds. III. Ikergazte. Nazioarteko Ikerketa Euskaraz, Giza Zientziak eta Artea, Vol. 1. Bilbao: Udako Euskal Unibertsitatea, 83–90.

Altuna, Begoña, María Jesús Aranzabe and Arantza Díaz de Ilarraza. Under revision a. EusTimeBank: A corpus for temporal information processing in Basque. Language Resources and Evaluation. Cham: Springer International Publishing.

Altuna, Begoña, Ander Soraluze, María Jesús Aranzabe, Olatz Arregi and Arantza Díaz de Ilarraza. Under revision b. KroniXa: Timeline creation from Basque texts. Digital Scholarship in the Humanities. Oxford: Oxford University Press.

Bauer, Sandro, Stephen Clark and Thore Graepel. 2015. Learning to identify historical figures for timeline creation from Wikipedia articles. In Lucia Aiello and Daniel E. McFarland eds. SocInfo 2014 International Workshops, Revised Selected Papers. Barcelona, Spain: Springer, 234–243.

Bethard, Steven. 2013. A synchronous context free grammar for time normalization. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu and Steven Bethard eds. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, USA: Association for Computational Linguistics, 821–826.

Bittar, André. 2010. Building a TimeBank for French: A Reference Corpus Annotated According to the ISO-TimeML Standard. Paris: Université Paris Diderot dissertation.

Caselli, Tommaso, Valentina Bartalesi Lenzi, Rachele Sprugnoli, Emanuele Pianta and Irina Prodanof. 2011. Annotating events, temporal expressions and relations in Italian: The It-TimeML experience for the Ita-TimeBank. In Nancy Ide, Adam Meyers, Sameer Pradhan and Katrin Tomanek eds. Proceedings of the 5th Linguistic Annotation Workshop. Portland, Oregon: Association for Computational Linguistics, 143–151.

Caselli, Tommaso and Piek Vossen. 2017. The Event StoryLine Corpus: A new benchmark for causal and temporal relation extraction. In Tommaso Caselli, Ben Miller, Marieke van Erp, Piek Vossen, Martha Palmer, Eduard Hovy, Teruko Mitamura and David Caswell eds. Proceedings of the Events and Stories in the News Workshop. Vancouver, Canada: Association for Computational Linguistics 77–86.

Cassidy, Taylor, Bill McDowell, Nathanael Chambers and Steven Bethard. 2014. An annotation framework for dense event ordering. In Kristina Toutanova and Hua Wu eds. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Baltimore, Maryland, USA: Association for Computational Linguistics, 501–506.

Costa, Francisco and António Branco. 2012. TimeBankPT: a TimeML annotated corpus of Portuguese. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis eds. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012). Istanbul, Turkey: European Language Resources Association (ELRA), 3727–3734.

Derczynski, Leon and Kalina Bontcheva. 2014. PHEME: veracity in digital social networks. In Harry Bunt ed. Proceedings of the 10th Joint ACL – ISO Workshop on Interoperable Semantic Annotation (ISA). Reykiavic: Association for Computational Linguistics, 65–68.

Derczynski Leon, Héctor Llorens, and Naushad UzZaman. 2013. TimeML-Strict: clarifying temporal annotation. Computing Research Repository (CoRR) abs/1304.7289. http://arxiv.org/abs/1304.7289 (29 December, 2019.)

Ehrmann, Maud and Caroline Hagège. 2009. Proposition de caracterisation et de typage des expressions temporelles en contexte. In Adeline Nazarenko and Thierry Poibeau eds. Actes de la 16ème Conférence sur le Traitement Automatique des Langues Naturelles. Senlis, France: Association pour le Traitement Automatique des Langues.

Ferro, Lisa, Laurie Gerber, Inderjeet Mani, Beth Sundheim and George Wilson. 2003. TIDES 2003 Standard for the Annotation of Temporal Expressions. McLean, USA: The MITRE Corporation.

Forăscu, Corina and Dan Tufiş. 2012. Romanian TimeBank: An annotated parallel corpus for temporal information. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis eds. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012). Istanbul, Turkey: European Language Resources Association (ELRA), 3762–3766.

Jeong, Young-Seob, Zae Myung Kim, Hyun-Woo Do, Chae-Gyun Lim and Ho-Jin Choi. 2015. Temporal information extraction from Korean texts. In Afra Alishahi and Alessandro Moschitti eds. Proceedings of the 19th Conference on Computational Natural Language Learning, CoNLL 2015. Beijing, China: Association for Computational Linguistics, 279–288.

Kawai, Hideki, Adam Jatowt, Katsumi Tanaka, Kazuo Kunieda, and Keiji Yamada. 2010. Chronoseeker: Search engine for future and past events. In Dongsoo S. Kim, Sang-Wook Kim, Suk-Han Lee, Lajos Hanzo and Roslan Ismail eds. Proceedings of the 4th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’10. New York, USA: Association for Computing Machinery, 25:1–25:10.

Kocoń, Jan and Michał Marcińczuk. 2015. Recognition of Polish temporal expressions. In Galia Angelova, Kalina Bontcheva and Ruslan Mitkov eds. Proceedings of the Recent Advances in Natural Language Processing (RANLP 2015). Hissar, Bulgaria: RANLP, 282–290.

Laparra, Egoitz, Rodrigo Agerri, Itziar Aldabe, German Rigau. 2017. Multi-lingual and cross-lingual timeline extraction. Knowledge-Based Systems 133, 77–89.

Laparra, Egoitz, Dongfang Xu and Steven Bethard. 2018. From characters to time intervals: New paradigms for evaluation and neural parsing of time normalizations. Transactions of the Association for Computational Linguistics 6, 343–356.

Leeuwenberg, Artuur and Francine Moens. 2019. A survey on temporal reasoning for temporal information extraction from text. The Journal of Artificial Intelligence Research (JAIR) 66: 341–380.

Mani, Inderjeet and George Wilson. 2000. Robust temporal processing of news. Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Hong Kong: Association for Computational Linguistics, 69–76.

Minard, Anne-Lyse, Manuela Speranza, Ruben Urizar, Begoña Altuna, Marieke van Erp, Anneleen Schoen and Chantal van Son. 2016. MEANTIME, the NewsReader multilingual event and time corpus. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk and Stelios Piperidis eds. Proceedings of the 10th Language Resources and Evaluation Conference (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 4417–4422.

Mostafazadeh, Nasrin, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli and James Allen. 2016. A corpus and cloze evaluation framework for deeper understanding of commonsense stories. In Kevin Knight, Ani Nenkova and Owen Rambow eds. Proceedings of NAACL-HLT 2016. San Diego, CA: Association for Computational Linguistics, 839–849.

Ning, Qiang, Hao Wu and Dan Roth. 2018. A multi-axis annotation scheme for event temporal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 1318–1328.

Otegi, Arantxa, Nerea Ezeiza, Iakes Goenaga and Gorka Labaka. 2016. A modular chain of NLP tools for Basque. In Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala eds. Proceedings of the 19th International Conference on Text, Speech and Dialogue, TSD. Cham: Springer, 93–100.

Pustejovsky, James, José Castaño, Robert Ingria, Roser Saurí, Robert Gaizauskas, Andrea Setzer, Graham Katz and Dragomir Radev. 2003a. TimeML: Robust specification of event and temporal expressions in text. New Directions in Question Answering 3, 28–34.

Pustejovsky, James, Patrick Hanks, Roser Saurí, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro and Marcia Lazo. 2003b. The TimeBank Corpus. In Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery eds. Proceedings of Corpus Linguistics 2003. Lancaster, UK: UCREL, Lancaster University, 647–656.

Pustejovsky, James, Marc Verhagen, Roser Saurí, Jessica Littman, Robert Gaizauskas, Graham Katz, Inderjeet Mani, Robert Knippen, Andrea Setzer. 2006. TimeBank 1.2 LDC2006T08. Web Download. Philadelphia: Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/LDC2006T08

Pustejovsky, James, Kiyong Lee, Harry Bunt, and Laurent Romary. 2010. ISO-TimeML: An international standard for semantic annotation. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner and Daniel Tapias eds. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). La Valletta: Association for Computational Linguistics, 394–397.

Radinsky, Kira and Eric Horvitz. 2013. Mining the web to predict future events. In Stefano Leonardi, Alessandro Panconesi, Paolo Ferragina and Aristides Gionis eds. Proceedings of the sixth ACM international conference on Web search and data mining. New York: Association for Computing Machinery, 255–264.

Salaberri, Haritz. 2017. Rol semantikoen etiketatzeak testuetako espazio-denbora informazioaren prozesamenduan daukan eraginaz. Donostia: University of the Basque Country dissertation.

Saurí, Roser. 2008. A Factuality Profiler for Eventualities in Text. Waltham, MA: Brandeis University dissertation.

Saurí, Roser. 2010. Annotating Temporal Relations in Catalan and Spanish TimeML Annotation Guidelines. Barcelona: Barcelona Media.

Saurí, Roser and James Pustejovsky. 2009. Annotating Events in Catalan – TimeML Annotation Guidelines (Version TempEval-2010). Barcelona: Barcelona Media.

Saurí, Roser and James Pustejovsky. 2010. Annotating Time Expressions in Catalan – TimeML Annotation Guidelines (Version TempEval-2010). Barcelona: Barcelona Media.

Saurí, Roser, Olga Batiukova and James Pustejovsky. 2009. Annotating Events in Spanish. TimeML Annotation Guidelines (Version TempEval-2010). Barcelona Media.

Saurí, Roser, Estela Saquete and James Pustejovsky. 2010. Annotating Time Expressions in Spanish. TimeML Annotation Guidelines (Version TempEval-2010). Barcelona Media.

Styler, William F., Steven Bethard, Sean Finan, Martha Palmer, Sameer Pradhan, Piet C. de Groen, Brad Erickson, Timothy Miller, Chen Lin, Guergana Savova and James Pustejovsky. 2014. Temporal annotation in the clinical domain. In Ellen Riloff ed. Transactions of the Association for Computational Linguistics 2: 143–154.

TimeML Working Group. 2010. TimeML Annotation Guidelines Version 1.3. Technical report.

Wonsever, Dina, Aiala Rosá, Marisa Malcuori and Matias Etcheverry. 2015. TEMANTEX: A markup language for Spanish temporal expressions and indicators. Research in Computing Science 97: 9–19.

Published
2020-05-10
How to Cite
Altuna, B., Aranzabe, M. J., & Díaz de Ilarraza, A. (2020). EusTimeML: A mark-up language for temporal information in Basque. Research in Corpus Linguistics, 8(1), 86-104. https://doi.org/10.32714/ricl.08.01.06