The FGLOCTweet Corpus: An English tweet-based corpus for fine-grained location-detection tasks

Keywords: location detection, locative references, fine-grained locations, tweets, corpus for training and evaluating models

Abstract

Location detection in social-media microtexts is an important natural language processing task for emergency-based contexts where locative references are identified in text data. Spatial information obtained from texts is essential to understand where an incident happened, where people are in need of help and/or which areas have been affected. This information contributes to raising emergency situation awareness, which is then passed on to emergency responders and competent authorities to act as quickly as possible. Annotated text data are necessary for building and evaluating location-detection systems. The problem is that available corpora of tweets for location-detection tasks are either lacking or, at best, annotated with coarse-grained location types (e.g. cities, towns, countries, some buildings, etc.). To bridge this gap, we present our semi-automatically annotated corpus, the Fine-Grained LOCation Tweet Corpus (FGLOCTweet Corpus), an English tweet-based corpus for fine-grained location-detection tasks, including fine-grained locative references (i.e. geopolitical entities, natural landforms, points of interest and traffic ways) together with their surrounding locative markers (i.e. direction, distance, movement or time). It includes annotated tweet data for training and evaluation purposes, which can be used to advance research in location detection, as well as in the study of the linguistic representation of place or of the microtext genre of social media.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Ahlers, Dirk. 2013. Assessment of the accuracy of GeoNames gazetteer data. In Chris Jones and Ross Purves eds. Proceedings of the 7th Workshop on Geographic Information Retrieval - GIR ’13. New York: Association for Computing Machinery, 74–81.

Ahmed, Mohammed F., Lelitha Vanajakshi and Ramasubramanian Suriyanarayanan. 2019. Real-time traffic congestion information from tweets using supervised and unsupervised machine learning techniques. Transportation in Developing Economies 5/2: Article 20. https://link.springer.com/article/10.1007/s40890-019-0088-2 (10 September, 2021.)

Anthony, Laurence and Claire Hardaker. 2017. FireAnt (Version 1.1.4). Tokyo, Japan: Waseda University. https://www.laurenceanthony.net/software (10 September, 2021.)

Baldwin, Timothy, Paul Cook, Marco Lui, Andrew MacKinlay and Li Wang. 2013. How noisy social media text, how diffrnt social media sources? In Ruslan Mitkov and Jong C. Park eds. Proceedings of the Sixth International Joint Conference on Natural Language Processing. Nagoya, Japan: Asian Federation of Natural Language Processing, 356–364. http://www.aclweb.org/anthology/I13-1041 (10 September, 2021.)

Chiticariu, Laura, Yunyao Li and Frederick R. Reiss. 2013. Rule-based information extraction is dead! Long live rule-based information extraction systems! In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu and Steven Bethard eds. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. New York: Association for Computational Linguistics, 827–832.

Crooks, Andrew, Arie Croitoru, Anthony Stefanidis and Jacek Radzikowski. 2013. #Earthquake: Twitter as a distributed sensor system. Transactions in GIS 17/1: 124–147.

Das, Raul D. and Ross S. Purves. 2019. Exploring the potential of Twitter to understand traffic events and their locations in greater Mumbai, India. IEEE Transactions on Intelligent Transportation Systems 21/12: 1–10.

De Bruijn, Jens A., Hans de Moel, Brenden Jongman, Jurgen Wagemaker and Jeroen C. Aerts. 2018. TAGGS: Grouping tweets to improve global geoparsing for disaster response. Journal of Geovisualization and Spatial Analysis 2/2: 1–14.

Developer Policy – Twitter Developers. 2021. Twitter developer platform. https://developer.twitter.com/en/developer-terms/policy (5 December, 2021.)

Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv http://arxiv.org/abs/1810.04805 (10 September, 2021.)

Dredze, Mark, Michael J. Paul, Shane Bergsma and Hieu Tran. 2013. Carmen: A twitter geolocation system with applications to public health. In Martin Michalowski, Wojtek Michalowski, Dympna O’Sullivan, Szymon Wilk eds. Expanding the Boundaries of Health Informatics Using Artificial Intelligence: Papers from the Association for the Advancement of Artificial Intelligence 2013 Workshop. Palo Alto, California: Association for the Advancement of Artificial Intelligence, 20–24. https://www.aaai.org/ocs/index.php/WS/AAAIW13/paper/view/7085

Eisenstein, Jacob. 2013. What to do about bad language on the internet. In Lucy Vanderwende, Hal Daumé III and Katrin Kirchhoff eds. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New York: Association for Computational Linguistics, 359–369. https://aclanthology.org/N13-1037/

Eke, Paul I. 2011. Using social media for research and public health surveillance. Journal of Dental Research 90/9: 1045–1046.

Fernández-Martínez, Nicolás José and Carlos Periñán-Pascual. 2021a. LORE: A model for the detection of fine-grained locative references in tweets. Onomazein 52: 195–225.

Fernández-Martínez, Nicolás José and Carlos Periñán-Pascual. 2021b. nLORE: A linguistically rich deep-learning system for locative-reference extraction in tweets. In Engie Bashir and Mitja Luštrek eds. Intelligent Environments 2021: Workshop Proceedings of the 1st International Workshop on Artificial Intelligence and Machine Learning for Emerging Topics (ALLEGET ’21). Amsterdam: IOS Press 243–254.

Gonzalez-Paule, Jorge David, Yeran Sun and Yashar Moshfeghi. 2019. On fine-grained geolocalisation of tweets and real-time traffic incident detection. Information Processing and Management 56/3: 1–14.

Gritta, Milan, Moahammad T. Pilehvar, Nut Limsopatham and Nigel Collier. 2018. What’s missing in geographical parsing? Language Resources and Evaluation 52/2: 603–623.

Guyon, Isabelle. 1997. A scaling law for the validation-set training-set size ratio. Technical report. Berkeley, California: AT&T Bell Laboratories 1–11. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1337&rep=rep1&type=pdf (10 September, 2021.)

Hoang, Thi B. N. and Josiane Mothe. 2018. Location extraction from tweets. Information Processing and Management 54/2: 129–144.

Hu, Yingjie and Jimin Wang. 2020. How do people describe locations during a natural disaster: An analysis of tweets from Hurricane Harvey. In Krzysztof Janowicz and Judith A. Verstegen eds. 11th International Conference on Geographic Information Science (GIScience 2021). Dagstuhl, Germany: Dagstuhl Publishing Company, 6.1–6.16. https://drops.dagstuhl.de/opus/volltexte/2020/13041/pdf/LIPIcs-GIScience-2021-I-6.pdf (10 September, 2021.)

Imran, Muhammad, Carlos Castillo, Fernando Diaz and Sarah Vieweg. 2014. Processing social media messages in mass emergency: Survey summary. WWW’18: Companion Proceedings of the The Web Conference 2018. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 507–511. https://dl.acm.org/doi/10.1145/3184558.3186242 (10 September, 2021.)

Inkpen, Diana, Ji Liu, Atefeh Farzindar, Farzaneh Kazemi and Diman Ghazi. 2017. Location detection and disambiguation from twitter messages. Journal of Intelligent Information Systems 49/2: 237–253.

Jongman, Brenden, Jurgen Wagemaker, Beatriz Romero and Erin de Perez. 2015. Early flood detection for rapid humanitarian response: Harnessing near real-time satellite and Twitter signals. ISPRS International Journal of Geo-Information 4/4: 2246–2266.

Jurafsky, Daniel and James H. Martin. 2021. Sequence labeling for parts of speech and named entities. In Dan Jurafsky and Hames H. Martin eds. Speech and Language Processing: 1–27. https://web.stanford.edu/~jurafsky/slp3/8.pdf (10 September, 2021.)

Khodabandeh-Shahraki, Zahra, Afsaneh Fatemi and Hadi Tabatabaee-Malazi. 2019. Evidential fine-grained event localization using Twitter. Information Processing and Management 56/6: Article 102045.

Liu, Fei, Maria Vasardani and Timothy Baldwin. 2014. Automatic identification of locative expressions from social media text. In Dirk Ahlers ed. LocWeb ’14: Proceedings of the 4th International Workshop on Location and the Web. New York: Association for Computing Machinery, 9–16.

Manning, Christopher D. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Alexander F. Gelbukh ed. Computational Linguistics and Intelligent Text Processing. Berlin: Springer Berlin Heidelberg, 171–189.

Martínez-Rojas, María, María del Carmen Pardo-Ferreira and Juan Carlos Rubio-Romero. 2018. Twitter as a tool for the management and analysis of emergency situations: A systematic literature review. International Journal of Information Management 43: 196–208.

Middleton, Stuart E., Giorgos Kordopatis-Zilos, Symeon Papadopoulos and Yiannis Kompatsiaris. 2018. Location Extraction from Social Media. ACM Transactions on Information Systems 36/4: 1–27.

Mourad, Ahmed, Falk Scholer, Walid Magdy and Mark Sanderson. 2019. A practical guide for the effective evaluation of Twitter user geolocation. ACM Transactions on Social Computing 2/3: 1–23.

Purves, Ross S., Paul Clough, Christopher B. Jones, Mark H. Hall and Vanessa Murdock. 2018. Geographic information retrieval: Progress and challenges in spatial search of text. Foundations and Trends in Information Retrieval 12/2–3: 164–318.

Pustejovsky, James and Amber Stubbs. 2013. Natural Language Annotation for Machine Learning: A Guide to Corpus-building for Applications. Sebastopol, California: O’Reilly Media, Inc.

Rayson, Paul. 2014. Computational tools and methods for corpus compilation and analysis. In Douglas Biber and Randi Reppen eds. The Cambridge Handbook of English Corpus Linguistic. Cambridge: Cambridge University Press, 32–50.

Reppen, Randi. 2010. Building a corpus. In Anne O’Keeffe and Michael McCarthy eds. The Routledge Handbook of Corpus Linguistics. London: Routledge, 31–37.

Santorini, Beatrice. 1990. Part-of-Speech Tagging Guidelines for the Penn Treebank Project. 3rd revision, 2nd printing. Department of Computer and Information Science, University of Pennsylvania: Technical Report MS-CIS-9047. https://repository.upenn.edu/cis_reports/570/ (10 September, 2021.)

Singh, Lisa, Shweta Bansal, Leticia Bode, Ceren Budak, Guangqing Chi, Kornraphop Kawintiranon, Colton Padden, Rebecca Vanarsdall, Emily Vraga and Yanchen Wang. 2020. A first look at COVID-19 information and misinformation sharing on Twitter [preprint 31 March 2020]. ArXiv. http://arxiv.org/abs/2003.13907 (10 September, 2021.)

Siriaraya, Panote, Yihong Zhang, Yuanyuan Wang, Yukiko Kawai, Mohit Mittal, Péter Jeszenszky and Adam Jatowt. 2019. Witnessing crime through tweets. In Farnoush Banaei-Kashani, Goce Trajcevski, Ralf Hartmut Güting, Lars Kulik and Shawn Newsam eds. SIGSPATIAL ’19: Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. New York: Association for Computing Machinery, 568–571.

Toutanova, Kristina and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Hinrich Schütze and Keh-Yih Su eds. 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. New York: Association for Computational Linguistics, 63–70.

Vieweg, Sarah, Amanda L. Hughes, Kate Starbird and Leysia Palen. 2010. Microblogging during two natural hazards events. In Elizabeth Mynatt ed. CHI ’10 Proceedings of the 28th International Conference on Human Factors in Computing Systems. New York: Association for Computing Machinery, 1079–1088.

Vossen, Piek. 1998. Introduction to EuroWordNet. Computers and the Humanities 32/2–3: 73–89.

Wallgrün, Jan Oliver, Morteza Karimzadeh, Alan M. MacEachren and Scott Pezanowski. 2018. GeoCorpora: Building a corpus to test and train microblog geoparsers. International Journal of Geographical Information Science 32/1: 1–29.

Wang, Jimin and Yingjie Hu. 2019. Are we there yet ? Evaluating state-of-the-art neural network based geoparsers using EUPEG as a benchmarking platform. In Bruno Martins ed. GeoHumanities ’19 Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Geospatial Humanities. New York: Association for Computing Machinery, Article 2, 1–6.

Zhang, Cheng, Chao Fan, Wenlin Yao, Xia Hu and Ali Mostafavi. 2019. Social media for intelligent public information and warning in disasters: An interdisciplinary review. International Journal of Information Management 49: 190–207.

Zhu, Xingquan and Xindong Wu. 2004. Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22/3: 177–210.

Zinsmeister, Heike, Erhard Hinrichs, Sandra Kübler and Andreas Witt. 2009. Linguistically annotated corpora: Quality assurance, reusability and sustainability. In Anke Lüdeling and Merja Kytö eds. Corpus Linguistics: An International Handbook Vol. 1. Berlin: Walter de Gruyter, 759–772.

Zur, Richard M., Yulei Jiang, Lorenzo L. Pesce and Karen Drukker. 2009. Noise injection for training artificial neural networks: A comparison with weight decay and early stopping. Medical Physics 36/10: 4810–4818.

Published
2022-01-11
How to Cite
Nicolás José Fernández-Martínez. (2022). The FGLOCTweet Corpus: An English tweet-based corpus for fine-grained location-detection tasks. Research in Corpus Linguistics, 10(1), 117-133. https://doi.org/10.32714/ricl.10.01.06