The creation of the Indonesian TreeTagger for use in LancsBox and CQPweb

Authors

DOI:

https://doi.org/10.32714/ricl.13.02.02

Keywords:

TreeTagger, CQPweb, LancsBox, Indonesian, annotation

Abstract

TreeTagger is a multilingual tagger capable of performing headword and POS tagging. However, before the completion of this project, Indonesian had not been supported. Thus, corpus query systems employing TreeTagger as a subsystem, such as CQPweb v.3.3.10 and LancsBox v.5, were incapable of annotating Indonesian texts. This context leads to the following research: 1) develop Indonesian language support for TreeTagger, 2) evaluate its performance, and 3) integrate the support into two popular corpus query systems, namely CQPweb and LancsBox, and demonstrate its functionalities. The research procedure can be concisely summarised as follows: training, annotation and evaluation, and incorporation. A pre-annotated corpus and lexicon were used in the training process. Headwords for the lexicon and corpus were semi-automatically added using MorphInd, augmented with expert revisions. The training produced an Indonesian TreeTagger parameter file, whose accuracy for POS and headword annotation was 96 per cent and 91 percent respectively. The parameter file has been incorporated into LancsBox v.6 and CQPweb 3.3.11, enabling support for the Indonesian language.

Downloads

Download data is not yet available.

References

Anthony, Lawrence. 2024. AntConc v.4.2.4 [Software]. https://www.laurenceanthony.net/software/antconc/

Bahasa, Pusat. 2005. Kamus Besar Bahasa Indonesia. Jakarta: Badan Pengembangan dan Pembinaan Bahasa.

Bird, Steven, Edward Loper and Ewan Klein. 2019. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly. https://www.nltk.org/book/

Brezina, Vaclav and William Platt. 2024. #LancsBox X. https://lancsbox.lancs.ac.uk/

Brezina, Vaclav, Pierre Weill-Tessier and Tony McEnery. 2018. #LancsBox v.4.x [Software]. Lancaster. http://corpora.lancs.ac.uk/lancsbox

Brezina, Vaclav, Pierre Weill-Tessier and Tony McEnery. 2020. #LancsBox v.5.x [Software]. Lancaster. http://corpora.lancs.ac.uk/lancsbox

Can, Burcu, Ahmet Üstün and Murathan Kurfalı. 2017. Turkish PoS tagging by reducing sparsity with morpheme tags in small datasets. arXiv. https://doi.org/10.48550/arXiv.1703.03200

Chung, Siaw-Fong and Meng-Hsien Shih. 2019. An Annotated News Corpus of Malaysian Malay. https://doi.org/10.15026/94451

Cohn, Abigail C. and Maya Ravindranath. 2014. Local languages in Indonesia: Language maintenance or language shift? Linguistik Indonesia 32/2: 131–148.

Davies, Mark. 2024. English Corpora. [Corpora]. https://www.english-corpora.org

Denistia, Karlina. 2023. Databases on the Indonesian prefixes PE- and PEN. Journal of Language and Literature 23/1: 13–24.

Dinakaramani, Arawinda, Fam Rashel, Andry Luthfi and Ruli Manurung. 2014. Designing an Indonesian part of speech tagset and a manually tagged Indonesian corpus. Proceedings of the International Conference on Asian Language Processing. Kuching: Institute of Electrical and Electronic Engineers, 66–69. https://doi.org/10.1109/IALP.2014.6973519.

Eberhard, David M., Gary Francis Simons and Charles D. Fennig eds. 2022. Ethnologue: Languages of Asia. Dallas: SIL International.

Fu, Sihui, Nankai Lin, Gangqin Zhu and Shengyi Jiang. 2018. Towards Indonesian part-of-speech tagging: Corpus and models. http://lrec-conf.org/workshops/lrec2018/W34/pdf/3_W34.pdf

Gomide, Andressa. 2020. Corpus Linguistics Software: Understanding Their Usages and Delivering Two New Tools. Lancaster: Lancaster University dissertation.

Hardie, Andrew. 2012. CQPweb — Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17/3: 380–409.

Hardie, Andrew. 2023. CQPWeb Lancaster. Lancaster. https://cqpweb.lancs.ac.uk/

Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel. 2014. The Sketch Engine: Ten years on. Lexicography 1/1: 7–36.

Lanin, Ivan, Romi Hardiyanto and Arthur Purnama. 2019. Kateglo Dataset v1.00.20131128. https://datahub.io/aps2201/kateglo_scrape#resource- kateglo_scrape_zip

Larasati, Septina Dian, Vladislav Kuboň and Daniel Zeman. 2011. Indonesian morphology tool (MorphInd): Towards an Indonesian corpus. In Cerstin Mahlow and Michael Piotrowski eds. Systems and Frameworks for Computational Morphology. Berlin: Springer, 119–129.

Librian, Andi. 2016. Sastrawi [Software]. https://github.com/sastrawi/sastrawi.

Lun, Wong Wei, Mazura Mastura Muhammad, Warid Mihat, Muhammad Syafiq Ya Shak, Mairas Abdul Rahman and Prihantoro Prihantoro. 2023. Vocabulary index as a sustainable resource for teaching extended writing in the post-pandemic era. World Journal of English Language 13/3: 181. https://doi.org/10.5430/wjel.v13n3p181.

Maulana, Aditya and Ade Romadhony. 2021. Domain adaptation for part-of-speech tagging of indonesian text using affix information. Procedia Computer Science 179: 640–647.

Park, Youngmin and Jungyun Seo. 2015. Joint model of Korean part-of- speech tagging and dependency parsing with partial tagged corpus. International Journal of Knowledge Engineering-IACSIT 1/1: 49–53.

Pisceldo, Femphy, Rahmad Mahendra, Ruli Manurung and I Wayan Arka. 2008. A two-level morphological analyser for the Indonesian language. In Nicola Stokes and David Powers eds. Proceedings of the Australasian Language Technology Association Workshop2008. 142–50. Hobart: Australian Language Technology Association, 142–150

Prihantoro, Prihantoro. 2021a. An Automatic Morphological Analysis System for Indonesian. Lancaster: Lancaster University dissertation.

Prihantoro, Prihantoro. 2021b. An evaluation of MorphInd’s morphological annotation scheme for Indonesian. Corpora 16/2: 287–299.

Prihantoro, Prihantoro 2022a. Buku Referensi Pengantar Linguistik Korpus: Lensa Digital Data Bahasa. Semarang: Undip Press.

Prihantoro, Prihantoro. 2022b. SANTI-Morf dictionaries. Lexicography 9/2: 175–193.

Quasthoff, Uwe, Dirk Goldhahn and Thomas Eckart. 2014. Building large resources for text mining: The Leipzig corpora collection. In Chris Biemann and Alexander Mehler eds. Theory and Applications of Natural Language Processing. Cham: Springer, 3–24.

Rashel, Fam. 2016. Manually Tagged Indonesian Corpus Data. GitHub. https://github.com/famrashel/idn-tagged- corpus/tree/a0c7a7409a31f2e6a3103778f2621d222525c450

Rashel, Fam, Andry Luthfi, Arawindaamani and Ruli Manurung. 2014. Building an Indonesian rule-based part-of-speech tagger. Proceedings of the International Conference on Asian Language Processing. Kuching: Institute of Electrical and Electronic Engineers, 70–73.

Schmid, Helmut. 1999. Improvements in part-of-speech tagging with an application to German. In Susan Armstrong, Kenneth Church, Pierre Isabelle, Sandra Manzi, Evelyne Tzoukermann and David Yarowsky eds. Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology. Dordrecht: Springer, 13–25.

Schmid, Helmut. 2024. TreeTagger: A POS Tagger for Many Languages [Software]. https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Scott, Mike. 2024. WordSmith v.9.0. Stroud: Lexical Software Analysis. https://ww.lexically.net/wordsmith/

Thavareesan, Sajeetha and Sinnathamby Mahesan. 2020. Word embedding-based part of speech tagging in Tamil texts. Proceedings of the International Conference on Industrial and Information Systems. Rupnagar: Institute of Electrical and Electronic Engineers, 478–482

Vasiliev, Yuli. 2020. Natural Language Processing with Python and spaCy: A Practical Introduction. San Francisco: No Starch Press.

Voutilainen, Atro. 1999. A short history of tagging. Hans Van Halteren ed. Syntactic Wordclass Tagging. Dordrecht: Springer, 9–21.

Wicaksono, Alfan Farizki and Ayu Purwarianti. 2010. HMM based part-of-speech tagger for Bahasa Indonesia. Proceedings of 4th International Malay and Indonesian Language Workshop Jakarta: Computer Science.

Downloads

Published

2024-11-27

How to Cite

Prihantoro. (2024). The creation of the Indonesian TreeTagger for use in LancsBox and CQPweb . Research in Corpus Linguistics, 13(2), 35–62. https://doi.org/10.32714/ricl.13.02.02