The creation of the Indonesian TreeTagger for use in LancsBox and CQPweb
DOI:
https://doi.org/10.32714/ricl.13.02.02Keywords:
TreeTagger, CQPweb, LancsBox, Indonesian, annotationAbstract
TreeTagger is a multilingual tagger capable of performing headword and POS tagging. However, before the completion of this project, Indonesian had not been supported. Thus, corpus query systems employing TreeTagger as a subsystem, such as CQPweb v.3.3.10 and LancsBox v.5, were incapable of annotating Indonesian texts. This context leads to the following research: 1) develop Indonesian language support for TreeTagger, 2) evaluate its performance, and 3) integrate the support into two popular corpus query systems, namely CQPweb and LancsBox, and demonstrate its functionalities. The research procedure can be concisely summarised as follows: training, annotation and evaluation, and incorporation. A pre-annotated corpus and lexicon were used in the training process. Headwords for the lexicon and corpus were semi-automatically added using MorphInd, augmented with expert revisions. The training produced an Indonesian TreeTagger parameter file, whose accuracy for POS and headword annotation was 96 per cent and 91 percent respectively. The parameter file has been incorporated into LancsBox v.6 and CQPweb 3.3.11, enabling support for the Indonesian language.
Downloads
References
Anthony, Lawrence. 2024. AntConc v.4.2.4 [Software]. https://www.laurenceanthony.net/software/antconc/
Bahasa, Pusat. 2005. Kamus Besar Bahasa Indonesia. Jakarta: Badan Pengembangan dan Pembinaan Bahasa.
Bird, Steven, Edward Loper and Ewan Klein. 2019. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly. https://www.nltk.org/book/
Brezina, Vaclav and William Platt. 2024. #LancsBox X. https://lancsbox.lancs.ac.uk/
Brezina, Vaclav, Pierre Weill-Tessier and Tony McEnery. 2018. #LancsBox v.4.x [Software]. Lancaster. http://corpora.lancs.ac.uk/lancsbox
Brezina, Vaclav, Pierre Weill-Tessier and Tony McEnery. 2020. #LancsBox v.5.x [Software]. Lancaster. http://corpora.lancs.ac.uk/lancsbox
Can, Burcu, Ahmet Üstün and Murathan Kurfalı. 2017. Turkish PoS tagging by reducing sparsity with morpheme tags in small datasets. arXiv. https://doi.org/10.48550/arXiv.1703.03200
Chung, Siaw-Fong and Meng-Hsien Shih. 2019. An Annotated News Corpus of Malaysian Malay. https://doi.org/10.15026/94451
Cohn, Abigail C. and Maya Ravindranath. 2014. Local languages in Indonesia: Language maintenance or language shift? Linguistik Indonesia 32/2: 131–148.
Davies, Mark. 2024. English Corpora. [Corpora]. https://www.english-corpora.org
Denistia, Karlina. 2023. Databases on the Indonesian prefixes PE- and PEN. Journal of Language and Literature 23/1: 13–24.
Dinakaramani, Arawinda, Fam Rashel, Andry Luthfi and Ruli Manurung. 2014. Designing an Indonesian part of speech tagset and a manually tagged Indonesian corpus. Proceedings of the International Conference on Asian Language Processing. Kuching: Institute of Electrical and Electronic Engineers, 66–69. https://doi.org/10.1109/IALP.2014.6973519.
Eberhard, David M., Gary Francis Simons and Charles D. Fennig eds. 2022. Ethnologue: Languages of Asia. Dallas: SIL International.
Fu, Sihui, Nankai Lin, Gangqin Zhu and Shengyi Jiang. 2018. Towards Indonesian part-of-speech tagging: Corpus and models. http://lrec-conf.org/workshops/lrec2018/W34/pdf/3_W34.pdf
Gomide, Andressa. 2020. Corpus Linguistics Software: Understanding Their Usages and Delivering Two New Tools. Lancaster: Lancaster University dissertation.
Hardie, Andrew. 2012. CQPweb — Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17/3: 380–409.
Hardie, Andrew. 2023. CQPWeb Lancaster. Lancaster. https://cqpweb.lancs.ac.uk/
Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel. 2014. The Sketch Engine: Ten years on. Lexicography 1/1: 7–36.
Lanin, Ivan, Romi Hardiyanto and Arthur Purnama. 2019. Kateglo Dataset v1.00.20131128. https://datahub.io/aps2201/kateglo_scrape#resource- kateglo_scrape_zip
Larasati, Septina Dian, Vladislav Kuboň and Daniel Zeman. 2011. Indonesian morphology tool (MorphInd): Towards an Indonesian corpus. In Cerstin Mahlow and Michael Piotrowski eds. Systems and Frameworks for Computational Morphology. Berlin: Springer, 119–129.
Librian, Andi. 2016. Sastrawi [Software]. https://github.com/sastrawi/sastrawi.
Lun, Wong Wei, Mazura Mastura Muhammad, Warid Mihat, Muhammad Syafiq Ya Shak, Mairas Abdul Rahman and Prihantoro Prihantoro. 2023. Vocabulary index as a sustainable resource for teaching extended writing in the post-pandemic era. World Journal of English Language 13/3: 181. https://doi.org/10.5430/wjel.v13n3p181.
Maulana, Aditya and Ade Romadhony. 2021. Domain adaptation for part-of-speech tagging of indonesian text using affix information. Procedia Computer Science 179: 640–647.
Park, Youngmin and Jungyun Seo. 2015. Joint model of Korean part-of- speech tagging and dependency parsing with partial tagged corpus. International Journal of Knowledge Engineering-IACSIT 1/1: 49–53.
Pisceldo, Femphy, Rahmad Mahendra, Ruli Manurung and I Wayan Arka. 2008. A two-level morphological analyser for the Indonesian language. In Nicola Stokes and David Powers eds. Proceedings of the Australasian Language Technology Association Workshop2008. 142–50. Hobart: Australian Language Technology Association, 142–150
Prihantoro, Prihantoro. 2021a. An Automatic Morphological Analysis System for Indonesian. Lancaster: Lancaster University dissertation.
Prihantoro, Prihantoro. 2021b. An evaluation of MorphInd’s morphological annotation scheme for Indonesian. Corpora 16/2: 287–299.
Prihantoro, Prihantoro 2022a. Buku Referensi Pengantar Linguistik Korpus: Lensa Digital Data Bahasa. Semarang: Undip Press.
Prihantoro, Prihantoro. 2022b. SANTI-Morf dictionaries. Lexicography 9/2: 175–193.
Quasthoff, Uwe, Dirk Goldhahn and Thomas Eckart. 2014. Building large resources for text mining: The Leipzig corpora collection. In Chris Biemann and Alexander Mehler eds. Theory and Applications of Natural Language Processing. Cham: Springer, 3–24.
Rashel, Fam. 2016. Manually Tagged Indonesian Corpus Data. GitHub. https://github.com/famrashel/idn-tagged- corpus/tree/a0c7a7409a31f2e6a3103778f2621d222525c450
Rashel, Fam, Andry Luthfi, Arawindaamani and Ruli Manurung. 2014. Building an Indonesian rule-based part-of-speech tagger. Proceedings of the International Conference on Asian Language Processing. Kuching: Institute of Electrical and Electronic Engineers, 70–73.
Schmid, Helmut. 1999. Improvements in part-of-speech tagging with an application to German. In Susan Armstrong, Kenneth Church, Pierre Isabelle, Sandra Manzi, Evelyne Tzoukermann and David Yarowsky eds. Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology. Dordrecht: Springer, 13–25.
Schmid, Helmut. 2024. TreeTagger: A POS Tagger for Many Languages [Software]. https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Scott, Mike. 2024. WordSmith v.9.0. Stroud: Lexical Software Analysis. https://ww.lexically.net/wordsmith/
Thavareesan, Sajeetha and Sinnathamby Mahesan. 2020. Word embedding-based part of speech tagging in Tamil texts. Proceedings of the International Conference on Industrial and Information Systems. Rupnagar: Institute of Electrical and Electronic Engineers, 478–482
Vasiliev, Yuli. 2020. Natural Language Processing with Python and spaCy: A Practical Introduction. San Francisco: No Starch Press.
Voutilainen, Atro. 1999. A short history of tagging. Hans Van Halteren ed. Syntactic Wordclass Tagging. Dordrecht: Springer, 9–21.
Wicaksono, Alfan Farizki and Ayu Purwarianti. 2010. HMM based part-of-speech tagger for Bahasa Indonesia. Proceedings of 4th International Malay and Indonesian Language Workshop Jakarta: Computer Science.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Research in Corpus Linguistics

This work is licensed under a Creative Commons Attribution 4.0 International License.
Submission of your paper to this journal implies that the paper is not under submission for publication elsewhere. Material which has been previously copyrighted, published, or accepted for publication will not be considered for publication in this journal. Submission of a manuscript is interpreted as a statement of certification that no part of the manuscript is copyrighted by any other publisher nor is under review by any other formal publication. By submitting your manuscript to us, you agree on these copyright guidelines. It is your responsibility to ensure that your manuscript does not cause any copyright infringements, defamation, and other problems.
Submitted papers are assumed to contain no proprietary material unprotected by patent or patent application; responsibility for technical content and for protection of proprietary material rests solely with the author(s) and their organizations and is not the responsibility of the journal or its editorial staff. The main author is responsible for ensuring that the article has been seen and approved by all the other authors. It is the responsibility of the author to obtain all necessary copyright release permissions for the use of any copyrighted materials in the manuscript prior to the submission.
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under the BY Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal
Article submission implies author agreement with this policy.