Tag your Classical Armenian text
You can find more information on this specific model here: https://zenodo.org/records/14056139
You can find more information on this specific model here: https://zenodo.org/records/14056139
The models were trained for lemmatization, POS-tagging, and morphological analysis of Classical Armenian using the Universal Dependencies corpus (09/2024 release), comprising 82,557 wordforms (63,271 for training, 8,697 for validation, 10,591 for testing). The dataset primarily consists of the Classical Armenian Gospels but shows strong performance across both in-domain and out-of-domain tests (refer to the linked publication). For the training dataset (annotation guidelines and data), see:
We provide both accuracy and (F1 micro-average) scores for various tasks:
task_name | all | ambiguous-tokens | known-tokens | unknown-tokens |
---|---|---|---|---|
aspect | 0.9972 (0.9914) | 0.9542 (0.916) | 0.998 (0.9939) | 0.8986 (0.8628) |
case | 0.9773 (0.9458) | 0.9431 (0.9172) | 0.9785 (0.9482) | 0.8384 (0.748) |
deixis | 0.9965 (0.9594) | 0.969 (0.5134) | 0.9966 (0.961) | 0.9873 (0.2484) |
lemma | 0.9961 (0.9269) | 0.9824 (0.7625) | 0.9977 (0.9882) | 0.8067 (0.6169) |
mood | 0.9972 (0.9795) | 0.9588 (0.9398) | 0.9979 (0.9835) | 0.9081 (0.824) |
number | 0.988 (0.9852) | 0.9435 (0.9445) | 0.9887 (0.9862) | 0.9017 (0.8541) |
numtype | 0.9925 (0.2491) | 0.7273 (0.4211) | 0.9925 (0.2491) | 0.9968 (0.4992) |
person | 0.9952 (0.9849) | 0.9272 (0.8999) | 0.9958 (0.9864) | 0.9287 (0.8691) |
pos | 0.9965 (0.9889) | 0.9948 (0.9904) | 0.9978 (0.9933) | 0.8447 (0.4692) |
prontype | 0.9948 (0.8617) | 0.9708 (0.7985) | 0.9949 (0.862) | 0.9873 (0.2987) |
tense | 0.996 (0.9874) | 0.9593 (0.9282) | 0.9967 (0.9895) | 0.9144 (0.8373) |
verbform | 0.9972 (0.983) | 0.9585 (0.8622) | 0.9977 (0.9863) | 0.9445 (0.8736) |
voice | 0.9927 (0.8278) | 0.9223 (0.885) | 0.9934 (0.8298) | 0.9144 (0.8671) |
Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.
For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.
@software{thibault_clerice_2020_3883590, author = {Clérice, Thibault}, title = {Pie Extended, an extension for Pie with pre-processing and post-processing}, month = jun, year = 2020, publisher = {Zenodo}, doi = {10.5281/zenodo.3883589}, url = {https://doi.org/10.5281/zenodo.3883589} } @inproceedings{manjavacas-etal-2019-improving, title = "Improving Lemmatization of Non-Standard Languages with Joint Learning", author = "Manjavacas, Enrique and K{\'a}d{\'a}r, {\'A}kos and Kestemont, Mike", booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N19-1153", doi = "10.18653/v1/N19-1153", pages = "1493--1503",}
Please remember to cite the following resources when using this lemmatizer. For each models, a bibliography and potentially other citable works are given, such as models and datasets. @software{thibault_clerice_2020_3883590, author = {Clérice, Thibault}, title = {Pie Extended, an extension for Pie with pre-processing and post-processing}, month = jun, year = 2020, publisher = {Zenodo}, doi = {10.5281/zenodo.3883589}, url = {https://doi.org/10.5281/zenodo.3883589} } @inproceedings{manjavacas-etal-2019-improving, title = "Improving Lemmatization of Non-Standard Languages with Joint Learning", author = "Manjavacas, Enrique and Kádár, Ákos and Kestemont, Mike", booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N19-1153", doi = "10.18653/v1/N19-1153", pages = "1493--1503" } @inproceedings{vidal-gorene-etal-2024-cross, title = "Cross-Dialectal Transfer and Zero-Shot Learning for {A}rmenian Varieties: A Comparative Analysis of {RNN}s, Transformers and {LLM}s", author = "Vidal-Gor{\`e}ne, Chahan and Tomeh, Nadi and Khurshudyan, Victoria", editor = {H{\"a}m{\"a}l{\"a}inen, Mika and {\"O}hman, Emily and Miyagawa, So and Alnajjar, Khalid and Bizzoni, Yuri}, booktitle = "Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities", month = nov, year = "2024", address = "Miami, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.nlp4dh-1.42", pages = "438--449", } @misc{vidal_gorene_2024_14056139, author = {Vidal-Gorène, Chahan and Tomeh, Nadi and Khurshudyan, Victoria}, title = {Pie Model for Lemmatization, POS Tagging, and Morphological Analysis of Classical Armenian}, month = nov, year = 2024, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.14056139}, url = {https://doi.org/10.5281/zenodo.14056139} } @inproceedings{vidal2020lemmatization, title = {Lemmatization and POS-tagging process by using joint learning approach. Experimental results on Classical Armenian, Old Georgian, and Syriac}, author = {Vidal-Gorène, Chahan and Kindt, Bastien}, booktitle = {Proceedings of LT4HALA 2020-1st Workshop on Language Technologies for Historical and Ancient Languages}, pages = {22--27}, year = {2020} }
This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :
Please refer to the following sources: