Tag your Eastern Armenian text
You can find more information on this specific model here: https://zenodo.org/records/14059437
You can find more information on this specific model here: https://zenodo.org/records/14059437
The models were trained for lemmatization, POS-tagging, and morphological analysis of Eastern Armenian using the Universal Dependencies corpus (09/2024 release), comprising 52,950 wordforms (42,337 for training, 5,370 for validation, 5,243 for testing). Sentences cover diverse documents: blog, fiction, grammar examples, legal, news, and nonfiction. The input data should be pre-tokenized. For the training dataset (annotation guidelines and data), see:
We provide both accuracy and (F1 micro-average) scores for various tasks:
task_name | all | ambiguous-tokens | known-tokens | unknown-tokens |
---|---|---|---|---|
abbr | 0.997 (0.8622) | 0.8864 (0.7991) | 0.997 (0.8631) | 0.9986 (0.7497) |
adptype | 0.9916 (0.6512) | 0.8246 (0.7106) | 0.9915 (0.6526) | 0.9945 (0.3324) |
animacy | 0.9588 (0.92) | 0.8949 (0.8311) | 0.966 (0.9327) | 0.733 (0.6866) |
aspect | 0.9909 (0.538) | 0.9727 (0.7321) | 0.993 (0.6421) | 0.9245 (0.4644) |
case | 0.9714 (0.9624) | 0.9167 (0.8801) | 0.977 (0.9672) | 0.792 (0.7241) |
definite | 0.9738 (0.9664) | 0.9076 (0.8619) | 0.9782 (0.9713) | 0.8346 (0.8481) |
degree | 0.9491 (0.2435) | 0.2378 (0.0961) | 0.9482 (0.2434) | 0.9794 (0.4948) |
lemma | 0.9909 (0.9502) | 0.9434 (0.7085) | 0.996 (0.9917) | 0.8298 (0.6842) |
mood | 0.9942 (0.8632) | 0.9762 (0.847) | 0.9959 (0.8962) | 0.9382 (0.6227) |
nametype | 0.9809 (0.359) | 0.8 (0.1778) | 0.9823 (0.3639) | 0.9369 (0.2956) |
number | 0.9619 (0.7737) | 0.9301 (0.8898) | 0.9664 (0.782) | 0.8181 (0.6069) |
number[psor] | 0.9954 (0.3326) | 0.2 (0.1667) | 0.996 (0.3327) | 0.9753 (0.3292) |
numform | 0.991 (0.3975) | 0.7115 (0.4157) | 0.9909 (0.3974) | 0.9952 (0.4994) |
numtype | 0.9968 (0.6147) | 0.8571 (0.8381) | 0.997 (0.623) | 0.9904 (0.6035) |
person | 0.9936 (0.9532) | 0.986 (0.9641) | 0.9954 (0.9702) | 0.9362 (0.66) |
person[psor] | 0.9949 (0.2494) | 0.2 (0.1667) | 0.9955 (0.2494) | 0.9753 (0.2469) |
polarity | 0.9785 (0.9558) | 0.9318 (0.9015) | 0.9799 (0.9588) | 0.9348 (0.8632) |
pos | 0.9911 (0.9881) | 0.9885 (0.9794) | 0.9959 (0.9928) | 0.8387 (0.5319) |
poss | 0.9959 (0.9283) | 0.8213 (0.7389) | 0.9958 (0.9289) | 0.9986 (0.4997) |
prontype | 0.995 (0.9162) | 0.9101 (0.8508) | 0.9951 (0.917) | 0.9918 (0.363) |
subcat | 0.9805 (0.9232) | 0.8349 (0.836) | 0.984 (0.9351) | 0.8696 (0.7021) |
tense | 0.9942 (0.9726) | 0.9767 (0.4651) | 0.995 (0.9776) | 0.9671 (0.7785) |
verbform | 0.985 (0.7933) | 0.9565 (0.7614) | 0.9869 (0.7964) | 0.9266 (0.7046) |
voice | 0.9753 (0.5995) | 0.8213 (0.807) | 0.9787 (0.6107) | 0.8655 (0.5098) |
Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.
For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.
@software{thibault_clerice_2020_3883590, author = {Clérice, Thibault}, title = {Pie Extended, an extension for Pie with pre-processing and post-processing}, month = jun, year = 2020, publisher = {Zenodo}, doi = {10.5281/zenodo.3883589}, url = {https://doi.org/10.5281/zenodo.3883589} } @inproceedings{manjavacas-etal-2019-improving, title = "Improving Lemmatization of Non-Standard Languages with Joint Learning", author = "Manjavacas, Enrique and K{\'a}d{\'a}r, {\'A}kos and Kestemont, Mike", booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N19-1153", doi = "10.18653/v1/N19-1153", pages = "1493--1503",}
Please remember to cite the following resources when using this lemmatizer. For each models, a bibliography and potentially other citable works are given, such as models and datasets. @software{thibault_clerice_2020_3883590, author = {Clérice, Thibault}, title = {Pie Extended, an extension for Pie with pre-processing and post-processing}, month = jun, year = 2020, publisher = {Zenodo}, doi = {10.5281/zenodo.3883589}, url = {https://doi.org/10.5281/zenodo.3883589} } @inproceedings{manjavacas-etal-2019-improving, title = "Improving Lemmatization of Non-Standard Languages with Joint Learning", author = "Manjavacas, Enrique and Kádár, Ákos and Kestemont, Mike", booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N19-1153", doi = "10.18653/v1/N19-1153", pages = "1493--1503" } @inproceedings{vidal-gorene-etal-2024-cross, title = "Cross-Dialectal Transfer and Zero-Shot Learning for {A}rmenian Varieties: A Comparative Analysis of {RNN}s, Transformers and {LLM}s", author = "Vidal-Gor{\`e}ne, Chahan and Tomeh, Nadi and Khurshudyan, Victoria", editor = {H{\"a}m{\"a}l{\"a}inen, Mika and {\"O}hman, Emily and Miyagawa, So and Alnajjar, Khalid and Bizzoni, Yuri}, booktitle = "Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities", month = nov, year = "2024", address = "Miami, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.nlp4dh-1.42", pages = "438--449", } @misc{vidal_gorene_2024_14059437, author = {Vidal-Gorène, Chahan and Tomeh, Nadi and Khurshudyan, Victoria}, title = {Pie Model for Lemmatization, POS Tagging, and Morphological Analysis of Eastern Armenian}, month = nov, year = 2024, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.14059437}, url = {https://doi.org/10.5281/zenodo.14059437} } @inproceedings{vidal2020recycling, title={Recycling and comparing morphological annotation models for Armenian diachronic-variational corpus processing}, author={Vidal-Gor{\`e}ne, Chahan and Khurshudyan, Victoria and Donab{\'e}dian-Demopoulos, Ana{\"\i}d}, booktitle={Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects}, pages={90--101}, year={2020} }
This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :
Please refer to the following sources: