Tag your Eastern Armenian text

You can find more information on this specific model here: https://zenodo.org/records/14059437

Information about the model

The models were trained for lemmatization, POS-tagging, and morphological analysis of Eastern Armenian using the Universal Dependencies corpus (09/2024 release), comprising 52,950 wordforms (42,337 for training, 5,370 for validation, 5,243 for testing). Sentences cover diverse documents: blog, fiction, grammar examples, legal, news, and nonfiction. The input data should be pre-tokenized. For the training dataset (annotation guidelines and data), see:

Yavrumyan Marat and ArmTDP team: Universal Dependencies
Yavrumyan, M.M., Danielyan, A.S. (2020). "Universal Dependencies and the Armenian Treebank." Herald of the Social Sciences (2). 231-244 (in Armenian)

The development was part of the ANR project ANR-21-CE38-0006 "DALiH - Digitizing Armenian Linguistic Heritage", led by Victoria Khurshudyan (Inalco, SeDyL, CNRS, IRD), with initial contributions from Calfa. The models are also available on Zenodo. You can find more informations on this specific model here.

Model Evaluation on UD test dataset

We provide both accuracy and (F1 micro-average) scores for various tasks:

task_name	all	ambiguous-tokens	known-tokens	unknown-tokens
abbr	0.997 (0.8622)	0.8864 (0.7991)	0.997 (0.8631)	0.9986 (0.7497)
adptype	0.9916 (0.6512)	0.8246 (0.7106)	0.9915 (0.6526)	0.9945 (0.3324)
animacy	0.9588 (0.92)	0.8949 (0.8311)	0.966 (0.9327)	0.733 (0.6866)
aspect	0.9909 (0.538)	0.9727 (0.7321)	0.993 (0.6421)	0.9245 (0.4644)
case	0.9714 (0.9624)	0.9167 (0.8801)	0.977 (0.9672)	0.792 (0.7241)
definite	0.9738 (0.9664)	0.9076 (0.8619)	0.9782 (0.9713)	0.8346 (0.8481)
degree	0.9491 (0.2435)	0.2378 (0.0961)	0.9482 (0.2434)	0.9794 (0.4948)
lemma	0.9909 (0.9502)	0.9434 (0.7085)	0.996 (0.9917)	0.8298 (0.6842)
mood	0.9942 (0.8632)	0.9762 (0.847)	0.9959 (0.8962)	0.9382 (0.6227)
nametype	0.9809 (0.359)	0.8 (0.1778)	0.9823 (0.3639)	0.9369 (0.2956)
number	0.9619 (0.7737)	0.9301 (0.8898)	0.9664 (0.782)	0.8181 (0.6069)
number[psor]	0.9954 (0.3326)	0.2 (0.1667)	0.996 (0.3327)	0.9753 (0.3292)
numform	0.991 (0.3975)	0.7115 (0.4157)	0.9909 (0.3974)	0.9952 (0.4994)
numtype	0.9968 (0.6147)	0.8571 (0.8381)	0.997 (0.623)	0.9904 (0.6035)
person	0.9936 (0.9532)	0.986 (0.9641)	0.9954 (0.9702)	0.9362 (0.66)
person[psor]	0.9949 (0.2494)	0.2 (0.1667)	0.9955 (0.2494)	0.9753 (0.2469)
polarity	0.9785 (0.9558)	0.9318 (0.9015)	0.9799 (0.9588)	0.9348 (0.8632)
pos	0.9911 (0.9881)	0.9885 (0.9794)	0.9959 (0.9928)	0.8387 (0.5319)
poss	0.9959 (0.9283)	0.8213 (0.7389)	0.9958 (0.9289)	0.9986 (0.4997)
prontype	0.995 (0.9162)	0.9101 (0.8508)	0.9951 (0.917)	0.9918 (0.363)
subcat	0.9805 (0.9232)	0.8349 (0.836)	0.984 (0.9351)	0.8696 (0.7021)
tense	0.9942 (0.9726)	0.9767 (0.4651)	0.995 (0.9776)	0.9671 (0.7785)
verbform	0.985 (0.7933)	0.9565 (0.7614)	0.9869 (0.7964)	0.9266 (0.7046)
voice	0.9753 (0.5995)	0.8213 (0.807)	0.9787 (0.6107)	0.8655 (0.5098)

Cite with the following

Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.

For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.

@software{thibault_clerice_2020_3883590,
  author       = {Clérice, Thibault},
  title        = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3883589},
  url          = {https://doi.org/10.5281/zenodo.3883589}
}
@inproceedings{manjavacas-etal-2019-improving,
    title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
    author = "Manjavacas, Enrique  and
      K{\'a}d{\'a}r, {\'A}kos  and
      Kestemont, Mike",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational
      Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1153",
    doi = "10.18653/v1/N19-1153",
    pages = "1493--1503",}

Please remember to cite the following resources when using this lemmatizer. For each models, a bibliography and potentially other citable works are given, such as models and datasets.

@software{thibault_clerice_2020_3883590,
  author = {Clérice, Thibault},
  title = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month = jun,
  year = 2020,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.3883589},
  url = {https://doi.org/10.5281/zenodo.3883589}
}

@inproceedings{manjavacas-etal-2019-improving,
  title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
  author = "Manjavacas, Enrique and Kádár, Ákos and Kestemont, Mike",
  booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
  month = jun,
  year = "2019",
  address = "Minneapolis, Minnesota",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/N19-1153",
  doi = "10.18653/v1/N19-1153",
  pages = "1493--1503"
}

@inproceedings{vidal-gorene-etal-2024-cross,
    title = "Cross-Dialectal Transfer and Zero-Shot Learning for {A}rmenian Varieties: A Comparative Analysis of {RNN}s, Transformers and {LLM}s",
    author = "Vidal-Gor{\`e}ne, Chahan  and
      Tomeh, Nadi  and
      Khurshudyan, Victoria",
    editor = {H{\"a}m{\"a}l{\"a}inen, Mika  and
      {\"O}hman, Emily  and
      Miyagawa, So  and
      Alnajjar, Khalid  and
      Bizzoni, Yuri},
    booktitle = "Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.nlp4dh-1.42",
    pages = "438--449",
}

@misc{vidal_gorene_2024_14059437,
  author       = {Vidal-Gorène, Chahan and
                  Tomeh, Nadi and
                  Khurshudyan, Victoria},
  title        = {Pie Model for Lemmatization, POS Tagging, and
                   Morphological Analysis of Eastern Armenian},
  month        = nov,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.14059437},
  url          = {https://doi.org/10.5281/zenodo.14059437}
}

@inproceedings{vidal2020recycling,
  title={Recycling and comparing morphological annotation models for Armenian diachronic-variational corpus processing},
  author={Vidal-Gor{\`e}ne, Chahan and Khurshudyan, Victoria and Donab{\'e}dian-Demopoulos, Ana{\"\i}d},
  booktitle={Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects},
  pages={90--101},
  year={2020}
}

Bibliography

This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :

E. Manjavacas & Á. Kádár & M. Kestemont, « Improving Lemmatization of Non-Standard Languages with Joint Learning », Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Special issue on "Natural Language Processing and Ancient Languages", 2019, pp. 493--1503.
Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537 Check the latest version here :Zenodo DOI

Please refer to the following sources:

Clérice, T. (2020). Pie Extended, an extension for Pie with pre-processing and post-processing. Zenodo. doi, 10.
Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo
Manjavacas, E., Kádár, Á., & Kestemont, M. (2019). Improving lemmatization of non-standard languages with joint learning. arXiv preprint arXiv:1903.06939.
Vidal-Gorène, C., Khurshudyan, V., & Donabédian-Demopoulos, A. (2020). Recycling and comparing morphological annotation models for Armenian diachronic-variational corpus processing. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 90-101). See paper.
Vidal-Gorène, C., Tomeh, N., & Khurshudyan, V. (2024). Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities (pp. 438–449), Miami, USA. Association for Computational Linguistics.See paper.
Vidal-Gorène, C., Tomeh, N., & Khurshudyan, V. (2024). Pie Model for Lemmatization, POS Tagging, and Morphological Analysis of Eastern Armenian. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities - EMNLP 2024 (1.0.0, p. 438‑449). Zenodo. See model
Yavrumyan, M. M., Danielyan, A.S. (2020). "Universal Dependencies and the Armenian Treebank." Herald of the Social Sciences (2). 231-244 (in Armenian)