Tag your Western Armenian text

You can find more information on this specific model here: https://zenodo.org/records/14060082


Information about the model

The models were trained for lemmatization, POS-tagging, and morphological analysis of Western Armenian using the Universal Dependencies corpus (09/2024 release), comprising 124,230 wordforms (96,948 for training, 13,615 for validation, 13,667 for testing). Sentences cover diverse sources: blog, fiction, news, nonfiction, reviews, social, spoken, web, wiki. Input data must be pre-tokenized. For the training dataset (annotation guidelines and data), see:

The development was part of the ANR project ANR-21-CE38-0006 "DALiH - Digitizing Armenian Linguistic Heritage", led by Victoria Khurshudyan (Inalco, SeDyL, CNRS, IRD), with initial contributions from Calfa. The models are also available on Zenodo. You can find more informations on this specific model here.

Model Evaluation on UD test dataset

We provide both accuracy and (F1 micro-average) scores for various tasks:

task_name all ambiguous-tokens known-tokens unknown-tokens
abbr 0.9977 (0.8399) 0.7045 (0.5866) 0.998 (0.8563) 0.987 (0.4967)
adptype 0.9969 (0.9628) 0.9147 (0.9216) 0.9968 (0.9631) 0.9982 (0.333)
animacy 0.9851 (0.9655) 0.9586 (0.9258) 0.99 (0.977) 0.7918 (0.7638)
aspect 0.9982 (0.8154) 0.9893 (0.9566) 0.9987 (0.8496) 0.9768 (0.7511)
case 0.9902 (0.9388) 0.9636 (0.9342) 0.993 (0.9555) 0.8812 (0.5959)
connegative 0.9969 (0.4992) 0.5073 (0.3366) 0.9969 (0.4992) 0.9953 (0.4988)
definite 0.9756 (0.9679) 0.9156 (0.864) 0.9781 (0.9707) 0.8783 (0.8836)
degree 0.9632 (0.1962) 0.2944 (0.1516) 0.963 (0.2453) 0.9696 (0.2461)
deixis 0.9979 (0.9506) 0.8254 (0.7873) 0.9979 (0.9523) 0.9964 (0.2496)
hyph 0.9987 (0.9116) 0.9716 (0.9049) 0.9988 (0.9158) 0.9971 (0.4993)
lemma 0.9879 (0.9098) 0.9173 (0.5108) 0.991 (0.9413) 0.865 (0.7453)
mood 0.9964 (0.6131) 0.9744 (0.7198) 0.9968 (0.6152) 0.9833 (0.8434)
number 0.9895 (0.9777) 0.9626 (0.9172) 0.9931 (0.9847) 0.8505 (0.7682)
numform 0.9974 (0.876) 0.8033 (0.597) 0.9975 (0.879) 0.9953 (0.6234)
numtype 0.9967 (0.5264) 0.7854 (0.3376) 0.9967 (0.5264) 0.9964 (0.5552)
person 0.9978 (0.9899) 0.9675 (0.9414) 0.9981 (0.9919) 0.9862 (0.9073)
person[psor] 0.9919 (0.249) 0.5312 (0.2313) 0.9921 (0.249) 0.983 (0.2479)
polarity 0.9969 (0.9928) 0.9804 (0.9687) 0.9973 (0.9936) 0.9776 (0.9602)
polite 0.9991 (0.4998) 0.6014 (0.3755) 0.9991 (0.4998) 0.9982 (0.4995)
pos 0.9933 (0.9897) 0.9874 (0.9611) 0.9967 (0.9949) 0.8606 (0.5483)
poss 0.9974 (0.9541) 0.8635 (0.7017) 0.9974 (0.9551) 0.9975 (0.4994)
prontype 0.9949 (0.9189) 0.882 (0.742) 0.995 (0.9211) 0.9873 (0.317)
reflex 0.9957 (0.871) 0.6424 (0.537) 0.9956 (0.8716) 0.9989 (0.4997)
style 0.9918 (0.1423) 0.4598 (0.126) 0.9921 (0.1423) 0.9801 (0.1414)
subcat 0.9957 (0.9825) 0.9008 (0.8952) 0.9969 (0.9874) 0.9493 (0.875)
tense 0.9979 (0.9913) 0.9833 (0.7189) 0.998 (0.9921) 0.9913 (0.9613)
typo 0.9985 (0.4996) 0.8889 (0.4706) 0.9986 (0.4996) 0.9975 (0.4994)
verbform 0.9956 (0.7889) 0.9731 (0.9462) 0.9961 (0.9874) 0.9783 (0.7649)
voice 0.99 (0.6858) 0.8393 (0.7731) 0.9917 (0.693) 0.9232 (0.4915)

Cite with the following

Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.

For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.

@software{thibault_clerice_2020_3883590,
  author       = {Clérice, Thibault},
  title        = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3883589},
  url          = {https://doi.org/10.5281/zenodo.3883589}
}
@inproceedings{manjavacas-etal-2019-improving,
    title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
    author = "Manjavacas, Enrique  and
      K{\'a}d{\'a}r, {\'A}kos  and
      Kestemont, Mike",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational
      Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1153",
    doi = "10.18653/v1/N19-1153",
    pages = "1493--1503",}
Please remember to cite the following resources when using this lemmatizer. For each models, a bibliography and potentially other citable works are given, such as models and datasets.

@software{thibault_clerice_2020_3883590,
  author = {Clérice, Thibault},
  title = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month = jun,
  year = 2020,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.3883589},
  url = {https://doi.org/10.5281/zenodo.3883589}
}

@inproceedings{manjavacas-etal-2019-improving,
  title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
  author = "Manjavacas, Enrique and Kádár, Ákos and Kestemont, Mike",
  booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
  month = jun,
  year = "2019",
  address = "Minneapolis, Minnesota",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/N19-1153",
  doi = "10.18653/v1/N19-1153",
  pages = "1493--1503"
}

@inproceedings{vidal-gorene-etal-2024-cross,
    title = "Cross-Dialectal Transfer and Zero-Shot Learning for {A}rmenian Varieties: A Comparative Analysis of {RNN}s, Transformers and {LLM}s",
    author = "Vidal-Gor{\`e}ne, Chahan  and
      Tomeh, Nadi  and
      Khurshudyan, Victoria",
    editor = {H{\"a}m{\"a}l{\"a}inen, Mika  and
      {\"O}hman, Emily  and
      Miyagawa, So  and
      Alnajjar, Khalid  and
      Bizzoni, Yuri},
    booktitle = "Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.nlp4dh-1.42",
    pages = "438--449",
}

@misc{vidal_gorene_2024_14060082,
  author       = {Vidal-Gorène, Chahan and
                  Tomeh, Nadi and
                  Khurshudyan, Victoria},
  title        = {Pie Model for Lemmatization, POS Tagging, and
                   Morphological Analysis of Western Armenian},
  month        = nov,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.14060082},
  url          = {https://doi.org/10.5281/zenodo.14060082}
}

@inproceedings{vidal2020recycling,
  title={Recycling and comparing morphological annotation models for Armenian diachronic-variational corpus processing},
  author={Vidal-Gor{\`e}ne, Chahan and Khurshudyan, Victoria and Donab{\'e}dian-Demopoulos, Ana{\"\i}d},
  booktitle={Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects},
  pages={90--101},
  year={2020}
}

Bibliography

This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :

  • E. Manjavacas & Á. Kádár & M. Kestemont, « Improving Lemmatization of Non-Standard Languages with Joint Learning », Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Special issue on "Natural Language Processing and Ancient Languages", 2019, pp. 493--1503.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537 Check the latest version here :Zenodo DOI
  • Please refer to the following sources:

    • Clérice, T. (2020). Pie Extended, an extension for Pie with pre-processing and post-processing. Zenodo. doi, 10.
    • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo
    • Kharatyan L., & Kocharov P. (2024). Development of Linguistic Annotation Toolkit for Classical Armenian in SpaCy, Stanza, and UDPipe. In Proceeding of The First Workshop on Data-driven Approaches to Ancient Languages (DAAL 2024).
    • Manjavacas, E., Kádár, Á., & Kestemont, M. (2019). Improving lemmatization of non-standard languages with joint learning. arXiv preprint arXiv:1903.06939.
    • Vidal-Gorène, C., Khurshudyan, V., & Donabédian-Demopoulos, A. (2020). Recycling and comparing morphological annotation models for Armenian diachronic-variational corpus processing. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 90-101). See paper.
    • Vidal-Gorène, C., Tomeh, N., & Khurshudyan, V. (2024). Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities (pp. 438–449), Miami, USA. Association for Computational Linguistics.See paper.
    • Vidal-Gorène, C., Tomeh, N., & Khurshudyan, V. (2024). Pie Model for Lemmatization, POS Tagging, and Morphological Analysis of Western Armenian. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities - EMNLP 2024 (1.0.0, p. 438‑449). Zenodo. See model
    • Yavrumyan M. M. (2019). "Universal Dependencies for Armenian." International Conference on Digital Armenian, Inalco, Paris.