Tag your Classical Armenian text

You can find more information on this specific model here: https://zenodo.org/records/14056139

Information about the model

The models were trained for lemmatization, POS-tagging, and morphological analysis of Classical Armenian using the Universal Dependencies corpus (09/2024 release), comprising 82,557 wordforms (63,271 for training, 8,697 for validation, 10,591 for testing). The dataset primarily consists of the Classical Armenian Gospels but shows strong performance across both in-domain and out-of-domain tests (refer to the linked publication). For the training dataset (annotation guidelines and data), see:

Kocharov Petr, Kharatyan Lilit, Universal Dependencies

The development was part of the ANR project ANR-21-CE38-0006 "DALiH - Digitizing Armenian Linguistic Heritage", led by Victoria Khurshudyan (Inalco, SeDyL, CNRS, IRD), with initial contributions from Calfa and GREgORI. The models are also available on Zenodo. You can find more informations on this specific model here.

Model Evaluation on UD test dataset

We provide both accuracy and (F1 micro-average) scores for various tasks:

task_name	all	ambiguous-tokens	known-tokens	unknown-tokens
aspect	0.9972 (0.9914)	0.9542 (0.916)	0.998 (0.9939)	0.8986 (0.8628)
case	0.9773 (0.9458)	0.9431 (0.9172)	0.9785 (0.9482)	0.8384 (0.748)
deixis	0.9965 (0.9594)	0.969 (0.5134)	0.9966 (0.961)	0.9873 (0.2484)
lemma	0.9961 (0.9269)	0.9824 (0.7625)	0.9977 (0.9882)	0.8067 (0.6169)
mood	0.9972 (0.9795)	0.9588 (0.9398)	0.9979 (0.9835)	0.9081 (0.824)
number	0.988 (0.9852)	0.9435 (0.9445)	0.9887 (0.9862)	0.9017 (0.8541)
numtype	0.9925 (0.2491)	0.7273 (0.4211)	0.9925 (0.2491)	0.9968 (0.4992)
person	0.9952 (0.9849)	0.9272 (0.8999)	0.9958 (0.9864)	0.9287 (0.8691)
pos	0.9965 (0.9889)	0.9948 (0.9904)	0.9978 (0.9933)	0.8447 (0.4692)
prontype	0.9948 (0.8617)	0.9708 (0.7985)	0.9949 (0.862)	0.9873 (0.2987)
tense	0.996 (0.9874)	0.9593 (0.9282)	0.9967 (0.9895)	0.9144 (0.8373)
verbform	0.9972 (0.983)	0.9585 (0.8622)	0.9977 (0.9863)	0.9445 (0.8736)
voice	0.9927 (0.8278)	0.9223 (0.885)	0.9934 (0.8298)	0.9144 (0.8671)

Cite with the following

Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.

For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.

@software{thibault_clerice_2020_3883590,
  author       = {Clérice, Thibault},
  title        = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3883589},
  url          = {https://doi.org/10.5281/zenodo.3883589}
}
@inproceedings{manjavacas-etal-2019-improving,
    title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
    author = "Manjavacas, Enrique  and
      K{\'a}d{\'a}r, {\'A}kos  and
      Kestemont, Mike",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational
      Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1153",
    doi = "10.18653/v1/N19-1153",
    pages = "1493--1503",}

Please remember to cite the following resources when using this lemmatizer. For each models, a bibliography and potentially other citable works are given, such as models and datasets.

@software{thibault_clerice_2020_3883590,
  author = {Clérice, Thibault},
  title = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month = jun,
  year = 2020,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.3883589},
  url = {https://doi.org/10.5281/zenodo.3883589}
}

@inproceedings{manjavacas-etal-2019-improving,
  title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
  author = "Manjavacas, Enrique and Kádár, Ákos and Kestemont, Mike",
  booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
  month = jun,
  year = "2019",
  address = "Minneapolis, Minnesota",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/N19-1153",
  doi = "10.18653/v1/N19-1153",
  pages = "1493--1503"
}

@inproceedings{vidal-gorene-etal-2024-cross,
    title = "Cross-Dialectal Transfer and Zero-Shot Learning for {A}rmenian Varieties: A Comparative Analysis of {RNN}s, Transformers and {LLM}s",
    author = "Vidal-Gor{\`e}ne, Chahan  and
      Tomeh, Nadi  and
      Khurshudyan, Victoria",
    editor = {H{\"a}m{\"a}l{\"a}inen, Mika  and
      {\"O}hman, Emily  and
      Miyagawa, So  and
      Alnajjar, Khalid  and
      Bizzoni, Yuri},
    booktitle = "Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.nlp4dh-1.42",
    pages = "438--449",
}

@misc{vidal_gorene_2024_14056139,
  author       = {Vidal-Gorène, Chahan and
                  Tomeh, Nadi and
                  Khurshudyan, Victoria},
  title        = {Pie Model for Lemmatization, POS Tagging, and
                   Morphological Analysis of Classical Armenian},
  month        = nov,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.14056139},
  url          = {https://doi.org/10.5281/zenodo.14056139}
}

@inproceedings{vidal2020lemmatization,
  title = {Lemmatization and POS-tagging process by using joint learning approach. Experimental results on Classical Armenian, Old Georgian, and Syriac},
  author = {Vidal-Gorène, Chahan and Kindt, Bastien},
  booktitle = {Proceedings of LT4HALA 2020-1st Workshop on Language Technologies for Historical and Ancient Languages},
  pages = {22--27},
  year = {2020}
}

Bibliography

This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :

E. Manjavacas & Á. Kádár & M. Kestemont, « Improving Lemmatization of Non-Standard Languages with Joint Learning », Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Special issue on "Natural Language Processing and Ancient Languages", 2019, pp. 493--1503.
Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537 Check the latest version here :Zenodo DOI

Please refer to the following sources:

Clérice, T. (2020). Pie Extended, an extension for Pie with pre-processing and post-processing. Zenodo. doi, 10.
Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo
Kharatyan L., & Kocharov P. (2024). Development of Linguistic Annotation Toolkit for Classical Armenian in SpaCy, Stanza, and UDPipe. In Proceeding of The First Workshop on Data-driven Approaches to Ancient Languages (DAAL 2024).
Manjavacas, E., Kádár, Á., & Kestemont, M. (2019). Improving lemmatization of non-standard languages with joint learning. arXiv preprint arXiv:1903.06939.
Vidal-Gorène, C., & Kindt, B. (2020). Lemmatization and POS-tagging process by using joint learning approach. Experimental results on Classical Armenian, Old Georgian, and Syriac. In Proceedings of LT4HALA 2020-1st Workshop on Language Technologies for Historical and Ancient Languages (pp. 22-27). See paper.
Vidal-Gorène, C., Khurshudyan, V., & Donabédian-Demopoulos, A. (2020). Recycling and comparing morphological annotation models for Armenian diachronic-variational corpus processing. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 90-101). See paper.
Vidal-Gorène, C., Tomeh, N., & Khurshudyan, V. (2024). Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities (pp. 438–449), Miami, USA. Association for Computational Linguistics.See paper.
Vidal-Gorène, C., Tomeh, N., & Khurshudyan, V. (2024). Pie Model for Lemmatization, POS Tagging, and Morphological Analysis of Western Armenian. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities - EMNLP 2024 (1.0.0, p. 438‑449). Zenodo. See model