Tag your Old French text

You can find more information on this specific model here: https://github.com/chartes/deucalion-model-af

Cite with the following

Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.

For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.

@software{thibault_clerice_2020_3883590,
  author       = {Clérice, Thibault},
  title        = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3883589},
  url          = {https://doi.org/10.5281/zenodo.3883589}
}
@inproceedings{manjavacas-etal-2019-improving,
    title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
    author = "Manjavacas, Enrique  and
      K{\'a}d{\'a}r, {\'A}kos  and
      Kestemont, Mike",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational
      Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1153",
    doi = "10.18653/v1/N19-1153",
    pages = "1493--1503",}
@software{clerice_thibault_2019_3237455,
  author       = {Clérice, Thibault and
                  Camps, Jean-Baptiste and
                  Pinche, Ariane},
  title        = {Deucalion, Modèle Ancien Francais (0.2.0)},
  month        = jun,
  year         = 2019,
  publisher    = {Zenodo},
  version      = {v0.2.0},
  doi          = {10.5281/zenodo.3237455},
  url          = {https://doi.org/10.5281/zenodo.3237455}
}

Information

The model was trained on the following corpora :

  • Geste: un corpus de chansons de geste, dir. Jean-Baptiste Camps, avec la collab. d'Elena Albarran, Alice Cochet & Lucence Ing, Paris, 2016-…, http://github.com/Jean-Baptiste-Camps/Geste.
  • Édition nativement numérique du recueil hagiographique "Li Seint Confessor" de Wauchier de Denain d'après le manuscrit 412 de la Bibliothèque nationale de France, éd. Ariane Pinche, Lyon, en cours (only Dialogues and La vie de Saint Martin are used right now. Data are closed source until publication of the PhD thesis)
  • Les Institutes de Justinien en français, éd. F. Olivier-Martin (1935), éd. revue par F. Duval, lemmatisée par F. Duval et L. Ing. Paris, 2018.
  • Chrétien de Troyes: Cligès, Erec, Lancelot, Perceval, Yvain -- Manuscrit P (BnF Fr. 794), éd. P. Kunstmann (2009), annotation revue par J.B. Camps et L. Ing (2017).

The annotations are made according to the following reference lists:

  • lemma: Adolf Tobler et Erhard Friedrich Lommatzsch, Altfranzösisches Wörterbuch: édition électronique, éd. Peter Blumenthal et Achim Stein, Stuttgart, F. Steiner, 2002.
  • POS and morph: Sophie Prévost, Céline Guillot, Alexei Lavrentiev et Serge Heiden, Jeu d’étiquettes morphosyntaxiques CATTEX2009-max, Lyon, 2013, http://bfm.ens-lyon.fr/IMG/pdf/Cattex2009_2.0.pdf.

More information on the annotation practice can be found in the wiki of the Geste corpus: https://github.com/Jean-Baptiste-Camps/Geste/wiki.

Sample from annotation:

form    lemma   POS morph
G'  je  PROper  PERS.=1|NOMB.=s|GENRE=m|CAS=n
irai    aler    VERcjg  MODE=ind|TEMPS=fut|PERS.=1|NOMB.=s
sor sor2    PRE MORPH=empty
eus il  PROper  PERS.=3|NOMB.=p|GENRE=m|CAS=i
por por2    PRE MORPH=empty
lor lor2    DETpos  PERS.=3|NOMB.=p|GENRE=f|CAS=r
terres  terre   NOMcom  NOMB.=p|GENRE=f|CAS=r
saisir  saisir  VERinf  MORPH=empty

Bibliography

This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :

  • E. Manjavacas & Á. Kádár & M. Kestemont, « Improving Lemmatization of Non-Standard Languages with Joint Learning », Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Special issue on "Natural Language Processing and Ancient Languages", 2019, pp. 493--1503.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537 Check the latest version here :Zenodo DOI
  • Geste: un corpus de chansons de geste, dir. Jean-Baptiste Camps, avec la collab. d'Elena Albarran, Alice Cochet & Lucence Ing, Paris, 2016-…, http://github.com/Jean-Baptiste-Camps/Geste.
  • Édition nativement numérique du recueil hagiographique "Li Seint Confessor" de Wauchier de Denain d'après le manuscrit 412 de la Bibliothèque nationale de France, éd. Ariane Pinche, Lyon, en cours (only Dialogues and La vie de Saint Martin are used right now. Data are closed source until publication of the PhD thesis)
  • Les Institutes de Justinien en français, éd. F. Olivier-Martin (1935), éd. revue par F. Duval, lemmatisée par F. Duval et L. Ing. Paris, 2018.
  • Chrétien de Troyes: Cligès, Erec, Lancelot, Perceval, Yvain -- Manuscrit P (BnF Fr. 794), éd. P. Kunstmann (2009), annotation revue par J.B. Camps et L. Ing (2017).