Tag your (Early) Modern French text

You can find more information on this specific model here: https://github.com/e-ditiones/LEM17

Cite with the following

Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.

For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.

@software{thibault_clerice_2020_3883590,
  author       = {Clérice, Thibault},
  title        = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3883589},
  url          = {https://doi.org/10.5281/zenodo.3883589}
}
@inproceedings{manjavacas-etal-2019-improving,
    title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
    author = "Manjavacas, Enrique  and
      K{\'a}d{\'a}r, {\'A}kos  and
      Kestemont, Mike",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational
      Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1153",
    doi = "10.18653/v1/N19-1153",
    pages = "1493--1503",}
@software{clerice_thibault_2019_3237455,
  author       = {Gabay, Simon and
                  Clérice, Thibault and
                  Camps, Jean-Baptiste and
                  Tanguy, Jean-Baptiste and
                  Gille-Levenson, Matthias},
  title        = {Deucalion, Modèle Français moderne (0.1.0)},
  month        = jun,
  year         = 2020,
  publisher    = {GitHub},
  version      = {v1.0.0},
  url          = {https://github.com/e-ditiones/LEM17/releases/tag/v1}
}

Information

Accuracies are the following (v.1) for lemmas: >
Orig
16 17 18 19 20 All
Drama 94.73 97.42 97.47 98.56 97.86 97.25
Varia 96.23 98.09 98.27 98.23 97.46 97.66
Both 95.51 97.76 97.88 98.39 97.66 97.46
Norm
16 17 18 19 20 All
Drama 97.36 98.41 98.51 98.56 97.86 98.15
Varia 98 98.4 98.54 98.23 97.46 98.13
Both 97.69 98.4 98.53 98.39 97.66 98.14
Accuracies are the following (v.1) for POS: > >
Orig
16 17 18 19 20 All
Drama 90.34 94.47 94.64 95.03 93.71 93.69
Varia 89.85 93.44 95.98 92.24 94.03 93.14
Both 90.08 93.95 95.33 93.65 93.87 93.41
Norm
16 17 18 19 20 All
Drama 93.69 95.75 95.61 95.03 93.71 94.76
Varia 92.52 94.81 95.98 92.24 94.03 93.94
Both 93.08 95.28 95.8 93.65 93.87 94.35

The model is trained on normalised (i.e. "translated" into contemporary French) and non-normalised transcriptions.

The model was trained on the following corpora :

  • CornMol for lemma + POS. Cf. Florian Cafiero, Jean-Baptiste Camps. Why Molière most likely did write his plays. Science Advances, American Association for the Advancement of Science (AAAS), 2019, 5 (11), pp.eaax5489. 10.1126/sciadv.aax5489.
  • FranText OA for lemmas. The original corpus has been heavily corrected. Cf. FranText, https://www.frantext.fr/.
  • Presto gold for lemma + POS. The original corpus has been heavily corrected. Cf. Presto project, http://presto.ens-lyon.fr/.
  • Presto max for lemmas. The original corpus has been heavily corrected. Cf. Presto project, http://presto.ens-lyon.fr/.

The annotations are made according to the following reference lists:

  • lemma: Sascha Diwersy, Achille Falaise, Marie-Hélène Lay, Gilles Souvay, Ressources et méthodes pour l’analyse diachronique, Langages 2017/2 (N° 206), pp. 21-24: https://www.cairn.info/revue-langages-2017-2-page-21.htm.
  • POS and morph: Sophie Prévost, Céline Guillot, Alexei Lavrentiev et Serge Heiden, Jeu d’étiquettes morphosyntaxiques CATTEX2009-max, Lyon, 2013, http://bfm.ens-lyon.fr/IMG/pdf/Cattex2009_2.0.pdf.

More information on the annotation practice can be found in Simon Gabay, Jean-Baptiste Camps, Thibault Clérice, Manuel d'annotation linguistique pour le français moderne (XVIe -XVIIIe siècles) 2020: https://hal.archives-ouvertes.fr/hal-02571190.

Sample from annotation:

form    lemma   POS morph
Pour  pour  PRE MORPH=empty
moi je  PROper  PERS.=1|NOMB.=s|GENRE=x|CAS=i
je  je  PROper  PERS.=1|NOMB.=s|GENRE=x|CAS=n
suis  être  VERcjg  MODE=ind|TEMPS=pst|PERS.=1|NOMB.=s
toûjours  toujours  ADVgen  MORPH=empty
ici ici ADVgen  MORPH=empty
, , PONfbl  MORPH=empty
où  que PROrel  NOMB.=x|GENRE=x|CAS=i
, , PONfbl  MORPH=empty
à à PRE MORPH=empty
des un  DETndf  NOMB.=p|GENRE=m
rumatismes  rhumatisme  NOMcom  NOMB.=p|GENRE=f
près  près  ADVgen  MORPH=empty
, , PONfbl  MORPH=empty
je  je  PROper  PERS.=1|NOMB.=s|GENRE=x|CAS=n
me  je  PROper  PERS.=1|NOMB.=s|GENRE=x|CAS=r
suis  être  VERcjg  MODE=ind|TEMPS=pst|PERS.=1|NOMB.=s
assez assez ADVgen  MORPH=empty
bien  bien  ADVgen  MORPH=empty
porté porter  VERppe  NOMB.=s|GENRE=m
. . PONfrt  MORPH=empty

Several versions of the FREEM model have been released

  • v0.1: beta release available here.
  • v1.0: first release available here, with a big correction of the lemma dataset (but not the POS), and we add the full morphology of tokens (mood, tense, gender…).
  • Bibliography

    This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :

    • E. Manjavacas & Á. Kádár & M. Kestemont, « Improving Lemmatization of Non-Standard Languages with Joint Learning », Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Special issue on "Natural Language Processing and Ancient Languages", 2019, pp. 493--1503.
    • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537 Check the latest version here :Zenodo DOI
    • Standardizing linguistic data: method and tools for annotating (pre-orthographic) French, Simon Gabay, Thibault Clérice, Jean-Baptiste Camps, Jean-Baptiste Tanguy, Matthias Gille-Levenson, DDH20 : Data and Digital Humanities 2020, 15-17 Oct 2020, Hammamet (Tunisia), [10.1145/3423603.3423996]