Tag your Modern French text

You can find more information on this specific model here: https://arxiv.org/abs/2005.07505

Cite with the following

Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.

For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.

@software{thibault_clerice_2020_3883590,
  author       = {Clérice, Thibault},
  title        = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3883589},
  url          = {https://doi.org/10.5281/zenodo.3883589}
}
@inproceedings{manjavacas-etal-2019-improving,
    title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
    author = "Manjavacas, Enrique  and
      K{\'a}d{\'a}r, {\'A}kos  and
      Kestemont, Mike",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational
      Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1153",
    doi = "10.18653/v1/N19-1153",
    pages = "1493--1503",}
@misc{camps2020corpus,
      title={Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre}, 
      author={Jean-Baptiste Camps and 
              Simon Gabay and 
              Paul Fièvre and 
              Thibault Clérice and 
              Florian Cafiero},
      year={2020},
      eprint={2005.07505},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2005.07505},
      primaryClass={cs.CL}
}

Information

The model is trained on transcriptions with modernised spelling.

The model was trained on the following corpora :

  • Théâtre classique ([Fièvre, 2007) samples, annotated in lemma and POS by J.B. Camps, Fl. Cafiero and S. Gabay (cf. F. Cafiero and J.-B. Camps, 2019).
  • FranText OA for lemmas (ATILF and Université de Lorraine, 1998-2018). The original corpus has been subject to some corrections (see Camps et al., 2020).

The annotations are made according to the following reference lists:

  • lemma: Morphalou (Romary L., Salmon-Alt S. & Francopoulo G. (2004), “Standards going concrete: From LMF to Morphalou”, Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries – ElectricDict’04 (Geneva, Switzerland), Stroudsburg (PA), 22-28. ).
  • POS and morph: Sophie Prévost, Céline Guillot, Alexei Lavrentiev et Serge Heiden, Jeu d’étiquettes morphosyntaxiques CATTEX2009-max, Lyon, 2013, http://bfm.ens-lyon.fr/IMG/pdf/Cattex2009_2.0.pdf.

More information on the annotation practice can be found in Simon Gabay, Jean-Baptiste Camps, Thibault Clérice, Manuel d'annotation linguistique pour le français moderne (XVIe -XVIIIe siècles) 2020: https://hal.archives-ouvertes.fr/hal-02571190.

Sample from annotation:


token   lemma   POS morph   treated
Il  il  PROimp  PERS.=3|NOMB.=s|GENRE=m|CAS=n   Il
faut    falloir VERcjg  MODE=ind|TEMPS=pst|PERS.=3|NOMB.=s  faut
que que CONsub  MORPH=empty que
ce  ce  DETdem  NOMB.=s|GENRE=m ce
matin   matin   NOMcom  NOMB.=s|GENRE=m matin
,   ,   PONfbl  MORPH=empty ,
à   à   PRE MORPH=empty à
force   force   NOMcom  NOMB.=s|GENRE=f force
de  de  PRE MORPH=empty de
trop    trop    ADVgen  MORPH=empty trop
boire   boire   VERinf  MORPH=empty boire
,   ,   PONfbl  MORPH=empty ,
Il  il  PROper  PERS.=3|NOMB.=s|GENRE=m|CAS=n   Il
se  se  PROper  PERS.=3|NOMB.=s|CAS=r   se
soit    être    VERcjg  MODE=sub|TEMPS=pst|PERS.=3|NOMB.=s  soit
troublé troubler    VERppe  NOMB.=s|GENRE=m troublé
le  le  DETdef  NOMB.=s|GENRE=m le
cerveau cerveau NOMcom  NOMB.=s|GENRE=m cerveau
.   .   PONfrt  MORPH=empty .  

Bibliography

This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :

  • E. Manjavacas & Á. Kádár & M. Kestemont, « Improving Lemmatization of Non-Standard Languages with Joint Learning », Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Special issue on "Natural Language Processing and Ancient Languages", 2019, pp. 493--1503.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537 Check the latest version here :Zenodo DOI
  • J.-B. Camps, S. Gabay, P. Fièvre, Th. Clérice and Fl. Cafiero, 2020. Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre, https://arxiv.org/abs/2005.07505.
  • F. Cafiero and J.B. Camps, 2019. Why Molière most likely did write his plays. Science Advances, 5 (11), eaax5489. 10.1126/sciadv.aax5489.
  • P. Fièvre, 2008. Théâtre classique, http://www.theatre-classique.fr.
  • ATILF-CNRS and Université de Lorraine, 1998-2018. Base textuelle frantext: Démonstration, https://www.frantext.fr/repository/frantext-demo/.