Tag your Latin text
You can find more information on this specific model here: https://github.com/PonteIneptique/latin-lasla-models
You can find more information on this specific model here: https://github.com/PonteIneptique/latin-lasla-models
Note: the model is currently being fine-tuned in the context of my PhD. I'll fill this part when it will be done.
The training set is roughly 1.5M tokens, dev test roughly 10k and test 169822. This is not counting punctuation, as LASLA data are lacking punctuation.
token[Caesarque]
== lemma[Caesar界que]
token[C]
== lemma[Gaius]
Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.
For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.
@software{thibault_clerice_2020_3883590, author = {Clérice, Thibault}, title = {Pie Extended, an extension for Pie with pre-processing and post-processing}, month = jun, year = 2020, publisher = {Zenodo}, doi = {10.5281/zenodo.3883589}, url = {https://doi.org/10.5281/zenodo.3883589} } @inproceedings{manjavacas-etal-2019-improving, title = "Improving Lemmatization of Non-Standard Languages with Joint Learning", author = "Manjavacas, Enrique and K{\'a}d{\'a}r, {\'A}kos and Kestemont, Mike", booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N19-1153", doi = "10.18653/v1/N19-1153", pages = "1493--1503",}
@software{thibault_clerice_2020_4043059, author = {Thibault Clérice}, title = {Deucalion Latin Lemmatizer}, month = sep, year = 2020, publisher = {Zenodo}, version = {0.0.3}, doi = {10.5281/zenodo.4043059}, url = {https://doi.org/10.5281/zenodo.4043059} } @phdthesis{phdthesis, author = {Thibault Clérice}, title = {Dire la sexualité en latin classique et tardif : "une étude lexicographique par apprentissage profond"}, school = {École Doctorale 3LA, Laboratoire HISOMA, Université Lyon 3}, year = "2017-(En cours)" }
This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :