A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin

Rüdiger Gleim; Steffen Eger; Alexander Mehler; Tolga Uslu; Wahed Hemati; Andy Lücking; Alexander Henlein; Sven Kahlsdorf; Armin Hoenen

doi:10.15398/jlm.v7i1.205

A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin

Authors

Rüdiger Gleim Goethe Universität Frankfurt
Steffen Eger Technische Universität Darmstadt https://orcid.org/0000-0003-4663-8336
Alexander Mehler Goethe Universität Frankfurt
Tolga Uslu Goethe Universität Frankfurt
Wahed Hemati Goethe Universität Frankfurt
Andy Lücking Goethe Universität Frankfurt https://orcid.org/0000-0002-5070-2233
Alexander Henlein Goethe Universität Frankfurt
Sven Kahlsdorf Goethe Universität Frankfurt
Armin Hoenen Goethe Universität Frankfurt

Keywords:

morphological tagging, lemmatization, morphologically rich languages, NLP evaluation modelling

Abstract

The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then compared with old(er) methods and implementations for coarse-grained POS tagging, as well as fine-grained (morphological) POS tagging (e.g. case, number, mood). We examine to what degree recent advances in tagger development have improved accuracy – and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-of-domain evaluation. Out-of-domain evaluation is particularly pertinent because the distribution of data to be tagged will typically differ from the distribution of data used to train the tagger. Pipeline tagging is then compared with a tagging approach that acknowledges dependencies between inflectional categories. Finally, we evaluate three lemmatization techniques.

DOI:

https://doi.org/10.15398/jlm.v7i1.205

Full article

Published

2019-07-10

How to Cite

Gleim, R., Eger, S., Mehler, A., Uslu, T., Hemati, W., Lücking, A., Henlein, A., Kahlsdorf, S., & Hoenen, A. (2019). A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin. Journal of Language Modelling, 7(1), 1–52. https://doi.org/10.15398/jlm.v7i1.205