Practitioner’s view: A comparison and a survey of lemmatization and morphological tagging in German and Latin


Rüdiger Gleim, Goethe Universität Frankfurt, Germany
Steffen Eger, Technische Universität Darmstadt, Germany
Alexander Mehler, Goethe Universität Frankfurt, Germany
Tolga Uslu, Goethe Universität Frankfurt, Germany
Wahed Hemati, Goethe Universität Frankfurt, Germany
Andy Lücking, Goethe Universität Frankfurt, Germany
Alexander Henlein, Goethe Universität Frankfurt, Germany
Sven Kahlsdorf, Goethe Universität Frankfurt, Germany
Armin Hoenen, Goethe Universität Frankfurt, Germany

Abstract


This paper relates to the challenge of POS tagging and lemmatization in morphologically rich languages by example of German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on the question what a practitioner can expect when using state-of-the-art solutions. Moreover, we contrast these with old(er) methods and implementations for coarse-grained pos tagging as well as fine-grained (morphological) POS tagging which also includes tagging of case, number, mood etc.). We examine to what degree recent efforts in tagger development pay out in improved accuracies – and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-domain evaluation. Out-domain evaluations are particularly insightful because the distribution of the data which is being tagged by a user will typically differ from the distribution on which the tagger has been trained. Furthermore, we compare pipeline tagging vs. a tagging approach that acknowledges dependencies between inflectional categories. Finally, two lemmatization techniques are evaluated.

Keywords


morphological tagging; lemmatization; morphologically rich languages, NLP evaluation modelling

Full Text:

PDF


DOI: http://dx.doi.org/10.15398/jlm.v7i1.205

ISSN of the paper edition: 2299-856X