How to embed noncrossing trees in Universal Dependencies treebanks in a low-complexity regular language



bounded stack, coding morphisms, context-free grammars, dependencies, finite-state, parsing, state complexity, treebanks


A recently proposed balanced-bracket encoding (Yli-Jyrä and GómezRodríguez 2017) has given us a way to embed all noncrossing dependency graphs into the string space and to formulate their exact arcfactored inference problem (Kuhlmann and Johnsson 2015) as the best string problem in a dynamically constructed and weighted unambiguous context-free grammar. The current work improves the encoding and makes it shallower by omitting redundant brackets from it. The streamlined encoding gives rise to a bounded-depth subset approximation that is represented by a small finite-state automaton. When bounded to 7 levels of balanced brackets, the automaton has 762 states and represents a strict superset of more than 99.9999% of the noncrossing trees available in Universal Dependencies 2.4 (Nivre et al. 2019). In addition, it strictly contains all 15-vertex noncrossing digraphs. When bounded to 4 levels and 90 states, the automaton still captures 99.2% of all noncrossing trees in the reference dataset. The approach is flexible and extensible towards unrestricted graphs, and it suggests tight finite-state bounds for dependency parsing, and for the main existing parsing methods.


Full article

Author Biography

Anssi Mikael Yli-Jyrä, University of Helsinki

The author has MA in Computational Linguistics and Computer Science and has received his PhD in Language Technology in 2005 from the University of Helsinki (UH).  He was awarded the title of docent in 2006 in Language Technology and he works as Academy/University Researcher in the Department of Digital Humanities in UH, working in the area of natural language processing and software verification.



How to Cite

Yli-Jyrä, A. M. (2019). How to embed noncrossing trees in Universal Dependencies treebanks in a low-complexity regular language. Journal of Language Modelling, 7(2), 177–232.