UvA-DARE (Digital Academic Data-Oriented Parsing with Discontinuous Constituents and Function Tags Data-oriented parsing with discontinuous constituents and function tags

Statistical parsers are effective but are typically limited to producing projective dependencies or constituents. On the other hand, linguistically rich parsers recognize non-local relations and analyze both form and function phenomena but rely on extensive manual grammar en-gineering. We combine advantages of the two by building a statistical parser that produces richer analyses. We investigate new techniques to implement treebank-based parsers that allow for discontinuous constituents. We present two systems. One system is based on a Linear Context-Free Rewriting System (lcfrs), while using a Probabilistic Discontinuous Tree-Substitution Grammar (pdtsg) to improve disambiguation performance. Another system encodes discontinuities in the labels of phrase-structure trees, allowing for efficient context-free grammar parsing. The two systems demonstrate that tree fragments as used in tree-substitution grammar improve disambiguation performance while capturing non-local relations on an as-needed basis. Additionally, we present results for models that produce function tags, resulting in a more linguistically adequate model of the data. We report substantial accuracy improvements in discontinuous parsing for German, English, and Dutch, including results on spoken Dutch.


introduction
Probabilistic algorithms for parsing and disambiguation select the most probable analysis for a given sentence in accordance with a certain probability distribution. A fundamental property of such algorithms is thus the definition of the space of possible sentence structures that constitutes the domain of the probability distribution. Modern statistical parsers are often automatically derived from corpora of syntactically annotated sentences ("treebanks"). In this case, the "linguistic backbone" of the probabilistic grammar naturally depends on the convention for encoding syntactic structure that was used in annotating the corpus.
When different parsing and disambiguation algorithms are applied to the same treebank, their relative accuracies can be objectively assessed if the treebank is split into a training set (that is used to induce a grammar and its probabilities) and a test set (that provides a "gold standard" to assess the performance of the system). This is common practice now. In many cases, however, the linguistic significance of these evaluations may be questioned, since the test sets consist of phrase-structure trees, i.e., part-whole structures where all parts are contiguous chunks. Non-local syntactic relations are not represented in these trees; utterances in which such relations occur are therefore skipped or incorrectly annotated.
For certain practical applications this restriction may be harmless, but from a linguistic (and cognitive) viewpoint it cannot be defended. Since Chomsky's transformational-generative grammar, there have been many proposals for formal grammars with a less narrow scope. Some of these formalisms have been employed to annotate large corpora; in principle, they can thus be used in treebank grammars extracted from these corpora.
The Penn treebank, for instance, enriches its phrase-structure representations with "empty constituents" that share an index with the constituent that, from a transformational perspective, would be analyzed as originating in that position. Most grammars based on the Penn treebank ignore this information, but it was used by, e.g., Johnson (2002), Dienes and Dubey (2003), and Gabbard et al. (2006).
Another perspective on non-local syntactic dependencies generalizes the notion of a "syntactic constituent," in that it allows "dis- [ 58 ] continuous constituent structures," where a non-terminal node dominates a lexical yield that consists of different non-contiguous parts (McCawley 1982). Several German and Dutch treebanks have been annotated in terms of discontinuous constituency, and some statistical parsers have been developed that use these treebanks. Also, phrase structures with co-indexed traces can be converted into discontinuous constituent structures; the Penn treebank can therefore be transformed and used in the discontinuous constituency approach (Evang and Kallmeyer 2011). Figure 1 shows an example of a tree with discontinuous constituents.  Beek et al. 2002). PPART is a discontinuous constituent (indicated with crossing branches) due to its extraposed NP object. Part-of-speech tags: VNW=pronoun, N=noun, WW=verb, BW=adverb. The tags also contain additional morphological features not shown here that distinguish personal pronouns from others, auxiliary verbs from main verbs, etc.
It is an annotation choice to employ discontinuous constituents; some treebanks elect not to model non-local phenomena, while others may choose different mechanisms. For example, two German treebanks employ discontinuous constituents (Skut et al. 1997;Brants et al. 2002), while another German treebank does not (Telljohann et al. 2004(Telljohann et al. , 2012. The annotation scheme of the latter treebank lacks information expressed in the former two. For instance, it cannot encode the heads of non-local modifiers; with discontinuous constituents, a modifier is a sibling of its head, regardless of their configuration. On the other hand, the co-indexed traces of the Penn treebank provide more information than discontinuous constituents, because they assume that constituents have been moved from somewhere else in the tree and encode the original position. Discontinuous constituents describe surface structure without making such assumptions. Some phenomena that can be analyzed with discontinuous constituents are extraposition, topicalization, scrambling, and parentheticals; cf. Maier et al. (2014) for an overview of such phenomena in German. [ 59 ] van Cranenburgh, Scha, Bod The notion of discontinuous constituents in annotation is useful to bridge the gap between the information represented in constituency and dependency structures. Constituency structures capture the hierarchical structure of phrases -which is useful for identifying re-usable elements; discontinuous constituents extend this to allow for arbitrary non-local relations that may arise due to such phenomena as extraposition and free word order. There is a close relation of discontinuous constituency to non-projectivity in dependency structures (Maier and Lichte 2011). Compare Figure 2, which shows a dependency structure for the constituency tree in Figure 1. Note that in this dependency structure, the edge labels are grammatical functions present in the original treebank, while the constituent labels in Figure 1 are syntactic categories. The dependency structure encodes the non-local relations within the discontinuous constituent. On the other hand, it does not represent the hierarchical grouping given by the NP and PPART constituents. By encoding both hierarchical and non-local information, trees with discontinuous constituents combine the advantages of constituency and dependency structures. We will also come back to grammatical function labels.
This paper is concerned with treebank-based parsing algorithms that accept discontinuous constituents. It takes as its point of departure work by Maier (2010, 2013) that represents discontinuous structures in terms of a string-rewriting version of Linear Context-Free Rewriting Systems (Section 3.1). In addition, we employ Tree-Substitution Grammar (tsg). We make the following contributions: 1. We discuss the notions of competence and performance in (computational) linguistics (Section 2). We argue that instead of focussing on the search for the formal (competence) grammar with the right capacity for natural language, we can consider performance aspects such as cognitive limitations and pruning strategies. [ 60 ] 2. We show that Tree-Substitution Grammar can be applied to discontinuous constituents (Section 3.2) and that it is possible, using a transformation, to parse with a Tree-Substitution Grammar without having to write a separate parser for this formalism (Section 4.2).
3. We induce a tree-substitution grammar from a treebank (Section 5) using a method called Double-dop (Sangati and Zuidema 2011). This method extracts a set of recurring tree fragments. We show that compared to another method which implicitly works with all possible fragments, this explicit method offers advantages in both accuracy and efficiency (Section 4.2.1, Section 9). 4. Fragments make it possible to treat discontinuous constituency as a statistical phenomenon within an encompassing context-free framework (Section 4.1, Section 7); this yields a considerable efficiency improvement without hurting accuracy (Section 9).
5. Finally, we present an evaluation on three languages. We employ manual state splits from previous work for improved performance (Section 8) and discuss methods and results for grammars that produce function tags in addition to phrasal labels (Section 8.3).
This work explores parsing discontinuous constituents with Linear Context-Free Rewriting Systems and Context-Free Grammar, as well as with and without the use of tree fragments through tree substitution. Figure 3 gives an overview of these systems and how they are combined in a coarse-to-fine pipeline (cf. Section 6.4). 2 the division of labor between competence and performance Traditionally, two aspects of language cognition have been distinguished: competence and performance (Chomsky 1965). Linguistic competence comprises a language user's "knowledge of language," usually described as a system of rules, while linguistic performance includes the details of the user's production and comprehension behavior. For a computational model, its syntactic competence defines the set of possible sentences that it can process in principle, and the structures it may assign to them, while its performance includes such [ 61 ] van Cranenburgh, Scha, Bod (start) Figure 3: The systems explored in this work aspects as disambiguation using occurrence frequencies of grammatical constructions. Thus, the choice of a formalism to describe the system's competence grammar depends on one's decisions on how syntax should be formalized. Regular and context-free grammars have been argued to be too limited (Chomsky 1956;Shieber 1985), while richer alternatives -context-sensitive and beyond -are considered too powerful to allow for an efficient computational implementation; this applies to Transformational Grammar (Peters and Ritchie 1973), Lexical-Functional Grammar, and Head-Driven Phrase Structure Grammar (Trautwein 1995). We may therefore wish to strike a balance and find a grammar formalism that is just powerful enough to describe the syntax of natural language. Joshi (1985) proposes Mildly Context-Sensitive grammars, which are beyond context-free, but avoid the computational complexity that comes with the full class of contextsensitive grammars. The first formalism developed in this framework was Tree-Adjoining Grammar (tag;Joshi 1985). There has been [ 62 ] work on automatic extraction of tree-adjoining grammars from corpora (Chiang 2000;Xia et al. 2001;Kaeshammer and Demberg 2012), and formal extensions such as multi-component tag (Weir 1988;Schuler et al. 2000;Kallmeyer 2009). Linear Context-Free Rewriting Systems (lcfrss), as employed in the work reported below, are instances of Mildly Context-Sensitive grammar. lcfrs appears to be a lingua franca among mildly context-sensitive formalisms, since several formalisms have been shown to be equivalent to it (Vijay-Shanker and Weir 1994). Irrespective of whether one accepts the competence-performance dichotomy, a practical natural language system needs to deal with phenomena that depend on world knowledge reflected in language use (e.g., the fact that in "eat pizza with a fork", with a fork is prototypically related to eat rather than to pizza). This has led to a statistical turn in computational linguistics, in which models are directly induced from treebanks (Scha 1990;Charniak 1996;Bod et al. 2003;Geman and Johnson 2004). If the end goal is to make an adequate model of language performance, there is actually no need to have a competence grammar which is 'just right.' Instead, we might reduce some of the formal complexity by encoding it in statistical patterns. Concretely, we can opt for a grammar formalism that deliberately overgenerates, and count on grammatical analyses having a higher probability of being selected during disambiguation. This operationalizes the idea of there being a spectrum between ungrammaticality, markedness, and felicity. In Section 4.1 we introduce an approximation of lcfrs that makes it possible to produce discontinuous constituents in cubic time using a context-free grammar, by encoding information in non-terminal labels. A probabilistic variant of the resulting grammar makes stronger independence assumptions than the equivalent lcfrs, but as a component in a larger statistical system this does not have to pose a problem.
In the debate about the context-freeness of language, crossserial dependencies have played an important role (Huybregts 1976;Bresnan et al. 1982;Shieber 1985 Ojeda (1988) gives an account using discontinuous constituents; cf. Figure 4. In Section 4.1 we show how such analyses may be produced by an overgenerating context-free grammar. This is an instance of the more general idea of approximating rich formal models in formally weaker but statistically richer models, i.e., descriptive aspects of language that can be handled as a performance rather than a competence problem. Another instance of this is constituted by the various restricted versions of tag, whose string languages form a proper subset of those of lcfrs. Restricted variants of tag that generate context-free string languages are Tree-Insertion Grammar (Schabes and Waters 1995;Hoogweg 2003;Yamangil and Shieber 2012), and off-spine tag (Swanson et al. 2013); tsg is an even more restricted variant of tag in which the adjunction operation is removed altogether. These results suggest that there is a trade-off to be made in the choice of formalism. While on the one hand Mild Context-Sensitivity already aims to limit formal complexity to precisely what is needed for adequate linguistic description, a practical, statistical implementation presents further opportunities for constraining complexity.
Another performance aspect of language relevant for computational linguistics is pruning. While normally considered an implementation aspect made necessary by practical hardware limitations, finding linguistically and psychologically plausible shortcuts in language processing forms an interesting research question. Schuler et al. (2010) present a parser with human-like memory constraints based on a finite-state model. Although Roark et al. (2012) are not concerned with cognitive plausibility, they also work with finite-state methods and show that cfg parsing can [ 64 ] be done in quadratic or even linear time with finite-state pruning methods.
As a specific example of a cognitive limitation relevant to parsing algorithms, consider center embedding. Karlsson (2007) reports from a corpus study that center embeddings only occur up to depth 3 in written language, and up to depth 2 in spoken language. If a statistical parser would take such cognitive limitations into account, many implausible analyses could be ruled out from the outset. More generally, it is worthwhile to strive for an explicit performance model that incorporates such cognitive and computational limitations as first class citizens.
In this work we do not go all the way to a finite-state model, but we do show that the non-local relations expressed in discontinuous constituents can be expressed in a context-free grammar model. We start with a mildly context-sensitive grammar formalism to parse discontinuous constituents, augmented with tree substitution. We then show that an approximation with context-free grammar is possible and effective. We find that the reduced independence assumptions and larger contexts taken into account as a result of tree substitution make it possible to capture non-local relations without going beyond context-free. Tree substitution thus increases the capabilities of the performance side without increasing the complexity of the competence side. A performance phenomenon that is modeled by this is that non-local relations are only faithfully produced as far as observed in the data.

grammar formalisms
In this section we describe two formalisms related to discontinuous constituents; (string rewriting) Linear Context-Free Rewriting Systems and Discontinuous Tree-Substitution Grammar.
(String rewriting) Linear Context-Free Rewriting Systems (lcfrs; Vijay-Shanker et al. 1987) can produce such structures. An lcfrs generalizes cfg by allowing non-terminals to rewrite tuples of strings instead of just single, contiguous strings. This property makes lcfrs suitable for directly parsing discontinuous constituents Maier 2010, 2013), as well as non-projective dependencies (Kuhlmann and Satta 2009;Kuhlmann 2013).
A tree-substitution grammar (tsg) provides a generalization of context-free grammar (cfg) that operates with larger chunks than just single grammar productions. A probabilistic tsg can be seen as a pcfg in which several productions may be applied at once, capturing structural relations between those productions. Tree-substitution grammars have numerous applications. They can be used for statistical parsing, such as with Data-Oriented Parsing (dop;Scha 1990;Bod 1992;Bod et al. 2003;Bansal and Klein 2010;Sangati and Zuidema 2011) and Bayesian tsgs (O'Donnell et al. 2009;Post and Gildea 2009;Cohn et al. 2009Cohn et al. , 2010Shindo et al. 2012). Other applications include grammaticality judgements (Post 2011), multi-word expression identification (Green et al. 2011), stylometry (Bergsma et al. 2012;van Cranenburgh 2012b), and native language detection (Swanson and Charniak 2012).
Before defining these formalisms, we first define the tree structures they operate on. The notion of a "discontinuous tree" stems from a long linguistic tradition (Pike 1943, Sections 4.12-14;Wells 1947, Sections 55-62;McCawley 1982). It generalizes the usual notion of a phrase-structure tree in that it allows a non-terminal node to dominate a lexical span that consists of non-contiguous chunks. In our interpretation of this idea, it results in three formal differences: 1. A non-terminal with non-contiguous daughters does not have a non-arbitrary place in the left-to-right order with respect to its sibling nodes. Therefore, it is not obvious anymore that the leftto-right order of the terminals is to be described in terms of their occurrence in a tree with totally ordered branches. Instead, we employ trees with unordered branches, while every node is augmented with an explicit representation of its (ordered) yield.
2. An "ordinary" (totally ordered) tree has a contiguous string of leaf nodes as its yield. When we allow discontinuities, this property still applies to the (totally lexicalized) complete trees of complete sentences. But for tree fragments, it fails; their yields may contain gaps. In the general case, the yield of a discontinuous tree is thus a tuple of strings.
3. Extracting a fragment from a tree now consists of two steps: (a) Extracting a connected subset of nodes, and [ 66 ] (b) Updating the yield tuples of the nodes. In the yield tuple of every non-terminal leaf node, every element (a contiguous chunk of words) is replaced by a terminal variable. This replacement is percolated up the tree, to the yield tuples of all nodes. Different occurrences of the same word carry a unique index, to allow for the percolation to proceed correctly.
We now proceed to give a more formal definition of our notion of a discontinuous tree. Definition 1. A discontinuous syntactic tree is a rooted, unordered tree. Each node consists of a label and a yield. A yield is a tuple of strings composed of lexical items; the tuple of strings denotes a subsequence of the yield at the root of the tree. We write xa by to denote a yield consisting of the contiguous sequence of lexical items 'a' and 'b', while xa b, cy denotes a yield containing 'a b' followed by 'c' with an intervening gap. Given a node X , • the yield of X is composed of the terminals in the yields of the children of X; • conversely, the yield of each child of X is a subsequence of the yield of X ; • the yields of siblings do not overlap. Figure 5 shows a tree according to this definition in which discontinuities are visualized with crossing branches as before. The same tree is rendered in Figure 6, without crossing branches, to highlight the fact that the information about discontinuities is encoded in the yields of the tree nodes.  Definition 2. An incomplete tree is a discontinuous tree in which the yields may contain variables v n with n P in addition to lexical items. Variables stand in for any contiguous string of lexical items. An incomplete tree contains 2 or more nodes, or a single node with only lexical items in its yield. A node without children whose yield consists solely of variables is called a substitution site.
An incomplete tree may be derived from an extracted tree fragment. The tree fragment may contain variables for substrings which needed to be distinguished in other parts of the tree, but only occur contiguously in the fragment. We reduce these strings of contiguous variables to single variables; i.e., we abstract fragments from their original context by reducing strings of variables that appear contiguously across the fragment into single variables (e.g. Figure 7).
The fan-out of a non-terminal node equals the number of terminals in its yield that are not directly preceded by another terminal in the same yield; i.e., the number of contiguous substrings (components) of which the yield consists. 1 From here on we denote the fan-out of a discontinuous non-terminal with a subscript that is part of its label.

Linear Context-Free Rewriting Systems
String-rewriting lcfrs can be seen as the discontinuous counterpart of cfg, and its probabilistic variant can be used to articulate a discontinuous treebank grammar. lcfrs productions differ from cfg productions in that they generate for a given non-terminal one or more strings at a time in potentially non-adjacent positions in the sentence. The number of these positions, the measure of discontinuity in a constituent, is called the fan-out. A cfg is an lcfrs with a maximum fan-out of 1. Together with the number of non-terminals on the righthand side, the fan-out defines a hierarchy of grammars with increasing complexity, of which cfg is the simplest case. In this paper we use the simple rcg notation (Boullier 1998) for lcfrs. We focus on string-rewriting lcfrs and use the tree produced as a side-effect of a string's derivation as its syntactic analysis. It is possible to define an lcfrs that rewrites trees or graphs; however, the formalisms used in this paper are all expressible as string-rewriting lcfrss.
Definition 3. A string-rewriting lcfrs is a tuple G " xN , T, V, P, Sy. N and T are disjoint finite sets of non-terminals and terminals, respectively. A function φ : N Ñ t1, 2, . . . , u specifies the unique fan-out for every non-terminal symbol. V is a finite set of variables; we refer to the variables as x i j with i, j P . S is the distinguished start symbol with S P N and φpSq " 1. P is a finite set of productions, of the form: Observe that a component α j is a concatenation of one or more terminals and variables.
The rank r refers to the number of non-terminals on the righthand side of a production, while the fan-out φ of a non-terminal refers to the number of components it covers. A rank of zero implies a lexical production; in that case the right-hand side (rhs) is notated as ϵ implying no new non-terminals are produced (not to be confused with generating the empty string), and the left-hand side (lhs) argument is composed only of terminals.
Productions must be linear and non-erasing: if a variable occurs in a production, it occurs exactly once on the lhs, and exactly once on [ 69 ] van Cranenburgh, Scha, Bod the rhs. A production is monotone 2 if for any two variables x 1 and x 2 occurring in a non-terminal on the rhs, x 1 precedes x 2 on the lhs iff x 1 precedes x 2 on the rhs. Due to our method of grammar extraction from treebanks, (cf. Section 3.1.1 below) all productions in this work are monotone and, except in some examples, at most binary (r ď 2); lexical productions (r " 0) have fan-out 1 and introduce only a single terminal.
A production is instantiated when its variables are bound to spans such that for each component α j of the lhs, the concatenation of the strings that its terminals and bound variables point to forms a contiguous, non-overlapping span in the input. In the remainder we will notate discontinuous non-terminals with a subscript indicating their fan-out.
When a sentence is parsed by an lcfrs, its derivation tree (Boullier 1998, Section 3.3; Kallmeyer 2010, pp. 115-117) is a discontinuous tree. Conversely, given a set of discontinuous trees, a set of productions can be extracted that generate those trees.
In a probabilistic lcfrs (plcfrs), each production is associated with a probability and the probability of derivation is the product of the probabilities of its productions. Analogously to a pcfg, a plcfrs may be induced from a treebank by using relative frequencies as probabilities (Maier and Søgaard 2008).
Definition 4. The language of an lcfrs G is defined as follows (Kallmeyer and Maier 2013, pp. 92-93): 1. For every A P N , we define the yield of A, yield G pAq, as follows: (a) For every production Aptq Ñ ϵ with t P T , xty P yield G pAq (b) For every production and all tuples τ 1 P yield G pB 1 q, . . . , τ r P yield G pB r q: f is the composition function of the production. (c) Nothing else is in yield G pAq.
2. The language of G is then LpGq " yield G pSq.

3.1.1
Extracting LCFRS productions from trees lcfrs productions may be induced from a discontinuous tree, using a procedure described in Maier and Søgaard (2008). We extend this procedure to handle substitution sites, i.e., non-terminals with only variable terminals in their yield, but no lexical items; such nodes occur in tree fragments extracted from a treebank. The procedure is as follows: Given a discontinuous tree, we extract a grammar production for each non-leaf non-terminal node. The label of the node forms the lhs non-terminal, and the labels of the nodes immediately dominated by it form the rhs non-terminals. The arguments of each rhs non-terminal are based on their yield tuples. Adjacent variables in the yield of the rhs non-terminals are collapsed into single variables and replaced on both lhs and rhs. Consider the tree fragment in Figure 7, which gives the following lcfrs production: Pre-terminals yield a production with their terminal as a direct argument to the pre-terminal, and an empty rhs. Substitution sites in a tree only appear on the rhs of extracted productions, since it is not known what they will expand to. See Figure 8 for examples of lcfrs productions extracted from a discontinuous tree.

Discontinuous Tree-Substitution Grammar
We now employ string-rewriting lcfrs, introduced in the previous section, to replace the cfg foundation of tsgs. Note that the resulting formalism directly rewrites elementary trees with discontinuous constituents, making it an instantiation of the more general notion of a tree-rewriting lcfrs. Tree-rewriting lcfrss are more general because they allow other rewriting operations besides substitution. However, since we limit the operations in the formalism [ 71 ] van Cranenburgh, Scha, Bod G " xN , T, V, P, Sy extracted from the tree in Figure 5 N " tSMAIN, PPART, NP, VNW, N, WW, BWu T " tDat, had, uitgevonden, werkwoord, ze, zelfu φ " tSMAIN : 1, PPART : 2, NP : 1, VNW : 1, N : 1, WW : 1, BW : 1u WWpuitgevondenq Ñ ϵu to substitution, it remains possible to specify a direct mapping to a string-rewriting grammar, as we shall see in the next section. As noted before, a tsg can be seen as a tag without the adjunction operation. A discontinuous tsg may be related to a special case of set-local multi-component tag (Weir 1988;Kallmeyer 2009). A multi-component tag is able to specify constraints that require particular elementary trees to apply together; this mechanism can be used to generate the non-local elements of discontinuous constituents.
The following definitions are based on the definition for continuous tsg in Sima'an (1997).
Definition 5. A probabilistic, discontinuous tsg (pdtsg) is a tuple xN , T, V, S, , Py, where N and T are disjoint finite sets that denote the set of non-terminal and terminal symbols, respectively; V is a finite set of variables; S denotes the start non-terminal; and is a finite set of elementary trees. For all trees in it holds that for each non-terminal, there is a unique fan-out; this induces a function φ Ă Nˆt1, 2, . . .u with φpAq being the unique fan-out of A P N . For convenience, we abbreviate φprootptqq for a tree t as φptq. The function P assigns a value 0 ă Pptq ď 1 (probability) to each elementary tree t such that for every non-terminal A P N , the probabilities of all elementary trees whose root node is labelled A sum to 1. [ 72 ] The tuple xN , T, V, S, y of a given pdtsg xN , T, V, S, , Py is called the dtsg underlying the pdtsg.
Definition 6. Substitution: The substitution A˝B is defined iff the label of the left-most substitution site of A equals the label of the root node of B. The left-most substitution site of an incomplete tree A is the leaf node containing the first occurrence of a variable in the yield of the root of A. When defined, the result of A˝B equals a copy of the tree A with B substituted for the left-most substitution site of A. In the yield argument of A, each variable terminal is replaced with the corresponding component of one or more contiguous terminals from B. For example, given yieldpAq " xl 1 v 2 , l 4 y and yieldpBq " xl 2 l 3 y where l n is a lexical terminal and v n a variable, yieldpA˝Bq " xl 1 l 2 l 3 , l 4 y.
. . , f m P , rootpT q " rootp f 1 q " S, φpT q " 1 and T contains no substitution sites. The probability Ppdq is defined as: Figure 9 for an example.  Figure 9: A discontinuous tree-substitution derivation of the tree in Figure 1. Note that in the first fragment, which has a discontinuous substitution site, the destination for the discontinuous spans is marked in advance, shown with variables (v n ) as placeholders.
[ 73 ] van Cranenburgh, Scha, Bod Definition 8. A parse is any tree which is the result of a derivation. A parse can have various derivations. Given the set DpT q of derivations yielding parse T , the probability of T is defined as ř dPDpT q Ppdq. 4 grammar transformations cfg, lcfrs, and dtsg can be seen as natural extensions of each other. This makes it possible to define transformations that help to make parsing more efficient. Specifically, we define simplified versions of these grammars that can be parsed efficiently, while their productions or labels map back to the original grammar.

4.1
A CFG approximation of discontinuous LCFRS parsing Barthélemy et al. (2001) introduced a technique to guide the parsing of a range concatenation grammar (rcg) by a grammar with a lower parsing complexity. Van Cranenburgh (2012a) applies this idea to probabilistic lcfrs parsing and extends the method to prune unlikely constituents in addition to filtering impossible constituents.
The approximation can be formulated as a tree transformation instead of a grammar transformation. The tree transformation by Boyd (2007) encodes discontinuities in the labels of tree nodes. 3 The resulting trees can be used to induce a pcfg that can be viewed as an approximation to the corresponding plcfrs grammar of the original, discontinuous treebank. We will call this a Split-pcfg.
Definition 9. A Split-pcfg is a pcfg induced from a treebank transformed by the method of Boyd (2007); that is, discontinuous constituents have been split into several non-terminals, such that each new non-terminal covers a single contiguous component of the yield of the discontinuous constituent. Given a discontinuous non-terminal X n in the original treebank, the new non-terminals will be labelled X˚m n , with m the index of the component, s.t. 1 ď m ď n.
For example: lcfrs productions: Spabcq Ñ NPpbq VP 2 pa, cq VP 2 pa, bq Ñ VBpaq PRTpbq cfg approximation: S Ñ VP *1 2 NP VP *2 2 VP *1 2 Ñ VB VP *2 2 Ñ PRT In a post-processing step, pcfg derivations are converted to discontinuous trees by merging siblings marked with '*'. This approximation overgenerates compared to the respective lcfrs, i.e., it licenses a superset of the derivations of the respective lcfrs. For example, a component VP *1 2 may be generated without generating its counterpart VP *2 2 ; such derivations can be filtered in post-processing. Furthermore, two components VP *1 2 and VP *2 2 may be generated which were extracted from different discontinuous constituents, such that their combination could not be generated by the lcfrs. 4 Another problem would occur when productions contain discontinuous constituents with the same label; the following two productions map to the same productions in the cfg approximation: VPpadc beq Ñ VP 2 pa, bq CNJpcq VP 2 pd, eq However, such productions do not occur in any of the treebanks used in this work. The increased independence assumptions due to rewriting discontinuous components separately are more problematic, especially with nested discontinuous constituents. They necessitate the use of non-local statistical information to select the most likely structures, for instance by turning to tree-substitution grammar (cf. Section 2 above). (Note that the issue is not as problematic when the approximation is only used as a source of pruning information).
As a specific example of the transformation, consider the case of cross-serial dependencies. Figure 10 shows the parse tree for the 4 A reviewer points out that if discontinuous rewriting is seen as synchronous rewriting (synchronous cfgs are equivalent to lcfrss with fan-out 2), the split transformation is analogous to taking out the synchronicity.
[ 75 ]  Figure 10: Cross-serial dependencies in Dutch expressed with discontinuous constituents (top); and the same parse tree, after discontinuities have been encoded in node labels (bottom) example sentence from the previous section, along with the grammar productions for it, before and after applying the cfg approximation of lcfrs. Note that in the approximation, the second level of INF nodes may be rewritten separately, and a context-free grammar cannot place the non-local constraint that each transitive verb should be paired with a direct object. On the other hand, through the use of tree substitution, an elementary tree may capture the whole construction of two verbs cross-serially depending on two objects, and the model needs only to prefer an analysis with this elementary tree. Once an elementary tree contains the whole construction, it no longer matters whether its internal nodes contain discontinuous constituents or indexed node labels, and the complexity of discontinuous rewriting is weakened to a statistical regularity. A phenomenon which cannot be captured in this representation, not even with the help of tree-substitution, is recursive synchronous rewriting . Although this phenomenon is rare, it does occur in treebanks.

TSG compression
Using grammar transformations, it is possible to parse with a tsg without having to represent elementary trees in the chart explicitly, but instead work with a parser for the base grammar underlying the tsg (typically a cfg, in our case an lcfrs).
In this section we present such a transformation for an arbitrary discontinuous tsg to a string-rewriting lcfrs. We first look at wellestablished strategies for reducing a continuous tsg to a cfg, and then show that these carry over to the discontinuous case. Previous work was based on probabilistic tsg without discontinuity; this special case of pdtsg is referred to as ptsg.

4.2.1
Compressing PTSG to PCFG Goodman (2003) gives a reduction to a pcfg for the special case of a ptsg based on all fragments from a given treebank and their frequencies. This reduction is stochastically equivalent to an all-fragments ptsg after the summation of probabilities from equivalent derivations; however, it does not admit parsing with tsgs consisting of arbitrary sets of elementary trees or assuming arbitrary probability models. Perhaps counter-intuitively, restrictions on the set of fragments increase the size of Goodman's reduction (e.g., depth restriction, Goodman 2003, p. 134). While Goodman (2003) gives instantiations of his reduction with various probability models, the limitation is that probability assignments of fragments have to be expressible as a composition of the weights of the productions in each fragment. Since each production in the reduction participates in numerous implicit fragments, it is not possible to adjust the probability of an individual fragment without affecting related fragments. We leave Goodman's reduction aside for now, because we would prefer a more general method. A naive way to convert any tsg is to decorate each internal node of its elementary trees with a globally unique number, which can be removed from derivations in a post-processing step. Each elementary tree then contributes one or more grammar productions, and because of the unique labels, elementary trees will always be derived as a whole. However, this conversion results in a large number of non-terminals, which are essentially 'inert': they never participate in substitution but deterministically rewrite to the rest of their elementary tree. [ 77 ] van Cranenburgh, Scha, Bod A more compact transformation is used in Sangati and Zuidema (2011), which can be applied to arbitrary ptsgs, but adds a minimal number of new non-terminal nodes. Internal nodes are removed from elementary trees, yielding a flattened tree of depth 1. Each flattened tree is then converted to a grammar production. Each production and original fragment is stored in a backtransform table. This table makes it possible to restore the original fragments of a derivation built from flattened productions. Whenever two fragments would map to the same flattened production, a unary node with a unique identifier is added to disambiguate them. The weight associated with an elementary tree carries over to the first production it produces; the rest of the productions are assigned a weight of 1.

4.2.2
Compressing PDTSG to PLCFRS The transformation defined by Sangati and Zuidema (2011) assumes that a sequence of productions can be read off from a syntactic tree, such as a standard phrase-structure tree that can be converted into a sequence of context-free grammar productions. Using the method for inducing lcfrs productions from syntactic trees given in Section 4.2.1, we can apply the same tsg transformation to discontinuous trees as well.
Due to the design of the parser we will use, it is desirable to have grammar productions in binarized form, and to separate phrasal and lexical productions. We therefore binarize the flattened trees with a left-factored binarization that adds unique identifiers to every intermediate node introduced by the binarization. In order to separate phrasal and lexical productions, a new pos tag is introduced for each terminal, which selects for that specific terminal. A sequence of productions is then read off from the transformed tree. The unique identifier in the first production is used to look up the original elementary tree in the backtransform table. 5 Figure 11 illustrates the transformation of a discontinuous tsg. The middle column shows the productions after transforming each ele-  Figure 11: Transforming a discontinuous tree-substitution grammar into an lcfrs and backtransform table. The elementary trees are extracted from the tree in Figure 1 with labels abbreviated. The first production of each fragment is used as an index to the backtransform table so that the original fragments in derivations can be reconstructed.

Base grammar
Spabcq Ñ NPpbq VP 2 pa, cq VP 2 pa, bq [ 79 ] van Cranenburgh, Scha, Bod mentary tree. The rightmost column shows how relative frequencies can be used as weights, where f is the frequency of the elementary tree in the treebank, and f 1 is the frequency mass of elementary trees with the same root label. Note that the productions for the first elementary tree contain no discontinuity, because the discontinuous internal node is eliminated. Conversely, the transformation may also introduce more discontinuity, due to the binarization (but cf. Section 8.1 below). Figure 12 presents an overview of the methods of grammar induction presented thus far, as well as the approach for finding recurring fragments that will be introduced in the next section.

inducing a tsg from a treebank
In Data-Oriented Parsing the grammar is implicit in the treebank itself, and in principle all possible fragments from its trees can be used to derive new sentences. Grammar induction is therefore conceptually simple (even though the grammar may be very large), as there is no training or learning involved. This maximizes re-use of previous experience.
The use of all possible fragments allows for multiple derivations of the same tree; this spurious ambiguity is seen as a virtue in dop, because it combines the specificity of larger fragments and the smoothing of smaller fragments. This is in contrast to parsimonious approaches which decompose each tree in the training corpus into a sequence of fragments representing a single derivation.

Extracting recurring fragments
Representing all possible fragments of a treebank is not feasible, since the number of fragments is exponential in terms of the number of nodes. A practical solution is to define a subset. A method called Double-dop (2dop; Sangati and Zuidema 2011) implements this without compromising on the principle of data-orientation. It restricts the fragment set to recurring fragments, i.e., fragments that occur in at least two different contexts. These are found by considering every pair of trees and extracting the largest tree fragments they have in common. It is feasible to do this exhaustively for the whole treebank. This is in contrast to the sampling of fragments in earlier dop models (Bod 2001) and Bayesian tsgs. Since the space of fragments is enormous [ 80 ] (that is, exponential in terms of sentence length), it stands to reason that a sampling approach will not discover all relevant fragments in a reasonable time frame. Sangati et al. (2010) presents a tree-kernel method for extracting maximal recurring fragments that operates in quadratic time in terms of the number of nodes in the treebank. A faster version of this method was presented in van Cranenburgh (2014), which uses a linear average time tree kernel, and introduces the ability to handle discontinuous trees. We obtain a further increase in speed by implementing an inverted index with a compressed bitmap (Chambi et al. 2015).

Discontinuous fragments
The aforementioned fragment extraction algorithms can be adapted to support trees with discontinuous constituents. Instead of implementing a new version with data structures for discontinuous trees following Definitions 1 and 2, we apply a representation that makes it possible to add discontinuous trees as a special case.
In the representation, leaf nodes are decorated with indices indicating their ordering. Just as in Figure 6, a discontinuous tree may be represented as a continuous tree, as long as information about the yield is encoded somehow. We do this by storing indices as leaf nodes, which denote an ordering and refer to a separate list of tokens. This makes it possible to use the same data structures as for continuous trees, as long as the child nodes are kept in a canonical order (induced from the order of the lowest index of each child).
Indices are used not only to keep track of the order of lexical nodes in a fragment, but also for that of the contribution of substitution sites. This is necessary in order to preserve the configuration of the yield in the original sentence. When leaf nodes are compared, the indices stand in for the token at the sentence position referred to. After a fragment is extracted, any indices need to be canonicalized. The indices originate from the original sentence, but need to be decoupled from this original context. This process is analogous to how lcfrs productions are read off from a tree with discontinuous constituents, in which contiguous intervals of indices are replaced by variables.
The canonicalization of fragments is achieved in three steps, as defined in the pseudocode of Algorithm 1; Figure 13 illustrates the [ 81 ] van Cranenburgh, Scha, Bod process. In the examples, substitution sites have spans denoted with inclusive start:end intervals, as extracted from the original parse tree, which are reduced to variables denoting contiguous spans whose relation to the other spans is reflected by their indices.
Input: A tree fragment t with indexed terminals w i or intervals xi : j, . . . y as leaves (0 ď i ă j ă n) Output: A tree fragment with modified indices. 1: k Ð the smallest index in t 2: subtract k from each index in t 3: for all intervals I = xi : j, . . . y of the substitution sites in t 4: for all i : j P I 5: replace i : j with i 6: subtract j´i from all indices k s.t. k ą j 7: for all indices i in t 8: if the indices i`1 and i`2 are not in t 9: k Ð the smallest index in t s.t. k ą i 10: subtract k´i from all indices y s.t. y ą i Figure 13: Canonicalization of fragments extracted from parse trees. These sample fragments have been extracted from the tree in Figure 1. The fragments are visualized here as discontinuous tree structures, but since the discontinuities are encoded in the indices of the yield, they can be represented in a standard bracketing format as used by the fragment extractor.
1. Translate indices so that they start at 0; e.g.: We will refer to the combination of Double-dop with discontinuous constituents as Disco-2dop. When recurring fragments are extracted from the Tiger treebank (cf. Section 8.1), we find that 10.4% [ 82 ] of fragment types contain a discontinuous node (root, internal, or substitution site). This can be contrasted with the observation that 30% of sentences in the Tiger treebank contain one or more discontinuous constituents, and that 20.9% of production types in the plcfrs treebank grammar of Tiger contain a discontinuous non-terminal. On the other hand, when occurrence frequencies are taken into account, both the fragments and productions with discontinuities account for around 6.5% of the total frequency mass. 6 parsing with plcfrs and pdtsg After extracting fragments by means of the method of Section 5, we augment the set of fragments with all depth 1 fragments, in order to preserve complete coverage of the training set trees. Since depth 1 fragments are equivalent to single grammar productions, this ensures strong equivalence between the tsg and the respective treebank grammar. 6 We then apply the grammar transformation (cf. Section 4.2.1) to turn the fragments into productions. Productions corresponding to fragments are assigned a probability based on the relative frequency of the respective fragment; productions introduced by the transformation are given a probability of 1. For an example, please refer back to Figure 11.
We parse with the transformed grammar using the disco-dop parser (van Cranenburgh et al. 2011;van Cranenburgh 2012a). This is an agenda-based parser for plcfrs based on the algorithm in Maier (2010, 2013), extended to produce n-best derivations (Huang and Chiang 2005) and exploit coarse-to-fine pruning (Charniak et al. 2006).
Parsing with lcfrs can be done with a weighted deduction system and an agenda-based parser. The deduction steps are given in Figure 14; for the pseudo-code of the parser see Algorithm 2, which is an extended version of the parser in Maier (2010, 2013) that obtains the complete parse forest as opposed to just the Viterbi derivation.
[ 83 ] In Section 6.1 we describe the probabilistic instantiation of dtsg and the criterion to select the best parse. Section 6.2 describes how derivations from the compressed tsg are converted back into trees composed of the full elementary trees. Section 6.4 describes how coarse-to-fine pruning is employed to make parsing efficient.
Algorithm 2 A probabilistic agenda-based parser for lcfrs.
Input: A sentence w 1¨¨¨wn , a grammar Output: A chart with Viterbi probabilities, a parse forest . 1: initialize agenda with all possible pos tags for input 2: while not empty 3: xI, xy Ð pop item with best score on agenda 4: add xI, xy to 5: for all xI 1 , zy that can be deduced from xI, xy and items in 6: if I 1 R Y 7: enqueue xI 1 , zy in 8: else if I 1 P ^z ą score for I 1 in 9: update weight of I 1 in to z 10: add edge for I 1 to 6.1

Probabilities and disambiguation
Our probabilistic model uses the relative frequency estimate (rfe), which has shown good results with the Double-dop model (Sangati and Zuidema 2011). The relative frequency of a fragment is the number of its occurrences, divided by the total number of occurrences of fragments with the same root node. In dop many derivations may produce the same parse tree, and it has been shown that approximating the most probable parse, which [ 84 ] considers all derivations for a tree, yields better results than the most probable derivation (Bod 1995). To select a parse tree from a derivation forest, we compute tree probabilities on the basis of the 10,000 most probable dop derivations, and select the tree with the largest probability. Although the algorithm of Huang and Chiang (2005) makes it is possible to extract the exact k-best derivations from a derivation forest, we apply pruning while building the forest.

Reconstructing derivations
After a derivation forest is obtained and a list of k-best derivations has been produced, the backtransform is applied to these derivations to recover their internal structure. This proceeds by doing a depth-first traversal of the derivations, and expanding each non-intermediate 7 node into a template of the original fragment. These templates are stored in a backtransform table indexed by the first binarized production of the fragment in question. The template fragment has its substitution sites marked, which are filled with values obtained by recursively expanding the children of the current constituent.

Efficient discontinuous parsing
We review several strategies for making discontinuous parsing efficient. As noted by Levy (2005, p. 138), the intrinsic challenge of discontinuous constituents is that a parser will generate a large number of potential discontinuous spans.

Outside estimates
Outside estimates (also known as context-summary estimates and figures-of-merit) are computed offline for a given grammar. During parsing they provide an estimate of the outside probability for a given constituent, i.e., the probability of a complete derivation with that constituent divided by the probability of the constituent. The estimate can be used to prioritize items in the agenda. Estimates were first introduced for discontinuous lcfrs parsing in Maier (2010, 2013). Their estimates are only applied up to sentences of 30 words. Beyond 30 words the table grows too large.
7 An intermediate node is a node introduced by the binarization.
[ 85 ] van Cranenburgh, Scha, Bod A different estimate is given by Angelov and Ljunglöf (2014), who succeed in parsing longer sentences and providing an A* estimate, which is guaranteed to find the best derivation.

6.3.2
Non-projective dependency conversion Hall and Nivre (2008), Versley (2014), and Fernández-González and Martins (2015) apply a reversible dependency conversion to the Tiger treebank, and use a non-projective dependency parser to parse with the converted treebank. The method has the advantage of being fast due to the greedy nature of the arc-eager transition-based dependency parser that is employed. The parser copes with non-projectivity by reordering tokens during parsing. Experiments are reported on the full Tiger treebank without length restrictions.

Reducing fan-out
The most direct way of reducing the complexity of lcfrs parsing is to reduce the fan-out of the grammar. Maier et al. (2012) introduces a linguistically motivated reduction of the fan-outs of the Negra and Penn treebanks to fan-out 2 (up to a single gap per constituent). This enables parsing of sentences of up to length 40.
Nederhof and Vogler (2014) introduce a method of synchronous parsing with an lcfrs and a definite clause grammar. A parameter allows the fan-out (and thus parsing complexity) of the lcfrs to be reduced. Experiments are reported on sentences of up to 30 words on a small section of the Tiger treebank. 6.3.4 Coarse-to-fine pruning We will focus on coarse-to-fine pruning, introduced in Charniak et al. (2006) and applied to discontinuous parsing by van Cranenburgh (2012a), who reports parsing results on the Negra treebank without length restrictions. Compared to the previous methods, this method does not change the grammar, but adds several new grammars to be used as preprocessing steps. Compared to the outside estimates, this method exploits sentence-specific information, since pruning information is collected during parsing with the coarser grammars. Pauls and Klein (2009) present a comparison of coarse-to-fine and (hierarchical A*) outside estimates, and conclude that except when [ 86 ] near-optimality is required, coarse-to-fine is more effective as it prunes a larger number of unlikely constituents.
A similar observation is obtained from a comparison of the discontinuous coarse-to-fine method and the outside estimates of Angelov and Ljunglöf (2014): coarse-to-fine is faster with longer sentences (30 words and up), at the cost of not always producing the most likely derivation (Ljunglöf, personal communication).

Coarse-to-fine pipeline
In order to tame the complexity of lcfrs and dop, we apply coarseto-fine pruning. Different grammars are used in the sequel, each being an overgenerating approximation of the next. That is, a coarse grammar will generate a larger set of constituents than a fine grammar. Parsing with a coarser grammar is more efficient, and all constituents which can be ruled out as improbable with a coarser grammar can be discarded as candidates when parsing with the next grammar. A constituent is ruled improbable if it does not appear in the k-best derivations of a parse forest. We use the same setup as in van Cranenburgh (2012a); namely, we parse in three stages, using three different grammars: 1. Split-pcfg: A cfg approximation of the discontinuous treebank grammar; rewrites spans of discontinuous constituents independently. 2. plcfrs: The discontinuous treebank grammar; rewrites discontinuous constituents in a single operation. A discontinuous span X n xx 1 , . . . , x n y is added to the chart only if all of X˚m n xx m y with 1 ď m ď n are part of the k-best derivations of the chart of the previous stage. 3. Disco-dop: The discontinuous dop grammar; uses tree fragments instead of individual productions from the treebank. A discontinuous span X n xx 1 , . . . , x n y is added to the chart only if X n xx 1 , . . . , x n y is part of the k-best derivations of the chart of the previous stage, or if X n is an intermediate symbol introduced by the tsg compression.
The first stage is necessary because without pruning, the plcfrs generates too many discontinuous spans, the majority of which are improbable or not even part of a complete derivation. The second stage [ 87 ] is not necessary for efficiency but gives slightly better accuracy on discontinuous constituents.
For example, while parsing the sentence "Wake your friend up," the discontinuous VP "Wake … up" may be produced in the plcfrs stage. Before allowing this constituent to enter into the agenda and the chart, the chart of the previous stage is consulted to see if the two discontinuous components "Wake" and "up" were part of the kbest derivations. In the dop stage, multiple elementary trees may be headed by this discontinuous constituent, and again they are only allowed on the chart if the previous stage produced the constituent as part of its k-best derivations.
The initial values for k are 10,000 and 50 for the plcfrs and dop grammar respectively. These values are chosen to be able to directly compare the new approach with the results in van Cranenburgh (2012a). However, experimenting with a higher value for k for the dop stage has shown to yield improved performance. In other coarseto-fine work the pruning criterion is based on a posterior threshold (e.g., Charniak et al. 2006;Bansal and Klein 2010); the k-best approach has the advantage that it does not require the computation of inside and outside probabilities.
For the initial pcfg stage, we apply beam search as in Collins (1999). The highest scoring item in each cell is tracked and only items up to 10,000 times less probable are allowed to enter the chart.
Experiments and results are described in Sections 8-9. 7 discontinuity without lcfrs The idea up to now has been to generate discontinuous constituents using formal rewrite operations of lcfrs. It should be noted, however, that the pcfg approximation used in the pruning stage reproduces discontinuities using information derived from the non-terminal labels. Instead of using this technique only as a crutch for pruning, it can also be combined with the use of fragments to obtain a pipeline that runs in cubic time. While the cfg approximation increases the independence assumptions for discontinuous constituents, the use of large fragments in the dop approach can mitigate this increase. To create the cfg approximation of the discontinuous treebank grammar, the treebank is transformed by splitting discontinuous constituents into several non- [ 88 ] terminal nodes (as explained in Section 4.1), after which grammar productions are extracted. This last step can also be replaced with fragment extraction to obtain a dop grammar from the transformed treebank. We shall refer to this alternative approach as 'Split-2dop.' The coarse-to-fine pipeline is now as follows: 1. Split-pcfg: A treebank grammar based on the cfg approximation of discontinuous constituents; rewrites spans of discontinuous constituents independently. 2. Split-2dop grammar: tree fragments based on the same transformed treebank as above.
Since every discontinuous non-terminal is split up into a new nonterminal for each of its spans, the independence assumptions for that non-terminal in a probabilistic grammar are increased. While this representation is not sufficient to express the full range of nested discontinuous configurations, it appears adequate for the linguistic phenomena in the treebanks used in this work, since their trees can be unambiguously transformed back and forth into this representation. Moreover, the machinery of Data-Oriented Parsing mitigates the increase in independence assumptions through the use of large fragments. We can therefore parse using a dop model with a context-free grammar as the symbolic backbone, and still recover discontinuous constituents. 8

experimental setup
In this section we describe the experimental setup for benchmarking our discontinuous Double-dop implementations on several discontinuous treebanks.

Treebanks and preprocessing
We evaluate on three languages: for German, we use the Negra (Skut et al. 1997) and Tiger (Brants et al. 2002) treebanks; for English, we use a discontinuous version of the Penn treebank (Evang and Kallmeyer 2011); and for Dutch, we use the Lassy (Van Noord 2009) and cgn (van der Wouden et al. 2002) treebanks; cf. Table 1. Negra and Tiger contain discontinuous annotations by design, as a strategy to cope with the relatively free word order of German. The discontinuous Penn treebank consists of the wsj section in which traces have [ 89 ] van Cranenburgh, Scha, Bod been converted to discontinuous constituents; we use the version used in Evang and Kallmeyer (2011, Sections 5.1-5.2) without restrictions on the transformations. The Lassy treebank is referred to as a dependency treebank but when discontinuity is allowed it can be directly interpreted as a constituency treebank. The Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus; van der Wouden et al. 2002) is a Dutch spoken language corpus with the same syntactic annotations. We use the syntactically annotated sentences from the Netherlands (i.e., without the Flemish part) of up to 100 tokens. The train-dev-test splits we employ are as commonly used for the Penn treebank: sec. 2-21, sec. 24, sec. 23, respectively. For Negra we use the one defined in Dubey and Keller (2003). For Tiger we follow Hall and Nivre (2008) who define sections 0-9 where sentence i belongs to section i mod 10, sec. 0 is used as test, sec. 1 as development, and 2-9 as training.
When parsing the Tiger test set, the development set is added to the training set as well; while this is not customary, it ensures the results are comparable with Hall and Nivre (2008). The same split is applied to the cgn treebank but with a single training set. For Lassy the split is our own. 8 For purposes of training we apply heuristics for head assignment (Klein and Manning 2003) and binarize the trees in the training sets head-outward with h " 1, v " 1 markovization; i.e., n-ary nodes are factored into nodes specifying an immediate sibling and parent. Note that for lcfrs, a binarization may increase the fan-out, and thus the complexity of parsing. It is possible to select the binarization in such a way as to minimize this complexity (Gildea 2010). However, experiments show that this increase in fan-out does not actually occur, regardless of the binarization strategy (van Cranenburgh 2012a). Head-outward means that constituents are binarized in a right-factored manner up until the head child, after which the rest of the binarization continues in a left-factored manner.
We add fan-out markers to guarantee unique fan-outs for nonterminal labels, e.g., tVP, VP 2 , VP 3 , . . .u, which are removed again for evaluation.
For the Dutch and German treebanks, punctuation is not part of the syntactic annotations. This causes spurious discontinuities, as the punctuation interrupts the constituents dominating its surrounding tokens. Additionally, punctuation provides a signal for constituent boundaries, and it is useful to incorporate it as part of the rest of the phrase structures. We use the method described in van Cranenburgh (2012a): punctuation is attached to the highest constituent that contains a neighbor to its right. With this strategy there is no increase in the amount of discontinuity with respect to a version of the treebank with punctuation removed. The CGN treebank contains spoken language phenomena, including disfluencies such as interjections and repeated words. In preprocessing, we treat these as if they were punctuation tokens; i.e., they are moved to an appropriate constituent (as defined above) and are ignored in the evaluation.
The complexity of parsing with a binarized lcfrs is Opn 3φ q with φ the highest fan-out of the non-terminals in the grammar (Seki et al. 1991). For a given grammar, it is possible to give a tighter upper bound on the complexity of parsing. Given the unique fan-outs of nonterminals in a grammar, the number of operations it takes to apply a production is the sum of the fan-outs in the production (Gildea 2010): van Cranenburgh, Scha, Bod The complexity of parsing with a grammar is then the maximum value of this measure for productions in the grammar. In our experiments we find a worst-case time complexity of Opn 9 q for parsing with the dop grammars extracted from Negra and wsj. The following sentence from Negra contributes a grammar production with complexity 9. The production is from the VP of vorgeworfen; bracketed words are from other constituents, indicating the discontinuities: ( The complexities for Tiger and Lassy are Opn 10 q and Opn 12 q respectively, due to a handful of anomalous sentences; by discarding these sentences, a grammar with a complexity of Opn 9 q can be obtained with no or negligible effect on accuracy.

Unknown words
In initial experiments the parser is trained and evaluated on gold standard part-of-speech tags, as in previous experiments on discontinuous parsing. Later we show results when tags are assigned automatically with a simple unknown word model, based on the Stanford parser (Klein and Manning 2003). An open class threshold σ determines which tags are considered open class tags; tags that rewrite more than σ words are considered open class tags, and words they rewrite are open class words. Open class words in the training set that do not occur more than 4 times are replaced with signatures based on a list of features; words in the test set which are not part of the known words from the training set are replaced with similar signatures. The features are defined in the Stanford parser as Model 4, which is relatively language independent; cf.

Function tags
We investigated two methods of having the parser produce function tags in addition to the usual phrase labels. The first method is to train a separate discriminative classifier that adds function tags to parse trees in a post-processing step. This approach is introduced in Blaheta and Charniak (2000). We employed their feature set. Another approach is to simply append the function tags to the non-terminal labels, resulting in, e.g., NP-SBJ and NP-OBJ for subject and object noun phrases. While this approach introduces sparsity and may affect the performance without function tags, we found this approach to perform best and therefore report results with this approach. Gabbard et al. (2006) and Fraser et al. (2013) use this approach as well. Compared to the classifier approach, it does not require any tuning, and the resulting model is fully generative. We apply this to the Tiger, wsj, and Lassy treebanks.
The Penn treebank differs from the German and Dutch treebanks with respect to function tags. The Penn treebank only has function tags on selected non-terminals (never on preterminals) and each nonterminal may have several function tags from four possible categories; whereas the German and Dutch treebanks have a single function tag [ 93 ] on most non-terminals. The tag set also differs considerably: the Penn treebank has 20 function tags, Lassy has 31, and Tiger has 43. 8.4

Treebank refinements
We apply a set of manual treebank refinements based on previous work. In order to compare the results on Negra with previous work, we do not apply the state splits when working with gold standard pos tags.
For Dutch and German we split the pos tags for the sentenceending punctuation '.!?'. For all treebanks we add the feature 'year' to the preterminal label of tokens with numbers in the range 1900-2040, and replace the token with 1970. Other numbers are replaced with 000.

Tiger
For Tiger we apply the refinements described in Fraser et al. (2013).
Since the Negra treebank is only partially annotated with morphological information, we do not apply these refinements to that treebank.

WSJ
We follow the treebank refinements of Klein and Manning (2003) for the Wall Street Journal section of the Penn treebank.

Lassy
The Lassy treebank contains fine-grained part-of-speech tags with morphological features. It is possible to use the full part-of-speech tags as the preterminal labels, but this introduces sparsity. We select a subset of features to add to the preterminal labels: • nouns: proper/regular; • verbs: auxiliary/main, finite/infinite; • conjunctions: coordinating/subordinating; • pronouns: personal/demonstrative; • pre-vs. postposition.
Additionally, we percolate the feature identifying finite and infinite verbs to the parents and grandparents of the verb.
For multi-word units (MWU), we append the label of its head child. This helps distinguish MWUs as being nominal, verbal, prepositional, or otherwise. [ 94 ] The last two transformations are based on those for Tiger. Unary NPs are added for single nouns and pronouns in sentential, prepositional and infinitival constituents. For conjuncts, the function tag of the parent is copied. Both transformations can be reversed.
Since the cgn treebank uses a different syntax for the fine-grained pos tags, we do not apply these refinements to that treebank. 8.5

Metrics
We employ the exact match and Parseval measures (Black et al. 1992) as evaluation metrics. Both are based on bracketings that identify the label and yield of each constituent. The exact match is the proportion of sentences in which all labelled bracketings are correct. The Parseval measures consist of the precision, recall, and F-measure of the correct labelled bracketings averaged across the treebank. Since the pos accuracy is crucial to the performance of a parser and neither of the previous metrics reflect it, we also report the proportion of correct pos tags.
We use the evaluation parameters typically used with EVALB on the Penn treebank. Namely, the root node and punctuation are not counted towards the score (similar to COLLINS.prm, 10 except that we discount all punctuation, including brackets). Counting the root node as a constituent should not be done because it is not part of the corpus annotation and the parser is able to generate it without doing any work; when the root node is counted it inflates the F-score by several percentage points. Punctuation should be ignored because in the original annotation of the Dutch and German treebanks, punctuation is attached directly under the root node instead of as part of constituents. Punctuation can be re-attached using heuristics for the purposes of parsing, but evaluation should not be affected by this.
It is not possible to directly compare evaluation results from discontinuous parsing to existing state-of-the-art parsers that do not produce discontinuous constituents, since parses without discontinuous constituents contain a different set of bracketings; cf. Figure 15, which compares discontinuous bracketings to the bracketings extracted from a tree in which discontinuity has been resolved by attaching nonhead siblings higher in the tree, as used in work on parsing Negra. Compared to an evaluation of bracketings without discontinuous constituents, an evaluation including discontinuous bracketings is more stringent. This is because bracketings are scored in an all-or-nothing manner, and a discontinuous bracketing includes non-local elements that would be scored separately when discontinuity is removed in a preprocessing step. For function tags we use two metrics: 1. The non-null metric of Blaheta and Charniak (2000), which is the F-score of function tags on all correctly parsed bracketings.
Since the German and Dutch treebanks include function tags on pre-terminals, we also include function tags on correctly tagged words in this metric.

2.
A combined F-measure on bracketings of the form xC, F, spany, where C is a syntactic category and F a function tag. 9 evaluation This section presents an evaluation on three languages, and with respect to the use of function tags, tree fragments, pruning, and probabilities.

9.1
Main results on three languages Table 3 lists the results for discontinuous parsing of three Germanic languages, with unknown word models. The cited works by Kallmeyer and Maier (2013) and Evang and Kallmeyer (2011) also use lcfrs [ 96 ] for discontinuity but employ a treebank grammar with relative frequencies of productions. Hall and Nivre (2008), Versley (2014), and Fernández-González and Martins (2015) use a conversion to dependencies discussed in Section 6.3.2. For English and German our results improve upon the best known discontinuous constituency parsing results. The new system achieves a 16% relative error reduction over the previous best result for discontinuous parsing on sentences of size ď 40 in the Negra test set. In terms of efficiency, the Disco-2dop model is more than three times as fast as the dop reduction, taking about three hours instead of ten on a single core. The grammar is also more compact: the Disco-2dop grammar is only a third the size of that of the dop reduction, at 6 mb versus 18 mb compressed size. Table 3 also includes results from van Cranenburgh and Bod (2013) who do not add function tags to non-terminal labels nor apply the extensive treebank refinements described in Sections 8.3-8.4. Although the refinements and some of the function tags would be expected to improve performance, the rest of the function tags increase sparsity and consequently the resulting F-scores are slightly lower; but this tradeoff seems to be justified in order to get parse trees with function tags. The results on cgn show a surprisingly high exact match score. This is due to a large number of interjection utterances, e.g., "uhm."; since such sentences only consist of a root node and pos tags, the bracketing F 1 -score is not affected by this. Table 4 reports an evaluation including function tags. For these three treebanks, the models reproduce most of the information in the original treebank. The following parts are not yet incorporated. The German and Dutch treebanks contain additional lexical information consisting of lemmas and morphological features. These could be added to the non-terminal labels of the model or obtained from an external pos tagger. Lastly, some non-terminals have multiple parents; these occur in the German and Dutch treebanks and are referred to as secondary edges.

All-fragments vs. recurring fragments
The original Disco-dop model (van Cranenburgh et al. 2011) is based on an all-fragments model, while Disco-2dop is based on recurring [ 97 ] van Cranenburgh, Scha, Bod Table 3: Discontinuous parsing of three Germanic languages. POS is the partof-speech tagging accuracy; F 1 is the labelled bracketing F 1 -score; EX is the exact match score. Results marked with * use gold standard pos tags; those marked with † do not discount the root node and punctuation. NB: Kallmeyer and Maier (2013) and Evang and Kallmeyer (2011) use a different test set and length restriction. 'vanCraBod2013' refers to van Cranenburgh and Bod (2013), and 'FeMa2015' to Fernández-González and Martins (2015) DEV TEST Treebank and parser |w| POS  fragments. Table 5 compares previous results of Disco-dop to the new Disco-2dop implementation. The second column shows the accuracy for different values of k, i.e., the number of coarse derivations that determine the allowed labelled spans for the fine stage. While increasing this value did not yield improvements using the dop reduction, with Disco-2dop there is a substantial improvement in performance, with k " 5000 yielding the best score among the handful of values tested. Figure 16 shows the average time spent in each stage using the latter model on wsj. The average time to parse a sentence (ď 40 words) for this grammar is 7.7 seconds. Efficiency could be improved significantly by improving the pcfg parser using better chart representations such as packed parse forests and bit vectors (Schmid 2004

Effects of pruning
The effects of pruning can be further investigated by comparing different levels of pruning. We first parse the sentences in the Negra development set that are up to 30 words long with a plcfrs treebank grammar, with k " 10, 000 and without pruning. Out of 897 sentences, the Viterbi derivation is pruned on only 14 occasions, while the pruned version is about 300 times faster. Table 6 shows results for different levels of pruning on sentences of all lengths. For sentences of all lengths it is not feasible to parse with the unpruned plcfrs. However, we can compare the items in the parse forest after pruning and the best derivation to the gold tree from the treebank. From the various measures, it can be concluded that the pruning has a large effect on speed and the number of items in the resulting parse forest, while having only a small effect on the quality of the parse (forest).  Table 3 shows that the Disco-2dop and Split-2dop techniques have comparable performance, demonstrating that the complexity of lcfrs parsing can be avoided. Table 7 shows the performance in each step of the coarse-to-fine pipelines, with and without lcfrs. Surprisingly, the use of a formalism that explicitly models discontinuity as an operation does not give any improvement over a simpler model in which discontinuities are only modeled probabilistically by encoding them into labels and fragments. This demonstrates that given the use of tree fragments, discontinuous rewriting through lcfrs comes at a high computational cost without a clear benefit over cfg.

Without LCFRS
[ 100 ] Pipeline F 1 EX% Split-pcfg (no lcfrs, no tsg) 65.8 28.0 Split-pcfg ñ plcfrs (no tsg) 65.9 28.6 Split-pcfg ñ plcfrs ñ 2dop 77.7 41.5 Split-pcfg ñ Split-2dop (no lcfrs) 78.1 42.0 The role of probabilities From the results it is clear that a probabilistic tree-substitution grammar is able to provide much better results than a simple treebank grammar. However, it is not obvious whether the improvement is specifically due to the more fine-grained statistics (i.e., frequencies of more specific events), or generally because of the use of larger chunks. A serendipitous discovery during development of the parser provides insight into this: during an experiment, the frequencies of fragments were accidentally permuted and assigned to different fragments, but the resulting decrease in performance was surprisingly low, from 77.7 to 74.1 F 1 -suggesting that most of the improvement over the 65.9 F 1 score of the plcfrs treebank grammar comes from memorizing larger chunks, as opposed to statistical reckoning.

Previous work
Earlier work on recovering empty categories and their antecedents in the Penn treebank (Johnson 2002;Levy and Manning 2004;Gabbard et al. 2006;Schmid 2006;Cai et al. 2011) has recovered nonlocal dependencies by producing the traces and co-indexation as in the original annotation. If the results include both traces and antecedents (which holds for all but the last work cited), the conversion to discontinuous constituents of Evang and Kallmeyer (2011) could be applied to obtain a discontinuous F-score. Since this would require access to the original parser output, we have not pursued this. As explained in Section 8.5, it is not possible to directly compare the results to existing parsers that do not produce discontinuous constituents. However, the F-measures do give a rough measure, since the majority of constituents are not discontinuous.
For English, there is a result with 2dop by Sangati and Zuidema (2011) with an F 1 score of 87.9. This difference can be attributed to the absence of discontinuous bracketings, as well as their use of the Max- [ 101 ] imum Constituents Parse instead of the Most Probable Parse; the former optimizes the F-measure instead of the exact match score. Shindo et al. (2012) achieve an F 1 score of 92.9 with a Bayesian tsg that uses symbol refinement through latent variables (i.e., automatic state splitting).
For German, the best results without discontinuity and no length restriction are F 1 scores of 84.2 for Negra (Petrov 2010) and 76.8 for Tiger (Fraser et al. 2013; note that this result employs a different traindev-test split than the one in this work). 10 conclusion We have shown how to parse with discontinuous tree-substitution grammars and presented a practical implementation. We employ a fragment extraction method that finds recurring structures in treebanks efficiently, and supports discontinuous treebanks. This enables a data-oriented parsing implementation that employs a compact, efficient, and accurate model for discontinuous parsing in a generative model that improves upon previous results for this task.
Surprisingly, it turns out that the formal power of lcfrs is not necessary to describe discontinuity, since equivalent results can be obtained with a probabilistic tree-substitution grammar in which nonlocal relations are encoded in the non-terminal labels. In other words, it is feasible to produce discontinuous constituents without invoking mild context-sensitivity.
We have presented parsing results on three languages. Compared to previous work on statistical parsing, our models are linguistically richer. In addition to discontinuous constituents, our models also reproduce function tags from the treebank. While there have been previous results on reproducing non-local relations or function tags, this work reproduces both using models derived straightforwardly from treebanks, while exploiting ready-made treebank transformations for improved performance.
The source code of the parser used in this work is available at https://github.com/andreasvc/disco-dop.
[ 102 ] acknowledgments We are grateful to Kilian Evang for supplying the discontinuous Penn treebank, to the reviewers for detailed comments, and to Dave Carter and Adam Przepiórkowski for copy-editing suggestions. This work is supported by the Computational Humanities Program of the Royal Netherlands Academy of Arts and Sciences, as part of The Riddle of Literary Quality.