The expected sum of edge lengths in planar linearizations of trees. Theory and applications

Dependency trees have proven to be a very successful model to represent the syntactic structure of sentences of human languages. In these structures, vertices are words and edges connect syntactically-dependent words. The tendency of these dependencies to be short has been demonstrated using random baselines for the sum of the lengths of the edges or its variants. A ubiquitous baseline is the expected sum in projective orderings (wherein edges do not cross and the root word of the sentence is not covered by any edge), that can be computed in time $O(n)$. Here we focus on a weaker formal constraint, namely planarity. In the theoretical domain, we present a characterization of planarity that, given a sentence, yields either the number of planar permutations or an efficient algorithm to generate uniformly random planar permutations of the words. We also show the relationship between the expected sum in planar arrangements and the expected sum in projective arrangements. In the domain of applications, we derive a $O(n)$-time algorithm to calculate the expected value of the sum of edge lengths. We also apply this research to a parallel corpus and find that the gap between actual dependency distance and the random baseline reduces as the strength of the formal constraint on dependency structures increases, suggesting that formal constraints absorb part of the dependency distance minimization effect. Our research paves the way for replicating past research on dependency distance minimization using random planar linearizations as random baseline.


Introduction
A successful representation of the structure of a sentence in natural language is a (labeled) graph indicating the syntactic relationships between words together with the encoding of the words' order.In such a graph, the edge labels indicate the type of syntactic relationship between the words.Such combination of graph and linear ordering, as in Figure 1, is known as syntactic dependency structure (Nivre, 2006).When the graph is (1) well-formed, namely, the graph is weakly connected, (2) is acyclic, that is, there are no cycles in the graph, (3) is single-headed, that is, every node has a single head (except for the root node), and (4) there is only one root node (one node with no head) in the graph, then it is called a syntactic dependency tree (Nivre, 2006).There exist formal constraints that are often imposed on dependency structures.One such constraint is projectivity: a dependency structure is projective if, for every vertex v, all vertices reachable from v in the underlying graph form a continuous substring within the sentence (Kuhlmann and Nivre, 2006) and the root word of the sentence (the root of the underlying syntactic dependency structure) is never covered (as in Figure 1(a)).Another formal constraint is planarity, a generalization of projectivity where the root is allowed to be covered by one or more of the edges (as in Figure 1(b)).Figure 1(c) shows a sentence that is neither projective nor planar.
In this article, we study statistical properties of syntactic dependency structures under the planarity constraint.Such structures are represented in this article as a pair consisting of a (free or rooted) tree and a linear arrangement of its vertices.Free trees are denoted as T = (V, E), and rooted trees as T r = (V, E; r), where V is the set of vertices, E the set of edges, and r ∈ V denotes the root vertex.Unless stated otherwise n = |V |, that is, n denotes the number of vertices which is equal to the number of words in the sentence.A

a) b)
A hearing is scheduled on the issue today Someone arrived with red hair You brought your dog Figure 1: Examples of sentences with their syntactic dependency structures; arc labels indicate dependency distance (in words) between linked words.The rectangles denote the root word in each sentence.a) A projective dependency tree (adapted from Großand Osborne, 2009).b) Planar (but not projective) syntactic dependency structure (adapted from Großand Osborne, 2009).c) Non-projective and non-planar syntactic dependency structure (adapted from Nivre, 2009).
linear arrangement π (also called embedding) of a tree is a (bijective) function (π : V → {1, . . ., n}) that maps every vertex u of a tree to a unique position in {1, . . ., n}, which is denoted by π(u).Projectivity, as well as planarity, can be alternatively defined on linear arrangements using the concept of edge crossing.We say that any two (undirected) edges {s, t}, {u, v} cross if the positions of their vertices interleave.More formally, assume, without loss of generality, that π(s) < π(t), π(u) < π(v) and π(s) < π(u).Then, edges {s, t}, {u, v} cross in the linear ordering defined by π if π(s) < π(u) < π(t) < π(v). 1 We denote the total number of edge crossings in an arrangement π as C π (T ).Then, an arrangement π of a rooted tree T r is planar if C π (T r ) = 0 and is projective if (a) it is planar and (b) the root of the tree is not covered, that is, there is no edge {s, t} such that π(s) < π(r) < π(t) or π(t) < π(r) < π(s).Planarity is a relaxation of projectivity where the root can be covered (Kuhlmann and Nivre, 2006;Sleator and Temperley, 1993).Planar arrangements are also known in the literature as one-page book embeddings (Bernhart and Kainen, 1979).
In this article, the main object of study is the expectation of the sum of edge lengths (or syntactic dependency distances) in planar arrangements of free trees.The length of an edge connecting two syntacticallyrelated words, also known as dependency distance, is usually2 defined as the number of intervening words between u and v in the sentence plus 1 (Figure 1).It is defined mathematically as We define the total sum of edge lengths in π as (1) Close attention has been paid to this metric in modern linguistic research since its causal relationship with cognitive cost was first put forward, to the best of our knowledge, by Hudson (1995).The main causal argument is that the longer the dependency, the greater the memory burden arising from decay of activation and interference (Hudson, 1995;Liu et al., 2017).A number of studies have exposed the general tendency in languages to reduce D, the total sum of edge lengths, a reflection of a potentially universal cognitive force known as the Dependency Distance Minimization principle (DDm) (Ferrer-i-Cancho, 2004;Ferrer-i-Cancho, Gómez-Rodríguez, Esteban, and Alemany-Puig, 2022;Futrell, Mahowald, and Gibson, 2015;Liu, 2008;Liu et al., 2017).As an example of such cognitive cost, consider the sentences in Figures 2(a not surprising that the latter is preferred over the former due to smaller total sum of edge lengths (Morrill, 2000), the former's being D = 18 and the latter's being D = 12.
Statistical evidence of the DDm principle has been provided showing that dependency distances are smaller than expected by chance in syntactic dependency treebanks (Ferrer-i-Cancho, 2004;Ferrer-i-Cancho et al., 2022;Futrell et al., 2015;Gildea and Temperley, 2010;Kramer, 2021;Liu, 2008;Liu et al., 2017;Park and Levy, 2009).Typically, the random baseline is defined as a random shuffling of the words of a sentence.To the best of our knowledge, the first known instance of such an approach was done by Ferrer-i-Cancho ( 2004) who established the DDm principle by comparing the average real D(T ) of sentences against its expected value in a uniformly random permutation of their words.More formally, Ferrer-i-Cancho (2004) calculated the expected value of D(T ) when the words of the sentence are shuffled uniformly at random (u.a.r.), that is, when all n! permutations equally likely.This value is denoted here as E [D(T )].Ferrer-i-Cancho (2004) found that In spite of the simplicity of Equation 2, the majority of researchers have used as random baseline the expected sum of edge lengths conditioned to projective arrangements (Futrell et al., 2015;Gildea and Temperley, 2010;Kramer, 2021;Park and Levy, 2009;Temperley, 2008) which we denote here as E pr [D(T r )].However, this baseline has been computed approximately via random sampling of projective arrangements.For these reasons, a formula to calculate the exact value of E pr [D(T r )] in linear time, was derived by Alemany-Puig and Ferrer-i-Cancho ( 2022) where s r (u) denotes the size (in vertices) of the subtree of T r rooted at u, and d r (u) is the out-degree of u in T r .In spite of its extensive use, the projective random baseline has some limitations.First, the percentage of nonprojective sentences in languages ranges between 18.2 and 26.4 (Gómez-Rodríguez, 2016) or between 6.8 and 36.4 (Gómez-Rodríguez and Nivre, 2010) (see also Havelka, 2007).The limited coverage of projectivity raises the question if the projective baseline should be used for sentences that are not projective as it is customary in research on dependency distance minimization.In addition, projectivity per se implies a reduction in dependency distances, which raises the question if that rather strong constraint may mask the effect of the dependency distance minimization principle under investigation (Gómez-Rodríguez, Christiansen, and Ferreri-Cancho, 2022).Here we aim to make a step forward by considering planarity, a generalization of projectivity, so as to increase the coverage of real sentences and reduce the bias towards dependency minimization in the random baseline.The percentage of non-planar sentences in languages ranges between 14.3 and 20.0 (Ferreri-Cancho, Gómez-Rodríguez, and Esteban, 2018) or between 5.3 and 31 (Gómez-Rodríguez and Nivre, 2010).The latter range is consistent with earlier estimates (Havelka, 2007).
This article is part of a research program on the statistical properties of D(T ) under constraints on the possible linear arrangements (Alemany-Puig, Esteban, and Ferrer-i-Cancho, 2022;Alemany-Puig and Ferreri-Cancho, 2022;Ferrer-i-Cancho, 2019).The remainder of the article is divided into two main parts: theory (Section 2) and applications (Section 3).
The theory part (Section 2) is structured as follows.In Section 2.1, we introduce notation used throughout that part.In Section 2.2, we first present a characterization of planar arrangements so as to identify their underlying structure, which we apply to count their number for a given free tree, and later on in Section 2.3, to generate them u.a.r. by means of a novel O(n)-time algorithm.In Section 2.4, we use said characterization to prove the main result of the article, namely that expectation of D(T ) in planar arrangements can be calculated from the expectation of projective arrangements, as the following theorem indicates.
Theorem 1.1.Given a free tree T = (V, E), where is the expected value of D(T u ) in uniformly random projective arrangements π of T u such that π(u) = 1 and E pr [D(T u )] (Equation 3) is the expected value of D(T u ) in uniformly random projective arrangements of T u , the free tree T rooted at u.
Table 1 summarizes the theoretical results obtained in previous articles and those presented in this article.The applications part (Section 3) is structured as follows.In Section 3.1, we apply Theorem 1.1 to derive a O(n)-time algorithm to calculate E pl [D(T )].Since Alemany-Puig and Ferrer-i-Cancho (2022) showed that E pr [D(T r )] can be evaluated in time O(n), Equation 5 naturally leads to a O(n 2 )-time algorithm if it is evaluated 'as is'.However, we devise a O(n)-time algorithm to calculate E pl [D(T )].In Section 3.2, we apply this and previous research on the projective case (Alemany-Puig and Ferrer-i-Cancho, 2022) to a parallel syntactic dependency treebank.We find that the gap between the actual dependency distance and that of the random baseline, reduces as the strength of the formal constraint on dependency structures chosen for the random baseline increases, suggesting that formal constraints absorb part of the dependency distance minimization effect.
Finally, in Section 4, we review all the findings and make suggestions for future research.From this point onwards, the article is organized to ease reading by readers of distinct profiles.Readers interested in the analysis of syntactic dependency treebanks can jump directly to Section 3.2.Readers interested in the algorithm for computing E pl [D(T )] can jump directly to Section 3.1, after reading Section 2.1.Readers whose primary interest is applying the algorithms have ready-to-use code: both methods to generate planar arrangements (Section 2.3) and the O(n)-time calculation of E pl [D(T )] (Section 3.1) are freely available in the Linear Arrangement Library3 (Alemany-Puig, Esteban, and Ferrer-i-Cancho, 2021).

Definitions and notation
We use u, v, w, z to denote vertices, r to always denote a root vertex, and i, j, k, p, q to denote integers.The edges of a free tree are undirected, and denoted as {u, v} = uv; those of rooted trees are directed, denoted as (u, v), and oriented away from r towards the leaves.
Let Γ(u) denote the set of neighbors of u ∈ V in the free tree T , and let Γ r (u) denote the out neighbors (also, children) of u ∈ V in T r .Notice that, Γ r (u) ⊆ Γ(u) with equality if, and only if u = r.Let d r (u) = |Γ r (u)| denote the out-degree of vertex u of a rooted tree T r , and let d(u) = |Γ(u)| denote the degree of u in a free tree T .Notice that d r (u) = d(u) − 1 when u ̸ = r and d r (r) = d(r).Furthermore, we denote the subtree rooted at v with respect to root u as T u v (obviously T r r = T r ), and its size as Table 1: Summary of the main mathematical results for increasing constraints on linear orders.Results for the unconstrained and projective cases are borrowed from previous research (Ferrer-i-Cancho, 2004 andAlemany-Puig andFerrer-i-Cancho, 2022, respectively).Results for the planar case are a contribution of the present article.N pr (T r ), N pl (T ) and N(T ) denote the number of distinct projective, planar and unconstrained linear arrangements, respectively, of a rooted tree T r or of a free tree T .E pr [δ uv ], E pl [δ uv ] and E [δ uv ] denote the expected length of an edge in random linear arrangement for the projective, planar and unconstrained cases, respectively.E pr [δ uv | s] is the expected value of δ uv conditioned to having vertex s as root of the tree.
In E pr [δ uv ] the root is vertex r.  (Alemany-Puig et al., 2022;Hochberg and Stallmann, 2003).(Figure 3).We call this directional size (Alemany-Puig et al., 2022;Hochberg and Stallmann, 2003).Note that As in previous research, we also decompose an edge (r, u) in a projective arrangement π into two parts: its anchor and its coanchor, as in Figure 4 ( Alemany-Puig and Ferrer-i-Cancho, 2022;Chung, 1984;Shiloach, 1979).Informally, α ru (π) is the number of vertices in π covered by (r, u) in the segment of T r u including vertex u (Figure 4); similarly, β ru (π), is the number of vertices of π covered by (r, u) in segments that fall between r and u (Figure 4).The length of an edge connecting r with u can be expressed with the formula where α ru (π) is the length of the anchor and β ru (π) is the length of the coanchor.The length of the anchor and coanchor can be formally defined as 4).The same notation with π omitted, α ru and β ru denote random variables.Furthermore, it will be useful to define the operator ⋄, which we use to condition expected values and constrain sets of arrangements of a rooted tree, in both cases to arrangements π where (only) the root is fixed at the leftmost position of π.For instance, if S is a set of arrangements π of a rooted tree T r then S Finally, in this article we consider that two arrangements π and π ′ of the same tree T are different if there is (at least) one vertex u for which π(u) ̸ = π ′ (u).

Counting planar arrangements
It is well known that the number of unconstrained arrangements of an n-vertex tree is n!.This is true given that arrangements are simply permutations, and unconstrained arrangements are not subject to any particular constraint, thus all vertex orderings are possible.Building on the fact that projective arrangements span over contiguous intervals (Kuhlmann and Nivre, 2006), Alemany-Puig and Ferrer-i-Cancho (2022) studied the expected value of the random variable D(T r ) in such arrangements by defining, as usual, a set of segments Φ u associated to each vertex u, consisting of the segments associated to the subtrees T r u1 , . . ., T r up and u.A segment of a rooted tree T r u is a segment within the linear ordering containing all vertices of T r u , an interval of length s r (u) whose starting and ending positions are unknown until the whole tree is fully Figure 5: a) A rooted tree T r where Γ(r) = {r 1 , . . ., r p } are the p children of r.The subtree T r r1 has been circled for clarity.b) An example of a permutation of the segments in Φ r associated to the root.c) An example of a permutation of the segments in Φ r1 associated to r 1 , the segment at the leftmost position in the example in (b).The dash-dotted edge in (b) and in (c) represent the same edge of the tree.In (b) and (c), respectively, r and r 1 are segments of length 1. linearized; thus, a segment is a movable set of vertices within the linear ordering (Alemany-Puig and Ferreri-Cancho, 2022).For a vertex u, the set Φ u is constructed from vertex u's segment and the segments of its children Γ r (u) = {u 1 , . . ., u k } (Figure 5).Decomposing every vertex and its segments from the root to the leaves linearizes T r into a projective arrangement (Figure 5).This characterization led to a straightforward derivation of the total amount of projective arrangements of a rooted tree T r (Table 1) Using the structure of segments summarized above, we present a characterization of planar arrangements of free trees which helps to devise a method to generate planar arrangements u.a.r.(Section 2.3.3) and to prove Theorem 1.1 (Section 2.4).To this aim, we define P ⋄ pr (T r ) as the set of projective arrangements of a rooted tree T r such that π(r) = 1, and denote its size as Notice that when a vertex u is fixed to the leftmost position, the planar arrangements in P ⋄ pr (T u ) are obtained by arranging the subtrees T u v , v ∈ Γ(u), projectively to the right of u in the linear arrangement.It is important to bear in mind that the operator ⋄ only fixes the root vertex r to the leftmost position of the arrangement: the other vertices can be placed freely as long as the result is projective.
Proposition 1.The number of planar arrangements of an n-vertex free tree Proof.Given a free tree T , and any two distinct vertices u, v, it holds that P ⋄ pr (T u )∩P ⋄ pr (T v ) = ∅ because the vertices in the first positions are different.This lets us partition P pl (T ) into the non-empty pairwise-disjoint sets P ⋄ pr (T u ) and see that It is easy to see that We used Equation 6 in the second equality.Notice that since the value N ⋄ pr (T u ) does not depend on the root vertex u.Therefore, Equation 7 follows immediately.
Obviously, there are more planar arrangements of a free tree T than projective arrangements of any 'rooting' T r of T , formally N pl (T ) ≥ N pr (T r ).We can see this by noticing that, when given a 'rooting' of with equality when T is a star tree 4 and r is its vertex of highest degree.

Generating arrangements uniformly at random
Arrangements can be generated freely, that is, by imposing no constraint on the possible orderings, where all the n! possible orderings are equally likely, or by imposing some constraint on the possible orderings.Generating unconstrained arrangements is straightforward: it is well known that a permutation of n elements can be generated u.a.r. in time O(n) (Cormen, Leiserson, Rivest, and Stein, 2001).It can be done as follows.
Assume we are given a set of n vertices, say V = {u 1 , . . ., u n }, and let i = 1.Repeat the following steps n times, 1. Select u.a.r. a vertex from V ; the vertex is chosen with probability 1/(n − i + 1).Let u i be said vertex, 2. Place u i in the arrangement at position i, that is, let π(u i ) = i, The product of all probabilities of vertex choice gives that the probability of producing a certain linear arrangement is thus the arrangement is constructed uniformly at random.Since the removal of a vertex from the set and uniformly random choice of vertex can both be implemented in constant time (using arrays), the running time is O(n).
When constraints are involved, projectivity is often the preferred choice (Futrell et al., 2015;Gildea and Temperley, 2007;Liu, 2008).First, we present a O(n)-time procedure to generate projective arrangements u.a.r.(Section 2.3.1) and review methods used in past research (Section 2.3.2).Then we present a novel O(n)time procedure to generate planar arrangements u.a.r.(Section 2.3.3)which in turn involves the generation of random projective arrangements of a subtree.
1 Function Random Projective Arrangement(T r ) is Input: T r a rooted tree.
Output: A projective arrangement π of T r chosen u.a.r.
Algorithm 2.2: Generating projective arrangements u.a.r. of a subtree.
1 Function Random Projective Arrangement Subtree(T r , u, p, π) is Input: T r a rooted tree, u any vertex of T r , p the starting position to arrange the vertices of T r u , π partially-constructed without T r u .Output: π partially-constructed with T r u . 2

Generating projective arrangements
The method we will present in detail here was outlined first by Futrell et al. (2015).Here we borrow from recent theoretical research summarized above (Alemany-Puig and Ferrer-i-Cancho, 2022) to derive a detailed algorithm to generate projective arrangements and prove its correctness.
In order to generate projective arrangements u.a.r., simply make random permutations of a vertex u and its children Γ r (u), that is, choose one of the possible (d r (u)+1)!permutations u.a.r.Algorithm 2.1 formalizes this brief description.The proof that Algorithm 2.1 produces projective arrangements of a rooted tree T r u.a.r. is simple.The first call takes the root and its dependents and produces a uniformly random permutation with probability 1/(d(r) + 1)!.Subsequent recursive calls (in Algorithm 2.2) produce the corresponding permutations each with its respective uniform probability, hence the probability of producing a particular permutation is the product of individual probabilities.Using Equation 6, we easily obtain that the probability of producing a certain projective arrangement is .

Generation of projective arrangements in past research
Algorithm 2.1 is equivalent to the "fully random" method used by Futrell et al. (2015) as witnessed by the implementation of their code available on Github5 , in particular in file cliqs/mindep.py6(function randlin projective).Notice that Futrell et al. (2015) outline (though vaguely) that a projective arrangement is generated randomly by "Starting at the root node of a dependency tree, collecting the head word and its dependents and order them randomly".Futrell et al. (2015) present their method to generate random projective arrangements as though it were the same as that by Gildea and Temperley (2007,1), who introduced a method to generate random linearizations of a tree which consists of "choosing a random branching direction for each dependent of each head,7 and -in the case of multiple dependents on the same side -randomly ordering them in relation to the head" (Gildea and Temperley, 2010).However, Futrell et al. (2015) do not actually implement Gildea & Temperley's method as witnessed by their code.Critically, Gildea & Temperley's method does not produce uniformly random linearizations as we show with a counterexample.
Consider a star tree rooted at its hub.Let X be a random variable for the position of the root in a random projective linear arrangement (1 ≤ X ≤ n).We have P (X = x) = 1/n for all x ∈ [1, n], therefore X follows a uniform distribution and hence E [X] = (n + 1)/2 and V [X] = (n 2 − 1)/12 (Mitzenmacher and Upfal, 2017).Let X ′ be a random variable for the position of the root according to Gildea & Temperley's method.It is easy to see that X ′ − 1 follows a binomial distribution with parameters n − 1 and 1/2.Namely, . Therefore, the variance in a truly uniformly random projective linear arrangement is Θ(n 2 ) while Gildea & Temperley's method results in Θ(n), a much smaller dispersion.As n → ∞, X ′ − 1 converges to a Gaussian distribution.
Gildea & Temperley's method was introduced as a random baseline for the distance between syntacticallyrelated words in languages and has been used with that purpose (Gildea and Temperley, 2007,1;Temperley and Gildea, 2018).Interestingly, the minimum baseline, namely, the minimum sum of dependency distances, results from placing the root at the center (Chung, 1984;Shiloach, 1979).The example above shows that Gildea & Temperley's baseline tends to put the root at the center of the linear arrangement with higher probability than the truly uniform baseline.That behavior casts doubts on the power of that random baseline to investigate dependency distance minimization in languages since it tends to place the root at the center of the sentence, as expected from an optimal placement under projectivity (Alemany-Puig et al., 2021;Gildea and Temperley, 2007) and does it with much lower dispersion around the center than in truly uniformly random linearizations.

Generating planar arrangements
Proposition 1 leads to a method to generate planar arrangements u.a.r. for any free tree T .The method we propose is detailed in Algorithm 2.3.
1 Function Random Planar Arrangement(T ) is Input: T a free tree.
Output: A planar arrangement π of T chosen u.a.r.
It is easy to see that Algorithm 2.3 has time complexity O(n).Now we show that it generates planar arrangements uniformly at random.Firstly, choose a vertex, say u ∈ V , u.a.r., and place it at one of the arrangement's ends, say, the leftmost position; this vertex acts as a root for T .Secondly, choose u.a.r.one of the d(u)!permutations of the segments of the subtrees T u v u.a.r.Lastly, recursively choose u.a.r. a projective linearization of every subtree T u v for v ∈ Γ(u) (Algorithm 2.2).These steps generate a planar arrangement u.a.r.since the probability of producing a certain planar arrangement following these steps is, then, .
The equalities follow from Proposition 1.

Expected sum of edge lengths
In this section we derive an arithmetic expression for E pl [D(T )].First, we prove Theorem 1.1.To this aim, we define as the expected value of α uv conditioned to the projective arrangements π of T r such that π(r) = 1; we define E ⋄ pr [β uv | r] likewise.The root is specified as a parameter of the expected value because we want to be able to use various roots.In the following proofs we rely heavily on Linearity of Expectation (Mitzenmacher and Upfal, 2017, Theorem 2.1) and the Law of Total Expectation (Mitzenmacher and Upfal, 2017, Lemma 2.5).
Proof of Theorem 1.1.We first prove Equation 4. By the Law of Total Expectation, Notice that, quite simply, that that is, the expected value of D conditioned to planar arrangements of T such that vertex u is fixed at the leftmost position, E pl [D(T ) | π(u) = 1], is equal to the expected value of D conditioned to projective arrangements of T u such that vertex u is fixed at the leftmost position, which is denoted as . By noticing, given a fixed vertex u, that P pl (π(u) = 1) = 1 n , which is the proportion of planar arrangements of T in which π(u) = 1 (Proposition 1), Equation 4 follows immediately.Notice Equation 4 expresses the expected value of D conditioned to planar arrangements of a free tree T as the average of each of the expected values of D conditioned to projective arrangements of T u (for all u ∈ V ) such that the root is fixed at the leftmost position.
Now we aim to write E ⋄ pr [D(T u )] as a function of E pr [D(T u )].We start by decomposing E ⋄ pr [D(T u )] into a summation of expected values of the individual edge lengths, and group the edges of every subtree T u v of T u (where uv is a (directed) edge of the tree) into one single expected value for each subtree and leave the edges incident to the root u in the same summation as follows Now, it is important to notice that we did not write E ⋄ pr [D(T u v )] in the summation above since the conditioning imposed by the operator ⋄ in E ⋄ pr [D(T u )] only applies to the root u.The root of the subtrees can be placed freely in the arrangement as long as the result is projective.Now we decompose all (directed) edges uv of T r in the first summation into anchor and coanchor, and we get Although the root u is clear in this context, we have made it explicit in E ⋄ pr [α uv + β uv | u] so as to be able to keep track of it in the following derivations.By linearity of expectation, Now, notice that the length of the anchor of any given directed edge (u, v), where u is the head and v is the dependent, is invariant to the position of u, that is, it only changes if we change the position of v within its interval.Therefore, fixing the head to the leftmost position of the arrangement (or any position outside the segment of v) does not affect the value of E ⋄ pr [α uv | u] and we simply have that The next step is to find the value of E ⋄ pr [β uv | u].Notice now that the length of the coanchor of any directed edge (u, v) is affected by the position of the head u and, as such, E ⋄ pr [β uv | u] need not be exactly equal to E pr [β uv | u].The derivation is found in to the Appendix since it is merely an adaptation of the proof by Alemany-Puig and Ferrer-i-Cancho (2022, Lemma 1); it gives Thus, In the third equality we have used the identity by Alemany-Puig and Ferrer-i-Cancho (2022, Equation 28), which states that in a rooted tree T r In this equation, we have not specified the expected values as being conditioned by the root r since this is clear from the context.Plugging Equation 8into Equation 4we get We can use the following result by Alemany-Puig and Ferrer-i-Cancho (2022, Equation 16) to further simplify Equation 9 and, after proving that we obtain 1 2n u∈V v∈Γ(u) Hence Equation 5.
For the sake of comprehensiveness, we also provide an arithmetic expression for the expected length of an edge uv of a free tree in uniformly random planar arrangements.To this aim, we further define ] to be the expected value of the length of edge uv ∈ E(T ) when the vertex r ∈ V (T ) is fixed to the leftmost position in planar arrangements of T .Similarly, given a rooting of T at r, ] to be the expected value of the length of edge uv ∈ E(T r ) when vertex r acts as the root of the tree and it is fixed to the leftmost position in projective arrangements of T r .The root vertex r may be one of vertices u, v or none of the two.In the expected value E ⋄ pr [δ uv | r] we assume that the edge uv is directed from u to v in accordance with the orientation defined by the root vertex r.Therefore, when r is neither u or v, the vertex of edge uv closest to r is always vertex u, and the farthest is always vertex v.
Lemma 2.1.Given a free tree T = (V, E), for any uv ∈ E it holds that where (Alemany-Puig and Ferrer-i-Cancho, 2022) Proof.Following the characterization of planar arrangements described in Section 2.2, we have that P pl (π(r) = 1) = 1/n.Then applying the Law of Total Expectation When r ∈ {u, v}, by linearity of expectation, . By denoting r the only vertex in {u, v} \ {r}, then Equation 15 relies on the fact that in a rooted tree T r , the expected length of the anchor of an edge incident to the root, say rw ∈ E(T r ), is given by E pr [α rw | r] = (s r (w) + 1)/2 (Alemany-Puig and Ferrer-i-Cancho, 2022).An arithmetic expression for E ⋄ pr [β uv | r] can be found by modifying the proof of Alemany-Puig and Ferrer-i-Cancho (2022, Lemma 1).Then, as before, we get (see Appendix), Therefore, by adding Equations 15 and 16 we obtain Equation 11 follows immediately after inserting Equations 17 and 14 in Equation 13.

Applications 3.1 A linear-time algorithm to compute E pl [D(T )]
Here we consider algorithms of increasing efficiency.First, since E pr [D(T u )] can be calculated in O(n)-time for any n-vertex rooted tree T u (Alemany-Puig and Ferrer-i-Cancho, 2022, Theorem 1), the evaluation 'as is' of Equation 5 leads to an O(n 2 )-time algorithm.Second, we could calculate the value E pr [D(T u )] for all u ∈ V in O(n)-time and O(n)-space with the following procedure: (Alemany-Puig et al., 2022); 2. Choose an arbitrary vertex w; (Alemany-Puig and Ferrer-i-Cancho, 2022); and, finally, 4. Perform a Breadth First Search (BFS) traversal of T starting at w.In this traversal, when going from vertex u to vertex v, the value of E pr [D(T v )] is calculated applying the precomputed value of We can obtain a formula for this difference by manipulating Equation 3. We get Notice that the value of ∆ can be computed in constant time for any two vertices u and v (here we are interested in the value of ∆ for pairs of adjacent vertices) and, crucially, without knowledge of either for any v ∈ Γ(u) can be calculated in constant time as Third, we propose an alternative that is also O(n)-time yet simpler and faster in practice, based on Proposition 2.
Proposition 2. Given a free tree T = (V, E), Proof.Here we simplify the summation in Equation 5, which becomes (Alemany-Puig and Ferrer-i-Cancho, 2022) 1 Now we simplify f (T ) by first replacing the term d u (v) by d(v) after the necessary transformations so that we can swap the order of the summations afterwards, that is, with Figure 6: Proof of 2. The value s u (v) is the same for all vertices of T v w denoted as {u 1 , . . ., u k } in the figure and the proof.
In the preceding derivation, the second equality holds due to d u (v) = d(v) − 1 for v ̸ = u; the third and fourth steps, we apply the Handshaking lemma.8These lead to It remains to simplify Equations 20 and 21.We start by changing the order of the summations in Equation 20, and continue simplifying the inner summation.Consider a fixed v ∈ V .We have that u∈V s u (v) (1) The summation (1) adds up the size of all subtrees T w v with respect to a 'moving' root u.In the first equality we have simply taken out the case s u (u).To understand the second equality, focus for now on a single subtree T v w such that wv ∈ E. The summation (2) contains summands that correspond to all the vertices in T v w , say vertices u 1 , . . ., u k (assume, w.l.o.g., that w = u k ).These summands are s u1 (v), . . ., s u k (v) which are all equal to s w (v) (Figure 6).Moreover, there are s v (w) vertices in T v w thus k = s v (w), and this holds for all w ∈ Γ(v), hence the equality.Finally, thanks to the identity We use the result in Equation 23 to simplify Equation 21, By combining Equations 24 and 25 into Equation 22and, after some effort, we obtain which leads directly to Equation 18.
Lemma 3.1.For any given free tree T , Algorithm 3.1 calculates E pl [D(T )] in time and space O(n).
Proof.The pseudocode to calculate E pl [D(T )] based on Proposition 2 is given in Algorithm 3.1.This algorithm first calculates s u (v) for all edges uv ∈ E, for the given tree T in O(n) time using the pseudocode by Alemany-Puig et al. (2022, Algorithm 2.1).Then it uses these values to calculate the sums of s v (u) 2 for every vertex v ∈ V .Such sums are then used to evaluate Equation 18hence calculating 1 Function compute expected planar(T ) is Input: T free tree. Output:

A simple application
Let E ≥1 [D(T )] be the expected value of the sum of edge lengths conditioned to arrangements π such that C π (T ) ≥ 1.That is, arrangements such that the number of edge crossings is at least 1.An immediate consequence of Lemma 3.1 is that E ≥1 [D(T )] can be computed easily as the following corollary states.
Corollary 3.For any free tree T , E ≥1 [D(T )] can be computed in time and space O(n) thanks to the fact that with Proof.Due to the Law of Total Expectation, and hence Equation 26.N pl (T ) can be computed in O(n)-time with Equation 6 and E pl [D(T )] can be computed in time and space O(n) (Lemma 3.1).Hence all the components in the r.h.s. of Equation 26 can be computed in time and space O(n).

Real syntactic dependency distances versus random baselines
Evidence that dependency distances are smaller than expected by chance can be obtained by random baselines of varying strength • None, E [D(T )], the expectation of D(T ) in unconstrained random linear arrangements (Ferrer-i-Cancho, 2004), • Planarity, E pl [D(T )], the expectation of D(T ) in planar random linear arrangements (this article), • Projectivity, E pr [D(T r )], the expectation of D(T ) in projective random linear arrangements (Alemany-Puig and Ferrer-i-Cancho, 2022;Gildea and Temperley, 2007).
This raises the questions of what would the most appropriate baseline for research on dependency distance minimization be.E pr [D(T r )] is by far the most widely used random baseline (Futrell et al., 2015;Gildea and Temperley, 2007;Liu, 2008;Park and Levy, 2009).Since planarity is a weaker condition than projectivity, E pl [D(T )] implies a gain in coverage.Accordingly, there are more planar sentences than projective sentences in real texts (Gómez-Rodríguez and Nivre, 2010;Havelka, 2007, Table 1) and also in artificially-generated syntactic dependency structures (Gómez-Rodríguez et al., 2022, Figure 2).However, surprisingly, E pl [D(T )] has never been used in research on the principle of dependency distance minimization.Here we aim to test the hypothesis that formal constraints mask the effects of the principle, a hypothesis that has already been confirmed on artificially-generated syntactic dependency structures (Gómez-Rodríguez et al., 2022).
Since dependency distance naturally grows with sentence length (Ferrer-i-Cancho et al., 2022;Ferrer-i-Cancho and Liu, 2014) and the manifestation of the principle depends on sentence length (the statistical bias towards shorter distances may disappear or become a bias in the opposite direction in short sentences Ferrer-i-Cancho and Gómez-Rodríguez, 2021;Ferrer-i-Cancho et al., 2022), we compare the actual dependency distances against the values predicted by the baselines in sentence of the same length.Given the natural growth of dependency distance as sentence length increases (Ferrer-i-Cancho et al., 2022;Ferrer-i-Cancho and Liu, 2014), we measure, for each sentence, the average dependency distance, namely ⟨d⟩ = D(T )/(n − 1) instead of the raw total sum D(T ) (a sentence of n vertices has n − 1 syntactic dependencies when the structure is a tree).

Data and methods
As real datasets, we use the Parallel Universal Dependencies 2.6 collection (Zeman, Nivre, Abrams, Ackermann, and et al., 2020).To control for annotation style, we consider two versions of the collection: the collection with its original content-head annotation (PUD) and its transformation into Surface-Syntactic Universal Dependencies 2.6 (hereafter PSUD).By doing so, we cover two major competing annotation styles (Gerdes, Guillaume, Kahane, and Perrier, 2018).
We borrow the preprocessing methods from previous research (Ferrer-i-Cancho et al., 2022).The main features of the processing is that nodes that are punctuation marks are removed and that the corpus remains fully parallel after the removal (Ferrer-i-Cancho et al., 2022).The preprocessed data is freely available as ancillary materials of the Linear Arrangement Library website.9 With respect to previous accounts (Ferrer-i-Cancho et al., 2018;Gómez-Rodríguez and Nivre, 2010;Havelka, 2007), our collections exhibit some remarkable statistical differences.First, the proportion of projective and planar sentence is higher specially in PUD, where the proportion of non-projective or non-planar sentences does not exceed 10% in most cases (Tables 2 and 3).This proportion increases in PSUD and in two exceptional languages, Chinese and Hindi, it becomes larger than 50% (Tables 3).Second, the difference between the proportion of non-projective and non-planar sentences is smaller than in previous reports (Gómez-Rodríguez and Nivre, 2010;Havelka, 2007).Having said that, notice that our collections are fully parallel, and special care has been taken to keep annotation consistent across languages.
Given formal constraint '*' (none, planarity and projectivity) and sentence length n, 1.We calculate D(T r ) for each T r and also calculate the expected sum of edge lengths under '*' different constraints (none, Equation 2; planarity, Equation 5; projectivity, Equation 3).
2. Then, for each sentence, we divide each by n − 1, to produce the mean length of its dependencies and the expected mean of length of its dependencies under some constraint '*'  3. Finally, we compute the average ⟨d * ⟩ and the average E [⟨d * ⟩] over all sentence of length n satisfying constraint '*'.

Results
Figures 7 and 8 show the scaling of mean dependency distance as a function of sentence length in real sentences and in their corresponding random baselines.Concerning the random baselines (dashed lines), we find that the stronger the formal constraint on syntactic dependency structures the lower the value of the random baseline.
In contrast, the actual mean sentence length (solid lines) is practically the same independently of the formal constraint (none, planarity and projectivity).This is due to the fact the proportion of sentences that are lost by imposing some formal constraint is small in the PUD and PSUD collections.The overwhelming majority of sentences are planar and the proportion of planar sentences that are not projective is really small (Table 2 and 3).Thus, selecting sentences satisfying a certain formal constraint has a neglectable impact on the estimation of mean dependency distance.
Concerning the relationship between the actual mean dependency distance and the random baselines, we find that the average ⟨d⟩ is below the average value of the random baselines for sufficiently large n in all languages.The only exception is Turkish, where the actual average ⟨d⟩ is just slightly below the average of the projective baseline (Figures 7 and 8).
These findings are consistent between PUD and PSUD, in spite of their differences in proportions of projective and planar sentences commented above.4 Conclusions and future work

Theory
In Section 2.2, we have characterized planar arrangements of a given free tree T using the concept of segment (Alemany-Puig and Ferrer-i-Cancho, 2022).Employing said characterization, we have shown that the number of planar arrangements of a free tree depends on its degree sequence (Proposition 1), in a similar way projective arrangements of a rooted tree do (Alemany-Puig and Ferrer-i-Cancho, 2022).Moreover, we have given a procedure to generate u.a.r.planar arrangements of a given free tree in Section 2.3 (Algorithm 2.3) which can be easily adapted to generate such arrangements exhaustively.Interestingly, our algorithm to generate planar arrangements is based on the generation of projective arrangements of a rooted subtree.For the sake of completeness, we have detailed a procedure to generate u.a.r.projective arrangements of a given rooted tree (Algorithm 2.1).

Applications
The identification of the underlying structure of planar arrangements have led us to derive an arithmetic expression, in Section 2.4, for E pl [D(T )] (Theorem 1.1) from which we devised a O(n)-time algorithm to calculate such value (Proposition 1, Algorithm 3.1).
In Section 3, we have applied the theory developed so far to investigate the effect of formal constraints of increasing strength (none, planarity, projectivity) in a parallel collection and reported two main findings.First, the average dependency distance in real sentences remains practically the same as the strength of the formal constraint increases.We believe that this result stems from the high proportion of planar sentences (and the very low proportion of planar sentences that are not projective) of the PUD collection.Higher proportions of non-planar sentences have been reported in other collections (Gómez-Rodríguez and Ferreri-Cancho, 2017).Second, the tendency of the random baseline to have a smaller value in stronger formal constraints.Critically, this phenomenon indicates that the strength of the dependency distance minimization effect depends on the choice of the formal constraint for the random baseline.As these formal constraints may be a side-effect of dependency distance minimization (Ferrer-i-Cancho, 2006;Gómez-Rodríguez et al., 2022;Gómez-Rodríguez and Ferrer-i-Cancho, 2017;Yadav, Husain, and Futrell, 2022), this phenomenon suggests that 1. Formal constraints absorb the dependency distance effect.
2. A fairer evaluation of the actual degree of optimization of dependency distances or a more accurate measurement of the power of the effect of dependency distance minimization requires considering not only the magnitude of the effect with respect some random baseline but also the formal constraint, as the latter may hide part of the dependency distance minimization effect.
In past research on syntactic dependency distance minimization, E pr [D(T r )] has been the most widely used random baseline (Futrell et al., 2015;Gildea and Temperley, 2007;Liu, 2008;Park and Levy, 2009).However, projectivity has a lower coverage than planarity in real sentences (Gómez-Rodríguez and Nivre, 2010;Havelka, 2007).Projectivity is at risk of underestimating the strength of the dependency distance minimizaton principle (Ferrer-i-Cancho, 2004) because of the significant reduction in the value of the random baseline (Figures 7 and 8) or the reduction of the actual dependency distances (Gómez-Rodríguez et al., 2022, Figure 2) that it introduces.Thanks to the research in this article, we have paved the way for replicating past research replacing E pr [D(T r )] with E pl [D(T )].

Future work
Planarity is a relaxation of projectivity but future work should address the problem of the expected value of D(T ) in classes of formal constraints with even more coverage (Ferrer-i-Cancho et al., 2018).A promising step is the investigation of E ≤k [D(T )], the expected value of D(T ) conditioned to arrangements π such that C π (T ) ≤ k, that is, in arrangements such that the number of edge crossings is at most k.Notice that E ≤0 [D(T )] = E pl [D(T )].In real languages, the average number of crossings ranges between 0.40 and 0.62 (Ferrer-i-Cancho et al., 2018), suggesting that E ≤k [D(T )] with k = 1 or a small k would suffice.

A Derivation of E ⋄ pr [β uv | u]
Here we derive the expected length of the coanchor of a (directed) edge uv ∈ E(T u ) in uniformly random projective arrangements of T u conditioned to π(u) = 1.Following Alemany-Puig and Ferrer-i-Cancho (2022), we decompose the length of the coanchor of the (directed) edge uv, β uv , as the sum of the lengths of the segments in-between u and v (Figure 4).Here we use k uv to denote the number of segments in-between u and v, and φ (i) uv to denote the size of the ith segment, yielding (Alemany-Puig and Ferrer-i-Cancho, 2022), By the Law of Total Expectation, we have that where E ⋄ pr [β uv | u, k uv = k] is the expectation of β uv given that u is the root of the tree (fixed at the leftmost position), and that u and v are separated by k segments, and P ⋄ pr (k uv = k | u) is the probability that u and v are separated by k intermediate segments, both in uniformly random projective arrangements π conditioned to π(u) = 1, both conditioned to the root of the tree being vertex u.On the one hand, Notice that this is the same result as that obtained in (Alemany-Puig and Ferrer-i-Cancho, 2022).Lastly, the proportion of arrangements in which the segment of v is at position k uv + 1 equals (d(u) − 1)!, therefore,

Figure 2 :
Figure 2: Examples of sentences with their syntactic dependency structures; arc labels indicate dependency distance.The rectangles denote the root word in each sentence.Examples adapted from Morrill (2000).The sum of edge lengths are D = 18 for a) and D = 12 for b).

Figure 7 :
Figure7: The scaling of ⟨d⟩, the mean dependency distance of a sentence as a function of sentence length (n) for languages in the PUD collection for formal constraints of increasing strength: none (blue), planarity (green) and projectivity (red).Lines indicate the average value over all sentences of the same length.Solid lines are used for real sentences and dashed lines are used for the corresponding random baseline.Solid lines overlap so much that only one of them can be seen in most cases.

Figure 8 :
Figure8: The scaling of ⟨d⟩, the mean dependency distance of a sentence as a function of sentence length (n) for languages in the PSUD collection for formal constraints of increasing strength.Format is the same as in Figure7.Again, solid lines overlap that only one of them can be seen in most cases.

Table 2 :
Proportion (%) of projective and planar sentences in the PUD collection.

Table 3 :
Proportion (%) of projective and planar sentences in the PSUD collection.