A cognitively plausible model for grammar induction

abstract

Of course, much disagreement remains about the correct competence theories and the characterization of the range of theories.The characterization of the range of allowable grammars -which can be thought of as a reference machine into which individual grammars are written -is often referred to as Universal Grammar (UG). 1 Starting with UG, the child reaches a particular grammar through exposure to a linguistic environment.As pointed out by Chomsky (1965), this view assigns a central role to learnability in investigating UG: a linguistic theory must specify a range of grammars that can be attained using the cognitive machinery and data available to the child.Moreover, UG can provide an evaluation metric that allows the child to compare potential grammars given the data.In its original formulation, this evaluation metric was stated in terms of simplicity, a notion that -though defined with respect to a concrete UG -is also often seen as a cognition-general (CG) principle.
One might hope, then, that TL theories of competence and CG theories of learning would have a close relationship: that theories of UG would map onto theories of learning through an evaluation metric, and that theories of learning would restrict the choice of UG.In practice, however, the evaluation metric has been largely abandoned, and the two domains have never succeeded in constraining one another.Worse, TL and CG approaches have grown to be considered mutually incompatible.There are various different aspects to this ostensible incompatibility, such as whether linguistic knowledge involves structured, rule-like representations or not, whether probabilities play a role, and so on.Perhaps most fundamental among the perceived differences is how the two approaches view learning.TL, following a more hopeful beginning, has adopted a deeply skeptical stance that rejects the possibility of any meaningful learning and relegates most of the linguistic ability of adults to the innate component, and often to UG itself (that is, to the reference machine).CG, on the other hand, 1 Elsewhere in the literature, UG is sometimes used to refer to the range of possible grammars (rather than to its intentional characterization as a reference machine), and sometimes it is used to refer to the combination of the range of possible grammars and the learning mechanism.Here UG will refer strictly to the reference machine.The term UG has sometimes been associated with approaches that assume a substantial innate component.Here I will use it neutrally -this paper makes no claims as to the correct theory of UG. [ 214 ] tends to be confident of learning and skeptical of the innate component (and especially of UG).The perceived incompatibility between TL and CG has led over the years to a growing divide between the two disciplines.
Over the past decade or so, the Bayesian program for cognition and the closely related framework of Minimum Description-Length (MDL) have brought the two disciplines closer by articulating CG views that can integrate probabilistic reasoning with structured, symbolic representations.In the other direction, proposals such as Marcus (2000) and Yang (2004Yang ( , 2010) ) offer TL perspectives that connect with CG approaches to learning.But a sizable gap remains: even CGoriented TL proposals such as those of Marcus and Yang still question the ability of general learning mechanisms to generalize correctly from the data, embracing instead restrictive theories of the innate component; and even TL-oriented CG proposals such as Goldsmith (2001), Dowman (2007), Foraker et al. (2009), and Perfors et al. (2011) still emphasize the power of general-purpose learning mechanisms and question whether the innate component should be quite as rich as TL would have it.
This paper has two goals.First, I wish to explain why the skepticism in both directions is misguided.In particular, I will explain why linguists believe in a complex innate component -including a nontrivial UG -even in the face of powerful statistical learners.I will do this by presenting two kinds of evidence that linguists rely on that have nothing to do with questions of learnability in principle.I will also explain why many cognitive scientists are confident that learning is a real possibility, despite the arguments against learning in the TL literature.My second goal is to offer a TL view that treats the learnable and the innate as mutually supportive rather than conflicting.The ability of CG mechanisms to learn, on this view, is interpreted not as a reason to reduce the innate component -though it will be a reason to bring back into consideration theories that leave much to be learned -but rather as a way to extract nuanced predictions from competing theories of that component.
I start, in Section 2, by reviewing the history of the divide between TL and CG, focusing first on the roots of TL pessimism regarding learning (Section 2.1) and then on CG optimism regarding the same (Section 2.2).In Section 3 I evaluate the two positions and argue that TL was [ 215 ] wrong to dismiss learning but right to emphasize potentially restrictive UGs, while CG was right to emphasize learning but wrong to dismiss potentially restrictive UGs (Section 3.1).In Section 3.2 I explain how the co-existence of rich UGs and meaningful learning is not only possible but in fact a good state of affairs, one that allows us to revive the old hope of mutual collaboration from the early days of generative grammar.In Section 3.3 I explain how any fully explicit theory of UG provides us with a CG learner -specifically, a Minimum Description-Length (MDL) learner -and that this provides both a starting point for the study of learning and a basis for comparing competing theories of UG.Section 4 illustrates this mapping from UG to MDL learner using a simple UG and a couple of toy examples.Section 5 concludes.In an influential paper, Gold (1967) introduced a learning paradigm, identification in the limit (iitl), and proved that learning of this kind is impossible even in seemingly simple cases.In iitl, a learner g is presented with a sequence (or text) T of elements from a language L, where L is known to be taken from a set C of candidate languages.
After each new element in T is presented, g guesses a language in C.
If after a certain point all of g's guesses are the same correct guess (in this case, L), we will say that g has identified L in the limit from T .If g can identify in the limit any L ∈ C based on any fair text in L (that is, a text in L in which every w ∈ L appears at some point, and in which nothing appears that is not in L), we will say that g identifies C in the limit.If such a g exists, we will say that C is identifiable in the limit.Certain simple families of languages are iitl.For example, the set of all finite languages over a finite alphabet Σ is iitl: if g guesses at each point the language that is the union of all the elements in T that have been encountered so far, it will always identify the source language in the limit.Similarly, any C that can be written as {L i |i ≥ 1}, where L i ⊂ L i+1 for all i, is iitl: g can identify C in the limit by always guessing the minimal L i that contains all the elements in T that have been encountered so far.Changing these families of languages [ 216 ] only slightly makes them not iitl.For example, adding a single infinite language to the set of all finite languages makes the set not iitl.In the second, more general example, adding L ∞ = ∪ L i to C makes the result (as well as any set that contains it) not iitl.To see why, assume to the contrary that C ′ = C ∪ {L ∞ } is iitl.Let g be a learner that identifies C ′ in the limit.We can construct a text T that starts as a text in L 1 up until the first point where g guesses L 1 (such a point exists by assumption), continues as a text in L 2 up until the first following point where g guesses L 2 , then continues as a text in L 3 until g guesses L 3 , and so on.The result is a text in L ∞ , but g makes infinitely many different guesses and so never converges on a correct answer, contrary to assumption.
Gold's setting rules out learning even in intuitively very simple families of languages, like the set of all regular languages. 2For theoretical linguists, this has confirmed a growing skepticism (already discussed explicitly in Chomsky 1965, pp. 56-58) about the role of learning in linguistic competence.The skepticism was grounded in a general sense that learning is hard and that the data available to the child are insufficient.Gold's results can be seen as providing formal justification for this skepticism: assuming iitl is an appropriate model for language learning in humans, the set of possible languages must be severely restricted.Osherson et al. (1984) formulate further assumptions about human learning that, if correct, would entail an even more restrictive version UG in which the task of the learner is reduced to choosing from a finite set of candidate languages.Examples of linguistic approaches that adopt the finite version of UG are the Principles and Parameters framework of Generative Grammar (P&P; Chomsky 1981) and Optimality Theory (OT; Prince and Smolensky 1993).
It is worth noting that, while a restricted enough UG addresses the theoretical problem of iitl, even the finite version does not guaran-2 A full characterization of when a family of languages is iitl is provided by Angluin (1980).Algorithms that guarantee iitl for various classes of languages include Angluin (1982), Koshiba et al. (1997), Clark and Eyraud (2007), Heinz (2010), andYoshinaka (2011).Note that arguments such as Gold's show that, under the relevant assumptions, no learner can succeed.This is a stronger result than showing that a particular learner cannot succeed (such as the problem identified by Braine 1971, Baker 1979, and Dell 1981 for the specific evaluation metric of Chomsky and Halle 1968).
[ 217 ] tee an easy task in practice, since a finite space can still be dauntingly large.In the P&P framework, for example, there are 2 n settings, where n is the number of parameters (on the standard assumption that parameters are binary), and in OT there are n! different constraint rankings, where n is the number of constraints.Noise and cognitive limitations further complicate the task.See Clark and Roberts (1993), Gibson and Wexler (1994), Niyogi and Berwick (1996), and Yang (2002) for attempts to tackle the practical issues of acquisition within P&P and Tesar and Smolensky (1998), Boersma and Hayes (2001), and Magri (2013) for a similar discussion within OT.

2.1.2
Poverty of the stimulus Much of the disagreement between TL and CG has centered on a form of argument known as the argument from the poverty of the stimulus (POS), involving some property P that humans demonstrate in their language in spite of apparently insufficient support for P in the data.
To cite a well-known (and highly controversial) example, Englishspeaking children will form a yes/no question by fronting the structurally highest auxiliary rather than the leftmost one, thus forming the yes/no interrogative version of The monkey that is jumping can sing by asking Can the monkey that is jumping sing?rather than *Is the monkey that jumping can sing?(where * marks ungrammaticality).They do so, it appears, despite hearing only simpler yes/no questions such as Is the monkey jumping?(from The monkey is jumping) and Can the monkey sing?(from The monkey can sing), where structurally highest and leftmost amount to the same thing.This has been taken to show that the innate component ensures this choice by making available structuredependent generalizations but not rules that depend on linear order.See Berwick et al. (2011) and Clark and Lappin (2011), as well as references therein, for discussion.While the form of POS arguments is clear enough, it is often difficult to establish any particular POS argument for humans in practice, even in a simple case such as the one just mentioned. 3For example, how can we determine just what kind of evidence would suffice to make the relevant choice empirically?Could there be indirect sources of information that would predispose the child against forming ordering-based generalizations?And how sure are we that we know exactly what data the subjects have encountered over those few years prior to the experiment?Some progress has been made on these questions (see Legate and Yang (2002), Lidz et al. (2003), Yang (2010), and Hsu and Chater (2010) for thoughts on quantifying the information available to the the child; see Crain and Pietroski (2002) for how POS can be constructed from developmental stages in which children exhibit very specific linguistic knowledge that is incompatible with their ambient language but compatible with other natural languages; and see Wilson (2006) for an experimental paradigm designed to test the child's generalization beyond the data in POS situations), but the core weakness of relying on what we think can be learned and what we think the child hears -two questions that can be prohibitively difficult to answer -remains.4

2.1.3
Richness of the stimulus If children can be shown to systematically not demonstrate a property P in their language despite an adequate amount of evidence supporting P in the input, we can conclude that this failure is due to the innate component.We can term such evidence an argument from the richness of the stimulus (ROS).5For example, Peña et al. (2002) have shown that, while humans are capable of extracting abstract dependencies within words, they fail on this task when combined with a segmentation task (a task that subjects perform well on, both on its own and when combined with the task of extracting word-internal dependencies).Similarly, Moreton (2008) has shown that humans are significantly better at learning certain phonological dependencies -specifically, dependencies relating the height of the vowels in two adjacent syllables -than other phonological dependencies -dependencies relating the height of a vowel to the voicedness of the following consonant and dependencies relating the voicedness of consonants in two adjacent syllables -even though the two patterns are equally prominent perceptually and are both abundantly represented in the input.One must ensure, of course, that prior exposure has not biased the subjects against observing the relevant patterns.This, however, is considerably easier in practice than the reverse task, essential to POS, of ensuring that a certain pattern is never attested in the data.And as the above examples show -see Bonatti et al. (2005), Endress et al. (2007), Endress and Mehler (2010), Becker et al. (2011), andHunter andLidz (2013), among others, for further evidence of this kind -ROS lends itself to the design of controlled experiments that can inform us about what humans fail to learn.

Typology
Perhaps the most common source for enrichments of the innate component comes from the routine TL task of examining individual languages and comparing the results across a range of languages.If language after language shows the same property P (which can be an absolute universal, such as "Has nouns" or an implicational universal, such as "If demonstratives and adjectives precede the noun, then demonstratives precede adjectives"), we can sometimes conclude that P is due to the innate component.
As usual, caution is needed: for some properties, other sources, such as communication pressure, might be responsible rather than the innate component.For example, P = "Verbs have a small number of arguments" or P = "Has vowels".More interestingly, P may arise not through any direct benefit to the speakers but as properties that enhance the transmission of language between generations of speakers.See Kirby (2000Kirby ( , 2002)); Kirby et al. (2004); Smith et al. (2003) as well as Niyogi andBerwick (1997, 2009).Less frequently, P can be explained away by appealing to historical accident. 66 Controlling completely for historical accident is quite challenging in practice, but the emergence of the Nicaraguan Sign Language (Senghas et al. 2004) and of the Al-Sayyid Bedouin Sign Language (Sandler et al. 2005) provide an approximation.In non-human species it is sometimes possible to explore typological questions in lab settings that control in full for historical accident, as shown by the work of Feher et al. (2009) on the emergence of typical song patterns in zebra finches over several generations, starting from birds grown in isolation.
[ 220 ] But in many cases, P has little if anything to recommend it in terms of communication efficiency and other functionalist criteria.Suppose, to take a syntactic example discovered by Ross (1967), that I heard you say that you saw Max and some lady at the party last night, but I don't know the identity of the lady in question.I could use a roundabout inquiry such as I heard you saw Max and some lady at the party; can you tell me which lady?, or I could use a paraphrase such as Which lady did you see Max with ___ at the party?, where the conjunction and in the original sentence is replaced with the preposition with.But what I cannot do, in English or in any other known language, is use the standard way to form a question and say *Which lady did you see Max and ___ at the party last night?, despite its obvious usefulness for the conversation (P in this case could be "Does not allow a question to target a single conjunct").To cite a different example, discovered by Horn (1972), no natural language has a connective corresponding to NAND (= not and) or a quantificational determiner corresponding to NALL (= not all), despite the usefulness of these concepts in everyday life (as well as in artificial settings). 7In such cases, it seems reasonable to ensure P through the innate component. 8, 9

2.2
CG: Optimism about learning

2.2.1
The probabilistic turn Other work, both theoretical and experimental, supports a less restrictive view on learning than the TL view.First, as has often been ob-7 See Horn (2011) and Katzir and Singh (2013) for discussion of the general context of this typological fact.
8 Evans and Levinson (2009) and Levinson and Evans (2010) have made the remarkable claim that language universals do not exist.They do not discuss the Ross (1967)'s and Horn (1972)'s cases discussed above.See the commentaries following Evans and Levinson (2009), as well as Abels and Neeleman (2010), Crain et al. (2010), Reuland and Everaert (2010), Harbour (2011), and Matthewson (2012), among others, for additional problems with Evans and Levinson's claim. 9The discussion in this subsection is framed as one about absolute properties.See Tily and Jaeger (2011) and Piantadosi and Gibson (2013) for discussion of the challenges of obtaining a large enough sample to establish such universals statistically.In addition to absolute universals, quantitative typological evidence offers a rich source of information for TL, though using this information is still difficult at present.See Sauerland and Bobaljik (2013) for an interesting example.
[ 221 ] served, some of Gold's assumptions do not seem to match the situation of the human language learner.In particular, the learner in iitl is expected to guess perfectly based on any fair text in the target language.No provision is made for discounting (or excluding completely) texts that are in some sense deviant, and no guess that is less than perfect counts.In acquisition, on the other hand, it is far from obvious that all sequences of inputs are equally good, and learning may well count as successful even if the child ends up having somewhat different judgments from its parents' about various sentences. 10Relaxing this requirement, as has been done in the probabilistic settings of Horning (1969) and others, yields notions of learning that are often much more inclusive than iitl.Horning's setting involves the same form of text presentation as Gold's, but the texts are generated by taking independent, identically distributed samples from the strings generated by a probabilistic context-free grammar (PCFG), and the criterion for learning is modified.On these assumptions, the set of languages generated by PCFGs is learnable, even though the set of languages generated by Context-Free Grammars (CFGs) is not iitl.
Horning's results -and those of later probabilistic developments such as Wexler and Culicover (1980), Osherson et al. (1986), Angluin (1988), Kapur (1991), and Chater and Vitányi (2007) -can be seen as evidence that a probabilistic approach is both more natural and more successful than iitl. 11Experimental data about specific learning 10 A different aspect of iitl that could be changed with significant consequences for learnability is the assumption that the learner is only exposed to positive evidence.If the learner is exposed both to positive and to negative evidence (for example, as a sequence of strings paired with a grammaticality judgment), many more families of languages become learnable, including families that might be of potential linguistic interest.(Intuitively, the reason negative evidence helps is that it breaks all the subset relations between the languages in C -see Gold 1967 for discussion.)Unfortunately, infants do not seem to have access to anything like systematic negative evidence (Brown and Hanlon 1970;Marcus 1993).
11 Care must be taken, however, in interpreting positive results about such models from the perspective of language acquisition.Horning (1969)'s original result applies to (unambiguous) PCFGs, a class of grammars that is not a realistic model of natural languages.Osherson et al. (1986) prove that a much broader class of languages can be identified with probability one from a similar form of text presentation (that is, through independent identically distributed draws from the language; see Clark 2001 for further extension).However, this result [ 222 ] tasks has provided empirical evidence for the role of statistics in learning, as well as further clarification of the requirements for a successful theory of learning in humans.One example is the segmentation experiments of Saffran et al. (1996), who showed that infants can reliably segment an artificially-generated input after a short exposure. 12Since the only cues for segmentation in these experiments are statistical, we can conclude that a learner must be able to make use of statistical regularities in the input.In addition, these results show that a model for human learning should succeed even with unsegmented input. 13Finally, the success of the babies in learning after such a brief exposure provides a preliminary quantitative measure of the performance of the learner.Further evidence that humans are skillful statistical learners come from Sobel et al. (2004) and Griffiths and Tenenbaum (2006), among others, who demonstrate the sensitivity of humans (both children and adults) to statistical information.

2.2.2
Task-specific approaches Experimental results about learning tasks, of the kind mentioned above, have sometimes inspired task-specific (but domain-general) learning models: relatively simple mechanisms, usually sensitive to statistics, that form part of a CG toolkit.For example, the results of Saffran et al., as well as those of subsequent experiments within the paradigm, have been taken to show that humans can employ certain segmentation techniques.One mechanism, based on Harris (1955) and suggested as the mechanism behind the infant segmentation data by Aslin et al. (1998), involves the tracking of transitional probabilities requires knowing the possible distributions.If this assumption is replaced by more realistic requirements, the classes of languages that can be identified become considerably more limited, as shown by Angluin (1988) and Pitt (1989).
In fact, if the child is required to perform distribution-free learning with probability one, the classes of languages that are identifiable revert to those that are Gold-identifiable.See Niyogi (2006) and Clark and Lappin (2011) for further discussion.
12 Other examples include the tasks of categorization, the learning of phonotactics, and the induction of grammatical rules.
13 Removing the segmentation marks in the text makes the learning problem harder.For example, the family C = {{a}, {aa}} is trivial to learn from a segmented text but impossible to learn from an unsegmented text.Both Gold and Horning require the input to be segmented.[ 223 ] between syllables.Transitions tend to be more restrictive within words than across words, so segmentation can proceed by finding drops in transitional probability.Different task-specific models of segmentation have been offered by Brent and Cartwright (1996), Christiansen et al. (1998), Brent (1999), Mattys et al. (1999), Johnson and Jusczyk (2001), Venkataraman (2001), andBatchelder (2002), among others.Other task-specific (but potentially domain-general) learning mechanisms that have been proposed in the literature include mechanisms for processing identity relations (Endress et al. 2007) and positional relations (Endress and Mehler 2009). 14

Prediction and description length
Another CG approach, one that is radically different from the taskspecific approach -and the one I will try to support in this paperis the idea of learning everything at once, with particular learning tasks (such as segmentation, categorization, syntactic learning, and so on) arising as by-products of a very general learning process.Here a principled approach is provided by the theory of prediction developed by Solomonoff (1964). 15Simplifying, we consider all the different hypotheses about the data, each treated as a computer program that outputs the data, and we evaluate each hypothesis according to its length.The learner bases its guesses about the continuation of the input based on a weighted sum of all the hypotheses compatible with the observations so far, with shorter hypotheses receiving higher weights.Recently, this approach has been proposed by Chater and Vitányi (2007) and Hsu et al. (2011) as a useful abstraction -a form of ideal learning -for evaluating certain claims about the learnability of natural language.
While fully general and mathematically sound, ideal learning as originally formalized is not cognitively plausible, nor is it meant to be.In its pure form, ideal learning is not even computable (though see Solomonoff 2008 for thoughts on how to address this concern).
14 See Endress et al. (2009) and Endress and Bonatti (2013) for further discussion of such mechanisms and qualifications of their generality.
15 Related notions were developed by Kolmogorov (1965) and Chaitin (1966).See Li and Vitányi (1997) for discussion.Learning of this kind is guaranteed to minimize errors in a certain sense, as shown by Solomonoff (1978) and Chater and Vitányi (2007).
[ 224 ] Another challenge to making Chater and Vitányi's model cognitively plausible is that it is stated with respect to a very broad UG -in its original form, a Turing-complete UG (which is the source of the noncomputability).If we wish to take into account arguments for a more restrictive innate component, such as the arguments from ROS and from the typology, we should re-state Chater and Vitányi's model in terms of more limited UGs.Restricting the set of hypotheses can both ensure computability and make the model work with linguistically realistic UGs, but the computations required to derive the predictions in a Solomonoff-based ideal learner such as Chater and Vitányi's can still be prohibitively complex.
The approximation to Kolmogorov Complexity known as Minimum Description-Length (MDL; Rissanen 1978) offers a way to overcome the difficulties of ideal learning while maintaining both the weighting of hypotheses according to their length and the idea of general learning, with particular tasks falling out as by-products. 16In MDL -and in the closely related Bayesian framework -the hypothesis space is restricted, and the search aims at finding a single hypothesis that minimizes the total description length (or, in the Bayesian framework, a hypothesis that maximizes the posterior probability).MDL has been used for grammar induction in the works of Berwick (1982), Rissanen and Ristad (1994), Stolcke (1994), Brent and Cartwright (1996), Chen (1996), Grünwald (1996), de Marcken (1996), Osborne and Briscoe (1997), Brent (1999), Clark (2001), Goldsmith (2001), Onnis et al. (2002), Zuidema (2003), Dowman (2007), Chang (2008), and Rasin and Katzir (2013) among others.In Section 3.3 I will suggest that MDL arises as a natural criterion for the evaluation of grammars given the data -and thus as a natural CG learning mechanism -from the commitment to an explicit UG made in TL. 16 See also the closely related approach known as Minimum Message Length (MML; Wallace and Boulton 1968).An approach related to MDL and MML is the search for a grammar (usually a context-free grammar) that generates the input data as its only possible output.The problem of finding such a grammar -the so-called shortest grammar problem -has its roots in Lempel and Ziv (1976) and has been studied by Nevill-Manning and Witten (1997), Kieffer and Yang (2000), Charikar et al. (2005), and Dębowski (2011), among others.

A rich UG and the possibility of learning both exist
As we saw, TL has good reasons to assume a nontrivial UG: while iitl seems inapplicable to the condition of the child, and while POS arguments are susceptible to successful learning models, ROS and typological arguments do not depend on learnability in principle.Indeed, the better the general-purpose mechanisms that one can assume, the more surprising both failures to learn and systematic typological patterns become.At the same time, the CG models of learning are clearly very much an option.None of the arguments against learning in principle holds, and it seems that humans are quite good at learning statistical distributions (as shown by Sobel et al. 2004 andGriffiths andTenenbaum 2006, among others).
Assuming that (almost) everything is innate or that (almost) everything is learned was perhaps convenient at one point as a working hypothesis: if we already have an elaborate innate component, we might hope that we could do without a sophisticated learning mechanism, and vice versa.But a rich innate component and a powerful CG mechanism are not logically incompatible, and it is worth noting that the state of the art in each project still leaves a significant amount of work for the other.At the very least, then, the two respective research projects should continue to co-exist: TL should keep studying the innate component focusing on ROS and typological evidence, perhaps showing more caution with POS arguments than it did before; and CG should keep studying what humans can learn and how, perhaps showing a better appreciation for the role of innateness in shaping adult linguistic abilities.
But there is also a more interesting option, one that allows a tighter collaboration between the two research projects and that enables discoveries in one to translate into tools for the other.This option, a hope from the early days of generative grammar, was made possible by the advent of the Bayesian program for cognition and of the closely related MDL framework, both of which allow the integration of structured representations and probabilistic reasoning.I will sketch an outline of this option immediately below.
[ 226 ] A cognitively plausible model for grammar induction

Combining innateness with general learning
Practitioners of TL often find themselves with two different hypotheses, call them F 1 and F 2 , that seem equally capable of explaining the observed linguistic phenomena.F 1 and F 2 might come from entirely different theoretical frameworks, such as Combinatory Categorial Grammar and Minimalism for syntax or Optimality Theory and SPE for phonology, or they may constitute two refinements of the same broad framework.This has led to what Steedman and Baldridge (2011) have called a crisis in syntactic theory (though a similar problem arises in other subfields of TL, such as phonology and semantics): modern TL proposals are often meaningfully different in their essentials and yet comparably successful in accounting for the linguistic judgments of adult speakers.In order to choose between them, we need to look elsewhere.
One important source of evidence of this kind is the mapping from theories of competence to theories of processing, mediated by the competence hypothesis articulated by Miller and Chomsky (1963) and Chomsky (1965).This mapping has been used to argue for Lexical-Functional Grammar (over transformational grammars) by Bresnan and Kaplan (1982); for the flexible constituents endorsed by categorial grammars (over the rigid constituency of most other formalisms) by Steedman (1989); and for quantifier-raising (over in situ incorporation of quantifiers) by Hackl et al. (2012).I would like to suggest that combining CG with TL might provide another source of evidence of this kind, with a suitable mapping of UGs to CG learners (in Section 3.3 below I will argue that such a mapping is available by default through the principle of MDL).The shape of possible experiments to distinguish between F 1 and F 2 is as follows.Suppose one finds two properties, P 1 and P 2 , that some languages have but some do not -so that learning will be involved -and that can co-exist in the same language.To take a phonological example, P 1 might be that a voiceless consonant like /p/ is aspirated in the beginning of a syllable while a voiced consonant like /b/ is not (as in English: [p h ]at vs. [b]at; note that this is a choice of English: Hindi can aspirate both /p/ and /b/, while French aspirates neither), and P 2 might be that vowels are lengthened before a voiced consonant but not before a voiceless consonant (again as in English: t[a:]b vs. t[a]p; again, this is a choice of English: French, for example, shows no such lengthening).In a syntactic example, P 1 might be that a subject can be dropped (as in Italian, but not in English) and P 2 might be that questions are marked by overt dislocation (again, as in Italian, but not in Japanese).
Given a CG mechanism M that seems cognitively plausible, we can now obtain two combinations, M + F 1 and M + F 2 , and each combination can be run on a realistic corpus of child-directed speech.While F 1 and F 2 might both be capable of representing both P 1 and P 2 , there might be a significant difference in how well the combinations M + F 1 and M + F 2 can learn the two and the order in which they do so.If this is the case, we now have a criterion for choosing between F 1 and F 2 : whichever provides a better match with data from actual child language acquisition will receive support.Since M was proposed as a general-purpose learning mechanism and was not tailor made to handle either F 1 or F 2 , such evidence can be taken seriously.
Experiments of this kind require researchers in each project to pay closer attention to work done in the other project than has usually been the case.Still, I think that they are a more productive -and, given current understanding, a more sensible -direction for future work on language and learning than further attempts to determine whether language is more innate than learned or vice versa.

An argument for MDL
I have tried to show why TL and CG can and should have a much closer relationship than they currently enjoy.In this section I will provide an argument that any explicit theory of UG already comes with the evaluation metric (or objective function) that forms the central component of a CG learner.Specifically, I will show how any explicit theory of UG translates into an MDL evaluation metric that allows the child to compare different possible hypotheses within the hypothesis space defined by UG.If correct, the discussion below points to bare MDL as our starting point in studying learning and as the linguist's M for comparing contenders for the correct theory of UG.
A theory of UG provides a set of possible grammars.Any of these can be the grammar of a competent speaker, who stores that grammar in memory and uses it to obtain an opinion about data.At the very least, then, assuming a theory of UG T with a set of possible grammars commits us to the following assumptions: [ 228 ] 1.A competent adult speaker has a grammar, G ∈ .

G is stored in memory.
3. G is used to parse inputs.
In order to make learning possible, we must allow a learner who currently represents G to also consider at least one other grammar G ′ and to switch from G to G ′ under certain conditions. 17Of the very few properties that we can rely on to compare the two grammars in the general case, total storage space is a natural candidate, and one that accords well with the intuition behind MDL, which equates learning with compression.I therefore add the following two assumptions: 4.During language learning, a second grammar, G ′ ∈ can be stored in memory and used to parse the input.
5. The memory size used to store G and its parse of the input can be compared to the memory size used to store G ′ and its parse of the input.
These assumptions amount to little more than saying that grammars can be used for parsing and that the overall description length of two grammars can be compared.My claim is that these assumptions already provide the language learner with an inherent learning mechanism: given an input D, the language learner searches through for the grammar G for which the encoding of G (as defined by T ) and of D (using G) is the shortest.By relying only on what the theory of UG under consideration is already committed to, this bare MDL learner offers a natural starting point for the study of learnability: alternatives in which the learner ignores the freely available MDL criterion and relies on some other mechanism instead should only be pursued 17 Strictly speaking, maintaining more than one grammar is not always necessary.In particular, the learners proposed by Angluin (1982), Koshiba et al. (1997), Clark and Eyraud (2007), and Heinz (2010) all operate by considering just one grammar at a time and updating it as input comes along.All these learners, however, assume elaborate mechanisms for growing a grammar -usually tailor-made for the specific UGs they are designed to handle -that go well beyond the basic commitment to an explicit UG. [ 229 ] given evidence that the MDL null hypothesis is incorrect. 18, 19The argument for bare MDL as the null hypothesis can be taken to support approaches in the literature that use MDL for learning, such as the works mentioned in Section 2.2.3, and in particular works such as de Marcken (1996) and Rasin and Katzir (2013) that use MDL not simply as a convenient heuristic but as the sole principle that maps an explicit UG to an evaluation metric. 20Moreover, as mentioned in the introduction and discussed further in Section 3.2, the generality of 18 To date, the literature has provided very little that bears directly on the empirical question of whether children use MDL as a criterion for comparing hypotheses during learning.On the other hand, several works have provided arguments -often in conflicting directions -regarding a possible role for description length more broadly in the learning process.In particular, Feldman (2000), extending the results of Shepard et al. (1961), provides evidence for the cognitive relevance of MDL by showing that description length is correlated with learning difficulty in concept learning (see also Feldman 2006 andGoodman et al. 2008).In the same vein, Moreton and Pater (2012a,b) review the literature on artificial grammar learning in phonology and conclude that description length is a central factor determining learning difficulty in this domain.On the other hand, Kurtz et al. (2013) point to a more nuanced pattern of difficulty in concept learning, and Moreton et al. (2014) provide evidence for correlating difficulty with factors other than description length, both in phonological learning and in concept learning.I will not attempt to relate such results about learning difficulty with the question of what evaluation criterion is used by the learner. 19Heinz and Idsardi (2013) note a lack of correlation between the complexity of finite-state machines for capturing certain patterns and potentially relevant language classes to which these patterns correspond.Based on this, Heinz and Idsardi suggest that MDL is not an appropriate learning criterion in phonology.Note, however, that the complexity of a grammar is only one part of the MDL criterion: the size of the description of the data given the grammar is just as important as the size of the grammar itself, and without taking it into account it is generally not possible to draw conclusions about the adequacy of the criterion.In addition, Heinz and Idsardi discuss the length of very specific representations -namely, the finite-state machines they use to describe the relevant patterns -and these representations do not correspond to any of the main grammatical formalisms for phonology.Given different representations, grammar size can change.Finally, it is hard to see how the possible correlation of language families with the description length for the best grammar (with or without taking the data into account) is a relevant consideration.The question is whether, given an appropriate representation scheme, the grammar that yields the shortest description in any particular situation is also the one that humans arrive at.
20 For de Marcken (1996) MDL is a substitute for Structural Risk Minimiza- [ 230 ] the mapping from UGs to learners provides a framework in which theories of UG can be compared with respect to their predictions about learning.

Encoding
To see how the mapping from theories of UG to bare MDL learners works, let us consider a naive theory of UG.This theory, call it T 1 , allows any CFG to be represented by listing all the rules in some order, with a category #, which is not one of the terminals or nonterminals in the grammar, serving as a separator.Since T 1 only allows CFGs, it can list each rule unambiguously as the left-hand side followed by the list of the categories on the right-hand side. 21T 1 marks the end of the grammar with an additional separator.For example, the grammar below will be listed as ABA#ABC#A#BCD#…#EFG##: We still need to specify how T 1 encodes the categories in the list.Sticking to simple-minded (and deliberately suboptimal) choices, we will use a fixed code-length scheme for the different categories, where each category will be encoded using k = ⌈lg(|Categories| + 1)⌉ bits: tion, but it is still the sole contributor to the actual evaluation metric used by the learner.While de Marcken's focus is different from that of the present work -in particular, his emphasis on a specific representational framework that he develops can obscure the general applicability of MDL as an immediate CG learning criterion for any explicit UG -his work provides a particularly clear example of how pure MDL can fit in with a linguistically motivated UG.
21 This particular choice of encoding individual rules would change in extensions of the learner beyond CFG, but the general point will not be affected.As long as the grammar can be stored and used for parsing, it can be encoded, and the encoding can be used in an MDL learner.

G 111
The number of bits per category, k, will have to be represented as well.We can do this by starting the code with a sequence of k 0's followed by a single 1, and by agreeing to treat 000 k as #.Encoding the grammar above, then, will be 000

Search
Using the UG specified above as T 1 , we can now take some input D and search for the grammar that minimizes the total description length of G and of the encoding of D given G. Any grammar G 0 that parses the [ 232 ] input can serve as an initial hypothesis for the search.Moreover, G 0 provides a trivial upper bound on the size of the search, since the total description length provided by the target grammar is at most as large as that provided by G 0 .For T 1 , there is a very simple grammar that is guaranteed to parse D and can serve as G 0 .This grammar is what I will refer to as the concatenation grammar for Σ, where Σ is the alphabet in which D is written.If Σ = {σ 1 , . . ., σ n }, the concatenation grammar for Σ is defined as follows: The concatenation grammar for Σ makes all texts of a certain length written in Σ equally easy to describe.It treats all symbols in all positions in D as equally good and therefore fails to capture any regularity other than the alphabet in which D is written.Consequently, it is only a good hypothesis for a random or near-random text.However, since it parses D it can serve as an initial hypothesis, and it provides an initial upper bound on the total description length using the target grammar.
Still, the bound provided by the concatenation grammar is huge, ruling out an exhaustive search.A greedy search is not likely to succeed, due to various local optima along the way.To address this problem, the search in the simulations below relies on Simulated Annealing (SA, Kirkpatrick et al. 1983), though I wish to emphasize that I am not trying to model the search procedure in humans, and my only claims concern the definition of the objective function, stated in terms of total description length.Indeed, it is quite possible that, even if they use the MDL criterion, humans will turn out to be incapable of exploring the search space effectively.If that is the case, the search component could make the learner -and with it the entire innate componentconsiderably more restrictive than suggested by the representational abilities of UG and by the MDL criterion. 22A proceeds by comparing a current hypothesis to one of its neighbors, chosen at random, in terms of goodness, which in the present case is the total description length.That is, when a current hypothesis the search switches to G ′ .Otherwise, the choice of whether to switch to G ′ is made probabilistically and depends both on how much worse G ′ is and on a temperature parameter.The higher the temperature, the more likely the search is to switch from G to its bad neighbor G ′ .Similarly, the closer G and G ′ are in terms of overall description length, the more likely the search is to switch to G ′ .The temperature is initially set to a relatively high value, and it is gradually lowered as the search progresses, making the search increasingly greedy.The search ends when the temperature descends below a certain threshold.
For any grammar G, the neighbor grammar G ′ is generated as a variant of G in which one of the changes in the following list occurs: 1.An element, possibly a new nonterminal, is added to one of the rules.
2. An element is deleted from one of the rules.
3. A new rule of the form X → ε is created for some category X .
4. A nonterminal in the right-hand side of a rule is replaced with its expansion according to some rule in the grammar.
5. A nonterminal X in the right-hand side of a rule is replaced with a new nonterminal Y , and a unit rule Y → X is added to the grammar.
The modification is chosen according to a uniform distribution over possible changes.All decisions in a given modification are made randomly as well (category for insertion, positions for insertion or deletion, etc.).

Results
In Section 4.1 above we saw the specification of T 1 , a simple-minded CFG UG, and in Section 4.2 we saw the details of a search procedure that turns the MDL evaluation metric induced by T 1 into a learner.In this section we will see the results of running this learner on two [ 234 ] extremely simple data sets: one that is the concatenation of words from an artificial lexicon and another that involves palindromes.Both tasks are loosely based on patterns that arise in natural language.The concatenation data set requires that the learner address the challenge of segmenting the input, a challenge solved by human learners, who are exposed to inputs that are for the most part unsegmented.The palindrome data set requires that the learner address the challenge of acquiring center embedding, a common pattern in natural languages.Despite this loose correspondence with natural language, the goal of the present section is not the realistic modeling of learning in humans -both T 1 and the data sets are far too simplistic to be informative in this respect -but rather to show how a bare MDL learner induced by an explicit UG operates, and how the representational abilities of the UG in question guide the search for the best hypothesis given the data.

Segmentation
The first data set is based on the one described by Saffran et al. (1996).In Saffran et al.'s experiment, in which a text was generated by the random concatenation of elements from an artificial vocabulary consisting of the items pabiku, golatu, daropi, tibudo.This text was turned into speech using a synthesizer that produced a stream of speech with flat intonation and no word breaks.Eight-month old infants were exposed to this stream, and after two minutes (= 180 words = 1080 segments) they were able to distinguish between words (e.g.pabiku) and non-word sequences that appear in the text (e.g.

bikuda)
. 23 Here are sample snapshots from the learning process using an input that is only 300 segments long (compared to 1080 in the original experiment), using an initial temperature of 15 and a maximum grammar-length of 200 bits.The first step, as explained above, is a concatenation grammar, which captures no regularities:24 Grammar length: 126, Encoding length: 1200, Energy: 1326.0 After a thousand steps, we already have ro from daropi, la and go from golatu, and ku from pabiku: Grammar length: 192, Encoding length: 1023, Energy: 1215.0As we proceed, more and more parts of the underlying vocabulary are discovered.Here, at the final step, we have all the words: Grammar length: 97, Encoding length: 100, Energy: 197.0The results presented above show rules that correspond straightforwardly to the lexicon that was used to generate the input and thus reflect the correct segmentation of the input, based on its statistical [ 236 ] regularities.Crucially, though, the theory of UG presented as T 1 in Section 4.1 is not aware of the tasks of segmentation and lexicon induction, and it does not represent probabilities in its rules.Consequently, the bare MDL learner for T 1 is not aware of these notions either.It arrives at the correct segmentation as a by-product of its general search for the best grammar given the input.

Palindromes
For our second simulation, along the lines of Horning's paradigm, we will use an input that exhibits nested dependencies.Such dependencies are common in natural language: they are present in the nesting of object-extracted relative clauses in English, for example, as well as in the basic structure of verb-argument dependencies in German clauses.It has been suggested by Fitch and Hauser (2004) that humans acquire such patterns in experiments of artificial-language learning, though the experiment and the claim remain controversial (see Perruchet and Rey 2005, among others). 25n the nesting data set I will use a segmented input.We can specify the learner's goal when presented with a segmented input sequence to be the minimization of the sum of the grammar length and the sum of the encoding lengths for each element in the sequence. 26At least in simple cases, the learner successfully identifies the generating grammar from an input presented in this way.Following are several snapshots from a run on an input that consists of 200 even-lengthed palindromes over the alphabet Σ = {a, b, c} (the sequence reported here starts as cccabaccabaccc, cbbc, bccccccb, aa, aabbaa, . ..; for performance purposes, the learner cannot see past the first 25 characters of each element in the sequence): G 4200 is already the correct grammar (371 is the arbitrary category label of what would usually be written as S).Similar results were obtained with other simple CFGs, such as a n b n .5 discussion I set out to bring TL theories of UG and CG theories of learning into closer contact.I reviewed some of the central arguments within each discipline for and against rich UGs and for and against learning, concluding that linguists' notions of rich UGs are well-founded, but that cognition-general learning approaches are viable as well.Differently from what is often suggested in the literature, I argued that the two can and should co-exist and support each other.Specifically, I used the observation that any theory of UG provides a learning criterion -the total memory space used to store a grammar and its encoding of the input -that supports an MDL evaluation metric that can serve as the central component of a CG learner.This mapping from theories of UG to learners maintains a minimal ontological commitment: the learner for a particular theory of UG uses only what that theory already requires to account for linguistic competence in adults.I suggested that such learners should be our null hypothesis regarding the child's learning mechanism, and that furthermore, the mapping from theories of UG to learners provides a framework for comparing theories of UG. [ 238 ]

2
tl and cg: a brief history of the schism of encoding G will be |G| ≈ k • [ ∑ r∈G |r| + 1].As for determining the encoding of the data, D, given G, T 1 first groups rules by their left-hand side, and then enumerates the expansionsG provides the following parse for D: T = [ A [ B . . . ] [C . . . ]]. T 1 encodes this parse by traversing the tree in preorder, concatenating the code for each expansion choice given the lefthand side: C(T ) = C(A)C(A → BC |A)C(. . .|B) . . .C(. . .|C) . ... In cases of ambiguity, T 1 takes the shortest encoding.