Modelling a subregular bias in phonological learning with Recurrent Neural Networks

ABSTRACT


INTRODUCTION
Formal Language Theory (FLT; Chomsky 1956) describes how complex a pattern is in terms of the computational machinery needed to represent it. The framework was originally designed to demonstrate that natural language syntax was more complex than the set of Regular patterns (i.e., those that could be represented using finite state machines). However, Johnson (1972) showed that all known phonological mappings could be considered, at most, Regular (see also Kaplan and Kay 1994). Recent work has supported this finding, arguing that phonological learning must be categorically limited to patterns that can be characterized as Subregular (i.e., belonging to specific classes of patterns that can be represented with less expressive power than that of a finite state machine; Heinz 2010; Heinz and Idsardi 2011). One piece of evidence for this hypothesis is a series of experimental results that show humans being biased against learning certain patterns that seem to be too complex according to FLT-based metrics (Finley and Badecker 2008;Lai 2015;Finley 2017;Avcu 2018).
For example, Finley and Badecker (2008) showed that their participants were biased against learning Majority Rule Harmony (also known as Majority Rules;Lombardi 1999;Bakovic 2000), an unattested phonological process that is more complex than the set of Regular mappings. Later experimental work went on to show that people were also biased against learning some Subregular patterns (Lai 2015;Avcu 2018;McMullin and Hansson 2019), providing evidence that the phonological grammar might be limited to even simpler levels of the FLT hierarchy, such as those that can be characterized as Strictly Local and Tier-based Strictly Local (TSL;. 1 The former level of complexity includes any pattern that bans a finite set of substrings from occurring in a word, while the latter does so over a tier of segments (i.e., certain segments can be ignored by the pattern).
An example of a Strictly Local pattern that commonly occurs in natural language is the restriction banning voiceless sounds after nasals (henceforth *NC̥ ;Pater, 1999). This pattern is Strictly Local since it bans any word containing the finite set of strings that result from combining all nasals with all voiceless sounds (e.g. [nt], [np], [mt], [mp], etc.). TSL patterns are also common in phonology and are typically called harmony (see Rose and Walker 2011 for an overview), since many of them cause a subset of segments in a word to agree in their value for some feature. 2 For example, Navajo contains a har-  (Sapir and Hoijer 1967). This means that on the sibilant tier, the strings [sʃ] and [ʃs] are banned, since [s] is [+anterior] but [ʃ] is [−anterior]. Any sounds that are not sibilants are irrelevant to the pattern. A word like * [saʃ] would not be allowed, since its sibilant tier would exclude [a] and only include the banned sequence * [sʃ]. Figure 1 shows the full Subregular Hierarchy and where each of these two types of patterns are located in it.
While a considerable amount of work has been done to explain phonological typology and learning in terms of these FLT-based criteria, little work has been done to computationally model the experimental results that support a bias for Subregular patterns. 3 Here, I will show that the biases observed in past FLT-related experiments can emerge from the learning process of a relatively generic learner, namely a sequence-to-sequence neural network, which has the expressive power to represent both Subregular and Supraregular patterns (Siegelmann 1999). Since the network has no explicit, FLT-related biases built into its architecture, this provides evidence that such a 3 Note that most of the literature involving FLT and learning (e.g., Chandlee et al. 2015; Jardine and Heinz 2016, among others) does not have an explicit hypothesis for how such learning algorithms can be used to make predictions for artificial language learning experiments. Instead, such work tends to focus on whether formally defined classes of languages are learnable at all, given certain kinds of training data.
[ 69 ] bias may not need to be added to theories of phonological acquisition.
The paper is structured as follows: Section 2 introduces the neural network model that I will be using, Section 3 focuses on simulating experimental results regarding Majority Rule Harmony (Lombardi 1999;Bakovic 1999;Finley and Badecker 2008), Section 4 focuses on doing the same for experiments that involve First-Last Assimilation (Lai 2015; Avcu 2018), and Section 5 concludes.

MODELLING PHONOLOGICAL LEARNING WITH NEURAL NETWORKS
Neural networks have been used to model linguistic patterns since at least Rumelhart and McClelland (1986) and were quickly applied to the domain of phonology by Touretzky (1989) and Touretzky and Wheeler (1990). Hare (1990) first used recurrent neural networks (Jordan 1986;Elman 1990) to capture Hungarian vowel harmony, demonstrating that this architecture could be particularly useful for learning phonological mappings. Recurrent neural networks treat a stimulus as being made up of multiple timesteps, each of which the model processes separately. At each timestep, the model has connections that lead to the output layer and to the next step in time. These connections that feed into future timesteps are called recurrent and give the model a kind of memory as it walks through the full stimulus. This is illustrated in Figure 2 for Hungarian vowel harmony.
Figure 2: Illustration of a recurrent neural network. Circles represent the hidden recurrent layer processing each timestep, black arrows represent groups of connections, grey lines represent the internal structure of the layer, and IPA symbols represent feature vectors corresponding to each segment [ 70 ] The use of such simple recurrent networks was later expanded to model other phonological phenomena, such as voicing assimilation (Gasser and Lee 1992) and phonotactic learning biases (Doucette 2017). However, these simple networks have been critiqued for their inability to generalise in a human-like way (Gasser 1993;Marcus et al. 1999) and for being too myopic (Alderete and Tupper 2018), since they have no ability to look ahead in their input sequence. There are a number of other reasons to suspect that simple recurrent networks would not be able to handle the full wealth of phonological phenomena -for example, their dependency on input and output lengths being equal (Sutskever et al. 2014).
Most of these issues are solved by the neural network architecture used in this paper, sequence-to-sequence networks (henceforth Seq2Seq;Sutskever et al. 2014). Seq2Seq networks were originally designed for machine translation and are meant to handle the fact that different languages often use different numbers of words to express the same idea. For example, a sentence like "No, I am your father" could be translated to Spanish as "No, soy tu padre," which has one less word. Seq2Seq networks deal with this by processing sequences in the input with a recurrent network called the encoder which is connected to a separate network, called the decoder, via its hidden layer connections. This processed data is then unpacked by the decoder into an output sequence whose length is independent of the length of the input.
This design also makes Seq2Seq networks well suited for modelling morphological and phonological patterns (e.g., Kirov and Cotterell 2018;Prickett et al. 2018;Prickett 2019), since these often involve mapping between forms of different lengths. For the simulations presented in this paper, words are represented as sequences of sounds, where sounds are vectors of real-numbered features that range from 0 to 1. In the input, which represents the underlying form, standard phonological features are used (like [voice] or [back]), with 0 and 1 corresponding to [−] and [+], respectively. In the output, which represents the surface representation, the network has a binary classifier for each feature that gives the model's estimated probability for how likely that feature is to have a positive value, given the underlying representation (UR) in its input. This is illustrated in Figure 3 using the same Hungarian example as above, with the feature vectors in the   (Bengio et al. 1994), which can prohibit a network from learning long-distance dependencies. While none of the patterns I investigate have dependencies that are long enough to be affected by this phenomenon, GRU units are relatively standard in the Seq2Seq literature and I leave it to future work to see whether they are necessary for capturing the results presented here. Similarly, in all of my simulations, the network's weights were optimized using Adam (Kingma and Ba 2015), a standard algorithm for training neural networks, but one that is likely not necessary to produce the results that I observed. The loss function used for optimization was the sum of binary cross entropy over all of the binary feature classifiers in the output and weight updates were made after seeing each word in training (i.e. batch sizes were equal to 1, sometimes called online learning in the phonological literature).
A final aspect of the model's architecture worth noting is attention (Bahdanau et al. 2015). This gives the model's decoder additional access to information from the input sequence by allowing it to see the decoder's hidden-state activations. Attention has been shown to encourage human-like generalization in Seq2Seq networks (Nelson et al. 2020). Some pilot simulations without attention suggested that it helped the model generalisese better in the simulations presented here.

Background
Majority Rule Harmony is a pattern predicted by some constraintbased theories of assimilation in which the number of segments in a word's underlying representation (UR) with a particular feature value determines what the value of that feature will be throughout the surface representation (SR) of the word (Lombardi 1999;Bakovic 1999 ). Since Majority Rule requires a potentially unbounded amount of memory (i.e. enough memory to keep track of the quantities for each feature value), it cannot be represented with a finite state transducer and is more complex than the set of Regular functions (Heinz and Lai 2013). 4 Finley and Badecker (2008) tested whether humans were biased against Majority Rule. They did this by training participants on a language that was ambiguous between Majority Rule Harmony and a more standard, attested harmony pattern (henceforth Attested Harmony), in which the value of the relevant feature in the SR was determined by the value of that feature in either the leftmost or rightmost segment of the UR (see Rose and Walker 2011, for more on the kinds of harmony patterns that are common in natural language). Directional harmony mappings like this are Subregular, since determining how a vowel will surface only depends on local information in the input and 4 Since TSL only defines a set of languages (i.e. phonotactic restrictions on SRs) and not a set of functions (i.e. UR→SR mappings), standard harmony patterns (when represented as transformations) are Output Tier-based Strictly Local Participants in the experiment were exposed to stimuli meant to represent underlying forms like /kupoki/, with both [+back] and [−back] vowels present in a single word. Crucially, the minority vowel (/i/ in this case, since it is [−back] while /o/ and /u/ are both [+back]) always occurred on the same side of the word in training. After being given each "underlying" form, participants would then be exposed to a stimulus representing the "surface" form it mapped to (e.g., [kupoku] for the example above). The mapping /kupoki/→[kupoku] could then be analysed by the participants in two ways: either Attested Harmony, where the [back] value of the final vowel changed because the leftmost vowel in the word was [+back], or Majority Rule Harmony, where the word-final /i/ changed because the majority of vowels in the underlying form were [+back]. After being exposed to a number of these ambiguous mappings, participants were asked to choose between mappings that were unambiguous between Majority Rule and Attested Harmony. 5 For example, they might be given /kupeki/ and need to choose between mapping it to [kupoku] (the Attested Harmony candidate) or [kipeki] (the Majority Rule candidate). If participants chose between the options at chance, it would suggest that they had no preference for either pattern. However, if they chose one significantly more often than the other, it would suggest that they were biased toward learning that pattern. Finley and Badecker (2008) found that their participants were significantly more likely to generalise in a way that adhered to Attested Harmony. That is, when choosing to either apply an Attested Harmony or Majority Rule mapping to items that were unambiguous between the two patterns, participants only applied the latter in approximately 20% of trials. This suggests that in the face of ambiguous training, the participants learned the Attested Harmony pattern -which Finley and Badecker (2008) interpreted as evidence of a bias against learning Supraregular patterns like Majority Rule.

Simulations
To see whether the behaviour observed by Finley and Badecker (2008) is mirrored by a Seq2Seq network, I simulated their experiment using the architecture described in Section 2. The model was exposed to the same types of training data that Finley and Badecker (2008) gave their participants, which was ambiguous between Majority Rule and Attested Harmony. Since only the vowels were relevant to the patterns in this experiment, all consonants were removed. Other than this difference, the model was exposed to the same underlying and surface forms that the experiment participants were given. These are shown in Table 1 and the features used in all the simulations presented in this subsection are shown in Table 2.
All simulations consisted of 15 repetitions using this training data, with randomly initialized weights at the start of learning, and 300 full passes through the training data (i.e., 300 epochs). At each epoch, the    Table 3).
The conditional probability that the model assigned to each choice, given a particular UR, was calculated using the equation defined in Equation 2, based on Luce (1959), where pr(UR i ) → SR j is found using Equation 1, and where f i j stands for feature j in segment s i of the relevant SR. (1) Results for these forced choice estimates were averaged over stimulus types and repetitions, and these averages are shown for each epoch in Figure 4. Figure 5 gives the 50th epoch in more detail, for results that are more visually comparable to the ones presented by Finley and Badecker (2008).
These results show that throughout learning, the model prefers choices that are consistent with Attested Harmony, even though it has been trained on data that is ambiguous between the two patterns. This difference reaches statistical significance for a range of epochs (including the 50th epoch), meaning that the bias in humans observed by Finley and Badecker (2008) can be captured by the model.
To further test the model's biases in regards to Majority Rule Harmony, I also ran a simulation that does not correspond to Finley and [ 76 ]  Badecker's (2008) experiment. Rather than using a generalizationbased design, in this simulation, multiple, unambiguous languages were used in training. Additional data points were added to the training data in Table 1 to disambiguate the two patterns of interest. The data for unambiguous versions of Majority Rule Harmony and Attested Harmony are shown in Tables 4 and 5.
The model was trained on these unambiguous versions of Attested Harmony and Majority Rule and the cross entropy and accuracy were recorded at each epoch. Accuracy was estimated by feeding the model each of the underlying forms in the training data, sampling from the probabilities it produced in the output to create surface forms, and finding the proportion of those surface forms that were perfectly produced in that epoch's sample. The learning curves created from these results (averaged over 15 repetitions) are shown in Figure 6.
These results show that for small portions of the learning curve, Attested Harmony's average accuracy is marginally higher than [ 77 ]   [ 78 ]  (2008). Chance performance for the plot on the right would be considerably lower than 0.1, since the model assigns probabilities to each feature value in each segment. Coloured regions show 95% confidence intervals Majority Rule's, but this difference is not a reliable one. There also seems to be a small, statistically marginal difference between the loss curves for the two patterns, but this effect is even less consistent throughout learning. Assuming that the small, artificial languages used here adequately represented each of the languages, this suggests that if the model does have a bias for Subregular patterns in its learning from unambiguous data, the effect size of this bias is too small to see in just 15 repetitions.

Background
First-Last Assimilation is a hypothetical phonotactic restriction in which the first and last segment of a word must agree in some feature value, while the intervening sounds are ignored (Lai 2015). For example, if the feature that needed to agree was [anterior], the word [saʃas] would be allowed, but the word *[saʃaʃ] would not be. Lai (2015) argued that there are reasonable diachronic origins for such a pattern, [ 79 ] since the beginning and end of a word are perceptually salient positions. She went on to argue that the absence of such a pattern in the phonological typology could be due to its FLT-based complexity.
While First-Last Assimilation is Subregular, it belongs to the Locally Testable region, which is more complex than TSL, in terms of the logic needed to define the crucial parts of the pattern. That is, sets of sequences are necessary to describe words banned by First-Last Assimilation (i.e. "words with either [#s] and [ʃ#] or [#ʃ] and [s#] are banned"), which is never true for TSL patterns.
Two studies have shown that people have biases against learning First-Last Assimilation. Lai (2015) trained participants on either a standard sibilant harmony pattern (henceforth, Attested Harmony) or First-Last Assimilation by having them listen to and then repeat words adhering to the pattern they were assigned to. In the testing phase of the experiment, participants were asked to judge which word was more likely to belong to the language they were trained on in three types of forced choice: 6 i. a choice between a word that was allowed in both patterns (e.g.
[sasakas], denoted as FL/AH below) and a word that was only allowed in First-Last Assimilation (e.g. [saʃakas], denoted as FL/*AH below), ii. a choice between a word that was allowed in both patterns and a word that was banned by both (e.g. [sasakaʃ], denoted as *FL/*AH below), iii. a choice between a word that was only allowed in First-Last Assimilation and one that was banned by both.
Participants who learned an Attested Harmony pattern would be expected to choose words that were allowed by both patterns when presented with choices (i) and (ii), but should choose at random for choice (iii). This is because choice (iii) forces participants to choose between two words that are both banned by the Attested Harmony pattern. Participants who learned a First-Last Assimilation pattern would  (2015). The x-axis shows which pattern participants were trained on. Type labels are mine, with "FL" standing for First-Last Assimilation, "AH" for Attested Harmony, and "*" indicating an option not being allowed in a given pattern. Note that Lai (2015) used the term "Standard Harmony"/"SH" for the pattern I'm calling "Attested Harmony"/"AH" be expected to choose at chance for choice (i), since both choices are grammatical according to First-Last Assimilation. For choice (ii), they would be expected to choose words that are allowed by both patterns, and for choice (iii) they should choose the words that are only allowed by First-Last Assimilation.
However, participants trained on First-Last Assimilation in Lai's (2015) experiment did not behave as expected. Her results (reproduced in Figure 7) showed that participants in both language conditions behaved as if they had learned Attested Harmony.
Specifically, when presented with choices (i) and (ii), participants in both conditions chose items that were grammatical in both languages significantly more than chance, showing that they preferred items in which Attested Harmony was not violated. However, when presented with choice (iii), participants performed at chance, demonstrating that they had no preference between items that violated First-Last Assimilation and those that did not. This shows that they failed to learn First-Last Assimilation when trained on the pattern, and instead learned the Attested Harmony pattern. These results are what one would expect if there were a categorical restriction banning the acquisition of phonological patterns that are more complex than TSL.
Avcu (2018) ran another artificial language learning experiment to test for a bias against First-Last Assimilation. Participants received the same training as Lai's (2015) study; however in testing, they were asked to make a different kind of choice. Instead of choosing between [ 81 ] two words, participants judged whether they thought each test stimulus (some of which followed the pattern from training and some which did not) belonged to the language they had just learned. This allowed Avcu (2018) to analyse participant responses using Signal Detection Theory (Green and Swets 1966) and provided a measure of how sensitive individuals were to whether a word belonged to the language they were assigned. The results showed that participants in both language conditions were better than chance at performing this discrimination task, but that those who learned Attested Harmony performed significantly better. Since Avcu's (2018) participants were less successful at learning First-Last Assimilation than its more standard counterpart, these results also support the idea of a bias for patterns that are simpler according to FLT.

Simulations
To see if an explicit, FLT-related bias is needed to capture the results that Lai (2015) and Avcu (2018) observed in human learning, I ran a simulation using a Seq2Seq network. 7 The training and testing data that the model received were identical to the stimuli used by Lai (2015), except that all vowels were removed from the model's representations (as they were irrelevant to the patterns of interest). Since Lai's (2015) participants were not exposed to the underlying forms for any of the stimuli, all training and testing data for the model assumed that underlying forms were identical to their corresponding surface forms (see Prince and Tesar 2004, for a similar approach to phonotactic learning). While this data represents an identity mapping, the fact that neural networks cannot perfectly learn such a mapping (Tupper and Shahriari 2016) means that the model must learn alternative ways to optimize its objective function, such as acquiring the phonotactic patterns present in the language (see Kurtz 2007, for a similar approach using a different neural network architecture). The 7 Thanks to a reviewer for pointing me toward similar work in the domain of syntax: Ravfogel et al. (2019) show that a neural network, when trained on data that is ambiguous between an agreement pattern analogous to First-Last Assimilation and a pattern that involves more local agreement, the network generalises in a way that suggests it learned the latter.
[ 82 ]     Tables 6 and 7, respectively. Additionally, the features used to represent the segments in both patterns are shown in Table 8. Simulations consisted of 15 repetitions in each language condition, with randomly initialized weights at the start of learning, and 300 passes through the full data set. At each epoch of training, the model's cross entropy and accuracy were measured. Accuracy was estimated by feeding the model each of the forms in the training data as input, sampling from the probabilities it produced in its output to create surface forms, and finding the proportion of those surface forms [ 83 ] (2015). Chance performance for the plot on the right would be considerably lower than 0.1, since the model assigns probabilities to each feature value in each segment. Coloured regions show 95% confidence intervals that matched their input in that epoch's sample. Learning curves showing both of these metrics are given in Figure 8.

Modelling a subregular bias in phonological learning
The curves in Figure 8 show that Attested Harmony is learned consistently faster than First-Last Assimilation. This difference is significant for considerable portions of learning in both the model's loss and accuracy. These results are most comparable to those reported by Avcu (2018), since the model's performance is higher than chance for both patterns, but significantly better for Attested Harmony.
To compare the model's learning to the results in Lai (2015), the network was given a forced-choice task similar to the one described in Section 3.2, with the test data given in Table 9.
Since the patterns here were phonotactic (rather than mappings), there was no shared UR between the two choices. That is, the conditional probability that the model assigned to each choice was just a normalized probability for each of the two SRs mapping to themselves, as shown in Equation 3.
The relevant conditional probabilities were averaged over stimulus types and repetitions, and are shown in Figure 9 and Figure 10 for the model that was trained on First-Last Assimilation and the model that was trained on Attested Harmony, respectively.   These results show that the Seq2Seq model, like the human participants in Lai (2015), behaved in a way that was consistent with Attested Harmony, even when trained on data that unambiguously followed the First-Last Assimilation pattern. That is, regardless of the model's training data, it chose at chance between words that were banned by Attested Harmony, even when one of those words adhered to First-Last Assimilation (with the only exception to this behaviour being a small number of epochs in the Attested Harmony condition). This is shown in the results for choice (iii). By itself, this only shows that the model did not learn First-Last Assimilation. However, choices (i) and (ii) both show that the models acquired Attested Harmony, since words adhering to this pattern are consistently given more probability than words banned by it for most of the acquisition process. 8 To show these results in a way that is more visually comparable to the results reported in Lai (2015), the model's estimates for the 100th Figure 11: Forced choice probabilities for the 100th epoch of training in both the First-Last Assimilation language and the Attested Harmony Language. The dashed line shows chance and the error bars show 95% confidence intervals. As in Figure 7, "FL" stands for First-Last Assimilation, "AH" stands for Attested Harmony, and "*" indicates an option not being allowed in a given pattern epoch in each language, which was a relatively representative point in each language's learning curve, are shown in Figure 11.

5.1
Why can the Seq2Seq network capture these biases?
In this paper, I showed that the apparent FLT-related bias observed in past artificial language learning experiments could be modeled by a recurrent neural network with no FLT-based restrictions built into its architecture. But the question of why these biases exist has not been addressed. One reason for the model's bias against Majority Rule Harmony could be its inability to count. Weiss et al. (2018) showed that GRU units, like the one used in the hidden layer of the neural network I tested, prohibit a model from acquiring the ability to count (as opposed to simple recurrent networks and networks with LSTM units, which were able to learn counting-based patterns). Since Majority Rule Harmony requires counting the occurrences of a particular feature value in the input, this could explain the model's preference [ 87 ] for learning an Attested Harmony pattern in the face of ambiguous data.
Another relevant factor is the locality bias (sometimes also called "sequentiality"; Battaglia et al. 2018) present in all recurrent network architectures. This is a bias for patterns that involve local dependencies, originating from the fact that recurrent connections have a finite amount of memory with which to store information across time. Past results on syntactic patterns have shown that this bias can cause RNNs to learn a local agreement pattern when given ambiguous evidence between that and a non-local one (Ravfogel et al. 2019). Similarly, McCoy et al. (2020) showed that Seq2Seq neural networks similar to the one used here were more likely to learn syntactic patterns that depended on linear order, which typically involves more local dependencies, than patterns that depended on hierarchical structure, which typically involves longer distance dependencies. Since First-Last Assimilation also involves non-local dependencies (i.e. two arbitrarily distant first and last segments), the network could have struggled to keep track of the relevant feature values in its recurrent connections when acquiring that pattern. 9

Future work
This paper has shown that three experiments that found evidence supporting an FLT-based bias in humans (Finley and Badecker 2008;Lai 2015;Avcu 2018) can be simulated using a Seq2Seq recurrent neural network. Future work should continue to explore the phonological learning biases present in both humans and computational models. For example, one phonological pattern that was not discussed here but which the literature has discussed in detail is Sour Grapes Harmony (Bakovic 2000;Wilson 2003). Sour Grapes is identical to Standard Harmony, except when a segment that blocks the harmony process is 9 The difference between local and non-local dependencies has been thoroughly explored in the statistical learning literature as well (e.g., Newport and Aslin 2004), and simulations of such statistical learning experiments with RNNs have been performed (see, e.g., Farkaš 2008). I leave exploring the relationship between these experiments and those that have been used to support FLT-based biases in phonology to future work.
[ 88 ] present in a word. When this happens, any changes that would have occurred up to the blocker are prevented from occurring at all. Like First-Last Assimilation and Majority Rule, Sour Grapes is unattested in natural language and more complex than the Tier-based Strictly Local region of the Subregular Hierarchy (O'Hara and Smith 2019; Lamont 2019b).
Another avenue for future work is using more realistic artificial languages. In all of the experiments simulated here, word length was kept constant. When testing the effects of formal complexity on human learning, generalization to novel lengths has been shown to be crucial in understanding human bias (Westphal-Fitch et al. 2018). Further research that makes use of variable lengths in its training and testing data could shed light on whether humans display an FLT-based bias under these more realistic conditions.
Researchers should also explore how the predictions about human learnability made by FLT and neural networks differ. For example, certain Context-Sensitive patterns are easier for neural networks and humans to learn than corresponding Context-Free patterns (Li et al. 2013;Westphal-Fitch et al. 2018), despite the fact that Context-Sensitive is more complex according to FLT. Exploring whether mismatches like this occur in phonological patterns could shed more light on how psychologically real FLT-based complexity is.
Understanding better why the neural network is able to capture these results and what representations it learns while doing so is another important next step. While the interpretability of recurrent networks has primarily been explored in the context of syntactic patterns and language modelling (see, e.g., Alishahi et al. 2019, for a review), some recent work on phonological patterns has shown promising results in this direction (Nelson et al. 2020;Smith et al. 2021) and these techniques could likely be applied to the networks used here.
Finally, a number of choices about the model I used were made somewhat arbitrarily: the number of hidden states in each layer, the use of GRU instead of a different kind of recurrent layer in the model, the use of attention, et cetera. Changing any one of these would likely have an effect on the model's ability to capture the experiment results investigated in Section 3 and Section 4, and I leave exploring the consequences of such changes to future work.

5.3
The

relationship between FLT and other complexity metrics
The Subregular Hierarchy is not the only way of measuring complexity that has been used in phonological research. Feature counting (Chomsky and Halle 1968), Minimum Description Length (Rasin and Katzir 2016), and various other methods (e.g. Moreton et al. 2017) have been used to characterize the complexity of phonological patterns. While these other methods are related to FLT, they are not perfectly correlated with it. For example, a feature-counting complexity metric would find a pattern banning all voiced sounds at the end of words (i.e., *[+voice]#) to be simpler than a pattern banning voiced, velar stops in that context (i.e., *[+voice, Dorsal]#). However, according to FLT, these patterns would both be Strictly Local, with no difference in complexity. Exploring the relationship between FLT and these other metrics is outside the scope of this paper; however future work should investigate what formalizations of complexity best predict both human behavior and linguistic typology (see, e.g., Moreton and Pater 2012).

Conclusions
Past work has explained phonological typology using an explicit, categorical restriction that prohibits the acquisition of patterns that are too complex according to the Subregular Hierarchy. Evidence for this hypothesis includes a series of experiments that showed humans being affected by an apparent FLT-based bias in an artificial language learning context (Finley and Badecker 2008;Lai 2015;Avcu 2018). The results in this paper challenge the idea that a categorical, explicit bias like this is needed to capture phonological learning, since a Seq2Seq neural network with the expressive power to represent Supraregular patterns was able to capture these experimental results. While FLT can be useful for describing phonological typology, these results suggest that an explicit FLT-based bias may not be needed in models of phonological learning. [ 90 ]