Aligning Speech and Co-speech Gesture in a Constraint-based Grammar

This paper concerns the form-meaning mapping of communicative actions consisting of speech and improvised co-speech gestures. Based on the findings of previous cognitive and computational approaches, we advance a new theory in which this form-meaning mapping is analysed in a constraint-based grammar. Motivated by observations in naturally occurring examples, we propose several construction rules, which use linguistic form, gesture form and their relative timing to constrain the derivation of a single speech-gesture syntax tree, from which a meaning representation can be composed via standard methods for semantic composition. The paper further reports on implementing these speech-gesture construction rules within the English Resource Grammar (Flickinger 2000). Since gestural form often underspecifies its meaning, the logical formulae that are composed via syntax are underspecified so that current models of the semantics/pragmatics interface support the range of possible interpretations of the speech-gesture act in its context of use.

: Gesture Depicting Mixing Mud, example (1)   2004). For instance, in utterance (1), 1 extracted from a conversation where the speaker is describing installing a drywall (Loehr 2004), 2 the speaker performs a circular movement with the right hand over the left palm (see Figure 1) along with the spoken utterance. Both the speech and the hand movement are relevant for the conveyed meaning of mixing mud, and both are produced and perceived as a coherent idea unit (McNeill 1992).
(1) So he mixes [ N mud] … In this article, we analyse signals like (1), in which the hand is spontaneously used to convey meaning in tandem with speech. In the literature, these hand signals are known as co-speech gesture, co-verbal gesture or gesticulation (e.g., Kendon 1972). In depicting/referential gestures, the form of the hands visually characterises a salient feature of the referent. The depiction could be iconic (McNeill 1992) (e.g., in 1 We adopt the following conventions in utterance transcriptions: the part of the speech signal that is simultaneous with the expressive phase of the gesture, the so-called stroke, is underlined. We include words that start or end at midpoint in relation to the gesture phase boundaries. The pitch accented words are shown in square brackets with the accent type in the left corner: PN (pre-nuclear), NN (non-nuclear) and N (nuclear).
2 For this and for all subsequent examples that are cited as Loehr (2004), we are grateful to Daniel Loehr who kindly provided us with an annotated corpus of speech and co-speech gesture. We used this corpus to study depicting gestures. [ 2 ] (1) the hands perform a rotating movement to depict the mud being mixed), or metaphoric (McNeill 1992) (e.g., a rotating hand while saying "This was a long, boring process" can designate an iterative process). In deixis/pointing gestures, the hand points to a region in space so as to identify the referent's location in Euclidean space. The pointing can be concrete (McNeill 1992), as when pointing to something that's physically present in the communicative situation. It can also be abstract (McNeill 1992): the referent is a virtually created object in the gesture space just in front of the speaker, and its location in the gesture space constrains its physical location; e.g., a speaker, while describing her apartment that's on the other side of town, extends her right hand to the right periphery while saying "The bedroom is on the right". Formless flicks of the hand, beating the time along with the rhythm of the speech are known as beats. The current analysis focusses on depicting and pointing co-speech gestures.
We adhere to current theories of gesture (Cassell et al. 1999;Lascarides and Stone 2009a;Pfeiffer et al. 2013), in that we assume that co-speech gesture can affect the truth-conditional content of the speech-and-gesture action. Both deictic gestures and iconic representations say something about the world and as such they have propositional content; this extends to pictorial representations as well (Abusch 2014; Grzankowski 2015).
Our paper contributes to the existing approaches to integrating the contents of speech of co-speech gesture in a single semantic unit (McNeill 1992;Kendon 2004;Bavelas and Chovil 2006;Engle 2000;Giorgolo 2012) in that we explore the coordination patterns of the two modalities, we formalise them within an integrated grammar and we spell out the gesture's semantic contributions to the proposition that is conveyed by the speech-gesture action. The main challenges are twofold: on the one hand, the gesture signal is massively ambiguous (Lascarides and Stone 2009a); on the other, the speech-gesture integration is not a free-for-all, in that the form of the speech-gesture action rules out certain interpretations of it, whatever its context of use. To illustrate gesture's ambiguity, consider again the hand movement in (1). Taken out of its speech context, this gesture could be a depiction of a circular movement (e.g., the turning of a wheel), or it could refer to the object being rotated (e.g., the wheel itself), or it could refer to an iterative process. It is only via context that gesture receives a spe- [ 3 ] cific meaning: the content conveyed by the rotating movement while saying "He mixes mud" is distinct from that while saying "It's a huge, long boring process".
The form of a deictic gesture is also imprecise on the region pointed out by the hand and what is being designated (Kühnlein et al. 2002): when pointing in the direction of a book with an extended index finger, does the deictic gesture identify the physical object book, the book's content, or the location of the book-e.g., the table?
This ambiguity notwithstanding, the form of the gesture, abstracted away from its context of use, conveys some meaning, no matter how incomplete it might be. A depicting gesture, by the definition of iconicity, must support a perceptual resemblance between the gesture's form and its denotation (Kendon 2004;Kopp et al. 2007): i.e., the gesture's movement, hand shape etc. visualise qualitative characteristics of the referent. Deixis, on the other hand, indexes spatial reference in Euclidean space by projecting the hand to a region that is proximal or distal in relation to the speaker's location (e.g., Levinson 1983). Through deictic gestures, people anchor the referents in their utterances to the physical context (Kaplan 1989). This difference between depicting gestures and deictic gestures is accounted for in how we model the form-meaning mapping, and we also support the analysis of gestures that are both deictic and depictive simultaneously (and so inherit the characteristics of both gestural types).

Outline
This article is structured as follows: in Section 2, we discuss the ambiguous form-meaning mappings of the speech-and-gesture signal, assuming a coherence-based pragmatic theory. In Section 3, we introduce examples to motivate a grammar-based approach to co-speech gesture. We then proceed with a discussion of related work and our distinct contribution (Section 4). In Section 5, we discuss how to formally represent gesture form and map this form to (underspecified) meaning. In Section 6, we propose domain-independent grammar rules which are based on the empirically extracted generalisations. Section 7 reports on the grammar implementation and evaluation.
[ 4 ] Co-speech Gesture in a Constraint-based Grammar 2 ambiguous form meaning mapping There is a right balance to be struck between constraining the mapping from form to meaning, while ensuring that existing pragmatic theories will support inferring the context-specific interpretations from the underspecified meanings derived only from form. The aim of this section is to use examples of speech-gesture actions to motivate one way of striking that balance. We first introduce an existing coherence-based model of pragmatics, which we assume underlies the inferences from the meaning that's derived from form alone to a preferred pragmatic interpretation in context. We then use this to motivate speech-gesture attachment ambiguities by illustrating how each syntax tree supports a different interpretation of the speech-and-gesture action, given the assumed pragmatics model. We also argue that licensed attachments are constrained, despite the multiple ways co-speech gestures can relate to speech.

Pragmatic Theory Background
In this paper, we assume a coherence-based model of the semantics/pragmatics interface as discussed in the literature of discourse interpretation (e.g., Hobbs 1985, Kehler 2002. The main principle of a coherence-based pragmatic theory is that discourse content is dependent on coherence relations-e.g., Elaboration, Explanation, Contrast, Contiguity-which link the meaning of its segments together. Identifying coherence relations is a defeasible process, informed by the compositional and lexical semantics of the units and contextual information such as real-world knowledge. For instance, the pragmatic interpretation of the discourse in (2) involves the following contents: Max fell, John pushed Max, and the latter explains the former (so the pushing caused the falling and hence preceded it).
Using the notation of Segmented Discourse Representation Theory (sdrt, Asher and Lascarides 2003), as shown in (3), this is represented as a rooted hierarchical set of labels-each label corresponds to a discourse segment-with each label associated with some content: π 1 is associated with the content that the event e 1 of Max m falling hap- [ 5 ] pened before now; segment π 2 with the content that the event e 2 of John j pushing x, where x is identical to m, happened before now; and the (root) segment π 0 stipulates that π 2 explains π 1 (in other words, the content of π 2 explains why the content of π 1 is true).
(3) π 0 : Explanation(π 1 , π 2 ) π 1 : fall(e 1 , m) ∧ e 1 < now The linguistic grammar doesn't identify the antecedent m to the pronoun x. Rather, "him" introduces an underspecified equality condition between the newly introduced referent x and some antecedentwritten x =?. Generally, (disambiguated) linguistic form yields an Underspecified Logical Form (ulf), because syntax on its own does not fully resolve all semantic and anaphoric ambiguities. Similarly, the grammar does not introduce the Explanation relation between the segments. Rather, identifying this coherence relation and the antecedent m to x (thereby replacing x =? with x = m in the logical form of the discourse) is achieved via commonsense reasoning, using the ulfs of the clauses as premises. Moreover, the assumption that π 2 is coherently related to π 1 is what makes m an available antecedent for x.
Following Lascarides and Stone (2009a), we assume that gestures are elementary discourse units (that is, segments at the leaves of the hierarchical discourse structure); so interpreting gesture involves inferring coherence relation(s) between it and other speech units and gesture units. Furthermore, Lascarides and Stone (2009a) stipulate that co-speech gesture must be coherently related to its synchronous speech, and it can be related to other units as well. The main aim of this paper is to model this necessary connection between co-speech gesture and its synchronous speech. In line with theories of dynamic semantics and discourse interpretation (Hobbs 1985;Kehler 2002;Asher and Lascarides 2003), we further assume that there are constraints on which antecedents are available for resolving the anaphoric elements of the current discourse unit. In speech-only discourse, antecedents to anaphora in the discourse unit π must be introduced in π itself or in a unit π ′ that π is coherently related to. Following Lascarides and Stone (2009a), we carry over these constraints to gesture: i.e., all individuals that are a part of the pragmatic interpretation of a gesture behave like anaphoric expressions-they must bind via a bridging relation to [ 6 ] an available antecedent (Asher and Lascarides 1998). Thus inferring a pragmatic interpretation of gesture is dependent on inferring how it coherently connects to available speech unit(s).
The meaning representations that we derive from the form of a sentence with co-speech gesture must respect the above constraints on interpretation. To achieve this, we make the choices of speech and gesture integration-which we formally express by attachments in the syntax tree-determine the speech phrase that the gesture is coherently related to. This in turn affects which referents, introduced in speech, are available antecedents for resolving the underspecified gesture meaning (given just its form). Lascarides and Stone (2009a) observe additional constraints on antecedents for resolving gesture interpretation; constraints that we assume here. Specifically, they claim that the antecedent for resolving gesture can be introduced by a gesture or a linguistic discourse unit, but antecedents for resolving linguistic anaphora cannot be introduced by depicting gestures. This doesn't apply to deixis: a linguistic anaphor can co-refer with a referent that's pointed at. For instance, when a person points at a knife and says "It's sharp", it is perfectly acceptable for "it" to refer to the knife introduced by the deictic gesture. In contrast, when a person says "He cut the cake" and makes a 'cutting' gesture with a vertically flat palm to depict the instrument used for cutting, it is rather unnatural to continue this discourse with "It was sharp" where "it" refers to the knife introduced by the iconic gesture.
By drawing on standard methods from formal linguistics, our goal is to make the analysis of a discourse featuring co-speech gestures compatible with the analysis of purely linguistic discourse. Given the fact that we are adopting a coherence-based theory, the pragmatic interpretation of co-speech gesture is dependent on the content of the linguistic signal it is coherently related to. With this in mind, we introduce the notion of speech-gesture alignment to roughly designate: (i) that speech and gesture are coherently related; and (ii) that resolving the (underspecified) semantics of gesture to a specific interpretation and inferring a coherence relation are logically co-dependent tasks. We shall refine the notion of alignment in Section 3.3 after a discussion of how linguistic form and gestural form, including their relative timings, constrain the alignment configurations. In the next section, we illustrate the various ways in which a gesture can be interpreted [ 7 ] in context.

Ambiguous Form-Meaning Mapping
Syntactic attachment ambiguities and semantic scope ambiguities are ubiquitous in grammars. For instance there is the non-unique choice for attaching the PP in "John saw the man with the telescope". And there's the non-unique semantic scope of the quantifier in "every dog probably did not walk"-"probably" semantically outsopes the negation, which outscopes "walk", but the quantifier "every man" may outscope "probably", or have narrow scope to "probably" but outscope the negation, or have narrow scope to the negation. Most grammars have to handle semantic scope ambiguity in the absence of syntactic ambiguity. 3 So syntax derives a ulf that underspecifies semantic scope.
We will now argue that the range of plausible pragmatic interpretations of co-speech gesture can likewise be analysed via a nonunique choice of attachment of the co-speech gesture to speech and a non-unique way of resolving scope in the ulf that gets composed via such attachments. In essence, these sources of ambiguity familiar from linguistics can also capture ambiguities in co-speech gestures. In Section 3.1, we will then argue that not only can one model co-speech gesture ambiguity this way, but one should.
We use a slight modification of example (1), namely (4), to discuss the ambiguous form-meaning mapping of depicting gestures. Its plausible pragmatic interpretations are presented in sdrt notation, except that we ignore tense and presupposition, and (following the English Resource Grammar (erg, Flickinger 2000)), events are not existentially bound.
Intuitively, one of the possible denotations of the circular hand movement is paraphrasable as "the mud is going round in horizontal circles". This interpretation is regimented in the lf in (5), which features an Elaboration relation between the speech content mud(x) (labelled π s ) and the gesture content labelled π g -a horizontal rotating event e ′ over a substance x ′ that is made equal to the 'mud' referent x introduced in π s . So the speech-gesture action conveys "John mixes mud, (specifically) the mud that is going round". Like (2), this lf consists of a hierarchical structure of coherently related segments.
(5) π s : mud(x) The constraints on anaphoric reference imposed by the discourse structure in (5) license using x as an antecedent for specifying the content of π g (Asher and Lascarides 2003;Lascarides and Stone 2009b): x is available because it's 'introduced' by the predication mud(x)or more precisely, using hpsg terminology, x is the semantic index of mud(x) (its first argument which introduces a noun variable)-and mud(x) is a part of π s , to which π g is coherently related.
Further, this lf represents one way of resolving the underspecified semantic scope of the ulf that you would get by attaching the gesture to the NP "mud" in the syntax tree. Specifically, following the standard approach to semantic composition (Sag and Wasow 1999;Copestake et al. 2001), assume the semantic component of the construction rule that attaches gesture to a linguistic unit introduces an (underspecified) coherence relation-here resolved to Elaborationbetween the gesture and the predications in that linguistic unit, but the ulf so derived underspecifies the relative scope of this (underspecified) coherence relation and the quantifiers in the linguistic unit. Then the ulf derived by attaching the gesture to the NP "mud" would force the coherence relation to outscope the predicate mud(x) but it won't outscope the predicates mixes (e, j, x) or john( j). (5) is a fully specific logical form that is licensed by this ulf. Here, ∃x must outscope the coherence relation because free occurrences of x are forbidden (Copestake et al. 2005).
An alternative pragmatic interpretation of the co-speech gesture in (4) is that it depicts the event of mud going round as a result of the mixing. A formal rendition of this interpretation is given in (6).
This interpretation can be derived by attaching the gesture to a linguistic unit whose timing is (again) not equal to the timing of the gesture (though they temporally overlap), and then resolving the ulf that results from this attachment to a fully specific logical form. Here, (6) can be derived from the ulf you get by attaching the gesture to the VP "mixes mud": this attachment forces π s to include the predication mix(e, j, x). Consequently, the quantifier ∃x can now have narrower scope than the coherence relation, as shown. This contrasts with attachment to the NP "mud": this attachment ruled out mix (e, j, x), and hence also ∃x, from being within the scope of the coherence relation. Further, since the predication john( j) in (6) isn't a part of π s , j is not available for resolving the content of π g .
The particular linguistic grammar that we use in this paper to analyse co-speech gesture-specifically the erg (Flickinger 2000)makes the ulf generated by VP attachment the same as that derived by S attachment. For example, the adverbial in Probably John mixed mud and John probably mixed mud attaches to the S and VP nodes respectively, but in both cases the ulf forces the modal introduced by probably to outscope mixes (e, j, x) and it underspecifies whether it also outscopes john( j) and/or mud(x), or not. Thus (6) is also derivable from the ulf you get by attaching the gesture to the S node. An alternative fully scoped form of this ulf corresponds to a further plausible interpretation of the gesture: [ 10 ] Unlike (5) and (6), john( j) is now outscoped by the coherence relation; so j is available for resolving the content of π g . As before, the choice of antecedents for specifying the content of π g interacts with the choice of coherence relation: here, the coherence relation is Depiction and the overall content is roughly paraphrasable as another free adjunct: "As he was making it go round, John was mixing mud".
The interpretations in (5), (6) and (7) all feature identity between a referent introduced by the co-speech gesture and a referent introduced by speech. However in (8) the gesture does not denote a salient property of the referents introduced in speech: instead, it qualifies the speech act of questioning (signalled by a rising intonation). A rough paraphrase of the meaning of the multimodal action in (8) would be "Are you telling me that John mixes mud?". Interpreting the gesture in this metaphorical way (see the lf in (9)), and inferring a Metatalk relation (Polanyi 1985) whose semantics is defined in terms of the speech act rather than the domain-level content, would be supported via an attachment of the co-speech gesture to the S node. (9) π s : question(∃x(john(j) ∧ mud(x) ∧ mix(e, j, x))) While the attachments we've proposed deviate from McNeill's (1992) claim that co-speech gesture is semantically related to its temporally simultaneous speech phrase, we remain agnostic about his claims (and those of others) about the underlying production processes-e.g., McNeill's claim that decisions about which contents are expressed in which channel stem from a single (complex) thought. and co-speech gesture. This serves as qualitative evidence for: (a) encoding the constraints on speech-gesture alignment within a grammar (rather than entirely via pragmatics); and in particular (b) suitably constraining the application of construction rules of the kind we described in the prior section. The examples we use as evidence include both constructed examples (to illustrate our judgements about ill-formedness) and examples extracted from existing corpora.

Speech-Gesture Alignment and Prosody
We begin with the constructed example (10), which reflects intuitions of native speakers about multimodal grammaticality.
The speaker puts his hand to the ear to imitate holding a receiver.
Intuitively, it seems anomalous to perform the gesture along the unaccented "called", even though the gesturing hand is shaped as holding a receiver and can thus be associated with calling. This anomaly would not arise if the gesture was performed along the whole utterance (or a part of it) which, importantly, includes the prosodically prominent element "mother": e.g., "mother called" or "your mother called". As suggested by Mark Steedman (personal communication), gestures exhibit contrastive properties in analogy to those conveyed by pitch accents. If this is so, then it's not surprising if a co-speech gesture is well-formed only if, unlike (10), it temporally overlaps with a contrastive component that's signalled via prosodic prominence (this is not to say that gesture performance is driven by prosody, but rather that their performances are mutually constraining). Further, a pragmatic interpretation where the gesture depicts calling must be sourced in a syntactic derivation where the gesture is aligned with a linguistic unit that includes "called"-prosody constrains the gesture to be aligned with a phrase that includes "mother", but the event of calling is available to its interpretation only if it aligns with a phrase that includes "called" as well. Thus, just like with purely linguistic discourse, considerations about plausible pragmatic interpretations can serve to resolve syntactic ambiguities that are licensed by the construction rules in the grammar. Further, this strong relationship in (10) between the performance of the gesture and prosody is in line with the empirical findings of Giorgolo and Verstraten (2008), who [ 12 ] Figure 2: Gesture Depicting "greasy", example (11) (Kendon 2004) isolated prosody as the parameter that influences the perception of multimodal well-formedness vs. multimodal ill-formedness.
Considering that form (here, prosody) constrains what part of the speech signal a co-speech gesture can align with, we define alignment as a constraint on grammaticality. Ungrammatical (and hence misaligned) speech and co-speech gestures comprise cases where the timing of co-speech gesture relative to the timing of speech does not validate any construction rule in the grammar by which speech and gesture may be combined; and our aim is to ensure that such constraints on the construction rules match native speakers' judgements about ill-formedness.

Speech-Gesture Alignment and Syntax
To illustrate that linguistic syntax influences decisions about which phrase a co-speech gesture semantically aligns with, consider utterance (11), where the speaker is discussing new owners of a factory finding it filthy. Along with "greasy…", the speaker's hands spread out to the left and right periphery ( Figure 2) so as to designate some spatial extent, some closed area being made greasy (Kendon 2004).
(11) First of all they made [pause 0.1 sec] everything [ N * gre]asy in the whole room place.
Consider how moving the timing of this gesture affects its meaning. If the gesture onset was moved a few milliseconds earlier so that it happened along "made everything greasy" or if it was held further so as to span "made everything greasy in the whole room", this would [ 13 ] not change the interpretation of it: it still designates an enclosed area that's greasy. This interpretation would also remain unchanged if the primary pitch accent were on "everything" rather than "greasy", and the gesture temporally coincided with "everything". However, the gesture cannot receive this interpretation if it temporally coincides only with the subject NP "they" (which in turn would need to be accented for the speech-gesture action to be well-formed): now it designates a spatial referent for "they" in the gestural space, and cannot qualify the spatial extent of greasiness. These variations suggest that a gesture that temporally coincides with "they" can only semantically align with "they", but a gesture temporally coinciding with any element in a VP can semantically align with the VP, sub-portions of the VP containing the temporally coinciding words, and with the whole clause.
A special class of deictic gestures behave differently with regards to the semantic effects of prosody and timing, however. In (12) from the annotated ami corpus (Carletta 2007), the deictic gesture is performed along with the prominent "Thank you" but its denotation binds to that of the NP "the mouse". The alternative interpretation where the gesture signal and the speech signal are bound through a causal relationship-i.e., handing the mouse is the reason for thanking the addressee-is not possible, since it's clear in context that "Thank you" is related to what came in the previous discourse (i.e., projecting the presentation in slide show mode in response to the speaker's request).

Speaker's right hand is loosely open, index finger is loosely extended, pointing at the computer mouse
In (13) (again from the ami corpus), the deixis happens along the nuclear accent "said", but it identifies the individual that resolves the pronoun "she" coming from speech.
(13) And a as she [ N said], it's an environmentally friendly uh material The speaker extends her arm with a loosely open palm towards the participant seated diagonally from the speaker.
In these examples, the gesture would fail to map to the intended meaning if the grammar were to license attaching a co-speech gesture only to its temporally simultaneous linguistic phrase. [ 14 ] Based on Lascarides and Stone (2009a), we formalise the location of the pointing hand with the constant ⃗ c; this marks the physical location of the tip of the index finger. This combines with the features of the pointing hand-the hand shape, the orientation of the palm and fingers, and the hand movement-to determine the spatial region ⃗ p that's designated by the gesture-e.g., a stroke with an extended index finger will make ⃗ p a line (or a cone) that starts at ⃗ c and continues in the direction of the index finger. Abstract deixis identifies referents that are not physically salient in the communicative situation. To account for this inequality between the gestured space and actual denotation, Lascarides and Stone (2009a) use the function v to map the physical space ⃗ p designated by the gesture to the space v(⃗ p) it denotes (and they claim that the value of v is pragmatically determined). Essentially, ⃗ p is not equal to v(⃗ p) in cases where the referent introduced in the gesture space is not physically present. Conversely, ⃗ p equals v(⃗ p) when the referent introduced by the gesture is at the physical coordinates identified in the gesture space.
With this in mind, we observed in all the annotated corpora we examined 4 that the temporal/prosodic mismatch occurred only in cases where the visible space ⃗ p designated by the gesture was equal to the space v(⃗ p) it denoted, i.e., the function v that maps the space identified by gesture to the actually denoted space resolves to equality. So we shall capture this finding in the grammar via a construction rule that allows gesture to align with a spoken word that is not prosodically marked and/or that doesn't temporally overlap with the gesture, but only if the deictic referent is physically located at the exact coordinates identified by the pointing hand.
Bearing in mind that we are restricting our study and analysis to only those gestures that temporally overlap with speech (i.e., cospeech gestures), these examples provide evidence that their semantic alignment depends on the syntax and prosody of the speech signal, as well as the relative timing of the gesture and speech. This motivates encoding the constraints on alignment within a grammar, for 4 To study depicting gestures, we used a 165-second collection of four recorded meetings, annotated for gesture events and intonation events in the ToBI framework (Loehr 2004). To study deictic gestures, we used two multimodal corpora: a 5.53 min recording from the Talkbank Data, 5 and observation IS1008c, speaker C from the ami corpus (Carletta 2006). 6 [ 15 ] this is where information about syntactic constituency is expressed. The alternative approach would be to infer speech-gesture alignment at the pragmatic level, via the commonsense reasoning that resides there for inferring which discourse units are coherently connected to which other units. But this alternative is incompatible with existing and well-established assumptions about the interface between syntax, semantics and pragmatics. For instance, our discussion of example (11) showed that the temporal relationship between subject NP/VP boundary and the gesture profoundly affect the possible interpretations. To capture this fact, pragmatics would need access to the syntax of the speech. However, there is no formal model of pragmatics that supports that kind of architecture, without pragmatics being fully integrated into the grammar itself along the lines of Dynamic Syntax (Kempson et al. 2000). In contrast to the non-modular approach of Dynamic Syntax, we aim to maintain a conservative, well-established and modularised interface between syntax, semantics and pragmatics, so that implementations of our grammar can be supported by standard methods for computing discourse meanings (e.g., statistical discourse parsers, Afantenos et al. 2015).
Accordingly, we will develop a speech-gesture grammar using standard techniques for syntactic derivation and semantic composition, where the constraints on attaching co-speech gesture to a linguistic constituent are defined in terms of relative timing, prosody and linguistic syntax.
The examples we've discussed so far motivate allowing attachments of gesture to linguistic constituents whose timing is not identical to the timing of the gesture; we saw in Section 2.2 that making alignment equivalent to temporal simultaneity would under-generate the range of plausible pragmatic interpretations. Rather, the choices of attachment, and hence ultimately the choices of what the gesture means, are determined by the prosodic properties and constituent boundaries of the speech signal as well as relative timing.

Speech-Gesture Alignment
Given our assumptions about constrained inference in pragmatics, and also given our observations of how form affects the speech-gesture interaction, we now refine the notion of alignment as follows: [ 16 ] Definition 3.1. Speech-Gesture Alignment. Our choice of which speech phrase a gesture (stroke) can align with is guided by the following factors: i. the final interpretation of the gesture in specific context of use; ii. the speech phrase whose content is semantically related to that of the gesture given the value of (i); and iii. the syntactic structure that, with standard semantic composition rules, would yield a ulf supporting (i) and hence also (ii).
The derivation of the single speech-gesture syntactic structure, which is constrained by the prosody of the temporally overlapping speech signal, is achieved within the grammar. This definition encompasses both form (introduced in clause (iii)) and meaning (all three clauses). We capture semantic alignment of speech and gesture via attachment in a single syntax derivation tree, because-as shownsyntax (among other things) governs semantic alignment. If there is a choice as to which phrase a co-speech gesture can align to, then this is modelled via a combination of structural-i.e., attachmentambiguity and semantic scope ambiguity that's licensed by the ulf so-derived. The semantic effects of alignment are thus captured using standard methods of semantic composition on the derivation tree. Given the theory of pragmatics we aim to support, the construction rules combining speech and a depicting gesture introduce an (underspecified) semantic relation vis_rel(s, g) (visualising relation) between the content g of the depicting gesture and the content s of the speech constituent to which the gesture attaches, which captures the fact that speech and gesture are coherently connected (Lascarides and Stone 2009a). The (underspecified) relation that's introduced by the construction rules that combine deixis and speech is deictic_rel(s, g) (Lascarides and Stone 2009a). The resolution of these underspecified relations to a pragmatically preferred and specific value happens externally to the grammar at the semantics/pragmatics interface. 7 In Section 6 we discuss the formal framework and in Section 7 the implementation in hpsg.

previous work and contribution
This paper aims to demonstrate that informal observations about the relationship between speech-gesture form and meaning can be regimented formally, using standard techniques from linguistics. In particular, we use standard techniques for deriving logical form from a syntax tree within a grammar, while ensuring that the meaning representations so derived comply with the requirements imposed by existing formal models of pragmatics.
The idea of integrating speech and gesture within a grammar is by no means new, with several such proposals established over the past 20 years (see, inter aliae, Johnston 1998a,b, Kühnlein et al. 2002, Paggio and Navarretta 2009, Giorgolo and Asudeh 2011. Further, the "constituent structure" of gesture, as well as its syntactic function for the integration within the language, has also been a matter of research (see Fricke 2008, Müller andLadewig 2013). And the construction of meaning across speech and gesture has been the subject of analysis within construction grammars (Steen 2013).
But there are a few main differences between this prior work and our approach. First, we claim that the speech phrase that gesture aligns with is not determined uniquely by when the gesture was performed. Whilst the time feature matters, we also constrain alignment via prosody and syntactic notions such as headedness. Further, in contrast to these prior grammars, we aim for a domain independent analysis, and so we must fully capture all linguistically licensed semantic alignments between speech and co-speech gesture, rather than only those that are plausible in the chosen domain of application. The other main difference lies in the semantic component of the grammar. In particular, we draw on recent advances in deriving an Underspecified Logical Formula (ulf), which allows the grammar developer to capture semantic ambiguity in the absence of syntactic ambiguity. The above grammatical approaches all assume that every semantic ambiguity corresponds to a syntactic ambiguity.
There are previous semantic analyses of gesture (Lücking et al. 2006b; Lascarides and Stone 2009a) that assume a grammar pro- [ 18 ] duces an underspecified meaning representation: these theories focus on how contextual information contributes to mapping the underspecified meaning that's derived from form into a fully specific and pragmatically preferred interpretation. Our work contributes to this by providing a grammar framework that produces the form-meaning mappings they assume. In doing so, we not only capture informal observations about gestural ambiguity, but our formal model uses well-established methods from linguistics to produce a meaning representation that is compliant with current models for multimodal processing at the semantics/pragmatics interface.
To achieve that, we perform two dependent tasks: first, we extract generalisations from the existing literature and from our own observations in annotated multimodal corpora about the syntactic and semantic well-formedness of speech-gesture signals; second, we use the extracted generalisations to define a precise grammar that models the form of the speech, the form of the gesture and the form of their combination, producing ulfs of speech and gesture using standard methods of syntactic derivation and semantic composition from linguistics. We also demonstrate that the grammar can be implemented by extending an existing linguistic grammar. 5 mapping gesture form to meaning 5.1

Modelling Gesture Form
One major difference between speech and gesture is how the meaning gets derived from the form of the signal. Gestures are 'global' and 'synthetic' (McNeill 1992), i.e., the meanings of the various features of a gesture's form-such as the direction of the movement, the hand shape, the location of the hands, etc-determine the meaning of the gesture as a whole. This is unlike the semantic compositionality via natural language syntax. Following previous work (Kopp et al. 2004, Lascarides and Stone 2006, Hahn and Rieser 2010 inter alia), we regiment this difference by using Typed Feature Structures (tfs) since they support a non-hierarchical representation of the distinct aspects of the gesture's form. The gesture type designates its category: e.g., depict-literal for literally depicting gestures ( Figure 3) and deicticabstract for abstract deixis (Figure 4), of the kind exhibited in (14):  The feature-value pairs of a depicting gesture capture every aspect of the form of the hand that (potentially) contributes to its meaning: the hand shape, the orientation of the palm and fingers, the location of the hand relative to the speaker's torso and the hand movement. With deictic gestures, the shape of the hand determines the region of space that is identified by the pointing hand: e.g., an extended index finger identifies a line or a cone that starts from the tip of the index finger; with a vertical open hand, the designated region is a plane. Recording the form of the pointing hand is essential, because prior work shows that it is significant for interpreting its meaning in context (Kendon 2004): e.g., an extended index finger typically singles out an individuated object while a vertical open hand typically denotes a class of objects rather than an individuated object, or it serves a pragmatic function such as offering the floor or citing someone else's contribution to the discourse. The hand location of a deictic gesture is represented via the constant ⃗ c. This, combined with the deixis form features, determines the region ⃗ p actually marked by the gesture.

Modelling Meaning
As we've already highlighted, a well-established method for handling cases where form does not fully determine meaning is semantic underspecification. All frameworks for semantic underspecification-e.g., Quasi-Logical Form (Alshawi 1992), Underspecified Discourse Representation Theory (Reyle 1993), the Constraint Language for Lambda Structures (Egg et al. 2001), Hole Semantics (Bos 2004), Minimal Recursion Semantics (Copestake et al. 2005), Regular Tree Grammars (Koller et al. 2008)-construct from a fully disambiguated form an abstract representation of meaning that can resolve to several distinct specific messages in context, rather than deriving those specific representations from syntax directly, and assuming a syntactic ambiguity for every semantic ambiguity. Technically, the ulf derived by syntax partially describes the form of a fully specific logical form, which in turn represents a context-specific interpretation which can be evaluated against a model or the actual situation at hand.
To map the form of the gesture to an underspecified meaning representation, we use the underspecification formalism of Robust Minimal Recursion Semantics (rmrs, Copestake 2007)-a factorised version of erg's semantic framework, Minimal Recursion Semantics (mrs, Copestake et al. 2005). rmrs was originally developed to support the integration of deep and shallow processing. Modelling gesture is somewhat akin to shallow processing in that one has to handle the large degree of underspecificity.
To illustrate it, consider the mrs for "every dog chased some cat" in (15). Here, the semantic scope ambiguities are captured by the so called qeq (= q ) contraints which allow for two alternative fully scoped formulas.
(15) l 1 : every(x 0 , h 3 , h 1 ) l 11 : dog(x 1 ) l 2 : some( y 0 , h 4 , h 2 ) l 21 : cat( y 1 ) l 3 : chase(e 1 , x 2 , y 3 ) h 3 = q l 11 , h 4 = q l 21 While mrs underspecifies scope, it still requires a fully specified predicate-argument structure. However, neither shallow language [ 21 ] processors nor gestural form on their own can fully determine a unique predicate argument structure. Refining mrs to rmrs solves this. One simply produces a highly factorised representation of each elementary predication: each one is equipped with its own unique anchor (a), which serves as a locus for specifying the predicate's arguments; equations (e.g., x 0 = x 1 = x 2 ) are also added to express unifiability between variables. So (16) is a notational variant of (15).
(17) l 1 : a 1 : every(x 0 ) l 11 : a 11 : dog(x 1 ) l 2 : a 2 : some( y 0 ) l 21 : a 21 : cat( y 1 ) l 3 : a 3 : chase(e 1 ) Semantic composition with rmrs follows the semantic algebra of Copestake et al. (2001): the predications and qeq on the mother are acummulated from those in the daughters and the semantic head daughter has its 'hook' (roughly equivalent to a λ-term) replaced by the semantic index of the non-head.

Depicting Gestures
Following Lascarides and Stone (2009a), mapping the form of a depicting gesture to its meaning involves mapping each feature value [ 22 ] pair in the tfs representing its form to an rmrs-based underspecified predication: the ulf of the gesture from Figure 3 is shown in (18).
(18) l 0 : a 0 : [ ](h) l 1 : a 1 : hand_shape_bent(i 1 ) l 2 : a 2 : palm_orient_towards_down(i 2 ) l 3 : a 3 : finger_orient_towards_down(i 3 ) l 4 : a 4 : hand_location_lower_periphery(i 4 ) l 5 : a 5 : hand_movement_circular(i 5 ) h = q l n where 1 ≤ n ≤ 5 Each predicate has a label, an anchor, and a semantic index, as is standard in rmrs. Since a predication mapped from depicting gesture could resolve in context to an event e or an individual x, its semantic index is a metavariable i that generalises over e or x. The predicate symbols underspecify the particular constructor and its arity in the lf. For instance, a feature-value pair like hand-movement circular would map to l 1 : a 1 : hand_movement_circular(i). Resolving these predicates happens outside the grammar as a byproduct of discourse processing (Lascarides and Stone 2009a). In particular, each underspecified predicate (such as hand_movement_circular(i)) is a root to a type hierarchy of increasingly specific predications of content. This is roughly analogous to constructing a specific lexical meaning out of a polysemous lexical entry (Copestake and Briscoe 1995), but here the type hierarchy captures constraints on interpretation that are imposed by the requirement for iconicity-i.e., a resemblance between the form of the gesture and its meaning. This type hierarchy is designed so that a circular hand movement can never resolve to, say, a rectangular concept. To illustrate the idea, in Section 2.2 we claimed that one of the interpretations of the circular hand movement in (1) was the mud being mixed. This is achieved by resolving hand_movement_circular(i) to a conjunction of predications: substance(x ′ )∧rotate(e ′ , x ′ ), which is a node in the type hierarchy that's rooted at hand_movement_circular(i), and is featured in (5). In an alternative interpretation this hand movement is a depiction of the mixing event from the agent's viewpoint: i.e., the underspecified predicate hand_movement_circular(i) can resolve to the three-place predicate rotate(e ′ , j ′ , x ′ ), featured in (7).
Further, recall from Section 2.1 the constraint that an individual that is introduced in a depicting gesture can't be an antecedent to a [ 23 ] pronoun in speech. Lascarides and Stone (2009a) regiment this constraint by introducing the scopal operator [ ]: all predicates mapped from depicting gesture fall within its scope (via the scopal condition h = q l n ), and the dynamic semantics Lascarides and Stone assign to [ ] ensures that co-reference across the modalities is suitably constrained.

Deictic Gestures
The mapping of deixis form to a ulf captures the fact that deixis provides the spatial reference of an individual or event in the physical space ⃗ p (the complete rmrs logical form mapped from the gesture in Figure 4 is shown in (19)). This is formalised by the two-place predicate l 21 : a 2 : sp_ref (i 1 ) l 21 : a 2 : ARG1(v(⃗ p)) whose first argument is the underspecified variable i 1 , and the second argument ARG1-linked through the anchor a 2 -is the actually denoted space v(⃗ p) with v being the function that maps the gesture space to the space in denotation (recall discussion in Section 3.2). The ulf is only a partial description of the resolved lf: e.g., resolving the underspecified referent i 1 to an object x and inferring a relation between the deixis denotation and the speech denotation is a matter of pragmatic reasoning. Note how in the prior interpretation of hand_movement_ciricular(i) i resolves to an individual x, but here it resolves to an event e. To capture how the form of the pointing hand affects its meaning, we map each deixis feature-value pair to a two-place predicate, with the first argument being an event variable (e 0 ...e n ) and the second argument arg1 being the referent identified by the pointing signal (i 0 ...i n ). This formalisation is similar to the treatment of nonscopal modification in the English Resource Grammar (erg, Flickinger 2000): a deictic predication (as mapped from form) is a two-place predication whose second argument arg1 is equated with the semantic index of the modified predication, obtained by equating i 0 = i 1 = i 2 = i 3 = i 4 = i 5 = i 6 and whose label is equated with the label of the modified predication, obtained via l 21 = l 22 = l 23 = l 24 = l 25 = l 26 . For consistency with erg where individuals are all bound by quantifiers, we use the deictic_q quantifier to quantify over the spatial referent i 1 .

Prosodic Word and Gesture Alignment
We begin with the straightforward case where gesture aligns with a single lexical item: Construction Rule 6.1. Situated Prosodic Word Constraint. A depicting or deictic gesture can attach to a spoken word w of a spoken utterance if (a.) there is an overlap between the temporal performance of the gesture stroke and w; and (b.) w bears a nuclear or a pre-nuclear pitch accent.
We represent the mulitmodal rules as phrase structure rules equipped with the following information ( Figure 5): the speech daughter s-dtr and the gesture daughter g-dtr each introduce a time feature, a synsem|cat feature which captures its syntacic category (note that for gestures, this information includes the form featurevalue pairs, discussed in Section 5.1) and a synsem|cont feature which captures its (underspecified) semantic contribution. The speech daughter also introduces a phon feature which captures the phonological information. The construction rule introduces a feature overlap whose values are re-entrant with values in the temporal components of the daughters; and also a time feature which is the union of the Figure 5: hpsg-based formalisation of the Situated Prosodic Word Constraint aligning gesture and a spoken word speech daughter's value and the gesture daughter's value. In so doing, we follow previous work where timing is used as a constraint on the integration (Johnston et al. 1997). As it is standardly done in erg, the semantic contribution of the construction rule is captured within ccont: here, a depicting gesture introduces an underspecified relation vis_rel between the main label of the gesture semantics and the main label of the semantics of the spoken phrase; the underspecified relation introduced by deixis is deictic_rel between the semantic index of the speech daughter and the semantic index of the gesture daughter. Multimodal integration happens via unification of these features. Given the different form-meaning mappings of depicting vs. de- [ 26 ] ictic gestures, we will now provide separate analyses for both gesture types.

Situated Prosodic Word Constraint and Depicting Gesture
To illustrate how the Situated Prosodic Word Constraint works with depicting gestures, consider again example (1). The nuclear accent is on the rightmost word "mud", which licenses an attachment of the gesture to it using Construction Rule 6.1. The derivation, which attaches the gesture to "mud", is shown in Figure 6. The prosodic phon and syntactic cat information of the speech head daughter gets propagated to the mother node. We do not propagate the gesture form features to the mother node since we do not need to access gesture form any further. The timing of the situated utterance is recorded in the mother's time value. This information is necessary in case the (situated) word aligns with another gesture.
The semantic composition follows the standard English Resource Grammar (erg) process, namely: the individual semantic formulae are decorated with a global label (h 1 ) which demonstrates the derivation of a single lf. Each formula is also augmented with a hook containing the ltop label (equated to the label of the main predication) and the semantic index. The ltop of the predicate contributed by the speech daughter l 6 : a 6 : _mud_n_1(x 1 ) is l 6 and the index is x 1 . The ltop of the gesture daughter is equated to the label of the modality-l 0 . Regarding the gesture semantic index, the gesture lf is too underspecified to know which of the semantic predications will resolve to the main variable and hence at this stage we have no information as to which is the semantic index of the formula. We therefore use i 1−5 as a shorter notation for a disjunction of co-indexations to reflect the fact that the underspecified variable i 1 . . . i 5 of each gesture predicate could potentially resolve to the main variable: event e or individual x.
Note that the semantic representation cont of the situated utterance which features the underspecified relation vis_rel between the top label l 6 of the speech daughter and the top label l 0 of the gesture daughter to designate that the speech and gesture are coherently connected. In rmrs, labels denote the scopal position of an elementary predication. We therefore code the arguments of vis_rel as s-lbl and g-lbl to designate that their values are labels of spoken and gestural predications, respectively. As illustrated in Section 2.1, vis_rel is hand-movement circular   Figure 6: Derivation Tree for Depicting Gesture and the N "mud" resolvable at the semantics/pragmatics interface to a specific valuee.g., Depiction, Elaboration-that is dependent on resolving the gestural denotation. Here, the attachment to "mud" would support an interpretation where the gesture designates some substance and the fact that it was going round, which in turn would resolve vis_rel to Elaboration, as featured in the lf in (5). The truth conditional contribution of the gesture will thus ultimately be roughly analogous to an appositive or a non-restrictive relative clause modifying the noun. Note that given constraints on reference on the semantics/pragmatics interface, this attachment blocks the gesture referring to anything that is bridging related to "mixes" or "he". The cont of the mother is obtained by equating the top of the mother to the top of the daughters. The relations (abbreviated as rl) of the situated phrase are equal to the append of the predications of the gesture daughter G sem and the speech daughter N sem , and also vis_rel. Further, vis_rel introduces a multimodal argument marg which serves as a semantic index of the integrated speech-gesture signal (the hook's index is therefore equated to the index of m-arg - x 2 ), and so it can be taken as an argument by any external predicate. Here, for instance, the verb "mix" would take two arguments: arg1corresponding to the subject-would be identified with arg0 of "he", and arg2-corresponding to the object-would be identified with marg of the situated word, consisting of "mud" and the gesture.

Situated Prosodic Word Constraint and Deictic Gesture
We illustrate the syntactic derivation and the semantic composition for deixis and a spoken word using utterance (14). The derivation tree is shown in Figure 7. The Situated Prosodic Word Constraint licenses an attachment of the deictic gesture to the verb "enter": it is marked by a pre-nuclear accent, and it temporally overlaps the gesture.
The semantic composition proceeds in the same way as with depicting gestures. Since the gesture semantics features a quantifier deictic_q), the local top of gesture is distinct from the label of the quantifier. The semantic index is the underspecified variable i 1 bound by sp_ref . In composition, the deixis semantic predicates (as shown in 19) append to the semantic predicate V sem of the speech daughterl 4 : a 9 : _enter_v_1(e 5 ) l 4 : a 9 : ARG1(u 1 ) l 4 : a 9 : ARG2(u 2 ). In so doing, the underspecified semantic index i 1 of the deixis unifies with the semantic index e 5 of the speech, and so the underspecified gesture variable i 1 of sp_ref (i 1 ) resolves to an event (e 7 ).
[ 29 ] Figure 7: Derivation Tree for Deictic Gesture and the V "enter" [ 30 ] Like depicting gestures, deictic gestures are connected in semantics to their aligned speech via an (underspecified) relation. The construction rule therefore introduces the underspecified relation deictic_rel(e 5 , e 7 ) between the semantic index e 5 of the speech predication and the semantic index e 7 of the deictic gesture. Pragmatics must then resolve this relation to a specific value: one possible resolution would be VirtualCounterpart-i.e., the deictic gesture denotes a virtual counterpart of the coordinates of entering the apartment door. Similarly to the treatment of non-scopal modification in language, this relation shares the same label as the speech head daughter since it further restricts the referent introduced by the gesture. Informally, the gesture here functions as an appositive in language and a rough linguistic paraphrase is "the entering event, the event at the coordinates pointed at".

Speech Phrase and Gesture Alignment
One of our central claims is that ambiguities as to which speech phrase a co-speech gesture aligns with are best modelled as attachment ambiguities within the grammar. As we demonstrated in Section 2.2, the relative timing of speech and gesture is not the only constraint on using such construction rules; also, temporal constraints should be weaker than simultaneity, contrary to McNeill (1992). Rather, we argued that the gesture should temporally overlap with its aligned speech (if it didn't, then by definition it wouldn't be co-speech gesture!) and furthermore temporally overlap with an accented element in the (aligned) speech unit. Thus a single utterance such as (1) or (14) can licence different speech-gesture alignments, each of them supporting a distinct range of plausible pragmatic interpretations in accordance with constraints on reference (see Section 2.1). Likewise, it is perfectly acceptable for the gesture in (1) to be performed only while uttering the accented word "mud", and still interpret the gesture in all the ways proposed in Section 2.2. In this section we provide the formal methodology of how to arrive at these interpretations.
As proposed in Section 2.2, we introduce construction rules that allow a gesture to align with an entire constituent-that is, a head combined with its arguments-in contrast to Rule 6.1 that aligns gesture with a (temporally overlapping, accented) word. From a descriptive perspective, the inclusion of more context into the speech aligned with [ 31 ] gesture is grounded in the "synthetic" nature of gesture versus the "analytic" nature of the spoken words (McNeill 2005). For instance, in example (1) the information about the direction of the mixing event (i.e., clockwise, downwards), the manner of performing the mixing action (i.e., using the entire hand) is denoted by a single visual performance and by several linearly ordered lexical items ("mixes", "mud"). For the purposes of a multimodal grammar it is essential to distinguish between temporal synchrony and alignment: whereas the former is a quantitative measurement of when the two modalities happen, the latter is a qualitative, linguistic notion pertaining to the syntax tree of speech and gesture and the meaning representation it corresponds to. By setting apart these two notions, we also ensure that the physical termination of the gesture does not enable attachment to a midpoint of a speech constituent.
With all this in mind, we now define the construction rule that allows a gesture to attach to a constituent larger than a single prosodic word: Construction Rule 6.2. Situated Spoken Phrase Constraint A depicting or deictic gesture can attach to any of the higher projections in the derivation tree of the nuclear/pre-nuclear accent element, which also form a syntactic and/or prosodic constituent xp, no matter what the syntactic label is if there is an overlap between the temporal performance of the gesture stroke and xp.
The attachment of the gesture to any projection in the tree would allow for saturating the head with its selected arguments before the attachment takes place. This means that the attachments are licensed at each saturation step. In this way, we account for the fact that gesture can co-refer to any or all of these arguments in the fully resolved pragmatic interpretation. Note also that Rule 6.2 used 'syntactic and/or prosodic constituent' to refer to any phrase of a hierarchical organisation: prosodic or syntactic. Assuming an analysis where there is no isomorphism between syntax and prosody, this flexibility is necessary whenever there are mismatches between prosodic structure and syntactic structure. 8 Since the attachments of depicting gesture to a speech phrase are analogous to the attachments of deixis to the speech phrase, we illustrate the possible attachments using the depicting gesture in utterance (1). Recall from Section 2.2 that the resolved lfs for this speechgesture action featured coherence relations between: (i) the NP's denotation and the 'rotating' gesture, and (ii) between the VP's (or S's) denotation and the 'rotating' gesture. We discussed (i) in the previous section and we therefore forego any further details about it. Given the construction rule in 6.2, interpretation (ii) is supported as follows: attach the gesture to VP "mixes mud" (or to the S "he mixes mud"). In both cases, the gesture stroke temporally overlaps the nuclear prominent "mud", and so the gesture can attach to its VP projection or S projection. Both of these attachments force the gesture to qualify "mixes" (for the second argument to the underspecified coherence relation that's introduced by the construction rule must outscope mix(e, y, x)). They underspecify, however, the relative scope of the coherence relation with respect to the predication mud(x) and pron( y). If these resolve to being within the scope of the coherence relation, then the resolved interpretation of the gesture can co-refer to he and to the mud; if not, it can't.
Further to this, we claimed that utterance (10) was ill-formed since the gesture was performed along a non-accented item in an all-rheme utterance. Having introduced the construction rules 6.1 and 6.2, we are now in a position to account for the utterance's illformedness: the form of (10) doesn't meet the constraints for either of our construction rules. On the other hand, if the gesture was performed in a way that temporally overlaps the prosodic word "mother", then the rules we've proposed license attachments to the N "mother", the NP "your mother" and even to the S "your mother called".

Spoken Word and Gesture Alignment: Temporal and Prosodic Relaxation
The two construction rules we've proposed allow a co-speech gesture to align with a prosodic word or with a constituent that contains prosodic element(s) that overlap the temporal performance of the gesture. These constructions, however, are not sufficient as they do not reflect an important finding from our data. We used examples (12) and (13) to illustrate that when the referent of the deictic gesture is visually [ 33 ] salient, the deictic gestures does not necessarily overlap a prosodically prominent word and/or temporally overlap the semantically related word. The following rule takes this into account.

Construction Rule 6.3. Deictic Prosodic Word with Defeasible
Constraint. The constraints on temporal overlap in 6.1 and 6.2 are defeasible, i.e., a deictic gesture attaches to a word that is not prosodically prominent and/or whose temporal performance is adjacent to that of the deictic stroke if: (a.) the mapping v from gestured space ⃗ p to space in denotation v(⃗ p) resolves to equality; and (b.) the temporal performance of the gesture overlaps (some portion of) the spoken utterance containing the word.
This temporal/prosodic relaxation rule integrates a defeasible constraint with the view of producing lfs that in context resolve to the intended meaning. As attested by (13), 9 the relaxation of this contraint depends on the salience of co-present individuals and it is thus necessary only in utterances where the gesture denotation is physically present in the visible space, i.e., there is an equality between the physical space that the hand points at and the gesture referent. This rule accounts for the fact that certain characteristics of the context (i.e., salience of the individual pointed at) are required for the rule to apply. Otherwise, the interpretation could be infelicitous. Similar issues occur with deictic expressions and other referential expressions which require a salient individual in context for the utterance to be felicitous (see Lücking et al. 2006a).
Note also that this rule constrains the alignment to temporal overlap between (some portion of) the utterance and the gesture. This means that the grammar does not handle gestures performed either before or after the temporal performance of the utterance since anything beyond the clausal level is a matter of relating discourse units. For instance, while the temporal overlap between the gesture and the speech signal in (13) takes care of aligning the gesture and the semantically related element-i.e., "she" in (13)-the gesture in (12) does not overlap any portion of the utterance containing "mouse" and hence the grammar rule cannot attach the gesture to the noun "mouse". Similarly to relating purely linguistic discourse segments, relating the ges-9 Many more examples can be found in the AMI corpus.
[ 34 ] ture in (12) with the noun "mouse" is a matter of discourse processing that lies outwith the scope of the (syntactic) grammar.
With this constraint in mind, let us examine the possible derivations of utterance (13). The Situated Prosodic Word 6.1 would license attachments to the temporally overlapping prosodically prominent "said". Although syntactically well-formed, this attachment would not produce the contextually preferred (and the most intuitive) interpretation: namely, an identity between the gesture referent and the speech referent. An alternative attachment is provided by Construction Rule 6.3: the deictic gesture may attach to "she" thereby providing an interpretation where the gesture denotation is identical to the denotation of the pronoun "she".

implementation and evaluation
The main challenge for the grammar implementation stems from the non-linear input of speech-and-gesture actions. Existing grammar engineering platforms for unification-based grammars typically only parse linearly ordered strings, and so they do not handle multimodal signals whose input comes from separate channels connected through temporal relations. Also, these parsing platforms do not support quantitative comparison operations over the time stamps of the input tokens. This is essential for our grammar since temporal overlap constraints choices of attachment.
To solve this, we pre-processed the xml-based Feature Structure (fs) input so that overlapping time values were 'translated' into identical start and end edges of the speech token and the gesture token as follows: <edge source="v0" target="v1"> <fs type="speech_token"> <edge source="v0" target="v1"> <fs type="gesture_token"> This pre-processing step is sufficient since the only temporal relation required by the grammar is overlap, an abstraction over more fined-grained relations between speech (S) and gesture (G) such as (precedence(start(S), start(G)) ∧ identity (end(S), end(G))).
The linking of gesture to its temporally overlapping speech seg- [ 35 ] ment happens prior to parsing via chart-mapping rules (Adolphs et al. 2008) which involve re-writing chart items into fss. The gestureunary-rule (Figure 8) rewrites an input (I) speech token in the context (C) of a gesture token into a combined speech+gesture token where the +gest and +pros values of the speech and gesture tokens are copied onto the output (O).  The +pros attribute contains prosodic information and the +gest attribute is a feature-structure representation. The +position constraint restricts the position of the I, O and C items to an overlap (@), i.e., the edge markers of the gesture token should be identical to those of the speech token, and also identical to the speech+gesture token. This chart-mapping rule recognises the gesture token overlapping the speech token and it records this by "augmenting" the speech token with the gesture feature-values.
Gestures overlapping more than one speech token were handled by further chart-mapping rules that distributed the gestural information onto multiple speech tokens within the temporal span of the gesture. So a gesture overlapping, say, three speech tokens, would get split into three gesture tokens. Then, the gesture-unary-rule was applied so as to instantiate a speech+gesture token for each speech token temporally overlapping the gesture. The result of this chartmapping operation is multiple gesture-marked speech tokens whose span is identical to the span of the gesture.
A separate rule was also required for concrete deixis to account for the permitted precedence and sequence relations between the speech token and the concrete deictic gesture token. This rule (which we omit for the sake of space) remains neutral about the positional (and hence temporal) relation between the gesture token and the speech token, [ 36 ] thus allowing a gesture token of type deictic-concrete to attach to each speech token from the input chart.
In the grammar, we extended the erg word and phrase rules with prosodic and gestural information where the +pros and +gest features of the input token are identified with the pros and gest of the word and/or lexical phrase in the grammar. We then added a gesture lexical rule (Figure 9) which projects a gesture daughter to a complex gesture-marked entity for which both the pros and gest features are appropriate.  In line with Definition 6.1, this rule constrains pros to a prosodically prominent word of type p-word thereby preventing a gesture from plugging into a prosodically unmarked word. The gesture-form value is a supertype over the distinct gesture types-depicting and deictic. The gest feature of the mother is of type no-gesture to block any further recursive instantiation of this rule. The gesture_lexrule is inherited by a lexical rule specific to depicting gestures, and by a lexical rule specific to deictic gestures. In this way, we can encode the semantic contribution of depicting gestures which is different from the semantic contribution of deixis. For the sake of space, Figure 10 presents only the depicting_lexrule. The semantic information contributed by the rule is encoded within c-cont.
The rule introduces an underspecified vis_rel between the main label #dltop of the spoken sign (via the hcons constraints) and the main label #glbl of the gesture semantics (via the hcons constraints). Note that these two arguments are in a geq (greater or equal) constraint. This means that vis_rel can operate over any projection of the speech word; e.g., attaching the gesture to "mud" in (1) means that the relation is not restricted to the eps contributed by "mud" but it [ 37 ]   can be also be over the eps of a higher projection. Here, the implemented analysis differs from the theoretical one in that we formalise in semantics the gesture attachment ambiguities as per Situated Spoken Phrase Constraint: that is, vis_r el can operate over any projection of the gesture-marked sign. The gesture's semantics is a bag of eps, all of which are outscoped [ 38 ] by the gestural modality [ ]. The rule therefore introduces in rels a label (here #larg1) for an ep which is in qeq constraints with [ ]. The instantiation of the particular eps comes from the gestural lexical entry. In the real implementation, the number of these labels corresponds to the number of features.
The evaluation was performed in the tradition of testing widecoverage grammars, by means of a manually crafted test suite (Oepen et al. 1997). We created a test suite covering different gesture types, prosody and the following linguistic phenomena: intransitivity, transitivity, complex NPs, modification, negation and coordination. The test set contained 471 speech-gesture items (71.5% well-formed and 28.5% ill-formed) covering the full range of prosodic (prosodic markedness and unmarkedness) and gesture (the span of depicting/deictic gesture and its temporal relation to the prosodically marked elements) permutations. The gestural vocabulary was limited since a larger gesture lexicon has no effects on the performance. To test the grammar, we used the [incr tsdb()] competence and performance tool (Oepen 2001) which enables batch processing of test items and which creates a coverage profile of the test set (see Table 1). The values are as follows: the left column separates the items per aggregation criterion (the length of test items); 10 the next column shows the number of test items per aggregate; then we have the number of grammatical items; average length of test item; average number of lexical items; average number of distinct analyses and total coverage.
We manually verified the coverage. While the grammar successfully parses all well-formed examples, the inclusion of a separate chartmapping rule for concrete deixis results in overgeneration. We believe that the alternative method of enforcing strict precedence or strict sequence is too restrictive with respect to the possible interpretations supported by the distinct attachment configurations.
Finally, we also verified that the newly introduced rules did not change the coverage or increase the ambiguity of the existing broadcoverage grammar. We therefore ran both the erg grammar and the gesture grammar on the erg testsuite. The results shown in Table 2 were generated by both the erg grammar and by the grammar 10 Note the length here does not correspond to the actual length of tokens in each test item, since the tool also counts the xml tags.

conclusions
The work presented here advances a new theory in which the formmeaning mapping of speech-gesture actions was analysed using wellestablished methods from linguistics such as constraint-based syntactic derivation and semantic composition. In particular, we captured the mapping of form of speech-gesture actions to their meanings within a constraint-based grammar: the construction rules were inspired by examining real data and were further implemented within a wide-coverage grammar for English. The highly ambiguous gesture form was captured using underspecified semantics, which allowed us to account for the range of specific interpretations that a given gesture can take in its context of use. The ambiguities notwithstanding, we demonstrated that the speech-gesture attachments are constrained by the form of the speech signal, thus showing that the difference in ambiguity between linguistic input and gesture input is more a matter of degree than a difference in kind.