Interfacing Language, Spatial Perception and Cognition in Type Theory with Records

We argue that computational modelling of perception, action, language, and cognition introduces several requirements on a formal semantic theory and its practical implementations. Using examples of semantic representations of spatial descriptions we show how Type Theory with Records (TTR) satisfies these requirements. The advantage of truth being based on agent-relative judgements in TTR is crucial in this but practically it comes with a computational cost. We argue that the number of type judgements an agent has to make can be minimised by incorporating a cognitive notion of judgement that is driven by perceptual attention.

(discussed in Section 3) where the low level perceptual processing is carried out in an entirely different system to that used for semantic processing.An advantage we claim for our approach is that it facilitates the construction of types which have components relating to both low level and high level processing.
In Section 2, we give an overview of the problem area before describing some of the approaches that have been taken in Section 3. We then give a brief intuitive account of the tools we are using from TTR in Section 4 and give some examples of how this relates to understanding spatial descriptions (as our focus is on knowledge representation) in Section 5. Finally, in Section 6, we offer some conclusions and perspectives for future work.An implementation of examples in this paper is available on https://github.com/GU-CLASP/pyttr/blob/master/lspc.ipynb.

computational modelling of spatial language
We approach the study of spatial descriptions from the perspective of building computational models for situated agents which we have implemented so far, the typical problems and the ad-hoc solutions taken when representing multi-sourced information.Spatial language is central for situated agents as these must resolve their meaning and reference to visual scenes when being involved in conversations with humans.In such conversations humans would use locational information to identify objects (the chair to the left of the table), describe directed action (pick up the red cube near the green one) or give route instructions (go down this corridor nearly towards its end and then take the second door to your right).However, interfacing language and perception is not only the domain of applications that involve language-based interaction with humans.There is an emerging trend in robotics where information represented in language is used as assistance to visual search (Sjöö 2011;Kunze et al. 2014).Robots are typically equipped with several sensors that allow creation of perceptual representations at different levels of abstraction.Creating and classifying for all representations all the time is therefore a computationally expensive task.
In the domain of visual object recognition, a system would have to employ all image classifiers on every observation it makes even if most [ 274 ] of these classifiers would not yield a match in these situations.For example, the robot is in a corridor and is applying classifiers that would recognise objects found in a kitchen.Having background knowledge about the likely distribution of objects would allow it to prioritise certain classifications.The ontology capturing this knowledge may be static or dynamically built through interaction (Dobnik and Kelleher 2016).In the latter case humans programme the robot through language (Lauria et al. 2002).
Cross-disciplinary research has shown that spatial language is dependent on several contextual factors that are part of an agent's interaction with the environment through perception and other agents through dialogue, for example geometrical arrangement of the scene (Regier and Carlson 2001), the type of objects referred to and their interaction (Coventry et al. 2001;Dobnik andKelleher 2013, 2014), visual and discourse salience of objects (Kelleher et al. 2005), alignment in dialogue (Watson et al. 2004;Dobnik et al. 2015), and gesture (Tutton 2013) among others.
The geometrical arrangement of scenes is captured in spatial templates or potential fields.These can be captured experimentally by placing the target object in various locations around the landmark object and asking participants for judgements whether a particular spatial relation holds (Logan and Sadler 1996;Dobnik and Åstbom 2017).The semantics of spatial templates may be approximated to functions (Gapp 1994a,b) or expressed as a general function with trainable parameters as in the case of the Attentional Vector Sum (AVS) model (Regier and Carlson 2001).Figure 1 shows a spatial template for the description in front of relating a table and a chair.Spatial templates capture gradience of semantics of spatial descriptions in terms of angles and distances from the location and the orientation of the landmark object.There are regions where native speakers would judge the relation holds to a high degree, for example for the placement of chairs A and D, and regions where the relation holds to a lesser degree, the placement of chairs C and E, or does not hold at all, the placement of chair F. A particular scene may be matched by several spatial descriptions.Spatial templates are far from being fixed or universally applicable.In addition to angle and distance, several contextual parameters can be incorporated, for example the presence of distractor objects (Costello and Kelleher 2006), object occlusion (Kelleher et al. 2011), [ 275 ]  or the function itself can be learned from a dataset of perceptual observations and descriptions as a classifier (Roy 2002;Dobnik 2009).
Scene geometry is not the only meaning component of spatial descriptions.Spatial relations are also expressing other non-geometric aspects of how we view the relation between the landmark and the target objects.For example, a description such as Alex is at her desk might not only mean that Alex is proximal to her desk.Instead, we might interpret the description that she is sitting in her chair facing a computer screen and working.In literature, such aspects of meaning are known as functional aspects (Coventry and Garrod 2005) because they are dependent on the function of interacting objects: what are they used for, how do they interact with each other, and how they can be manipulated?In order to understand the interaction of objects, one needs to observe what will happen to scenes.Coventry et al. (2005) model functional aspects of meaning as dynamic-kinematic routines captured by several stacked recurrent neural networks that take both visual and language input data.Modelling different takes on the scene and the relations into which the target and the landmark objects enter leads to the development of qualitative spatial ontologies (Bateman et al. 2010) and logics such as (Zwarts and Winter 2000;Cohn and Renz 2008) which are similar to Allen's interval algebra for temporal reasoning (Allen 1983).
Spatial descriptions are also sensitive to changing linguistic context that arises in linguistic interaction.One such example is the coordination of referring expressions (Garrod and Doherty 1994).Projective spatial descriptions such as to the left of and behind require setting [ 276 ] Katie: Please tell me, where is the darker box?
Figure 2: Assignment of FoR in dialogue a perspective or the frame of reference (FoR) which can be modelled as a set of three orthogonal axes fixed at some point (the location of the landmark object) and oriented in a direction determined by the viewpoint (Maillat 2003).The viewpoint can be any conversational participant or object in the scene (for this reason such FoR assignment is known as relative FoR1 ) that has an identifiable front and back which introduces considerable referential ambiguity of projective spatial descriptions.Alternatively, a scene can also be described from a global bird's eye perspective, e.g.North of, in which case we talk about extrinsic FoR assignment.The FoR may be specified overtly such as from your point of view but frequently it is omitted and its resolution is relied upon the dynamics of conversation, among other things.
Figure 2 shows a virtual scene involving a conversational partner, Katie, facing us at the opposite side of the room.What FoR would we use to continue the conversation?How would the FoR be assigned over several utterances and conversational role changes?Would conversational partners align with a particular FoR or would they tend to change it frequently -and what are the conditions licensing such change?What other factors in addition to linguistic conversation contribute to the assignment of FoR? Can a system learn from human assignments of FoR and successfully demonstrate its knowledge in a new conversation with a human?We investigate the strategies of FoR assignment in dialogue, both restricted and free, in (Dobnik et al. 2014) and (Dobnik et al. 2015) respectively.
The preceding discussion demonstrates that the semantics of spatial descriptions involves meaning representations at three distinct levels none of which have been so far captured in a single representational framework which could be employed with situated conversational agents.(i) Geometric representations involve grounding symbols in perceptual observations (Harnad 1990), (ii) integrating of functional knowledge involves lexical and compositional semantics, and (iii) FoR assignment involves both of the previous steps and pragmatics of conversation.Modelling the semantics of spatial descriptions thus raises several open questions.How is an agent able to determine the sense and reference (Frege 1948) 2 of spatial descriptions?The former relates to what components of lexical meaning are involved and the latter relates to how expressions relate to contextual features arising from perceptual and discourse contexts.A model of grounding is required: how are perceptual and conceptual domains bridged (reference) and how is information from contextual features fused into bundles of meaning representations (sense)?The resulting framework should possess sufficient formal accuracy and expressiveness of representations for modelling human language and reasoning to capture notions such as logical entailment, scoping properties, underspecification, hierarchical organisation of meaning and structure, compositionality of structure for words, sentences and utterances, recursion, feature unification, and others.The framework should also include a learning theory concerning how an agent is able to adapt or learn its representations in new physical and conversational contexts (Cooper et al. 2015;Dobnik and Kelleher 2016).

A classical view of vision and language
Figure 3 shows a typical approach to modelling language and vision.We start by building a model of the perceptual scene which captures its geometrical representation.In this example, the robot starts with a .� �� ..

The newspaper is on the table
Figure 3: Grounding language in perception SLAM map (Dissanayake et al. 2001) which contains clouds of points in 3-dimensional coordinate space.The perceptual model is then connected to a formal conceptual representation of the scene which in this example is expressed in first-order logic.An important and challenging issue here is to find a mapping between a reasonably accurate geometric representation of a scene with continuous parameters (locations in the coordinate space and angles of orientation) to cognitive categories that are reflected in language.The mapping between two such domains thus results in vagueness.The formal representation is then mapped to the linguistic expression.The mapping between the layers is typically learned from datasets of collected observations with machine learning.For example, in (Dobnik 2009) we learn classifiers that map representations from SLAM maps to words, thus skipping an intermediate representational layer.Matuszek et al. (2012a) present a method where also the intermediate semantic representation is included: linguistic expressions are grounded in compositional semantic forms which are grounded in perception.Finally, natural language does not only need to be grounded in perception but also in the robotic control language (Matuszek et al. 2012b).

Model-theoretic Montague semantics
Classical model-theoretic or Montague semantics uses higher order logic (Montague 1974;Dowty et al. 1981;Blackburn and Bos 2005;Bird et al. 2009) which provides the required and desired formal accuracy and expressiveness of a representation system.It accounts for how meaning representations of words are composed in the form of higher [ 279 ] order functions to form meaning representations of sentences.The functional composition of constituents allows us to translate between sentence constituent structure and its logical representation as shown in Figure 4.The final logical forms of spatial prepositions are slightly more complicated than presented in this example and due to their context dependency a single description or utterance (surface form) may resolve to several representations as discussed in (Miller and Johnson-Laird 1976;Herskovits 1986), for example on(x, y) 1 : object(x) ∧ object( y) ∧ supports( y,x) ∧ contiguous(surface(x),surface( y)) and on(x, y) 2 : object(x) ∧ object( y) ∧ contiguous(boundary(x), y).However, dealing with these two issues separately, we are able to derive their compositional representation along the same lines as in Figure 4.
In model-theoretic semantics the expression's reference is determined by an assignment, a valuation function between linguistic strings and entities (or sets of tuples of entities) in a model.The model is agent external and fixed.The valuation returns true if an entity or a relation between entities denoted by an expression can be found in the model, otherwise it returns false.While it would be possible to represent the referential semantics of on in a model by listing a set of all coordinates of locations where this spatial description applies, this referential representation of meaning is cumbersome as the model would have to include an assignment for every scale, for every spatial relation, for every pair of objects.Since angles and distances in a coordinate system are continuous measures this means that such sets would be infinite.The model also does not straightforwardly represent gradience and vagueness of spatial descriptions.In order to do that one would have to resort to the notion of possible worlds (Lassiter 2011) which introduces further computational complexity (for discussion see (Cooper et al. 2015, Section 1.1, p.3ff)).
As discussed earlier both vagueness and gradience of spatial language are captured in computational models as spatial templates or potential fields.While spatial templates can be thought of as referential overlays of regions induced experimentally (as a set of points where participants consider a particular spatial relation to apply), potential fields capture the notion that such regions can be generalised as functions.However, as argued in (Lappin 2013) these functions do not represent objects in a model (or extensions or the referential meaning of these descriptions) but rather they capture their sense or intension specifying in what ways a description relates to perceptual observations.Knowing this function, we can check whether a particular spatial relation associated with the function applies to a particular pair of objects and to what degree.The notion of applying a function from perceptual observations to words (or the other way around) representing the meaning of words is also known as grounding these words in perception (Harnad 1990).
The model-theoretic approach to semantics assumes that a model is derived through some external process and therefore pre-given, that it is complete and represents a state of affairs at a particular temporal snapshot (Fagin et al. 1995).In practice, however, complete models may be rarely observable and we must deal with partial models.We must also account for the fact that we may incrementally observe more and more of the world and we have to update the model with new observations, sometimes even correct representations that we have already built in the light of new evidence.Finally, the world is not static itself as new objects and events continually come into existence.Imagine a robot (and indeed such robots were used in the early days of robotics) with a pre-programmed static model of the world.Every minute change in the world would render it useless as there would be a discrepancy between its representation of the world and the actual world.Modern robotic models used in localisation and map building are incrementally learned or updated over time by taking into account robot's perceptional observations and motion and errors associated with both (Dissanayake et al. 2001).An important consequence of this is that the model of the world a robot builds is individual to a particular robot's life-span and experience.Two robots experiencing the same world will have slightly different models.Of course, the more they experience the world, the more similar their models will be.It is conceivable that humans learn meanings in the same way.However, doing so they are equipped with yet another tool to overcome individual inconsistencies in their model.They can use linguistic dialogue interaction to resolve such inconsistencies in the form of repair (Pickering and Garrod 2004).In robotics, several models that explore learning language through interaction have been built which include (Steels and Belpaeme 2005;Skočaj et al. 2011;Ivaldi et al. 2014), also related to spatial cognition (Steels and Loetzsch 2009).We describe a system for the mod- [ 282 ] Language, spatial perception and cognition in TTR elling of semantic concept learning through dialogue interaction in (Dobnik and de Graaf 2017).

Models used in robotics
In building situated conversational agents, several systems have been proposed but none of them capture all of the requirements discussed in Section 2. For example, semiotic schemas (Roy 2005) represent the lexical meaning of words as directed graphs composed of nodes that, in turn, represent sequences of perceptual observations and classification events as shown in Figure 5.The meaning/sense of an object is defined in terms of what can be experienced with the sensors and actuators of a robot.The reference is determined by embedding a semiotic schema with the actual sensory readings.For example, a cup can be experienced and classified either through visual or haptic modalities.The location of the sensory readings determines the location of the object.Semiotic schemas represent a very attractive model of grounded lexical semantics of words, but how such semiotic schemas compose to form larger linguistic structures is left unaccounted for.
Quite frequently, grounded representations are arranged into layers.This is related to the fact that in practical applications several distinctive sub-systems are used that are stacked into a pipeline.For example, in the layered approach of (Kruijff et al. 2007), here summarised in Figure 6, the lowest level consists of a feature map which directly relates to laser sensors.Here, features are sets of points which can be connected to lines which represent walls.The next level is a navigation graph.As the robot moves around space, it creates nodes.If the robot can move directly between two nodes, a connection is made and, on the basis of several such connections, a navigation graph is created.Groups of nodes may be identified whereby two groups are only connected through a single node in each group.Such nodes are gateway nodes and indicate passages between different areas or doors.From such a topology of nodes, a topological map can be hypothesised such that it identifies enclosed spaces, corridors, kitchens, and rooms.The information about the spaces can be further augmented with linguistic information from the ontology, for example what objects are found in kitchens.In this approach one needs to design interfaces between representational levels in the pipeline.Most frequently, representations and operations at each level are distinct from each other.A question we would like to explore is whether representations at different levels can be generalised by taking inspiration from the way humans assign, learn, and reason with meaning.A unified meaning representation would allow interactions between modalities that are required in modelling human cognition but are difficult to implement in a layered pipeline architecture.

type theory with records ttr
Type Theory with Records (TTR) (Cooper 2012) builds on the tradition of classical formal semantics (and therefore captures the notion of compositionality) but at the same time, drawing on insights from situation semantics, addresses the outstanding questions related to perception discussed in the preceding paragraphs.It starts from the idea that information is founded on our ability to perceive and classify the world, that is to perceive or judge objects and situations as being of [ 284 ] Types of sensor events Types of spatial situations

Types
Types of utterance events Figure 7: A unified view: types all over the place types.All information can be represented as types (Figure 7) which makes type assignment an abstract theory of cognition and perception.Having a single representational layer allows information fusion between perception, conceptual knowledge and linguistic communication which is an important requirement for modelling spatial descriptions.
Types are intensional -that is, there can be distinct types which have identical extensions.For example, the type of situations in which an object, a, is to the left of another object, b, in symbols left(a,b), can have exactly the same witnesses as the type of situations in which b is to the right of a, right(b,a), without requiring that the two types be identical.For some more discussion of the intensional nature of types in TTR see (Cooper 2017).This allows us to relate linguistic propositions to types, the so-called propositions as types dictum which is standard in type theories deriving from the original work of Martin-Löf (Martin-Löf 1984;Nordström et al. 1990).The notion of truth is linked to judgements that an object a is of type T (a : T ).As in standard Martin-Löf type theories, a type is true just in case it has some witness.Thus, the type of situations left(a,b) is true just in case there is some situation where a is to the left of b.
We can furthermore seek to operationalise the types as computable functions (Lappin 2013) or classifiers (Larsson 2015), rather than associating them with sets of witnesses as in the standard definition of TTR (Cooper in prep, 2012).Under this view, we can consider an agent to have access to a particular type inventory as a resource.Different agents can have access to different type resources which can [ 285 ] be dynamically revised, both in terms of learning new types and in modifying the witness conditions in terms of classifiers, which can change as the result of the agent's experience of new situations (Dobnik et al. 2013;Larsson 2015).In order for communication between agents to be possible, they must converge on sufficiently similar type resources.This convergence is in part enabled by the fact that the agents exist in similar environments and have similar perceptual apparatus to classify features in the environment.But in addition it is important that the agents be able to use language to communicate with each other about their classification of features in the environment.For example, an agent may receive linguistic information which provides a classification which is at variance from that given by its perceptual apparatus or, in linguistic communication between agents, corrective feedback might be used to express a variance in judgement by two agents.This is perhaps a novel view in linguistic semantics and computational linguistics but it relates to a standard view in mobile robotics (Dissanayake et al. 2001) where a map of an environment is constructed dynamically as a robot moves around in it and features are constructed on the basis of clouds of points in 3D space where the robot's sensors indicate that something is present.In our terms, this would correspond to recognising the physical presence of an object and assigning a particular type to it.
In such a learning scenario, it is natural to consider the role of probabilistic judgements, that is, the judgement that an object a is of type T with probability p instead of the standard categorical judgements to be found in type theory.For a proposal of how this might be incorporated into TTR see (Cooper et al. 2015).This means that an agent can determine a degree of belief that a particular situation is of a particular type.For example, the probability that a situation is to be classified as one where an umbrella is over a person may vary with respect to both geometric configuration and the degree to which the umbrella is protecting the person from rain (Coventry et al. 2001).
In contrast to the classical Montagovian semantic framework which employs a variant of the simple theory of types, TTR introduces an extended set of basic types (for example Ind and Real that correspond to the basic conceptual categories individuals and real numbers.However, it is also a rich type system which, in addition to basic types, [ 286 ] contains complex types constructed from types and other objects, among them ptypes constructed from predicates and their arguments, such as left(a,b), and record types, such as,   x : Ind y : Ind e : left(x,y)   whose witnesses would be any record with three fields labelled by x, y and e, respectively (and possibly more fields with other labels) such that the x-field contains an object a of type Ind, the y-field contains an object b of type Ind and the e-field contains an object of type left(a,b).For a detailed characterisation of record types in TTR see (Cooper in prep, 2012).Record types in TTR are used to model, among other things, lexical content and dialogue information states.For our present purposes, the structured nature of record types allows us to combine in a single object the kind of multi-source information needed for robotics and the modelling of spatial descriptions representing a bridge between what might be thought of in other approaches as the sub-symbolic domain of perception and the symbolic domain of high level conceptual analysis.
The structured nature of record types in TTR allows representation of several kinds of formal structural relations which has implications for inference of representations containing multi-sourced information.Record types (and the corresponding records) can be compared with each other.Consider the following example.If then Left ⊑ Relation where ⊑ denotes the subtype relation (Cooper 2012, p.295).Similarly, record types allow identification of depen- [ 287 ] dencies using dependent types.The notation like target(x) within the context of the record type above is an abbreviation for a tuple of objects 〈λv:Ind .target(v), 〈x〉〉 where the first element is a dependent type, a function mapping objects to a type, and the second element is a sequence of paths to the arguments of this function within a record type.Finally, both Ind and target(x) are component types of record types Relation and Left which means that the latter types are representations of thematic relations between individuals and properties found in language (Lin and Murphy 2001;Estes et al. 2011).

types of spatial descriptions
In the remainder of the paper, we discuss how our empirical investigations of learning geometric meanings of spatial descriptions with situated robots (Dobnik 2009;Dobnik and de Graaf 2017), learning functional meanings of prepositions from collections of image descriptions (Dobnik andKelleher 2013, 2014), and modelling of reference frame assignment in conversation (Dobnik et al. 2014(Dobnik et al. , 2015) ) can be captured in the TTR framework.
The idea is that TTR can be seen as an abstract model of cognition and perception (Cooper 2012, in prep) which can be used to model both the linguistic behaviour of humans as well as perception based on sensor readings in artificial agents.It is important to note that robots have different perceptual apparatus than humans, both in the number and the nature of sensors.It follows that their sensors will give rise to different types of information at the lowest sensory level.However, these sensory types can be related to types corresponding to concepts which are similar enough to conceptual types internalised by humans to allow communication between the two.Nevertheless, the type system an agent can acquire is constrained by the agent's perceptual apparatus.We cannot, for example, expect an agent incapable of colour perception to successfully make judgements about the colour concepts available to a human, however much we may talk to the agent or train it on objects of different colours.It simply does not have the required sensors and classifiers to distinguish the appropriate situations.
There are two main aspects of theoretical interest with the approach we suggest: [ 288 ] 1.The notion of judgement from type theory can be used to model both the kind of low level perceptual discrimination carried out by classifiers in robotic systems and the high level conceptual classification including the truth of propositions which are important for linguistic semantics.Thus, it offers the possibility of a unified approach to both.
2. Given the kind of structured types that are proposed in a system like TTR it is not only possible to express relations between the low level and high level types but even to have a low level perceptual type and a high level conceptual type as components within a single type and even to have one type depend on the other.This gives a very different perspective on the cognitive makeup of situated agents than that given by the kind of layered approaches discussed in Section 3, where the different layers involve entirely different systems.
In the next section we will give examples which illustrate this. 5.1

Types of objects
Figure 8 shows an example of bridging between perceptual and conceptual domains for object recognition.
Step 1 shows a record of type PointMap which is produced by SLAM (for details see (Dobnik et al. 2013)).The type PointMap is a subtype of a type that represents a list of records containing three real numbers modelling points in threedimensional space.A point map is a list (or a set) of points that a robot is tracking in space.TTR allows function types one of which is exemplified in the object detection function in Step 2. This function maps an object of type Pointmap to a type that represents a set of records specifying (1) the reg(ion) occupied by the object (a sub-pointmap) and (2) a property which is modelled as a pfun which maps an individual to a type, in this case a ptype or a predicate type.The purpose of this function type is to associate a perceptual object and some property, thus to pair two kinds of information.The property functions take objects of type Individuals to types of individuals having some property.
The target record type of the main function type does not yet constrain any individuals that this property could be assigned to nor does this record type correspond to a situation.In Step 3 we introduce an individuation function which takes records of associated perceptual  objects and properties and yields a type of situation involving an individual located at a certain location and having this property.This type therefore represents a cognitive take on a situation.In this example, the mappings between the types are modelled with functions but in practice (some) associations would be learned.For example, Harnad (1990) argues that grounding, associating perceptual and conceptual domains, can only be accomplished through classification.In (Dobnik 2009), decision tree and Naïve Bayes classifiers are learned to classify between point clouds and spatial descriptions.Here, the associating function that the classifier has learned is in the domain of the hypothesis space of each learning algorithm and is therefore quite complex.Larsson (2015) introduces a perceptron model to TTR and Cooper et al. (2015) give the type system, including function types, a Bayesian interpretation.The latter allows direct propagation of Bayesian probabilistic beliefs between the types while [ 290 ] the observed type probabilities can be trained based on the agent's observations.

Types of spatial situations
Spatial descriptions, e.g. over and above are sensitive to classes of interacting objects and the contribution of such functional worldknowledge versus geometric knowledge for the semantics is different from one spatial preposition to another (Coventry et al. 2001;Coventry and Garrod 2005;Coventry et al. 2005).While previous work attempted to determine the contribution of each modality experimentally, Dobnik andKelleher (2013, 2014) functional information from a large corpus of text describing images.Image descriptions are constrained by the properties of the visual scene shown in the image, both perceptual (geometric arrangement of the scene) and functional (the nature and interaction of objects shown there).Both kinds of information will be reflected in the text describing the image, in a particular choice of descriptions that annotators used.Building lexical models of word co-occurrence thus allows us to capture functional interactions between prepositions and targets and landmarks.In (Dobnik and Kelleher 2013) we capture the strength of association between a preposition and different target-landmark pairs with log-likelihood ratio.In (Dobnik and Kelleher 2014), we generalise the types of targets and landmarks of a particular spatial preposition by ascending in a WordNet hierarchy (Fellbaum 1998).This allows us to generate patterns of prepositional use such as the following: person.n.01 under tree.n.01, shirt.n.01 under sweater.n.01, and person.n.01 under body of water.n.01.Labels such as person.n.01 indicate the labels given to the generalised synsets in the WordNet hierarchy.The patterns indicate types of spatial situations that the under relation applies to.Importantly, each of these patterns corresponds to quite a different arrangement of target and landmark objects and without such functional knowledge it would be difficult to capture a single spatial template that would not over-generate.The functional knowledge represented in these types thus constrains sub-sets of spatial situations for which individual spatial templates can be learned.
Figure 9(a) shows a TTR function that maps ontological knowledge from one ontological category to another.This is a similar function to pfun in the object detection function shown in Figure 8.It as-   an agent can make a conclusion that a situation s of the first type (the left hand side of the If-then rule) requires that there is also a situation of the second type (the right hand side of the same rule).Each type of situation representing a spatial pattern involves a different interplay of geometric and conceptual knowledge spanning the domain of point clouds and "logical" individuals.Figure 10 shows the conceptual constraints on the target and landmark objects limiting top-down a subset of spatial situations over which individual types of spatial relations are built.Hence, the resulting spatial template spatialtemplate und er 1 is a distinct pytpe classifier from spatial-template under 2 .In the generation step, the function in Figure 10 takes account of conceptual properties of objects that could be obtained by computing relevant hypernyms such as person and furniture and an associated spatial template that relates the point clouds associated with them.It then generates a type of situation which involves a conceptual spatial relation between individuals.

Types of dialogue information states
TTR can also be used to model dialogue by representing types of information states (IS).Agents in conversation align with the primed frame of reference (FoR) and continue to use it (Dobnik et al. 2014).However, such alignment is only local and depends on the nature of the dialogue that agents are engaged in and other contextual factors of the conversation such as the perceptual properties of the scene or the task that agents are performing (Dobnik et al. 2015).Dobnik et al. (2014) study the properties of local FoR alignment over several turns of conversation in the constrained environment (Figure 2).The experiment captures participants' understanding of the agreed FoR and therefore alignment.In Game 1, a virtual conversational partner generates an unambiguous description that refers only to one of the objects.The participant must then click on that object.Here, the system primes the participant for a particular FoR.In Game 2, the system generates an ambiguous description which may refer to several objects.Again, the participant must click on one of the target objects but this time they must decide on a particular FoR assignment.Will this be aligned with the previous turn pair or will they assume a new strategy?Game 3 is identical to Game 2 and it tests if the priming from Game 1 is persistent over several games.In Game 4, the speaker-hearer roles reverse: the system selects an object and the participant must describe it using a particular FoR assignment.The role of this game is to test whether priming will persist if the conversational roles change.
The preceding interaction is formalised as a probabilistic model of FoR assignment over several local turns of conversation.This model is then applied in a generation experiment.Here, the system is making assumptions about the human conversational partner and is trying to align with them to the extent captured in the previous experiment.In Game1, the system chooses an object and a human primes the system by generating an unambiguous description.In Game 2, a human selects a box and the system generates a description using its FoR model.The human then confirms if the description is a good one.Game 3 is identical to Game 2. In Game 4 a human chooses a box.The system asks the user to describe it and also generates a description for itself.[ 293 ] A match between the human description and the system-generated description is compared.The results show a good agreement between humans and the system (≥ 82.76% for Game 4).
The model of FoR assignment predicts, for example, that speakers initiating conversation tend to be egocentric.Figure 11 shows two types of information states (ISs).When Alex is planning the utterance The chair is to the left of the table her information state would be of the type shown in (a).Information states represent information that is private to the agent, and information that the agent believes is a part of the common ground with another conversational participant or shared.In the shared part of the IS in (a), there is a pointer to the object in focus.The object is stored in the private part of the IS as each agent builds its own objects.Σ i is a type returned by an individuation function on the basis of the pointmap that the agent has constructed.The agent also has a private belief that they are one of the objects and a belief that two particular objects are in the left relation.Crucially, at this stage, the FoR origin is assigned to the object corresponding to the (a) s Alex individual having this IS.A double arrow ⇑ 2 indicates that the path refers to the container-type which the current type is a dependent type of, the superscript indicates the depth of embedding.Notation such as label=value : Type as in for-origin=objs.o0 .a: Ind represents singleton types where the value stands for a manifest field.
The model of FoR assignment also predicts that hearers assume that speakers are egocentric.Figure 11(b) shows Sam's IS accommodating Alex's utterance.After Alex has made an utterance, the shared part of the IS is expanded through accommodation.There is information about the latest move: the speaker and the content of the move.Since Sam is a hearer of the utterance, he assumes that the FoR is identical to the speaker of the previous utterance as predicted by our probabilistic model.In this example, we assume that agents use identical labels for objects.However, it is not necessary or indeed possible that they have identified the same objects.In future work, we plan to investigate how agents resolve such differences using language, in particular what mechanisms of clarification and repair are used in such cases (Purver et al. 2003).

conclusion
In this paper, we outlined an application of type theory to natural language semantics in the framework called Type Theory with Records or TTR which allows to relate semantics to action, perception, and cognition.We used TTR to represent different components of analysis of spatial descriptions.TTR is naturally suited for this task as it treats meaning being based on perception and interaction.Perception and conceptual reasoning can be related within one unified approach.The framework also points to similarities between linguistic and non-linguistic learning.We will be testing practical implementations of TTR with situated agents in our forthcoming work based on the framework described in (Dobnik and de Graaf 2017).The expressiveness of the type theoretic framework is associated with high computational cost.In order to make the framework computationally more tractable, we are investigating mechanisms of attention from psychological research which allow us to contextually restrict the type judgements a situated agent has to make (Dobnik and Kelleher 2016).
[ 295 ] One aspect of spatial meaning which we have not discussed in this paper is the gradability of types like left (a,b).For example, a would be judged to be left of b with a high probability if the two objects were close to each other.However, the probability of this judgement would decrease if a is much closer to the observer than b.This suggests exploring the use of probabilistic judgements in TTR as described in (Cooper et al. 2015) Figure 1: The chair is in front of the desk y[supports( y,x) ∧ contiguous(surface(x), surface( y)) → on 1 (x, y)] Figure 5: A simplified semiotic schema Figure 6: A layered approach Figure 8: From perceptional to conceptional domain Figure 9: Representing functional knowledge Figure 9(b) shows how associative reasoning is captured in TTR.Having a meaning postulate in Figure 9(a) Figure 10: Spatial templates sensitive to object function λr:          o 1 :

Figure 11 :
Figure 11: Types of dialogue information states in our future work.acknowledgements This paper was supported in part by the project Networks and Types (Vetenskapsrådet/Swedish Research Council project VR 2013-4873).