Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

LU4R: Adaptive Spoken Language Understanding for Robots

2017, Italian Journal of Computational Linguistics

IJCoL Italian Journal of Computational Linguistics 3-1 | 2017 Emerging Topics at the Third Italian Conference on Computational Linguistics and EVALITA 2016 LU4R: Adaptive Spoken Language Understanding for Robots Andrea Vanzo, Roberto Basili, Danilo Croce and Daniele Nardi Electronic version URL: http://journals.openedition.org/ijcol/432 DOI: 10.4000/ijcol.432 ISSN: 2499-4553 Publisher Accademia University Press Printed version Number of pages: 59-76 Electronic reference Andrea Vanzo, Roberto Basili, Danilo Croce and Daniele Nardi, “LU4R: Adaptive Spoken Language Understanding for Robots”, IJCoL [Online], 3-1 | 2017, Online since 01 June 2017, connection on 28 January 2021. URL: http://journals.openedition.org/ijcol/432 ; DOI: https://doi.org/10.4000/ijcol.432 IJCoL is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License LU4R: Adaptive Spoken Language Understanding for Robots Andrea Vanzo∗ Danilo Croce∗∗ Sapienza Università di Roma Università di Roma, Tor Vergata Roberto Basili† Daniele Nardi‡ Università di Roma, Tor Vergata Sapienza Università di Roma Service robots are expected to operate in specific environments, where the presence of humans plays a key role. It is thus essential to enable for a natural and effective communication among humans and robots. One of the main features of such robotics platforms is the ability to react to spoken commands. This requires a comprehensive understanding of the user utterance to trigger the robot reaction. Moreover, the correct interpretation of linguistic interactions depends on physical, cognitive and language-dependent aspects related to the environment. In this work, we present the latest version of LU4R - adaptive spoken Language Understanding 4 Robots, a Spoken Language Understanding framework for the semantic interpretation of robotic commands, that is sensitive to the operational environment. The overall system is designed according to a Client/Server architecture in order to be easily deployed in a vast plethora of robotic platforms. Moreover, an improved version of HuRIC - Human-Robot Interaction Corpus is presented. The main novelty presented in this paper is the extension to commands expressed in Italian. In order to prove the effectiveness of such system, we also present some empirical results in both English and Italian computed over the new HuRIC resource. 1. Introduction One of the most challenging issues that Service Robotics is facing in the recent years is the need of high level interactions and collaborations between humans and robots. In such a robotic context, human language is one of the most natural ways of communication as for its expressiveness and flexibility. However, an effective communication in natural language between humans and robots is challenging even for the different cognitive abilities involved during the interaction. In fact, for a robot to react to a simple command like “take the pillow on the couch”, a number of implicit assumptions should be met. First, at least two entities, a pillow and a couch, must exist in the environment and the speaker must be aware of such entities. Accordingly, the robot must have access to an inner representation of the objects, e.g. an explicit map of the ∗ Dept. of Computer, Control and Management Engineering “Antonio Ruberti” - Via Ariosto 25, 00185 Rome, Italy. E-mail: vanzo@diag.uniroma1.it ∗∗ Dept. of Enterprise Engineering - Via del Politecnico 1, 00133 Rome, Italy. E-mail: croce@info.uniroma2.it † Dept. of Enterprise Engineering - Via del Politecnico 1, 00133 Rome, Italy. E-mail: basili@info.uniroma2.it ‡ Dept. of Computer, Control and Management Engineering “Antonio Ruberti” - Via Ariosto 25, 00185 Rome, Italy. E-mail: nardi@diag.uniroma1.it © 2017 Associazione Italiana di Linguistica Computazionale Italian Journal of Computational Linguistics Volume 3, Number 1 environment. Second, mappings from lexical references to real world entities must be developed or made available. In this respect, the Grounding process (Harnad 1990) links symbols (e.g. words) to the corresponding perceptual information. Hence, robot interactions need to be grounded, as meaning depends on the state of the physical world and the interpretation crucially interplays with perception, as pointed out by psycholinguistic theories (Tanenhaus et al. 1995). The integration of perceptual information derived from the robot’s sensors with an ontologically motivated description of the world has been adopted as an augmented representation of the environment, in the so-called semantic maps (Nüchter and Hertzberg 2008). In these maps, the existence of real world objects can be associated to lexical information, in the form of entity names given by a knowledge engineer or spoken by a user for a pointed object, as in HumanAugmented Mapping (Diosi, Taylor, and Kleeman 2005; Gemignani et al. 2016). While Spoken Language Understanding (SLU) for Interactive Robotics have been mostly carried out over the only evidences specific to the linguistic level (see, for example, (Chen and Mooney 2011; Matuszek et al. 2012)), we argue that such process should be context-aware, in the sense that both the user and the robot live in and make references to a shared environment. For example, in the above command, “taking” is the intended action whenever a pillow is actually on the couch, so that “the pillow on the couch” refers to a single argument. On the contrary, the command may refer to a “bringing” action, when no pillow is on the couch and the pillow and on the couch correspond to different semantic roles. We are interested in an approach for the interpretation of robotic spoken commands that is consistent with (i) the world (with all the entities composing it), (ii) the Robotic Platform (with all its inner representations and capabilities), and (iii) the linguistic information derived from the user’s utterance. We foster here the approach presented in (Bastianelli et al. 2016a), where a machine leaning method for Spoken Language Understanding forces the interpretations to be consistent with the environment: this is obtained by extending the linguistic evidences that can be extracted from the uttered commands with perceptual evidences directly derived by the semantic map of a robot. In particular, the interpretation process is modeled as a sequence labeling problem where the final labeler is trained by applying Structured Learning methods over realistic commands expressed in domestic environments, as in (Bastianelli et al. 2017). The resulting interpretations adhere to Frame Semantics (Fillmore 1985): this well-established theory provides a strong linguistic foundations to the overall process while enforcing its applicability, as it is made independent from the vast plethora of existing robotic platforms. Such methodologies have been implemented in a free and ready-to-use framework, here presented, whose name is LU4R - an adaptive spoken Language Understanding framework for(4) Robots. LU4R is entirely coded in Java and, thanks to its Client/Server architectural design, it is completely decoupled from the robot, enabling for an easy and fast deployment on every platform1 . As the aforementioned approaches relies on realistic data, in this work we also present an extended version of HuRIC - a Human Robot Interaction Corpus, originally introduced in (Bastianelli et al. 2014). HuRIC is a collection of realistic spoken commands that users might express towards generic service robots. In this resource, each sentence is labeled with morpho-syntactic and syntactic information (e.g. dependency relations, POS tags, . . . ), along with its correct interpretation in terms of semantic frames 1 LU4R can be downloaded at http://sag.art.uniroma2.it/lu4r.html 60 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots (Baker, Fillmore, and Lowe 1998). We present here a new version of HuRIC that has been enhanced in terms of (i) the number of annotated sentences in English and (ii) a brand new section, where Italian commands have been added (and aligned) to the already existing English counterparts. At the best of our knowledge this is the first dataset of spoken robotic commands in Italian2 . The extended version of HuRIC supports a larger and more significant evaluation of LU4R, that highlights its robustness towards commands expressed through the investigated languages. Specifically, we observed very good performances w.r.t. both languages, whose outcomes are encouraging for the deployment of LU4R (and the underlying methods and psycho linguistic assumptions) in realistic applications. The rest of the paper is structured as follows. Section 2 provides a short survey of existing approaches to SLU in Human-Robot Interaction. Section 3 describes the semantic analysis process that represents the core of the LU4R system. In Section 4, an architectural description of the entire system is provided, as well as an overall introduction about its integration with a generic robot. Section 5 describes the new release of HuRIC, while in Section 6 we demonstrate the applicability of the proposed system in the interpretation of commands in English and Italian, by reporting our experimental results. Finally, Section 7 derives the conclusions. 2. Related Work In Robotics, Spoken Language Understanding (SLU) has been usually treated by following two orthogonal approaches: grammar-based and data-driven. Grammar-based systems for speech recognition model language phenomena through the definition of grammars. Moreover, they provide mechanisms to enrich the syntactic structure with semantic information, to build a semantic representation during the transcription process (Bos 2002; Bos and Oka 2007). In (Bastianelli et al. 2016b), SLU supporting manifold robotics tasks is performed jointly with speech recognition, through the definition of ad-hoc grammars. This is possible thanks to the Speech Recognition Grammar Specification3 , that allows to inject semantic attachment directly within the grammar specification. Other approaches are based on formal languages, as in (Kruijff et al. 2007), where Combinatory Categorial Grammar (CCG) are applied for spoken dialogues in the context of Human-Augmented Mapping, or exploit template-based algorithms (see (Perera and Veloso 2015)) to extract a semantic interpretation of robotic commands from the corresponding syntactic trees. Data-driven methods have been also applied to SLU for robotic application. Examples are (MacMahon, Stankiewicz, and Kuipers 2006) and (Chen and Mooney 2011), where the parsing of route instructions is addressed as a Statistical Machine Translation task between the human language and a synthesized robot language. The same approach is applied in (Matuszek, Fox, and Koscher 2010) to learn translation model between natural language and formal descriptions of paths. A probabilistic CCG is used in (Matuszek et al. 2012) to map natural navigational instructions into robot executable commands. The same problem is faced in (Kollar et al. 2010; Duvallet, Kollar, and Stentz 2013), where Spatial Description Clauses are parsed from sentences through sequence labeling approaches. In (Tellex et al. 2011), the authors address natural language instructions about motion and grasping, that are mapped into Generalized Grounding 2 The extended version of HuRIC will be released at http://sag.art.uniroma2.it/huric.html 3 http://www.w3.org/TR/speech-grammar/ 61 Italian Journal of Computational Linguistics Volume 3, Number 1 Graphs (G3 ). In (Fasola and Matarić 2013a, 2013b), SLU for pick-and-place instructions is performed through a Bayesian classifier trained over a specific corpus. In (Misra et al. 2016), the authors define a probabilistic approach to ground natural language instructions within a changing environment. 2.1 Contribution On the one hand, LU4R embodies most of the capabilities in terms of linguistic generalization characterizing the presented data-driven approaches. On the other hand, it introduces several novelties that are missing in the existing literature. First, the interpretation is performed and provided in terms of semantic frames, according to the Frame Semantics theory (Fillmore 1985). Hence, the resulting logic form representing the meaning of a command will be supported by a robust linguistic theory. Moreover, as both the proposed semantic parsing approach and the nature of such a theory are domain-independent, the development of a SLU in other domains will depend mostly on the existence of training data. Second, the interpretation process is contextdependent, whenever additional knowledge derived from perception is discriminating against multiple possible interpretations. 3. The Language Understanding Cascade A command interpretation system for a robotic platform must produce interpretations of user utterances. In this paper, the understanding process is based on the theory of the Frame Semantics (Fillmore 1985); in this way, we aim at giving a linguistic and cognitive basis to the interpretations. In particular, we consider the formalization promoted in the FrameNet (Baker, Fillmore, and Lowe 1998) project, where actions expressed in user utterances can be modeled as semantic frames. Each frame represents a microtheory about a real world situation, e.g. the actions of bringing or motion. Such microtheories encode all the relevant information needed for their correct interpretation. This information is represented in FrameNet via the so-called frame elements, whose role is to specify the participating entities in a frame, e.g. the T HEME frame element represents the object that is taken in a bringing action. As an example, let us consider the following sentence: “bring the pillow on the couch” (“porta il cuscino sul divano”, in Italian). This sentence can be intended as a command whose effect is to instruct a robot that, in order to achieve the task, has to: (i) move towards a pillow, (ii) pick it up, (iii) move to the couch and, finally, (iv) release the object on the couch. The language understanding cascade should produce its FrameNetannotated version, that is: [bring]Bringing [the pillow]T HEME [on the couch]G OAL (1) [porta]Bringing [il cuscino]T HEME [sul divano]G OAL (2) or whenever the command is expressed through the Italian language. Semantic frames can thus provide a cognitively sound bridge between the actions expressed in the language and the implementation of such actions in the robot world, namely plans and behaviors. 62 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots LU4R Hypotheses Perceived entities MorphoSyntactic Analysis Action Detection Re-ranking Argument Identification Argument Classification Interpretation Figure 1 The SLU cascade ROOT PREP DOBJ POBJ DET DET bring the pillow on the couch VB DT NN IN DT NN Figure 2 Example of a dependency graph and POS tags associated to “bring the pillow on the couch” The whole SLU process has been designed as a cascade of reusable components, as shown in Figure 1. As we deal with vocal commands, their (possibly multiple) hypothesized transcriptions derived from an Automatic Speech Recognition (ASR) engine constitute the input of this process. It is composed by four modules, whose final output is the interpretation of a utterance, to be used to implement the corresponding robotic actions. First, Morpho-syntactic and syntactic analysis is performed over the available utterance transcriptions by applying morphological analysis, Part-of-Speech tagging and syntactic analysis. In particular, dependency trees are extracted from the sentence as well as POS tags, as shown in Figure 2. Then, if more than one transcription hypothesis is available, the Re-ranking module can be activated to compute a new ranking of the hypotheses, in order to get the best transcription out of the initial ranking. This module is realized through a learn-to-rank approach, where a Support Vector Machine exploiting a combination of linguistic kernels is applied, according to (Basili et al. 2013). Third, the best transcription is the input of the Action Detection (AD) component. The evoked frames in a sentence are detected, along with the corresponding evoking words, the so-called lexical units. Let us consider the recurring sentence: the AD should produce the following interpretation [bring]Bringing the pillow on the couch. The final step is the Argument Labeling, where a set of frame elements is retrieved for each frame. This process is realized in two sub-steps. First, the Argument Identification (AI) finds the spans of all the possible frame elements, producing the following form [bring]Bringing [the pillow] [on the couch]. Then, the Argument Classification (AC) assigns the suitable label (i.e. the frame element) to each span thus returning the final tagging shown in the Example 1. The AD, AI and AC steps are modeled as a sequential labeling task, as in (Bastianelli et al. 2016a). The Markovian formulation of a structured SVM proposed in (Altun, Tsochantaridis, and Hofmann 2003) is applied to implement the sequential labeler, known as SVMhmm . In general, this learning algorithm combines a local discriminative model, which estimates the individual observation probabilities of a sequence, with a global generative approach to retrieve the most likely sequence, i.e. tags that better explain the whole sequence. In other words, given an input sequence x = (x1 . . . xl ) ∈ X of feature vectors x1 . . . xl , SVMhmm learns a model isomorphic to a k-order Hidden Markov Model, to associate x with a set of labels y = (y1 . . . yl ) ∈ Y. 63 Italian Journal of Computational Linguistics Volume 3, Number 1 A sentence s is here intended as a sequence of words wi , each modeled through a feature vector xi and associated to a dedicated label yi , specifically designed for each interpretation process: in any case, features combine linguistic evidences from a targeted sentences, but also features derived from the semantic map (when available) in order to synthesize information about existence and position of entities around the robot, as discussed in more details in (Bastianelli et al. 2016a) During training, the SVM algorithm associates words to step-specific labels: linear kernel functions are applied to different types of features, ranging from linguistic to perception-based features, and linear combinations of kernels are used to integrate independent properties. At classification time, given a sentence s = (w1 . . . w|s| ), the SVMhmm efficiently predicts the tag sequence y = (y1 . . . y|s| ) using a Viterbi-like decoding algorithm. Notice that both the re-ranking and the semantic parsing phases can be realized in two different settings, depending on the type of features adopted in the labeling process. It is this possible to rely upon linguistic information to solve the given task, or also on perceptual knowledge coming from a semantic map. In the first case, that we call basic setting, the information used to solve the task comes from linguistic inputs, as the sentence itself or external linguistic resources. These models correspond to the methods discussed in (Bastianelli et al. 2017; Basili et al. 2013). In the second case, the simple setting, when perceptual information is made available to the chain, a context-aware interpretation is triggered, as in (Bastianelli et al. 2016a). Such perceptual knowledge is mainly exploited through a linguistic grounding mechanism. This lexically-driven grounding is estimated through distances between filler (i.e. argument heads) and entity names. Such a semantic distance integrates metrics over word vectors descriptions and phonetic similarity. Word semantic vectors are here acquired through corpus analysis, as in Distributional Lexical Semantic paradigms (Turney and Pantel 2010). They allow to map referential elements, such as lexical fillers, e.g. couch, to entities, e.g. a sofa, by thus modeling synonymy or co-hyponymy. Conversely, phonetic similarities are smoothing factors against possible ASR transcription errors, e.g. pitcher and picture, allowing to actually cope with spoken language. Once links between fillers and entities have been activated, the sequential labeler is made sensitive to additional features, that inject perceptual information both in the learning and the tagging process, e.g. the presence/absence of referred objects in the environment. As a side effect, the above mechanism provides the robot with the set of linguistically-motivated groundings, that can be potentially used for any further grounding process. This information can be crucial in the correct interpretation of ambiguous commands, which depends on the specific environmental setting in which the robot operates. A straightforward example is the command “bring the pillow on the couch in the living room”. Such a sentence may have two different interpretations, according to the configuration of the environment. In fact, whenever the couch is located into the living room, the goal of the Bringing action is the couch and interpretation will be: [bring]Bringing [the pillow]T HEME [on the couch in the living room]G OAL Conversely, if the couch is outside the living room, it means that probably the pillow is already on the couch. Hence, the interpretation of the sentence will be different, due to different argument spans, and the couch becomes the goal of the Bringing action: [bring]Bringing [the pillow on the couch]T HEME [in the living room]G OAL 64 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots Such ambiguities are mostly cross-lingual. In fact, this phenomenon can be observed even in the corresponding Italian command “porta il cuscino sul divano in sala da pranzo”. However, the proposed approach is robust towards different languages, as the disambiguation of the interpretation depends just on the configuration of the environment and not on the targeted language. Additional details about the pure linguistic approach can be found in (Bastianelli et al. 2017), whereas (Bastianelli et al. 2016a) provides a detailed description of the context aware SLU process. 4. LU4R - adaptive spoken Language Understanding 4 Robots The architecture of the LU4R system considers two main actors, as shown in Figure 3: the Robotic Platform and the LU4R chain (or LU4R), where the processing cascade of the latter component has been introduced in the previous Section. Spoken command List of hypotheses LU4R (Server) Robotic Platform (Client) LU4R ROS interface Hypotheses (SLU Orchestrator) Perceived entities Response Response Grounding LU4R Android app (ASR) Support Knowledge Base Domain Model Semantic Map User Model Interpretation Platform Model Figure 3 The LU4R architecture The Client-Server communication schema between LU4R and the Robot allows for the independence from the Robotic Platform, in order to maximize the re-usability and integration in heterogeneous robotic settings. The SLU process exhibits semantic capabilities (e.g. disambiguation, predicate detection or grounding into robotic actions and environments) that are designed to be general enough to be representative of a large set of application scenarios. It is obvious that an interpretation process must be achieved even when no information about the domain/environment is available, i.e. a scenario involving a blind but speaking robot, or when the actions a robot can perform are not made explicit. This is the case when the command “take the pillow on the couch” is not paired with any additional information and the ambiguity with respect to the evoked frame, i.e. Taking vs. Bringing, cannot be resolved. At the same time, LU4R makes available methods to specialize its semantic interpretation process to individual situations where more information is available about goals, the environment and the robot capabilities. These methods are expected to support the optimization of the core SLU process against a specific interactive robotics setting, in a cost-effective manner. In fact, whenever more information about the environment perceived by the robot (e.g. a semantic map) or about its capabilities is provided, the interpretation of a command can be improved by exploiting a more focused scope. That is: whenever the sentence “take the pillow on the 65 Italian Journal of Computational Linguistics Volume 3, Number 1 couch” is provided along with information about the presence and possible positions of a pillow on a couch. In order to better describe the different operating modalities of LU4R, some assumptions toward the Robotic Platform must be made explicit: this will allow to precisely establish functionalities and resources that the robot needs to provide to unlock the more complex processes. These information will be used to express the experience that the robot is able to share with the user (i.e. the perceptual knowledge about the environment where the linguistic communication occurs and some lexical information and properties about objects in the environment) and some level of awareness about its own capabilities (e.g. the primitive actions that the robot is able to perform, given its hardware components). In the following, each component of the architecture in Figure 3 will be discussed and analyzed. 4.1 The Robotic Platform The LU4R system contemplates a generic Robotic Platform, whose task, domain and physical setting are not necessarily specified. In order to make the SLU process independent from the above specific aspects, we assume that the platform requires, at least, the following modules:  an Automatic Speech Recognition (ASR) system;  a SLU Orchestrator;  a Grounding and Command Execution Engine;  a Physical Robot. In developing LU4R, we implemented both the ASR system and a simple SLU Orchestrator. The ASR is realized by the LU4R Android app, exploiting the Android environment, whereas the SLU orchestrator is implemented as a ROS node, through the LU4R ROS interface. Additionally, the optional component Support Knowledge Base is expected to maintain and provide the contextual information discussed above. While the discussion about the Robotic Platform is out of the scope of this work, all the other components are hereafter shortly summarized. LU4R Android app. An ASR engine allows to transcribe a spoken utterance into one or more transcriptions. In the latest release, the ASR is performed through an ad-hoc Android application, the LU4R Android app (Fig. 4). It relies on the official Google ASR API4 and offers valuable performances for an off-the-shelf solution. The main requirement of this solution is that the device hosting the software must feature an Internet connection, in order to provide transcriptions for the spoken utterance. The App can be deployed on both Android smartphones and tablets. In the latter case, even though the communication protocol remains the same, the tablet will be part of the robotic platform. The tablet can be provided with a directional condenser microphone and speakers. 4 http://goo.gl/4ZkdU 66 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots Figure 4 The LU4R Android app The communication with the entire system is realized through TCP Sockets. In this setting, the LU4R Android app implements a TCP Client, feeding LU4R with lists of hypotheses through a middle-layer. To this end, the LU4R ROS interface has been integrated in the loop, acting as the TCP Server. Once a new sentence is uttered by the user, this component outputs a list of hypothesized transcriptions, that are forwarded to the LU4R ROS interface. LU4R ROS interface. The LU4R ROS interface implements a TCP Server for the LU4R Android app, here coded as a ROS node waiting for Client requests. Once a new request is received (a list of transcriptions for a given spoken sentence), this module is in charge of extracting the perceived entities from a structured representation of the environment (here, a sub-component of the Support Knowledge Base) and sending the list of hypothesized transcriptions to LU4R along with the list of the perceived entities. The communication protocol requires the serialization of such information in two different JSON objects. However, in order to obtain the desired interpretation, only the list of transcription is mandatory. In fact, even though environmental information is essential for the perception-driven chain, whenever it is not provided, the chain operates in a blind setting. Moreover, this module has been decoupled from the LU4R chain as it can be employed for other purposes, such as tele-operating the robot by means of a virtual joypad coded into the Android App (Fig. 4). This component, mediating between the LU4R Android App, the LU4R Chain and the Robotic Platform, is provided along with the LU4R system, so that robustness in the communication is guaranteed. Hence, the robotic developers are in charge of: (i) the deployment of the ROS node into the target Robotic System; (ii) the definition of the policies for the acquisition of perceptual knowledge; and (iii) the manipulation of the structure representing the interpretation returned by the LU4R Chain. Even though this module is actually a TCP Server for the LU4R Android App, it represents also the Client interface toward the LU4R Chain. Grounding and Command Execution. Even though the grounding process is placed at the end of the loop, it is discussed here as it is a component of the Robotic Platform. In fact, this process has been completely decoupled from the SLU process, as it may involve perception capabilities and information unavailable to LU4R or, in general, out 67 Italian Journal of Computational Linguistics Volume 3, Number 1 of the linguistic dimension. Nevertheless, this situation can be partially compensated by defining mechanisms to exchange some of the grounding information with the linguistic reasoning component. The grounding carried out by the robot is triggered by a logical form expressing one or more actions through logic predicates, that potentially correspond to specific frames. The output of LU4R embodies the produced logic form: this latter exposes the recognized actions that are then linked to specific robotic operations (primitive actions or plans). Correspondingly, the predicate arguments (e.g. objects and location involved in the targeted action) are detected and linked to the objects/entities of the current environment. A fully grounded command is obtained through the complete instantiation of the robot action (or plan) and its final execution. 4.2 The LU4R Chain The LU4R component implements the language understanding cascade described in Section 3. It realizes the SLU service as a black-box component, so that the complexity of each inner sub-task is hidden to the user. It is entirely coded in Java and released as a single Jar file. Morpho-syntactic and syntactic analysis is realized through the Stanford CoreNLP suite (Manning et al. 2014) when English is the targeted language, and the Chaos parser (Basili and Zanzotto 2002) for Italian commands. Conversely, the SVMhmm algorithm for the three steps of the semantic analysis (namely, Action Detection, Argument Identification and Argument Classification) is implemented through the KeLP framework (Filice et al. 2015). The LU4R Chain is a service that can be invoked through HTTP communication. Its implementation is realized through a server that keeps listening to natural language sentences and outputs an interpretation for them. The communication between the client of the service (the Robotic Platform) and the LU4R Chain is described in this Section. The LU4R Chain requires an initialization phase, where the process is run and initialized, followed by a service phase, where LU4R is ready to receive requests. The initialization phase corresponds to create an instance of the chain, among the ones defined in the previous Section, e.g. either basic or simple. The basic setting does not contemplate perceptual knowledge during the interpretation process. Conversely, the simple configuration relies on perceptual information, enabling a context-sensitive interpretation of the command at the predicate level. During the initialization, a specific output format can be chosen, among the available ones. For example, xdg is the default output format, where the interpretation is given in the XDG format eXtended Dependency Graph and XML compliant container (see (Basili and Zanzotto 2002)). In the amr format, the interpretation is given in the Abstract Meaning Representation (see (Banarescu et al. 2013)). Finally, cfr (Command Frame Representation) is a format for the predicates (frames) produced by the chain defined in (Schneider et al. 2014), in the context of RoCKIn competition. The language parameter allows to choose the operating language of LU4R. At the moment, only en (English) and it (Italian) versions are supported. Once the service has been initialized, it is possible to start asking for interpreting user utterances. The server thus waits for messages carrying the utterance transcriptions to be parsed. Each sentence here corresponds to a speech recognition hypothesis. Hence, it can be paired with the corresponding transcription confidence score, useful in the re-ranking phase. The body of the message must then contain the list of hypotheses encoded as a JSON array, called hypotheses, where each entry is a transcription paired with a confidence. 68 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots Additionally, when the simple configuration is selected, the input can include the list of entities populating the environment the robot is operating into (e.g. name of rooms or furnitures and objects of the rooms), again encoded as a JSON array. Despite of the representation of the environment adopted by the robot, this environmentdependent interpretation process requires the following information for each entity “perceived” by the robot:  the type of each entity; it reflects the class to which each specific entity belongs (e.g. it is an object, such as a table, book, pillow, or a location, such as living_room or kitchen);  the preferredLexicalReference used to refer to a class of objects; it is crucial in order to enable a linguistic grounding between the commands uttered by the user and the entities within the environment. These labels are expected to be provided by the engineer initializing the robot. For example, an entity of the class couch can be referred by the string sofa. If no label is given, it is derived by the name of the corresponding class, so that couch can be used to refer to the objects of the class couch;  in the case the engineer provides more than one label, these can be specified through alternativeLexicalReference, as a list of alternative namings for a given entity;  the position of the each entity is essential to determine shallow spatial relations between entities (e.g. two object are near or far from each other). To this end, each entity is associated with its corresponding coordinate in the world, in terms of planar coordinates (x,y), elevation (z) and angle as the orientation. 5. HuRIC 2.0: a multilingual corpus of robotic command The computational paradigms adopted in LU4R are based on machine learning techniques and depend strictly on the availability of training data. In order to properly train and test our framework, we are developing a collection of datasets that together form the Human-Robot Interaction Corpus5 (HuRIC), formerly presented in (Bastianelli et al. 2014). HuRIC is based on Frame Semantics and captures cognitive information about situations and events expressed in sentences. Differently from other corpora for Spoken Language Understanding in Human-Robot Interaction, it is not system or robot dependent both with respect to the kind of sentences and with respect to the adopted formalism. HuRIC contains information strictly related to Natural Language Semantics and it is decoupled from specific systems. The corpus exploits different situations representing possible commands given to a robot in a house environment. HuRIC is composed by different subsets, characterized by different order of complexity and they are designed to stress in different ways a possible architecture. Each dataset includes a set of audio files representing robot commands, paired with the correct transcription. 5 Available at http://sag.art.uniroma2.it/huric. The download page also contains a detailed description of the release format. 69 Italian Journal of Computational Linguistics Volume 3, Number 1 Table 1 HuRIC: some statistics Number of examples Number of frames Number of predicates Number of roles Predicates per sentence Sentences per frame Roles per sentence English 656 18 767 34 1.17 36.44 2.04 Italian 214 14 241 27 1.13 15.29 1.83 Table 2 Distribution of frames and frame elements in the English dataset Frame Motion T HEME G OAL D IRECTION PATH M ANNER D ISTANCE A REA S OURCE Locating P HENOMENON G ROUND C OGNIZER P URPOSE M ANNER Change_direction T HEME D IRECTION A NGLE S PEED Ex 143 23 129 9 9 4 1 2 1 90 89 34 10 5 2 11 1 11 3 1 Placing T HEME G OAL A GENT A REA Being_located T HEME L OCATION P LACE Perception_active P HENOMENON M ANNER 52 52 51 7 1 38 38 34 1 6 6 1 Frame Bringing T HEME G OAL A GENT B ENEFICIARY S OURCE M ANNER A REA Ex 153 153 95 39 56 18 1 1 Frame Cotheme C OTHEME S PEED M ANNER T HEME PATH G OAL A REA Ex 39 39 1 9 4 1 8 1 Inspecting G ROUND D ESIRED _ STATE I NSPECTOR U NWANTED _ ENTITY 29 28 9 5 2 Taking A GENT T HEME S OURCE P URPOSE 80 8 80 16 2 Arriving G OAL PATH M ANNER T HEME 12 11 5 1 1 Giving R ECIPIENT T HEME D ONOR R EASON 10 10 10 4 1 Closure C ONTAINER _ PORTAL A GENT C ONTAINING _ OBJECT D EGREE Attaching I TEM G OAL I TEMS Being_in_category I TEM C ATEGORY 19 8 7 11 2 11 6 11 1 11 11 11 Change_operational_state A GENT D EVICE O PERATIONAL _ STATE 49 17 49 43 Releasing T HEME G OAL 9 9 5 Manipulation E NTITY 5 5 Each sentence is then annotated with: lemmas, POS tags, dependency trees6 and Frame Semantics. Semantic frames and frame elements are used to represent the meaning of commands, as, in our view, they reflect the actions a robot can accomplish in a home environment. In this way, HuRIC can potentially be used to train all the modules of the processing chain presented in Section 4. With respect to the previous releases, in order to consider further robotic actions, the release of LU4R required an extension of HuRIC in terms of new frames, such as 6 At the moment of writing the dependency trees associated to the Italian Sentences are still under validation 70 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots Table 3 Distribution of frames and frame elements in the Italian dataset Frame Motion G OAL M ANNER T HEME PATH S OURCE Bringing T HEME G OAL B ENEFICIARY S OURCE Closure C ONTAINING _ OBJECT C ONTAINER _ PORTAL D EGREE Taking T HEME S OURCE Releasing T HEME P LACE Ex 32 28 1 3 2 1 59 60 26 31 8 10 5 6 1 22 22 8 8 8 3 Frame Locating M ANNER P HENOMENON G ROUND P URPOSE D IRECTION Cotheme C OTHEME G OAL M ANNER Ex 27 2 27 6 1 1 13 13 5 6 Frame Inspecting G ROUND U NWANTED _ ENTITY I NSTRUMENT D ESIRED _ STATE _ OF _ AFFAIRS Ex 4 2 2 1 2 Placing T HEME G OAL A REA 18 18 17 1 Giving R ECIPIENT T HEME D ONOR Being_located L OCATION T HEME Change_operational_state D EVICE 7 6 7 1 14 14 12 14 14 Change_direction D IRECTION A NGLE S PEED Being_in_category I TEM C ATEGORY 9 9 3 1 4 4 4 C HANGE _ DIRECTION and, in general, frame elements: at the moment the English subset of HuRiC contains 656 sentences. Most importantly, we extended HuRIC with a first set of 214 commands in Italian. Almost all Italian sentences are translations of the original commands in English and the corpus keeps also the alignment between those sentences. We believe these alignments will support further researches in further areas, such as in the context of Machine Translation. The number of annotated sentences, number of frames and further statistics are reported in Table 1. Detailed statistics about the number of sentences for each frame and frame elements are reported in the Tables 2 and 3 for the English and Italian subsets, correspondingly. The current release of HuRIC is provided with a novel XML-based format, whose extension is hrc. For each command we can store: (i) the whole sentence, (ii) the list of the tokens composing it, along with the corresponding lemma and POS tag, (iii) the dependency relations among tokens, and (iv) the semantics, expressed in terms of Frames and Frame elements. 6. Experimental Evaluation In order to provide evidences about the effectiveness of the proposed solution, we report here an evaluation of the interpretation process of robotic commands in two languages, i.e. English and Italian, w.r.t the basic setting. Table 4 and 5 show the results obtained over the new version of the HumanRobot Interaction Corpus (HuRIC), presented in Section 5. In fact, the experiments have been performed on both languages, as HuRIC provides commands in both English and Italian. The results, expressed in terms of Precision, Recall and F1 measure, focus on the semantic interpretation process, in particular Action Detection (AD), Argument Identification (AI) and Argument Classification (AC) steps. In fact, F1 scores measure the quality of a specific module. While in the AD step the F1 refers to the ability to extract the correct frame(s) (i.e. robot action(s) expressed by the user) evoked by a 71 Italian Journal of Computational Linguistics Volume 3, Number 1 Table 4 English dataset AD AI AC Precision 95.14% ± 1.73 89.95% ± 2.28 92.15% ± 1.51 Recall 95.02% ± 0.37 89.63% ± 2.00 92.15% ± 1.51 F1-Measure 95.07% ± 0.93 89.78% ± 2.05 92.15% ± 1.51 sentence, in the AI step it evaluates to the correctness of the predicted argument spans. Finally, in the AC step the F1 measures the accuracy of the classification of individual arguments. The experiments have been performed in a random split setting, over 5 iterations. During each iteration, the dataset is shuffled and split into three subsets, containing 70%, 10% and 20% of the data, used as training, tuning and testing set, respectively. In this respect, Table 4 and 5 show also the standard deviations among the different iterations. We tested each sub-module in isolation, feeding each step with gold information provided by the previous step in the chain. Moreover, the evaluation has been carried out considering the correct transcriptions, i.e. not contemplating the error introduced by the Automatic Speech Recognition system. The results over both datasets refer to the basic setting of LU4R, that is the configuration in which just linguistic information are exploited. Results against the commands in English (Table 4) are encouraging for the application of LU4R in realistic scenarios. In fact, the F1 is higher than 95% in the recognition of semantic predicates used to express intended actions (AD). The system is able to recognize the involved entities (AC) with high accuracy as well, with a F1 higher than 92%. This result is surprising when analyzing the complexity of the task. In fact, the classifier is able to cope with a high level of uncertainty, as the amount of possible semantic roles is sizable, i.e. 34 total. The most challenging task seems to be the ability to recognize the spans composing a single frame element (AI), where the F1 settles just under the 90% (89.78%). One of the most frequent errors concerns the ambiguity of the “take” verb. In fact, as explained in the previous sections, the interpretation of such verb may be different (i.e. either Bringing or Taking), depending on the configuration of the environment. As the basic setting does not rely on any kind of perceptual knowledge, the system is not able to correctly discriminate among them. Hence, the resulting interpretation is more likely to be wrong, as it does not reflect the semantics that is motivated by the environment. In terms of F1 measure, this issue affects mainly the process of recognizing the argument spans (AI), rather than the ability to identify the action(s) (AD), as for each (possibly) wrong frame, there could be more than two (possibly) wrong arguments. For example, the sentence “take the pillow on the couch” will be probably recognized to be a Taking action, even though it is labeled as Bringing, i.e. the pillow and the couch are supposed to be far in the environment. While the AD step will receive just one penalty for the wrong recognized action, the AI step is penalized twice, as two arguments were expected by the gold standard annotation, i.e. the pillow as T HEME and the couch as G OAL, instead of one, i.e. the pillow on the couch as a single T HEME argument. Preliminary experiments in the perception-driven setting seem to show that, whenever such knowledge is in- 72 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots jected into the learning process, the system is able to mitigate the error rate over those phenomena. In addition, small values of standard deviation suggest that the system seems to be rather stable across the different iterations of the experiment and that the results do not depend on specific splits of the entire dataset. Table 5 Italian dataset AD AI AC Precision 93.59% ± 2.81 82.80% ± 1.38 89.93% ± 3.83 Recall 88.63% ± 4.25 82.50% ± 3.47 89.93% ± 3.83 F1-Measure 91.01% ± 3.21 82.64% ± 2.34 89.93% ± 3.83 The experiments over the Italian dataset reflect the observation that have been pointed out for the English setting. In fact, the system is able to recognize actions (AD) with an F1 measure of 91.01%. Again, valuable performances here suggest that the process of recognizing the intended action(s) is reliable enough to be applicable in real scenarios. As in the English setting, the most challenging step is the Argument Identification, where the F1 measure does not overstep the 83% (82.64%). The results are promising, when compared to the actual size of the dataset. In fact, the classifiers are trained on just the 80% of the entire Italian dataset, i.e. 170 sentences on average. Though lower, the accuracy in recognizing involved entities (AC) is in line with the English experiments, with a F1 score of 89.93%. It seems plausible that the gap in performances and standard deviations with respect to results against the English dataset is mainly due to the reduced size of the dataset. When looking at the errors, we observe again that the introduction of the perceptual information can be beneficial for the overall task, specially for the AI step. In fact, the command “porta il bicchiere sul tavolo in cucina” (i.e. bring the glass on the table in the kitchen in English) can not be correctly predicted without information about the involved entities, as two different interpretations are plausible. The intended action correspond to a Bringing one in both cases; nevertheless, the involved roles are substantially different. In fact, whenever the referred table (tavolo) is inside the kitchen (cucina), the table itself represents the goal of the action, whereas if the glass (bicchiere) is on the table, this latter is probably outside the kitchen that is, instead, the goal of the action. Hence, the lack of perceptual evidences can play a key role in producing these mis-classifications. Though the F1 measures are not directly comparable as capturing different phenomenon, this behavior could explain the bigger gap that is observed between AD and AI results in the Italian experiment (8.37%) than in the English one (5.29%). Notice that this phenomenon does not affect the AC task. In fact, as we said, during the experimental evaluation each step is fed with gold standard annotations. At the moment of writing we are pairing each sentence from both sub sets of HuRIC with semantic maps, in order to design proper systematic evaluations also for the simple, i.e. context-aware, setting. 7. Conclusions In this paper, we presented a comprehensive framework for the robust implementation of natural language interfaces for Human-Robot Interaction (HRI). It is specifically 73 Italian Journal of Computational Linguistics Volume 3, Number 1 designed for the automatic interpretation of spoken commands towards robots in domestic environments. The solution proposed here relies on Frame Semantics and supports a structured learning approach to language processing able to map individual sentence transcriptions to meaningful commands. An hybrid discriminative and generative learning method is proposed to map the interpretation process into a cascade of sentence annotation tasks. The overall framework and individual algorithms have been implemented in LU4R, a free and ready-to-use Java processing chain, designed for the cost-effective and rapid deployment of language interfaces in a wide range of robotic platforms. By implementing the approach presented in (Bastianelli et al. 2016a),LU4R’s command interpretation is made dependent on the robot’s environment; in fact the adopted training annotations not only express linguistic evidences from source utterances, but also account for specific perceptual knowledge derived from a reference map. In this way the semantic map aspects useful to interpretation are expressed via feature modeling with the structured learning mechanism applied. Such perceptual knowledge is thus derived from a semantically-enriched implementation of a robot map (i.e. its semantic map): it expresses information about the existence and position of entities surrounding the robot: as this is also available to the user, this information is crucial to disambiguate predicates and role assignments. The machine learning processes inside LU4R have been trained by using an extended version of HuRIC, the Human Robot Interaction Corpus. This corpus, originally composed by example in English, now contains a subset of example in Italian: from the one hand, this novel corpus supports the development of LU4R in the Italian language but, most of all, it will support the research in natural language interfaces for Robots in such language. The empirical results obtained by LU4R over both languages are quite impressive (about 90% of F1 in almost all the evaluations). This (i) confirms the effectiveness of the proposed processing chain, (ii) the application of the same approach in different languages. Further effort is required to extend HuRIC with additional sentences, in order to consider a wider range of robotic actions. We are currently working to make it including semantic maps associated to each individual sentence: it will support a systematic evaluation of the interpretation process enhanced with perceptual information. Future research will also focus on the extension of the methodology proposed in (Bastianelli et al. 2016a), e.g. by considering spatial relations between entities in the environment or their physical characteristics, such as their color. In the medium/long term research, we believe that LU4R will support further and more challenging research topics in the context of HRI, such as in interactive question answering or dialogue with robots. References Altun, Yasemin, Ioannis Tsochantaridis, and Thomas Hofmann. 2003. Hidden Markov support vector machines. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington D.C., USA, August 21-24. Baker, Collin F., Charles J. Fillmore, and John B. Lowe. 1998. The berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL ’98), pages 86–90, Montreal, Quebec, Canada, August 10-14. Banarescu, Laura, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria, August. 74 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots Basili, Roberto, Emanuele Bastianelli, Giuseppe Castellucci, Daniele Nardi, and Vittorio Perera. 2013. Kernel-based discriminative re-ranking for spoken command understanding in hri. In AI* IA 2013: Advances in Artificial Intelligence. Springer International, pages 169–180. Basili, Roberto and Fabio Massimo Zanzotto. 2002. Parsing engineering and empirical robustness. Natural Language Engineering, 8(3):97–120, June. Bastianelli, Emanuele, Giuseppe Castellucci, Danilo Croce, Roberto Basili, and Daniele Nardi. 2014. Huric: a human robot interaction corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, May 26-31. Bastianelli, Emanuele, Giuseppe Castellucci, Danilo Croce, Roberto Basili, and Daniele Nardi. 2017. Structured learning for spoken language understanding in human-robot interaction. The International Journal of Robotics Research, 36(5-7):660–683. Bastianelli, Emanuele, Danilo Croce, Andrea Vanzo, Roberto Basili, and Daniele Nardi. 2016a. A discriminative approach to grounded spoken language understanding in interactive robotics. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), New York, New York, USA, 9-15 July. Bastianelli, Emanuele, Daniele Nardi, Luigia Carlucci Aiello, Fabrizio Giacomelli, and Nicolamaria Manes. 2016b. Speaky for robots: The development of vocal interfaces for robotic applications. Applied Intelligence, 44(1):43–66, January. Bos, Johan. 2002. Compilation of unification grammars with compositional semantics to speech recognition packages. In Proceedings of the 19th International Conference on Computational Linguistics (COLING ’02), volume 1, pages 1–7, Taipei, Taiwan, 26-30 August. Association for Computational Linguistics. Bos, Johan and Tetsushi Oka. 2007. A spoken language interface with a mobile robot. Artificial Life and Robotics, 11(1):42–47. Chen, David L. and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI-11), pages 859–865, San Francisco, California, USA, August 7-11. Diosi, Albert, Geoffrey R. Taylor, and Lindsay Kleeman. 2005. Interactive SLAM using laser and advanced sonar. In Proceedings of the 2005 International Conference on Robotics and Automation, pages 1103–1108, Barcelona, Spain, April 18-22. Duvallet, Felix, Thomas Kollar, and Anthony Stentz. 2013. Imitation learning for natural language direction following through unknown environments. In 2013 IEEE International Conference on Robotics and Automation, pages 1047–1053, Karlsruhe, Germany, May 6-10. Fasola, Juan and Maja J. Matarić. 2013a. Using semantic fields to model dynamic spatial relations in a robot architecture for natural language instruction of service robots. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 143–150, Tokyo, Japan, November 3-7. Fasola, Juan and Maja J. Matarić. 2013b. Using spatial semantic and pragmatic fields to interpret natural language pick-and-place instructions for a mobile service robot. In Social Robotics: 5th International Conference, ICSR 2013, Bristol, UK, October 27-29, 2013, Proceedings. Springer International Publishing, pages 501–510. Filice, Simone, Giuseppe Castellucci, Danilo Croce, and Roberto Basili. 2015. Kelp: a kernel-based learning platform for natural language processing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL2015): System Demonstrations, Beijing, China, 26-31 July. Fillmore, Charles J. 1985. Frames and the semantics of understanding. Quaderni di Semantica, 6(2):222–254. Gemignani, Guglielmo, Roberto Capobianco, Emanuele Bastianelli, Domenico Daniele Bloisi, Luca Iocchi, and Daniele Nardi. 2016. Living with robots. Robotics and Autonomous Systems, 78(C):1–16, April. Harnad, Stevan. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346. Kollar, Thomas, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE International Conference on Human-robot Interaction (HRI ’10), pages 259–266, Osaka, Japan, March 2-5. Kruijff, Geert-Jan M., H. Zender, P. Jensfelt, and Henrik I. Christensen. 2007. Situated dialogue and spatial organization: What, where. . . and why? International Journal of Advanced Robotic Systems, 4(2). 75 Italian Journal of Computational Linguistics Volume 3, Number 1 MacMahon, Matt, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: connecting language, knowledge, and action in route instructions. In proceedings of the 21st national conference on Artificial intelligence (AAAI’06), volume 2, pages 1475–1482. AAAI Press. Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, Baltimore, Maryland, USA, June 22-27. Matuszek, Cynthia, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the 5th ACM/IEEE International Conference on Human-robot Interaction (HRI ’10), pages 251–258, Osaka, Japan, March 2-5. IEEE Press. Matuszek, Cynthia, Evan Herbst, Luke S. Zettlemoyer, and Dieter Fox. 2012. Learning to parse natural language commands to a robot control system. In Jaydev P. Desai, Gregory Dudek, Oussama Khatib, and Vijay Kumar, editors, Experimental Robotics: The 13th International Symposium on Experimental Robotics, volume 88 of Springer Tracts in Advanced Robotics, pages 403–415. Springer. Misra, Dipendra K., Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. 2016. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3):281–300. Nüchter, Andreas and Joachim Hertzberg. 2008. Towards semantic maps for mobile robots. Robotics and Autonomous Systems, 56(11):915–926. Perera, Vittorio and Manuela M. Veloso. 2015. Handling complex commands as service robot task requests. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), pages 1177–1183, Buenos Aires, Argentina, 25-31 July. Schneider, Sven, Frederik Hegger, Aamir Ahmad, Iman Awaad, Francesco Amigoni, Jakob Berghofer, Rainer Bischoff, Andrea Bonarini, Rhama Dwiputra, Giulio Fontana, Luca Iocchi, Gerhard Kraetzschmar, Pedro Lima, Matteo Matteucci, Daniele Nardi, and Viola Schiaffonati. 2014. The rockin@home challenge. In Proceedings of the 41st International Symposium on Robotics (ISR/Robotik 2014), pages 1–7, Munich, Germany, June 2-3. Tanenhaus, Michael K., Michael J. Spivey-Knowlton, Kathleen M. Eberhard, and Julie C. Sedivy. 1995. Integration of visual and linguistic information during spoken language comprehension. Science, 268:1632–1634. Tellex, Stefanie, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. 2011. Approaching the symbol grounding problem with probabilistic graphical models. AI Magazine, 34(4):64–76. Turney, Peter D. and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141–188, January. 76