Survey On Vector Representations
Survey On Vector Representations
Survey On Vector Representations
Abstract
Over the past years, distributed semantic representations have proved to be effective
and flexible keepers of prior knowledge to be integrated into downstream applications. This
survey focuses on the representation of meaning. We start from the theoretical background
behind word vector space models and highlight one of their major limitations: the meaning
conflation deficiency, which arises from representing a word with all its possible meanings as
a single vector. Then, we explain how this deficiency can be addressed through a transition
from the word level to the more fine-grained level of word senses (in its broader acceptation)
as a method for modelling unambiguous lexical meaning. We present a comprehensive
overview of the wide range of techniques in the two main branches of sense representation,
i.e., unsupervised and knowledge-based. Finally, this survey covers the main evaluation
procedures and applications for this type of representation, and provides an analysis of
four of its important aspects: interpretability, sense granularity, adaptability to different
domains and compositionality.
1. Introduction
Recently, neural network based approaches which process massive amounts of textual data to
embed words’ semantics into low-dimensional vectors, the so-called word embeddings, have
garnered a lot of attention (Mikolov, Chen, Corrado, & Dean, 2013a; Pennington, Socher, &
Manning, 2014). Word embeddings have demonstrated their effectiveness in storing valuable
syntactic and semantic information (Mikolov, Yih, & Zweig, 2013d). In fact, they have
been shown to be beneficial to many Natural Language Processing (NLP) tasks, mainly
due to their generalization power (Goldberg, 2016). A wide range of applications have
reported improvements upon integrating word embeddings, including machine translation
(Zou, Socher, Cer, & Manning, 2013), syntactic parsing (Weiss, Alberti, Collins, & Petrov,
2015), text classification (Kim, 2014) and question answering (Bordes, Chopra, & Weston,
2014), to name a few.
However, despite their flexibility and success in capturing semantic properties of words,
the effectiveness of word embeddings is generally hampered by an important limitation
which we will refer to as meaning conflation deficiency: the inability to discriminate among
1
Camacho-Collados & Pilehvar
different meanings of a word. A word can have one meaning (monosemous) or multiple
meanings (ambiguous). For instance, the noun nail can refer to two different meanings
depending on the context: a part of the finger or a metallic object. Hence, the noun nail
is said to be ambiguous1 . Each individual meaning of an ambiguous word is called a word
sense and a lexical resource that lists different meanings (senses) of words is usually referred
to as a sense inventory.2 While most words in general sense inventories (e.g. WordNet) are
often monosemous3 , frequent words tend to have more senses, according to the Principle of
Economical Versatility of Words (Zipf, 1949). Therefore, accurately capturing the semantics
of ambiguous words plays a crucial role in the language understanding of NLP systems.
In order to deal with the meaning conflation deficiency, a number of approaches have
attempted to model individual word senses. In this survey we have tried to synthesize
the most relevant works on sense representation learning. The main distinction of these ap-
proaches is in how they model meaning and where they obtain it from. Unsupervised models
directly learn word senses from text corpora, while knowledge-based techniques exploit the
sense inventories of lexical resources as their main source for representing meanings. In this
survey we cover these two classes of techniques for learning distributed semantic representa-
tions of meaning, including evaluation procedures and an analysis of their main properties.
While the survey is intended to be as extensive as possible, given the breadth of the topics
reviewed, some areas may not have received a sufficient coverage to be totally self-contained.
However, for these cases we provide relevant pointers for readers interested in learning more
on the topic. Given the wide audience that this survey is intended to reach, we have tried
to make it as understandable as possible. Therefore, technical details might not have been
necessarily provided in full detail, but rather the intuition behind them.
The remainder of this survey is structured as follows. First, in Section 2 we provide a
theoretical background for word senses, what they are, why modeling them may be useful
and its main paradigms. Then, in Section 3 we describe unsupervised sense vector modeling
techniques which learn directly from text corpora, while in Section 4 the representations
linked to lexical resources are explained. Common evaluation procedures and benchmarks
are presented in Section 5 and the applications in downstream tasks in Section 6. Finally,
we present an analysis and comparison between unsupervised and knowledge-based repre-
sentations in Section 7 and the main conclusions and future challenges in Section 8.
2. Background
This section provides theoretical foundations which support the move from word level to the
more fine-grained level of word senses and concepts. First, we provide the background to
vector space models, particularly for word representation learning (Section 2.1). Then, we
1. Nail can also refer to a unit of cloth measurement (generally a sixteenth of a yard) or even be used as a
verb.
2. In order to obtain the list of possible word senses of a target word, lexicographers tend to first collect
occurrences of the words from corpora and then manually cluster them semantically and based on their
contexts, i.e., concordance (Kilgarriff, 1997). Given this procedure, Kilgarriff (1997) suggested that
word senses, as defined by sense inventories in NLP, should not be construed as objects but rather as
abstractions over clusters of word usages.
3. For instance, around 83% of the 155K words in WordNet 3.0 are listed as monosemous (see Section 4.1
for more information on lexical resources).
2
A Survey on Vector Representations of Meaning
explain some of the main deficiencies of word representations which led to the development
of sense modeling techniques (Section 2.2) and describe the main paradigms for representing
senses (Section 2.3). In Section 2.4 we present a brief historical background of the related
task of word sense disambiguation. Finally, we explain the notation followed throughout
the survey (Section 2.5).
Word representation learning has been one of the main research areas in Semantics since the
beginnings of NLP. We first introduce the main theories behind word representation learning
based on vector space models (Section 2.1.1) and then move to the emerging theories for
learning word embeddings (Section 2.1.2).
One of the most prominent methodologies for word representation learning is based on Vec-
tor Space Models (VSM), which is supported by research in human cognition (Landauer
& Dumais, 1997; Gärdenfors, 2004). The earliest VSM applied in NLP considered a doc-
ument as a vector whose dimensions were the whole vocabulary (Salton, Wong, & Yang,
1975). Weights of individual dimensions were initially computed based on word frequencies
within the document. Different weight computation metrics have been explored, but mainly
based on frequencies or normalized frequencies (Salton & McGill, 1983). This methodology
has been successfully refined and applied to various NLP applications such as information
retrieval (Lee, Chuang, & Seamons, 1997), text classification (Soucy & Mineau, 2005), or
sentiment analysis (Turney, 2002), to name a few. Turney and Pantel (2010) provide a
comprehensive overview of VSM and their applications.
The document-based VSM has been also extended to other lexical items like words. In
this case a word is generally represented as a point in a vector space. A word-based vector
has been traditionally constructed based on the normalized frequencies of the co-occurring
words in a corpus (Lund & Burgess, 1996), by following the initial theories of Harris (1954).
The main idea behind word VSM is that words that share similar context should be close in
the vector space (therefore, have similar semantics). Figure 1 shows an example of a word
VSM where this underlying proximity axiom is clearly highlighted.
Vector-based representations have established their effectiveness in NLP tasks such as
information extraction (Laender, Ribeiro-Neto, da Silva, & Teixeira, 2002), semantic role
labeling (Erk, 2007), word similarity (Radinsky, Agichtein, Gabrilovich, & Markovitch,
2011), word sense disambiguation (Navigli, 2009) or spelling correction (Jones & Martin,
1997), inter alia. One of the main drawbacks of the conventional VSM approaches is the
high dimensionality of the produced vectors. Since the dimensions correspond to words
in the vocabulary, this number could easily reach hundreds of thousands or even millions,
depending on the underlying corpus. A common approach for dimensionality reduction
makes use of the Singular Value Decomposition (SVD), also known as Latent Semantic
Analysis (Hofmann, 2001; Landauer & Dooley, 2002, LSA). In addition, recent models also
leverage neural networks to directly learn low-dimensional word representations. These
models are introduced in the following section.
3
Camacho-Collados & Pilehvar
Figure 1: Subset of a sample word vector space reduced to two dimensions using t-SNE
(Maaten & Hinton, 2008). In a semantic space, words with similar meanings tend
to appear in the proximity of each other, as highlighted by these word clusters
(delimited by the red dashed lines) associated with big cats, birds and plants.
E = − log(p(w ~ t ))
~ t |W (1)
where wt is the target word and Wt = wt−n , ..., wt , ..., wt+n represents the sequence of words
in context. Figure 2 shows a simplification of the general architecture of the CBOW and
Skip-gram models of Word2vec. The architecture consists of input, hidden and output
layers. The input layer has the size of the word vocabulary and encodes the context as a
combination of one-hot vector representations of surrounding words of a given target word.
The output layer has the same size as the input layer and contains a one-hot vector of
the target word during the training phase. The Skip-gram model is similar to the CBOW
model but in this case the goal is to predict the words in the surrounding context given
the target word, rather than predicting the target word itself. Interestingly, Levy and
4
A Survey on Vector Representations of Meaning
Figure 2: Learning architecture of the CBOW and Skipgram models of Word2vec (Mikolov
et al., 2013a).
Goldberg (2014b) proved that Skip-gram can be in fact viewed as an implicit factorization
of a Point-Mutual Information (PMI) co-occurrence matrix.
Another prominent word embedding architecture is GloVe (Pennington et al., 2014),
which combines global matrix factorization and local context window methods through a
bilinear regression model. In recent years more complex approaches that attempt to im-
prove the quality of word embeddings have been proposed, including models exploiting
dependency parse-trees (Levy & Goldberg, 2014a) or symmetric patterns (Schwartz, Re-
ichart, & Rappoport, 2015), leveraging subword units (Wieting, Bansal, Gimpel, & Livescu,
2016; Bojanowski, Grave, Joulin, & Mikolov, 2017), representing words as probability distri-
butions (Vilnis & McCallum, 2015; Athiwaratkun & Wilson, 2017; Athiwaratkun, Wilson,
& Anandkumar, 2018), learning word embeddings in multilingual vector spaces (Conneau,
Lample, Ranzato, Denoyer, & Jégou, 2018; Artetxe, Labaka, & Agirre, 2018), or exploiting
knowledge resources (more details about this type in Section 4.2).4
4. For a more comprehensive overview on word embeddings and their current challenges, please refer to the
work of Ruder (2017).
5
Camacho-Collados & Pilehvar
to two different senses of mouse, i.e., rodent and computer input device. See Figure 3 for an
illustration.5 Moreover, the conflation deficiency violates the triangle inequality of euclidean
spaces, which can reduce the effectiveness of word space models (Tversky & Gati, 1982).
In order to alleviate this deficiency, a new direction of research has emerged over the past
years, which tries to directly model individual meanings of words. In this survey we focus
on this new branch of research, which has some similarities and peculiarities with respect
to word representation learning.
6
A Survey on Vector Representations of Meaning
7
Camacho-Collados & Pilehvar
2.5 Notation
Throughout this survey we use the following notation. Words will be referred to as w while
senses will be written as s. Concepts, entities and relations will be referred to as c, e and
r, respectively. Following previous work (Navigli, 2009), we use the following interpretable
expression for senses as well: word pn is the nth sense of word with part of speech p. As
for synsets as represented in a sense inventory we will use y.9 A semantic network will be
generally represented as N . In order to refer to vectors we will add the vector symbol on
the top of each item. For instance, w
~ and ~s will refer to the vectors of the word w and sense
s, respectively.
In general in this survey we may refer to sense representation as a general umbrella
term including all vector representations (including embeddings) of meaning beyond the
word level, or explicitly to the vector representation of a word associated with a specific
meaning10 (e.g., bank with its financial meaning), irrespective of whether it comes from a
pre-defined sense inventory or not, and whether it refers to a concept (e.g., banana) or an
entity (e.g., France).
8
A Survey on Vector Representations of Meaning
The context-group discrimination of Schütze (1998) is one of the pioneering works in sense
representation. The approach was an attempt to automatic word sense disambiguation in
order to address the knowledge-acquisition bottleneck for sense annotated data (Gale et al.,
1992) and reliance on external resources. The basic idea of context-group discrimination
is to automatically induce senses from contextual similarity, computed by clustering the
contexts in which an ambiguous word occurs. Specifically, each context C of an ambigu-
ous word w is represented as a context vector ~vC , computed as the centroid of its content
words’ vectors ~vc (c ∈ C). Context vectors are computed for each word in a given corpus
and then clustered into a predetermined number of clusters (context groups) using the Ex-
pectation Maximization algorithm (Dempster, Laird, & Rubin, 1977, EM). Context groups
for the word are taken as representations for different senses of the word. Despite its sim-
plicity, the clustering-based approach of Schütze (1998) constitutes the basis for many of
the subsequent techniques, which mainly differed in their representation of context or the
underlying clustering algorithm. Figure 4 depicts the general procedure followed by the
two-stage unsupervised sense representation techniques.
Given its requirement for computing independent representations for all individual con-
texts of a given word, the context-group discrimination approach is not easily scalable to
large corpora. Reisinger and Mooney (2010) addressed this by directly clustering the con-
texts, represented as feature vectors of unigrams, instead of modeling contexts as vectors.
The approach can be considered as the first new-generation sense representation technique,
which is often referred to as multi-prototype. In this specific work, contexts were clustered
using Mixtures of von Mises-Fisher distributions (movMF) algorithm. The algorithm is
similar to k-means but permits controlling the semantic breadth using a per-cluster con-
centration parameter which would better model skewed distributions of cluster sizes.
9
Camacho-Collados & Pilehvar
The clustering-based approach to sense representation suffers from the limitation that clus-
tering and sense representation are done independently from each other and, as a result,
the two stages do not take advantage from their inherent similarities. The introduction of
embedding models was one of the most revolutionary changes to vector space models of
word meaning. As a closely related field, sense representations did not remain unaffected.
Many researchers have proposed various extensions of the Skip-gram model (Mikolov et al.,
2013a) which would enable the capture of sense-specific distinctions. A major limitation
of the two-stage models is their computational expensiveness11 . Thanks to the efficiency
of embedding algorithms and their unified nature (as opposed to the two-phase nature of
more conventional techniques) these techniques are generally efficient. Hence, many of the
recent techniques have relied on embedding models as their base framework.
Neelakantan et al. (2014) was the first to propose a multi-prototype extension of the
Skip-gram model. Their model, called Multiple-Sense Skip-Gram (MSSG), is similar to
earlier work in that it represents the context of a word as the centroid of it words’ vectors
and clusters them to form the target word’s sense representation. Though, the fundamental
difference is that clustering and sense embedding learning are performed jointly. During
training, the intended sense for each word is dynamically selected as the closest sense to
the context and weights are updated only for that sense. In a concurrent work, Tian, Dai,
Bian, Gao, Zhang, Chen, and Liu (2014) proposed a Skip-gram based sense representation
technique that significantly reduced the number of parameters with respect to the model of
Huang et al. (2012). In this case, word embeddings in the Skip-gram model are replaced
with a finite mixture model in which each mixture corresponds to a prototype of the word.
The EM algorithm was adopted for the training of this multi-prototype Skip-gram model.
11. For instance, the model of Huang et al. (2012) took around one week to learn sense embeddings for a
6,000 subset of the 100,000 vocabulary on a corpus of one billion tokens (Neelakantan et al., 2014).
10
A Survey on Vector Representations of Meaning
# Senses 2 3 4 5 6 7 8 9 10 11 12 ≥ 12
Nouns 22% 17% 14% 13% 9% 7% 4% 4% 3% 3% 1% 3%
Verbs 15% 16% 14% 13% 9% 7% 5% 4% 4% 3% 1% 9%
Adjectives 23% 19% 15% 12% 8% 5% 2% 3% 3% 1% 2% 6%
Table 1: Distribution of words per number of senses in the SemCor dataset (words with
frequency < 10 were pruned).
Liu, Liu, Chua, and Sun (2015b) argued that the above techniques are limited in that
they consider only the local context of a word for inducing its sense representations. To
address this limitation, they proposed Topical Word Embeddings (TWE) in which each word
is allowed to have different embeddings under different topics, where topics are computed
globally using latent topic modelling (Blei, Ng, & Jordan, 2003). Three variants of the
model were proposed: (1) TWE-1, which regards each topic as a pseudo word, and learns
topic embeddings and word embeddings separately; (2) TWE-2, which considers each word-
topic as a pseudo word, and learns topical word embeddings directly; and (3) TWE-3, which
assigns distinct embeddings for each word and each topic and builds the embedding of each
word-topic pair by concatenating the corresponding word and topic embeddings. Various
extensions of the TWE model have been proposed. The Neural Tensor Skip-gram (NTSG)
model (Liu et al., 2015a) applies the same idea of topic modeling for sense representation
but introduces a tensor to better learn the interactions between words and topics. Another
extension is MSWE (Nguyen, Nguyen, Modi, Thater, & Pinkal, 2017), which argues that
multiple senses might be triggered for a word in a given context and replaces the selection
of the most suitable sense in TWE by a mixture of weights that reflect different association
degrees of the word to multiple senses in the context.
These joint unsupervised models, however, suffer from two limitations. First, for ease of
implementation, most unsupervised sense representation techniques assume a fixed number
of senses per word. This assumption is far from being realistic. Words tend to have a highly
variant number of senses, from one (monosemous) to dozens. In a given sense inventory,
usually, most words are monosemous. For instance, around 80% of words in WordNet
3.0 are monosemous, with less than 5% having more than 3 senses. However, ambiguous
words tend to occur more frequently in a real text which slightly smooths the highly skewed
distribution of words across polysemy. Table 1 shows the distribution of word types by
their number of senses in SemCor (Miller et al., 1993), one of the largest available sense-
annotated datasets which comprises around 235,000 semantic annotations for thousands of
words. The skewed distribution clearly shows that word types tend to have varying number
of senses in a natural text, as also discussed in other studies (Piantadosi, 2014; Bennett,
Baldwin, Lau, McCarthy, & Bond, 2016; Pasini & Navigli, 2018).
Second, a common strand of most unsupervised models is that they extend the Skip-
gram model by replacing the conditioning of a word to its context (as in the original model)
with an additional conditioning on the intended senses. However, the context words in
11
Camacho-Collados & Pilehvar
these models are not disambiguated. Hence, a sense embedding is conditioned on the word
embeddings of its context.
In the following we review some of the approaches that are directly targeted at addressing
these two limitations of the joint unsupervised models described above:
2. Pure sense-based models. Ideally, a model should model the dependency between
sense choices in order to address the ambiguity from context words. Qiu et al. (2016)
addressed this problem by proposing a pure sense-based model. The model also ex-
pands the disambiguation context from a small window (as done in the previous works)
to the whole sentence. MUSE (Lee & Chen, 2017) is another Skip-gram extension
that proposes pure sense representations using reinforcement learning. Thanks to a
linear-time sense sequence decoding module, the approach provides a more efficient
way of searching for sense combinations.
12
A Survey on Vector Representations of Meaning
Figure 5: A general illustration of contextualized word embeddings and how they are in-
tegrated in NLP models (Main system in the figure). A language modelling
component is responsible for analyzing the context of the target word (cell in
the figure) and generating its dynamic embedding. Unlike (context-independent)
word embeddings, which have static representations, contextualized embeddings
have dynamic representations that are sensitive to their context.
The sequence tagger of Li and McCallum (2005) is one of the pioneering works that
employ contextualized representations. The model infers context sensitive latent variables
for each word based on a soft word clustering and integrates them, as additional features, to
a CRF sequence tagger. Since 2011, with the introduction of word embeddings (Collobert
et al., 2011; Mikolov, Sutskever, Chen, Corrado, & Dean, 2013c) and the efficacy of neural
networks, and in the light of meaning conflation deficiency of word embeddings, context-
sensitive models have once again garnered research attention. Emerging solutions mainly
aim at addressing the application limitations of unsupervised techniques; hence, they are
generally characterized by their ease of integration into downstream applications. Con-
text2vec (Melamud, Goldberger, & Dagan, 2016) is one of the earliest and most prominent
proposals in the new branch of contextualized representations. The model represents the
context of a target word by extracting the output embedding of a multi-layer perceptron
built on top of a bi-directional LSTM language model. Context2vec constitutes the basis
for many of the subsequent works.
Figure 5 provides a high-level illustration of the integration of contextualized word
embeddings into an NLP model. At the training time, for each word (e.g., cell in the
figure) in a given input text, the language model unit is responsible for analyzing the context
(usually using recurrent neural networks) and adjusting the target word’s representation by
contextualising (adapting) it to the context. These context-sensitive embeddings are in fact
the internal states of a deep recurrent neural network, either in a monolingual language
modelling setting (Peters, Ammar, Bhagavatula, & Power, 2017; Peters, Neumann, Iyyer,
Gardner, Clark, Lee, & Zettlemoyer, 2018) or a bilingual translation configuration (McCann,
Bradbury, Xiong, & Socher, 2017). The training of contextualized embeddings is carried
out as a pre-training stage, independently from the main task on a large unlabeled or
differently-labeled text corpus. At the test time, a word’s contextualized embeddings is
usually concatenated with its static embedding and fed to the main model (Peters et al.,
2018).
13
Camacho-Collados & Pilehvar
The TagLM model of Peters et al. (2017) is a recent example of this branch which trains
a multi-layer bidirectional LSTM (Hochreiter & Schmidhuber, 1997) language model on
monolingual texts. The prominent ELMo (Embeddings from Language Models) technique
(Peters et al., 2018) is similar in principle with the exception that some weights are shared
between the two directions of the language modeling unit. The Context Vectors (CoVe)
model of McCann et al. (2017) similarly computes contextualized representations using a
two-layer bidirectional LSTM network, but in the machine translation setting. CoVe vectors
are pre-trained using an LSTM encoder from an attentional sequence-to-sequence machine
translation model.12
14
A Survey on Vector Representations of Meaning
of knowledge resources for improving word vectors (Section 4.2). Finally, we will focus on
the construction of knowledge-based representations of senses (Section 4.3) and concepts or
entities (Section 4.4).
Knowledge resources exist in many flavors. In this section we give an overview of knowl-
edge resources that are mostly used for sense and concept representation learning. The
nature of knowledge resources vary with respect to several factors. Knowledge resources can
be broadly split into two general categories: expert-made and collaboratively-constructed.
Each type has its own advantages and limitations. Expert-made resources (e.g., WordNet)
feature accurate lexicographic information such as textual definitions, examples and seman-
tic relations between concepts. On the other hand, collaboratively-constructed resources
(e.g., Wikipedia or Wiktionary) provide features such as encyclopedic information, wider
coverage, multilinguality and up-to-dateness.14
In the following we describe some of the most important resources in lexical semantics
that are used for representation learning, namely WordNet (Section 4.1.1), Wikipedia and
related efforts (Section 4.1.2), and mergers of different resources such as BabelNet and
ConceptNet (Section 4.1.3).
4.1.1 WordNet
14. In addition to these two types of resource, another recent branch is investigating the automatic con-
struction of knowledge resources (particularly WordNet-like) from scratch (Khodak, Risteski, Fellbaum,
& Arora, 2017; Ustalov, Panchenko, & Biemann, 2017). However, these output resources are not yet
used in practice, and they have been shown to generally lack recall (Neale, 2018).
15
Camacho-Collados & Pilehvar
16
A Survey on Vector Representations of Meaning
BabelNet lies in their main semantic units: ConceptNet models words whereas BabelNet
uses WordNet-style synsets.
|V |
~ ik +
X X
αi kw
~i − ŵ βi,j kw
~i − w~j k (2)
i=1 (wi ,wj )∈N
where |V | represents the size of the vocabulary, N is the input semantic network represented
as a set of word pairs, w~i and w~j correspond to word embeddings in the pre-trained model,
αi and βi,j are adjustable control values, and ŵ~i represents the output word embedding.
Building upon retrofitting, Speer and Lowry-Duda (2017) exploited the multilingual re-
lational information of ConceptNet for constructing embeddings on a multilingual space,
and Lengerich, Maas, and Potts (2017) generalized retrofitting methods by explicitly mod-
eling pairwise relations. Other similar approaches are those by Pilehvar and Collier (2017)
and Goikoetxea, Soroa, and Agirre (2015), which analyze the structure of semantic networks
via Personalized Page Rank (Haveliwala, 2002) for extending the coverage and quality of
pre-trained word embeddings, respectively. Finally, Bollegala, Alsuhaibani, Maehara, and
Kawarabayashi (2016) modified the loss function of a given word embedding model to learn
vector representations by simultaneously exploiting cues from both co-occurrences and se-
mantic networks.
Recently, a new branch that focuses on specializing word embeddings for specific ap-
plications has emerged. For instance, Kiela, Hill, and Clark (2015) investigated two vari-
ants of retrofitting to specialize word embeddings for similarity or relatedness, and Mrksic,
Vulić, Séaghdha, Leviant, Reichart, Gai, Korhonen, and Young (2017) specialized word
15. FrameNet (Baker, Fillmore, & Lowe, 1998), WordNet and PPDB (Ganitkevitch, Van Durme, & Callison-
Burch, 2013) are used in their experiments.
17
Camacho-Collados & Pilehvar
embeddings for semantic similarity and dialogue state tracking by exploiting a number of
monolingual and cross-lingual linguistic constraints (e.g., synonymy and antonymy) from
resources such as PPDB and BabelNet.
In fact, as shown in this last work, knowledge resources also play an important role in
the construction of multilingual vector spaces. The use of external resources avoids the need
of compiling a large parallel corpora, which has been traditionally been the main source
for learning cross-lingual word embeddings in the literature (Upadhyay, Faruqui, Dyer, &
Roth, 2016; Ruder, Vulić, & Søgaard, 2017). These alternative models for learning cross-
lingual embeddings exploit knowledge from lexical resources such as WordNet or BabelNet
(Mrksic et al., 2017; Goikoetxea, Soroa, & Agirre, 2018), bilingual dictionaries (Mikolov, Le,
& Sutskever, 2013b; Ammar, Mulcaire, Tsvetkov, Lample, Dyer, & Smith, 2016; Artetxe,
Labaka, & Agirre, 2016; Doval, Camacho-Collados, Espinosa-Anke, & Schockaert, 2018) or
comparable corpora extracted from Wikipedia (Vulić & Moens, 2015).
18
A Survey on Vector Representations of Meaning
Chen, Xu, He, and Wang (2015) exploited a convolutional neural network architecture for
initializing sense embeddings using textual definitions from lexical resources. Then, these
initialized sense embeddings are fed into a variant of the Multi-sense Skip-gram Model
of Neelakantan et al. (2014) (see Section 3.1) for learning knowledge-based sense embed-
dings. Finally, in Yang and Mao (2016) word sense embeddings are learned by exploiting
an adapted Lesk16 algorithm (Vasilescu, Langlais, & Lapalme, 2004) over short contexts of
word pairs.
A different line of research has experimented with the graph structure of lexical resources
for learning knowledge-based sense representations. As explained in Section 4.1, many of the
existing lexical resources can be viewed as semantic networks in which nodes are concepts
and edges represent the relations among concepts. Semantic networks constitute suitable
knowledge resources for disambiguating large amounts of text (Agirre et al., 2014; Moro
et al., 2014). Therefore, a straightforward method to learn sense representations would be to
automatically disambiguate text corpora and apply a word representation learning method
on the resulting sense-annotated text (Iacobacci, Pilehvar, & Navigli, 2015). Following this
direction, Mancini, Camacho-Collados, Iacobacci, and Navigli (2017) proposed a shallow
graph-based disambiguation procedure and modified the objective functions of Word2vec
in order to simultaneously learn word and sense embeddings in a shared vector space. The
objective function is in essence similar to the objective function proposed by Chen et al.
(2014) explained before, which also learns both word and sense embeddings in the last step
of the learning process.
Similarly to the post-processing of word embeddings by using knowledge resources (see
Section 4.2), recent works have made use of pre-trained word embeddings not only for
improving them but also de-conflating them into senses. Approaches that post-process
pre-trained word embeddings for learning sense embeddings are listed below:
1. One way to obtain sense representations from a semantic network is to directly apply
the Personalized PageRank algorithm (Haveliwala, 2002), as done by Pilehvar and
Navigli (2015). The algorithm carries out a set of random graph walks to compute
a vector representation for each WordNet synset (node in the network). Using a
similar random walk-based procedure, Pilehvar and Collier (2016) extracted for each
WordNet word sense a set of sense biasing words. Based on these, they put forward
an approach, called DeConf, which takes a pre-trained word ebmeddings space as
input and adds a set of sense embeddings (as defined by WordNet) to the same
space. DeConf achieves this by pushing a word’s embedding in the space to the
region occupied by its corresponding sense biasing words (for a specific sense of the
word). Figure 7 shows the word digit and its induced hand and number senses in the
vector space.
16. The original Lesk algorithm (Lesk, 1986) and its variants exploit the similarity between textual definitions
and a target word’s context for disambiguation.
17. See Section 4.2 for more information on retrofitting.
19
Camacho-Collados & Pilehvar
Figure 7: A mixed semantic space of words and word senses. DeConf (Pilehvar & Collier,
2016) introduces two new points in the word embedding space, for the mathemat-
ical and body part senses of the word digit, resulting in the mixed space.
objective of Word2vec to include senses within the learning process. The training
objective is optimized using EM.
3. Johansson and Pina (2015) post-processed pre-trained word embeddings through an
optimization formulation with two main constraints: polysemous word embeddings
can be decomposed as combinations of their corresponding sense embeddings and
sense embeddings should be close to their neighbours in the semantic network. A
Swedish semantic network, SALDO (Borin, Forsberg, & Lönngren, 2013), was used
in their experiments, although their approach may be directly extensible to different
semantic networks as well.
4. Finally, AutoExtend (Rothe & Schütze, 2015) is another method using pre-trained
word embeddings as input. In this case, they put forward an autoencoder architecture
based on two main constraints: a word vector corresponds to the sum of its sense
vectors and a synset to the sum of its lexicalizations (senses). For example, the
vector of the word crane would correspond to the sum of the vectors for its senses
crane 1n , crane 2n and crane 1v (using WordNet as reference). Similarly, the vector of the
synset defined as “arrange for and reserve (something for someone else) in advance” in
WordNet would be equal to the sum of the vectors of its corresponding senses reserve,
hold and book. Equation 3 displays these constraints mathematically:
n
X m
X
w
~= s~i ; ~y = s~j , (3)
i=1 j=1
20
A Survey on Vector Representations of Meaning
1. TransP (Wang, Zhang, Feng, & Chen, 2014b) is a similar model that provides im-
provements on the relational mapping by dealing with specific properties present in
the knowledge graph.
2. Lin, Liu, Sun, Liu, and Zhu (2015) proposed to learn embeddings of entities and
relations in separate spaces (TransR).
3. Ji, He, Xu, Liu, and Zhao (2015) introduced a dynamic mapping for each entity-
relation pair in separated spaces (TransD).
4. Luo, Wang, Wang, and Guo (2015) put forward a two-stage architecture using pre-
trained word embeddings for initialization.
5. A unified learning framework that generalize TransE and NTN (Socher, Perelygin,
Wu, Chuang, Manning, Ng, & Potts, 2013) was presented by Yang, Yih, He, Gao,
and Deng (2015).
6. Finally, Ebisu and Ichise (2018) discussed regularization issues from TransE and pro-
posed TorusE, which benefits from a new regularization method solving TransE’s
regularization problems.
18. Given an incomplete knowledge base as input, the knowledge base completion task consists of predicting
relations which were missing in the original resource.
21
Camacho-Collados & Pilehvar
Figure 8: From a knowledge graph to entity and relation embeddings. Illustration idea is
based on the slides of Weston and Bordes (2014).
In addition to techniques that entirely rely on the information available in knowledge bases,
there are models that combine cues from both knowledge bases and text corpora into the
same representation. Given its semi-structured nature and the textual content provided,
Wikipedia has been the main source for these kind of representations. While most ap-
proaches make use of Wikipedia-annotated corpora as their main source to learn represen-
tations for Wikipedia concepts and entities (Wang, Zhang, Feng, & Chen, 2014a; Sherkat &
Milios, 2017; Cao, Huang, Ji, Chen, & Li, 2017), the combination of knowledge from hetero-
19. A Poincaré ball is a hyperbolic space in which all points are inside the unit disk.
20. WordNet is used as the reference taxonomy in the original work.
22
A Survey on Vector Representations of Meaning
geneous resources like Wikipedia and WordNet has also been explored (Camacho-Collados,
Pilehvar, & Navigli, 2016).21
Given their hybrid nature, these models can easily be used in textual applications as
well. A straightforward application is word or named entity disambiguation, for which the
embeddings can be used as initialization in the embedding layer on a neural network ar-
chitecture (Fang, Zhang, Wang, Chen, & Li, 2016; Eshel, Cohen, Radinsky, Markovitch,
Yamada, & Levy, 2017) or used directly as a knowledge-based disambiguation system ex-
ploiting semantic similarity (Camacho-Collados et al., 2016).
5. Evaluation
In this section we present the most common evaluation benchmarks for assessing the quality
of meaning representations. Depending on their nature, evaluation procedures are generally
divided into intrinsic (Section 5.1) and extrinsic (Section 5.2).
where Swi is a set including all senses of wi and ~si represents the sense vector representation
of the sense si . Another strategy, known as AvgSim, simply averages the pairwise similarities
21. The combination of Wikipedia and WordNet relies on the multilingual mapping provided by BabelNet
(see Section 4.1.3 for more information about BabelNet).
23
Camacho-Collados & Pilehvar
of all possible senses of w1 and w2 . Cosine similarity (cos) is the most prominent metric for
computing the similarity between sense vectors.
In all these benchmarks, words are paired in isolation. However, we know that for a
specific meaning of an ambiguous word to be triggered, the word needs to appear in partic-
ular contexts. In fact, Kilgarriff (1997) argued that representing a word with a fixed set of
senses may not be the best way for modelling word senses but instead, word senses should be
defined according to a given context. To this end, Huang et al. (2012) presented a different
kind of similarity dataset in which words are provided with their corresponding contexts.
The task consists of assessing the similarity of two words by taking into consideration the
contexts in which they occur. The dataset is known as Stanford Contextual Word Simi-
larity (SCWS) and has been established as one of the main intrinsic evaluations for sense
representations. A pre-disambiguation step is required to leverage sense representations in
this task. Simple similarity measures such as MaxSimC or AvgSimC are generally utilized.
Unlike MaxSim and AvgSim, MaxSimC and AvgSimC take the context of the target word
into account. First, the confidence for selecting the most appropriate sense within the sen-
tence is computed (e.g., by computing the average of word embeddings from the context
and selecting the sense which is closest to the average context vector in terms of cosine
similarity). Then, the final score corresponds to the similarity between the selected senses
(i.e., MaxSimC ) or to a weighted average among all senses (i.e., AvgSimC ).
However, even though sense representations have generally outperformed word-based
models on this dataset, the simple strategies used to disambiguate the input text may not
have been optimal. In fact, it has been recently shown that the improvements of sense-
based models in word similarity tasks using AvgSim may not be due to accurate meaning
modeling but to related artifacts such as sub-sampling, which had not been controlled for
(Dubossarsky, Grossman, & Weinshall, 2018). This goes in line with a recent study analyz-
ing how well sense and contextualized representations capture meaning in context (Pilehvar
& Camacho-Collados, 2018). The binary classification task proposed in this analysis consists
of deciding whether the occurrences of a target word in two different contexts correspond
to the same meaning or not. The results showed how recent sense22 and contextualized
representation techniques fail at accurately distinguishing meanings in context, performing
only slightly better than a simple baseline, while significantly lagging behind the human
inter-rater agreement of the dataset.23
Finally, in addition to these tasks, there exist other intrinsic evaluation procedures such
as synonymy selection (Landauer & Dumais, 1997; Turney, 2001; Jarmasz & Szpakowicz,
2003; Reisinger & Mooney, 2010), outlier detection (Camacho-Collados & Navigli, 2016;
Blair, Merhav, & Barry, 2016; Stanovsky & Hopkins, 2018) or sense clustering (Snow,
Prakash, Jurafsky, & Ng, 2007; Dandala, Hokamp, Mihalcea, & Bunescu, 2013). For more
information, Bakarov (2018) provides a more comprehensive overview of intrinsic evaluation
benchmarks.
22. Similarly to the MaxSimC technique, sense representations were evaluated by retrieving the closest sense
embedding to the context-based vector, computed by averaging its word embeddings.
23. Another study which took the hypernymy detection task as test bed for their experiments (Vyas &
Carpuat, 2017) came to similar conclusions.
24
A Survey on Vector Representations of Meaning
6. Applications
As mentioned in Section 5 and throughout the survey, one of the main goals of research
in meaning representations is to enable effective integration of these knowledge carriers
into downstream applications. Unlike word representations (and more specifically embed-
dings), sense representations are still in their infancy in this regard. This is also due to the
non-immediate integration of these representations, which generally requires an additional
word sense disambiguation or induction step. However, as with word embeddings, sense
representations can be theoretically applied to multiple applications.
The integration of sense representations into downstream applications is not a new trend.
Since the nineties, many heterogeneous efforts have emerged in this direction for important
text-based applications, with varying degree of success. Information retrieval has been
one of the first applications in which the integration of word senses was investigated. In one
of the earlier attempts, Schütze and Pedersen (1995) showed how document-query similarity
based on word senses could lead to considerable improvements with respect to word-based
models.
Another classic task which has witnessed recurring efforts to incorporate sense-level
information is Machine Translation (MT). Since a word may have different translations
depending on its intended meaning in a context, sense identification has been traditionally
believed to be able to potentially improve word-based MT models. Carpuat et al. analyzed
the impact of WSD in the performance of standard MT systems at the time (Carpuat & Wu,
2005, 2007a, 2007b). The studies were inconclusive, but generally reflected the difficulty
to successfully integrate semantically-grounded models into an MT pipeline. This was also
25
Camacho-Collados & Pilehvar
26
A Survey on Vector Representations of Meaning
(Nickel & Kiela, 2017), or visual object discovery (Young, Kunze, Basile, Cabrio, Hawes, &
Caputo, 2017).
7. Analysis
This section provides an analysis and comparison of knowledge-based and unsupervised
representation techniques, highlighting the advantages and limitations of each, while sug-
gesting the settings and scenarios for which each technique is suited. We focus on four
important aspects: interpretability (Section 7.1), adaptability to different domains (Section
7.2), sense granularity (Section 7.3), and compositionality (Section 7.4).
7.1 Interpretability
One of the main reasons behind moving from word to sense level is the semantically-
grounded nature of word senses, which may enable a better interpretability. In this particu-
lar aspect, however, there is a considerable difference between unsupervised and knowledge-
based models. Unsupervised models learn senses directly from text corpora, which results
in model-specific sense interpretations. These induced senses do not necessarily correspond
to human notions of sense distinctions, or are not easily distinguishable. For this reason,
methods have been proposed to improve the interpretability of unsupervised sense represen-
tations, either by extracting their hypernyms or their visual representations (i.e., an image
illustrating a specific meaning) (Panchenko et al., 2017b) or by mapping the induced senses
to external sense inventories (Panchenko, 2016).
27
Camacho-Collados & Pilehvar
One feature which has been praised in word embeddings is their adaptability to general
and specialized domains (Goldberg, 2016). From this aspect, unsupervised models have
a theoretical advantage over knowledge-based counterparts as they are able to directly
induce senses from a given text corpus. This provides them with the chance to adapt
their sense distinctions according to the domain at hand and to the given task. In the
contrary, knowledge-based systems generally learn representations for all senses given by a
sense inventory; hence, they are unable to specialize their sense distinctions to the domain
or adapt their granularity to the task.
Knowledge-enhanced approaches like those proposed by Mancini et al. (2017) or Fang
et al. (2016), which directly learn from text corpora, may partially alleviate this limitation
of knowledge-based models. However, the senses should still be present in the semantic
network used as input for the model. In other words, knowledge-based approaches are not
able to learn new senses, which may be an important limitation in some specific domains
and tasks. Moreover, the accurate representation of certain domains would require suitable
knowledge resources, which might not be available for specialized domains or low-resource
languages.
A sense inventory may list a few dozen different senses for words such as run, play and get.
Words with multiple senses (i.e., ambiguous) are generally classified into two categories:
polysems and homonyms. Polysemous words have multiple related meanings. For instance
the word mark can refer to a “distinguishing symbol” as well as a “visible indication made
on a surface”. In this case the distinctions of these two senses are also said to be fine-grained,
as these two meanings are difficult to be torn apart. Homonymous words25 have meanings
that are completely unrelated. For instance, the geological and financial institution senses
25. According to the Cambridge Dictionary, a homonym is “a word that sounds the same (homophone) or
is spelled the same (homograph) as another word but has a different meaning”. Given that NLP focuses
on written forms, a homonym in this context usually refers to the latter condition, i.e., homographs with
different meanings.
28
A Survey on Vector Representations of Meaning
of the word bank 26 . This would also be a case of a coarse-grained distinction of senses, as
these two meaning of bank are clearly different.
In general, the fine granularity of some sense inventories has always been a point of
argument in NLP (Kilgarriff, 1997; Navigli, 2009; Hovy, Navigli, & Ponzetto, 2013). It
has been pointed out that sense distinctions in WordNet might be too fine-grained to be
useful for many NLP applications (Navigli, 2006; Snow et al., 2007; Hovy et al., 2013). For
instance, WordNet 3.0 (see Section 4.1.1) lists 41 different senses for the verb run. However,
most of these senses are translated to either correr or operar in Spanish. Therefore, a mul-
tilingual task such as machine translation might not benefit from the additional distinctions
provided by the sense inventory. In fact, a merging of these fine-grained distinctions into
more coarse-grained classes (referred to as supersenses in WordNet) has been shown to be
beneficial in various downstream applications (Flekova & Gurevych, 2016; Pilehvar et al.,
2017).
This discussion is also relevant for unsupervised techniques. The dynamic learning of
senses, instead of fixing the number of senses for all words, has shown to provide a more
realistic distribution of senses (see Section 3.1.2). Moreover, there have been discussions
about whether all occurrences of words can be effectively partitioned into senses (Kilgarriff,
1997; Hanks, 2000; Kilgarriff, 2007), leading to a new scheme in which meanings of a
word are described in a graded fashion (Erk et al., 2009; McCarthy et al., 2016). While
the scheme is not covered in this survey, it has been shown that a graded scale to assess
senses may correlate better to how humans perceive different meanings. Although not
exactly the same conclusions, these findings are also related to the criticisms about the
fine granularity of current sense inventories, which has shown to be harmful in certain
downstream applications.
7.4 Compositionality
Compositional methods model the semantics of a complex expression based on the mean-
ings of its constituents (e.g., words). Typically, constituent words are represented as their
word vector with all the meanings conflated. However, for an ambiguous word in an expres-
sion, usually only a single meaning is triggered and other senses are irrelevant. Therefore,
pinpointing the meaning of a word to the given context may be a reasonable idea for compo-
sitionality. This can be crucial to applications such as information retrieval in which query
ambiguity can be an issue (Allan & Raghavan, 2002; Di Marco & Navigli, 2013).
Different works have tried to introduce sense representations in the context of compo-
sitionality (Köper & im Walde, 2017; Kober, Weeds, Wilkie, Reffin, & Weir, 2017), with
different degrees of success. The main idea is to select the intended sense of a word and only
introduce that specific meaning into the composition, either through context-based sense
induction (Thater, Fürstenau, & Pinkal, 2011), exemplar-based representation (Reddy, Kla-
paftis, McCarthy, & Manandhar, 2011), or with the help of external resources, such as
WordNet (Gamallo & Pereira-Fariña, 2017). An example of the first type of approach can
be found in Cheng and Kartsaklis (2015), where a recurrent neural network in which word
26. The distinction between homonyms and polysems can sometime be subtle. For instance, research in
historical linguistics has shown that the two meanings of the word bank could have been related to each
other earlier in the Italian language, since the bankers used to do their business on the riverbanks.
29
Camacho-Collados & Pilehvar
embeddings were split into multiple sense vectors was proposed. The network was applied
to paraphrase detection with positive results.
In general, the evaluation of sense distinction models in the context of compositionality
has often been evaluated on generic benchmarks, such as paraphrase detection. Despite the
potential benefit in tasks such as question answering and information retrieval, there have
been no attempts at integrating sense representations as components of neural compositional
models.
8. Conclusions
In this survey we have presented an extensive overview of semantically-grounded models for
constructing distributed representations of meaning. Word embeddings have been shown to
provide interesting semantic properties that can be applied to most language applications.
However, these models tend to conflate different meanings into a single representation.
Therefore, an accurate distinction of senses is often required for a deep understanding of
lexical meaning. To this end, in this article we discuss models that learn representation for
senses which are either directly induced from text corpora (i.e., unsupervised) or defined
by external sense inventories (i.e., knowledge-based).
Some of these models have already proved effective in practise, but there is still much
room for improvement. For example, even though semantically-grounded information is cap-
tured (to different degrees) by almost all models, common-sense reasoning has not yet been
deeply explored. Also, most of these models have been tested on English only, whereas
only a few have proposed models for other languages or attempted multilinguality. Fi-
nally, the integration of these theoretical models into downstream applications is the next
step forward, as it is not clear now what the best integration strategy would be, and if a
pre-disambiguation step is necessary. For instance, approaches such as the contextualized
embeddings of Peters et al. (2018) have shown a new possible direction in which senses are
learned dynamically for each context, without the need for an explicit pre-disambiguation
step.
Although not exactly distributed representations of meaning, modelling relations in
a flexible way is also another possible avenue for future work. Relations are generally
modeled in works targeting knowledge-based completion. Moreover, a recent line of research
has focused on improving relation embeddings with the help of text corpora (Toutanova,
Chen, Pantel, Poon, Choudhury, & Gamon, 2015; Jameel, Bouraoui, & Schockaert, 2018;
Espinosa-Anke & Schockaert, 2018), which paves the way for new approaches integrating
these relations into downstream text applications.
From this perspective, the definition of sense and the correct paradigm is certainly still
an open question. Do senses need to be discrete? Should they need to be tied to a knowl-
edge resource or sense inventory? Should they be learned dynamically depending on the
context? These are the questions that are yet to be explored according to the many studies
on this topic. As also explained in our analysis, some approaches are more suited to cer-
tain applications or domains, without any clear general conclusion. These open questions
are certainly still relevant and encourage further research on distributed representations of
meaning, with many areas yet to be explored.
30
A Survey on Vector Representations of Meaning
Acknowledgments
The authors wish to thank the anonymous reviewers for their comments which helped im-
prove the overall quality of this survey. The research of Jose Camacho-Collados is supported
by ERC Starting Grant 637277.
References
Agirre, E., de Lacalle, O. L., & Soroa, A. (2014). Random walks for knowledge-based word
sense disambiguation. Computational Linguistics, 40 (1), 57–84.
Allan, J., & Raghavan, H. (2002). Using part-of-speech patterns to reduce query ambiguity.
In Proceedings of the 25th annual international ACM SIGIR conference on Research
and development in information retrieval, pp. 307–314, Tampere, Finland.
Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., & Smith, N. A. (2016).
Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.
Artetxe, M., Labaka, G., & Agirre, E. (2016). Learning principled bilingual mappings of
word embeddings while preserving monolingual invariance. In Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294.
Artetxe, M., Labaka, G., & Agirre, E. (2018). A robust self-learning method for fully
unsupervised cross-lingual mappings of word embeddings. In Proceedings of ACL, pp.
789–798.
Athiwaratkun, B., & Wilson, A. (2017). Multimodal word distributions. In Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), Vol. 1, pp. 1645–1656.
Athiwaratkun, B., Wilson, A., & Anandkumar, A. (2018). Probabilistic FastText for multi-
sense word embeddings. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pp. 1–11. Association for
Computational Linguistics.
Bakarov, A. (2018). A survey of word embeddings evaluation methods. arXiv preprint
arXiv:1801.09536.
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley Framenet Project. In
Proceedings of the 17th international conference on Computational linguistics-Volume
1, pp. 86–90. Association for Computational Linguistics.
Banerjee, S., & Pedersen, T. (2002). An adapted Lesk algorithm for Word Sense Disam-
biguation using WordNet. In Proceedings of the Third International Conference on
Computational Linguistics and Intelligent Text Processing, CICLing’02, pp. 136–145,
Mexico City, Mexico.
Bansal, M., Denero, J., & Lin, D. (2012). Unsupervised translation sense clustering. In
Proceedings of the 2012 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pp.
773–782, Stroudsburg, PA, USA. Association for Computational Linguistics.
31
Camacho-Collados & Pilehvar
Bartunov, S., Kondrashkin, D., Osokin, A., & Vetrov, D. (2016). Breaking sticks and ambi-
guities with adaptive skip-gram. In Proceedings of the 19th International Conference
on Artificial Intelligence and Statistics, Vol. 51 of Proceedings of Machine Learning
Research, pp. 130–138, Cadiz, Spain. PMLR.
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Lan-
guage Model. The Journal of Machine Learning Research, 3, 1137–1155.
Bennett, A., Baldwin, T., Lau, J. H., McCarthy, D., & Bond, F. (2016). Lexsemtm: A
semantic dataset based on all-words unsupervised sense distribution learning. In Pro-
ceedings of ACL, pp. 1513–1524.
Biemann, C. (2006). Chinese whispers: an efficient graph clustering algorithm and its appli-
cation to natural language processing problems. In Proceedings of the first workshop
on graph based methods for natural language processing, pp. 73–80. Association for
Computational Linguistics.
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., & Hellmann, S.
(2009). Dbpedia-a crystallization point for the web of data. Web Semantics: science,
services and agents on the world wide web, 7 (3), 154–165.
Blair, P., Merhav, Y., & Barry, J. (2016). Automated generation of multilingual clusters
for the evaluation of distributed representations. arXiv preprint arXiv:1611.01547.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. The Journal
of Machine Learning Research, 3, 993–1022.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with
subword information. Transactions of the Association of Computational Linguistics,
5 (1), 135–146.
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: a collab-
oratively created graph database for structuring human knowledge. In Proceedings
of the 2008 ACM SIGMOD international conference on Management of data, pp.
1247–1250. ACM.
Bollegala, D., Alsuhaibani, M., Maehara, T., & Kawarabayashi, K.-i. (2016). Joint word
representation learning using a corpus and a semantic lexicon. In AAAI, pp. 2690–
2696.
Bond, F., & Foster, R. (2013). Linking and extending an open multilingual Wordnet. In
ACL (1), pp. 1352–1362.
Bordes, A., Chopra, S., & Weston, J. (2014). Question answering with subgraph embeddings.
In EMNLP.
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating
embeddings for modeling multi-relational data. In Advances in Neural Information
Processing Systems, pp. 2787–2795.
Borin, L., Forsberg, M., & Lönngren, L. (2013). Saldo: a touch of yin to WordNets yang.
Language resources and evaluation, 47 (4), 1191–1211.
32
A Survey on Vector Representations of Meaning
Cai, H., Zheng, V. W., & Chang, K. (2018). A comprehensive survey of graph embedding:
problems, techniques and applications. IEEE Transactions on Knowledge and Data
Engineering.
Camacho-Collados, J., & Navigli, R. (2016). Find the word that does not belong: A frame-
work for an intrinsic evaluation of word vector representations. In Proceedings of the
1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 43–50.
Camacho-Collados, J., Pilehvar, M. T., Collier, N., & Navigli, R. (2017). Semeval-2017 task
2: Multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th
International Workshop on Semantic Evaluation (SemEval-2017), pp. 15–26.
Camacho-Collados, J., Pilehvar, M. T., & Navigli, R. (2016). Nasari: Integrating explicit
knowledge and corpus statistics for a multilingual representation of concepts and
entities. Artificial Intelligence, 240, 36–64.
Cao, Y., Huang, L., Ji, H., Chen, X., & Li, J. (2017). Bridge text and knowledge by learning
multi-prototype entity mention embedding. In Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp.
1623–1633.
Carpuat, M., & Wu, D. (2005). Word sense disambiguation vs. statistical machine trans-
lation. In Proceedings of the 43rd Annual Meeting on Association for Computational
Linguistics, pp. 387–394. Association for Computational Linguistics.
Carpuat, M., & Wu, D. (2007a). How phrase sense disambiguation outperforms word sense
disambiguation for statistical machine translation. Proceedings of TMI, 43–52.
Carpuat, M., & Wu, D. (2007b). Improving statistical machine translation using word sense
disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Meth-
ods in Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL).
Chaplot, D. S., & Salakhutdinov, R. (2018). Knowledge-based word sense disambiguation
using topic models. In Proceedings of AAAI.
Chen, T., Xu, R., He, Y., & Wang, X. (2015). Improving distributed representation of word
sense via WordNet gloss composition and context clustering. In Proceedings of the
53rd Annual Meeting of the Association for Computational Linguistics and the 7th
International Joint Conference on Natural Language Processing – Short Papers, pp.
15–20, Beijing, China.
Chen, X., Liu, Z., & Sun, M. (2014). A unified model for word sense representation and
disambiguation. In Proceedings of EMNLP, pp. 1025–1035, Doha, Qatar.
Cheng, J., & Kartsaklis, D. (2015). Syntax-aware multi-sense word embeddings for deep
compositional models of meaning. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pp. 1531–1542. Association for Computa-
tional Linguistics.
Chiu, B., Korhonen, A., & Pyysalo, S. (2016). Intrinsic evaluation of word vectors fails to
predict extrinsic performance. In Proceedings of the ACL Workshop on Evaluating
Vector Space Representations for NLP, Berlin, Germany.
33
Camacho-Collados & Pilehvar
Cocos, A., Apidianaki, M., & Callison-Burch (2016). Word sense filtering improves
embedding-based lexical substitution. In Proceedings of the 1st Workshop on Sense,
Concept and Entity Representations and their Applications, pp. 99–104. Association
for Computational Linguistics.
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing:
Deep neural networks with multitask learning. In Proceedings of ICML, pp. 160–167.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011).
Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12, 2493–
2537.
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & Jégou, H. (2018). Word translation
without parallel data. In Proceedings of ICLR.
Dandala, B., Hokamp, C., Mihalcea, R., & Bunescu, R. C. (2013). Sense clustering using
Wikipedia.. In Proceedings of Recent Advances in Natural Language Processing, pp.
164–171, Hissar, Bulgaria.
Delli Bovi, C., Camacho-Collados, J., Raganato, A., & Navigli, R. (2017). EuroSense: Auto-
matic harvesting of multilingual sense annotations from parallel text. In Proceedings
of ACL, Vol. 2, pp. 594–600.
Delli Bovi, C., Espinosa-Anke, L., & Navigli, R. (2015). Knowledge base unification via sense
embeddings and disambiguation. In Proceedings of EMNLP, pp. 726–736. Association
for Computational Linguistics.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the royal statistical society. Series B
(methodological), 1–38.
Di Marco, A., & Navigli, R. (2013). Clustering and diversifying web search results with
graph-based word sense induction. Computational Linguistics, 39 (3), 709–754.
Doval, Y., Camacho-Collados, J., Espinosa-Anke, L., & Schockaert, S. (2018). Improving
cross-lingual word embeddings by meeting in the middle. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing.
Dubossarsky, H., Grossman, E., & Weinshall, D. (2018). Coming to your senses: on controls
and evaluation sets in polysemy research. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, Brussels, Belgium.
Ebisu, T., & Ichise, R. (2018). Toruse: Knowledge graph embedding on a lie group. In
Proceedings of the AAAI Conference on Artificial Intelligence.
Erk, K. (2007). A simple, similarity-based model for selectional preferences. In Proceedings
of ACL, Prague, Czech Republic.
Erk, K. (2012). Vector space models of word meaning and phrase meaning: A survey.
Language and Linguistics Compass, 6 (10), 635–653.
Erk, K., McCarthy, D., & Gaylord, N. (2009). Investigations on word senses and word
usages. In Proceedings of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language Processing of
the AFNLP, pp. 10–18. Association for Computational Linguistics.
34
A Survey on Vector Representations of Meaning
Erk, K., & Padó, S. (2008). A structured vector space model for word meaning in context. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing,
pp. 897–906.
Eshel, Y., Cohen, N., Radinsky, K., Markovitch, S., Yamada, I., & Levy, O. (2017). Named
entity disambiguation for noisy text. In Proceedings of the 21st Conference on Com-
putational Natural Language Learning (CoNLL 2017), pp. 58–68.
Espinosa-Anke, L., Camacho-Collados, J., Delli Bovi, C., & Saggion, H. (2016). Supervised
distributional hypernym discovery via domain adaptation. In Proceedings of EMNLP,
pp. 424–435.
Espinosa-Anke, L., & Schockaert, S. (2018). Seven: Augmenting word embeddings with
unsupervised relation vectors. In Proceedings of the 27th International Conference on
Computational Linguistics, pp. 2653–2665.
Ettinger, A., Resnik, P., & Carpuat, M. (2016). Retrofitting sense-specific word vectors using
parallel text. In Proceedings of NAACL-HLT, pp. 1378–1383, San Diego, California.
Fang, W., Zhang, J., Wang, D., Chen, Z., & Li, M. (2016). Entity disambiguation by
knowledge and text jointly embedding. In Proceedings of CoNLL, pp. 260–269.
Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2015). Retrofitting
word vectors to semantic lexicons. In Proceedings of NAACL, pp. 1606–1615.
Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of
word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on
Evaluating Vector-Space Representations for NLP, pp. 30–35. Association for Com-
putational Linguistics.
Finkelstein, L., Evgeniy, G., Yossi, M., Ehud, R., Zach, S., Gadi, W., & Eytan, R. (2002).
Placing search in context: The concept revisited. ACM Transactions on Information
Systems, 20 (1), 116–131.
Flekova, L., & Gurevych, I. (2016). Supersense embeddings: A unified model for supersense
interpretation, prediction, and utilization. In Proceedings of ACL.
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-
based explicit semantic analysis. In Proceedings of IJCAI, pp. 1606–1611, Hyderabad,
India.
Gale, W. A., Church, K., & Yarowsky, D. (1992). A method for disambiguating word senses
in a corpus. Computers and the Humanities, 26, 415–439.
Gamallo, P., & Pereira-Fariña, M. (2017). Compositional semantics using feature-based
models from wordnet. In Proceedings of the 1st Workshop on Sense, Concept and
Entity Representations and their Applications, pp. 1–11, Valencia, Spain. Association
for Computational Linguistics.
Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013). PPDB: The paraphrase
database. In Proceedings of NAACL-HLT, pp. 758–764.
Gärdenfors, P. (2004). Conceptual spaces: The geometry of thought. MIT press.
Goikoetxea, J., Soroa, A., & Agirre, E. (2015). Random walks and neural network language
models on knowledge bases. In Proceedings of NAACL, pp. 1434–1439.
35
Camacho-Collados & Pilehvar
Goikoetxea, J., Soroa, A., & Agirre, E. (2018). Bilingual embeddings with random walks
over multilingual wordnets. Knowledge-Based Systems.
Goldberg, Y. (2016). A primer on neural network models for natural language processing.
Journal of Artificial Intelligence Research, 57, 345–420.
Grover, A., & Leskovec, J. (2016). Node2Vec: Scalable feature learning for networks. In
Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, KDD ’16, pp. 855–864, New York, NY, USA.
Guo, J., Che, W., Wang, H., & Liu, T. (2014). Learning sense-specific word embeddings by
exploiting bilingual resources. In COLING, pp. 497–507.
Hanks, P. (2000). Do word meanings exist?. Computers and the Humanities, 34 (1-2),
205–215.
Harris, Z. (1954). Distributional structure. Word, 10, 146–162.
Haveliwala, T. H. (2002). Topic-sensitive PageRank. In Proceedings of the 11th International
Conference on World Wide Web, pp. 517–526, Hawaii, USA.
Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with
(genuine) similarity estimation. Computational Linguistics.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Compasutation,
9 (8), 1735–1780.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Ma-
chine Learning, 42 (1), 177–196.
Hovy, E. H., Navigli, R., & Ponzetto, S. P. (2013). Collaboratively built semi-structured
content and Artificial Intelligence: The story so far. Artificial Intelligence, 194, 2–27.
Huang, E. H., Socher, R., Manning, C. D., & Ng, A. Y. (2012). Improving word represen-
tations via global context and multiple word prototypes. In Proceedings of ACL, pp.
873–882, Jeju Island, Korea.
Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2015). Sensembed: Learning sense embeddings
for word and relational similarity. In Proceedings of ACL, pp. 95–105, Beijing, China.
Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2016). Embeddings for word sense disambigua-
tion: An evaluation study. In Proceedings of ACL, pp. 897–907, Berlin, Germany.
Ide, N., Erjavec, T., & Tufis, D. (2002). Sense discrimination with parallel corpora. In
Proceedings of ACL-02 Workshop on WSD: Recent Successes and Future Directions,
pp. 54–60, Philadelphia, USA.
Jameel, S., Bouraoui, Z., & Schockaert, S. (2018). Unsupervised learning of distributional
relation vectors. In Proceedings of ACL, Melbourne, Australia.
Jarmasz, M., & Szpakowicz, S. (2003). Roget’s thesaurus and semantic similarity. In Pro-
ceedings of Recent Advances in Natural Language Processing, pp. 212–219, Borovets,
Bulgaria.
Jauhar, S. K., Dyer, C., & Hovy, E. (2015). Ontologically grounded multi-sense repre-
sentation learning for semantic vector space models. In Proceedings of NAACL, pp.
683–693, Denver, Colorado.
36
A Survey on Vector Representations of Meaning
Ji, G., He, S., Xu, L., Liu, K., & Zhao, J. (2015). Knowledge graph embedding via dy-
namic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), Vol. 1, pp. 687–696.
Johansson, R., & Pina, L. N. (2015). Embedding a semantic network in a word space. In
Proceedings of NAACL, pp. 1428–1433, Denver, Colorado.
Jones, M. P., & Martin, J. H. (1997). Contextual spelling correction using latent seman-
tic analysis. In Proceedings of the Fifth Conference on Applied Natural Language
Processing, ANLC ’97, pp. 166–173.
Kartsaklis, D., Pilehvar, M. T., & Collier, N. (2018). Mapping text to knowledge graph
entities using multi-sense LSTMs. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, Brussels, Belgium.
Khodak, M., Risteski, A., Fellbaum, C., & Arora, S. (2017). Automated WordNet construc-
tion using word embeddings. In Proceedings of the 1st Workshop on Sense, Concept
and Entity Representations and their Applications, pp. 12–23.
Kiela, D., Hill, F., & Clark, S. (2015). Specializing word embeddings for similarity or
relatedness. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, pp. 2044–2048.
Kilgarriff, A. (1997). “I don’t believe in word senses”. Computers and the Humanities,
31 (2), 91–113.
Kilgarriff, A. (2007). Word senses. In Word Sense Disambiguation, pp. 29–46. Springer.
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings
of EMNLP, pp. 1746–1751, Doha, Qatar.
Kober, T., Weeds, J., Wilkie, J., Reffin, J., & Weir, D. (2017). One representation per
word - does it make sense for composition?. In Proceedings of the 1st Workshop on
Sense, Concept and Entity Representations and their Applications, pp. 79–90, Valen-
cia, Spain. Association for Computational Linguistics.
Köper, M., & im Walde, S. S. (2017). Applying multi-sense embeddings for german verbs
to determine semantic relatedness and to detect non-literal language. In Proceedings
of the 15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, Vol. 2, pp. 535–542.
Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., & Teixeira, J. S. (2002). A brief
survey of web data extraction tools. SIGMOD Rec., 31 (2), 84–93.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent
semantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological Review, 104 (2), 211.
Landauer, T., & Dooley, S. (2002). Latent semantic analysis: theory, method and applica-
tion. In Proceedings of CSCL, pp. 742–743.
Lee, D. L., Chuang, H., & Seamons, K. (1997). Document ranking and the vector-space
model. IEEE software, 14 (2), 67–75.
37
Camacho-Collados & Pilehvar
Lee, G.-H., & Chen, Y.-N. (2017). Muse: Modularizing unsupervised sense embeddings. In
Proceedings of EMNLP, Copenhagen, Denmark.
Lengerich, B. J., Maas, A. L., & Potts, C. (2017). Retrofitting distributional embeddings
to knowledge graphs with functional relations. arXiv preprint arXiv:1708.00112.
Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How
to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual Conference
on Systems Documentation, Toronto, Ontario, Canada, pp. 24–26.
Leviant, I., & Reichart, R. (2015). Separated by an un-common language: Towards judgment
language informed vector space modeling. arXiv preprint arXiv:1508.00106.
Levy, O., & Goldberg, Y. (2014a). Dependency-based word embeddings. In ACL, pp.
302–308.
Levy, O., & Goldberg, Y. (2014b). Neural word embedding as implicit matrix factorization.
In Advances in neural information processing systems, pp. 2177–2185.
Li, J., & Jurafsky, D. (2015). Do multi-sense embeddings improve natural language under-
standing?. In Proceedings of EMNLP, pp. 683–693, Lisbon, Portugal.
Li, W., & McCallum, A. (2005). Semi-supervised sequence modeling with syntactic topic
models. In Proceedings of the 20th National Conference on Artificial Intelligence -
Volume 2, pp. 813–818. AAAI Press.
Lieto, A., Radicioni, D., Rho, V., & Mensa, E. (2017). Towards a unifying framework for
conceptual represention and reasoning in cognitive systems. Intelligenza Artificiale,
11 (2), 139–153.
Lin, Y., Liu, Z., Sun, M., Liu, Y., & Zhu, X. (2015). Learning entity and relation embeddings
for knowledge graph completion.. In Proceedings of AAAI, pp. 2181–2187.
Liu, F., Lu, H., & Neubig, G. (2018). Handling homographs in neural machine translation.
In Proceedings of NAACL, New Orleans, LA, USA.
Liu, P., Qiu, X., & Huang, X. (2015a). Learning context-sensitive word embeddings with
neural tensor skip-gram model. In Proceedings of the 24th International Conference
on Artificial Intelligence, pp. 1284–1290.
Liu, Y., Liu, Z., Chua, T.-S., & Sun, M. (2015b). Topical word embeddings. In Proceedings
of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2418–2424.
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical
co-occurrence. Behavior Research Methods, Instruments, & Computers, 28 (2), 203–
208.
Luo, F., Liu, T., Xia, Q., Chang, B., & Sui, Z. (2018). Incorporating glosses into neural word
sense disambiguation. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pp. 2473–2482.
Luo, Y., Wang, Q., Wang, B., & Guo, L. (2015). Context-dependent knowledge graph
embedding. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, pp. 1656–1661.
38
A Survey on Vector Representations of Meaning
Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine
learning research, 9 (Nov), 2579–2605.
Mallery, J. C. (1988). Thinking about foreign policy: Finding an appropriate role for ar-
tificial intelligence computers, Ph.D. Thesis. M.I.T. Political Science Department,
Cambridge, MA.
Mancini, M., Camacho-Collados, J., Iacobacci, I., & Navigli, R. (2017). Embedding words
and senses together via joint knowledge-enhanced training. In Proceedings of CoNLL,
pp. 100–111, Vancouver, Canada.
McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in translation: Con-
textualized word vectors. In Advances in Neural Information Processing Systems 30,
pp. 6294–6305. Curran Associates, Inc.
McCarthy, D., Apidianaki, M., & Erk, K. (2016). Word sense clustering and clusterability.
Computational Linguistics.
McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gómez-Pérez, A.,
Gracia, J., Hollink, L., Montiel-Ponsoda, E., Spohr, D., et al. (2012). Interchanging
lexical resources on the semantic web. Language Resources and Evaluation, 46 (4),
701–719.
Melamud, O., Goldberger, J., & Dagan, I. (2016). context2vec: Learning generic context
embedding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference
on Computational Natural Language Learning, pp. 51–61, Berlin, Germany.
Meyerson, A. (2001). Online facility location. In Proceedings of the 42nd IEEE Symposium
on Foundations of Computer Science, pp. 426–432, Washington, DC, USA. IEEE
Computer Society.
Mihalcea, R., & Csomai, A. (2007). Wikify! Linking documents to encyclopedic knowledge.
In Proceedings of the Sixteenth ACM Conference on Information and Knowledge man-
agement, pp. 233–242, Lisbon, Portugal.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word
representations in vector space. CoRR, abs/1301.3781.
Mikolov, T., Le, Q. V., & Sutskever, I. (2013b). Exploiting similarities among languages
for machine translation. arXiv preprint arXiv:1309.4168.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013c). Distributed
representations of words and phrases and their compositionality. In Advances in
neural information processing systems, pp. 3111–3119.
Mikolov, T., Yih, W.-t., & Zweig, G. (2013d). Linguistic regularities in continuous space
word representations. In HLT-NAACL, pp. 746–751.
Miller, G. A. (1995). WordNet: a lexical database for english. Communications of the ACM,
38 (11), 39–41.
Miller, G. A., Leacock, C., Tengi, R., & Bunker, R. (1993). A semantic concordance. In
Proceedings of the 3rd DARPA Workshop on Human Language Technology, pp. 303–
308, Plainsboro, N.J.
39
Camacho-Collados & Pilehvar
Moro, A., Raganato, A., & Navigli, R. (2014). Entity Linking meets Word Sense Disam-
biguation: a Unified Approach. Transactions of the Association for Computational
Linguistics (TACL), 2, 231–244.
Mrksic, N., Vulić, I., Séaghdha, D. Ó., Leviant, I., Reichart, R., Gai, M., Korhonen, A.,
& Young, S. (2017). Semantic Specialisation of Distributional Word Vector Spaces
using Monolingual and Cross-Lingual Constraints. Transactions of the Association
for Computational Linguistics (TACL).
Navigli, R. (2006). Meaningful clustering of senses helps boost Word Sense Disambiguation
performance. In Proceedings of COLING-ACL, pp. 105–112, Sydney, Australia.
Navigli, R. (2009). Word Sense Disambiguation: A survey. ACM Computing Surveys, 41 (2),
1–69.
Navigli, R., & Ponzetto, S. P. (2012). BabelNet: The automatic construction, evaluation and
application of a wide-coverage multilingual semantic network. Artificial Intelligence,
193, 217–250.
Neale, S. (2018). A Survey on Automatically-Constructed WordNets and their Evaluation:
Lexical and Word Embedding-based Approaches. In chair), N. C. C., Choukri, K.,
Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J.,
Mazo, H., Moreno, A., Odijk, J., Piperidis, S., & Tokunaga, T. (Eds.), Proceedings of
the Eleventh International Conference on Language Resources and Evaluation (LREC
2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Neelakantan, A., Shankar, J., Passos, A., & McCallum, A. (2014). Efficient non-parametric
estimation of multiple embeddings per word in vector space. In Proceedings of
EMNLP, pp. 1059–1069, Doha, Qatar.
Nguyen, D. Q., Nguyen, D. Q., Modi, A., Thater, S., & Pinkal, M. (2017). A mixture model
for learning multi-sense word embeddings. In Proceedings of *SEM 2017.
Nguyen, D. Q. (2017). An overview of embedding models of entities and relationships for
knowledge base completion. arXiv preprint arXiv:1703.08098.
Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representa-
tions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,
S., & Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30, pp.
6341–6350. Curran Associates, Inc.
Niemann, E., & Gurevych, I. (2011). The people’s web meets linguistic knowledge: au-
tomatic sense alignment of Wikipedia and Wordnet. In Proceedings of the Ninth
International Conference on Computational Semantics, pp. 205–214.
Nieto Piña, L., & Johansson, R. (2015). A simple and efficient method to generate word sense
representations. In Proceedings of Recent Advances in Natural Language Processing,
pp. 465–472, Hissar, Bulgaria.
Otegi, A., Aranberri, N., Branco, A., Hajic, J., Neale, S., Osenova, P., Pereira, R., Popel,
M., Silva, J., Simov, K., & Agirre, E. (2016). QTLeap WSD/NED Corpora: Semantic
Annotation of Parallel Corpora in Six Languages. In Proc. of LREC, pp. 3023–3030.
40
A Survey on Vector Representations of Meaning
Panchenko, A. (2016). Best of both worlds: Making word sense embeddings interpretable.
In Proceedings of LREC, pp. 2649–2655.
Panchenko, A., Faralli, S., Ponzetto, S. P., & Biemann, C. (2017a). Using linked disam-
biguated distributional networks for word sense disambiguation. In Proceedings of the
1st Workshop on Sense, Concept and Entity Representations and their Applications,
pp. 72–78.
Panchenko, A., Ruppert, E., Faralli, S., Ponzetto, S. P., & Biemann, C. (2017b). Un-
supervised does not mean uninterpretable: The case for word sense induction and
disambiguation. In Proceedings of EACL, pp. 86–98.
Pasini, T., & Navigli, R. (2018). Two knowledge-based methods for high-performance sense
distribution learning. In Proceedings of AAAI, New Orleans, United States.
Pelevina, M., Arefyev, N., Biemann, C., & Panchenko, A. (2016). Making sense of word
embeddings. In Proceedings of the 1st Workshop on Representation Learning for NLP,
pp. 174–183.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word
representation. In Proceedings of EMNLP, pp. 1532–1543.
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online learning of social repre-
sentations. In Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’14, pp. 701–710.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L.
(2018). Deep contextualized word representations. In Proceedings of NAACL, New
Orleans, LA, USA.
Peters, M., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence
tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1756–
1765. Association for Computational Linguistics.
Piantadosi, S. T. (2014). Zipf’s word frequency law in natural language: A critical review
and future directions. Psychonomic bulletin & review, 21 (5), 1112–1130.
Pilehvar, M. T., & Camacho-Collados, J. (2018). Wic: 10,000 example pairs for evaluating
context-sensitive representations. arXiv preprint arXiv:1808.09121.
Pilehvar, M. T., Camacho-Collados, J., Navigli, R., & Collier, N. (2017). Towards a Seamless
Integration of Word Senses into Downstream NLP Applications. In Proceedings of
ACL, Vancouver, Canada.
Pilehvar, M. T., & Collier, N. (2016). De-conflated semantic representations. In Proceedings
of EMNLP, pp. 1680–1690, Austin, TX.
Pilehvar, M. T., & Collier, N. (2017). Inducing embeddings for rare and unseen words by
leveraging lexical resources. In Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguistics: Volume 2, Short Papers,
pp. 388–393, Valencia, Spain. Association for Computational Linguistics.
Pilehvar, M. T., & Navigli, R. (2014). A robust approach to aligning heterogeneous lexical
resources. In Proceedings of ACL, pp. 468–478.
41
Camacho-Collados & Pilehvar
Pilehvar, M. T., & Navigli, R. (2015). From senses to texts: An all-in-one graph-based
approach for measuring semantic similarity. Artificial Intelligence, 228, 95–128.
Pratt, L. Y. (1993). Discriminability-based transfer between neural networks. In Advances
in Neural Information Processing Systems 5, pp. 204–211.
Qiu, L., Tu, K., & Yu, Y. (2016). Context-dependent sense embedding. In Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing, pp. 183–191.
Radinsky, K., Agichtein, E., Gabrilovich, E., & Markovitch, S. (2011). A word at a time:
Computing word relatedness using temporal semantic analysis. In Proceedings of the
20th International Conference on World Wide Web, WWW ’11, pp. 337–346.
Raganato, A., Camacho-Collados, J., & Navigli, R. (2017a). Word sense disambiguation:
A unified evaluation framework and empirical comparison. In Proceedings of EACL,
pp. 99–110, Valencia, Spain.
Raganato, A., Delli Bovi, C., & Navigli, R. (2017b). Neural sequence learning models
for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pp. 1156–1167.
Reddy, S., Klapaftis, I. P., McCarthy, D., & Manandhar, S. (2011). Dynamic and static
prototype vectors for semantic composition. In Fifth International Joint Conference
on Natural Language Processing, IJCNLP 2011, Chiang Mai, Thailand, November
8-13, 2011, pp. 705–713.
Reisinger, J., & Mooney, R. J. (2010). Multi-prototype vector-space models of word mean-
ing. In Proceedings of ACL, pp. 109–117.
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy.
In Proceedings of IJCAI, pp. 448–453.
Rothe, S., & Schütze, H. (2015). Autoextend: Extending word embeddings to embeddings
for synsets and lexemes. In Proceedings of ACL, pp. 1793–1803, Beijing, China.
Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Commu-
nications of the ACM, 8 (10), 627–633.
Ruder, S. (2017). On word embeddings, part 1. URL: http://ruder.io/word-embeddings-
2017/(visited on 1/04/2018).
Ruder, S., Vulić, I., & Søgaard, A. (2017). A survey of cross-lingual word embedding models.
arXiv preprint arXiv:1706.04902.
Salant, S., & Berant, J. (2018). Contextualized word representations for reading compre-
hension. In Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 2
(Short Papers), pp. 554–559, New Orleans, Louisiana.
Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-
Hill, New York.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing.
Communications of the ACM, 18 (11), 613–620.
42
A Survey on Vector Representations of Meaning
43
Camacho-Collados & Pilehvar
Tripodi, R., & Pelillo, M. (2017). A game-theoretic approach to word sense disambiguation.
Computational Linguistics, 43 (1), 31–70.
Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., & Dyer, C. (2015). Evaluation of word
vector representations by subspace alignment. In Proceedings of EMNLP (2), pp.
2049–2054, Lisbon, Portugal.
Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: A simple and general
method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics, pp. 384–394, Uppsala, Sweden.
Turney, P. D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In
Proceedings of the 12th European Conference on Machine Learning, pp. 491–502.
Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsu-
pervised classification of reviews. In Proceedings of ACL, pp. 417–424.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of
semantics. Journal of Artificial Intelligence Research, 37, 141–188.
Tversky, A., & Gati, I. (1982). Similarity, separability, and the triangle inequality. Psycho-
logical Review, 89 (2), 123.
Upadhyay, S., Chang, K.-W., Zou, J., Taddy, M., & Kalai, A. (2017). Beyond bilingual:
Multi-sense word embeddings using multilingual context. In Proceedings of the 2nd
Workshop on Representation Learning for NLP, Vancouver, Canada.
Upadhyay, S., Faruqui, M., Dyer, C., & Roth, D. (2016). Cross-lingual models of word
embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp.
1661–1670.
Ustalov, D., Panchenko, A., & Biemann, C. (2017). Watset: Automatic induction of synsets
from a graph of synonyms. In Proceedings of the 55th Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1579–1590.
Van de Cruys, T., Poibeau, T., & Korhonen, A. (2011). Latent vector weighting for word
meaning in context. In Proceedings of the 2011 Conference on Empirical Methods in
Natural Language Processing, pp. 1012–1022, Edinburgh, Scotland, UK.
Vasilescu, F., Langlais, P., & Lapalme, G. (2004). Evaluating variants of the lesk approach
for disambiguating words. In Proceddings of LREC.
Vilnis, L., & McCallum, A. (2015). Word representations via Gaussian embedding. In
Proceedings of ICLR.
Vrandečić, D. (2012). Wikidata: A New Platform for Collaborative Data Collection. In
Proceedings of WWW, pp. 1063–1064.
Vu, T., & Parker, D. S. (2016). K-embeddings: Learning conceptual embeddings for words
using context. In Proceedings of NAACL-HLT, pp. 1262–1267.
Vulić, I., & Moens, M.-F. (2015). Bilingual word embeddings from non-parallel document-
aligned data applied to bilingual lexicon induction. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International
44
A Survey on Vector Representations of Meaning
45
Camacho-Collados & Pilehvar
Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov,
S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal
dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing
from Raw Text to Universal Dependencies, pp. 1–21, Brussels, Belgium. Association
for Computational Linguistics.
Zhong, Z., & Ng, H. T. (2010). It Makes Sense: A wide-coverage Word Sense Disambiguation
system for free text. In Proceedings of the ACL System Demonstrations, pp. 78–83,
Uppsala, Sweden.
Zipf, G. K. (1949). Human Behaviour and the Principle of Least-Effort. Addison-Wesley,
Cambridge, MA.
Zou, W. Y., Socher, R., Cer, D. M., & Manning, C. D. (2013). Bilingual word embeddings for
phrase-based machine translation. In Proceedings of EMNLP, pp. 1393–1398, Seattle,
USA.
46