An Empirical Study of Incorporating Syntactic Constraints Into Bertbased Location Metonymy Resolution
An Empirical Study of Incorporating Syntactic Constraints Into Bertbased Location Metonymy Resolution
An Empirical Study of Incorporating Syntactic Constraints Into Bertbased Location Metonymy Resolution
669–692
doi:10.1017/S135132492200033X
A RT I C L E
(Received 22 November 2020; revised 11 June 2022; accepted 13 June 2022; first published online 1 August 2022)
Abstract
Metonymy resolution (MR) is a challenging task in the field of natural language processing. The task of
MR aims to identify the metonymic usage of a word that employs an entity name to refer to another target
entity. Recent BERT-based methods yield state-of-the-art performances. However, they neither make full
use of the entity information nor explicitly consider syntactic structure. In contrast, in this paper, we argue
that the metonymic process should be completed in a collaborative manner, relying on both lexical seman-
tics and syntactic structure (syntax). This paper proposes a novel approach to enhancing BERT-based MR
models with hard and soft syntactic constraints by using different types of convolutional neural networks
to model dependency parse trees. Experimental results on benchmark datasets (e.g., RELOCAR, SEMEVAL
2007 and WIMCOR) confirm that leveraging syntactic information into fine pre-trained language models
benefits MR tasks.
1. Introduction
Metonymy is a type of figurative language that is pervasive in literature and in our daily conversa-
tion. It is commonly used to refer to an entity by using another entity closely associated with that
entity (Lakoff and Johnson 1980; Lakoff 1987; Fass 1988; Lakoff 1991, 1993; Pustejovsky 1991). For
example, the following two text snippets show a word with literal usage and metonymic usage:
C The Author(s), 2022. Published by Cambridge University Press.
translation (Kamei and Wakao 1992), relation extraction (RE) (Chan and Roth 2011) and geo-
graphical parsing (Monteiro, Davis, and Fonseca 2016; Gritta et al. 2017; Li et al. 2020). While
other types of metonymies exist, in this paper, we are only interested in a specific type of conven-
tional (regular) metonymy, namely, location metonymy. The task of location metonymy resolution
(Markert and Nissim 2002; Gritta et al. 2017; Li et al. 2020) constitutes classifying a location name
within the given sentence into metonymic or literal class.
Although many named entity recognition (NER) systems and word sense disambiguation
(WSD) systems exist, these systems generally do not explicitly handle metonymies. NER systems
only identify entity names from a sentence, but they are not able to recognise whether a word is
used metonymically. Existing WSD systems only determine which fixed ‘sense’ (interpretation)
of a word is activated from a close set of interpretations, whereas metonymy interpretation is an
open problem. They cannot infer the metonymic reading of a word out of the dictionary. Lakoff
and Johnson (1980) and Fass (1988) found that metonymic expressions mainly fell into several
fixed patterns, most of which were quite regular. Therefore, recent methods for MR are mainly
structured into two phases (Markert and Nissim 2002): metonymy detectiona and metonymy
interpretation (Nissim and Markert 2003). Metonymy detection attempts first to distinguish the
usage of entity names between metonymic and literal. Then, metonymy interpretation determines
which fine-grained metonymic pattern it involves such as place-for-people or place-for-event. The
difference between metonymy detection and metonymy interpretation can be seen as from a
coarse-grained (binary, metonymic or literal) to fine-grained (a particular type of metonymic
expression) classification (Mathews and Strube 2020).
In computational linguistics, conventional feature-based methods for location MR (Nissim and
Markert 2003; Farkas et al. 2007; Markert and Nissim 2007, 2009; Brun, Ehrmann, and Jacquet
2007; Nastase and Strube 2009; Nastase et al. 2012) rely heavily on handcrafted features delivered
from either linguistic resources or off-the-shelf taggers and dependency parsers. These methods
struggle with the problem of data sparsity and heavy feature engineering. Later, deep neural net-
work (DNN) models (Mikolov et al. 2013; Gritta et al. 2017; Mathews and Strube 2020) become
mainstream in handling various NLP tasks, including MR. These models have better performances
since they take more contextual information into account. Although DNN models provide a
giant leap forward compared to feature-based methods, training high-performance DNN mod-
els requires large-scale and high-quality datasets. However, existing datasets for MR are rather
small because the cost of collecting and annotating datasets is very expensive and unaffordable.
This situation raises a need to transfer the knowledge from existing large-scale datasets. Recently,
pre-trained language models (PLMs), especially BERT (Devlin et al. 2019), have shown superior
performance on various NLP downstream applications (Sun, Huang, and Qiu 2019; Qu et al.
2019; Lin et al. 2019b). The main advantage of PLMs is that they do not need to be trained from
scratch. When applying PLMs to a specific dataset, only some additional fine-tuning is required,
which is much cheaper. Benefiting from being pre-trained on a large-scale dataset with efficient
self-supervised learning objectives, PLMs can efficiently capture the syntax and semantics in the
text (Tang et al. 2018; Jawahar, Sagot, and Seddah 2019). Therefore, it is natural to adopt BERT to
generate entity representations for MR tasks.
However, directly adopting BERT into MR tasks might encounter problems. While BERT has
a strong advantage in modelling lexical semantics and generates informative token embeddings,
BERT has difficulty in fully modelling completed syntactic structures as it might need deeper
layers to capture long-distance dependencies (Tang et al. 2018; Zhang, Qi, and Manning 2018;
Jawahar et al. 2019). Given the sentence ‘He later went to manage Malaysia for one year’, BERT
tends to focus more on the former verb ‘went’ and ignore the latter verb ‘manage’, which might
a Metonymy detection is also called metonymy recognition by Nissim and Markert (2003).
Figure 1. An example illustrates that syntactic information helps metonymy resolution. ‘Malaysia’ is metonymically used.
The words in deeper blue colour have higher weights in the BERT’s attention. Since the sentence above contains two verbs,
it is confusing to infer metonymy. While ‘manage’ strongly suggests a metonymic reading, BERT currently has not addressed
that verb. The attention weight for ‘went’ is higher than that of ‘manage’.
lead to incorrect prediction of the MR label for ‘Malaysia’.b As shown in Figure 1, dependency
parse trees that convey rich structural information might help to recognise the metonymic usage.
Therefore, syntactic knowledge is necessary for improving BERT-based MR models.
Previous studies (Nissim and Markert 2003; Nastase and Strube 2009; Nastase et al. 2012) sug-
gested that syntax was a strong hint in constructing metonymy routes. Both the lexical semantics
and the syntactic structure (specifically, dependency relations) jointly assisted in recognising novel
readings of a word. In a metonymic sentence, the target entity is artificially violated its fixed usage
in order to introduce a novel metonymic reading, which was traditionally treated as syntactico-
semantic violation (Hobbs and Martin 1987; Pustejovsky 1991; Chan and Roth 2011). Generally,
an entity is an argument to at least one predicate, there exist explicit syntactic restrictions on the
entity and the predicate. In other words, the inference of metonymic reading primarily relies on
the selectional preferences of verbs (Fass 1988). As shown in Figure 1, ‘Malaysia’ refers to the
national football team of Malaysia. The verbs and dependency arcs among verbs (coloured in a
dark colour) were a strong clue to that metonymy, while other words (coloured in grey) had less
contribution. This motivated us to explore an interesting question: Can jointly leveraging lexical
semantics and syntactic information for MR can bring benefits?
As a part of ongoing interest in introducing prior syntactic knowledge into DNNs and PLMs,
this paper investigates different ways to incorporate hard and soft syntactic constraints into BERT-
based location MR models, following the idea that lexical semantics are potentially helpful for MR.
Firstly, we employ an entity-aware BERT encoder to obtain entity representations. To force the
model to focus on the target entity for prediction, we leverage explicit entity location informa-
tion by inserting special entity markers before and after the target entity of the input sentence.
Then, to take advantage of relevant dependencies and eliminate the noise of irrelevant chunks, we
adopt two kinds of graph convolutional neural networks to impose hard and soft syntactic con-
straints on BERT representations in appropriate ways. Finally, the model selectively aggregates
syntactic and semantic features to be helpful for MR inference. As a result, the proposed approach
shows state-of-the-art (SOTA) performances on several MR benchmark datasets. To the best of
our knowledge, this work is the first attempt to integrate syntactic knowledge and contextualised
embeddings (BERT) for MR in an end-to-end deep learning framework.
b Although we know that ‘Malaysia’ is metonymically used, the resolution of the metonymy here is unclear without further
contextual information. Possible resolutions include as the national football team of Malaysia, as a department of a multi-
national business.
for MR tasks directly. Given the vast range of entities in the world, it is impossible to learn all
entity mentions. To address data sparsity and force the model to make predictions based only on
context, Li et al. (2020) proposed a word masking approach based on BERT by replacing all target
entity names with an [X] token during training and inference. The masking approach substantially
outperformed existing methods over a broad range of datasets.
Despite their successes, they did not investigate the role of syntax and how syntax affects MR.
However, identifying the metonymic usage of an entity should collaboratively rely on both the
entity and the syntax. The above issue motivated us to concentrate on modelling dependency
associations among words that may be potentially helpful for MR to enrich BERT representations.
3. Related works
Since entity names are often used in a metonymic manner, MR has a strong connection with other
NLP tasks such as WSD and RE. These tasks share similar pre-processing techniques and neural
network architectures in utilising syntactic information (Joshi and Penstein-Rosé 2009; Li et al.
2014; Peng et al. 2017; Zhang et al. 2018; Fu, Li, and Ma 2019). Integrating dependency relations
with DNN models has shown promising results for various NLP tasks (Joshi and Penstein-Rosé
2009; Li et al. 2014; Peng et al. 2017; Zhang et al. 2018; Fu et al. 2019). However, the effect of
dependency integration for neural-based MR models is still not recognised and has made limited
progress so far.
With recent advances in RE (Zhang et al. 2018; Wu and He 2019; Guo, Zhang, and Lu 2019),
we investigate the use of dependency integration for MR. Our first concern is the integration
approach, whether directly concatenating dependency embeddings with token embeddings or
imposing dependency relations using a graph model is more appropriate. Extensive works have
discussed this issue, and most of them treated dependency relations as features. For example,
Kambhatla (2004) trained a statistical classifier for RE by combining various lexical, syntactic and
semantic features derived from the text in the early data pre-processing stage. Zhang, Zhang,
and Su (2006) studied embedding syntactic structure features in a parse tree to help RE. As a
result, those models were sensitive to linguistic variations, which prevented further applying the
dependency integration approach.
Recent research employs graph-based models to integrate DNNs and dependency parse trees.
A variety of hard pruning strategies relying on pre-defined rules have been proposed to distil
dependency information that improves the performance of RE. For example, Xu et al. (2015)
used the shortest dependency path between the entities in the entire tree. Liu et al. (2015) com-
bined the shortest dependency path between the target entities using a recursive neural network
and attached the subtrees to the shortest path with a convolutional neural network. To leverage
hierarchy information in dependency parse trees, Miwa and Bansal (2016) performed bottom-
up or top-down computations along the parse tree or the subtree below the lowest common
ancestor (LCA) of the entities. Zhang et al. (2018) pruned words except for the immediate ones
around the shortest path, given that those words might hold vital information to hint at the rela-
tion between two target entities. They applied graph convolutional network (GCN) to model the
dominating dependency tree structures. Although these hard pruning methods remove irrele-
vant relations efficiently, some useful information may also be eliminated. To resolve the above
conflicts, Guo et al. (2019) proposed a soft pruning method called AGGCN (attention-guided
graph convolutional network), a model that pools information over dependency trees by using
GCN. They transform original dependency trees into fully connected edge-weighted graphs, bal-
ancing the weights of dependency relations between including and excluding information. Note
that dependency-guided approaches, such as Zhang et al. (2018) and Guo et al. (2019), worked on
the RE task. To the best of our knowledge, we are the first to incorporate syntactic constraints into
BERT-based models for MR.
Figure 2. Illustration of the architecture of the proposed model with syntactic integration. It can distinguish metonymic
usage of the entity name ‘Malaysia’ given the enriched representation by incorporating hard and soft syntactic constraints
using GCN and AGCN blocks. In this model, both the context and entity semantics are considered to resolving metonymies.
4. Proposal
The task addressed in this paper is MR. Given an entity name E within a sentence S, MR predicts
whether E involves a metonymic or literal usage. The critical insight of this paper is that incorpo-
rating syntactic constraints may help BERT-based MR. As shown in Figure 1, the closest governing
verb in the dependency parse tree plays a dominant role in resolving metonymies. Therefore, we
consider that lexical semantics and syntactic structure essential for identifying metonymies.
Figure 2 illustrates the overall architecture of the proposed model. We propose an end-to-
end neural-based approach for MR tasks and train the model based on recent advances in PLMs.
Since BERT has shown superior performance on various NLP tasks, we employ BERT as an input
encoder to produce tokenwise semantic representations by passing the input sentences through
the BERT encoder. To enrich these tokenwise representations with syntactic knowledge given
dependency parse trees, we propose two ways to incorporate syntactic constraints using different
types of GCNs, for example, non-attentive GCN and attentive GCN (AGCN). We first perform
dependency parsing for input sentences to extract corresponding dependency parse trees and then
convert those parse trees into dependency adjacency matrices. Then, we use the GCN to encode
dependency adjacency matrices explicitly. However, vanilla GCNs represent the adjacency edges
among nodes using hard 0 and 1 labels. To learn these weights, following Guo et al. (2019), we
adopt the self-attention mechanism (Vaswani et al. 2017) upon GCNs to tune the weights. As a
result, the final representations contain rich syntactic knowledge, and lexical semantics serve to
make predictions.
BERT with entity information for relation classification, we insert special [ENT] indicators before
and after the entity nominal. This simple approach lets BERT easily locate the MR entity position.
For each h0x at the index x, we concatenate initial token embeddings with positional embeddings
and segment embeddings as follows:
pos seg
h0x = concat[Stok
x ; Sx ; Sx ]. (1)
After going through N successive transformer encoder blocks, the encoder generates entity-aware
BERT representation at the x-th position represented by hN
x as follows:
hN 0
x = BERT(hx ) (2)
4.2 Alignment
BERT applies WordPiece Tokenizer (a particular type of subword tokenizer) to further segment
words into word pieces, for example, from ‘played’ to [‘play’, ‘##ed’]. However, dependency pars-
ing relies on words and hence does not execute further segmentation. Thus, we need to align
BERT’s tokens against the input words and restore word representations by adopting the average
pooling operation on BERT’s token representations. Assume hx , . . . , hy are BERT representa-
tions of tokens (x and y represent the start and end indices of the token sequence), we obtain
the embedding h̃i of the i-th word byd :
y
1
h̃i = ht (3)
y − x + 1 t=x
ϕ is a soft attention function, such as additive (Bahdanau, Cho, and Bengio 2015), general dot-
product (Luong, Pham, and Manning 2015) or scaled dot-product (Vaswani et al. 2017) attention.
Therefore, the attention-guided layer composes both the attentive and nonattentive modules. We
use the scaled dot-product attention in our model for efficiency.
d Let H = [h , . . . , h ] denotes a sentence. A more effective approach is to construct a mapping matrix M to project BERT’s
1 n
token representations H into the full sentence representations, that is, word representations H̃. The projection matrix M
records the transformation from the original words to the subwords, which can be served as a router to restore word-wise
representations as H̃ = HM T , where M is the projection matrix where M ∈ Rm×n . m and n denote the length of the input
sentence in word and in token after tokenisation.
where W (l) represents the weight matrix, b(l) denotes the bias vector and ρ is an activation func-
tion. h̃(l−1) and h̃(l) are the hidden states in the prior and current layers, respectively. Each node
gathers and aggregates information from its neighbouring nodes during graph convolution.
QWiQ × (KWiK )T
Ã(k) = softmax( √ ) (6)
d
where Q and K are the query and the key in multi-head attention, respectively, Q and K are both
equal to the input representation H̃ (i.e., the output of the last module), d denotes the dimension
of H̃, WiQ and WiK are both learnable parameters ∈ Rd×d , and A(k) is the k-th attention-guided
adjacency matrix corresponding to the k-th head. Thus, we can replace the hard matrix A in the
previous equation with the soft attention matrix A(k) . The dependency relations, especially the
indirect, multi-hop ones, are modelled by the multi-head mechanism.
Figure 3. Densely connected structure. The sublayers are densely connected to fuse structures.
(1)
where xj is the initial representation outputted by the alignment layer. h̃j , . . . , h̃j(l−1) denote a
concatenation of the representations produced by preceding sublayers. In addition, the dimension
of representations in these sublayers shrinks to improve parameter efficiency, that is, dhidden =
d/L, where L is the number of sublayers, and d is the input dimension, with three sublayers and an
input dimension of 768, dhidden = d/L = 256. It outputs a fresh representation whose dimension
is 768(256 × 3) by concatenating all these sublayer outputs. Thus, the layer conserves consider-
able information at a low computational cost. This layer helps the weight gradually flow to the
determining token. N densely connected layers are constructed to compute N adjacency matri-
ces produced by attention-guided layers, where N denotes the numbers of the head. The GCN
computation for each sublayer should be modified to adapt the multi-head attention as follows:
⎛ ⎞
n
h̃k(l)i = ρ ⎝ Ãij(k) Wk(l) gj(l) + bk(l) ⎠ (8)
j=1
where k represents the k-th head, Wk(l) and bk(l) are the learnable weights and bias, respectively,
which are selected by k and associated with the attention-guided adjacency matrix A(k) .
where h̃out ∈ Rd is the aggregated representation of N heads. Wout and bout are the weights and
biases learned during training.
4.4 Classifier
This layer maps the final hidden state sequence H to the class metonymic or literal. The represen-
tation Hi corresponds to the token ti . Specifically, H0 denotes ‘[CLS]’ at the head of the subword
sequence after tokenisation, which serves as the pooled embedding to represent the aggregate
sequence.
Suppose that h̃x , . . . , h̃y are the word representations against the entity span E outputted by
the syntactic integration layer. x and y represent the start and end index of the words in the entity
span, respectively. We apply an operation of average pooling to obtain the final entity encoding:
For classification, we concatenate H0 and He consecutively, applying two fully connected layers
with activation. Then, we apply a softmax layer to make the final prediction. The learning objective
is to predict metonymic and literal classes for an entity within a given sentence:
Hfinal = ρ(W ∗ [ρ(W concat[H0 ; He ] + b ]) + b∗ ]) (12a)
exp (Hfinal )
γ̂ = arg max ||
(12b)
exp 0=r (Hr )
where γ̂ refers to a class type in the metonymy type set . W ∈ Rd×2d , W ∗ ∈ Rr×d , || is the
number of classification types, and d is the dimension of the hidden representation vector. While
there are only two classes in this task, this approach can generalise to multiple classes.
5. Experiments
5.1 Datasets
We conducted our experiments mainly on three publicly available benchmarks: two small size
location metonymy datasets, SEMEVAL (Markert and Nissim 2007) and RELOCAR (Gritta et al.
2017), and a large size dataset, WIMCOR (Mathews and Strube 2020). SEMEVAL and RELOCAR
are created to evaluate the capability of a classifier to distinguish literal (geographical territories
and political entities), metonymic (place-for-people, place-for-product, place-for-event, capital-
for-government or place-for-organisation) and mixed (metonymic and literal frames invoked
simultaneously or are unable to distinguish) location mentions.
SEMEVAL: The SEMEVAL datasete focuses on locations retrieved from the British National
Corpus. The distribution of categories in the SEMEVAL dataset is approximately 80% literal, 18%
metonymic and 2% mixed to simulate the natural distribution of location metonymy. Therefore,
a literal default tag already provides 80% precision. Although it contains finer-grained labels of
metonymic patterns, such as place-for-people, place-for-event or place-for-product, we use only
coarse-level labels of metonymy or literal in the experiment. Our experiment excluded the mixed
class since it accounts for only 2% of the data. Finally, the dataset comprises training (910 samples)
and testing (888 samples) partitions.
RELOCAR: The RELOCAR datasetf was collected using the sample data from Wikipedia’s Random
Article API. The data distribution of RELOCAR classes (literal, metonymic and mixed) is approxi-
mately 49%, 49% and 2%, respectively. We excluded mixed class instances. The processed dataset
contains 1026 training and 982 testing instances and has a better label balance to eliminate the
bias due to sub-sampling of the majority class to balance the classes.
e http://web.eecs.umich.edu/ mihalcea/affectivetext/#resources.
f https://github.com/milangritta/Minimalist-Location-Metonymy-Resolution/tree/master/data.
WIMCOR: The above datasets are limited in size. We also conduct our experiments on a large
harvested corpus of location metonymy called WIMCOR.g WimCor is composed of a variety
of location names, such as names of towns (e.g., ‘Bath’), cities (e.g., ‘Freiburg’) and states (e.g.,
‘Texas’). The average sentence length in WIMCOR is 80 tokens per sentence. While the samples
in WIMCOR are annotated with coarse-grained, medium-grained and fine-grained labels, only
the coarse labels (binary, i.e., metonymic or literal) are used in our experiments. The training set
contains 92,563 literal instances and 31,037 metonymic instances.
5.2 Setup
5.2.1 Data pre-processing
This section introduces the way to obtain dependency relation matrices. We performed depen-
dency parsing using the spaCy parserh and transformed all dependency trees (one parse tree per
sentence) into symmetric adjacency matrices, ignoring the dependency directions and types for
simplicity. In preliminary work, we conducted experiments using asymmetric matrices, but we
did not observe any improvements.
For BERT variants, we followed Devlin et al. (2019) and used the tokenizer in BERT to seg-
ment words into word pieces as discussed in Section 4.1. We inserted the special [ENT] indicator
before and after the entity spans as Wu and He (2019) did for E-BERT experiments. To adapt the
sequence length distribution corresponding to each dataset, we set the max sequence length to 256
for SEMEVAL, and 128 for RELOCAR and WIMCOR.
5.3 Results
5.3.1 Models
We compared our proposed method with different MR methods to evaluate it. The task of loca-
tion MR is to detect the locations with literal reading only and ignore all other possible readings.
Following Gritta, Pilehvar and Collier (2020), we classify the entity phrase as either literal or
metonymic. The baseline models used in our experiments are listed below.
SVM+Wiki: SVM+Wiki is the previous SOTA statistical model. It applies SVM with Wikipedia’s
network of categories and articles, enabling the model to automatically discover new relations and
their instances.
LSTM and BiLSTM: LSTM is one of the most powerful dynamic classifiers publicly known
(Sundermeyer, Schlüter, and Ney 2012). Thanks to the featured memory function of remember-
ing the last hidden states, it achieves decent results and is widely used on various NLP tasks (Gao
et al. 2018; Si et al. 2019). Moreover, BiLSTM improves the token representation by being aware of
the conditions from both directions (Hochreiter and Schmidhuber 1997), making ture contextual
reasoning available. Additionally, two kinds of representations, GloVe (Pennington, Socher and
Manning 2014) and ELMo, are tested separately to ensure model reliability.
Paragraph, Immediate and PreWin: Three models, Paragraph, Immediate and PreWin, are built
upon BiLSTM models. They simultaneously encode tokens into word vectors and dependency
relation labels into one-hot vectors (generally 5–10 tokens selected from the left and right of the
entity work best). The three models differ in the manner of token picking. Immediate x chooses
the x number of words to the immediate right and left of the entity as input to the model (Collobert
et al. 2011; Mesnil et al. 2013; Mikolov et al. 2013; Baroni, Dinu, and Kruszewski 2014), for exam-
ple, Immediate-5/10 takes the 5/10 words to the immediate right and left of the entity as input to
a model. The Paragraph model extends the Immediate model that takes more words (50 words)
from each entity’s side as the input. PreWin selects the words near the local predicate to eliminate
long-distance noise in the input.
PreWin (BERT) is the reimplementation of the PreWin system with BERT embeddings as the
input. Instead of deploying BERT as a classifier, we replace the original GloVe embeddings with
BERT embeddings used in the PreWin model and initialise word embeddings using BERT embed-
dings. Word embeddings are combined by summing subword embeddings to generate GloVe-like
word embeddings.
BERT, +AUG, +MASK: Three BERT-based MR models are described in Li et al. (2020). The
vanilla BERT model (Devlin et al. 2019) can be directly used to detect metonymies by performing
sentential classification. BERT encodes the input tokens into distributed vector representations
after fine-tuning over datasets. BERT+AUG is fine-tuned with data augmentation (Li et al. 2020).
This method generates new samples by randomly substituting the target entity nominal with one
from all the extracted target words. BERT+MASK fine-tunes the BERT model with target word
masking that replaces the input target word with the single token [ENT] during training and
evaluation.
E-BERT (sent) and E-BERT (sent+ent): Entity-aware BERT, namely, E-BERT, enriches the
semantic representations by incorporating the entity information. The input to the E-BERT (sent)
model is slightly different from the original dataset, where we inserted [ENT] markers before and
(a) (b)
(c) (d)
Figure 4. Variants of the architecture for extracting entity and sentence representations from the deep Transformers net-
work. (a) A model with the standard input and with sentence output at the position of [CLS]; (b) a model using the
entity-marked input and with the sentence (i.e., [CLS]) and the entity outputs; (c) and (d) two models with the entity-marked
input and with the sentence (i.e., [CLS]) and the entity outputs using GCN and AGCN, respectively.
after the entity spans, making BERT aware of the entity position. The E-BERT (sent) model repre-
sents the sentence using the encoding at the [CLS] position. The E-BERT (sent+ent) model shares
the same network structure as the R-BERT model (Wu and He 2019) for RE, but it depends on
a sole entity. Concretely, this variation concatenates the target entity’s sentential encoding and
corresponding encoding.
E-BERT+GCN: This model applies a hard pruning strategy using GCN computation to integrate
syntactic information into BERT representations. The input sentences are inserted with the [ENT]
label before and after the metonymic and literal entity span.
E-BERT+AGCN: We build the fully attentive system E-BERT+AGCN based on E-BERT+GCN.
The attention-guided layer in E-BERT+AGCN employs a soft attention mechanism to assign
proper weights to all dependencies. Figure 4 illustrates all BERT variants used in this paper,
including BERT, E-BERT, E-BERT+GCN and E-BERT+AGCN.
Table 1. Statistics of the datasets used in this paper. The table describes the number
of identical entities and the number of overlapping entities in the training and test
sets. The table includes sentence length, entity position and the number of verbs per
sentence
NER features. Furthermore, we compared E-BERT+AGCN with two baseline models: E-BERT
(the entity-aware BERT model without syntactic integration) and E-BERT+GCN (imposing hard
syntactic constraints with GCN).
Moreover, the experiment on E-BERT+GCN shows an accuracy increase that is 0.3% and
0.2% higher than E-BERT (sent+ent) on the SEMEVAL and RELOCAR datasets, respectively.
GCN improves performance by catching useful information from syntax. The hard pruning
behaviour of Immediate 5 exerted on E-BERT+GCN has little effect, which shows that prun-
ing graphs crudely may be counterproductive. E-BERT+AGCN obtains improvements of 0.7%
and 0.2% on the SEMEVAL and RELOCAR datasets, respectively, compared with E-BERT+GCN.
Therefore, introducing a multi-head attention mechanism that assists GCNs in information
aggregation seems successful. The standard deviation of E-BERT+AGCN is also lower than
E-BERT+GCN, indicating a more robust model performance. Our approach effectively incor-
porates soft-dependency constraints into MR models by pruning irrelevant information and
emphasising dominant relations concerning indicators.
We also report F1 scores for literal class and metonymic class separately. RELOCAR is a class-
balanced dataset with literal and metonymic independently accounting for 50% of all examples
in the training dataset. The F1 score of RELOCAR is relatively higher than that of the SEMEVAL
dataset due to the shorter sentence length. In the RELOCAR rows, the F1 of both classes indicates a
slight upgrade compared with baseline E-BERT and E-BERT+GCN, since the standard deviations
are relatively higher. Conversely, SEMEVAL serves as a benchmark with literal and metonymic
accounting for 80% and 20%. The imbalance causes a lack of metonymic evidence, making the
model learning process insufficient. As reflected in Table 2, earlier models, such as LSTM, have an
inferior F1 performance on the metonymic class compared with the literal class. The considerable
performance gap of 3.4% and 4.0% in F1-M between BERT and E-BERT+AGCN shows that E-
BERT+AGCN is more powerful in capturing syntactic clues to solve the sample limitation. To
summarise, E-BERT+AGCN achieves the highest F1 scores for both SEMEVAL and RELOCAR
and is able to adapt to various class distributions in the dataset.
Table 2. The overall F1 and accuracy scores on the SEMEVAL and RELOCAR datasets. ‘L’ and ‘M’ denote literal
and metonymic classes. +NER+POS means integrating both NER and POS features with the baseline model.
In general, E-BERT+AGCN obtains the best results. The boldface denotes the best results and ‘↑’ means sta-
tistically significant improvement over the baseline (BERT+MASK, Li et al. 2020) with p-value ≤ 0.05. † and ‡
are the results reported in the previous works of Gritta et al. (2017) and Li et al. (2020), respectively. Since the
datasets are slightly different, we re-implement systems of Li et al. (2020) and report the new results labelled
by ∗
LSTM (GloVe) 85.2 28.7 72.6 (1.48) 78.4 78.4 78.4 (0.91)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
BiLSTM (GloVe) 83.2 37.4 75.4 (1.72) 82.9 83.0 82.9 (0.85)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
BiLSTM (ELMo) 91.9 54.7 86.3 (0.45) 90.0 90.1 90.0 (0.40)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
PreWin (GloVe)† 90.6 57.3 83.1 (0.64) 84.4 84.8 83.6 (0.71)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
+MASK∗ (SOTA) 93.0 63.3 88.2 (0.61) 95.3 95.4 95.3 (0.41)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
E-BERT (sent) 93.5 60.0 87.6 (0.55) 94.0 94.0 94.0 (0.58)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
E-BERT (sent+ent) 93.2 66.0 88.8 (0.63) 95.2 95.3 95.3 (0.44)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
+GCN (Immediate-5) 93.6↑ 65.7↑ 89.0 (0.50)↑ 95.3 95.4 95.4 (0.44)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
In addition, to verify the effectiveness of our model on a larger dataset, we launch the experi-
ment using the WIMCOR dataset. Table 3 also gives the results on the WIMCOR dataset. Though
the increase is not substantial in terms of accuracy or F1 scores, our model leads to a 0.2 percent-
age point improvement compared to E-BERT, given the fact that the WIMCOR testing set contains
41,200 instances.
Table 4. Cross-domain accuracy, precision, recall and F1 scores. The best results are indicated in boldface
Figure 5. Comparison on the RELOCAR and the SEMEVAL datasets w.r.t. different sentence lengths. E-BERT+AGCN signifi-
cantly outperforms E-BERT and E-BERT+GCN on the SEMEVAL dataset when sentence length is longer than 70.
syntactic relations. The accuracy drops as in Figure 5 for BERT when the sentence length grows.
In this case, a dependency-based model is more suitable for handling long-distance relations while
reducing computational complexity. E-BERT+AGCN alleviated such performance degradation
and outperformed the two baselines in all buckets, and the improvement becomes more signifi-
cant when the sentence length increases (≥ 30 on RELOCAR and ≥ 66 on SEMEVAL). The results
in Figure 5 confirm that E-BERT+AGCN produces better entity and contextual representations
for MR, especially for longer sentences.
(a) (b)
(c) (d)
Figure 6. Visualisation of attention matrices (better viewed in colour). (a) averaged attention weights in the E-BERT encoder
(E-BERT), (b) averaged attention weights in the E-BERT encoder (E-BERT+AGCN); (c) attention weights in the non-attention
modules (E-BERT+GCN and E-BERT+AGCN); (d) averaged attention weights in the attention modules (E-BERT+AGCN).
(a) with (b) show the effect of incorporating syntactic constraints on BERT. (c) and (d) illustrate soft attention weights com-
pared to hard ones. (a)∼(d) illustrate that incorporating syntactic information forces the model to focus on the neighbours
of the target word in the dependency graph, compared to Figure 1.
Furthermore, the sentence in the example can be divided into the main clause ‘He later went to
manage Malaysia’ and the prepositional phrase ‘for one year’. The main clause contains the pred-
icate and the entity that dominates the MR inference. However, conventional methods consider
the modified relation between ‘one’ and ‘year’ as well as other irrelevant connections to have the
same weight. This process introduces massive noise in feature extraction.
As shown in Figure 6(b), the prepositional phrase ‘for one year’ is irrelevant to the MR task.
Despite the existence of dependency relations for the prepositional phrase, the weights of those
relations are relatively lower compared with the main clause, which includes the verb and its
dependent words. After launching the multi-head attention mechanism, the model is free from
fixed pruning rules and flexibly learns the connections among tokens.
The syntactic component (GCN Block) first selects relevant syntactic features efficiently given
the hard dependency adjacency matrix (see Figure 6(c)). Then, the attention-guided layer learns
the soft attention matrix. To demonstrate the superiority of soft dependency relations, we use
Figure 6(d) to visualise the attention weights of the attention-guided layer. Unlike the attention in
the BERT encoding layer, the attention-guided layer’s attention matrix reflects more information
Table 5. Samples for error analysis. Bold denotes the target entities for MR. ‘Label’ refers to the class label of the target entity,
followed by the correctness of the predictions of E-BERT and E-BERT+AGCN models
S1 Her personal bests in the event are 1.92 metres outdoors (Marseille 2015) and MET × ×
1.93 metres indoors (Budapest 2015)
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
S2 Engaged in very long range strategic bombing missions to enemy military, MET ×
industrial and transportation, were Italy, France, Germany, Austria, Hungary,
Romania, and Yugoslavia
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
S4 The LP had advance orders of a half million and sold another half million by MET ×
September 1965, making it the second album to sell a million copies in the
United Kingdom, after the soundtrack to the 1958 film South Pacific
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
S5 After spending three years in London on board the prison hulk Newgate, MET × ×
Hutchinson was transported to Australia on the Hillsborough, sometimes
referred to as the ‘Fever Ship’ since some ninety-five of the three hundred
convicts aboard died from typhoid fever brought aboard from the prison hulks
about dependency relationships. GCN is prone to trust information for all its one-hop neighbours
in dependency graphs while overlooking other neighbours. In contrast, AGCN uses multi-head
attention to attend to different representation subspaces to reduce the information loss jointly.
6. Error analysis
In most cases shown in Table 5, E-BERT+AGCN makes correct predictions. However, typical
issues caused by various reasons remain unsolved. We will discuss three types of such unsolved
errors here.
trace the predicate and made an incorrect prediction even though it was aware of syntactic
structural knowledge.
Acknowledgement. We thank the anonymous reviewers for their valuable comments. This work was supported by
Shanghai Science and Technology Young Talents Sailing Program 21YF1413900, Fundamental Research Funds for the
Central Universities (43800-20101-222340) and in part by the National Natural Science Foundation of China under grants
91746203, 61991410 and the National Key R&D Program of China under grant 2018AAA0102804.
References
Bahdanau D., Cho K. and Bengio Y. (2015). Neural machine translation by jointly learning to align and translate. In
Proceedings of the 3rd International Conference on Learning Representations.
Baroni M., Dinu G. and Kruszewski G. (2014). Don’t count, predict! a systematic comparison of context-counting vs.
context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pp. 238–247.
Brun C., Ehrmann M. and Jacquet G. (2007). XRCE-M: A hybrid system for named entity metonymy resolution. In
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). Association for Computational
Linguistics, pp. 488–491.
Chan Y.S. and Roth D. (2011). Exploiting syntactico-semantic structures for relation extraction. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA.
Association for Computational Linguistics, pp. 551–560.
Collobert R., Weston J., Bottou L., Karlen M., Kavukcuoglu K. and Kuksa P.P. (2011). Natural language processing
(almost) from scratch. Journal of Machine Learning Research 12, 2493–2537.
Devlin J., Chang M.-W., Lee K. and Toutanova K. (2019). Bert: Pre-training of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics,
pp. 4171–4186.
Farkas R., Simon E., Szarvas G. and Varga D. (2007). Gyder: Maxent metonymy resolution. In Proceedings of the 4th
International Workshop on Semantic Evaluations, pp. 161–164.
Fass D. (1988). Metonymy and metaphor: What’s the difference? In Coling Budapest 1988 Volume 1: International Conference
on Computational Linguistics. International Committee on Computational Linguistics, pp. 177–181.
Fu T.-J., Li P.-H. and Ma W.-Y. (2019). GraphRel: Modeling text as relational graphs for joint entity and relation extraction.
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational
Linguistics, pp. 1409–1418.
Fundel K., Küffner R. and Zimmer R. (2007). RelEx—relation extraction using dependency parse trees. Bioinformatics 23(3),
365–371.
Gao G., Choi E., Choi Y. and Zettlemoyer L. (2018). Neural metaphor detection in context. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 607–613.
Glavaš G. and Vulić I. (2021). Is supervised syntactic parsing beneficial for language understanding tasks? an empirical inves-
tigation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:
Main Volume, Online. Association for Computational Linguistics, pp. 3090–3104.
Gritta M., Pilehvar M.T. and Collier N. (2020). A pragmatic guide to geoparsing evaluation toponyms, named entity
recognition and pragmatics. Lang Resources & Evaluation 54, 683–712.
Gritta M., Pilehvar M.T., Limsopatham N. and Collier N. (2017). Vancouver welcomes you! minimalist location metonymy
resolution. In Proceedings of Annual Meeting of the Association for Computational Linguistics, vol. 1. Association for
Computational Linguistics, pp. 1248–1259.
Guo Z., Zhang Y. and Lu W. (2019). Attention guided graph convolutional networks for relation extraction. In Proceedings
of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics,
pp. 241–251.
Hobbs J.R. and Martin P. 1987. Local pragmatics. Technical report. SRI International Menlo Park CA Artificial Intelligence
Center, pp. 520–523.
Hochreiter S. and Schmidhuber J. (1997). Long short-term memory. Neural Computation 9(8), 1735–1780.
Huang G., Liu Z., Van Der Maaten L. and Weinberger K.Q. (2017). Densely connected convolutional networks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708.
Janda L.A. (2011). Metonymy in word-formation. Cognitive Linguistics 22(2), 359–392.
Jawahar G., Sagot B. and Seddah D. (2019). What does BERT learn about the structure of language? In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics,
pp. 3651–3657.
Joshi M. and Penstein-Rosé C. (2009). Generalizing dependency features for opinion mining. In Proceedings of the ACL-
IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, pp. 313–316.
Kambhatla N. (2004). Combining lexical, syntactic, and semantic features with maximum entropy models for extract-
ing relations. In Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, vol. 22. Association for
Computational Linguistics.
Kamei S.-i. and Wakao T. (1992). Metonymy: Reassessment, survey of acceptability, and its treatment in a machine transla-
tion system. In Proceedings of the 30th Annual Meeting on Association for Computational Linguistics (ACL’92). Association
for Computational Linguistics, pp. 309–311.
Kipf T.N. and Welling M. (2017). Semi-supervised classification with graph convolutional networks. In 5th International
Conference on Learning Representations.
Kvecses Z. and Radden G. (1998). Metonymy: Developing a cognitive linguistic view. Cognitive Linguistics 9(1), 37–78.
Lakoff G. (1987). Image metaphors. Metaphor and Symbol 2(3), 219–222.
Lakoff G. (1991). Metaphor and war: The metaphor system used to justify war in the gulf. Peace Research, 23(2/3), 25–32.
Lakoff G. (1993). The Contemporary Theory of Metaphor. In Ortony A (ed), Metaphor and Thought. Cambridge, UK:
Cambridge University Press, pp. 202–251.
Lakoff G. and Johnson M. (1980). Conceptual Metaphor in Everyday Language, vol. 77. JSTOR, pp. 453–486.
Li D., Wei F., Tan C., Tang D. and Ke X. (2014). Adaptive recursive neural network for target-dependent twitter sentiment
classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers), pp. 49–54.
Li H., Vasardani M., Tomko M. and Baldwin T. (2020). Target word masking for location metonymy resolution. In
Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational
Linguistics, pp. 3696–3707.
Lin B.Y., Chen X., Chen J. and Ren X. (2019a). KagNet: Knowledge-aware graph networks for commonsense reasoning. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, pp. 2822–2832.
Lin C., Miller T., Dligach D., Bethard S. and Savova G. (2019b). A BERT-based universal model for both within-and cross-
sentence clinical temporal relation extraction. In Proceedings of the 2nd Clinical Natural Language Processing Workshop,
pp. 65–71.
Liu Y., Wei F., Li S., Ji H., Zhou M. and Wang H. (2015). A dependency-based neural network for relation classifica-
tion. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics,
pp. 285–290.
Luong T., Pham H. and Manning C.D. (2015). Effective approaches to attention-based neural machine translation. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Association
for Computational Linguistics, pp. 1412–1421.
Markert K. and Nissim M. (2002). Metonymy resolution as a classification task. In Proceedings of the 2002 Conference
on Empirical Methods in Natural Language Processing (EMNLP 2002). Association for Computational Linguistics,
pp. 204–213.
Markert K. and Nissim M. (2007). Semeval-2007 task 08: Metonymy resolution at semeval-2007. In Proceedings of the 4th
International Workshop on Semantic Evaluations. Association for Computational Linguistics, pp. 36–41.
Markert K. and Nissim M. (2009). Data and models for metonymy resolution. Lang Resources & Evaluation 43, 123–138.
Mathews K.A. and Strube M. (2020). A large harvested corpus of location metonymy. In Proceedings of the 12th Language
Resources and Evaluation Conference, Marseille, France. European Language Resources Association, pp. 5678–5687.
Mesnil G., He X., Deng L. and Bengio Y. (2013). Investigation of recurrent-neural-network architectures and learn-
ing methods for spoken language understanding. In INTERSPEECH 2013, 14th Annual Conference of the International
Speech Communication Association, Lyon, France, August 25–29, 2013. International Speech Communication Association,
pp. 3771–3775.
Mihaylov T. and Frank A. (2018). Knowledgeable reader: Enhancing cloze-style reading comprehension with external com-
monsense knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers). Association for Computational Linguistics, pp. 821–832.
Mikolov T., Sutskever I., Chen K., Corrado G.S. and Dean J. (2013). Distributed representations of words and phrases and
their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119.
Miwa M. and Bansal M. (2016). End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for
Computational Linguistics, pp. 1105–1116.
Monteiro B.R., Davis C.A. and Fonseca F. (2016). A survey on the geographic scope of textual documents. Computers &
Geosciences 96, 23–34.
Nastase V., Judea A., Markert K. and Strube M. (2012). Local and global context for supervised and unsupervised
metonymy resolution. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing
and Computational Natural Language Learning, pp. 183–193.
Nastase V. and Strube M. (2009). Combining collocations, lexical and encyclopedic knowledge for metonymy resolution. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 910–918.
Nastase V. and Strube M. (2013). Transforming wikipedia into a large scale multilingual concept network. Artificial
Intelligence 194, 62–85.
Nissim M. and Markert K. (2003). Syntactic features and word similarity for supervised metonymy resolution. In Proceedings
of the 41st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics,
pp. 56–63.
Peng N., Poon H., Quirk C., Toutanova K. and Yih W.-t. (2017). Cross-sentence n-ary relation extraction with graph
LSTMs. Transactions of the Association for Computational Linguistics 5, 101–115.
Pennington J., Socher R. and Manning C.D. (2014). GloVe: Global vectors for word representation. In Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,
pp. 1532–1543.
Peters M., Neumann M., Iyyer M., Gardner M., Clark C., Lee K. and Zettlemoyer L. (2018). Deep contextualized word rep-
resentations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics,
pp. 2227–2237.
Peters M.E., Ammar W., Bhagavatula C. and Power R. (2017). Semi-supervised sequence tagging with bidirectional lan-
guage models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 Volume
1: Long Papers. Association for Computational Linguistics, pp. 1756–1765.
Pinango M.M., Zhang M., Foster-Hanson E., Negishi M., Lacadie C. and Constable R.T. (2017). Metonymy as refer-
ential dependency: Psycholinguistic and neurolinguistic arguments for a unified linguistic treatment. Cognitive Science
41(2SUPPL.S2), 351–378.
Pustejovsky J. (1991). The generative lexicon. Computational Linguistics 17(4), 409–441.
Qu C., Yang L., Qiu M., Croft W.B., Zhang Y. and Iyyer, M. (2019). Bert with history answer embedding for conversational
question answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 1133–1136.
Rajpurkar P., Zhang J., Lopyrev K. and Liang P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, pp. 2383–2392.
Shibata T., Kawahara D. and Kurohashi S. (2016). Neural network-based model for Japanese predicate argument struc-
ture analysis. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). Association for Computational Linguistics, pp. 1235–1244.
Si C., Chen W., Wang W., Wang L. and Tan T. (2019). An attention enhanced graph convolutional LSTM network for
skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1227–1236.
Socher R., Perelygin A., Wu J., Chuang J., Manning C.D., Ng A.Y. and Potts C. (2013). Recursive deep models for seman-
tic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural
Language Processing. Association for Computational Linguistics, pp. 1631–1642.
Sun C., Huang L. and Qiu X. (2019). Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary
sentence. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics,
pp. 380–385.
Sundermeyer M., Schlüter R. and Ney H. (2012). LSTM neural networks for language modeling. In Thirteenth Annual
Conference of the International Speech Communication Association. International Speech Communication Association,
pp. 194–198.
Tang G., Muller M., Gonzales A.R. and Sennrich R. (2018). Why self-attention? a targeted evaluation of neural machine
translation architectures. In Empirical Methods in Natural Language Processing. Association for Computational Linguistics,
pp. 4263–4272.
Tian Y., Chen G., Song Y. and Wan X. (2021). Dependency-driven relation extraction with attentive graph convolu-
tional networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. Association for
Computational Linguistics, pp. 4458–4471.
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł. and Polosukhin I. (2017). Attention is
all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008.
Wu S. and He Y. (2019). Enriching pre-trained language model with entity information for relation classification. In
Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2361–2364.
Xu Y., Mou L., Li G., Chen Y., Peng H. and Jin Z. (2015). Classifying relations via long short term memory networks along
shortest dependency paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, pp. 1785–1794.
Yang A., Wang Q., Liu J., Liu K., Lyu Y., Wu H., She Q. and Li S. (2019). Enhancing pre-trained language representations
with rich knowledge for machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. Association for Computational Linguistics, pp. 2346–2357.
Zarcone A., Utt J. and Padó S. (2012). Modeling covert event retrieval in logical metonymy: Probabilistic and distribu-
tional accounts. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012),
Montréal, Canada. Association for Computational Linguistics, pp. 70–79.
Zhang M., Zhang J., and Su J. (2006). Exploring syntactic features for relation extraction using a convolution tree ker-
nel. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. Association for
Computational Linguistics, pp. 288–295.
Zhang Y., Qi P. and Manning C.D. (2018). Graph convolution over pruned dependency trees improves relation extraction.
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, pp. 2205–2215.
Zhang Z., Han X., Liu Z., Jiang X., Sun M. and Liu Q. (2019). ERNIE: Enhanced language representation with informa-
tive entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for
Computational Linguistics, pp. 1441–1451.
Figure A.1. Comparison of F1 scores w.r.t training data size on the RELOCAR dataset. We train the model with different
percentages {20, 40, 60, 80, 100} of the training dataset.
substantially outperforms E-BERT, and the performance gap between E-BERT+AGCN and E-
BERT is always larger than 0.4%. The observation suggests that the E-BERT+AGCN model has
better generalisation than E-BERT, especially for small datasets.
Cite this article: Wang H, Du S, Zheng X and Meng L (2023). An empirical study of incorporating syntactic con-
straints into BERT-based location metonymy resolution. Natural Language Engineering 29, 669–692. https://doi.org/
10.1017/S135132492200033X