Nothing Special   »   [go: up one dir, main page]

Huber and Carenini - 2022 - Towards Understanding Large-Scale Discourse Structures in Pre-Trained and Fine-Tuned Language Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Towards Understanding Large-Scale Discourse Structures in Pre-Trained

and Fine-Tuned Language Models


Patrick Huber and Giuseppe Carenini
Department of Computer Science
University of British Columbia
Vancouver, BC, Canada, V6T 1Z4
{huberpat, carenini}@cs.ubc.ca

Abstract or unsupervised methods (Wu et al., 2020; Pandia


et al., 2021). Previous work thereby either focuses
With a growing number of BERTology works on analyzing the syntactic structures (e.g., Hewitt
analyzing different components of pre-trained
and Manning (2019); Wu et al. (2020)), relations
language models, we extend this line of re-
search through an in-depth analysis of dis- (Papanikolaou et al., 2019), ontologies (Michael
course information in pre-trained and fine- et al., 2020) or, to a more limited extend, discourse
tuned language models. We move beyond prior related behaviour (Zhu et al., 2020; Koto et al.,
work along three dimensions: First, we de- 2021a; Pandia et al., 2021).
scribe a novel approach to infer discourse struc- Generally speaking, while most previous
tures from arbitrarily long documents. Second,
BERTology works has focused on either sentence
we propose a new type of analysis to explore
where and how accurately intrinsic discourse level phenomena or connections between adja-
is captured in the BERT and BART models. cent sentences, large-scale semantic and pragmatic
Finally, we assess how similar the generated structures (oftentimes represented as discourse
structures are to a variety of baselines as well as trees or graphs) have been less explored. These
their distributions within and between models. structures (e.g., discourse trees) play a fundamen-
tal role in expressing the intent of multi-sentential
1 Introduction documents and, not surprisingly, have been shown
Transformer-based machine learning models are to benefit many NLP tasks such as summarization
an integral part of many recent improvements in (Gerani et al., 2019), sentiment analysis (Bhatia
Natural Language Processing (NLP). With their et al., 2015; Nejat et al., 2017; Hogenboom et al.,
rise spearheaded by Vaswani et al. (2017), the 2015) and text classification (Ji and Smith, 2017).
pre-training/fine-tuning paradigm has gradually With multiple different theories for discourse
replaced previous approaches based on architec- proposed in the past, the RST discourse theory
ture engineering, with transformer models such as (Mann and Thompson, 1988) and the lexicalized
BERT (Devlin et al., 2019), BART (Lewis et al., discourse grammar (Webber et al., 2003) (underly-
2020), RoBERTa (Liu et al., 2019) and others deliv- ing PDTB (Prasad et al., 2008)) have received most
ering state-of-the-art performance on a wide variety attention. While both theories propose tree-like
of tasks. Besides their strong empirical results on structures, the PDTB framework postulates par-
most real-world problems, such as summarization tial trees up to the between-sentence level, while
(Zhang et al., 2020; Xiao et al., 2021a), question- RST-style discourse structures consist of a single
answering (Joshi et al., 2020; Oğuz et al., 2021) rooted tree covering whole documents, comprising
and sentiment analysis (Adhikari et al., 2019; Yang of: (1) The tree structure, combining clause-like
et al., 2019), uncovering what kind of linguistic sentence fragments (Elementary Discourse Units,
knowledge is captured by this new type of pre- short: EDUs) into a discourse constituency tree,
trained language models (PLMs) has become a (2) Nuclearity, assigning every tree-branch primary
prominent question by itself. As part of this line of (Nucleus) or peripheral (Satellite) importance in a
research, called BERTology (Rogers et al., 2020), local context and (3) Relations, defining the type
researchers explore the amount of linguistic under- of connection holding between siblings in the tree.
standing encapsulated in PLMs, exposed through Given the importance of large-scale discourse
either external probing tasks (Raganato and Tiede- structures, we extend the area of BERTology re-
mann, 2018; Zhu et al., 2020; Koto et al., 2021a) search with novel insights regarding the amount of
2376
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 2376 - 2394
July 10-15, 2022 ©2022 Association for Computational Linguistics
intrinsic discourse information captured in estab- due to their complementary nature (encoder-only
lished PLMs. More specifically, we aim to better vs. encoder-decoder) and based on previous work
understand to what extend RST-style discourse in- by Zhu et al. (2020) and Koto et al. (2021a), show-
formation is stored as latent trees in encoder self- ing the effectiveness of BERT and BART models
attention matrices1 . While we focus on the RST for discourse related tasks.
formalism in this work, our presented methods are
Our work is further related to the field of dis-
theory-agnostic and, hence, applicable to discourse
course parsing. With a rich history of traditional
structures in a broader sense, including other tree-
machine learning models (e.g., Hernault et al.
based theories, such as the lexicalized discourse
(2010); Ji and Eisenstein (2014); Joty et al. (2015);
grammar. Our contributions in this paper are:
Wang et al. (2017), inter alia), recent approaches
(1) A novel approach to extract discourse informa-
slowly shifted to successfully incorporate a vari-
tion from arbitrarily long documents with standard
ety of PLMs into the process of discourse predic-
transformer models, inherently limited by their in-
tion, such as ELMo embeddings (Kobayashi et al.,
put size. This is a non-trivial issue, which has been
2019), XLNet (Nguyen et al., 2021), BERT (Koto
mostly by-passed in previous work through the use
et al., 2021b), RoBERTa (Guz et al., 2020) and
of proxy tasks like connective prediction, relation
SpanBERT (Guz and Carenini, 2020). Despite
classification, sentence ordering, EDU segmenta-
these works showing the usefulness of PLMs for
tion, cloze story tests and others.
discourse parsing, all of them cast the task into
(2) An exploration of discourse information locality
a “local" problem, using only partial information
across pre-trained and fine-tuned language models,
through the shift-reduce framework (Guz et al.,
finding that discourse structures are consistently
2020; Guz and Carenini, 2020), natural document
captured in a fixed subset of self-attention heads.
breaks (e.g. paragraphs Kobayashi et al. (2020))
(3) An in-depth analysis of the discourse quality in
or by framing the task as an inter-EDU sequence
pre-trained language models and their fine-tuned
labelling problem on partial documents (Koto et al.,
extensions. We compare constituency and depen-
2021b). However, we believe that the true benefit
dency structures of 2 PLMs fine-tuned on 4 tasks
of discourse information emerges when complete
and 7 fine-tuning datasets to gold-standard dis-
documents are considered, leading us to propose
course trees, finding that the captured discourse
a new approach to connect PLMs and discourse
structures outperform simple baselines by a large
structures in a “global” manner, superseding the lo-
margin, even showing superior performance com-
cal proxy-tasks with a new methodology to explore
pared to distantly supervised models.
arbitrarily long documents.
(4) A similarity analysis between PLM inferred dis-
course trees and supervised, distantly supervised Aiming to better understand what information
and simple baselines. We reveal that PLM con- is captured in PLMs, the line of BERTology re-
stituency discourse trees do align relatively well search has recently emerged (Rogers et al., 2020),
with previously proposed supervised models, but with early work mostly focusing on the syntac-
also capture complementary information. tic capacity of PLMs (Hewitt and Manning, 2019;
(5) A detailed look at information redundancy in Jawahar et al., 2019; Kim et al., 2020), in parts
self-attention heads to better understand the struc- also exploring the internal workings of transformer-
tural overlap between self-attention matrices and based models (e.g., self-attention matrices (Ra-
models. Our results indicate that similar discourse ganato and Tiedemann, 2018; Mareček and Rosa,
information is consistently captured in the same 2019)). More recent work started to explore the
heads, even across fine-tuning tasks. alignment of PLMs with discourse information, en-
coding semantic and pragmatic knowledge. Along
2 Related Work those lines, Wu et al. (2020) present a parameter-
free probing task for both, syntax and discourse.
At the base of our work are two of the most pop- With their tree inference approach being computa-
ular and frequently used PLMs: BERT (Devlin tionally expensive and limited to the exploration of
et al., 2019) and BART (Lewis et al., 2020). We the outputs of the BERT model, we significantly
choose these two popular approaches in our study extend this line of research by exploring the inter-
1
Please note that we focus on discourse structure and nu- nal self-attention matrices of PLMs with a more
clearity here, leaving relation classification for future work. computationally feasible approach. More tradi-
2377
2021a). However, despite the difficulties of en-
coding arbitrarily long documents, we believe that
to systematically explore the relationship between
PLMs and discourse, considering complete docu-
ments is imperative. Along these lines, recent work
started to tackle the inherent input-length limitation
of general transformer models through additional
recurrence in the Transformer-XL model (Dai et al.,
2019), compression modules (Rae et al., 2020) or
sparse patterns (e.g., as in the Reformer (Kitaev
et al., 2020), BigBird (Zaheer et al., 2020), and
Longformer (Beltagy et al., 2020) models). While
all these approaches to extend the maximum doc-
ument length of transformer-based models are im-
portant to create more globally inspired models, the
document-length limitation is still practically and
theoretically in place, with models being limited
to a fixed number of pre-defined tokens the model
can process. Furthermore, with many proposed
systems still based on more established PLMs (e.g.,
BERT) and with no single dominant solution for
the general problem of the input length-limitation
Figure 1: Small-scale example of the discourse ex- yet, we believe that even with the restriction being
traction approach. Purple=EDUs, green=sub-word em- actively tackled, an in-depth analysis of traditional
beddings, red=input slices of size tmax , orange=PLM,
PLMs with discourse is highly valuable to establish
blue=self-attention values, grey-scale=frequency count.
a solid understanding of the amount of semantic
and pragmatic information captured.
Besides the described BERTology work, we got
tionally, Zhu et al. (2020) use 24 hand-crafted encouraged to explore fine-tuned extensions of stan-
rhetorical features to execute three different su- dard PLMs through previous work showing the
pervised probing tasks, showing promising per- benefit of discourse parsing for many downstream
formance of the BERT model. Similarly, Pan- tasks, such as summarization (Gerani et al., 2019),
dia et al. (2021) aim to infer pragmatics through sentiment analysis (Bhatia et al., 2015; Nejat et al.,
the prediction of discourse connectives by analyz- 2017; Hogenboom et al., 2015) and text classifica-
ing the model inputs and outputs and Koto et al. tion (Ji and Smith, 2017). Conversely, we recently
(2021a) analyze discourse in seven PLMs through showed promising results when inferring discourse
seven supervised probing tasks, finding that BART structures from related downstream tasks, such as
and BERT contain most information related to dis- sentiment analysis (Huber and Carenini, 2020) and
course. In contrast to the approach taken by both summarization (Xiao et al., 2021b). Given this
Zhu et al. (2020) and Koto et al. (2021a), we use bidirectional synergy between discourse and the
an unsupervised methodology to test the amount mentioned downstream tasks, we move beyond tra-
of discourse information stored in PLMs (which ditional experiments focusing on standard PLMs
can also conveniently be used to infer discourse and additionally explore discourse structures of
structures for new and unseen documents) and ex- PLMs fine-tuned on a variety of auxiliary tasks.
tend the work by Pandia et al. (2021) by taking
a closer look at the internal workings of the self- 3 Discourse Extraction Method
attention component. Looking at prior work an-
alyzing the amount of discourse information in With PLMs rather well analyzed according to their
PLMs, structures are solely explored through the syntactic capabilities, large-scale discourse struc-
use of proxy tasks, such as connective prediction tures have been less explored. One reason for this is
(Pandia et al., 2021), relation classification (Kur- the input length constraint of transformer models.
falı and Östling, 2021), and others (Koto et al., While this is generally not prohibitive for intra-
2378
sentence syntactic structures (e.g., presented in Wu Figure 1. Using the sliding window approach, we
et al. (2020)), it does heavily influence large-scale subdivide the m sub-word tokens into sequences of
discourse structures, operating on complete (poten- maximum input length tmax , defined by the PLM
tially long) documents. Overcoming this limitation (tmax = 512 for BERT, tmax = 1024 for BART).
is non-trivial, since traditional transformer-based Using a stride of 1, we generate (m − tmax ) + 1
models only allow for fixed, short inputs. sliding windows W , feed them into the PLM, and
Aiming to systematically explore the ability of extract the resulting tmax ×tmax partial square self-
PLMs to capture discourse, we investigate a novel attention matrices (MP in Figure 1) for a specific
way to effectively extract discourse structures from self-attention head3 .
the self-attention component of the BERT and
BART models. We thereby extend our previously The Frequency Normalization Method allows
proposed tree-generation methodology (Xiao et al., us to combine the partially overlapping self-
2021b) to support the input length constraints of attention matrices MP into a single document-level
standard PLMs using a sliding-window approach in matrix MD of size m×m. To this end, we combine
combination with matrix frequency normalization multiple overlapping windows, generated due to
and an EDU aggregation method. Figure 1 visual- the stride size of 1, by adding up the self-attention
izes the complete process on a small scale example cells, while keeping track of the number of over-
with 3 EDUs and 7 sub-word embeddings. laps in a separate m × m frequency matrix MF .
We then divide MD by the frequency matrix MF ,
The Tree Generation Procedure we previously to generate a frequency normalized self-attention
proposed in Xiao et al. (2021b) explores a two- matrix MA (see bottom of Figure 1).
stage approach to obtain discourse structures from
a transformer model, by-passing the input-length The EDU Aggregation is the final processing
constraint. Using the intuition that the self- step to obtain the document-level self-attention
attention score between any two EDUs is an in- matrix. In this step, the m sub-word tokens
dicator of their semantic/pragmatic relatedness, in- T = {t1 , ...tm } are aggregated back into n EDUs
fluencing their distance in a projective discourse E = {e1 , ..., en } by computing the average bidirec-
tree, they use the CKY dynamic programming tional self-attention score between any two EDUs
approach (Jurafsky and Martin, 2014) to gener- in MA . For example, in Figure 1, we aggregate
ate constituency trees based on the internal self- the scores in cells MA [0:1, 5:6] to compute the fi-
attention of the transformer model. To generate nal output of cell [0, 2] (purple matrix in Figure 1)
dependency trees, we apply the same intuition used and MA [5:6, 0:1] to generate the value of cell [0, 2].
to infer discourse trees with the Eisner algorithm This way, we obtain the average bidirectional self-
(Eisner, 1996). Since we explore the discourse attention scores between EDU1 and EDU3 . We
information captured in standard PLMs, we can’t use the resulting n × n matrix as the input to the
directly transfer our two-stage approach in Xiao CKY/Eisner discourse tree generation methods.
et al. (2021b), first encoding individual EDUs us-
ing BERT and subsequently feeding the dense rep- 4 Experimental Setup
resentations into a fixed-size transformer model.
4.1 Pre-Trained Models
Instead, we propose a new method to overcome the
length-limitation of the transformer model2 . We select the BERT-base (110 million parameters)
and BART-large (406 million parameters) models
The Sliding-Window Approach is at the core for our experiments. We choose these models for
of our new methodology to overcome the input- their diverse objectives (encoder-only vs. encoder-
length constraint. We first tokenize arbitrarily long decoder), popularity for diverse fine-tuning tasks,
documents with n EDUs E = {e1 , ..., en } into the and their prior successful exploration in regards to
respective sequence of m sub-word tokens T = discourse information (Zhu et al., 2020; Koto et al.,
{t1 , ...tm } with n ≪ m, according to the PLM 2021a). For the BART-large model, we limit our
tokenization method (WordPiece for BERT, Byte- analysis to the encoder, as motivated in Koto et al.
Pair-Encoding for BART), as show at the top of (2021a), leaving experiments with the decoder and
2
For more information on the general tree-generation ap- cross-attention for future work.
proach using the Eisner algorithm we refer interested readers
3
to Xiao et al. (2021b). We omit the self-attention indexes for better readability.

2379
Dataset Task Domain
IMDB(2014) Sentiment Movie Reviews
Yelp(2015) Sentiment Reviews
SST-2(2013) Sentiment Movie Reviews
MNLI(2018) NLI Range of Genres
CNN-DM(2016) Summarization News
XSUM(2018) Summarization News
SQuAD(2016) Question-Answering Wikipedia
(a) BERT: PLM, +IMDB, +Yelp, +SST-2, +MNLI
Table 1: The seven fine-tuning datasets used in this work
along with the underlying tasks and domains.

4.2 Fine-Tuning Tasks and Datasets


We explore the BERT model fine-tuned on two
classification tasks, namely sentiment analysis and
natural language inference (NLI). For our analysis
(b) BART: PLM, +CNN-DM, +XSUM, +SQuAD
on BART, we select the abstractive summarization
and question answering tasks. Table 1 summarizes Figure 2: Constituency (top) and dependency (bottom)
the 7 datasets used to fine-tune PLMs in this work, discourse tree evaluation of BERT (a) and BART (b)
along with their underlying tasks and domains4 . models on GUM. Purple=high score, Blue=low score.
Left-to-right: self-attention heads, top-to-bottom: high
4.3 Evaluation Treebanks layers to low layers. + indicates fine-tuning dataset.
RST-DT (Carlson et al., 2002) is the largest En-
glish RST-style discourse treebank, containing 385
from a summarization model trained on the CNN-
Wall-Street-Journal articles, annotated with full
DM and New York Times (NYT) corpora (referred
constituency discourse trees. To generate addi-
to as SumCNN-DM and SumNYT )5 .
tional dependency trees, we apply the conversion
Supervised Baseline: We select the popular Two-
algorithm proposed in Li et al. (2014).
Stage discourse parser (Wang et al., 2017) as our
GUM (Zeldes, 2017) is a steadily growing treebank
supervised baseline, due to its strong performance,
of richly annotated texts. In the current version 7.3,
available model checkpoints and code6 , as well as
the dataset contains 168 documents from 12 gen-
the traditional architecture. We use the published
res, annotated with full RST-style constituency and
Two-Stage parser checkpoint on RST-DT (from
dependency discourse trees.
here on called Two-StageRST-DT ) and re-train the
All evaluations shown in this paper are executed
discourse parser on GUM (Two-StageGUM ). We
on the 38 and 20 documents in the RST-DT and
convert the generated constituency structures into
GUM test-sets, to be comparable with previous
dependency trees following Li et al. (2014).
baselines and supervised models. A similarly-sized
Evaluation Metrics: We apply the original parse-
validation-set is used where mentioned to deter-
val score to compare discourse constituency struc-
mine the best performing self-attention head.
tures with gold-standard treebanks, as argued in
4.4 Baselines and Evaluation Metrics Morey et al. (2017). To evaluate the generated
dependency structures, we use the Unlabeled At-
Simple Baselines: We compare the inferred con- tachment Score (UAS).
stituency trees against right- and left-branching
structures. For dependency trees, we evaluate 5 Experimental Results
against simple chain and inverse chain structures.
Distantly Supervised Baselines: We compare our 5.1 Discourse Locality
results obtained in this paper against our previous Our discourse tree generation approach described
approach presented in Xiao et al. (2021b), using in section 3 directly uses self-attention matrices
similar CKY and Eisner tree-generation methods to to generate discourse trees. The standard BERT
infer constituency and dependency tree structures
5
www.github.com/Wendy-Xiao/summ_
4
We exclusively analyze published models provided on the guided_disco_parser
6
huggingface platform, further specified in Appendix A. www.github.com/yizhongw/StageDP

2380
model contains 144 of those self-attention matri- RST-DT GUM
Model
ces (12 layers, 12 self-attention heads each), all Span UAS Span UAS
of which potentially encode discourse structures. BERT
For the BART model, this number is even higher,
rand. init ↓ 25.5 ↓ 13.3 ↓ 23.2 ↓ 12.4
consisting of 12 layers with 16 self-attention heads PLM • 35.7 • 45.3 • 33.0 • 45.2
each. With prior work suggesting the locality of + IMDB ↓ 35.4 ↓ 42.8 • 33.0 ↓ 43.3
discourse information in PLMs (e.g., Raganato and + Yelp ↓ 34.7 ↓ 42.3 ↓ 32.6 ↓ 43.7
Tiedemann (2018); Mareček and Rosa (2019); Xiao + SST-2 ↓ 35.5 ↓ 42.9 ↓ 32.6 ↓ 43.5
et al. (2021b)), we analyze every self-attention ma- + MNLI ↓ 34.8 ↓ 41.8 ↓ 32.4 ↓ 43.3
trix individually to gain a better understanding of BART
their alignment with discourse information. rand. init ↓ 25.3 ↓ 12.5 ↓ 23.2 ↓ 12.2
Besides investigating standard PLMs, we also PLM • 39.1 • 41.7 • 31.8 • 41.8
explore the robustness of discourse information + CNN-DM ↑ 40.9 ↑ 44.3 ↑ 32.7 ↑ 42.8
across fine-tuning tasks. We believe that this is an + XSUM ↑ 40.1 ↑ 41.9 ↑ 32.1 ↓ 39.9
important step to better understand if the captured + SQuAD ↑ 40.1 ↑ 43.2 ↓ 31.3 ↓ 40.7
discourse information is general and robust, or if it Baselines
is “re-learned” from scratch for downstream tasks. RB / Chain 9.3 40.4 9.4 41.7
To the best of our knowledge, no previous analysis LB / Chain-1 7.5 12.7 1.5 12.2
of this kind has been performed in the literature. SumCNN-DM 21.4 20.5 17.6 15.8
To this end, Figure 2 shows the constituency and SumNYT 24.0 15.7 18.2 12.6
dependency structure overlap of the generated dis- Two-StageRST-DT 72.0 71.2 54.0 54.5
course trees from individual self-attention heads Two-StageGUM 65.4 61.7 58.6 56.7
with the gold-standard tree structures of the GUM
Table 2: Original parseval (Span) and Unlabelled At-
dataset7 . The heatmaps clearly show that con- tachment Score (UAS) of the single best performing
stituency discourse structures are mostly captured self-attention matrix of the BERT and BART models
in higher layers, while dependency structures are compared with baselines and previous work. ↑, •, ↓
more evenly distributed across layers. Comparing indicate better, same, worse performance compared to
the patterns between models, we find that, despite the PLM. “rand. init"=Randomly initialized transformer
being fine-tuned on different downstream tasks, the model of similar architecture as the PLM, RB=Right-
discourse information is consistently encoded in Branching, LB=Left-Branching, Chain-1 =Inverse chain.
the same self-attention heads. Even though the
best performing self-attention matrix is not con- tial performance of PLMs on RST-style discourse
sistent, discourse information is clearly captured structures. This is not a realistic scenario, as the
in a “local" subset of self-attention heads across best performing head is generally not known a-
all presented fine-tuning tasks. This plausibly sug- priori. Hence, we also explore the performance
gests that the discourse information in pre-trained using a small-scale validation set to pick the best-
BERT and BART models is robust and general, re- performing self-attention matrix. In this more re-
quiring only minor adjustments depending on the alistic scenario for discourse parsing, we find that
fine-tuning task. scores on average drop by 1.55 points for BERT
and 1.33% for BART compared to the oracle-
5.2 Discourse Quality
picked performance of a single self-attention ma-
We now focus on assessing the discourse informa- trix. We show detailed results of this degradation in
tion captured in the single best-performing self- Appendix C8 . Our results in Table 2 are separated
attention head. In Table 2, we compare the dis- into three sub-tables, showing the results for BERT,
course structure quality of pre-trained and fine- BART and baseline models on the RST-DT and
tuned PLMs in the context of supervised models, GUM treebanks, respectively. In the BERT and
distantly supervised approaches and simple base- BART sub-table, we further annotate each perfor-
lines. We show the oracle-picked best head on the mance with ↑, •, ↓, indicating the relative perfor-
test-set, analyzing the upper-bound for the poten- mance to the standard pre-trained model as supe-
7 8
The analysis on RST-DT shows similar trends and can be For a more detailed analysis of the min., mean, median
found in Appendix B. and max. self-attention performances see Appendix D.

2381
rior, equal, or inferior.
Taking a look at the top sub-table (BERT) we
find that, as expected, the randomly initialized
transformer model achieves the worst performance.
Fine-tuned models perform equal or worse than the
standard PLM. Despite the inferior results of the
fine-tuned models, the drop is rather small, with
the sentiment analysis models consistently outper-
forming NLI. This seems reasonable, given that
the sentiment analysis objective is intuitively more
aligned with discourse structures (e.g., long-form
reviews with potentially complex rhetorical struc- Figure 3: PLM discourse constituency (left) and depen-
tures) than the between-sentence NLI task, not in- dency (right) structure overlap with baselines and gold
volving multi-sentential text. trees (e.g., BERT ↔ Two-Stage (RST-DT)) according
to the original parseval and UAS metrics.
In the center sub-table (BART), a different trend
emerges. While the worst performing model is still
(as expected) the randomly initialized system, fine- models. Plausibly, these seemingly unintuitive re-
tuned models mostly outperform the standard PLM. sults may be caused by the following co-occurring
Interestingly, the model fine-tuned on the CNN- circumstances: (1) The inferior performance of
DM corpus consistently outperforms the BART BART can potentially be attributed to the decoder
baseline, while the XSUM model performs bet- component capturing parts of the discourse struc-
ter on all but the GUM dependency structure eval- tures, as well as the larger number of self-attention
uation. On one hand, the superior performance heads “diluting” the discourse information. (2)
of both summarization models on the RST-DT The different trends regarding fine-tuned models
dataset seems reasonable, given that the fine-tuning might be directly influenced by the input-length
datasets and the evaluation treebank are both in the limitation to 512 (BERT) and 1024 (BART) sub-
news domain. The strong results of the CNN-DM word tokens during the fine-tuning stage, hamper-
model on the GUM treebank, yet inferior perfor- ing the ability to capture long-distance semantic
mance of XSUM, potentially hints towards depen- and pragmatic relationships. This, in turn, limits
dency discourse structures being less prominent the amount of discourse information captured, even
when fine-tuning on the extreme summarization for document-level datasets (e.g., Yelp, CNN-DM,
task, compared to the longer summaries in the SQuAD). With this restriction being more promi-
CNN-DM corpus. The question-answering task nent in BERT, it potentially explains the compara-
evaluated through the SQuAD fine-tuned model un- bly low performance of the fine-tuned models.
derperforms the standard PLM on GUM, however Finally, the bottom sub-table puts our results
reaches superior performance on RST-DT. Since in the context of previously proposed supervised
the SQuAD corpus is a subset of Wikipedia articles, and distantly-supervised models, as well as sim-
more aligned with news articles than the 12 genres ple baselines. Compared to simple right- and left-
in GUM, we believe the stronger performance on branching trees (Span), the PLM-based models
RST-DT (i.e., news articles) is again reasonable, reach clearly superior performance. Looking at
yet shows weaker generalization capabilities across the chain/inverse chain structures (UAS), the im-
domains (i.e., on the GUM corpus). Interestingly, provements are generally lower, however, the vast
the question-answering task seems more aligned majority still outperforms the baseline. Comparing
with dependency than constituency trees, in line the first two sub-tables against completely super-
with what would be expected from a factoid-style vised methods (Two-StageRST-DT , Two-StageGUM ),
question-answering model, focusing on important the BERT- and BART-based models are, unsurpris-
entities, rather than global constituency structures. ingly, inferior. Lastly, compared to the distantly
Directly comparing the BERT and BART mod- supervised SumCNN-DM and SumNYT models, the
els, the former performs better on three out of four PLM-based discourse performance shows clear im-
metrics. At the same time, fine-tuning hurts the provements over the 6-layer, 8-head standard trans-
performance for BERT, however, improves BART former.
2382
(a) Head-aligned (b) Model-aligned

Figure 4: Nested aggregation approach for discourse


similarity. (a) Grey cells contain same-head, white cells
(a) Constituency Similarity (b) Dependency Similarity
indicate different heads. (b) Grey cells contain same-
model, white cells indicate different models. Column
Figure 5: BERT self-attention similarities on GUM.
indices equal row indices.
Top: Visual analysis of head-aligned (I&III) and
model-aligned (II&IV ) heatmaps. Yellow=high struc-
tural overlap, purple=low structural overlap.
5.3 Discourse Similarity Bottom: Aggregated similarity of same heads, same
Further exploring what kind of discourse informa- models, different heads and different models showing
the min, max and quartiles of the underlying distribution.
tion is captured in the PLM self-attention matrices,
*Significantly better than respective ̸=Head/̸=Model
we directly compare the emergent discourse struc- performance with p-value < 0.05.
tures with trees inferred from existing discourse
parsers and simple baselines. This way, we aim to
better understand if the information encapsulated supervised discourse parsers, we further analyze
in PLMs is complementary to existing methods, or the correctly predicted overlap. More specifically,
if the PLMs solely capture trivial discourse phe- we compute the intersection between PLM gener-
nomena and simple biases (e.g., resemble right- ated structures and gold-standard trees as well as
branching constituency trees). Since the GUM previously proposed models and the gold-standard.
dataset contains a more diverse set of test docu- Subsequently, we intersect the two resulting sets
ments (12 genres) than the RST-DT corpus (exclu- (e.g., BERT ∩ Gold Trees ↔ Two-Stage (RST-DT)
sively news articles), we perform our experiments ∩ Gold Trees). This way, we explore if the cor-
from here on only on the GUM treebank. rectly predicted PLM discourse structures are a
Figure 3 shows the micro-average structural over- subset of the correctly predicted trees by super-
lap of discourse constituency (left) and dependency vised approaches, or if complementary discourse
(right) trees between the PLM-generated discourse information is captured. We find that > 20% and
structures and existing methods, baselines, as well > 16% of the correctly predicted constituency and
as gold-standard trees. Noticeably, the generated dependency structures of our PLM discourse in-
constituency trees (on the left) are most aligned ference approach are not captured by supervised
with the structures predicted by supervised dis- models, making the exploration of ensemble meth-
course parsers, showing only minimal overlap to ods a promising future avenue. A detailed version
simple structures (i.e., right- and left-branching of Fig. 3 as well as more specific results regarding
trees). Taking a closer look at the generated de- the correctly predicted overlap of discourse struc-
pendency structures presented on the right side tures are shown in Appendix E.
in Figure 3, the alignment between PLM inferred
discourse trees and the simple chain structure is 5.4 Discourse Redundancy
predominant, suggesting a potential weakness in Up to this point, our quantitative analysis of the
regards to the discourse exposed by the Eisner algo- ability of PLMs to capture discourse information
rithm in the BERT and BART model. Not surpris- has been limited to the single best-performing head.
ingly, the highest overlap between PLM-generated However, looking at individual models, the dis-
trees and the chain structure occurs when fine- course performance distribution in Figure 2 sug-
tuning on the CNN-DM dataset, well-known to gests that a larger subset of self-attention heads
contain a strong lead-bias (Xing et al., 2021). performs similarly well (i.e., there are several dark
To better understand if the PLM-based con- purple cells in each heatmap). This leads to the
stituency structures are complementary to existing, interesting questions if the information captured
2383
in different, top-performing self-attention heads is statistically significant difference in distributions
redundant or complementary. Similarly, Figure 2 at the bottom of Figure 5 (a) and (b). However,
indicates that the same heads perform well across different self-attention heads hi , ..., hk of the same
different fine-tuning tasks, leading to the question model mm encode different discourse information
if the discourse structures captured in a single self- (heatmaps II&IV ). While the trend is stronger
attention matrix of different fine-tuned models is for constituency tree structures, there is a single
consistent, or varies depending on the underlying dependency self-attention head which does gen-
task. Hence, we take a detailed look at the simi- erally not align well between models and heads
larity of model self-attention heads in regards to (purple line in heatmap III). Plausibly, this spe-
their alignment with discourse information and ex- cific self-attention head encodes fine-tuning task
plore if (1) the top performing heads hi , ..., hk of specific discourse information, making it a prime
a specific model mm capture redundant discourse candidate for further investigations in future work.
structures, and if (2) the discourse information cap- Furthermore, the similarity patterns observed in
tured by a specific head hi across different models Figure 5 (a) and (b) point towards an opportunity to
mm , ..., mo contain similar discourse information. combine model self-attention heads to improve the
Specifically, we pick the top 10 best performing discourse inference performance compared to the
self-attention matrices of each model, remove self- scores shown in Table 2, where each self-attention
attention heads that don’t appear in at least two head was assessed individually, in future work.
models (since no comparisons can be made), and
compare the generated discourse structures in a
nested aggregation approach. 6 Conclusions
Figure 4 shows a small-scale example of our
nested visualization methodology. For the self-
In this paper, we extend the line of BERTology work
attention head-aligned approach (Figure 4 (a)),
by focusing on the important, yet less explored,
high similarity values (calculated as the micro-
alignment of pre-trained and fine-tuned PLMs with
average structural overlap) along the diagonal (grey
large-scale discourse structures. We propose a
cells) would be expected if the same head hi en-
novel approach to infer discourse information for
codes consistent discourse information across dif-
arbitrarily long documents. In our experiments,
ferent fine-tuning tasks and datasets. Inversely, the
we find that the captured discourse information is
model-aligned matrix (Figure 4 (b)) should show
consitently local and general, even across a collec-
high values along the diagonal if different heads
tion of fine-tuning tasks. We compare the inferred
hi , ..., hk in the same model mk capture redundant
discourse trees with supervised, distantly super-
discourse information. Besides the visual inspec-
vised and simple baselines to explore the structural
tion methodology presented in Figure 4, we also
overlap, finding that constituency discourse trees
compare aggregated similarities between the same
align well with supervised models, however, con-
head (=Head) against different heads (̸=Head) and
tain complementary discourse information. Lastly,
between the same model (=Model) against dif-
we individually explore self-attention matrices to
ferent models (̸=Model) (i.e., grey cells (=) and
analyze the information redundancy. We find that
white cells (̸=) in Figure 4 (a) and (b)). In order
similar discourse information is consistently cap-
to assess the statistical significance of the result-
tured in the same heads.
ing differences in the underlying distributions, we
compute a two-sided, independent t-test between In the future, we intend to explore additional dis-
same/different models and same/different heads9 . course inference strategies based on the insights we
The resulting redundancy evaluations for BERT gained in this analysis. Specifically, we want to ex-
are presented in Figure 510 . It appears that the plore more sophisticated methods to extract a single
same self-attention heads hi consistently encode discourse tree from multiple self-attention matrices,
similar discourse information across models indi- rather than only the single best-performing head.
cated by: (1) High similarities (yellow) along the Further, we want to investigate the relationship
diagonal in heatmaps I&III and (2) through the between supervised discourse parsers and PLM
9 generated discourse trees and more long term, we
Prior to running the t-test we confirm similar variance and
the assumption of normal distribution (Shapiro-Wilk test). plan to analyze PLMs with enhanced input-length
10
Evaluations for BART can be found in Appendix F. limitations.
2384
Acknowledgements 1996 Volume 1: The 16th International Conference
on Computational Linguistics.
We thank the anonymous reviewers and the UBC
NLP group for their insightful comments and sug- Shima Gerani, Giuseppe Carenini, and Raymond T. Ng.
gestions. This research was supported by the Lan- 2019. Modeling content and structure for abstractive
review summarization. Computer Speech & Lan-
guage & Speech Innovation Lab of Cloud BU, guage, 53:302–331.
Huawei Technologies Co., Ltd and the Natural
Sciences and Engineering Research Council of Grigorii Guz and Giuseppe Carenini. 2020. Corefer-
Canada (NSERC). Nous remercions le Conseil de ence for discourse parsing: A neural approach. In
recherches en sciences naturelles et en génie du Proceedings of the First Workshop on Computational
Approaches to Discourse, pages 160–167, Online.
Canada (CRSNG) de son soutien. Association for Computational Linguistics.

Grigorii Guz, Patrick Huber, and Giuseppe Carenini.


References 2020. Unleashing the power of neural discourse
parsers - a context and structure aware approach us-
Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and
ing large scale pretraining. In Proceedings of the 28th
Jimmy Lin. 2019. Docbert: Bert for document classi-
International Conference on Computational Linguis-
fication. arXiv preprint arXiv:1904.08398.
tics, pages 3794–3805, Barcelona, Spain (Online).
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. International Committee on Computational Linguis-
Longformer: The long-document transformer. arXiv tics.
preprint arXiv:2004.05150.
Hugo Hernault, Helmut Prendinger, Mitsuru Ishizuka,
Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. et al. 2010. Hilda: A discourse parser using support
2015. Better document-level sentiment analysis from vector machine classification. Dialogue & Discourse,
RST discourse parsing. In Proceedings of the 2015 1(3).
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 2212–2218, Lisbon, Portu- John Hewitt and Christopher D. Manning. 2019. A
gal. Association for Computational Linguistics. structural probe for finding syntax in word represen-
tations. In Proceedings of the 2019 Conference of
Lynn Carlson, Mary Ellen Okurowski, and Daniel the North American Chapter of the Association for
Marcu. 2002. RST discourse treebank. Linguistic Computational Linguistics: Human Language Tech-
Data Consortium, University of Pennsylvania. nologies, Volume 1 (Long and Short Papers), pages
4129–4138, Minneapolis, Minnesota. Association for
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- Computational Linguistics.
bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.
Transformer-XL: Attentive language models beyond Alexander Hogenboom, Flavius Frasincar, Franciska
a fixed-length context. In Proceedings of the 57th de Jong, and Uzay Kaymak. 2015. Using rhetori-
Annual Meeting of the Association for Computational cal structure in sentiment analysis. Commun. ACM,
Linguistics, pages 2978–2988, Florence, Italy. Asso- 58(7):69–77.
ciation for Computational Linguistics.
Patrick Huber and Giuseppe Carenini. 2020. MEGA
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
RST discourse treebanks with structure and nuclear-
Kristina Toutanova. 2019. BERT: Pre-training of
ity from scalable distant sentiment supervision. In
deep bidirectional transformers for language under-
Proceedings of the 2020 Conference on Empirical
standing. In Proceedings of the 2019 Conference of
Methods in Natural Language Processing (EMNLP),
the North American Chapter of the Association for
pages 7442–7457, Online. Association for Computa-
Computational Linguistics: Human Language Tech-
tional Linguistics.
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics. Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.
2019. What does BERT learn about the structure of
Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexan- language? In Proceedings of the 57th Annual Meet-
der J. Smola, Jing Jiang, and Chong Wang. 2014. ing of the Association for Computational Linguistics,
Jointly modeling aspects, ratings and sentiments for pages 3651–3657, Florence, Italy. Association for
movie recommendation (jmars). In Proceedings of Computational Linguistics.
the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’14, Yangfeng Ji and Jacob Eisenstein. 2014. Represen-
page 193–202, New York, NY, USA. Association for tation learning for text-level discourse parsing. In
Computing Machinery. Proceedings of the 52nd Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Jason M. Eisner. 1996. Three new probabilistic models Long Papers), pages 13–24, Baltimore, Maryland.
for dependency parsing: An exploration. In COLING Association for Computational Linguistics.

2385
Yangfeng Ji and Noah A. Smith. 2017. Neural dis- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
course structure for text categorization. In Proceed- Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
ings of the 55th Annual Meeting of the Association for Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Computational Linguistics (Volume 1: Long Papers), BART: Denoising sequence-to-sequence pre-training
pages 996–1005, Vancouver, Canada. Association for for natural language generation, translation, and com-
Computational Linguistics. prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, pages 7871–7880, Online. Association for Computa-
Luke Zettlemoyer, and Omer Levy. 2020. Span- tional Linguistics.
BERT: Improving pre-training by representing and
predicting spans. Transactions of the Association for Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie Li.
Computational Linguistics, 8:64–77. 2014. Text-level discourse dependency parsing. In
Shafiq Joty, Giuseppe Carenini, and Raymond T. Ng. Proceedings of the 52nd Annual Meeting of the As-
2015. CODRA: A novel discriminative framework sociation for Computational Linguistics (Volume 1:
for rhetorical analysis. Computational Linguistics, Long Papers), pages 25–35, Baltimore, Maryland.
41(3):385–435. Association for Computational Linguistics.

Dan Jurafsky and James H Martin. 2014. Speech and Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
language processing, volume 3. Pearson London. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang goo Roberta: A robustly optimized bert pretraining ap-
Lee. 2020. Are pre-trained language models aware proach. arXiv preprint arXiv:1907.11692.
of phrases? simple but strong baselines for grammar
induction. In International Conference on Learning William C Mann and Sandra A Thompson. 1988.
Representations. Rhetorical structure theory: Toward a functional the-
ory of text organization. Text-Interdisciplinary Jour-
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
nal for the Study of Discourse, 8(3):243–281.
2020. Reformer: The efficient transformer. In Inter-
national Conference on Learning Representations. David Mareček and Rudolf Rosa. 2019. From
Naoki Kobayashi, Tsutomu Hirao, Hidetaka Kamigaito, balustrades to pierre vinken: Looking for syntax in
Manabu Okumura, and Masaaki Nagata. 2020. Top- transformer self-attentions. In Proceedings of the
down rst parsing utilizing granularity levels in doc- 2019 ACL Workshop BlackboxNLP: Analyzing and
uments. In Proceedings of the AAAI Conference on Interpreting Neural Networks for NLP, pages 263–
Artificial Intelligence, volume 34, pages 8099–8106. 275, Florence, Italy. Association for Computational
Linguistics.
Naoki Kobayashi, Tsutomu Hirao, Kengo Nakamura,
Hidetaka Kamigaito, Manabu Okumura, and Masaaki Julian Michael, Jan A. Botha, and Ian Tenney. 2020.
Nagata. 2019. Split or merge: Which is better for Asking without telling: Exploring latent ontologies
unsupervised RST parsing? In Proceedings of the in contextual representations. In Proceedings of the
2019 Conference on Empirical Methods in Natu- 2020 Conference on Empirical Methods in Natural
ral Language Processing and the 9th International Language Processing (EMNLP), pages 6792–6812,
Joint Conference on Natural Language Processing Online. Association for Computational Linguistics.
(EMNLP-IJCNLP), pages 5797–5802, Hong Kong,
China. Association for Computational Linguistics. Mathieu Morey, Philippe Muller, and Nicholas Asher.
2017. How much progress have we made on RST
Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2021a. discourse parsing? a replication study of recent re-
Discourse probing of pretrained language models. sults on the RST-DT. In Proceedings of the 2017
In Proceedings of the 2021 Conference of the North Conference on Empirical Methods in Natural Lan-
American Chapter of the Association for Computa- guage Processing, pages 1319–1324, Copenhagen,
tional Linguistics: Human Language Technologies, Denmark. Association for Computational Linguis-
pages 3849–3864, Online. Association for Computa- tics.
tional Linguistics.
Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2021b. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Top-down discourse parsing via sequence labelling. Çağlar Gu̇lçehre, and Bing Xiang. 2016. Abstrac-
In Proceedings of the 16th Conference of the Euro- tive text summarization using sequence-to-sequence
pean Chapter of the Association for Computational RNNs and beyond. In Proceedings of The 20th
Linguistics: Main Volume, pages 715–726, Online. SIGNLL Conference on Computational Natural Lan-
Association for Computational Linguistics. guage Learning, pages 280–290, Berlin, Germany.
Association for Computational Linguistics.
Murathan Kurfalı and Robert Östling. 2021. Prob-
ing multilingual language models for discourse. In Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
Proceedings of the 6th Workshop on Representation 2018. Don’t give me the details, just the summary!
Learning for NLP (RepL4NLP-2021), pages 8–19, topic-aware convolutional neural networks for ex-
Online. Association for Computational Linguistics. treme summarization. In Proceedings of the 2018

2386
Conference on Empirical Methods in Natural Lan- machine comprehension of text. In Proceedings of
guage Processing, pages 1797–1807, Brussels, Bel- the 2016 Conference on Empirical Methods in Natu-
gium. Association for Computational Linguistics. ral Language Processing, pages 2383–2392, Austin,
Texas. Association for Computational Linguistics.
Bita Nejat, Giuseppe Carenini, and Raymond Ng. 2017.
Exploring joint neural model for sentence level dis- Anna Rogers, Olga Kovaleva, and Anna Rumshisky.
course parsing and sentiment analysis. In Proceed- 2020. A primer in BERTology: What we know about
ings of the 18th Annual SIGdial Meeting on Dis- how BERT works. Transactions of the Association
course and Dialogue, pages 289–298, Saarbrücken, for Computational Linguistics, 8:842–866.
Germany. Association for Computational Linguistics.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Thanh-Tung Nguyen, Xuan-Phi Nguyen, Shafiq Joty, Chuang, Christopher D. Manning, Andrew Ng, and
and Xiaoli Li. 2021. RST parsing from scratch. In Christopher Potts. 2013. Recursive deep models for
Proceedings of the 2021 Conference of the North semantic compositionality over a sentiment treebank.
American Chapter of the Association for Computa- In Proceedings of the 2013 Conference on Empiri-
tional Linguistics: Human Language Technologies, cal Methods in Natural Language Processing, pages
pages 1613–1625, Online. Association for Computa- 1631–1642, Seattle, Washington, USA. Association
tional Linguistics. for Computational Linguistics.

Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Lewis, Vladimir Karpukhin, Aleksandra Piktus, Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Xilun Chen, Sebastian Riedel, Wen-tau Yih, Kaiser, and Illia Polosukhin. 2017. Attention is all
Sonal Gupta, et al. 2021. Domain-matched pre- you need. In Advances in Neural Information Pro-
training tasks for dense retrieval. arXiv preprint cessing Systems, volume 30. Curran Associates, Inc.
arXiv:2107.13602.
Yizhong Wang, Sujian Li, and Houfeng Wang. 2017.
Lalchand Pandia, Yan Cong, and Allyson Ettinger. 2021. A two-stage parsing method for text-level discourse
Pragmatic competence of pre-trained language mod- analysis. In Proceedings of the 55th Annual Meeting
els through the lens of discourse connectives. In Pro- of the Association for Computational Linguistics (Vol-
ceedings of the 25th Conference on Computational ume 2: Short Papers), pages 184–188, Vancouver,
Natural Language Learning, pages 367–379, Online. Canada. Association for Computational Linguistics.
Association for Computational Linguistics.
Bonnie Webber, Matthew Stone, Aravind Joshi, and Al-
Yannis Papanikolaou, Ian Roberts, and Andrea Pierleoni. istair Knott. 2003. Anaphora and discourse structure.
2019. Deep bidirectional transformers for relation Computational Linguistics, 29(4):545–587.
extraction without supervision. In Proceedings of
the 2nd Workshop on Deep Learning Approaches for Adina Williams, Nikita Nangia, and Samuel Bowman.
Low-Resource NLP (DeepLo 2019), pages 67–75, 2018. A broad-coverage challenge corpus for sen-
Hong Kong, China. Association for Computational tence understanding through inference. In Proceed-
Linguistics. ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- guistics: Human Language Technologies, Volume
sakaki, Livio Robaldo, Aravind Joshi, and Bonnie 1 (Long Papers), pages 1112–1122, New Orleans,
Webber. 2008. The Penn Discourse TreeBank 2.0. Louisiana. Association for Computational Linguis-
In Proceedings of the Sixth International Conference tics.
on Language Resources and Evaluation (LREC’08),
Marrakech, Morocco. European Language Resources Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020.
Association (ELRA). Perturbed masking: Parameter-free probing for ana-
lyzing and interpreting BERT. In Proceedings of the
Jack W. Rae, Anna Potapenko, Siddhant M. Jayaku- 58th Annual Meeting of the Association for Compu-
mar, Chloe Hillier, and Timothy P. Lillicrap. 2020. tational Linguistics, pages 4166–4176, Online. Asso-
Compressive transformers for long-range sequence ciation for Computational Linguistics.
modelling. In International Conference on Learning
Representations. Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman
Cohan. 2021a. Primer: Pyramid-based masked sen-
Alessandro Raganato and Jörg Tiedemann. 2018. An tence pre-training for multi-document summarization.
analysis of encoder representations in transformer- arXiv preprint arXiv:2110.08499.
based machine translation. In Proceedings of the
2018 EMNLP Workshop BlackboxNLP: Analyzing Wen Xiao, Patrick Huber, and Giuseppe Carenini.
and Interpreting Neural Networks for NLP, pages 2021b. Predicting discourse trees from transformer-
287–297, Brussels, Belgium. Association for Com- based neural summarizers. In Proceedings of the
putational Linguistics. 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and man Language Technologies, pages 4139–4152, On-
Percy Liang. 2016. SQuAD: 100,000+ questions for line. Association for Computational Linguistics.

2387
Linzi Xing, Wen Xiao, and Giuseppe Carenini. 2021.
Demoting the lead bias in news summarization via al-
ternating adversarial learning. In Proceedings of the
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol-
ume 2: Short Papers), pages 948–954, Online. Asso-
ciation for Computational Linguistics.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Xlnet: Generalized autoregressive pretraining for lan-
guage understanding. In Advances in Neural Infor-
mation Processing Systems, volume 32. Curran Asso-
ciates, Inc.

Manzil Zaheer, Guru Prashanth Guruganesh, Avi Dubey,


Joshua Ainslie, Chris Alberti, Santiago Ontanon,
Philip Minh Pham, Anirudh Ravula, Qifan Wang,
Li Yang, and Amr Mahmoud El Houssieny Ahmed.
2020. Big bird: Transformers for longer sequences.
In Advances in Neural Information Processing Sys-
tems.
Amir Zeldes. 2017. The GUM corpus: Creating mul-
tilayer resources in the classroom. Language Re-
sources and Evaluation, 51(3):581–612.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
ter Liu. 2020. Pegasus: Pre-training with extracted
gap-sentences for abstractive summarization. In In-
ternational Conference on Machine Learning, pages
11328–11339. PMLR.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text clas-
sification. In Proceedings of the 28th International
Conference on Neural Information Processing Sys-
tems - Volume 1, NIPS’15, page 649–657, Cambridge,
MA, USA. MIT Press.
Zining Zhu, Chuer Pan, Mohamed Abdalla, and Frank
Rudzicz. 2020. Examining the rhetorical capacities
of neural language models. In Proceedings of the
Third BlackboxNLP Workshop on Analyzing and In-
terpreting Neural Networks for NLP, pages 16–32,
Online. Association for Computational Linguistics.

2388
A Huggingface Models
We investigate 7 fine-tuned BERT and BART models from the huggingface model library, as well as the
two pre-trained models. The model names and links are provided in Table 3

Pre-Trained Fine-Tuned Link


BERT-base – https://huggingface.co/bert-base-uncased
BERT-base IMDB https://huggingface.co/textattack/bert-base-uncased-imdb
BERT-base Yelp https://huggingface.co/fabriceyhc/bert-base-uncased-yelp_polarity
BERT-base SST-2 https://huggingface.co/textattack/bert-base-uncased-SST-2
BERT-base MNLI https://huggingface.co/textattack/bert-base-uncased-MNLI
BART-large – https://huggingface.co/facebook/bart-large
BART-large CNN-DM https://huggingface.co/facebook/bart-large-cnn
BART-large XSUM https://huggingface.co/facebook/bart-large-xsum
BART-large SQuAD https://huggingface.co/valhalla/bart-large-finetuned-squadv1

Table 3: Huggingface pre-trained and fine-tuned model links.

B Test-Set Results on RST-DT and GUM

(a) BERT: PLM, +IMDB, +Yelp, +MNLI, +SST-2

(b) BART: PLM, +CNN-DM, +XSUM, +SQuAD

Figure 6: Constituency (top) and dependency (bottom) discourse tree evaluation of BERT (a) and BART (b) models
on RST-DT (test). Purple=high score, blue=low score. + indicates fine-tuning dataset.

2389
(a) BERT: PLM, +IMDB, +Yelp, +MNLI, +SST-2

(b) BART: PLM, +CNN-DM, +XSUM, +SQuAD

Figure 7: Constituency (top) and dependency (bottom) discourse tree evaluation of BERT (a) and BART (b) models
on GUM (test). Purple=high score, blue=low score. + indicates fine-tuning dataset.

2390
C Oracle-picked self-attention head compared to validation-picked matrix

RST-DT GUM
Model
Span UAS Span UAS
BERT
rand. init 25.5 (-0.0) 13.3 (-0.0) 23.2 (-0.0) 12.4 (-0.0)
PLM 35.7 (-1.6) 45.3 (-4.9) 33.0 (-0.4) 45.2 (-0.0)
+ IMDB 35.4 (-1.8) 42.8 (-2.4) 33.0 (-3.8) 43.3 (-0.1)
+ Yelp 34.7 (-1.0) 42.3 (-1.9) 32.6 (-3.6) 43.7 (-0.0)
+ SST-2 35.5 (-1.9) 42.9 (-2.5) 32.6 (-0.3) 43.5 (-0.9)
+ MNLI 34.8 (-1.7) 41.8 (-1.4) 32.4 (-0.3) 43.3 (-0.5)
BART
rand. init 25.3 (-0.0) 12.5 (-0.0) 23.2 (-0.0) 12.2 (-0.0)
PLM 39.1 (-0.4) 41.7 (-2.7) 31.8 (-0.3) 41.8 (-0.0)
+ CNN-DM 40.9 (-0.0) 44.3 (-4.0) 32.7 (-0.3) 42.8 (-0.7)
+ XSUM 40.1 (-0.9) 41.9 (-3.4) 32.1 (-1.7) 39.9 (-0.0)
+ SQuAD 40.1 (-0.0) 43.2 (-4.6) 31.3 (-2.1) 40.7 (-0.1)
Baselines
Right-Branch/Chain 9.3 40.4 9.4 41.7
Left-Branch/Chain-1 7.5 12.7 1.5 12.2
SumCNN-DM (2021b) 21.4 20.5 17.6 15.8
SumNYT (2021b) 24.0 15.7 18.2 12.6
Two-StageRST-DT (2017) 72.0 71.2 54.0 54.5
Two-StageGUM 65.4 61.7 58.6 56.7

Table 4: Original parseval (Span) and Unlabelled Attachment Score (UAS) of the single best performing oracle
self-attention matrix and validation-set picked head (in brackets) of the BERT and BART models compared with
baselines and previous work. “rand. init"=Randomly initialized transformer model of similar architecture as the
PLM.

2391
D Detailed Self-Attention Statistics

Span Eisner
Model
Min Med Mean Max Min Med Mean Max
RST-DT
rand. init 21.7 23.4 23.4 25.5 7.5 10.3 10.3 13.3
PLM 19.3 27.0 27.4 35.7 6.6 17.4 21.6 45.3
+ IMDB 19.7 26.9 27.2 35.4 6.6 16.9 21.3 42.8
+ YELP 20.2 26.6 26.9 34.7 7.0 16.5 21.0 42.3
+ SST-2 19.5 27.3 27.7 35.5 7.3 17.6 21.9 42.9
+ MNLI 18.5 26.9 27.1 34.8 6.9 17.5 21.5 41.8
GUM
rand. init 18.6 21.0 21.0 23.2 7.9 10.1 10.1 12.4
PLM 17.8 24.2 24.3 32.6 6.7 16.0 21.2 45.2
+ IMDB 18.1 23.8 24.1 32.7 6.1 15.9 21.0 43.3
+ YELP 18.6 24.0 23.9 32.3 7.0 15.8 20.7 43.7
+ SST-2 18.2 24.6 24.7 32.3 6.5 16.5 21.6 43.5
+ MNLI 17.4 23.9 24.2 32.1 6.8 16.6 21.3 43.3

Table 5: Minimum, median, mean and maximum performance of the self-attention matrices on RST-DT and GUM
for the BERT model.

Span Eisner
Model
Min Med Mean Max Min Med Mean Max
RST-DT
rand. init 20.3 23.3 23.3 25.3 8.5 10.6 10.6 12.5
PLM 20.3 28.3 28.5 39.1 4.1 15.8 19.2 41.7
+ CNN-DM 20.5 28.6 28.7 40.9 3.6 15.2 19.2 44.3
+ XSUM 20.2 27.6 28.3 40.1 4.8 14.8 18.7 41.9
+ SQuAD 20.5 27.6 28.2 40.1 2.8 14.8 18.8 43.2
GUM
rand. init 18.6 21.0 21.0 23.2 8.0 10.2 10.2 12.2
PLM 16.7 23.4 23.8 31.5 2.6 15.2 18.7 41.8
+ CNN-DM 15.9 23.7 24.1 32.4 3.7 14.7 18.9 42.8
+ XSUM 16.4 23.2 23.9 31.8 3.0 14.1 18.1 39.9
+ SQuAD 16.1 23.4 23.8 31.0 2.4 14.8 18.3 40.7

Table 6: Minimum, median, mean and maximum performance of the self-attention matrices on RST-DT and GUM
for the BART model.

2392
E Details of Structural Discourse Similarity

Figure 8: Detailed PLM discourse constituency (left) and dependency (right) structure overlap with baselines and
gold trees according to the original parseval and UAS metrics.

Figure 9: Detailed PLM discourse constituency (left) and dependency (right) structure performance of intersection
with gold trees (e.g., BERT ∩ Gold Trees ↔ Two-Stage (RST-DT) ∩ Gold Trees) according to the original parseval
and UAS metrics.

2393
F Intra- and Inter-Model Self-Attention Comparison

Heatmaps sorted by heads (left) and models (right) Heatmaps sorted by heads (left) and models (right)

̸=Model ̸=Model
=Model =Model
̸=Head ̸=Head
=Head* =Head*
0.4 0.6 0.8 0.7 0.8 0.9 1

(a) BERT constituency tree similarity on GUM (b) BERT dependency tree similarity on GUM

Heatmaps sorted by heads (left) and models (right) Heatmaps sorted by heads (left) and models (right)

̸=Model ̸=Model
=Model* =Model*
̸=Head ̸=Head
=Head* =Head*
0.3 0.4 0.5 0.6 0.7 0.7 0.8 0.9 1

(c) BART constituency tree similarity on GUM (d) BART dependency tree similarity on GUM

Figure 10: Top: Visual analysis of sorted heatmaps. Yellow=high score, purple=low score.
Bottom: Aggregated similarity of same heads, same models, different heads and different models. *=Head/=Model
significantly better than ̸=Head/̸=Model performance with p-value < 0.05.

2394

You might also like