Huber and Carenini - 2022 - Towards Understanding Large-Scale Discourse Structures in Pre-Trained and Fine-Tuned Language Models
Huber and Carenini - 2022 - Towards Understanding Large-Scale Discourse Structures in Pre-Trained and Fine-Tuned Language Models
Huber and Carenini - 2022 - Towards Understanding Large-Scale Discourse Structures in Pre-Trained and Fine-Tuned Language Models
2379
Dataset Task Domain
IMDB(2014) Sentiment Movie Reviews
Yelp(2015) Sentiment Reviews
SST-2(2013) Sentiment Movie Reviews
MNLI(2018) NLI Range of Genres
CNN-DM(2016) Summarization News
XSUM(2018) Summarization News
SQuAD(2016) Question-Answering Wikipedia
(a) BERT: PLM, +IMDB, +Yelp, +SST-2, +MNLI
Table 1: The seven fine-tuning datasets used in this work
along with the underlying tasks and domains.
2380
model contains 144 of those self-attention matri- RST-DT GUM
Model
ces (12 layers, 12 self-attention heads each), all Span UAS Span UAS
of which potentially encode discourse structures. BERT
For the BART model, this number is even higher,
rand. init ↓ 25.5 ↓ 13.3 ↓ 23.2 ↓ 12.4
consisting of 12 layers with 16 self-attention heads PLM • 35.7 • 45.3 • 33.0 • 45.2
each. With prior work suggesting the locality of + IMDB ↓ 35.4 ↓ 42.8 • 33.0 ↓ 43.3
discourse information in PLMs (e.g., Raganato and + Yelp ↓ 34.7 ↓ 42.3 ↓ 32.6 ↓ 43.7
Tiedemann (2018); Mareček and Rosa (2019); Xiao + SST-2 ↓ 35.5 ↓ 42.9 ↓ 32.6 ↓ 43.5
et al. (2021b)), we analyze every self-attention ma- + MNLI ↓ 34.8 ↓ 41.8 ↓ 32.4 ↓ 43.3
trix individually to gain a better understanding of BART
their alignment with discourse information. rand. init ↓ 25.3 ↓ 12.5 ↓ 23.2 ↓ 12.2
Besides investigating standard PLMs, we also PLM • 39.1 • 41.7 • 31.8 • 41.8
explore the robustness of discourse information + CNN-DM ↑ 40.9 ↑ 44.3 ↑ 32.7 ↑ 42.8
across fine-tuning tasks. We believe that this is an + XSUM ↑ 40.1 ↑ 41.9 ↑ 32.1 ↓ 39.9
important step to better understand if the captured + SQuAD ↑ 40.1 ↑ 43.2 ↓ 31.3 ↓ 40.7
discourse information is general and robust, or if it Baselines
is “re-learned” from scratch for downstream tasks. RB / Chain 9.3 40.4 9.4 41.7
To the best of our knowledge, no previous analysis LB / Chain-1 7.5 12.7 1.5 12.2
of this kind has been performed in the literature. SumCNN-DM 21.4 20.5 17.6 15.8
To this end, Figure 2 shows the constituency and SumNYT 24.0 15.7 18.2 12.6
dependency structure overlap of the generated dis- Two-StageRST-DT 72.0 71.2 54.0 54.5
course trees from individual self-attention heads Two-StageGUM 65.4 61.7 58.6 56.7
with the gold-standard tree structures of the GUM
Table 2: Original parseval (Span) and Unlabelled At-
dataset7 . The heatmaps clearly show that con- tachment Score (UAS) of the single best performing
stituency discourse structures are mostly captured self-attention matrix of the BERT and BART models
in higher layers, while dependency structures are compared with baselines and previous work. ↑, •, ↓
more evenly distributed across layers. Comparing indicate better, same, worse performance compared to
the patterns between models, we find that, despite the PLM. “rand. init"=Randomly initialized transformer
being fine-tuned on different downstream tasks, the model of similar architecture as the PLM, RB=Right-
discourse information is consistently encoded in Branching, LB=Left-Branching, Chain-1 =Inverse chain.
the same self-attention heads. Even though the
best performing self-attention matrix is not con- tial performance of PLMs on RST-style discourse
sistent, discourse information is clearly captured structures. This is not a realistic scenario, as the
in a “local" subset of self-attention heads across best performing head is generally not known a-
all presented fine-tuning tasks. This plausibly sug- priori. Hence, we also explore the performance
gests that the discourse information in pre-trained using a small-scale validation set to pick the best-
BERT and BART models is robust and general, re- performing self-attention matrix. In this more re-
quiring only minor adjustments depending on the alistic scenario for discourse parsing, we find that
fine-tuning task. scores on average drop by 1.55 points for BERT
and 1.33% for BART compared to the oracle-
5.2 Discourse Quality
picked performance of a single self-attention ma-
We now focus on assessing the discourse informa- trix. We show detailed results of this degradation in
tion captured in the single best-performing self- Appendix C8 . Our results in Table 2 are separated
attention head. In Table 2, we compare the dis- into three sub-tables, showing the results for BERT,
course structure quality of pre-trained and fine- BART and baseline models on the RST-DT and
tuned PLMs in the context of supervised models, GUM treebanks, respectively. In the BERT and
distantly supervised approaches and simple base- BART sub-table, we further annotate each perfor-
lines. We show the oracle-picked best head on the mance with ↑, •, ↓, indicating the relative perfor-
test-set, analyzing the upper-bound for the poten- mance to the standard pre-trained model as supe-
7 8
The analysis on RST-DT shows similar trends and can be For a more detailed analysis of the min., mean, median
found in Appendix B. and max. self-attention performances see Appendix D.
2381
rior, equal, or inferior.
Taking a look at the top sub-table (BERT) we
find that, as expected, the randomly initialized
transformer model achieves the worst performance.
Fine-tuned models perform equal or worse than the
standard PLM. Despite the inferior results of the
fine-tuned models, the drop is rather small, with
the sentiment analysis models consistently outper-
forming NLI. This seems reasonable, given that
the sentiment analysis objective is intuitively more
aligned with discourse structures (e.g., long-form
reviews with potentially complex rhetorical struc- Figure 3: PLM discourse constituency (left) and depen-
tures) than the between-sentence NLI task, not in- dency (right) structure overlap with baselines and gold
volving multi-sentential text. trees (e.g., BERT ↔ Two-Stage (RST-DT)) according
to the original parseval and UAS metrics.
In the center sub-table (BART), a different trend
emerges. While the worst performing model is still
(as expected) the randomly initialized system, fine- models. Plausibly, these seemingly unintuitive re-
tuned models mostly outperform the standard PLM. sults may be caused by the following co-occurring
Interestingly, the model fine-tuned on the CNN- circumstances: (1) The inferior performance of
DM corpus consistently outperforms the BART BART can potentially be attributed to the decoder
baseline, while the XSUM model performs bet- component capturing parts of the discourse struc-
ter on all but the GUM dependency structure eval- tures, as well as the larger number of self-attention
uation. On one hand, the superior performance heads “diluting” the discourse information. (2)
of both summarization models on the RST-DT The different trends regarding fine-tuned models
dataset seems reasonable, given that the fine-tuning might be directly influenced by the input-length
datasets and the evaluation treebank are both in the limitation to 512 (BERT) and 1024 (BART) sub-
news domain. The strong results of the CNN-DM word tokens during the fine-tuning stage, hamper-
model on the GUM treebank, yet inferior perfor- ing the ability to capture long-distance semantic
mance of XSUM, potentially hints towards depen- and pragmatic relationships. This, in turn, limits
dency discourse structures being less prominent the amount of discourse information captured, even
when fine-tuning on the extreme summarization for document-level datasets (e.g., Yelp, CNN-DM,
task, compared to the longer summaries in the SQuAD). With this restriction being more promi-
CNN-DM corpus. The question-answering task nent in BERT, it potentially explains the compara-
evaluated through the SQuAD fine-tuned model un- bly low performance of the fine-tuned models.
derperforms the standard PLM on GUM, however Finally, the bottom sub-table puts our results
reaches superior performance on RST-DT. Since in the context of previously proposed supervised
the SQuAD corpus is a subset of Wikipedia articles, and distantly-supervised models, as well as sim-
more aligned with news articles than the 12 genres ple baselines. Compared to simple right- and left-
in GUM, we believe the stronger performance on branching trees (Span), the PLM-based models
RST-DT (i.e., news articles) is again reasonable, reach clearly superior performance. Looking at
yet shows weaker generalization capabilities across the chain/inverse chain structures (UAS), the im-
domains (i.e., on the GUM corpus). Interestingly, provements are generally lower, however, the vast
the question-answering task seems more aligned majority still outperforms the baseline. Comparing
with dependency than constituency trees, in line the first two sub-tables against completely super-
with what would be expected from a factoid-style vised methods (Two-StageRST-DT , Two-StageGUM ),
question-answering model, focusing on important the BERT- and BART-based models are, unsurpris-
entities, rather than global constituency structures. ingly, inferior. Lastly, compared to the distantly
Directly comparing the BERT and BART mod- supervised SumCNN-DM and SumNYT models, the
els, the former performs better on three out of four PLM-based discourse performance shows clear im-
metrics. At the same time, fine-tuning hurts the provements over the 6-layer, 8-head standard trans-
performance for BERT, however, improves BART former.
2382
(a) Head-aligned (b) Model-aligned
2385
Yangfeng Ji and Noah A. Smith. 2017. Neural dis- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
course structure for text categorization. In Proceed- Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
ings of the 55th Annual Meeting of the Association for Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Computational Linguistics (Volume 1: Long Papers), BART: Denoising sequence-to-sequence pre-training
pages 996–1005, Vancouver, Canada. Association for for natural language generation, translation, and com-
Computational Linguistics. prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, pages 7871–7880, Online. Association for Computa-
Luke Zettlemoyer, and Omer Levy. 2020. Span- tional Linguistics.
BERT: Improving pre-training by representing and
predicting spans. Transactions of the Association for Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie Li.
Computational Linguistics, 8:64–77. 2014. Text-level discourse dependency parsing. In
Shafiq Joty, Giuseppe Carenini, and Raymond T. Ng. Proceedings of the 52nd Annual Meeting of the As-
2015. CODRA: A novel discriminative framework sociation for Computational Linguistics (Volume 1:
for rhetorical analysis. Computational Linguistics, Long Papers), pages 25–35, Baltimore, Maryland.
41(3):385–435. Association for Computational Linguistics.
Dan Jurafsky and James H Martin. 2014. Speech and Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
language processing, volume 3. Pearson London. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang goo Roberta: A robustly optimized bert pretraining ap-
Lee. 2020. Are pre-trained language models aware proach. arXiv preprint arXiv:1907.11692.
of phrases? simple but strong baselines for grammar
induction. In International Conference on Learning William C Mann and Sandra A Thompson. 1988.
Representations. Rhetorical structure theory: Toward a functional the-
ory of text organization. Text-Interdisciplinary Jour-
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
nal for the Study of Discourse, 8(3):243–281.
2020. Reformer: The efficient transformer. In Inter-
national Conference on Learning Representations. David Mareček and Rudolf Rosa. 2019. From
Naoki Kobayashi, Tsutomu Hirao, Hidetaka Kamigaito, balustrades to pierre vinken: Looking for syntax in
Manabu Okumura, and Masaaki Nagata. 2020. Top- transformer self-attentions. In Proceedings of the
down rst parsing utilizing granularity levels in doc- 2019 ACL Workshop BlackboxNLP: Analyzing and
uments. In Proceedings of the AAAI Conference on Interpreting Neural Networks for NLP, pages 263–
Artificial Intelligence, volume 34, pages 8099–8106. 275, Florence, Italy. Association for Computational
Linguistics.
Naoki Kobayashi, Tsutomu Hirao, Kengo Nakamura,
Hidetaka Kamigaito, Manabu Okumura, and Masaaki Julian Michael, Jan A. Botha, and Ian Tenney. 2020.
Nagata. 2019. Split or merge: Which is better for Asking without telling: Exploring latent ontologies
unsupervised RST parsing? In Proceedings of the in contextual representations. In Proceedings of the
2019 Conference on Empirical Methods in Natu- 2020 Conference on Empirical Methods in Natural
ral Language Processing and the 9th International Language Processing (EMNLP), pages 6792–6812,
Joint Conference on Natural Language Processing Online. Association for Computational Linguistics.
(EMNLP-IJCNLP), pages 5797–5802, Hong Kong,
China. Association for Computational Linguistics. Mathieu Morey, Philippe Muller, and Nicholas Asher.
2017. How much progress have we made on RST
Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2021a. discourse parsing? a replication study of recent re-
Discourse probing of pretrained language models. sults on the RST-DT. In Proceedings of the 2017
In Proceedings of the 2021 Conference of the North Conference on Empirical Methods in Natural Lan-
American Chapter of the Association for Computa- guage Processing, pages 1319–1324, Copenhagen,
tional Linguistics: Human Language Technologies, Denmark. Association for Computational Linguis-
pages 3849–3864, Online. Association for Computa- tics.
tional Linguistics.
Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2021b. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Top-down discourse parsing via sequence labelling. Çağlar Gu̇lçehre, and Bing Xiang. 2016. Abstrac-
In Proceedings of the 16th Conference of the Euro- tive text summarization using sequence-to-sequence
pean Chapter of the Association for Computational RNNs and beyond. In Proceedings of The 20th
Linguistics: Main Volume, pages 715–726, Online. SIGNLL Conference on Computational Natural Lan-
Association for Computational Linguistics. guage Learning, pages 280–290, Berlin, Germany.
Association for Computational Linguistics.
Murathan Kurfalı and Robert Östling. 2021. Prob-
ing multilingual language models for discourse. In Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
Proceedings of the 6th Workshop on Representation 2018. Don’t give me the details, just the summary!
Learning for NLP (RepL4NLP-2021), pages 8–19, topic-aware convolutional neural networks for ex-
Online. Association for Computational Linguistics. treme summarization. In Proceedings of the 2018
2386
Conference on Empirical Methods in Natural Lan- machine comprehension of text. In Proceedings of
guage Processing, pages 1797–1807, Brussels, Bel- the 2016 Conference on Empirical Methods in Natu-
gium. Association for Computational Linguistics. ral Language Processing, pages 2383–2392, Austin,
Texas. Association for Computational Linguistics.
Bita Nejat, Giuseppe Carenini, and Raymond Ng. 2017.
Exploring joint neural model for sentence level dis- Anna Rogers, Olga Kovaleva, and Anna Rumshisky.
course parsing and sentiment analysis. In Proceed- 2020. A primer in BERTology: What we know about
ings of the 18th Annual SIGdial Meeting on Dis- how BERT works. Transactions of the Association
course and Dialogue, pages 289–298, Saarbrücken, for Computational Linguistics, 8:842–866.
Germany. Association for Computational Linguistics.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Thanh-Tung Nguyen, Xuan-Phi Nguyen, Shafiq Joty, Chuang, Christopher D. Manning, Andrew Ng, and
and Xiaoli Li. 2021. RST parsing from scratch. In Christopher Potts. 2013. Recursive deep models for
Proceedings of the 2021 Conference of the North semantic compositionality over a sentiment treebank.
American Chapter of the Association for Computa- In Proceedings of the 2013 Conference on Empiri-
tional Linguistics: Human Language Technologies, cal Methods in Natural Language Processing, pages
pages 1613–1625, Online. Association for Computa- 1631–1642, Seattle, Washington, USA. Association
tional Linguistics. for Computational Linguistics.
Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Lewis, Vladimir Karpukhin, Aleksandra Piktus, Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Xilun Chen, Sebastian Riedel, Wen-tau Yih, Kaiser, and Illia Polosukhin. 2017. Attention is all
Sonal Gupta, et al. 2021. Domain-matched pre- you need. In Advances in Neural Information Pro-
training tasks for dense retrieval. arXiv preprint cessing Systems, volume 30. Curran Associates, Inc.
arXiv:2107.13602.
Yizhong Wang, Sujian Li, and Houfeng Wang. 2017.
Lalchand Pandia, Yan Cong, and Allyson Ettinger. 2021. A two-stage parsing method for text-level discourse
Pragmatic competence of pre-trained language mod- analysis. In Proceedings of the 55th Annual Meeting
els through the lens of discourse connectives. In Pro- of the Association for Computational Linguistics (Vol-
ceedings of the 25th Conference on Computational ume 2: Short Papers), pages 184–188, Vancouver,
Natural Language Learning, pages 367–379, Online. Canada. Association for Computational Linguistics.
Association for Computational Linguistics.
Bonnie Webber, Matthew Stone, Aravind Joshi, and Al-
Yannis Papanikolaou, Ian Roberts, and Andrea Pierleoni. istair Knott. 2003. Anaphora and discourse structure.
2019. Deep bidirectional transformers for relation Computational Linguistics, 29(4):545–587.
extraction without supervision. In Proceedings of
the 2nd Workshop on Deep Learning Approaches for Adina Williams, Nikita Nangia, and Samuel Bowman.
Low-Resource NLP (DeepLo 2019), pages 67–75, 2018. A broad-coverage challenge corpus for sen-
Hong Kong, China. Association for Computational tence understanding through inference. In Proceed-
Linguistics. ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- guistics: Human Language Technologies, Volume
sakaki, Livio Robaldo, Aravind Joshi, and Bonnie 1 (Long Papers), pages 1112–1122, New Orleans,
Webber. 2008. The Penn Discourse TreeBank 2.0. Louisiana. Association for Computational Linguis-
In Proceedings of the Sixth International Conference tics.
on Language Resources and Evaluation (LREC’08),
Marrakech, Morocco. European Language Resources Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020.
Association (ELRA). Perturbed masking: Parameter-free probing for ana-
lyzing and interpreting BERT. In Proceedings of the
Jack W. Rae, Anna Potapenko, Siddhant M. Jayaku- 58th Annual Meeting of the Association for Compu-
mar, Chloe Hillier, and Timothy P. Lillicrap. 2020. tational Linguistics, pages 4166–4176, Online. Asso-
Compressive transformers for long-range sequence ciation for Computational Linguistics.
modelling. In International Conference on Learning
Representations. Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman
Cohan. 2021a. Primer: Pyramid-based masked sen-
Alessandro Raganato and Jörg Tiedemann. 2018. An tence pre-training for multi-document summarization.
analysis of encoder representations in transformer- arXiv preprint arXiv:2110.08499.
based machine translation. In Proceedings of the
2018 EMNLP Workshop BlackboxNLP: Analyzing Wen Xiao, Patrick Huber, and Giuseppe Carenini.
and Interpreting Neural Networks for NLP, pages 2021b. Predicting discourse trees from transformer-
287–297, Brussels, Belgium. Association for Com- based neural summarizers. In Proceedings of the
putational Linguistics. 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and man Language Technologies, pages 4139–4152, On-
Percy Liang. 2016. SQuAD: 100,000+ questions for line. Association for Computational Linguistics.
2387
Linzi Xing, Wen Xiao, and Giuseppe Carenini. 2021.
Demoting the lead bias in news summarization via al-
ternating adversarial learning. In Proceedings of the
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol-
ume 2: Short Papers), pages 948–954, Online. Asso-
ciation for Computational Linguistics.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Xlnet: Generalized autoregressive pretraining for lan-
guage understanding. In Advances in Neural Infor-
mation Processing Systems, volume 32. Curran Asso-
ciates, Inc.
2388
A Huggingface Models
We investigate 7 fine-tuned BERT and BART models from the huggingface model library, as well as the
two pre-trained models. The model names and links are provided in Table 3
Figure 6: Constituency (top) and dependency (bottom) discourse tree evaluation of BERT (a) and BART (b) models
on RST-DT (test). Purple=high score, blue=low score. + indicates fine-tuning dataset.
2389
(a) BERT: PLM, +IMDB, +Yelp, +MNLI, +SST-2
Figure 7: Constituency (top) and dependency (bottom) discourse tree evaluation of BERT (a) and BART (b) models
on GUM (test). Purple=high score, blue=low score. + indicates fine-tuning dataset.
2390
C Oracle-picked self-attention head compared to validation-picked matrix
RST-DT GUM
Model
Span UAS Span UAS
BERT
rand. init 25.5 (-0.0) 13.3 (-0.0) 23.2 (-0.0) 12.4 (-0.0)
PLM 35.7 (-1.6) 45.3 (-4.9) 33.0 (-0.4) 45.2 (-0.0)
+ IMDB 35.4 (-1.8) 42.8 (-2.4) 33.0 (-3.8) 43.3 (-0.1)
+ Yelp 34.7 (-1.0) 42.3 (-1.9) 32.6 (-3.6) 43.7 (-0.0)
+ SST-2 35.5 (-1.9) 42.9 (-2.5) 32.6 (-0.3) 43.5 (-0.9)
+ MNLI 34.8 (-1.7) 41.8 (-1.4) 32.4 (-0.3) 43.3 (-0.5)
BART
rand. init 25.3 (-0.0) 12.5 (-0.0) 23.2 (-0.0) 12.2 (-0.0)
PLM 39.1 (-0.4) 41.7 (-2.7) 31.8 (-0.3) 41.8 (-0.0)
+ CNN-DM 40.9 (-0.0) 44.3 (-4.0) 32.7 (-0.3) 42.8 (-0.7)
+ XSUM 40.1 (-0.9) 41.9 (-3.4) 32.1 (-1.7) 39.9 (-0.0)
+ SQuAD 40.1 (-0.0) 43.2 (-4.6) 31.3 (-2.1) 40.7 (-0.1)
Baselines
Right-Branch/Chain 9.3 40.4 9.4 41.7
Left-Branch/Chain-1 7.5 12.7 1.5 12.2
SumCNN-DM (2021b) 21.4 20.5 17.6 15.8
SumNYT (2021b) 24.0 15.7 18.2 12.6
Two-StageRST-DT (2017) 72.0 71.2 54.0 54.5
Two-StageGUM 65.4 61.7 58.6 56.7
Table 4: Original parseval (Span) and Unlabelled Attachment Score (UAS) of the single best performing oracle
self-attention matrix and validation-set picked head (in brackets) of the BERT and BART models compared with
baselines and previous work. “rand. init"=Randomly initialized transformer model of similar architecture as the
PLM.
2391
D Detailed Self-Attention Statistics
Span Eisner
Model
Min Med Mean Max Min Med Mean Max
RST-DT
rand. init 21.7 23.4 23.4 25.5 7.5 10.3 10.3 13.3
PLM 19.3 27.0 27.4 35.7 6.6 17.4 21.6 45.3
+ IMDB 19.7 26.9 27.2 35.4 6.6 16.9 21.3 42.8
+ YELP 20.2 26.6 26.9 34.7 7.0 16.5 21.0 42.3
+ SST-2 19.5 27.3 27.7 35.5 7.3 17.6 21.9 42.9
+ MNLI 18.5 26.9 27.1 34.8 6.9 17.5 21.5 41.8
GUM
rand. init 18.6 21.0 21.0 23.2 7.9 10.1 10.1 12.4
PLM 17.8 24.2 24.3 32.6 6.7 16.0 21.2 45.2
+ IMDB 18.1 23.8 24.1 32.7 6.1 15.9 21.0 43.3
+ YELP 18.6 24.0 23.9 32.3 7.0 15.8 20.7 43.7
+ SST-2 18.2 24.6 24.7 32.3 6.5 16.5 21.6 43.5
+ MNLI 17.4 23.9 24.2 32.1 6.8 16.6 21.3 43.3
Table 5: Minimum, median, mean and maximum performance of the self-attention matrices on RST-DT and GUM
for the BERT model.
Span Eisner
Model
Min Med Mean Max Min Med Mean Max
RST-DT
rand. init 20.3 23.3 23.3 25.3 8.5 10.6 10.6 12.5
PLM 20.3 28.3 28.5 39.1 4.1 15.8 19.2 41.7
+ CNN-DM 20.5 28.6 28.7 40.9 3.6 15.2 19.2 44.3
+ XSUM 20.2 27.6 28.3 40.1 4.8 14.8 18.7 41.9
+ SQuAD 20.5 27.6 28.2 40.1 2.8 14.8 18.8 43.2
GUM
rand. init 18.6 21.0 21.0 23.2 8.0 10.2 10.2 12.2
PLM 16.7 23.4 23.8 31.5 2.6 15.2 18.7 41.8
+ CNN-DM 15.9 23.7 24.1 32.4 3.7 14.7 18.9 42.8
+ XSUM 16.4 23.2 23.9 31.8 3.0 14.1 18.1 39.9
+ SQuAD 16.1 23.4 23.8 31.0 2.4 14.8 18.3 40.7
Table 6: Minimum, median, mean and maximum performance of the self-attention matrices on RST-DT and GUM
for the BART model.
2392
E Details of Structural Discourse Similarity
Figure 8: Detailed PLM discourse constituency (left) and dependency (right) structure overlap with baselines and
gold trees according to the original parseval and UAS metrics.
Figure 9: Detailed PLM discourse constituency (left) and dependency (right) structure performance of intersection
with gold trees (e.g., BERT ∩ Gold Trees ↔ Two-Stage (RST-DT) ∩ Gold Trees) according to the original parseval
and UAS metrics.
2393
F Intra- and Inter-Model Self-Attention Comparison
Heatmaps sorted by heads (left) and models (right) Heatmaps sorted by heads (left) and models (right)
̸=Model ̸=Model
=Model =Model
̸=Head ̸=Head
=Head* =Head*
0.4 0.6 0.8 0.7 0.8 0.9 1
(a) BERT constituency tree similarity on GUM (b) BERT dependency tree similarity on GUM
Heatmaps sorted by heads (left) and models (right) Heatmaps sorted by heads (left) and models (right)
̸=Model ̸=Model
=Model* =Model*
̸=Head ̸=Head
=Head* =Head*
0.3 0.4 0.5 0.6 0.7 0.7 0.8 0.9 1
(c) BART constituency tree similarity on GUM (d) BART dependency tree similarity on GUM
Figure 10: Top: Visual analysis of sorted heatmaps. Yellow=high score, purple=low score.
Bottom: Aggregated similarity of same heads, same models, different heads and different models. *=Head/=Model
significantly better than ̸=Head/̸=Model performance with p-value < 0.05.
2394