A Discourse-Aware Attention Model For Abstractive Summarization of Long Documents - SUMMARY
A Discourse-Aware Attention Model For Abstractive Summarization of Long Documents - SUMMARY
A Discourse-Aware Attention Model For Abstractive Summarization of Long Documents - SUMMARY
Abstract
Our approach consists of a new hierarchical encoder that models the discourse structure
of a document, and an attentive discourse-aware decoder to generate the summary.
I. Introduction
Our decoder attends to different discourse sections and allows the model to more
accurately represent important information from the source resulting in a better context
vector.
We also introduce two large-scale datasets of long and structured scientific papers
obtained from arXiv and PubMed.
2. Background
Attentive Decoding
(𝑑)
The attention mechanism maps the decoder state ℎ𝑡−1 and the encoder states ℎ𝑖(𝑒) to
context vector 𝑐𝑡 . Incorporating this context vector at each decoding timestep (attentive
decoding) is proven effective in seq2seq models
𝑁
(𝑡) (𝑒)
𝑐𝑡 = ∑ 𝛼𝑖 ℎ𝑖
𝑖=1
linear(𝑥1 , 𝑥2 ) = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
3. Model
Hierarchical Encoder
We first encoder each discourse section and then encoder the document
(𝑠) (𝑠)
𝑑 = 𝑅𝑁𝑁𝑑𝑜𝑐 ({ℎ1 , … , ℎ𝑁 })
(𝑠)
ℎ𝑗 = 𝑅𝑁𝑁𝑠𝑒𝑐 (𝑥(𝑗,1) , … , 𝑥(𝑗,𝑀) )
Discourse-Aware Decoder
At each decoding timestep, in addition to the words in the document, we also attend to
the relevant discourse section. Then we use the discourse-related information to modify
the word-level attention function.
𝑁 𝑀
(𝑡) (𝑒)
𝑐𝑡 = ∑ ∑ 𝛼(𝑗,𝑖) ℎ(𝑗,𝑖)
𝑗=1 𝑖=1
At each timestep 𝑡 , the decoder state ℎ𝑡(𝑑) and the context vector 𝑐𝑡 are used to estimate
the probability of next word 𝑦𝑡
(𝑑)
𝑝(𝑦𝑡 |𝑦1:𝑡−1) = softmax (𝑉 𝑇 .linear (ℎ𝑡 , 𝑐𝑡 ))
where 𝑝𝑔 is the probability of generating and 𝑝𝑐 is probability of copying a word from the
source
(𝑡)
𝑝𝑐 (𝑦𝑡 = 𝑥𝑙 |𝑦1:𝑡−1 ) = ∑ 𝛼(𝑗,𝑖)
(𝑗,𝑖):𝑥(𝑗,𝑖) =𝑥𝑙
Decoding Coverage
In long sequences, the neural generation models tend to repeat phrases where the softmax
layer predicts the same phrase multiple times over multiple timesteps.
We track attention coverage to avoid repeatedly attending to the same steps with a
coverage vector
𝑡−1
(𝑡) (𝑘)
𝑐𝑜𝑣(𝑗,𝑖) = ∑ 𝛼(𝑗,𝑖)
𝑘=0
4. Related Work
5. Data
6. Experiments
7. Conclusion and Future Work
May 18, 2023
Chu Dinh Duc