Nothing Special   »   [go: up one dir, main page]

A Discourse-Aware Attention Model For Abstractive Summarization of Long Documents - SUMMARY

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Abstract
Our approach consists of a new hierarchical encoder that models the discourse structure
of a document, and an attentive discourse-aware decoder to generate the summary.
I. Introduction
Our decoder attends to different discourse sections and allows the model to more
accurately represent important information from the source resulting in a better context
vector.
We also introduce two large-scale datasets of long and structured scientific papers
obtained from arXiv and PubMed.
2. Background
Attentive Decoding
(𝑑)
The attention mechanism maps the decoder state ℎ𝑡−1 and the encoder states ℎ𝑖(𝑒) to
context vector 𝑐𝑡 . Incorporating this context vector at each decoding timestep (attentive
decoding) is proven effective in seq2seq models
𝑁
(𝑡) (𝑒)
𝑐𝑡 = ∑ 𝛼𝑖 ℎ𝑖
𝑖=1

where 𝛼𝑖(𝑡) are the attention weights calculated as follow


(𝑡) (𝑒) (𝑑)
𝛼𝑖 = softmax (score (ℎ𝑖 ,  ℎ𝑡−1  ))

(𝑒) (𝑑) (𝑒) (𝑑)


score (ℎ𝑖 ,  ℎ𝑡−1  ) = 𝑣𝑎𝑇 tanh (linear (ℎ𝑖 ,  ℎ𝑡−1  ))

linear(𝑥1 ,  𝑥2 ) = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏

3. Model
Hierarchical Encoder
We first encoder each discourse section and then encoder the document
(𝑠) (𝑠)
𝑑 = 𝑅𝑁𝑁𝑑𝑜𝑐 ({ℎ1 ,   … ,  ℎ𝑁 })

(𝑠)
ℎ𝑗 = 𝑅𝑁𝑁𝑠𝑒𝑐 (𝑥(𝑗,1) ,   … ,  𝑥(𝑗,𝑀) )

where N is number of sections and M is the maximum section length.


The parameters of 𝑅𝑁𝑁𝑠𝑒𝑐 are shared for all the discourse sections.
We use a single layer bidirectional LSTM for both 𝑅𝑁𝑁𝑑𝑜𝑐 and 𝑅𝑁𝑁𝑠𝑒𝑐
⃗ ,  cev(ℎ)] + 𝑏)
ℎ = relu(𝑊[ℎ

Discourse-Aware Decoder
At each decoding timestep, in addition to the words in the document, we also attend to
the relevant discourse section. Then we use the discourse-related information to modify
the word-level attention function.
𝑁 𝑀
(𝑡) (𝑒)
𝑐𝑡 = ∑ ∑ 𝛼(𝑗,𝑖) ℎ(𝑗,𝑖)
𝑗=1 𝑖=1

(𝑡) (𝑡) (𝑒) (𝑑)


𝛼(𝑗,𝑖) = softmax (𝛽𝑗 .score (ℎ(𝑗,𝑖) ,  ℎ𝑡−1 ))  

(𝑡) (𝑠) (𝑑)


𝛽𝑗 = softmax (score (ℎ𝑗 ,  ℎ𝑡−1 ))

At each timestep 𝑡 , the decoder state ℎ𝑡(𝑑) and the context vector 𝑐𝑡 are used to estimate
the probability of next word 𝑦𝑡
(𝑑)
𝑝(𝑦𝑡 |𝑦1:𝑡−1) = softmax (𝑉 𝑇 .linear (ℎ𝑡 , 𝑐𝑡 ))

where V is a vocabulary weight matrix.


Copying from source
Address the problem of unknown token prediction by allowing the model to occasionally
copy words directly from source instead of generating a new token.
We add an additional binary variable 𝑧𝑡 to the decoder, indicating generating a word from
vocabulary (𝑧𝑡 = 0) or copying a word from the source (𝑧𝑡 = 1). The probability is learnt
during training according to the following equation
(𝑑)
𝑝(𝑧𝑡 = 1|𝑦1:𝑡−1) = 𝜎 (linear (ℎ𝑡 , 𝑐𝑡 , 𝑥𝑡′ ))

where 𝑥𝑡′ is decoder input at timestep 𝑡 .


Then the next word 𝑦𝑡 is generated according to

𝑝(𝑦𝑡 |𝑦1:𝑡−1 ) = ∑ 𝑝(𝑦𝑡 , 𝑧𝑡 = 𝑧|𝑦1:𝑡−1 ) ; 𝑧 = {0,1}


𝑧

The joint probability is decomposed as


𝑝(𝑦𝑡 , 𝑧𝑡 = 𝑧) = 𝑝𝑐 (𝑦𝑡 |𝑦1:𝑡−1 )𝑝(𝑧𝑡 = 𝑧|𝑦1:𝑡−1 ); 𝑧 = 1
𝑝(𝑦𝑡 , 𝑧𝑡 = 𝑧) = 𝑝𝑔 (𝑦𝑡 |𝑦1:𝑡−1 )𝑝(𝑧𝑡 = 𝑧|𝑦1:𝑡−1 ); 𝑧 = 0

where 𝑝𝑔 is the probability of generating and 𝑝𝑐 is probability of copying a word from the
source

(𝑡)
𝑝𝑐 (𝑦𝑡 = 𝑥𝑙 |𝑦1:𝑡−1 ) = ∑ 𝛼(𝑗,𝑖)
(𝑗,𝑖):𝑥(𝑗,𝑖) =𝑥𝑙

Decoding Coverage
In long sequences, the neural generation models tend to repeat phrases where the softmax
layer predicts the same phrase multiple times over multiple timesteps.
We track attention coverage to avoid repeatedly attending to the same steps with a
coverage vector
𝑡−1
(𝑡) (𝑘)
𝑐𝑜𝑣(𝑗,𝑖) = ∑ 𝛼(𝑗,𝑖)
𝑘=0

We incorporate the decoder coverage as an additional input to the attention function


(𝑡) (𝑡) (𝑒) (𝑡) (𝑑)
𝛼(𝑗,𝑖) = softmax (𝛽𝑗 .score (ℎ(𝑗,𝑖) , 𝑐𝑜𝑣(𝑗,𝑖) , ℎ𝑡−1 ))  

4. Related Work
5. Data
6. Experiments
7. Conclusion and Future Work
May 18, 2023
Chu Dinh Duc

You might also like