Unsupervised Text Summarization Using Sentence Embeddings
Unsupervised Text Summarization Using Sentence Embeddings
Unsupervised Text Summarization Using Sentence Embeddings
Embeddings
Kushal Chauhan Follow
Aug 6, 2018 · 13 min read
Humans are generally quite good at this task as we have the capacity to understand the
meaning of a text document and extract salient features to summarize the documents using
our own words. However, automatic methods for text summarization are crucial in today’s
world where there is an over-abundance of data and lack of manpower as well as time to
interpret the data. There are many reasons why Automatic Text Summarization is useful:
1. Single Document, where the input length is short. Many of the early summarization
systems dealt with single document summarization.
1. Generic, where the model makes no assumptions about the domain or content of the
text to be summarized and treats all inputs as homogeneous. The majority of the work
that has been done revolves around generic summarization.
3. Query-based, where the summary only contains information which answers natural
language questions about the input text.
Based on output type:
1. Extractive, where important sentences are selected from the input text to form a
summary. Most summarization approaches today are extractive in nature.
2. Abstractive, where the model forms its own phrases and sentences to offer a more
coherent summary, like what a human would generate. This approach is definitely a
more appealing, but much more difficult than extractive summarization.
My Task
The task was to perform Text Summarization on emails in languages such as English,
Danish, French, etc. using Python. Most publicly available datasets for text summarization
are for long documents and articles. As the structure of long documents and articles
significantly differs from that of short emails, models trained with supervised methods may
suffer from poor domain adaptation. Therefore, I chose to explore unsupervised methods
for unbiased prediction of summaries.
Now, let us try to understand the various steps which constitute the model pipeline.
Hi Jane,
Thank you for keeping me updated on this issue. I'm happy to hear that
the issue got resolved after all and you can now use the app in its full
functionality again.
Also many thanks for your suggestions. We hope to improve this feature in
the future.
In case you experience any further problems with the app, please don't
hesitate to contact me again.
Best regards,
John Doe
Customer Support
Hei
Grunnet manglende dekning på deres kort for månedlig trekk, blir dere nå
overført til årlig fakturering.
I morgen vil dere motta faktura for hosting og drift av nettbutikk for
perioden 05.03.2018-05.03.2019.
Ta gjerne kontakt om dere har spørsmål.
Ciao John,
Grazie mille per averci contattato! Apprezziamo molto che abbiate trovato
il tempo per inviarci i vostri commenti e siamo lieti che vi piaccia
l'App.
Cordiali saluti,
Jane Doe
Customer Support
As one can see, the salutation and signature lines at the beginning and the end of an email
contributes no value for the task of summary generation. So, it is necessary to remove these
lines from the email, which we know, shouldn’t contribute to the summary. This makes for
a simpler input which a model can perform better with.
As salutation and signature lines can vary from email to email and from one language to the
other, removing them will require matching against a regular expression. For implementing
this module I used a slightly modified version of code found in the Mailgun Talon GitHub
repository so that it also supports other languages. The module also removes newline
characters. The shorter version of the code goes like this:
Instead of modifying the code to create your own clean() you can also use:
Thank you for keeping me updated on this issue. I'm happy to hear that
the issue got resolved after all and you can now use the app in its full
functionality again. Also many thanks for your suggestions. We hope to
improve this feature in the future. In case you experience any further
problems with the app, please don't hesitate to contact me again.
Grunnet manglende dekning på deres kort for månedlig trekk, blir dere nå
overført til årlig fakturering. I morgen vil dere motta faktura for
hosting og drift av nettbutikk for perioden 05.03.2018-05.03.2019. Ta
gjerne kontakt om dere har spørsmål.
The cleaned Italian email:
Grazie mille per averci contattato! Apprezziamo molto che abbiate trovato
il tempo per inviarci i vostri commenti e siamo lieti che vi piaccia
l'App. Sentitevi liberi di parlare di con i vostri amici o di sostenerci
lasciando una recensione nell'App Store.
Once we are done with the preprocessing step, we can proceed to explore the rest of the
summarization pipeline.
The Skip-Gram Word2Vec Model is trained to predict surrounding words given an input
word.
For sentence embeddings, one easy way is to take a weighted sum of the word vectors for
the words contained in the sentence. We take a weighted sum because frequently occurring
words such as ‘and’, ‘to’ and ‘the’, provide little or no information about the sentence. Some
rarely occurring words, which are unique to a few sentences have much more representative
power. Hence, we take the weights as being inversely related to the frequency of word
occurrence. This method is described in detail in this paper.
However, these unsupervised methods do not take the sequence of words in the sentence
into account. This may incur undesirable losses in model performance. To overcome this, I
chose to instead train a Skip-Thought sentence encoder in a supervised manner using
Wikipedia dumps as training data. The Skip-Thoughts model consists of two parts:
2. Decoder Network: The decoder network takes this vector representation h(i) as input
and tries to generate two sentences — S(i-1) and S(i+1), which could occur before and
after the input sentence respectively. Separate decoders are implemented for generation
of previous and next sentences, both being GRU-RNNs. The vector
representation h(i) acts as the initial hidden state for the GRUs of the decoder networks.
Given a sentence (the grey dots), the model attempts to predict the preceding sentence (red
dots) and the next sentence (green dots). Image
Source: https://arxiv.org/pdf/1506.06726.pdf
For implementation, I have used code made open-source by the author of the skip-thoughts
paper. It is written in Theano and can be found here. The task of obtaining embeddings for
every sentence in an email can be done in a few lines of code:
Skip-Thoughts Encoder-Decoder Architecture
Step-5: Clustering
After generating sentence embeddings for each sentence in an email, the approach is
to cluster these embeddings in high-dimensional vector space into a pre-defined number of
clusters. The number of clusters will be equal to desired number of sentences in the
summary. I chose the numbers of sentences in the summary to be equal to the square root
of the total number of sentence in the email. One can also have it as being equal to, say,
30% of the total number of sentences. Here’s the code that can do the clustering for you:
Step-6: Summarization
Each cluster of sentence embeddings can be interpreted as a set of semantically similar
sentences whose meaning can be expressed by just one candidate sentence in the summary.
The candidate sentence is chosen to be the sentence whose vector representation is closest
to the cluster center. Candidate sentences corresponding to each cluster are then ordered to
form a summary for an email. The order of the candidate sentences in the summary is
determined by the positions of the sentences in their corresponding clusters in the original
email. For example, a candidate sentence is chosen as the first line in the summary if most
of the sentences that lie in its cluster occur at the beginning of the email. The following lines
of code implements this:
As this method essentially extracts some candidate sentences from the text to form a
summary, it is known as Extractive Sumarization.
Sample summaries obtained for the above emails are down below:
I'm happy to hear that the issue got resolved after all and you can now
use the app in its full functionality again. Also many thanks for your
suggestions. In case you experience any further problems with the app,
please don't hesitate to contact me again.
For the Danish email:
Grunnet manglende dekning på deres kort for månedlig trekk, blir dere nå
overført til årlig fakturering. I morgen vil dere motta faktura for
hosting og drift av nettbutikk for perioden 05.03.2018-05.03.2019. Ta
gjerne kontakt om dere har spørsmål.
Training
Pre-trained models were available for encoding English sentences (see the repository for
more details). For Danish sentences however, the skip-thought model had to be trained.
The data was taken from Danish Wikipedia dumps that you can get here. The .bz2 archive
was extracted and the resultant .xml was parsed to strip off the html so that only the plain-
text remained. There are many tools available for parsing Wikipedia dumps and none of
them are perfect. They can also take a lot of time depending upon the approach used for
parsing. I used the tool from here, which isn’t the best available, but is free and can do the
job in a reasonable amount of time. Simple pre-processing was performed on the resultant
plain-text like removing newlines. By doing so, a large amount of training data was made
available for the skip-thoughts model to train on for days.
The training data thus generated, consisted of 2,712,935 Danish Sentences from Wikipedia
articles. The training process also requires pre-trained Wor2Vec word vectors. For this, I
used the Facebook fastText’s pretrained vectors (just the wiki.da.vec file and
not wiki.da.bin , hence not using the vocabulary expansion feature) for Danish. The pre-
trained vectors had a vocabulary size of 312,956 words. As these word vectors were also
trained on Danish Wikipedia, out-of-vocabulary words were quite rare. The training code
used is also available in the repository.
Implementation Details
Below is a simplified version of the module which supports only English emails, but
implements all of the steps mentioned above and works surprisingly well. The module
along with instructions on how to run it are present in this GitHub repository for your
reference. Feel free to fork and modify the code as you wish!
Results
As you can notice, this method for summarization fares much better when the email
consists of several sentences, instead of just 2–3 sentences. For a three sentence email,
the summary will consist of two sentences, which shouldn’t be the case. Also, these
three sentences may be conveying entirely different things, and omitting information
from any one sentence is not desirable. Extractive methods normally are not preferred
for summarization of short inputs, for this very reason. Supervised Seq2Seq models are
better suited for this task. However in this case, emails are generally longer in length
and extractive methods work surprisingly well.
One disadvantage of using skip-thought vectors is that the model can take a long time to
train. Although acceptable results were obtained after 2–3 days of training, the Danish
Skip-Thoughts model was trained for about a week. The cost fluctuated a lot during
iterations since it was normalized by sentence length.
To see how well the Skip-Thoughts model works, we can look at most similar pairs of
sentences in our dataset:
I can assure you that our developers are already aware of the issue and
are trying to solve it as soon as possible.
AND
I have already forwarded your problem report to our developers and they
will now investigate this issue with the login page in further detail in
order to detect the source of this problem.
--------------------------------------------------------------------I am
very sorry to hear that.
AND
We sincerely apologize for the inconvenience caused.
--------------------------------------------------------------------
Therefore, I would kindly ask you to tell me which operating system you
are using the app on.
AND
Can you specify which device you are using as well as the Android or iOS
version it currently has installed?
As is evident from above, the model works surprisingly well and can flag similar sentences
even when the sentences largely differ in length and the vocabulary used is entirely
different.
Possible Improvements
The approach presented here works pretty good, but is not perfect. There are many
improvements which can be made by increasing the model complexity:
Infrastructure Setup
All of the above experiments were performed on a n1-highmem-8 Google Cloud instance
sporting an Octa-Core Intel(R) Xeon(R) CPU and an Nvidia Tesla K80 GPU with 52 GB
RAM.
A special thanks to my mentor Rahul Kumar for his advice and useful suggestions along the
way, without whom this would not have been possible. Also I owe a debt of gratitude
to Jatana.ai for giving me this wonderful opportunity and the necessary resources to
accomplish the same.