Description
Automatic summarization aims to extract and present the most important content to the user from an information source. Generally two types of summaries are generated: extract, i.e., a summary which contains text segments copied from the input, and abstract, i.e., a summary consisting of text segments which is not present in the input.
One of summary evaluation issues is that it involves human judgments of different quality criteria like coherence, readability and content. There is no absolute unique correct summary and it is possible that a system output a good summary quite different from a human reference summary (the same problems for machine translation, speech synthesis, etc.).
Approach
Traditionally, summarization evaluation compares the tool output summaries with sentences previously extracted by human assessors or judges. The basic idea is that automatic evaluation should collerate to the human assessment.
Two main methods are used for evaluating text summarization. Intrinsic evaluation compares machine generated summaries with human generated summaries, it is considered as system focused evaluation. Extrinsic evaluation measures the performance of summarization in various tasks, and it is also considered as task specific evaluation.
Both methods require significant human resources, using key sentence (sentence fragment) mark-up and human generated summaries for source documents. Summarization evaluation measures provide a ranking score which can be used to compare different summaries of a document.
Measures
Sentence precision/recall based evaluation
Content similarity measures
ROUGE (Lin, 2004), cosine similarity, n-gram overlap, LSI (Latent Semantic Indexing), etc.
Sentence Rank
Utility measures
Projects
Ongoing
NTCIR (NII Test Collection for IR Systems) includes Text Summarization tasks, e.g. MuST (Multimodal Summarization for Trend Information) at NTCIT-7.
Past
TAC (Text Analysis Conference): Recognizing Textual Entailment (RTE), Summarization, etc.
TIPSTER : See the TIPSTER Text Summarization Evaluation: SUMMAC
TIDES (Translingual Information Detection Extraction and Summarization).
TIDES included several evaluation projects:
- Information Retrieval: HARD (High Accuracy Retrieval from Documents).
- Information Detection: TDT (Topic Detection and Tracking).
- Information Extraction: ACE (Automatic Content Extraction).
- Summarization: DUC (Document Understanding Conference). DUC has moved to the Text Analysis Conference(TAC).
CHIL (Computer in the Human Interaction Loop) included a Text Summarization task.
GERAF (Guide pour l’Evaluation des Résumés Automatiques Français): Guide for the Evaluation of Automatic Summarization in French.
Events
Past
ACL-IJCNLP 2009 Workshop: Language Generation and Summarisation
TAC 2009 Workshop (Text Analysis Conference).
Language Generation and Summarisation Workshop at ACL 2009
RANLP 2009
CLIAWS3 (3rd Workshop on Cross Lingual Information Access)
Multi-source, Multilingual Information Extraction and Summarization Workshop at RANLP2007
TSC-3 (Text Summarization Challenge) at NTCIR-4
Text Summarization Branches Out Workshop at ACL 2004
DUC 2003 (HLT-NAACL Text Summarization Workshop)
Tools
N/A
LRs
N/A
References
Bibliography
- Lin C.-Y. (2004) ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26.