article

Query-focused multi-document summarization: Automatic data annotations and supervised learning approaches

Authors:

Sadid a. HasanAuthors Info & Claims

Natural Language Engineering, Volume 18, Issue 1

Pages 109 - 145

https://doi.org/10.1017/S1351324911000167

Published: 01 January 2012 Publication History

Abstract

In this paper, we apply different supervised learning techniques to build query-focused multi-document summarization systems, where the task is to produce automatic summaries in response to a given query or specific information request stated by the user. A huge amount of labeled data is a prerequisite for supervised training. It is expensive and time-consuming when humans perform the labeling task manually. Automatic labeling can be a good remedy to this problem. We employ five different automatic annotation techniques to build extracts from human abstracts using ROUGE, Basic Element overlap, syntactic similarity measure, semantic similarity measure, and Extended String Subsequence Kernel. The supervised methods we use are Support Vector Machines, Conditional Random Fields, Hidden Markov Models, Maximum Entropy, and two ensemble-based approaches. During different experiments, we analyze the impact of automatic labeling methods on the performance of the applied supervised methods. To our knowledge, no other study has deeply investigated and compared the effects of using different automatic annotation techniques on different supervised learning approaches in the domain of query-focused multi-document summarization.

References

[1]

Banko, M., Mittal, V., Kantrowitz, M., and Goldstein, J. 1999. Generating extraction-based summaries from hand-written summaries by aligning text spans. In Proceedings of the 4th Meeting of the Pacific Association for Computational Linguistics, PACLING, Waterloo, Canada.

[2]

Barzilay, R., and Elhadad, N. 2003. Sentence alignment for monolingual comparable corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 25-32, Sapporo, Japan. ACL.

[3]

Berger, A. L., Pietra, V. J. D, and Pietra, S. A. D. 1996. A maximum entropy approach to natural language processing. Computational Linguistics 22(1): 39-71 (Cambridge, MA: MIT Press).

Digital Library

[4]

Cancedda, N., Gaussier, E., Goutte, C., and Renders, J. M. 2003. Word sequence kernels. Journal of Machine Learning Research 3: 1059-1082 (Cambridge, MA: MIT Press).

Digital Library

[5]

Carbonell, J., Geng, Y., and Goldstein, J. 1997. Automated query-relevant summarization and diversity-based reranking. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, Workshop: AI in Digital Libraries, pp. 12-19, Nagoya, Japan. IJCAI.

[6]

Chali, Y., Hasan, S. A., and Joty, S. R. 2009a. A SVM-based ensemble approach to multidocument summarization. In Proceedings of the 22nd Canadian Conference on Artificial Intelligence, pp. 199-202, Kelowna, Canada. Berlin, Germany: Springer-Verlag.

[7]

Chali, Y., Hasan, S. A., and Joty, S. R. 2009b. Do automatic annotation techniques have any impact on supervised complex question answering? In Proceedings of the Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing, pp. 329-332, Suntec, Singapore. ACL.

[8]

Chali, Y., and Joty, S. R. 2007. Word sense disambiguation using lexical cohesion. In Proceedings of the 4th International Conference on Semantic Evaluations, pp. 476-479, Prague, Czech Republic. ACL.

[9]

Chali, Y., and Joty, S. R. 2008. Selecting sentences for answering complex questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 304-313, Hawaii. ACL.

[10]

Chali, Y., Joty, S. R., and Hasan, S. A. 2009. Complex question answering: unsupervised learning approaches and experiments. Journal of Artificial Intelligence Research 35(1): 1-47 (El Segundo, CA, USA: AI Access Foundation).

Digital Library

[11]

Charniak, E. 2000. A maximum-entropy-inspired parser. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 132-139, Seattle, Washington. Massachusetts, USA: Morgan Kaufmann.

[12]

Collins, M., and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of Advances in Neural Information Processing Systems 14, pp. 625-632, Vancouver, Canada. Cambridge, MA, USA: MIT Press.

[13]

Conroy, J. M., and O'Leary, D. P. 2001. Text summarization via hidden Markov models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp, 406-407, New Orleans, USA. New York, NY, USA: ACM.

[14]

Conroy, J. M., Schlesinger, J. D., and O'Leary, D. P. 2006. Topic-focused multidocument summarization using an approximate Oracle score. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 152-159, Sydney, Australia. ACL.

[15]

Cortes, C., and Vapnik, V. 1995. Support vector networks. Machine Learning 20(3): 273-297 (Hingham, USA: Kluwer).

[16]

Daumé, H., III, and Marcu, D. 2006. Bayesian query-focused summarization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 305-312, Sydney, Australia. ACL.

[17]

Dietterich, T. G. 2000. Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, pp. 1-15, London, UK. Berlin, Germany: Springer-Verlag.

Digital Library

[18]

Edmundson, H. P. 1969. New methods in automatic extracting. Journal of the Association for Computing Machinery (ACM) 16(2): 264-285 (New York, NY, USA: ACM).

Digital Library

[19]

Efron, B., and Tibshirani, R. J. 1994. An Introduction to the Bootstrap. Boca Raton, FL, USA: CRC Press.

[20]

Fellbaum, C. 1998. WordNet-An Electronic Lexical Database. Cambridge, MA, USA: MIT Press.

[21]

Ferrier, L. 2001. A Maximum Entropy Approach to Text Summarization. MSc thesis, School of Artificial Intelligence, Division of Informatics, University of Edinburgh.

[22]

Hacioglu, K., Pradhan, S., Ward, W., Martin, J. H., and Jurafsky, D. 2003. Shallow semantic parsing using support vector machines. Technical Report TR-CSLR-2003-1, Center for Spoken Language Research, Boulder, CO, USA.

[23]

Harnly, A., Nenkova, A., Passonneau, R., and Rambow, O. 2005. Automation of summary evaluation by the pyramid method. In Proceedings of the Conference of Recent Advances in Natural Language Processing, pp. 226-232, Borovets, Bulgeria. RANLP.

[24]

Hirao, T., Isozaki, H., Maeda, E., and Matsumoto, Y. 2002a. Extracting important sentences with support vector machines. In Proceedings of the 19th International Conference on Computational Linguistics - Vol. 1, pp. 1-7, Taipei, Taiwan. ACL.

[25]

Hirao, T., Sasaki, Y., Isozaki, H., and Maeda, E. 2002b. NTT's text summarization system for DUC-2002. In Proceedings of the Document Understanding Conference, pp. 104-107, Philadelphia, PA, USA. Gaithersburg, MD, USA: NIST.

[26]

Hirao, T., Suzuki, J., Isozaki, H., and Maeda, E. 2003. NTT's multiple document summarization system for DUC-2003. In Proceedings of the Document Understanding Conference, Edmonton, Canada. Gaithersburg, MD, USA: NIST.

[27]

Hirao, T., Suzuki, J., Isozaki, H., and Maeda, E. 2004. Dependency-based sentence alignment for multiple document summarization. In Proceedings of the 20th International Conference on Computational Linguistics, pp. 446-452, Geneva, Switzerland. ACL.

[28]

Hovy, E., Lin, C. Y., and Zhou, L. 2005. A BE-based multi-document summarizer with query interpretation. In Proceedings of the Document Understanding Conference, Vancouver, BC, Canada. Gaithersburg, MD, USA: NIST.

[29]

Hovy, E., Lin, C. Y., Zhou, L., and Fukumoto, J. 2006. Automated summarization evaluation with basic elements. In Proceedings of the 5th Conference on Language Resources and Evaluation, Genoa, Italy. LREC.

[30]

Hsu, C., Chang, C., and Lin, C. 2008. A Practical Guide to Support Vector Classification, Taipei, Taiwan: National Taiwan University. http://www.csie.ntu.edu.tw/cjlin.

[31]

Jing, H., and McKeown, K. R. 1999. The decomposition of human-written summary sentences. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 129-136, Berkeley, CA, USA. New York, NY, USA: ACM.

[32]

Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pp. 137-142, Chemnitz, Germany. Berlin, Germany: Springer-Verlag.

Digital Library

[33]

Joachims, T. 1999. Making Large-Scale Support Vector Machine Learning Practical, pp. 169- 184. Cambridge, MA, USA: MIT Press.

[34]

Jurafsky, D., and Martin, J. H. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd ed. Boston, MA, USA: Prentice-Hall.

[35]

Kingsbury, P., and Palmer, M. 2002. From Treebank to PropBank. In Proceedings of the International Conference on Language Resources and Evaluation, pp. 1989-1993, Las Palmas, Spain. LREC.

[36]

Kudo, T., and Matsumoto, Y. 2001. Chunking with support vector machine. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, pp. 192-199, Pittsburgh, USA. ACL.

[37]

Kupiec, J., Pedersen, J., and Chen, F. 1995. A trainable document summarizer. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73, Seattle, USA. New York, NY, USA: ACM.

[38]

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pp. 282-289, Williamstown, MA, USA. Massachusetts, USA: Morgan Kaufmann.

Digital Library

[39]

Lin, C. Y. 2004. ROUGE: a package for automatic evaluation of summaries. In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of Association for Computational Linguistics, pp. 74-81, Barcelona, Spain. ACL.

[40]

Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. 2002. Text classification using string kernels. Journal of Machine Learning Research 2: 419-444 (Cambridge, MA, USA: MIT Press).

Digital Library

[41]

Mani, I. 2001. Automatic Summarization. Natural Language Processing. Philadelphia, PA, USA: John Benjamins.

[42]

Mani, I., and Maybury, M. T. 1999. Advances in Automatic Text Summarization. Cambridge, MA, USA: MIT Press.

[43]

Marcu, D. 1999. The automatic construction of large-scale corpora for summarization research. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 137-144, Berkeley, CA, USA. New York, NY, USA: ACM.

[44]

McCallum, A. K. 2002. MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu.

[45]

Moschitti, A., and Basili, R. 2006. A tree kernel approach to question and answer classification in question answering systems. In Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 1510-1513, Genoa, Italy. LREC.

[46]

Moschitti, A., Quarteroni, S., Basili, R., and Manandhar, S. 2007. Exploiting syntactic and shallow semantic kernels for question/answer classificaion. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 776-783, Prague, Czech Republic. ACL.

[47]

Nastase, V. 2008. Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 763-772, Honolulu, Hawaii, USA. ACL.

[48]

Nguyen, M. L., Shimazu, A., Phan, X. H., Ho, T. B., and Horiguchi, S. 2005. Sentence extraction with support vector machine ensemble. In First World Congress of the International Federation for Systems Research (IFSR'05), Symposium on Data/Text Mining from Large Databases, Kobe, Japan. Komatsu, Japan: JAIST Press.

[49]

Parmanto, B., Munro, P. W., and Doyle, H. R. 1996. Improving committee diagnosis with resampling techniques. In Advances in Neural Information Processing Systems, vol. 8, pp. 882-888, Denver, CO, USA. NIPS.

[50]

Pasca, M., and Harabagiu, S. M. 2001. Answer mining from on-line documents. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics and 10th Conference of the European Chapter Workshop on Open-Domain Question Answering, pp.38- 45, Toulouse, France. ACL.

[51]

Qi, H., and Huang, M. 2007. Research on SVM ensemble and its application to remote sensing classification. In Proceedings of the International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Chengdu, China. Amsterdam, Netherlands: Atlantis Press.

[52]

Rooney, N., Patterson, D., Tsymbal, A., and Anand, S. 2004. Random subspacing for regression ensembles. In Proceedings of the 17th International Florida Artificial Intelligence Research Society Conference (FLAIRS), Miami Beach, FL, USA. California, USA: AAAI Press.

[53]

Sekine, S. 2002. Proteus project-OAK system (English sentence analyzer). http://nlp.nyu.edu/oak.

[54]

Sekine, S., and Nobata, C. A. 2001. Sentence extraction with information extraction technique. In Proceedings of the Document Understanding Conference, New Orleans, LA, USA. Gaithersburg, MD, USA: NIST.

[55]

Shen, D., Sun, J., Li, H., Yang, Q., and Chen, Z. 2007. Document summarization using conditional random fields. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 2862-2867, Hyderabad, India. Massachusetts, USA: Morgan Kaufmann.

[56]

Toutanova, K., Brockett, C., Gamon, M., Jagarlamudi, J., Suzuki, H., and Vanderwende, L. 2007. The PYTHY summarization system: Microsoft research at DUC 2007. In Proceedings of the Document Understanding Conference, Rochester, NY, USA. Gaithersburg, MD, USA: NIST.

[57]

Wallach, H. 2002. Efficient Training of Conditional Random Fields. MSc thesis, Division of Informatics, University of Edinburgh.

[58]

Wan, X., and Xiao, J. 2009. Graph-based multi-modality learning for topic-focused multidocument summarization. In Proceedings of the 21st International Joint Conference on Artifical Intelligence, pp. 1586-1591, Pasadena, CA, USA. Massachusetts, USA: Morgan Kaufmann.

[59]

Wan, X., Yang, J., and Xiao, J. 2007. Manifold-ranking based topic-focused multidocument summarization. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 2903-2908, Hyderabad, India. Massachusetts, USA: Morgan Kaufmann.

[60]

Wong, K., Wu, M., and Li, W. 2008. Extractive summarization using supervised and semi-supervised learning. In Proceedings of the 22nd International Conference on Computational Linguistics - Vol. 1, pp. 985-992, Manchester, UK. ACL.

[61]

Zhang, D., and Lee, W. S. 2003. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 26-32, Toronto, Canada. New York, NY, USA: ACM.

Cited By

Li WZhuge H(2021)Abstractive Multi-Document Summarization Based on Semantic Link NetworkIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.292295733:1(43-54)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/TKDE.2019.2922957
Filice SCohen NCarmel D(2020)Voice-based Reformulation of Community AnswersProceedings of The Web Conference 202010.1145/3366423.3380053(2885-2891)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380053
Cohan AGoharian N(2018)Scientific document summarization via citation contextualization and scientific discourseInternational Journal on Digital Libraries10.1007/s00799-017-0216-819:2-3(287-303)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1007/s00799-017-0216-8
Show More Cited By

Query-focused multi-document summarization: Automatic data annotations and supervised learning approaches
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Exploring actor---object relationships for query-focused multi-document summarization

Most research on multi-document summarization explores methods that generate summaries based on queries regardless of the users' preferences. We note that, different users can generate somewhat different summaries on the basis of the same source data ...
Manifold-ranking based topic-focused multi-document summarization
IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligence

Topic-focused multi-document summarization aims to produce a summary biased to a given topic or user profile. This paper presents a novel extractive approach based on manifold-ranking of sentences to this summarization task. The manifold-ranking process ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Natural Language Engineering

Natural Language Engineering Volume 18, Issue 1

January 2012

150 pages

ISSN:1351-3249

Issue’s Table of Contents

Publisher

Cambridge University Press

United States

Publication History

Published: 01 January 2012

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li WZhuge H(2021)Abstractive Multi-Document Summarization Based on Semantic Link NetworkIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.292295733:1(43-54)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/TKDE.2019.2922957
Filice SCohen NCarmel D(2020)Voice-based Reformulation of Community AnswersProceedings of The Web Conference 202010.1145/3366423.3380053(2885-2891)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380053
Cohan AGoharian N(2018)Scientific document summarization via citation contextualization and scientific discourseInternational Journal on Digital Libraries10.1007/s00799-017-0216-819:2-3(287-303)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1007/s00799-017-0216-8
Wang WLi ZWang JZheng Z(2017)How far we can go with extractive text summarization? Heuristic methods to obtain near upper boundsExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.08.04090:C(439-463)Online publication date: 30-Dec-2017
https://dl.acm.org/doi/10.1016/j.eswa.2017.08.040
Gambhir MGupta V(2017)Recent automatic text summarization techniquesArtificial Intelligence Review10.1007/s10462-016-9475-947:1(1-66)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1007/s10462-016-9475-9
Chali YHasan S(2015)Towards topic-to-question generationComputational Linguistics10.1162/COLI_a_0020641:1(1-20)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1162/COLI_a_00206
Chali YHasan SMojahid M(2015)A reinforcement learning formulation to the complex question answering problemInformation Processing and Management: an International Journal10.1016/j.ipm.2015.01.00251:3(252-272)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1016/j.ipm.2015.01.002
Alguliyev RAliguliyev RIsazade N(2015)An unsupervised approach to generating generic summaries of documentsApplied Soft Computing10.1016/j.asoc.2015.04.05034:C(236-250)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1016/j.asoc.2015.04.050

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents