Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining: | ACM Books

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text MiningJune 2016

Go to Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining

Volume 12

June 2016

Authors:
ChengXiang Zhai
University of Illinois at Urbana-Champaign
,
Sean Massung
University of Illinois at Urbana-Champaign

Publisher:

Association for Computing Machinery and Morgan & Claypool

ISBN:978-1-970001-17-4

DOI:https://doi.org/10.1145/2915031

Published:23 June 2016

Pages:

1286

Appears In:

ACM Books

Purchase this Book Recommend ACM DL

ALREADY A SUBSCRIBER?SIGN IN

Bibliometrics

Sections

2016

Abstract

Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media (such as blog articles, forum posts, product reviews, and tweets). This has led to an increasing demand for powerful software tools to help people manage and analyze vast amounts of text data effectively and efficiently. Unlike data generated by a computer system or sensors, text data are usually generated directly by humans, and capture semantically rich content. As such, text data are especially valuable for discovering knowledge about human opinions and preferences, in addition to many other kinds of knowledge that we encode in text. In contrast to structured data, which conform to well-defined schemas (thus are relatively easy for computers to handle), text has less explicit structure, requiring computer processing toward understanding of the content encoded in text. The current technology of natural language processing has not yet reached a point to enable a computer to precisely understand natural language text, but a wide range of statistical and heuristic approaches to management and analysis of text data have been developed over the past few decades. They are usually very robust and can be applied to analyze and manage text data in any natural language, and about any topic.

This book provides a systematic introduction to many of these approaches, with an emphasis on covering the most useful knowledge and skills required to build a variety of practically useful text information systems. Because humans can understand natural languages far better than computers can, effective involvement of humans in a text information system is generally needed and text information systems often serve as intelligent assistants for humans. Depending on how a text information system collaborates with humans, we distinguish two kinds of text information systems. The first is information retrieval systems which include search engines and recommender systems; they assist users in finding from a large collection of text data the most relevant text data that are actually needed for solving a specific application problem, thus effecively turning big raw text data into much smaller relevant text data that can be more easily processed by humans. The second is text mining application systems; they can assist users in analyzing patterns in text data to extract and discover useful actionable knowledge directly useful for task completion or decision making, thus providing more direct task support for users. This book covers the major concepts, techniques, and ideas in information retrieval and text data mining from a practical viewpoint, and includes many hands-on exercises designed with a companion software toolkit (i.e., MeTA) to help readers learn how to apply techniques of information retrieval and text mining to real-world text data and how to experiment with and improve some of the algorithms for interesting application tasks. This book can be used as a textbook for computer science undergraduates and graduates, library and information scientists, or as a reference book for practitioners working on relevant problems in managing and analyzing text data.

Chapters

PrefacePreface

Chapter 1Introduction

Chapter 2Background

Chapter 3Text Data Understanding

Chapter 4MeTA : A Unified Toolkit for Text Data Management and Analysis

Chapter 5Overview of Text Data Access

Chapter 6Retrieval Models

Chapter 7Feedback

Chapter 8Search Engine Implementation

Chapter 9Search Engine Evaluation

Chapter 10Web Search

Chapter 11Recommender Systems

Chapter 12Overview of Text Data Analysis

Chapter 13Word Association Mining

Chapter 14Text Clustering

Chapter 15Text Categorization

Chapter 16Text Summarization

Chapter 17Topic Analysis

Chapter 18Opinion Mining and Sentiment Analysis

Chapter 19Joint Analysis of Text and Structured Data

Chapter 20Toward A Unified System for Text Management and Analysis

AppendixesAppendixes

ReferencesReferences

IndexIndex

References

C. C. Aggarwal. 2015. Data Mining - The Textbook. Springer. DOI: 10.1007/978-3-319-14142-8.Google Scholar
C. C. Aggarwal and C. Zhai, editors. 2012. Mining Text Data. Springer. DOI: 10.1007/978-1-4614-3223-4.Google Scholar
J. Allen. 1995. Natural Language Understanding. 2nd ed. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA.Google Scholar
G. Amati and C. J. Van Rijsbergen. October 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357–389. DOI: 10.1145/582415.582416.Google ScholarDigital Library
A. U. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. 2009. On smoothing and inference for topic models. In UAI 2009, Proc. of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009, pp. 27–34.Google Scholar
R. A. Baeza-Yates and B. A. Ribeiro-Neto. 2011. Modern Information Retrieval - the concepts and technology behind search. 2nd ed. Pearson Education Ltd., Harlow, UK. http://www.mir2ed.org/.Google Scholar
Y. Bar-Hillel, The Present Status of Automatic Translation of Languages, in Advances in Computers, vol. 1 (1960), pp. 91–163.Google Scholar
R. Belew. 2008. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press.Google Scholar
N. J. Belkin and W. B. Croft. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM, 35(12):29–38. DOI: 10.1145/138859.138861.Google ScholarDigital Library
C. M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ.Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan. March 2003. Latent Dirichlet Allocation. J. of Mach. Learn. Res., 3:993–1022.Google Scholar
J. S. Breese, D. Heckerman, and C. Kadie. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In Proc. of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI'98, Morgan Kaufmann Publishers Inc. pp. 43–52, San Francisco, CA. http://dl.acm.org/citation.cfm?id=2074094.2074100.Google Scholar
P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. 1992. Class-based N-gram Models of Natural Language. Comput. Linguist., 18(4):467–479.Google Scholar
C. Buckley. 1994. Automatic query expansion using smart: Trec 3. In Proc. of The third Text REtrieval Conference (TREC-3, pp. 69–80.Google Scholar
S. Büttcher, C. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press.Google ScholarDigital Library
F. Cacheda, V. Carneiro, D. Fernández, and V. Formoso. 2011. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Trans. Web, 5(1):2:1–2:33. DOI: 10.1145/1921591.1921593.Google ScholarDigital Library
C. Campbell and Y. Ying. 2011. Learning with Support Vector Machines. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers. DOI: 10.2200/S00324ED1V01Y201102AIM010.Google Scholar
J. Carbonell and J. Goldstein. 1998. The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries. In Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '98,ACM, pp. 335–336, New York. DOI: 10.1145/290941.291025Google ScholarDigital Library
C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1–27:27.Google ScholarDigital Library
J. Chang, S. Gerrish, C. Wang, J. L. Boyd-graber, and D. M. Blei. 2009. Reading Tea Leaves: How Humans Interpret Topic Models. In Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, Curran Associates, Inc. 22, pp. 288–296.Google Scholar
K. W. Church and P. Hanks. 1990. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22–29. http://dl.acm.org/citation.cfm?id=89086.89095.Google Scholar
T. Cover and J. Thomas. 1991. Elements of Information Theory. New York: Wiley. DOI: 10.1002/047174882XGoogle ScholarCross Ref
B. Croft, D. Metzler, and T. Strohman. 2009. Search Engines: Information Retrieval in Practice, 1st ed., Addison-Wesley Publishing Company.Google ScholarDigital Library
D. Das and A. F. T. Martins. 2007. A Survey on Automatic Text Summarization. Technical report, Literature Survey for the Language and Statistics II course at Carnegie Mellon University.Google Scholar
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res., 9:1871–1874.Google ScholarDigital Library
H. Fang, T. Tao, and C. Zhai. 2004. A formal study of information retrieval heuristics. In Proc. of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '04, ACM, pp. 49–56, New York. DOI: 10.1145/1008992.1009004.Google ScholarDigital Library
H. Fang, T. Tao, and C. Zhai. April 2011. Diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst., 29(2):7:1–7:42. DOI: 10.1145/1961209.1961210.Google ScholarDigital Library
R. Feldman and J. Sanger. 2007. The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.Google Scholar
E. A. Fox, M. A. Gon„alves, and R. Shen. 2012. Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers. DOI: 10.2200/S00434ED1V01Y201207ICR022.Google Scholar
W. B. Frakes and R. A. Baeza-Yates, editors. 1992. Information Retrieval: Data Structures & Algorithms. Prentice-Hall,Google ScholarDigital Library
K. Ganesan, C. Zhai, and J. Han. 2010. Opinosis: A graph-based approach to abstractive summarization of highly redundant opinions. In Proc. of the 23rd International Conference on Computational Linguistics, COLING '10, Association for Computational Linguistics, pp. 340–348, Stroudsburg, PA.Google Scholar
K. Ganesan, C. Zhai, and E. Viegas. 2012. Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions. In Proc. of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012, pages 869–878. DOI: 10.1145/2187836.2187954Google ScholarDigital Library
J. Gantz, and D. Reinsel. 2012. The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, IDC Report, December, 2012.Google Scholar
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. 1995. Bayesian Data Analysis. Chapman & Hall.Google Scholar
S.Ghemawat, H. Gobioff, and S.-T. Leung. 2003. The Google file system. In Proc. of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, 29–43.Google Scholar
M. A. Gon„alves, E. A. Fox, L. T. Watson, and N. A. Kipp. 2004. Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries. ACM Trans. Inf. Syst., 22(2):270–312. DOI: 10.1145/984321.984325.Google ScholarDigital Library
D. A. Grossman and O. Frieder. Kluwer, 2004. Information Retrieval - Algorithms and Heuristics, Second Edition, vol. 15 of The Kluwer International Series on Information Retrieval. DOI: 10.1007/978-1-4020-3005-5.Google Scholar
G. Hamerly and C. Elkan. 2003. Learning the k in k-means. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], pp. 281–288.Google Scholar
J. Han. 2005. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
D. Harman. 2011. Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers. DOI: < 10.1145/215206.215351Google Scholar
M. A. Hearst. 2009. Search User Interfaces. 1st ed. Cambridge University Press, New York.Google Scholar
J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. 2004. Evaluating Collaborative Filtering Recommender Systems. ACM Trans. Inf. Syst., 22(1):5–53. DOI: 10.1145/963770.963772Google ScholarDigital Library
J. L. Hodges and E. L. Lehmann. 1970. Basic Concepts of Probability and Statistics. Holden Day, San Francisco.Google Scholar
T. Hofmann. 1999. Probabilistic Latent Semantic Analysis. In Proc. of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI'99, Morgan Kaufmann Publishers Inc., pp. 289–296, San Francisco, CA. DOI: 10.1145/312624.312649Google ScholarDigital Library
A. Huang. 2008. Similarity Measures for Text Document Clustering. In Proc. of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, pages 49–56.Google Scholar
F. Jelinek. 1997. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA.Google Scholar
J. Jiang. 2012. Information extraction from text, In Charu C. Aggarwal and ChengXiang Zhai (Eds.), Mining Text Data, Springer, pp. 11–41.Google Scholar
S. Jiang and C. Zhai. 2014. Random walks on adjacency graphs for mining lexical relations from big text data. In 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, DC, USA, October 27-30, pages 549–554. DOI: 10.1109/BigData.2014.7004272.Google ScholarCross Ref
Y. Jo and A. H. Oh. 2011. Aspect and sentiment unification model for online review analysis. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM '11, ACM, pp. 815–824, New York. DOI: 10.1145/1935826.1935932.Google ScholarDigital Library
T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst., 25(2). DOI: 10.1145/1229179.1229181.Google ScholarDigital Library
D. Jurafsky and J. H. Martin. 2009. Speech and Language Processing. 2nd ed. Prentice-Hall, Inc., Upper Saddle River, NJ.Google Scholar
D. Kelly. 2009. Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2):1–224. DOI: 10.1561/1500000012Google ScholarDigital Library
D. Kelly and J. Teevan. 2003. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2):18–28. DOI: 10.1145/959258.959260.Google ScholarDigital Library
H. D. Kim, M. Castellanos, M. Hsu, C. Zhai, T. Rietz, and D. Diermeier. 2013. Mining causal topics in text data: iterative topic modeling with time series feedback. In Proc. of the 22nd ACM international conference on Conference on information and knowledge management, CIKM '13, ACM pages 885–890, New York, NY. DOI: 10.1145/2505515.2505612.Google ScholarDigital Library
J. M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632. DOI: 10.1145/324133.324140.Google ScholarDigital Library
J. M. Kleinberg. 2002. An impossibility theorem for clustering. In Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, December 9-14, 2002, Vancouver, British Columbia, Canada], pp. 446–453. http://papers.nips.cc/paper/2340-an-impossibility-theorem-for-clustering.Google Scholar
D. Koller and N. Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press.Google Scholar
J. Lafferty and C. Zhai. 2003. Probabilistic relevance models based on document and query generation. In W. Bruce Croft and John Lafferty, editors, Language Modeling and Information Retrieval. Kluwer Academic Publishers. DOI: 10.1007/978-94-017-0171-6\_1Google Scholar
D. Lin. 1999. Automatic identification of non-compositional phrases. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL '99, Association for Computational Linguistics, pages 317–324, Stroudsburg, PA. DOI: 10.3115/1034678.1034730.Google ScholarDigital Library
J.Lin and C. Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers. DOI: 10.2200/S00274ED1V01Y201006HLT007.Google Scholar
Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. DOI: 10.2200/S00416ED1V01Y201204HLT016.Google Scholar
T.-Y. Liu. 2009. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225–331. DOI: 10.1561/1500000016.Google Scholar
Y. Lv and C. Zhai. 2009. A comparative study of methods for estimating query language models with pseudo feedback. In Proc. of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, ACM, pp. 1895–1898, New York. DOI: 10.1145/1645953.1646259.Google ScholarCross Ref
Y. Lv and C. Zhai. 2010. Positional relevance model for pseudo-relevance feedback. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '10, ACM, pages 579–586, New York. DOI: 10.1145/1835449.1835546.Google ScholarDigital Library
Y. Lv and C. Zhai. 2011. Lower-bounding Term Frequency Normalization. In Proc. of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pp. 7–16. DOI: 10.1145/2063576.2063584Google ScholarDigital Library
P. Lyman, H. R. Varian, K. Swearingen, P. Charles, N. Good, L.L. Jordan, and J. Pal. 2003. How much information? http://www2.sims.berkeley.edu/research/projects/how-much-info-2003.Google Scholar
C. D. Manning and H. Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.Google Scholar
C. D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York.Google Scholar
M. E. Maron and J. L. Kuhns. 1960. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7:216–244. DOI: 10.1145/321033.321035Google ScholarDigital Library
S. Massung and C. Zhai. 2015. SyntacticDiff: Operator-Based Transformation for Comparative Text Mining. In Proc. of the 3rd IEEE International Conference on Big Data, pp. 571–580.Google Scholar
S. Massung and C. Zhai. 2016. Non-Native Text Analysis: A Survey. The Journal of Natural Language Engineering, 22(2):163–186. DOI: 10.1017/S1351324915000303Google ScholarCross Ref
S. Massung, C. Zhai, and J.Hockenmaier. 2013. Structural Parse Tree Features for Text Representation. In IEEE Seventh International Conference on Semantic Computing, pp. 9–13. DOI: 10.1109/ICSC.2013.13Google ScholarDigital Library
J. D. McAuliffe and D. M. Blei. 2008. Supervised topic models. In J.C. Platt, D. Koller, Y. Singer, and S.T. Roweis, eds., Advances in Neural Information Processing Systems 20, pages 121–128. Curran Associates, Inc.Google Scholar
G. J. McLachlan and T. Krishnan. 2008. The EM algorithm and extensions. 2nd ed. Wiley Series in Probability and Statistics. Hoboken, NJ., Wiley. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&SRT=YOP&IKT=1016&IKT=1016&TRM=ppn+52983362X&sourceid=fbw_bibsonomy. DOI: 10.1002/9780470191613Google Scholar
Q. Mei. 2009. Contextual text mining. Ph.D. Dissertation, University of Illinois at Urbana-Champaign.Google Scholar
Q. Mei and C. Zhai. 2006. A mixture model for contextual text mining. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '06, ACM, pp. 649–655, New York. DOI: 10.1145/1150402.1150482.Google ScholarDigital Library
Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. 2006. Generating semantic annotations for frequent patterns with context analysis. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '06, ACM, pp. 337–346, New York. DOI: 10.1145/1150402.1150441.Google ScholarDigital Library
Q. Mei, C. Liu, H. Su, and C. Zhai. 2006. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proc.of the 15th international conference on World Wide Web (WWW '06). ACM. New York, 533–542. DOI: 10.1145/1135777.1135857.Google ScholarDigital Library
Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. 2007a. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proc. of the 16th International Conference on World Wide Web, WWW '07, ACM, pp. 171–180, New York. DOI: 10.1145/1242572.1242596.Google ScholarCross Ref
Q. Mei, X. Shen, and C. Zhai. 2007b. Automatic labeling of multinomial topic models. In Proc. of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, August 12-15, 2007, pp. 490–499. DOI: 10.1145/1281192.1281246.Google ScholarDigital Library
Q. Mei, D. Cai, D. Zhang, and C. Zhai. 2008. Topic modeling with network regularization. In Proceedings of the 17th International Conference on World Wide Web, WWW '08, ACM, pp. 101–110, New York. DOI: 10.1145/1367497.1367512.Google ScholarDigital Library
T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp. 1045–1048. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, NV, pp. 3111–3119.Google Scholar
T. M. Mitchell. 1997. Machine learning. McGraw Hill Series in Computer Science. McGraw-Hill.Google Scholar
M.-F. Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context (The Information Retrieval Series). Springer-Verlag New York, Inc., Secaucus, NJ. DOI: 10.1007/978-1-4020-4993-4.Google ScholarCross Ref
I. J. Myung. 2003. Tutorial on maximum likelihood estimation. J. Math. Psychol., 47(1):90–100. DOI: 10.1016/S0022-2496(02)00028-7.Google ScholarDigital Library
A. Nenkova and K. McKeown. 2012. A survey of text summarization techniques. In Charu C. Aggarwal and C. Zhai, eds, Mining Text Data, pp. 43–76. Springer US. DOI: 10.1007/978-1-4614-3223-4_3.Google ScholarCross Ref
L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf.Google Scholar
B. Pang and L. Lee. 2008. Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135. DOI: 10.1561/1500000011Google ScholarDigital Library
J. M. Ponte and W. B. Croft. 1998. A language modeling approach to information retrieval. In Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '98, ACM, pp. 275–281, New York, NY. DOI: 10.1145/290941.291008.Google ScholarDigital Library
J. R. Quinlan. 1986. Induction of Decision Trees. Machine Learning, 1(1):81–106. DOI: 10.1007/BF00116251.Google ScholarCross Ref
D. R. Radev, H. Jing, M. Styś, and D. Tam. 2004. Centroid-based summarization of multiple documents. Information Processing & Management, 40(6):919––938. DOI: 10.1016/j.ipm.2003.10.006.Google Scholar
D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. 2009. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP '09, Association for Computational Linguistics, pages 248–256, Stroudsburg, PA.Google ScholarDigital Library
E. Reiter and R. Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press, New York.Google Scholar
F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor. 2010. Recommender Systems Handbook. 1st ed. Springer-Verlag New York, Inc. DOI: 10.1007/978-0-387-85820-3Google Scholar
C. J. Van Rijsbergen. 1979. Information Retrieval. 2nd ed. Butterworth-Heinemann, Newton, MA.Google Scholar
S. Robertson and K. Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129–146.Google ScholarCross Ref
S. E. Robertson. 1997. Readings in Information Retrieval. In The Probability Ranking Principle in IR, San Francisco, CA, Morgan Kaufmann Publishers Inc. pp. 281–286.Google Scholar
S. Robertson and H. Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., 3(4):333–389. DOI: 10.1561/1500000019.Google Scholar
S. Robertson, H. Zaragoza, and M. Taylor. 2004. Simple BM25 Extension to Multiple Weighted Fields. In Proc. of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM '04, pp. 42–49. DOI: 10.1145/1031171.1031181Google ScholarDigital Library
C. Roe. 2012. The growth of unstructured data: what to do with all those zettabytes? http://www.dataversity.net/the-growth-of-unstructured-data-what-are-we-going-to-do-with-all-those-zettabytes/.Google Scholar
R. Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here. In Proceedings of the IEEE.Google ScholarCross Ref
G. Salton. 1989. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley.Google Scholar
G. Salton and M. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill.Google Scholar
G. Salton, A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing. Commun. ACM, 18(11):613–620.Google Scholar
G. Salton and C. Buckley. 1990. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41:288–297.Google ScholarCross Ref
M. Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, 4(4):247–375.Google Scholar
M. Sanderson and W. B. Croft. 2012. The history of information retrieval research. Proc. of the IEEE, 100(Centennial-Issue):1444–1451, 2012. DOI: 10.1109/JPROC.2012.2189916.Google ScholarCross Ref
S. Sarawagi. 2008. Information extraction. Found. Trends databases, 1(3):261–377. DOI: 10.1561/1900000003.Google Scholar
F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1–47. DOI: 10.1145/505282.505283.Google ScholarDigital Library
G. Shani and A. Gunawardana. 2011. Evaluating Recommendation Systems. In Recommender Systems Handbook, 2nd ed., pp. 257–297. Springer, New York, NY. DOI: 10.1007/978-0-387-85820-3_8.Google Scholar
F. Silvestri. 2010. Mining query logs: Turning search usage data into knowledge. Found. Trends Inf. Retr., 4:1–174. DOI: 10.1561/1500000013Google ScholarDigital Library
A. Singhal, C. Buckley, and Mandar Mitra. 1996. Pivoted document length normalization. In Proc. of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96,ACM, pp. 21–29, New York. DOI: 10.1145/243199.243206.Google ScholarDigital Library
N. Smith. 2010. Text-driven forecasting. http://www.cs.cmu.edu/\~nasmith/papers/smith.whitepaper10.pdf.Google Scholar
Mark D. Smucker, James Allan, and Ben Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proc. of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM '07, ACM, pp. 623–632, New York. DOI: 10.1145/1321440.1321528.Google ScholarDigital Library
K. Sparck Jones and P. Willett, eds. 1997. Readings in Information Retrieval. San Francisco, CA, Morgan Kaufmann Publishers Inc.Google Scholar
N. Spirin and J. Han. May 2012. Survey on Web Spam Detection: Principles and Algorithms. SIGKDD Explor. Newsl., 13(2):50–64. DOI: 10.1145/2207243.2207252.Google ScholarDigital Library
E. Stamatatos. 2009. A Survey of Modern Authorship Attribution Methods. J. Am. Soc. Inf. Sci. Technol., 60(3):538–556. DOI: 10.1002/asi.v60:3Google ScholarCross Ref
M. Steinbach, G. Karypis, and V. Kumar. 2000. A comparison of document clustering techniques. In KDD Workshop on Text Mining.Google Scholar
J. Steinberger and K. Jezek. 2009. Evaluation measures for text summarization. Computing and Informatics, 28(2):251–275.Google Scholar
M. Steyvers and T. Griffiths. 2007. Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7):424–440.Google Scholar
Y. Sun and J. Han. 2012. Mining Heterogeneous Information Networks: Principles and Methodologies. Morgan & Claypool Publishers. DOI: 10.2200/S00433ED1V01Y201207DMK005.Google Scholar
I. Titov and R. McDonald. 2008. Modeling online reviews with multi-grain topic models. In Proc. of the 17th International Conference on World Wide Web, WWW '08, ACM, pp. 111–120, New York. DOI: 10.1145/1367497.1367513.Google ScholarDigital Library
H. Turtle and W. B. Croft. 1990. Inference networks for document retrieval. In Proc. of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '90, ACM, pp. 1–24, New York. DOI: 10.1145/96749.98006.Google ScholarDigital Library
Princeton University. 2010. About wordnet. http://wordnet.princeton.edu.Google Scholar
C. J. van Rijsbergen. 1979. Information Retrieval. Butterworths.Google Scholar
H. Wang, Yue Lu, and C. Zhai. 2010. Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach. In Proc. of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10, ACM, pp. 783–792, New York. DOI: 10.1145/1835804.1835903.Google ScholarDigital Library
H. Wang, Y. Lu, and C. Zhai. 2011. Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In Proc. of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, ACM, pp. 618–626, New York. DOI: 10.1145/2020408.2020505.Google ScholarDigital Library
J. Weizenbaum. 1966. ELIZA—A Computer Program for the Study of Natural Language Communication Between Man and Machine, Communications of the ACM 9 (1): 36–45, DOI: 10.1145/265153.365168.Google Scholar
J. S. Whissell and C. L. A. Clarke. 2013. Effective Measures for Inter-document Similarity. In Proc. of the 22nd ACM International Conference on Conference on Information & Knowledge Management, CIKM '13, ACM, pages 1361––1370, New York. DOI: 10.1145/2505515.2505526.Google ScholarDigital Library
R. W. White and R. A. Roth. 2009. Exploratory Search: Beyond the Query-Response Paradigm. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers. DOI: < 10.2200/S00174ED1V01Y200901ICR003.Google ScholarDigital Library
R. W. White, B. Kules, S. M. Drucker, and m.c. schraefel. 2006. Introduction. Commun. ACM, 49(4):36–39. DOI: 10.1145/1121949.1121978.Google ScholarDigital Library
I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes (2Nd Ed.): Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
C.F J. Wu. 1983. On the convergence properties of the EM algorithm. Ann. of stat., 95–103.Google Scholar
J. Xu and W. B. Croft. 1996. Query expansion using local and global document analysis. In Proc. of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96, ACM, pp. 4–11, New York. DOI: 10.1145/243199.243202.Google ScholarDigital Library
Y. Yang. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1:67–88.Google ScholarDigital Library
C. Zhai. 1997. Exploiting context to identify lexical atoms—a statistical view of linguistic context. In Proc. of the International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97), pages 119–129. Rio de Janeiro, Brazil.Google Scholar
C. Zhai. 2008. Statistical Language Models for Information Retrieval. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. DOI: 10.2200/S00158ED1V01Y200811HLT001.Google Scholar
C. Zhai and J. Lafferty. 2001. Model-based Feedback in the Language Modeling Approach to Information Retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management, CIKM '01, ACM, pp. 403–410, New York. DOI: 10.1145/502585.502654.Google ScholarDigital Library
C. Zhai and J. Lafferty. 2004. A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Trans. Inf. Syst., 22(2):179–214.Google ScholarDigital Library
C. Zhai, P. Jansen, E. Stoica, N. Grot, and D. A. Evans. 1998. Threshold Calibration in CLARIT Adaptive Filtering. In Proc. of Seventh Text REtrieval Conference (TREC-7), pp. 149–156.Google Scholar
C. Zhai, P. Jansen, and D. A. Evans. 2000. Exploration of a heuristic approach to threshold learning in adaptive filtering. In SIGIR, ACM, pp. 360–362. DOI: 10.1145/345508.345652.Google ScholarCross Ref
C. Zhai, A. Velivelli, and B. Yu. 2004. A cross-collection mixture model for comparative text mining. In Proc. of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, ACM, pp. 743–748, New York. DOI: 10.1145/1014052.1014150.Google ScholarDigital Library
D. Zhang, C. Zhai, J. Han, A. Srivastava, and N. Oza. 2009. Topic modeling for OLAP on multidimensional text databases: topic cube and its applications. Stat. Anal. Data Min. 2, 5–6 (December 2009), 378–395. DOI: 10.1002/sam.v2.5/6.Google ScholarCross Ref
J. Zhu, A. Ahmed, and E. P. Xing. 2009. Medlda: Maximum margin supervised topic models for regression and classification. In Proc. of the 26th Annual International Conference on Machine Learning, ICML '09, ACM, pp. 1257–1264, New York. DOI: 10.1145/1553374.1553535.Google ScholarCross Ref
G. K. Zipf. 1949. Human Behavior and the Principle of Least-Effort. Cambridge, MA, Addison-Wesley.Google Scholar

Cited By

Contributors

ChengXiang Zhai
University of Illinois Urbana-Champaign
- Publication Years1990 - 2024
- Publication counts290
- Citation count13,889
- Available for Download236
- Downloads (cumulative)204,913
- Downloads (12 months)10,944
- Downloads (6 weeks)1,823
- Average Downloads per Article868
- Average Citation per Article48
View Full Profile
Sean Massung
University of Illinois Urbana-Champaign
- Publication Years2013 - 2017
- Publication counts6
- Citation count132
- Available for Download2
- Downloads (cumulative)14,901
- Downloads (12 months)1,003
- Downloads (6 weeks)211
- Average Downloads per Article7,451
- Average Citation per Article22
View Full Profile

Reviews

Reviewer: Fernando Berzal

An old rule of thumb suggests that 90 percent of all potentially relevant business information is in unstructured form. Hence, it is no surprise that many mathematically ill-defined problems associated with text analysis have attracted a lot of attention from data mining researchers. Text data management is a more mature field, and its associated text data access problems are tackled with the help of information retrieval techniques, as the popularity of web search engines attest. Zhai and Massung have managed to write a very readable introduction to both fields and their state of the art in 500 pages. After the usual introductory chapters, which include some background information and a very cursory mention of natural language processing (NLP) techniques, they delve into text data access methods, also known as information retrieval. Here, they discuss basic techniques such as ranking documents in response to a user query. They gently introduce retrieval models and the rationale behind them until they logically reach state-of-the-art vector space models, namely pivoted-length normalization and the Okapi BM25 ranking function. They also cover probabilistic models and, by clever use of analogies with the heuristic models, clearly explain the query likelihood retrieval model and the smoothing methods often used with it. Their discussion is not only theoretical, since they also cover practical issues associated with the implementation of information retrieval systems and, as you may expect, web search engines as the most prominent example of information retrieval systems nowadays. Their analysis of web search includes crawling, indexing, and link analysis, with the usual description of Google's PageRank and Kleinberg's HITS. The information retrieval half of this book is completed with short chapters on feedback (that is, how to take into account a user's actions to improve information retrieval results) and recommender systems, which provide relevant information to the user in "push" mode (in contrast to the "pull" mode of search and browsing, when the user initiates the requests). The second half of Zhai and Massung's textbook focuses on text mining, "text analysis" using the authors' preferred term. Word association mining, text clustering, text categorization, text summarization, topic modeling, opinion mining, and sentiment analysis are the main text mining problems studied in this second half of the book. Many of the discussed techniques are unavoidably application-specific, hence the authors' emphasis on the importance of feature engineering for solving problems such as text categorization, sentiment analysis, or text-based prediction. Their coverage of different problems is not without stark contrasts. For instance, a 60-page guided tour on probabilistic topic modeling, where probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) are excruciatingly dissected, is followed by a shallow overview chapter on opinion mining and sentiment analysis. In this short chapter, text data is regarded as data generated from humans as subjective sensors, which enables mining knowledge about the human observer who generated the text data. The subjective content of text data is then analyzed using techniques such as ordinal logistic regression or latent aspect rating analysis (LARA), proposed by the first author in two KDD papers [1,2]. The text mining half of the book ends with a 30-page survey chapter on the joint analysis of text and structured data, which is a requirement in many real-world applications. In fact, non-text data can enrich text analysis, whereas text data can help interpret non-text data (for example, pattern annotation). Three example techniques illustrate how topic analysis can be combined with non-text data in different domains: the use of different views in contextual PLSA, the network supervised topic model in NetPLSA (for the joint analysis of text and social network data), and iterative causal topic modeling for the analysis of text associated to time series. The book's final chapter is a short position paper where the authors advocate for integrated software frameworks that support both text management (that is, information retrieval) and text analysis (that is, text mining). It can be read as a broad-brush of the essentials for future unified systems. In general terms, the authors typically provide verbose descriptions of the reasons behind the design of specific techniques, with numerical examples and illustrative figures from the slides of two massive open online courses (MOOCs) offered by the first author on Coursera. They also provide specific sections that describe in detail the proper way to evaluate every different kind of technique, a key factor to be taken into account when applying the discussed techniques in practice. The book, however, is not always self-contained, since its broad scope in a limited number of pages entails an unavoidable depth/breadth tradeoff. Most basic techniques can be implemented just by following the instructions and guidelines in the text, although interested readers might need to resort to the bibliographic references if they want to gain a thorough understanding of the many advanced techniques. Fortunately, the authors include some bibliographic notes and very selective suggestions for further reading at the end of each chapter, instead of the encyclopedic collection of references common in many other textbooks. Although readers will not find detailed coverage of NLP techniques and some chapters might seem lacking in depth, advanced undergraduate students might find this book to be a valuable reference for getting acquainted with both information retrieval and text mining in a single volume, a worthwhile achievement for a 500-page textbook. Online Computing Reviews Service

Reviewer: H. Van Dyke Parunak

One of the most rapidly growing sources of data, natural-language text, is also one of the most difficult to analyze. Computerized understanding of natural language was among the earliest anticipated benefits of artificial intelligence (AI), but it has proven extraordinarily challenging. This volume offers a selective introduction to the state of the art of computerized analysis of text. As befits the subtitle, "a practical introduction ...," it situates the techniques it explains in the context of a systems view that emphasizes how natural-language processing (NLP) can be applied in real applications. Chapter 1 introduces the overall framework, distinguishing analysis of the text from various organizational processes (including search, filtering, categorization, summarization, topic analysis, information extraction, clustering, and visualization) that support the two main objectives of retrieval operations and data mining. With the exception of information extraction and visualization, the book discusses each of these operations. Chapter 2 provides an overview of mathematical background in probability and statistics, information theory, and machine learning. Chapter 3 reviews the history of NLP and text data understanding. Most of the book is limited to a bag-of-words model, though this chapter acknowledges more sophisticated techniques. Chapter 4 introduces the authors' modern text analysis (MeTA) toolkit for text data management and analysis, encouraging readers to download the open-source C++-based system and use it in examples and exercises promised later in the text. This promise of a hands-on learning experience is only partly fulfilled. Few exercises, and even fewer examples in the body of the text, actually say anything about MeTA. Most of the exercises that do mention it do not use it to illustrate a particular text-analytic function, but ask the user either to look to see how MeTA implements a given text-analytic function, or to extend MeTA to do something discussed in the text. Both kinds of task require the reader to delve into the source code of MeTA rather than use the functionality of the package, and thus assume a level of knowledge about MeTA well beyond anything in the text. These exercises might be useful in the context of a class where the instructor is already acquainted with the internal design and implementation of MeTA. Some other toolkits are mentioned, but there is no reference to other, important ones, such as MALLET from the University of Massachusetts at Amherst. After these four introductory chapters, the rest of the book has three parts: seven chapters devoted to accessing textual data, eight to analyzing it, and one final chapter fleshing out an overall architecture for unified text management and analysis. The chapters on accessing data discuss retrieval models, how the information retrieval system gets feedback from the user, implementation and evaluation of search engines, a special chapter on web-based search, and recommender systems. Most chapters are about 20 pages long (the median chapter length for the book is 18 pages), but the chapter on retrieval models is 46 pages long. The extra detail is useful, given the importance of this theme, but it is uneven compared with the rest of the book. The selection of retrieval methods to discuss is not clear. Early in the chapter, the authors identify "four major models that are generally regarded as state of the art: pivoted length normalization, Okapi BM25, query likelihood, and PL2." However, the rest of the chapter mentions PL2 only in passing, focusing instead on two forms of smoothing for query likelihood, JM smoothing and Dirichlet prior smoothing. The chapter does not discuss two very important issues in the area of retrieval, van Rijsbergen's work on The geometry of information retrieval [1], and the particular challenges posed by comparing vectors in high-dimensional spaces, which characterize most keyword-based retrieval methods. The text analysis chapters discuss word association mining, text clustering, categorization and summarization, topic analysis, opinion mining and sentiment analysis, and the joint analysis of text and structured data. Again, the level of detail is uneven. The median chapter length in this section is 24 pages, but the chapter on topic analysis occupies 60 pages. Again, the theme is an important one, but the level of detail appears to be out of balance with the rest of the book. The book includes exercises with each chapter, appendices giving further details on mathematical methods mentioned earlier in the book (Bayesian statistics, expectation maximization, and KL-divergence and Dirichlet prior smoothing), copious references, and an index. The references usefully include the page numbers on which they are cited, but there is some irregularity. For example, van Rijsbergen's important volume [1] is listed twice in the references, once alphabetized under R, and again under V. Online Computing Reviews Service

Reviewer: Xiannong Meng

Zhai and Massung's new book Text data management and analysis provides a fresh new look at the areas of text retrieval, text mining, and text management. Traditionally, these three areas are separate, each with a rich collection of research literature and textbooks. Zhai and Massung masterfully weave the contents of these areas together and present students and scholars with a unified view of "everything text," including a piece of software, META, which is developed by the authors for a variety of text analysis and management tasks. Because of the large scope of the contents, the authors chose to concentrate on the breadth, not the depth, of the knowledge area in this 500-plus-page textbook. The primary audience is upper-level undergraduate or first-year graduate students. The book contains 20 chapters that are divided into four parts and a few appendices. The first part reviews tools that are needed for the tasks, including probability and statistics, natural language understanding, and the installation and use of the META software. The second part contains the major parts of a traditional information retrieval study. The subjects covered in this part are text retrieval, vector space, and probabilistic models; feedback models; search engine implementation and evaluation; search over the web; and recommendation systems. The third part mainly deals with various text mining related topics, such as word association mining, text clusters, topic analysis, and opinion mining. The fourth part is a summary of the authors' views about a unified framework for text analysis and management. There are three appendices that describe some common statistics tools, the Bayesian model, the expectation-maximization model, and KL-divergence and Dirichlet prior smoothing. Each chapter ends with a collection of exercises (about ten in each), which allow readers to assess how well they have learned the content. The exercises with the authors' software tool META are spread throughout the book. The authors used this book in one of their (400-level) undergraduate courses and in two massive open online courses (MOOCs), all at the University of Illinois at Urbana-Champaign. Because text analysis and management are such important fields, it is a very good idea to seek ways to teach the topics at the undergraduate or early graduate level. The authors' approach of unifying text information retrieval and text mining is very refreshing and worth noting. In particular, the authors provided a programming tool that students can use as they learn the course materials. But I think challenges from two aspects remain. One issue is that the mathematics tools needed for text mining are typically out of reach for undergraduate computer science students. It is common practice in undergraduate data mining courses to use packages such as R or Weka to hide the details of statistical analysis. The second challenge is the amount of information covered in the book. It is a great idea to establish a unified framework as the book does. And in keeping the book, and thus the courses using this book, to a manageable size, I agree it is a very good idea to keep a broad view of the topics, without going into depth. But the number of topics covered in the book is vast. It will be a real challenging to use it in undergraduate courses. One may just have to cover selected topics in a typical semester. Regardless, this is a very good attempt to unify two important areas, text retrieval and text mining, for a society in which text analysis is becoming increasingly critical. The book also shows the depth and the breadth of the knowledge of the authors. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Save to Binder