Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text MiningJune 2016
Published:23 June 2016
Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media (such as blog articles, forum posts, product reviews, and tweets). This has led to an increasing demand for powerful software tools to help people manage and analyze vast amounts of text data effectively and efficiently. Unlike data generated by a computer system or sensors, text data are usually generated directly by humans, and capture semantically rich content. As such, text data are especially valuable for discovering knowledge about human opinions and preferences, in addition to many other kinds of knowledge that we encode in text. In contrast to structured data, which conform to well-defined schemas (thus are relatively easy for computers to handle), text has less explicit structure, requiring computer processing toward understanding of the content encoded in text. The current technology of natural language processing has not yet reached a point to enable a computer to precisely understand natural language text, but a wide range of statistical and heuristic approaches to management and analysis of text data have been developed over the past few decades. They are usually very robust and can be applied to analyze and manage text data in any natural language, and about any topic.

This book provides a systematic introduction to many of these approaches, with an emphasis on covering the most useful knowledge and skills required to build a variety of practically useful text information systems. Because humans can understand natural languages far better than computers can, effective involvement of humans in a text information system is generally needed and text information systems often serve as intelligent assistants for humans. Depending on how a text information system collaborates with humans, we distinguish two kinds of text information systems. The first is information retrieval systems which include search engines and recommender systems; they assist users in finding from a large collection of text data the most relevant text data that are actually needed for solving a specific application problem, thus effecively turning big raw text data into much smaller relevant text data that can be more easily processed by humans. The second is text mining application systems; they can assist users in analyzing patterns in text data to extract and discover useful actionable knowledge directly useful for task completion or decision making, thus providing more direct task support for users. This book covers the major concepts, techniques, and ideas in information retrieval and text data mining from a practical viewpoint, and includes many hands-on exercises designed with a companion software toolkit (i.e., MeTA) to help readers learn how to apply techniques of information retrieval and text mining to real-world text data and how to experiment with and improve some of the algorithms for interesting application tasks. This book can be used as a textbook for computer science undergraduates and graduates, library and information scientists, or as a reference book for practitioners working on relevant problems in managing and analyzing text data.


  • University of Illinois Urbana-Champaign
  • University of Illinois Urbana-Champaign


Fernando Berzal

An old rule of thumb suggests that 90 percent of all potentially relevant business information is in unstructured form. Hence, it is no surprise that many mathematically ill-defined problems associated with text analysis have attracted a lot of attention from data mining researchers. Text data management is a more mature field, and its associated text data access problems are tackled with the help of information retrieval techniques, as the popularity of web search engines attest. Zhai and Massung have managed to write a very readable introduction to both fields and their state of the art in 500 pages. After the usual introductory chapters, which include some background information and a very cursory mention of natural language processing (NLP) techniques, they delve into text data access methods, also known as information retrieval. Here, they discuss basic techniques such as ranking documents in response to a user query. They gently introduce retrieval models and the rationale behind them until they logically reach state-of-the-art vector space models, namely pivoted-length normalization and the Okapi BM25 ranking function. They also cover probabilistic models and, by clever use of analogies with the heuristic models, clearly explain the query likelihood retrieval model and the smoothing methods often used with it. Their discussion is not only theoretical, since they also cover practical issues associated with the implementation of information retrieval systems and, as you may expect, web search engines as the most prominent example of information retrieval systems nowadays. Their analysis of web search includes crawling, indexing, and link analysis, with the usual description of Google's PageRank and Kleinberg's HITS. The information retrieval half of this book is completed with short chapters on feedback (that is, how to take into account a user's actions to improve information retrieval results) and recommender systems, which provide relevant information to the user in "push" mode (in contrast to the "pull" mode of search and browsing, when the user initiates the requests). The second half of Zhai and Massung's textbook focuses on text mining, "text analysis" using the authors' preferred term. Word association mining, text clustering, text categorization, text summarization, topic modeling, opinion mining, and sentiment analysis are the main text mining problems studied in this second half of the book. Many of the discussed techniques are unavoidably application-specific, hence the authors' emphasis on the importance of feature engineering for solving problems such as text categorization, sentiment analysis, or text-based prediction. Their coverage of different problems is not without stark contrasts. For instance, a 60-page guided tour on probabilistic topic modeling, where probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) are excruciatingly dissected, is followed by a shallow overview chapter on opinion mining and sentiment analysis. In this short chapter, text data is regarded as data generated from humans as subjective sensors, which enables mining knowledge about the human observer who generated the text data. The subjective content of text data is then analyzed using techniques such as ordinal logistic regression or latent aspect rating analysis (LARA), proposed by the first author in two KDD papers [1,2]. The text mining half of the book ends with a 30-page survey chapter on the joint analysis of text and structured data, which is a requirement in many real-world applications. In fact, non-text data can enrich text analysis, whereas text data can help interpret non-text data (for example, pattern annotation). Three example techniques illustrate how topic analysis can be combined with non-text data in different domains: the use of different views in contextual PLSA, the network supervised topic model in NetPLSA (for the joint analysis of text and social network data), and iterative causal topic modeling for the analysis of text associated to time series. The book's final chapter is a short position paper where the authors advocate for integrated software frameworks that support both text management (that is, information retrieval) and text analysis (that is, text mining). It can be read as a broad-brush of the essentials for future unified systems. In general terms, the authors typically provide verbose descriptions of the reasons behind the design of specific techniques, with numerical examples and illustrative figures from the slides of two massive open online courses (MOOCs) offered by the first author on Coursera. They also provide specific sections that describe in detail the proper way to evaluate every different kind of technique, a key factor to be taken into account when applying the discussed techniques in practice. The book, however, is not always self-contained, since its broad scope in a limited number of pages entails an unavoidable depth/breadth tradeoff. Most basic techniques can be implemented just by following the instructions and guidelines in the text, although interested readers might need to resort to the bibliographic references if they want to gain a thorough understanding of the many advanced techniques. Fortunately, the authors include some bibliographic notes and very selective suggestions for further reading at the end of each chapter, instead of the encyclopedic collection of references common in many other textbooks. Although readers will not find detailed coverage of NLP techniques and some chapters might seem lacking in depth, advanced undergraduate students might find this book to be a valuable reference for getting acquainted with both information retrieval and text mining in a single volume, a worthwhile achievement for a 500-page textbook. Online Computing Reviews Service

H. Van Dyke Parunak

One of the most rapidly growing sources of data, natural-language text, is also one of the most difficult to analyze. Computerized understanding of natural language was among the earliest anticipated benefits of artificial intelligence (AI), but it has proven extraordinarily challenging. This volume offers a selective introduction to the state of the art of computerized analysis of text. As befits the subtitle, "a practical introduction ...," it situates the techniques it explains in the context of a systems view that emphasizes how natural-language processing (NLP) can be applied in real applications. Chapter 1 introduces the overall framework, distinguishing analysis of the text from various organizational processes (including search, filtering, categorization, summarization, topic analysis, information extraction, clustering, and visualization) that support the two main objectives of retrieval operations and data mining. With the exception of information extraction and visualization, the book discusses each of these operations. Chapter 2 provides an overview of mathematical background in probability and statistics, information theory, and machine learning. Chapter 3 reviews the history of NLP and text data understanding. Most of the book is limited to a bag-of-words model, though this chapter acknowledges more sophisticated techniques. Chapter 4 introduces the authors' modern text analysis (MeTA) toolkit for text data management and analysis, encouraging readers to download the open-source C++-based system and use it in examples and exercises promised later in the text. This promise of a hands-on learning experience is only partly fulfilled. Few exercises, and even fewer examples in the body of the text, actually say anything about MeTA. Most of the exercises that do mention it do not use it to illustrate a particular text-analytic function, but ask the user either to look to see how MeTA implements a given text-analytic function, or to extend MeTA to do something discussed in the text. Both kinds of task require the reader to delve into the source code of MeTA rather than use the functionality of the package, and thus assume a level of knowledge about MeTA well beyond anything in the text. These exercises might be useful in the context of a class where the instructor is already acquainted with the internal design and implementation of MeTA. Some other toolkits are mentioned, but there is no reference to other, important ones, such as MALLET from the University of Massachusetts at Amherst. After these four introductory chapters, the rest of the book has three parts: seven chapters devoted to accessing textual data, eight to analyzing it, and one final chapter fleshing out an overall architecture for unified text management and analysis. The chapters on accessing data discuss retrieval models, how the information retrieval system gets feedback from the user, implementation and evaluation of search engines, a special chapter on web-based search, and recommender systems. Most chapters are about 20 pages long (the median chapter length for the book is 18 pages), but the chapter on retrieval models is 46 pages long. The extra detail is useful, given the importance of this theme, but it is uneven compared with the rest of the book. The selection of retrieval methods to discuss is not clear. Early in the chapter, the authors identify "four major models that are generally regarded as state of the art: pivoted length normalization, Okapi BM25, query likelihood, and PL2." However, the rest of the chapter mentions PL2 only in passing, focusing instead on two forms of smoothing for query likelihood, JM smoothing and Dirichlet prior smoothing. The chapter does not discuss two very important issues in the area of retrieval, van Rijsbergen's work on The geometry of information retrieval [1], and the particular challenges posed by comparing vectors in high-dimensional spaces, which characterize most keyword-based retrieval methods. The text analysis chapters discuss word association mining, text clustering, categorization and summarization, topic analysis, opinion mining and sentiment analysis, and the joint analysis of text and structured data. Again, the level of detail is uneven. The median chapter length in this section is 24 pages, but the chapter on topic analysis occupies 60 pages. Again, the theme is an important one, but the level of detail appears to be out of balance with the rest of the book. The book includes exercises with each chapter, appendices giving further details on mathematical methods mentioned earlier in the book (Bayesian statistics, expectation maximization, and KL-divergence and Dirichlet prior smoothing), copious references, and an index. The references usefully include the page numbers on which they are cited, but there is some irregularity. For example, van Rijsbergen's important volume [1] is listed twice in the references, once alphabetized under R, and again under V. Online Computing Reviews Service

Xiannong Meng

Zhai and Massung's new book Text data management and analysis provides a fresh new look at the areas of text retrieval, text mining, and text management. Traditionally, these three areas are separate, each with a rich collection of research literature and textbooks. Zhai and Massung masterfully weave the contents of these areas together and present students and scholars with a unified view of "everything text," including a piece of software, META, which is developed by the authors for a variety of text analysis and management tasks. Because of the large scope of the contents, the authors chose to concentrate on the breadth, not the depth, of the knowledge area in this 500-plus-page textbook. The primary audience is upper-level undergraduate or first-year graduate students. The book contains 20 chapters that are divided into four parts and a few appendices. The first part reviews tools that are needed for the tasks, including probability and statistics, natural language understanding, and the installation and use of the META software. The second part contains the major parts of a traditional information retrieval study. The subjects covered in this part are text retrieval, vector space, and probabilistic models; feedback models; search engine implementation and evaluation; search over the web; and recommendation systems. The third part mainly deals with various text mining related topics, such as word association mining, text clusters, topic analysis, and opinion mining. The fourth part is a summary of the authors' views about a unified framework for text analysis and management. There are three appendices that describe some common statistics tools, the Bayesian model, the expectation-maximization model, and KL-divergence and Dirichlet prior smoothing. Each chapter ends with a collection of exercises (about ten in each), which allow readers to assess how well they have learned the content. The exercises with the authors' software tool META are spread throughout the book. The authors used this book in one of their (400-level) undergraduate courses and in two massive open online courses (MOOCs), all at the University of Illinois at Urbana-Champaign. Because text analysis and management are such important fields, it is a very good idea to seek ways to teach the topics at the undergraduate or early graduate level. The authors' approach of unifying text information retrieval and text mining is very refreshing and worth noting. In particular, the authors provided a programming tool that students can use as they learn the course materials. But I think challenges from two aspects remain. One issue is that the mathematics tools needed for text mining are typically out of reach for undergraduate computer science students. It is common practice in undergraduate data mining courses to use packages such as R or Weka to hide the details of statistical analysis. The second challenge is the amount of information covered in the book. It is a great idea to establish a unified framework as the book does. And in keeping the book, and thus the courses using this book, to a manageable size, I agree it is a very good idea to keep a broad view of the topics, without going into depth. But the number of topics covered in the book is vast. It will be a real challenging to use it in undergraduate courses. One may just have to cover selected topics in a typical semester. Regardless, this is a very good attempt to unify two important areas, text retrieval and text mining, for a society in which text analysis is becoming increasingly critical. The book also shows the depth and the breadth of the knowledge of the authors. Online Computing Reviews Service

