Nothing Special   »   [go: up one dir, main page]

skip to main content
Skip header Section
Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text MiningJune 2016
Publisher:
  • Association for Computing Machinery and Morgan & Claypool
ISBN:978-1-970001-17-4
Published:23 June 2016
Pages:
1286
Appears In:
ACMACM Books
Skip Bibliometrics Section
Reflects downloads up to 24 Sep 2024Bibliometrics
Skip Abstract Section
Abstract

Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media (such as blog articles, forum posts, product reviews, and tweets). This has led to an increasing demand for powerful software tools to help people manage and analyze vast amounts of text data effectively and efficiently. Unlike data generated by a computer system or sensors, text data are usually generated directly by humans, and capture semantically rich content. As such, text data are especially valuable for discovering knowledge about human opinions and preferences, in addition to many other kinds of knowledge that we encode in text. In contrast to structured data, which conform to well-defined schemas (thus are relatively easy for computers to handle), text has less explicit structure, requiring computer processing toward understanding of the content encoded in text. The current technology of natural language processing has not yet reached a point to enable a computer to precisely understand natural language text, but a wide range of statistical and heuristic approaches to management and analysis of text data have been developed over the past few decades. They are usually very robust and can be applied to analyze and manage text data in any natural language, and about any topic.

This book provides a systematic introduction to many of these approaches, with an emphasis on covering the most useful knowledge and skills required to build a variety of practically useful text information systems. Because humans can understand natural languages far better than computers can, effective involvement of humans in a text information system is generally needed and text information systems often serve as intelligent assistants for humans. Depending on how a text information system collaborates with humans, we distinguish two kinds of text information systems. The first is information retrieval systems which include search engines and recommender systems; they assist users in finding from a large collection of text data the most relevant text data that are actually needed for solving a specific application problem, thus effecively turning big raw text data into much smaller relevant text data that can be more easily processed by humans. The second is text mining application systems; they can assist users in analyzing patterns in text data to extract and discover useful actionable knowledge directly useful for task completion or decision making, thus providing more direct task support for users. This book covers the major concepts, techniques, and ideas in information retrieval and text data mining from a practical viewpoint, and includes many hands-on exercises designed with a companion software toolkit (i.e., MeTA) to help readers learn how to apply techniques of information retrieval and text mining to real-world text data and how to experiment with and improve some of the algorithms for interesting application tasks. This book can be used as a textbook for computer science undergraduates and graduates, library and information scientists, or as a reference book for practitioners working on relevant problems in managing and analyzing text data.

References

  1. C. C. Aggarwal. 2015. Data Mining - The Textbook. Springer. DOI: 10.1007/978-3-319-14142-8.Google ScholarGoogle Scholar
  2. C. C. Aggarwal and C. Zhai, editors. 2012. Mining Text Data. Springer. DOI: 10.1007/978-1-4614-3223-4.Google ScholarGoogle Scholar
  3. J. Allen. 1995. Natural Language Understanding. 2nd ed. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA.Google ScholarGoogle Scholar
  4. G. Amati and C. J. Van Rijsbergen. October 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357–389. DOI: 10.1145/582415.582416.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. U. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. 2009. On smoothing and inference for topic models. In UAI 2009, Proc. of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009, pp. 27–34.Google ScholarGoogle Scholar
  6. R. A. Baeza-Yates and B. A. Ribeiro-Neto. 2011. Modern Information Retrieval - the concepts and technology behind search. 2nd ed. Pearson Education Ltd., Harlow, UK. http://www.mir2ed.org/.Google ScholarGoogle Scholar
  7. Y. Bar-Hillel, The Present Status of Automatic Translation of Languages, in Advances in Computers, vol. 1 (1960), pp. 91–163.Google ScholarGoogle Scholar
  8. R. Belew. 2008. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press.Google ScholarGoogle Scholar
  9. N. J. Belkin and W. B. Croft. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM, 35(12):29–38. DOI: 10.1145/138859.138861.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ.Google ScholarGoogle Scholar
  11. D. M. Blei, A. Y. Ng, and M. I. Jordan. March 2003. Latent Dirichlet Allocation. J. of Mach. Learn. Res., 3:993–1022.Google ScholarGoogle Scholar
  12. J. S. Breese, D. Heckerman, and C. Kadie. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In Proc. of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI'98, Morgan Kaufmann Publishers Inc. pp. 43–52, San Francisco, CA. http://dl.acm.org/citation.cfm?id=2074094.2074100.Google ScholarGoogle Scholar
  13. P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. 1992. Class-based N-gram Models of Natural Language. Comput. Linguist., 18(4):467–479.Google ScholarGoogle Scholar
  14. C. Buckley. 1994. Automatic query expansion using smart: Trec 3. In Proc. of The third Text REtrieval Conference (TREC-3, pp. 69–80.Google ScholarGoogle Scholar
  15. S. Büttcher, C. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Cacheda, V. Carneiro, D. Fernández, and V. Formoso. 2011. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Trans. Web, 5(1):2:1–2:33. DOI: 10.1145/1921591.1921593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Campbell and Y. Ying. 2011. Learning with Support Vector Machines. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers. DOI: 10.2200/S00324ED1V01Y201102AIM010.Google ScholarGoogle Scholar
  18. J. Carbonell and J. Goldstein. 1998. The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries. In Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '98,ACM, pp. 335–336, New York. DOI: 10.1145/290941.291025Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1–27:27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Chang, S. Gerrish, C. Wang, J. L. Boyd-graber, and D. M. Blei. 2009. Reading Tea Leaves: How Humans Interpret Topic Models. In Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, Curran Associates, Inc. 22, pp. 288–296.Google ScholarGoogle Scholar
  21. K. W. Church and P. Hanks. 1990. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22–29. http://dl.acm.org/citation.cfm?id=89086.89095.Google ScholarGoogle Scholar
  22. T. Cover and J. Thomas. 1991. Elements of Information Theory. New York: Wiley. DOI: 10.1002/047174882XGoogle ScholarGoogle ScholarCross RefCross Ref
  23. B. Croft, D. Metzler, and T. Strohman. 2009. Search Engines: Information Retrieval in Practice, 1st ed., Addison-Wesley Publishing Company.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Das and A. F. T. Martins. 2007. A Survey on Automatic Text Summarization. Technical report, Literature Survey for the Language and Statistics II course at Carnegie Mellon University.Google ScholarGoogle Scholar
  25. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res., 9:1871–1874.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Fang, T. Tao, and C. Zhai. 2004. A formal study of information retrieval heuristics. In Proc. of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '04, ACM, pp. 49–56, New York. DOI: 10.1145/1008992.1009004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Fang, T. Tao, and C. Zhai. April 2011. Diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst., 29(2):7:1–7:42. DOI: 10.1145/1961209.1961210.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Feldman and J. Sanger. 2007. The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.Google ScholarGoogle Scholar
  29. E. A. Fox, M. A. Gon„alves, and R. Shen. 2012. Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers. DOI: 10.2200/S00434ED1V01Y201207ICR022.Google ScholarGoogle Scholar
  30. W. B. Frakes and R. A. Baeza-Yates, editors. 1992. Information Retrieval: Data Structures & Algorithms. Prentice-Hall,Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. Ganesan, C. Zhai, and J. Han. 2010. Opinosis: A graph-based approach to abstractive summarization of highly redundant opinions. In Proc. of the 23rd International Conference on Computational Linguistics, COLING '10, Association for Computational Linguistics, pp. 340–348, Stroudsburg, PA.Google ScholarGoogle Scholar
  32. K. Ganesan, C. Zhai, and E. Viegas. 2012. Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions. In Proc. of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012, pages 869–878. DOI: 10.1145/2187836.2187954Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Gantz, and D. Reinsel. 2012. The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, IDC Report, December, 2012.Google ScholarGoogle Scholar
  34. A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. 1995. Bayesian Data Analysis. Chapman & Hall.Google ScholarGoogle Scholar
  35. S.Ghemawat, H. Gobioff, and S.-T. Leung. 2003. The Google file system. In Proc. of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, 29–43.Google ScholarGoogle Scholar
  36. M. A. Gon„alves, E. A. Fox, L. T. Watson, and N. A. Kipp. 2004. Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries. ACM Trans. Inf. Syst., 22(2):270–312. DOI: 10.1145/984321.984325.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. A. Grossman and O. Frieder. Kluwer, 2004. Information Retrieval - Algorithms and Heuristics, Second Edition, vol. 15 of The Kluwer International Series on Information Retrieval. DOI: 10.1007/978-1-4020-3005-5.Google ScholarGoogle Scholar
  38. G. Hamerly and C. Elkan. 2003. Learning the k in k-means. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], pp. 281–288.Google ScholarGoogle Scholar
  39. J. Han. 2005. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle Scholar
  40. D. Harman. 2011. Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers. DOI: < 10.1145/215206.215351Google ScholarGoogle Scholar
  41. M. A. Hearst. 2009. Search User Interfaces. 1st ed. Cambridge University Press, New York.Google ScholarGoogle Scholar
  42. J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. 2004. Evaluating Collaborative Filtering Recommender Systems. ACM Trans. Inf. Syst., 22(1):5–53. DOI: 10.1145/963770.963772Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. L. Hodges and E. L. Lehmann. 1970. Basic Concepts of Probability and Statistics. Holden Day, San Francisco.Google ScholarGoogle Scholar
  44. T. Hofmann. 1999. Probabilistic Latent Semantic Analysis. In Proc. of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI'99, Morgan Kaufmann Publishers Inc., pp. 289–296, San Francisco, CA. DOI: 10.1145/312624.312649Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Huang. 2008. Similarity Measures for Text Document Clustering. In Proc. of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, pages 49–56.Google ScholarGoogle Scholar
  46. F. Jelinek. 1997. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  47. J. Jiang. 2012. Information extraction from text, In Charu C. Aggarwal and ChengXiang Zhai (Eds.), Mining Text Data, Springer, pp. 11–41.Google ScholarGoogle Scholar
  48. S. Jiang and C. Zhai. 2014. Random walks on adjacency graphs for mining lexical relations from big text data. In 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, DC, USA, October 27-30, pages 549–554. DOI: 10.1109/BigData.2014.7004272.Google ScholarGoogle ScholarCross RefCross Ref
  49. Y. Jo and A. H. Oh. 2011. Aspect and sentiment unification model for online review analysis. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM '11, ACM, pp. 815–824, New York. DOI: 10.1145/1935826.1935932.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst., 25(2). DOI: 10.1145/1229179.1229181.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. D. Jurafsky and J. H. Martin. 2009. Speech and Language Processing. 2nd ed. Prentice-Hall, Inc., Upper Saddle River, NJ.Google ScholarGoogle Scholar
  52. D. Kelly. 2009. Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2):1–224. DOI: 10.1561/1500000012Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. D. Kelly and J. Teevan. 2003. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2):18–28. DOI: 10.1145/959258.959260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. H. D. Kim, M. Castellanos, M. Hsu, C. Zhai, T. Rietz, and D. Diermeier. 2013. Mining causal topics in text data: iterative topic modeling with time series feedback. In Proc. of the 22nd ACM international conference on Conference on information and knowledge management, CIKM '13, ACM pages 885–890, New York, NY. DOI: 10.1145/2505515.2505612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. J. M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632. DOI: 10.1145/324133.324140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. J. M. Kleinberg. 2002. An impossibility theorem for clustering. In Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, December 9-14, 2002, Vancouver, British Columbia, Canada], pp. 446–453. http://papers.nips.cc/paper/2340-an-impossibility-theorem-for-clustering.Google ScholarGoogle Scholar
  57. D. Koller and N. Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press.Google ScholarGoogle Scholar
  58. J. Lafferty and C. Zhai. 2003. Probabilistic relevance models based on document and query generation. In W. Bruce Croft and John Lafferty, editors, Language Modeling and Information Retrieval. Kluwer Academic Publishers. DOI: 10.1007/978-94-017-0171-6\_1Google ScholarGoogle Scholar
  59. D. Lin. 1999. Automatic identification of non-compositional phrases. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL '99, Association for Computational Linguistics, pages 317–324, Stroudsburg, PA. DOI: 10.3115/1034678.1034730.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J.Lin and C. Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers. DOI: 10.2200/S00274ED1V01Y201006HLT007.Google ScholarGoogle Scholar
  61. Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. DOI: 10.2200/S00416ED1V01Y201204HLT016.Google ScholarGoogle Scholar
  62. T.-Y. Liu. 2009. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225–331. DOI: 10.1561/1500000016.Google ScholarGoogle Scholar
  63. Y. Lv and C. Zhai. 2009. A comparative study of methods for estimating query language models with pseudo feedback. In Proc. of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, ACM, pp. 1895–1898, New York. DOI: 10.1145/1645953.1646259.Google ScholarGoogle ScholarCross RefCross Ref
  64. Y. Lv and C. Zhai. 2010. Positional relevance model for pseudo-relevance feedback. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '10, ACM, pages 579–586, New York. DOI: 10.1145/1835449.1835546.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Y. Lv and C. Zhai. 2011. Lower-bounding Term Frequency Normalization. In Proc. of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pp. 7–16. DOI: 10.1145/2063576.2063584Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. P. Lyman, H. R. Varian, K. Swearingen, P. Charles, N. Good, L.L. Jordan, and J. Pal. 2003. How much information? http://www2.sims.berkeley.edu/research/projects/how-much-info-2003.Google ScholarGoogle Scholar
  67. C. D. Manning and H. Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  68. C. D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York.Google ScholarGoogle Scholar
  69. M. E. Maron and J. L. Kuhns. 1960. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7:216–244. DOI: 10.1145/321033.321035Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. S. Massung and C. Zhai. 2015. SyntacticDiff: Operator-Based Transformation for Comparative Text Mining. In Proc. of the 3rd IEEE International Conference on Big Data, pp. 571–580.Google ScholarGoogle Scholar
  71. S. Massung and C. Zhai. 2016. Non-Native Text Analysis: A Survey. The Journal of Natural Language Engineering, 22(2):163–186. DOI: 10.1017/S1351324915000303Google ScholarGoogle ScholarCross RefCross Ref
  72. S. Massung, C. Zhai, and J.Hockenmaier. 2013. Structural Parse Tree Features for Text Representation. In IEEE Seventh International Conference on Semantic Computing, pp. 9–13. DOI: 10.1109/ICSC.2013.13Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. J. D. McAuliffe and D. M. Blei. 2008. Supervised topic models. In J.C. Platt, D. Koller, Y. Singer, and S.T. Roweis, eds., Advances in Neural Information Processing Systems 20, pages 121–128. Curran Associates, Inc.Google ScholarGoogle Scholar
  74. G. J. McLachlan and T. Krishnan. 2008. The EM algorithm and extensions. 2nd ed. Wiley Series in Probability and Statistics. Hoboken, NJ., Wiley. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&SRT=YOP&IKT=1016&IKT=1016&TRM=ppn+52983362X&sourceid=fbw_bibsonomy. DOI: 10.1002/9780470191613Google ScholarGoogle Scholar
  75. Q. Mei. 2009. Contextual text mining. Ph.D. Dissertation, University of Illinois at Urbana-Champaign.Google ScholarGoogle Scholar
  76. Q. Mei and C. Zhai. 2006. A mixture model for contextual text mining. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '06, ACM, pp. 649–655, New York. DOI: 10.1145/1150402.1150482.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. 2006. Generating semantic annotations for frequent patterns with context analysis. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '06, ACM, pp. 337–346, New York. DOI: 10.1145/1150402.1150441.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Q. Mei, C. Liu, H. Su, and C. Zhai. 2006. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proc.of the 15th international conference on World Wide Web (WWW '06). ACM. New York, 533–542. DOI: 10.1145/1135777.1135857.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. 2007a. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proc. of the 16th International Conference on World Wide Web, WWW '07, ACM, pp. 171–180, New York. DOI: 10.1145/1242572.1242596.Google ScholarGoogle ScholarCross RefCross Ref
  80. Q. Mei, X. Shen, and C. Zhai. 2007b. Automatic labeling of multinomial topic models. In Proc. of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, August 12-15, 2007, pp. 490–499. DOI: 10.1145/1281192.1281246.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Q. Mei, D. Cai, D. Zhang, and C. Zhai. 2008. Topic modeling with network regularization. In Proceedings of the 17th International Conference on World Wide Web, WWW '08, ACM, pp. 101–110, New York. DOI: 10.1145/1367497.1367512.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp. 1045–1048. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html.Google ScholarGoogle Scholar
  83. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, NV, pp. 3111–3119.Google ScholarGoogle Scholar
  84. T. M. Mitchell. 1997. Machine learning. McGraw Hill Series in Computer Science. McGraw-Hill.Google ScholarGoogle Scholar
  85. M.-F. Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context (The Information Retrieval Series). Springer-Verlag New York, Inc., Secaucus, NJ. DOI: 10.1007/978-1-4020-4993-4.Google ScholarGoogle ScholarCross RefCross Ref
  86. I. J. Myung. 2003. Tutorial on maximum likelihood estimation. J. Math. Psychol., 47(1):90–100. DOI: 10.1016/S0022-2496(02)00028-7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. A. Nenkova and K. McKeown. 2012. A survey of text summarization techniques. In Charu C. Aggarwal and C. Zhai, eds, Mining Text Data, pp. 43–76. Springer US. DOI: 10.1007/978-1-4614-3223-4_3.Google ScholarGoogle ScholarCross RefCross Ref
  88. L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf.Google ScholarGoogle Scholar
  89. B. Pang and L. Lee. 2008. Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135. DOI: 10.1561/1500000011Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. J. M. Ponte and W. B. Croft. 1998. A language modeling approach to information retrieval. In Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '98, ACM, pp. 275–281, New York, NY. DOI: 10.1145/290941.291008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. J. R. Quinlan. 1986. Induction of Decision Trees. Machine Learning, 1(1):81–106. DOI: 10.1007/BF00116251.Google ScholarGoogle ScholarCross RefCross Ref
  92. D. R. Radev, H. Jing, M. Styś, and D. Tam. 2004. Centroid-based summarization of multiple documents. Information Processing & Management, 40(6):919––938. DOI: 10.1016/j.ipm.2003.10.006.Google ScholarGoogle Scholar
  93. D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. 2009. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP '09, Association for Computational Linguistics, pages 248–256, Stroudsburg, PA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. E. Reiter and R. Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press, New York.Google ScholarGoogle Scholar
  95. F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor. 2010. Recommender Systems Handbook. 1st ed. Springer-Verlag New York, Inc. DOI: 10.1007/978-0-387-85820-3Google ScholarGoogle Scholar
  96. C. J. Van Rijsbergen. 1979. Information Retrieval. 2nd ed. Butterworth-Heinemann, Newton, MA.Google ScholarGoogle Scholar
  97. S. Robertson and K. Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129–146.Google ScholarGoogle ScholarCross RefCross Ref
  98. S. E. Robertson. 1997. Readings in Information Retrieval. In The Probability Ranking Principle in IR, San Francisco, CA, Morgan Kaufmann Publishers Inc. pp. 281–286.Google ScholarGoogle Scholar
  99. S. Robertson and H. Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., 3(4):333–389. DOI: 10.1561/1500000019.Google ScholarGoogle Scholar
  100. S. Robertson, H. Zaragoza, and M. Taylor. 2004. Simple BM25 Extension to Multiple Weighted Fields. In Proc. of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM '04, pp. 42–49. DOI: 10.1145/1031171.1031181Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. C. Roe. 2012. The growth of unstructured data: what to do with all those zettabytes? http://www.dataversity.net/the-growth-of-unstructured-data-what-are-we-going-to-do-with-all-those-zettabytes/.Google ScholarGoogle Scholar
  102. R. Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here. In Proceedings of the IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  103. G. Salton. 1989. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley.Google ScholarGoogle Scholar
  104. G. Salton and M. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill.Google ScholarGoogle Scholar
  105. G. Salton, A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing. Commun. ACM, 18(11):613–620.Google ScholarGoogle Scholar
  106. G. Salton and C. Buckley. 1990. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41:288–297.Google ScholarGoogle ScholarCross RefCross Ref
  107. M. Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, 4(4):247–375.Google ScholarGoogle Scholar
  108. M. Sanderson and W. B. Croft. 2012. The history of information retrieval research. Proc. of the IEEE, 100(Centennial-Issue):1444–1451, 2012. DOI: 10.1109/JPROC.2012.2189916.Google ScholarGoogle ScholarCross RefCross Ref
  109. S. Sarawagi. 2008. Information extraction. Found. Trends databases, 1(3):261–377. DOI: 10.1561/1900000003.Google ScholarGoogle Scholar
  110. F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1–47. DOI: 10.1145/505282.505283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. G. Shani and A. Gunawardana. 2011. Evaluating Recommendation Systems. In Recommender Systems Handbook, 2nd ed., pp. 257–297. Springer, New York, NY. DOI: 10.1007/978-0-387-85820-3_8.Google ScholarGoogle Scholar
  112. F. Silvestri. 2010. Mining query logs: Turning search usage data into knowledge. Found. Trends Inf. Retr., 4:1–174. DOI: 10.1561/1500000013Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. A. Singhal, C. Buckley, and Mandar Mitra. 1996. Pivoted document length normalization. In Proc. of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96,ACM, pp. 21–29, New York. DOI: 10.1145/243199.243206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. N. Smith. 2010. Text-driven forecasting. http://www.cs.cmu.edu/\~nasmith/papers/smith.whitepaper10.pdf.Google ScholarGoogle Scholar
  115. Mark D. Smucker, James Allan, and Ben Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proc. of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM '07, ACM, pp. 623–632, New York. DOI: 10.1145/1321440.1321528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. K. Sparck Jones and P. Willett, eds. 1997. Readings in Information Retrieval. San Francisco, CA, Morgan Kaufmann Publishers Inc.Google ScholarGoogle Scholar
  117. N. Spirin and J. Han. May 2012. Survey on Web Spam Detection: Principles and Algorithms. SIGKDD Explor. Newsl., 13(2):50–64. DOI: 10.1145/2207243.2207252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. E. Stamatatos. 2009. A Survey of Modern Authorship Attribution Methods. J. Am. Soc. Inf. Sci. Technol., 60(3):538–556. DOI: 10.1002/asi.v60:3Google ScholarGoogle ScholarCross RefCross Ref
  119. M. Steinbach, G. Karypis, and V. Kumar. 2000. A comparison of document clustering techniques. In KDD Workshop on Text Mining.Google ScholarGoogle Scholar
  120. J. Steinberger and K. Jezek. 2009. Evaluation measures for text summarization. Computing and Informatics, 28(2):251–275.Google ScholarGoogle Scholar
  121. M. Steyvers and T. Griffiths. 2007. Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7):424–440.Google ScholarGoogle Scholar
  122. Y. Sun and J. Han. 2012. Mining Heterogeneous Information Networks: Principles and Methodologies. Morgan & Claypool Publishers. DOI: 10.2200/S00433ED1V01Y201207DMK005.Google ScholarGoogle Scholar
  123. I. Titov and R. McDonald. 2008. Modeling online reviews with multi-grain topic models. In Proc. of the 17th International Conference on World Wide Web, WWW '08, ACM, pp. 111–120, New York. DOI: 10.1145/1367497.1367513.Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. H. Turtle and W. B. Croft. 1990. Inference networks for document retrieval. In Proc. of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '90, ACM, pp. 1–24, New York. DOI: 10.1145/96749.98006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Princeton University. 2010. About wordnet. http://wordnet.princeton.edu.Google ScholarGoogle Scholar
  126. C. J. van Rijsbergen. 1979. Information Retrieval. Butterworths.Google ScholarGoogle Scholar
  127. H. Wang, Yue Lu, and C. Zhai. 2010. Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach. In Proc. of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10, ACM, pp. 783–792, New York. DOI: 10.1145/1835804.1835903.Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. H. Wang, Y. Lu, and C. Zhai. 2011. Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In Proc. of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, ACM, pp. 618–626, New York. DOI: 10.1145/2020408.2020505.Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. J. Weizenbaum. 1966. ELIZA—A Computer Program for the Study of Natural Language Communication Between Man and Machine, Communications of the ACM 9 (1): 36–45, DOI: 10.1145/265153.365168.Google ScholarGoogle Scholar
  130. J. S. Whissell and C. L. A. Clarke. 2013. Effective Measures for Inter-document Similarity. In Proc. of the 22nd ACM International Conference on Conference on Information & Knowledge Management, CIKM '13, ACM, pages 1361––1370, New York. DOI: 10.1145/2505515.2505526.Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. R. W. White and R. A. Roth. 2009. Exploratory Search: Beyond the Query-Response Paradigm. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers. DOI: < 10.2200/S00174ED1V01Y200901ICR003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. R. W. White, B. Kules, S. M. Drucker, and m.c. schraefel. 2006. Introduction. Commun. ACM, 49(4):36–39. DOI: 10.1145/1121949.1121978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes (2Nd Ed.): Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle Scholar
  134. C.F J. Wu. 1983. On the convergence properties of the EM algorithm. Ann. of stat., 95–103.Google ScholarGoogle Scholar
  135. J. Xu and W. B. Croft. 1996. Query expansion using local and global document analysis. In Proc. of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96, ACM, pp. 4–11, New York. DOI: 10.1145/243199.243202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. Y. Yang. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1:67–88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. C. Zhai. 1997. Exploiting context to identify lexical atoms—a statistical view of linguistic context. In Proc. of the International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97), pages 119–129. Rio de Janeiro, Brazil.Google ScholarGoogle Scholar
  138. C. Zhai. 2008. Statistical Language Models for Information Retrieval. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. DOI: 10.2200/S00158ED1V01Y200811HLT001.Google ScholarGoogle Scholar
  139. C. Zhai and J. Lafferty. 2001. Model-based Feedback in the Language Modeling Approach to Information Retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management, CIKM '01, ACM, pp. 403–410, New York. DOI: 10.1145/502585.502654.Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. C. Zhai and J. Lafferty. 2004. A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Trans. Inf. Syst., 22(2):179–214.Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. C. Zhai, P. Jansen, E. Stoica, N. Grot, and D. A. Evans. 1998. Threshold Calibration in CLARIT Adaptive Filtering. In Proc. of Seventh Text REtrieval Conference (TREC-7), pp. 149–156.Google ScholarGoogle Scholar
  142. C. Zhai, P. Jansen, and D. A. Evans. 2000. Exploration of a heuristic approach to threshold learning in adaptive filtering. In SIGIR, ACM, pp. 360–362. DOI: 10.1145/345508.345652.Google ScholarGoogle ScholarCross RefCross Ref
  143. C. Zhai, A. Velivelli, and B. Yu. 2004. A cross-collection mixture model for comparative text mining. In Proc. of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, ACM, pp. 743–748, New York. DOI: 10.1145/1014052.1014150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. D. Zhang, C. Zhai, J. Han, A. Srivastava, and N. Oza. 2009. Topic modeling for OLAP on multidimensional text databases: topic cube and its applications. Stat. Anal. Data Min. 2, 5–6 (December 2009), 378–395. DOI: 10.1002/sam.v2.5/6.Google ScholarGoogle ScholarCross RefCross Ref
  145. J. Zhu, A. Ahmed, and E. P. Xing. 2009. Medlda: Maximum margin supervised topic models for regression and classification. In Proc. of the 26th Annual International Conference on Machine Learning, ICML '09, ACM, pp. 1257–1264, New York. DOI: 10.1145/1553374.1553535.Google ScholarGoogle ScholarCross RefCross Ref
  146. G. K. Zipf. 1949. Human Behavior and the Principle of Least-Effort. Cambridge, MA, Addison-Wesley.Google ScholarGoogle Scholar

Cited By

  1. Karousos N, Vorvilas G, Pantazi D and Verykios V (2024). A Hybrid Text Summarization Technique of Student Open-Ended Responses to Online Educational Surveys, Electronics, 10.3390/electronics13183722, 13:18, (3722)
  2. Xiong S, Tian W, Si H, Zhang G and Shi L (2024). A Survey of the Applications of Text Mining for the Food Domain, Algorithms, 10.3390/a17050176, 17:5, (176)
  3. Hu Z, Ma H, Xiong J, Gao P and Divakaran P Convergence or Divergence: A Computational Text Analysis of Stakeholder Concerns on Manufacturing Upgrading in China, IEEE Transactions on Engineering Management, 10.1109/TEM.2022.3159344, 71, (1285-1295)
  4. Zhang W, Yan R and Yuan L How Generative AI Was Mentioned in Social Media and Academic Field? A Text Mining Based on Internet Text Data, IEEE Access, 10.1109/ACCESS.2024.3379010, 12, (43940-43947)
  5. Tzirides A (2024). Artificial Intelligence Integration in Translingual Language Learning: Enhancing Communication and Digital Literacy Trust and Inclusion in AI-Mediated Education, 10.1007/978-3-031-64487-0_12, (261-286),
  6. Cope B and Kalantzis M (2024). On Cyber-Social Learning: A Critique of Artificial Intelligence in Education Trust and Inclusion in AI-Mediated Education, 10.1007/978-3-031-64487-0_1, (3-34),
  7. Kassimi M, Abdellatif H and Essayad A (2024). Mono-Lingual Search Engine: Combining Keywords with Context for Semantic Search Engine Advances in Intelligent System and Smart Technologies, 10.1007/978-3-031-47672-3_34, (353-363),
  8. Phan H, Vinh N and Huu N (2023). An Efficient System for Personal Information Search in Cyberspace Using Facial Recognition Technology 2023 12th International Conference on Control, Automation and Information Sciences (ICCAIS), 10.1109/ICCAIS59597.2023.10382379, 979-8-3503-2878-3, (566-571)
  9. OKATAN B and ÇAM H (2023). Analysis of customer reviews for digital banking applications with text mining methodsMetin madenciliği yöntemleri ile dijital bankacılık uygulamalarına yönelik müşteri yorumlarının analizi, Gümüşhane Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 10.17714/gumusfenbil.1361431
  10. Tao D, Hu R, Zhang D, Laber J, Lapsley A, Kwan T, Rathke L, Rundensteiner E and Feng H (2023). A Novel Foodborne Illness Detection and Web Application Tool Based on Social Media, Foods, 10.3390/foods12142769, 12:14, (2769)
  11. ÇULLU B and OKURSOY A (2023). Kargo Firmalarının Hizmet Kalitesinin Metin Madenciliği İle İncelenmesiInvestigation of Cargo Companies' Service Quality Using Text Mining, Anadolu Üniversitesi Sosyal Bilimler Dergisi, 10.18037/ausbd.1205507, 23:2, (399-422)
  12. Alhoori H, Fox E, Frommholz I, Liu H, Coupette C, Rieck B, Ghosal T and Wu J (2023). Who can Submit an Excellent Review for this Manuscript in the Next 30 Days? - Peer Reviewing in the Age of Overload 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 10.1109/JCDL57899.2023.00077, 979-8-3503-9931-8, (319-320)
  13. Reid A (2023). Closing the Affordable Housing Gap: Identifying the Barriers Hindering the Sustainable Design and Construction of Affordable Homes, Sustainability, 10.3390/su15118754, 15:11, (8754)
  14. Nakamura Y, Nagaoka T, Kitagawa T, Inoki M and Honiden S (2023). Understanding Support Method for Requirements Specification Using Description Status Based on Page Trend 2023 8th International Conference on Information and Network Technologies (ICINT), 10.1109/ICINT58947.2023.00016, 979-8-3503-0145-8, (43-48)
  15. VORVILAS G, LIAPIS A, KOROVESIS A, AGGELOPOULOU D, KAROUSOS N and EFSTATHOPOULOS E (2023). CONDUCTING REMOTE ELECTRONIC EXAMINATIONS IN DISTANCE HIGHER EDUCATION: STUDENTS’ PERCEPTIONS, Turkish Online Journal of Distance Education, 10.17718/tojde.971889, 24:2, (167-182)
  16. Kline S (2023). CGScholar Promoting Next-Generation Learning Environments Through CGScholar, 10.4018/978-1-6684-5124-3.ch011, (206-229)
  17. Olteanu A, Cernian A and Gâgă S (2022). Leveraging Machine Learning and Semi-Structured Information to Identify Political Views from Social Media Posts, Applied Sciences, 10.3390/app122412962, 12:24, (12962)
  18. Purwandari K and Nurlaila I (2022). Sequential Topic Modelling: A Case Study on One Health Conversation on Twitter 2022 5th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 10.1109/ISRITI56927.2022.10052987, 978-1-6654-5512-1, (457-461)
  19. López J and Cuadrado J (2021). An efficient and scalable search engine for models, Software and Systems Modeling, 10.1007/s10270-021-00960-4, 21:5, (1715-1737), Online publication date: 1-Oct-2022.
  20. Amies A (2022). Machine Learning Approaches with Multilingual Bibliographic, Quotation, and Terminology Databases for the Study of the Chinese Buddhist Canon 2022 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC), 10.23919/PNC56605.2022.9982732, 978-9-8695-3174-0, (1-7)
  21. Zuo E, Aysa A, Muhammat M, Zhao Y, Chen B and Ubul K (2022). A food safety prescreening method with domain-specific information using online reviews, Journal of Consumer Protection and Food Safety, 10.1007/s00003-022-01367-z
  22. Zehtab G and Basiri A (2022). Employees Turnover Rate with Pivoted Length Normalization 2022 27th International Computer Conference, Computer Society of Iran (CSICC), 10.1109/CSICC55295.2022.9780489, 978-1-6654-8027-7, (1-4)
  23. Liu X, Wang J, Rui X, Zhang J and Sun G (2022). Application of GIS Technology-Supported Cross Media Fusion Method Based on Deep Learning in Landscape Performance Evaluation, Computational Intelligence and Neuroscience, 2022, Online publication date: 1-Jan-2022.
  24. (2022). Bibliography Storage Systems, 10.1016/B978-0-32-390796-5.00023-1, (641-693),
  25. Thomasian A (2022). Structured, unstructured, and diverse databases Storage Systems, 10.1016/B978-0-32-390796-5.00018-8, (493-563),
  26. Sirajzade J, Bouvry P and Schommer C (2022). Deep Mining Covid-19 Literature Applied Informatics, 10.1007/978-3-031-19647-8_9, (121-133),
  27. Aggarwal C (2022). An Introduction to Text Analytics Machine Learning for Text, 10.1007/978-3-030-96623-2_1, (1-17),
  28. BÜYÜKEKE A and ÖZSOY T (2021). A text mining analysis of customer evaluations in terms of gastronomy tourismGastronomi turizmi açısından müşteri değerlendirmelerinin metin madenciliği ile analizi, Balıkesir Üniversitesi Sosyal Bilimler Enstitüsü Dergisi, 10.31795/baunsobed.1025204, 24:46-1, (1295-1312)
  29. BUDAK V (2021). Geçici Bilgi İhtiyacının Giderilme Sürecinde Kullanıcı Okuma Davranışlarının İncelenmesi, Turk Kutuphaneciligi - Turkish Librarianship, 10.24146/tk.955630, 35:4, (1-18)
  30. Trinko D, Porter E, Dunckley J, Bradley T and Coburn T (2021). Combining Ad Hoc Text Mining and Descriptive Analytics to Investigate Public EV Charging Prices in the United States, Energies, 10.3390/en14175240, 14:17, (5240)
  31. Gholamian S and Ward P (2021). On the Naturalness and Localness of Software Logs 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 10.1109/MSR52588.2021.00028, 978-1-7281-8710-5, (155-166)
  32. Liapis A, Vorvilas G, Korovesis A, Aggelopoulou D, Karousos N and Efstathopoulos E (2021). Evaluating the remote examination process applied by the Hellenic Open University (HOU) during COVID-19 pandemic: Students’ opinions 2021 IEEE Global Engineering Education Conference (EDUCON), 10.1109/EDUCON46332.2021.9454107, 978-1-7281-8478-4, (924-927)
  33. Parlina A, Ramli K and Murfi H (2021). Exposing Emerging Trends in Smart Sustainable City Research Using Deep Autoencoders-Based Fuzzy C-Means, Sustainability, 10.3390/su13052876, 13:5, (2876)
  34. Mahbub S, Pardede E, Kayes A and Chaudhry S (2021). Detection of Harassment Type of Cyberbullying, Security and Communication Networks, 2021, Online publication date: 1-Jan-2021.
  35. Kim J, On B and Lee I High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models, IEEE Access, 10.1109/ACCESS.2021.3086586, 9, (85240-85254)
  36. Singh K, Dorendro A, Devi H and Mahanta A (2021). Analysis of Changing Trends in Textual Data Representation Recent Trends in Image Processing and Pattern Recognition, 10.1007/978-981-16-0507-9_21, (237-251),
  37. Bachmaier P (2021). Text Mining: Durchführung einer Sentiment Analysis mit SAP HANA Data Science, 10.1007/978-3-658-33403-1_16, (259-275),
  38. Hoßfeld H (2021). Text Mining in der Organisationsforschung Handbuch Empirische Organisationsforschung, 10.1007/978-3-658-08580-3_35-1, (1-23),
  39. Seyler D, Li L and Zhai C (2020). Semantic Text Analysis for Detection of Compromised Accounts on Social Networks 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 10.1109/ASONAM49781.2020.9381432, 978-1-7281-1056-1, (417-424)
  40. Sim J, Miller P and Swarup S (2020). Tweeting the High Line Life: A Social Media Lens on Urban Green Spaces, Sustainability, 10.3390/su12218895, 12:21, (8895)
  41. Shah A, Yan X, Khan S, Khurrum W and Khan Q (2020). A multi-modal approach to predict the strength of doctor–patient relationships, Multimedia Tools and Applications, 10.1007/s11042-020-09596-w
  42. Moreno-Guerrero A, López-Belmonte J, Marín-Marín J and Soler-Costa R (2020). Scientific Development of Educational Artificial Intelligence in Web of Science, Future Internet, 10.3390/fi12080124, 12:8, (124)
  43. Hendrickx I, Voets T, van Dyk P and Kool R (2020). Using text mining techniques to identify healthcare providers with patient safety problems: an exploratory study (Preprint), Journal of Medical Internet Research, 10.2196/19064
  44. Fraj M, Hajkacem M and Essoussi N (2020). Self-Organizing Map for Multi-view Text Clustering Big Data Analytics and Knowledge Discovery, 10.1007/978-3-030-59065-9_30, (396-408),
  45. Al-Ash H, Putri M, Mursanto P and Bustamam A (2019). Ensemble Learning Approach on Indonesian Fake News Classification 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS), 10.1109/ICICoS48119.2019.8982409, 978-1-7281-4610-2, (1-6)
  46. Milne G, Villarroel Ordenes F and Kaplan B (2019). Mindful consumption: Three consumer segment views, Australasian Marketing Journal (AMJ), 10.1016/j.ausmj.2019.09.003, Online publication date: 1-Sep-2019.
  47. ACM
    Labhishetty S, Bhavya , Pei K, Boughoula A and Zhai C Web of Slides Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale, (1-4)
  48. Hu M and Pavao-Zuckerman M (2019). Literature Review of Net Zero and Resilience Research of the Urban Environment: A Citation Analysis Using Big Data, Energies, 10.3390/en12081539, 12:8, (1539)
  49. Husáková M (2019). Ontology-Based Conceptualisation of Text Mining Practice Areas for Education Computational Collective Intelligence, 10.1007/978-3-030-28374-2_46, (533-542),
  50. ACM
    Karmaker Santu S, Geigle C, Ferguson D, Cope W, Kalantzis M, Searsmith D and Zhai C (2018). SOFSAT, ACM SIGKDD Explorations Newsletter, 20:2, (21-30), Online publication date: 11-Dec-2018.
  51. ACM
    Gupta D and Berberich K GYANI Proceedings of the 27th ACM International Conference on Information and Knowledge Management, (487-496)
  52. ACM
    Lee G and Sun A Seed-driven Document Ranking for Systematic Reviews in Evidence-Based Medicine The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, (455-464)
  53. Wu P and Lin K (2018). Unstructured big data analytics for retrieving e-commerce logistics knowledge, Telematics and Informatics, 10.1016/j.tele.2017.11.004, 35:1, (237-244), Online publication date: 1-Apr-2018.
  54. Castillo E, Cervantes O, Vilariño D, Pinto D, Singh V, Villavicencio A, Mayr-Schlegel P and Stamatatos E (2018). Author profiling using a graph enrichment approach, Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology, 34:5, (3003-3014), Online publication date: 1-Jan-2018.
  55. Balog K (2018). Term-Based Models for Entity Ranking Entity-Oriented Search, 10.1007/978-3-319-93935-3_3, (57-99),
  56. Balog K (2018). Meet the Data Entity-Oriented Search, 10.1007/978-3-319-93935-3_2, (25-53),
  57. Lommatzsch A (2018). A Next Generation Chatbot-Framework for the Public Administration Innovations for Community Services, 10.1007/978-3-319-93408-2_10, (127-141),
  58. Correia A, Teodoro M and Lobo V (2018). Statistical Methods for Word Association in Text Mining Recent Studies on Risk Analysis and Statistical Modeling, 10.1007/978-3-319-76605-8_27, (375-384),
  59. Aggarwal C (2018). Machine Learning for Text: An Introduction Machine Learning for Text, 10.1007/978-3-319-73531-3_1, (1-16),
  60. ACM
    Wang S, Giridhar P, Wang H, Kaplan L, Pham T, Yener A and Abdelzaher T StoryLine Proceedings of the Second International Conference on Internet-of-Things Design and Implementation, (83-93)
  61. ACM
    Albishre K, Li Y and Xu Y Effective pseudo-relevance for Microblog retrieval Proceedings of the Australasian Computer Science Week Multiconference, (1-6)
  62. Golani N, Khandelwal I and Tripathy B (2017). Hybrid Intelligent Techniques in Text Mining and Analysis of Social Networks and Media Data Hybrid Intelligence for Social Networks, 10.1007/978-3-319-65139-2_1, (1-24),
  63. Correia A and Gonçalves A (2017). Topics Discovery in Text Mining Recent Advances in Information Systems and Technologies, 10.1007/978-3-319-56535-4_25, (251-256),
Contributors
  • University of Illinois Urbana-Champaign
  • University of Illinois Urbana-Champaign

Reviews

Fernando Berzal

An old rule of thumb suggests that 90 percent of all potentially relevant business information is in unstructured form. Hence, it is no surprise that many mathematically ill-defined problems associated with text analysis have attracted a lot of attention from data mining researchers. Text data management is a more mature field, and its associated text data access problems are tackled with the help of information retrieval techniques, as the popularity of web search engines attest. Zhai and Massung have managed to write a very readable introduction to both fields and their state of the art in 500 pages. After the usual introductory chapters, which include some background information and a very cursory mention of natural language processing (NLP) techniques, they delve into text data access methods, also known as information retrieval. Here, they discuss basic techniques such as ranking documents in response to a user query. They gently introduce retrieval models and the rationale behind them until they logically reach state-of-the-art vector space models, namely pivoted-length normalization and the Okapi BM25 ranking function. They also cover probabilistic models and, by clever use of analogies with the heuristic models, clearly explain the query likelihood retrieval model and the smoothing methods often used with it. Their discussion is not only theoretical, since they also cover practical issues associated with the implementation of information retrieval systems and, as you may expect, web search engines as the most prominent example of information retrieval systems nowadays. Their analysis of web search includes crawling, indexing, and link analysis, with the usual description of Google's PageRank and Kleinberg's HITS. The information retrieval half of this book is completed with short chapters on feedback (that is, how to take into account a user's actions to improve information retrieval results) and recommender systems, which provide relevant information to the user in "push" mode (in contrast to the "pull" mode of search and browsing, when the user initiates the requests). The second half of Zhai and Massung's textbook focuses on text mining, "text analysis" using the authors' preferred term. Word association mining, text clustering, text categorization, text summarization, topic modeling, opinion mining, and sentiment analysis are the main text mining problems studied in this second half of the book. Many of the discussed techniques are unavoidably application-specific, hence the authors' emphasis on the importance of feature engineering for solving problems such as text categorization, sentiment analysis, or text-based prediction. Their coverage of different problems is not without stark contrasts. For instance, a 60-page guided tour on probabilistic topic modeling, where probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) are excruciatingly dissected, is followed by a shallow overview chapter on opinion mining and sentiment analysis. In this short chapter, text data is regarded as data generated from humans as subjective sensors, which enables mining knowledge about the human observer who generated the text data. The subjective content of text data is then analyzed using techniques such as ordinal logistic regression or latent aspect rating analysis (LARA), proposed by the first author in two KDD papers [1,2]. The text mining half of the book ends with a 30-page survey chapter on the joint analysis of text and structured data, which is a requirement in many real-world applications. In fact, non-text data can enrich text analysis, whereas text data can help interpret non-text data (for example, pattern annotation). Three example techniques illustrate how topic analysis can be combined with non-text data in different domains: the use of different views in contextual PLSA, the network supervised topic model in NetPLSA (for the joint analysis of text and social network data), and iterative causal topic modeling for the analysis of text associated to time series. The book's final chapter is a short position paper where the authors advocate for integrated software frameworks that support both text management (that is, information retrieval) and text analysis (that is, text mining). It can be read as a broad-brush of the essentials for future unified systems. In general terms, the authors typically provide verbose descriptions of the reasons behind the design of specific techniques, with numerical examples and illustrative figures from the slides of two massive open online courses (MOOCs) offered by the first author on Coursera. They also provide specific sections that describe in detail the proper way to evaluate every different kind of technique, a key factor to be taken into account when applying the discussed techniques in practice. The book, however, is not always self-contained, since its broad scope in a limited number of pages entails an unavoidable depth/breadth tradeoff. Most basic techniques can be implemented just by following the instructions and guidelines in the text, although interested readers might need to resort to the bibliographic references if they want to gain a thorough understanding of the many advanced techniques. Fortunately, the authors include some bibliographic notes and very selective suggestions for further reading at the end of each chapter, instead of the encyclopedic collection of references common in many other textbooks. Although readers will not find detailed coverage of NLP techniques and some chapters might seem lacking in depth, advanced undergraduate students might find this book to be a valuable reference for getting acquainted with both information retrieval and text mining in a single volume, a worthwhile achievement for a 500-page textbook. Online Computing Reviews Service

H. Van Dyke Parunak

One of the most rapidly growing sources of data, natural-language text, is also one of the most difficult to analyze. Computerized understanding of natural language was among the earliest anticipated benefits of artificial intelligence (AI), but it has proven extraordinarily challenging. This volume offers a selective introduction to the state of the art of computerized analysis of text. As befits the subtitle, "a practical introduction ...," it situates the techniques it explains in the context of a systems view that emphasizes how natural-language processing (NLP) can be applied in real applications. Chapter 1 introduces the overall framework, distinguishing analysis of the text from various organizational processes (including search, filtering, categorization, summarization, topic analysis, information extraction, clustering, and visualization) that support the two main objectives of retrieval operations and data mining. With the exception of information extraction and visualization, the book discusses each of these operations. Chapter 2 provides an overview of mathematical background in probability and statistics, information theory, and machine learning. Chapter 3 reviews the history of NLP and text data understanding. Most of the book is limited to a bag-of-words model, though this chapter acknowledges more sophisticated techniques. Chapter 4 introduces the authors' modern text analysis (MeTA) toolkit for text data management and analysis, encouraging readers to download the open-source C++-based system and use it in examples and exercises promised later in the text. This promise of a hands-on learning experience is only partly fulfilled. Few exercises, and even fewer examples in the body of the text, actually say anything about MeTA. Most of the exercises that do mention it do not use it to illustrate a particular text-analytic function, but ask the user either to look to see how MeTA implements a given text-analytic function, or to extend MeTA to do something discussed in the text. Both kinds of task require the reader to delve into the source code of MeTA rather than use the functionality of the package, and thus assume a level of knowledge about MeTA well beyond anything in the text. These exercises might be useful in the context of a class where the instructor is already acquainted with the internal design and implementation of MeTA. Some other toolkits are mentioned, but there is no reference to other, important ones, such as MALLET from the University of Massachusetts at Amherst. After these four introductory chapters, the rest of the book has three parts: seven chapters devoted to accessing textual data, eight to analyzing it, and one final chapter fleshing out an overall architecture for unified text management and analysis. The chapters on accessing data discuss retrieval models, how the information retrieval system gets feedback from the user, implementation and evaluation of search engines, a special chapter on web-based search, and recommender systems. Most chapters are about 20 pages long (the median chapter length for the book is 18 pages), but the chapter on retrieval models is 46 pages long. The extra detail is useful, given the importance of this theme, but it is uneven compared with the rest of the book. The selection of retrieval methods to discuss is not clear. Early in the chapter, the authors identify "four major models that are generally regarded as state of the art: pivoted length normalization, Okapi BM25, query likelihood, and PL2." However, the rest of the chapter mentions PL2 only in passing, focusing instead on two forms of smoothing for query likelihood, JM smoothing and Dirichlet prior smoothing. The chapter does not discuss two very important issues in the area of retrieval, van Rijsbergen's work on The geometry of information retrieval [1], and the particular challenges posed by comparing vectors in high-dimensional spaces, which characterize most keyword-based retrieval methods. The text analysis chapters discuss word association mining, text clustering, categorization and summarization, topic analysis, opinion mining and sentiment analysis, and the joint analysis of text and structured data. Again, the level of detail is uneven. The median chapter length in this section is 24 pages, but the chapter on topic analysis occupies 60 pages. Again, the theme is an important one, but the level of detail appears to be out of balance with the rest of the book. The book includes exercises with each chapter, appendices giving further details on mathematical methods mentioned earlier in the book (Bayesian statistics, expectation maximization, and KL-divergence and Dirichlet prior smoothing), copious references, and an index. The references usefully include the page numbers on which they are cited, but there is some irregularity. For example, van Rijsbergen's important volume [1] is listed twice in the references, once alphabetized under R, and again under V. Online Computing Reviews Service

Xiannong Meng

Zhai and Massung's new book Text data management and analysis provides a fresh new look at the areas of text retrieval, text mining, and text management. Traditionally, these three areas are separate, each with a rich collection of research literature and textbooks. Zhai and Massung masterfully weave the contents of these areas together and present students and scholars with a unified view of "everything text," including a piece of software, META, which is developed by the authors for a variety of text analysis and management tasks. Because of the large scope of the contents, the authors chose to concentrate on the breadth, not the depth, of the knowledge area in this 500-plus-page textbook. The primary audience is upper-level undergraduate or first-year graduate students. The book contains 20 chapters that are divided into four parts and a few appendices. The first part reviews tools that are needed for the tasks, including probability and statistics, natural language understanding, and the installation and use of the META software. The second part contains the major parts of a traditional information retrieval study. The subjects covered in this part are text retrieval, vector space, and probabilistic models; feedback models; search engine implementation and evaluation; search over the web; and recommendation systems. The third part mainly deals with various text mining related topics, such as word association mining, text clusters, topic analysis, and opinion mining. The fourth part is a summary of the authors' views about a unified framework for text analysis and management. There are three appendices that describe some common statistics tools, the Bayesian model, the expectation-maximization model, and KL-divergence and Dirichlet prior smoothing. Each chapter ends with a collection of exercises (about ten in each), which allow readers to assess how well they have learned the content. The exercises with the authors' software tool META are spread throughout the book. The authors used this book in one of their (400-level) undergraduate courses and in two massive open online courses (MOOCs), all at the University of Illinois at Urbana-Champaign. Because text analysis and management are such important fields, it is a very good idea to seek ways to teach the topics at the undergraduate or early graduate level. The authors' approach of unifying text information retrieval and text mining is very refreshing and worth noting. In particular, the authors provided a programming tool that students can use as they learn the course materials. But I think challenges from two aspects remain. One issue is that the mathematics tools needed for text mining are typically out of reach for undergraduate computer science students. It is common practice in undergraduate data mining courses to use packages such as R or Weka to hide the details of statistical analysis. The second challenge is the amount of information covered in the book. It is a great idea to establish a unified framework as the book does. And in keeping the book, and thus the courses using this book, to a manageable size, I agree it is a very good idea to keep a broad view of the topics, without going into depth. But the number of topics covered in the book is vast. It will be a real challenging to use it in undergraduate courses. One may just have to cover selected topics in a typical semester. Regardless, this is a very good attempt to unify two important areas, text retrieval and text mining, for a society in which text analysis is becoming increasingly critical. The book also shows the depth and the breadth of the knowledge of the authors. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations