Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-642-35236-2_62guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

On text preprocessing for opinion mining outside of laboratory environments

Published: 04 December 2012 Publication History

Abstract

Opinion mining deals with scientific methods in order to find, extract and systematically analyze subjective information. When performing opinion mining to analyze content on the Web, challenges arise that usually do not occur in laboratory environments where prepared and preprocessed texts are used. This paper discusses preprocessing approaches that help coping with the emerging problems of sentiment analysis in real world situations. After outlining the identified shortcomings and presenting a general process model for opinion mining, promising solutions for language identification, content extraction and dealing with Internet slang are discussed.

References

[1]
Alby, T.: Web 2.0. Konzepte, Anwendungen, Technologien, 3rd edn. Hanser, München (2008)
[2]
Nelles, O.: Nonlinear system identification: from classical approaches to neural networks and fuzzy models. Springer (2001)
[3]
Liu, B.: Web data mining. Exploring hyperlinks, contents, and usage data, 2nd edn. Datacentric systems and applications. Springer, Berlin (2008)
[4]
Steinecke, U., Straub, W.: Unstrukturierte Daten im Business Intelligence. Vorgehen, Ergebnisse und Erfahrungen in der praktischen Umsetzung. HMD - Praxis der Wirtschaftsinformatik 47(271), 91-101 (2010)
[5]
Guozheng, Z., Faming, Z., Fang, W., Jian, L.: Knowledge Creation in Marketing Based on Data Mining. In: International Conference on Intelligent Computation Technology and Automation (ICICTA), vol. 1, pp. 782-786 (2008)
[6]
Holzinger, A.: Weakly Structured Data in Health-Informatics. In: Proceedings of INTERACT 2011 International Conference on Human-Computer Interaction, Workshop: Promoting and Supporting Healthy Living by Design, pp. 5-7 (2011)
[7]
Holzinger, A.: On Knowledge Discovery and Interactive Intelligent Visualization of Biomedical Data. In: Proceedings of the 9th International Joint Conference on e-Business and Telecommunications (ICETE 2012), pp. IS9-IS20 (2012)
[8]
Holzinger, A., Geierhofer, R., Modritscher, F., Tatzl, R.: Semantic Information in Medical Information Systems: Utilization of Text Mining Techniques to Analyze Medical Diagnoses. Journal of Universal Computer Science 14(22), 3781-3795 (2008)
[9]
Maynard, D., Bontcheva, K., Rout, D.: Challenges in developing opinion mining tools for social media. In: Proceedings of @NLP can u tag #user_generated_content?! Workshop at LREC 2012, Istanbul, Turkey (May 2012)
[10]
Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums. ACM Trans. Inf. Syst. 26(3), 12:1-12:34 (2008)
[11]
Yi, L., Liu, B.: Web page cleaning for web mining through feature weighting. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 43-48. Morgan Kaufmann Publishers Inc., San Francisco (2003)
[12]
Gamon, M.: Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis. In: Proceedings of the 20th International Conference on Computational Linguistics (2004)
[13]
Kaiser, C.: Opinion Mining im Web 2.0 - Konzept und Fallbeispiel. HMD - Praxis der Wirtschaftsinformatik 46(268), 90-99 (2009)
[14]
Kim, S.-M., Hovy, E.: Determining the Sentiment of Opinions. In: Proceedings of 20th International Conference on Computational Linguistics, Geneva, Switzerland, pp. 1367- 1373 (2004)
[15]
Nadali, S., Masrah, A. A. M., Rabiah, A. K.: Sentiment Classification of Customer Reviews Based on Fuzzy logic. In: Mahmood, A. K. (ed.) International Symposium in Information Technology (ITSim), pp. 1037-1044. IEEE (2010)
[16]
Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168-177 (2004)
[17]
Jindal, N., Liu, B.: Mining comparative sentences and relations. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 2, pp. 1331-1336. AAAI Press (2006)
[18]
Hatzivassiloglou, V., Wiebe, J.: Effects of Adjective Orientation and Gradability on Sentence Subjectivity. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 299-305 (2000)
[19]
Turney, P. D.: Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp. 417-424 (2002)
[20]
Wiebe, J., Mihalcea, R.: Word Sense and Subjectivity. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 1065-1072 (2006)
[21]
Ding, X., Liu, B., Yu, P. S.: A Holistic Lexicon-Based Approach to Opinion Mining. In: International Conference on Web Search & Data Mining, Palo Alto, California, February 11-12. ACM, New York (2008)
[22]
Popescu, A.-M., Etzioni, O.: Extracting Product Features and Opinions from Reviews. In: Proceedings of Human Language Technology Conference, pp. 339-346 (2005)
[23]
Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2, 45-66 (2001)
[24]
Weisberg, S.: Applied linear regression, vol. 528. Wiley (2005)
[25]
Vapnik, V.: The nature of statistical learning theory. Springer (2000)
[26]
Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2005)
[27]
Kreuzthaler, M., Bloice, M. D., Faulstich, L., Simonic, K. M., Holzinger, A.: A Comparison of Different Retrieval Strategies Working on Medical Free Texts. Journal of Universal Computer Science 17(7), 1109-1133 (2011)
[28]
Holzinger, A., Simonic, K. M., Yildirim, P.: Disease-disease relationships for rheumatic diseases Web-based biomedical textmining and knowledge discovery to assist medical decision making. In: IEEE COMPSAC, pp. 573-580 (2012)
[29]
Koza, J.: Genetic programming II: automatic discovery of reusable programs (1994)
[30]
Affenzeller, M., Wagner, S., Winkler, S.: Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications. Numerical Insights. Taylor & Francis (2009)
[31]
Bai, X.: Predicting consumer sentiments from online text. Decision Support Systems 50(4), 732-742 (2011)
[32]
Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 375-384. ACM, New York (2009)
[33]
Faschang, P., Petz, G., Dorfer, V., Kern, T., Winkler, S. M.: An Approach to Mining Consumer's Opinion on the Web. In: 13th International Conference on Computer Aided Systems Theory, Eurocast 2011, pp. 37-39 (2011)
[34]
Faschang, P., Petz, G., Wimmer, M., Dorfer, V., Winkler, S. M.: Evaluation of Tools for Opinion Mining. In: EEE (ed.) Proceedings of the 2011 International Conference on ELearning, E-Business, Enterprise Information Systems & E-Government, Las Vegas, pp. 3-9 (2011)
[35]
Schaller, S., Winkler, S. M., Dorfer, V., Petz, G., Fürschuß, H.: A Machine Learning Suite for Opinion Mining in Web. In: Proceedings of the 14th International Asia Pacific Conference on Computer Aided System Theory, IEEE APCast (2012)
[36]
Mihalcea, R., Banea, C., Wiebe, J.: Learning Multilingual Subjective Language via Cross-Lingual Projections. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 976-983 (2007)
[37]
Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Phys. Rev. Lett. 88(4), 48702 (2002).
[38]
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data (JADT 1995), pp. 263-268 (1995)
[39]
Řehůřek, R., Kolkus, M.: Language Identification on the Web: Extending the Dictionary Method. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 357-368. Springer, Heidelberg (2009)
[40]
Cavnar, W. B., Trenkle, J. M.: Trenkle: N-Gram-Based Text Categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161-175 (1994)
[41]
Dunning, T.: Statistical Identification of Language (1994)
[42]
Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., Teixeira, J. S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84-93 (2002).
[43]
Weninger, T., Hsu, W. H.: Text Extraction from the Web via Text-to-Tag Ratio. In: Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, pp. 23-28. IEEE Computer Society, Washington, DC (2008).
[44]
Weninger, T., Hsu, W. H., Han, J.: CETR: content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, pp. 971-980. ACM, New York (2010).
[45]
Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th International Conference on World Wide Web, pp. 971-980 (2009)
[46]
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441-450. ACM, New York (2010)
[47]
van Rijsbergen, C. J.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton, MA, USA (1979)
[48]
Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: Cleaneval: a Competition for Cleaning Web Pages
[49]
Schmid, H.: TreeTagger - a language independent part-of-speech tagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ (accessed March 10, 2011)

Cited By

View all
  1. On text preprocessing for opinion mining outside of laboratory environments

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      AMT'12: Proceedings of the 8th international conference on Active Media Technology
      December 2012
      667 pages
      ISBN:9783642352355
      • Editors:
      • Runhe Huang,
      • Ali A. Ghorbani,
      • Gabriella Pasi,
      • Takahira Yamaguchi,
      • Neil Y. Yen

      Sponsors

      • IEEE Computational Intelligence Society: IEEE Computational Intelligence Society
      • Web Intelligence Consortium
      • Lecture Notes in Computer Science of Springer

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 04 December 2012

      Author Tags

      1. content extraction
      2. internet slang
      3. language detection
      4. opinion mining
      5. sentiment analysis
      6. text mining
      7. web analytics

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 28 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)Image sentiment prediction based on textual descriptions with adjective noun pairsMultimedia Tools and Applications10.1007/s11042-016-4310-577:1(1115-1132)Online publication date: 1-Jan-2018
      • (2018)Feedback Matters! Predicting the Appreciation of Online Articles A Data-Driven ApproachMachine Learning and Knowledge Extraction10.1007/978-3-319-99740-7_10(147-159)Online publication date: 27-Aug-2018
      • (2016)Multi-opinion RingMultimedia Tools and Applications10.1007/s11042-015-2640-375:12(7159-7186)Online publication date: 1-Jun-2016
      • (2014)Computational approaches for mining user's opinions on the Web 2.0Information Processing and Management: an International Journal10.1016/j.ipm.2014.07.00550:6(899-908)Online publication date: 1-Nov-2014

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media