research-article

Learning under Feature Drifts in Textual Streams

Authors:

Damianos P. Melidis,

Myra Spiliopoulou,

Eirini NtoutsiAuthors Info & Claims

CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Pages 527 - 536

https://doi.org/10.1145/3269206.3271717

Published: 17 October 2018 Publication History

Abstract

Huge amounts of textual streams are generated nowadays, especially in social networks like Twitter and Facebook. As the discussion topics and user opinions on those topics change drastically with time, those streams undergo changes in data distribution, leading to changes in the concept to be learned, a phenomenon called concept drift. One particular type of drift, that has not yet attracted a lot of attention is feature drift, i.e., changes in the features that are relevant for the learning task at hand. In this work, we propose an approach for handling feature drifts in textual streams. Our approach integrates i) an ensemble-based mechanism to accurately predict the feature/word values for the next time-point by taking into account the different features might be subject to different temporal trends and ii) a sketch-based feature space maintenance mechanism that allows for a memory-bounded maintenance of the feature space over the stream. Experiments with textual streams from the sentiment analysis, email preference and spam detection demonstrate that our approach achieves significantly better or competitive performance compared to baselines.

References

[1]

Giulio Angiani, Laura Ferrari, Tomaso Fontanini, Paolo Fornacciari, Eleonora Iotti, Federico Magliani, and Stefano Manicardi. 2016. A Comparison between Preprocessing Techniques for Sentiment Analysis in Twitter. In KDWeb .

[2]

Jean Paul Barddal, Heitor Murilo Gomes, Fabr'icio Enembreck, and Bernhard Pfahringer. 2017. A survey on feature drift adaptation: Definition, benchmark, challenges and future directions. Journal of Systems and Software, Vol. 127 (2017), 278--294.

Digital Library

[3]

Albert Bifet and Eibe Frank. 2010. Sentiment knowledge discovery in Twitter streaming data. In International conference on discovery science. Springer, 1--15.

Digital Library

[4]

Albert Bifet and Ricard Gavalda. 2007. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining. SIAM, 443--448.

[5]

Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. Journal of Machine Learning Research, Vol. 11, May (2010), 1601--1604.

Digital Library

[6]

Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer. 2011. MOA-tweetreader: real-time analysis in Twitter streaming data. In International Conference on Discovery Science. Springer, 46--60.

Digital Library

[7]

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. 2009. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 139--148.

Digital Library

[8]

George EP Box and Gwilym M Jenkins. 1976. Time series analysis: forecasting and control, revised ed .Holden-Day.

[9]

J. Cryer and K. Chan. {n. d.}. Time Series Analysis with Applications in R .

[10]

Joao Gama. 2010. Knowledge discovery from data streams .CRC Press.

Digital Library

[11]

João Gama, Indr.e vZ liobait.e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM computing surveys (CSUR), Vol. 46, 4 (2014), 44.

Digital Library

[12]

Anastasia Giachanou and Fabio Crestani. 2016. Tracking sentiment by time series analysis. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 1037--1040.

Digital Library

[13]

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, Vol. 1, 12 (2009).

[14]

T Ryan Hoens, Robi Polikar, and Nitesh V Chawla. 2012. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, Vol. 1, 1 (2012), 89--101.

[15]

Charles C Holt. 2004. Forecasting seasonals and trends by exponentially weighted moving averages. International journal of forecasting, Vol. 20, 1 (2004), 5--10.

[16]

Zhao Jianqiang and Gui Xiaolin. 2017. Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis. IEEE Access, Vol. 5 (2017), 2870--2879.

[17]

Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2006. Dynamic feature space and incremental feature selection for the classification of textual data streams. Knowledge Discovery from Data Streams (2006), 107--116.

[18]

Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2010. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems, Vol. 22, 3 (2010), 371--391.

Digital Library

[19]

Jyrki Kivinen and Manfred K Warmuth. 1999. Averaging expert predictions. In European Conference on Computational Learning Theory. Springer, 153--167.

Digital Library

[20]

Guy Lebanon and Yang Zhao. 2008. Local likelihood modeling of temporal text streams. In Proceedings of the 25th international conference on Machine learning. ACM, 552--559.

Digital Library

[21]

Guanjun Lin, Nan Sun, Surya Nepal, Jun Zhang, Yang Xiang, and Houcine Hassan. 2017. Statistical Twitter Spam Detection Demystified: Performance, Stability and Scalability. IEEE access, Vol. 5 (2017), 11142--11154.

[22]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval .Cambridge University Press, New York, NY, USA.

[23]

Emaad Manzoor, Hemank Lamba, and Leman Akoglu. 2018. xStream: Outlier Detexion in Feature-Evolving Data Streams. (2018).

[24]

Damianos P Melidis, Alvaro Veizaga Campero, Vasileios Iosifidis, Eirini Ntoutsi, and Myra Spiliopoulou. 2018. Enriching Lexicons with Ephemeral Words for Sentiment Analysis in Social Streams. In Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics. ACM, 38.

Digital Library

[25]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory. Springer, 398--412.

Digital Library

[26]

Luis Moreira-Matias, Joao Gama, Michel Ferreira, Joao Mendes-Moreira, and Luis Damas. 2013. Predicting taxi--passenger demand using streaming data. IEEE Transactions on Intelligent Transportation Systems, Vol. 14, 3 (2013), 1393--1402.

Digital Library

[27]

Hai-Long Nguyen, Yew-Kwong Woon, Wee-Keong Ng, and Li Wan. 2012a. Heterogeneous ensemble for feature drifts in data streams. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 1--12.

Digital Library

[28]

Le T Nguyen, Pang Wu, William Chan, Wei Peng, and Ying Zhang. 2012b. Predicting collective sentiment dynamics from time-series social media. In Proceedings of the first international workshop on issues of sentiment discovery and opinion mining. ACM, 6.

Digital Library

[29]

Kyosuke Nishida, Takahide Hoshide, and Ko Fujimura. 2012. Improving tweet stream classification by detecting changes in word probability. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 971--980.

Digital Library

[30]

Myra Spiliopoulou, Eirini Ntoutsi, and Max Zimmermann. 2016. Opinion Stream Mining. Encyclopedia of Machine Learning and Data Mining (2016), 1--10.

[31]

Soroush Vosoughi, Helen Zhou, and Deb Roy. 2016. Enhanced twitter sentiment classification using contextual information. arXiv preprint arXiv:1605.05195 (2016).

[32]

Sebastian Wagner, Max Zimmermann, Eirini Ntoutsi, and Myra Spiliopoulou. 2015. Ageing-based multinomial naive bayes classifiers over opinionated data streams. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer, 401--416.

Digital Library

Cited By

Soleymanian MMashayekhi HRahimi M(2024)An incremental clustering algorithm based on semantic conceptsKnowledge and Information Systems10.1007/s10115-024-02063-066:6(3303-3335)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10115-024-02063-0
de Lima ABatista MBarddal JSanches Dde Oliveira L(2022)Assessing Batch and Online Learning for Delivery in Full and On Time Predictions2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892386(1-9)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892386
Patil MKumar SKumar SGarg M(2021)Concept Drift Detection for Social Media: A Survey2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)10.1109/ICAC3N53548.2021.9725548(12-16)Online publication date: 17-Dec-2021
https://doi.org/10.1109/ICAC3N53548.2021.9725548
Show More Cited By

Index Terms

Learning under Feature Drifts in Textual Streams

Recommendations

Classifying Data Streams with Skewed Class Distributions and Concept Drifts

Classification is an important data analysis tool that uses a model built from historical data to predict class labels for new observations. More and more applications are featuring data streams, rather than finite stored data sets, which are a ...
Heterogeneous ensemble for feature drifts in data streams
PAKDD'12: Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II

The nature of data streams requires classification algorithms to be real-time, efficient, and able to cope with high-dimensional data that are continuously arriving. It is a known fact that in high-dimensional datasets, not all features are critical for ...
Optimizing ADWIN for steady streams
SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

With the ever-growing data generation rates and stringent constraints on the latency of analyzing such data, stream analytics is overtaking. Learning from data streams, aka online machine learning, is no exception. However, online machine learning comes ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

October 2018

2362 pages

ISBN:9781450360142

DOI:10.1145/3269206

General Chair:
Alfredo Cuzzocrea
University of Trieste, Italy
,
Program Chairs:
James Allan
University of Massachusetts, USA
,
Norman Paton
University of Manchester, United Kingdom
,
Divesh Srivastava
AT&T Labs Research, USA
,
Rakesh Agrawal
Data Insights Lab, USA
,
Andrei Broder
Google Research, USA
,
Mohammed Zaki
Rensselaer Polytechnic Institute, USA
,
Selcuk Candan
Arizona State University, USA
,
Alexandros Labrinidis
University of Pittsburgh, USA
,
Assaf Schuster
Technion, Israel
,
Haixun Wang
Google Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

CIKM '18

Sponsor:

CIKM '18: The 27th ACM International Conference on Information and Knowledge Management

October 22 - 26, 2018

Torino, Italy

Acceptance Rates

CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
310
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)3

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Soleymanian MMashayekhi HRahimi M(2024)An incremental clustering algorithm based on semantic conceptsKnowledge and Information Systems10.1007/s10115-024-02063-066:6(3303-3335)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10115-024-02063-0
de Lima ABatista MBarddal JSanches Dde Oliveira L(2022)Assessing Batch and Online Learning for Delivery in Full and On Time Predictions2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892386(1-9)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892386
Patil MKumar SKumar SGarg M(2021)Concept Drift Detection for Social Media: A Survey2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)10.1109/ICAC3N53548.2021.9725548(12-16)Online publication date: 17-Dec-2021
https://doi.org/10.1109/ICAC3N53548.2021.9725548
Abolfazli ANtoutsi E(2020)Drift-Aware Multi-Memory Model for Imbalanced Data Streams2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378101(878-885)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9378101
Beyer CUnnikrishnan VBrüggemann RToulouse VOmar HNtoutsi ESpiliopoulou M(2020)Resource management for model learning at entity levelAnnals of Telecommunications10.1007/s12243-020-00800-4Online publication date: 29-Aug-2020
https://doi.org/10.1007/s12243-020-00800-4
Zhang WNtoutsi E(2019)FAHTProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367032.3367242(1480-1486)Online publication date: 10-Aug-2019
https://dl.acm.org/doi/10.5555/3367032.3367242
Li YCheng Y(2019)Streaming Feature Selection for Multi-Label Data with Dynamic Sliding Windows and Feature Repulsion LossEntropy10.3390/e2112115121:12(1151)Online publication date: 25-Nov-2019
https://doi.org/10.3390/e21121151
Beyer CUnnikrishnan VNiemann UMatuszyk PNtoutsi ESpiliopoulou MHung CPapadopoulos G(2019)Exploiting entity information for stream classification over a stream of reviewsProceedings of the 34th ACM/SIGAPP Symposium on Applied Computing10.1145/3297280.3297333(564-573)Online publication date: 8-Apr-2019
https://dl.acm.org/doi/10.1145/3297280.3297333
Iosifidis VNtoutsi E(2019)Sentiment analysis on big sparse data streams with limited labelsKnowledge and Information Systems10.1007/s10115-019-01392-9Online publication date: 17-Aug-2019
https://doi.org/10.1007/s10115-019-01392-9

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents