Abstract
Data streams are dynamic, with frequent distributional changes. In this paper, we propose a statistical approach to detecting distributional shifts in multi-dimensional data streams. We use relative entropy, also known as the Kullback-Leibler distance, to measure the statistical distance between two distributions. In the context of a multi-dimensional data stream, the distributions are generated by data from two sliding windows. We maintain a sample of the data from the stream inside the windows to build the distributions.
Our algorithm is streaming, nonparametric, and requires no distributional or model assumptions. It employs the statistical theory of hypothesis testing and bootstrapping to determine whether the distributions are statistically different. We provide a full suite of experiments on synthetic data to validate the method and demonstrate its effectiveness on data from real-life applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aggarwal, C.C.: A framework for diagnosing changes in evolving data streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 575–586 (2003)
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA, pp. 633–634 (2002)
Chawathe, S.S., Abiteboul, S., Widom, J.: Representing and querying changes in semistructured data. In: ICDE 1998, pp. 4–13 (1998)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley and Sons, Inc., Chichester (1991)
Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press, London (1981)
Dasu, T., Krishnan, S., Venkatasubramanian, S., Yi, K.: An information-theoretic approach to detecting changes in multi-dimensional data streams. In: Interface 2006 (2006)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hall, Boca Raton (1993)
Fleiss, J.L., Levin, B., Paik, M.: Statistical Methods for Rates and Proportions, 3rd edn. John Wiley and Sons, New York (2003)
Ganti, V., Gehrke, J., Ramakrishnan, R., Loh, W.-Y.: A framework for measuring differences in data characteristics. pp. 126–137 (1999)
Gutman, M.: Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Trans. Inf. Theory 35, 401–408 (1989)
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: KDD, pp. 97–106 (2001)
Johnson, D., Gruner, C.: Information-theoretic analysis of neural coding. Journal of Computational Neuroscience 10, 47–69 (2001)
Keogh, E., Lonardi, S., Chiu, B.Y.: Finding surprising patterns in a time series database in linear time and space. In: KDD, pp. 550–556 (2002)
Kifer, D., Ben-David, S., Gehrke, J.: Detecting changes in data streams. In: Proceedings of the 30th International Conference on Very Large Databases, pp. 180–191 (2004)
Kleinberg, J.: Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery 7(4), 373–397 (2003)
Krichevsky, R.E., Trofimov, V.K.: The performance of universal encoding. IEEE Trans. Inf. Theory 27, 199–207 (1981)
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL, pp. 183–190 (1993)
Pietra, S.D., Pietra, V.D., Lafferty, J.: Inducing features of random fields. IEEE Trans. Pattern Analysis and Machine Intelligence 19, 380–393 (1995)
Shibata, R.: Bootstrap estimate of Kullback-Liebler information for model selection. Statistica Sinica 7, 375–394 (1997)
Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-dimensional data. In: ACM SIGKDD 2007, pp. 667–676 (2007)
Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., Gunopulos, D.: Online outlier detection in sensor data using non-parametric models. In: VLDB 2006, pp. 187–198 (2006)
Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 37–57 (1985)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dasu, T., Krishnan, S., Lin, D., Venkatasubramanian, S., Yi, K. (2009). Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams . In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, JF. (eds) Advances in Intelligent Data Analysis VIII. IDA 2009. Lecture Notes in Computer Science, vol 5772. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03915-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-03915-7_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03914-0
Online ISBN: 978-3-642-03915-7
eBook Packages: Computer ScienceComputer Science (R0)