AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases

Jia-Yu Pan¹⁹,
Hiroyuki Kitagawa²⁰,
Christos Faloutsos¹⁹ &
…
Masafumi Hamamoto²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3070 Accesses
6 Citations

Abstract

For discovering hidden (latent) variables in real-world, non-gaussian data streams or an n-dimensional cloud of data points, SVD suffers from its orthogonality constraint. Our proposed method, “AutoSplit”, finds features which are mutually independent and is able to discover non-orthogonal features. Thus, (a) finds more meaningful hidden variables and features, (b) it can easily lead to clustering and segmentation, (c) it surprisingly scales linearly with the database size and (d) it can also operate in on-line, single-pass mode. We also propose “Clustering-AutoSplit”, which extends the feature discovery to multiple feature/bases sets, and leads to clean clustering. Experiments on multiple, real-world data sets show that our method meets all the properties above, outperforming the state-of-the-art SVD.

Supported in part by Japan-U.S. Cooperative Science Program of JSPS; grants from JSPS and MEXT (#15017207, #15300027); the NSF No. IRI-9817496, IIS-9988876, IIS-0113089, IIS-0209107, IIS-0205224, INT-0318547, SENSOR-0329549, EF-0331657; the Pennsylvania Infrastructure Technology Alliance No. 22-901-0001; DARPA No. N66001-00-1-8936; and donations from Intel and Northrop-Grumman.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

State-of-the-art on clustering data streams

Article Open access 01 December 2016

Breaking the curse of dimensionality: hierarchical Bayesian network model for multi-view clustering

Article 06 May 2021

$\hbox {U}^2\hbox {F}^2\hbox {S}^2$: Uncovering Feature-level Similarities for Unsupervised Feature Selection

Article 24 July 2018

References

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41, 391–497 (1990)
Article Google Scholar
Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3, 72–86 (1991)
Article Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)
Google Scholar
Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (1986)
Google Scholar
Korn, F., Labrinidis, A., Kotidis, Y., Faloutsos, C.: Ratio rules: A new paradigm for fast, quantifiable data mining. In: VLDB (1998)
Google Scholar
Garofalakis, M., Gehrke, J., Rastogi, R.: Querying and mining data streams: You only get one look. In: VLDB (2002)
Google Scholar
Guha, S., Gunopulos, D., Koudas, N.: Correlating synchronous and asynchronous data streams. In: SIGKDD 2003 (2003)
Google Scholar
Kanth, K.V.R., Agrawal, D., Singh, A.K.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD, pp. 166–176 (1998)
Google Scholar
Garofalakis, M., Gibbons, P.B.: Wavelet synopses with error guarantees. In: SIGMOD 2002 (2002)
Google Scholar
Achlioptas, D.: Database-friendly random projections. In: PODS, pp. 274–281 (2001)
Google Scholar
Indyk, P., Koudas, N., Muthukrishnan, S.: Identifying representative trends in massive time series data sets using sketches. In: Proc. VLDB, pp. 363–372 (2000)
Google Scholar
Gunopulos, D., Das, G.: Time series similarity measures and time series indexing. In: SIGMOD, p. 624 (2001)
Google Scholar
Jensen, C.S., Snodgrass, R.T.: Semantics of time-varying information. Information Systems 19, 33–54 (1994)
Article Google Scholar
Teng, W.G., Chen, M.S., Yu, P.S.: A regression-based temporal pattern mining scheme for data streams. In: VLDB 2003, pp. 93–104 (2003)
Google Scholar
Yi, B.K., Sidiropoulos, N.D., Johnson, T., Jagadish, H., Faloutsos, C., Biliris, A.: Online data mining for co-evolving time sequences. In: ICDE (2000)
Google Scholar
Jagadish, H., Mendelzon, A., Milo, T.: Similarity-based queries. In: PODS 1995 (1995)
Google Scholar
Moon, Y.S., Whang, K.Y., Han, W.S.: General match: a subsequence matching method in time-series databases based on generalized windows. In: SIGMOD 2002, pp. 382–393 (2002)
Google Scholar
Keogh, E., Chakrabarti, K., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD, pp. 151–162 (2001)
Google Scholar
Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: Proc. SIGMOD, pp. 289–300 (1997)
Google Scholar
Lee, J., Chai, J., Reitsma, P.S.A., Hodgins, J.K., Pollard, N.S.: Interactive control of avatars animated with human motion data. In: SIGGRAPH 2002 (2002)
Google Scholar
Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001)
Book Google Scholar
Lewicki, M.S.: Estimating sub- and super-gaussian densities using ica and exponential power distributions with applications to natural images (2000) (unpublished manuscript)
Google Scholar
Wactlar, H., Christel, M., Gong, Y., Hauptmann, A.: Lessons learned from the creation and deployment of a terabyte digital video library. IEEE Computer 32, 66–73 (1999)
Google Scholar
Tipping, M., Bishop, C.: Mixture of probabilistic principal component analyzers. Neural Computation (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Jia-Yu Pan & Christos Faloutsos
University of Tsukuba, Tennohdai, Tsukuba, Ibaraki, 305-8573, Japan
Hiroyuki Kitagawa & Masafumi Hamamoto

Authors

Jia-Yu Pan
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar
Masafumi Hamamoto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering and Information Technology, Deakin University, VIC 3125, Australia
Honghua Dai
University of Illinois at Urbana-Champaign, 61801, Urbana, IL, USA
Ramakrishnan Srikant
Faculty of Engineering and Information Technology, Centre for Quantum Computation and Intelligent Systems, and Australian ACS National Committee for Artificial Intelligence, University of Technology, Sydney, Australia
Chengqi Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pan, JY., Kitagawa, H., Faloutsos, C., Hamamoto, M. (2004). AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_62

Download citation

DOI: https://doi.org/10.1007/978-3-540-24775-3_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

State-of-the-art on clustering data streams

Breaking the curse of dimensionality: hierarchical Bayesian network model for multi-view clustering

\(\hbox {U}^2\hbox {F}^2\hbox {S}^2\): Uncovering Feature-level Similarities for Unsupervised Feature Selection

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

State-of-the-art on clustering data streams

Breaking the curse of dimensionality: hierarchical Bayesian network model for multi-view clustering

\(\hbox {U}^2\hbox {F}^2\hbox {S}^2\): Uncovering Feature-level Similarities for Unsupervised Feature Selection

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation