Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2020408.2020424acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Model order selection for boolean matrix factorization

Published: 21 August 2011 Publication History

Abstract

Matrix factorizations---where a given data matrix is approximated by a product of two or more factor matrices---are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the `model order selection problem' of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices.
Boolean matrix factorization (BMF)---where data, factors, and matrix product are Boolean---has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. But so far no method for selecting the correct model order for BMF has been available.
In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate.
We formulate the description length function for BMF in general---making it applicable for any BMF algorithm. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

References

[1]
R. Belohlavek and V. Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comp. Sys. Sci., 76(1):3--20, 2010.
[2]
R. Cattell. The scree test for the number of factors. Multivar. Behav. Res., 1:245--276, 1966.
[3]
V. Chandola and V. Kumar. Summarization -- compressing data into an informative representation. Knowl. Inf. Syst., 12(3):355--378, 2007.
[4]
R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Trans. Inform. Theory, 51(4):1523--1545, 2005.
[5]
T. Cover and J. Thomas. Elements of Information Theory, 2nd ed. John Wiley and Sons, 2006.
[6]
C. T. dos S. Dias and W. J. Krzanowski. Model selection and cross validation in additive main effect and multiplicative interaction models. Crop Sci., 43:865--873, 2003.
[7]
S. M. Embleton and E. S. Wheeler. Finnish dialect atlas for quantitative studies. J. Quant. Ling., 4(1--3):99--102, 1997.
[8]
C. Faloutsos and V. Megalooikonomou. On data mining, compression and Kolmogorov complexity. Data Min. Knowl. Disc., 15:3--20, 2007.
[9]
U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. UAI'93, pages 1022--1027, 1993.
[10]
M. Fortelius et al. Neogene of the old world database of fossil mammals (NOW), 2003. http://www.helsinki.fi/science/now/.
[11]
G. C. Garriga, E. Junttila, and H. Mannila. Banded structure in binary matrices. In Proc. KDD'08, pages 292--300, 2008.
[12]
F. Geerts, B. Goethals, and T. Mielikäinen. Tiling databases. In Proc. DS'04, pages 278--289, 2004.
[13]
G. H. Golub and C. F. Van Loan. Matrix Computations. JHU Press, 1996.
[14]
P. D. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.
[15]
H. Heikinheimo, J. Vreeken, A. Siebes, and H. Mannila. Low-entropy set selection. In Proc. SDM'09, 2009.
[16]
E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In Proc. KDD'04, pages 206--215, 2004.
[17]
K.-N. Kontonasios and T. De Bie. An information-theoretic approach to finding noisy tiles in binary databases. In Proc. SDM'10, 2010.
[18]
M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer-Verlag, 1993.
[19]
H. Lu, J. Vaidya, and V. Atluri. Optimal Boolean matrix decomposition: Application to role engineering. In Proc. ICDE'08, pages 297--306, 2008.
[20]
P. Miettinen. Matrix Decomposition Methods for Data Mining: Computational Complexity and Algorithms. PhD thesis, University of Helsinki, 2009.
[21]
P. Miettinen. Sparse Boolean matrix factorizations. In Proc. ICDM'10, pages 935--940, 2010.
[22]
P. Miettinen, T. Mielikäinen, A. Gionis, G. Das, and H. Mannila. The discrete basis problem. IEEE Trans. Knowl. Data Eng., 20(10):1348--1362, 2008.
[23]
T. P. Minka. Automatic choice of dimensionality for PCA. In Proc. NIPS'00, pages 598--604, 2001.
[24]
A. J. Mitchell-Jones, G. Amori, W. Bogdanowicz, B. Krystufek, P. J. H. Reijnders, F. Spitzenberger, M. Stubbe, J. B. M. Thissen, V. Vohralik, and J. Zima. The Atlas of European Mammals. Academic Press, 1999.
[25]
S. D. Monson, N. J. Pullman, and R. Rees. A survey of clique and biclique coverings and factorizations of (0,1)-matrices. Bulletin of the ICA, 14:17--86, 1995.
[26]
S. Myllykangas, J. Himberg, T. Böhling, B. Nagy, J. Hollmén, and S. Knuutila. DNA copy number amplification profiling of human neoplasms. Oncogene, 25(55), 2006.
[27]
D. S. Nau, G. Markowsky, M. A. Woodbury, and D. B. Amos. A mathematical analysis of human leukocyte antigen serology. Math. Biosci., 40:243--270, 1978.
[28]
A. B. Owen and P. O. Perry. Bi-cross-validation of the SVD and the nonnegative matrix factorization. Ann. Appl. Stat., 3(2):564--594, Jun 2009.
[29]
P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5:111--126, 1994.
[30]
V. Pestov. An axiomatic approach to intrinsic dimension of a dataset. Neural Networks, 21(2--3):204--213, 2008.
[31]
J. Rissanen. Modeling by shortest data description. Automatica, 14(1):465--471, 1978.
[32]
M. Schmidt, O. Winther, and L. Hansen. Bayesian non-negative matrix factorization. In Proc. ICA'09, volume 5411 of LNCS, pages 540--547, 2009.
[33]
A. Streich, M. Frank, D. Basin, and J. Buhmann. Multi-assignment clustering for Boolean data. In Proc. ICML'09, pages 969--976, 2009.
[34]
N. Tatti, T. Mielikainen, A. Gionis, and H. Mannila. What is the dimension of your binary data? In Proc. ICDM'06, pages 603--612, 2006.
[35]
N. Tatti and J. Vreeken. Finding good itemsets by packing data. In Proc. ICDM'08, pages 588--597, 2008.
[36]
J. Vaidya, V. Atluri, and Q. Guo. The role mining problem: Finding a minimal descriptive set of roles. In Proc. SACMAT, pages 175--184, 2007.
[37]
J. Vreeken and A. Siebes. Filling in the blanks -- Krimp minimisation for missing data. In Proc. ICDM'08, pages 1067--1072, 2008.
[38]
J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: Mining itemsets that compress. Data Min. Knowl. Discov., 19(2):176--193, 2010. 10.1007/s10618-010-0202-x.
[39]
C. Wallace. Statistical and inductive inference by minimum message length. Springer-Verlag, 2005.
[40]
J. Wang and G. Karypis. On efficiently summarizing categorical databases. Knowl. Inf. Syst., 9(1):19--37, 2006.
[41]
K. Yeomans and P. Golder. The Guttman--Kaiser criterion as a predictor of the number of common factors. The Statistician, 31(3):221--229, 1982.
[42]
M. Zhu and A. Ghodsi. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comp. Stat. & Data Anal., 51(2):918--930, 2006.

Cited By

View all
  • (2024)Bias-aware boolean matrix factorization using disentangled representation learningProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702845(3630-3642)Online publication date: 15-Jul-2024
  • (2023)Parameterized Complexity of Feature Selection for Categorical Data ClusteringACM Transactions on Computation Theory10.1145/360479715:3-4(1-24)Online publication date: 1-Sep-2023
  • (2023)Boolean matrix factorization for symmetric binary variablesKnowledge-Based Systems10.1016/j.knosys.2023.110944279:COnline publication date: 4-Nov-2023
  • Show More Cited By

Index Terms

  1. Model order selection for boolean matrix factorization

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2011
    1446 pages
    ISBN:9781450308137
    DOI:10.1145/2020408
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. boolean matrix factorizations
    2. matrix decompositions
    3. matrix factorizations
    4. minimum description length principle
    5. model order selection
    6. model selection

    Qualifiers

    • Research-article

    Conference

    KDD '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Bias-aware boolean matrix factorization using disentangled representation learningProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702845(3630-3642)Online publication date: 15-Jul-2024
    • (2023)Parameterized Complexity of Feature Selection for Categorical Data ClusteringACM Transactions on Computation Theory10.1145/360479715:3-4(1-24)Online publication date: 1-Sep-2023
    • (2023)Boolean matrix factorization for symmetric binary variablesKnowledge-Based Systems10.1016/j.knosys.2023.110944279:COnline publication date: 4-Nov-2023
    • (2022)Approximate Logic Synthesis Using Boolean Matrix FactorizationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.305460341:1(15-28)Online publication date: Jan-2022
    • (2022)Boolean and $\mathbb{F}_{p}$-Matrix Factorization: From Theory to Practice2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892947(1-8)Online publication date: 18-Jul-2022
    • (2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
    • (2022)Design Space Exploration ToolsApproximate Computing Techniques10.1007/978-3-030-94705-7_8(215-259)Online publication date: 3-Jan-2022
    • (2021)Binary matrix factorization on special purpose hardwarePLOS ONE10.1371/journal.pone.026125016:12(e0261250)Online publication date: 16-Dec-2021
    • (2021)Model order selection for approximate Boolean matrix factorization problem▪Knowledge-Based Systems10.1016/j.knosys.2021.107184227:COnline publication date: 5-Sep-2021
    • (2020)Summarizing Complex Graphical Models of Multiple Chronic Conditions Using the Second Eigenvalue of Graph Laplacian: Algorithm Development and ValidationJMIR Medical Informatics10.2196/163728:6(e16372)Online publication date: 17-Jun-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media