research-article

Model order selection for boolean matrix factorization

Authors:

Pauli Miettinen,

Jilles VreekenAuthors Info & Claims

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 51 - 59

https://doi.org/10.1145/2020408.2020424

Published: 21 August 2011 Publication History

Abstract

Matrix factorizations---where a given data matrix is approximated by a product of two or more factor matrices---are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the `model order selection problem' of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices.

Boolean matrix factorization (BMF)---where data, factors, and matrix product are Boolean---has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. But so far no method for selecting the correct model order for BMF has been available.

In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate.

We formulate the description length function for BMF in general---making it applicable for any BMF algorithm. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

References

[1]

R. Belohlavek and V. Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comp. Sys. Sci., 76(1):3--20, 2010.

Digital Library

[2]

R. Cattell. The scree test for the number of factors. Multivar. Behav. Res., 1:245--276, 1966.

[3]

V. Chandola and V. Kumar. Summarization -- compressing data into an informative representation. Knowl. Inf. Syst., 12(3):355--378, 2007.

Digital Library

[4]

R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Trans. Inform. Theory, 51(4):1523--1545, 2005.

Digital Library

[5]

T. Cover and J. Thomas. Elements of Information Theory, 2nd ed. John Wiley and Sons, 2006.

Digital Library

[6]

C. T. dos S. Dias and W. J. Krzanowski. Model selection and cross validation in additive main effect and multiplicative interaction models. Crop Sci., 43:865--873, 2003.

[7]

S. M. Embleton and E. S. Wheeler. Finnish dialect atlas for quantitative studies. J. Quant. Ling., 4(1--3):99--102, 1997.

[8]

C. Faloutsos and V. Megalooikonomou. On data mining, compression and Kolmogorov complexity. Data Min. Knowl. Disc., 15:3--20, 2007.

Digital Library

[9]

U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. UAI'93, pages 1022--1027, 1993.

[10]

M. Fortelius et al. Neogene of the old world database of fossil mammals (NOW), 2003. http://www.helsinki.fi/science/now/.

[11]

G. C. Garriga, E. Junttila, and H. Mannila. Banded structure in binary matrices. In Proc. KDD'08, pages 292--300, 2008.

Digital Library

[12]

F. Geerts, B. Goethals, and T. Mielikäinen. Tiling databases. In Proc. DS'04, pages 278--289, 2004.

[13]

G. H. Golub and C. F. Van Loan. Matrix Computations. JHU Press, 1996.

[14]

P. D. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.

Digital Library

[15]

H. Heikinheimo, J. Vreeken, A. Siebes, and H. Mannila. Low-entropy set selection. In Proc. SDM'09, 2009.

[16]

E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In Proc. KDD'04, pages 206--215, 2004.

Digital Library

[17]

K.-N. Kontonasios and T. De Bie. An information-theoretic approach to finding noisy tiles in binary databases. In Proc. SDM'10, 2010.

[18]

M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer-Verlag, 1993.

Digital Library

[19]

H. Lu, J. Vaidya, and V. Atluri. Optimal Boolean matrix decomposition: Application to role engineering. In Proc. ICDE'08, pages 297--306, 2008.

Digital Library

[20]

P. Miettinen. Matrix Decomposition Methods for Data Mining: Computational Complexity and Algorithms. PhD thesis, University of Helsinki, 2009.

[21]

P. Miettinen. Sparse Boolean matrix factorizations. In Proc. ICDM'10, pages 935--940, 2010.

Digital Library

[22]

P. Miettinen, T. Mielikäinen, A. Gionis, G. Das, and H. Mannila. The discrete basis problem. IEEE Trans. Knowl. Data Eng., 20(10):1348--1362, 2008.

Digital Library

[23]

T. P. Minka. Automatic choice of dimensionality for PCA. In Proc. NIPS'00, pages 598--604, 2001.

[24]

A. J. Mitchell-Jones, G. Amori, W. Bogdanowicz, B. Krystufek, P. J. H. Reijnders, F. Spitzenberger, M. Stubbe, J. B. M. Thissen, V. Vohralik, and J. Zima. The Atlas of European Mammals. Academic Press, 1999.

[25]

S. D. Monson, N. J. Pullman, and R. Rees. A survey of clique and biclique coverings and factorizations of (0,1)-matrices. Bulletin of the ICA, 14:17--86, 1995.

[26]

S. Myllykangas, J. Himberg, T. Böhling, B. Nagy, J. Hollmén, and S. Knuutila. DNA copy number amplification profiling of human neoplasms. Oncogene, 25(55), 2006.

[27]

D. S. Nau, G. Markowsky, M. A. Woodbury, and D. B. Amos. A mathematical analysis of human leukocyte antigen serology. Math. Biosci., 40:243--270, 1978.

[28]

A. B. Owen and P. O. Perry. Bi-cross-validation of the SVD and the nonnegative matrix factorization. Ann. Appl. Stat., 3(2):564--594, Jun 2009.

[29]

P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5:111--126, 1994.

[30]

V. Pestov. An axiomatic approach to intrinsic dimension of a dataset. Neural Networks, 21(2--3):204--213, 2008.

Digital Library

[31]

J. Rissanen. Modeling by shortest data description. Automatica, 14(1):465--471, 1978.

Digital Library

[32]

M. Schmidt, O. Winther, and L. Hansen. Bayesian non-negative matrix factorization. In Proc. ICA'09, volume 5411 of LNCS, pages 540--547, 2009.

Digital Library

[33]

A. Streich, M. Frank, D. Basin, and J. Buhmann. Multi-assignment clustering for Boolean data. In Proc. ICML'09, pages 969--976, 2009.

Digital Library

[34]

N. Tatti, T. Mielikainen, A. Gionis, and H. Mannila. What is the dimension of your binary data? In Proc. ICDM'06, pages 603--612, 2006.

Digital Library

[35]

N. Tatti and J. Vreeken. Finding good itemsets by packing data. In Proc. ICDM'08, pages 588--597, 2008.

Digital Library

[36]

J. Vaidya, V. Atluri, and Q. Guo. The role mining problem: Finding a minimal descriptive set of roles. In Proc. SACMAT, pages 175--184, 2007.

Digital Library

[37]

J. Vreeken and A. Siebes. Filling in the blanks -- Krimp minimisation for missing data. In Proc. ICDM'08, pages 1067--1072, 2008.

Digital Library

[38]

J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: Mining itemsets that compress. Data Min. Knowl. Discov., 19(2):176--193, 2010. 10.1007/s10618-010-0202-x.

Digital Library

[39]

C. Wallace. Statistical and inductive inference by minimum message length. Springer-Verlag, 2005.

Digital Library

[40]

J. Wang and G. Karypis. On efficiently summarizing categorical databases. Knowl. Inf. Syst., 9(1):19--37, 2006.

Digital Library

[41]

K. Yeomans and P. Golder. The Guttman--Kaiser criterion as a predictor of the number of common factors. The Statistician, 31(3):221--229, 1982.

[42]

M. Zhu and A. Ghodsi. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comp. Stat. & Data Anal., 51(2):918--930, 2006.

Digital Library

Cited By

Wang XWang JZhao TWang YZhang NZang YCao SZhang CKiyavash NMooij J(2024)Bias-aware boolean matrix factorization using disentangled representation learningProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702845(3630-3642)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.5555/3702676.3702845
Bandyapadhyay SFomin FGolovach PSimonov K(2023)Parameterized Complexity of Feature Selection for Categorical Data ClusteringACM Transactions on Computation Theory10.1145/360479715:3-4(1-24)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1145/3604797
Konecny JTrnecka M(2023)Boolean matrix factorization for symmetric binary variablesKnowledge-Based Systems10.1016/j.knosys.2023.110944279:COnline publication date: 4-Nov-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110944
Show More Cited By

Index Terms

Model order selection for boolean matrix factorization
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, ...
Sparse Boolean Matrix Factorizations
ICDM '10: Proceedings of the 2010 IEEE International Conference on Data Mining

Matrix factorizations are commonly used methods in data mining. When the input data is Boolean, replacing the standard matrix multiplication with Boolean matrix multiplication can yield more intuitive results. Unfortunately, finding a good Boolean ...
Interpretable nonnegative matrix decompositions
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

A matrix decomposition expresses a matrix as a product of at least two factor matrices. Equivalently, it expresses each column of the input matrix as a linear combination of the columns in the first factor matrix. The interpretability of the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2011

1446 pages

ISBN:9781450308137

DOI:10.1145/2020408

General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '11

Sponsor:

KDD '11: The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2011

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
846
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)8

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang XWang JZhao TWang YZhang NZang YCao SZhang CKiyavash NMooij J(2024)Bias-aware boolean matrix factorization using disentangled representation learningProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702845(3630-3642)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.5555/3702676.3702845
Bandyapadhyay SFomin FGolovach PSimonov K(2023)Parameterized Complexity of Feature Selection for Categorical Data ClusteringACM Transactions on Computation Theory10.1145/360479715:3-4(1-24)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1145/3604797
Konecny JTrnecka M(2023)Boolean matrix factorization for symmetric binary variablesKnowledge-Based Systems10.1016/j.knosys.2023.110944279:COnline publication date: 4-Nov-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110944
Ma JHashemi SReda S(2022)Approximate Logic Synthesis Using Boolean Matrix FactorizationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.305460341:1(15-28)Online publication date: Jan-2022
https://doi.org/10.1109/TCAD.2021.3054603
Fomin FPanolan FPatil ATanveer A(2022)Boolean and $\mathbb{F}_{p}$-Matrix Factorization: From Theory to Practice2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892947(1-8)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892947
Galbrun E(2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
https://doi.org/10.1007/s10618-022-00846-z
Barbareschi MBarone SMazzocca NMoriconi A(2022)Design Space Exploration ToolsApproximate Computing Techniques10.1007/978-3-030-94705-7_8(215-259)Online publication date: 3-Jan-2022
https://doi.org/10.1007/978-3-030-94705-7_8
Malik OUshijima-Mwesigwa HRoy AMandal AGhosh I(2021)Binary matrix factorization on special purpose hardwarePLOS ONE10.1371/journal.pone.026125016:12(e0261250)Online publication date: 16-Dec-2021
https://doi.org/10.1371/journal.pone.0261250
Trnecka MTrneckova M(2021)Model order selection for approximate Boolean matrix factorization problem▪Knowledge-Based Systems10.1016/j.knosys.2021.107184227:COnline publication date: 5-Sep-2021
https://dl.acm.org/doi/10.1016/j.knosys.2021.107184
Faruqui SAlaeddini AChang MShirinkam SJaramillo CNajafiRad PWang JPugh M(2020)Summarizing Complex Graphical Models of Multiple Chronic Conditions Using the Second Eigenvalue of Graph Laplacian: Algorithm Development and ValidationJMIR Medical Informatics10.2196/163728:6(e16372)Online publication date: 17-Jun-2020
https://doi.org/10.2196/16372
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten