research-article

Algorithm for low-variance biclusters to identify coregulation modules in sequencing datasets

Authors:

Raj BhatnagarAuthors Info & Claims

BIOKDD '11: Proceedings of the Tenth International Workshop on Data Mining in Bioinformatics

Article No.: 1, Pages 1 - 10

https://doi.org/10.1145/2003351.2003352

Published: 21 August 2011 Publication History

Abstract

High-throughput sequencing (CHIP-Seq) data exhibit binding events with possible binding locations and their strengths, followed by interpretations of the locations of peaks. Recent methods tend to summarize all CHIP-Seq peaks detected within a limited up and down region of each gene into one real-valued score in order to quantify the probability of regulation in a region. Applying subspace clustering (or biclustering) techniques on these scores would discover important knowledge such as the potential co-regulation or co-factors mechanisms. The ideal biclusters generated should contain subsets of genes, and transcription factors (TF) such that the cell-values in biclusters are distributed around a mean value with low variance. Such biclusters would indicate TF sets regulating gene sets with the same probability values. However, most existing biclustering algorithms are neither able to enforce variance as a strict limitation on the values contained in a bicluster, nor use variance as the guiding metric while searching for the desirable biclusters. An algorithm that uses search spaces defined by lattices containing all overlapping biclusters and a bound on variance values as the guiding metric is presented in this paper. The algorithm is shown to be an efficient and effective method for discovering the possibly overlapping biclusters under pre-defined variance bounds. We present in this paper our algorithm, its results with synthetic and CHIP-Seq and motif datasets, and compare them with the results obtained by other algorithms to demonstrate the power and effectiveness of our algorithm.

References

[1]

Ucsc genome browser website:. http://genome.ucsc.edu/.

[2]

V. A, J. DS, S. A, M. C, A. E, and et al. Genome-wide analysis of transcription factor binding sites based on chip-seq data. Nat Methods, 5:829--834, 2008.

[3]

F. Alqadah and R. Bhatnagar. An effective algorithm for mining 3-clusters in vertically partitioned data. In Proceeding of the 17th ACM conference on Information and knowledge management, pages 1103--1112, 2008.

Digital Library

[4]

M. Ashburner, C. Ball, J. Blake, D. Botstein, H. B. J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, and et al. Gene ontology: tool for the unification of biology. Nature Genetics, 25(1), 2000.

[5]

B. C. Ben-Dor, R. Karp, and Z. Yakhini. Discovering local structure in gene expression data: The order-preserving submatrix problem. In Proceedings of the 6th International Conference on Computational Biology (RECOMB-02), pages 49--57, 2002.

Digital Library

[6]

H. Bian and R. Bhatnagar. An algorithm for lattice-structured subspace clustering. Proceedings of the SIAM International Conference on Data Mining, 2005.

[7]

H. Bian, R. Bhatnagar, and B. Young. An efficient constraint-based closed set mining algorithm. In Proceedings of the 6th international confernce on Machine Learning, pages 172--177, 2007.

Digital Library

[8]

K. Bryan, P. Cunningham, and Bolshakova N. Biclustering of expression data using simulated annealing. In Proceedings of the 18th IEEE symposium on computer-based medical systems, pages 383--388, 2005.

Digital Library

[9]

J. S. Carroll, C. A. Meyer, J. Song, W. Li, T. R. Geistlinger, J. Eeckhoute, A. S. Brodsky, E. K. Keeton, K. C. Fertuck, G. F. Hall, Q. Wang, S. Bekiranov, V. Sementchenko, E. A. Fox, P. A. Silver, T. R. Gingeras, X. S. Liu, and M. Brown. Genome-wide analysis of estrogen receptor binding sites. Nature Genetics, 38:1289--1297, 2006.

[10]

X. Chen, H. Xu, P. Yuan, F. Fang, M. Huss, V. B. Vega, E. Wong, Y. L. Orlov, W. Zhang, J. Jiang, Y.-H. Loh, H. C. Yeo, Z. X. Yeo, V. Narang, K. Ramamoorthy, Govindarajan, B. Leong, A. Shahab, Y. Ruan, G. Bourque, W.-K. Sung, N. D. Clarke, C.-L. Wei, and H.-H. Ng. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133:1106âĂŞ1117, 2008.

[11]

Y. Cheng and G. Church. Biclustering of expression data. In Proceedings of the 8th international conference on intelligent systems for molecular biology, pages 93--103, 2000.

Digital Library

[12]

E. Conlon, X. Liu, J. Lieb, and J. Liu. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. U.S.A., 100(6):3339--3344, 2003.

[13]

N. D, C. S, and B. K. Empirical methods for controlling false positives and estimating confidence in chip-seq peaks. BMC Bioinformatics, 9:523, 2008.

[14]

I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD International Conference On Knowledge Discovery And Data Mining (KDD), 2001.

Digital Library

[15]

I. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 89--98, 2003.

Digital Library

[16]

J. M. Freudenberg, V. K. Joshi, Z. Hu, and M. Medvedovic. Clean: Clustering enrichment analysis. BMC Bioinformatics, 10(234), 2009.

[17]

B. Ganter and R. Wille. Formal concept analysis: Mathematical foundations. Springer-Verlag, Heidelber, 1999.

Digital Library

[18]

J. Ihmels, S. Bergmann, and N. Barkai. Defining transcription modules using large-scale gene expression data. Bioinformatics, 20:1993--2003, 2004.

Digital Library

[19]

J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai. Revealing modular organization in the yeast transcriptional network. Nature Genetics, 31:370--377, 2002.

[20]

S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1):24--45, 2004.

Digital Library

[21]

Z. Ouyang, Q. Zhou, and W. H. Wong. Chip-seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. PNAS, 106(51):21521--21526, 2009.

[22]

P. PJ. Chip-seq: advantages and challenges of a maturing technology. Nat Rev Genet, 10:669âĂŞ680, 2009.

[23]

P. S, W. B, and M. A. Computation for chip-seq and rna-seq studies. Nat Methods, 6:S22âĂŞ32, 2009.

[24]

K. Shinde, M. Phatak, J. M. Freudenberg, J. Chen, Q. Li, V. Joshi, Z. Hu, K. Ghosh, J. Meller, and M. Medvedovic. Genomics portals: Integrative web-platform for mining genomics data. BMC Genomics, 11(1), 2010.

[25]

A. Tanay, R. Sharan, and R. Shamir. Discovering statistically significant bilcusters in gene expression data. Bioinformatics, 18:136--144, 2002.

[26]

L. TD, R. S, T. S, L. R, A. T, and et al. A practical comparison of methods for detecting transcription factor binding sites in chip-seq experiments. BMC Genomics, 10:618, 2009.

[27]

Z. Y, L. T, M. C, E. J, J. D, and et al. Model-based analysis of chip-seq (macs). Genome Biology, 9:R137, 2008.

[28]

J. Yang, W. Wang, H. Wang, and P. Yu. Δ-clusters: capturing subspace correlation in a large data set. In Proceedings of the 18th IEEE International Conference On Data Engineering, pages 517--528, 2002.

Digital Library

[29]

S. Yoon, L. Benini, and D. M. G. Co-clustering: A versatile tool for data analysis in biomedical informatics. Information Technology in Biomedicine, IEEE Transactions on, 11:493--494.

Digital Library

[30]

M. J. Zaki and K. Gouda. Fast vertical mining using diffsets. In 9th International Conference on Knowledge Discovery and Data Mining, 2003.

Digital Library

Index Terms

Algorithm for low-variance biclusters to identify coregulation modules in sequencing datasets

Recommendations

Mining low-variance biclusters to discover coregulation modules in sequencing datasets
Biological Knowledge Discovery and Data Mining

High-throughput sequencing CHIP-Seq data exhibit binding events with possible binding locations and their strengths, followed by interpretation of the locations of peaks. Recent methods tend to summarize all CHIP-Seq peaks detected within a limited up ...
A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules

Motivation: It is well known that microRNAs (miRNAs) and genes work cooperatively to form the key part of gene regulatory networks. However, the specific functional roles of most miRNAs and their combinatorial effects in cellular processes are still ...
Improving biological significance of gene expression biclusters with key missing genes
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

Identifying condition-specific co-expressed gene groups is critical for gene functional and regulatory analysis. However, given that genes with critical functions (such as transcription factors) may not co-express with their target genes, it is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

BIOKDD '11: Proceedings of the Tenth International Workshop on Data Mining in Bioinformatics

August 2011

47 pages

ISBN:9781450308397

DOI:10.1145/2003351

General Chairs:
Mohammed Zaki
Rensselaer Polytechnic Institute, Troy, NY
,
Jake Chen
Indiana University School of Informatics, Indiana University-Purdue University Indianapolis, Indianapolis, IN
,
Program Chairs:
Mohammad Al Hasan
Indiana University-Purdue University, Indianapolis (IUPUI), Indianapolis, IN
,
Jun (Luke) Huan
University of Kansas, Lawrence, KS

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

KDD '11

Sponsor:

KDD '11: The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21, 2011

California, San Diego

Acceptance Rates

Overall Acceptance Rate 7 of 16 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
172
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents