Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1006209.1006215acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Parallel algorithms for mining frequent structural motifs in scientific data

Published: 26 June 2004 Publication History

Abstract

Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and across a class of molecules.Recently, we have developed a general purpose suite of algorithms -- the MotifMiner Toolkit -- that can mine for structural motifs in a wide area of biomolecular datasets. While the algorithms have proven to be extremely useful in their ability to identify novel substructures, the algorithms themselves are quite time consuming. There are two reasons for this: i) inherently the algorithm suffers from the curse of subgraph isomorphism; and ii) handling noise effects (e.g. protein structure data) results in a significant slowdown.To address this problem in this paper we propose parallelization strategies in a cluster environment for the above algorithms. We identify key optimizations that handle load imbalance, scheduling, and communication overheads. Results show that the optimizations are quite effective and that we are able to obtain good speedup on moderate sized clusters.

References

[1]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In 20th VLDB Conf., September 1994.
[2]
C. Borgelt and M. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In IEEE International Conference on Data Mining, Dec 2002.
[3]
Douglas Burdick, Manuel Calimlim, and Johannes Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In IEEE International Conference on Data Engineering, 2001.
[4]
L. P. Chew, D. Huttenlocher, K. Kedem, and J. Kleinberg. Fast detection of common geometric substructures in proteins. RECOMB, 1999.
[5]
M. Coatney, S. Mehta, A. Choy, S. Barr, S. Parthasarathy, R. Machiraju, and J. Wilkins. Defect detection in silicon and alloys. In IEEE Workshop on Visualization in Bioinformatics and Cheminformatics, 2002.
[6]
M. Coatney and S. Parthasarathy. Motifminer: A general toolkit for efficiently identifying common substructures in molecules. In IEEE International Conference on Bioinformatics and Bioengineering (to appear), 2003.
[7]
M. Coatney and S. Parthasarathy. Motifminer: Efficient discovery of common substructures in biochemical molecules. In Knowledge and Information Systems, to appear, 2003.
[8]
D. J. Cook, L. B. Holder, S. Su, R. Maglothin, and I. Jonyer. Structural mining of molecular biology data. IEEE Engineering in Medicine and Biology, 20(4):67--74, 2001.
[9]
L. Dehaspe, H. Toivonen, and R. King. Finding frequent substructures in chemical compounds. In The Fourth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998.
[10]
D. J. Cook and et al. L. B. Holder. Approaches to parallel and distributed computing. Journal of Parallel and Distributed Computing, 2001.
[11]
S. Djoko, D. Cook, and L. Holder. Analyzing the benefits of domain knowledge in substructure discovery. In The First ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1995.
[12]
U. Fayyad et al. Mining Scientific Data. In Communications of the ACM, pages 51--57, 1996.
[13]
H. M. Grindley, P. J. Artymiuk, D. W. Rice, and P. Willett. Identification of tertiary resemblence in proteins using a maximal common subgraph isomorphism algorithm. J. of Mol. Biol., 229(3):707--721, 1993.
[14]
E. Han, G. Karypis, and V. Kumar. Data mining for turbulent flows. In Data mining for scientific and engineering applications, pages 239--256, 2001.
[15]
D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, Cambridge, MA, 2001.
[16]
M. Jiang, T.-S. Choy, S. Mehta, M. Coatney, S. Barr, K. Hazzard, D. Richie, S. Parthasarathy, R. Machiraju, D. S. Thompson, J. Wilkins, and B. Gatlin. Feature Mining Algorithms for Scientific Data. In Proceedings of SIAM Data Mining Conference, May 2003.
[17]
I. Jonassen, I. Eidhammer, D. Conklin, and W. Taylor. Structure motif discovery and mining the pdb. In German Conference on Bioinformatics, 2000.
[18]
C. Kamath. On Mining Scientific Datasets. In et al R. L. Grossman, editor, Data Mining for Scientific and Engineering Applications, pages 1--21. Kluwer Academic Publishers, 2001.
[19]
G. Karypis, M. Deshpande, and M Kuramochi. Automated approaches for classifying structures. In BIOKDD02: Workshop on Data Mining in Bioinfomatics (with SIGKDD02 Conf.), July 2002.
[20]
J. Kim, E. Moriyama, C. Warr, P. Clyne, and J. Carlson. Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics, 2002.
[21]
R. King, A. Karwath, A. Clare, and L. Dehaspe. Genome scale prediction of protein functional class from sequence using data mining. In The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.
[22]
S. Kirshner, I. V. Cadez, P. Smyth, and C. Kamath. Learning to classify galaxy shapes using the em algorithm. In ICPR, 2002.
[23]
I. Koch, T. Lengauer, and E. Wanke. An algorithm for finding maximal common subtopologies in a set of protein structures. J. of Comp. Biol., 3(2):289--306, 1996.
[24]
M Kuramochi and G. Karypis. Frequent subgraph discovery. In IEEE International Conference on Data Mining, Nov 2001.
[25]
M Kuramochi and G. Karypis. Discovering frequent geometric subgraphs. In IEEE International Conference on Data Mining, Dec 2002.
[26]
Y. Lamdan and H. Wolfson. Geometric hashing: a general and efficient model-based recognition scheme. In ICCV, 1988.
[27]
H. Li and S. Parthasarathy. Automatically deriving multi-level protein structures through data mining. In HiPC Conference Workshop on Bioinformatics and Computational Biology, Hyderabad, India, 2001.
[28]
R. Machiraju, S. Parthasarathy, D. S. Thompson, J. Wikins, B. Gatlin, T. S. Choy, D. Richie, M. Jiang, S. Mehta, M. Coatney, S. Barr, and K. Hazzard. Mining Complex Evolutionary Phenomena. In NSF NGDM Workshop, 2002.
[29]
R. Machiraju, S. Parthasarathy, D. S. Thompson, J. Wikins, B. Gatlin, T. S. Choy, D. Richie, M. Jiang, S. Mehta, M. Coatney, S. Barr, and K. Hazzard. Mining Complex Evolutionary Phenomena. In H. Kargupta et al., editor, Data Mining for Scientific and Engineering Applications. MIT Press, 2003.
[30]
E. M. Mitchell, P. J. Artymiuk, D. W. Rice, and P. Willett. Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol., 212:151--166, 1990.
[31]
W. Pan, J. Lin, and C. Le. Model-based cluster analysis of microarray gene-expression data. Genome Biology, 2002.
[32]
S. Parthasarathy and M. Coatney. Efficient discovery of common substructures in macromolecules. In IEEE International Conference on Data Mining, 2002.
[33]
L. De Raedt and S. Kramer. The level-wise version space algorithm and its application to molecular fragment finding. In Seventeenth International Joint Conference on Artificial Intelligence, 2001.
[34]
M. Twa, S. Parthasarathy, T. Rosche, and M. Bullmer. Decision tree classification of spatial data patterns from videokeratography using zernike polynomials. SIAM International Conference on Data Mining, 2003.
[35]
J. T. L. Wang, B. A. Shapiro, D. Shasha, K. Zhang, and C.-Y. Chang. Automated discovery of active motifs in multiple rna secondary structures. In International Conference on Knowledge Discovery and Data Mining, 1996.
[36]
X. Wang, J. Wang, D. Shasha, B. Shapiro, S. Dikshitulu, I. Rigoutsos, and K. Zhang. Automated discovery of active motifs in three dimensional molecules. In The Third ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1997.
[37]
X. Wang, J. T. L. Wang, D. Shasha, B. A. Shapiro, I. Rigoutsos, and K. Zhang. Finding patterns in three-dimensional graphs: Algorithms and applications to scientific data mining. IEEE Transactions on Knowledge and Data Engineering, 14(4):731--749, jul/aug 2002.
[38]
H. Wolfson and I. Rigoutsos. Geometric hashing: An overview. In In IEEE Computational Science and Engineering, Oct 1997.
[39]
X. Yan and J. Han. gspan: Graph based substructure pattern mining. In IEEE International Conference on Data Mining, Dec 2002.
[40]
X. Zheng and T. Chan. Chemical genomics: A systematic approach in biological research and drug discovery. Current Issues in Molecular Biology, 2002.

Cited By

View all
  • (2020)Motif discovery algorithms in static and temporal networks: A surveyJournal of Complex Networks10.1093/comnet/cnaa0318:4Online publication date: 16-Dec-2020
  • (2018)RStreamProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291225(763-782)Online publication date: 8-Oct-2018
  • (2017)Data mining in distributed environment: a surveyWIREs Data Mining and Knowledge Discovery10.1002/widm.12167:6Online publication date: 18-Jul-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '04: Proceedings of the 18th annual international conference on Supercomputing
June 2004
360 pages
ISBN:1581138393
DOI:10.1145/1006209
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. atomset
  2. backward pruning
  3. bitonic partitioning
  4. motif
  5. recursive fuzzy hashing
  6. self-adaptive expansion

Qualifiers

  • Article

Conference

ICS04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Motif discovery algorithms in static and temporal networks: A surveyJournal of Complex Networks10.1093/comnet/cnaa0318:4Online publication date: 16-Dec-2020
  • (2018)RStreamProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291225(763-782)Online publication date: 8-Oct-2018
  • (2017)Data mining in distributed environment: a surveyWIREs Data Mining and Knowledge Discovery10.1002/widm.12167:6Online publication date: 18-Jul-2017
  • (2015)ArabesqueProceedings of the 25th Symposium on Operating Systems Principles10.1145/2815400.2815410(425-440)Online publication date: 4-Oct-2015
  • (2014)G-TriesData Mining and Knowledge Discovery10.1007/s10618-013-0303-428:2(337-377)Online publication date: 1-Mar-2014
  • (2014)Big Data Frequent Pattern MiningFrequent Pattern Mining10.1007/978-3-319-07821-2_10(225-259)Online publication date: 30-Aug-2014
  • (2013)Graph-Based Substructure Pattern Mining Using CUDA Dynamic ParallelismProceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning --- IDEAL 2013 - Volume 820610.1007/978-3-642-41278-3_42(342-349)Online publication date: 20-Oct-2013
  • (2012)Parallel discovery of network motifsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2011.08.00772:2(144-154)Online publication date: 1-Feb-2012
  • (2011)A Parallel Algorithm for Counting Subgraphs in Complex NetworksBiomedical Engineering Systems and Technologies10.1007/978-3-642-18472-7_30(380-393)Online publication date: 2011
  • (2010)A Survey of Graph Mining Techniques for Biological DatasetsManaging and Mining Graph Data10.1007/978-1-4419-6045-0_18(547-580)Online publication date: 18-Jan-2010
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media