Data and task parallelism in ILP using MapReduce

Ashwin Srinivasan¹,
Tanveer A. Faruquie² &
Sachindra Joshi²

1529 Accesses
43 Citations
Explore all metrics

Abstract

Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era of very large machine-generated scientific and industrial datasets, accompanied by programs that provide ready access to complex relational information in machine-readable forms (ontologies, parsers, and so on). Besides the usual issues about the ease of use, ILP is now confronted with questions of implementation. We are concerned here with two of these, namely: can an ILP system construct models efficiently when (a) Dataset sizes are too large to fit in the memory of a single machine; and (b) Search space sizes becomes prohibitively large to explore using a single machine. In this paper, we examine the applicability to ILP of a popular distributed computing approach that provides a uniform way for performing data and task parallel computations in ILP. The MapReduce programming model allows, in principle, very large numbers of processors to be used without any special understanding of the underlying hardware or software involved. Specifically, we show how the MapReduce approach can be used to perform the coverage-test that is at the heart of many ILP systems, and to perform multiple searches required by a greedy set-covering algorithm used by some popular ILP systems. Our principal findings with synthetic and real-world datasets for both data and task parallelism are these: (a) Ignoring overheads, the time to perform the computations concurrently increases with the size of the dataset for data parallelism and with the size of the search space for task parallelism. For data parallelism this increase is roughly in proportion to increases in dataset size; (b) If a MapReduce implementation is used as part of an ILP system, then benefits for data parallelism can only be expected above some minimal dataset size, and for task parallelism can only be expected above some minimal search-space size; and (c) The MapReduce approach appears better suited to exploit data-parallelism in ILP.

Article PDF

AND Parallelism for ILP: The APIS System

Scalable Horn-Like Rule Inference of Semantic Data Using MapReduce

Versatile XQuery Processing in MapReduce

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Appice, A., Ceci, M., Turi, A., & Malerba, D. (2010). A parallel, distributed algorithm for relational frequent pattern discovery from very large data sets. In Intelligent data analysis.
Google Scholar
Blockeel, H., & De Raedt, L. (1998). Top-down induction of first-order logical decision trees. Artificial Intelligence, 101, 285–297.
Article MATH MathSciNet Google Scholar
Botta, M., Giordana, A., Saitta, L., & Sebag, M. (1999). Relational learning: Hard problems and phase transitions. In LNAI: Vol. 1792. Proceedings of the sixth congress AI ^∗ AI (pp. 178–189). Berlin: Springer.
Google Scholar
Camacho, R. (1994). Learning stage transition rules with Indlog. In S. Wrobel (Ed.), Proceedings of the fourth international inductive logic programming workshop. Gesellschaft für Mathematik und Datenverarbeitung MBH, GMD-Studien Nr. 237.
Google Scholar
Cardoso, P. M., & Zaverucha, G. (2006). Comparative evaluation of approaches to scale up ilp. In International conference on inductive logic programming (pp. 37–39).
Google Scholar
Chu, C. T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G. R., Ng, A. Y., & Olukotun, K. (2006). Map-reduce for machine learning on multicore. In NIPS (pp. 281–288).
Google Scholar
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G. R., Ng, A. Y., & Olukotun, K. (2006). Map-reduce for machine learning on multicore. In NIPS (pp. 281–288). Cambridge: MIT Press.
Google Scholar
Clare, A., & King, R. D. (2003). Data mining the yeast genome in a lazy functional language. In Proceedings of the 5th international symposium on practical aspects of declarative languages (pp. 19–36).
Chapter Google Scholar
Costa, V. S., Srinivasan, A., Camacho, R. C., Blockeel, H., Demoen, B., Janssens, G., Struyf, J., Vandecasteele, H., & Van Laer, W. (2003). Query transformations for improving the efficiency of ILP systems. Journal of Machine Learning Research, 4(8), 465–491 (Aug).
Google Scholar
De Raedt, L., & Bruynooghe, M. (1991). Clint: a multistrategy interactive concept-learner and theory revision system. In Proceedings of the 1st international workshop on multistrategy learning (pp. 175–191).
Google Scholar
Dean, J., & Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.
Article Google Scholar
Fonseca, N., Srinivasan, A., Silva, F., & Camacho, R. (2008). Parallel ILP for distributed-memory architectures. Machine Learning Journal. doi:10.1007/s10994-008-5094-2
Google Scholar
Hoare, C. A. R. (1993). Programs are predicates. ICOT Journal, 38, 2–15.
Google Scholar
Hoefler, T., Lumsdaine, A., & Dongarra, J. (2009). Towards efficient mapreduce using mpi. In Proceedings of the 16th European PVM/MPI users’ group meeting on recent advances in parallel virtual machine and message passing interface (pp. 240–249).
Chapter Google Scholar
Irwin, J. J., & Shoichet, B. K. (2004). Zinc—a free database of commercially available compounds for virtual screening. Journal of Chemical Information and Modeling, 45(1), 177–182.
Article Google Scholar
Kearns, M. (1998). Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45, 983–1006.
Article MATH MathSciNet Google Scholar
King, R. D., Muggleton, S. H., Srinivasan, A., & Sternberg, M. J. E. (1996). Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. In Proc. of the national academy of sciences (Vol. 93, pp. 438–442).
Google Scholar
King, R. D., & Srinivasan, A. (1996). Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming. Environmental Health Perspectives, 104(5), 1031–1040.
Article Google Scholar
Ho, E. K. Y., Knobbe, A., & Malik, R. (2005). Ilp 2005 challenge: the safarii mrdm environment. In Proceedings late-breaking papers track ILP (p. 2005).
Google Scholar
Lämmel, R. (2007). Google’s mapreduce programming model—revisited. Science of Computer Programming, 68(3), 208–237.
Google Scholar
Miceli, C., Miceli, M., Jha, S., Kaiser, H., & Merzky, A. (2009). Programming abstractions for data intensive computing on clouds and grids. In Proceedings of the 2009 9th IEEE/ACM international symposium on cluster computing and the grid, CCGRID’09 (pp. 478–483).
Chapter Google Scholar
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to wordnet: an on-line lexical database. International Journal of Lexicography, 3(4), 235–244.
Article Google Scholar
Muggleton, S. (1994). Inductive logic programming: derivations, successes and shortcomings. SIGART Bulletin, 5(1), 5–11.
Article Google Scholar
Muggleton, S. (1995). Inverse entailment and progol. New Generation Computing, 13, 245–286.
Article Google Scholar
Muggleton, S. H., & Feng, C. (1990). Efficient induction of logic programs. In Proceedings of the first conference on algorithmic learning theory. Tokyo: Ohmsha.
Google Scholar
Muggleton, S. H., Almeida Santos, J. C., & Tamaddoni-Nezhad, A. (2008). Toplog: ILP using a logic program declarative bias. In ICLP (pp. 687–692).
Google Scholar
Lavrac̆, N., Flach, P., & Zupan, B. (1999). Rule evaluation measures: a unifying view. In Proceedings of ninth international workshop on ILP (Vol. 1634, pp. 174–185). Berlin: Springer.
Google Scholar
Naish, L., & Sterling, L. (2000). Stepwise enhancement and higher-order programming in prolog. Journal of Functional and Logic Programming, 4. MIT Press, March.
Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., & Stonebraker, M. (2009). A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on management of data, SIGMOD’09 (pp. 165–178).
Chapter Google Scholar
Plotkin, G. D. (1971). Automatic methods of inductive inference. Ph.D. thesis, Edinburgh University, August.
Quinlan, J. R. (1990). Learning logical definitions from relations. Machine Learning, 5, 239–266.
Google Scholar
Shi, Y. (1996). Re-evaluating Amdahl’s law and Gustafson’s law. Available at http://www.cis.temple.edu/shi/docs/amdahl/amdahl.html.
Specia, L., Srinivasan, A., Ramakrishnan, G., & Nunes, M. (2007). Word sense disambiguation with ILP. In LNAI: Vol. 4455. Proceedings of the sixteenth international conference on inductive logic programming (pp. 409–423).
Google Scholar
Srinivasan, A. (1999). A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1), 95–123.
Article Google Scholar
Srinivasan, A. (1999). The Aleph manual. Available at http://www.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/.
Srinivasan, A. (2000). A study of two probabilistic methods for searching large spaces with ILP. Technical Report PRG-TR-16-00, Oxford University Computing Laboratory, Oxford.
Srinivasan, A., & Kothari, R. (2005). A study of applying dimensionality reduction to restrict the size of a hypothesis space. In LNAI: Vol. 3625. Proceedings of the fifteenth international conference on inductive logic programming (ILP2005) (pp. 348–365). Berlin: Springer.
Google Scholar
Fonseca, N. A., Camacho, R., Rocha, R., & Costa, V. S. (2008). Compile the hypothesis space: do it once, use it often. Fundamenta Informaticae, 89, 45–67 Special issue on multi-relational data mining.
MATH Google Scholar
Wang, Y., & Skillicorn, D. (2000). Parallel inductive logic for data mining. In Proceedings on distributed and parallel knowledge discovery, KDD2000. New York: ACM.
Google Scholar
Zelezny, F., Srinivasan, A., & Page, C. D. (2002). Lattice-search runtime distributions may be heavy-tailed. In LNAI. Proceedings of the twelfth international conference on inductive logic programming (ILP2002).
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Mathematics and Computer Science, South Asian University, New Delhi, 110067, India
Ashwin Srinivasan
IBM Research—India, New Delhi, 110070, India
Tanveer A. Faruquie & Sachindra Joshi

Authors

Ashwin Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar
Tanveer A. Faruquie
View author publications
You can also search for this author in PubMed Google Scholar
Sachindra Joshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashwin Srinivasan.

Additional information

A. Srinivasan is also an Adjunct Professor at the School of Computer Science and Engineering, University of New South Wales; a Visiting Professor at the Computing Laboratory, Oxford.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srinivasan, A., Faruquie, T.A. & Joshi, S. Data and task parallelism in ILP using MapReduce. Mach Learn 86, 141–168 (2012). https://doi.org/10.1007/s10994-011-5245-8

Download citation

Received: 26 July 2010
Accepted: 28 February 2011
Published: 10 May 2011
Issue Date: January 2012
DOI: https://doi.org/10.1007/s10994-011-5245-8

Data and task parallelism in ILP using MapReduce

Abstract

Article PDF

Similar content being viewed by others

AND Parallelism for ILP: The APIS System

Scalable Horn-Like Rule Inference of Semantic Data Using MapReduce

Versatile XQuery Processing in MapReduce

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data and task parallelism in ILP using MapReduce

Abstract

Article PDF

Similar content being viewed by others

AND Parallelism for ILP: The APIS System

Scalable Horn-Like Rule Inference of Semantic Data Using MapReduce

Versatile XQuery Processing in MapReduce

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation