Abstract
Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era of very large machine-generated scientific and industrial datasets, accompanied by programs that provide ready access to complex relational information in machine-readable forms (ontologies, parsers, and so on). Besides the usual issues about the ease of use, ILP is now confronted with questions of implementation. We are concerned here with two of these, namely: can an ILP system construct models efficiently when (a) Dataset sizes are too large to fit in the memory of a single machine; and (b) Search space sizes becomes prohibitively large to explore using a single machine. In this paper, we examine the applicability to ILP of a popular distributed computing approach that provides a uniform way for performing data and task parallel computations in ILP. The MapReduce programming model allows, in principle, very large numbers of processors to be used without any special understanding of the underlying hardware or software involved. Specifically, we show how the MapReduce approach can be used to perform the coverage-test that is at the heart of many ILP systems, and to perform multiple searches required by a greedy set-covering algorithm used by some popular ILP systems. Our principal findings with synthetic and real-world datasets for both data and task parallelism are these: (a) Ignoring overheads, the time to perform the computations concurrently increases with the size of the dataset for data parallelism and with the size of the search space for task parallelism. For data parallelism this increase is roughly in proportion to increases in dataset size; (b) If a MapReduce implementation is used as part of an ILP system, then benefits for data parallelism can only be expected above some minimal dataset size, and for task parallelism can only be expected above some minimal search-space size; and (c) The MapReduce approach appears better suited to exploit data-parallelism in ILP.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Appice, A., Ceci, M., Turi, A., & Malerba, D. (2010). A parallel, distributed algorithm for relational frequent pattern discovery from very large data sets. In Intelligent data analysis.
Blockeel, H., & De Raedt, L. (1998). Top-down induction of first-order logical decision trees. Artificial Intelligence, 101, 285–297.
Botta, M., Giordana, A., Saitta, L., & Sebag, M. (1999). Relational learning: Hard problems and phase transitions. In LNAI: Vol. 1792. Proceedings of the sixth congress AI ∗ AI (pp. 178–189). Berlin: Springer.
Camacho, R. (1994). Learning stage transition rules with Indlog. In S. Wrobel (Ed.), Proceedings of the fourth international inductive logic programming workshop. Gesellschaft für Mathematik und Datenverarbeitung MBH, GMD-Studien Nr. 237.
Cardoso, P. M., & Zaverucha, G. (2006). Comparative evaluation of approaches to scale up ilp. In International conference on inductive logic programming (pp. 37–39).
Chu, C. T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G. R., Ng, A. Y., & Olukotun, K. (2006). Map-reduce for machine learning on multicore. In NIPS (pp. 281–288).
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G. R., Ng, A. Y., & Olukotun, K. (2006). Map-reduce for machine learning on multicore. In NIPS (pp. 281–288). Cambridge: MIT Press.
Clare, A., & King, R. D. (2003). Data mining the yeast genome in a lazy functional language. In Proceedings of the 5th international symposium on practical aspects of declarative languages (pp. 19–36).
Costa, V. S., Srinivasan, A., Camacho, R. C., Blockeel, H., Demoen, B., Janssens, G., Struyf, J., Vandecasteele, H., & Van Laer, W. (2003). Query transformations for improving the efficiency of ILP systems. Journal of Machine Learning Research, 4(8), 465–491 (Aug).
De Raedt, L., & Bruynooghe, M. (1991). Clint: a multistrategy interactive concept-learner and theory revision system. In Proceedings of the 1st international workshop on multistrategy learning (pp. 175–191).
Dean, J., & Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.
Fonseca, N., Srinivasan, A., Silva, F., & Camacho, R. (2008). Parallel ILP for distributed-memory architectures. Machine Learning Journal. doi:10.1007/s10994-008-5094-2
Hoare, C. A. R. (1993). Programs are predicates. ICOT Journal, 38, 2–15.
Hoefler, T., Lumsdaine, A., & Dongarra, J. (2009). Towards efficient mapreduce using mpi. In Proceedings of the 16th European PVM/MPI users’ group meeting on recent advances in parallel virtual machine and message passing interface (pp. 240–249).
Irwin, J. J., & Shoichet, B. K. (2004). Zinc—a free database of commercially available compounds for virtual screening. Journal of Chemical Information and Modeling, 45(1), 177–182.
Kearns, M. (1998). Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45, 983–1006.
King, R. D., Muggleton, S. H., Srinivasan, A., & Sternberg, M. J. E. (1996). Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. In Proc. of the national academy of sciences (Vol. 93, pp. 438–442).
King, R. D., & Srinivasan, A. (1996). Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming. Environmental Health Perspectives, 104(5), 1031–1040.
Ho, E. K. Y., Knobbe, A., & Malik, R. (2005). Ilp 2005 challenge: the safarii mrdm environment. In Proceedings late-breaking papers track ILP (p. 2005).
Lämmel, R. (2007). Google’s mapreduce programming model—revisited. Science of Computer Programming, 68(3), 208–237.
Miceli, C., Miceli, M., Jha, S., Kaiser, H., & Merzky, A. (2009). Programming abstractions for data intensive computing on clouds and grids. In Proceedings of the 2009 9th IEEE/ACM international symposium on cluster computing and the grid, CCGRID’09 (pp. 478–483).
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to wordnet: an on-line lexical database. International Journal of Lexicography, 3(4), 235–244.
Muggleton, S. (1994). Inductive logic programming: derivations, successes and shortcomings. SIGART Bulletin, 5(1), 5–11.
Muggleton, S. (1995). Inverse entailment and progol. New Generation Computing, 13, 245–286.
Muggleton, S. H., & Feng, C. (1990). Efficient induction of logic programs. In Proceedings of the first conference on algorithmic learning theory. Tokyo: Ohmsha.
Muggleton, S. H., Almeida Santos, J. C., & Tamaddoni-Nezhad, A. (2008). Toplog: ILP using a logic program declarative bias. In ICLP (pp. 687–692).
Lavrac̆, N., Flach, P., & Zupan, B. (1999). Rule evaluation measures: a unifying view. In Proceedings of ninth international workshop on ILP (Vol. 1634, pp. 174–185). Berlin: Springer.
Naish, L., & Sterling, L. (2000). Stepwise enhancement and higher-order programming in prolog. Journal of Functional and Logic Programming, 4. MIT Press, March.
Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., & Stonebraker, M. (2009). A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on management of data, SIGMOD’09 (pp. 165–178).
Plotkin, G. D. (1971). Automatic methods of inductive inference. Ph.D. thesis, Edinburgh University, August.
Quinlan, J. R. (1990). Learning logical definitions from relations. Machine Learning, 5, 239–266.
Shi, Y. (1996). Re-evaluating Amdahl’s law and Gustafson’s law. Available at http://www.cis.temple.edu/shi/docs/amdahl/amdahl.html.
Specia, L., Srinivasan, A., Ramakrishnan, G., & Nunes, M. (2007). Word sense disambiguation with ILP. In LNAI: Vol. 4455. Proceedings of the sixteenth international conference on inductive logic programming (pp. 409–423).
Srinivasan, A. (1999). A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1), 95–123.
Srinivasan, A. (1999). The Aleph manual. Available at http://www.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/.
Srinivasan, A. (2000). A study of two probabilistic methods for searching large spaces with ILP. Technical Report PRG-TR-16-00, Oxford University Computing Laboratory, Oxford.
Srinivasan, A., & Kothari, R. (2005). A study of applying dimensionality reduction to restrict the size of a hypothesis space. In LNAI: Vol. 3625. Proceedings of the fifteenth international conference on inductive logic programming (ILP2005) (pp. 348–365). Berlin: Springer.
Fonseca, N. A., Camacho, R., Rocha, R., & Costa, V. S. (2008). Compile the hypothesis space: do it once, use it often. Fundamenta Informaticae, 89, 45–67 Special issue on multi-relational data mining.
Wang, Y., & Skillicorn, D. (2000). Parallel inductive logic for data mining. In Proceedings on distributed and parallel knowledge discovery, KDD2000. New York: ACM.
Zelezny, F., Srinivasan, A., & Page, C. D. (2002). Lattice-search runtime distributions may be heavy-tailed. In LNAI. Proceedings of the twelfth international conference on inductive logic programming (ILP2002).
Author information
Authors and Affiliations
Corresponding author
Additional information
A. Srinivasan is also an Adjunct Professor at the School of Computer Science and Engineering, University of New South Wales; a Visiting Professor at the Computing Laboratory, Oxford.
Rights and permissions
About this article
Cite this article
Srinivasan, A., Faruquie, T.A. & Joshi, S. Data and task parallelism in ILP using MapReduce. Mach Learn 86, 141–168 (2012). https://doi.org/10.1007/s10994-011-5245-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5245-8