Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1401890.1402000acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Data mining using high performance data clouds: experimental studies using sector and sphere

Published: 24 August 2008 Publication History

Abstract

We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it provides the storage services required by the Sphere compute cloud. We also describe the programming paradigm supported by the Sphere compute cloud. Sector and Sphere are designed for analyzing large data sets using computer clusters connected with wide area high performance networks (for example, 10+ Gb/s). We describe a distributed data mining application that we have developed using Sector and Sphere. Finally, we describe some experimental studies comparing Sector/Sphere to Hadoop.

References

[1]
Amazon. Amazon Simple Storage Service (Amazon S3). www.amazon.com/s3.
[2]
Jay Beale, Andrew R Baker, and Joel Esler. Snort IDS and IPS Toolkit. Syngress, 2007.
[3]
Dhruba Borthaku. The hadoop distributed file system: Architecture and design. retrieved from lucene.apache.org/hadoop, 2007.
[4]
Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Chapman and Hall, New York, 1984.
[5]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004.
[6]
National Center for Data Mining at the University of Illinois at Chicago. The large data archives project.
[7]
Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, California, 2004.
[8]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In SOSP, 2003.
[9]
Jim Gray and Alexander S. Szalay. The world-wide telescope. Science, 293:2037--2040, 2001.
[10]
William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd Edition. MIT Press, 1999.
[11]
Robert L. Grossman and Yunhong Gu. Sc 2006 bandwidth challenge: National center for data mining - udt. retrieved from https://scinet.supercomp.org/2006/bwc/graphs/challengencdm.png, 2006.
[12]
Robert L Grossman, Michael Sabala, Yunhong Gu, Anushka Anand, Matt Handley, Rajmonda Sulo, and Lee Wilkinson. Distributed discovery in e-science: Lessons from the angle project. In Next Generation Data Mining (NGDM '07), 2008.
[13]
Yunhong Gu and Robert L. Grossman. UDT: UDP-based data transfer for high-speed wide area networks. Computer Networks, 51(7):1777--1799, 2007.
[14]
Yunhong Gu, Robert L. Grossman, Alex Szalay, and Ani Thakar. Distributing the sloan digital sky survey using udt and sector. In Proceedings of e-Science 2006, 2006.
[15]
Hillol Kargupta. Proceedings of Next Generation Data Mining 2007. Taylor and Francis, 2008.
[16]
Amazon Web Services LLC. Amazon web services developer connection. retrieved from developer.amazonwebservices.com on November 1, 2007.
[17]
John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. Lefohn, and Timothy J.Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, pages 21--51, 2005.
[18]
The Sector Project. Sector, a distributed storage and computing infrastructure, version 1.4.
[19]
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H Balakrishnana. Chord: A scalable peer to peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM '01, pages 149--160, 2001.
[20]
Hbase Development Team. Hbase: Bigtable-like structured storage for hadoop hdfs. http://wiki.apache.org/lucene-hadoop/Hbase, 2007.

Cited By

View all
  • (2023)Technological Prospects of Cloud Computing in Web Mining: Recent Trends and Opportunitiesinternational journal of engineering technology and management sciences10.46647/ijetms.2023.v07i01.0177:1(98-104)Online publication date: 28-Feb-2023
  • (2022)Privacy-Preserving Multi-party Neural Network Learning Over Incomplete Data2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00191(1216-1221)Online publication date: Dec-2022
  • (2022)A Taxonomy on Strategic Viewpoint and Insight Towards Multi-Cloud EnvironmentsComputational Vision and Bio-Inspired Computing10.1007/978-981-16-9573-5_51(713-719)Online publication date: 31-Mar-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
  • General Chair:
  • Ying Li,
  • Program Chairs:
  • Bing Liu,
  • Sunita Sarawagi
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cloud computing
  2. distributed data mining
  3. high performance data mining

Qualifiers

  • Research-article

Conference

KDD08

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)2
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Technological Prospects of Cloud Computing in Web Mining: Recent Trends and Opportunitiesinternational journal of engineering technology and management sciences10.46647/ijetms.2023.v07i01.0177:1(98-104)Online publication date: 28-Feb-2023
  • (2022)Privacy-Preserving Multi-party Neural Network Learning Over Incomplete Data2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00191(1216-1221)Online publication date: Dec-2022
  • (2022)A Taxonomy on Strategic Viewpoint and Insight Towards Multi-Cloud EnvironmentsComputational Vision and Bio-Inspired Computing10.1007/978-981-16-9573-5_51(713-719)Online publication date: 31-Mar-2022
  • (2022)Introduction of Data CenterData Center Networking10.1007/978-981-16-9368-7_1(3-24)Online publication date: 24-Feb-2022
  • (2021)Data mining in Cloud Computing2021 5th International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC51019.2021.9418489(71-78)Online publication date: 8-Apr-2021
  • (2021)Privacy-Preserving Cloud-Aided Broad Learning SystemComputers & Security10.1016/j.cose.2021.102503(102503)Online publication date: Oct-2021
  • (2021)Assessing Teacher’s Performance Evaluation and Prediction Model Using Cloud Computing Over Multi-dimensional DatasetWireless Personal Communications10.1007/s11277-021-08394-3Online publication date: 8-Apr-2021
  • (2020)A comparative study of Distributed Large Scale Data Mining AlgorithmsBSSS Journal of Computer10.51767/jc1102Online publication date: 25-May-2020
  • (2019)A Comprehensive Survey on Cloud Data Mining (CDM) Frameworks and AlgorithmsACM Computing Surveys10.1145/334926552:5(1-62)Online publication date: 13-Sep-2019
  • (2019)Survey of Data Locality in Apache Hadoop2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD)10.1109/BCD.2019.8885148(46-53)Online publication date: May-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media