Welcome to Scribd!

0% found this document useful (0 votes)

36 views

2a Intro To Cluster Computing PDF

Uploaded by

23522020 Danendra Athallariq Harya P

1) Large-scale machine learning and data mining problems require massive computing resources due to the size of datasets, which can be hundreds of terabytes. 2) A common architecture uses a cluster of commodity Linux servers connected by a fast network to distribute computing and storage across machines. 3) Challenges include machine failures, slow network transfer speeds, and complex distributed programming.

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

2a Intro To Cluster Computing PDF

Uploaded by

23522020 Danendra Athallariq Harya P

0% found this document useful (0 votes)

36 views18 pages

Original Description:

Original Title

2a-Intro-to-Cluster-Computing.pdf

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

0% found this document useful (0 votes)

36 views18 pages

2a Intro To Cluster Computing PDF

Uploaded by

23522020 Danendra Athallariq Harya P

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

Jump to Page

You are on page 1of 18

Search inside document

Modified from Mining Massive Dataset (Stanford)

CPU
Machine Learning, Statistics
Memory

“Classical” Data Mining

Disk

2
20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
 ~4 months to read the web
~1,000 hard drives to store the web
Takes even more to do something useful
with the data!
Today, a standard architecture for such
problems is emerging:
 Cluster of commodity Linux nodes
 Commodity network (ethernet) to connect them
3
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Large-scale computing for data mining
problems on commodity hardware
Challenges:
 Machines fail:
 One server may stay up 3 years (~1,000 days)
 If you have 1,000 servers, expect to loose 1/day
 People estimated Google had ~1M machines in 2011
 1,000 machines fail every day!
 How to store data persistently and keep it available if
nodes can fail?
 How to deal with node failures during long-running
computation?
6
Challenges:
 Network bottleneck
 Network bandwidth = 1Gbps
 Moving 10TB takes approximately 1 day
 Distributed programming is hard!
 Need a simple model that hides most of the complexity

7
Issue: machine failure in the cluster computer
Idea:
?

8
Issue: machine failure in the cluster computer
Idea:
 Store files multiple times for reliability

9
Issue: Copying data over a network takes time
Idea:
?

10
Issue: Copying data over a network takes time
Idea:
 Bring computation close to the data

11
Issue: Complex programming for distributed
system
Idea:
?

12
Issue: Complex programming for distributed
system
Map-reduce addresses these all problems
 Google’s computational/data manipulation model
 Elegant way to work with big data
 Storage Infrastructure – File system
 Google: GFS. Hadoop: HDFS
 Programming model
 Map-Reduce

13
Typical usage pattern
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common

14
Reliable distributed file system
Data kept in “chunks” spread across machines
Each chunk replicated on different machines
 Seamless recovery from disk or machine failure

…
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

15
Reliable distributed file system
Data kept in “chunks” spread across machines
Each chunk replicated on different machines
 Seamless recovery from disk or machine failure

C0 C1 D0 C1 C2 C4 C0 C4

C5 C2 C5 C3 D0 D1 … D1 C3
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Bring computation directly to the data!

Chunk servers also serve as compute servers
16
Chunk servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks

17
Master node
 a.k.a. Name Node in Hadoop’s HDFS
 Stores metadata about where files are stored
 Might be replicated
Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunk servers to access data

OData Interview Questions
Document2 pages
OData Interview Questions
gvrahul
100% (2)
FDMEE Integration With Open Interface Adapter
Document18 pages
FDMEE Integration With Open Interface Adapter
Gabriel Gutierrez
No ratings yet
Mining of Massive Datasets: Leskovec, Rajaraman, and Ullman
Document11 pages
Mining of Massive Datasets: Leskovec, Rajaraman, and Ullman
oscarmyo
No ratings yet
3 Hadoop
Document111 pages
3 Hadoop
zohaib
No ratings yet
TM2 ch02 Mapreduce
Document51 pages
TM2 ch02 Mapreduce
tzinajojo
No ratings yet
Chapter 3
Document47 pages
Chapter 3
SANG VÕ NGỌC
No ratings yet
Synthetic Data Gen
Document29 pages
Synthetic Data Gen
Owen Mantz
No ratings yet
MapReduce-Final
Document92 pages
MapReduce-Final
rsumaira80
No ratings yet
Disco
Document33 pages
Disco
sushmsn
No ratings yet
Parallel and Cluster Computing
Document31 pages
Parallel and Cluster Computing
dedo mraz
No ratings yet
Cluster Computing
Document23 pages
Cluster Computing
Nilu Hoda
No ratings yet
Rehan Khan Roll No - 38
Document23 pages
Rehan Khan Roll No - 38
hoda01
No ratings yet
Ahmad Aljebaly Department of Computer Science Western Michigan University
Document42 pages
Ahmad Aljebaly Department of Computer Science Western Michigan University
Arushi Mittal
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
Document46 pages
Overview of Parallel Computing: Shawn T. Brown
Karthik Kusuma
No ratings yet
Genertions of OS
Document31 pages
Genertions of OS
Samavia Riaz
No ratings yet
Week 02
Document115 pages
Week 02
muhammad shoaib
No ratings yet
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
Document27 pages
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
Car
No ratings yet
0 To 60 in 3.1: Tyler Carlton Cory Sessions
Document21 pages
0 To 60 in 3.1: Tyler Carlton Cory Sessions
Oleksiy Kovyrin
No ratings yet
NT Kernel Internals PDF
Document106 pages
NT Kernel Internals PDF
Mészáros Balázs
No ratings yet
ch02 Mapreduce
Document7 pages
ch02 Mapreduce
jefferyleclerc
No ratings yet
Module 4
Document40 pages
Module 4
Tharushi Dewmini
No ratings yet
CS6461 - Computer Architecture Fall 2016 Morris Lancaster - Memory Systems
Document66 pages
CS6461 - Computer Architecture Fall 2016 Morris Lancaster - Memory Systems
闫麟阁
No ratings yet
L7 Multicore 1
Document50 pages
L7 Multicore 1
AsHraf G. ElrawEi
No ratings yet
Lecture 10
Document34 pages
Lecture 10
MAIMONA KHALID
No ratings yet
Cluster Computing
Document28 pages
Cluster Computing
VinayKumarSingh
100% (6)
Managing The Analytic Deluge in The Cloud: May 2018 Scott Jeschonek Principal Program Manager
Document23 pages
Managing The Analytic Deluge in The Cloud: May 2018 Scott Jeschonek Principal Program Manager
nad791
No ratings yet
Cluster Basics
Document34 pages
Cluster Basics
Jay Nagwani
No ratings yet
GPU Fundamentals
Document20 pages
GPU Fundamentals
Jyotirmay Sahu
No ratings yet
Lecture1a DistSyst
Document19 pages
Lecture1a DistSyst
Deepesh Meena
No ratings yet
Comparch Fall2020 Lecture11b Memory Interference and Qos
Document183 pages
Comparch Fall2020 Lecture11b Memory Interference and Qos
lijianing1024
No ratings yet
Linux Multi-Core Scalability
Document7 pages
Linux Multi-Core Scalability
VISHNUKUMAR R
No ratings yet
Elective-I Advanced Database Management Systems: Unit Ii
Document141 pages
Elective-I Advanced Database Management Systems: Unit Ii
Pranil Nandeshwar
100% (1)
CH 1
Document36 pages
CH 1
Natanem Yimer
No ratings yet
The Google File System: CSE 490h, Autumn 2008
Document29 pages
The Google File System: CSE 490h, Autumn 2008
hEho44744
No ratings yet
CS3350B Computer Architecture: Marc Moreno Maza
Document45 pages
CS3350B Computer Architecture: Marc Moreno Maza
AsHraf G. ElrawEi
100% (1)
Hadoop: A Software Framework For Data Intensive Computing Applications
Document47 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
vaibhavbdx
No ratings yet
Parallel and Cluster Computing
Document31 pages
Parallel and Cluster Computing
anitha
No ratings yet
Cluster Computing: Shashwat Shriparv Infinitysoft
Document28 pages
Cluster Computing: Shashwat Shriparv Infinitysoft
shashwat2010
No ratings yet
Mapreduce
Document51 pages
Mapreduce
roshanak attar
No ratings yet
Increasing Factors Which Improves The Performance of Computer in Future
Document7 pages
Increasing Factors Which Improves The Performance of Computer in Future
Awais ktk
No ratings yet
Andrew - Cmu.edu: Let's Start With A Familiar Example: Andrew 10,000s of People Terabytes of Disk
Document7 pages
Andrew - Cmu.edu: Let's Start With A Familiar Example: Andrew 10,000s of People Terabytes of Disk
fahim_patel
No ratings yet
Evolution of The Windows Kernel Architecture
Document30 pages
Evolution of The Windows Kernel Architecture
jpaulpra
No ratings yet
Chapter 4
Document71 pages
Chapter 4
yehenew
No ratings yet
Distributed Operating Syst EM: 15SE327E Unit 1
Document49 pages
Distributed Operating Syst EM: 15SE327E Unit 1
Arun Chinnathambi
No ratings yet
History of The Computer
Document2 pages
History of The Computer
api-441383155
No ratings yet
Introduction
Document25 pages
Introduction
Jonah K. Menson
No ratings yet
ES1 REVIEWER Summarized
Document4 pages
ES1 REVIEWER Summarized
Montes Arianne A.
No ratings yet
Lecture 2 - IO Management
Document196 pages
Lecture 2 - IO Management
Huy Nguyen
No ratings yet
December 2011 Solved
Document9 pages
December 2011 Solved
chanchal roy
No ratings yet
Distributed Systems: University of Pennsylvania
Document26 pages
Distributed Systems: University of Pennsylvania
Nagendra Kumar
No ratings yet
Final Report: Multicore Processors
Document12 pages
Final Report: Multicore Processors
Jigar Kaneriya
No ratings yet
CSC 121 Computers and Scientific Thinking: David Reed Creighton University
Document20 pages
CSC 121 Computers and Scientific Thinking: David Reed Creighton University
student2020
No ratings yet
Lecture 1 Parallel Databases
Document30 pages
Lecture 1 Parallel Databases
Kumkumo Kussia Kossa
No ratings yet
ACFrOgC4ox6DptIVCDmiouLjGvcRNxDclc4utf7rZ BTQpwqs-5k idk-7WqHqd2Sxnk-rBUSVMkNu9eIQzn diigGrGdByxW3jMx3sKRhnkMlkfdmiMPyQWaw58rF3dR3wPGfpm75d0eSAobGTl
Document5 pages
ACFrOgC4ox6DptIVCDmiouLjGvcRNxDclc4utf7rZ BTQpwqs-5k idk-7WqHqd2Sxnk-rBUSVMkNu9eIQzn diigGrGdByxW3jMx3sKRhnkMlkfdmiMPyQWaw58rF3dR3wPGfpm75d0eSAobGTl
Vivaan Kriplani
No ratings yet
Virtuoso: Distributed Computing Using Virtual Machines: Peter A. Dinda
Document49 pages
Virtuoso: Distributed Computing Using Virtual Machines: Peter A. Dinda
akashree
No ratings yet
OS Sheet (1) Solution
Document6 pages
OS Sheet (1) Solution
hussienboss99
No ratings yet
BDP 2023 03
Document59 pages
BDP 2023 03
Aayush Airan
No ratings yet
Tutorial Cluster Knoppix
Document6 pages
Tutorial Cluster Knoppix
Gustavo Lira
No ratings yet
DIsk BFR
Document26 pages
DIsk BFR
Arnaldo Canelas
No ratings yet
Iphone: Joseph Del Rocco Cop5611 April 2009
Document37 pages
Iphone: Joseph Del Rocco Cop5611 April 2009
rachelkrrish
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
Document21 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
shabir Ahmad
No ratings yet
OpenBSD Mastery: Filesystems: IT Mastery, #19
From Everand
OpenBSD Mastery: Filesystems: IT Mastery, #19
Michael W. Lucas
No ratings yet
Partition Bad Disk Read Me
Document4 pages
Partition Bad Disk Read Me
Josefina Policarpio II
0% (1)
Kelebihan MySQL Enterprise Edition
Document2 pages
Kelebihan MySQL Enterprise Edition
Eka Rahmat Fauzi
No ratings yet
Pentestmonkey: Mysql SQL Injection Cheat Sheet
Document3 pages
Pentestmonkey: Mysql SQL Injection Cheat Sheet
guess me
No ratings yet
Sample Paper Xi Ip Pt2
Document3 pages
Sample Paper Xi Ip Pt2
B K Bhatt
No ratings yet
Xenstack Fullstack 03 PDF
Document1 page
Xenstack Fullstack 03 PDF
Prashant Sreedharan
No ratings yet
Entity Framework Object-Relational Mapping: Author: Nemanja Kojic, Mscee
Document47 pages
Entity Framework Object-Relational Mapping: Author: Nemanja Kojic, Mscee
Anonymous rJfy8NQNjH
No ratings yet
Wait Event
Document8 pages
Wait Event
getsatya347
No ratings yet
CMP 443 Note 1
Document28 pages
CMP 443 Note 1
blessing edu
No ratings yet
DataSunrise Database Security Release Notes
Document27 pages
DataSunrise Database Security Release Notes
thanh242001
No ratings yet
All Netapp Cifs 2
Document96 pages
All Netapp Cifs 2
Purushothama Gn
No ratings yet
Data at Rest Encryption
Document5 pages
Data at Rest Encryption
Shital Lokhande
No ratings yet
Inf2603 202 2024
Document4 pages
Inf2603 202 2024
MrHlOuKz DaProducer
No ratings yet
JDBC Interview Questions and Answers 70
Document3 pages
JDBC Interview Questions and Answers 70
Shoban Babu
No ratings yet
ADBMS Questions
Document10 pages
ADBMS Questions
varunendra
No ratings yet
Binary Tree and Its Traversal
Document67 pages
Binary Tree and Its Traversal
sneha
No ratings yet
Differences DBMS RDBMS
Document4 pages
Differences DBMS RDBMS
bsgindia82
No ratings yet
CHPT 7
Document16 pages
CHPT 7
Anonymous cs4BLczE
No ratings yet
Splunk Use Cases Webinar
Document26 pages
Splunk Use Cases Webinar
jpl1
100% (1)
Group 6
Document4 pages
Group 6
99966
0% (3)
Homework 2 Solutions
Document20 pages
Homework 2 Solutions
brm1shubha
No ratings yet
Open SQL: Sy-Subrc Sy-Subrc Contains Sy-Dbcnt
Document2 pages
Open SQL: Sy-Subrc Sy-Subrc Contains Sy-Dbcnt
Anjan Kumar
No ratings yet
COMP1639 Exam
Document4 pages
COMP1639 Exam
Vĩnh Huy
No ratings yet
1 Cs208 Principles of Database Design - Model QP
Document3 pages
1 Cs208 Principles of Database Design - Model QP
Atherin
No ratings yet
Creation of Database Using Derby DB. Our Derby Bookstore Database Contains Tables: Authors (Authorid, Firstname, Lastname, EMAIL) 4 Attributes
Document16 pages
Creation of Database Using Derby DB. Our Derby Bookstore Database Contains Tables: Authors (Authorid, Firstname, Lastname, EMAIL) 4 Attributes
Avinash Dilip
No ratings yet
Metallic Technical Overview
Document15 pages
Metallic Technical Overview
raj
No ratings yet
BIHANA2015 - Hollis - Performance Tuning in Sap Hana PDF
Document75 pages
BIHANA2015 - Hollis - Performance Tuning in Sap Hana PDF
Neo
No ratings yet
Best Practices For The Most Impactful Oracle Database 19c Features
Document35 pages
Best Practices For The Most Impactful Oracle Database 19c Features
KoushikKc Chatterjee
No ratings yet
Unit 1: IBM Tivoli Storage Manager Overview
Document34 pages
Unit 1: IBM Tivoli Storage Manager Overview
TANVI bansal
No ratings yet