Spark Ops Final

Deploying and
Administering Spark
Patrick Wendell
Databricks
Outline
Spark components
Cluster managers
Hardware & configuration
Linking with Spark
Monitoring and measuring
Outline
Spark components
Cluster managers
Linking with Spark
Spark application
Driver program
Java program that creates a
SparkContext
Executors
Worker processes that execute
tasks and store data
Cluster manager
Cluster manager grants
executors to a Spark application
Driver program
Driver program decides when to
launch tasks on which executor
Needs full network

connectivity to workers
Types of Applications
Long lived/shared applications
Shark
May do mutli-user
scheduling within
Spark Streaming
allocation from
Job Server (Ooyala) cluster manger
Short lived applications
Standalone apps
Shell sessions
Outline
Spark components
Cluster managers
Linking with Spark
Cluster Managers
Several ways to deploy Spark
1. Standalone mode (on-site)
2. Standalone mode (EC2)
3. YARN
4. Mesos
5. SIMR [not covered in this talk]
Standalone Mode
Bundled with Spark
Great for quick dedicated

Spark cluster
H/A mode for long running
applications (0.8.1+)
Standalone Mode
1. (Optional) describe amount of
resources in conf/spark-env.sh
- SPARK_WORKER_CORES
- SPARK_WORKER_MEMORY
2. List slaves in conf/slaves
3. Copy configuration to slaves
4. Start/stop using
./bin/stop-all and ./bin/start-all
Standalone Mode
Some support for inter-application
scheduling
Set spark.cores.max to limit # of
cores each application can use
EC2 Deployment
Launcher bundled with Spark
Create cluster in 5 minutes
Sizes cluster for any EC2
instance type and # of nodes
Used widely by Spark team for
internal testing
EC2 Deployment
./spark-ec2
-t [instance type]
-k [key-name]
-i [path-to-key-file]
-s [num-slaves]
-r [ec2-region]
--spot-price=[spot-price]
EC2 Deployment
Creates:
Spark Sandalone cluster at
<ec2-master>:8080
HDFS cluster at
< ec2-master >:50070
MapReduce cluster at
< ec2-master >:50030
Apache Mesos
General-purpose cluster manager that
can run Spark, Hadoop MR, MPI, etc
Simply pass mesos://<master-url> to
SparkContext
Optional: set spark.executor.uri to a
pre-built Spark package in HDFS,
created by make-distribution.sh
Mesos Run Modes

Fine-grained (default):
Apps get static memory allocations, but share
CPU dynamically on each node
Coarse-grained:
Apps get static CPU and memory allocations
Better predictability and latency, possibly at cost
of utilization
Hadoop YARN
In Spark 0.8.0:
Runs standalone apps only, launching driver
inside YARN cluster
YARN 0.23 to 2.0.x
Coming in 0.8.1:
Interactive shell
YARN 2.2.x support
Support for hosting Spark JAR in HDFS
YARN Steps
1. Build Spark assembly JAR

2. Package your app into a JAR
3. Use the yarn.Client class
SPARK_JAR=<SPARK_ASSEMBLY_JAR> ./spark-class
org.apache.spark.deploy.yarn.Client \
--jar <YOUR_APP_JAR> --class <MAIN_CLASS> \
--args <MAIN_ARGUMENTS> \
--num-workers <N> \
--master-memory <MASTER_MEM> \
--worker-memory <WORKER_MEM> \
--worker-cores <CORES_PER_WORKER>
More Info
http://spark.incubator.apache.org/docs/
latest/cluster-overview.html
Detailed docs about each of standalone
mode, Mesos, YARN, EC2
Outline
Cluster components
Deployment options
Linking with Spark
Where to run Spark?

If using HDFS, run on same
nodes or within LAN
1. Have dedicated (usually
beefy) nodes for Spark
2. Colocate Spark and
MapReduce on shared nodes
Local Disks
Spark uses disk for writing
shuffle data and paging out
RDDs
Ideally have several disks per
node in JBOD configuration
Set spark.local.dir with commaseparated disk locations
Memory
Recommend 8GB heap and up
Generally, more is better
For massive (>200GB) heaps you
may want to increase # of
executors per node (see
SPARK_WORKER_INSTANCES)
Network/CPU
For in-memory workloads,
network and CPU are often the
bottleneck
Ideally use 10Gb Ethernet
Works well on machines with
multiple cores (since parallel)
Environmentrelated configs
spark.executor.memory
How much memory you will
ask for from cluster manager
spark.local.dir
Where spark stores shuffle files
Outline
Cluster components
Deployment options
Linking with Spark
Typical Spark Application

sc = new SparkContext(<clustermanager>)
sc.addJar(/uber-app-jar.jar)
sc.textFile(XX)
reduceBy
saveAS
Created using
maven or sbt
assembly
Linking with Spark

Add an ivy/maven dependency in your
project on spark-core artifact
If using HDFS, add dependency on
hadoop-client for your version
e.g. 1.2.0, 2.0.0-cdh4.3.1
For YARN, also add spark-yarn
Hadoop Versions
Distribution
Release
Maven Version Code
CDH
4.X.X
2.0.0-mr1-chd4.X.X
4.X.X (YARN mode)
2.0.0-chd4.X.X
3uX
0.20.2-cdh3uX
1.3
1.2.0
1.2
1.1.2
1.1
1.0.3
HDP
See Spark docs for details:

http://spark.incubator.apache.org/docs/latest/hadoop-third-party-distributions.html
Outline
Cluster components
Deployment options
Linking with Spark
Monitoring
Cluster Manager UI
Executor Logs
Spark Driver Logs
Application Web UI
Spark Metrics
Cluster Manager UI
Standalone mode: <master>:8080
Mesos, YARN have their own UIs
Executor Logs
Stored by cluster manager on each
worker
Default location in standalone mode:
/path/to/spark/work
Executor Logs
Spark Driver Logs

Spark initializes a log4j when created
Include log4j.properties file on the
classpath
See example in
conf/log4j.properties.template
Application Web UI
http://spark-application-host:4040
(or use spark.ui.port to configure the port)
For executor / task / stage / memory
status, etc
Executors Page
Environment Page
Stage Information
Task Breakdown
App UI Features
Stages show where each operation
originated in code
All tables sortable by task length,
locations, etc
Metrics
Configurable metrics based on Coda
Hales Metrics library
Many Spark components can report
metrics (driver, executor, application)
Outputs: REST, CSV, Ganglia, JMX, JSON
Servlet
Metrics
More details:
http://spark.incubator.apache.org/docs/la
test/monitoring.html
More Information
Official docs:
http://spark.incubator.apache.org/docs/lat
est
Look for Apache Spark parcel in CDH

Spark Ops Final

Uploaded by

Copyright:

Available Formats

Spark Ops Final

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark Ops Final

Uploaded by

Copyright:

Available Formats

Deploying and

Needs full network

Great for quick dedicated

Mesos Run Modes

1. Build Spark assembly JAR

Where to run Spark?

Typical Spark Application

Linking with Spark

Maven Version Code

4.X.X (YARN mode)

See Spark docs for details:

Spark Driver Logs

Look for Apache Spark parcel in CDH

You might also like