Apache Spark with HDFS cluster within Kubernetes.
As the description says, this repository is an Apache Spark with an HDFS cluster within Kubernetes. Although it contains Intel HiBench benchmark suite for testing CPU, IO, and network usage, the cluster can run as a regular one.
- Micro
- Machine Learning
- Websearch
You can just execute the build.sh
file.
$ ./build.sh
To submit the cluster and prepare it, you must type the following ./scripts/init-cluster.sh <WORKLOAD> <BENCHMARK> <INPUT_SIZE>
Where:
WORKLOAD
represents a workload from HiBenchBENCHMARK
represents the benchmarkINPUT_SIZE
means the size of the workload for the benchmark
To run a HiBench benchmark, you can run ./scripts/run.sh <WORKLOAD> <BENCHMARK>
The report saved will be in the base directory with the name hibench.report
.
- k8s-bigdata currently uses Apache Spark 2.4 with Hadoop 2.7 binary
- No need to register manually each
datanode
- Kubernetes will create a
datanode
for each node registered in the cluster - Can specify which node
namenode
,resourcemanager
, andhistoryserver
will be launched by assigning the labeltype=master
. If you are new to Kubernetes, typekubectl label nodes YOUR NODE type=master
- 🗴 Support data streaming frameworks such as Apache Kafka
- ✓ Switch static to dynamic environment variables for containers (avoid building on every change in
./hadoop/base/hadoop.env
file) - ✓ Implement a configuration parser to run HiBench without changing
run.sh
- ✓ Implement a solution to change the size of input data for HiBench benchmarks without accessing
namenode
pod directly
Based on the locality of reference, HiBench, Hadoop Namenode, and Spark Master are within the same container as processes. Also, HiBench needs the Hadoop and Spark directories located at the namenode pod.