Development of Big Data Multi-VM Platform for Rapid Prototyping of Distributed Deep Learning

Chien-Heng Wu¹⁸,
Chiao-Ning Chuang¹⁹,
Wen-Yi Chang¹⁸ &
…
Whey-Fone Tsai¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10968))

Included in the following conference series:

International Conference on Big Data

2185 Accesses

Abstract

The present study utilizes VirtualBox virtual environment technology to develop the personal big data multi-VM platform with four-node Spark and Hadoop cluster that can effectively replicate and provide an environment for developers to easily design and implement the Spark and Hadoop Map/Reduce programming. Before running their Big Data and deep learning applications in physical multi-node Spark and Hadoop Cluster, developers can conduct Map/Reduce programing simply on the proposed multi-VM platform, which is exactly the same as the physical one. To demonstrate its capability and applicability, this study utilizes the deep learning application as an example for function illustration. In this study, the big data multi-VM platform provides the rapid prototyping of distributed deep learning by using a cutting-edge framework TensorFlowOnSpark (TFoS) for AI developers. To look into deep insight, this study performs the deep-learning benchmark in different types of cluster systems including the multi-node big data VM platform, physical standalone system and the physical small-cluster system. The results indicate that InputMode.SPARK can get 3.3 times faster than InputMode.TENSORFLOW on the big data VM platform and even achieve 6.1 times faster on the physical server.

You have full access to this open access chapter, Download conference paper PDF

Instance segmentation on distributed deep learning big data cluster

Article Open access 02 January 2024

Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch

Keywords

1 Introduction

With explosively growing data volume in many application fields, the demand for large-scale computation and storage capacities has been greatly boosted. Under such circumstances, the well-known Hadoop system [1, 2] has been developed and adopted by many enterprises. Generally, building a Spark and Hadoop Cluster at least requires two physical machines for NameNode and DataNode, which enables functionalities of data partition, data replication and data distribution. Hence, for Map/Reduce programming and testing purposes, a Hadoop system with the basis of master-slave architecture is highly recommended. In addition to Hadoop, Spark [3] is the newly emerging Big Data platform, which boasts the ability of in-memory computing and has been shown to be potentially 10–100 times faster than Hadoop. However, building a physical Hadoop or Spark system may be difficult for entry-level beginners without strong IT background. Furthermore, Hadoop system is built in Linux operation system, so it can also be an obstacle for application users who are only familiar with Windows operating system.

For the considerations described above, this study utilizes the virtual machine VirtualBox [4] to deploy multi-node Big Data VM platform, which can run Spark and Hadoop cluster in Multi-Node (1 name node + 3 data nodes) Virtual Machines (VM) under Windows operating system. Although running the (1+3) Multi-Node VMs would share the same resources (CPU, memory and hard disk) in one physical machine, the developers or users are still able to perform and simulate big data computing by using the (1+3) Multi-Node virtual environment of Spark and Hadoop cluster. In addition, we also provide the solution to eliminate the problems of unstable HBase system and the superfluous data.

By using the virtual technology, the core components of the big data platform including Hadoop, Spark, HDFS, HBase, Zookeeper and BigData Software Platform within Liferay [5] IDE (Eclipse [6]) are all integrated into this platform. Spark is built on the top of Hadoop system, and is managed by YARN for the resource management. Specifically, this big data multi-VM platform can be simply and easily deployed by extracting its VM image file into VirtualBox under the Windows operating system. Therefore, the big data multi-VM platform could be a very helpful software platform for big data training, deployment and testing of Spark development and Map/Reduce programming, especially for beginners.

In addition, the big data developers can also integrate the backend computing by implementing the specific portlets such as Job Submission, Job Status and other application portlets for their own purposes. For the big data applications of portal development and Map/Reduce programming, this big data multi-VM platform provides developers to reduce application complexity, development time, and improving application performance. It also provides the well-designed platform in the fields of big data computing and e-Commerce for Java developers to design, implement, configure and deploy. In order to satisfy the presentation tier, the business tier and the database tier at the same time, this platform adopts the enterprise architecture to make it robust, stable and advanced by utilizing the enterprise portal engine and Java EE application server.

2 Platform Architecture

As shown in Fig. 1, this platform utilizes the Oracle VirtualBox to construct multi-node Spark and Hadoop systems including the development tier, the middleware tier and the system tier. The software adopted in each tier is listed in Table 1. In this architecture, the development tier contains four modules, which are Liferay Portal development, Hadoop Map/Reduce Samples, big data App and Hadoop Library API Generator. The middleware tier contains the multi-node Hadoop Cluster for developers to run their Map/Reduce source code during the development period. The developers can also implement Scala or Java to access Spark cluster within this tier. Users do not have to worry about the complexity of building the Spark and Hadoop Cluster. Instead, they can just concentrate on the big data development for their specific applications [7]. Furthermore, by many times and long time running tests, the provided big-data multi-VM platform has been approved to be a stable and robust version.

Table 1. Software architecture of Big Data (1+3) Multi-Node VM Platform

Full size table

To deploy the big data multi-VM platform, firstly, users can install VirtualBox software on a Windows-based PC. Then, the multi-node VM image files which contain NameNode, DataNode1, DataNode2 and DataNode3 can be loaded in a one-by-one way and the VirtualBox will automatically build the corresponding Virtual Machine, the multi-node big data VM platform, as shown in Fig. 2. Meanwhile, all the relevant software and tools are well established, in which the host name and fixed IP addresses are shown in Table 2.

Table 2. Hosts and IP of big data multi-VM platform

Full size table

Basically, these multi-node Virtual Machines utilize the highly stable Ubuntu Linux operation system, on which Hadoop, Spark, Hadoop database/HBase, Scala development and MapReduce programming tools, Java development tools/Eclipse, ZooKeeper/distributed application coordination tool, have been pre-installed and pre-configured, so users can immediately start using this platform for their own applications. In general, the overall time to import Big Data Multi-VM Platform can be completed within 45 min; hence it really reduces the difficulty for users to construct the Spark and Hadoop system.

The developers can perform programming, submitting jobs, analyzing data in this big data VM platform just like in a physical Spark and Hadoop cluster. Besides, this platform also provides important sample codes and readme instruction with respect to Hadoop Map/Reduce programming and Liferay portal development for users to learn and modify the source codes. This platform is not only for development but also for education and training purposes. Hence, this platform is a good initiation step for the beginners with science or engineering backgrounds to move a step into the big data world. With this Big Data multi-VM Platform, both the powerful four-node Spark and Hadoop systems can be at hand.

As shown in Fig. 3, the usage procedure of Big Data (1+3) Multi-Node VM Platform is shown below.

1.
Startup Virtual Machines (1 NameNode and 3 DataNodes)
2.
Startup Spark and Hadoop cluster
3.
Start the BigData Software Platform by Eclipse IDE
4.
Design, implement and test the deep learning or big data applications
5.
Send the executable applications into physical Spark and Hadoop cluster

One thing needs to address is that the common method to build the Big Data Multi-VM Platform is to create a single VM of NameNode with Spark and Hadoop cluster, and then to clone this single VM into DataNode1, DataNode2 and DataNode3, respectively. This procedure will work, but problems may occur because of the existed dirty data within the DataNode1, DataNode2 and DataNode3. For example, the Hbase service may be unstable without finding the actual root cause. In order to avoid the mentioned problems, we designed an alternative method to build the stable multi-node big data VM platform. The procedure is described as following steps:

(1)
To create a pure VM with Ubuntu Linux
(2)
To clone this VM into other 3 VMs
(3)
To configure the fixed IP and create SSH in NameNode, and then send the SSH key to DataNode1, DataNode2 and DataNode3
(4)
Manually and carefully to build up Spark and Hadoop cluster with ZooKeeper and Hbase

Finally, a stable Big Data Multi-VM Platform can be obtained. As shown in Table 3, this new multi-VM platform is also light-weighted than the old one.

Table 3. Stable version of (1+3) VM Hadoop

Full size table

3 AI Application and Benchmark

3.1 AI Application

Deep learning is the area of artificial intelligence where the real magic is happening right now. Traditional computers, while being very fast, have not been very smart – they have no ability to learn from their mistakes and have to be given precise instructions in order to carry out any task. Deep learning involves building artificial neural networks which attempt to mimic the way human brains sort and process information. The “deep” in deep learning signifies the use of many layers of neural networks all stacked on top of each other. This data processing configuration is known as a deep neural network, and its complexity means it is able to process data to a more thorough and refined degree than other AI technologies which have come before it. Recently, deep learning applications have shown impressive results across a wide variety of domains [8]. However, training neural networks is relatively time-consuming, even on a single GPU-equipped machine. Fortunately, setting up distributed environments can be an approach to greatly help us accelerate the training process and handle larger size of datasets. The big data multi-VM platform provides the rapid prototyping of distributed deep learning by using a cutting-edge framework TensorFlowOnSpark (TFoS) [9] for AI developers.

TensorFlowOnSpark integrates Apache Hadoop and Apache Spark with the deep learning framework, TensorFlow [10]. Most of the existing distributed deep learning frameworks need to set up an additional cluster for deep learning separately, while TensorFlowOnSpark does not. As shown in Fig. 4, the separated clusters not only require copying data back and forth between the clusters but also encounter the issues of unwanted system complexity and end-to-end learning latency. In contrast, as shown in Fig. 5, TensorFlowOnSpark allows distributed deep learning execution on the identical Spark clusters, which achieves faster learning.

Data ingestion is the first step to utilize the power of Hadoop. Choosing a proper method of data ingestion can accelerate training processes significantly. In this study, we benchmark two types of data ingestion which TensorFlowOnSpark offers: InputMode.SPARK and InputMode.TENSORFLOW with MNIST handwritten digit dataset [11] on the big data multi-VM platform and the physical server. InputMode.SPARK is provided as a way to send Spark RDDs into the TensorFlow nodes, it enables easier integrations between Spark and TensorFlow once you have had an existing data pipeline generating data via Spark; on the other hand, InputMode.TENSORFLOW takes advantage of TensorFlow’s QueueRunners mechanism to read data directly from HDFS files, and Spark is not involved in accessing data. Besides, for each type of data ingestion, we benchmark the data ingestion of both TFRecords and CSV input formats, which are standard TensorFlow format and comma-separated value format respectively.

The results are illustrated in Fig. 6, showing InputMode.SPARK can get 3.3 times faster than InputMode.TENSORFLOW on the big data VM platform and even achieve 6.1 times faster on the physical server. Furthermore, whatever the data ingestion we tested in our experiments, TFRecords always get better performance than CSV does.

The following benchmark we hold the combination of InputMode.SPARK and TFRecords to delve deeper into the effect of tuning the number of executors and the number of data nodes on the big data VM platform. The architecture of TensorFlowOnSpark uses only one core/task per executor for easier debugging and Spark/Yarn configuration settings. Nonetheless, it is not a problem to set multiple executors per node as long as the physical server is providing enough memory and CPU resources. Accordingly, we benchmark 5 cases within acceptable resource limits as shown in Table 4. Note that at least two executors must be set owing to one of them must be a parameter server (ps) to maintain and update the model weights while the other performs the main computations such as reading data and computing the gradient. In general, if the number of executors is greater than two, we tend to configure only one ps task and the others are all set to do parallel computations.

Table 4. The experimental setup of VMs

Full size table

Under the same number of data nodes, comparing case 1 to case 2 or case 3 to case 4, setting a greater number of executors gets better performance. As a whole, we can allow each worker to process more records by adding more executors to a Spark job. This will accelerate the training process. In addition, comparing case 2 to case 5 or case 1 to case 3, when we set the same number of executors, adding more data nodes negatively affect performance. Overall, performance degradation on VM clusters may be due to the overhead of the virtualization. For example, if we are running 4 VMs on a single physical machine, it means that 4 OSs, 4 namenode/datanode services and 4 of any other VM services are running simultaneously. Obviously, the overhead system loading of 4 VMs is much heavier than that of 2 VMs thus to reduce its computing capacity.

3.2 Benchmark

Finally, this study performs the benchmark in different types of deep learning systems including the multi-node big data VM platform, physical standalone system and the physical small-cluster system. In terms of hardware consideration, we conduct experiments as follows:

CPU only: single-node, single-CPU (non-distributed)
GPU only: single-node, single-GPU (non-distributed)
SPARK+CPU: 3 data nodes, single-CPU-per-node
SPARK+GPU: 3 data nodes, single-GPU-per-node
VMs+vCPUs: 3 data nodes VMs on a single physical node

The experimental hardware configuration on each physical node is given as above:

CPU(s): 4
RAM: 24 GB
Graphics card: Quadro K620 (see Table 5 for details)
Table 5. The specifications of Nvidia Quadro K620
Full size table

In general, fast training on large-scale datasets is a crucial factor for a distributed computing system. In this experiment, the dataset we train to perform the benchmark is Dogs vs. Cats dataset from Kaggle [12], which is a larger-scale dataset compared to MNIST. The results of testing different types of deep learning systems are shown in Fig. 7.

In non-distributed systems for CPU-only and GPU-only cases, an entry-level graphic card achieves nearly 4x speedup. On top of that, the distributed deep learning framework of TensorFlowOnSpark runs even two times faster than non-distributed systems. Note that we only configured 3 data nodes and one executor per node in this example. Therefore, 2 data nodes execute computations and 1 data node executes weights update, and that is the main reason why 3 data nodes setup only achieves 2x speedup. Additionally, the case for applying distributed computing with GPUs turns out to get almost 8 times faster than the non-distributed CPU system and even 28 times faster than the VMs distributed environment.

Next, the computational time using the same distributed computing way in the physical small-cluster system is 6.8 times faster than in the VM distributed system. The main reason is that the big data VM platform is the guest operating system on Windows OS, therefore it shares the same resources of CPU, memory and hard disk with the Windows operating system to reduce its computational capacity.

In brief, InputMode.SPARK data ingestion and TFRecords input format get the best performance under the framework of TensorFlowOnSpark. By tuning the number of executors properly can decrease total training time. Although the computational efficiency in Big Data multi-VM Platform system is not as good as the physical small-cluster system, users are encouraged to use the big data multi-VM platform for distributed deep learning programming, compilation, testing and running. Therefore, the big data VM platform is an ideal platform for the preparing processing of development of big data application.

3.3 Demonstration

During the paper presentation, the first author of this paper would like to give a demonstration for the proposed Big Data Multi-VM Platform system to show its capability and applicability.

4 Concluding Remarks

The present study is focused on the development of big data multi-VM platform, and the benchmark in different types of deep learning systems including the multi-node big data VM platform, physical standalone system and the physical small-cluster system. The conclusion and suggestion are provided below:

A.
This study utilizes the virtual machine technology by VirtualBox to construct big data multi-VM platform and obtain its four VM image files for quick installation. During the design and implementation phase, the system resource of physical multi-node Spark and Hadoop Cluster can be occupied and shared by multiple developers. This would cause the major impact when other users running their jobs in the system. Therefore, the big data multi-VM platform provides a well-designed personal development platform for developers to design, implement and test before running their deep learning applications in physical multi-node Spark and Hadoop Cluster. This can save more system resource and waiting time for users. This multi-VM platform provides an environment for users to use just like a real environment in the physical Spark and Hadoop cluster. Hence, users can quickly access the Spark and Hadoop Map/Reduce programming and portal development to speed up building their big data solutions.
B.
The big data multi-VM platform provides the rapid prototyping of distributed deep learning by using a cutting-edge framework TensorFlowOnSpark(TFoS for AI developers. In this study, we benchmark two types of data ingestion which TensorFlowOnSpark offers: InputMode.SPARK and InputMode.TENSORFLOW with MNIST handwritten digit dataset on the big data multi-VM platform and the physical server.
C.
InputMode.SPARK data ingestion and TFRecords input format get the best performance under the framework of TensorFlowOnSpark. By tuning the number of executors properly can decrease total training time. Although the computational efficiency in Big Data multi-VM Platform system is not as good as the physical small-cluster system, users do be able to use the big data multi-VM platform for distributed deep learning programming, compilation, testing and running. Therefore, the big data VM platform is an ideal platform for the development of big data application.
D.
To speed up the development of big data applications for worldwide users, the proposed big data VM platform is ready for download [13]. We hope it can benefit all the users in big data computing community.

5 Future Work

Due to the use of most Spark and Hadoop systems is still in the command-line interface, it may be inconvenient for beginners to use. Therefore, the future work will be focused on the big data portal development by using Liferay Portal, which provides users friendly interface to access Spark and Hadoop systems in multiple Virtual Machines. This would improve the working efficiency for users in Spark and Hadoop management and applications.

When Spark and Hadoop cluster starts up several times, the file size of each VM image file increases rapidly. It’s still hard to identify root causes. To temporarily deal with this problem, the current solution is to export each of VM image file immediately when the deep learning application software is installed and configured completely. Further studies are needed to address this issue.

In addition, Spark possesses the ability of in-memory computing and has been shown to be potentially 10–100 times faster than Hadoop. The future work of research will be focused on deep learning enabling and streaming applications by using Spark in the big data multi-VM platform. Such as the techniques of GPU access in VMs would be worthy to develop in the future. This would allow users to have more choice for their big data development and applications.

References

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: 6th Symposium on Operating Systems Design & Implementation (OSDI 2004), 6–8 December 2004, San Francisco (2004)
Google Scholar
http://hadoop.apache.org/
Apache Spark. http://spark.apache.org/
VirtualBox. https://www.VirtualBox.org/
Liferay Portal. https://www.liferay.com
Eclipse. https://eclipse.org/ide/
Wu, C.-H., Tsai, W.-F., Lin, F., Chang, W.-Y., Lin, S.-C., Yang, C.-T.: Big Data development platform for engineering applications. In: Proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData), pp. 2699–2702 (2016)
Google Scholar
The Amazing Ways Google Uses Deep Learning AI. https://www.forbes.com/sites/bernardmarr/2017/08/08/the-amazing-ways-how-google-uses-deep-learning-ai/#31e431d13204
TensorFlowOnSpark. https://github.com/yahoo/TensorFlowOnSpark
Tensorflow. https://www.tensorflow.org
MNIST Dataset. http://yann.lecun.com/exdb/mnist/
Dogs vs. Cats Dataset. https://www.kaggle.com/c/dogs-vs-cats
FTP Site (IP = 140.110.20.15; Port = 21; Anonymous)
Google Scholar

Download references

Acknowledgement

Financial support from the Ministry of Science and Technology, Taiwan, under grants MOST 105-2634-E-492-001 is highly appreciated. We are also grateful to the National Center for High-Performance Computing for computer time and facilities.

Author information

Authors and Affiliations

National Center for High-Performance Computing, Hsinchu, 30076, Taiwan
Chien-Heng Wu, Wen-Yi Chang & Whey-Fone Tsai
National Chiao Tung University, Hsinchu, 30010, Taiwan
Chiao-Ning Chuang

Authors

Chien-Heng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chiao-Ning Chuang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Yi Chang
View author publications
You can also search for this author in PubMed Google Scholar
Whey-Fone Tsai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chien-Heng Wu .

Editor information

Editors and Affiliations

The University of Hong Kong, Hong Kong, Hong Kong
Francis Y. L. Chin
University of Macau, Macao, Macao
C. L. Philip Chen
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Louisiana State University, Baton Rouge, USA
Kisung Lee
Kingdee International Software Group Company Limited, Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, CH., Chuang, CN., Chang, WY., Tsai, WF. (2018). Development of Big Data Multi-VM Platform for Rapid Prototyping of Distributed Deep Learning. In: Chin, F., Chen, C., Khan, L., Lee, K., Zhang, LJ. (eds) Big Data – BigData 2018. BIGDATA 2018. Lecture Notes in Computer Science(), vol 10968. Springer, Cham. https://doi.org/10.1007/978-3-319-94301-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-94301-5_14
Published: 21 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94300-8
Online ISBN: 978-3-319-94301-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Development of Big Data Multi-VM Platform for Rapid Prototyping of Distributed Deep Learning

Abstract

Similar content being viewed by others

Instance segmentation on distributed deep learning big data cluster

Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch

Keywords

1 Introduction

2 Platform Architecture