Keywords

1 Introduction

With explosively growing data volume in many application fields, the demand for large-scale computation and storage capacities has been greatly boosted. Under such circumstances, the well-known Hadoop system [1, 2] has been developed and adopted by many enterprises. Generally, building a Spark and Hadoop Cluster at least requires two physical machines for NameNode and DataNode, which enables functionalities of data partition, data replication and data distribution. Hence, for Map/Reduce programming and testing purposes, a Hadoop system with the basis of master-slave architecture is highly recommended. In addition to Hadoop, Spark [3] is the newly emerging Big Data platform, which boasts the ability of in-memory computing and has been shown to be potentially 10–100 times faster than Hadoop. However, building a physical Hadoop or Spark system may be difficult for entry-level beginners without strong IT background. Furthermore, Hadoop system is built in Linux operation system, so it can also be an obstacle for application users who are only familiar with Windows operating system.

For the considerations described above, this study utilizes the virtual machine VirtualBox [4] to deploy multi-node Big Data VM platform, which can run Spark and Hadoop cluster in Multi-Node (1 name node + 3 data nodes) Virtual Machines (VM) under Windows operating system. Although running the (1+3) Multi-Node VMs would share the same resources (CPU, memory and hard disk) in one physical machine, the developers or users are still able to perform and simulate big data computing by using the (1+3) Multi-Node virtual environment of Spark and Hadoop cluster. In addition, we also provide the solution to eliminate the problems of unstable HBase system and the superfluous data.

By using the virtual technology, the core components of the big data platform including Hadoop, Spark, HDFS, HBase, Zookeeper and BigData Software Platform within Liferay [5] IDE (Eclipse [6]) are all integrated into this platform. Spark is built on the top of Hadoop system, and is managed by YARN for the resource management. Specifically, this big data multi-VM platform can be simply and easily deployed by extracting its VM image file into VirtualBox under the Windows operating system. Therefore, the big data multi-VM platform could be a very helpful software platform for big data training, deployment and testing of Spark development and Map/Reduce programming, especially for beginners.

In addition, the big data developers can also integrate the backend computing by implementing the specific portlets such as Job Submission, Job Status and other application portlets for their own purposes. For the big data applications of portal development and Map/Reduce programming, this big data multi-VM platform provides developers to reduce application complexity, development time, and improving application performance. It also provides the well-designed platform in the fields of big data computing and e-Commerce for Java developers to design, implement, configure and deploy. In order to satisfy the presentation tier, the business tier and the database tier at the same time, this platform adopts the enterprise architecture to make it robust, stable and advanced by utilizing the enterprise portal engine and Java EE application server.

2 Platform Architecture

As shown in Fig. 1, this platform utilizes the Oracle VirtualBox to construct multi-node Spark and Hadoop systems including the development tier, the middleware tier and the system tier. The software adopted in each tier is listed in Table 1. In this architecture, the development tier contains four modules, which are Liferay Portal development, Hadoop Map/Reduce Samples, big data App and Hadoop Library API Generator. The middleware tier contains the multi-node Hadoop Cluster for developers to run their Map/Reduce source code during the development period. The developers can also implement Scala or Java to access Spark cluster within this tier. Users do not have to worry about the complexity of building the Spark and Hadoop Cluster. Instead, they can just concentrate on the big data development for their specific applications [7]. Furthermore, by many times and long time running tests, the provided big-data multi-VM platform has been approved to be a stable and robust version.

Fig. 1.
figure 1

Architecture of Big Data (1+3) multi-VM platform

Table 1. Software architecture of Big Data (1+3) Multi-Node VM Platform

To deploy the big data multi-VM platform, firstly, users can install VirtualBox software on a Windows-based PC. Then, the multi-node VM image files which contain NameNode, DataNode1, DataNode2 and DataNode3 can be loaded in a one-by-one way and the VirtualBox will automatically build the corresponding Virtual Machine, the multi-node big data VM platform, as shown in Fig. 2. Meanwhile, all the relevant software and tools are well established, in which the host name and fixed IP addresses are shown in Table 2.

Fig. 2.
figure 2

Big Data multi-VM platform with Spark and Hadoop cluster

Table 2. Hosts and IP of big data multi-VM platform

Basically, these multi-node Virtual Machines utilize the highly stable Ubuntu Linux operation system, on which Hadoop, Spark, Hadoop database/HBase, Scala development and MapReduce programming tools, Java development tools/Eclipse, ZooKeeper/distributed application coordination tool, have been pre-installed and pre-configured, so users can immediately start using this platform for their own applications. In general, the overall time to import Big Data Multi-VM Platform can be completed within 45 min; hence it really reduces the difficulty for users to construct the Spark and Hadoop system.

The developers can perform programming, submitting jobs, analyzing data in this big data VM platform just like in a physical Spark and Hadoop cluster. Besides, this platform also provides important sample codes and readme instruction with respect to Hadoop Map/Reduce programming and Liferay portal development for users to learn and modify the source codes. This platform is not only for development but also for education and training purposes. Hence, this platform is a good initiation step for the beginners with science or engineering backgrounds to move a step into the big data world. With this Big Data multi-VM Platform, both the powerful four-node Spark and Hadoop systems can be at hand.

As shown in Fig. 3, the usage procedure of Big Data (1+3) Multi-Node VM Platform is shown below.

Fig. 3.
figure 3

Usage procedure of Big Data multi-VM platform

  1. 1.

    Startup Virtual Machines (1 NameNode and 3 DataNodes)

  2. 2.

    Startup Spark and Hadoop cluster

  3. 3.

    Start the BigData Software Platform by Eclipse IDE

  4. 4.

    Design, implement and test the deep learning or big data applications

  5. 5.

    Send the executable applications into physical Spark and Hadoop cluster

One thing needs to address is that the common method to build the Big Data Multi-VM Platform is to create a single VM of NameNode with Spark and Hadoop cluster, and then to clone this single VM into DataNode1, DataNode2 and DataNode3, respectively. This procedure will work, but problems may occur because of the existed dirty data within the DataNode1, DataNode2 and DataNode3. For example, the Hbase service may be unstable without finding the actual root cause. In order to avoid the mentioned problems, we designed an alternative method to build the stable multi-node big data VM platform. The procedure is described as following steps:

  1. (1)

    To create a pure VM with Ubuntu Linux

  2. (2)

    To clone this VM into other 3 VMs

  3. (3)

    To configure the fixed IP and create SSH in NameNode, and then send the SSH key to DataNode1, DataNode2 and DataNode3

  4. (4)

    Manually and carefully to build up Spark and Hadoop cluster with ZooKeeper and Hbase

Finally, a stable Big Data Multi-VM Platform can be obtained. As shown in Table 3, this new multi-VM platform is also light-weighted than the old one.

Table 3. Stable version of (1+3) VM Hadoop

3 AI Application and Benchmark

3.1 AI Application

Deep learning is the area of artificial intelligence where the real magic is happening right now. Traditional computers, while being very fast, have not been very smart – they have no ability to learn from their mistakes and have to be given precise instructions in order to carry out any task. Deep learning involves building artificial neural networks which attempt to mimic the way human brains sort and process information. The “deep” in deep learning signifies the use of many layers of neural networks all stacked on top of each other. This data processing configuration is known as a deep neural network, and its complexity means it is able to process data to a more thorough and refined degree than other AI technologies which have come before it. Recently, deep learning applications have shown impressive results across a wide variety of domains [8]. However, training neural networks is relatively time-consuming, even on a single GPU-equipped machine. Fortunately, setting up distributed environments can be an approach to greatly help us accelerate the training process and handle larger size of datasets. The big data multi-VM platform provides the rapid prototyping of distributed deep learning by using a cutting-edge framework TensorFlowOnSpark (TFoS) [9] for AI developers.

TensorFlowOnSpark integrates Apache Hadoop and Apache Spark with the deep learning framework, TensorFlow [10]. Most of the existing distributed deep learning frameworks need to set up an additional cluster for deep learning separately, while TensorFlowOnSpark does not. As shown in Fig. 4, the separated clusters not only require copying data back and forth between the clusters but also encounter the issues of unwanted system complexity and end-to-end learning latency. In contrast, as shown in Fig. 5, TensorFlowOnSpark allows distributed deep learning execution on the identical Spark clusters, which achieves faster learning.

Fig. 4.
figure 4

ML pipeline with multiple programs on separated clusters

Fig. 5.
figure 5

TFoS for deep learning on spark clusters

Data ingestion is the first step to utilize the power of Hadoop. Choosing a proper method of data ingestion can accelerate training processes significantly. In this study, we benchmark two types of data ingestion which TensorFlowOnSpark offers: InputMode.SPARK and InputMode.TENSORFLOW with MNIST handwritten digit dataset [11] on the big data multi-VM platform and the physical server. InputMode.SPARK is provided as a way to send Spark RDDs into the TensorFlow nodes, it enables easier integrations between Spark and TensorFlow once you have had an existing data pipeline generating data via Spark; on the other hand, InputMode.TENSORFLOW takes advantage of TensorFlow’s QueueRunners mechanism to read data directly from HDFS files, and Spark is not involved in accessing data. Besides, for each type of data ingestion, we benchmark the data ingestion of both TFRecords and CSV input formats, which are standard TensorFlow format and comma-separated value format respectively.

The results are illustrated in Fig. 6, showing InputMode.SPARK can get 3.3 times faster than InputMode.TENSORFLOW on the big data VM platform and even achieve 6.1 times faster on the physical server. Furthermore, whatever the data ingestion we tested in our experiments, TFRecords always get better performance than CSV does.

Fig. 6.
figure 6

Comparative experiment results of distributed deep learning

The following benchmark we hold the combination of InputMode.SPARK and TFRecords to delve deeper into the effect of tuning the number of executors and the number of data nodes on the big data VM platform. The architecture of TensorFlowOnSpark uses only one core/task per executor for easier debugging and Spark/Yarn configuration settings. Nonetheless, it is not a problem to set multiple executors per node as long as the physical server is providing enough memory and CPU resources. Accordingly, we benchmark 5 cases within acceptable resource limits as shown in Table 4. Note that at least two executors must be set owing to one of them must be a parameter server (ps) to maintain and update the model weights while the other performs the main computations such as reading data and computing the gradient. In general, if the number of executors is greater than two, we tend to configure only one ps task and the others are all set to do parallel computations.

Table 4. The experimental setup of VMs

Under the same number of data nodes, comparing case 1 to case 2 or case 3 to case 4, setting a greater number of executors gets better performance. As a whole, we can allow each worker to process more records by adding more executors to a Spark job. This will accelerate the training process. In addition, comparing case 2 to case 5 or case 1 to case 3, when we set the same number of executors, adding more data nodes negatively affect performance. Overall, performance degradation on VM clusters may be due to the overhead of the virtualization. For example, if we are running 4 VMs on a single physical machine, it means that 4 OSs, 4 namenode/datanode services and 4 of any other VM services are running simultaneously. Obviously, the overhead system loading of 4 VMs is much heavier than that of 2 VMs thus to reduce its computing capacity.

3.2 Benchmark

Finally, this study performs the benchmark in different types of deep learning systems including the multi-node big data VM platform, physical standalone system and the physical small-cluster system. In terms of hardware consideration, we conduct experiments as follows:

  • CPU only: single-node, single-CPU (non-distributed)

  • GPU only: single-node, single-GPU (non-distributed)

  • SPARK+CPU: 3 data nodes, single-CPU-per-node

  • SPARK+GPU: 3 data nodes, single-GPU-per-node

  • VMs+vCPUs: 3 data nodes VMs on a single physical node

The experimental hardware configuration on each physical node is given as above:

  • CPU(s): 4

  • RAM: 24 GB

  • Graphics card: Quadro K620 (see Table 5 for details)

    Table 5. The specifications of Nvidia Quadro K620

In general, fast training on large-scale datasets is a crucial factor for a distributed computing system. In this experiment, the dataset we train to perform the benchmark is Dogs vs. Cats dataset from Kaggle [12], which is a larger-scale dataset compared to MNIST. The results of testing different types of deep learning systems are shown in Fig. 7.

Fig. 7.
figure 7

Deep learning’s training performance on physical servers and VMs.

In non-distributed systems for CPU-only and GPU-only cases, an entry-level graphic card achieves nearly 4x speedup. On top of that, the distributed deep learning framework of TensorFlowOnSpark runs even two times faster than non-distributed systems. Note that we only configured 3 data nodes and one executor per node in this example. Therefore, 2 data nodes execute computations and 1 data node executes weights update, and that is the main reason why 3 data nodes setup only achieves 2x speedup. Additionally, the case for applying distributed computing with GPUs turns out to get almost 8 times faster than the non-distributed CPU system and even 28 times faster than the VMs distributed environment.

Next, the computational time using the same distributed computing way in the physical small-cluster system is 6.8 times faster than in the VM distributed system. The main reason is that the big data VM platform is the guest operating system on Windows OS, therefore it shares the same resources of CPU, memory and hard disk with the Windows operating system to reduce its computational capacity.

In brief, InputMode.SPARK data ingestion and TFRecords input format get the best performance under the framework of TensorFlowOnSpark. By tuning the number of executors properly can decrease total training time. Although the computational efficiency in Big Data multi-VM Platform system is not as good as the physical small-cluster system, users are encouraged to use the big data multi-VM platform for distributed deep learning programming, compilation, testing and running. Therefore, the big data VM platform is an ideal platform for the preparing processing of development of big data application.

3.3 Demonstration

During the paper presentation, the first author of this paper would like to give a demonstration for the proposed Big Data Multi-VM Platform system to show its capability and applicability.

4 Concluding Remarks

The present study is focused on the development of big data multi-VM platform, and the benchmark in different types of deep learning systems including the multi-node big data VM platform, physical standalone system and the physical small-cluster system. The conclusion and suggestion are provided below:

  1. A.

    This study utilizes the virtual machine technology by VirtualBox to construct big data multi-VM platform and obtain its four VM image files for quick installation. During the design and implementation phase, the system resource of physical multi-node Spark and Hadoop Cluster can be occupied and shared by multiple developers. This would cause the major impact when other users running their jobs in the system. Therefore, the big data multi-VM platform provides a well-designed personal development platform for developers to design, implement and test before running their deep learning applications in physical multi-node Spark and Hadoop Cluster. This can save more system resource and waiting time for users. This multi-VM platform provides an environment for users to use just like a real environment in the physical Spark and Hadoop cluster. Hence, users can quickly access the Spark and Hadoop Map/Reduce programming and portal development to speed up building their big data solutions.

  2. B.

    The big data multi-VM platform provides the rapid prototyping of distributed deep learning by using a cutting-edge framework TensorFlowOnSpark(TFoS for AI developers. In this study, we benchmark two types of data ingestion which TensorFlowOnSpark offers: InputMode.SPARK and InputMode.TENSORFLOW with MNIST handwritten digit dataset on the big data multi-VM platform and the physical server.

  3. C.

    InputMode.SPARK data ingestion and TFRecords input format get the best performance under the framework of TensorFlowOnSpark. By tuning the number of executors properly can decrease total training time. Although the computational efficiency in Big Data multi-VM Platform system is not as good as the physical small-cluster system, users do be able to use the big data multi-VM platform for distributed deep learning programming, compilation, testing and running. Therefore, the big data VM platform is an ideal platform for the development of big data application.

  4. D.

    To speed up the development of big data applications for worldwide users, the proposed big data VM platform is ready for download [13]. We hope it can benefit all the users in big data computing community.

5 Future Work

Due to the use of most Spark and Hadoop systems is still in the command-line interface, it may be inconvenient for beginners to use. Therefore, the future work will be focused on the big data portal development by using Liferay Portal, which provides users friendly interface to access Spark and Hadoop systems in multiple Virtual Machines. This would improve the working efficiency for users in Spark and Hadoop management and applications.

When Spark and Hadoop cluster starts up several times, the file size of each VM image file increases rapidly. It’s still hard to identify root causes. To temporarily deal with this problem, the current solution is to export each of VM image file immediately when the deep learning application software is installed and configured completely. Further studies are needed to address this issue.

In addition, Spark possesses the ability of in-memory computing and has been shown to be potentially 10–100 times faster than Hadoop. The future work of research will be focused on deep learning enabling and streaming applications by using Spark in the big data multi-VM platform. Such as the techniques of GPU access in VMs would be worthy to develop in the future. This would allow users to have more choice for their big data development and applications.