Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Container technology is experiencing a rapid development with the support from industry like Google and Alibaba, and is being widely used in large scale production environments [1, 2]. Container technology is also called operating-system-level virtualization, which allows multiple isolated user-space instances sharing the same operating system kernel and applies \(\mathrm{CGroups}\) to take control of resources in the host. This provides functionality similar to a virtual machine (VM) but with a lighter footprint. While traditional virtualization solutions (e.g., Hypervisor) need to interpose on various privileged operations (e.g., page-table lookups) and use roundabout techniques to infer resource usage (e.g., ballooning). The result is that hypervisors are heavyweight, with slow boot times as well as high run-time overheads.

Speedy launching time and tiny memory footprint are two outstanding features to make containers launch an application in less than a second and consume a very small amount of resources. Compared with virtual machines, adopting containers not only improves the performance of applications, but also allows hosts to sustain more applications simultaneously [3]. So far, there have been a number of container products released to the market, such as LXC (Linux Container) [4], Docker [5], rkt (Rocket) [6], and OpenVZ [7], etc. Docker [5] is a popular runtime system for managing Linux containers, providing both management tools and a simple file format. Docker technology introduces lightweight virtual machines on top of Linux containers called Docker containers. Major cloud platforms include Amazon Web Service [8] and Google Compute Engine [9] are also beginning to provide public container services for developers to deploy applications in the cloud. Undoubtedly, the emergence of container technology has virtually changed the trend of cloud computing market.

Deep learning is a new area of machine learning research. It is the application to learning tasks of artificial neural networks (ANNs) that contain more than one hidden layer. Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task specific algorithms. Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been widely applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation and bioinformatics, where they produced results comparable to or in some cases superior to human experts. TensorFlow [10] is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

There are already a lot of work focused on the performance issues of different applications running on containers [11,12,13,14]. However, the fault behavior and interference of container-based cloud is still not clear. In fact, reliability is very important and may cause serious consequences if faults are not timely handled, which not only hinders the runtime system, but also results in serious economic losses. For example, on February 28th 2017, Amazon Web Services, the popular hosting and storage platform used by a huge range of companies, experienced S3 service interruption for 4 hours in the Northern Virginia (US-EAST-1) Region, and then quickly spread other online service providers who rely on the S3 service [15]. This failure caused a large number of economic compensations for users. It is because cloud computing service providers usually set a Service Level Agreement (SLA) with customers. For example, when customers require 99.99% availability, it means that 99.99% of the time must meet the requirement for 365 days per year. If the service breaks more than 0.01%, compensation is required.

Although reliability of cloud systems is important, it is not easy to solve this problem, mainly because: (1) Large scale. A typical data center involves more than 100,000 servers and 10,000 switches, more nodes usually mean higher probability of failure; (2) Complex application structure. Web search, e-commerce and other typical cloud programs have complex interactive behaviors. For example, an Amazon page request involves interactions with hundreds of components [16], error occurs in any one component will lead to anomalies of applications; (3) Diverse causes. Resource competition, resource bottlenecks, misconfiguration, defects of softwares, hardware failures and external attacks can cause anomalies or failures of cloud systems. Cloud systems are more prone to performance anomalies than traditional platforms [17].

In this paper, we focus on the reliability issue of container-based cloud. We first propose a fault injection framework for container-based cloud systems. We build a docker container environment installed with TensorFlow deep learning framework, and develop four typical attack programs, i.e., CPU attack, Memory attack, Disk attack and DDOS attack. Then, we observe fault behaviors and interferences phenomenon by injecting attack programs to the containers running artificial intelligence (AI) applications (CNN, RNN, BRNN and DRNN). We also design fault detection models based on quantile regression method to detect potential faults in containers.

Our main contributions are:

  • We develop a fault injection tool as well as four typical attack programs for container-based cloud systems. The fault programs include CPU attack, memory attack, disk attack and network attack. The purpose of fault injection is to simulate abnormal behavior of containers.

  • We investigate the fault behaviors and interferences between multiple containers. We focus on four mainstream AI applications: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Bidirectional Recurrent Neural Network (BRN) and Dynamic Recurrent Neural Network (DRNN).

  • We design fault detection models for container-based cloud systems. The models are based on the quantile regression method. Part of the experimental data is used for training the models. Experimental results show that the proposed fault detection models can effectively detect the injected faults with more than 60% Precision, more than 90% Recall and nearly 100% Accuracy.

Fig. 1.
figure 1

Fault injection framework.

2 Fault Injection Framework

2.1 Fault Injection Framework

AI Applications in Target Systems. Our fault injection framework is mainly designed for container-based cloud systems (see Fig. 1), but it would be easily extended to other environments. We use docker to set up a target testing system and create several containers. The applications running in containers are the up-to-date AI applications based on tensorflow framework. We mainly focus on four types of typical AI applications:

  • Convolutional Neural Network (CNN): is a class of deep, feed-forward artificial neural network that have been applied to analyzing visual imagery. Convolutional networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex. They have applications in image and video recognition, recommender systems and natural language processing.

  • Recurrent Neural Network (RNN): is a class of artificial neural network where connections between units form a directed cycle. This allows it to exhibit dynamic temporal behavior. Unlike feedforward neural networks, RNNs can use their internal memory to process arbitrary sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

  • Bidirectional Recurrent Neural Network (BRNN): is introduced to increase the amount of input information available to the network. BRNNs do not require their input data to be fixed. Moreover, their future input information is reachable from the current state. The basic idea of BRNNs is to connect two hidden layers of opposite directions to the same output. By this structure, the output layer can get information from past and future states.

  • Dynamic Recurent Neural Network (DRNN): contains architecturally both a huge number of feedforward and feedback synaptic connections between the neural units involved in the network and is mathematically a complex dynamic system described by a set of parameterized nonlinear differential or difference equations. The DRNNs have been proven to be capable of processing the time-varing spatio-temporal information represented by state space trajectories.

Attack Repository. In order to assess fault behaviors and interferences of container-based cloud systems from different dimensions, we develop four types of attack programs. Specifically, we use C language to develop the CPU, Memory, Disk and Network attack programs.

  • CPU attack program: keeps CPU doing a lot of calculations and consuming a majority of CPU resources, which is used to simulate a competition scenario of CPU resources with other neighbor containers.

  • Memory attack program: continuously allocates and consumes memory resources, which is used to simulate a competition scenario of memory resources with other neighbor containers.

  • DiskIO attack program: uses IOzone benchmark to continuously read and write disk to consume almost all the disk bandwidth resources, which is used to simulate a competition scenario of disk bandwidth resources with other neighbor containers.

  • Network attack program: uses third party DDOS clients to initiate a lot of invalid connections to the server, consuming all possible network bandwidth resources, which is used to simulate a competition scenario of network bandwidth resources with other neighbor containers.

Fault Injection Controller. Fault injection controller is responsible for triggering, injecting and activating attack programs. In most cases, the attack programs are not executing immediately, they would be triggered by user-defined conditions, such as the start time of execution. All the attack programs are selected from the attack repository, then been configured by users, and finally be injected to the target location, such as container, host, etc. The controlling parameters include attack type, injection location, injection strategy, attack duration and other parameters.

Monitor. Monitor is responsible for monitoring and collecting information of target systems. It is also responsible for checking if the attack program is activated, and how it affects the system. The detail monitored information includes fault feedback data, container status information and performance data of applications running in containers, etc.

Fig. 2.
figure 2

The DDoS attack injection method.

2.2 DDOS Injection Method

Figure 2 shows the DDoS attack method in containers. We installed Apache Tomcat in the containers in host A, which provides web service through port 8080. Host B can access Tomcat service through port 8000 in host A, where port 8000 is mapped to the port 8080 in the container.

We use \(\mathrm{torshammer}\) tool [18] to simulate the DDoS attack. Torshammer is a slow-rate DDoS attack tool that is efficient and very disruptive on most apache servers. Similar to regular volumetric DDoS attacks, slow-rate DDoS attacks exhaust web server resources by generating a large amount of connections as long as possible. More technically, it uses the classic form of slow POST attack, generating HTTP POST requests and connections that will hold for around 1000–30000 s. Instead of leveraging huge attack bandwidth or large amounts of HTTP requests per second, slow-rate DDoS attacks simply exploit the maximum current connection time that apache servers can handle. We control the attack traffic through adjusting the attack threads. We use \(\mathrm{dstat}\) tool [19] to collect the send and receive traffic.

3 Fault Behavior Analysis

In this section, we investigate fault behaviors and interferences by injecting different types of attacks in containers. We compared the performance of four up-to-date AI applications (CNN, RNN, BRNN and DRNN) before and after injecting those attacks.

Fig. 3.
figure 3

Fault interference of four AI applications running in containers after injecting CPU attack.

Fig. 4.
figure 4

Fault interference of four AI applications running in containers after injecting memory attack.

Fig. 5.
figure 5

Fault interference of four AI applications running in containers after injecting disk attack.

3.1 Fault Interference Analysis

Figures 3, 4 and 5 show the fault interference of AI applications after injecting CPU, memory and disk attacks. It is very obvious that the performance of AI applications is affected by varying degrees. For CPU attack, the running time of CNN, RNN, BRNN and DRNN is extended by 65.49%, 66.67%, 79.10%, and 44.89% respectively as compared with normal execution time. For memory attack, the running time is extended by 123.17%, 216.52%, 223.16%, and 93.31% respectively. For disk attack, the running time of is extended by 360.66%, 549.01%, 1045.43%, and 399.09% respectively.

From the experimental results, we observe: (i) From the application perspective, all the four AI applications are very sensitive to the attacks. The BRNN application is the most sensitive to the attacks while DRNN is least sensitive to the attacks. It is because BRNNs connects two hidden layers of opposite directions to the same output, which needs more resources to process the workloads. (ii) From the attack perspective, both memory and disk attack cause very serious degradation on application performance. Disk attacks causes most serious interference. It is because the applications are running on the distributed container cluster and need frequent data transfer, exchange, read and write. If disk bandwidth or memory space runs out, performance of applications would be certainly affected. (iii) From the container perspective, the fault isolation of containers is still not enough, which is although much better than traditional hypervisors like Xen or KVM. The CPU isolation is the best, memory isolation the second, and the disk isolation is the worst.

Fig. 6.
figure 6

Fault behavior of Tomcat web service after injecting DDoS attack in containers.

3.2 DDOS Attack Behavior Analysis

We further study DDoS attack behaviors in containers. DDoS attack is one of the most common network attacks in today’s cloud systems. Figure 6 shows the fault behaviors when triggering DDoS attack under different configurations of attack threads. We inject the DDoS attack at the 63th time points.

From the figure, we observe: (i) when the thread is set to 1, it didn’t cause obvious changes on both send and receive traffic, indicating that the attack is similar or equivalent to an access to the server from a normal user. When the attack thread is set to 256, both the send and receive traffic suddenly increase to a very high degree, about more than 30 times higher than the original normal traffics. By doing so, the web server denies service for other normal users. (ii) According to the normal traffic analysis, we find even the normal traffic shows very fluctuating phenomenon, making it very challenging to accurately detect potential faults.

4 Fault Detection Models

In this section, we design several fault detection models based on the quantile regression method. Quantile regression is an optimization-based statistical method to model effects of factors on arbitrary quantiles, which has recently been successfully used for performance study in computer experiments [20]. We use this theory for anomaly detection. It is an extension of traditional regression method. The main difference is traditional regression builds a model on the mean of response variable, while quantile regression constructs one on any given quantile (having probabilistic sense), such as the median or the 99th percentile. Quantile regression has been successfully applied in various areas, such as ecology, healthcare and financial economics.

4.1 Metrics

The main objective of the fault detection method is to maximize the Precision and Recall of the proposed detection technique. The four common performance parameters for these objectives are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) which are defined as follows:

True Positive (TP): This occurs when the detection model raises true alerts on a detected anomaly traffic.

True Negative (TN): This occurs when there is no anomaly activity taking place in the system, and the detection model is thus not raising any alarm.

False Positive (FP): This occurs when the detection model erroneously raises a false alarm over a legitimate activity in the system.

False Negative (FN): This occurs when the detection model fails to detect an anomaly activity taking place in the system.

To better understand the relationship between these four metrics, we give an intuitive judgment matrix in Table 1. We further define the metrics of Precision, Recall and Accuracy to quantify the detection efficiency.

Table 1. Confusion matrix for TP, TN, FP and FN.

Precision is used to indicate the proportion of correct predictions, which represents the correctness of the prediction:

$$\begin{aligned} \begin{aligned}&{\text {Precision}} =\dfrac{TP}{TP +FP} \\ \end{aligned} \end{aligned}$$
(1)

Recall represents the percentage of actual failures that the model predicts in all occurrences of the environment, which represents the accuracy of the prediction:

$$\begin{aligned} \begin{aligned}&{\text {Recall}} =\dfrac{TP}{TP +FN} \\ \end{aligned} \end{aligned}$$
(2)

Accuracy indicates that the predicted model correctly determines the percentage of events in the total event:

$$\begin{aligned} \begin{aligned}&{\text {Accuracy}} =\dfrac{TP+TN}{TP + FP + FN + TN} \\ \end{aligned} \end{aligned}$$
(3)
Fig. 7.
figure 7

Fault detection result for four AI applications: CNN, RNN, BRNN and DRNN.

4.2 Detection Models for CPU, Memory and Disk Attacks

Figure 7 shows the fault detection results for the four AI applications (CNN, RNN, BRNN and DRNN) running on TensorFlow framework in containers. We set tau value of 0.1 as lower boundary for the fitting model, and tau value of 0.9 as the upper boundary of the fitting model. All the points in the interval are considered normal traffic, the remaining points are considered as abnormal traffic. In the figure, sequence 5, 10, and 15 are the points where the faults (CPU, memory and disk fault) are injected. From our quantile regression based detection model, we find these points are not within interval of upper and lower boundaries of the fitting model, so this method can detect the injected faults. Tables 2, 3, 4 and 5 show the four basis detection metrics TP, FP, FN, and TN, based on which, we calculate the Precision, Recall and Accuracy of the detection models in Table 6.

Table 2. TP, TN, FP and FN for CNN.
Table 3. TP, TN, FP and FN for RNN.
Table 4. TP, TN, FP and FN for BRNN.
Table 5. TP, TN, FP and FN for DRNN.
Table 6. The Precision, Recall and Accuracy of the detection models for different AI applications.

From the above experimental results, it can be seen that the fault detection method based on the quantile regression can effectively detect the potential faults. For CNN and RNN fault detection model, the Precision is 60%, Recall is 100%, and Accuracy is 91.30%. For BRNN and DRNN fault detection model, the Precision is 75%, Recall is 100%, and Accuracy is 95.65%. We find the Precision is relatively low, it is because the number of fault points in the experiment is very small (3 points) which affects the detection Precision. But if we look at the Accuracy, it is nearly 100% because the number of sampling data is relatively large (23 points). We plan to increase the testing scale in the next step of our work.

4.3 Detection Models for DDoS Attack

DDOS fault injection detection model needs two sets of data, namely, the receive traffic and send traffic of the Apache web server. In the experiment, we uses \(\mathrm{dstat}\) tool to collect traffic information. Traffic data is a typical time series type of data, making it ideal for processing using time series models. In the experiment, the quantile regression method is used to process the traffic data in the time series. We totally calculate three groups of fitting values, named fit1, fit2 and fit3, the tau value in each set is set as 0.1, 0.5 and 0.8 respectively. The tau value of 0.1 indicates the lower boundary of the receive or send traffic, tau value of 0.5 indicates the average value of the receive or send traffic, and the tau value of 0.8 indicates the upper boundary of the receive or send traffic. Based on the trained model, we can detect potential DDoS attacks.

Fig. 8.
figure 8

Fault detection result for DDoS attacks.

Figure 8(a) and (b) shows detection result for both receive and send traffic of Apache Tomcat server under DDoS attacks. In the figure, we inject DDoS attack at point 10, 20, 30, 40, 50, 60, 70, 80, and 90. All the traffic point inside this range is determined as the normal traffic, while the traffic points outside the range means potential DDOS attack. From the figure, we find the traffic points with attacks are obviously outside the confidential interval of the fitting model, which means the fitting model fits the normal traffic data well and can be used to accurately detect potential DDOS attacks.

Table 7. TP, TN, FP and FN for the recv traffic.
Table 8. TP, TN, FP and FN for the send traffic.
Table 9. The Precision, Recall and Accuracy of the detection model for DDoS recv traffic and send traffic.

Table 9 shows precision, recall and accuracy results of the detection model for DDoS recv traffic and send traffic, which is based on the results of TP, TN, FP and FN as shown in Tables 7 and 8. For the receive traffic detection model, the Precision is 100%, Recall is 100% and Accuracy is also 100%. For the send traffic detection model, the Precision is 90%, Recall is 100% and Accuracy is 98.94%, indicating our test models perform very well.

5 Related Work

Recently, there is an increasing interest in the performance and reliability issues of containers.

Performance. Felter et al. [3] explored the performance of traditional virtual machine (VM) deployments, and compared them with the use of Linux . They used KVM as a representative hypervisor and Docker as a container manager. The results showed that containers result in equal or better performance than VMs in most cases. Both VMs and containers require tuning to support I/O intensive applications. Ruan et al. [21] conducted a series of experiments to measure performance differences between application containers and system containers and find system containers are more suitable to sustain I/O-bound workload than application containers. Ye and Ji [13] proposed a performance model to predict the application performance running in containers. Higgins et al. [12] evaluated the suitability of Docker containers as a runtime for high performance parallel execution. Their findings suggest that containers can be used to tailor the run time environment for an MPI application without compromising performance. Yu et al. [14] designed a flexible container-based tuning system (FlexTuner) allowing users to create a farm of lightweight virtual machines (containers) on host machines. They mainly focus on the impact of network topology and routing algorithm on the communication performance of big data applications. Harter et al. [11] proposed Slacker, a new Docker storage driver optimized for fast container startup. Docker workers quickly provision container storage using backend clones and minimize startup latency by lazily fetching container data.

Reliability. Duffield et al. [22] proposed a rule-based anomaly detection on the IP network which correlates the packet and flow level information. Cherkasova et al. [23] presented an integrated framework of using regression based transaction models and application performance signatures to detect anomalous application behavior. Sharma et al. [24] used the Auto-Regressive models and a time-invariant relationships based approach to detect the fault. Pannu et al. [25] presented an adaptive anomaly detection framework that can self adapt by learning from observed anomalies at runtime. Tan et al. presented two anomaly prediction systems PREPARE [26] and ALERT [27] that integrate online anomaly prediction, learning-based cause inference, and predictive prevention actuation to minimize the performance anomaly penalty without human intervention. They also investigated the anomalous behavior of three datasets [28]. Bronevetsky et al. [29] designed a novel technique that combine classification algorithms with information on the abnormality of application behavior to improve detection. Gu et al. [30] developed an attack detection system LEAPS based on supervised statistical learning to classify benign and malicious system events. Besides, Fu and Xu [31] implemented a failure prediction framework PREdictor that explores correlations among failures and forecasts the time-between-failure of future instances. Nguyen et al. [32] presented a black-box online fault localization system called FChain that can pinpoint faulty components immediately after a performance anomaly is detected. Most recently, Arnautov et al. [33] described SCONE, a secure container mechanism for Docker that uses the SGX trusted execution support of Intel CPUs to protect container processes from outside attacks.

In comparison, we study fault behaviors and interferences of artificial intelligence applications on Docker containers. We also propose fault detection models based on quantile regression method to accurately detection the faults in containers.

6 Conclusion

In this paper, we studied the reliability issue of container-based cloud. We first proposed a fault injection framework for container-based cloud systems. We built a docker container environment installed with TensorFlow deep learning framework, and developed four typical attack programs, i.e., CPU attack, Memory attack, Disk attack and DDOS attack. Then, we injected the attack programs to the containers running typical artificial intelligence applications (CNN, RNN, BRNN and DRNN), to observe the fault behavior and interference phenomenon. After that, we designed different fault detection models based on quantile regression method to detect potential faults in containers. Experimental results show that the proposed fault detection models can effectively detect the injected faults with more than 60% Precision, more than 90% Recall and nearly 100% Accuracy. Note that, in this paper we verified that quantile regression based fault detection model can efficiently detect potential faults, but the experimental scale is still very small. In the future, we plan to expand the experimental scale and continue to test our approach in more complex environments.