An uncertainty-managing batch relevance-based approach to
network anomaly detection
Gianni D’angeloa , Francesco Palmierib,∗, Salvatore Ramponea , Massimo Ficcob
a Department
of Sciences and Technologies, Sannio University,
Via dei Mulini 59A, I-82100 Benevento, Italy.
b Department of Industrial and Information Engineering, Second University of Naples,
Via Roma 29, I-81031 Aversa (CE), Italy.
Abstract
The main aim in network anomaly detection is effectively spotting hostile events within the traffic pattern associated to network operations, by distinguishing them from normal activities. This
can be only accomplished by acquiring the a-priori knowledge about any kind of hostile behavior that can potentially affect the network (that is quite impossible for practical reasons) or, more
easily, by building a model that is general enough to describe the normal network behavior and
detect the violations from it. Earlier detection frameworks were only able to distinguish already
known phenomena within traffic data by using pre-trained models based on matching specific
events on pre-classified chains of traffic patterns. Alternatively, more recent statistics-based approaches were able to detect outliers respect to a statistic idealization of normal network behavior. Clearly, while the former approach is not able to detect previously unknown phenomena
(zero-day attacks) the latter one has limited effectiveness since it cannot be aware of anomalous
behaviors that do not generate significant changes in traffic volumes. On the other hand, machine
learning allows the development of adaptive, non-parametric detection strategies that are based
on “understanding” the network dynamics by acquiring through a proper training phase a more
precise knowledge about normal or anomalous phenomena in order to classify and handle in a
more effective way any kind of behavior that can be observed on the network. Accordingly, we
present a new anomaly detection strategy based on supervised machine learning, and more precisely on a batch relevance-based fuzzyfied learning algorithm known as U-BRAIN, aiming at
understanding through inductive inference the specific laws and rules governing normal or abnormal network traffic, in order to reliably model its operating dynamics. This proposal appears
to be promising both in terms of identification accuracy and robustness/flexibility when coping
with uncertainty in the detection/classification process, as verified through extensive evaluation
experiments.
Keywords: Network Anomaly Detection, Machine Learning, Supervised Classification,
Fuzzy-based techniques, Inductive Inference.
∗ Corresponding
author.
Email addresses: dangelo@unisannio.it (Gianni D’angelo), francesco.palmieri@unina.it (Francesco
Palmieri), rampone@unisannio.it (Salvatore Rampone), massimo.ficco@unina2.it (Massimo Ficco)
Preprint submitted to Journal of Applied Soft Computing
March 7, 2015
1. Introduction
Together with the astonishing deployment of network technologies and the consequent increment in traffic volumes, the importance of network misuse detection and prevention frameworks
is proportionally growing in almost all the modern organizations, in order to protect the most
strategic resources from both external and internal threats. In this scenario
The task of identifying and categorizing network anomalies essentially consists in determining all the circumstances in which the network traffic pattern deviates from its normal behavior,
that in turn depends on multiple elements and considerations associated to the activities taking
place every day on the network.
However, the main difficulty related to a really effective detection is associated to the continuous evolution of anomalous phenomena, due to the emergence of new previously unknown
attacks, so that achieving a precise, stable and exhaustive definition of anomalous behavior, encompassing all the possible hostile events that can occur on a real network, is practically impossible. Nevertheless, detection systems must not be limited by the a priori knowledge of a specific
set of anomalous traffic templates or be conditioned by a large number of complex operating
parameters (e.g., traffic statistic distributions and alarm thresholds), and hence have to be able to
recognize and directly classify in any previously unknown phenomenon that can be experienced
on the network. As a consequence, the ultimate goal of modern anomaly detection systems is
behaving in a adaptive way in order to flag in “real-time”, all the deviations from a model that is
built dynamically and in an incremental way by capturing the concept of normality in network
operations according to a learning by example strategy. These new systems, overcoming the
known limitations of the more traditional ones based on pattern detection and statistical analysis,
are empowered by flexible machine learning techniques.
Accordingly, we propose a novel anomaly detection strategy, particularly suitable for IP networks, based on supervised machine learning, and more specifically on a batch relevance-based
fuzzyfied learning algorithm known as U-BRAIN.
This strategy aims at understanding the processes that originate the traffic data, by deriving
the specific laws and rules governing it, in order to reliably model its underlying dynamics. This
is accomplished by performing inductive inference (or better, generalization) on traffic observations, based on some empirical pre-classified “experiential” (training) data representing incomplete information about the occurrence of specific phenomena describing normal or anomalous
network activities. In addition, the adopted learning scheme allows a certain degree of uncertainty in the whole detection process making the resulting framework more solid and flexible, in
managing the large variety and complexity of real traffic phenomena.
We evaluated the effectiveness of the presented detection framework within a widely known
test case scenario, in order to make the achieved results comparable with those of other proposal
available in literature. These results demonstrated a quite satisfactory identification accuracy by
placing our strategy among the most promising state-of-the-art proposals.
2. Background and Related Work
Network anomaly detection has gained a great attention in security research with about 40
years of experiences available in literature. The first approach to automatic detection has been
proposed in [1], followed by a large number of contributions exploring many other solutions and
proposals [2][3][4].
2
The earliest and more traditional detection approaches, mainly aiming at spotting intrusion
activities, work by matching specific traffic patterns, gathered from the packets under observation, against a list of predefined signatures, each associated to a known attack or hostile/anomalous behavior. Some well-known examples are SNORT [5] and BRO [6]. While ensuring very
good response times and a quite satisfactory degree of effectiveness in case of previously known
menaces, these approaches are almost totally clueless in presence of new (zero-day) attacks,
or when, due to minor modifications in its behavior, an already known attack does not closely
match the associated signatures. In both the cases new up-to-date signatures must be generated
and added to the list as soon as more detailed information about the hostile behavior become
available. Unfortunately, this implies human intervention, and hence too much time to ensure
real-time response.
On the other hand, other very common detection systems are based on a statistical idealization of the network behavior and process the traffic observations through statistical-analysis techniques by flagging the outliers as anomalous events. The most significant examples are NIDES
(Next-Generation Intrusion Detection Expert System) [7], an hybrid system providing a statistical analysis engine, and SPADE (Statistical Packet Anomaly Detection Engine) [8] a statistical
detection system based on determining anomaly scores, available as a plug-in for SNORT. However, while straightforward and robust in their formulation (they do not require prior knowledge
of the security menaces nor need packet inspection) statistic detection approaches may result too
simplistic in their basic assumptions and hence scarcely reliable in their results. In fact, being
based only on the statistical properties of the involved traffic flows these approaches are too sensitive to he normality assumption, and really effective only against specific phenomena that imply
significant variations in the statistical properties of the network traffic (Volume-based Attacks).
More precisely, such detection techniques have to be based on extremely accurate statistical distributions that describe the traffic under observation. Unfortunately, modeling network traffic,
typically characterized by an inhomogeneous usage pattern, by using only pure statistical methods may result in a poor choice in terms of real effectiveness. Furthermore, these solutions cannot
be aware of hostile activities that only affect the packet contents (such as stack smashing or other
kind of malicious code exploiting system/services vulnerabilities) or explicitly conceived to be
undistinguishable from regular user activities (e.g., low-rate DoS attacks).
As an alternative that may reveal extremely effective in coping with the above challenges,
Machine learning provides the ability of a fully automated detection system to learn by example
what are the anomalous events occurring in the observed traffic by also improving its performance over time with experience, as more examples (or training data), describing normal or
anomalous behaviors, are provided in its knowledge base. In this way, the detection function,
that is essentially a binary classifier working on the normal and anomalous traffic classes, is inferred from the aforementioned training data. Such data consist of a set of pre-classified traffic
samples. In supervised learning, each sample is a pair consisting of an input object (typically a
vector of traffic features) and a desired output value (the class value) also called the supervisory
signal. The inferred classifier should assign the right output class value to any valid input sample.
This implies that the learning paradigm should be reasonably capable to perform generalization
from the knowledge contained in the training data to previously unseen situations.
The use of machine learning in anomaly detection, with the development of generalization
capabilities from past experiences for classifying future data as normal or anomalous has been
exploited in many proposals [9], based on neural networks [10] [11], SVMs [12] and data mining
techniques [13][14]. These approaches can be further subdivided into generative or discriminative. A typical generative approach (e.g., [15]) constructs a model by starting only from normal
3
training examples, then evaluating test instances in order to appreciate how well it fits such
model. As an example, the ideas presented in [16] explore different machine learning techniques
to construct detection models from past behavior. On the other hand, discriminative techniques
(e.g., [12]), attempt to understand the difference between the normal and anomalous instance
classes. A learning approach for reproducing packet level alerts for anomaly detection at the
flow level has been presented in [17]. Several approaches rely on clustering techniques, such
as ADWICE [18], performing unsupervised detection based on a fast incremental clustering
technique. K-Means+ID3 [19], instead, is a supervised learning approach combining k-Means
clustering and the ID3 decision trees in order to classify anomalous and normal network activities. Regarding the use of tree-based structures, the DGSOT+SVM [20] scheme is an iterative
approach leveraging the dynamic generation of self-organizing hierarchical trees together with
SVMs to be trained on the tree nodes, where support vectors are used as the basic knowledge to
control the tree growth.
Other approaches introduced the concept of uncertainty within the learning strategy. Notably,
a detection solution based on weighted fuzzy matching over frequent episode rules is presented
in [21]. In addition, the approach presented in [22] applied fuzzy logic-based rules to audit data
in order to classify it as normal or anomalous.
Starting from the above experience, our machine-learning based detection solution combines
the strength of rule-based systems and the flexibility of fuzzy logic to reliably understand the
fundamental properties of the network traffic in order to rapidly flag the occurrence of abnormal
events.
3. A fuzzy rule-based detection strategy
The basic idea is building a formal model that expresses the relations between all the fundamental variables involved in the traffic dynamics, and hence “understands” the notions of normal’
and anomalous behavior, from the available experience by learning the characteristics of the corresponding traffic classes and expressing them into laws and rules that are general enough to
determine if any unseen instance belongs to the one or the other class. Obviously, the overall detection quality strongly depends on the accurateness and generality of the above model and hence
of the completeness of the training data on which its “knowledge” about normal and anomalous
phenomena is built.
Starting from the previous considerations, we modeled our anomaly detection strategy according to a supervised machine learning scheme, specifically conceived for learning disjunctive
normal form (DNF) [23] boolean formulas from partial truth tables, possibly with uncertain values or missing information bits. Such formulas, determined by using the U-BRAIN (Uncertaintymanaging Batch Relevance-based Artificial Intelligence) algorithm [24], describe the correlation
rules between attribute conditions and class labels, modeling the normal and anomalous traffic
profiles through boolean predicates on the observation attributes. Since the U-BRAIN algorithm
works on boolean data, the above attribute values have to be quantized, in order to represent them
in rules by using binary strings.
The aggregation of all the determined correlation rules defines the behavior of an inductively
learned classifier that is able to analyze the deviation from normal traffic profiles and the proximity to the known anomalous ones, as determined from the historical observations available in the
training set. Clearly, by relying on a supervised approach where both the phenomena of interest
are known in the training set, such a classifier should be potentially able to achieve better detection performance respect to semi-supervised and unsupervised solutions, since more information
4
are available in its training knowledge base. However, the unbalancing in the amount of training
data representing the two classes, together with the presence of some noise in pre-classified samples, can significantly affect the final accuracy of the detection results. In particular, we have to
consider that anomalies are rather infrequent when compared with the whole amount of observations, thus the available data used to model the concept of anomaly may be insufficient or lack
of representativity. This may imply some accuracy problems (usually resulting in false alarms
or missed detections). We faced these problems by improving the flexibility of rules through the
introduction of fuzzy logic. In fact, allowing a certain degree of uncertainty at the binary classifier level makes the resulting detection strategy significantly stronger both in terms of flexibility
and ability to cope with the variety and complexity of network traffic phenomena. In this way the
detection approach considers a wider range of implications when making its decisions, resulting
in a more effective and powerful approach in presence of incomplete training data.
3.1. Initial knowledge construction
A fundamental preliminary task in a supervised network anomaly detection process is the initial “knowledge construction” where the most significant network utilization patterns, describing
the fundamental traffic dynamics, should be described in terms of specific features gathered from
traffic observations. This is a very slow and complex activity that implies collecting and preclassifying network traffic observations over a sufficiently long period of time, in order to build a
quite complete training set, reliably describing the historical knowledge of the network behavior
or “baseline”. The training data will thus consist in a parametric network traffic model that can
be viewed as an approximation of reality, where a limited set of parameters is available in form
of the most discriminant traffic features that are able to describe the traffic behavior. Such traffic
model can be realized by looking at different dimensions of observation such as inter-arrival times
of packet transmission and reception events, together with information about packet sizes, flags,
source and destination addresses, ports/services involved, by also attempting to use the memory of recent past to identify persistent events like end-to-end connections (traffic flows). These
observations are represented as a time series, that is, a sequence of scalar samples measured
at uniformly spaced time intervals. More formally, the training set T = (t1 , t2 , . . . tN ) consists
of N samples, each structured as an d-dimensional vector of traffic features si = ( f1 , f2 , . . . fd ).
The training set is joined with a one-dimensional feature set C = (c1 , c2 , . . . cm ) associated to
the supervisory signal, where ci represent the class (i.e., anomalous, not anomalous) to which
each sample ti belongs. In this way, the knowledge base of the classifier is described by a set or
rules that will be “inferred” from the aforementioned collection of pre-classified training samples, where p of them are associated to normal traffic observation and q to anomalous ones, with
p ≫ q.
3.2. The U-BRAIN algorithm
U-BRAIN is a machine learning algorithm able to infer explicitly the laws that govern a
process from examples. In its latest version, U-BRAIN can also act on incomplete data. Originally, the algorithm, named BRAIN [25], was conceived for recognizing the borders of coding
regions in human DNA. The algorithm was based on the Michalski’s STAR technique [26], on
the candidate-elimination method introduced by Mitchell [27], and on the work of Haussler [28].
The BRAIN algorithm was later extended [24] to treat data with missing bits by using fuzzy sets.
The resulting U-BRAIN algorithm keeps the computational complexity of the original one. The
great versatility that characterize it, makes U-BRAIN applicable in every industry and science
5
field in which there are data to be analyzed, such as the financial world, the aviation industry
as well as the biomedical scope. U-BRAIN models a process by a formula able to forecast the
future process behavior. Specifically, the algorithm builds Boolean formulae F of n literals xi
(i = {1, . . . , n}) in DNF form, made up of disjunctions of conjunctive terms, starting from a set T
of training data. The data (instances) in T are divided into two classes, named positive and negative, respectively modeled by the n-sized vectors u~i with i = {1, . . . , p} and v~j with j = {1, . . . , q},
representing the issues to be classified. Each element ui,k or v j,k with k = {1, . . . , n} can assume
values belonging to the set {0.1, 21 } respectively associated to positive, negative and uncertain
instances. The conjunctive terms of the formula are carried-out in an iterative way by two nested
loops (see algorithm 1 schema). The inner cycle refers to the selection of the literals of each
formula term, while the outer one is devoted to the terms themselves. In order to build a formula
consistent with the given data, U-BRAIN compares each given positive instance with each negative one and builds a family of fuzzy sets of conditions that must be satisfied by at least one of
the positive instances and violated by all the negative ones formally defined as:
o n
o
n
S i, j = xk |ui,k > v j,k ∨ ui,k = v j,k = 21 ∪ xk |ui,k < v j,k ∨ ui,k = v j,k = 12
(1)
In other words, the k-th literal belongs to the S i, j set if the elements in the position k, belonging
to the i-th positive instance ui,k and to the j-th negative instance v j,k , are different or both equal
to 12 .
Starting from these sets S i, j , the algorithm determines for each literal xk belonging to them a
set of coefficients Ri, j , Ri and R, called relevances, forming a probability distribution, where:
Ri, j (xk ) =
Ri (xk ) =
µi, j (xk )
#(S i, j )
q
1X
Ri, j (xk )
q j=1
(2)
o
R(xk ) =
1X
Ri (xk )
p j=1
where µi, j is the membership function of the set S i, j (see [24]) and #(x) is the fuzzy cardinality
P
of a subset s of a set S defined as #(x) = x∈S µ s (x).
This allows the selection of the literals on a maximum probability greedy criteria (the literal
having maximum relevance value is selected).The goal of such greedy selection is simultaneously covering the maximum number of positive instances with the minimum possible number
of literals. Each time a literal is chosen, the condition sets S i, j , and the corresponding probability
distribution, are updated by erasing the sets containing the literal itself. The inner cycle is then
repeated and the term is completed when there are no more elements in the sets of conditions.
Then the new term is added to the formula and, in the outer cycle, the positive instances satisfying the term are erased. Then, the inner cycle starts again on the remaining data. The algorithm
ends when there are no more data to treat. The algorithm has two biases: the instance set must be
self-consistent, that means that an instance cannot belong to both the classes, and no duplicated
instances are allowed. In fact, it may happen that the initial set of training instances contains
redundant information. This may be due to repeated instances present from the beginning of the
process or resulting from a reduction step, whose task is limiting the presence of missing bits,
6
by recovering them as possible. Such redundancy is automatically removed by keeping each
instance just once and deleting all the repetitions, in order to avoid consistency violation that can
halt the process.
Algorithm 1 The U-BRAIN schema
Require: p > 0. q > 0. T = {u~1 , . . . , u~p , v~1 , . . . , v~q }
1: F ← ∅ {Initialize F}
~i ∈ T do
2: while there are positive instances u
3:
Uncertainty Reduction
4:
Repetition Deletion
5:
m ← ∅ {Initialize term m}
6:
Build S i, j sets f rom T
7:
while there are elements in S i, j do
8:
Compute Ri, j for {xk , xk }, k = {1, . . . , n}
9:
Compute Ri for {xk , xk }, k = {1, . . . , n}
10:
Compute R for {xk , xk }, k = {1, . . . , n}
11:
Choose literal x with max relevance R
12:
m ← m ∪ {x} {Update term m}
13:
U pdate S i, j sets
14:
end while
15:
F ← F ∪ {m} {Add term m to F}
16:
U pdate positive instances u~i ∈ T, i = {1, . . . , p}
17:
U pdate negative instances v~j ∈ T, j = {1, . . . , q}
18:
Check consistency
19: end while
20: return F
3.3. U-BRAIN Algorithm Complexity and its Parallel implementation
The overall algorithm time complexity is O(n5 ) and its space complexity is O(n3 ) for large n
(where n is the number of variables). So, processing large amounts of data with a single computer
may be prohibitive in terms of both space and time needed. Of course, such a complexity is only
referred to the training phase where the set of classification rules is initially built from the training data. Once these rules are available the detection activity is extremely simple and fast and
hence can be performed in real-time by operating on-line on live network traffic, by potentially
including the resulting classifier within a next-generation firewall or IDS system providing limited processing capabilities. The whole operating scenario is depicted in Figure 1 where all the
off-line activities, associated to the U-BRAIN framework, are reported into a grey box, whereas
the other ones can be managed on-line within the context of a real-time detection system.
However, in order to keep the classifier up-to-date with the evolving network traffic dynamics,
periodical re-training is necessary. Due to the above computing and space demands in the training phase, such activity can be performed off-line and independently from the running classifier,
by eventually distributing its load among multiple cooperating runtime machines that provide
high performance High-Performance Computing (HPC) capabilities. Accordingly, the main goal
of a recent work [29] has been the construction of a new version of the U-BRAIN algorithm
relaying on HPC capabilities, in order to overcome the limits related to its high computational
7
Figure 1: Detection system architecture
complexity. The new version, using more clever mathematical and programming solutions, such
as a Dynamic Programming and a new relevance representation, allows the algorithm implementation on parallel computing systems with reduced communication costs. To this aim an effective
distributed computing and storage architecture has been specifically designed according to an
high degree temporal and spatial locality principle [30]. The algorithm has been implemented by
using a Single Program Multiple Data (SPMD) [31] technique together with a Message-Passing
Programming paradigm. Overall, the results obtained on standard data sets show that the parallel
version is up to 30 times faster than the serial one. Moreover, by increasing the problem size, at
constant number of processors, the average speed-up further increases. Recently, the U-BRAIN
parallel version has been used in diagnosis of aerospace structures defects. In such a context, the
algorithm demonstrated to be effective as a defect classifier in non destructive testing [32].
4. Performance Evaluation
In evaluating the performance of the proposed detection strategy our main aim was making
our results comparable with alternative approaches already available in literature. Unfortunately,
this is not immediate, in lack of a generally recognized benchmark for assessing and validating
anomaly detection solutions. In fact, most of the publicly available data sets and taxonomies that
can be used for benchmarking anomaly detection systems are generally known to be error-prone
and of limited significance in their results.
However, while strongly criticized [33] as being excessively “artificial” for the usage of
synthetic simulated background data not containing the noise that characterize real traffic, the
KDD’99 dataset [34], originally produced from DARPA 98 Lincoln Labs, is widely recognized
8
as the most used publicly available labeled dataset for comparing network anomaly detection
systems, as it can be appreciated from a huge number of research works. It starts from several
weeks of raw traffic data collected within a local-area network (LAN) simulating a typical U.S.
Air Force LAN, in which 24 attack types have been identified and labeled, and all the data flow
activity has been summarized into connections by using the Bro IDS [6], that extracted 41 distinct features for each connection. Observations are partitioned into two subsets representing
respectively the training and the testing set, where the raw training data yielded about 5 million
connection records and the test one around 2 million ones. The test set also includes specific
attack types not present in the training data (that can be considered as zero-day attacks) in order
to asses the classifier generalization capability. In order to avoid some of the known consistency
problems of the KDD’99 data set [35] we used its preprocessed version known as NSL-KDD [36]
that removed inconsistencies and redundancy in the training set, so that the performance of the
classifiers will not be perturbed or biased by the methods which exhibit better detection rates
on frequent records. Although the NSL-KDD data set is not a perfect representative of a real
network scenario (it still suffers from some of the weaknesses reported in [37]), we believe it can
be considered the best available compromise for comparing/benchmarking different anomaly detection methods.
4.1. Feature Reduction
First of all, we are interested in investigating the relevance of the 41 features available in
KDD’99 samples with respect to dataset labels. In order to ensure a satisfactory degree of performance, mainly in terms of responsiveness, the number of features constituting each sample
should be kept as small as possible, by diversifying the features and making them discriminative as well as mutually independent. This is due to the fact that a group of highly independent
features is more effective in discriminating between the two classes than any of them taken individually.
Several techniques can be used for determining the best, and most relevant features that can
be used for building robust learning models. Their goal is significantly improving the detection
performance by eliminating irrelevant and redundant features from the available data and hence
reducing the overall problem dimension. This transform our data representation into a shorter,
more compact, and hopefully, more predictive one keeping only the really useful features from
the original feature set.
The simplest approach for selecting the really necessary features is trying to extract the more
specific properties of the traffic under observation, by properly mining the pre-classified feature
vectors in the training set. This implies searching the whole feature space, a component a a time,
for the subset of features that is most likely to best characterize the normal and anomalous traffic
classes, by ranking each individual feature according to the aforementioned considerations and
identifying and removing useless features that may adversely affect the detection performance.
4.1.1. Ranking
Ranking may be based on several criteria that are able to quantify the relevance of a given
feature, and hence its role in determining the class label. The most used for this purpose is
information gain, based on the feature’s entropy, and defined as:
IG( fk , ci ) = P( fk , ci )log
P( f¯k , ci )
P( fk , ci )
+ P( f¯k , ci )log
P( fk )P(ci )
P( f¯k )P(ci )
9
(3)
P
where the P(·) are occurrence probabilities in S and i IG( fk , ci ) = H(C) − H(C|S ). When
the feature fk is relevant, and hence necessary for an accurate detection of class ci , the associated
entropies will assume values close to 0 and thus the information gain will be consequently close
to 1.
Also the well known χ2 test can be used as a good quality metric to evaluate the predictive
properties of a specific feature, by computing the χ2 statistic for each combination of feature fk
and class ci :
χ2 ( fk , ci ) =
N(P( fk , ci )P( f¯k , c̄i ) − P( fk , c̄i )P( f¯k , ci ))2
P( fk )P( f¯k )P(ci )P(c̄i )
(4)
where N is the total number of observations in the training set.
We obtained the same (and hence extremely coherent) results from both the Information Gain
and χ2 tests on the whole training set, resulting in the ranking reported in Table 1.
Table 1: Features ranking results
Order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Ranking Score
0.81620015
0.6715649
0.63304893
0.51938822
0.51868903
0.50984907
0.4759185
0.43821385
0.41091066
0.40596111
0.40475154
0.39806695
0.39273998
0.38358863
0.37912493
0.27083688
0.19803814
0.18887561
0.14155352
0.09424551
Feature
src bytes
service
dst bytes
flag
diff srv rate
same srv rate
dst host srv count
dst host same srv rate
dst host diff srv rate
dst host serror rate
logged in
dst host srv serror rate
serror rate
count
srv serror rate
dst host srv diff host rate
dst host count
dst host same src port rate
srv diff host rate
srv count
Order
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Ranking Score
0.08826684
0.06263911
0.05674069
0.05217739
0.05156684
0.03672469
0.01155446
0.00960996
0.00650408
0.00371697
0.00215482
0.00116837
0.00091826
0.00051404
0.00032406
0.00010426
0.00006037
0.00003811
0.00000717
0
0
Feature
dst host srv rerror rate
protocol type
rerror rate
dst host rerror rate
srv rerror rate
duration
hot
wrong fragment
num compromised
num root
num access files
is guest login
num file creations
su attempted
root shell
num shells
num failed logins
land
is host login
num outbound cmds
urgent
4.1.2. Feature Selection
The top k features (resulting in a percentage of the number of original ones) can be chosen
after ordering the whole feature set according to the ranking score. However, features’ quality
may significantly vary, so that performing feature selection based only on the ranking order can
potentially introduce in the set some low-discriminating features. Analogously, the features can
be selected by introducing a threshold on the ranking score, (e.g., 0.01 for Information Gain) and
including the features with a value over it. However, there is a time consuming but more effective
alternative, exploring the predictive power of groups of features by using tree-based classification
10
algorithms such as C4.5 [38]. This may result in an automatic feature selection strategy based
on an initial mining phase operating on a sufficiently large quantity of pre-classified data, that
starting from a reasonably complete set of features removes one by one the least discriminants
ones in terms of ranking by verifying that the classification accuracy does not fall under a predetermined value.
Accordingly, we progressively removed a feature at a time in successive J48-based classifications, until the relative absolute error ε, remains under the 1% threshold. ε has been measured
as as the total absolute error relative to what the error would have been if the prediction ϕi had
been the average τ of the target values τi , so that:
ε=
n
P
i=1
n
P
|ti − oi |
i=1
(5)
ti − t
The whole selection process results in only 6 (starting from the first in table 1) necessary
features as it can be seen from the chart in Figure 2, that implies a significant reduction in the
overall space and runtime complexity in both the training and testing phases.
6
Relative Absolute Error
Acceptance Threshold
5
4
3
2
1
0
0
5
10
15
20
25
30
35
40
45
Number of Features
Figure 2: Feature selection results
Quantization has been performed on all the six attributes (the first two are nominal/discrete,
the other numeric/continuous) in order to transform them into binary data strings as required by
the U-BRAIN algorithm. Since the service and flag attributes are respectively composed by 70
11
and 11 elements, we used 7 bits to represent the former one and 4 bits for the latter. Analogously,
by considering their minimum and maximum values, we have chosen 12 bits for src bytes and
dst bytes, while 8 bits were devoted to same srv rate and diff srv rate. So, 52 bits are necessary
to represent each instance, where 51 bits are dedicated to the feature space and one bit is required
for the label representing the target class (0=normal, 1=anomalous). Each connection has been
labeled as either associated to normal activity, or as a network anomaly, regardless of the specific
attack type reported in the KDD dataset. The individual instance layout is presented in table 2.
Table 2: Generic instance layout
Description
destination network service
connection attributes and error status
bytes from source to destination
bytes from destination to source
% of connections to the same service
% of connections to different services
label
Attribute
service
flags
src bytes
dst bytes
same srv rate
diff srv rate
class
Bits
7
4
12
12
8
8
1
4.2. Training and Testing
In order to assess the potential of the proposed strategy for detecting network anomalies
we provided a proof of concept using U-BRAIN to find rules and formulae for distinguishing
between anomalous (intrusions, attacks) and normal connections/events. U-BRAIN has been
properly trained on the NSL-KDD training set and its real effectiveness has been successively
verified on the corresponding testing set. However, due to the U-BRAIN consistency bias, a data
pre-processing/cleaning step has been necessary to remove the inconsistencies on both training
and testing data. In particular, all the new duplicates introduced by feature reduction (the instances passed from 41 to 6 attributes) have been removed. The resulting dataset is composed
by 21178 training instances (8757 positive and 12421 negative) and 6032 testing instances (2116
positive and 3916 negative), as reported in Table 3.
Table 3: The NSL-KDD data set after preprocessing
Positives
8757
Training Set
Negatives
12421
Total
21178
Positives
2116
Testing Set
Negatives
3916
Total
6032
First of all, 10-fold cross validation has been performed on the entire training set, in order to
ensure results consistency and guarantee a better reliability to the whole process. Accordingly, all
the training data have been randomly divided into 10 disjoint subsets (folders), each containing
approximately the same amount of instances (see Table 4).
In each experiment, nine folders have been used as training data, while the remaining folder
is used as validation. This process has been repeated 10 times, for each different choice of the
12
Table 4: 10-fold cross-validation instance breakdown
Test
1
2
3
4
5
6
7
8
9
10
Training Data
Positives Negatives
7843
11218
7870
11191
7831
11230
7865
11196
7883
11178
7840
11221
7852
11209
7886
11175
7972
11089
7971
11082
Total
19061
19061
19061
19061
19061
19061
19061
19061
19061
19053
Positives
914
887
926
892
874
917
905
871
785
786
Test data
Negatives
1203
1230
1191
1225
1243
1200
1212
1246
1332
1339
Total
2117
2117
2117
2117
2117
2117
2117
2117
2117
2125
validation folder. The results are reported in Table 5, resuming how well the model assigns the
correct class value to the test instances.
Table 5: 10-fold cross-validation classification performance
Test
1
2
3
4
5
6
7
8
9
10
Mean
Confusion matrix
TP FP
TN
FN
905
9
1194
9
880 14 1216
7
913 14 1177 13
880 19 1206 12
866 18 1225
8
905 11 1189 12
894 23 1189 11
856 21 1225 15
775 23 1309 10
778 19 1320
8
865 17 1225 11
Accuracy
Sensitivity
Specificity
Precision
0.991
0.990
0.987
0.985
0.988
0.989
0.984
0.983
0.984
0.987
0.987
0.990
0.992
0.986
0.987
0.991
0.987
0.988
0.983
0.987
0.990
0.988
0.993
0.989
0.988
0.984
0.986
0.991
0.981
0.983
0.983
0.986
0.986
0.990
0.984
0.985
0.979
0.980
0.988
0.975
0.976
0.971
0.976
0.980
F1
Measure
0.990
0.988
0.985
0.983
0.985
0.987
0.981
0.979
0.979
0.983
0.984
Correlation
Coefficient
0.983
0.980
0.974
0.970
0.975
0.978
0.967
0.965
0.967
0.973
0.973
ROC
Area
0.991
0.990
0.987
0.986
0.988
0.989
0.984
0.983
0.985
0.988
0.987
Error
Ratio
0.009
0.010
0.013
0.015
0.012
0.011
0.016
0.017
0.016
0.013
0.013
In detail, the first four columns, after the test number, represent the confusion matrix in
terms of True Positives (T P: correct detections), False Negatives (FN: missed detections), True
Negatives (T N: correct silence), False Positives (FN: false alarms). The following five columns
report respectively:
(T P+T N)
• Accuracy (T P+T
N+FP+FN) , that is the portion of correctly classified instances.
TP
• Sensitivity (T P+FN)
, also called True Positive Rate, that is the portion of positive instances
which are correctly identified as positives by the classifier.
TN
• Specificity (T N+FP)
, also called True Negative Rate, that refers to the classifier’s ability
to identify negative results.
TP
• Precision (T P+FP)
, that is a measure of retrieved instances that are relevant.
13
2·T P
, that is the harmonic mean of precision and sensitivity, hence
• F1-Measure (2·T P+FP+FN)
gathering into a single value both the metrics. In fact, evaluating precision and sensitivity
also in a joint way may be very useful, since it is quite easy optimizing one of the metrics
by declining the other.
T P×T N−FP×FN
, correlating the
• Matthews Correlation Coefficient [39] √(T P+FP)(T
P+FN)(T N+FP)(T N+FN)
observed and predicted binary classifications by simultaneously considering true and false
positives and negatives. It can assume a value between -1 and +1, where +1 represents a
perfect prediction, 0 no better than random prediction and -1 indicates total disagreement
between prediction and observation.
• ROC Area or Area under the ROC curve (plotting the detection rate against the false positive one), summarizes the total accuracy of the detector in a way that accounts for both the
gains in True Positive Rate and the losses in False Positive Rate. More precisely, it represents the probability for the classifier of ranking a randomly selected positive instance
better than a randomly selected negative one, by reflecting the inherent difficulty of the detection task in presence of noise. Accordingly, with a ROC Area value of 0.9, for example,
there is an 90% probability that given two randomly selected alternatives, the correct one
will be identified, so that a ROC Area value of 1 is associated to perfect detection while an
value of 0.5 implies completely random results.
Finally, the last column reports the error ratio referred to the per-folder classification process.
Once the classification formulae have been built from the entire training set, the overall detection performance has been evaluated on the testing set. The results are reported in Table 6,
where we can observe a significant classification accuracy associated to a very high precision in
identifying traffic that is not affected by anomalies. This is also confirmed by the observation
of the F1-Measure results, strongly reflecting the excellent combined performance in precision
and sensitivity. In addition, both the Matthews Correlation Coefficient and the area under the
ROC curve report respectively a very satisfactory score (0.87) and a value (0.93) quite near to a
perfect prediction. In addition, the whole experiment resulted in a very limited error ratio. We
can argue that the errors observable from the confusion matrix can be in most part due to specific
kind of non-noisy attacks present in the testing set (e.g., buffer overflow), that due to their inherent structure in term if traffic pattern may be indistinguishable from normal activity and hence
lead to contradictory results. However, this can be not considered a real problem in a network
anomaly detection system, since the results show a good efficiency in identifying traffic patterns
that can be considered “normal”, and we are essentially interested in distinguishing the occurrence of suspicious events (and eventually flagging them for further analysis) deviating from the
“normal” traffic behavior.
Table 6: Classification performance on the testing set
Confusion matrix
TP
FP
TN
FN
1890 128 3788 226
Accuracy
Sensitivity
Specificity
Precision
0.941
0.893
0.967
0.936
F1
Measure
0.914
Correlation
Coefficient
0.870
ROC
Area
0.930
Error
Ratio
0.059
In addition, in order to evaluate our anomaly detection strategy also on real traffic data, we
14
tested it against a packet trace collected at the Federico II University of Napoli, containing all
the (anonymized) traffic incoming from an 1 Gbps link to the Internet of the engineering Campus, collected on March 24, 2009 from 00:00 to 19:08. Only incoming traffic has been collected
for traffic cleaning and purity reasons (filtering undesired anomalous events is much easier since
the expected traffic pattern is known). The trace contains several anomalous events simulated
through distributed SYN floods and port-scanning attacks occurring at various times (see table ??), properly chosen as representative of most of the anomalous traffic patterns that can be
observed on a border connection to the Internet (inbound distributed denial of service attacks,
bandwidth floods, single and multiple scans). The features of interest have been extracted by
using the CAIDA CoralReef toolset.
Start Time
01:15
02:15
03:15
04:15
05:15
06:15
13:15
Duration
60s
300s
600s
60s
300s
600s
600s
Attack
SYN flood
SYN flood
SYN flood
SYN flood
SYN flood
SYN flood
portscan
Packet rate
500/s
500/s
500/s
250/s
250/s
250/s
250/s
Table 7: The simulated attacks in the real traffic trace.
The classification results reported in table ??, related to the 10-fold cross validation, demonstrate the effectiveness of the proposed approach also on real data.
AGGIUNGERE TABLE 8 CON I RISULTATI SUI DATI REALI
4.3. Results Comparison
4.3.1. Classification Stage
In order to obtain a clearer view about the quality of the above results on terms of classification performance we compared them with those obtained on the same data set by using several
of the most common and effective supervised classification techniques whose implementation
details are widely known. These techniques have been explicitly chosen to cover the fundamental categories of approaches available in literature, ranging from rule-based methods (in our
case C4.5/J48 [40][41] ), to neural networks (with a Multilayer Perceptron (MLP) [42]), Support
Vector Machines (SVM) [43][44] or Bayesian networks (by using a Naive-Bayes classifier [45]).
As comparison metrics, in addition to the more traditional accuracy and precision we used
the Matthews correlation coefficient that is considered to be the most effective quality measure
in binary classification, and in particular in anomaly detection where the amount of normal instances greatly surpasses the number of anomalous ones. In fact, such coefficient is significantly
more stable, in presence of very different class sizes, than other widely used ones such as the area
under the ROC curve. It also achieves an optimum balance of the types of errors over the two
classes. As it can be easily appreciated from Figure 3 as well as from the detailed performance
results reported in Table 8, the U-BRAIN-based strategy outperforms all the other methods on
the NSL-KDD data.
4.3.2. Application Stage
In order to compare the efficiency and effectiveness of the proposed technique with other
known anomaly detection approaches, not only in terms of classification performance, but also
15
Table 8: Classification performance comparison results
U-Brain
J48
SVM
MLP
Naive Bayes
Accuracy
0.941
0.858
0.795
0.768
0.698
Sensitivity
0.893
0.962
0.966
0.905
0.977
Specificity
0.967
0.779
0.664
0.664
0.487
0.95
Precision
0.9362
0.767
0.685
0.671
0.591
Accuracy
Precision
Correlation
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
U-Brain
J48
SVM
MLP
Naive-Bayes
Figure 3: Classification performance comparison histogram
16
Correlation
0.870
0.736
0.639
0.571
0.508
at the final application stage, we performed a tentative comparison on the final accuracy of
the results achieved on the same dataset by our scheme and several distinguished (supervised
and unsupervised) anomaly detection methods such as the k-Nearest Neighbor outlier mining
algorithm (K-NN) [46], fixed-width clustering (Cluster) [47], ADWICE [18], fpMAFIA [48],
SVM+DGSOT [20], K-Means+ID3 [19], HDG-clustering [49] and RPCPAI [50], whose ROC
graphs and Area measures are available in literature. The area under the ROC curve has been
chosen for comparison since it is one of the most standardized accuracy metrics for anomaly
detection algorithms, that consequently has been widely used in literature, so that many performance results referring to the KDD dataset are publicly available.
Based on the comparison results reported in Figure 4, we can assert that the proposed detection method can compete with the other available techniques since its detection performance is
located exactly on the average of all the examined ROC results, represented by the dotted line in
the aforementioned Figure.
0.98
Area under the ROC curve
0.96
0.94
Average Value
0.92
0.9
0.88
0.86
fpMAFIA
K-NN
U-Brain
Fixed ADWICE DGSOT K-means RPCPAI
HDG
Cluster
+SVM
+ID3
Cluster
Figure 4: Application performance comparison histogram
5. Conclusions
Identifying anomalous events is one of the best ways to discover a lot of existing malfunctions
and handle most of the security and performance problems that may occur in modern networks.
Hence, the availability of reliable detection devices and strategies becomes a fundamental prerequisite for next generation network-empowered infrastructures. We presented a new supervised
17
machine learning approach to anomaly detection, whose goal is understanding the dynamics and
behaviors characterizing network traffic in order to generate a set of rules and criteria that can
be used to effectively discriminate anomalous events in the normal traffic flow. Such approach
couples the capability of inferring rigid decisional structures, represented as boolean formulas,
from incomplete sample observations, with the flexibility introduced by a fuzzy-based uncertainty management strategy. This allows the detection engine to easily adapt to the very different
kind of phenomena that can be experienced on a real network. The results of the experimental
evaluation demonstrate the ability of successfully handling very different kind of events/phenomena within the context of a quite difficult selection of training and testing data, commonly used
for assessing detection approaches.
However, it must be considered that the detection capability is directly depending on accuracy
of the self-learnt rules describing the traffic model, so that if the training data are not enough
complete and realistic, i.e., they do not reflect all the aspects characterizing the real network
traffic, the risk of false positives and/or negatives increases. This introduces the need of an
incremental knowledge construction ad refinement process, implemented within the context of a
continuous supervised re-training mechanism, managed trough human-driven results validation.
Furthermore, while rule construction is a computationally heavy task that has to be managed
off-line, on a periodic basis, by relying on properly crafted parallel computing environments, the
on-line detection activity, based on the above rules, is extremely simple and effective in term of
performance and can be easily implemented in most of the next-generation security equipments
available on modern networks.
Finally, it should be considered that, by relying on a native rule-based detection strategy,
where the inferred rules have the main goal of reliably describing the model (ideally dynamically kept up-to-date through periodic re-training) that represents network traffic, this approach
is potentially more effective against previously unknown phenomena and robust against obfuscation mechanisms.
References
[1] J. P. Anderson, Computer security threat monitoring and surveillance, Tech. rep., Computer Security Division of
the Information Technology Laboratory, National Institute of Standards and Technology (1980).
[2] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Comput. Surv. 41 (3) (2009) 15:1–15:58.
[3] P. Garcia-Teodoro, J. Diaz-Verdejo, G. Macia-Fernandez, E. Vazquez, Anomaly-based network intrusion detection:
Techniques, systems and challenges, Computers & Security 28 (1-2) (2009) 18–28.
[4] A. Patcha, J.-M. Park, An overview of anomaly detection techniques: Existing solutions and latest technological
trends, Computer Networks 51 (12) (2007) 3448–3470.
[5] M. Roesch, et al., Snort: Lightweight intrusion detection for networks, in: Proceedings of the 13th USENIX
Conference on System Administration, Vol. 99, 1999, pp. 229–238.
[6] V. Paxson, Bro: a system for detecting network intruders in real-time, Computer networks 31 (23) (1999) 2435–
2463.
[7] D. Anderson, T. F. Lunt, H. Javitz, A. Tamaru, A. Valdes, et al., Detecting unusual program behavior using the
statistical component of the Next-generation Intrusion Detection Expert System (NIDES), SRI International, Computer Science Laboratory, USA SRI-CSL-95-06, 1995.
[8] S. Staniford, J. A. Hoagland, J. M. McAlerney, Practical automated detection of stealthy portscans, Journal of
Computer Security 10 (1) (2002) 105–136.
[9] T. Lane, C. E. Brodley, An application of machine learning to anomaly detection, in: Proceedings of the 20th
National Information Systems Security Conference, Vol. 377, Baltimore, USA, 1997, pp. 366–380.
[10] A. K. Ghosh, A. Schwartzbard, A study in using neural networks for anomaly and misuse detection., in: USENIX
Security, 1999.
[11] V. N. Dao, V. Vemuri, A performance comparison of different back propagation neural networks methods in computer network intrusion detection, Differential Equations and Dynamical Systems 10 (1&2) (2002) 201–214.
18
[12] S. Mukkamala, G. Janoski, A. Sung, Intrusion detection using neural networks and support vector machines, in:
Neural Networks, 2002. IJCNN’02. Proceedings of the 2002 International Joint Conference on, Vol. 2, IEEE, 2002,
pp. 1702–1707.
[13] W. Lee, S. J. Stolfo, K. W. Mok, Mining in a data-flow environment: Experience in network intrusion detection, in:
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM,
1999, pp. 114–124.
[14] W. Lee, S. J. Stolfo, K. W. Mok, A data mining framework for building intrusion detection models, in: Security
and Privacy, 1999. Proceedings of the 1999 IEEE Symposium on, IEEE, 1999, pp. 120–132.
[15] C. Warrender, S. Forrest, B. Pearlmutter, Detecting intrusions using system calls: Alternative data models, in:
Security and Privacy, 1999. Proceedings of the 1999 IEEE Symposium on, IEEE, 1999, pp. 133–145.
[16] P. K. Chan, M. V. Mahoney, M. H. Arshad, Learning rules and clusters for anomaly detection in network traffic, in:
Managing Cyber Threats, Springer, 2005, pp. 81–99.
[17] N. Duffield, P. Haffner, B. Krishnamurthy, H. Ringberg, Rule-based anomaly detection on ip flows, in: INFOCOM
2009, IEEE, IEEE, 2009, pp. 424–432.
[18] K. Burbeck, S. Nadjm-Tehrani, Adwice–anomaly detection with real-time incremental clustering, in: Information
Security and Cryptology–ICISC 2004, Springer, 2005, pp. 407–424.
[19] S. R. Gaddam, V. V. Phoha, K. S. Balagani, K-means+ id3: A novel method for supervised anomaly detection
by cascading k-means clustering and id3 decision tree learning methods, Knowledge and Data Engineering, IEEE
Transactions on 19 (3) (2007) 345–354.
[20] L. Khan, M. Awad, B. Thuraisingham, A new intrusion detection system using support vector machines and hierarchical clustering, The VLDB Journal?The International Journal on Very Large Data Bases 16 (4) (2007) 507–521.
[21] D. peng Chen, X. song Zhang, Internet anomaly detection with weighted fuzzy matching over frequent episode
rules, in: Apperceiving Computing and Intelligence Analysis, 2008. ICACIA 2008. International Conference on,
2008, pp. 299–302. doi:10.1109/ICACIA.2008.4770028.
[22] J. E. Dickerson, J. A. Dickerson, Fuzzy network profiling for intrusion detection, in: Fuzzy Information Processing
Society, 2000. NAFIPS. 19th International Conference of the North American, IEEE, 2000, pp. 301–306.
[23] E. Mendelson, Introduction to mathematical logic, CRC press, 1997.
[24] S. Rampone, C. Russo, A fuzzified BRAIN algorithm for learning DNF from incomplete data, Electronic Journal
of Applied Statistical Analysis (EJASA) 5 (2) (2012) 256–270.
[25] S. Rampone, Recognition of splice junctions on DNA sequences by BRAIN learning algorithm., Bioinformatics
14 (8) (1998) 676–684.
[26] R. S. Michalski, A theory and methodology of inductive learning, Springer, 1983.
[27] T. M. Mitchell, Generalization as search, Artificial intelligence 18 (2) (1982) 203–226.
[28] D. Haussler, Quantifying inductive bias: Ai learning algorithms and valiant’s learning framework, Artificial intelligence 36 (2) (1988) 177–221.
[29] G. D’Angelo, S. Rampone, Towards a hpc-oriented parallel implementation of a learning algorithm for bioinformatics applications, BMC Bioinformatics 15 (5) (2014) 1–15.
[30] J. S. Vitter, External memory algorithms and data structures: Dealing with massive data, ACM Computing surveys
(CsUR) 33 (2) (2001) 209–271.
[31] F. Darema, The spmd model: Past, present and future, in: Recent Advances in Parallel Virtual Machine and
Message Passing Interface, Springer, 2001, pp. 1–1.
[32] G. D’Angelo, S. Rampone, Diagnosis of aerospace structure defects by a hpc implemented soft computing algorithm, in: Proceedings of the IEEE International Workshop on Metrology for Aerospace, Benevento, Italy, 2014,
pp. 408–412.
[33] S. T. Brugger, J. Chow, An assessment of the DARPA IDS Evaluation Dataset using Snort, Tech. rep., University
of California, Davis, Department of Computer Science (2007).
[34] DARPA,
Kdd
cup
1999
data
set,
available
at
the
following
website
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[35] M. Tavallaee, E. Bagheri, W. Lu, A.-A. Ghorbani, A detailed analysis of the KDD CUP 99 data set, in: Proceedings
of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications 2009, 2009.
[36] M. Tavallaee, Nsl-kdd dataset, http://www.iscx.ca/NSL-KDD.
[37] J. McHugh, Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system
evaluations as performed by lincoln laboratory, ACM transactions on Information and system Security 3 (4) (2000)
262–294.
[38] R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification, John Wiley & Sons,, 1999.
[39] D. M. Powers, Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation,
Journal of Machine Learning Technologies 2 (1) (2011) 37–63.
[40] G. H. John, Robust linear discriminant trees, in: Learning from Data, Springer, 1996, pp. 375–385.
[41] R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993.
19
[42] M. Augusteijn, B. Folkert, Neural network classification and novelty detection, International Journal of Remote
Sensing 23 (14) (2002) 2891–2902.
[43] I. Steinwart, D. R. Hush, C. Scovel, A classification framework for anomaly detection, in: Journal of Machine
Learning Research, 2005, pp. 211–232.
[44] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems
and Technology (TIST) 2 (3) (2011) 27.
[45] G. H. John, P. Langley, Estimating continuous distributions in bayesian classifiers, in: Eleventh Conference on
Uncertainty in Artificial Intelligence, Morgan Kaufmann, San Mateo, 1995, pp. 338–345.
[46] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A geometric framework for unsupervised anomaly detection,
in: Applications of data mining in computer security, Springer, 2002, pp. 77–101.
[47] J. Oldmeadow, S. Ravinutala, C. Leckie, Adaptive clustering for network intrusion detection, in: Advances in
Knowledge Discovery and Data Mining, Springer, 2004, pp. 255–259.
[48] K. Leung, C. Leckie, Unsupervised anomaly detection in network intrusion detection using clusters, in: Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38, Australian Computer Society,
Inc., 2005, pp. 333–342.
[49] C.-F. Tsai, C.-C. Yen, Unsupervised anomaly detection using hdg-clustering algorithm, in: Neural Information
Processing, Springer, 2008, pp. 356–365.
[50] M. Mohammadi, A. Akbari, B. Raahemi, B. Nassersharif, H. Asgharian, A fast anomaly detection system using
probabilistic artificial immune algorithm capable of learning new attacks, Evolutionary Intelligence 6 (3) (2014)
135–156.
20