Web Application Firewall: CS499: B.Tech Project Final Report
Web Application Firewall: CS499: B.Tech Project Final Report
Web Application Firewall: CS499: B.Tech Project Final Report
by
Namit Gupta (Y3188)
Abakash Saikia (Y3349)
submitted to
Department of Computer Science and Engineering
Indian Institute Of Technology, Kanpur
~~,;;( -
,
"
Certificate
Certified that the work titled "Web Application Firewall" by Namit Gupta and Abakil-':ih
.., Saikiahasbeendoneundermy supervisionand the work hasnot beensubmittedelsewher~
for a degree.
~'y.
Dr. Dheeraj Sanghi
~1tf(DT
-,.,
,
Abstract
Today applications are becoming the prime target for cyber attacks. A recent re-
search showed that approximately 70% of all successful web attacks exploit application
vulnerabilities and there is no shortage of vulnerabilities to go after, all of them re-
quire some skill to exploit. While traditional firewalls have blocked packets effectively
at the network layer, they are ineffective against attacks which point to application
weaknesses. Web application firewalls detect application deviation and whether sen-
sitive data, such as account information or credit card number, is being hacked and
can take suitable action accordingly. Moreover, Intrusions pose a serious threat to the
computer world and must be dealt with correctly. Everyday some new kind of attack
comes into existence and pose a threat to the application security. Current signature
based methods and machine learning algorithms cannot detect such intrusions as they
rely on training of labeled data.
We divided the implementation of the Web Application Firewall into five different
parts. When a HTTP request comes, it is initially parsed into useful chunks of infor-
mation, which are then normalized to some standard format. Some built-in-checks are
performed on the normalized form which is followed by checking of user-defined rules.
Logging is done intelligently on every request. Unsupervised learning techniques over
unlabeled data are used to detect different types of intrusions while maintaining a low
false positive rate. Using heuristics and a learning engine, an effort towards averting
Zero Day Attacks is made.
Introduction
Related Work
Technical Details
Deployment Scenario
WAF sits between the insecure internet and the web server. The request from a client to
the backend server is analyzed at the WAF. The safe requests are then sent to the server
while the malicious ones are being dropped there itself. WAF would be very generic irre-
spective of the backend server which can be either a database or an office workstation. It
can be shown as a context diagram in the form as given in Figure 1.
Figure 1: Deployment Scenario
Implementation Details
The design of the WAF divided into individual components goes as follows:
Request ⇒RequestLine
∗ ((GeneralHeader
|RequestHeader
|EntityHeader)CRLF )
CRLF
[M essageBody]
The request line is of the form of:
The above rule filters the request seeing the IP address of the client and the file requested
by him/her. So, if the IP equals 203.200.95.130 or the requested filename is “/bin/sh”, then
it will simply block the packet.
(e) Logging
Logging forms an important part of any web application. Here, logs are maintained both
sides, from client to server and back from server to client. A total of 18 fields like protocol,
source/destination IP address, source/destination port number, method, response status,
time, etc is being logged. Thereafter, log file is being trained using the clustering algorithms
which helps in detecting an intrusion into the system.
Process Flow
A normal HTTP packet when it arrives is first stopped at the proxy server and given to
the firewall. The information in the packet is being checked against the basic rules in the
configuration file. If it is considered to be a safe packet, then it is given to the backend
server otherwise the connection is closed right there. The server responds in a normal fash-
ion and provides the client with the required page. Logs are maintained both ways, on the
arrival of a packet and on way back too. The flow can be shown in the form of a diagram
as shown in Figure 3.
Clustering Algorithms
The clustering algorithms use the dataset to form clusters and detect intrusions. Al-
most all of them work on the following two assumptions:
1. The normal instances have similar properties and occur close together to form one
single cluster while anomalies are far apart from them.
2. The number of normal instances are exceedingly large than the number of anomalies
i.e. normal instances constitute around 95-98 % of the total data while anomalies
constitute the rest.
Figure 3: Process Flow
The next two algorithms which we are going to discuss next have been tested on the
KDD Cup 1999 dataset. It consists of around 5 million data instances, each of which is a
41-dimensional vector obtained from raw network data. The fields are duration, protocol
type, number of bytes transferred, flag indicating normal or error status of the connection
etc. Each connection was labeled as either normal or as exactly one kind of attack.
As system designed must work for instances coming from an arbitrary distribution, data
instances must be normalized in order to form clusters. The normalization can be done by
calculating the average and standard deviation vectors as:
N
1 X
avg[j] = datai [j]
N
i=1
N
1 X
std dev[j] = ( (datai [j] − avg[j])2 )1/2
N −1
i=1
data[j] − avg[j]
new data[j] =
std dev[j]
The euclidean distance is given by the root of the squared distance between two feature
vectors.
v
uN
uX
euc dist = t (ai − bi )2
i=1
We use a simple variant of single-linkage clustering. The algorithm starts with empty
set of clusters. For each new instance, it computes the distance between it and the centers
of clusters formed so far. The cluster with the shortest distance is calculated and if it is
less than some fixed distance W , it is assigned to that cluster. Otherwise, a new cluster is
made with it being the center. It can be given as:
2. Get a data instance from the dataset. If S is empty, then this data instance becomes
the defining distance and add it to S. Otherwise, find the cluster whose center is
closest to it.
3. If this distance is smaller than the the cluster width W , then add the instance to that
cluster. Otherwise, create a new cluster for it and add the cluster to S.
When applied on the KDD-99 dataset, Leonid Portnoy gives a result of around 65%.
K-Means
K-Means is a typical clustering algorithm. It partitions the dataset into k clusters according
to the following steps:
1. Choose k instances randomly from the dataset and make them initial centers for the
clustering space.
2. Take each instance from the dataset and assign it to the closest cluster.
The difference it has with the Leonid Portnoy algorithm is choosing of k, the initial num-
ber of clusters, in the first case and W , the cluster width, in the latter one. When applied
on the KDD-99 dataset, K-means gives a result of around 80%.
As already stated, our dataset consists of a set of 18 vectors which we get from the
contents of a HTTP packet. Some of the vectors are protocol, source IP address, destina-
tion IP address, source port number, destination port number, method, host, payload size,
response status and time.
In our assumption, we have mentioned that a large number of instances are normal while
the rest are anomalies. So, we label some percentage N of the clusters containing the largest
number of instances as normal, rest being anomalies. After having trained the dataset by
our algorithms, a new instance is put into some cluster using the formulae and the cluster is
checked against a normal or an anomalous one. A voting mechanism is used and if both the
algorithms classify the new instance as an anomaly, then only we consider it as an anomaly
otherwise not.
Integration
We have to integrate the Web Application Firewall with the proxy server so that the
packets which are coming from the client side can actually be stopped on matching of rules.
For this, we have used an open source proxy server, Wcol. It is a multi process proxy system
and can handle many connections at the same time.
The basic working mechanism of Wcol integrated with WAF is:
A socket connection is setup between the client and the proxy server. Whenever a client
wishes to access some page from the backend server, it issues a GET request to the proxy
server. For each accepted socket connection, a new instance of method Reception() is forked.
So, we may have more than one processes running at the same instant. The packet infor-
mation is then stored in a structure and rule matching is done. If it comes out to be a safe
packet, the proxy server then fetches the page from the backend server and sends it back to
the client. Otherwise, the connection is closed there itself and the client is denied access.
When the ratio between the short time average and the long term average exceeds a
certain threshold, the detector informs that this may be a Zero Day Attack.
Results
Acknowledgements
Apart from our efforts, the success of this project depends largely on the encourage-
ment and motivation of many others.
We would like to show our greatest appreciation to Dr. Dheeraj Sanghi. Without his
guidance and support, this project would not have been materialized. A special mention
to the Juniper people whose guidance was also vital for the success of the project. We are
grateful for their support and help.
References
[1] Ivan Ristic et al., “Web Application Firewall Evaluation Criteria”, Web Application
Security Consortium, 2006, http://www.webappsec.org/projects/wafec/.
[3] Jeffrey R. Williams et al., “The Ten Most Critical Web Application Se-
curity Vulnerabilities”, The Open Web Application Security Project, 2004,
http://www.owasp.org/index.php/Category:OWASP Top Ten Project.
[4] Leonid Portnoy, “Intrusion Detection with unlabeled data using clustering”, Proceed-
ings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001),
Philadelphia, PA: November 5-8, 2001.
[5] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy and Salvatore Stolfo,
“A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions
in Unlabeled Data” Data Mining for Security Applications, Kluwer 2002.
[6] Guan Y., Ghorbani A., and Belacel N., “Y-Means: A Clustering Method for Intrusion
Detection”, Canadian Conference on Electrical and Computer Engineering, May 2003.
[7] Guan Y., Ghorbani A., and Belacel N., “K-Means+: An Autonomous Clustering Al-
gorithm”, Technical Report TR04-164, University of New Brunswick, 2004.
[8] Tapas Kunango, Nathan S. Netanyahu and Angela Y. Wu, “An Efficient K-Means
Clustering Algorithm: Analysis and Implementation”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 24, No. 7, July 2002.