Sample Mini Project Report

DETECTING SPAM BOTS ON SOCIAL NETWORK
A PROJECT REPORT
Submitted by
AFRAAH MARIAM S (111516104003)

DEEPIKA K (111516104025)
DHIVYASHREE S (111516104031)
in partial fulfillment for the award of the degree
of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE ENGINEERING
R.M.D. ENGINEERING COLLEGE

(An Autonomous Institution)
KAVARAIPETTAI - 601206
ANNA UNIVERSITY: CHENNAI – 600 025
MAY – 2021
ANNA UNIVERSITY: CHENNAI – 600 025
BONAFIDE CERTIFICATE
Certified that this project report titled “DETECTING SPAM BOTS ON

SOCIAL NETWORK”, is a bonafide work of AFRAAH MARIAM S
(111516104003), DEEPIKA K (111516104025) and DHIVYASHREE S
(111516104031) who carried out the work under my supervision
SIGNATURE SIGNATURE
Dr. P. Ezhumalai,B.E,M.Tech, Ph.D., Dr. A.Gnanasekar,AMIE., M.Tech., Ph.D,
HEAD OF THE DEPARTMENT, ASSOCIATE PROFESSOR,
Department of Computer Science Department of Computer Science
Engineering, Engineering,
RMD Engineering College RMD Engineering College
R.S.M Nagar R.S.M Nagar
Kavaraipettai – 601 206. Kavaraipettai – 601 206.
ii
VIVA-VOCE EXAMINATION
The V i v a -Voce Examination of the following students who have
submitted this project work held on………………
AFRAAH MARIAM S (111517104003)

DEEPIKA K (111517104025)
DHIVYASHREE S (111517104031)
INTERNAL EXAMINER EXTERNAL EXAMINER
iii
ACKNOWLEDGEMENT
I take this opportunity to put to record my sincere thanks to all who

enlightened my path towards the successful completion of this project. At the very
outset, I thank the almighty for his abundant blessing on me.
It is my greatest privilege to convey my thanks to Chairman

MR. R. S. MUNIRATHINAM and Vice-Chairman MR. R. M. KISHORE B.E.,
MBA., for having provided me with all required facilities to complete my project
without hurdles.
We take this opportunity to give profound and heartfelt thanks to the Dean-
Research Dr. K.SIVARAM B.E., M. Tech., Ph.D., and Dean-Academic
Dr.K. K. THYAGHARAJAN B.E., M.E., Ph.D., for their continuous support in
successful completion of this project.
It is my greatest privilege to convey my thanks to Principal

Dr. N.ANBUCHEZHIAN M.S., MB.A., M.E., Ph.D., for having provided me
with all required facilities to complete my project without hurdles.
I express my sincere feeling of gratitude to our Head of Department of

Computer Science Engineering Dr. P.EZHUMALAI B.E., M.Tech, Ph.D., for his
advice and encouragement in all stages of my project work.
It’s my pleasant duty to deliver hearty thanks to the extraordinary efforts of

our project guide, Dr. A.Gnanasekar, AMIE., M.Tech., Ph.D, Associate
professor, Computer Science Engineering for his continuous guidance and
suggestions without whom this project would not have become a reality.
I owe my sincere thanks to COMPUTER SCIENCE DEPARTMENT STAFF of

R.M.D ENGINEERING COLLEGE for their constant support and encouragement
towards the completion of this project.
iv
ABSTRACT
The presence of bots has been felt in many aspects of social media. Twitter, one
example of social media, has especially felt the impact, with bots accounting for a
large portion of its users. These bots have been used for malicious tasks such as
spreading false information about political candidates and inflating the perceived
popularity of celebrities. Furthermore, these bots can change the results of common
analyses performed on social media. With the significant increase in the volume,
velocity, and variety of user data (e.g., user generated data) in online social networks,
there have been attempts to design new ways of collecting and analyzing such big
data. For example, social bots have been used to perform automated analytical
services and provide users with improved quality of service. However, malicious
social bots have also been used to disseminate false information (e.g., fake news), and
this can result in real-world consequences. Therefore, detecting and removing
malicious social bots in online social networks is crucial. The most existing detection
methods of malicious social bots analyze the quantitative features of their behavior.
These features are easily imitated by social bots; thereby resulting in low accuracy of
the analysis. A novel method of detecting malicious socialbots, including both
features selection based on the transition probability of clickstream sequences and
semi-supervised clustering, is presented in this paper. This method not only analyzes
v
transition probability of user behavior clickstreams but also considers the time feature of
behavior. Findings from our experiments on real online social network platforms
demonstrate that the detection accuracy for different types of malicious socialbots by the
detection method of malicious social bots based on transition probability of user behavior
clickstreams increases by an average of 12.8%, in comparison to the detection method
based on quantitative analysis of user behavior.

TABLE OF CONTENTS
CHAPTER TITLE PAGE NO

NO.
ABSTRACT v
LIST OF FIGURES x
LIST OF ABBREVIATION xi
1 CHAPTER 1: INTRODUCTION 1
1.1 GENERAL 1
1.2 OBJECTIVE 1
1.3 EXISTING SYSTEM 2
1.3.1 DRAWBACKS IN EXISTING SYSTEM 3
1.3.2 LITERATURE SURVEY 4
1.4 PROPOSED SYSTEM 6
1.4.1 ARCHITECTURE OF SPAMBOT DETECTION 6
1.4.2 ADVANTAGES IN PROPOSED SYSTEM 7
2 CHAPTER 2: PROJECT DESCRIPTION 8
2.1 GENERAL 8
2.2 MODULES 8
2.2.1 DATA COLLECTION 8
2.2.2 EXPERIMENTAL DESIGN 8
2.2.3 MALICIOUS SOCIAL BOTS DETECTION 9
2.3 SYSTEM TECHNIQUES 10
3 CHAPTER 3: REQUIREMENTS 11
3.1 GENERAL 11
3.2 HARDWARE REQUIREMENTS 11
3.3 SOFTWARE REQUIREMENTS 11
4 CHAPTER 4: SYSTEM DESIGN 13

4.1 GENERAL 13
4.2 USE CASE DIAGRAM FOR SPAM BOTS DETECTION 13
4.3 CLASS DIAGRAM FOR SPAM BOTS DETECTION 14
4.4 ACTIVITY DIAGRAM FOR SPAM BOTS DETECTION 15
4.5 SEQUENCE DIAGRAM FOR SPAM BOTS 16
DETECTION 17
4.6 DATA FLOW DIAGRAM FOR SPAM BOTS
DETECTION 18
4.7 ER DIAGRAM FOR SPAM BOTS DETECTION 19
4.8 SYSTEM ARCHITECTURE OF SPAM BOTS
DETECTION
CHAPTER 5: SOFTWARE SPECIFICATION 20
5.1 GENERAL 20
5.2 FEATURES OF PYTHON 20
5 5.2.1 OBJECTIVES OF PYTHON 21
5.2.2 HISTORY OF PYTHON 21
5.2.3 OBJECT ORIENTED LANGUAGE 22
5.3 GETTING PYTHON 22
5.4 FIRST PYTHON PROGRAM 23
5.5 SOFTWARE DESCRIPTION 24
5.5.1 LIBRARIES 26
5.5.2 DEVELOPMENT 27
6 CHAPTER 6: IMPLEMENTATION 28
6.1 GENERAL 28
6.2 DATA INTEGRATION TECHNIQUE 28
6.2.1 DATA COLLECTION 41
6.2.2 TRANSITION PROBABILITY
7 CHAPTER 7: SNAPSHOTS 53
7.1 VARIOUS SNAPSHOTS 53
8 CHAPTER 8:SOFTWARE TESTING 58
8.1 GENERAL 58
8.2 DEVELOPING METHODOLOGIES 58
8.2.1 TYPES OF TESTING 58
8.2.2 BUILD THE TEST PLAN 61
9 CHAPTER 9: CONCLUSION 62
9.1 CONCLUSION 62
9.2 FUTURE ENHANCEMENTS 63
LIST OF FIGURES
FIGURE NO. NAME OF THE FIGURE PAGE NO
1.41 ARCHITECTURE OF SPAMBOT DETECTION 6
4.2 USE CASE DIAGRAM FOR SPAM BOTS DETECTION 13
4.3 CLASS DIAGRAM FOR SPAM BOTS DETECTION 14
4.4 ACTIVITY DIAGRAM FOR SPAM BOTS DETECTION 15
4.5 SEQUENCE DIAGRAM FOR SPAM BOTS DETECTION 16
4.6 DATA FLOW DIAGRAM FOR SPAM BOTS DETECTION 17
4.7 ER DIAGRAM FOR SPAM BOTS DETECTION 18
4.8 SYSTEM ARCHITECTURE FOR SPAM BOTS 19

DETECTION
7.1.1 DATA CLEANING 53
7.1.2 DATA PROCESSING 53
7.1.3 FEATURE SELECTION 54
7.1.4 SEMI-SUPERVISED CLUSTERING 54
7.1.5 NORMAL USER SET AND SOCIAL BOT SET 55
7.1.6 RESULT EVALUATION 55
7.1.7 PREDICTION RECALL 56
7.1.8 PREDICTION PRECISION 56
7.1.9 PREDICTION ACCURACY 57
x
LIST OF ABBREVIATIONS
ABBREVIATION EXPANSION
BEDM Behaviour Enhanced Deep Model
OSN Online Social Network
GUI Graphical User Interface
ABE Attribute-Based Encryption
PyPI Python Package Index
PEP Python Enhancement Proposal
MIME Multipurpose Internet Mail Extension
HTTP Hypertext Transfer Protocol
HCL Hardware Compatibility List
GPL General Public License
URL Uniform Resource Locator
PHP Hypertext Preprocessor
WWW World Wide Web
xi
CHAPTER 1 - INTRODUCTION
GENERAL:
In online social networks, social bots are social accounts controlled by automated
programs that can perform corresponding operations based on a set of procedures. The
increasing use of mobile devices (e.g., Android and iOS devices) also contributed to an increase
in the frequency and nature of user interaction via social networks. It is evidenced by the
significant volume, velocity and variety of data generated from the large online social network
user base. Social bots have been widely deployed to enhance the quality and efficiency of
collecting and analyzing data from social network services. For example, the social bot SF
QuakeBot is designed to generate earthquake reports in the San Francisco Bay, and it can analyze
earthquake related information in social networks in real-time. However, public opinion about
social networks and massive user data can also be mined or disseminated for malicious or
nefarious purposes. In online social networks, automatic social bots cannot represent the real
desires and intentions of normal human beings, so they are usually looked upon as malicious
ones. For example, some fake social bots accounts created to imitate the profile of a normal user,
steal user data and compromise their privacy, disseminate malicious or fake information,
malicious comment, promote or advance certain political or ideology agenda and propaganda,
and influence the stock market and other societal and economical markets. Such activities can
adversely impact the security and stability of social networking platforms. In previous research,
various methods were used to protect the security of online social networks. User behavior is the
most direct manifestation of user intent, as different users have different habits, preferences, and
online behavior (e.g., the way one clicks or types, as well as the speed of typing). In other words,
we may be able to mine and analyze information hidden in a user's online behavior to profile and
identify different users. However, we also need to be conscious of situational factors that may
play a role in changing a user's online behavior. In other words, user behavior is dynamic and its
environment is constantly changing i.e., external observable environment (e.g., environment and
behavior) of application context and the hidden environment in user information.
OBJECTIVE:
In order to distinguish social bots from normal users accurately, detect malicious social
bots, and reduce the harm of malicious social bots, we need to acquire and analyze social
1
situations of user behavior and compare and understand the differences of malicious social bots
and normal users in dynamic behavior. Specifically, in this paper, we aim to detect malicious
social bots on social network platforms in real-time, by
(1) proposing the transition probability features between user clickstreams based on the social
situation analytics; and
(2) designing an algorithm for detecting malicious social bots based on spatiotemporal features.
EXISTING SYSTEM:
Recent statistics show that more than 50% of Twitter accounts are not human users.
Social network administrators are well aware of these harmful activities and try to delete these
users using their suspension/removal systems. By one estimate 28% of accounts created in 2008
and half of the accounts created in 2014 have been suspended by Twitter. What is not well taken
care of is the role of bots in facilitating these malicious activities. In one study, 145,000 accounts
survived for months without detection.
Today, 16% of spammers on Twitter are bots. Although social network spam detection
approaches are still insufficient, bot detection in social networks has received wide attention
from the research community in recent years. Botnets become widespread in wired and wireless
networks. In particular, bots in a botnet are able to cooperate towards a common malicious
purpose. In recent years, social bots have become very popular in social networks, and they can
imitate human activities in social networks. They are also programmed to work together to fulfill
the prescribed tasks. There are a wide range of methods (e.g., sophisticated techniques and tools
that may be associated with nation states and state-sponsored actors) used by some users with
malicious or nefarious intent as well as social bots. For example, in order to imitate the features
of human users successfully, social bots may `crawl' for words and pictures from online social
networks to complete fabricated user profiles and so on. Semi-social bots between humans and
social bots have also reportedly emerged in social networks, which are highly complex social
bots that bear the characteristics of human behaviour and social bot behaviour. The automated
procedure for semi-social bots is generally activated by humans, and the subsequent actions are
automatically performed by social bots. This process further increases the uncertainty of the
operation time of social bots.
Social bots are generally more intelligent and they can more easily imitate human
behaviour, and they cannot be easily detected. In existing literature, social bots are generally
detected using machine learning-based approaches, such as Bot or not released by Twitter in
2014. In Bot or not, the random forest model is used in both training and analysis by using
historical social information of normal users and social bots accounts. Based on six features (i.e.
network, user, making friends, time, content and emotion),this model distinguished normal users
from social bots. Morstatter et al. proposed a heuristic type supervised Boost OR model with
increasing recall rate to detect malicious bots, which using the proportion of tweets forwarded
toP. Shi et al.: Detecting Malicious Social Bots Based on Clickstream Sequences the published
tweets on Twitter, the mean length of tweets, URL, and forwarding interval. Wang et al.
constructed a semi-supervised clickstream similarity graph model for user behaviour to detect
abnormal accounts in Renren.
According to the social interactions between users of the Twitter user to identify the
active, passive and inactive users, a supervised machine learning method was proposed to
identify social bots on the basis of age, location and other static features of active, passive, and
inactive users in the Twitter, as well as interacting person, interaction content, interaction theme,
and some dynamic characteristics. A time act model, namely, Act-M, was constructed focusing
on the timing of user behaviour activities, which can be used to accurately determine the interval
between different behaviours of social media users to accurately detect malicious users. They
have been focused on detecting semi-social bots too. For example, a management framework
relying on entropy component, spam detection component, account attribute component, and
decision maker was proposed by Chu et al. In the approach, Naive Bayes is adopted to categorize
automated Twitter accounts into human, social bots, or semi-social bots. Previous studies have
also shown that quantitative features such as friends, fans, forwarders, and tweets can be used in
feature selection. The supervised learning method can be effective in detecting social bots,
however annotation and training for large amounts of data are required in supervised learning.
Tagging data requires time, manpower, and is generally unsuitable for the big data social
networking environment.
DRAWBACKS IN EXISTING SYSTEM:

● Social bots are generally more intelligent and they can more easily imitate human
behavior, and they cannot be easily detected.
● It is not possible to determine which cluster is normal and which cluster is abnormal.
● Social network spam detection approaches are still insufficient.
LITERATURE SURVEY:
1. Cai et al [1] , proposed a system in which social bots are regarded as the most common
kind of malwares in social platforms. They can produce fake messages, spread rumours, and
even manipulate public opinions. Recently, massive social bots are created and widely spread in
social platforms, they bring negative effects to public and netizen security. Bot detection aims to
distinguish bots from humans and it catches more and more attention in recent years. In this
paper, we propose a behavior enhanced deep model (BeDM) for bot detection. The proposed
model regards user content as temporal text data instead of plain text to extract latent temporal
patterns. Moreover, BeDM fuses content information and behavior information using deep
learning methods. However, this is the first trial that applies deep neural networks in bot
detection. Experiments on real world dataset collected from Twitter also demonstrate the
effectiveness of the proposed model.
2. Ting-Kai Huang et al [2], proposed a recommendation system that was developed to

match consumers with products to meet their variety of special needs and tastes in order to
enhance user satisfaction and loyalty. The popularity of personalized recommendation systems
has increased in recent years and applied in several areas including movies, songs, books, news,
friend recommendations on social media, travel products, and other products in general.
Collaborative Filtering methods are widely used in recommendation systems. The collaborative
filtering method is divided into neighborhood-based and model-based. In this study, matrix
factorization is implemented which is part of model-based that learns latent factors for each user
and item and uses them to make rating predictions. The method will be trained using stochastic
gradient descent and optimization of regularization hyperparameter. In the end, neighbourhood-
based collaborative filtering and matrix factorization with different values of regularization
hyperparameter will be compared. The result shows that matrix factorization method is better
than item-based collaborative filtering method and even better with tuning the regularization
hyperparameter by achieving lowest RMSE score. However, the used functions are available
from Graphlab and using Movielens 100k data set for building the recommendation systems.
3. Fred Morstatter et al [3], realized the presence of bots has been felt in many aspects of
social media. Twitter, one example of social media, has especially felt the impact, with bots
accounting for a large portion of its users. These bots have been used for malicious tasks such as
spreading false information about political candidates and inflating the perceived popularity of
celebrities. Furthermore, these bots can change the results of common analyses performed on
social media. It is important that researchers and practitioners have tools in their arsenal to
remove them. Approaches exist to remove bots, however they focus on precision to evaluate
their model at the cost of recall. This means that while these approaches are almost always
correct in the bots they delete, they ultimately delete very few, thus many bots remain. We
propose a model which increases the recall in detecting bots, allowing a researcher to delete
more bots. However this model is evaluated on two real world social media datasets and shows
that the detection algorithm removes more bots from a dataset than current approaches.
4. Zhiyong Zhang et al[4], proposed an attribute based encryption (ABE), a user is

identified with the help of some attributes and their functions for encryption and decryption of
the data. The current techniques based on attribute-based encryption have found that if a user's
access structure includes a considerable amount of attribute information labeled as Don’t Care,
then the encryption pairing operation has low calculation efficiency and cipher text information
redundancy. In this paper, we have proposed a hierarchical multi-authority attribute-based
encryption on prime order groups to tackle these problems. The encryption technique has a
polycentric attribute authorization system based on an AND gate access structure, with a unified
attribute index established by each attribute authority throughout the system, to form a binary
tree, i.e., attribute access tree. The state value of the parent node can be determined by the state
of its child node in an attribute access tree. The attribute-based encryption established in this
manner is theoretically proven to effectively decrease the calculation amount for decryption and
compress the redundant information in the cipher text as much as possible. However, the
encryption technique has a theoretical and practical significance in the system of “large
universe” constructions.
5. Yadong Zhou et al [5], established that online social networks gradually integrate
financial capabilities by enabling the usage of real and virtual currency. They serve as new
platforms to host a variety of business activities such as online promotion events, where users
can possibly get virtual currency as rewards by participating in such events. Both OSNs and
business partners are significantly concerned when attackers instrument a set of accounts to
collect virtual currency from these events, which make these events ineffective and result in
significant financial loss. It becomes of great importance to proactively detect these malicious
accounts before the online promotion activities and subsequently decrease their priority to be
rewarded. In this paper, we propose a novel system, namely ProGuard, to accomplish this
objective by systematically integrating features that characterize accounts from three
perspectives including their general behaviors, their recharging patterns, and the usage of their
currency. We have performed extensive experiments based on data collected from Tencent QQ, a
global leading OSN with built-in financial management activities. However, experimental results
have demonstrated that our system can accomplish a high detection rate of 96.67% at a very low
false positive rate of 0.3%.
PROPOSED SYSTEM:
In this paper, we aim to detect malicious social bots on social network platforms in
real-time, by (1) proposing the transition probability features between user clickstreams based on
the social situation analytics; and(2) designing an algorithm for detecting malicious social bots
based on spatiotemporal features. In order to better detect malicious social bots in online social
networks, we analyze user behavior features and identify transition probability features between
user clickstreams Based on the transition probability features and time interval features, a
semi-supervised social bots detection method based on space-time features is proposed.
ARCHITECTURE OF SPAMBOT DETECTION:
Figure 1.4.1 Architecture of spambot detection

ADVANTAGES IN PROPOSED SYSTEM:
● We evaluate user behavior features and select the transition probability of user behaviour
on the basis of general behaviour Characteristics.
● We then analyze and classify situation aware user behaviors in social networks using our
proposed semi supervised clustering detection method.
● This allows us to promptly detect malicious social bots using only a small number of
tagged users.
CHAPTER 2- PROJECT DESCRIPTION
GENERAL
In this project, the most existing detection methods of malicious social bots analyze the
quantitative features of their behavior. These features are easily imitated by social bots; thereby
resulting in low accuracy of the analysis. A novel method of detecting malicious socialbots,
including both features selection based on the transition probability of clickstream sequences and
semi-supervised clustering, is presented in this paper. This method not only analyzes transition
probability of user behavior clickstreams but also considers the time feature of behavior.
Findings from our experiments on real online social network platforms demonstrate that the
detection accuracy for different types of malicious socialbots by the detection method of
malicious social bots based on transition probability of user behavior clickstreams increases by
an average of 12.8%, in comparison to the detection method based on quantitative analysis of
user behavior.
MODULES
● DATA COLLECTION
● EXPERIMENTAL DESIGN
● MALICIOUS SOCIAL BOTS DETECTION
2.2.1 DATA COLLECTION
The CyVOD platform comprises the website platform and Android and iOS applications.
On CyVOD, the user clickstream behavior is obtained by a data burying point, and user
clickstream data is collected server-side. In the realistic environment, for your own website, you
can use the buried technology to get the corresponding data; for other websites, you need to get
the data by working with the website or by calling the corresponding API (if provided).
EXPERIMENTAL DESIGN
Social bots that perform a single task, malicious social bots that coordinate to perform
tasks, and malicious social bots that perform mixed tasks. For example, a user can perform two
or more actions in the actions of liking,comment, sharing and so on. The social bot for malicious
likes, the value of the P(play,like) (the transition probability of ‘‘the current click event is and the
next click event is liking’’) would be high and the value of other transition probability features
would be small or zero.
Data processing:
Some data are selected randomly from the normal user set and social bots set to the label.
Normal user account is labeled as 1, and the social bots account is labeled as −1. Seed users are
classified as the category of clusters.
Feature selection:
In the spatial dimension: according to the main functions of the CyVOD platform, we
select the transition probability features related to the play-back function: P(play, play), P(play,
like) , P(play, feedback), P(play, comment), P(play, share) and P(play, more) ; in the time
dimension: we can get the interarrival times (IATs). Because if all transition probability
matrices of user behavior are constructed, extremely huge data size and sparse matrix can
increase the difficulty of data detection.
MALICIOUS SOCIAL BOTS DETECTION:
● DATA CLEANING
● DATA PROCESSING
● FEATURE SELECTION
● SEMI SUPERVISED CLUSTERING
● OBTAIN NORMAL USER SET AND SOCIAL BOT SET
● RESULT EVALUATION
Data cleaning:
Data that is clicked less must be cleaned to remove wrong data, obtain accurate transition
probability between clickstreams, and avoid the error of transition probability caused by fewer
data.
Semi-supervised clustering method:
First, the initial centers of two clusters are determined by labeled seed users. Then,
unlabeled data are used to iterate and optimize the clustering results constantly.
Obtain the normal user set and social bots set:
The normal user set and social bots set can be finally obtained by detecting.
Result evaluation:
We evaluate results based on three different metrics: Precision, Recall, and F1 Score
(F1 is the harmonic average of Precision and Recall, F1 = 2 · Precision· Recall Precision +
Recall ). In the meantime, we use Accuracy as a metric and compare it with the SVM
algorithm to verify the efficiency of the method. Accuracy is the ratio of the number of
samples correctly classified by the classifier to the total number of samples.
SYSTEM TECHNIQUES:
Technique: Data Integration Techniques
Data Integration is the combination of technical and business processes used to combine
data from disparate sources into meaningful and valuable information. The process of Data
Integration is about taking data from many disparate sources (such as files, various databases,
mainframes etc.,) and combining that data to provide a unified view of the data for business
intelligence. Data integration is needed when a business decides to implement a new application
and migrate its data from the legacy systems into the new application. It becomes even critically
important in cases of company mergers where two companies merge and they need to
consolidate their applications. One of the most commonly known uses of data integration is
building a data warehouse for an enterprise which enables a business to have a unified view of
their data for analysis and business intelligence (BI) needs.
CHAPTER 3 - REQUIREMENTS ENGINEERING
GENERAL
To be used efficiently, all computer software needs certain hardware components or the
other software resources to be present on a computer. These prerequisites are known as
(computer) system requirements and are often used as a guideline as opposed to an absolute rule.
Most software defines two sets of system requirements: minimum and recommended. With
increasing demand for higher processing power and resources in newer versions of software,
system requirements tend to increase over time. Industry analysts suggest that this trend plays a
bigger part in driving upgrades to existing computer systems than technological advancements.
They are
1.Hardware Requirements.
2. Software Requirements.
HARDWARE REQUIREMENTS
The most common set of requirements defined by any operating system or software
application is the physical computer resources, also known as hardware. A hardware
requirements list is often accompanied by a hardware compatibility list (HCL), especially in case
of operating systems. An HCL lists tested, compatible and sometimes incompatible hardware
devices for a particular operating system or application. The following subsections discuss the
various aspects of hardware requirements
Hardware requirements for present project:

System : Intel Pentium.
Hard Disk : 120 GB.
Monitor : 15’’ LED
Input Devices : Keyboard, Mouse
Ram : 2 GB
SOFTWARE REQUIREMENTS
Software Requirements deal with defining software resource requirements and

prerequisites that need to be installed on a computer to provide optimal functioning of an
application. These requirements or pre-requisites are generally not included in the software
installation package and need to be installed separately before the software is installed.
Software requirements for present project:

Operating system : Windows
7 Coding Language : PYTHON
Tool : Python
Database : SQL
CHAPTER 4 - DESIGN ENGINEERING
GENERAL
Design Engineering deals with the various UML [Unified Modeling language] diagrams
for the implementation of projects. Design is a meaningful engineering representation of a thing
that is to be built. Software design is a process through which the requirements are translated into
representation of the software. Design is the place where quality is rendered in software
engineering. Design is the means to accurately translate customer requirements into finished
products.
USE CASE DIAGRAM FOR SPAM BOTS DETECTION:
A Use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented as
use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor.
Figure 4.2 Use Case diagram for spambot detection

EXPLANATION:
A use case is a methodology used in system analysis to identify, clarify, and organize
system requirements. In this context, the term "system" refers to something being developed or
operated, such as a mail-order product sales and service Website. Use case diagrams are
employed in UML (Unified Modeling Language), a standard notation for the modeling of
real-world objects and systems.
CLASS DIAGRAM FOR SPAM BOTS DETECTION:
Figure 4.3 Class diagram for spambot detection

EXPLANATION:
Class diagram is an illustration of the relationships and source code dependencies among
classes in the Unified Modeling Language (UML). In this context, a class defines the methods
and variables in an object, which is a specific entity in a program or the unit of code representing
that entity. Class diagrams are useful in all forms of object-oriented programming (OOP). The
concept is several years old but has been refined as OOP modeling paradigms have
evolved.
ACTIVITY DIAGRAM FOR SPAM BOTS DETECTION:
Activity diagrams are graphical representations of workflows of stepwise activities and

actions with support for choice, iteration and concurrency. In the Unified Modeling Language,
activity diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.
Figure 4.4 Activity diagram for spambot detection

EXPLANATION:
Activity diagram is another important diagram in UML to describe the dynamic aspects
of the system. Activity diagram is a flowchart to represent the flow from one activity to another
activity. The activity can be described as an operation of the system. The control flow is drawn
from one operation to another. This flow can be sequential, branched, or concurrent. Activity
diagrams deal with all types of flow control by using different elements such as fork, join, etc.
SEQUENCE DIAGRAM FOR SPAM BOTS DETECTION:
Figure 4.5 Sequence diagram for spambot detection
EXPLANATION:
A sequence diagram shows object interactions arranged in time sequence. It depicts the
objects and classes involved in the scenario and the sequence of messages exchanged between
the objects needed to carry out the functionality of the scenario. Sequence diagrams are typically
associated with use case realizations in the Logical View of the system under development.
Sequence diagrams are sometimes called event diagrams or event scenarios. A sequence diagram
shows, as parallel vertical lines, different processes or objects that live simultaneously, and, as
horizontal arrows, the messages exchanged between them, in the order in which they occur. This
allows the specification of simple runtime scenarios in a graphical manner. Messages, written
with horizontal arrows with the message name written above them, display interaction. Solid
arrowheads represent synchronous calls, open arrowheads represent asynchronous messages, and
dashed lines represent reply messages. If a caller sends a synchronous message, it must wait until
the message is done, such as invoking a subroutine. If a caller sends an asynchronous message, it
can continue processing and doesn’t have to wait for a response. Asynchronous calls are present
in multithreaded applications, event-driven applications and in message-oriented middleware.
Activation boxes, or method-call boxes, are opaque rectangles drawn on top of lifelines to
represent that processes are being performed in response to the message
DATA FLOW DIAGRAM FOR SPAM BOTS DETECTION:
Figure 4.6 Data flow diagram for spambot detection
EXPLANATION:
1. The DFD is also called a bubble chart. It is a simple graphical formalism that can be
used to represent a system in terms of input data to the system, various processing carried out on
this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to
model the system components. These components are the system process, the data used by the
process, an external entity that interacts with the system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is modified by
a series of transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any
level of abstraction. DFD may be partitioned into levels that represent increasing information
flow and functional detail.
ER DIAGRAM FOR SPAM BOTS DETECTION:
Figure 4.7 ER diagram for spambot detection
An entity-relationship diagram (ERD) is a data modeling technique that graphically

illustrates an information system's entities and the relationships between those entities. An ERD
is a conceptual and representational model of data used to represent the entity framework
infrastructure. For each data flow, at least one of the endpoints (source and / or destination) must
exist in a process. The refined representation of a process can be done in another data-flow
diagram, which subdivides this process into sub-processes
SYSTEM ARCHITECTURE FOR SPAM BOTS DETECTION:
Figure 4.8System architecture for spambot detection
EXPLANATION:
System architecture is the conceptual model that defines the structure, behavior, and
more views of a system. An architecture description is a formal description and representation of
a system, organized in a way that supports reasoning about the structures and behaviors of the
system. A system architecture can consist of system components and the sub-systems developed,
that will work together to implement the overall system. There have been efforts to formalize
languages to describe system architecture; collectively these are called architecture description
languages (ADL).
19
CHAPTER 5 - SOFTWARE SPECIFICATION
DEVELOPMENT TOOLS
GENERAL
This chapter is about the software language and the tools used in the development of the
project. The platform used here is PYTHON.
FEATURES OF PYTHON
Python's features include −

● Easy-to-learn − Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.
● Easy-to-read − Python code is more clearly defined and visible to the eyes.
● Easy-to-maintain − Python's source code is fairly easy-to-maintain.
● A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.
● Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.
● Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
● Extendable − You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.
● Databases − Python provides interfaces to all major commercial databases.
● GUI Programming − Python supports GUI applications that can be created and ported
to many system calls, libraries and windows systems, such as Windows MFC, Macintosh,
and the X Window system of Unix.
● Scalable − Python provides a better structure and support for large programs than shell
scripting.
● Apart from the above-mentioned features, Python has a big list of good features, few are
listed below −
● It supports functional and structured programming methods as well as OOP.
● It can be used as a scripting language or can be compiled to byte-code for building large
applications.
20
● It provides very high-level dynamic data types and supports dynamic type checking.
● It supports automatic garbage collection.
● It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
OBJECTIVES OF PYTHON
Why Do Software Developers Choose Python?
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment Python is a high-level, interpreted, interactive
and object-oriented scripting language. Python is designed to be highly readable. It uses English
keywords frequently whereas other languages use punctuation, and it has fewer syntactic
constructions than other languages.
● Python is Interpreted − Python is processed at runtime by the interpreter. You

do not need to compile your program before executing it. This is similar to PERL
and PHP.
● Python is Interactive − You can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.
● Python is Object-Oriented − Python supports Object-Oriented style or technique
of programming that encapsulates code within objects.
Python is a Beginner's Language − Python is a great language for the beginner-level

programmers and supports the development of a wide range of applications from simple text
processing to WWW browsers to games
HISTORY OF PYTHON:
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands. Python is
derived from many other languages, including ABC, Modula-3, C, C++, Algol-68, SmallTalk,
and Unix shell and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU
General Public License (GPL). Python is now maintained by a core development team at the
institute, although Guido van Rossum still holds a vital role in directing its progress.
OBJECT ORIENTED LANGUAGE
To be an Object Oriented language, any language must follow at least the four
characteristics.
1.Inheritance: It is the process of creating the new classes and using the behavior of the
existing classes by extending them just to reuse the existing code and adding additional features
as needed.
2.Encapsulation: It is the mechanism of combining the information and providing the
abstraction.
3.Polymorphism: As the name suggests one name multiple form, Polymorphism is the
way of providing the different functionality by the functions having the same name based on the
signatures of the methods.
4.Dynamic binding: Sometimes we don't have the knowledge of objects about their
specific types while writing our code. It is the way of providing the maximum functionality to a
program about the specific type at runtime.
GETTING PYTHON:
The most up-to-date and current source code, binaries, documentation, news, etc., is
available on the official website of Python https://www.python.org Windows Installation Here
are the steps to install Python on Windows machines.
● Open a Web browser and go to https://www.python.org/downloads/.

● Follow the link for the Windows installer python-XYZ.msi file where XYZ is the
version you need to install.
● To use this installer python-XYZ.msi, the Windows system must support
Microsoft Installer 2.0. Save the installer file to your local machine and then run it to find out if
your machine supports MSI.
● Run the downloaded file. This brings up the Python install wizard, which is really
easy to use. Just accept the default settings, wait until the install is finished, and you are done.
The Python language has many similarities to Perl, C, and Java. However, there are
some definite differences between the languages.
FIRST PYTHON PROGRAM:
Let us execute programs in different modes of programming.
Interactive Mode Programming
Invoking the interpreter without passing a script file as a parameter brings up the
following prompt −
$ python
Python 2.4.3 (#1, Nov 11 2010, 13:34:43)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
Type the following text at the Python prompt and press the Enter −
>>> print "Hello, Python!"
If you are running a new version of Python, then you would need to use print statement
with parenthesis as in print ("Hello, Python!");. However in Python version 2.4.3, this produces
the following result −
Hello, Python!
Script Mode Programming
Invoking the interpreter with a script parameter begins execution of the script and
continues until the script is finished. When the script is finished, the interpreter is no longer
active.Let us write a simple Python program in a script. Python files have extension .py. Type the
following source code in a test.py file −
print "Hello, Python!"
We assume that you have a Python interpreter set in the PATH variable. Now, try to run
this program as follows −
$ python test.py
This produces the following result −
Hello, Python!
Flask Framework:
Flask is a web application framework written in Python. Armin Ronacher, who leads an
international group of Python enthusiasts named Pocco, develops it. Flask is based on Werkzeug
WSGI toolkit and Jinja2 template engine. Both are Pocco projects.
Http protocol is the foundation of data communication in worldwide web. Different

methods of data retrieval from specified URL are defined in this protocol.
SOFTWARE DESCRIPTION:
Python is a multi-paradigm programming language. Object-oriented programming and

structured programming are fully supported, and many of its features support functional
programming and aspect-oriented programming. Many other paradigms are supported via
extensions, including design by contract and logic programming.
Python uses dynamic typing, and a combination of reference counting and a cycle-
detecting garbage collector for memory management. It also features dynamic name resolution
(late binding), which binds method and variable names during program execution.
Python's design offers some support for functional programming in the Lisp tradition. It
has a filter map and reduce functions; list comprehensions, dictionaries, sets and generator
expressions. The standard library has two modules (itertools and functools) that implement
functional tools borrowed from Haskell and Standard ML.
The language's core philosophy is summarized in the document The Zen of Python (PEP
20), which includes aphorisms such as:
● Beautiful is better than ugly.

● Explicit is better than implicit.
● Simple is better than complex.
● Complex is better than complicated.
● Readability counts.
Rather than having all of its functionality built into its core, Python was designed to be
highly extensible. This compact modularity has made it particularly popular as a means of
adding programmable interfaces to existing applications. Van Rossum's vision of a small core
language with a large standard library and easily extensible interpreter stemmed from his
frustrations with ABC, which espoused the opposite approach.
Python strives for a simpler, less-cluttered syntax and grammar while giving developers a
choice in their coding methodology. In contrast to Perl's "there is more than one way to do it"
motto, Python embraces a "there should be one—and preferably only one—obvious way to do it"
design philosophy. Alex Martelli, a Fellow at the Python Software Foundation and Python book
author, writes that "To describe something as 'clever' is not considered a compliment in the
Python culture."
Python's developers strive to avoid premature optimization, and reject patches to non-
critical parts of the CPython reference implementation that would offer marginal increases in
speed at the cost of clarity. When speed is important, a Python programmer can move time-
critical functions to extension modules written in languages such as C, or use PyPy, a just-in-time
compiler. Cython is also available, which translates a Python script into C and makes direct C-
level API calls into the Python interpreter.
An important goal of Python's developers is keeping it fun to use. This is reflected in the
language's name as a tribute to the British comedy group Monty Python and in occasionally
playful approaches to tutorials and reference materials, such as examples that refer to spam and
eggs (from a famous Monty Python sketch) instead of the standard foo and bar.
A common neologism in the Python community is pythonic, which can have a wide
range of meanings related to program style. To say that code is pythonic is to say that it uses
Python idioms well, that it is natural or shows fluency in the language, that it conforms with
Python's minimalist philosophy and emphasis on readability. In contrast, code that is difficult to
understand or reads like a rough transcription from another programming language is called
unpythonic.
Users and admirers of Python, especially those considered knowledgeable or
experienced, are often referred to as Pythonistas.
LIBRARIES:
Python's large standard library, commonly cited as one of its greatest strengths, provides
tools suited to many tasks. For Internet-facing applications, many standard formats and protocols
such as MIME and HTTP are supported. It includes modules for creating graphical user
interfaces, connecting to relational databases, generating pseudorandom numbers, arithmetic
with arbitrary precision decimals, manipulating regular expressions, and unit testing.
Some parts of the standard library are covered by specifications, but most modules are
not. They are specified by their code, internal documentation, and test suites (if supplied).
However, because most of the standard library is cross-platform Python code, only a few
modules need altering or rewriting for variant implementations.
As of March 2018, the Python Package Index (PyPI), the official repository for third-
party Python software, contains over 130,000 packages with a wide range of functionality,
including:
● Graphical user interfaces

● Web frameworks
● Multimedia
● Databases
● Networking
● Test frameworks
● Automation
● Web scraping
● Documentation
● System administration
● Scientific computing
● Text processing
DEVELOPMENT:
Python's development is conducted largely through the Python Enhancement Proposal

(PEP) process, the primary mechanism for proposing major new features, collecting community
input on issues and documenting Python design decisions. Python coding style is covered in PEP
8. Outstanding PEPs are reviewed and commented on by the Python community and the steering
council.Enhancement of the language corresponds with development of the CPython reference
implementation. The mailing list python-dev is the primary forum for the language's
development. Specific issues are discussed in the Roundup bug tracker maintained at python.org.
Development originally took place on a self-hosted source-code repository running Mercurial,
until Python moved to GitHub in January 2017.
CPython's public releases come in three types, distinguished by which part of the version
number is incremented:Backward-incompatible versions, where code is expected to break and
need to be manually ported. The first part of the version number is incremented. These releases
happen infrequently for example, version 3.0 was released 8 years after 2.0. Major or "feature"
releases, about every 18 months, are largely compatible but introduce new features. The second
part of the version number is incremented. Each major version is supported by bug fixes for
several years after its release.
Bugfix releases, which introduce no new features, occur about every 3 months and are
made when a sufficient number of bugs have been fixed upstream since the last release. Security
vulnerabilities are also patched in these releases. The third and final part of the version number is
incremented.Many alpha, beta, and release-candidates are also released as previews and for
testing before final releases. Although there is a rough schedule for each release, they are often
delayed if the code is not ready. Python's development team monitors the state of the code by
running the large unit test suite during development, and using the BuildBot continuous
integration system.
The community of Python developers has also contributed over 86,000 software modules
(as of 20 August 2016) to the Python Package Index (PyPI), the official repository of third-party
Python libraries.
The major academic conference on Python is PyCon. There are also special Python
mentoring programmes, such as Pyladies.
CHAPTER 6 - IMPLEMENTATION
GENERAL
In this we implement the coding part using eclipse. Below are the coding’s
that are used to apply for the various schemes available.
● DATA INTEGRATION TECHNIQUES

● DATA SET COLLECTION
● TRANSITION PROBABILITY
DATA SET COLLECTION:
import sys
import datetime
import random
import string
import json
import csv
import time
import pandas as pd
from pandas.io.json import json_normalize
def datasetcollect(numMsgs,id1):
List=[]
eventlist = ["like", "feedback", "comment","share","more"]
iotmsg_header = """\
"guid": "%s",
"destination": "%s", """

iotmsg_eventTime = """\
"eventTime": "%sZ", """
iotmsg_payload ="""\
"payload": {
"format": "%s", """
iotmsg_data ="""\
{"id": "%s",
"time": "%s",
"timestamp": "%d",
"event": "%s",
"IP": "%d",
"location": "%s"
"""
dict1={}
f= open("dataset.txt","w+")
bb=0;
f.write("[")
dataElementDelimiter = ","
for counter in range(0, numMsgs):
bb=bb+1;
p1 = random.randrange(1, 6)
now = datetime.datetime.now()
timestamp = int(time.time()*1000.0)
playtime= now + datetime.timedelta(hours=0, minutes=1, seconds=10)
if counter == numMsgs-1:
now = now + datetime.timedelta(hours=0, minutes=1, seconds=10)
data=iotmsg_data % (str(id1),now,timestamp,"play",bb,"IOS")+
dataElementDelimiter
f.write(data)
for counter1 in range(0, p1):
now = now + datetime.timedelta(hours=0, minutes=counter1,

seconds=20)
if counter1 == p1-1:
data=iotmsg_data
%(str(id1),now,timestamp,eventlist[counter1],bb,"IOS")
f.write(data)
else:
data=iotmsg_data %
(str(id1),now,timestamp,eventlist[counter1],bb,"IOS")+ dataElementDelimiter
f.write(data)
else:
now = now +datetime.timedelta(hours=0, minutes=3, seconds=20)
f.write(data)

seconds=20)
data=iotmsg_data %
f.write(data)
f.write("]")
f.close()
def botcollect(numMsgs,id1):
List=[]
eventlist = ["start","like", "feedback", "comment","share"]
"guid": "%s",
"payload": {
"format": "%s", """
iotmsg_data ="""\
{"id": "%s",
"time": "%s",
"timestamp": "%d",
"event": "%s",
"IP": "%d",
"location": "%s"
"""
dict1={}
bb=0
f.write("[")
bb=bb+1
playtime= now + datetime.timedelta(hours=0, minutes=1, seconds=10)

seconds=20)
data=iotmsg_data %
(str(id1),now,timestamp,eventlist[counter1],bb,"IOS")
f.write(data)
else:

seconds=20)
data=iotmsg_data %
f.write(data)
else:

seconds=20)
data=iotmsg_data %
f.write(data)
f.write("]")
f.close()
import sys
import datetime
import random
import string
import json
import csv
import time
import pandas as pd
def datasetcollect(numMsgs,id1):
List=[]
eventlist = ["like", "feedback", "comment","share","more"]
{ "guid": "%s", "destination": "%s", """
"payload": {
"format": "%s", """
iotmsg_data ="""\
{"id": "%s",
"time": "%s",
"timestamp": "%d",
"event": "%s",
"IP": "%d",
"location": "%s"
}
"""
dict1={}
bb=0;
f.write("[")
bb=bb+1;
playtime= now + datetime.timedelta(hours=0, minutes=1,

seconds=10)
f.write(data)

seconds=20)
data=iotmsg_data %
f.write(data)
else:
data=iotmsg_data %
f.write(data)
else:
now = now +datetime.timedelta(hours=0, minutes=3, seconds=20)
f.write(data)


seconds=20)
data=iotmsg_data %
f.write(data)
f.write("]")
f.close()
def botcollect(numMsgs,id1):
List=[]
eventlist = ["start","like", "feedback", "comment","share"]
"guid": "%s",
"payload": {
"format": "%s", """
iotmsg_data ="""\
{"id": "%s",
"time": "%s",
"timestamp": "%d",
"event": "%s",
"IP": "%d",
"location": "%s"
"""
dict1={}
bb=0
f.write("[")
bb=bb+1
playtime= now + datetime.timedelta(hours=0, minutes=1,

seconds=10)

seconds=20)
data=iotmsg_data %
f.write(data)
else:

seconds=20)
data=iotmsg_data %
f.write(data)
else:

seconds=20)
data=iotmsg_data % (str(id1),now,timestamp,eventlist[counter1],bb,"IOS")+
f.write(data)
f.write("]")
f.close()
TRANSITION PROBABILITY : import

datasetcollection as clickstream import
pandas as pd
import datetime
s= clickstream.datasetcollect(50,1)
df=pd.read_json (r'dataset.txt')
gh=[];
for gg in range(2,20):
s= clickstream.datasetcollect(50,gg)
df1=pd.read_json (r'dataset.txt')
#Data Cleaning
for nn in range(1,51):
m=df1[(df1['IP'] == nn) ]
g2=m.shape[0]
if(g2<=1):
indexNames = df1[ df1['IP'] == nn ].index
df1.drop(indexNames , inplace=True)
df=df.append(df1, ignore_index = True)
lenn=df.shape[0]
for hj in range(0,lenn):
gh.append("0")
for gg in range(20,26):
s= clickstream.botcollect(50,gg)
df1=pd.read_json (r'dataset.txt')
m=df1[(df1['IP'] == nn) ]
g2=m.shape[0]
if(g2<=1):
indexNames = df1[ df1['IP'] == nn ].index
df1.drop(indexNames , inplace=True)
df=df.append(df1, ignore_index = True)
lenn1=df.shape[0]-lenn;
for hj in range(0,lenn1):
gh.append("1")
#After data cleaning
df
df['class']=gh
#after data processing
df
# Transition Probability Calculation
# Feature Selection
#P(play;like), P(play;feedback), P(play;comment), P(play;share) and
P(play;more);
#interarrival Time calculation
like=[];
feedback=[];
comment=[];
share=[];
more=[];
userid=[];
seid=[];
label=[];
for j in range(1,26):
userclickstream=df[(df['id'] == j)]
like1=0;
feedback1=0;
comment1=0;
share1=0;
more1=0;
m=userclickstream[(userclickstream['IP'] == nn) ]
kk1=m['time'].values;
kks=m['class'].values;
#print(kks[0])
ll=1;
for jk in kks:
ll=jk;
label.append(ll)
kk=m['event'].values;
cc=len(kk)
for jk in kk:
if(kk[0]=="play"):
if(jk=="like"):
hkl=1/(cc-1)
like1=hkl;
elif(jk=="feedback"):
hkl=1/(cc-1)
feedback1=hkl;
elif(jk=="comment"):
hkl=1/(cc-1)
comment1=hkl;
elif(jk=="share"):
hkl=1/(cc-1)
share1=hkl;
elif(jk=="more"):
hkl=1/(cc-1)
more1=hkl;
else:
if(jk=="like"):
hkl=1
like1=hkl;
elif(jk=="feedback"):
hkl=1
feedback1=hkl;
elif(jk=="comment"):
hkl=1
comment1=hkl;
elif(jk=="share"):
hkl=1
share1=hkl;
elif(jk=="more"):
hkl=1
more1=hkl;
like.append(like1)
feedback.append(feedback1)
comment.append(comment1)
share.append(share1)
more.append(more1)
userid.append(j);
seid.append(nn);
import numpy as np
newDF = pd.DataFrame()
newDF['userid']=userid;
newDF['seid']=seid;
newDF['like']=like;
newDF['share']=share;
newDF['more']=more;
newDF['comment']=comment;
newDF['feedback']=feedback;
newDF['class']=label;
bb=newDF
bb.drop(['userid', 'seid'], axis=1, inplace=True)
m1=bb[(bb['class'] == "0") ]
cls=np.array( m1.mean(axis = 0) )
cls=np.delete(cls, -1)
m2=bb[(bb['class'] == "1") ]
cls2=np.array( m2.mean(axis = 0) )
cls2=np.delete(cls2, -1)
testdata=newDF
testdata.drop(['class'], axis=1, inplace=True)
test=testdata.values;
cc=[];
cc.append(cls.tolist());
cc.append(cls2.tolist());
test
# center point calculation
import math
def euc_dist(a, b):
sum = 0
for i, j in zip(a, b):
a = (i - j) * (i - j)
sum = sum + a
return math.sqrt(sum)
def cal_dist(centroids, data):
c_dist = []
for i in centroids:
temp = []
for j in data:
temp.append(euc_dist(i, j))
c_dist.append(temp)
return c_dist
def perf_clustering(k, dist_table):
clusters = []
for i in range(k):
clusters.append([])
for i in range(len(dist_table[0])):
d = []
for j in range(len(dist_table)):
d.append(dist_table[j][i])
clusters[d.index(min(d))].append(i)
return clusters
#distance Calculation:
centroids=cc;
result={}
cluster_mem = []
distance_table=cal_dist(centroids,test);
cluster_table = perf_clustering(2, distance_table)
cluster_mem.append(cluster_table)
acc=0;
for i in range(len(centroids)):
for j in range(len(cluster_table[i])):
#print(cluster_table[i][j])
result[cluster_table[i][j]]=i
l=list(result.items()) #convet the given dict. into list

l.sort()
re=[]
for k, v in l:
re.append(v)
re
threshold=25
clabel=[];
for hj in range(0,len(label),50):
count=0;
count1=0;
for h in range(hj,(hj+50)):
if (label[h])=="0":
count=count+1;
else:
count1=count1+1;
if(count>count1):
clabel.append(0)
else:
clabel.append(1)
clabel
predict=[]
for hj in range(0,len(re),50):
count=0;
count1=0;
for h in range(hj,(hj+50)):
if (re[h])==0:
count=count+1;
else:
count1=count1+1;
if(count>count1):
predict.append(0)
else:
predict.append(1)
predict
from sklearn.metrics import accuracy_score, f1_score, precision_score,

recall_score, classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
predictions = predict
test_y = clabel
print ("Accuracy :: ", accuracy_score(test_y, predictions))
print ("precision :: ", precision_score(test_y,

predictions,average="macro"))
print ("Recall :: ", recall_score(test_y, predictions,average="macro"))
score=f1_score(predictions,test_y,average='binary')
print('F-Measure: : %.3f' % score)
ccoun=0;
for jj in predictions:
ccoun=ccoun+1;
if(jj==0):
print("User id "+str(ccoun)+ " : Normal User");
else:
print("User id "+str(ccoun)+ " : Malicious Bot User");
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
objects = ('Mixed feauture', 'Transistion Probability')
y_pos = np.arange(len(objects))
performance = [96,92]
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Accuracy %')
plt.title(' Prediction Accurancy')
plt.show()
plt.ylabel('Precision %')
plt.title(' Prediction Precision')
plt.show()
plt.ylabel('Recall %')
plt.title(' Prediction Recall')
plt.show()
CHAPTER 7 - SNAPSHOTS
7.1 VARIOUS SNAPSHOTS
(i) DATA CLEANING:
Figure 7.1.1 Data cleaning
(ii) DATA PROCESSING:
Figure 7.1.2 Data processing

(iii) FEATURE SELECTION:
Figure 7.1.3 Feature selection
(iv) SEMI-SUPERVISED CLUSTERING METHOD:
Figure 7.1.4 Semi-supervised clustering

(v) NORMAL USER SET AND SOCIAL BOT SET:
Figure 7.1.5 Normal user set and social bot set
(vi) RESULT EVALUATION:
Figure 7.1.6 Result evaluation

(a) PREDICTION RECALL:
Figure 7.1.7 Prediction recall
(b) PREDICTION PRECISION:
Figure 7.1.8 Prediction precision

(c) PRECISION ACCURACY:
Figure 7.1.9 Prediction accuracy

CHAPTER 8 - SOFTWARE TESTING
GENERAL
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of tests. Each test type addresses a specific testing requirement.
DEVELOPING METHODOLOGIES
The test process is initiated by developing a comprehensive plan to test the general
functionality and special features on a variety of platform combinations. Strict quality control
procedures are used.
The process verifies that the application meets the requirements specified in the system
requirements document and is bug free. The following are the considerations used to develop the
framework from developing the testing methodologies.
TYPES OF TESTING
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic
is functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration.
This is a structural testing, that relies on knowledge of its construction and is

invasive.Unit tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains clearly defined inputs
and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine
if they actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfied, as shown by successfully unit testing, the combination of components is
correct and consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.
Functional testing
Functional tests provide systematic demonstrations that functions tested are available
as specified by the business and technical requirements, system documentation, and user
manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures: interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key functions,

or special test cases. In addition, systematic coverage pertaining to identify Business process
flows; data fields, predefined processes, and successive processes must be considered for testing.
Before functional testing is complete, additional tests are identified and the effective value of
current tests is determined.
System Test
System testing ensures that the entire integrated software system meets requirements. It
tests a configuration to ensure known and predictable results. An example of system testing is
the configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.
White Box Testing
White Box Testing is a testing in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purposeful. It is
used to test areas that cannot be reached from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner
workings, structure or language of the module being tested. Black box tests, as most other kinds
of tests, must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document.
It is a testing in which the software under test is treated as a black box .you cannot “see”
into it. The test provides inputs and responds to outputs without considering how the software
works.
Test strategy and approach
Field testing will be performed manually and functional tests will be written in detail.
Test objectives
● All field entries must work properly.

● Pages must be activated from the identified link.
● The entry screen, messages and responses must not be delayed.
Features to be tested
● Verify that the entries are of the correct format

● No duplicate entries should be allowed
● All links should take the user to the correct page.
Integration Testing
Software integration testing is the incremental integration testing of two or more

integrated software components on a single platform to produce failures caused by interface
defects.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional requirements.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered.
BUILD THE TEST PLAN

A Test Plan is a detailed document that describes the test strategy, objectives, schedule,
estimation, deliverables, and resources required to perform testing for a software product. Test
Plan helps us determine the effort needed to validate the quality of the application under test. The
test plan serves as a blueprint to conduct software testing activities as a defined process, which is
minutely monitored and controlled by the test manager.
Follow the seven steps below to create a test plan as per IEEE 829
1.Analyze the product 2. Design the Test Strategy,
3. Define the Test Objectives 4. Define Test Criteria, Resource Planning,
5.Plan Test Environment 6. Schedule & Estimation,
7. Determine Test Deliverables
Any project can be divided into units that can be further performed for detailed processing. Then
a testing strategy for each of this unit is carried out.
CHAPTER 9 - CONCLUSION
CONCLUSION:
In this paper, we proposed a novel method to accurately detect malicious social bots in
online social networks. Experiments showed that transition probability between user
clickstreams based on the social situation analytics can be used to detect malicious social bots in
online social platforms accurately. In future research, additional behaviors of malicious social
bots will be further considered and the proposed detection approach will be extended and
optimized to identify specific intentions and purposes of a broader range of malicious social
bots.
FUTURE ENHANCEMENT:
As a future direction,
i) To develop an app that can detect malicious bots with options to choose from various social
network applications available
ii)To add a feature of blocking or reporting the spam bots when detected
62
REFERENCES
[1] C. Cai, L. Li, and D. Zengi, ``Behavior enhanced deep bot detection in social media,''
in Proc. IEEE Int. Conf. Intell. Secur. Inform. (ISI), Beijing, China, Jul.2017, pp. 128_130.
[2] T.-K. Huang, M. S. Rahman, H. V. Madhyastha, M. Faloutsos, and B. Ribeiro, `Àn

analysis of socware cascades in online social networks,'' in Proc. 22nd Int. Conf. World Wide
Web, Rio de Janeiro, Brazil, 2013, pp. 619_630.
[3] F. Morstatter, L. Wu, T. H. Nazer, K. M. Carley, and H. Liu, `À new approach to bot
detection: Striking the balance between precision and recall,'' in Proc. IEEE/ACM Int. Conf. Adv.
Social Netw. Anal. Mining, San Francisco, CA, USA, Aug. 2016, pp. 533_540.
[4] Z. Zhang, C. Li, B. B. Gupta, and D. Niu, `Èf_cient compressed ciphertext length
scheme using multi-authority CP-ABE for hierarchical attributes,'' IEEE Access, vol. 6, pp.
38273_38284, 2018. doi:10.1109/ACCESS.2018.2854600.
[5] Y. Zhou et al., ``ProGuard: Detecting malicious accounts in social network- based
online promotions,'' IEEE Access, vol. 5, pp. 1990_1999, 2017.

Sample Mini Project Report

Uploaded by

Copyright:

Available Formats

Sample Mini Project Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sample Mini Project Report

Uploaded by

Copyright:

Available Formats

DETECTING SPAM BOTS ON SOCIAL NETWORK

AFRAAH MARIAM S (111516104003)

in partial fulfillment for the award of the degree

COMPUTER SCIENCE ENGINEERING

R.M.D. ENGINEERING COLLEGE

Certified that this project report titled “DETECTING SPAM BOTS ON

The V i v a -Voce Examination of the following students who have

submitted this project work held on………………

AFRAAH MARIAM S (111517104003)

INTERNAL EXAMINER EXTERNAL EXAMINER

I take this opportunity to put to record my sincere thanks to all who

It is my greatest privilege to convey my thanks to Chairman

It is my greatest privilege to convey my thanks to Principal

I express my sincere feeling of gratitude to our Head of Department of

It’s my pleasant duty to deliver hearty thanks to the extraordinary efforts of

I owe my sincere thanks to COMPUTER SCIENCE DEPARTMENT STAFF of

this can result in real-world consequences. Therefore, detecting and removing

the analysis. A novel method of detecting malicious socialbots, including both

features selection based on the transition probability of clickstream sequences and

clickstreams increases by an average of 12.8%, in comparison to the detection method

based on quantitative analysis of user behavior.

CHAPTER TITLE PAGE NO

4 CHAPTER 4: SYSTEM DESIGN 13

FIGURE NO. NAME OF THE FIGURE PAGE NO

1.41 ARCHITECTURE OF SPAMBOT DETECTION 6

4.2 USE CASE DIAGRAM FOR SPAM BOTS DETECTION 13

4.3 CLASS DIAGRAM FOR SPAM BOTS DETECTION 14

4.4 ACTIVITY DIAGRAM FOR SPAM BOTS DETECTION 15

4.5 SEQUENCE DIAGRAM FOR SPAM BOTS DETECTION 16

4.6 DATA FLOW DIAGRAM FOR SPAM BOTS DETECTION 17

4.7 ER DIAGRAM FOR SPAM BOTS DETECTION 18

4.8 SYSTEM ARCHITECTURE FOR SPAM BOTS 19

7.1.1 DATA CLEANING 53

7.1.2 DATA PROCESSING 53

7.1.3 FEATURE SELECTION 54

7.1.4 SEMI-SUPERVISED CLUSTERING 54

7.1.5 NORMAL USER SET AND SOCIAL BOT SET 55

7.1.6 RESULT EVALUATION 55

7.1.7 PREDICTION RECALL 56

7.1.8 PREDICTION PRECISION 56

7.1.9 PREDICTION ACCURACY 57

BEDM Behaviour Enhanced Deep Model

OSN Online Social Network

GUI Graphical User Interface

ABE Attribute-Based Encryption

PyPI Python Package Index

PEP Python Enhancement Proposal

MIME Multipurpose Internet Mail Extension

HTTP Hypertext Transfer Protocol

HCL Hardware Compatibility List

GPL General Public License

URL Uniform Resource Locator

PHP Hypertext Preprocessor

WWW World Wide Web

DRAWBACKS IN EXISTING SYSTEM:

2. Ting-Kai Huang et al [2], proposed a recommendation system that was developed to

4. Zhiyong Zhang et al[4], proposed an attribute based encryption (ABE), a user is

ARCHITECTURE OF SPAMBOT DETECTION:

Figure 1.4.1 Architecture of spambot detection

2.2.1 DATA COLLECTION