DBSCAN is one of the most well-known algorithm in the field of density-based clustering, although its applicability to large datasets is generally disputed due to its high complexity. The aim of this work is to propose DBSCOUT: a new, parallel, Spark-based procedure for the sole purpose of anomaly detection, in a way which is coherent to the DBSCAN definition and suitable for the big data context. From a theoretical side, this algorithm is characterized by a worst-case performance boundary that depends linearly on the size of the dataset; in practical tests, it outperforms available solutions both in terms of result quality and overall scalability when the data grow large.
The repository is organized in the following folders:
algcomp
: scripts to run reference algorithms for result comparison.data
: data generation scripts.dbscout
: the code of DBSCOUT.utils
: Spark-based utility scripts.visualization
: result visualization scripts.
The code for DBSCOUT is stored in the dbscout
folder, in the form of a Maven project. Compile using mvn package
and run either through java
or spark-submit
.
Supported command-line options include:
--algClass
: the algorithm class (required). Available options:dbscout.algorithm.sequential.SequentialOutlierDetector
: the base version of the sequential algorithm.dbscout.algorithm.parallel.ParallelOutlierDetector
: the base version of the parallel algorithm.dbscout.algorithm.parallel.BroadcastOutlierDetector
: the broadcast join optimized version of the parallel algorithm.dbscout.algorithm.parallel.GroupedOutlierDetector
: the grouping before join optimized version of the parallel algorithm.
--dim
: the data dimensions (required).--eps
: the value of the epsilon parameter (required).--inputPath
: the input path (required).--minPts
: the value of the minPts parameter (required).--numPart
: the number of data partitions.--outputPath
: the output path (required).--stats
: print dataset statistics.
M. Corain, P. Garza and A. Asudeh, "DBSCOUT: A Density-based Method for Scalable Outlier Detection in Very Large Datasets", 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, pp. 37-48, doi: 10.1109/ICDE51399.2021.00011.
M. Corain. "A Density-based Method for Scalable Outlier Detection in Large Datasets". Master's thesis, Politecnico di Torino and University of Illinois at Chicago, May 2020.