Chapter - 1: 1.1 Overview
Chapter - 1: 1.1 Overview
Chapter - 1: 1.1 Overview
Introduction
1.1 OVERVIEW:
The objective of this system is to provide the Clustered components from high
dimensional data space where each dimension represents or denotes an attribute used an
algorithm “Projective clustering based on k – means algorithm”. Each attribute contains a set of
values of each data point corresponding to that given attribute. The algorithm that we propose
does not presume any distribution on each individual dimension for the input data. Furthermore,
there is no restriction imposed on the size of the clusters or the number of relevant dimensions of
each cluster. A projected cluster should have a significant number of selected (i.e., relevant)
dimensions with high relevance in which a large number of points are close to each other.
The projected clustering problem is to identify a set of clusters and their relevant
dimensions such that intra-cluster similarity is maximized while inter-cluster similarity is
minimized. The system takes input the multi dimensional data in excel format(.xls), maximum
no. of components in an individual dimension and no. of nearest elements . It generates the
Scale parameter and shape parameter based on maximizing the likelihood function using EM
algorithm. Initialize the cluster centers, generate membership matrix and then test for
convergence. If the data points in clusters are not relevant (i.e. more sparseness exists among
data points), then split into further group and check. Based on these scale parameters, calculate
code length and generate matrix. Then perform outlier handling to remove irrelevant data points
on data space. In outlier handling, the data points that are lies outside or far away from the data
clusters are identified and then removed from dataset. After removing outliers, Projected clusters
are computed by applying distance function in each individual cluster.
1
1.2 Objectives:
2
1.3 PROJECT PLAN:
The project Clustering High Dimensional data is created by using Java as front end.
Eclipse Java IDE is used for developing the tool. The backend intermediate files are stored as xls
spread sheet files. Windows 7 operating system was used during the development of the project.
The project gets a excel spread sheet containing different kinds of data values of people
as input. The output decides the optimal number of clusters. Intermediate results contains
sparseness degree values, clustering process by k-means algorithm and finding bayesian criterian
information value to fix optimal number of clusters.
3
2. Literature Summary
4
Rong et al.,proposed Density-Based Spatial Clustering and Application with Noise
(DBSCAN) was a clustering algorithm based on density in 2004. It did clustering through
growing high density area, and it can find any shape of clustering. DBSCAN requires two
parameters: epsilon and minimum points . It starts with an arbitrary starting point that has not
been visited. It then finds all the neighbor points within distance eps of the starting point. If the
number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point
and its neighbors are added to this cluster and the starting point is marked as visited. The
algorithm then repeats the evaluation process for all the neighbors recursively. If the number of
neighbors is less than minPts, the point is marked as noise. If a cluster is fully expanded (all
points within reach are visited) then the algorithm proceeds to iterate through the remaining
unvisited points in the dataset.
DBSCAN does not require you to know the number of clusters in the data a priori, as
opposed to k-means. It can find arbitrarily shaped clusters and clusters completely surrounded by
(but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link
effect (different clusters being connected by a thin line of points) is reduced. This algorithm has
a notion of noise and it is mostly insensitive to the ordering of the points in the database. Coming
to disadvantages DBSCAN does not respond well to data sets with varying densities (called
hierarchical data sets) .
5
certain input patterns. Initially, the weights and learning rate are set. The input vectors to be
clustered are presented to the network. Once the input vectors are given, based on the initial
weights, the winner unit is calculated either by Euclidean distance method or sum of products
method. Based on the winner unit selection, the weights are updated for that particular winner
unit. An epoch is said to be completed once all the input vectors are presented to the network. By
updating the learning rate, several epochs of training may be performed.
Yip et al., presented A hierarchical subspace clustering approach with automatic relevant
dimension selection, called HARP. HARP is based on the assumption that two objects are likely
to belong to the same cluster if they are very similar to each other along many dimensions.
Clusters are allowed to merge only if they are similar enough in a number of dimensions, where
the minimum similarity and the minimum number of similar dimensions are controlled by two
internal threshold parameters. Due to the hierarchical nature, the algorithm is intrinsically slow.
Also, if the number of relevant dimensions per cluster is extremely low, the accuracy of HARP
may drop as the basic assumption will become less valid due to the presence of a large amount of
noise values in the data set. A dimension receives an index value close to the maximum value
(one) if the local variance is extremely small, which means the projections form an excellent
signature for identifying the cluster members. Alternatively, if the local variance is only as large
as the global variance, the dimension will receive an index value of zero.
6
number of natural clusters is smaller than max no. of clusters, it will return all the discovered
clusters (there may be less than max no. of cluster such clusters). In other cases, it will return the
top max no. of cluster ranked clusters. Therefore, an inaccurate estimation of this parameter will
not affect the accuracy of the clustering output.
It produces identical results irrespective of the order in which the input records are
presented, and it does not presume any canonical distribution for the input data. Empirical
evaluation shows that CLIQUE scales linearly with the number of input records, and has good
scalability as the number of dimensions (attributes) in the data or the highest dimension in which
clusters are embedded is increased.
C.M. Procopiuc et al., proposed an algorithm for fast projective clustering “Monte Carlo
Algorithm for Fast Projective Clustering,” in 2002. This Monte carlo algorithm allows us to
compute projective clusters iteratively. During each iteration, we compute an approximation of
an optimal cluster over the current set of points. The termination criterion can be defined in more
than one way, e.g. a certain percentage of the points have been clustered; or a user specified
number of clusters have been computed. by contrast to partitioning methods, the user need not
specify the number of clusters k unless he wants to. This allows more flexibility in tuning the
algorithm to the particular application that uses it. One particularly desirable property of this
method is that it is accurate even when the cluster sizes vary significantly (in terms of number of
7
points). Many partitioning methods rely on random sampling for computing an initial partition.
As a result, their points are either assigned to other clusters, or declared outliers. Greedy method
is employed in this algorithm, that computes each cluster in turn. Its accuracy depends on finding
a good definition for an optimal projective cluster. It proves highly accurate and stable for
various types of data, on which partitioning algorithms are not always successful.
The naive k-means algorithm partitions the dataset into „k‟ subsets such that all records of
data points, and each subset contains a center. Also the points in a given subset are closer to that
center than to any other center. The algorithm keeps track of the centroids of the subsets, and
proceeds in simple iterations. The initial partitioning is randomly generated, that is, we randomly
initialize the centroids to some points in the region of the space. In each iteration step, a new set
of centroids is generated using the existing set of centroids following two very simple steps.
(i) Partition the points based on the centroids C(i), that is, find the centroids to which
each of the points in the dataset belongs. The points are partitioned based on the Euclidean
distance from the centroids.
(ii) Set a new centroid to be the mean of all the points that are closest to all points in that
subset. The algorithm is said to have converged when recomputing the partitions does not result
in a change in the partitioning. For configurations where no point is equidistant to more than one
center, the above convergence condition can always be reached. This convergence property
along with its simplicity adds to the attractiveness of the k means algorithm.
The k-means needs to perform a large number of "nearest-neighbour" queries for the
points in the dataset. If the data is „d‟ dimensional and there are „N‟ points in the dataset, the cost
of a single iteration is O(kdN). Sometimes the convergence of the centroids (i.e. C(i) and C(i+1)
being identical) takes several iterations and also in the last several iterations, the centroids move
very little. As running the expensive iterations so many more times might not be efficient, we
need a measure of convergence of the centroids so that we stop the iterations when the
convergence criteria are met. Disadvantage of this algorithm is that the Distortion is the most
widely accepted measure.
8
CHAPTER 3
SYSTEM ARCHITECTURE
Applying
k-means
alg
9
3.1 Finding nearest neighbours
Program gets the excel spread sheet values as input. The following works are carried out in
finding nearest neighbors.
The main processing unit of the system is estimating sparseness degree, used to know the
data points graphically. Specifically Estimating sparseness degree includes intializing of centers
to sets of nearest neighbors and measuring distance from center to remaining data point
attributes.
By using k-means algorithm, cluster the values of data points into maximum number of
clusters that you want to performed. Intialize centers to this process as from the data points.
Repeat the process for finding optimum centers. For this process we are using distance function.
For each set of clusters, find bayesian criterian values. To which the Bayesian value is
less, make it as optimum count of clusters. For finding Bayesian information value, we are using
Gamma function and logarithmic values.
10
3.5 Data flow diagrams
High 1. PDF
dimensional estimation Optimum
data space for all no.of clusters
datapoints
Normalize
Sparseness degree values
High 1.1Comput
Program
dimensional sparseness
versions
data degree
Clustered sets,
1.2 split
into data Centers
sets
11
Store all the nearest neighbors
Program
versions 1.1.1find
1.1.2
High nearest
centers
dimensional neighbors
for data
dataset 1.1.3 find
sets
sparsenes
s degree
store centers DB
1.2.1
1.2.1
identify
split the
centers to
data into
dataset
clusters
1.2.3
Generate
clusters
1.3.1 apply
maximum 1.3.2
likelyhood compute BIC values
function BIC
value
Optimal no.of
clusters
12
CHAPTER 4
4.1 INTRODUCTION
The system „Clustering High Dimensional data’ having three modules in this phase among
six modules. They are Estimating Sparseness degree, clustering process and Bayesian information criteria
to find optimal no.of clusters.
The main processing unit of the system is estimating sparseness degree, used to know the data
points graphically. Specifically Estimating sparseness degree includes intializing of centers to sets of
nearest neighbors and measuring distance from center to remaining data point attributes.
Module 1
13
4.3 clustering Process
The main aim of clustering process is to devide the given data points into required no.of small
By using k-means algorithm, cluster the values of data points into maximum number of clusters
that you want to performed. Intialize centers to this process as from the data points. Repeat the process
for finding optimum centers. For this process we are using distance function.
Module 2:
Apply k-
means al
14
4.4 Bayesian Information criterion
The main aim of this module bayesian criterion information is to compute the optimal no.of
For each set of clusters, find bayesian criterian values. To which the Bayesian value is less, make
it as optimum count of clusters. For finding Bayesian information value, we are using Gamma function
and logarithmic values.
Module 3:
15
CHAPTER 5
System gets input data set values from the excel spread sheet. The following works are carried
out in finding nearest neighbors. The main processing unit of the system is estimating sparseness degree,
used to know the data points graphically. Specifically Estimating sparseness degree includes intializing of
centers to sets of nearest neighbors and measuring distance from center to remaining data point attributes
along with Sorting values in each dimension & Perform comparisions to estimate nearest no.of values.
Key Steps
Step3: Finding no. of nearest neighbors for each magnitude value in each dimension.
Fig 5.1: Procedure for finding sparsedegree for data points in each dimension
16
5.1.2 Clustering:
In this approach, cluster the data set points into maximum number of clustered sections that you
want to be performed to classify the optimal number of clusters. Intialize centers to this process as from
the data points as like in k-means algorithm and repeat the process for all number of clusters. Each set of
clusters are formed based on k-means algorithm by evaluating distance measures. Finally, clustered data
sets will be result of system.
Clustering process
Input: maximum no.of clusters, dataset
Key Steps
Step2: Repeat the following process from m=1 to max value given as input
Step4: Initialize the clusters based on random positions generated by random function.
Step5: Finding Euclidian distance from each other point to each data point of center.
Step6: Allocate the data point to concern group, having min distance to particular point.
Step8: Repeat the process till present clusters centers is equal to previous set of clusters.
17
5.1.3 Finding Optimal no.of clusters
The clustered data sets are given as input to this module. The clustered data sets are subjected
to Expected Maximization algorithm(EM) to find parameters such as scale factor(α), shape factor(β).
Based on these parameters Bayesian information criterion is evaluated along with gamma function(ӷ) and
likelyhood functions. If the Bayesian information criterion value is less for which no.of clusters is to be
considered as optimal.
Key Steps
Step1: Estimation of Scale factor and Shape factor for each group of data points.
Step4: Estimate Maximum likely hood function for all clusters m=1 to max no.of clusters.
Step5: Estimate Bayesian Information criterion for all values m=1 to max. no.of clusters.
Step6: Considering optimal no. of clusters , for which BIC value is minimum among all.
Output: scale factor, shape factor, BIC and optimal count of clusters
18
5.2 Results
The main processing unit of the system is estimating sparseness degree, used to know the data
points graphically. Specifically Estimating sparseness degree includes intializing of centers to sets of
nearest neighbors and measuring distance from center to remaining data point attributes.
19
Fig 5.5: Shows the sparseness degree values in each dimension
20
Fig 5.6: Shows the sparseness degree values in each dimension
21
5.2.2 Module 2 - Clustering into datasets
By using k-means algorithm, cluster the values of data points into maximum number of clusters
that you want to performed. Intialize centers to this process as from the data points. Repeat the process
for finding optimum centers. For this process we are using distance function.
22
Fig 5.8: clustered data sets
23
5.2.3 Module 3 - Bayesian Information criterion:
For each set of clusters, find bayesian criterian values. To which the Bayesian value is less, make
it as optimum count of clusters. For finding Bayesian information value, we are using Gamma function
and logarithmic values.
Output: scale factor, shape factor, BIC and optimal count of clusters
Fig 5.10: scale factor, shape factor & BIC values for clustered set m=1
24
Fig 5.11: scale factor, shape factor & BIC values for clustered set m=2
Fig 5.12: scale factor, shape factor & BIC values for clustered set m=3
25
Fig 5.13: scale factor, shape factor & BIC values for clustered set m=4
26
5.3 TEST PLAN
5.3.1 Test case description
The Test Plan is derived from the Functional Specifications, and detailed Design Specifications. The
Test Plan identifies the details of the test approach, identifying the associated test case areas within the
specific product for this release cycle.
Break the product down into distinct parts and identify features of the product that
are to be tested.
To find the expected output of the module
27
DESCRIPTION AND THE EXPECTED RESULTS OF EACH TEST CASE
TEST CASE ID #1
TEST CASE FIELDS DETAILS
TEST CASE ID #2
TEST CASE FIELDS DETAILS
TEST CASE ID:TEST CASE NAME 2: compute sparseness degree for all datapoints
in each dimension
ACTUAL RESULT Sparseness degree values for each datapoint in
each dimension
EXPECTED RESULT Sparseness degree values for each datapoint in
each dimension
INFERENCE VALID
TABLE 5.3: TEST CASE 2
TEST CASE ID #3
TEST CASE FIELDS DETAILS
28
TEST CASE ID #4
TEST CASE FIELDS DETAILS
TEST CASE ID:TEST CASE NAME 4: Split the data set into clusters
ACTUAL RESULT Data is splitted into concern clustered data sets
based
EXPECTED RESULT Data is splitted into concern clustered data sets
based
INFERENCE VALID
TABLE 5.5: TEST CASE 4
TEST CASE ID #5
TEST CASE FIELDS DETAILS
TEST CASE ID:TEST CASE NAME 5: Compute BIC values for each no.of clusters
m=1 to max no.of clusters
ACTUAL RESULT Obtained BIC values for each no.of clusters
EXPECTED RESULT Obtained BIC values for each no.of clusters
INFERENCE VALID
TABLE 5.6: TEST CASE 5
29
5.4 Performance analysis
5.4.1 Estimation of Sparseness degree
Graphical representation of sparseness degree of data points in each dimension
31
5.4.2 Clustering Process
Performance evaluation of clustering process is given below
Probability of clusters Accuracy = (Number of obtained clusters) /
(Total count of clusters given)
The Figure 5.21 shows the graph verses test case number and clustering process
32
5.4.3 Evaluating optimal no.of clusters
The optimal count of clusters can be evaluated from BIC values for each no. of clusters set m= 1
to max through which is having less BIC value.
In the diagram, the optimal no.of clustes are 3 which is having low BIC value of 5 clustered
datasets.
BIC values for each clustered set can be evaluated from parameters scale factor and shape factors.
33
CHAPTER 6
6.1 Conclusion
For modules estimation of sparseness degree, clustering process & Bayesian information criterion
are coded successfully.
In the next phase, the work going to carried out is finding outliers (the data points which are
not relevant to current existing data points in a data cluster), removal of outliers from dataset by
measuring using jaccard distance among dataset. i.e minimizing inter component sparseness and
maximizing intra component sparseness to get effective results of projected clustering. Finally,
producing relevant clustered data sets with high degree sparse(maximizing intra cluster sparseness
degree).
34
REFERENCES
1. Mohamed Bouguessa and Shergrui Wang, “Mining Projected Clusters in High – dimensional
Spaces” IEEE Trans. Knowledge and Data Eng., vol.21, no.4, pp.507-522, April 2009.
2. Haojun Sun, shengrui wang, Qingshan Jiang. “FCM – Based model selection algorithms for
determining the number of clusters” Elsevier journal of pattern recognizing society, pattern
recognition 37 (2004) pp.2027-2037.
3. E.K.K. Ng, A.W. Fu, and R.C. Wong, “Projective Clustering by Histograms,” IEEE Trans.
Knowledge and Data Eng., vol. 17, no. 3, pp. 369-383, Mar. 2005.
4. Anne Patrikainen and Marina Meila., “Comparing subspace clusterings” IEEE Trans. Knowledge
and Data Eng., vol. 18, no. 7, pp. 902-916, July. 2006.
5. F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans.
Knowledge and Data Eng., vol. 17, no. 2, pp. 369-383, Feb. 2005.
6. K.Y.L. Yip, D.W. Cheng, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,”
IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.
7. C.C. Aggarwal and P.S. Yu, “Redefining Clustering for High Dimensional Applications,” IEEE
Trans. Knowledge and Data Eng., vol. 14, no. 2, pp. 210-225, Mar./Apr. 2002.
8. C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, “Monte Carlo Algorithm for Fast
Projective Clustering,” Proc. ACM SIGMOD ‟02, pp. 418-427, 2002.
9. C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park, “Fast Algorithm for Projected
Clustering,” Proc.ACMSIGMOD ‟99, pp. 61-72, 1999.
35
APPENDIX A
36
Implementation Work Sheet
Project Name : Clustering High Dimensional data
Module Number :1
Module Name : Finding Sparseness degree of all data points in each dimension
Status : Finished
Aim:
Sparseness degree:
Sparseness degree of the data points denotes the degree of closureness. If sparseness
degree value is more, the data points are assumed to be dispersed in a wide region(low-densed).
If sparseness degree value is less, the data points are said to be more closure(high-densed).
Ideas to implement:
37
Key Steps to be followed:
Step3: Finding no. of nearest neighbors for each magnitude value in each dimension.
Work log
Reference
38
Implementation Work Sheet
Project Name : Clustering High Dimensional Data
Module Number :2
Status : Finished
Aim:
Implementation Details:
39
Key Steps to follow
Step2: Repeat the following process from m=1 to max value given as input
Step4: Initialize the clusters based on random positions generated by random function.
Step5: Finding Euclidian distance from each other point to each data point of center.
Step6: Allocate the data point to concern group, having min distance to particular point.
Step8: Repeat the process till present clusters centers is equal to previous set of clusters.
Work done:
DATE Work done
06.09.2010 Reading description about module 2
07.09.2010 Writing code for k-means algorithm for very less values
09.09.2010 Generalizing algorithm for large set of values using constraints
13.09.2010 Implementing Random function for initialization of clusters
15.09.2010 Changing conditions for repeating loop in algorithm
17.09.2010 Got proper results for k-means algorithm
Table A.2: workdone for review 2
Reference
40
Implementation Work Sheet
Project Name : Clustering High dimensional data
Module Number :3
Status : Finished
Aim
To estimate Bayesian Information criterion (BIC) value for each m value 1 to max. no.of
clusters.
Process:
Based on BIC value, we can find optimal no.of clusters, to which BIC value is less. For
calculating BIC value, we have to calculate parameters to each group of values such as scale
factor(α), Shape factor(β).
Through these parameters α and β of each group, Maximum likely hood function(L m) is
calculated for m = 1 to max. no.of clusters.
Ideas to implement:
1. Estimation of parameters for each group.
2. Parameters are Scale factor(α) and shape factor(β).
3. Estimating Maximum Likely hood fuction.
4, Calculate Bayesian Information Criterion.
5. Optimum no.of clusters will come as output, for which BIC value is less.
41
Key Steps to follow
Step1: Estimation of Scale factor and Shape factor for each group of data points.
Step4: Estimate Maximum likely hood function for all clusters m=1 to max no.of clusters.
Step5: Estimate Bayesian Information criterion for all values m=1 to max. no.of clusters.
Step6: Considering optimal no. of clusters , for which BIC value is minimum among all.
Work done
20.09.2010 Reading detailed description about module 3
22.09.2010 Trace out the results for finding scale factor and shape factor
24.09.2010 Writing code for shape factor and scale factor
27.09.2010 Analyzing Gamma function
29.09.2010 Writing code for Gamma function
30.10.2010 Testing results of Gamma function
01.10.2010 Writing code for finding Maximum likely hood function
04.10.2010 Writing code to estimate Bayesian Information Criterion(BIC) value
05.10.2010 Integrating all these coding parts and testing results
06.10.2010 Documentation for second Review
08.10.2010 Review 2
Table A.3: workdone for review 2
Reference
42
Total Work done
DATE Work Done
02.08.2010 Gathering Input data
04.08.2010 Finding the properties of multi dimensional space
06.08.2010 Refreshing Basic concepts of java
10.08.2010 Gathering tools required, Eclipse, java1.6 etc.,
12.08.2010 Preparing Detailed design of the system
13.08.2010 Changing modifications to detailed design
16.08.2010 Preparing Report for 1st review
18.08.2010 Write code for sorting two dimensional array using bubble sort
20.08.2010 Reading nearest neighbors concepts
23.08.2010 Tracing out the implementation of sparseness degree through nearest neighbors
24.08.2010 Review 1
26.08.2010 Writing code for finding nearest neighbors
30.08.2010 Changing modifications to code
31.08.2010 Writing code for calculating sparseness degree
02.09.2010 Normalizing sparseness degree values
03.09.2010 Generating graphs for sparseness degree values for each dimension
06.09.2010 Reading description about module 2
07.09.2010 Writing code for k-means algorithm for very less values
09.09.2010 Generalizing algorithm for large set of values using constraints
13.09.2010 Implementing Random function for initialization of clusters
15.09.2010 Changing conditions for repeating loop in algorithm
17.09.2010 Got proper results for k-means algorithm
20.09.2010 Reading detailed description about module 3
22.09.2010 Trace out the results for finding scale factor and shape factor
24.09.2010 Writing code for shape factor and scale factor
27.09.2010 Analyzing Gamma function
29.09.2010 Writing code for Gamma function
30.10.2010 Testing results of Gamma function
01.10.2010 Writing code for finding Maximum likely hood function
04.10.2010 Writing code to estimate Bayesian Information Criterion(BIC) value
05.10.2010 Integrating all these coding parts and testing results
06.10.2010 Documentation for second Review
08.10.2010 Review 2
14.10.2010 Searching for real time data of an hospital
19.10.2010 Mapping out data to existing code
25.10.2010 Documentation
01.11.2010 Rectifying errors
04.11.2010 Doing test cases
07.11.2010 Final report for phase -1
22.11.2010 Modifications to the document
24.11.2010 Final rough draft
Table A.4: total workdone for phase1
43
APPENDIX - B
SOFTWARE REQUIREMENT SPECIFICATION
1 INTRODUCTION
The Software Requirements Specification (SRS) for the “Clustering high dimensional
data space”. The purpose of this document is to present a detailed description of the “Clustering
high dimensional data space”.. It will explain the purpose and features of the system, the
interfaces of the system, what the system will do, the constraints under which it must operate and
how the system will react to external stimuli.
1.1 Purpose
The purpose of clustering is to evaluate pattern recognition, trend analysis etc., Clustering
high dimensional data space is a complex task due to presence of multiple dimensions.
1.2 Scope
The project focuses on generating Optimized Cluster groups based on the relevance
analysis and set of outliers which are irrelevant to the others data points in the data space. The
Projected clustering based on k-means algorithm helps the system in grouping the relevant data
points into components.
1.3 Overview
The overall description of the system and the specific requirements of the software
including both the functional and non-functional requirements. The SRS contains the following
in order.
1.3.1 Overall Description of the Product
This section contains the overall description which includes sub-sections depicting the
Computation of Sparseness degree for each dimension, Split the data into clusters as per given
count, and rearrange into sub clusters if data points are irrelevant, PDF estimation,
Detecting dense regions, Outlier handling involves removing irrelevant data points from the data
space, Discovery of projected clusters.
44
1.3.2 Specific Requirements
This section of the SRS carries the external interface requirements followed by
functional requirements and the requirements by features of the product and the performance
requirements and design constraints.
This section of the SRS explains the “utilities” factors of the product namely
correctness, efficiency and responsiveness.
2 OVERALL DESCRIPTION
PDF for
sparseness clusters
degree
High
dimensional
data space
Database
Fig B.1 brief struture of the system
45
2.2 Product Functions
Outlier handling
2.4 Constraints
46
3 SPECIFIC REQUIREMENTS
Minimum requirements
RAM : 512MB
. HDD: 80 GB
Processor: Intel P4
The product is built using JSP which will run on Windows Operating system. Net
beans 6.7.1 software is used for building JSP pages. The coverage matrix details are
stored in the Oracle database. The interactions with the JSP are made using JDBC
connectivity.
47
3.2 Software Product Features
3.2.1 Computation of Sparseness degree
Introduction/Purpose of feature
To find the sparseness degree of data points in each dimension of high
dimensional data space.
Stimulus/Response
Stimulus: The user provides high dimensional Database table, no.of clusters, no.of
dimensions.
Response: Yij(sparseness degree) of each dimension in the data space.
48
3.3 Functional Requirements
3.3.3.1.2 Use-Case 2
Use-Case name - Split the data into clusters
Actor - Developer.
Pre-condition – Initiate the cluster groups and centers.
Post-condition – The data points in each cluster should relevant.
3.3.3.1.3 Use-Case 3
Use-Case name - PDF estimation
. Actor - Developer.
Pre-condition – no.of clusters, and clusters centers will given.
Post-condition – Scale factor , Shape factor are calculated to analyse the dense regions
and sparse regions in the high dimensional data space.
49
3.3.4 Non Functional Requirements
3.3.4.1 Security Requirements
3.3.4.2 Reliability
This system has high precision yielding smaller value for the given input high
dimensional data space.
3.3.4.3 Efficiency
The system provides higher efficiency in terms of finding the projected clusters from
the data points of high dimensional data space.
3.3.4.4 Correctness
The system prioritizes the clusters and evaluates the centers optimally and provides the
Optimized projected clusters intended by the user
50