Nothing Special   »   [go: up one dir, main page]

Data Minning Unit 2-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

DATA MINNING UNIT 2

CLASSIFICATION :
Classification in data mining refers to the process of categorizing data into predefined
classes or categories based on their features. The primary goal is to build a model that
can accurately predict the class labels of unseen instances.

In classification, the input data typically consists of a set of features or attributes, and
each instance is assigned to one of several possible classes. The classification model
learns patterns and relationships between the input features and class labels from the
training data. Once trained, the model can be used to classify new, unseen instances into
the appropriate classes based on their feature values.
The process of classification involves several steps:
Data Collection: Gathering a dataset containing instances with associated class labels.

Data Preprocessing: Cleaning and preparing the data for analysis, including handling
missing values, normalization, and feature selection or extraction.

Training Phase: Using a machine learning algorithm to build a classification model from
the labeled training data. The model learns to distinguish between different classes based
on the patterns present in the input features.
Evaluation: Assessing the performance of the trained model using evaluation metrics
such as accuracy, precision, recall, or F1-score. This step helps in selecting the best-
performing model and tuning its parameters.

Testing Phase: Applying the trained model to unseen data to predict their class labels
and evaluating its performance on this test set.

Decision Tree Induction in Data Mining


Decision tree induction is a common technique in data mining that is used to generate a
predictive model from a dataset. This technique involves constructing a tree-like
structure, where each internal node represents a test on an attribute, each branch
represents the outcome of the test, and each leaf node represents a prediction. The goal
of decision tree induction is to build a model that can accurately predict the outcome of a
given event, based on the values of the attributes in the dataset.
To build a decision tree, the algorithm first selects the attribute that best splits the data
into distinct classes. This is typically done using a measure of impurity, such as entropy or
the Gini index, which measures the degree of disorder in the data. The algorithm then
repeats this process for each branch of the tree, splitting the data into smaller and
smaller subsets until all of the data is classified.

Advantages of Decision Tree Induction


Easy to understand and interpret: Decision trees are a visual and intuitive model that
can be easily understood by both experts and non-experts.
Handle both numerical and categorical data: Decision trees can handle a mix of
numerical and categorical data, which makes them suitable for many different types of
datasets.

Can handle large amounts of data: Decision trees can handle large amounts of data and
can be updated with new data as it becomes available.
Can be used for both classification and regression tasks: Decision trees can be used
for both classification, where the goal is to predict a discrete outcome, and regression,
where the goal is to predict a continuous outcome.

Disadvantages of Decision Tree Induction


Prone to overfitting: Decision trees can become too complex and may not generalize
well to new data. This can lead to poor performance on unseen data.

Sensitive to small changes in the data: Decision trees can be sensitive to small changes
in the data, and a small change in the data can result in a significantly different tree.

Biased towards attributes with many levels: Decision trees can be biased towards
attributes with many levels, and may not perform well on attributes with a small number
of levels.

Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities such as the probability that a given tuple belongs to a particular class.
Bayesian classification is based on Bayes’ theorem,.

Bayesian classifier to be comparable in performance with decision tree and selected


neural network classifiers. Bayesian classifiers have also exhibited high accuracy and
speed when applied to large databases.
Na ̈ıve Bayesian classifiers assume that the effect of an attribute value on a given class
is independent of the values of the other attributes. This assumption is called class-
conditional independence. It is made to simplify the computations involved and, in this
sense, is considered “na ̈ıve.”
Bayes' theorem states that the probability of a hypothesis H given some observed event E
is proportional to the likelihood of the evidence given the hypothesis, multiplied by the
prior probability of the hypothesis, as shown below -

RULE BASED CLASSIFICATION


Rule-based classification in data mining is a technique in which class decisions are taken
based on various “if...then… else” rules. Thus, we define it as a classification type
governed by a set of IF-THEN rules. We write an IF-THEN rule as:
“IF condition THEN conclusion.”

IF-THEN Rule
To define the IF-THEN rule, we can split it into two parts:
Rule Antecedent: This is the “if condition” part of the rule. This part is present in the
LHS(Left Hand Side). The antecedent can have one or more attributes as conditions, with
logic AND operator.
Rule Consequent: This is present in the rule's RHS(Right Hand Side). The rule
consequent consists of the class prediction.
Assessment of Rule

In rule-based classification in data mining, there are two factors based on which we can
access the rules. These are:
Coverage of Rule: The fraction of the records which satisfy the antecedent conditions of
a particular rule is called the coverage of that rule.
We can calculate this by dividing the number of records satisfying the rule(n1) by the total
number of records(n).
Coverage(R) = n1/n

Accuracy of a rule: The fraction of the records that satisfy the antecedent conditions and
meet the consequent values of a rule is called the accuracy of that rule.
We can calculate this by dividing the number of records satisfying the consequent
values(n2) by the number of records satisfying the rule(n1).
Accuracy(R) = n2/n1

Generally, we convert them into percentages by multiplying them by 100. We do so to


make it easy for the layman to understand these terms and their values.

Model Evaluation and Selection:

Model evaluation is the process of assessing how well a machine learning model
performs on unseen data. It helps you determine if the model generalizes well and can
make accurate predictions on new data points. There are various metrics and techniques
to evaluate models, depending on the type of problem you’re solving (classification,
regression, etc.).

Here are some common evaluation metrics for different types of problems:

Classification Metrics:

• Accuracy: The proportion of correctly classified instances out of the total instances.
• Precision: The proportion of true positives out of the total predicted positives.
• Recall (Sensitivity): The proportion of true positives out of the total actual positives.
• F1 Score: The harmonic mean of precision and recall, providing a balance between
the two.

2. Regression Metrics:

• Mean Absolute Error (MAE): The average of the absolute differences between the
predicted and actual values.
• Mean Squared Error (MSE): The average of the squared differences between the
predicted and actual values.
• Root Mean Squared Error (RMSE): The square root of the mean squared error, which
is more sensitive to large errors than MAE.
• R-squared: The proportion of the variance in the dependent variable that is
predictable from the independent variables, ranging from 0 to 1.

Model selection is the process of choosing the best model from a set of candidates
based on their performance on a validation set. It’s essential because different models
may perform better on different types of data, and there’s often a trade-off between
model complexity and performance.Here are some common techniques for model
selection:

1. Train-Test Split:

The simplest method is to split your data into a training set and a testing set. Train each
candidate model on the training set and evaluate their performance on the testing set.
Choose the model with the best performance on the testing set.

2. K-Fold Cross-Validation:
K-Fold Cross-Validation is a more robust method that divides your data into k equal-sized
folds. For each fold, use it as the testing set while using the remaining k-1 folds as the
training set. Train and evaluate each candidate model k times and calculate the average
performance across all folds. Choose the model with the best average performance.

3. Grid Search and Randomized Search:

These methods are used to optimize hyperparameters of a model. In Grid Search, you
define a set of possible values for each hyperparameter, and the algorithm tries every
possible combination. In Randomized Search, the algorithm samples a random
combination of hyperparameters from a specified distribution. Both methods can be
combined with cross-validation for more accurate results.

Here’s an outline of the model evaluation and selection process:

2. 1. Split your dataset into training and validation sets (or use cross-validation).
3. 2. Train each candidate model on the training set (or k-1 folds in cross-validation).
4. 3. Evaluate each model’s performance on the validation set (or the kth fold in cross-
validation) using appropriate evaluation metrics.
5. 4. Compare the models’ performance and select the best one for your problem.
6. 5 . Train the chosen model on the entire dataset and use it to make predictions on new
data.

Techniques to Improve Classification Accuracy:


There are several techniques you can use to improve the accuracy of classification
machine learning models. Here are a few:

Feature Engineering: Feature engineering is the process of selecting and transforming


the input data to create new features that are more informative for the model. This can
help the model better distinguish between different classes.

Hyperparameter Tuning: Hyperparameters are the settings of the model that are not
learned from the data, such as the learning rate or regularization strength. Tuning these
hyperparameters can improve the performance of the model.

Ensembling: Ensembling involves combining multiple models to improve the overall


performance. This can be done by averaging the predictions of several models or by using
more advanced techniques such as boosting or bagging.

Data Augmentation: Data augmentation involves generating new training examples by


applying random transformations to the existing data. This can help the model learn to be
more robust to variations in the input.

Regularization: Regularization is a technique used to prevent overfitting by adding


constraints to the model. This can be done by adding a penalty term to the loss function,
or by using techniques such as dropout.

KNN
K-Nearest Neighbors (KNN) is a simple and widely used algorithm for classification and
regression tasks in machine learning. It's a non-parametric method used for both
supervised learning, where the input consists of labeled data points, and unsupervised
learning, where the data has no labels.
Here's a brief overview of how KNN works:

Training: KNN stores all available cases and their class labels (for classification) or the
real values (for regression).
Prediction: For a new data point, KNN calculates the distance between that point and all
other points in the dataset. The distance metric could be Euclidean distance, Manhattan
distance, or other distance measures depending on the problem.
Selection of K: The algorithm then selects the K nearest data points (K is a predefined
constant, usually odd to avoid ties) and determines the majority class among them (for
classification) or calculates the average of the values (for regression).
Classification or Regression: The algorithm assigns the new data point to the majority
class or predicts its value based on the average of the values of its K nearest neighbors.
KNN is simple to understand and implement, making it a good starting point for
classification problems, especially when you have a small dataset. However, it can be
computationally expensive as it needs to compute distances between the new point and
all other points in the dataset during prediction, especially in high-dimensional spaces.
Additionally, choosing the right value of K and the appropriate distance metric is crucial
for the algorithm's performance.

SVM
Support Vector Machine (SVM) is a powerful supervised learning algorithm used for
classification, regression, and outlier detection tasks. It's particularly well-suited for
classification of complex datasets with a clear margin of separation.
Here's how SVM works for classification:

Data Representation: SVM represents each data item as a point in n-dimensional space
(where n is the number of features) with the value of each feature being the value of a
particular coordinate.
Finding the Hyperplane: SVM finds the hyperplane that best separates the classes. The
hyperplane is the decision boundary that maximizes the margin between the closest data
points of different classes. These closest data points are called support vectors.
Margin Maximization: SVM aims to maximize the margin, which is the distance between
the hyperplane and the nearest data point(s) from each class. By maximizing the margin,
SVM ensures better generalization to unseen data.
Kernel Trick: In cases where the data is not linearly separable, SVM can transform the
input space into a higher-dimensional feature space using a kernel function. This allows
SVM to find a linear hyperplane in the transformed space, effectively solving non-linear
classification problems.

Classification: To classify a new data point, SVM determines which side of the
hyperplane the point falls on. If it falls on one side, it's classified as one class; if it falls on
the other side, it's classified as the other class.
SVM has several advantages, including its effectiveness in high-dimensional spaces, its
ability to handle datasets with many features, and its versatility in choosing different
kernel functions for different tasks. However, SVM can be sensitive to the choice of the
kernel and the regularization parameter. It also doesn't directly provide probability
estimates, although they can be calculated using techniques like Platt scaling or cross-
validation.

ANN
Artificial Neural Networks (ANNs) are a class of machine learning models inspired by the
structure and function of biological neural networks in the human brain. ANNs consist of
interconnected nodes, called neurons, organized in layers. Each neuron receives input
signals, processes them, and produces an output signal. ANNs are capable of learning
complex patterns and relationships in data, making them suitable for various tasks such
as classification, regression, clustering, and pattern recognition.
Here's a basic overview of how ANNs work:

Input Layer: This layer consists of neurons that receive input data. Each neuron
corresponds to one feature in the input data.

Hidden Layers: These layers are located between the input and output layers. Each
neuron in a hidden layer receives input from the neurons in the previous layer, processes
the information, and passes the result to the neurons in the next layer. Deep neural
networks have multiple hidden layers.

Output Layer: This layer produces the final output of the network. The number of neurons
in the output layer depends on the type of task the network is designed for. For example,
for binary classification, there is typically one neuron representing the probability of
belonging to one class, while for multi-class classification, there is one neuron per class.

Activation Functions: Neurons in each layer are associated with activation functions,
which introduce non-linearities into the network. Common activation functions include
the sigmoid, tanh, ReLU (Rectified Linear Unit), and softmax functions.
Weights and Bias: Each connection between neurons is associated with a weight, which
determines the strength of the connection. Additionally, each neuron has an associated
bias, which allows the network to capture more complex relationships.
Training: During the training phase, the network adjusts its weights and biases based on
the input data and the desired output. This is typically done using optimization algorithms
such as gradient descent, backpropagation, or variants like stochastic gradient descent.
Prediction: Once the network is trained, it can be used to make predictions on new,
unseen data by passing the input data through the network and obtaining the output from
the output layer.

ANNs have gained popularity due to their ability to automatically learn features from raw
data, eliminating the need for manual feature engineering in many cases. However, they
can be computationally expensive to train, especially for large datasets and complex
architectures. Additionally, they may require careful tuning of hyperparameters to achieve
optimal performance.

You might also like