November 2010)
November 2010)
November 2010)
(Example)
Looking at the scatterplot above, we can visualise the task of classification by
essentially seeking line or a curve to split the space of the feature variables
income and balance so that the model can suggest a separation between the
orange and the blue points. This separation can be used to classify a future
point for which it would be unknown whether it is orange or blue. Note that
there is overlap between the areas of the orange and blue points implying that
there will be some error in this separation, in other words some blue points will
be classified as orange and vice-versa. This line or curve that we seek to
separate the feature space is typically a function of the features and is called
discriminant function.
-- proposed plans for the modelling exercises
Based on my observation, I decided to consist the following three tasks that can
be implemented on one or more of your chosen real-world datasets.
1. Unsupervised Learning: where the problem consists of identifying
homogeneous population groups or dimension reduction techniques,
which can then be used in the context of the empirical application
2. Regression: where the problem consists of continuous target variable(s).
3. Classification: where problem consists of categorical target variable(s)
Linear Regression: This is one of the simplest and most commonly used
regression algorithms. It assumes a linear relationship between the dependent
and independent variables.
Lasso and Ridge regression are both regularization techniques used in linear
regression to prevent overfitting and improve the model's generalization
performance.
Ridge Regression (L2 Regularization): Ridge regression is a regularized linear
regression algorithm that adds a penalty term to the linear regression objective
function to prevent overfitting.
Lasso Regression (L1 Regularization): Similar to ridge regression, lasso
regression is a regularized linear regression algorithm that uses the absolute
values of the coefficients as a penalty term. It can be useful for feature
selection.
-- any other
Because the main focus is the unsupervised learning in this coursework, so I
will focus on the PCA (Principal Component Analysis), this can be done by
import scikit-learn from python library and import StandardScaler and PCA.
After that, I will get the variance and explained variance for the PCA.
Moreover, it is also possible that to get principal component and explained
variance ratio. I will create a graph with the x-axis(Principal Components) and
y-axis with Explained variance ratio. After that, I will probably create a scatter
graph that figure out PCA components. The relationship for this scatter graph
will tell us probably the result.
As we learn from the subject guide and class, I will also import an elbow
method for finding the optimal k. There are many ways to determine the k
value, common methods include Elbow method (hand-time method): By
observing the relationship between the sum of squares of errors (up-crossing)
and the k-value under different k-values, select the inflection point (elbow
point) as the k-value.
Contour Coefficient (Contour Coefficient): Calculates the contour coefficient
for different C values and selects the C value corresponding to the largest
contour coefficient. Cross-validation: Divide the data set into training set and
test set, select the best k-value by clustering the training set and calculating the
corresponding evaluation indexes on the test set.This can also be done by
plotting with the axis(k) and Sum of Squared distances.
I got a question, I want to apply feature selection and how can I apply to the
case?