Quoted from https://www.kaggle.com/c/otto-group-product-classification-challenge/data
Each row corresponds to a single product. There are a total of 93 numerical features, which represent counts of different events. All features have been obfuscated and will not be defined any further.
There are nine categories for all products. Each target category represents one of our most important product categories (like fashion, electronics, etc.). The products for the training and testing sets are selected randomly.
###Summary of each data fields
- id - an anonymous id unique to a product
- feat_1, feat_2, ..., feat_93- the various features of a product
- target - the class of a product
###Distribution of the class variable This helps us understand more about our data and possible class imbalance that may pose a problem in doing classification.
###Pre-processing Before the data was used, we have removed the first variable "id" as it is useless in the classification task and might interfere with the accuracy of the model.
data <- data[,-1]
We have divided our dataset into testing and training sets in the ratio of 3:7 for most of the algorithm
ind <- sample(1:nrow(data), floor(nrow(data)*0.3))
test <- data[ind,]
train <- data[-ind,]
Alternatively, down sampling are used in tree.R. Down sampling is used so that the classes in the training set are balanced.
ind <- sample(1:nrow(data), floor(nrow(data)*0.3))
test <- data[ind,]
train <- data[-ind,]
train <- downSample(x = train[, -ncol(train)],y = train$target)
Based on random forest (tree), we are able to sort out the top 10 feature based on its importance.
Our code is divided into 3 R files
- ANN.R - Artificial neural network (Library used : nnet)
- naiveBayes.R - Naive Bayes model (Library used : klaR)
- tree.R - Decision tree (Library used: randomForest, tree, ISLR)
The confusion matrix is plotted in each of the files, for comparison between these algorithms, we will take a look at the area under the curve.
Algorithm | Multi-class area under the curve | Accuracy |
---|---|---|
ANN | 0.81 | 0.7234 |
Naive Bayes | 0.72 | 0.6552 |
Tree | 0.82 | 0.8348 |
We can say that for our dataset, random forest performs better.
Classifiers behave differently because their underlying theory is different. For instance, neural networks are bad with sparse data and such.
Naive Bayes on the other hand, assumes member variables to be independent of each other. In this case for products, one feature clearly will have correlation with other feature(s). Hence, the low AUC (~70%) of Naive Bayes is justified.
Random Forest always outperform normal decision tree, particularly in larger datasets because of its ensemble approach. The drawback being it is computationally expensive.