Nothing Special   »   [go: up one dir, main page]

Microsoft R PreProcessing

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 1

20171023_Microsoft_PreProcessing

Objective:

In this session, you will learn data pre-processing steps, and data aggregation and manipulation
techniques used before moving onto data modeling.

Key takeaways:

 Handling missing values


 Discretization and standardization
 Dummy variables
 Splitting data into train and test

Exercise: Please write R code to do the following tasks:

1) Clear the environment and set working directory


2) Read data from url https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
in R environment. Name this data frame “adult”.
3) Change attribute names to “age”, “workclass”, “fnlwgt”, “education”, “education-num”, “marital-
status”, “occupation”, “relationship”, “race”, “gender”, “capital-gain”,”capital-loss”, “hours-per-
week”, “native-country”,”profits”
4) Find out which rows contain " ?" in "workclass", "occupation" and “native-country” attributes
and replace with NA. Check data summary to observe whether all " ?" has been replaced by NAs
5) Using central Imputation, impute NA values in the data frame. Check whether all NA values have
been imputed.
6) Split the data frame into two data frames. One containing only numeric data vectors and other
one containing only categorical vector.
7) Perform standardization and discretization of attributes in numeric vector using “equalwidth”
and “equalfreq”. Observe the tables to know difference between two methods.
8) Create dummy variables for “race attribute” in categorical data
9) Create a new data frame by adding standardized numeric data, categorical data (eliminate “race”
from this) and dummy variables created for “race”
10) Split the data into 60% train and 40% test set

Inspire…Educate…Transform. Page 1

You might also like