Feature Engineering 1

Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.
com
Feature Engineering
Knowledge Discovery and Data Mining 1
Roman Kern
ISDS, TU Graz
2018-10-25
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 1 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Big picture: KDDM
Probability Theory Linear Algebra Hardware & Programming

Model
Information Theory Statistical Inference
Mathematical Tools Infrastructure
Preprocessing
Knowledge Discovery Process
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Outline
1 Information Theory
2 Introduction
3 Feature Value Processing
4 Feature Engineering for Text Mining
5 Feature Selection
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Recap
Review of the preprocessing phase
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Recap - Feature Extraction
Example of features:
Images → colours, textures, contours, ...
Signals → frequency, phase, samples, spectrum, ...
Time series → ticks, trends, self-similarities, ...
Biomed → dna sequence, genes, ...
Text → words, POS tags, grammatical dependencies, ...
Features encode these properties in a way suitable for a chosen algorithm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
What is Part-of-Speech?
The process to apply word classes to words within a sentence
For example
Car → noun
Writing → noun or verb
Grow → verb
From → preposition
Open vs closed word classes

Prepositions (closed, e.g. “of”, “to”, “in”)
Verbs (open, e.g. to “google”)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Main approaches for POS tagging

Rule based
ENGTWOL tagger
Transformation based
Brill tagger
Stochastic
HMM tagger
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Sequence Tagging - Simplifications

argmax𝑡 𝑃 (𝑡1,𝑛 |𝑤1,𝑛 ) = argmax𝑡 𝑃 (𝑤1,𝑛 |𝑡1,𝑛 )𝑃 (𝑡1,𝑛 )
1,𝑛 1,𝑛
→ Not feasible in practice

Limited horizon & independence assumption:
𝑛
𝑃 (𝑡1,𝑛 ) ≈ 𝑃 (𝑡𝑛 |𝑡𝑛−1 )𝑃 (𝑡𝑛−1 |𝑡𝑛−2 )...𝑃 (𝑡2 |𝑡1 ) = ∏𝑖=1 𝑃 (𝑡𝑖 |𝑡𝑖−1 )
𝑛
Words only depend on tags: 𝑃 (𝑤1,𝑛 |𝑡1,𝑛 ) ≈ ∏𝑖=1 𝑃 (𝑤𝑖 |𝑡𝑖 )
The final equation is:
𝑛
argmax𝑡 ̂ ∏𝑖=1 𝑃 (𝑤𝑖 |𝑡𝑖 )𝑃 (𝑡𝑖 |𝑡𝑖−1 )
1,𝑛
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Hidden Markov Models
Needs three matrices as input: 𝐴 (transmission, POS ↦ POS), 𝐵

(emission, POS ↦ Word), 𝜋 (initial probabilities, POS) . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probability estimation for tagging

How do we get such probabilities?
→ With supervised tagging we can simply use Maximum Likelihood
Estimation (MLE) and use counts (C) from a reference corpus
𝐶(𝑡𝑖−1 ,𝑡𝑖 )
𝑃 (𝑡𝑖 |𝑡𝑖−1 ) = 𝐶(𝑡𝑖−1 )
𝐶(𝑤𝑖 ,𝑡𝑖 )
𝑃 (𝑤𝑖 |𝑡𝑖 ) = 𝐶(𝑡𝑖 )
Smoothing
To account for unseen words
𝐶(𝑡𝑖−1 ,𝑡𝑖 )+𝜆
Lidstone smoothing: 𝐶(𝑡𝑖−1 )+𝜆𝑉 (𝑡𝑖−1 ,𝑡)
Need to estimate 𝜆, e.g. by held-out data (development data set)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Information Theory
Review of information theory
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Entropy
What is Entropy?
Let 𝑋 be a discrete random variable with alphabet 𝒳 and probability
mass function 𝑝(𝑥)
𝑝(𝑥) = 𝑃 𝑟{𝑋 = 𝑥}, 𝑥 ∈ 𝒳
The entropy of a variable X is defined as
𝐻(𝑋) = − ∑𝑥∈𝒳 𝑝(𝑥)𝑙𝑜𝑔2 𝑝(𝑥)
... entropy is a measure for information content of a variable (in bits)
Note 1: By convention 0𝑙𝑜𝑔2 0 = 0
Note 2: Entropy is the lower bound on the average number of yes/no questions to guess the
state of a variable.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Entropy
Entropy Example 1/2

For example, let 𝒳 = {𝐴, 𝐵, 𝐶, 𝐷}
1
... each with the same probability of 4
One can encode each of the values with 2 bits
e.g., 𝐴 = 00, 𝐵 = 01, 𝐶 = 10, 𝐷 = 11
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Entropy
Entropy Example 2/2

What if the probabilities are not evenly distributed?
e.g., 𝐴 = 12 , 𝐵 = 14 , 𝐶 = 18 , 𝐷 = 1
8
One does only need 1.75 bits to encode
e.g., 𝐴 = 0, 𝐵 = 10, 𝐶 = 110, 𝐷 = 111
As one expects to see 𝐴 in 50% of all cases
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Entropy
What is Entropy?
Entropy is a measure for uncertainty
High entropy → uniform distribution
Histogram of the frequencies would be even
Values are hard to predict
Low entropy → peaks and valleys in the distribution
Histogram of the frequencies would have spikes
Values are easier to predict
Entropy is always non-negative
The entropy is always less (or equal) than the logarithm of the
alphabet size
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Joint Entropy
What is Joint Entropy?

The joint entropy of a pair of discrete random variables (𝑋; 𝑌 ) with
joint probability mass function 𝑝(𝑥; 𝑦) is defined by
𝐻(𝑋, 𝑌 ) = − ∑𝑥∈𝒳 ∑𝑦∈𝒴 𝑝(𝑥, 𝑦)𝑙𝑜𝑔2 𝑝(𝑥, 𝑦)
... can be generalised to cover more than 2 variables
𝐻(𝑋, 𝑌 ) = 𝐻(𝑋) + 𝐻(𝑌 ), if 𝑋 and 𝑌 are independent from each
other
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Conditional Entropy
What is Conditional Entropy?

The conditional entropy of 𝑌 given 𝑋 is defined as:
𝐻(𝑌 |𝑋) = − ∑𝑥∈𝒳 ∑𝑦∈𝒴 𝑝(𝑥, 𝑦)𝑙𝑜𝑔2 𝑝(𝑦|𝑥)
... how much uncertainty is left, once X is known
Connection between joint entropy and conditional entropy:
𝐻(𝑌 |𝑋) = 𝐻(𝑋, 𝑌 ) − 𝐻(𝑋)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Conditional Entropy
What is Specific Conditional Entropy?

𝐻(𝑌 |𝑋 = 𝑥) - the specific conditional entropy of 𝑌 given a
specific value 𝑥 of 𝑋
e.g. if 𝐻(𝑌 |𝑋 = 𝑥) = 0, then 𝑥 accounts for all the uncertainty of Y
𝐻(𝑌 |𝑋) = ∑𝑥∈𝒳 𝑝(𝑥)𝐻(𝑌 |𝑋 = 𝑥)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Information Gain
What is Information Gain?

𝐼𝐺(𝑌 |𝑋) = 𝐻(𝑌 ) − 𝐻(𝑌 |𝑋)
... how much is the uncertainty of 𝑌 reduced, once 𝑋 is known
or: One has to transmit 𝑌
How many bits on average would it save if both ends of the line would
know 𝑋?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Information Gain
What is Relative Information Gain?

𝐻(𝑌 )−𝐻(𝑌 |𝑋)
𝑅𝐼𝐺(𝑌 |𝑋) = 𝐻(𝑌 )
One has to transmit 𝑌
How many fraction of bits on average would it save if both ends of the
line would know 𝑋?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Mutual Information
What is Mutual Information?

The mutual information between random variables 𝑋 and 𝑌 with
joint probability mass function 𝑝(𝑥, 𝑦) and marginal probability mass
functions 𝑝(𝑥) and 𝑝(𝑦) is defined as
𝑝(𝑥,𝑦)
𝐼(𝑋; 𝑌 ) = ∑𝑥∈𝒳 ∑𝑦∈𝒴 𝑝(𝑥, 𝑦)𝑙𝑜𝑔2 𝑝(𝑥)𝑝(𝑦)
The mutual information is a measure of the amount of information
that one random variable contains about another random variable
𝐼(𝑋; 𝑌 ) = 0, if 𝑋 and 𝑌 are independent from each other
Conditional mutual information:
𝐼(𝑋; 𝑌 |𝑍) = 𝐻(𝑋|𝑍) − 𝐻(𝑋|𝑌 , 𝑍)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Pointwise Mutual Information
What is Pointwise Mutual Information?

The pointwise mutual information is defined as
𝑝(𝑥,𝑦)
𝑝𝑚𝑖(𝑋 = 𝑥; 𝑌 = 𝑦) = 𝑖(𝑥; 𝑦) = 𝑙𝑜𝑔2 𝑝(𝑥)𝑝(𝑦)
Can then be normalised to:
𝑝𝑚𝑖(𝑋=𝑥;𝑌 =𝑦)
𝑝𝑚𝑖𝑛𝑜𝑟𝑚 (𝑋 = 𝑥; 𝑌 = 𝑦) = −𝑙𝑜𝑔2 𝑝(𝑥,𝑦)
Example: For two binary variables:

y = false y = true
x = false 𝑝(¬𝑥, ¬𝑦) 𝑝(¬𝑥, 𝑦)
x = true 𝑝(𝑥, ¬𝑦) 𝑝(𝑥, 𝑦)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Relative Entropy
What is Relative Entropy?

The relative entropy or between two probability mass functions 𝑝(𝑥)
and 𝑞(𝑥) is defied by:
𝐷(𝑝||𝑞) = ∑𝑥∈𝑋 𝑝(𝑥)𝑙𝑜𝑔2 𝑝(𝑥)
𝑞(𝑥)
... also called Kullback-Leibler distance
The relative entropy is always non-negative and zero if and only if
𝑝=𝑞
Connection between mutual information and relative entropy:
𝐼(𝑋; 𝑌 ) = 𝐷(𝑝(𝑥, 𝑦)||𝑝(𝑥)𝑝(𝑦))
0 𝑝
Note: By convention 0𝑙𝑜𝑔2 0 = 0, 0𝑙𝑜𝑔2 0
𝑞 = 0 and 𝑝𝑙𝑜𝑔2 0 = ∞
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Entropy Overview
Figure: Overview of entropy, joint entropy, conditional entropy and mutual

information (source: Wikipedia)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Information Theory
Markov Chains
Markov chains
Random variables 𝑋, 𝑌 , 𝑍 are said to form a Markov chain in that order
(denoted 𝑋 → 𝑌 → 𝑍) if the conditional distribution of Z depends only
on Y and is conditionally independent of X, ie if the joint probability mass
function can be written as:
𝑝(𝑥, 𝑦, 𝑧) = 𝑝(𝑥)𝑝(𝑦|𝑥)𝑝(𝑧|𝑦)
𝑋 → 𝑌 → 𝑍 if and only if 𝑋 and 𝑍 are conditionally independent given

𝑌.
Markov chains and information theory

If 𝑋 → 𝑌 → 𝑍 then 𝐼(𝑋; 𝑌 ) ≥ 𝐼(𝑋; 𝑍)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Introduction
What are features & feature engineering
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Introduction
What is feature engineering?

The act to inject knowledge into a machine learning model.
What are features?

The items, that represent this knowledge suitable for machine learning
algorithms.
What is a machine learning model?

The model represents the output of the learning process (knowledge
representation)
Note: there is no formal definition of feature engineering
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Introduction
Tasks of feature engineering

1 Understand the properties of the task - how they might interact with
the strength and limitations of the model
2 Experimental work - test expectations and find out what actually
works
Note: The exploration vs. experimental work characterises many data science scenarios
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Introduction
Process of feature engineering

Remove unnecessary features
Remove redundant features
Create new features
Combine existing features
Transform features
Use features from the context
Integrate external sources
Modify feature types
e.g. from binary to numeric
Modify feature values
Note: depending on the task and the algorithms the results might differ
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Engineering Goals
Goals
The task also depends on the goal of feature engineering:
1 If the goal is to get the best prediction accuracy
2 ... or an explainable model
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Engineering Terminology
Important Terms
Feature Set Set of features used for a task
Feature Space High dimensional space spawned by the features (range of
the feature values)
Instance Single assignment of features and values (an example)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Introduction - Example
Figure: Features to predict which type of contact lens is most appropriate (none,
soft, hard)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Figure: Relation between the features with the contact lens type
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Figure: Features to decide whether to play or not based on the weather

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Introduction
Features are assigned to instances

No dependencies/relationships between instances (in practical
applications)
Thus relationships need to be flatted out (denormalisation)
... by modelling features
Complex features need to be “simplified”
... by creating aggregate, compound features, e.g. by using averages
Typically features have no fixed sequence (hence the term set)
... by creating features that express sequence information
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Value Processing

Operations upon feature values...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Processing
Feature binarisation
Threshold numerical values to get boolean values
Needed as some algorithms just take boolean features as input
Feature discretization
Convert continuous features to discrete features
Equal sized partitions? Equal interval partitions?
Feature value transformation
Scaling of values
Move the centre
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Normalisation
Normalise the value of features

For example, most of the features are in the range [0..1],
... but one ranges [-1000 .. +1000]
Classifiers like SVM struggle with this (other algorithms do not need
this step, e.g. decision trees)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Weighting
Given numeric or binary features

... encode their impact into the feature value
Can be seen as prior of a feature
e.g. “term weighting” to separate potentially words with grammatical
function from word with a semantic function
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Engineering for Text Mining
Feature Engineering for

Text Mining
Tactics when dealing with text
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Contextual Features
Bigram Features
When working with single words as features, often the sequence
information is lost
... but, this could potentially a source of information
→ introduce new feature as a combination of two adjacent words
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Contextual Features
Example Bigram Features

Example: The quick brown fox jumps over the lazy dog
Unigram features: brown, dog, fox, lazy, ...
Bigram features: brown_fox, fox_jumps, lazy_dog, over_the, ...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Contextual Features
n-grams
Bigrams can be extended for more than two words
→ n-grams
Can be extended to allow gap in between words (skip n-grams)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Contextual Features
Character n-grams
n-gram can be created on words, but on characters as well
e.g. The quick brown fox jumps over the lazy dog
Character tri-grams: the, qui, uic, ick, bro, row, own, ...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
External Sources
Integrate external sources

Integrate evidence from external sources
e.g. WordNet for semantic relations
Example: The quick brown fox jumps over the lazy dog
Features: the, quick, brown, fox, canine, canid, jumps, ...
Added canine and canid from the hypernyms found in WordNet
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
External Sources
Figure: Wordnet entry for the word fox, the first sense contains the hypernyms
canine and canid.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Less is more - sometimes...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Feature engineering can be classified into two use cases

Modelling for prediction accuracy
Default, if the goal is to have a productive system
... with optimal performance
Modelling for explanations
When the model should be easy to interpret
... one can acquire better knowledge of the problem
... and then to improve the feature engineering task
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
If a model uses fewer features

... it is easier to interpret
... it will generalise better (less risk of overfitting)
... it will be faster (more efficient)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Curse of Dimensionality
The problem of having too many features
More features make the model more expressive
but not all of the features are relevant
The higher the dimensionality, the higher the chances of spurious
features
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Approaches towards reducing the complexity

Feature selection
Regularisation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Feature selection
Approach: select the sub-set of all features without redundant or
irrelevant features
Set-of-all-subset problem → NP hard
Need to find more practical approaches
Unsupervised, e.g. heuristics
Supervised, e.g. using a training data set
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Simple approach → use heuristics

Black & white lists
... list contains features, which either should not be used
... or an exclusive list of features
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Example for black list

Stop-word list for textual features
... list of frequent word, that carry little semantics
e.g. the, you, again, can, ...
Advantage: simple, yet effective
Disadvantage: some may carry semantic, used in phrases or named
entities (“The The”), homonyms (can.v vs. can.n)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Unsupervised approach
Unsupervised ranked feature selection
Scoring function to rank the feature according to their importance
... then just use the top 5% (10%, 25%, ...)
e.g. for textual data use the frequency of words within a reference
corpus
Feature Count Freq.
the 3,032,573 0.879
in 2,919,623 0.846
a 2,903,352 0.841
of 2,888,379 0.837
is 2,639,282 0.765
and 2,634,096 0.763
⋮ ⋮ ⋮
with 1,703,251 0.494
Table: Top 50 word within the Wikipedia, the top ranked word (the) occurs in
88% of all instances.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Supervised approaches
Filter approaches
Wrapper approaches
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Filter approaches
Compute some measure for estimating the ability to discriminate
between classes
Typically measure feature weight and select the best n features →
supervised ranked feature selection
Problems:
Redundant features (correlated features will all have similar weights)
Dependant features (some features may only be important in
combination)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection - Information Gain
Information gain as ranking function

Recall: 𝐼𝐺(𝑌 |𝑋) = 𝐻(𝑌 ) − 𝐻(𝑌 |𝑋)
Select features by IG
1 Compute the IG for each feature
2 Rank the features based on IG
3 Select the top-k features
Example
Features on contact lenses
Ranked attributes:
0.5488 4 tear-prod-rate
0.377 3 astigmatism
0.0395 2 spectacle-prescrip
0.0394 1 age
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection
Wrapper approaches
Search through the space of all possible feature subsets
Each search subset is tried out with a learning algorithm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection - Wrapper Approach
Wrapper approach
General algorithm:
1 Initial subset selection
2 Try a subset with a learner
3 Modify the feature subset
4 Rerun the learner
5 Measure the difference
6 GOTO 2
Advantages: combination of features, ignore redundant/irrelevant
features
Disadvantage: computationally intensive
2 basic ways for i) initial subset selection, ii) modification of subset:
forward selection and backward elimination
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Selection - Wrapper Approach
Forward Selection
Start with empty set
Add each feature not in set
Pick the one with the highest increase
Stop if there is no increase
Backward Elimination
Start with full feature set
Try to remove features
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Interactions
From the field of statistics

e.g., remove redundant features based on correlation
e.g., Principal Component Analysis (PCA)
Non-trivial relationship between variables (features)
Link to Information Theory, e.g., Interaction Information
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Regularisation
Basic idea: Introduce a penalty for complexity of a model

The more features, the higher the complexity
e.g., if the number of feature exceeds the number of observations
Typically the regularizer is integrated into the cost function (or loss
function)
Example (negative log-likelihood): 𝑐𝑜𝑠𝑡(𝑓) = −𝑙(𝑓) + 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟(𝑓)
e.g. 𝐿0 ... taking the number of non-zero features, 𝐿1 ... sum of the
feature values, ...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature Transformation
Some algorithms can only solve certain problems

e.g. Perceptrons apply only to linearly separable data
XOR problem for single layer perceptrons
→ two main approaches: i) non-linear transformation of the features,
ii) more sophisticated algorithms
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Feature transformation
Map features into high-dimensional space
Create more features
The more features, the higher the dimensionality
The higher the dimensionality, the higher the chances that the
problem is linearly separable
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Left: original features, which cannot be separated by a single layer perceptron;

Right: features transformed into a higher dimensional space, is linear
. . . separable
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Kernel trick
Some algorithms employ a scalar product of the features (e.g. SVMs)
Transform into higher dimensionality “on-the-fly”
... by introducing a (kernel) function
Original: < 𝑥, 𝑦 >, with kernel function: 𝜑(𝑥, 𝑦)
Number of different well-known kernel functions (e.g. Gaussian
kernel)
... which often require parameters (to tune)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Feature Selection
Thank You!
Next up: Data Matrices
Further information
http://www.cs.cmu.edu/~awm/tutorials
http://www.icg.isy.liu.se/courses/infotheory/lect1.pdf
http://www.cs.princeton.edu/courses/archive/spring10/cos424/slides/18-feat.pdf
http://ufal.mff.cuni.cz/~zabokrtsky/courses/npfl104/html/feature_engineering.pdf
http://www.ke.tu-darmstadt.de/lehre/archiv/ss06/web-mining/wm-features.pdf
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Feature Engineering 1

Uploaded by

Copyright:

Available Formats

Feature Engineering 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Engineering 1

Uploaded by

Copyright:

Available Formats

Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.

Probability Theory Linear Algebra Hardware & Programming

Information Theory Statistical Inference

Mathematical Tools Infrastructure

Knowledge Discovery Process

3 Feature Value Processing

4 Feature Engineering for Text Mining

Features encode these properties in a way suitable for a chosen algorithm

Open vs closed word classes

Main approaches for POS tagging

Sequence Tagging - Simplifications

→ Not feasible in practice

Needs three matrices as input: 𝐴 (transmission, POS ↦ POS), 𝐵

Probability estimation for tagging

Entropy Example 1/2

Entropy Example 2/2

What is Joint Entropy?

What is Conditional Entropy?

What is Specific Conditional Entropy?

What is Information Gain?

What is Relative Information Gain?

What is Mutual Information?

What is Pointwise Mutual Information?

Example: For two binary variables:

What is Relative Entropy?

Figure: Overview of entropy, joint entropy, conditional entropy and mutual

𝑋 → 𝑌 → 𝑍 if and only if 𝑋 and 𝑍 are conditionally independent given

Markov chains and information theory

What is feature engineering?

What are features?

What is a machine learning model?

Tasks of feature engineering

Process of feature engineering

Figure: Features to decide whether to play or not based on the weather

Features are assigned to instances

Feature Value Processing

Normalise the value of features

Given numeric or binary features

Feature Engineering for

Example Bigram Features

Integrate external sources

Feature engineering can be classified into two use cases

If a model uses fewer features

Approaches towards reducing the complexity

Simple approach → use heuristics

Example for black list

Information gain as ranking function

From the field of statistics

Basic idea: Introduce a penalty for complexity of a model

Some algorithms can only solve certain problems

Left: original features, which cannot be separated by a single layer perceptron;

You might also like