Nothing Special   »   [go: up one dir, main page]

Feature Engineering 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.

com

Feature Engineering
Knowledge Discovery and Data Mining 1

Roman Kern

ISDS, TU Graz

2018-10-25

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 1 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Big picture: KDDM

Probability Theory Linear Algebra Hardware & Programming


Model

Information Theory Statistical Inference

Mathematical Tools Infrastructure

Preprocessing

Knowledge Discovery Process

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 2 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Outline

1 Information Theory

2 Introduction

3 Feature Value Processing

4 Feature Engineering for Text Mining

5 Feature Selection

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 3 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com

Recap
Review of the preprocessing phase

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 4 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Recap - Feature Extraction

Example of features:
Images → colours, textures, contours, ...
Signals → frequency, phase, samples, spectrum, ...
Time series → ticks, trends, self-similarities, ...
Biomed → dna sequence, genes, ...
Text → words, POS tags, grammatical dependencies, ...

Features encode these properties in a way suitable for a chosen algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 5 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Recap - Feature Extraction

What is Part-of-Speech?
The process to apply word classes to words within a sentence
For example
Car → noun
Writing → noun or verb
Grow → verb
From → preposition

Open vs closed word classes


Prepositions (closed, e.g. “of”, “to”, “in”)
Verbs (open, e.g. to “google”)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 6 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Recap - Feature Extraction

Main approaches for POS tagging


Rule based
ENGTWOL tagger
Transformation based
Brill tagger
Stochastic
HMM tagger

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 7 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Recap - Feature Extraction

Sequence Tagging - Simplifications


argmax𝑡 𝑃 (𝑡1,𝑛 |𝑤1,𝑛 ) = argmax𝑡 𝑃 (𝑤1,𝑛 |𝑡1,𝑛 )𝑃 (𝑡1,𝑛 )
1,𝑛 1,𝑛

→ Not feasible in practice


Limited horizon & independence assumption:
𝑛
𝑃 (𝑡1,𝑛 ) ≈ 𝑃 (𝑡𝑛 |𝑡𝑛−1 )𝑃 (𝑡𝑛−1 |𝑡𝑛−2 )...𝑃 (𝑡2 |𝑡1 ) = ∏𝑖=1 𝑃 (𝑡𝑖 |𝑡𝑖−1 )
𝑛
Words only depend on tags: 𝑃 (𝑤1,𝑛 |𝑡1,𝑛 ) ≈ ∏𝑖=1 𝑃 (𝑤𝑖 |𝑡𝑖 )
The final equation is:
𝑛
argmax𝑡 ̂ ∏𝑖=1 𝑃 (𝑤𝑖 |𝑡𝑖 )𝑃 (𝑡𝑖 |𝑡𝑖−1 )
1,𝑛

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 8 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Recap - Feature Extraction
Hidden Markov Models

Needs three matrices as input: 𝐴 (transmission, POS ↦ POS), 𝐵


(emission, POS ↦ Word), 𝜋 (initial probabilities, POS) . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 9 / 68
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Recap - Feature Extraction

Probability estimation for tagging


How do we get such probabilities?
→ With supervised tagging we can simply use Maximum Likelihood
Estimation (MLE) and use counts (C) from a reference corpus
𝐶(𝑡𝑖−1 ,𝑡𝑖 )
𝑃 (𝑡𝑖 |𝑡𝑖−1 ) = 𝐶(𝑡𝑖−1 )
𝐶(𝑤𝑖 ,𝑡𝑖 )
𝑃 (𝑤𝑖 |𝑡𝑖 ) = 𝐶(𝑡𝑖 )
Smoothing
To account for unseen words
𝐶(𝑡𝑖−1 ,𝑡𝑖 )+𝜆
Lidstone smoothing: 𝐶(𝑡𝑖−1 )+𝜆𝑉 (𝑡𝑖−1 ,𝑡)
Need to estimate 𝜆, e.g. by held-out data (development data set)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 10 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com

Information Theory
Review of information theory

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 11 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Entropy

What is Entropy?
Let 𝑋 be a discrete random variable with alphabet 𝒳 and probability
mass function 𝑝(𝑥)
𝑝(𝑥) = 𝑃 𝑟{𝑋 = 𝑥}, 𝑥 ∈ 𝒳
The entropy of a variable X is defined as
𝐻(𝑋) = − ∑𝑥∈𝒳 𝑝(𝑥)𝑙𝑜𝑔2 𝑝(𝑥)
... entropy is a measure for information content of a variable (in bits)
Note 1: By convention 0𝑙𝑜𝑔2 0 = 0
Note 2: Entropy is the lower bound on the average number of yes/no questions to guess the
state of a variable.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 12 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Entropy

Entropy Example 1/2


For example, let 𝒳 = {𝐴, 𝐵, 𝐶, 𝐷}
1
... each with the same probability of 4
One can encode each of the values with 2 bits
e.g., 𝐴 = 00, 𝐵 = 01, 𝐶 = 10, 𝐷 = 11

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 13 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Entropy

Entropy Example 2/2


What if the probabilities are not evenly distributed?
e.g., 𝐴 = 12 , 𝐵 = 14 , 𝐶 = 18 , 𝐷 = 1
8
One does only need 1.75 bits to encode
e.g., 𝐴 = 0, 𝐵 = 10, 𝐶 = 110, 𝐷 = 111
As one expects to see 𝐴 in 50% of all cases

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 14 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Entropy

What is Entropy?
Entropy is a measure for uncertainty
High entropy → uniform distribution
Histogram of the frequencies would be even
Values are hard to predict
Low entropy → peaks and valleys in the distribution
Histogram of the frequencies would have spikes
Values are easier to predict
Entropy is always non-negative
The entropy is always less (or equal) than the logarithm of the
alphabet size

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 15 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Joint Entropy

What is Joint Entropy?


The joint entropy of a pair of discrete random variables (𝑋; 𝑌 ) with
joint probability mass function 𝑝(𝑥; 𝑦) is defined by
𝐻(𝑋, 𝑌 ) = − ∑𝑥∈𝒳 ∑𝑦∈𝒴 𝑝(𝑥, 𝑦)𝑙𝑜𝑔2 𝑝(𝑥, 𝑦)
... can be generalised to cover more than 2 variables
𝐻(𝑋, 𝑌 ) = 𝐻(𝑋) + 𝐻(𝑌 ), if 𝑋 and 𝑌 are independent from each
other

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 16 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Conditional Entropy

What is Conditional Entropy?


The conditional entropy of 𝑌 given 𝑋 is defined as:
𝐻(𝑌 |𝑋) = − ∑𝑥∈𝒳 ∑𝑦∈𝒴 𝑝(𝑥, 𝑦)𝑙𝑜𝑔2 𝑝(𝑦|𝑥)
... how much uncertainty is left, once X is known
Connection between joint entropy and conditional entropy:
𝐻(𝑌 |𝑋) = 𝐻(𝑋, 𝑌 ) − 𝐻(𝑋)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 17 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Conditional Entropy

What is Specific Conditional Entropy?


𝐻(𝑌 |𝑋 = 𝑥) - the specific conditional entropy of 𝑌 given a
specific value 𝑥 of 𝑋
e.g. if 𝐻(𝑌 |𝑋 = 𝑥) = 0, then 𝑥 accounts for all the uncertainty of Y
𝐻(𝑌 |𝑋) = ∑𝑥∈𝒳 𝑝(𝑥)𝐻(𝑌 |𝑋 = 𝑥)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 18 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Information Gain

What is Information Gain?


𝐼𝐺(𝑌 |𝑋) = 𝐻(𝑌 ) − 𝐻(𝑌 |𝑋)
... how much is the uncertainty of 𝑌 reduced, once 𝑋 is known
or: One has to transmit 𝑌
How many bits on average would it save if both ends of the line would
know 𝑋?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 19 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Information Gain

What is Relative Information Gain?


𝐻(𝑌 )−𝐻(𝑌 |𝑋)
𝑅𝐼𝐺(𝑌 |𝑋) = 𝐻(𝑌 )
One has to transmit 𝑌
How many fraction of bits on average would it save if both ends of the
line would know 𝑋?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 20 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Mutual Information

What is Mutual Information?


The mutual information between random variables 𝑋 and 𝑌 with
joint probability mass function 𝑝(𝑥, 𝑦) and marginal probability mass
functions 𝑝(𝑥) and 𝑝(𝑦) is defined as
𝑝(𝑥,𝑦)
𝐼(𝑋; 𝑌 ) = ∑𝑥∈𝒳 ∑𝑦∈𝒴 𝑝(𝑥, 𝑦)𝑙𝑜𝑔2 𝑝(𝑥)𝑝(𝑦)
The mutual information is a measure of the amount of information
that one random variable contains about another random variable
𝐼(𝑋; 𝑌 ) = 0, if 𝑋 and 𝑌 are independent from each other
Conditional mutual information:
𝐼(𝑋; 𝑌 |𝑍) = 𝐻(𝑋|𝑍) − 𝐻(𝑋|𝑌 , 𝑍)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 21 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Pointwise Mutual Information

What is Pointwise Mutual Information?


The pointwise mutual information is defined as
𝑝(𝑥,𝑦)
𝑝𝑚𝑖(𝑋 = 𝑥; 𝑌 = 𝑦) = 𝑖(𝑥; 𝑦) = 𝑙𝑜𝑔2 𝑝(𝑥)𝑝(𝑦)
Can then be normalised to:
𝑝𝑚𝑖(𝑋=𝑥;𝑌 =𝑦)
𝑝𝑚𝑖𝑛𝑜𝑟𝑚 (𝑋 = 𝑥; 𝑌 = 𝑦) = −𝑙𝑜𝑔2 𝑝(𝑥,𝑦)

Example: For two binary variables:


y = false y = true
x = false 𝑝(¬𝑥, ¬𝑦) 𝑝(¬𝑥, 𝑦)
x = true 𝑝(𝑥, ¬𝑦) 𝑝(𝑥, 𝑦)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 22 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Relative Entropy

What is Relative Entropy?


The relative entropy or between two probability mass functions 𝑝(𝑥)
and 𝑞(𝑥) is defied by:
𝐷(𝑝||𝑞) = ∑𝑥∈𝑋 𝑝(𝑥)𝑙𝑜𝑔2 𝑝(𝑥)
𝑞(𝑥)
... also called Kullback-Leibler distance
The relative entropy is always non-negative and zero if and only if
𝑝=𝑞
Connection between mutual information and relative entropy:
𝐼(𝑋; 𝑌 ) = 𝐷(𝑝(𝑥, 𝑦)||𝑝(𝑥)𝑝(𝑦))
0 𝑝
Note: By convention 0𝑙𝑜𝑔2 0 = 0, 0𝑙𝑜𝑔2 0
𝑞 = 0 and 𝑝𝑙𝑜𝑔2 0 = ∞

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 23 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Entropy Overview

Figure: Overview of entropy, joint entropy, conditional entropy and mutual


information (source: Wikipedia)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 24 / 68
Information Theory
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Markov Chains

Markov chains
Random variables 𝑋, 𝑌 , 𝑍 are said to form a Markov chain in that order
(denoted 𝑋 → 𝑌 → 𝑍) if the conditional distribution of Z depends only
on Y and is conditionally independent of X, ie if the joint probability mass
function can be written as:

𝑝(𝑥, 𝑦, 𝑧) = 𝑝(𝑥)𝑝(𝑦|𝑥)𝑝(𝑧|𝑦)

𝑋 → 𝑌 → 𝑍 if and only if 𝑋 and 𝑍 are conditionally independent given


𝑌.

Markov chains and information theory


If 𝑋 → 𝑌 → 𝑍 then 𝐼(𝑋; 𝑌 ) ≥ 𝐼(𝑋; 𝑍)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 25 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com

Introduction
What are features & feature engineering

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 26 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Introduction

What is feature engineering?


The act to inject knowledge into a machine learning model.

What are features?


The items, that represent this knowledge suitable for machine learning
algorithms.

What is a machine learning model?


The model represents the output of the learning process (knowledge
representation)
Note: there is no formal definition of feature engineering

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 27 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Introduction

Tasks of feature engineering


1 Understand the properties of the task - how they might interact with
the strength and limitations of the model
2 Experimental work - test expectations and find out what actually
works

Note: The exploration vs. experimental work characterises many data science scenarios

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 28 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Introduction

Process of feature engineering


Remove unnecessary features
Remove redundant features
Create new features
Combine existing features
Transform features
Use features from the context
Integrate external sources
Modify feature types
e.g. from binary to numeric
Modify feature values
Note: depending on the task and the algorithms the results might differ

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 29 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Feature Engineering Goals

Goals
The task also depends on the goal of feature engineering:
1 If the goal is to get the best prediction accuracy
2 ... or an explainable model

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 30 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Feature Engineering Terminology

Important Terms
Feature Set Set of features used for a task
Feature Space High dimensional space spawned by the features (range of
the feature values)
Instance Single assignment of features and values (an example)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 31 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Introduction - Example

Figure: Features to predict which type of contact lens is most appropriate (none,
soft, hard)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 32 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Introduction - Example

Figure: Relation between the features with the contact lens type
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 33 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Introduction - Example

Figure: Features to decide whether to play or not based on the weather


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 34 / 68
Licensed to Erlon Abrantes Introduction
de Andrade - erlonabrantes@gmail.com
Introduction

Features are assigned to instances


No dependencies/relationships between instances (in practical
applications)
Thus relationships need to be flatted out (denormalisation)
... by modelling features
Complex features need to be “simplified”
... by creating aggregate, compound features, e.g. by using averages
Typically features have no fixed sequence (hence the term set)
... by creating features that express sequence information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 35 / 68
Feature Value Processing
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com

Feature Value Processing


Operations upon feature values...

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 36 / 68
Feature Value Processing
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Processing

Feature binarisation
Threshold numerical values to get boolean values
Needed as some algorithms just take boolean features as input
Feature discretization
Convert continuous features to discrete features
Equal sized partitions? Equal interval partitions?
Feature value transformation
Scaling of values
Move the centre

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 37 / 68
Feature Value Processing
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Normalisation

Normalise the value of features


For example, most of the features are in the range [0..1],
... but one ranges [-1000 .. +1000]
Classifiers like SVM struggle with this (other algorithms do not need
this step, e.g. decision trees)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 38 / 68
Feature Value Processing
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Weighting

Given numeric or binary features


... encode their impact into the feature value
Can be seen as prior of a feature
e.g. “term weighting” to separate potentially words with grammatical
function from word with a semantic function

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 39 / 68
Feature Engineering for Text Mining
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com

Feature Engineering for


Text Mining
Tactics when dealing with text

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 40 / 68
Feature Engineering for Text Mining
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Contextual Features

Bigram Features
When working with single words as features, often the sequence
information is lost
... but, this could potentially a source of information
→ introduce new feature as a combination of two adjacent words

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 41 / 68
Feature Engineering for Text Mining
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Contextual Features

Example Bigram Features


Example: The quick brown fox jumps over the lazy dog
Unigram features: brown, dog, fox, lazy, ...
Bigram features: brown_fox, fox_jumps, lazy_dog, over_the, ...

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 42 / 68
Feature Engineering for Text Mining
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Contextual Features

n-grams
Bigrams can be extended for more than two words
→ n-grams
Can be extended to allow gap in between words (skip n-grams)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 43 / 68
Feature Engineering for Text Mining
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Contextual Features

Character n-grams
n-gram can be created on words, but on characters as well
e.g. The quick brown fox jumps over the lazy dog
Character tri-grams: the, qui, uic, ick, bro, row, own, ...

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 44 / 68
Feature Engineering for Text Mining
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
External Sources

Integrate external sources


Integrate evidence from external sources
e.g. WordNet for semantic relations
Example: The quick brown fox jumps over the lazy dog
Features: the, quick, brown, fox, canine, canid, jumps, ...
Added canine and canid from the hypernyms found in WordNet

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 45 / 68
Feature Engineering for Text Mining
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
External Sources

Figure: Wordnet entry for the word fox, the first sense contains the hypernyms
canine and canid.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 46 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com

Feature Selection
Less is more - sometimes...

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 47 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Feature engineering can be classified into two use cases


Modelling for prediction accuracy
Default, if the goal is to have a productive system
... with optimal performance
Modelling for explanations
When the model should be easy to interpret
... one can acquire better knowledge of the problem
... and then to improve the feature engineering task

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 48 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

If a model uses fewer features


... it is easier to interpret
... it will generalise better (less risk of overfitting)
... it will be faster (more efficient)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 49 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Curse of Dimensionality
The problem of having too many features
More features make the model more expressive
but not all of the features are relevant
The higher the dimensionality, the higher the chances of spurious
features

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 50 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Approaches towards reducing the complexity


Feature selection
Regularisation

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 51 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Feature selection
Approach: select the sub-set of all features without redundant or
irrelevant features
Set-of-all-subset problem → NP hard
Need to find more practical approaches
Unsupervised, e.g. heuristics
Supervised, e.g. using a training data set

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 52 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Simple approach → use heuristics


Black & white lists
... list contains features, which either should not be used
... or an exclusive list of features

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 53 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Example for black list


Stop-word list for textual features
... list of frequent word, that carry little semantics
e.g. the, you, again, can, ...
Advantage: simple, yet effective
Disadvantage: some may carry semantic, used in phrases or named
entities (“The The”), homonyms (can.v vs. can.n)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 54 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection
Unsupervised approach
Unsupervised ranked feature selection
Scoring function to rank the feature according to their importance
... then just use the top 5% (10%, 25%, ...)
e.g. for textual data use the frequency of words within a reference
corpus
Feature Count Freq.
the 3,032,573 0.879
in 2,919,623 0.846
a 2,903,352 0.841
of 2,888,379 0.837
is 2,639,282 0.765
and 2,634,096 0.763
⋮ ⋮ ⋮
with 1,703,251 0.494

Table: Top 50 word within the Wikipedia, the top ranked word (the) occurs in
88% of all instances.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 55 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Supervised approaches
Filter approaches
Wrapper approaches

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 56 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Supervised approaches
Filter approaches
Compute some measure for estimating the ability to discriminate
between classes
Typically measure feature weight and select the best n features →
supervised ranked feature selection
Problems:
Redundant features (correlated features will all have similar weights)
Dependant features (some features may only be important in
combination)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 57 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection - Information Gain

Information gain as ranking function


Recall: 𝐼𝐺(𝑌 |𝑋) = 𝐻(𝑌 ) − 𝐻(𝑌 |𝑋)
Select features by IG
1 Compute the IG for each feature
2 Rank the features based on IG
3 Select the top-k features

Example
Features on contact lenses
Ranked attributes:
0.5488 4 tear-prod-rate
0.377 3 astigmatism
0.0395 2 spectacle-prescrip
0.0394 1 age

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 58 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection

Supervised approaches
Wrapper approaches
Search through the space of all possible feature subsets
Each search subset is tried out with a learning algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 59 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection - Wrapper Approach

Wrapper approach
General algorithm:
1 Initial subset selection
2 Try a subset with a learner
3 Modify the feature subset
4 Rerun the learner
5 Measure the difference
6 GOTO 2
Advantages: combination of features, ignore redundant/irrelevant
features
Disadvantage: computationally intensive
2 basic ways for i) initial subset selection, ii) modification of subset:
forward selection and backward elimination
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 60 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Selection - Wrapper Approach

Forward Selection
Start with empty set
Add each feature not in set
Pick the one with the highest increase
Stop if there is no increase

Backward Elimination
Start with full feature set
Try to remove features

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 61 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Interactions

From the field of statistics


e.g., remove redundant features based on correlation
e.g., Principal Component Analysis (PCA)
Non-trivial relationship between variables (features)
Link to Information Theory, e.g., Interaction Information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 62 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Regularisation

Basic idea: Introduce a penalty for complexity of a model


The more features, the higher the complexity
e.g., if the number of feature exceeds the number of observations
Typically the regularizer is integrated into the cost function (or loss
function)
Example (negative log-likelihood): 𝑐𝑜𝑠𝑡(𝑓) = −𝑙(𝑓) + 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟(𝑓)

e.g. 𝐿0 ... taking the number of non-zero features, 𝐿1 ... sum of the
feature values, ...

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 63 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Transformation

Some algorithms can only solve certain problems


e.g. Perceptrons apply only to linearly separable data
XOR problem for single layer perceptrons
→ two main approaches: i) non-linear transformation of the features,
ii) more sophisticated algorithms

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 64 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Transformation

Feature transformation
Map features into high-dimensional space
Create more features
The more features, the higher the dimensionality
The higher the dimensionality, the higher the chances that the
problem is linearly separable

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 65 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Transformation

Left: original features, which cannot be separated by a single layer perceptron;


Right: features transformed into a higher dimensional space, is linear
. . . separable
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 66 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com
Feature Transformation

Kernel trick
Some algorithms employ a scalar product of the features (e.g. SVMs)
Transform into higher dimensionality “on-the-fly”
... by introducing a (kernel) function
Original: < 𝑥, 𝑦 >, with kernel function: 𝜑(𝑥, 𝑦)
Number of different well-known kernel functions (e.g. Gaussian
kernel)
... which often require parameters (to tune)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 67 / 68
Feature Selection
Licensed to Erlon Abrantes de Andrade - erlonabrantes@gmail.com

Thank You!
Next up: Data Matrices

Further information
http://www.cs.cmu.edu/~awm/tutorials
http://www.icg.isy.liu.se/courses/infotheory/lect1.pdf
http://www.cs.princeton.edu/courses/archive/spring10/cos424/slides/18-feat.pdf
http://ufal.mff.cuni.cz/~zabokrtsky/courses/npfl104/html/feature_engineering.pdf
http://www.ke.tu-darmstadt.de/lehre/archiv/ss06/web-mining/wm-features.pdf

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Roman Kern (ISDS, TU Graz) KDDM1 - Feature Engineering 2018-10-25 68 / 68

You might also like