CS464 Ch1 Intro Fall2020

CS464
Introduction to
Machine Learning
CS464
Bilkent University
CS464
•  Undergraduate-level introductory course aims at
giving a broad overview of many concepts and
algorithms in machine learning.
•  The goal is to provide students with a deep

understanding of the subject matter and skills to
apply these concepts to real world problems.
Prerequisites
•  Basic algorithms, data structures
•  Basic probability and statistic
•  Basic linear algebra
•  Good programming skills

Term Projects
•  “Photobomb Detector”
•  “Facial Expression Recognition”
•  “Predicting the Stack Exchange Questions’ Category”
•  Some ideas from Stanford ML course:

–  http://cs229.stanford.edu/proj2019aut/index.html
TO DO
•  Enroll on the Moodle:
•  Read the syllabus:
–  https://stars.bilkent.edu.tr/syllabus/view/CS/464/
•  Read information:
–  http://ciceklab.cs.bilkent.edu.tr/ercumentcicek/cs-464-
introduction-to-machine-learning-fall-2020/
•  Start talking to your classmates about project ideas
and start establishing teams:
–  Groups of 5 people from the same section.
Data-rich Era
•  The information volume grows exponentially
–  Most data will never be seen by humans
•  Information complexity is also increasing greatly

–  Most data cannot be comprehended by humans
directly
•  In the computationally enabled, data-rich era

machine learning is an essential component of the
new scientific toolkit
Genome Cost
Genome Data Size
Data-rich Era
•  The information volume grows exponentially
–  Most data will never be seen by humans
•  Information complexity is also increasing greatly

–  Most data cannot be comprehended by humans
directly
•  In the computationally enabled, data-rich era

machine learning is an essential component of the
new scientific toolkit
Learning from Data
•  Many applications require gaining insights from massive,
noisy data sets
•  Commercial applications
–  Speech recognition
•  https://youtu.be/2U1ImKsecB8
–  Consumer data (online advertising, viral marketing,…)
–  Health records (emergency rehospitalization,…)
•  Security related applications
–  Spam filtering, fraud detections…
–  Surveillance …
•  Science
–  Biology, astronomy, neuroscience, geology, …
–  Social science, economics
Spam or Ham
Ham
Spam
Text Classification
Incoming e-mail
Reading Zip Codes
•  Classifying hand written digits from binary images
Recommendation Systems
Statistical Machine Translation
Recognizing Faces
Figure taken from: https://github.com/deepinsight/insightface

Semantic Segmentation
Figure taken from: https://github.com/facebookresearch/detectron2

Image/Video Captioning
Automatically captioned: “Two pizzas sitting on top of a stove top oven”

Autopilot
Figure taken from: https://www.tesla.com/autopilot

Clustering Gene Expressions
Verhaak et al., Cancer Cell, 2010.

Analyzing fMRI Data
Mitchell et.al,
Science, 2009.
•  Predict activation patterns for nouns

Speech Recognition
https://youtu.be/D5VN56jQMWM
IBM Watson
•  IBM Watson won Jeopardy on February 2011

AlphaGo
•  AlphaGO beats team of five leading Go players

Health
Many Other Applications
•  Information Retrieval
•  Online advertising
•  Finance
•  Fraud detection
•  Economics
•  Social Sciences
Growth of Machine Learning
•  Machine learning is preferred approach to
–  Speech recognition, natural language processing
–  Computer vision
–  Medical outcomes analysis
–  Robot control
–  Computational biology
–  Sensor networks
•  This trend is accelerating
–  Big data
–  Improved machine learning algorithms
–  Faster computers
–  Good open-source software
Why “Learn” ?
•  Learning is useful when:
–  Human expertise does not exist (navigating on Mars),
–  Humans are unable to explain their expertise (speech
recognition, object recognition)
–  Solution needs to be adapted to particular cases
(personalization)
Experience = data
•  Humans learn to improve their skills through practice
•  The machine learning goal is to write algorithms that will

improve with experience
–  Experience is encoded as examples, or data
–  Autonomous learning must gain from experience
•  Where does training data come from?

–  Human annotation
–  Automatic collection
Objectives of Machine Learning
•  Algorithms:
–  design of efficient, accurate, and general learning algorithms to
deal with large-scale problems
–  make accurate predictions (unseen examples)
–  handle a variety of different learning problems
•  Theoretical questions:
–  What can be learned? Under what conditions?
–  What learning guarantees can be given?
–  What can we say about the inherent ease or difficulty of
learning problems?
Machine Learning?
•  Role of Probability and Statistics: Inference from a sample
•  Role of Computer Science: Efficient algorithms to

–  solve the optimization problem
–  representing and evaluating the model for inference
Definitions and Terminology
•  Example:
–  an object or instance in data used
•  Features (denoted by X):
–  the set of attributes, often represented as a vector, associated to an
example, e.g., height and weight for gender prediction
•  Labels (Y):
–  in classification, category associated to an object
–  in regression, real-valued numbers
Definitions and Terminology
•  Training set: Data used for training the learning
algorithm
•  Test data: Data used for evaluating the learned

model
Feature Extraction and Mapping
•  How you represent your data as a set of features is
very! important.
Example: Recognizing Flowers
Three types of Iris Flowers
Setosa Versicolor Virginica

Flower Recognition: Features
Setosa
Versicolor
Virginica
Example: A Digit Recognizer
•  Imagine you are asked to write a program to recognize hand
written digits
•  Now let’s use the terminology to define the problem

Terminology Example
Classification Task:
Given a hand written digit classify it into {0,1,..,9}
3
Digit Recognizer
•  Each hand written digit image is an example
(instance, or object) in your data
Labels (Space of outputs)
•  Labels are all digits:
Label of this example: 6

Features (Space of Inputs)
We could divide the image in 16x16 pixels and encode each
pixel intensity as a numeric number.
Feature vector
for the example
0
.
.
Map to .
features 1.2
3.2
4.5
1.2
.
.
0
256 x 1
16 x16 grid
Feature Mapping
•  How you represent your data as features is very!
important.
Training Set/Test Set
Training set Test set
+ their labels + their labels
Should not be overlapping

Terminology Example: Digit Recognizer
Training set Test set
+ their labels + their labels
•  Built the classifier using •  Predict the label of unseen

the features and the examples’ features.
labels in the training set •  Compare your predictions results
with the actual labels.
How would you classify ?
Predict its
label?
Nearest Neighbor Classifiers
•  Basic idea:
–  If it walks like a duck, quacks like a duck, then it’s
probably a duck
compute
distance test
sample
training choose k of the

samples “nearest” samples
Nearest Neighbor Classifiers
Unknown record
k-Nearest Neighbor (kNN)
•  Assumes all instances are points in n-dimensional
space
•  A distance measure is needed to determine the

“closeness” of instances
•  Classify an instance by finding its nearest neighbors

and picking the most popular class among the
neighbors
1- Nearest Neighbor Classifier Algorithm
•  The training data D = {(x1, y1), (x2, y2), …, (xn, yn)}
•  Learning algorithm:
–  Store training examples
•  Prediction algorithm:
–  To classify a new example xnew by finding the training
example (xi, yi) that is nearest to xnew
–  Guess the class ynew = yi

k-NN Classifier
•  To classify a new input vector x, examine the k-
closest training data points to x and assign the
object to the most frequently occurring class
X2
X1
k-NN Classifier
k-NN Classifier
1-Nearest Neighbor Classifier (k = 1)
3-Nearest Neighbor Classifier (k=3)
5-Nearest Neighbor Classifier (k=5)
Example: Hand Written Digits
•  16x16 bitmaps
•  8-bit grayscale
•  Euclidean distance
over pixels
over pixels
over pixels
Accuracy: 7-NN ~95.2%

SVM ~95.8%
Humans ~97.5%
Performance Measures
•  For binary classification:
–  True positives
–  True negatives
–  False positives
–  False negatives
•  Precision (selectivity)
–  a/(a+c)
•  Recall (sensitivity, TPR)
–  a/(a+b)
•  Specificity
–  d/(c+d)
•  FPR?
•  There is often a trade-off between precision and recall
–  Receiver Operating Characteristic (ROC) curves are used to
accurately assess performance in relation to this trade-off
Performance Measures
Source: wikipedia
Relationship to Modeling
Functional Representation
•  Variable X is related to variable Y, and we would like to learn
this relationship from the data
•  Example: X = [weight, blood glucose, . . .] and Y is whether
the person will have diabetes or not.
•  We assume there is a relationship between X and Y:
–  it is less likely to see certain X co-occur with “low risk” and unlikely to
see some other X co-occur with “high risk".
•  This relationship is encapsulated by P(X, Y)

Formulation of Learning
Multitude of Learning Frameworks
•  Presence/absence of labeled data
•  Type of Labels:
•  Problems also differ in the protocol for obtaining
data
–  Passive Learning
–  Active Learning
•  and in assumptions in data
–  Batch
–  Online
Examples of Regression Problems
•  Predict tomorrow’s stock market price given current
market conditions and other possible side information.
•  Predict the age of a viewer watching a given video on
YouTube.
•  Predict the location in 3D space of a robot arm end
effector, given control signals (torques) sent to its
various motors.
•  Predict the amount of prostate specific antigen (PSA) in
the body as a function of a number of different clinical
measurements.
•  Predict the temperature at any location inside a building
using weather data, time, door sensors, etc.
Linear Regression
Residual
error
Outcome
variable
(real-valued Coefficient Feature
labels) vector vector
–  Optimization problem: Given instances of x and y,

compute the coefficient vector that will minimize the
residual error
–  Common assumption: Residual error has a Gaussian
(normal) distribution
What Can Go Wrong?
Overfitting
14-degree polynomial 20-degree polynomial
•  Which model is more likely to be generalizable?

–  If we try to fit the model to the training data too accurately, we
will end up with a too complex model that considers every bit of
variation in the data as information
•  Occam’s Razor
Overfitting Example: k-NN
MAP Estimate for K=1 MAP Estimate for K=5
•  What happens if we grow K

Training Data:
further?
•  How about when K=N (number
of samples)?
Overfitting vs. Underfitting
Performance of K-NN on training and test samples as a function of K:
To assess overfitting,
reserve a portion of
these as test samples
Misclassification rate:
Bias and Variance
Do not confuse this
Model Selection K with the K in K-NN!
•  K-fold Cross-Validation
–  Randomly partition training samples
into K groups
–  Repeat for i=1 to K:
•  Reserve the ith group as the test set
•  Train on the remaining (K-1)N
samples
–  Results in one prediction per
sample
•  Use a measure of classification •  K = 5 is usually a good
accuracy to assess performance choice
–  Repeat the above procedure o  K=N => Leave-One-
multiple times Out Cross Validation
•  Report the mean and variance of •  Why repeat multiple
performance figures across these
times?
multiple runs
•  Why variance?
Curse of Dimensionality
•  The space grows exponentially with the number of
dimensions (features)
–  If the number of training samples stays constant as the
number of features goes up, the data gets sparser
–  Distance/locality-based classifiers (e.g., k-NN) are
particularly vulnerable to curse of dimensionality
Breaking The Curse
•  Feature Selection
–  Explicitly aim at reducing the number of features by trying
and evaluating models with different combinations of few
features
•  Computationally difficult since the number of combinations is
exponential
–  A parsimonious (less complex) model is always more
desirable for interpretability and generalizability
•  Dimensionality Reduction
–  Pre-process the data to obtain lower-dimensional
representations by discovering latent patterns
•  Principal Component Analysis (PCA)/ Singular Value
Decomposition (SVD)
Dimensionality Reduction via PCA
Principal Component 1
Principal Component 2
Mean
Interpretability vs. Accuracy
•  Data mining: Find interpretable patterns
–  Emphasis on understanding
•  Machine learning: Make accurate predictions
–  Emphasis on utility
More Data - Cleverer Algorithm
•  Lack of sufficient number of examples is usually
the biggest challenge in machine learning
–  Algorithmic sophistication can address this problem only
to a certain extent, since increasing complexity can
result in overfitting
•  The classifier should get better as it sees more
examples
–  If not, can we say it is a good algorithm?
•  Extreme case: Deep Learning (Convolutional
Neural Networks)
–  Works poorly with few samples
–  Works great if you have millions of samples
AI vs ML vs DL
•  Nice blog post by NVIDIA:
•  https://blogs.nvidia.com/blog/2016/07/29/whats-
difference-artificial-intelligence-machine-learning-
deep-learning-ai/
What will be Covered?
•  This class should give you the basic foundation for applying machine learning
•  Supervised learning
–  Bayesian learning
–  Linear models for regression and classification
–  Instance-based learning
–  Support vector machines
–  Decision Trees
–  Ensemble models
–  Deep Learning
• Unsupervised learning
–  Clustering
–  Dimensionality reduction
•  Model selection and evaluation
•  Sequential Models
–  Hidden Markov Models
•  Additional topics (if time permits)

CS464 Ch1 Intro Fall2020

Uploaded by

Copyright:

Available Formats

CS464 Ch1 Intro Fall2020

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS464 Ch1 Intro Fall2020

Uploaded by

Copyright:

Available Formats

CS464

• The goal is to provide students with a deep

• Basic probability and statistic

• Basic linear algebra

• Good programming skills

• “Facial Expression Recognition”

• “Predicting the Stack Exchange Questions’ Category”

• Some ideas from Stanford ML course:

• Information complexity is also increasing greatly

• In the computationally enabled, data-rich era

• Information complexity is also increasing greatly

• In the computationally enabled, data-rich era

Figure taken from: https://github.com/deepinsight/insightface

Figure taken from: https://github.com/facebookresearch/detectron2

Automatically captioned: “Two pizzas sitting on top of a stove top oven”

Figure taken from: https://www.tesla.com/autopilot

Verhaak et al., Cancer Cell, 2010.

• Predict activation patterns for nouns

• IBM Watson won Jeopardy on February 2011

• AlphaGO beats team of five leading Go players

• The machine learning goal is to write algorithms that will

• Where does training data come from?

• Role of Computer Science: Efficient algorithms to

• Test data: Data used for evaluating the learned

Setosa Versicolor Virginica

• Now let’s use the terminology to define the problem

Label of this example: 6

+ their labels + their labels

Should not be overlapping

Training set Test set

+ their labels + their labels

• Built the classifier using • Predict the label of unseen

training choose k of the

• A distance measure is needed to determine the

• Classify an instance by finding its nearest neighbors

– Guess the class ynew = yi

Accuracy: 7-NN ~95.2%

• This relationship is encapsulated by P(X, Y)

– Optimization problem: Given instances of x and y,

14-degree polynomial 20-degree polynomial

• Which model is more likely to be generalizable?

MAP Estimate for K=1 MAP Estimate for K=5

• What happens if we grow K

You might also like

•  The goal is to provide students with a deep

•  Basic probability and statistic

•  Basic linear algebra

•  Good programming skills

•  “Facial Expression Recognition”

•  “Predicting the Stack Exchange Questions’ Category”

•  Some ideas from Stanford ML course:

•  Information complexity is also increasing greatly

•  In the computationally enabled, data-rich era

•  Information complexity is also increasing greatly

•  In the computationally enabled, data-rich era

•  Predict activation patterns for nouns

•  IBM Watson won Jeopardy on February 2011

•  AlphaGO beats team of five leading Go players

•  The machine learning goal is to write algorithms that will

•  Where does training data come from?

•  Role of Computer Science: Efficient algorithms to

•  Test data: Data used for evaluating the learned

•  Now let’s use the terminology to define the problem

•  Built the classifier using •  Predict the label of unseen

•  A distance measure is needed to determine the

•  Classify an instance by finding its nearest neighbors

–  Guess the class ynew = yi

•  This relationship is encapsulated by P(X, Y)

–  Optimization problem: Given instances of x and y,

•  Which model is more likely to be generalizable?

•  What happens if we grow K