Nothing Special   »   [go: up one dir, main page]

CS464 Ch1 Intro Fall2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

CS464

Introduction to
Machine Learning
CS464
Bilkent University
CS464
•  Undergraduate-level introductory course aims at
giving a broad overview of many concepts and
algorithms in machine learning.

•  The goal is to provide students with a deep


understanding of the subject matter and skills to
apply these concepts to real world problems.
Prerequisites
•  Basic algorithms, data structures

•  Basic probability and statistic

•  Basic linear algebra

•  Good programming skills


Term Projects
•  “Photobomb Detector”

•  “Facial Expression Recognition”

•  “Predicting the Stack Exchange Questions’ Category”

•  Some ideas from Stanford ML course:


–  http://cs229.stanford.edu/proj2019aut/index.html
TO DO
•  Enroll on the Moodle:
•  Read the syllabus:
–  https://stars.bilkent.edu.tr/syllabus/view/CS/464/
•  Read information:
–  http://ciceklab.cs.bilkent.edu.tr/ercumentcicek/cs-464-
introduction-to-machine-learning-fall-2020/
•  Start talking to your classmates about project ideas
and start establishing teams:
–  Groups of 5 people from the same section.
Data-rich Era
•  The information volume grows exponentially
–  Most data will never be seen by humans

•  Information complexity is also increasing greatly


–  Most data cannot be comprehended by humans
directly

•  In the computationally enabled, data-rich era


machine learning is an essential component of the
new scientific toolkit
Genome Cost
Genome Data Size
Data-rich Era
•  The information volume grows exponentially
–  Most data will never be seen by humans

•  Information complexity is also increasing greatly


–  Most data cannot be comprehended by humans
directly

•  In the computationally enabled, data-rich era


machine learning is an essential component of the
new scientific toolkit
Learning from Data
•  Many applications require gaining insights from massive,
noisy data sets
•  Commercial applications
–  Speech recognition
•  https://youtu.be/2U1ImKsecB8
–  Consumer data (online advertising, viral marketing,…)
–  Health records (emergency rehospitalization,…)
•  Security related applications
–  Spam filtering, fraud detections…
–  Surveillance …
•  Science
–  Biology, astronomy, neuroscience, geology, …
–  Social science, economics
Spam or Ham

Ham

Spam
Text Classification
Incoming e-mail
Reading Zip Codes
•  Classifying hand written digits from binary images
Recommendation Systems
Recommendation Systems
Recommendation Systems
Statistical Machine Translation
Recognizing Faces

Figure taken from: https://github.com/deepinsight/insightface


Semantic Segmentation

Figure taken from: https://github.com/facebookresearch/detectron2


Image/Video Captioning

Automatically captioned: “Two pizzas sitting on top of a stove top oven”


Autopilot

Figure taken from: https://www.tesla.com/autopilot


Clustering Gene Expressions

Verhaak et al., Cancer Cell, 2010.


Analyzing fMRI Data

Mitchell et.al,
Science, 2009.

•  Predict activation patterns for nouns


Speech Recognition

https://youtu.be/D5VN56jQMWM
IBM Watson

•  IBM Watson won Jeopardy on February 2011


AlphaGo

•  AlphaGO beats team of five leading Go players


Health
Many Other Applications
•  Information Retrieval
•  Online advertising
•  Finance
•  Fraud detection
•  Economics
•  Social Sciences
Growth of Machine Learning
•  Machine learning is preferred approach to
–  Speech recognition, natural language processing
–  Computer vision
–  Medical outcomes analysis
–  Robot control
–  Computational biology
–  Sensor networks
•  This trend is accelerating
–  Big data
–  Improved machine learning algorithms
–  Faster computers
–  Good open-source software
Why “Learn” ?
•  Learning is useful when:
–  Human expertise does not exist (navigating on Mars),
–  Humans are unable to explain their expertise (speech
recognition, object recognition)
–  Solution needs to be adapted to particular cases
(personalization)
Experience = data
•  Humans learn to improve their skills through practice

•  The machine learning goal is to write algorithms that will


improve with experience
–  Experience is encoded as examples, or data
–  Autonomous learning must gain from experience

•  Where does training data come from?


–  Human annotation
–  Automatic collection
Objectives of Machine Learning
•  Algorithms:
–  design of efficient, accurate, and general learning algorithms to
deal with large-scale problems
–  make accurate predictions (unseen examples)
–  handle a variety of different learning problems

•  Theoretical questions:
–  What can be learned? Under what conditions?
–  What learning guarantees can be given?
–  What can we say about the inherent ease or difficulty of
learning problems?
Machine Learning?
•  Role of Probability and Statistics: Inference from a sample

•  Role of Computer Science: Efficient algorithms to


–  solve the optimization problem
–  representing and evaluating the model for inference
Definitions and Terminology
•  Example:
–  an object or instance in data used
•  Features (denoted by X):
–  the set of attributes, often represented as a vector, associated to an
example, e.g., height and weight for gender prediction
•  Labels (Y):
–  in classification, category associated to an object
–  in regression, real-valued numbers
Definitions and Terminology
•  Training set: Data used for training the learning
algorithm

•  Test data: Data used for evaluating the learned


model
Feature Extraction and Mapping
•  How you represent your data as a set of features is
very! important.
Example: Recognizing Flowers
Three types of Iris Flowers

Setosa Versicolor Virginica


Flower Recognition: Features

Setosa

Versicolor

Virginica
Example: A Digit Recognizer
•  Imagine you are asked to write a program to recognize hand
written digits

•  Now let’s use the terminology to define the problem


Terminology Example
Classification Task:
Given a hand written digit classify it into {0,1,..,9}

3
Digit Recognizer
•  Each hand written digit image is an example
(instance, or object) in your data
Labels (Space of outputs)
•  Labels are all digits:

Label of this example: 6


Features (Space of Inputs)
We could divide the image in 16x16 pixels and encode each
pixel intensity as a numeric number.
Feature vector
for the example
0
.
.
Map to .
features 1.2
3.2
4.5
1.2
.
.
0
256 x 1
16 x16 grid
Feature Mapping
•  How you represent your data as features is very!
important.
Training Set/Test Set
Training set Test set

+ their labels + their labels

Should not be overlapping


Terminology Example: Digit Recognizer

Training set Test set

+ their labels + their labels

•  Built the classifier using •  Predict the label of unseen


the features and the examples’ features.
labels in the training set •  Compare your predictions results
with the actual labels.
How would you classify ?

Predict its
label?
Nearest Neighbor Classifiers
•  Basic idea:
–  If it walks like a duck, quacks like a duck, then it’s
probably a duck

compute
distance test
sample

training choose k of the


samples “nearest” samples
Nearest Neighbor Classifiers
Unknown record
k-Nearest Neighbor (kNN)
•  Assumes all instances are points in n-dimensional
space

•  A distance measure is needed to determine the


“closeness” of instances

•  Classify an instance by finding its nearest neighbors


and picking the most popular class among the
neighbors
1- Nearest Neighbor Classifier Algorithm
•  The training data D = {(x1, y1), (x2, y2), …, (xn, yn)}
•  Learning algorithm:
–  Store training examples

•  Prediction algorithm:
–  To classify a new example xnew by finding the training
example (xi, yi) that is nearest to xnew

–  Guess the class ynew = yi


k-NN Classifier
•  To classify a new input vector x, examine the k-
closest training data points to x and assign the
object to the most frequently occurring class

X2

X1
k-NN Classifier
k-NN Classifier
1-Nearest Neighbor Classifier (k = 1)
3-Nearest Neighbor Classifier (k=3)
5-Nearest Neighbor Classifier (k=5)
Example: Hand Written Digits
•  16x16 bitmaps
•  8-bit grayscale
•  Euclidean distance
over pixels
Example: Hand Written Digits
•  16x16 bitmaps
•  8-bit grayscale
•  Euclidean distance
over pixels
Example: Hand Written Digits
•  16x16 bitmaps
•  8-bit grayscale
•  Euclidean distance
over pixels

Accuracy: 7-NN ~95.2%


SVM ~95.8%
Humans ~97.5%
Performance Measures
•  For binary classification:
–  True positives
–  True negatives
–  False positives
–  False negatives
•  Precision (selectivity)
–  a/(a+c)
•  Recall (sensitivity, TPR)
–  a/(a+b)
•  Specificity
–  d/(c+d)
•  FPR?
•  There is often a trade-off between precision and recall
–  Receiver Operating Characteristic (ROC) curves are used to
accurately assess performance in relation to this trade-off
Performance Measures

Source: wikipedia
Relationship to Modeling
Functional Representation
•  Variable X is related to variable Y, and we would like to learn
this relationship from the data
•  Example: X = [weight, blood glucose, . . .] and Y is whether
the person will have diabetes or not.
•  We assume there is a relationship between X and Y:
–  it is less likely to see certain X co-occur with “low risk” and unlikely to
see some other X co-occur with “high risk".

•  This relationship is encapsulated by P(X, Y)


Formulation of Learning
Multitude of Learning Frameworks
•  Presence/absence of labeled data
Multitude of Learning Frameworks
•  Type of Labels:
Multitude of Learning Frameworks
•  Problems also differ in the protocol for obtaining
data
–  Passive Learning
–  Active Learning
•  and in assumptions in data
–  Batch
–  Online
Examples of Regression Problems
•  Predict tomorrow’s stock market price given current
market conditions and other possible side information.
•  Predict the age of a viewer watching a given video on
YouTube.
•  Predict the location in 3D space of a robot arm end
effector, given control signals (torques) sent to its
various motors.
•  Predict the amount of prostate specific antigen (PSA) in
the body as a function of a number of different clinical
measurements.
•  Predict the temperature at any location inside a building
using weather data, time, door sensors, etc.
Linear Regression
Residual
error

Outcome
variable
(real-valued Coefficient Feature
labels) vector vector

–  Optimization problem: Given instances of x and y,


compute the coefficient vector that will minimize the
residual error
–  Common assumption: Residual error has a Gaussian
(normal) distribution
What Can Go Wrong?
Overfitting

14-degree polynomial 20-degree polynomial

•  Which model is more likely to be generalizable?


–  If we try to fit the model to the training data too accurately, we
will end up with a too complex model that considers every bit of
variation in the data as information
•  Occam’s Razor
Overfitting Example: k-NN

MAP Estimate for K=1 MAP Estimate for K=5

•  What happens if we grow K


Training Data:
further?
•  How about when K=N (number
of samples)?
Overfitting vs. Underfitting
Performance of K-NN on training and test samples as a function of K:

To assess overfitting,
reserve a portion of
these as test samples

Misclassification rate:
Bias and Variance
Do not confuse this
Model Selection K with the K in K-NN!

•  K-fold Cross-Validation
–  Randomly partition training samples
into K groups
–  Repeat for i=1 to K:
•  Reserve the ith group as the test set
•  Train on the remaining (K-1)N
samples
–  Results in one prediction per
sample
•  Use a measure of classification •  K = 5 is usually a good
accuracy to assess performance choice
–  Repeat the above procedure o  K=N => Leave-One-
multiple times Out Cross Validation
•  Report the mean and variance of •  Why repeat multiple
performance figures across these
times?
multiple runs
•  Why variance?
Curse of Dimensionality
•  The space grows exponentially with the number of
dimensions (features)
–  If the number of training samples stays constant as the
number of features goes up, the data gets sparser
–  Distance/locality-based classifiers (e.g., k-NN) are
particularly vulnerable to curse of dimensionality
Breaking The Curse
•  Feature Selection
–  Explicitly aim at reducing the number of features by trying
and evaluating models with different combinations of few
features
•  Computationally difficult since the number of combinations is
exponential
–  A parsimonious (less complex) model is always more
desirable for interpretability and generalizability
•  Dimensionality Reduction
–  Pre-process the data to obtain lower-dimensional
representations by discovering latent patterns
•  Principal Component Analysis (PCA)/ Singular Value
Decomposition (SVD)
Dimensionality Reduction via PCA

Principal Component 1

Principal Component 2

Mean
Interpretability vs. Accuracy
•  Data mining: Find interpretable patterns
–  Emphasis on understanding
•  Machine learning: Make accurate predictions
–  Emphasis on utility
More Data - Cleverer Algorithm
•  Lack of sufficient number of examples is usually
the biggest challenge in machine learning
–  Algorithmic sophistication can address this problem only
to a certain extent, since increasing complexity can
result in overfitting
•  The classifier should get better as it sees more
examples
–  If not, can we say it is a good algorithm?
•  Extreme case: Deep Learning (Convolutional
Neural Networks)
–  Works poorly with few samples
–  Works great if you have millions of samples
AI vs ML vs DL
•  Nice blog post by NVIDIA:
•  https://blogs.nvidia.com/blog/2016/07/29/whats-
difference-artificial-intelligence-machine-learning-
deep-learning-ai/
What will be Covered?
•  This class should give you the basic foundation for applying machine learning
•  Supervised learning
–  Bayesian learning
–  Linear models for regression and classification
–  Instance-based learning
–  Support vector machines
–  Decision Trees
–  Ensemble models
–  Deep Learning
• Unsupervised learning
–  Clustering
–  Dimensionality reduction
•  Model selection and evaluation
•  Sequential Models
–  Hidden Markov Models
•  Additional topics (if time permits)

You might also like