Nothing Special   »   [go: up one dir, main page]

Deep Neural Network (DNN)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Deep Neural Network(DNN)

Dr. Vaishnaw G. Kale


Associate Professor
School of Computer Science
Engineering &Applications
D.Y. Patil International
University, Pune
About the Course

Name of the subject:Deep Neural Network


Course Code:CSE 601-- TC3 (AIM601)
Programme:B.Tech
Year of study:Third Year
Semester :VI
Specialization:common for AI, DS,CS Track
Academic Year:2023-24
Syllabus

Module-II:Neural Networks and Kernel Methods

Neural Networks: Feed-forward Network Functions, Error Backpropagation, The


Hessian Matrix, Regularization in Neural Networks, Mixture Density Networks,
Bayesian Neural Networks,

Kernel Methods: Constructing Kernels, Radial Basis Function Networks, Gaussian


Processes, Gaussian processes for regression, Learning the hyperparameters.
ANN-Single Layer Perceptron(SLP)
● A single layer perceptron (SLP) is a feed-forward network based on a threshold transfer
function.
● SLP is the simplest type of artificial neural networks and can only classify linearly
separable cases with a binary target (1 , 0).
ANN-Single Layer Perceptron(SLP)
● The single layer perceptron does not have a priori knowledge, so the initial weights are
assigned randomly.
● SLP sums all the weighted inputs and if the sum is above the threshold (some
predetermined value), SLP is said to be activated (output=1).

● The input values are presented to the perceptron, and if the predicted output is the same as
the desired output, then the performance is considered satisfactory and no changes to the
weights are made.
● However, if the output does not match the desired output, then the weights need to be
changed to reduce the error.
ANN-Single Layer Perceptron(SLP)
● Because SLP is a linear classifier and if the cases are not linearly separable the learning
process will never reach a point where all the cases are classified properly.
● The most famous example of the inability of perceptron to solve problems with linearly
non-separable cases is the XOR problem.

● However, a multi-layer perceptron using the backpropagation algorithm can successfully


classify the XOR data.
ANN-Multi Layer Perceptron(MLP)

● A multi-layer perceptron (MLP) has the same structure of a single layer perceptron with
one or more hidden layers.
● The backpropagation algorithm consists of two phases:
a) the forward phase where the activations are propagated from the input to the output
layer, and
b) the backward phase, where the error between the observed actual and the requested
nominal value in the output layer is propagated backwards in order to modify the
weights and bias values.
ANN-Multi Layer Perceptron
a) Forward Propagation
● Propagate inputs by adding all the weighted inputs and then computing outputs using
sigmoid threshold.
ANN-Multi Layer Perceptron
b) Backward Propagation
● Propagates the errors backward by apportioning them to each unit according to the
amount of this error the unit is responsible for.
ANN-Multi Layer Perceptron
Error in Backward Propagation

1) Cost Function
● It is a function that measures the performance of a model for any given data.
● Cost Function quantifies the error between predicted values and expected values and
presents it in the form of a single real number.
● After making a hypothesis with initial parameters, we calculate the Cost function.
● And with a goal to reduce the cost function, we modify the parameters by using the Gradient
descent algorithm over the given data.
ANN-Multi Layer Perceptron
2) Gradient Descent
ANN-Multi Layer Perceptron
Activation Function

● The main purpose of the activation function is to convert the weighted sum of input
signals of a neuron into the output signal.
● And this output signal is served as input to the next layer.
● Any activation function should be differentiable since we use a backpropagation
mechanism to reduce the error and update the weights accordingly.
ANN-Multi Layer Perceptron
Sigmoid

1. Ranges from 0 and 1.


2. A small change in x would result in a large change in y.
3. Usually used in the output layer of binary classification.

Tanh

1. Ranges between -1 and 1.


2. Output values are centered around zero.
3. Usually used in hidden layers.

RELU (Rectified Linear Unit)

1. Ranges between 0 and max(x).


2. Computationally inexpensive compared to sigmoid and tanh functions.
3. Default function for hidden layers.
ANN-Multi Layer Perceptron

Softmax

● Softmax function is often described as a combination of multiple sigmoids.

● We know that sigmoid returns values between 0 and 1, which can be treated as

probabilities of a data point belonging to a particular class.

● Thus sigmoid is widely used for binary classification problems.

● The softmax function can be used for multiclass classification problems.

● This function returns the probability for a datapoint belonging to each individual class.
The Hessian Matrix

Understanding Partial
Derivatives
The Hessian Matrix

● In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial


derivatives of a scalar-valued function .
● It describes the local curvature of a function of many variables.
● Hessian matrices belong to a class of mathematical structures that involve second order
derivatives.
● They are often used in machine learning and data science algorithms for optimizing a
function of interest.
● The Hessian matrix is a mathematical tool used to calculate the curvature of a function at
a certain point in space.
● The Hessian is nothing more than the gradient of the gradient, a matrix of second partial
The Hessian Matrix

Several Applications

1. Second-derivative Test:
● For Convex function, the Eigenvalues of the Hessian matrix defines it local/global
optima.
1. In Optimization:
● Used in large-scale Optimization.
1. To find out the Inflection Point.
2. To find out the Critical Point based on the nature of gradient.
The Hessian Matrix

● The formula for Hessian Matrix is given as


The Hessian Matrix

Two things to notice here is that

● This only makes sense for scalar-valued function.


● The object H(f) is no ordinary matrix
● It is the matrix with functions as entries. In the other words it is meant to be evaluated
at some point (x0, y0…..)
● The word "Hessian" also sometimes refers to the determinant of this matrix, instead of
to the matrix itself.
The Hessian Matrix
Example

● Calculate the Hessian matrix at the point (1,0) of the following multivariable function:

● First of all, we have to compute the first order partial derivatives of the function:

● Once we know the first derivatives, we calculate all the second order partial derivatives
of the function:
The Hessian Matrix

● Now we can find the Hessian ● So the Hessian matrix


matrix using the formula for evaluated at the point (1,0) is
2×2 matrices:
Regularization in Neural Network
● Neural networks can learn to represent complex relationships between network inputs and
outputs.
● This representational power helps them perform better than traditional machine learning
algorithms in computer vision and natural language processing tasks.
● However, one of the challenges associated with training neural networks is overfitting.
● When a neural network overfits on the training dataset, it learns an overly complex representation
that models the training dataset too well.
● As a result, it performs exceptionally well on the training dataset but generalizes poorly to unseen
test data.
● Regularization techniques help improve a neural network’s generalization ability by reducing
overfitting.
● They do this by minimizing needless complexity and exposing the network to more diverse data.
Regularization in Neural Network

Common regularization techniques:

● Early stopping
● L1 and L2 regularization
● Data augmentation
● Addition of noise
● Dropout
Regularization in Neural Network
1) Early stopping
● Early stopping is one of the simplest and most intuitive regularization techniques.
● It involves stopping the training of the neural network at an earlier epoch;
● As you train the neural network over many epochs, the training error decreases.
● If the training error becomes too low and reaches arbitrarily close to zero, then the network is
sure to overfit on the training dataset.
● Such a neural network is a high variance model that performs badly on test data that it has
never seen before despite its near-perfect performance on the training samples.
● Therefore, heuristically, if we can prevent the training loss from becoming arbitrarily low, the
model is less likely to overfit on the training dataset, and will generalize better.
● So how do we do it in practice?
Regularization in Neural Network

You can monitor one of the following:

● The change in metrics such as validation error and validation accuracy

Error Change and Early Stopping Monitoring the Validation Accuracy for Early Stopping
Regularization in Neural Network

You can monitor one of the following:

● The change in the weight vector


● Another way to know when to stop is to monitor the change in weights of the network.
● Let Wt and Wt-k denote the weight vectors at epochs t and t-k respectively.
● A better approach is to compute the change in individual components of the weight
vector.
● If the maximum change (across all components) is less than ϵ, we can conclude that the
weights are not changing significantly, so we can stop the training of the neural network.
Regularization in Neural Network
2) Data Augmentations

● Data augmentation is a regularization technique that helps a neural network generalize better by
exposing it to a more diverse set of training examples.
● As deep neural networks require a large training dataset, data augmentation is also helpful when
we have insufficient data to train a neural network.
● Let’s take the example of image data augmentation. Suppose we have a dataset with N training
examples across C classes.
● We can apply certain transformations to these N images to construct a larger dataset.
Regularization in Neural Network

2) Data Augmentations

● What is a valid transformation? Any operation that does not alter the original label is
a valid transformation.
● For example, a panda is a panda–whether it’s facing right or left, located near the
center of the image or one of the corners.
● We can apply any label-invariant transformation to perform data augmentation such as
a) Color space transformations such as change of pixel intensities
b) Rotation and mirroring
c) Noise injection, distortion, and blurring
Regularization in Neural Network
2) Data Augmentations

● Beyond this there is also new image augmentation technique called as Mixup.
● Mixup is a regularization technique that uses a convex combination of existing inputs to
augment the dataset.
● A few other approaches to data augmentation include Cutout, CutMix, and AugMix.
● Cutout involves the random removal of portions of an input image during training.
● CutMix replaces the removed sections with parts of another image.
● AugMix is a regularization technique that makes a neural network robust to distribution
change.
● AugMix performs a series of transformations on the same image, and then uses a
composition of these transformed images to get the resultant image.
Regularization in Neural Network
2) Data Augmentations

AugMix
Regularization in Neural Network
3) L1 and L2 Regularization

a) Lp
● In general, Lp norms (for p>=1) penalize larger weights. They force the norm of the weight
vector to stay sufficiently small.
● The Lp norm of a vector xx in n-dimensional space is given by:

a) L1 norm
● When p=1, we get L1 norm, the sum of the absolute values of the components in the vector:
Regularization in Neural Network
3) L1 and L2 Regularization

c) L2 Norm

● When p=2, we get L2 norm, the Euclidean distance of the point from the origin in n-
dimensional vector space:
Regularization in Neural Network
4) Addition of Noise

● Another regularization approach is the addition of noise. You can add noise to the input, the
output labels, or the gradients of the neural network.
a) Adding noise to the input data
● When using the sum of squares loss function, adding Gaussian noise to the inputs is equivalent to
L2 regularization .
● To understand this equivalence, let’s assume a simple network with input layer and weights:
Regularization in Neural Network

b) Adding noise to the output

● Adding noise to the output labels prevents the network from memorizing the training dataset by
introducing perturbations in the output labels. Here are some techniques:

i) DisturbLabel

ii) Label smoothing

c) Adding noise to the Gradient

● If the gradient vector at step t is gt​, we update it by adding noise with zero mean and variance.

● In η is chosen from the set {0.01, 0.3, 1.0} and γ is set to 0.55. The variance of the noise added
decreases as the training proceeds.
Regularization in Neural Network

5) Dropout

● Dropout involves dropping neurons in the hidden layers and (optionally) the input layer.
● During training, each neuron is assigned a “dropout” probability, like 0.5.
● With a dropout of 0.5, there’s a 50% chance of each neuron participating in training within each
training batch.
● This results in slightly different network architecture for each batch.
● It is equivalent to training different neural networks on different subsets of the training data.
Regularization in Neural Network
5) Dropout

Neural Network with dropout


A simple Neural Network
Regularization in Neural Network

5) Dropout

● The weight matrix is initialized once at the beginning of the training.


● In general, for the k-th batch, backpropagation occurs only along those paths only through
the neurons present for that batch.
● Meaning only the weights corresponding to neurons that are present get updates.
● At test time, all the neurons are present in the network.
● So how do we account for the dropout during training? We weight each neuron’s output by
the same probability p – proportional to the fraction of time the neuron was present during
the training.
Matrix Density Networks
● Mixture Density Networks are built from two components – a Neural Network and a Mixture
Model.
● The Neural Network can be any valid architecture which takes in the input and converts into a
set of learned features.
● Now, let’s take a look at the Mixture Model.
● The Mixture Model, is a model of probability distributions built up with a weighted sum of more
simple distributions. More formally, it models a probability density function (pdf) as a
mixture of m pdfs indexed by j , with weights

Where are the parameters of the distribution describing the shape and location of the distribution
Matrix Density Networks

● By using the Gaussian kernel in the above equation it becomes:

Mixture Density Network: The output of a neural network parameterizes a Gaussian mixture
model.
Matrix Density Networks
● Few restrictions and ways to implement the Mixture Density Networks
1. The mixing coefficients ( ) are probabilities and have to be less than one and sum to unity.
This can be easily achieved by passing the outputs of the mixing coefficients through a Softmax
layer.
2. The variance ( ) should be strictly positive.
3. The center parameters ( ) represent location parameters and this should be the raw logits of the
mean neuron.
● The network is trained end-to-end using standard backpropagation.
● And for that the loss function we are minimizing is the Negative Log Likelihood, which is
equivalent to the Maximum Likelihood Estimation.
● We already know what is and it’s just the matter of calculating it and maximizing the
negative log-likelihood.
Bayesian Neural Network(BNN)

● Deep learning models tend to overfitting and have several


problems in establishing the uncertainties of their predictions
● Bayesian Neural Networks are a specific type of neural
networks trained in the light of the Bayesian paradigm, being
capable to quantify uncertainty associated with the
underlying processes.
● This approach can be helpful in applications where model
failures are specially costly, e.g., autonomous driving,
medical diagnosis or finance.
Bayesian Neural Network(BNN)
Stochastic Neural Network

● Among the different types of artificial neural networks, stochastic neural networks have
proven to be one of the most generic and flexible to suppress the inability of ANNs.
● By introducing stochastic components into the network — giving the network either a
stochastic activation or stochastic weights — it can simulate multiple possible models of
parameters θ with an associated probability distribution p(θ).
● By comparing these multiple predictions, it is possible to obtain a better idea of
uncertainties.
● If the different models agree, then the uncertainty is low.
● If they disagree, then the uncertainty is high.
Bayesian Neural Network(BNN)
1. Goal — SNN focuses on optimization while BNN
focuses on marginalization. Optimization would find
one optimal point to represent a weight while
marginalization would treat each weight as a variable
and find its distribution.
2. Estimate — The estimate of the parameters for SNN
would be maximum likelihood estimators (MLE)
while for BNN; the estimate would be rather
maximum a posteriori (MAP) or predictive
distribution.
3. Method — Basically, SNN would use differentiation
to find the optimal value such as gradient descent. In
BNN, since the sophisticated integrals are hard to
determine, scientists or researchers would always rely
on Markov Chain Monte Carlos (MCMC),
Variational…
Bayesian Neural Network(BNN)

Design of BNN

The procedure to design a BNN can be divided into the steps:

● Choice of a functional model: y = Φ(x) (architecture).


● Choice of a stochastic model: p(θ) for model parametrization and p(y|x, θ) for
confidence of the model.
Bayesian Neural Network(BNN)

Functional model

● To obtain the posterior distribution for parameters, given our data D = {Dx, Dy} with
training inputs and labels, respectively.

● Due to the complexity of the posterior — especially because to the evidence integral
term — , computing this in a standard way is in general intractable.
Bayesian Neural Network(BNN)

Functional model

● When dealing with predictions, it is interesting to compute the marginal p(y|x, D) to


quantify our model's uncertainty:

● To do this, we rely on techniques such Markov Chain Monte Carlo (MCMC) and
Variational Inference, that are able to evaluate these integrals in different manners.
● In the case of MCMC approach, a large set of weights θ is sampled from the posterior
and used to compute a series of possible outputs y.
Bayesian Neural Network(BNN)
Variational Inference

● Theoretically we don’t need a learning phase for the BNNs since we can sample the
posterior and obtain its estimators.
● The probabilities P(D|H) and P(H) are given by the stochastic model, but computing the
integral for the evidence can be quite a challenge.
● Even when we are able to do this, directly sampling the posterior is difficult due to the
high dimensionality of the sampling space.
● So to deal with this Variational Inference is one of the popular methods
● Bayesian neural networks address overfitting by modeling uncertainty in the weights.
Plus they can be trained using standard neural net tools using an algorithm called
stochastic variational inference
Bayesian Neural Network(BNN)

Functional model

To obtain a estimator ŷ for the output y, we have to consider two different situations:
1) When performing regression, we use as estimator and uncertainty
2) When performing classification, the average model will give the relative probability
of each class, which by itself can be considered as an estimate of uncertainty. The final
prediction ŷ is the most likely class.
Bayesian Neural Network(BNN)

Stochastic Model

a) BNNs with weights as stochastic variables b) BNNS with activations as stochastic variables.
Bayesian Neural Network(BNN)

a) Workflow b)Training c) BNN for Predictions


Bayesian Neural Network(BNN)

● In short, to obtain estimates in a model through BNNs, we need a stochastic model


containing our priors (and for the variational case, a variational posterior), a functional
model for the network and the training data.
● From there, training is done using one of the available sampling methods, so that we
can infer the posterior for the parameters.
● From the posterior, we can evaluate the reliability of our model by obtaining its
marginal, encoded from an estimator given by the mean of the models and uncertainty
given by their respective covariance matrix.
● BNNs have been used in many fields to quantify uncertainty, e.g., in computer vision,
network traffic monitoring, aviation, civil engineering, hydrology, astronomy,
electronics, and medicine
Bayesian Neural Network(BNN)

Advantages
● They are more robust and able to generalize better than other neural networks.
● They can quantify the uncertainty in their predictive output.
● They can be used for many practical applications.

Disadvantages
● They can be more complicated to train than other neural networks, and require
knowledge of the fields of probability and statistics.
● They can be slower to converge than other neural networks and often require more
data. Since the weights of the network are distributions instead of single values, more
data is required to estimate the weights accurately.
Kernel
● In the realm of machine learning, kernels hold a pivotal role, especially in algorithms
designed for classification and regression tasks
● It transforms non-linear relationships into a linear format, making them accessible for
algorithms that traditionally only handle linear data.
● Kernels achieve this without the computational intensity of mapping data to higher
dimensions explicitly.
● Their efficiency and effectiveness in revealing hidden patterns make them a cornerstone in
modern machine learning.
● At its most fundamental level, a kernel is a relatively straightforward function that operates
on two vectors from the input space, commonly referred to as the X space.
● The primary role of this function is to return a scalar value, but the fascinating aspect of this
process lies in what this scalar represents and how it is computed.
Kernel

● This scalar is, in essence, the dot product of the two input vectors.
● However, it's not computed in the original space of these vectors.
● Instead, it's as if this dot product is calculated in a much higher-dimensional space,
known as the Z space.
● This is where the kernel's true power and elegance come into play.
● It manages to convey how close or similar these two vectors are in the Z space without
the computational overhead of actually mapping the vectors to this higher-dimensional
space and calculating their dot product there.
● It allows you to glean the necessary information about the vectors in this more complex
space without having to access the space directly.
Kernel

● This approach is particularly useful in SVMs, where understanding the relationship


and position of vectors in a higher-dimensional space is crucial for classification
tasks.
Constructing Kernel

● Constructing kernels is a fundamental concept in machine learning, particularly in methods


such as Support Vector Machines (SVM) and Relevance Vector Machines (RVM).
● Kernels are functions that compute the similarity between pairs of data points in a high-
dimensional space.

Definition of Kernel

● A kernel is a function K(x,y) that computes the inner product of the transformed feature
vectors Φ(x) and Φ(y) in a higher-dimensional space without explicitly calculating the
transformation.
● Mathematically, a function K is a valid kernel if it corresponds to an inner product in some
feature space.
Constructing Kernel

● Commonly Used Kernels

● Custom Kernel

Depending on the problem, we might design custom kernels that capture domain-specific
relationships.
Constructing Kernel

Mercers Theorem

● Mercer's theorem provides conditions that ensure a function K is a valid kernel.


● It states that K is a valid kernel if and only if the kernel matrix Kij​=K(xi​,xj​) is positive semi-
definite for any set of inputs x1​,x2​,…,xn.

Kernel Composition

● Kernels can be combined through addition or multiplication to form new kernels.


● For example, if K1 and K2 are valid kernels, then K(x,y)=aK1​(x,y)+bK2​(x,y) and
K(x,y)=K1​(x,y)+K2​(x,y) are also valid kernels, where a and b are positive constants.
Constructing Kernel
Cross validated hyper parameter tuning

● Many kernels have hyper parameters (e.g., σ in RBF kernel, d in polynomial kernel).
● These hyper parameters can be tuned using cross-validation to find the values that optimize
model performance.
● The use of kernel tricks involves implicitly mapping data into higher-dimensional spaces.
● This can be computationally more efficient than explicitly computing the transformations.
● Kernel functions encapsulate these transformations.
● Incorporating domain knowledge can help design custom kernels that capture specific
relationships or structures in the data.
Choosing Kernel
● Choosing the right kernel for a machine learning task, is a critical decision that can
significantly impact the performance of the model.
● The selection process involves understanding both the nature of the data and the specific
requirements of the task at hand.
● Firstly, it's important to consider the distribution and structure of the data. If the data is
linearly separable, a linear kernel may be sufficient.
● However, for more complex, non-linear data, a polynomial or radial basis function (RBF)
kernel might be more appropriate.
● Lastly, domain knowledge can play a significant role in kernel selection.
● Understanding the underlying phenomena or patterns in the data can guide the choice of the
kernel.
Radial Basis Function Network(RBFN)

● They are a special type of feeder neural network that use radial basis functions as activation
functions.
● It has an input layer, a hidden layer, and an output layer and is mostly used for classification,
regression, and time-series prediction.
● Radial-basis function networks are distinguished from other neural networks due to their
global approximation and fast learning speed.
● The main advantage of the RBF network is that it has only one hidden layer and uses the
radial basis function as the activation function.
● These functions are very powerful in approximation.
● Radial basis function (RBF) networks are a common type of use in artificial neural networks
for function approximation problems.
Radial Basis Function Network(RBFN)
Working

● RBFNs perform classification by measuring the input’s similarity to examples from the
training set.
● RBFNs have an input vector that feeds to the input layer. They have a layer of RBF neurons.
● The function finds the weighted sum of the inputs, and the output layer has one node per
category or class of data.
● The neurons in the hidden layer contain the Gaussian transfer functions, which have outputs
that are inversely proportional to the distance from the neuron’s center.
● The network’s output is a linear combination of the input’s radial-basis functions and the
neuron’s parameters.
Radial Basis Function Network(RBFN)

Working
Radial Basis Function Network(RBFN)
1) Input Vector:
● The input vector is the n-dimensional vector you are trying to classify.
● The entire input vector is shown for each of the RBF neurons.
2) RBF Neurons:
● Each RBF neuron stores a ‘prototype’ vector, just one of the vectors from the training set.
● Each RBF neuron compares the input vector to its prototype and outputs a value between 0 and 1
which is a measure of similarity.
● If the input is equal to the prototype, the output of the RBF neuron will be 1.
● As the distance between the input and the prototype increases, the response decreases exponentially
towards 0. The shape of the response of the RBF neuron is a bell curve
● The prototype vector is also often called the “center” of a neuron since it is the value at the center
of the bell curve.
Radial Basis Function Network(RBFN)
3) Outputs :

● The output of the network consists of a set of nodes, one per category that we are trying to
classify.
● Each output node computes a sort of score for the associated category.
● Typically, a classification decision is made by assigning the input to the category with the
highest score.
● The score is computed by taking a weighted sum of the activation values from every RBF
neuron.
● By weighted sum, we mean that an output node associates a weight value with each of the
RBF neurons, and multiplies the neuron’s activation by this weight before adding it to the
total response.
Radial Basis Function Network(RBFN)

3) Outputs :

● Because each output node is computing the score for a different category, every output node
has its own set of weights.
● The output node will typically give a positive weight to the RBF neurons that belong to its
category, and a negative weight to the others.
● Each RBF neuron computes a measure of the similarity between the input and its prototype
vector (taken from the training set).
● Input vectors which are more similar to the prototype return a result closer to 1.
● There are different possible choices of similarity functions, but the most popular is based on
the Gaussian.
Radial Basis Function Network(RBFN)

3) Outputs :

● The equation for a Gaussian with a one-dimensional input.


● Where x is the input, mu is the mean, and sigma is the standard deviation.
● The radial basis function is a function whose value depends only on the distance from the
origin. In fact, the function should contain only real values.
● The alternate forms of radial basis functions are defined as the distance from another point
denoted C, called the center.
● The main advantage of the RBF network is that it has only one hidden layer and it uses the
radial basis function as the activation function.
● These functions are very powerful in approximation.
Learning The Hyperparameters

Hyper parameter:

● Hyper parameters are the variables which determines the network structure (Eg:
Number of Hidden Units) and the variables which determine how the network is trained
(Eg: Learning Rate).
● Hyper parameters are set before training (before optimizing the weights and bias).
Learning The Hyperparameters
Hyper parameter related to Neural Networks:

1) Number of Hidden layers

● Many hidden units within a layer with regularization techniques can increase accuracy.
● Smaller number of units may cause underfitting.

2) Dropout

● Its is regularization technique to avoid overfitting thus increasing the generalizing power.
● Generally, use a small dropout value of 20%-50% of neurons with 20% providing a good
starting point.
● A probability too low has minimal effect and value too high results in under-learning by the
network.
● You are likely to get better performance when dropout is used on a larger network, giving
the model more of an opportunity to learn independent representations.
Learning The Hyperparameters

3) Network Weight Initialization

● Ideally, it may be better to use different weight initialization schemes according to the
activation function used on each layer.

4) Activation Function

● Activation functions are used to introduce nonlinearity to models, which allows deep
learning models to learn nonlinear prediction boundaries.
● Generally, the rectifier activation function(ReLU) is the most popular.
● Sigmoid is used in the output layer while making binary predictions.
● Softmax is used in the output layer while making multi-class predictions.
Learning The Hyperparameters

Hyper parameter related to Training :

1) Learning rate

● The learning rate defines how quickly a


network updates its parameters.
● Low learning rate slows down the learning
process but converges smoothly.
● Larger learning rate speeds up the learning but
may not converge.
● Usually a decaying Learning rate is preferred.
Learning The Hyperparameters
2) Momentum

● Momentum helps to know the direction of the next step with the knowledge of the previous
steps.
● It helps to prevent oscillations.
● A typical choice of momentum is in between 0.5 to 0.9.

3) Number of epochs

● Number of epochs is the number of times the whole training data is shown to the network while
training.
● Increase the number of epochs until the validation accuracy starts decreasing even when
training accuracy is increasing (overfitting).
Learning The Hyperparameters

4) Batch Size

● Mini batch size is the number of sub samples given to the network after which parameter
update happens.
● A good default for batch size might be 32. Also try 64, 128, 256, and so on.

Methods used to find out hyper parameters

1) Manual Search

2) Grid Search

3) Random Search

4) Bayesian Optimization
Gaussian Process Regression(GPR)

● Gaussian Process Regression (GPR) is a powerful and flexible non-parametric regression


technique used in machine learning and statistics.
● It is particularly useful when dealing with problems involving continuous data, where the
relationship between input variables and output is not explicitly known or can be complex.
● GPR is a Bayesian approach that can model certainty in predictions, making it a valuable tool
for various applications, including optimization, time series forecasting, and more.
Gaussian Process Regression(GPR)

Gaussian Process

● A non-parametric, probabilistic model called a GP is utilized in statistics and machine


learning for regression, classification, and uncertainty quantification.
● It depicts a group of random variables, each of which has a joint Gaussian distribution
and can have a finite number.
● GPs are a versatile and effective technique for modeling intricate relationships in data
and producing forecasts with related uncertainty.
Gaussian Process Regression(GPR)
Characteristics of Gaussian Processes:

● Non-Parametric Nature: GPs can adjust to the complexity of the data because they do not
rely on a set number of model parameters
● Probabilistic Predictions: Predictions from GPs can be quantified because they deliver
predictions as probability distributions.
● Interpolation and Smoothing: GPs are useful for noisy or irregularly sampled data
because they are good at smoothing noisy data and interpolating between data points.
● Marginalization of Hyperparameters: By eliminating the requirement for explicit
hyperparameter tweaking, they marginalize over hyperparameters, making the model
simpler.
Gaussian Process Regression(GPR)

Steps in Gaussian Process Regression

● Data Collection: Gather the input-output data pairs for your regression problem.
● Choose a Kernel Function: Select an appropriate covariance function (kernel) that
suits your problem. The choice of kernel influences the shape of the functions that GPR
can model.
● Parameter Optimization: Estimate the hyperparameters of the kernel function by
maximizing the likelihood of the data. This can be done using optimization techniques
like gradient descent.
● Prediction: Given a new input, use the trained GPR model to make predictions. GPR
provides both the predicted mean and the associated uncertainty (variance).
Gaussian Process Regression(GPR)

● Stock Price Prediction: GPR can be used to model and predict stock prices, taking
into account the volatility and uncertainty in financial markets.
● Computer Experiments: GPR is useful in optimizing complex simulations by
modeling the input-output relationships and identifying the most influential
parameters.
● Anomaly Detection: GPR can be applied to anomaly detection, where it identifies
unusual patterns in time series data by capturing normal data distributions.
Gaussian Process Regression(GPR)
Advantages of GPR
Challenges of GPR
● GPR provides a probabilistic framework for
● GPR can be computationally
regression, which means it not only gives
expensive when dealing with large
point estimates but also provides uncertainty
datasets, as the inversion of a
estimates for predictions.
covariance matrix is required.
● It is highly flexible and can capture complex
● The choice of the kernel function
relationships in the data.
and its hyperparameters can
● GPR can be adapted to various applications,
significantly impact the model’s
including time series forecasting,
performance.
optimization, and Bayesian optimization.
Thank you!

You might also like