Deep Neural Network (DNN)
Deep Neural Network (DNN)
Deep Neural Network (DNN)
● The input values are presented to the perceptron, and if the predicted output is the same as
the desired output, then the performance is considered satisfactory and no changes to the
weights are made.
● However, if the output does not match the desired output, then the weights need to be
changed to reduce the error.
ANN-Single Layer Perceptron(SLP)
● Because SLP is a linear classifier and if the cases are not linearly separable the learning
process will never reach a point where all the cases are classified properly.
● The most famous example of the inability of perceptron to solve problems with linearly
non-separable cases is the XOR problem.
● A multi-layer perceptron (MLP) has the same structure of a single layer perceptron with
one or more hidden layers.
● The backpropagation algorithm consists of two phases:
a) the forward phase where the activations are propagated from the input to the output
layer, and
b) the backward phase, where the error between the observed actual and the requested
nominal value in the output layer is propagated backwards in order to modify the
weights and bias values.
ANN-Multi Layer Perceptron
a) Forward Propagation
● Propagate inputs by adding all the weighted inputs and then computing outputs using
sigmoid threshold.
ANN-Multi Layer Perceptron
b) Backward Propagation
● Propagates the errors backward by apportioning them to each unit according to the
amount of this error the unit is responsible for.
ANN-Multi Layer Perceptron
Error in Backward Propagation
1) Cost Function
● It is a function that measures the performance of a model for any given data.
● Cost Function quantifies the error between predicted values and expected values and
presents it in the form of a single real number.
● After making a hypothesis with initial parameters, we calculate the Cost function.
● And with a goal to reduce the cost function, we modify the parameters by using the Gradient
descent algorithm over the given data.
ANN-Multi Layer Perceptron
2) Gradient Descent
ANN-Multi Layer Perceptron
Activation Function
● The main purpose of the activation function is to convert the weighted sum of input
signals of a neuron into the output signal.
● And this output signal is served as input to the next layer.
● Any activation function should be differentiable since we use a backpropagation
mechanism to reduce the error and update the weights accordingly.
ANN-Multi Layer Perceptron
Sigmoid
Tanh
Softmax
● We know that sigmoid returns values between 0 and 1, which can be treated as
● This function returns the probability for a datapoint belonging to each individual class.
The Hessian Matrix
Understanding Partial
Derivatives
The Hessian Matrix
Several Applications
1. Second-derivative Test:
● For Convex function, the Eigenvalues of the Hessian matrix defines it local/global
optima.
1. In Optimization:
● Used in large-scale Optimization.
1. To find out the Inflection Point.
2. To find out the Critical Point based on the nature of gradient.
The Hessian Matrix
● Calculate the Hessian matrix at the point (1,0) of the following multivariable function:
● First of all, we have to compute the first order partial derivatives of the function:
● Once we know the first derivatives, we calculate all the second order partial derivatives
of the function:
The Hessian Matrix
● Early stopping
● L1 and L2 regularization
● Data augmentation
● Addition of noise
● Dropout
Regularization in Neural Network
1) Early stopping
● Early stopping is one of the simplest and most intuitive regularization techniques.
● It involves stopping the training of the neural network at an earlier epoch;
● As you train the neural network over many epochs, the training error decreases.
● If the training error becomes too low and reaches arbitrarily close to zero, then the network is
sure to overfit on the training dataset.
● Such a neural network is a high variance model that performs badly on test data that it has
never seen before despite its near-perfect performance on the training samples.
● Therefore, heuristically, if we can prevent the training loss from becoming arbitrarily low, the
model is less likely to overfit on the training dataset, and will generalize better.
● So how do we do it in practice?
Regularization in Neural Network
Error Change and Early Stopping Monitoring the Validation Accuracy for Early Stopping
Regularization in Neural Network
● Data augmentation is a regularization technique that helps a neural network generalize better by
exposing it to a more diverse set of training examples.
● As deep neural networks require a large training dataset, data augmentation is also helpful when
we have insufficient data to train a neural network.
● Let’s take the example of image data augmentation. Suppose we have a dataset with N training
examples across C classes.
● We can apply certain transformations to these N images to construct a larger dataset.
Regularization in Neural Network
2) Data Augmentations
● What is a valid transformation? Any operation that does not alter the original label is
a valid transformation.
● For example, a panda is a panda–whether it’s facing right or left, located near the
center of the image or one of the corners.
● We can apply any label-invariant transformation to perform data augmentation such as
a) Color space transformations such as change of pixel intensities
b) Rotation and mirroring
c) Noise injection, distortion, and blurring
Regularization in Neural Network
2) Data Augmentations
● Beyond this there is also new image augmentation technique called as Mixup.
● Mixup is a regularization technique that uses a convex combination of existing inputs to
augment the dataset.
● A few other approaches to data augmentation include Cutout, CutMix, and AugMix.
● Cutout involves the random removal of portions of an input image during training.
● CutMix replaces the removed sections with parts of another image.
● AugMix is a regularization technique that makes a neural network robust to distribution
change.
● AugMix performs a series of transformations on the same image, and then uses a
composition of these transformed images to get the resultant image.
Regularization in Neural Network
2) Data Augmentations
AugMix
Regularization in Neural Network
3) L1 and L2 Regularization
a) Lp
● In general, Lp norms (for p>=1) penalize larger weights. They force the norm of the weight
vector to stay sufficiently small.
● The Lp norm of a vector xx in n-dimensional space is given by:
a) L1 norm
● When p=1, we get L1 norm, the sum of the absolute values of the components in the vector:
Regularization in Neural Network
3) L1 and L2 Regularization
c) L2 Norm
● When p=2, we get L2 norm, the Euclidean distance of the point from the origin in n-
dimensional vector space:
Regularization in Neural Network
4) Addition of Noise
● Another regularization approach is the addition of noise. You can add noise to the input, the
output labels, or the gradients of the neural network.
a) Adding noise to the input data
● When using the sum of squares loss function, adding Gaussian noise to the inputs is equivalent to
L2 regularization .
● To understand this equivalence, let’s assume a simple network with input layer and weights:
Regularization in Neural Network
● Adding noise to the output labels prevents the network from memorizing the training dataset by
introducing perturbations in the output labels. Here are some techniques:
i) DisturbLabel
● If the gradient vector at step t is gt, we update it by adding noise with zero mean and variance.
● In η is chosen from the set {0.01, 0.3, 1.0} and γ is set to 0.55. The variance of the noise added
decreases as the training proceeds.
Regularization in Neural Network
5) Dropout
● Dropout involves dropping neurons in the hidden layers and (optionally) the input layer.
● During training, each neuron is assigned a “dropout” probability, like 0.5.
● With a dropout of 0.5, there’s a 50% chance of each neuron participating in training within each
training batch.
● This results in slightly different network architecture for each batch.
● It is equivalent to training different neural networks on different subsets of the training data.
Regularization in Neural Network
5) Dropout
5) Dropout
Where are the parameters of the distribution describing the shape and location of the distribution
Matrix Density Networks
Mixture Density Network: The output of a neural network parameterizes a Gaussian mixture
model.
Matrix Density Networks
● Few restrictions and ways to implement the Mixture Density Networks
1. The mixing coefficients ( ) are probabilities and have to be less than one and sum to unity.
This can be easily achieved by passing the outputs of the mixing coefficients through a Softmax
layer.
2. The variance ( ) should be strictly positive.
3. The center parameters ( ) represent location parameters and this should be the raw logits of the
mean neuron.
● The network is trained end-to-end using standard backpropagation.
● And for that the loss function we are minimizing is the Negative Log Likelihood, which is
equivalent to the Maximum Likelihood Estimation.
● We already know what is and it’s just the matter of calculating it and maximizing the
negative log-likelihood.
Bayesian Neural Network(BNN)
● Among the different types of artificial neural networks, stochastic neural networks have
proven to be one of the most generic and flexible to suppress the inability of ANNs.
● By introducing stochastic components into the network — giving the network either a
stochastic activation or stochastic weights — it can simulate multiple possible models of
parameters θ with an associated probability distribution p(θ).
● By comparing these multiple predictions, it is possible to obtain a better idea of
uncertainties.
● If the different models agree, then the uncertainty is low.
● If they disagree, then the uncertainty is high.
Bayesian Neural Network(BNN)
1. Goal — SNN focuses on optimization while BNN
focuses on marginalization. Optimization would find
one optimal point to represent a weight while
marginalization would treat each weight as a variable
and find its distribution.
2. Estimate — The estimate of the parameters for SNN
would be maximum likelihood estimators (MLE)
while for BNN; the estimate would be rather
maximum a posteriori (MAP) or predictive
distribution.
3. Method — Basically, SNN would use differentiation
to find the optimal value such as gradient descent. In
BNN, since the sophisticated integrals are hard to
determine, scientists or researchers would always rely
on Markov Chain Monte Carlos (MCMC),
Variational…
Bayesian Neural Network(BNN)
Design of BNN
Functional model
● To obtain the posterior distribution for parameters, given our data D = {Dx, Dy} with
training inputs and labels, respectively.
● Due to the complexity of the posterior — especially because to the evidence integral
term — , computing this in a standard way is in general intractable.
Bayesian Neural Network(BNN)
Functional model
● To do this, we rely on techniques such Markov Chain Monte Carlo (MCMC) and
Variational Inference, that are able to evaluate these integrals in different manners.
● In the case of MCMC approach, a large set of weights θ is sampled from the posterior
and used to compute a series of possible outputs y.
Bayesian Neural Network(BNN)
Variational Inference
● Theoretically we don’t need a learning phase for the BNNs since we can sample the
posterior and obtain its estimators.
● The probabilities P(D|H) and P(H) are given by the stochastic model, but computing the
integral for the evidence can be quite a challenge.
● Even when we are able to do this, directly sampling the posterior is difficult due to the
high dimensionality of the sampling space.
● So to deal with this Variational Inference is one of the popular methods
● Bayesian neural networks address overfitting by modeling uncertainty in the weights.
Plus they can be trained using standard neural net tools using an algorithm called
stochastic variational inference
Bayesian Neural Network(BNN)
Functional model
To obtain a estimator ŷ for the output y, we have to consider two different situations:
1) When performing regression, we use as estimator and uncertainty
2) When performing classification, the average model will give the relative probability
of each class, which by itself can be considered as an estimate of uncertainty. The final
prediction ŷ is the most likely class.
Bayesian Neural Network(BNN)
Stochastic Model
a) BNNs with weights as stochastic variables b) BNNS with activations as stochastic variables.
Bayesian Neural Network(BNN)
Advantages
● They are more robust and able to generalize better than other neural networks.
● They can quantify the uncertainty in their predictive output.
● They can be used for many practical applications.
Disadvantages
● They can be more complicated to train than other neural networks, and require
knowledge of the fields of probability and statistics.
● They can be slower to converge than other neural networks and often require more
data. Since the weights of the network are distributions instead of single values, more
data is required to estimate the weights accurately.
Kernel
● In the realm of machine learning, kernels hold a pivotal role, especially in algorithms
designed for classification and regression tasks
● It transforms non-linear relationships into a linear format, making them accessible for
algorithms that traditionally only handle linear data.
● Kernels achieve this without the computational intensity of mapping data to higher
dimensions explicitly.
● Their efficiency and effectiveness in revealing hidden patterns make them a cornerstone in
modern machine learning.
● At its most fundamental level, a kernel is a relatively straightforward function that operates
on two vectors from the input space, commonly referred to as the X space.
● The primary role of this function is to return a scalar value, but the fascinating aspect of this
process lies in what this scalar represents and how it is computed.
Kernel
● This scalar is, in essence, the dot product of the two input vectors.
● However, it's not computed in the original space of these vectors.
● Instead, it's as if this dot product is calculated in a much higher-dimensional space,
known as the Z space.
● This is where the kernel's true power and elegance come into play.
● It manages to convey how close or similar these two vectors are in the Z space without
the computational overhead of actually mapping the vectors to this higher-dimensional
space and calculating their dot product there.
● It allows you to glean the necessary information about the vectors in this more complex
space without having to access the space directly.
Kernel
Definition of Kernel
● A kernel is a function K(x,y) that computes the inner product of the transformed feature
vectors Φ(x) and Φ(y) in a higher-dimensional space without explicitly calculating the
transformation.
● Mathematically, a function K is a valid kernel if it corresponds to an inner product in some
feature space.
Constructing Kernel
● Custom Kernel
Depending on the problem, we might design custom kernels that capture domain-specific
relationships.
Constructing Kernel
Mercers Theorem
Kernel Composition
● Many kernels have hyper parameters (e.g., σ in RBF kernel, d in polynomial kernel).
● These hyper parameters can be tuned using cross-validation to find the values that optimize
model performance.
● The use of kernel tricks involves implicitly mapping data into higher-dimensional spaces.
● This can be computationally more efficient than explicitly computing the transformations.
● Kernel functions encapsulate these transformations.
● Incorporating domain knowledge can help design custom kernels that capture specific
relationships or structures in the data.
Choosing Kernel
● Choosing the right kernel for a machine learning task, is a critical decision that can
significantly impact the performance of the model.
● The selection process involves understanding both the nature of the data and the specific
requirements of the task at hand.
● Firstly, it's important to consider the distribution and structure of the data. If the data is
linearly separable, a linear kernel may be sufficient.
● However, for more complex, non-linear data, a polynomial or radial basis function (RBF)
kernel might be more appropriate.
● Lastly, domain knowledge can play a significant role in kernel selection.
● Understanding the underlying phenomena or patterns in the data can guide the choice of the
kernel.
Radial Basis Function Network(RBFN)
● They are a special type of feeder neural network that use radial basis functions as activation
functions.
● It has an input layer, a hidden layer, and an output layer and is mostly used for classification,
regression, and time-series prediction.
● Radial-basis function networks are distinguished from other neural networks due to their
global approximation and fast learning speed.
● The main advantage of the RBF network is that it has only one hidden layer and uses the
radial basis function as the activation function.
● These functions are very powerful in approximation.
● Radial basis function (RBF) networks are a common type of use in artificial neural networks
for function approximation problems.
Radial Basis Function Network(RBFN)
Working
● RBFNs perform classification by measuring the input’s similarity to examples from the
training set.
● RBFNs have an input vector that feeds to the input layer. They have a layer of RBF neurons.
● The function finds the weighted sum of the inputs, and the output layer has one node per
category or class of data.
● The neurons in the hidden layer contain the Gaussian transfer functions, which have outputs
that are inversely proportional to the distance from the neuron’s center.
● The network’s output is a linear combination of the input’s radial-basis functions and the
neuron’s parameters.
Radial Basis Function Network(RBFN)
Working
Radial Basis Function Network(RBFN)
1) Input Vector:
● The input vector is the n-dimensional vector you are trying to classify.
● The entire input vector is shown for each of the RBF neurons.
2) RBF Neurons:
● Each RBF neuron stores a ‘prototype’ vector, just one of the vectors from the training set.
● Each RBF neuron compares the input vector to its prototype and outputs a value between 0 and 1
which is a measure of similarity.
● If the input is equal to the prototype, the output of the RBF neuron will be 1.
● As the distance between the input and the prototype increases, the response decreases exponentially
towards 0. The shape of the response of the RBF neuron is a bell curve
● The prototype vector is also often called the “center” of a neuron since it is the value at the center
of the bell curve.
Radial Basis Function Network(RBFN)
3) Outputs :
● The output of the network consists of a set of nodes, one per category that we are trying to
classify.
● Each output node computes a sort of score for the associated category.
● Typically, a classification decision is made by assigning the input to the category with the
highest score.
● The score is computed by taking a weighted sum of the activation values from every RBF
neuron.
● By weighted sum, we mean that an output node associates a weight value with each of the
RBF neurons, and multiplies the neuron’s activation by this weight before adding it to the
total response.
Radial Basis Function Network(RBFN)
3) Outputs :
● Because each output node is computing the score for a different category, every output node
has its own set of weights.
● The output node will typically give a positive weight to the RBF neurons that belong to its
category, and a negative weight to the others.
● Each RBF neuron computes a measure of the similarity between the input and its prototype
vector (taken from the training set).
● Input vectors which are more similar to the prototype return a result closer to 1.
● There are different possible choices of similarity functions, but the most popular is based on
the Gaussian.
Radial Basis Function Network(RBFN)
3) Outputs :
Hyper parameter:
● Hyper parameters are the variables which determines the network structure (Eg:
Number of Hidden Units) and the variables which determine how the network is trained
(Eg: Learning Rate).
● Hyper parameters are set before training (before optimizing the weights and bias).
Learning The Hyperparameters
Hyper parameter related to Neural Networks:
● Many hidden units within a layer with regularization techniques can increase accuracy.
● Smaller number of units may cause underfitting.
2) Dropout
● Its is regularization technique to avoid overfitting thus increasing the generalizing power.
● Generally, use a small dropout value of 20%-50% of neurons with 20% providing a good
starting point.
● A probability too low has minimal effect and value too high results in under-learning by the
network.
● You are likely to get better performance when dropout is used on a larger network, giving
the model more of an opportunity to learn independent representations.
Learning The Hyperparameters
● Ideally, it may be better to use different weight initialization schemes according to the
activation function used on each layer.
4) Activation Function
● Activation functions are used to introduce nonlinearity to models, which allows deep
learning models to learn nonlinear prediction boundaries.
● Generally, the rectifier activation function(ReLU) is the most popular.
● Sigmoid is used in the output layer while making binary predictions.
● Softmax is used in the output layer while making multi-class predictions.
Learning The Hyperparameters
1) Learning rate
● Momentum helps to know the direction of the next step with the knowledge of the previous
steps.
● It helps to prevent oscillations.
● A typical choice of momentum is in between 0.5 to 0.9.
3) Number of epochs
● Number of epochs is the number of times the whole training data is shown to the network while
training.
● Increase the number of epochs until the validation accuracy starts decreasing even when
training accuracy is increasing (overfitting).
Learning The Hyperparameters
4) Batch Size
● Mini batch size is the number of sub samples given to the network after which parameter
update happens.
● A good default for batch size might be 32. Also try 64, 128, 256, and so on.
1) Manual Search
2) Grid Search
3) Random Search
4) Bayesian Optimization
Gaussian Process Regression(GPR)
Gaussian Process
● Non-Parametric Nature: GPs can adjust to the complexity of the data because they do not
rely on a set number of model parameters
● Probabilistic Predictions: Predictions from GPs can be quantified because they deliver
predictions as probability distributions.
● Interpolation and Smoothing: GPs are useful for noisy or irregularly sampled data
because they are good at smoothing noisy data and interpolating between data points.
● Marginalization of Hyperparameters: By eliminating the requirement for explicit
hyperparameter tweaking, they marginalize over hyperparameters, making the model
simpler.
Gaussian Process Regression(GPR)
● Data Collection: Gather the input-output data pairs for your regression problem.
● Choose a Kernel Function: Select an appropriate covariance function (kernel) that
suits your problem. The choice of kernel influences the shape of the functions that GPR
can model.
● Parameter Optimization: Estimate the hyperparameters of the kernel function by
maximizing the likelihood of the data. This can be done using optimization techniques
like gradient descent.
● Prediction: Given a new input, use the trained GPR model to make predictions. GPR
provides both the predicted mean and the associated uncertainty (variance).
Gaussian Process Regression(GPR)
● Stock Price Prediction: GPR can be used to model and predict stock prices, taking
into account the volatility and uncertainty in financial markets.
● Computer Experiments: GPR is useful in optimizing complex simulations by
modeling the input-output relationships and identifying the most influential
parameters.
● Anomaly Detection: GPR can be applied to anomaly detection, where it identifies
unusual patterns in time series data by capturing normal data distributions.
Gaussian Process Regression(GPR)
Advantages of GPR
Challenges of GPR
● GPR provides a probabilistic framework for
● GPR can be computationally
regression, which means it not only gives
expensive when dealing with large
point estimates but also provides uncertainty
datasets, as the inversion of a
estimates for predictions.
covariance matrix is required.
● It is highly flexible and can capture complex
● The choice of the kernel function
relationships in the data.
and its hyperparameters can
● GPR can be adapted to various applications,
significantly impact the model’s
including time series forecasting,
performance.
optimization, and Bayesian optimization.
Thank you!