Adaptive Linear Neuron Using Linear (Identity) Activation Function With Batch Gradient Method
Adaptive Linear Neuron Using Linear (Identity) Activation Function With Batch Gradient Method
Adaptive Linear Neuron Using Linear (Identity) Activation Function With Batch Gradient Method
ADALINE (Adaptive Linear Neuron or later Adaptive Linear Element) is an early single-layer
artificial neural network and the name of the physical device that implemented this network.
It was developed by Professor Bernard Widrow and his graduate student Ted Hoff at Stanford
University in 1960.
It is based on the McCulloch–Pitts neuron. It consists of a weight, a bias and a summation function.
The difference between Adaline and the standard (McCulloch–Pitts) perceptron is that in the learning
phase, the weights are adjusted according to the weighted sum of the inputs (the net).
In the standard perceptron, the net is passed to the activation (transfer) function and the function's
output is used for adjusting the weights.
Definition
Learning algorithm
This update rule is in fact the stochastic gradient descent update for linear regression
In this tutorial, we'll learn another type of single-layer neural network (still this is also
a perceptron) called Adaline (Adaptive linear neuron) rule (also known as the Widrow-
Hoff rule).
The key difference between the Adaline rule (also known as the Widrow-Hoff rule) and
Rosenblatt's perceptron is that the weights are updated based on a linear activation function
rather than a unit step function like in the Perceptron model.
Perceptron
Adaptive linear neuron
The difference is that we're going to use the continuous valued output from the linear activation
function to compute the model error and update the weights, rather than the binary class labels.
Artificial neurons
The perceptron algorithm enables the model automatically learn the optimal weight coefficients
that are then multiplied with the input features in order to make the decision of whether a neuron
fires or not.
In supervised learning and classification, such an algorithm could then be used to predict if a
sample belonged to one class or the other.
In binary classifiers perceptron algorithm, we refer to our two classes as either 1 (positive class)
or -1 (negative class).
In the context of neural networks, a perceptron is an artificial neuron using the Heaviside step
function as the activation function.
The perceptron algorithm is also termed the single-layer perceptron, to distinguish it from
a multilayer perceptron. As a linear classifier, the single-layer perceptron is the
simplest feedforward neural network.
Updating weights
Cost function
One of the most critical tasks in supervised machine learning algorithms is to minimize cost
function.
In the case of Adaptive linear neuron, we can define the cost function J to learn the weights as
the Sum of Squared Errors (SSE) between the calculated outcome and the true class label:
1. Because it is convex, we can use a simple and powerful, optimization algorithm called gradient
descent to find the weights that minimize our cost function to classify the samples in the Iris
dataset.
Also, the weight update is calculated based on all samples in the training set (instead of updating
the weights incrementally after each sample), which is why this approach is also referred to
as batch gradient descent.
We can minimize a cost function by taking a step into the opposite direction of a gradient that is
calculated from the whole training set, and this is why this approach is also called as batch
gradient descent.
Implementation - Adaptive Linear Neuron
Since the perceptron rule and Adaptive Linear Neuron are very similar, we can take
the perceptron implementationthat we defined earlier and change the fit method so that the
weights are updated by minimizing the cost function via gradient descent.
import numpy as np
class AdaptiveLinearNeuron(object):
def __init__(self, rate = 0.01, niter = 10):
self.rate = rate
self.niter = niter
# weights
self.weight = np.zeros(1 + X.shape[1])
# Number of misclassifications
self.errors = []
# Cost function
self.cost = []
for i in range(self.niter):
output = self.net_input(X)
errors = y - output
self.weight[1:] += self.rate * X.T.dot(errors)
self.weight[0] += self.rate * errors.sum()
cost = (errors**2).sum() / 2.0
self.cost.append(cost)
return self
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header=None)
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', -1, 1)
X = df.iloc[0:100, [0, 2]].values
Gradient descent
As we can see in the resulting cost function plots below, we have two different types of issues.
The left one shows what could happen if we choose a learning rate that is too large. Instead of
minimizing the cost function, the error becomes larger in every epoch because we overshoot the
global minimum.
On the other hand, we can see that the cost decreases for the plot on the right side. That's because
we chose the learning rate η=0.0001�=0.0001 is so small that the algorithm would require a
very large number of epochs to converge.
The following figure demonstrates how we change the value of a particular weight parameter to
minimize the cost function J� (left). The figure on the right illustrates what happens if we
choose a learning rate that is too large, we overshoot the global minimum:
Feature scaling
Feature scaling is a method used to standardize the range of independent variables or features of
data. In data processing, it is also known as data normalization and is generally performed
during the data preprocessing step.
Gradient descent is one of the many algorithms that benefit from feature scaling.
Here, we will use a feature scaling method called standardization, which gives our data the
property of a standard normal distribution.
In machine learning, we can handle various types of data, e.g. audio signals and pixel values for
image data, and this data can include multiple dimensions.
Feature standardization makes the values of each feature in the data have zero-mean (when
subtracting the mean in the enumerator) and unit-variance.
This method is widely used for normalization in many machine learning algorithms
(e.g., support vector machines, logistic regression, and neural networks).
This is typically done by calculating standard scores.
The general method of calculation is to determine the distribution mean and standard deviation
for each feature. Next we subtract the mean from each feature. Then we divide the values (mean
is already subtracted) of each feature by its standard deviation.
- from Feature scaling
X_std = np.copy(X)
X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()
After the standardization, we will train the Linear model again using the not so small learning
rate of η=0.01 :
Here is our new code for the two pictures above:
class AdaptiveLinearNeuron(object):
def __init__(self, rate = 0.01, niter = 10):
self.rate = rate
self.niter = niter
# weights
self.weight = np.zeros(1 + X.shape[1])
# Number of misclassifications
self.errors = []
# Cost function
self.cost = []
for i in range(self.niter):
output = self.net_input(X)
errors = y - output
self.weight[1:] += self.rate * X.T.dot(errors)
self.weight[0] += self.rate * errors.sum()
cost = (errors**2).sum() / 2.0
self.cost.append(cost)
return self
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', -1, 1)
X = df.iloc[0:100, [0, 2]].values
# standardize
X_std = np.copy(X)
X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()