Optimizing a Routing Algorithm Based on Hopfield
Neural Networks for Graphic Processing Units
Carmelo J. A. Bastos-Filho, Marcos A. C. Oliveira Junior, Dennis R. C. Silva, and Robson A. Santana
Polytechnic School of Pernambuco,
University of Pernambuco,
Recife 50720-001, Pernambuco, Brazil
e-mail: carmelofilho@ieee.org
Abstract—Although some interesting routing algorithms based
on HNN were already proposed, they are slower when compared
to other routing algorithms. Since HNN are inherently parallel,
they are suitable for parallel implementations, such as Graphic
Processing Units (GPU). In this paper we propose a fast routing
algorithm based on Hopfield Neural Networks (HNN) for GPU,
considering some implementation issues. We analyzed the memory bottlenecks, the complexity of the HNN and how the kernel
functions should be implemented. We performed simulations
for five different variations of the routing algorithm for two
communication network topologies. We achieved speed-ups up
to 55 when compared to the simplest version implemented in
GPU and up to 40 when compared to the CPU version. These
new results suggest that it is possible to use the HNN for routing
in real networks.
I. I NTRODUCTION
Routing algorithms have been intensively discussed in the
scientific community, mainly because the routing process
impacts drastically on the performance of communication
networks. An ideal routing algorithm comprises finding the
best path between the source and the destination nodes,
enabling high quality transmission, avoiding penalties caused
by physical layer impairments and reserving resources for
future requests. There are different ways to determine a route.
Some algorithms determine the routes based on the shortest
path (SP) [1], the smallest delay [2], the highest Signal to
Noise Ratio [3], the best load distribution [4], among others.
A flexible routing algorithm has to calculate the path in
real time considering the current available resources and the
physical layer impairments. Some Computational Intelligence
techniques have been considered in order to provide this
ability. Among them, we can cite: Artificial Neural Networks
(ANN) [2] [4], Ant Colony Optimization [5], Genetic Algorithms [6]. ANN are suitable for solving the routing problem
problem since they present high computational speed and
distributed processing capability [2].
Some ANN approaches were already considered for solving
the routing problem. Hopfield and Tank described an ANN
with feedback connections and showed that this approach can
solve the Traveling Salesman Problem. Nowadays, this ANN
configuration is known as Hopfield neural networks (HNN).
Rauch and Winarske [7] were the first to apply HNN to find
the shortest path nodes in a communication network. After
that, Ali and Kamoun [2] proposed a HNN variation in which
the weight matrix just carries convergence information. The
link cost and the network topology information are inserted
in external bias inputs. Bastos-Filho et al. [8] proposed to
use a discrete and simple finite difference equation to update
the inputs of the neurons in order to simplify and accelerate
the convergence of the HNN. Schuler et al. [9] optimized the
difference equation proposed by Bastos-Filho et al. by using
a computational intelligence technique called Particle Swarm
Optimization. Recently, Kojic et al. [10] and Bastos-Filho et
al. [11] proposed routing algorithms for optical networks based
on HNN.
The average time to find the path between the source and
the destination node (Tpath ) is a very important attribute of
a routing algorithm. The Dijkstra algorithm can determine a
shortest path in some dozens µs. Although Bastos-Filho et.
al. [11] have optimized the HNN model proposed by BastosFilho et al. [8] to minimize Tpath , the time to determine the
path is still high. It means that the current HNN based routing
algorithms can obtain good network performance, but Tpath
is still high when compared to other algorithms.
In the recent years, the use of Graphic Processing Units
(GPUs) have been proposed for many scientific applications.
The GPU parallel floating point processing capacity allows
one to obtain high speed-ups. Nevertheless, there are some
aspects that should be considered to adapt an application
to run on these platforms, such as memory allocation and
communication between blocks. Furthermore, hardware limitations of the GPUs limit the use for general purposes. Naive
implementations neglecting theses issues may lead to a poor
performance.
One of the major benefits of neural networks is the parallel
processing capacity. In a HNN, each neuron can process
individually and exchange information by a fully connected
network. Therefore, HNN are naturally suitable for a parallel
implementation and the execution time may be drastically
reduced if the HNN are implemented on a parallel platform. In
this paper we present a parallel HNN-based routing algorithm
for GPUs. We discuss some important issues regarding the
implementation to improve the performance.
This paper is organized as follows: in the next section we
briefly review the current best HNN model to solve the routing
problem. In Section III, we introduce some basic concepts
of the NVIDIA CUDA architecture and GPU computing. We
present our contribution and the simulation results in Section
IV and V, respectively. In the last section, we present our
conclusions and suggest some future works.
The HNN synaptic matrix Txi,yj is described as [2].
Txi,yj = µ4 δxy δij − µ3 δxy − µ3 δij + µ3 δjx + µ3 δiy , (4)
II. H OPFIELD N EURAL N ETWORKS F OR ROUTING IN
C OMMUNICATIONS N ETWORKS
∀ (x 6= i) , ∀ (y 6= j) .
The neurons are the processing elements. Every output of
each neuron is connected to the input of all the other neurons
via synaptic weights. Each link in the communication network
between two adjacent nodes is associated to one neuron [2].
For example, a link from node x to node i refers to the neuron
xi. The output of a neuron Vxi depends on the input Uxi and is
evaluated by (1). The input of each neuron corresponds to the
sum of all the outputs of the other neurons yj weighted by a
synaptic weights (matrix Txi,yj ) plus an external bias Ixi . The
parameter λ determines the computation time to convergence
and the correctness of the algorithm. Higher values for the
parameter λ make the logistic function tends to a step function.
where µ3 is a HNN parameter and is used to guarantee the
convergence.
Therefore, if the system is stable in Liapunov sense, then
iterations lead to smaller output changes. As a consequence,
after some iterations the HNN converges. Schuler et al. used
the difference and finite equation described in equation (5)
to update the neurons. By using this equation, Schuler et al.
obtained a lower number of iterations to reach the convergence
and a lower number of errors finding the paths.
1
1 + e−λUxi
∀(x, i) ∈ N̄ X N̄ /x 6= i
Vxi =
Uxi [n + 1] = Uxi [n] − AUxi [n − 1]+
n
n X
X
B
Txi,yj Vyj [n] + CIxi .
(5)
y=1 j=1
j6=y
(1)
If every link in the network has a nonnegative cost associated Cij , the goal of the HNN is to find the path that minimizes
the cost from a source node s to a destination node d. Thus, the
HNN should indicate a directed path as an ordered sequence of
nodes connecting s to d. The path that provides minimum cost
is defined as Lsd . In most cases, the cost matrix is symmetric
(Cxi = Cix ). The elements Cii are nulls because on node
cannot be connected to itself.
The matrix ρxi defines if the link xi exists in the communication network topology. ρxi is defined as:
1, if the link xi does not exist;
ρxi =
(2)
0, otherwise.
The HNN converges when the variation of every output
values between consecutive iterations (∆Vxi ) are below a
threshold, i.e., ∆Vth < 10E −5 . After that, a new matrix (Yxi )
is evaluated. If the output has a value greater than 0.5, Yxi is
adjusted to “1”, otherwise Yxi is adjusted to “0”. If Yxi = 1,
the link xi ∈ Lsd , else the link does not belong to the shortest
path Lsd .
Beside this, each neuron is externally excited by input bias
(Ixi ). Ixi are used to include the link costs and the network
topology information as shown below:
µ1
µ2
ρxi (1 − δxd δis )
Ixi = − Cxi (1 − δxd δis ) −
2
2
(3)
µ4
µ5
−
+
δxd δis , ∀ (x 6= i) ,
2
2
where δ is the Kronecker function and µ1, µ2, µ4 and µ5
are HNN parameters. µ1 minimizes the total cost, µ2 prevents
nonexistent links from being included in the chosen path, µ4
avoids the convergence to a unstable state and µ5 is introduced
to ensure that the source and the destination nodes belong to
the solution.
The pseudocode for the HNN-based routing algorithm is
shown in Algorithm 1. Uxi is initialized with low and random
values (−0, 0002 ≤ Uxi ≤ 0, 0002) in order to speed-up the
convergence faster.
Algorithm 1 Pseudocode of the routing algorithm using
Hopfield neural networks.
1: Receive parameters (including Cxi );
2: Determine Txi,yj ;
3: Evaluate ρxi ;
4: Receive source and destination;
5: Evaluate Ixi ;
6: Insert initial noise in Uxi ;
7: Evaluate Vxi (Threshold of Uxi );
8: while ∆Vxi < ∆Vth do
9:
Store Uxi and Vxi ;
10:
Upgrade the neurons inputs using (5);
11:
Upgrade the neurons outputs using (1);
12:
Evaluate ∆Vxi ;
13: end while
14: Determine Yxi (binarization of Vxi );
15: Get path from Yxi ;
III. GPU C OMPUTING AND CUDA A RCHITECTURE
GPUs were traditionally designed for image and graphic
processing on computers. The use of GPU for general purpose
computing was inspired by its ability to process thousands
of threads simultaneously. GPUs became popular in recent
years through a NVIDIA technology called CUDA (Compute
Unified Device Architecture). CUDA enables programmers to
develop softwares to solve complex computational problems
by automatically distributing these threads to many-core simple processors.
The CUDA GPUs have a single-instruction multiple-thread
architecture (SIMT) [12], where the same instruction set is
executed on different processors at the same time. This architecture presents less overhead in parallel computing, which is
suitable to intensive and repetitive computation.
The CUDA parallel programming model allows one to split
the main problem in many sub-problems that can be executed
independently in parallel. Each one can be decomposed in
many other modules that may have their operations performed
cooperatively in parallel. Actually, each sub-problem is equivalent to a block of threads and each thread is a module. The
function that is performed to solve a sub-problem is called
kernel function. When a kernel function is invoked, it will run
on each thread in parallel within the corresponding block.
Each thread executing a kernel function is identified by its
thread identifier. One can access it within the kernel through
built-in variables provided by the CUDA API [13].
The main bottleneck in the CUDA architecture is the data
transferring between the host (CPU) and the device (GPU).
This type of operation harms the performance.
Threads within a block can cooperate by sharing data
through a dedicated memory, but they need to synchronize
the memory access in order to avoid data dependence errors.
A barrier placed on the code allows a thread to wait for
the other cooperative threads. They guarantee the correctness
of the algorithm running on the GPU, but influence on the
performance.
Each thread has a private local memory and each block
of threads has a shared memory that can be accessed by all
threads inside the block. Moreover, all threads can access
the same global memory. These memories follow a memory
hierarchy: the fastest one is the local memory and the slowest
is the global memory; the smallest one is the local memory
and the largest is the global memory.
Random number generators are widely used in computational intelligence techniques. They are naturally sequential,
since they are based on states. They must be carefully chosen.
GPU-based random numbers generators are discussed by [14]
and [15]. An approach CPU-free for generating random numbers on demand is presented by [16]. It is important to notice
that the quality of the random number generator may impact
on the performance of the algorithm.
IV. GPU- BASED HNN M ODEL
The HNN-based routing algorithm has an inhenrent parallel
behavior since all operations can be executed individually
because each one has an independent input. The first step to
the adapt the algorithm to a parallel platform is to identify
the bottlenecks. The main bottleneck is the update process
of the neurons described from the 9th to the 12th line of
the algorithm 1. This process only ends when a predefined
threshold is reached. In the following subsections, we present
how we adapted the algorithm to the parallel platform. First,
we show how we call the GPU functions from the host and
then we describe the GPU functions themselves.
A. The Host Code
The host code has the duty to call all the kernel functions.
It is the slowest part of the implementation, since it does not
run in the GPU and can not be implemented in parallel. Thus,
one should avoid intensive operations at this point and move
as much operations as possible to the kernel functions.
In our first implementation, we developed the naive model
described in the Algorithm 2. In each kernel function call
(shown in italic), the parameters inside the operands <<>>
describe how many threads will run inside a block.
Algorithm 2 Pseudocode of the first version of the Host Code
for the HNN-routing algorithm for GPU.
1: set-weights <<nodes*nodes>>;
2: initialize <<nodes*nodes>>;
3: converged = false;
4: while !convergedHost do
5:
iteration <<nodes*nodes>>;
6:
convergedHost = copy-from-device (converged);
7: end while
8: Determine Yxi <<nodes*nodes>>;
Once the hardware limits the maximum number of threads
running on a block, the evaluations of the inputs and the
outputs of the neurons must be split in blocks. As consequence,
the convergence test must be performed in the host. Obviously,
this approach harm the performance, since the blocks can
not be executed in parallel and the host and the device must
establish a communication during the process.
Actually, there is no correct way to get rid of this issue.
However, we may assume that there will not be needed
more than an exactly number of threads running inside the
block. Thus, there will be only one block of threads in the
GPU. Assuming this, we could improve the implementation
performance. This second version is shown in Algorithm 3.
Algorithm 3 Pseudocode of the second version of the Host
Code for the HNN-routing algorithm for GPU.
1: initialize <<nodes*nodes>>;
2: iteration <<nodes*nodes>>;
3: Determine Yxi <<nodes*nodes>>;
The kernel function iterations runs in the GPU until the
threshold is reached. By doing this, no communication between host and device is required during this process. Thus,
it is probably the best approach for the host code.
B. The Device Code
The functions called by the host code to run in the GPU are
known as kernel functions. These functions run in each thread
in parallel. The main functions of the algorithm are the ones
used to execute all the iterative process until the convergence.
They are: the iteration() and iterations() functions. Both of
them call a device function named calculate-sum() to evaluate
the sum used to update the input of the neurons as shown in
the Equation (5).
In the following subsections, the kernel functions and the
device function calculate-sum are presented. Moreover, we
present how we reduced the complexity of the update process
and how we avoid some memory bottleneck issues.
1) The Kernel Functions: The initial version of the host
code calls the iteration function described in the Algorithm 4.
It uses a global variable called converged that it is copied by
the host code from the device to know whether the threshold
is reached.
Algorithm 4 The kernel function iteration called by the first
version of the host code.
1: sum = calculate-sum;
2: update-input-and-output(sum);
3: synchronization-barrier;
4: if threshold is reached then
5:
converged = false;
6: end if
The second version of the host code calls the function
iterations and it is described in Algorithm 5. It runs all
iterations of the algorithm until the threshold is reached. It
also has a variable converged but it is in the shared memory
space. This variable can be viewed and modified by all threads
inside the block.
Algorithm 5 The kernel function iteration called by the
second version of the host code.
1: converged = false;
2: while (!converged) do
3:
sum = calculate-sum;
4:
update-input-and-output(sum);
5:
converged = true;
6:
synchronization-barrier;
7:
if threshold is reached then
8:
converged = false;
9:
end if
10: end while
2) Reduction of the Complexity of the Update Equation:
The function calculate-sum is described in the Algorithm 6.
The ty and tx variables are the thread identifiers. weights
is a 4th -dimensional matrix and refers to the synaptic matrix.
outputOld and output are 2th -dimensional matrices and refer
to the past and the current values for Vxi . This description
is the simplest way to implement the Equation (5). It has
complexity O(n2 ) and as it is used by each neuron, the full
algorithm has complexity O(n3 ).
Nevertheless, if the weights are analyzed as they are generated, the algorithm may be described as in Algorithm 7.
This approach is O(n) and it might be implemented in a
CPU sequential version as well.
3) The Memory Bottleneck: There is a memory hierarchy
in the GPU. The closer to the GPU is the memory, the faster
the memory access is. Therefore, closer memories should be
used when it is possible.
In the iteration and the iterations kernel functions, the
global memory is accessed constantly. Higher performance can
Algorithm 6 The first version of the device function calculatesum.
1: for k = 0 to number-of-nodes do
2:
for l = 0 to number-of-nodes do
3:
if k != l then
4:
sum += weights[ty][tx][k][l] * outputOld[k][l];
5:
end if
6:
end for
7: end for
Algorithm 7 The optimized version of the device function
calculate-sum.
1: for l = 0 to number-of-nodes do
2:
if l != ty then
3:
sum += weights[ty][tx][ty][l] * output[ty][l];
4:
if l != tx then
5:
sum += weights[ty][tx][l][tx] * output[l,tx];
6:
sum += weights[ty][tx][l][ty] * output[l][ty];
7:
end if
8:
end if
9:
if l != tx then
10:
sum += weights[ty][tx][tx][l] * output[tx][l];
11:
end if
12: end for
be achieved if a faster shared memory is used. It is possible
to store some information in the block shared memory, such
as the input and output matrices. Unfortunately, the shared
memory is limited and it is not possible to store the synaptic
weights matrix, once it is a 4-dimensional matrix.
However, the weights may be generated on demand instead
generating all of them for after-accessing through the global
memory. The code is almost the same, the only difference is
that a device function named weight is called for each time
it is necessary to use a synaptic weight. One should notice
that it generates more floating-point operations, but reduces
the number of access to the memory.
Algorithm 8 The optimized version of the device function
calculate-sum using a function to calculate the weight on
demand.
1: for l = 0 to number-of-nodes do
2:
if l != ty then
3:
sum += weight(ty,tx,ty,l) * output[ty][l];
4:
if l != tx then
5:
sum += weight(ty,tx,l,tx) * output[l,tx];
6:
sum += weight(ty,tx,l,ty) * output[l][ty];
7:
end if
8:
end if
9:
if l != tx then
10:
sum += weight(ty,tx,tx,l) * output[tx][l];
11:
end if
12: end for
V. S IMULATION S ETUP AND R ESULTS
The several versions detailed in the previous Section IV
were implemented on the CUDA platform. We performed
some simulations in order to evaluate the performance of
these different versions. The experiments were executed on an
Intel Core Quad 2.40GHz computer with a NVIDIA GeForce
9800GTX+. We performed simulations for two different networks. First, with the famous National Science Foundation
Network (NSFNET) with 14 nodes and then with a smaller
one with 6 nodes. The networks are depicted respectively in
Figure 1 and in Figure 2.
Fig. 1.
NSFNET network topology.
TABLE I
AVERAGE E XECUTION T IME TO D EFINE THE ROUTE USING THE CPU
V ERSION AND AND ALL THE GPU V ERSIONS .
Algorithm Version
NSFNET (ms)
6-nodes Network (ms)
CPU
73.461
2.598
GPU A
100.29
22.91
GPU B
98.2
12.69
GPU C
31.71
3.98
GPU D
3.92
0.51
GPU E
1.83
0.48
achieved speed-up is higher for the smaller communication
network. In the third version (GPU–C), all threads run inside
an unique block, thus it is completely parallel. This alteration
leads to a speed-up higher than three for both networks.
The fourth and higher speed-up was reached by using the
shared memory inside the GPU (GPU–D). Faster memory
access impacts drastically on the performance. Once it is not
possible to place the matrix of the synaptic weights in the
memory of the block, the synaptic weights are generated on
demand in the GPU–E version. Surprisingly, it is faster to
generate on demand the synaptic weights than to store them
for after-accessing.
VI. C ONCLUSIONS
Fig. 2.
Six-nodes network topology.
We used the following HNN parameters: µ1 = 950.0, µ2 =
2500.0, µ3 = 2500.0, µ4 = 475.0, µ5 = 2500.0, A = 0.001,
B = 0.001 and C = 0.001.
We analyzed five different GPU versions for the HNNbased routing algorithm: GPU A – the initial model with the
convergence test in the host code and without any use of the
shared memory; GPU B – the GPU A version updated with
a complexity reduction on the neurons update process; GPU
C – a updated version where all iterations run in the GPU
until the threshold is reached; GPU D – similar to GPU C,
but using the shared memory to store the input and the output
matrices; and GPU E – similar to GPU D, but the synaptic
weights are generated on-demand.
Each version was called 10,000 times to find a route for
randomly chosen pair of nodes for each network. The average
execution times are shown in Table I.
One can observe huge speed-ups obtained through the HNN
implementation versions in both networks. The initial version
(GPU–A) has an execution time slower than the CPU version,
although it is parallel.
In the second version (GPU–B), after an analysis on the
neurons update process, the first speed-up is reached. the
Despite the algorithm for routing networks using Hopfield
Neural Networks implemented on CPUs is slower than other
approaches for routing, Hopfield Neural Networks have a
parallel behavior that allows faster implementations on parallel
platforms.
We presented in this paper a fast Hopfield Neural Networks
based algorithm for routing in communications networks suitable for Graphic Processing Units. We also analyzed different
versions to show that some aspects must be carefully considered for GPU architectures in order to avoid bottlenecks
and even worse performances than sequential approaches. We
achieved a speed-up of 55 for the NSFNet communication
network topology when compared to a simple approach for
GPU.
The high speed-ups shows that HNN are suitable to be
implemented in GPUs, whereas the total time to find routes
suggest that it is possible to use our approach in real scenarios.
ACKNOWLEDGMENT
The authors would like to thank FACEPE, CNPq, UPE and
POLI (Escola Politécnica de Pernambuco).
R EFERENCES
[1] E. Dijkstra, “A note on two problems in connection with graphs,”
Numerische Mathematik, vol. 1, pp. 269–271, 1959.
[2] M. K. M. Ali and F. Kamoun, “Neural networks for shortest path
computation and routing in computer networks,” Neural Networks, IEEE
Transactions on, vol. 4, no. 6, pp. 941–954, Nov 1993.
[3] H. Pereira, D. Chaves, C. J. A. Bastos-Filho, and J. Martins-Filho, “Osnr
model to consider physical layer impairments in transparent optical
networks,” Photonic Network Communications, vol. 18(2), pp. 137–149,
2008. [Online]. Available: http://dx.doi.org/10.1007/s11107-008-0178-2
[4] N. Kojic, I. Reljin, and B. Reljin, “Neural network for finding optimal
path in packet-switched network,” in Neural Network Applications in
Electrical Engineering, 2004. NEUREL 2004. 2004 7th Seminar on,
Sept. 2004, pp. 91–96.
[5] G. Di Caro and M. Dorigo, “Antnet: Distributed stigmergetic control for
communications networks,” Journal of Artificial Intelligence Research,
vol. 9, pp. 317–365, 1998.
[6] S. H. Ngo, X. Jiang, and S. Horiguchi, “Adaptive routing and wavelength
assignment using ant-based algorithm,” in Networks, 2004. (ICON
2004). Proceedings. 12th IEEE International Conference on, vol. 2, Nov.
2004, pp. 482–486 vol.2.
[7] H. E. Rauch and T. Winarske, “Neural networks for routing communication traffic,” Control Systems Magazine, IEEE, vol. 8, no. 2, pp. 26–31,
Apr 1988.
[8] C. J. A. Bastos-Filho, R. A. Santana, and A. L. I. Oliveira, “A novel
approach for a routing algorithm based on a discrete time hopfield neural
network,” in Foundations of Computational Intelligence, 2007. FOCI
2007. IEEE Symposium on, April 2007, pp. 363–369.
[9] W. H. Schuler, C. J. A. Bastos-Filho, and A. L. I. Oliveira, “A novel
hybrid training method for hopfield neural networks applied to routing
in communications networks,” Int. J. Hybrid Intell. Syst., vol. 6, no. 1,
pp. 27–39, 2009.
[10] N. S. Kojic, I. S. Reljin, and B. D. Reljin, “Routing in optical networks
by using neural network,” in Neural Network Applications in Electrical
Engineering, 2006. NEUREL 2006. 8th Seminar on, Sept. 2006, pp.
65–68.
[11] C. J. A. Bastos-Filho, R. A. Santana, D. R. C. Silva, J. F. MartinsFilho, and D. A. R. Chaves, “Hopfield neural networks for routing in
all-optical networks,” in Transparent Optical Networks, 2010. ICTON
’10. 12th International Conference on, 27 2010-july 1 2010.
[12] NVIDIA, NVIDIA CUDA Programming Guide 3.1. NVIDIA Corporation, 2010.
[13] ——, CUDA C Best Practices Guide 3.2. NVIDIA Corporation, 2010.
[14] H. Nguyen, Gpu gems 3. Addison-Wesley Professional, 2007.
[15] L. D. R. Thomas, Howes, and W. Luk, “A comparison of CPUs, GPUs,
FPGAs, and massively parallel processor arrays for random number
generation,” in Proceeding of the ACM/SIGDA international symposium
on Field programmable gate arrays. ACM, 2009, pp. 63–72.
[16] C. Bastos-Filho, M. Oliveira Junior, D. Nascimento, and A. Ramos,
“Impact of the random number generator quality on particle swarm
optimization algorithm running on graphic processor units,” Hybrid
Intelligent Systems, 2010. HIS ’10. Tenth International Conference on,
pp. 85–90, 2010.