Visual Robot Homing using Sarsa(λ), Whole Image Measure, and
Radial Basis Function.
Abdulrahman Altahhan, Kevin Burn, Stefan Wermter
Hybrid Intelligent Systems Research Group,
School of Computing and Technology,
University of Sunderland, SR6 0DD UK www.his.sunderland.ac.uk
abdulrahman.altahhan@sunderland.ac.uk.
Abstract—This paper describes a model for visual homing. It
uses Sarsa(λ) as its learning algorithm, combined with the
Jeffery Divergence Measure (JDM) as a way of terminating the
task and augmenting the reward signal. The visual features are
taken to be the histograms difference of the current view and
the stored views of the goal location, taken for all RGB
channels. A radial basis function layer acts on those histograms
to provide input for the linear function approximator. An onpolicy on-line Sarsa(λ) method was used to train three linear
neural networks one for each action to approximate the actionvalue function with the aid of eligibility traces. The resultant
networks are trained to perform visual robot homing, where
they achieved good results in finding a goal location. This work
demonstrates that visual homing based on reinforcement
learning and radial basis function has a high potential for
learning local navigation tasks.
I. INTRODUCTION
A skill which plays an integral role in achieving robot
autonomy is the ability to learn to operate in a priori
unknown environments[1]. Visual homing is the act of
finding a goal location by comparing the image currently
viewed with stored ‘snapshot’ images (normally taken while
animal or robot is heading off its home location). Visual
navigation is the act of navigating form one location to the
other in the environment, as efficiently as possible. In this
paper we present a model for visual homing, which can also
be used in local navigation, using reinforcement learning
(RL from now on) and an online snapshot comparison
technique. This snapshot comparison facilitates online
learning and execution in a priori unknown environments to
reach a goal location1.
Robotics borrows several concepts from animal homing
and navigation strategies described in the biological
literature [2, 3]. While both visual homing and visual
navigation are related, they have been kept fairly apart due
to the fact that visual homing is more inspired by the biology
and due to the fact that visual navigation is more general
than visual homing. Nevertheless, navigation can be
accomplished more directly by using local homing strategies
to reach some location, without directly building a map or
1
Note: goal location and home location will be used interchangeably in
this paper.
using a model of environment dynamics. The limitation is
that the learned strategies to navigate to home is bound to
that particular location. Therfore, if the robot needs to
navigate to a different location, it should be trained to do so.
We argue that our model can also be used for general
navigation tasks due to the fact that it can operate in any
environment and requires no additional effort except
showing the robot, online or offline, its goal location, then
letting it trains.
Algorithms based on the snapshot model [3] propose
various strategies for finding features within images and
establishing correspondence between them in order to
determine a home direction. Block matching, for example,
takes a block of pixels from one image and searches for the
best matching block in another image within a fixed search
radius [4]. The degree of match between blocks is usually
judged by the Sum of Squared Differences (SSD) or some
other local correlation measure[5]. In our model we will take
a more effective approach by comparing bins of histograms
through a Radial Bases Function layer, and using images
only taken around the home, nothing more.
Reinforcement Learning has been used previously in
robotics navigation and control problems. Several of the
models that used it are inspired by biological findings, e.g.
[6]. Although successful, some of those models lack the
generality and/or practicality, and some are restricted to their
environment. The model proposed by [7] for example
depends heavily on object recognition of a landmark in the
environment to achieve the task. We have addressed this
issue in our model by avoiding object recognition and using
a whole image measure technique instead, to measure the
dissimilarity of current and goal views to identify whether
the robot reached the goal location (with the desired
orientation). This was possible with no prior knowledge or
constrains regarding those images. By adding the above
advantage to the learning robustness and generality of RL,
coupled with visual states and rewards, the model achieved a
high level of robustness, generality, and applicability.
While environment-dynamics or map-building may be
necessary for more complex or interactive forms of
navigation or localization, visual homing based on modelfree learning can offer an adaptive form of local homing.
Although the immediate execution of model-based
navigation system can be successful [8, 9], RL techniques
have got the advantages of model-free systems i.e. there is
no knowledge needed prior to operating the robot. It learns
the best policy for the environment dynamics. While the idea
of using snapshots to do robot localization is not new [10],
visual homing based on reinforcement learning and radial
basis input layer and whole image measure is a novel
contribution of this paper.
We begin by presenting an overview of our reinforcement
learning context and Markov Decision Processes (MDP)
framework followed by the Temporal Difference (TD)
learning algorithm for continuous states space. This is
followed by a detailed description of our model,
demonstrating generality and simplicity of execution. Then
we present empirical results of a robot reaching a goal
location visually in a simulation environment.
II. BACKGROUND OF REINFORCEMENT LEARNING
Reinforcement learning concerns the problem of learning
to predict the sum of rewards an agent is receiving while
interacting with its environment in order to optimally
execute a task [11]. Instead of being given examples of the
desired behavior, the learning agent must find out - using its
environment feedback and using gradual explorative actions
- how to act best to execute a task. Usually this feedback is a
minimal signal of reward or punishment induced in some
way in the environment. This signal is called the
reinforcement signal.
In any environment there exists a set of states that
represent the situations that the agent can face (or
recognize). Those states define the state space denoted by S,
which can be finite or infinite and continuous. The actions
are those simple activities the agent is able to do in a certain
state. The set of those actions define the actions space A.
Those actions can also be finite or infinite. The environment
normally reacts or responds to any action taken by the agent
by returning a signal indicating or reinforcing how good or
bad this action was for the task. It is called the reward signal
or the reinforcement signal. The dynamics of an
environment are the set of probability distributions that
distinguish its internal properties. Those are mainly the state
transition function and the reward function.
The state transition function is a probability distribution
defined on the state space that specifies the probability of
moving form state s at time t to another state s' at time t+1
after applying action a:
Psas′ = Pr{st +1 = s′ st = s, at = a}
The reward function is defined as the expected reward
returned by the environment for each state after applying a
certain action:
ℜ ass′ = E{rt st = s, st +1 = s′, at = a}
where rt is the actual reward returned by the environment
and fully observed by the agent.
As with most reinforcement work, we will restrain
ourselves to the Markovian environments. A Markov
decision
process
(MDP)
is
defined
by
a
tuple ( S , A, Psas′ , Rsas′ , γ ) , where γ ∈ [0, 1] is a discount rate
parameter, and where the Markov property is satisfied [9,
11]. A trajectory of experience is a sequence
s1 , a 1 , r2 , s 2 , a 2 , r3 ,... where the agent in s1 takes action a1
then receives reward r2 and transitioning to s2 before
taking a2 , etc.
A policy π specifies (probabilistically or deterministically)
the action that needs to be taken for each different state.
π : S × A → [0,1],
π (s, a) = 1 .
∑
a
where π ( s , a ) is the probability of selecting action a when
an agent is in state s. A deterministic policy is a mapping
between states and actions π : S → A . The ultimate goal of
reinforcement learning methods (algorithms) is to learn an
optimum policy that, when followed, maximizes the
accumulated rewards expected to be gained by the agent
during interaction with its environment. This is normally
reached through estimating the expected sum in some form
since a model of the environment is normally not available
and undesired to be a requirement. Even in methods that
assume a model of the environment dynamics to be known,
such as Dynamic Programming methods, the expectation
still needs to be estimated due to the bootstrapping
characteristic of such a method. By bootstrapping we mean
building on an own initial estimation to reach a better
estimation closer to the real value [11].
The discounted sum of rewards at time step t is called the
return Rt where:
∞
Rt = rt +1 + γrt +2 + γ 2 rt +3 + ... = ∑ γ k rt + k +1 (1)
k =0
Expected accumulated rewards for a certain policy π can
be expressed in two forms: the value function V π (s ) and the
action-value function Q π (s, a) . A value function for a policy
is defined as:
V π ( s) : S → ℜ .
V π specifies the expected return (sum of rewards rt ) from
the starting state s and onwards. Obviously each policy has a
different value function, hence the upper superscript.
∞
V π ( s ) = Eπ [Rt s = st ]= Eπ ∑ γ k rt + k +1 st = s (2)
k =0
The central idea of RL is to try to learn an estimate of the
value function of the adopted policy depending on the
interaction between the agent and its environment. In other
words, to predict the value function of the agent's MDP
policy. An essential property of the value function can be
deduced from the intrinsic recursion it posses:
∞
V π ( s) = Eπ rt +1 + γ ∑ γ k rt + k + 2 st = s
k =0
a
π
a
= ∑ π ( s, a )∑ Pss′ ℜ ss′ + γV ( s′) (3)
[
a
The
]
s
action-value
function
is
defined
as
Qπ ( s, a ) : S × A → ℜ , where:
∞
Q π ( s, a) = Eπ ∑ γ k rt + k +1 s = st , a = a t (4)
k =0
For clarity, we will present below the main results for the
value function, then we will shift to the action-value
function when presenting our model.
III. TOWARDS OUR MODEL
Our work uses techniques developed for the problem of
online on-policy evaluation, where an approximate actionvalue function is maintained and improved after each time
step of following the policy. In particular we are interested
in a linear Q-function approximator that uses Temporal
Difference learning (TD) [12] since TD learning can be
guaranteed to converge with any linear function
approximator and suitable step size [13]. For the continuous
case and non-linear function approximation, convergence is
not guaranteed [14] although some models have been
presented with good results [15]
In this work we focus on presenting a model that learns an
approximation of a policy’s action-value function from
sample trajectories of experience following that policy. A
method for solving this problem is a core component of our
visual robot homing model. In particular, maintaining an
online estimate of the Q-function can be combined with
generalized policy improvement (GPI) to learn a controller
[11].
For a particular value function V let the TD error at time t
be defined as:
δ t (Vθ ) = rt +1 + γV ( s ′) − V ( s ) (5)
δ t (Vθ ) = predictiont +1 − predictiont
Then, Et [δ t (V π )] = 0 , that is, the mean TD error for the
policy’s true value function must be zero. We are interested
in approximating V π using a linear function approximator. In
particular, suppose we have a function which gives a feature
representation of the state space φ : S → ℜ n . We are
interested in an approximated value function of the form
Vθ = φ (s )T θ ; θ ∈ ℜn are the parameters of the value
function.
Because the policy’s true value function may not be in our
space of linear functions, we want to find a set of parameters
that approximates the true function. One possible approach
is to use the observed TD error on sample trajectories of
experience to guide the approximation.
The standard one-step TD method for value function
approximation is TD(0). The basic idea of TD(0) is to adjust
the predicted value of a state to reduce the TD error. Given
some new experience tuple ( st , a t , rt +1 , st +1 ) , the update with
linear function approximation is:
θ t = θ t +1 + α t u t (θ t ) (6)
u t (θ t ) = δ t (Vθ )φ ( st ) (7)
Vθ is the estimated value with respect to θ t and α t is the
learning rate. The vector ut (θ t ) is like a gradient estimate
that specifies how to change the predicted value of st to
reduce the observed TD error. We will call ut (θ t ) the TD
update at time t. After updating the parameter vector, the
experience tuple is removed form memory.
IV. THE PROPOSED VISUAL HOMING SARSA MODEL
In this section we describe the proposed model. In the
simplest perspective, any reinforcement learning model, (or
any MDP model in general), consists of elements and
experience gained about those elements. The environment
dynamics encoded in the tuple ( S , A, Psas′ , ℜass′ , γ ) describes
the basic elements of the model, while the interaction
between the robot and the environment constitutes the
gained experience. This experience is normally encoded in
the learning parameters using some learning method that
mainly learns a value function. For control, reaching an
optimal policy π * can be done through policy improvement.
We first begin by describing the main elements, then we
describe the learning rules and algorithm, and conclude this
section with the overall model structure.
A. Basic Elements of the System, the State Space:
Since we are considering visual homing, it is natural to
choose the vision as the main medium to distinguish
between different situations. Hence, we assume it is the
image at each time step that represents the current state, and
the state space S is the set of all the images that can be
possibly taken for any location (with specific orientation) in
the environment. This complex state space has two
problems. First, each state is of high dimensionality, i.e.
each state is represented by a large number of pixel
components. Second, this state space is huge and a policy
cannot be learned directly for each state. Instead, a feature
representation of the states is used to reduce the high
dimensionality of the images state space and to gain the
advantages of coding [16].
This feature representation of state space is assumed to
reserve the distinctiveness of states, hence it can reduce the
high-dimensionality problem but we are still faced by the
intractability problem. Therefore, a generalization technique
is needed in order to accommodate the intractability of state
space. More precisely, generalization is needed in order to
approximate the value for a state that has never been visited
before, through previous visits to a similar states. A natural
way to do so is to use a function approximation technique
such as a neural network.
We would like to encode in those features implicitly how
different the current image view is from those of the goal.
This visual clue should guide the process of finding the goal
location. The problem is that this approach does not give a
direct distance indication. We will not assume that the goal
location is always in the robot's field of view, but by
comparing the current view with the goal view we combine
the properties of distinctiveness, distance and orientation in
one representation.
B. Defining the goal location:
Since the home location can be approached from different
directions, the way it is represented should accommodate
this fact. Therefore, a home (or a goal) location is defined by
m snapshots called the stored views. The few snapshots
(normally m ≥ 3 ) of the home location are taken at the very
start, each from a fixed distance but from a different angle.
The distance should be compatible with the scale of the
environment and the characteristics of the home location.
This allows for the highest distinctiveness of the location
without loosing info or involving unneeded information.
These snapshots are the only requirement of the system to
learn to reach its home location starting from any position in
the environment (including those from which it cannot see
the home from, i.e. the robot should be able to reach a
hidden goal location).
C. The Features Vectors:
We take a histogram of each channel of the current view
and compare it with those of the stored views through a
radial basis function (RBF) layer. This gives us the feature
space Φ : S → ℜ n representation (8) which is used with the
Sarsa(λ) algorithm, as we shall see later.
φi (st (c, j ) ) = exp( −
hi (st ( c) ) − hi (v( c, j ) )
2σ i
2
2
)
(8)
The index t is for the time step, j stands for the jth stored
view, and c is the index of the channel where we used the
RGB representation of images. So v (c, j ) is the image of
channel c of the jth stored view, hi (v(c, j )) is the bin i of the
image v (c, j ) , and hi (st (c )) is bin i of the channel c of the
current (t) view. Of course the number of bins has an effect
on the performance of this measure and hence on the model,
and will be studied in the experimental section.
D. The Action Space:
The set of actions is A = [Left_Forward, Right_Forward,
Go Forward], where the two differential wheel speeds were
set to a fixed values so that we have a countable set of
actions.
E. Dissimilarity Measure and The Termination
Condition:
We need a way to determine how close the current
position is to the goal location, this is done through
measuring how dissimilar is the current view to each stored
view of the goal location. One can use any of the
dissimilarity measures discussed extensively in the
information retrieval field [10, 17]. In particular we are
interested in the Jeffery Divergence Measure, given by (9).
2hi
2k i (9)
+ k log
JDM (H , K ) = h log
∑
i
i
hi + k i
i
hi + k i
Where H and K are two images to be compared, hi and ki
are the number of elements belong to bin i of the histograms
of H and K, respectively. Fig. 1 (a) shows a simple view of
robot's camera, part (b) shows the changes that took place in
JDM measurements when turning away from this location.
(a)
800
600
400
(b)
200
0
0
10
20
30
40
Jeffrey
50
60
70
80
Fig. 1. Example of the JDM behaviour relative to the robot
rotational motion
JDM has been successfully used with omni-directional
camera to perform robot localization [10]. We used a normal
camera, however, to be able to distinguish the robot’s
orientation which is crucial to our navigational task. This is
to avoid the disadvantage of orientation-insensitivity of
omni-directional camera which is desirable for localization
but undesirable for navigation.
We will denote JDM t (c, j ) as being the Jeffery
Divergence Measure between the current view and the
stored view j according to the channel c, and we denote
JDM t ( j ) to be the average dissimilarity between the current
view and the stored view j on all of the channels:
JDM t ( j ) =
JDM (c, j ) C (10)
∑
t
c
We set our termination state to be the current view for
which one of its JDM t (c, j ) with the m stored views is less
than a certain thresholdψ , i.e. the view that matches ‘well’
with one of the goal views.
⇒ Terminate Episode.
If min( JDM t (c, j )) < ψ
c, j
The way to set this environment-scale-specific threshold is
discussed in the experimental section.
F. The Reward Function:
The reward function ℜass′ consists of three parts:
-The main part is the cost which is set to -1 for each step
taken by the robot without reaching the home location
(reaching a termination state).
The reward signal can be augmented by another two signal
to insure higher performance although the model works
regardless of their involvement. Those are:
-Approaching the goal reward: is the maximum reduction
in dissimilarity between the current step and the previous
step. If this difference is decreasing it means that the robot is
actually moving in the right direction towards the home
location. While if it is increasing it means the opposite. We
call this signal the differential dissimilarity signal and it is
defined as:
∆ JDM t = max ( JDM t −1 ( j ) − JDM t ( j )) (11)
j
- The Position signal is the inverse of the current
1
dissimilarity.
JDM t
Thus, as the current location differs less, from the home
location, this reward will increase.
1 (12)
r = cos t + ∆ JDM +
t
JDM t
Of course the previous two reward component will only be
considered if the dissimilarities of both steps falls under a
certain threshold ψ ′ to ensure that the robot is approaching
the home location. This threshold is environment- scalespecific, and is introduced merely to enhance the
performance.
The overall structure of the model is shown in Fig. 2.
Current Image
s
Current state
Stored Views Images
algorithm that uses TD(λ) for control. It learns on-line
through interaction with a simulation software that feeds it
with the robot visual sensors. The algorithm coded as a
controller returns the chosen action to be taken by the robot,
and updates its policy through updating its set of parameters
used to approximate the action-value function Q. Three
linear networks are used to approximate the action-value
function for the three actions.
θ(a ( i )) = (θ 1a ( i ) ,… ,θ ia ( i ) ,…,θ na (i ) )
i = 1,.. A
The current image was passed through an RBF layer
which gives the feature vector φ( st ) = (φ1 , … , φi , … , φn ) .
The robot was left to run through several episodes. After
each episode the learning rate was decreased, and the policy
was improved further through GPI. The overall algorithm is
that of the Sarsa(λ) control algorithm [11] and is
summarized in Fig. 3.
Initializa tion
θ 0 ( ai ) = 1
i = 1: A
a0 ← 2
JDM
Repeat for each episode
e 0 ( ai ) = 0
r _ c Current
i = 1: A
s 0 ← Initial robot view, t ← 1
Sarsa(λ)
Feature
vector of
each
φ (s ) histogram
bin
RBF of
each feature
with
reference
C × B × m view
θ → Q t +1 (φ ( s t ), a t )
Robot
Control
Qt +1 (φ (st ), a t )
Q-function
approximation
Policy
π ( s, a )
B ×m
Histogram
of each
channel =
features
Q(φ (s ), a)
Fig. 2. The various component of the proposed model.
Generate a0 using sampling of probability π (φ ( s 0 ), a)
Repeat (for each step of episode)
Take action at , Observe rt +1 , φ( st +1 ),
Generate at +1 using sampling of probability π (φ( st +1 ), a).
[
δ t ← rt +1 + γφ ( st +1 ) T ⋅ θ(at +1 ) − φ( s t ) T ⋅ θ( at )
γλe t −1 ( a) + φ ( st )
et ( a) ←
γλe t −1 ( a)
]
if a = at
otherwise
θ (at ) ← θ( at ) + α ep ⋅ e t ( at ) ⋅ δ t
φ ( st ) ← φ( s t +1 )
at ← at +1
G. The Eligibility Trace:
An eligibility trace constitutes a mechanism for temporal
credit assignment. It marks the memory parameters
associated with the action as eligible for undergoing
learning changes [11]. Depending on our application, the
eligibility trace for action a is the discounted sum of the
feature vectors for the images that the robot has seen so
far, after applying this action. The eligibility trace for other
actions which has not been taken while in the current state
is simply its previous trace but discounted, i.e. those
actions are now less responsible for the credit:
if a = at (13)
γλ et −1 (a ) + φ (st )
et ( a ) ←
otherwise
γλ et −1 (a )
where λ is the discount rate for the eligibility traces e
H. The Learning Method:
The remaining is the learning algorithm. Our algorithm is
an on-policy bootstrapping Sarsa(λ) [11] with linear
approximation of the Q action-value function. Sarsa(λ) is an
until min ( JDM t ( j )) < ψ 1
j
until episode == final_epis ode
Fig. 3. Linear on-policy gradient-descent Sarsa(λ) control, with RBF
features algorithm for linear action-value function approximation and
Policy Improvement. The approximate Q is implicitly a function of θ
The learning rate was the same used by Boyan [18]:
α ep = α 0 ⋅
n 0 (final_epi sode) + 1
(14)
n 0 (final_epi sode) + episode
I. The policy used to Generate Actions:
A combination of ε-greedy policy and Gibbs soft-max [11]
policy is used to pick an action and to strike the balance
between exploration and exploitation. Using ε-greedy
probability allows exploration to be increased as needed by
initially setting ε to high value then decreasing it through
episodes. Gibbs soft-max probability,
Gibbs ( a ( i ) , φ ( s t )) =
[
exp φ ( s t ) T ⋅ θ ( a ( i ) )
]
,
A
∑ exp [φ ( s
t
) T ⋅ θ ( a ( j ))
learning the value function for almost the same states that
are going to be encountered in the future.
(15)
]
j =1
helped in increasing the chances of picking the action with
the highest value when the differences between the values of
it and the remaining actions is large, i.e. it helped in
increasing the chances of picking the action with the highest
Q-value when the robot is sure that it is the right one.
ε
1 − ε + A
Pr( a , φ ( s t )) =
ε
A
[
if a = arg max φ ( s t ) T ⋅ θ ( a ( i ))
i
](16)
L. Important enhancements and Limitations:
One problem of ‘unnecessary wandering’ remains.
Mainly it is caused by consequent conflicting positive and
negative rewards given by the environment due to
approaching the goal and wandering around it without
reaching it. Simply a sever punishment was applied for the
particular case when the robot goes from a positive
rewarding to a negative punishments in two successive steps.
V. EXPERIMENTAL RESULTS
otherwise
πε +Gibbs(ai , φ(st ) ) = Gibbs(ai , φ( st )) + Pr(ai , φ(st )) (17)
J. The Neural Network Layers:
From a neural network point of view, when considering
the RBF layer together with the competitive layer, one can
realize that this architecture is similar to a Probabilistic
Neural Network (PNN) that calculates the probability of
picking up a certain action conditional to the given goal. We
will call the neural network used in our model the RBF-Q-D
Network (and algorithm) because we used the RBF layer for
feature extraction and then a linear layer with Sarsa(λ)
algorithm and the dissimilarity measure. Fig. 3 shows a
simplification of our model with its layers.
The model was applied using a simulated Khepera [19]
robot in Webots™ [20] simulation software. The Khepera is
a miniature real robot, 70 mm diameter and 30 mm height,
and is provided with 8 infra-red sensors for reactive
behaviour, as well as a colour camera extension.
A 1.15x1 m2 simulated environment has been used as a
test bed for our model. The task is to learn to navigate from
any location in the environment to a home location (without
using any specific object or landmark). For training, the
robot always starts from the same location, where it cannot
see the target location, and the end state is the target
location.
Target locations
Khepera robot in its starting
location
K. The Linear Networks and Features Dimensions :
The parameters have the same dimension as the feature
space which is n = C × B × m ; where C = 3 is the number of
channels, B is the number of bins per image and m is the
number of stored views for the goal location. Since we use
an RGB images with values in the range of [0 , 255] for each
pixel, the dimension of the feature space is given by:
256
n≈ C×
× m (18)
b
where b is the bin’s size. Different bin sizes give different
dimensions, which in turn give a different number of
approximation parametersθ . The equality is not complete
due to the fact that the precise number of bins is going to be
256
B = round (
) + 1.
b
Note that σ i of the features has been chosen through
continuous update of the sum of the features vector collected
in all the time steps so far.
1
2
.∑ hi (st +1 (c, j ) ) − hi (v(c, j ) )
σ i =
T
−
1
T
1/ 2
(19)
This allowed for maintaining better incremental estimation
of each feature variance and hence better performance. After
enough exploration of the environment this value is almost
stable and changes to it are minimized. It has been observed
that the variance of this internal parameter has dropped after
episode 10 to a negligible value, which means results are
reliable for episode>10 and that the neural networks are
Fig. 4. Snapshots of the realistic simulated environment.
Fig. 4 shows the environment used. A cone, ball and TV
are included to add more texture to the goal location, i.e. to
enrich it and make it different from the other environment
locations. We reemphasize that no object recognition
technique was used, only the JDM. The controller written as
a combination of C++ code and Matlab Engine code.
The robot starts by taking (m= 3) snapshots for the goal
location. It then goes through a specific number (500) of
episodes. The robot starts with a random policy, and finishes
an episode when it reaches the desired location.
A. The Practical Settings of the Model Parameters:
For our application we have chosen the feature space
parameters
to
be
b=3,
m=3
hence
n = 3 × ( round ( 256 / 3) + 1) × 3 = 774 . λ was set to the value
of 0.8 depending on the studies [11, 21] that referred to the
range of [0.7 0.8] as the peak of the performance of the
TD(λ)-learning. The discount constant was set to γ = 1 , i.e.
there is no discount through time. ψ ,ψ ′ are purely empirical
and were set to 1.7 and 2 respectively.
B. Setting the Exploitation vs. Exploration:
Since action space is finite, and to avoid fluctuation and
overshooting in the robot behaviour, low wheel speeds were
adopted for these actions. This in turn required setting the
exploration to a relatively high rate (almost 50%) during the
early episodes. It was then dropped gradually through
episodes, in order to make sure that most of the potential
paths are sufficiently visited. Setting exploration high also
helps in decreasing the number of episodes needed before
reaching an acceptable performance. This explains the
exponential appearance of the different learning curves
discussed below.
The model performance has been studied for a small
number of stored views (m=3) to show the robustness of the
model. One can enhance accuracy by increasing the
dimension space but would have to trade-off speed of
convergence and execution. The most natural way to
increase the state space dimension is by increasing the
number of histogram's bins considered. However, to
concentrate on the pure effect of changing m and eliminate
the increase in state dimension due to the increase in m (18),
one can set m=b then change both m and b together. This
could fix the dimension of the feature space and
consequently the size of the approximator, and show the
actual effect of changing the number of views m.
C. Studying the Model Performance:
Fig. 5, shows the effect of learning averaged over 8 trials,
each with 500 episodes. All of the trials successfully
converged. Divergence occurred only when setting the
learning rate to a high value, or when exploration was
quickly decreased. The reason that we needed a low learning
rate is that we use a Gibbs probability distribution for the
exploration/exploitation balance. This exponentially formed
probability can go quickly to infinity if care is not taken
when assigning its exponents. The fact that we have a
relatively large state space dimension was the major factor in
this situation.
Part (a) shows the most important aspect of any
reinforcement learning model; the return values of each
episode, converging optimally. After all, the main purpose of
the RL-based model is to optimally increase the sum of the
received rewards. The return values (mostly negative) have
increased naturally through episodes due to the improvement
taking place from episode to episode. This is done via
improving the adopted policy implicitly; by moving to better
estimates and decreasing exploration from episode to the
other. The accuracy of the action-value function estimates is
gradually/iteratively increasing using the learning parameter
θ .
Fig. 5 (b) shows the decrease that took place in the
number of steps needed to achieve the task. This normal
decrease is in accordance with part (a) and because of the
cost part of the reward function. In fact, we decreased the
difference between ψ and ψ ′ , so that the other two parts of
the reward formula have minimal effect on the model
convergence. Fig. 5 (c) depicts the changes in the learning
parameters themselves, i stands for the component index of
the learning vector.
Most important is that the three parts have an exponentiallike shape showing the high speed of convergence this
model has reached. This is highly desirable in reinforcement
learning model due to its dominant convergence slowness
problem [11]. In fact, one major contribution of this work is
the high performance reached with little experience using a
complex visual input.
a
b
c
Fig. 5. Learning Curves averaged over 8 trials.
Fig. 6, depicts performance and internal parameters
illustrations. Fig. 6, (a) shows the learning rate decrease
through the episodes which was used throughout the trials.
Part (b) shows the decrease enforced on the exploration rate
ε while part (c) shows the overall percentage of explorative
action and exploitative actions. Routes taken by the robot in
three episodes (early, middle, and final) for one of the trials
are shown in parts (d)-(f).
VI. DISCUSSION AND CONCLUSION
We have tackled the policy improvement for Sarsa(λ)
systems combined with JDM and RBF. This is novel to
models introduced in the literature due to the way we
applied reinforcement learning using neuro-dynamic
programming methods like Sarsa(λ). Below we state some
of the advantages of this model:
b
a
c
d
e
f
Fig. 6. Learning performance and a sample route for a sample trial.
1) Simplicity of learning: the robot can learn to perform its
visual navigation task in a very simple way without a
long process of map building.
2) Limited storage of information is required in the form of
m stored views.
3) No pre or manual processing is required. No a priori
knowledge about the environment is needed i.e. no
landmarks are needed in those views.
4) An important advantage of our model over MDP
explicit model-based approaches is that abduction of
robot is solved directly i.e. the robot can find its way
and recover after it has been displaced from its current
position and put in a totally different position.
To raise the differentiability of the views, however, they
should be rich with colours etc. (i.e. good amount of
information).
Through the learning robustness and generality of RL
robots, coupled with visual states and rewards, the system
achieved a high level of robustness, generality, and
applicability. This combination tentatively proved to work
very well for our navigation problem.
Future work includes carrying out more extensive
experiments over our model by trying different
configurations using (18), both in terms of more views to be
considered as well as different bins sizes and different
environments. Future work can also include using off-policy
instead of the on-policy method to accommodate for two
behaviours layers used by the agent.
REFERENCES
[1] U. Nehmzow, Mobile robotics: A Practical Introduction: SpringerVerlag, 2000.
[2] A. M. Anderson, "A model for landmark learning in the honey-bee,"
Journal of Comparative Physiology A vol. 114, pp. 335-355, 1977.
[3] B. A. Cartwright and T. S. Collett, "Landmark maps for honeybees,"
Biological Cybernetics, vol. 57, pp. 85-93, 1987.
[4] A. Vardy and F. Oppacher, "A scale invariant local image descriptor
for visual homing," in Biomimetic neural learning for intelligent
robots. , G. Palm and S. Wermter, Eds.: Springer, 2005
[5] M. Szenher, "Visual Homing with Learned Goal Distance
Information," presented at Proceedings of the 3rd International
Symposium on Autonomous Minirobots for Research and
Edutainment (AMiRE 2005), Awara-Spa, Fukui, Japan, 2005.
[6] D. Sheynikhovich, R. Chavarriaga, T. Strosslin, and W. Gerstner,
"Spatial Representation and Navigation in a Bio-inspired Robot," in
Biomimetic Neural Learning for Intelligent Robots, S. Wermter, M.
Elshaw, and G. Palm, Eds.: Springer, 2005, pp. 245-265.
[7] C. Weber, D. Muse, M. Elshaw, and S. Wermter, "A camera-direction
dependent visual-motor coordinate transformation for a visually
guided neural robot," Knowledge-Based Systems, Science Direct,
Elsevier vol. 19, pp. 348–355, 2006.
[8] S. Thrun, Y. Liu, D. Koller, A. Ng, Z. Ghahramani, and H. DurrantWhyte, "Simultaneous localization and mapping with sparse extended
information filters," International Journal of Robotics Research, vol.
23, pp. 693–716, 2004.
[9] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge,
Massachusetts; London, England: The MIT Press, 2005.
[10] I. Ulrich and I. Nourbakhsh, "Appearance-Based Place Recognition
for Topological Localization," presented at IEEE International
Conference on Robotics and Automation, San Francisco, CA, 2000.
[11] R. S. Sutton and A. Barto, Reinforcement Learning, an introduction.
Cambridge, Massachusetts: MIT Press, 1998.
[12] R. S. Sutton, "Learning to predict by the methods of temporal
differences," Machine Learning, vol. 3, pp. 9–44, 1988.
[13] J. N. Tsitsiklis and B. Van Roy, "An analysis of temporal-difference
learning with function approximation," IEEE Transactions on
Automatic Control vol. 42, pp. 674–690, 1997.
[14] J. A. Boyan, "Technical update: Least-squares temporal difference
learning. ," Machine Learning vol. 49., pp. 233–246, 2002.
[15] C. Gaskett, D. Wettergreen, and A. Zelinsky, "Q-Learning in
Continuous State and Action Spaces," presented at Australian Joint
Conference on Artificial Intelligence, Australia, 1999.
[16] P. Stone, R. S. Sutton, and G. Kuhlmann, "Reinforcement learning for
robocup soccer keepaway," International Society for Adaptive
Behavior vol. 13, pp. 165–188, 2005.
[17] Y. Rubner and et al., "The Earth Mover's Distance as a Metric for
Image Retrieval," International Journal of Computer Vision, vol. 40,
pp. 99-121, 2000.
[18] J. A. Boyan, "Least-squares temporal difference learning. ," presented
at In Proceedings of the Sixteenth International Conference on
Machine Learning, San Francisco, CA, 1999.
[19] D. Floreano and F. Mondada, "Hardware solutions for evolutionary
robotics," presented at First European Workshop on Evolutionary
Robotics, Berlin, 1998.
[20] O. Michel, " Webots: Professional Mobile Robot Simulation,"
International Journal of Advanced Robotic Systems, vol. 1, pp. 39-42,
2004.
[21] L. C. Baird, "Residual Algorithms: Reinforcement Learning with
Function Approximation," presented at International Conference on
Machine Learning, proceedings of the Twelfth International
Conference, San Francisco, CA, 1995.