Nothing Special   »   [go: up one dir, main page]

Ecient Neighbor Searching in Nonlinear Time Series Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Ecient neighbor searching in nonlinear time

series analysis
Thomas Schreiber
Department of Theoretical Physics, University of Wuppertal,
D{42097 Wuppertal
July 18, 1996

We want to encourage the use of fast algorithms to nd nearest neighbors


in k{dimensional space. We review methods which are particularly useful for
the study of time series data from chaotic systems. As an example, a simple
box{assisted method and possible re nements are described in some detail. The
eciency of the method is compared to the naive approach and to a multidi-
mensional tree for some exemplary data sets.

1 Introduction
Finding nearest neighbors in k{dimensional space is a task encountered in many
data processing problems. In the context of time{series analysis e.g. it occurs if
one is interested in local properties in a reconstructed phase space. Examples are
predictions, noise reduction or Lyapunov exponent estimates based on local ts
to the dynamics, or the calculation of dimension estimates. Other applications in
physics include simulations of molecular dynamics with nite range interactions,
where a box{oriented approach is used called \link{cell algorithm."[Fincham &
Heyes, 1985, Form et al., 1992]
As long as only small sets (say n < 1000 points) are evaluated, neighbors
can be found in a straightforward way by computing the n2 =2 distances between
all pairs of points. However, numerical simulations and to an increasing degree
experiments are able to provide much larger amounts of data. With increasing
data sets ecient handling becomes more important.
Neighbor searching and related problems of computational geometry have
been extensively studied in computing science, with a rich literature covering
both theoretical and practical issues. General references include [Sedgewick,
1990, Preparata & Shamos, 1985, Gonnet & Baeza{Yates, 1991, Mehlhorn,
1984]. In particular, the tree{like data structures are studied in [Omohumdro,

1
1987, Bentley, 1980, Bentley, 1990], and the bucket (or box) based methods
in [Noga & Allison, 1985, Devroye, 1986, Asano et al., 1985].
Although considerable expertise is required to nd and implement an opti-
mal algorithm, we want to demonstrate in this paper that with relatively little
e ort a substantial factor in eciency can be gained. The use of any intelligent
algorithm can result in reducing CPU time (or increasing the maximal amount
of data which can be handled with reasonable resources) by orders of magnitude,
compared to which the di erences among these methods and the gain through
re nements of an existing algorithm are rather marginal. Thus we give a simple
and general algorithm which is worth the e ort even for sets of only moderate
size.
The box{assisted algorithm given here has been heuristically developed in
the context of time series analysis [Grassberger, 1990, Grassberger et al., 1991].
Similar procedures are proposed for this purpose in [Theiler, 1987, Kostelich
& Yorke, 1988], while the k{d{tree (Sec. 2) seems to be the most popular
approach [Bingham & Kot, 1989, Farmer & Sidorowich, 1988]. In Sec. 3 we
describe a very simple version of a box{assisted algorithm for nding all points
closer than a given distance . In Sec. 4 we describe how it can be improved
by using linked lists. To illustrate the usefulness of this data structure, a very
fast sorting method is presented. Furthermore we describe how to modify the
basic algorithm in order to nd a given number of neighbors rather than a
neighborhood of xed diameter. In Sec. 5 we will discuss some examples of the
performance of the algorithms described. For comparison we give results ob-
tained using the k{d{tree algorithm described in [Bingham & Kot, 1989], which
represents a similar level of sophistication.
Both the box{assisted and the tree implementation used were chosen rather
for simplicity than for optimality. The reader who wants to go beyond this will
nd some suggestions in Sec. 6.

2 The classical approach: multidimensional trees


How to nd nearest neighbors in k{dimensional space? The textbook answer is
to use multidimensional trees. From the point of view of computing theory they
have the advantage that they can be proven to have an asymptotic number of
operations proportional to n logn for a set of n points, which is the best possible
performance for sets of arbitrary distribution.
We will see later that there exist algorithms which are faster (the number
of operations is asymptotically proportional to n) provided the set is not too
singularly distributed. Nevertheless, since trees are very widely applicable and
accordingly very popular, we want to recall the basic idea. The algorithm is
described in more detail in [Bingham & Kot, 1989].
Neighbor searching consists of two steps, rst a data base is built and then
this base is scanned in an ecient way to extract the required points. In this

2
section we want to use a tree{like structure as a data base. At each branching
point, the remaining data set is split into two branches according to the result
of a comparison. Say we have n vectors of Cartesian coordinates in k dimensi-
ons. Since there is no ordering relation in k > 1 dimensions we use one of the k
coordinates at each level of the tree to determine into which branch a point falls.
We cyclically use all the coordinates. The method is illustrated in Fig. 1. Since
we will have n branching points and on average log2 n levels, the construction
of the tree requires O(n logn) operations.
Once the tree has been built up, we can now use it to nd all points closer
than  to a given reference point x0 . Instead of computing the distances to all
the other points in the set we can now exclude branches which fall completely
outside an {cube around x0 .

3 Box methods: The basic algorithm


Consider a set of n vectors xi in k dimensions, for simplicity rescaled to fall into
the unit cube. For each xi we want to determine all neighbors closer than ,1
or, strictly speaking, determine the set of indices
Ui () = fj : kxj ? xi k < g : (1)
This is preferable for technical reasons since we do not want to move the whole
vectors around, in particular not if they are obtained as delay coordinates from
a scalar time{series.
The idea of box{assisted methods is the following. Divide the phase space
into a grid of boxes of side length . Then each point falls into one of these
boxes. All its neighbors closer than  have to lie in either the same box or one
of the adjacent boxes. Thus in two dimensions only 9 boxes have to be searched
for possible neighbors, i.e. for  < 1=3 we have to compute less distances than
in the simple approach.
The technical problem is how to put the points into the boxes without wast-
ing memory. How to provide the right amount of memory for each box without
knowing beforehand how many points it will contain? The conceptually sim-
plest solution is to obtain this information by rst lling a histogram which tells
how many points fall into which box. With this information one can divide an
array of n elements so that each box is assigned a contiguous section of memory
exactly as large as to hold all the points in the box (see Fig. 2). For each box
one has to store the position where this section starts. The end of the section
is given by the starting position for the next box. Once the indices of all the
points are stored in the section corresponding to the appropriate box, scanning
a box simply means scanning its section of the array.
We suggest the following implementation.
1 Throughout this paper we use the sup norm. Neighbors in the Euclidean or the L1 norms
are also neighbors in sup norm.

3
1. To save memory we use two{dimensional boxes regardless of the dimen-
sion k, unless the set itself is of very high dimension. Neighbors in two
dimensions are the candidates for neighbors in k dimensions. The two
least correlated coordinates promise most information in two dimensions.
2. Provide an array BOX of size n  n. If n > 1 we waste memory. If n < 1
we have to \wrap around" the set since the boxes do not cover the whole
area. Thus we take the coordinates fx(jk) mod (n)g instead of fx(jk)g to
determine into which box a point falls.
3. Provide a linear array POINTS with n elements to contain the indices of
all points.
4. Use BOX to make a histogram of how many points fall into each box.
5. Replace the histogram in BOX by an accumulated histogram, going through
BOX row by row, say from left to right, and sum the number of points to be
stored so far. Now each element of BOX tells where the section of POINTS
corresponding to this box ends.
6. For every point determine into which box it falls, say (I,J). Store the
index of this point at the location on POINTS given by the current value
of BOX(I,J). Count down BOX(I,J) one step.
7. When all points are inserted, the elements of BOX tell where the corre-
sponding sections of POINTS start.
8. In order to scan a box, just scan the corresponding section of POINTS. It
ends where the section of the next box begins. Each candidate is tested
whether it is a neighbor in the desired norm in k dimensions.
The resulting FORTRAN program is given as algorithm 1. It consists of two
subroutines. The rst one, BOXIT, scatters the points into a square array BOX
and the appropriate sections of array POINTS. The second one, FIND, nds Ui ()
for a given index i. The indices of these points are stored in array FOUND.
The method works most eciently if the expectation value of the number of
points per box is O(1). For small box size  this can be achieved by providing
roughly as many boxes as there are points. Thus we need 2n storage locations
in addition to the data. For 216 points or less, the arrays BOX and POINTS can
be 2{byte integers.

4 Linked lists and sorting


The above algorithm can be slightly accelerated by the use of linked lists [Knuth,
1973] instead of linear arrays. We can save the time needed for forming and ac-
cumulating the histogram, while the storage needed is exactly the same. With

4
linked lists points can be added to the data base at any time, a useful fea-
ture for real{time applications. Thus it is also possible to compress lling of
boxes and retrieval of neighbors into one sweep through the data, as was done
in [Grassberger, 1990].
A one{directional linked list is a data structure in which each element con-
tains a data item (or a pointer to it) and a pointer to the next element in the
list. A pointer to a null location indicates the end of the list. To access the list
we have to know where its rst element is stored. A new element is inserted
into the list by redirecting the pointer of the preceding element and letting the
new element point to the following one (see Fig. 3). Note that we do not want
to move the data items (the phase space vectors xi ) around. We can save the
storage for the pointers to these items by exploiting the one{to{one correspon-
dence between list elements and data items: list elements and data items have
the same index.
To illustrate the usefulness of this data structure let us for a moment think
about the problem of ranking (or sorting) n numbers xi 2 [0; 1]. To begin with,
one could implement the method of straight insertion. First the list has only
one element, x1. Now the other numbers are added to the list consecutively.
Each new number is either inserted between a smaller and a larger element,
added at the end if it is larger than the last element, or added at the beginning
if it is smaller than the rst element. The number of comparisons required by
this procedure is / n2.
A box{assisted technique can be used to accelerate the algorithm to an
operation count / n. The idea is to guess from the size of xi where it has to
be inserted. We divide the interval into n boxes and store for each box where
the section of the list corresponding to this value starts. Inserting xi we have
to determine into which box it falls and scan only through the corresponding
section of the list. For non{singularly distributed points the sections have a
mean length of n=m. Sorting n points with the help of m / n boxes yields
an algorithm requiring / n operations. For non{uniformly distributed points,
the operation count is / n2 =md2 , where d2 is the correlation dimension of the
distribution.
Algorithm 2 shows a FORTRAN subroutine for ranking NMAX real numbers
stored in the array X, and storing their ranks in the array LIST. It does this by
means of an auxiliary array BOX of size NBOX+1. After calling the subroutine,
LIST(I) contains the rank of X(I), which is de ned as f1 + the number of X(J)
less than X(I)g. For equal values the time order is taken. It is easy to modify
this algorithm so that it gives ordered linked lists as output.
Ranking the 50; 000 points in the data set (iii) with NBOX = NMAX/2 takes 0.48
sec. of CPU time on a SPARCstation 1+ (average value of several runs). The
fastest known comparison{based algorithm, quicksort (as implemented in [Press
et al., 1988]), takes 1.07 sec, slower by a factor of more than 2. For more evenly
distributed data, this factor can increase up to 4.
In the neighbor search problem in k dimensions we use the same idea, only

5
now the lists starting at each box do not have to be sorted. The lists correspond-
ing to each box all end with a null pointer to mark where to stop scanning.
Ecient neighbor searching with a xed grid of boxes requires the box size
to be equal to the radius of the neighborhoods we want to nd. However, if
one wants to nd a given number K of nearest neighbors (neighborhoods of
xed mass), the procedure has to be modi ed. It is quite inecient to choose
a small box size  and then scan more and more boxes until at least K neigh-
bors are found. We rather suggest a slight modi cation of a scheme applied
in [Schreiber & Grassberger, 1991] in the context of noise reduction. It consists
of the following steps:
1. Start with a very ne partition of box length .
2. For all points nd the neighbors within a distance less than .
3. Some of the points will have less than K neighbors with this resolution .
These points are marked for the next sweep.
4. Coarsen the partition by a linear factor > 1,  = . With = 21=d ,
0

where d is the dimension of the set, one roughly doubles the mass of the
neighborhoods. A new data base is created containing all points in boxes
of size  .
0

5. Now only for the marked points nd all the neighbors within a distance
less than  .0

6. Coarsen the partition until enough neighbors have been found for all
points.
Note that we rst satisfy the requirement of K neighbors for points in dense
regions. Points in sparse regions are served last. If one needs to nd neighbor-
hoods for central points according to a given order, the method fails. In this
case the algorithm of the last section can be modi ed by mapping the linear
array POINTS to the square array BOX via a space lling Peano curve [Simmons,
1963]. Then boxes of di erent sizes 2l ; l = 0; 1; : : : map to contiguous sections
of the array. Although this can be implemented eciently, we prefer the use of
trees for this problem.

5 Results and comparison


In this section we will give some typical results for the CPU time required by the
methods described above. How long it takes to nd nearest neighbors depends
on the number of points n, the dimension of the set, the embedding dimension
and either the radius  of the neighborhoods or the minimal number of neighbors
required. Rather than scanning through a four{dimensional parameter space we

6
will concentrate on three data sets, (i) a data set consisting of 1,000 iterations
of the Ikeda map, (ii) 10,000 iterates of the same map, and (iii) 50,000 sampless
taken at one site of a coupled map lattice which is in the state of space{time
chaos, representing a high dimensional example.
We embed the rst two samples in 4 dimensions and the third in 3, 5 and
10 dimensions. All CPU times were obtained on a SPARCstation 1+ but even
their relative magnitudes will be slightly machine dependent.
We implemented a k{d tree as described in [Bingham & Kot, 1989], also using
FORTRAN for comparison. If a language supporting pointers and recursive
function calls (e.g. C) is used, it is about as easy to implement as the algorithm
we present. (It will usually not be faster than the FORTRAN implementation
though). For possible improvements of this algorithm see Sec. 6.
Even for the short set (i) the box method pays compared to the naive ap-
proach below  = 1=3 where both take about 7s. Only for very small  < 1=8
the tree is faster than the naive method, while it is consistently a factor of two
slower than the box method. The time to nd 50 neighbors is 11s for the naive,
6s for the tree and 4s for the box methods.
The results for the longer ikeda sample (n=10,000) are given in Table 1.
Three di erent values of the xed diameter  were chosen: a quite small value
(1/32 of the total attractor size), a typical value below which scaling could be
expected (1/16), and a larger value (1/8). Also with this data set the box
method was somewhat faster than the tree.
For the high{dimensional data in set (iii), the naive approach would take
several hours of CPU time due to the large number of points. On this set
the box method is fastest for xed size neighborhoods below 1/16 of the linear
extension of the phase space, where it takes about 60s to nd all neighbors. For
larger  the tree wins slightly.
For xed mass neighborhoods the situation depends on the embedding di-
mension. While for the three dimensional embedding the two algorithms were
about equally fast (e.g., nding 50 neighbors takes 760s), for the high embedding
dimensions the tree is clearly faster (by a factor of 1.5{2), even if a grid of boxes
in up to 5 dimensions is applied. In our method each box requires a storage
location even if it is empty. Thus the dimension of the grid can only be increased
under coarsening of the resolution, unless one has a lot of memory to spend.
Circumventing this problem is addressed in the following section.

6 Possible improvements
The programs we applied in the above sections are very basic implementations
of more general concepts. In this section we give some pointers to the literature
which develops this further.
The problem of nding nearest neighbors is related to a whole range of
problems of computational geometry [Gonnet & Baeza{Yates, 1991, Mehlhorn,

7
1984, Bentley, 1990, Asano et al., 1985], problems in which a data base has
to be built from a set of points in some geometric space. After that, certain
queries have to be supported by this data base. One wants to devise algorithms
optimal in terms of the size of storage and number of operations needed for a
given task. The asymptotic scaling of the operation count with the number of
points n is called the complexity (more speci c, the computational complexity)
of an algorithm. In the class of problems under consideration the total operation
count comprises the e ort to built the data base from the unstructured data
and the e ort to perform a query.
For the naive method of nding neighbors the e ort to build the data base
is actually zero since queries are answered using the raw data. The latter takes
O(n) operations for each query. With the simple k{d tree we used, the data base
(i.e. the tree) can be built in O(n logn) operations on average and a query costs
O(logn) operations. To ll the boxes in our box{algorithm always takes O(n)
operations but the query time depends on how many boxes we provided. We
can trade storage for time and obtain optimal behavior for O(n) boxes, where
we expect to nd O(1) points to scan through.
Which algorithm is optimal depends on how many queries will be performed.
For in sample applications like noise reduction, dimensions etc. the number of
queries is equal to the number of data points. This can be di erent in forecast-
ing. If only a few forecasts are asked for, there is no point in applying a fast
algorithm at all. If many forecasts will be asked from a limited data base it will
pay to optimize accessability of the data base even if the setup of the data base
might be slower.
6.1 Re nements for tree methods
For tree methods several improvements have been proposed [Bentley, 1990,
Sproull, 1991]. For all but the simplest implementation of trees it is highly rec-
ommended to use a programming language supporting structured data types,
pointers and dynamical storage allocation.2 The simplest k{d tree recursively
splits the available space at each node into two halves, the splitting line at level
J being parallel to the (J mod k){th coordinate axis.
Balancing: To get optimalquery times, the tree should be balanced, i.e. branches
of the same level should contain about equally many points. However, in
more than one dimensions it is hard to construct a set which would lead
to a grossly unbalanced tree. This is di erent from the situation in one
dimension, where a simple sorting algorithm based on a binary tree could
perform very badly on presorted data.
2 On some systems the last feature is formally supported but very slow. One can avoid its
use as long as the number of nodes etc. is predictable from the number of data points, which
is the case for all algorithms mentioned here.

8
Bucketing: A more relevant improvement is obtained by limiting the spatial
resolution of the tree. Thus a leaf of the tree no more corresponds to a
single data point but to a whole bucket of points close to the last node.
This in itself reduces the query time to O(logm) where m is the number
of buckets. A further improvement is gained by providing a pointer to the
parent node for each node within the tree structure. The thereby possible
bottom up queries take only O(1) operations.
Principal cuts: While cuts parallel to the coordinate axes are natural if neigh-
borhoods are de ned in the sup norm, queries in the euclidean norm can
be accelerated by cuts along principal axes of the distribution of points. Of
course, this requires more work in constructing the data base. See [Sproull,
1991] for details.
6.2 Re nements for box methods
Clever high{dimensional meshes: As we have seen working on data set D,
the algorithm given in this paper works best on not too high{dimensi-
onal problems. The reason is that in order to cover a high{dimensional
space with O(n) boxes of size  we have to wrap around the set several
times. Eventually only a small fraction of the points we nd in a box will
be real neighbors, most will be wrapped images of far away points. The
box{assisted method proposed by Theiler [Theiler, 1987] circumvents this
problem. An arbitrarily ne grid of boxes can be provided by storing only
a table of the non{empty boxes. Thus to scan a neighboring box one has
to look up the location of this box in the table. If the table has been
organized by a lexicographical sort, queries can be done with O(log n)
operations. This and the higher (O(n logn)) cost for constructing the
data base pays mainly in higher dimensions.
Adaptive buckets: The sorting algorithm we gave performs best on fairly
uniform sets. For singularly distributed sets the use of multiple keys or
adaptive buckets will be necessary (see [Noga & Allison, 1985]).
6.3 Alternatives for special problems
The algorithms described so far can be used as general purpose routines. How-
ever, one often has some additional knowledge about the data or one has to
deal with a simpler task. For some cases the algorithms given above can be
accelerated, one might even think of implementing a special purpose algorithm
for the case at hand.
For some problems one does not really need the set of neighbors itself but
only the number of neighbors within a given maximal distance. With the algo-
rithms described so far this can accelerate queries if we can access the number

9
of points in each box or, using trees, store at each node the size of the subtree
starting at this node.
One frequently deals with data of low signi cant resolution, either because
the data is given in coarsely discretized form or only part of the given informa-
tion is relevant due to noise. In this case it can be possible to form a histogram
at the resolution of the data and use this histogram to count neighbors. If the
number of distinguishable states is too high to provide bins for all of them, one
can again store only the nonempty bins.
An ecient algorithm for nite resolution data can be obtained as a special
case of the box{algorithm by Theiler [Theiler, 1987] mentioned above. If one
box is provided for each distinguishable state, both the setup of the data base
and the retrieval of neighbors can be done very fast.
On a wide range of problems even the simple tree{ and box{assisted meth-
ods are both found to be quite ecient, the di erences being rather marginal.
In some cases ( nding xed mass neighborhoods in sets of high dimension) tree
methods are faster, in other cases (sets of lower dimensions, xed radius neigh-
borhoods) box methods win. The di erences are in no case strong enough to
favour one of the methods in general.
However, we feel that the barrier of implementation of any fast neighbor
search method is quite low with the simple algorithm given here, which we
found much easier to implement (and to modify) than even the simple multidi-
mensional tree [Bingham & Kot, 1989] we used for comparison.

Acknowledgements
I am grateful to Holger Kantz and Peter Grassberger who closely accompanied
the work presented in this paper. Neil Gershenfeld drew my attention to alter-
native approaches for nite resolution data. The author receives a grant within
the framework of the SCIENCE programme of the Commission of the European
Communities under contract no B/SC1*-900557.

References
[Asano et al., 1985] T. Asano, M. Edahiro, H. Imai, M. Iri and K. Murota,
Practical Use of Bucketing Techniques in Computational Geometry, Com-
putational Geometry, G. T. Toussaint, ed., Elsevier (1985).
[Bentley, 1980] J. L. Bentley, Multidimensional Divide{and{Conquer, Commu-
nications of the ACM, 23, 214 (1980).
[Bentley, 1990] J. L. Bentley, K-d Trees for Semidynamic Point Sets, Sixth An-
nual ACM Symposium on Computational Geometry, 91, San Francisco,
(1990).

10
[Bingham & Kot, 1989] S. Bingham and M. Kot, Multidimensional trees, range
searching, and a correlation dimension algorithm of reduced complexity
Phys. Lett. A 140, 327 (1989).
[Devroye, 1986] L. Devroye, Lecture Notes on Bucket Algorithms, Progress in
Computer Science no. 6, Birkhauser, Boston, (1986).
[Farmer & Sidorowich, 1988] J.D. Farmer and J. Sidorowich, Exploiting Chaos
to Predict the Future and Reduce Noise, in \Evolution, Learning and Cog-
nition", Y.C. Lee, ed., World Scienti c, (1988).
[Fincham & Heyes, 1985] D. Fincham and D. M. Heyes, in \Dynamical Process
in Condensed Matter { Advances in Chemical Physics," vol. LXIII, M. W.
Evans, ed., Wiley, New York, (1985).
[Form et al., 1992] W. Form, N. Ito, and G. A. Kohring, Vectorized and paral-
lelized algorithms for multi{million particle MD{simulation, to appear in
Int. Jour. Mod. Phys. C, (1992).
[Gonnet & Baeza{Yates, 1991] G. H. Gonnet and R. Baeza{Yates, Handbook of
Algorithms and Data Structures, in Pascal and C, Addison{Wesley, Wok-
ingham (1991).
[Grassberger, 1990] P. Grassberger, An optimized box{assisted algorithm for
fractal dimensions, Phys. Lett. A 148, 63 (1990).
[Grassberger et al., 1991] P. Grassberger, T. Schreiber and C. Scha rath, Non{
linear time sequence analysis, Int. J.Bifurcation and Chaos 1, 521 (1991).
[Knuth, 1973] D. E. Knuth, The art of computer programming, Vol 1. Fun-
damental algorithms and Vol 3., Sorting and Searching, Addison{Wesley,
Reading, (1973).
[Kostelich & Yorke, 1988] E. J. Kostelich and J. A. Yorke, Noise reduction in
dynamical systems, Phys. Rev. A 38, 1649 (1988).
[Mehlhorn, 1984] K. Mehlhorn, Data Structures and Algorithms 3: Multi{
dimensional Searching and Computational Geometry, Springer (1984).
[Noga & Allison, 1985] M. T. Noga and D. C. S. Allison, Sorting in Linear
Expected Time, Bit 25 (1985) 451{465.
[Omohumdro, 1987] S.M. Omohumdro, Ecient Algorithms with Neural Net-
work Behavior, Complex Syst. 1, 273 (1987).
[Preparata & Shamos, 1985] F. P. Preparata and M. I. Shamos, Computational
Geometry, an Introduction, Springer, New York (1985).

11
[Press et al., 1988] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vet-
terling, Numerical Recipes, Cambridge Univ. Press, (1988).
[Sedgewick, 1990] R. Sedgewick, Algorithms in C, Addison{Wesley, Reading,
(1990).
[Schreiber & Grassberger, 1991] T. Schreiber and P. Grassberger, A simple
noise{reduction method for real data, Phys. Lett. A 160, 411 (1991).
[Simmons, 1963] G. F. Simmons, Introduction to Topology and Modern Analy-
sis, McGraw{Hill, Tokyo, (1963).
[Sproull, 1991] R. F. Sproull, Re nements to nearest{neighbor searching in k-d
trees, Algorithmica 6 (1991) 579.
[Theiler, 1987] J. Theiler, Ecient algorithm for estimating the correlation di-
mension from a set of discrete points, Phys. Rev. A 36, 4456 (1987).

12
Figure captions
1. How the plane is divided into branches of a tree.
2. How points within each box are stored in sections of an array.
3. How a new element is inserted into an existing list.

Table caption
1. CPU times (seconds) for data set (ii), n=10,000

13
Fig. 1

x
x
x
x x
x

14
Fig. 2

array POINTS
YH
H
H I
@ 6
BM 

 ?

HH@ B 
? etc.
H@ B 
?

array BOX

15
Fig. 3

- - AA - - - - - null



?

16
task box tree naive
xed radius 489
 = 1=32 5.6 9.8
 = 1=16 19.9 21.0
 = 1=8 72.2 75.9
xed mass m = 50 45 64 675

17
Algorithm 1
SUBROUTINE BOXIT(NMAX,X,K,EPS,BOX,POINTS)

This routine sets up the box data base for neighbor search in K+1 dimensional delay co-
ordinates. Give a REAL array X containing NMAX points and the box size EPS. The data
base is stored in arrays BOX and POINTS. The bitwise .AND.IM is used to to take an integer
modulo IM+1

PARAMETER(IM=127) ! IM must be 2**l-1 for integer l


REAL X(NMAX)
INTEGER BOX(0:IM,0:IM), POINTS(NMAX)

EPSINV=1./EPS

DO 1000 I1=0,IM
DO 1000 I1=0,IM
1000 BOX(I1,I2)=0

DO 2000 N=1,NMAX-K ! make histogram


I1=INT(X(N)*EPSINV).AND.IM
I2=INT(X(N+K)*EPSINV).AND.IM
2000 BOX(I1,I2)=BOX(I1,I2)+1

DO 3000 I1=0,IM ! accumulate it


DO 3100 I2=1,IM
3100 BOX(I1,I2)=BOX(I1,I2)+BOX(I1,I2-1)
3000 IF(I1.LT.IM) BOX(I1+1,0)=BOX(I1+1,0)+BOX(I1,IM)

DO 4000 N=1,NMAX-K ! fill boxes


I1=INT(X(N)*EPSINV).AND.IM
I2=INT(X(N+K)*EPSINV).AND.IM
POINTS(BOX(I1,I2))=N
4000 BOX(I1,I2)=BOX(I1,I2)-1
END

18
SUBROUTINE FIND(NMAX,X,N,K,EPS,BOX,POINTS,FOUND,NFOUND)

Given the data base set up by BOXIT, this routine nds all neighbors closer than EPS for
the point with index N. The indices of the NFOUND neighbors found are stored in INTEGER
array FOUND

PARAMETER(IM=127) ! same value as in BOXIT


REAL X(NMAX)
INTEGER BOX(0:IM,0:IM), POINTS(NMAX), FOUND(NMAX)

EPSINV=1./EPS
NFOUND=0
I1=INT(X(N)*EPSINV).AND.IM
I2=INT(X(N+K)*EPSINV).AND.IM

DO 1000 J2=I2-1,I2+1 ! search neighboring boxes


DO 1000 J1=I1-1,I1+1
L1=J1.AND.IM
L2=J2.AND.IM
IF(L2.LT.IM) THEN ! determine end of section
IEND=BOX(L1,L2+1)
ELSE IF(L1.LT.IM) THEN
IEND=BOX(L1+1,0)
ELSE
IEND=NMAX-K
ENDIF

DO 1100 ISCAN=BOX(L1,L2)+1,IEND ! scan box


IPOINT=POINTS(ISCAN)
DO 1110 M=0,K ! sup norm in k+1 D
1110 IF(ABS(X(N+M)-X(IPOINT+M)).GE.EPS) GOTO 1100
NFOUND=NFOUND+1 ! it's a neighbor
FOUND(NFOUND)=IPOINT
1100 CONTINUE
1000 CONTINUE
END

19
Algorithm 2
SUBROUTINE RANK(XMAX,NMAX,X,LIST)

Rank NMAX points given in X in the inteval [0,XMAX]. Ranks are stored in array LIST. It
is assumed that in a composed logical expression containing .OR. or .AND. the rst part
is evaluated rst. Depending on the result, the second part should not be evaluated. For
other compilers the IF statements containing .OR. or .AND. have to be split into two
statements.

PARAMETER(NBOX=100000)
REAL X(NMAX)
INTEGER LIST(NMAX),BOX(0:NBOX)

NL=MIN(NBOX,NMAX/2)
SC=(NL-1)/XMAX
DO 1000 I=0,NL
1000 BOX(I)=0

DO 2000 N=1,NMAX
XN=X(N)
I=INT(XN*SC)
IP=BOX(I)
IF ((IP.EQ.0).OR.(XN.LE.X(IP))) THEN
BOX(I)=N
ELSE
2 IPP=IP
IP=LIST(IP)
IF ((IP.GT.0).AND.(XN.GT.X(IP))) GOTO 2
LIST(IPP)=N
ENDIF
2000 LIST(N)=IP

N=0
DO 3000 I=0,NL
IP=BOX(I)
3 IF (IP.GT.0) THEN
N=N+1
IPP=IP
IP=LIST(IP)
LIST(IPP)=N
GOTO 3
ENDIF
3000 CONTINUE
END

20

You might also like