Open AccessArticle

Adaptive Optimization and Dynamic Representation Method for Asynchronous Data Based on Regional Correlation Degree

Sichao Tang

^1,2,

Yuchen Zhao

^1,*,

Hengyi Lv

¹,

Ming Sun

¹,

Yang Feng

¹ and

Zeshu Zhang

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

University of Chinese Academy of Sciences, Beijing 100049, China

Author to whom correspondence should be addressed.

Sensors 2024, 24(23), 7430; https://doi.org/10.3390/s24237430

Submission received: 5 October 2024 / Revised: 11 November 2024 / Accepted: 19 November 2024 / Published: 21 November 2024

(This article belongs to the Section Optical Sensors)

Download

Browse Figures

Figure 1
Schematic diagram of the human retina model and corresponding event camera pixel circuit. "> Figure 2
(a) We consider the light intensity change signals received by the corresponding pixels as computational elements in the time domain. (b) From the statistical results, it can be seen that the ON polarity ratio varies randomly over the time index. "> Figure 3
This graph represents the time span changes of each event cuboid processed by our algorithm. "> Figure 4
This figure illustrates the time surface of events in the original event stream. For clarity, only the x–t components are shown. Red crosses represent non-main events, and blue dots represent main events. (a) In the time surface described in [<a href="#B50-sensors-24-07430" class="html-bibr">50</a>] (corresponding to Formula (24)), only the occurrence frequency of the nearest events around the main event is considered. Consequently, non-main events with disruptive effects may have significant weight. (b) The local memory time surface corresponding to Formula (26) considers the influence weight of historical events within the current spatiotemporal window. This approach reduces the ratio of non-main events involved in the time surface calculation, better capturing the true dynamics of the event stream. (c) By spatially averaging the time surfaces of all events in adjacent cells, the time surface corresponding to Formula (29) can be further regularized. Due to the spatiotemporal regularization, the influence of non-main events is almost completely suppressed. "> Figure 5
Schematic of the Gromov–Wasserstein Event Discrepancy between the original event stream and the event representation results. "> Figure 6
Illustration of the grid positions corresponding to non-zero entropy values. "> Figure 7
Grayscale images and 3D event stream diagrams for three captured scenarios: (a) Grayscale illustration of the corresponding scenarios; (b) 3D event stream illustration of the corresponding scenarios. "> Figure 7 Cont.
Grayscale images and 3D event stream diagrams for three captured scenarios: (a) Grayscale illustration of the corresponding scenarios; (b) 3D event stream illustration of the corresponding scenarios. "> Figure 8
The variation of the value of <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>GWED</mi> </mrow> <mi mathvariant="normal">N</mi> </msub> </mrow> </semantics></math> corresponding to each algorithm with different numbers of event samples. "> Figure 9
Illustration of the event stream processing results for Scene A by different algorithms: (a) TORE; (b) ATSLTD; (c) Voxel Grid; (d) MDES; (e) Ours. "> Figure 10
APED data obtained from the event stream processing results for Scene A by different algorithms. "> Figure 11
Illustration of the event stream processing results for Scene B by different algorithms: (a) TORE; (b) ATSLTD; (c) Voxel Grid; (d) MDES; (e) Ours. "> Figure 12
APED data obtained from the event stream processing results for Scene B by different algorithms. "> Figure 13
Illustration of the event stream processing results for Scene C by different algorithms: (a) TORE; (b) ATSLTD; (c) Voxel Grid; (d) MDES; (e) Ours. "> Figure 14
APED data obtained from the event stream processing results for Scene C by different algorithms. ">

Versions Notes

Abstract

Event cameras, as bio-inspired visual sensors, offer significant advantages in their high dynamic range and high temporal resolution for visual tasks. These capabilities enable efficient and reliable motion estimation even in the most complex scenes. However, these advantages come with certain trade-offs. For instance, current event-based vision sensors have low spatial resolution, and the process of event representation can result in varying degrees of data redundancy and incompleteness. Additionally, due to the inherent characteristics of event stream data, they cannot be utilized directly; pre-processing steps such as slicing and frame compression are required. Currently, various pre-processing algorithms exist for slicing and compressing event streams. However, these methods fall short when dealing with multiple subjects moving at different and varying speeds within the event stream, potentially exacerbating the inherent deficiencies of the event information flow. To address this longstanding issue, we propose a novel and efficient Asynchronous Spike Dynamic Metric and Slicing algorithm (ASDMS). ASDMS adaptively segments the event stream into fragments of varying lengths based on the spatiotemporal structure and polarity attributes of the events. Moreover, we introduce a new Adaptive Spatiotemporal Subject Surface Compensation algorithm (ASSSC). ASSSC compensates for missing motion information in the event stream and removes redundant information, thereby achieving better performance and effectiveness in event stream segmentation compared to existing event representation algorithms. Additionally, after compressing the processed results into frame images, the imaging quality is significantly improved. Finally, we propose a new evaluation metric, the Actual Performance Efficiency Discrepancy (APED), which combines actual distortion rate and event information entropy to quantify and compare the effectiveness of our method against other existing event representation methods. The final experimental results demonstrate that our event representation method outperforms existing approaches and addresses the shortcomings of current methods in handling event streams with multiple entities moving at varying speeds simultaneously.

Keywords:

event cameras; slicing methods; event representations

1. Introduction

Event cameras (such as DVS [1], DAVIS [2], and ATIS [3]) are bio-inspired visual sensors, with their internal structure illustrated in Figure 1. Unlike traditional cameras, event cameras do not capture images at a fixed frame rate. Instead, each pixel independently and asynchronously responds to changes in intensity within the environment. Due to their asynchronous nature, similar to the biological retina, they can accurately and efficiently capture motion information in natural scenes [4,5], particularly movements caused by dynamic objects [6,7]. This asynchronous characteristic makes event cameras well-suited for a wide range of applications, including target tracking, robotics, motion estimation, autonomous vehicles, and virtual reality. As depicted in Figure 1, the structure of an event camera enables each pixel to generate an “on” event, which includes the pixel’s coordinates and the current timestamp, when there is an increase in luminance at the corresponding location. Conversely, if the luminance decreases, the event camera records an “off” event with the pixel’s coordinates and the current timestamp. This binary event-based output requires specialized processing techniques to reconstruct meaningful visual information from the asynchronous event data.

Specifically, a spike will be fired from a pixel

u_{n} = [x_{n}, y_{n}]

at the time

t_{n}

when the intensity change reaches a firing threshold

C_{th}

, and the definition is as follows.

Δ lnL ≐ lnL (u_{n}, t_{n}) - lnL (u_{n}, t_{n} - {Δ t}_{n}) = p_{n} C_{th} .

(1)

The spike is defined as

e_{n} = (x_{n}, y_{n}, t_{n}, p_{n})

{Δ t}_{n}

is the time since the last spike at the same pixel, and the polarity

p_{n} \in {1, - 1}

respectively represents ON or OFF spikes. In event cameras, a sequence of ordered spike firing timestamps for each pixel can be defined as a spike train

A = {t_{n} \in Γ : n = 1, \dots, N}

, and its mathematical expression can be presented as follows:

A (t) = \sum_{n = 1}^{N} p_{n} δ (t - t_{n}),

(2)

where

N

is the number of spikes in a single pixel during the time interval

Γ

, and

δ (\cdot)

refers to the Dirac delta function, with

δ (t) = 0 \forall x \neq 0

and

\int δ (t) dt = 1

. Similarly, the asynchronous spike stream

S = {x_{n}, y_{n}, t_{n} \in Γ_{s} : n = 1, \dots, N}

, which represents the event information, generated by pixels in the spatiotemporal interval

Γ_{s}

can be divided into event cuboids [8,9]

s \in S

. Its corresponding mathematical formula is as follows.

S (x, y, t) = \sum_{n = 1}^{N} p_{n} δ (x - x_{n}, y - y_{n}, t - t_{n}) .

(3)

The output data of an event camera consist of a series of events that encode the time, location, and polarity information of brightness changes at specific pixel points. While this provides an advantage for event cameras, it also means that each event individually carries very little information about the scene. Therefore, before interpreting or utilizing the information from the event stream, we need to aggregate the data for further processing. In the field of event-based vision, existing representation methods can be broadly classified into two categories: sparse representation and dense representation. Sparse representation methods [10,11,12,13] preserve the sparsity of events, but current hardware and subsequent data processing algorithms are not mature or targeted enough. Consequently, event data preprocessed using these methods cannot be extended to more complex tasks. Specifically, the sparse event data required by asynchronous pulse neural networks [10,11,13] are limited by the lack of specialized hardware and efficient backpropagation algorithms. Moreover, point cloud encoders [14,15,16] are essential for handling the spatiotemporal characteristics of event data, but they incur high computational costs and result in significant noise. Graph neural networks [17,18,19,20,21,22] offer considerable scalability and can achieve high performance in most vision tasks, but their accuracy in event-based vision is still inferior to dense methods.

Conversely, dense representation methods [23,24,25,26] provide better performance, as they can integrate with existing advanced machine learning algorithms and neural network architectures. Early dense representation methods mostly converted events into histograms [23], time surfaces [24], or a combination of both, followed by standard image processing algorithm models for subsequent processing. However, these methods typically used only a few channels, capturing only low-dimensional representations of the events. Further research attempted to capture more event information by computing higher-order moments [27] or stacking multiple time windows of events [25]. Nonetheless, these methods inherently rely on fixed time windows for event stacking, leading to issues when event rates become too high or too low. To address these shortcomings, event count-based stacking methods [26] were developed. Concurrently, a biologically inspired method introduced the use of Time-Ordered Recent Event (TORE) [28], which aggregates events into queues. However, their efficiency was relatively poor, similar to existing voxel grid [25] representation methods. Recently, Nam et al. proposed a more efficient representation method [29], which divides events into multiple overlapping windows, halving the number of events at each stage and enhancing robustness in processing data from different motion scenes.

However, due to the inherent sparsity and asynchronicity of event cameras, classical computer vision algorithms, although effective, are not fully applicable. To leverage the spatiotemporal structural characteristics and polarity attributes of the event stream, we propose a Dynamic Asynchronous Data Metric and Slicing algorithm (ASDMS). This algorithm first divides the event stream into main and non-main event streams. The “main” event stream consists of events with higher significance based on certain criteria, while the “non-main” stream contains the remaining events. ASDMS then segments these streams into event cuboids, which are spatiotemporal volumes of a specified size, based on a default time span. The polarity of events within each event cuboid is used as the basis for calculating overall density, and this new event density is introduced into the time window as a reference to dynamically adjust the length of the event cuboids. After adjusting the lengths of the event cuboids in both event streams, the cuboids are correspondingly matched and merged into a new, modified event stream. The ASDMS algorithm, upon adjusting the cuboid spans and outputting the new event stream, may cause loss and the redundancy of events within each cuboid. To address this, we define a new Adaptive Spatiotemporal Subject Surface Compensation algorithm (ASSSC). ASSSC uses the main event stream as a base to minimize the impact of redundant, low-correlation non-main events surrounding the main spatial area in the new event stream. It then selects highly correlated events to compensate for the missing main events within the cuboids. This approach not only effectively addresses the issue of poor slicing performance when dealing with multiple objects moving at significantly different speeds, as mentioned at the end of article [30], but also retains more information in the preprocessed data.

Lastly, we need to quantitatively evaluate our method against existing methods to provide an intuitive comparison. Few papers have studied the performance of event stream data tasks. Although [31,32] presented small-scale studies on selecting event representations, only Gehrig et al. [33] conducted a large-scale investigation of event representations by studying various inputs across multiple tasks. They explored the advantages of splitting polarity and incorporating timestamps into event representations. However, single-task studies still have high computational performance requirements, limiting the number of representations that can be compared. Consequently, their study did not cover many event representations, particularly those considering different window sizes as in later works [26,29] or more advanced aggregation and measurement as in [27]. Moreover, existing comparison methods do not adequately quantify the degree of distortion in the final results due to the deletion of crucial distinguishing features during event representation, nor the loss of the main information-carrying parts. To address these shortcomings, we propose a method for quickly comparing event representations, bypassing the need for pre-training neural networks. This method combines prior knowledge of Gromov–Wasserstein Discrepancy [34] and information entropy [35], directly calculating the Actual Performance Efficiency Discrepancy (APED) between the raw events and their representations. A lower APED indicates a lower actual distortion rate and higher retention of main information-carrying events. Our contributions include the following three main points:

We propose a Dynamic Asynchronous Data Metric and Slicing algorithm (ASDMS) that dynamically adjusts the slicing span of events based on the spatiotemporal structure and polarity information of the event stream;
We introduce an Adaptive Spatiotemporal Subject Surface Compensation algorithm (ASSSC) that repairs the main information-carrying parts of the new event stream after slicing based on the correlation between main and overall events, removing redundant events in the spatiotemporal correlation area;
We propose a new evaluation metric, Actual Performance Efficiency Discrepancy (APED), which quantifies the effectiveness of each representation method in handling the primary information-carrying events in the event stream.

2. Materials and Methods

2.1. Asynchronous Spike Dynamic Metric and Slicing Algorithm

Before performing the slicing operation, we use the previously proposed MPAR method [36] to divide the original event stream into main event stream

S_{i}

and non-main event stream

S_{j}

based on overall correlation. This allows the subsequent slicing process to obtain dynamic parameters that better represent the interrelationship between the data. In the following algorithm process, we will measure and segment the event stream from the perspectives of spatiotemporal structure and polarity attributes.

Nowadays, some researchers have introduced computational structures to provide a theoretical foundation for measuring asynchronous event stream data [37,38,39]. Based on these results, we have equipped the previously mentioned asynchronous spike data stream

S

, which represents event information, with a new metric

d

and proposed a new complete and separable metric space

(S, d)

. When

S

is a compact subset equipped with Euclidean distance,

S

is also referred to as a locally finite configuration. Therefore,

S (x, y, t) = \sum_{n = 1}^{N} p_{n} δ (x - x_{n}, y - y_{n}, t - t_{n}), 0 < N < \infty

, and this denotes a measurable mapping from some probability space

[Φ, Ω, θ]

to a measurable space. If

N < \infty

, then

S

represents marked spatiotemporal points (MSPs) in a finite space. Probability space

[Φ, Ω, θ]

is the mathematical model of a random experiment, where sample space

Φ

is the set of all possible outcomes.

Ω

is the

σ

-algebra of subsets of the sample space. When

0 ≪ Ω ≪ 1

θ

represents the probability measure.

Quantifying asynchronous spikes is a challenging process due to the lack of standard algebraic operations. Kernel methods [37,40,41] offer a general framework for measuring spike sequences. These methods can extend linear modeling in the input space to nonlinear modeling, specifically mapping abstract objects to Hilbert space. This characteristic provides a new approach to addressing key issues in the quantification process. Therefore, we choose to use Gaussian kernel-based methods to measure the distance between asynchronous spikes in MSPs. We let

s_{i}

and

s_{j}

be two event cuboids in the spatiotemporal interval

Γ_{s}

, respectively, and the inner product is introduced to measure the distance between asynchronous spikes in a Hilbert space in Formula (4).

∥ s_{i} - s_{j} ∥ ≜ \sqrt{κ (s_{i}, s_{i}) + κ (s_{j}, s_{j}) - 2 κ (s_{i}, s_{j})},

(4)

where

κ (s_{i}, s_{j})

is the inner product of

s_{i}

and

s_{j}

When analyzing the polarity attributes of event streams, we found that existing spike sequence measurement methods [40,42,43,44] rarely utilize polarity attributes. Therefore, we adopted the approach of describing pulse data using conditional probability density functions as seen in [37,45]. As shown in Figure 2a, we selected three representative pixel coordinates from the event camera as the origin of the spike flow’s vertical axis. The distance between spike trains is calculated using a single neuron spike train metric and increases with the pixel index. The measurement results, as shown in Figure 2b, indicate that the ON and OFF polarity ratios vary randomly. Consequently, we can use the polarity attribute as a prior probability distribution in spike metrics, describing MSPs using a conditional intensity function. This approach integrates polarity attributes as a key calculation point in the algorithm process.

The expression of the conditional strength function is as follows:

λ (x, y, t, p ∣ H_{t}) = \frac{f (x, y, t, p ∣ H_{t})}{1 - F (x, y, t ∣ H_{t})},

(5)

H_{t} = {e_{n} \in Γ_{s} ∣ t_{n} < t},

(6)

where

F (x, y, t ∣ H_{t})

is the cumulative distribution function, and

f (x, y, t, p ∣ H_{t})

is the conditional probability density function in spiking history

H_{t}

Next, by applying Bayes’ theorem to derive Formula (5), we obtain the following new expression for the conditional intensity function.

\begin{array}{r} λ (x, y, t, p ∣ H_{t}) & = \frac{f (x, y, t ∣ H_{t})}{1 - F (x, y, t ∣ H_{t})} \cdot f (p ∣ H_{t}, x, y, t) \end{array} = λ (x, y, t ∣ H_{t}) \cdot f (p ∣ H_{t}, x, y, t),

(7)

where

\sum_{p \in {1, - 1}} \iint_{Γ_{s}} λ^{2} (x, y, t, p ∣ H_{t}) dxdydt < \infty

; therefore,

λ (x, y, t, p ∣ H_{t})

is an element of

L_{2} (Γ_{s})

space.

Next, we choose to use the smoothing function

h (x, y, t)

to capture the spatiotemporal structure and use 3D convolution to convert discrete spikes into continuous intensity functions. It is similar to the conditional strength function in history time

H_{t}

without considering the polarity, and it is computed as follows:

\begin{array}{r} λ (x, y, t ∣ H_{t}) & = s (x, y, t) * h (x, y, t) \end{array} = \sum_{n = 1}^{N} h (x - x_{n}, y - y_{n}, t - t_{n}) .

(8)

Then,

f (p ∣ H_{t}, x, y, t)

can be calculated based on the polarity probability distribution in history time

H_{t}

, and it is modeled as:

f (p ∣ H_{t}, x, y, t) = \frac{Nu {e_{n} \in Γ_{s} ∣ p_{n} = p, x_{n} < x, y_{n} < y, t_{n} < t}}{Nu {e_{n} \in Γ_{s}}},

(9)

where Nu{} represents the counting result of event numbers in the spatiotemporal interval

Γ_{s}

. Additionally, for any two event cuboids

s_{i}

and

s_{j}

, the inner product

κ (s_{i}, s_{j})

can be given by Formula (10).

\begin{matrix} κ (s_{i}, s_{j}) & = {λ_{s_{i}} (x, y, t, p ∣ H_{t}), λ_{s_{j}} (x, y, t, p ∣ H_{t})}_{L_{2} (Γ_{s})} \\ = \sum_{p \in {- 1, 1}} \int \int \int_{Γ_{s}} (λ_{s_{i}} \cdot λ_{s_{j}}) dxdydt . \end{matrix}

(10)

The formulas substituted into (7) and (8) can transform Formula (10) into the following form:

\begin{array}{r} κ (s_{i}, s_{j}) = & \sum_{p \in {- 1, 1}} \sum_{m = 1}^{N_{i}} \sum_{n = 1}^{N_{j}} \int \int \int_{Γ_{s}} f_{s} (p ∣ H_{t}, x, y, t) \cdot H (x, y, t) dxdydt \end{array},

(11)

f_{s} (p ∣ H_{t}, x, y, t) = f_{s_{i}} (p ∣ H_{t}, x, y, t) \cdot f_{s_{j}} (p ∣ H_{t}, x, y, t),

(12)

H (x, y, t) = h (x - x_{m}^{(i)}, y - y_{m}^{(i)}, t - t_{m}^{(i)}) \cdot h (x - x_{n}^{(j)}, y - y_{n}^{(j)}, t - t_{n}^{(j)}),

(13)

where

N_{i}

and

N_{j}

are the number of events in event cuboids

s_{i}

and

s_{j}

For two spikes of events

e_{m}^{(i)}

and

e_{n}^{(j)}

in event cuboids

s_{i}

and

s_{j}

, respectively, the inner product between two spikes can be represented as follows:

\begin{array}{r} κ (e_{m}^{(i)}, e_{n}^{(j)}) = \int \int \int_{Γ_{s}} H (x, y, t) & dxdydt \end{array} .

(14)

When processing the calculation of inner product, we use Gaussian kernel as the smoothing function and define it as follows:

h (x, y, t) = \frac{e^{- \frac{x^{2}}{2 σ_{x}^{2}}}}{l_{x}} \cdot \frac{e^{- \frac{y^{2}}{2 σ_{y}^{2}}}}{l_{y}} \cdot \frac{e^{- \frac{t^{2}}{2 σ_{t}^{2}}}}{l_{t}},

(15)

where

σ_{x}, σ_{y}

, and

σ_{t}

are the standard deviation parameters of the Gaussian kernel, and

l_{x} = {(\sqrt{π} σ_{x})}^{1 / 2}, l_{y} =

{(\sqrt{π} σ_{y})}^{1 / 2}

, and

l_{t} = {(\sqrt{π} σ_{t})}^{1 / 2}

. So, (14) can be re-expressed as follows:

κ (e_{m}^{(i)}, e_{n}^{(j)}) = e^{- \frac{{(x_{m}^{(i)} - x_{n}^{(j)})}^{2}}{4 σ_{x}^{2}} - \frac{{(y_{m}^{(i)} - y_{n}^{(j)})}^{2}}{4 σ_{y}^{2}} - \frac{{(t_{m}^{(i)} - t_{n}^{(j)})}^{2}}{4 σ_{t}^{2}}} .

(16)

In summary, the inner product between two event cuboids can be further rewritten as follows:

κ (s_{i}, s_{j}) = \sum_{m = 1}^{N_{i}} \sum_{n = 1}^{N_{j}} κ (e_{m}^{(i)}, e_{n}^{(j)}) \cdot \sum_{p \in {- 1, 1}} f_{s_{i}} (p ∣ H_{t}, x, y, t) \cdot f_{s_{j}} (p ∣ H_{t}, x, y, t),

(17)

R (s_{i}, s_{j}) = \sum_{p \in {1, - 1}} f_{s_{i}} (p ∣ H_{t}, x, y, t) f_{s_{j}} (p ∣ H_{t}, x, y, t),

(18)

Here,

R (s_{i}, s_{j})

represents the relative polarity static in the spike cube. Generally, the optimal kernel parameter

θ = {σ_{x}, σ_{y}, σ_{t}}

of the three-dimensional Gaussian kernel function involved in Formula (15) can be continuously optimized by minimizing the fitting error [46]. To obtain more accurate calculation results, we refer to the parameter structure of compression ratio [40,47] and the definition of PSNR [48,49], defining PSNR as the performance score

p_{s}

. The performance score might measure the algorithm’s efficiency in maintaining information fidelity, like minimizing motion blur, preserving object details, or reducing data redundancy. It could also be a comparative measure, assessing the balance between computational cost and accuracy of the event data’s representation for specific tasks, like detection, tracking, or segmentation. Based on this, we calculate the correlation coefficient between the spike metric distance

d

and the performance score.

Therefore, the error function

J (θ)

corresponding to the optimal kernel function is defined as follows:

J (θ) = \sum_{i \in R} \sum_{j \in D} {∥ d (S_{i}, S_{j}, θ) - f_{b} (p_{s} (S_{i}, S_{j}), b) ∥}^{2} + γ ∥ b ∥ .

(19)

where

d

is the distance between two spike streams. R and D are, respectively, the corresponding sets.

f_{b}

is a polynomial function of degree

b

, which is able to fit curve between the distance

d

and performance score

p_{s}

γ

is a hyperparameter that weights the relative contribution of the norm penalty term.

Next, as shown in Formulas (20) and (21), we use the gradient method to iterate the kernel parameters in order to minimize the error.

θ^{(n + 1)} = θ^{(n)} - η \cdot \nabla \frac{\partial J}{\partial θ} .

(20)

θ^{*} = \underset{θ}{argmin} J (θ) .

(21)

The learning rate

η

is a hyperparameter that determines how we adjust the loss gradient. We set the initial values for the learning rate

η

and the maximum number of iterations

N_{o}

and use

ε

to represent a positive number approaching infinity. We then fit a polynomial function

f_{b} (p_{s} (S_{i}, S_{j}, θ^{(0)}), b)

between the distance and performance score, using Formula (17) to update the kernel parameter

θ^{(n)}

, where

n

represents the iteration step, initially set to 0. The updating process stops when

J (θ) < ε

n \geq N_{o}

, at which point we have achieved the minimum error. Otherwise, we increment

n

by one and continue iterating. Next, we need to calculate the distance relationship between the event cuboid and its corresponding event stream, with the detailed process described in Algorithm 1, where learning rate

η

is a hyperparameter that determines how much we adjust

θ

with respect to the loss gradient.

Algorithm 1. Asynchronous Spike Dynamic Metric and Slicing.

Input : Two spike event streams S_{i} {and S}_{j}

Output : The distance ∥ S_{i} - S_{j} ∥ between two streams

1 Divide streams S_{i} {and S}_{j} {into K spike cuboids s}_{i}^{k} {and s}_{j}^{k}

2 For k =1, 2 …, K do

3 Calculate the f_{s_{i}}^{k} {and f}_{s_{j}}^{k} based on formula (8) and polarity of (s_{i}^{k}, s_{j}^{k})

4 Compute the relative polarity statics R (s_{i}, s_{j}) based on Formula (16)

5 Obtain inner product κ (s_{i}, s_{j}) of (s_{i}^{k}, s_{j}^{k}) based on Formula (15)

6 Obtain the distance ∥ s_{i}^{k} - s_{j}^{k} ∥ of spike cuboids based on Formula (4)

7 Obtain the value of ∥ S_{i} - S_{j} ∥ by accumulating ∥ s_{i}^{k} - s_{j}^{k} ∥

8 End for

9 Return ∥ S_{i} - S_{j} ∥

The obtained calculation results are substituted into the following adaptive time span calculation formula:

Δ {t_{i}}^{*} = \underset{Δ t_{i}}{\arg \min} | \frac{∥ S_{i} ∥}{∥ S_{i} - S_{j} ∥} - \frac{θ^{(i)} (Δ t_{i - 1} - Δ t_{i})}{| θ^{(i - 1)} - θ^{(i)} |} Nu {e_{n} \in Γ_{s}} |,

(22)

Δ {t_{j}}^{*} = \underset{Δ t_{j}}{\arg \min} | \frac{∥ S_{j} ∥}{∥ S_{i} - S_{j} ∥} - \frac{θ^{(j)} (Δ t_{j - 1} - Δ t_{j})}{| θ^{(j - 1)} - θ^{(j)} |} Nu {e_{n} \in Γ_{s}} |,

(23)

where

Δ t

represents the time span of the event cuboid in microseconds. This allows us to segment a continuous event stream into intervals of varying lengths based on time labels. These intervals can be discontinuous or overlapping. The dynamically sliced event stream is illustrated in Figure 3. Here, the time label interval is the base time divided by the time window, with a label assigned every 5000 microseconds. After re-slicing both the main and non-main event streams, the event cuboids in both streams are aligned and merged based on the time labels. In this way, we minimize the occurrence of redundancy in each event cuboid after being compressed into frames. However, as a result, we also lose some subject-related events to a certain extent. In the next section, we will compensate for the subjects appearing in all cuboids based on the two processed event streams.

2.2. Adaptive Spatiotemporal Subject Surface Compensation Algorithm

To describe the influence of different events on each other over time, we refer to the definitions of event relationships in [50,51], introducing the concept of the time surface to describe the local spatiotemporal surroundings of related events. The time surface can be transformed into a local spatial operator acting on the i-th event in the main event stream, defined as follows:

T_{s_{i}} (z, q) = {\begin{array}{l} e^{- \frac{t_{i} - t (x_{i} + z, y_{i} + z, q)}{τ}} & {if p}_{i} = q \\ 0 & otherwise \end{array},

(24)

(z, q) \in {[- ρ, ρ]}^{2} \times {- 1, 1},

(25)

where

ρ

is the radius of the spatial neighborhood used to calculate the time surface, and

t (x_{i} + z, y_{i} + z, q)

represents the timestamp of the last event with polarity

q

received from pixel location

(x_{i} + z, y_{i} + z)

τ

means a decay factor that gives less weight to events farther in the past. Intuitively, the time surface encodes the asynchronous information in a neighborhood of events, providing temporal and spatial relationships between events.

In order to construct the required feature event representation, we need to first generalize the time plane proposed above. As shown in Figure 4a, the time surface of Formula (24) only uses the time

t (x_{i} + z, y_{i} + z, q)

of the last event received near the pixel on the time surface, resulting in the descriptor being overly sensitive to minor changes in non-main events or the original event stream.

To address this shortcoming, we use historical events within the time window

Δ t

adjusted by the ASDMS algorithm from the previous section to calculate the time surface. More precisely, we define a local memory time surface as follows:

T_{s_{i}} (z, q) = {\begin{array}{l} \sum_{s_{j} \in N_{(z, q)} (s_{i})} e^{- \frac{t_{i} - t_{j}}{τ}} & {if p}_{i} = q \\ 0 & otherwise \end{array},

(26)

N_{(z, q)} (s_{i}) = {s_{j} : x_{j} = x_{i} + z, y_{j} = y_{i} + z, t_{j} \in (t_{i} - Δ t, t_{i}), p_{j} = q} .

(27)

As shown in Figure 4b, Formula (26) more effectively mitigates the impact of non-main events and minor variations in the event stream, while robustly describing the actual dynamics occurring in the captured scene. Next, we refer to the approach in [52] for obtaining invariance in speed and contrast, using the local memory time surface as a basic operator and grouping adjacent pixels into cells of equal size and quantity

{C_{l}}_{l = 1}^{L}

. For each cell, we sum the time surface components of the events within it into a histogram as follows:

h_{C}^{0} (z, p) = \sum_{s_{i} \in C} T_{s_{i}} (z, p),

(28)

where

s_{i} \in C

represents the pixel coordinates of the current event within the cell range. Since the event camera sensor generates more events for high-contrast objects compared to low-contrast objects, we normalize

h^{0}

by the number of events within the spatiotemporal window to make the descriptor of the cell more contrastive. The histogram definition of the event group cells is as follows:

h_{C} (z, p) = \frac{1}{| C |} h_{C}^{0} (z, p) = \frac{1}{| C |} \sum_{s_{i} \in C} T_{s_{i}} (z, p) .

(29)

The schematic of the cell histogram is shown in Figure 4c. This approach regularizes space and time, reducing the influence of non-main events on the main events in the new event stream. Next, we need to compensate for the loss of main events containing important information due to the change in time span in the previous algorithm.

The events within time segment

C

and cell

C {x, y, t}

are represented by triplet 66. Referring to [52], we use a warp field

Φ (x, y, t - t_{i}) : (x, y, t) \to (x^{'}, y^{'}, t)

to represent the planar displacement that maps events at time

t

to their positions at time

t_{i}

. The goal is to find a corresponding motion compensation warp field

Φ : R^{3} \to R^{3}

that ensures the events used to supplement the event stream have the most appropriate density when projected onto the image plane. We define these primary compensation events as follows:

C^{'} = Π {Φ (U)} = Π {Φ (x, y, t - t_{i})} = {x^{'}, y^{'}, 0},

(30)

\forall {x, y, t} \in C,

(31)

where

Π

represents the time projection function that projects primary compensation events along the time axis, and

Π

can be used to reduce the data’s dimensionality:

R^{3} \to R^{2}

. Regarding the geometric properties presented by the event stream, the warp field encodes information for each event. Thus, we use two discrete mappings that encode the intensity and temporal attributes of the event stream to represent the available data in

C

. These are denoted as

Υ

and

Ψ

To calculate the event density

D

of the primary compensation events

C^{'}

, we discretize the image plane into pixel regions of specified size. We use the symbol

(m, n) \in N^{2}

to represent integer pixel coordinates and

(x^{'}, y^{'}, t) \in R^{3}

to represent the coordinates of events after displacement. Each projected event in

C^{'}

is mapped to a discrete pixel, with the total number of events mapped to that pixel recorded as its value as follows:

ξ_{mn} = {{x^{'}, y^{'}, t} : {x^{'}, y^{'}, 0} \in C^{'}, m = x^{'}, n = y^{'}} .

(32)

The above formula represents the event trajectory, which signifies a set of events along the time axis projected onto pixel

(m, n)

after applying the warp field

Φ

. We define the event count image as follows:

Υ_{mn} = | ξ_{mn} | .

(33)

The formula for calculating the cell compensation event density

D

is as follows:

D = \frac{| C^{'} |}{Υ_{C} \cdot h_{C} (z, p)},

(34)

where

Υ_{C}

represents the number of pixels within cell

C

that have at least one event mapped to them. However, during the projection operation, events generated by different subject edges may be projected onto the same pixel, counteracting our goal of reducing interfering events. To alleviate this issue, we refer to the approach in [53], utilizing a time representation image derived from timestamp information to aid in the primary compensation process. The expression is as follows:

Ψ_{mn} = \frac{1}{Υ_{mn}} \sum t : t \in ξ_{mn} .

(35)

Ψ

is a discrete plane where each pixel contains the event timestamps mapped to it by the warp field

Φ

. By calculating the average value of the timestamps, we can use all available events to improve the fidelity of the main event stream. The method in [53] considers only the latest timestamp, but since the signal-to-noise ratio of events depends on the average illumination, this method does not perform well under low light conditions. Additionally,

Ψ

follows the three-dimensional structure of the event cloud, and its gradient provides a global measure of error in the primary compensation process, which can be minimized as follows:

E = \sum ∥ G [m, n] ∥ = \sum (G_{x}^{2} [m, n] + G_{y}^{2} [m, n]),

(36)

where

G_{x} [m, n]

and

G_{y} [m, n]

represent the local spatial gradients of

Ψ

along the X and Y axes, respectively, calculated using the Sobel operator. Given that Formula (36) considers the global error of the event cloud within the warp field, we can decompose it into the following equations, assuming rigid body motion:

g_{x} = \frac{\sum G_{x} [m, n]}{Υ_{C} \cdot h_{C} (z, p)},

(37)

g_{y} = \frac{\sum G_{y} [m, n]}{Υ_{C} \cdot h_{C} (z, p)},

(38)

g_{t} = \frac{\sum (G_{x} [m, n], G_{y} [m, n]) \cdot (m, n)}{Υ_{C} \cdot h_{C} (z, p)},

(39)

g_{θ} = \frac{\sum (G_{x} [m, n], G_{y} [m, n]) \times (m, n)}{Υ_{C} \cdot h_{C} (z, p)} .

(40)

The above four equations correspond to the local errors

{E_{x}, E_{y}, E_{z}, θ}

in the displacement, scaling, and 2D rotation compensation processes on the image plane. We use a global main event model

ℳ^{G} = {E_{x}, E_{y}, E_{z}, θ}

containing error parameters to describe the corresponding global warp field

Φ^{G} (x, y, t)

, yielding the transformation as follows:

[\begin{array}{l} x^{'} \\ y^{'} \end{array}] = [\begin{array}{l} x \\ y \end{array}] - t * [[\begin{array}{l} E_{x} \\ E_{y} \end{array}] + (E_{t} + 1) * | \begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix} | \cdot [\begin{array}{l} x \\ y \end{array}] - [\begin{array}{l} x \\ y \end{array}]] .

(41)

This transformation converts the coordinates

{x, y, t}

in the event stream to

{x^{'}, y^{'}, t}

, resulting in the expression of the warp field, with timestamps remaining unchanged. Based on the proposed four-parameter model, we estimate and minimize the motion range of the main event stream using an error function defined on the time representation image. By maximizing the main event density on the event count image, we refine the global activity area of the main events and identify and track a subset of non-main events with a high probability of correlation to the main events, thus achieving main event stream completion and motion compensation.

The entire main event completion process involves first using Formula (41) to calculate the initial warp field expression for each cell

C

, then using Formula (35) to generate the corresponding time representation image

Ψ

. We compute the model

ℳ^{G}

and its corresponding gradient using the initial model

ℳ_{0}^{G}

and Formulas (37)–(40) and update the range of the warp field of

ℳ^{G}

through gradient descent, thus updating the region of the primary compensation events

C^{'}

. Next, we maximize the main event density

D

on the event count image

Υ

, refining the global activity area of the main events and further adjusting the region of the primary compensation events

C^{'}

. It is noteworthy that

Υ

and

Ψ

both represent local measures of the deviation between primary and non-main event regions in the event stream but are based on different data sources: event rate and event timestamps. The detailed process is as Algorithm 2, where the discrete parameter

k

represents the event camera pixel size of 0.3,

q

is an adjustable precision parameter, and

D

represents the initial value of the cell compensation event density, set to 0.

Algorithm 2. Adaptive Spatiotemporal Subject Surface Compensation.

Input : Initial model ℳ_{0}^{G}

and initial cell

C

of the main events,

k

q

Output: Cell

C^{'}

and density

D

of main compensation events
1 Obtain main compensation events

C^{'}

based on

Formula (1)

, ℳ_{0}^{G},

and

C

2 Obtain time representation image

Ψ

based on

Formula (5)

, C^{'},

and

k

3 Obtain ℳ^{G}

based on

Formulas (7 - 9)

, ℳ_{0}^{G},

and

Ψ

4 While {∥ ℳ^{G} - ℳ_{0}^{G} ∥}_{2} > q

5 Update main compensation events

C^{'}

based on

Formula (1)

, ℳ_{0}^{G},

and

C

6 Update time representation image

Ψ

based on

Formula (5)

, C^{'},

and

k

7 Update model ℳ^{G}

based on

Formulas (7 - 9)

, ℳ^{G},

and

Ψ

8 End
9 Obtain event count image

Υ

based on

Formula (3)

, C^{'},

and

k

10 Obtain event density

D^{'}

based on

Formula (4)

and

Υ

11 While ∥ D^{'} - D ∥ > q

12 Assign the value of

D

D^{'}

13 Obtain event density

D

based on

Formula (4),

C

, k,

and ℳ^{G}

If D^{'} < D

do
15

Update ℳ^{G}

and

C^{'}

16 Update event count image

Υ

based on

Formula (3)

, C^{'},

and

k

17        End
18   End
19   Return

C^{'}

D

After obtaining the area range

C^{'}

and the final event density

D

of the primary compensation events, we can determine the range of events in the original event stream that can supplement the main events in the new event stream. In this way, we further reduce the impact of non-main events with interfering effects in the new event stream and compensate for any main events that might have been removed in the previous section. Finally, by projecting the events within each event cuboid’s time span into the pixel plane in the form of an event count image, we complete the entire event stream representation process. Next, we need to quantitatively evaluate the processing results and compare the effectiveness of our algorithm with existing mainstream and latest event representation methods.

2.3. Actual Performance Efficiency Discrepancy

To evaluate the distortion degree of the final results obtained by various representation methods, we refer to the prior knowledge on event representation discrepancies in [52] and propose a new metric called Gromov–Wasserstein Event Discrepancy (GWED) based on events. For comparison purposes, we project the event streams processed by each method onto a plane using event count images. This approach ensures that each event stream possesses features indexed by the horizontal position of integer-valued pixels in the count image

Υ

, expressed as follows:

f_{x} = Υ (x) \in R^{N_{f}},

(42)

ℱ = {f_{x}}_{x \in Λ} .

(43)

Here,

Λ

represents the image domain, and

| Λ | = N_{f}

represents the number of pixels in the image domain.

ℱ

denotes the set of features for each group of positions. When converting the original event representation, some important event features are inevitably removed from the event stream, leading to distortion. Therefore, we measure the actual damage to the event stream by examining the relationship between the feature set and the event stream.

Figure 5 provides an overview schematic of GWED. We first measure the similarity between each set of events and their corresponding representations by constructing a soft correspondence between event

s_{m}

and the corresponding feature

f_{x_{n}}

, defining it as

T_{mn}

. This transport relationship effectively captures the features corresponding to each event and identifies the distortion and destruction information of the original event set. We set the total weight of the events to 1, meaning each event has a weight of

1 / N_{e}

. These event weights are transferred to the output features in the transformed representation, which also need a total weight of 1. Thus, each feature has a weight of

T_{mn}

. To satisfy this construction, the corresponding equations

\sum_{m} T_{mn} = 1 / N_{f}

and

\sum_{n} T_{mn} = 1 / N_{e}

must hold true.

In the next step, we evaluate the distortion introduced in the aforementioned transport relationship by using the pairwise similarity of events and features, which reflects the damage to actual events during the event representation process, thereby obtaining the corresponding distortion rate. Let the features corresponding to a pair of events,

s_{m}

and

s_{k},

f_{x_{n}}

and

f_{x_{l}}

, respectively. Their similarity ratings are as follows:

C_{mk}^{s} = C^{s} (s_{m}, s_{k}),

(44)

C_{nl}^{f} = C^{f} (f_{x_{n}}, f_{x_{l}}) .

(45)

Then, using the difference in the previously proposed similarity scores as a measure of distortion, we define the metric for each pair of events and features as follows:

L_{mnkl} = T_{mn} T_{kl} ℒ (C_{mk}^{s}, C_{nl}^{f}),

(46)

where

ℒ

represents the difference measurement between

C^{s} (s_{m}, s_{k})

and

C_{nl}^{f}

. In this process, we use the Gaussian Radial Basis Function (RBF) [54] and Kullback–Leibler (KL) divergence as the basis for calculating features and similarities related to events and images. The calculation process is as follows:

C_{mk}^{s} = e^{\frac{- {∥ s_{m} - s_{k} ∥}^{2}}{2 σ_{s}^{2}}},

(47)

C_{nl}^{f} = e^{\frac{- {∥ f_{x_{n}} - f_{x_{1}} ∥}^{2}}{2 σ_{f}^{2}}},

(48)

σ_{s}^{2} = {mean}_{m < n} {∥ s_{m} - s_{n} ∥}^{2},

(49)

σ_{f}^{2} = {mean}_{m < n} {∥ f_{x_{m}} - f_{x_{n}} ∥}^{2},

(50)

ℒ (C_{mk}^{s}, C_{nl}^{f}) = C_{mk}^{s} \log (C_{mk}^{s} / C_{nl}^{f}) .

(51)

By normalizing the distances between events and feature pairs with the variance of the event stream data, we ensure that the similarity scores are robust to data dimensions and the number of samples in the source and target domains. Summing over all possible pairs of events and features gives us the transport cost during the event representation process.

L (T; ℰ, ℱ) = \sum_{m, n, k, l} L_{mnkl} = \sum_{m, n, k, l} T_{mn} T_{kl} ℒ (C_{mk}^{s}, C_{nl}^{f}) .

(52)

Next, we minimize the transport relationship

T

to obtain the GWED as follows:

GWED = L (ℰ, ℱ) = \min_{T} \sum_{m, n, k, l} T_{mn} T_{kl} ℒ (C_{mk}^{s}, C_{nl}^{f}) .

(53)

Referring to the approach in [55], since the above metric is defined based on the events within a single event cuboid, we can average it over multiple samples to obtain

{GWED}_{N}

as follows:

{GWED}_{N} = \frac{1}{N} \sum_{m} L (ℰ_{m}, ℱ_{m}),

(54)

Which represents the average distortion rate from the original event stream to the event representation process. We then refer to [35] to use confidence interval-related knowledge to calculate the retention rate of main events in the final distorted result. In this work, we use Non-Zero Grid Entropy (NZGE) as the metric. To calculate the NZGE value for the main event area in the generated count image, we first count the area corresponding to main events within a single event cuboid and divide this area into

m \times n

cells of size

C

as before. Then, we calculate an image entropy for each cell to construct an entropy map, as shown in Figure 6.

The calculation process for the NZGE value within each event cuboid is as follows:

{NZGE}_{i} = \frac{1}{n_{i}^{C}} \sum_{x = 1}^{m} \sum_{y = 1}^{n} {entropy}^{i}_{x, y},

(55)

{entropy}_{x, y} = - \sum_{z = 0}^{255} p_{z}^{x, y} {\log p}_{z}^{x, y} .

(56)

n_{i}^{C}

represents the number of cells with non-zero entropy.

{entropy}^{i}_{x, y}

represents the image entropy of the region at the x row and y column.

p_{z}^{x, y}

represents the probability that a pixel in the corresponding image region at the x row and y column has a grayscale value of z, with the grayscale range being 0 to 255. We can use the NZGE value to keep the primary contour part of the final result within a reasonable range, which requires a corresponding confidence interval. To calculate the upper and lower bounds of the confidence interval, let the collection of nn NZGE values be

N

, with the corresponding mean and variance being

\bar{N}

and

S_{N}^{2}

. Assuming the appropriate NZGE values are independent and normally distributed, the set of observations for the normal distribution is

N

. Here, we define a pivotal quantity

g

as follows:

g = \frac{\bar{N} - μ}{S_{N} / \sqrt{n}} .

(57)

As a result,

g

follows the

t

-distribution

t (n - 1)

with

n - 1

degrees of freedom. We can calculate the confidence interval for the NZGE value as follows:

[α, β] = [\bar{N} - | g_{0.025} | \frac{S_{N}}{\sqrt{n}}, \bar{N} + | g_{0.025} | \frac{S_{N}}{\sqrt{n}}] .

(58)

When the NZGE value lies within the confidence interval, the primary information in the event representation image obtained from the corresponding event cuboid is well-preserved. The closer the NZGE value is to the median, the better the preservation. Therefore, we define the difference between the NZGE and the median as

{NZGE}_{D}

{NZGE}_{D} = | NZGE - μ | .

(59)

The smaller this value, the better the corresponding representation method preserves the main events. Combining this with the previously obtained

{GWED}_{N}

, we obtain the new metric for retention efficiency discrepancy:

APED = {GWED}_{N} \cdot {NZGE}_{D} .

(60)

This value represents the treatment strategies for actual events and main events in the event stream by various methods. The lower this metric, the lower the actual distortion rate of the corresponding event representation method, and the higher the retention rate of main events carrying major information.

3. Results

During the experimental phase, we used the DAVIS346 event camera to capture and establish a dataset for comparative experiments in various motion scenarios. The performance of this camera is sufficient to meet our needs for image acquisition and metric computation in complex motion scenes. For the experiments, we selected three scenarios to capture and compare the performance of different algorithms. These include two complex scenes with stationary objects and moving cameras, as well as one scene with multiple subjects moving at different speeds relative to the camera. The grayscale images and 3D event stream diagrams of these three scenarios are shown in Figure 7.

The algorithms involved in the experiments were implemented on the MATLAB 2022b platform. The computer used to run the algorithm programs was equipped with an Intel(R) Core(TM) i7-11800H @ 2.30 GHz processor, 16 GB of RAM, and an NVIDIA GeForce RTX 3060 6 GB GPU. The operating system was Windows 11. Before conducting the experiments, we chose to extract the events within the first 5000 ms of each event stream for comparative analysis. This is because the value of

{GWED}_{N}

in the event representation process can exhibit significant fluctuations when the number of event samples is small.

Figure 8 shows the variation of the value of

{GWED}_{N}

corresponding to each algorithm with different numbers of event samples. In the comparison experiment for Scene A, as the number of samples increases, the value of

{GWED}_{N}

for each algorithm changes. Therefore, we extracted a 5000 ms segment from the original event streams of each scene as the basis for our experiments, randomly selecting 100,000 events for calculating this metric. This ensures that the value of

{GWED}_{N}

falls within a stable range that accurately reflects the distortion rate for each algorithm, providing sufficient experimental data to validate the performance comparison results. After this value stabilizes, the lower the corresponding algorithm, the better the performance. The algorithms used for comparison are the widely-used Voxel Grid [25], TORE [28], MDES [29], and ATSLTD [35].

First, we conducted comparative experiments on the event representation effectiveness of each algorithm for Scene A and Scene B. These scenes involve a drone equipped with an event camera capturing complex real-world environments while moving slowly at high altitudes. Figure 9 and Figure 10 show the results and corresponding APED values for each algorithm in Scene A, while Figure 11 and Figure 12 present the results for Scene B.

From the comparative results of the algorithms in these two scenes, it is evident that when the objects being captured have diverse compositions, complex textures, and numerous elements, the TORE and ATSLTD algorithms occasionally miss important information in the event presenting images. In contrast, the ATSLTD and Voxel Grid algorithms sometimes produce overly redundant results, failing to reflect textures and directional information. Our method, however, produces event-presenting images that fully capture the information of the objects without motion blur.

Next, Figure 13 and Figure 14 show the results and corresponding APED values for each algorithm in Scene C, where there are multiple subjects moving at different speeds relative to the camera.

From Figure 13, it is clear that when there are multiple subjects moving at different speeds in the scene, the existing mainstream algorithms cannot fully represent the information of all subjects. Specifically, in Scene C, where three people are walking with the middle person moving significantly slower than the other two, the other algorithms used in the comparison experiment tend to lose part of the slower subject’s contour and texture information while ensuring the completeness of the other two subjects’ information. Our algorithm, on the other hand, can retain the information of all subjects, including the slower one, resulting in higher completeness of the slower subject’s information in the final event representation image.

Finally, we summarize the average APED values of each algorithm across the three comparison experiment scenes in Table 1.

Based on the data in this table, we can quantitatively evaluate the overall performance of each algorithm. It is evident that our algorithm significantly outperforms others in avoiding data distortion and redundancy while better preserving the main events that carry the main information.

4. Discussion

The experiments presented in this paper demonstrate that our event representation algorithm maintains excellent overall performance and effective representation when handling varying complexities and multiple subjects. In processing the two components of the APED metric, GWED_N and

{NZGE}_{D}

, we observed that the former could not reliably reflect the performance and distortion rate of each algorithm with smaller sample sizes. Therefore, we selected a large random sample size to obtain stable metric values. For the latter component, we utilized a continuous segment of event stream data as an experimental basis, recognizing that event cameras encounter many unpredictable situations during actual operation. The ability to handle these scenarios needs to be incorporated into the comprehensive evaluation.

The demonstration images produced by various algorithms also indicate that under complex motion conditions or frequently changing scenes, some algorithms exhibit motion blur, contour dragging, ghosting, or information loss. In contrast, our algorithm maintains the integrity of scene information without redundancy across various scenarios. Additionally, experiments corresponding to Scene C reveal that when subjects move at significantly different speeds, most current algorithms cannot fully preserve the information completeness of all subjects. They often sacrifice information from slower-moving subjects to ensure a better representation of faster-moving ones. Our algorithm optimizes overall representation by compensating for the information of slower-moving subjects while preserving the performance of faster-moving subjects. The final APED value statistics demonstrate that our algorithm excels in both distortion minimization and main event retention.

5. Conclusions

This paper introduces a novel event data measurement and slicing algorithm, along with an event data completion and optimization algorithm, addressing the challenges and limitations of existing event representation methods when processing event stream data. Our approach preserves information completeness in complex scenes with diverse elements while minimizing redundant events and reducing motion blur, ghosting, and contour dragging. Additionally, it resolves the issue of maintaining information completeness for subjects moving at different speeds within a scene. Current event representation methods use differences in movement speeds between subjects to intuitively present speed-related information. However, when applying these representations to recognition and detection tasks, these speed differences can hinder further experiments. This issue does not arise with traditional images, where motion blur from high-speed movement can impede advanced experiments. Our algorithm combines the strengths of event cameras, which excel in capturing dynamic objects, with those of traditional cameras, which are better at capturing static objects. It ensures the completeness of high-speed object information without redundancy while fully representing the information of slower-moving objects through compensation. This approach reduces errors in recognition tasks caused by significant differences in individual movement speeds. Future research will focus on better complementing the missing information in event images with traditional images or integrating the two to achieve a more comprehensive representation, thereby attaining full environmental perception.

Author Contributions

All authors were involved in the formulation of the problem and the design of the methodology; S.T. designed the experiment and wrote the manuscript; Y.F. constructed the datasets; Z.Z. and M.S. analyzed the accuracy of the experimental data; Y.Z. and H.L. reviewed and guided the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Industrial Technology Research and Development of Jilin Province (2023C031-6).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lichtsteiner, P.; Posch, C.; Delbruck, T. A 128 × 128 120 dB 15 μs latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circuits 2008, 43, 566–576. [Google Scholar] [CrossRef]
Brandli, C.; Berner, R.; Yang, M.; Liu, S.-C.; Delbruck, T. A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 2014, 49, 2333–2341. [Google Scholar] [CrossRef]
Posch, C.; Matolin, D.; Wohlgenannt, R. A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. IEEE J. Solid-State Circuits 2010, 46, 259–275. [Google Scholar] [CrossRef]
Oyster, C. The analysis of image motion by the rabbit retina. J. Physiol. 1968, 199, 613–635. [Google Scholar] [CrossRef]
Murphy-Baum, B.L.; Awatramani, G.B. An old neuron learns new tricks: Redefining motion processing in the primate retina. Neuron 2018, 97, 1205–1207. [Google Scholar] [CrossRef]
Ölveczky, B.P.; Baccus, S.A.; Meister, M. Segregation of object and background motion in the retina. Nature 2003, 423, 401–408. [Google Scholar] [CrossRef]
Wild, B. How does the brain tell self-motion from object motion? J. Neurosci. 2018, 38, 3875–3877. [Google Scholar] [CrossRef]
Ghosh, R.; Gupta, A.; Nakagawa, A.; Soares, A.; Thakor, N. Spatiotemporal filtering for event-based action recognition. arXiv 2019, arXiv:1903.07067. [Google Scholar]
Ghosh, R.; Gupta, A.; Tang, S.; Soares, A.; Thakor, N. Spatiotemporal feature learning for event-based vision. arXiv 2019, arXiv:1903.06923. [Google Scholar]
Orchard, G.; Meyer, C.; Etienne-Cummings, R.; Posch, C.; Thakor, N.; Benosman, R. HFirst: A temporal approach to object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2028–2040. [Google Scholar] [CrossRef]
Lee, J.H.; Delbruck, T.; Pfeiffer, M. Training deep spiking neural networks using backpropagation. Front. Neurosci. 2016, 10, 508. [Google Scholar] [CrossRef]
Zhao, B.; Ding, R.; Chen, S.; Linares-Barranco, B.; Tang, H. Feedforward categorization on AER motion events using cortex-like features in a spiking neural network. IEEE Trans. Neural Netw. Learn. Syst. 2014, 26, 1963–1978. [Google Scholar] [CrossRef] [PubMed]
Pérez-Carrasco, J.A.; Zhao, B.; Serrano, C.; Acha, B.; Serrano-Gotarredona, T.; Chen, S.; Linares-Barranco, B. Mapping from frame-driven to frame-free event-driven vision systems by low-rate rate coding and coincidence processing—Application to feedforward ConvNets. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2706–2719. [Google Scholar] [CrossRef] [PubMed]
Sekikawa, Y.; Hara, K.; Saito, H. Eventnet: Asynchronous recursive event processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3887–3896. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Fan, H.; Yu, X.; Ding, Y.; Yang, Y.; Kankanhalli, M. Pstnet: Point spatio-temporal convolution on point cloud sequences. arXiv 2022, arXiv:2205.13713. [Google Scholar]
Gehrig, M.; Scaramuzza, D. Recurrent vision transformers for object detection with event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13884–13893. [Google Scholar]
Schaefer, S.; Gehrig, D.; Scaramuzza, D. Aegnn: Asynchronous event-based graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12371–12381. [Google Scholar]
Bi, Y.; Chadha, A.; Abbas, A.; Bourtsoulatze, E.; Andreopoulos, Y. Graph-based spatio-temporal feature learning for neuromorphic vision sensing. IEEE Trans. Image Process. 2020, 29, 9084–9098. [Google Scholar] [CrossRef]
Bi, Y.; Chadha, A.; Abbas, A.; Bourtsoulatze, E.; Andreopoulos, Y. Graph-based object classification for neuromorphic vision sensing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019; pp. 491–501. [Google Scholar]
Mondal, A.; Giraldo, J.H.; Bouwmans, T.; Chowdhury, A.S. Moving object detection for event-based vision using graph spectral clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 876–884. [Google Scholar]
Deng, Y.; Chen, H.; Liu, H.; Li, Y. A voxel graph cnn for object classification with event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1172–1181. [Google Scholar]
Maqueda, A.I.; Loquercio, A.; Gallego, G.; García, N.; Scaramuzza, D. Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5419–5427. [Google Scholar]
Sironi, A.; Brambilla, M.; Bourdis, N.; Lagorce, X.; Benosman, R. HATS: Histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1731–1740. [Google Scholar]
Zhu, A.Z.; Yuan, L.; Chaney, K.; Daniilidis, K. Unsupervised event-based learning of optical flow, depth, and egomotion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 989–997. [Google Scholar]
Wang, L.; Ho, Y.-S.; Yoon, K.-J. Event-based high dynamic range image and very high frame rate video generation using conditional generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10081–10090. [Google Scholar]
Alonso, I.; Murillo, A.C. EV-SegNet: Semantic segmentation for event-based cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 1624–1633. [Google Scholar]
Baldwin, R.W.; Liu, R.; Almatrafi, M.; Asari, V.; Hirakawa, K. Time-ordered recent event (tore) volumes for event cameras. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2519–2532. [Google Scholar] [CrossRef]
Nam, Y.; Mostafavi, M.; Yoon, K.-J.; Choi, J. Stereo depth from events cameras: Concentrate and focus on the future. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6114–6123. [Google Scholar]
Zhang, Y.; Zhao, Y.; Lv, H.; Feng, Y.; Liu, H.; Han, C. Adaptive slicing method of the spatiotemporal event stream obtained from a dynamic vision sensor. Sensors 2022, 22, 2614. [Google Scholar] [CrossRef]
Perot, E.; De Tournemire, P.; Nitti, D.; Masci, J.; Sironi, A. Learning to detect objects with a 1 megapixel event camera. Adv. Neural Inf. Process. Syst. 2020, 33, 16639–16652. [Google Scholar]
Kim, J.; Bae, J.; Park, G.; Zhang, D.; Kim, Y.M. N-imagenet: Towards robust, fine-grained object recognition with event cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2146–2156. [Google Scholar]
Gehrig, D.; Loquercio, A.; Derpanis, K.G.; Scaramuzza, D. End-to-end learning of representations for asynchronous event-based data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5633–5643. [Google Scholar]
Zubic, N.; Gehrig, D.; Gehrig, M.; Scaramuzza, D. From Chaos Comes Order: Ordering Event Representations for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023. [Google Scholar]
Chen, H.; Wu, Q.; Liang, Y.; Gao, X.; Wang, H. Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tracking. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 473–481. [Google Scholar]
Tang, S.; Lv, H.; Zhao, Y.; Feng, Y.; Liu, H.; Bi, G. Denoising method based on salient region recognition for the spatiotemporal event stream. Sensors 2023, 23, 6655. [Google Scholar] [CrossRef]
Park, I.M.; Seth, S.; Paiva, A.R.; Li, L.; Principe, J.C. Kernel methods on spike train space for neuroscience: A tutorial. IEEE Signal Process. Mag. 2013, 30, 149–160. [Google Scholar] [CrossRef]
González, J.A.; Rodríguez-Cortés, F.J.; Cronie, O.; Mateu, J. Spatio-temporal point process statistics: A review. Spat. Stat. 2016, 18, 505–544. [Google Scholar] [CrossRef]
Teixeira, R.F.; Leite, N.J. A new framework for quality assessment of high-resolution fingerprint images. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1905–1917. [Google Scholar] [CrossRef]
Dong, S.; Bi, Z.; Tian, Y.; Huang, T. Spike coding for dynamic vision sensor in intelligent driving. IEEE Internet Things J. 2018, 6, 60–71. [Google Scholar] [CrossRef]
Paiva, A.R.; Park, I.; Principe, J.C. A reproducing kernel Hilbert space framework for spike train signal processing. Neural Comput. 2009, 21, 424–449. [Google Scholar] [CrossRef] [PubMed]
Tezuka, T. Multineuron spike train analysis with R-convolution linear combination kernel. Neural Netw. 2018, 102, 67–77. [Google Scholar] [CrossRef]
Houghton, C.; Sen, K. A new multineuron spike train metric. Neural Comput. 2008, 20, 1495–1511. [Google Scholar] [CrossRef]
Brockmeier, A.J.; Choi, J.S.; Kriminger, E.G.; Francis, J.T.; Principe, J.C. Neural decoding with kernel-based metric learning. Neural Comput. 2014, 26, 1080–1107. [Google Scholar] [CrossRef]
Li, J.; Li, J.; Zhu, L.; Xiang, X.; Huang, T.; Tian, Y. Asynchronous spatio-temporal memory network for continuous event-based object detection. IEEE Trans. Image Process. 2022, 31, 2975–2987. [Google Scholar] [CrossRef]
Gönen, M.; Alpaydın, E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 2011, 12, 2211–2268. [Google Scholar]
Fu, Y.; Li, J.; Dong, S.; Tian, Y.; Huang, T. Spike coding: Towards lossy compression for dynamic vision sensor. In Proceedings of the 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 26–29 March 2019; p. 572. [Google Scholar]
Scheerlinck, C.; Barnes, N.; Mahony, R. Continuous-time intensity estimation using event cameras. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 308–324. [Google Scholar]
Rebecq, H.; Ranftl, R.; Koltun, V.; Scaramuzza, D. Events-to-video: Bringing modern computer vision to event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3857–3866. [Google Scholar]
Lagorce, X.; Orchard, G.; Galluppi, F.; Shi, B.E.; Benosman, R.B. Hots: A hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1346–1359. [Google Scholar] [CrossRef] [PubMed]
Marchisio, A.; Shafique, M. Embedded Neuromorphic Using Intel’s Loihi Processor. In Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing: Software Optimizations and Hardware/Software Codesign; Springer: Berlin/Heidelberg, Germany, 2023; pp. 137–172. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Zhu, A.Z.; Yuan, L.; Chaney, K.; Daniilidis, K. EV-FlowNet: Self-supervised optical flow estimation for event-based cameras. arXiv 2018, arXiv:1802.06898. [Google Scholar]
Said, S.; Bombrun, L.; Berthoumieu, Y.; Manton, J.H. Riemannian Gaussian distributions on the space of symmetric positive definite matrices. IEEE Trans. Inf. Theory 2017, 63, 2153–2170. [Google Scholar] [CrossRef]
Peyré, G.; Cuturi, M.; Solomon, J. Gromov-wasserstein averaging of kernel and distance matrices. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 2664–2672. [Google Scholar]

Figure 1. Schematic diagram of the human retina model and corresponding event camera pixel circuit.

Figure 2. (a) We consider the light intensity change signals received by the corresponding pixels as computational elements in the time domain. (b) From the statistical results, it can be seen that the ON polarity ratio varies randomly over the time index.

Figure 3. This graph represents the time span changes of each event cuboid processed by our algorithm.

Figure 4. This figure illustrates the time surface of events in the original event stream. For clarity, only the x–t components are shown. Red crosses represent non-main events, and blue dots represent main events. (a) In the time surface described in [50] (corresponding to Formula (24)), only the occurrence frequency of the nearest events around the main event is considered. Consequently, non-main events with disruptive effects may have significant weight. (b) The local memory time surface corresponding to Formula (26) considers the influence weight of historical events within the current spatiotemporal window. This approach reduces the ratio of non-main events involved in the time surface calculation, better capturing the true dynamics of the event stream. (c) By spatially averaging the time surfaces of all events in adjacent cells, the time surface corresponding to Formula (29) can be further regularized. Due to the spatiotemporal regularization, the influence of non-main events is almost completely suppressed.

Figure 5. Schematic of the Gromov–Wasserstein Event Discrepancy between the original event stream and the event representation results.

Figure 6. Illustration of the grid positions corresponding to non-zero entropy values.

Figure 7. Grayscale images and 3D event stream diagrams for three captured scenarios: (a) Grayscale illustration of the corresponding scenarios; (b) 3D event stream illustration of the corresponding scenarios.

Figure 8. The variation of the value of

{GWED}_{N}

corresponding to each algorithm with different numbers of event samples.

Figure 8. The variation of the value of

{GWED}_{N}

corresponding to each algorithm with different numbers of event samples.

Figure 9. Illustration of the event stream processing results for Scene A by different algorithms: (a) TORE; (b) ATSLTD; (c) Voxel Grid; (d) MDES; (e) Ours.

Figure 10. APED data obtained from the event stream processing results for Scene A by different algorithms.

Figure 11. Illustration of the event stream processing results for Scene B by different algorithms: (a) TORE; (b) ATSLTD; (c) Voxel Grid; (d) MDES; (e) Ours.

Figure 12. APED data obtained from the event stream processing results for Scene B by different algorithms.

Figure 13. Illustration of the event stream processing results for Scene C by different algorithms: (a) TORE; (b) ATSLTD; (c) Voxel Grid; (d) MDES; (e) Ours.

Figure 14. APED data obtained from the event stream processing results for Scene C by different algorithms.

Table 1. APED values of event stream data processed by different event representation algorithms in three captured scenarios.

Method	Actual Performance Efficiency Discrepancy
Method	Scene A	Scene B	Scene C
TORE	0.11575	0.11906	0.04819
ATSLTD	0.07921	0.06497	0.03415
Voxel Grid	0.05566	0.03737	0.02832
MDES	0.05356	0.03086	0.01336
Ours	0.02403	0.01896	0.00596

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, S.; Zhao, Y.; Lv, H.; Sun, M.; Feng, Y.; Zhang, Z. Adaptive Optimization and Dynamic Representation Method for Asynchronous Data Based on Regional Correlation Degree. Sensors 2024, 24, 7430. https://doi.org/10.3390/s24237430

AMA Style

Tang S, Zhao Y, Lv H, Sun M, Feng Y, Zhang Z. Adaptive Optimization and Dynamic Representation Method for Asynchronous Data Based on Regional Correlation Degree. Sensors. 2024; 24(23):7430. https://doi.org/10.3390/s24237430

Chicago/Turabian Style

Tang, Sichao, Yuchen Zhao, Hengyi Lv, Ming Sun, Yang Feng, and Zeshu Zhang. 2024. "Adaptive Optimization and Dynamic Representation Method for Asynchronous Data Based on Regional Correlation Degree" Sensors 24, no. 23: 7430. https://doi.org/10.3390/s24237430

APA Style

Tang, S., Zhao, Y., Lv, H., Sun, M., Feng, Y., & Zhang, Z. (2024). Adaptive Optimization and Dynamic Representation Method for Asynchronous Data Based on Regional Correlation Degree. Sensors, 24(23), 7430. https://doi.org/10.3390/s24237430

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Adaptive Optimization and Dynamic Representation Method for Asynchronous Data Based on Regional Correlation Degree

Abstract

1. Introduction

2. Materials and Methods

2.1. Asynchronous Spike Dynamic Metric and Slicing Algorithm

2.2. Adaptive Spatiotemporal Subject Surface Compensation Algorithm

2.3. Actual Performance Efficiency Discrepancy

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI