1. Introduction
Last years have seen the design of increasingly advanced computer vision algorithms to support a wide range of critical tasks in a plethora of application areas. These algorithms often have the responsibility of taking determining decisions in issues where failure would lead to serious consequences. In References [
1,
2,
3], for example, vision-based systems are used for inspection of pipeline infrastructures. In particular, in the first work, the authors presented a method for subsea pipeline corrosion estimation by using colour information of corroded pipes. To manage the degraded underwater images, the authors developed ad-hoc pre-processing algorithms for image restoration and enhancement. Differently, in the second and third work, the authors focused on large infrastructures, such as sewers and waterworks. Both works proposed an anomaly detector based on unsupervised machine learning techniques, which, unlike other competitors in this specific field, do not require annotated samples for training the detection models. Correct evaluations by the algorithms reported above can provide countless advantages in terms of maintenance and hazard determination. Even in hand and body rehabilitation, as shown in References [
4,
5,
6], last few years have seen the proliferation of vision-based systems able to provide measurements and predictions about the recovery degree of lost skills by patients affected with strokes and degenerative diseases. The first work focused on a hand skeleton model together with pose estimation and tracking algorithms both to determine palm and fingers pose estimation and to track their movements during a rehabilitation exercise. Similarly, but using a body skeleton model, the second and third work used a customized long short term memory (LSTM) model to analyze movements and limitations of patients’ body, treating a full body rehabilitation exercise like a long action over time. Hand and body skeleton models, in addition to allowing the transposition of patients’ virtual avatars into immersive environments, also allow therapists to better detect joints that require of more exercises, thus optimizing and customizing the recovery task.
Without doubt, one of the application areas in which computer vision has had more impact, in the last ten years, is the active video surveillance, that is, those automatic systems able to replace human operators in complex tasks, such as intrusion detection [
7,
8], event recognition [
9,
10], target identification [
11,
12], and many others. In References [
13], for example, the authors proposed a reinforcement learning approach to train a deep neural network to find optimal patrolling strategies for Unmanned Aerial Vehicles (UAVs) visual coverage tasks. Unlike other works in this application area, their method explicitly considers different coverage requirements expressed as relevance maps. References [
14,
15,
16], instead, even if use pixel-based computer vision techniques, that is, low level processing, are able to achieve remarkable results in terms of novelty recognition and change detection, thus providing a step forward in the field of aerial monitoring. Moving to stationary and Pan–Tilt–Zoom (PTZ) cameras, very recent works, as that reported in Reference [
17], shown that, even in challenging application fields, robust systems can be implemented and applied, for security contexts, in everyday life. In particular, the authors proposed a continuous-learning framework for context-aware activity recognition from unlabelled video. In their innovative approach, the authors formulated an information-theoretic active learning technique that utilizes contextual information among activities and objects. Other interesting current examples are presented in References [
18,
19]. In the first, a novel end-to-end partially supervised deep learning approach for video anomaly detection and localization using only normal samples is presented. Instead, in the second, the authors propose an abnormal event detection hybrid modulation method via feature expectation subgraph calibrating classification in video surveillance scenes. Both works proven to provide remarkable results in terms of robustness and accuracy.
Active surveillance systems, in real contexts, are composed of different algorithms, each of which collaborates with the others for the achievement of a specific target. For example, algorithms for separating background from foreground, for example, References [
20,
21,
22], are often used, as pre-processing stage, both to detect the objects of interest in the scene and to have a reference model of the background and its variations over time. Another example are the tracking algorithms, for example, Reference [
10,
23], which are used to analyze moving objects. Among these collaborative algorithms, person re-identification ones, for example, References [
24,
25], play a key role, especially in security, protection, and prevention areas. In fact, being able to identify a person’s identity and being able to verify the presence of that person in other locations hours or even weeks before is a fundamental step for the areas reported above. Anyway, despite the efforts of many computer vision researchers in this application area, person re-identification task still presents several problems largely unsolved. Most of the person re-identification methods, in fact, are based on visual features extracted from images to model a person’s appearance [
26,
27]. This leads to a first class of problems since, as known, visual features have many weaknesses, including illumination changes, shadows, direction of light, and many others. Another class of problems regards the background clutter [
28,
29] and occlusions [
30,
31], which tend, in uncontrolled environments, to lower system performances in term of accuracy. A final class of problems, very important from a practical point of view, is referred to the long-term re-identification and camouflage [
32,
33]. Many systems, based totally or in part on visual features, are not able to re-identify persons under the two issues reported above, thus making the use of these systems limited in real contexts.
In this paper, a meta-feature based LSTM hashing model for person re-identification is presented. The proposed method takes inspiration from some of our recent experiences in using 2D/3D skeleton based features and LSTM models to recognize hand gestures [
34], body actions [
35], and body affects [
36] in long video sequences. Unlike these, the 2D skeleton model, in this paper, is used to generate biometric features referred to movement, gait, and bone proportions of the body. Compared to the current literature, beyond the originality of the overall pipeline and features, the proposed method presents several novelties suitably studied to address, at least in part, the classes of problems reported above, especially in uncontrolled and crowded environments. First, the method is fully based on features extracted from 2D skeleton models present in a scene. The algorithm used to extract the models, reported in Reference [
37], is proven to be highly reliable in terms of accuracy, even for multi-person estimation. The algorithm uses each input image only to produce the 2D locations of anatomical keypoints for each person in the image. This means that all the aspects, for example, illumination changes, direction of light, and many others, that can influence the re-identification by systems based on visual appearance of people, can, in the proposed method, be overlooked since the images are only used to generate 2D skeletons. Second, the proposed pipeline does not use visual features, but only features derived by analysing the 2D skeleton joints. These features, through the use of the LSTM model, are designed to catch both dynamic correlations of different body parts during movements and gaits as well as bone proportions of relevant body parts. The term meta-features was born from the fact that our features are hence conceived to generalize the body in terms of movements, gaits, and proportions. Thanks to this abstraction, long-term re-identification and some kinds of camouflages can be better managed. Third, thanks to the two final layers of the proposed network, which implement the bodyprint hash through binary coding, the representative space of our method can be considered uncountable, thus providing a tool to potentially label each and every human being in the world. The proposed method was tested on three challenging datasets designed for person-re-identification in video sequences, that is, iLIDS-VID [
38], Person Re-ID (PRID) 2011 [
39], and Motion Analysis and Re-identification Set (MARS) [
40], showing remarkable results, compared with key works of the current literature, in terms of re-identification rank. Summarizing, the contributions of the paper with respect to both the present open issues and the current state-of-the-art in terms of pipelines and models can be outlined as follows:
- -
Concerning the present person re-identification field open issues, the proposed method, which is not based on visual appearance, can be considered a concrete support in uncontrolled crowded spaces, especially in long-term re-identification where people, usually, change clothes and some aspects of their visual appearance over time. In addition, the use of RGB cameras allows the method to be used in both indoor and outdoor environments; overcoming, thanks to the use of 2D skeletons, well-known problems related to scene analysis, including illumination changes, shadows, direction of light, background clutter, and occlusions.
- -
Regarding the overall pipeline and model, the approach proposed in this paper presents different novelties. First, even if some features are inspired by selected works of the current literature in recognizing hand gestures, body actions, and body affects; other features are completely new as also is their joint utilize. Second, for the first time in the literature, an LSTM hashing model for person re-identification is used. This model was conceived not only to exploit the dynamic patterns of the body movements, but also to provide, via the last two layers, a mechanism by which to provide a labelling approach for millions of people thanks to binary coding properties.
The rest of the paper is structured as follows. In
Section 2, a concise literature review focused on person re-identification methods based on video sequences processing is presented. In
Section 3, the entire methodology is described, including the proposed LSTM hashing model and meta-features. In
Section 4, three benchmark datasets and comparative results with other literature works, are reported. Finally,
Section 5 concludes the paper.
2. Related Work
In this section, selected works that treat person re-identification task on the basis of video sequences by deep learning techniques are presented and discussed. The same works are then reported in the experimental tests, that is,
Section 4, for a full comparison.
A first remarkable work is reported in Reference [
40], where the authors, first and foremost, introduced the Motion Analysis and Re-identification Set (MARS) dataset, a very large collection of challenging video sequences that, moreover, were also used for part of the experimental tests in the present paper. Beyond the dataset, the authors also reported an extensive evaluation of the state-of-the-art methods, including a customized one based on Convolutional Neural Networks (CNNs). The latter, supported by the Cross-view Quadratic Discriminant Analysis (XQDA) metric learning scheme [
41] on iLIDS-VID and PRID 2011 datasets, and by Multiple Query (MQ), only on the MARS dataset, shown to outperform several competitive approaches, thus demonstrating a good generalization ability. The work proposed in Reference [
42], instead, adopts a Recurrent Neural Network (RNN) architecture in which features are extracted from each frame by using a CNN. The latter incorporates a recurrent final layer that allows information to flow between time-steps. To provide an overall appearance feature for the complete sequence, the features, from all time-steps, are then combined by using temporal pooling. The authors conduced experiments on iLIDS-VID and PRID-2011 datasets, obtaining, on both, very significant results. Other two competitors, of the work proposed in this paper, are described in References [
43,
44]. In the first, the authors presented an end-to-end deep neural network architecture, which integrates a temporal attention model to selectively focus on the discriminative frames and a spatial recurrent model to exploit the contextual information when measuring the similarity. In the second, the authors described a deep architecture with jointly attentive spatial-temporal pooling, enabling a joint learning of the representations of the inputs as well as their similarity measurement. Their method extends the standard RNN-CNNs by decomposing pooling into two steps—a spatial-pooling on feature map from CNN and an attentive temporal-pooling on the output of RNN. Both works were tested on iLIDS-VID, PRID 2011, and MARS, obtaining outstanding results compared with different works of the current literature. The authors of Reference [
45] proposed a method for extracting a global representation of subjects through the several frames composing a video. In particular, the method attends human body part appearance and motion simultaneously and aggregates the extracted features via the vector of locally aggregated descriptors (VLAD) [
46] aggregator. By considering the adversarial learning approach, in Reference [
47] the authors presented a deep few-shot adversarial learning to produce effective video representations for video-based person re-identification, using few labelled training paired videos. In detail, the method is based on Variational Recurrent Neural Networks (VRNNs) [
48], which can capture temporal dynamics by mapping video sequences into latent variables. Another approach is to consider the walking cycle of the subjects, such as the one presented in Reference [
49], where the authors proposed a super-pixel tracking method for extracting motion information, used to select the best walking cycle through an unsupervised method. A last competitor is reported in Reference [
50], where a deep attention based Siamese model to jointly learn spatio-temporal expressive video representations and similarity metrics is presented. In their approach, the authors embed visual attention into convolutional activations from local regions to dynamically encode spatial context priors and capture the relevant patterns for the propagation through the network. Also this work was compared on the three mentioned benchmark datasets, showing remarkable results.
3. Methodology: LSTM Hashing Model and Meta-Features
In this section, the proposed architecture for person re-identification, based on skeleton meta-features derived from RGB videos and bodyprint LSTM hashing, is presented. A scheme summarizing the architecture is shown in
Figure 1. In detail, starting from an RGB video, the OpenPose [
37] library is first used to obtain skeleton joints positions. This representation is then fed to the feature extraction module, where meta-features analysing movements both locally (i.e., in a frame-by-frame fashion) and globally (e.g., averaging over the whole motion) are generated, fully characterizing a person. Subsequently, meta-features are given as input to the bodyprint LSTM hashing module. In this last component, local meta-features are first analysed via a single LSTM layer. The LSTM representation of local meta-features is then concatenated to global meta-features and finally fed to two dense layers, so that a bodyprint hash is generated through binary coding.
3.1. Skeleton Joint Generation
The first and foremost step, for a video-based person re-identification system using skeleton information, is a skeleton joint generation phase. At this stage, the well-known OpenPose library is exploited so that the main joints of a human skeleton can be extracted from RGB video sequences. In detail, this library leverages a multi-stage CNN so that limbs Part Affinity Fields (PAFs), a set of 2D vector fields encoding location and orientation of limbs over the image domain, are generated. By exploiting these fields, OpenPose is able to produce accurate skeletons from RGB frames and track them inside RGB videos. The extensive list of skeleton joints, representing body-foot position estimations, is depicted in
Figure 2. As can be seen, up to 25 joints are described for a skeleton, where a joint is defined by its
coordinates inside the RGB frame. The identified joints correspond to—nose (0), neck (1), right/left shoulder, elbow, wrist (2–7), hips middle point (8), right/left hip, knee, ankle, eye, ear, big toe, small toes, and heel (9–24). While joint positions alone do not provide useful information, due to their strict correlation to the video they are extracted from, they can still be used to generate a detailed description of body movements, via the feature extraction module.
3.2. Feature Extraction
In this module, skeleton joint positions identified via OpenPose are used to generate a detailed description of full-body movements. Skeleton joint positions can be exploited to produce meaningful features able to describe, for example, gait, using the lower half of the body as shown by several gait re-identification works [
51,
52], or body emotions, leveraging features extracted from both lower and upper body halves [
36]. Motivated by the encouraging results already obtained via hand-crafted skeleton features in References [
34,
35,
36], two meta-feature groups built from skeleton joint positions are proposed in this work to analyze the whole body: local and global meta-features defined
and
, respectively.
Concerning the set, a frame-by-frame analysis of body movements is provided via the upper and lower body openness, frontal and sideways head tilt, lower, upper, head, and composite body wobble, left and right limbs relative positions, cross-limb distance and ratio, head, arms, and legs location of activity, as well as lower, upper, and full-body convex triangulation meta-features, for a total of 21 local descriptors per frame. Through these features, gait cues and relevant changes associated to the entire body structure can be captured as they happen.
Regarding the group, a condensed description of body characteristics and movements is specified via the head, arms, and legs relative movement, limp, as well as feet, legs, chest, arms, and head bones extension meta-features, for a total of 19 global descriptors. Through these features, bias towards specific body parts, which might become apparent during a walk, and detailed body conformations, describing possible limb length discrepancy, for example, can be depicted.
3.2.1. Body Openness
Body Openness (
) is employed as devised in a previous study [
36]. This meta-feature can describe both lower and upper body openness, defined
and
, respectively. The former, is computed using the ratio between the ankles-hips distance, and left/right knees distance; indicating whether a person has an open lower body (i.e., open legs) or has bent legs. Similarly,
is calculated utilising the ratio between left/right elbows distance, and the neck-hips distance, thus capturing a broaden out chest. Intuitively, low
values depict bent legs, corresponding to a crouched position; mid-range
values correspond to straight yet open legs; while high
values denote straight but closed legs (i.e., standing position). Comparably, low
values indicate a broaden out chest; mid-range
values correspond to straight arms and torso; while high
values depict open elbows and bent torso. Formally, given a video sequence
S, these quantities are computed for each frame
as follows:
where
is the Euclidean distance;
,
,
,
,
, and
, correspond to OpenPose joints 8, 13, 10, 1, 6, and 3, respectively; while
indicates the average
ankles position, that is, OpenPose joints 11 and 14. Summarizing,
is a local meta-feature (i.e.,
) describing 2 quantities, that is, lower and upper body half openness.
3.2.2. Head Tilt
Head Tilt (
) reports the degree of frontal and sideways head tilt, defined
and
, respectively. The former, is calculated exploiting the angle between the neck-hips and nose-neck axes. The latter, is computed using the angle between the nose-neck and left-right eye axes. Intuitively, for positive
values, the head is facing downward, while for negative values the head is facing upward. Similarly, for positive
values the head is tilted to the right side, while for negative values there is a tilt to the left side. Formally, given a video sequence
S, these measures are calculated for each frame
as follows:
where
,
,
,
, and
, represent the OpenPose joints 0, 1, 8, 16, 15, respectively; while the slope
m of a given axis is computed using:
where
x and
y indicate the joint coordinates used to compute the various axes. Summarizing,
is a local meta-feature (i.e.,
) indicating 2 measures, that is, frontal and sideways head tilt.
3.2.3. Body Wobble
Body Wobble (
) describes whether a person has an unsteady head, upper or lower body part, defined
,
and
measures, respectively. Moreover, this meta-feature is employed to also indicate the composite wobbling degree of a person, designated as
, by accounting for the head, upper and lower halves of the body when computing this quantity. Similarly to
, the angles between the neck-hips axis and either the left-right eye, shoulder, or hip axes, are exploited to compute these values. Moreover, the composite body wobble is computed by averaging the absolute values of head, upper and lower body wobble, to depict the general degree of body wobble. Intuitively,
,
and
describe toward which direction the corresponding body part is wobbling during a motion, while
characterizes the movement by capturing possible peculiar walks (e.g., a drunk person usually wobbles more than a sober one). Formally, given a video sequence
S,
,
and
are computed for each frame
as follows:
where
,
,
,
,
,
,
, and
, indicate the OpenPose joints 16, 15, 1, 8, 5, 2, 12, and 9, respectively. Finally, the composite body wobble
is derived by the other body wobble meta-features as follows:
Summarizing, is a local meta-feature (i.e., ) denoting 4 values, that is, lower, upper, head, and composite body wobble.
3.2.4. Limbs Relative Position
Limbs Relative Position (
) indicates the opposing arm and leg relative position with respect to the neck-hips axis (i.e., a vertical axis). This meta-feature is defined for left arm/right leg and right arm/left leg pairs, named
and
, respectively. Intuitively, an arm and the opposing leg tend to be synchronised when walking, and oscillate together. Thus, by computing the difference between opposing limbs and the vertical neck-hips axis (i.e., their relative position), it is possible to describe whether the synchronous oscillation is happening or not. Formally, given a video sequence
S, relative positions are computed for each frame
as follows:
where
and
correspond to OpenPose joints 1, and 8, respectively; while
,
,
, and
are the average
positions of left arm, right arm, left leg, and right leg, computed using joints (5, 6, 7), (2, 3, 4), (12, 13, 14), and (9, 10, 11), respectively. Summarizing,
is a local meta-feature (i.e.,
) depicting 2 measures, that is, left/right and right/left arm-leg relative position.
3.2.5. Cross-Limb Distance and Ratio
Cross-Limb Distance Ratio (
) denotes the cross distance between left arm/right leg and right arm/left leg pairs, as well as their ratio, defined
,
, and
, respectively. Similarly to the
meta-feature,
represents the synchronised oscillation of opposite limbs, although using only the average limbs position and their ratio instead of a reference axis. Intuitively, low
, and
distances indicate a synchronous cross-limb oscillation; while high values denote an irregular movement. Concerning
, the closer its value is to 1, the more synchronised is the oscillation. Indeed, through these meta-features, possible peculiar movements can be grasped during a motion. For example, arms held behind the back while walking would result in low synchronous oscillation, and could be captured via low
,
, and
values. Formally, given a video sequence
S, cross-limb distances are computed for each frame
as follows:
where
is the Euclidean distance; while
,
,
, and
are the average
positions of left arm, right arm, left leg, and right leg, computed using joints (5, 6, 7), (2, 3, 4), (12, 13, 14), and (9, 10, 11), respectively. Finally,
, that is, the cross-limb distance ratio, is calculated using the following equation:
Summarizing, is a local meta-feature (i.e., ) describing 3 values, that is, left/right, right/left arm-leg cross limb relative position as well as their ratio.
3.2.6. Location of Activity
Location of Activity (
) quantifies movement of each body component, and is computed for the head, left/right arms and legs, defined
,
,
,
, and
, respectively. Intuitively, these meta-features analyse two consecutive frames to capture position changes, and directly measure the movement of a given component, so that they can determine which part is being moved the most during the motion. For example, a person with arms flailing about, would result in high
and
values. Formally, given a video sequence
S, the set of components to be analysed
, and the set containing all joints of a given body part
, the various
are computed for each pair of consecutive frames
as follows:
where
and
correspond to
,
,
,
, and
components, composed by joints (0, 1, 15, 16, 17, 18), (5, 6, 7), (2, 3, 4), (12, 13, 14), and (9, 10, 11), respectively; while
is the Euclidean distance. Summarizing,
is a local meta-feature (i.e.,
) computing 5 values, that is, head, arms, and legs location of activity.
3.2.7. Body Convex Triangulation
Body Convex Triangulation (
) depicts the center of mass distribution by analysing triangle structures built between neck and wrists joints; middle hips and ankles joints; as well as head and feet joints. These meta-features, defined
,
, and
, represent the upper, lower, and full-body mass distribution, respectively. Intuitively, by analysing the triangle inner angles ratio difference, it is possible to describe whether a person is leaning left, right, or has a balanced mass distribution (i.e., positive, negative, and 0 values, respectively) in relation to the upper, lower, and full-body triangles observed during a motion. Formally, given a triangle with vertices
A,
B, and
C, the corresponding angles
are first computed as follows:
where · is the dot product;
represents the vector magnitude; while
,
,
, and
, indicate vectors
,
,
, and
, respectively. Then, given a video sequence
S, the difference between the ratios of angles adjacent to the triangle base, and the non adjacent one, is used to compute the
,
, and
values for each frame
, via the following equations:
where
,
, and
, that is, OpenPose joints 14, 11, 8, correspond to the lower-body triangle;
,
, and
, that is, joints 7, 4, 1, denote the upper-body triangle; while
,
, and
, that is, joints 21, 24, 0, indicate the full-body triangle. Summarizing,
is a local meta-feature (i.e.,
) denoting 3 quantities, that is, lower, upper, and full-body convex triangulation.
3.2.8. Relative Movement
Relative Movement (
) describes the amount of change a given body part has with respect to the whole body, as conceived in Reference [
36]. This meta feature is computed for the head, left/right arm, and left/right leg, defined
,
,
,
, and
, respectively. Intuitively,
first analyses position and velocity changes for each body component, then computes the ratio between a single body part position (or velocity) variation amount, and the sum of position (or velocity) changes for all body components. Formally, given a video sequence
S, the set of components to be analysed
, and the set containing all joints of a given body part
, the average change of a component (
), over the entire recording
S, is computed as follows:
where
and
h,
,
,
,
, correspond to
,
,
,
, and
components, composed by joints (0, 1, 15, 16, 17, 18), (5, 6, 7), (2, 3, 4), (12, 13, 14), and (9, 10, 11), respectively. Finally, the
, is derived using the following equation:
Summarizing, is a global meta-feature (i.e., ) computed for both position and velocity changes of head, arms, and legs; thus resulting in 10 distinct values describing the entire recording S.
3.2.9. Limp
Limp (
L) denotes whether or not a leg is moved less than the other one when walking. This meta feature is calculated using the average velocity difference between left and right legs, over the entire video sequence. Intuitively, a limping person generally has much lower velocity in either leg. Thus, this aspect can be captured by exploiting
L, where a limp in the right or left leg is denoted via a negative or positive
L value. Formally, given a video sequence
S, the Limp
L is computed as follows:
where
and
represent the joint sets of left and right leg, defined by OpenPose joints (12, 13, 14) and (9, 10, 11), respectively; while
is the joint velocity which can be easily computed using position variations and video frame per second (FPS). Summarizing,
L is a single value global meta-feature (i.e.,
) indicating limp over the whole video sequence
S.
3.2.10. Bones Extension
Bones Extension (
) describes left/right foot, left/right leg, chest, left/right arm, and head bones extension, defined
,
,
,
,
,
,
, and
respectively. Intuitively, these meta-features provide a bone size estimation by exploiting the maximum distance between the two end-points of a bone, over the entire video sequence. Formally, given a video sequence
S, the set of limb bones
B and the sets of joints describing the bones
;
is computed as follows:
where
is the Euclidean distance; ∼ identifies adjacent joints of a given bone; while
,
,
,
,
,
,
, and
denote the left foot, right foot, left leg, right leg, chest, left arm, right arm, and head, via OpenPose joint sets (19, 21), (22, 24), (12, 13, 14), (9, 10, 11), (1, 8), (5, 6, 7), (2, 3, 4), (0, 1), respectively. Summarizing,
is global meta-feature (i.e.,
) defining bone length over feet, legs, chest, arms, and head, for a total of 8 distinct values over the entire recording
S.
3.3. Bodyprint LSTM Hashing
The last step of the proposed framework, is the construction of bodyprints through the binary coding technique. In this last module, the set is analysed through a LSTM so that time variations of local meta-features can be fully captured. The LSTM output is then concatenated to the set , so that time-invariant body characteristics are merged together with time-varying ones. Finally, two dense layers are employed to implement the binary coding technique, so that a unique representation for a given person is built, ultimately allowing re-identification of that person.
3.3.1. LSTM
All local meta-features
are generated for each frame of the input video sequence containing skeleton joint positions. While the proposed features provide a detailed local description of the motion (i.e., for each specific frame), they do to not account for possible time correlation between two different frames of the input sequence. To fully exploit this information, a single layer LSTM network was chosen due to its inherent ability to handle input sequences [
53]. This network leverages forget gates and peep-hole connections so that non relevant information is gradually ignored, thus improving its ability to correlate both close and distant information in a given sequence. Formally, a generic LSTM cell at time
t is described by the following equations:
where
,
,
,
, and
represent input gate, forget gate, output gate, cell activation, and hidden vectors, respectively. Moreover,
,
,
, and
are weights connecting the various components to the input, while
,
, and
correspond to diagonal weights for peep-hole connections. Additionally
,
,
, and
denote the bias associated to input gate, forget gate, output gate, and cell. Finally, the Hadamard product is used for vector multiplication. To conclude, the LSTM output
, that is, the last hidden state summarizing motion characteristics of a
T-length video sequence, is concatenated to
, consequently producing a
vector of size
, representing a body motion.
3.3.2. Bodyprint Hashing
The body motion characterization
can be transformed into a unique identifier via binary coding hashing, so that a bodyprint is ultimately built and used for person re-identification. Following the approach in Reference [
54], Supervised Deep Hashing (SDH) with a relaxed loss function is exploited to obtain a bodyprint binary code, produced by feeding the
-vector to two dense layers. The first layer is used to merge the concatenation of global and local meta-features, while the second one is used to obtain a binary code of dimension
k. Intuitively, through SDH, similar body descriptions should result in similar codes and vice-versa, ultimately enabling a bodyprint to be used in re-identification since it is associated to a specific person.
The key aspect of this SDH approach is the relaxed loss function, where the Euclidean distance between the hash of two samples is used in conjunction with a regularizer to relax the binary constraint. This relaxation is necessary due to the sign function, usually used to obtain binary codes, leading to a discontinuous, non-differentiable and non-treatable problem via back-propagation. Formally, given two input sequences
and
and their corresponding bodyprints
, it is possible to define
in case the sequences are similar (i.e., they are derived from the same person), and
otherwise. Consequently, the relaxed loss
with respect to the two sequences is computed as follows:
where
and
represent L1 and L2-norm, respectively;
denotes an all-ones vector;
indicates the element-wise absolute value operation;
is a similarity threshold; while
is a parameter modulating the regularizer weight. In detail, the first term penalises sequences generated from the same person and mapped to different bodyprints; the second term punishes sequences of different persons encoded to close binary codes, according to the threshold
m; while the third term is the regularizer exploited to relax the binary constraint. Supposing there are
N pairs randomly selected from the training sequences
, the resulting loss function to be minimised is:
While this function can be applied via the back-propagation algorithm with mini-batch gradient descent method, the subgradients of both max and absolute value operations are non-differentiable at certain points. Thus, as described in Reference [
54], the partial derivatives at those points are defined to be
and are computed, for the three terms of Equation (
32)
, as follows:
To conclude, bodyprints can be computed by applying , and this final representation is then exploited for person re-identification as extensively shown in the experimental section.
4. Results and Discussion
In this section, the results obtained with the proposed method are discussed and compared with other state-of-the-art approaches. The comparisons are performed on three public video re-identification datasets, which are discussed below.
4.1. Datasets and Settings
Concerning the datasets, the experiments were performed on iLIDS-VID [
38], PRID 2011 [
39], and MARS [
40]. The iLIDS-VID dataset is comprised of 600 image sequences belonging to 300 people acquired by two non-overlapping cameras. Each sequence has a number of frame ranging from 23 to 192, with an average of 73. The PRID 2011 dataset consists of 400 image sequences belonging to 200 people acquired by two adjacent cameras. Each sequence has a number of frame ranging from 5 to 675, with an average of 100. Finally, the MARS dataset consists of 1261 videos acquired by 2 to 6 cameras. The entire dataset contains 20,175 tracklets, of which 3248 are distractors. The chosen datasets are challenging for the following reasons. iLIDS-VID presents clothing similarities, occlusions, cluttered background and variations across camera views. PRID 2011 main challenges are the lighting and, as for iLIDS-VID, the variations across camera views. MARS, instead, has the above mentioned challenges besides the distractors.
Concerning the model, it was implemented in Pytorch and the following parameters were used. For the LSTM input, a tensor having a shape of
was employed, where 256 is the batch size, 50 are the frames used for each subject and 21 are the local meta-features
. For the LSTM output a tensor having a shape of
is instead utilized, where
. The value of
s was chosen empirically during the training of the model. As depicted in
Figure 3, for iLIDS-VID and PRID 2011, that is, the dataset with the smaller number of identities, we used 200 as value for
s to have better results over
and to avoid the overfitting, obtained with
. Instead, for MARS dataset
s was set to 500 due to the high number of identities. A value of
s higher than 500 led to a high training time with a negligible accuracy improvement. In relation to the dense layers, the first one had an input size of
, namely 119, 219, and 519, while the second dense layer used a dimension
to create the bodyprint representation. Regarding the tensor precision, since the used meta-features comprise ratios, the tensor data type was set as a 32-bit float number. For the relaxation parameter
and similarity threshold
m,
and 2 were chosen, respectively. The model was trained on a NVidia RTX 2080 GPU for 1000 epochs with a learning rate of 0.001.
4.2. Experiment Analysis
In
Table 1, the results obtained with the proposed method are compared with current key works of the state-of-the-art. In particular, the comparison was performed with deep network based methods, namely, RNN [
42], CNN + XQDA + MQ [
40], Spatial and Temporal RNN (SPRNN) [
43], Attentive Spatial-Temporal Pooling Network (ASTPN) [
44], Deep Siamese Network (DSAN) [
50], PersonVLAD + XQDA [
45], VRNN + KissME [
47], and Superpixel-Based Temporally Aligned Representation (STAR) [
49]. Regarding the proposed method, 5 different versions were used for comparison. The
Bodyprintlocal uses only local features,
Bodyprintglobal uses only global features, while
Bodyprintk,
, uses both local and global features but with different size of the bodyprint hashing. By first analysing the local-only and global-only version of bodyprint, it is possible to observe that for the iLIDS-VID dataset performances are consistent with the state-of-the-art. For the PRID 2011 and MARS datasets, instead, there is a noticeable drop in the performance. This can be associated with the higher number of identities and to the fact that the proposed model was designed to use synergistically local and global features. In general, we have that the local-only version of the model performs better with respect to the global-only. This is amenable to the fact that due their granularity, the local features have a better description power, while the global features can result similar for different subjects. By considering, instead, the full bodyprint model, we have that starting from the 16-bits size hashing vector, the obtained ranking can overcome many state-of-the-art works.
Concerning the iLIDS-VID and the state-of-the-art algorithm that has the best rank-1 performance on it, that is, PersonVLAD, we have that Bodyprint16 performs worst with respect to it. However, with higher bodyprint vector sizes, that is, 32 and 64 bits, there are an in line performance between Bodyprint32 and PersonVLAD, and a improvement by using the 64 bits bodyprint version. Moreover, by considering the second best algorithm, that is, ASTPN, the gain obtained with 32 and 64 bits bodyprint vectors is 8.3% and 11.4%, respectively, which is an impressive result. For the PRID 2011 dataset, we have that Bodyprint16 and Bodyprint32 rank-1 results are in line with the other methods. In detail, Bodyprint16 is below SPRNN (i.e., the second best performing rank-1 method on this dataset) while Bodyprint32 is over it. For rank-5 and rank-20, we have that both Bodyprint16 and Bodyprint32 are slightly below SPRNN results. Regarding Bodyprint64, we have that it is the best algorithm at rank-5, while it is the third best result for rank-1 and rank-20, in which PersonVLAD has the best results. Finally, considering the MARS dataset, we have that at rank-1 Bodyprint16 and Bodyprint32 are in line with the state-of-the-art, while Bodyprint64 substantially outperforms other literature works. Conversely, for the rank-5, we have that Bodyprint64 is the second best method after PersonVLAD, which has obtained a score of . For the rank-20, instead, Bodyprint64 is in line with the other works, by obtaining a value of . Despite the method performing generally well, there are some limitations that can influence the accuracy. These limitations are discussed in the next section.
4.3. Limitations
The proposed model presents two major limitations: occlusions and static frames (i.e., only one frame per subject). These two situations strongly influence the feature and meta-feature extraction and computation, leading to a worse performance of the model. Regarding the occlusions, we have that for the global features the average value is lowered with respect to the number of the occluded frames. For the local features, instead, we have two cases. The first case occurs when the lower part of a subject is occluded, hence only the upper local features are available. In this case, 9 local features are available. On the contrary, for the second case we have that the upper part of the subject is occluded, allowing the extraction of the lower local features only. In this case, 5 local features are available. Despite in the first case there are 4 more local features, the performance of both cases are almost identical. Since the proposed model has been designed to consider the whole body of a subject, we have that some global features cannot be computed in case of occlusions, contributing to the lowering of the performance. Concerning the static frames, we have that all the meta-features that need more than one frame to be computer are set to 0. This means that those features lose their descriptive potential and, as for the occlusions, there is a drop in the performance. In detail, when used with static frames or in sequences with a lot of occlusions the rank-1 value is around .
A final remark must be made on the quality of the analysed images. Since the proposed method strongly relies on OpenPose framework, an important requirement is that the frames containing the subjects must not be too small in terms of spatial resolution. Otherwise, OpenPose will not extract all the skeleton parts, leading to the same problems encountered with occlusions. The same does not hold for data acquired with depth cameras, since the skeleton is directly provided and not estimated from RGB images. Anyway, considering the very good results of the proposed system and considering also that it is not based on visual features, thus overcoming a wide range of drawbacks of these systems, we can conclude that the system proposed in this paper can be considered a real contribute to the scientific community about this topic.