Keywords

1 Introduction

Human motion analysis has been a hot research topic in multimedia research discipline, and is widely applied in athletic training, medical diagnostic test, video games and security monitoring, etc. [1,2,3,4,5,6]. Deep mining on human motion sequence is demanded in real world applications to meet actual needs (e.g. movement standardness evaluation in athletic training) [2, 7]. However, since the temporality and complexity of human motion, analyzing the whole sequence directly is a tough issue. Moreover, the motions of different classes will influence each other, which causes the decrease of analysis accuracy [8]. Hence, segmenting the human motion sequences can effectively facilitate the analysis of human motions and promote the analysis result of the sequences.

As a fundamental work in human motion sequence analysis, segmentation can facilitate the operations on human motion clips like clustering, classification, etc. [11, 12]. Effective and accurate segmentations on human motion sequences are of great significance in understanding the semantic information of human motion sequences. Researches on human motion sequences have made great progress. Researchers segment the human motion sequences from different perspectives to analyze the temporal property of the sequence. For instance, considering the correlation of the same class motions, clustering methods are applied in human motion sequence segmentation [13, 14]. On the other hand, based on the differences of motions from different classes, classification methods are introduced to human motions segmentations [15]. Some researchers segment the human motion sequences by detecting transition frames of the sequences. Dian et al. [16] proposed a kernel algorithm to detect the transition frames as well as the repetitive frames simultaneously of a motion sequence.

Apart from aforementioned segmentation methods, some methods realize human motions sequence segmentation by considering the property of each frame and analyze the physical properties [9, 10] of human motions. For example, Principle Component Analysis (PCA) approach to human motion segmentation employs the intrinsic dimension to realize segmentation, and assigns a cut when the intrinsic dimensionality of a local model of the motion suddenly increases [17]. Probability Principle Component Analysis (PPCA) to human motion segmentation analyzes the distributions of motions of the sequence to detect the segmentation points. That is PPCA approach places a cut when the distribution of poses is observed to change [17]. Time series-Warp Metric Curvature Segmentation (TS-WMCS) constructs a curvature like descriptor to evaluate the changes of human motion sequence, and detects the transition point when the local curvature of a frame changes dramatically [8].

In general, the key issue in human motion segmentation is to detect the segmentation points in the sequences. Note that, human motions will change a lot near the segmentation points [8, 17], this paper proposes a hashing based state variation segmentation (HBSVS) method for human motion sequence segmentation. HBSVS researches the human motion state variation by introducing the state machine theory to the human motion segmentation discipline and regarding the motion of each frame as a state. The states of motions from the same class will not change dramatically on the time series, but the variations of motion states will boom near the segmentation points for the appearance of new motion classes. Moreover, HBSVS employs a hashing representation method [18] to describe the state of each frame, and fully utilizes the ability of hashing methods in representing data similarities.

The main contributions of this paper can be summarized as follows.

Firstly, construct a state variation descriptor to represent the change of human motion.

Secondly, employ a data similarity based hashing method to describe the motion state. To our knowledge, it is the first time to utilize hashing method to realize human motion segmentation.

Thirdly, realize human motion segmentation from the perspective of motion state variation.

2 Related Works

2.1 PCA and PPCA Approaches to Human Motion Sequence Segmentation

PCA and PPCA approaches to human motion segmentation all focus on the motion poses, this section will briefly introduce the PCA and PPCA approaches. Given human motion sequence \(X = {\left[ {{x_1}, \cdots ,{x_n}} \right] ^T} \in {R^{n \times D}}\) with n frames and D dimensions. For each frame \({x_i}\), construct the neighborhood \({N_i}\) for \({x_i}\) with \(k-NN\) method.

PCA based human motion segmentation method assumes that the error between neighborhood \({X_i}\) of \({x_i}\) and it’s projection \({X'_i}\) will not vary largely even if frames belong to different motion classes. However, the error will be large for the frames in the transition clips. The error e can be expressed as following Eq. (1).

$$\begin{aligned} e = \sum \limits _{i = 1}^n {||{X_i} - X'_i|{|^2}} \end{aligned}$$
(1)

The PCA approach to human motion sequence segmentation works well if motions sequences of different classes have clear transition clips. However, the error will not change significantly for the continuity of real world motions, which causes the incomplete detection of PCA approach in finding all transition clips [8].

PPCA based approach to human montion segmentation utilizes the Probability Principle Component Analysis to evaluate the distribution of motion sequences. PPCA approach extends the traditional PCA approach with Gaussian distribution, and constructs the correlation covariance matrix C to represent the relations of different motions. The correlation covariance matrix C can be utilized to compute the average Mahalanobis distance H of the neighborhood of frame K, which can evaluate how likely are motion frames \(K\,+\,1\) through \(K\,+\,T\) to belong to the Gaussian distribution [17]. H can be caluculated by Eq.(2).

$$\begin{aligned} H = \frac{1}{T}\sum \limits _{i = K + 1}^{K + m} {{{({x_i} - \overline{x} )}^T}{C^{ - 1}}({x_i} - \overline{x} )} \end{aligned}$$
(2)

Thereinto, K is the number of the start frame and m is the length of a subsequence. PPCA based human motion sequence segmentation method captures the correlation in the motion of different joint angles as well as the variance of all joint angles [17]. However, the Gaussian distribution of the motions is too strong for real world applications, which will lower the segmentation accuracy of PPCA segmentation approach. Besides, the PCA and PPCA based human motion sequence segmentation methods are based on angles, which is hard to extend to real world applications [8].

2.2 Time Series-Warp Metric Curvature Segmentation (TS-WMCS) Algorithm

This section mainly introduces the Time series-Warp Metric Curvature Segmentation (TS-WMCS) algorithm to human motion segmentations. TS-WMCS utilizes a curvature like descriptor to depict the change degree of human motion sequences. Given human motion sequence \(X = {\left[ {{x_1}, \cdots ,{x_n}} \right] ^T} \in {\mathbb {R}^{n \times D}}\) with n frames. For each frame \({x_i}\), construct the neighborhood \({N_i}\) of \({x_i}\) with \(k-NN\) method. Then, construct the angle between the samples and the tangent space in each neighborhood. The warp degree of the data can be described by the angle between \({x_j}\) and its orthogonal projection, where \({x_j}\) is the neighbor of \({x_i}\). \({\alpha _{ij}} = {\alpha _j}\left( {{Q_i}} \right) \), where \({\alpha _{{{ij}}}} \in \left[ {0,\pi /2} \right] \). The local low dimensional space can be expanded by \({Q_i}\), and \({Q_i}\) can be obtained by the optimization equation Eq. (3),

$$\begin{aligned} \mathop {\arg \max }\limits _{{Q_i}} Tr\left( {Q_i^T{X_i}ZX_i^T{Q_i}} \right) \;s.t.\;Q_i^T{Q_i} = I \end{aligned}$$
(3)

where \({X_i}\) is the neighborhood of \({x_i}\) and Z is the normalization matrix of \({{X_i}X_i^T}\). Then, the curvature-like descriptor can be expressed as Eq. (4).

(4)

The transition points in the human motion sequences are those samples whose curvature-like descriptors are high. Apart from the curvature-like descriptor segmentation, TS-WMCS algorithm also reduces the dimension of original motion sequences, and utilizes the low dimensional temporal feature curves to assist the segmentation. For each neighborhood, compute the low dimensional embedding \({{\varTheta _{{k_i}}}}\) of \(X_i\), and then add affinity projection \(L_i\) for each \({{\varTheta _{{k_i}}}}\) of \(X_i\) to map the low dimensional embedding to the global embedding \(T = [{\tau _1},{\tau _2}, \cdots ,{\tau _n}]^T \in {R^{n \times d}}\). Hence, the low dimensional embedding T can be obtained by Eq. (5).

$$\begin{aligned} \mathop {\min }\limits _{{\tau _i},{L_i}} \sum \limits _{i = 1}^n {{c_i}{{\left\| {\left( {{T_{{k_i}}} - {{\tau }_i}e_{{k_i}}^T} \right) - {L_i}{\varTheta _{{k_i}}}} \right\| }^2}} \;s.t.\; T{T^T} = {I_d} \end{aligned}$$
(5)

where \(I_d\) is a \(d-by-d\) identity matrix, and \({e_{{k_i}}}\) is a vector with all ones.

TS-WMCS algorithm utilizes the curvature-like descriptor to segment the human motion sequences, which can effectively depict the changes of human motions. However, a certain move in complex motion sequences (e.g. “white crane spreads its wing” in tai chi) will contain motions which are quite different from each other. TS-WMCS algorithm can not yield satisfying result on these kind of motion sequences [8]. Moreover, TS-WMCS algorithm employ the low dimensional temporal feature curves to assist the segmentation, which do not have clear physical significance.

3 Hashing Based State Variation Segmentation Approach for Human Motion Segmentation

In this section, we will introduce the proposed Hashing based state variation segmentation (HBSVS) approach to human motion sequence segmentation. HBSVS constructs the state variation to describe the changing degree of motions in the motion sequences. Note that, motions of the same class changes slightly while motions of different classes change greatly. Hence, we construct the state variation to fully reveal the change of motions, and details are in Sect. 3.2. Considering the outstanding ability of hashing methods in mapping similar samples to the same hashing codes (hash bucket), we utilize hashing representation of each frame to represent the state of the motion. Here, we employ the widely utilized hashing method Locality Sensitive Hashing(LSH) [18] to represent the status. Detailed introduction of LSH is in Sect. 3.1.

3.1 Local Sensitive Hashing (LSH)

As a typical hashing method, local sensitive hashing (LSH) is widely adopted in various applications (e.g. image retrieval, video surveillance, etc.) [18, 19, 23, 24] with its high computational efficiency and satisfying performance.

LSH aims at constructing hashing functions to project similar samples to hamming space with smaller hamming distances. As for dissimilar samples, the constructed hashing functions should divide them apart in hamming space. After the hashing projection, the original data are mapped to a hashing table, and samples are aligned to different hashing buckets of the hashing table. Samples in the same bucket are more likely to be neighbors in the original space. In other words, original data set are divided into many subsets with the hashing functions, and similar samples or neighbors are mapped to the same subset. Hence, the original data are represented with a number of binary codes, and each binary representation can be regarded as a hashing bucket. Based on the binary representations, computation cost on the samples are reduced effectively.

Given samples x, y, and hashing function \(h(\cdot )\). To realize the goals of LSH, the hashing functions should satisfy the following constraints.

$$\begin{aligned} {\left\{ \begin{array}{l} if\;d(x,y) \le {d_1},\;then\;P(h(x) = h(y)) \ge {p_1}\\ if\;d(x,y) \ge {d_2},\;then\;P(h(x) = h(y)) \le {p_2} \end{array} \right. } \end{aligned}$$
(6)

Thereinto, \(d(\cdot )\) denotes the hamming distance of two binary representations and \({d_1}<{d_2}\), P(A) represents the probability of the occurrence of event A and \({p_1}>{p_2}\). If \(h(\cdot )\) satisfies the aforementioned constraints, then \(h(\cdot )\) can be regarded as \(({d_1},{d_2},{p_1},{p_2})-\)sensitive. To satisfy the constraints, we adopt the hashing functions adopted in [18], and utilize random projection to map the original data [18].

Random projection is powerful method in pattern recognition [19,20,21], which has been widely applied in dimensional reduction [20] and data clustering [21]. Based on random projection, similar data can be mapped closer in low-dimensional space, and will be encoded with similar hashing codes. On the other hand, dissimilar data will be projected apart either in the low-dimensional space or in hamming space. The low dimensional embedding \(Y = {\left[ {{y_1}, \cdots ,{y_n}} \right] ^T} \in {\mathbb {R}^{n \times d}}\) of X can be obtained as \(Y=X \times P\), where P is a random projection matrix like [19,20,21] and d denotes the dimension of Y. In general, the binary codes \(B = {\left[ {{b_1}, \cdots ,{b_n}} \right] ^T} \in {\left\{ {1, - 1} \right\} ^{n \times d}}\) of original data X can be obtained by LSH as follows:

$$\begin{aligned} {B_{ij}} = \left\{ {\begin{array}{*{20}{c}} {1,\;\;\;\;{y_{ij}} > 0}\\ { - 1,\;\;{y_{ij}} < 0} \end{array}} \right. \end{aligned}$$
(7)

where i is the index of the frame in a human motion sequence, j indicates the bit of binary code \({b_i}\).

3.2 Hashing Based State Variation Segmentation (HBSVS)

This section mainly introduces the way to construct state descriptor by LSH, and the way to evaluate the motion variation by HBSVS. Given human motion sequence X, for each frame \({x_i}\) we can obtain the corresponding hashing representation \({b_i} \in {\left\{ {1,0} \right\} ^c}\), and c is the length of binary representation.

The degree of motion changes reflects the correlation of human motions. The changes of motions from the same class will not vary largely on continuous frames of a sequence. However, the motions will change dramatically when motions of a new class appear. Based on this observation, we manage construct a motion change degree descriptor, which can effectively reflect the variation of human motions. Note that, hamming distance is always utilized to metric the similarity between two binary representations [18], and similar samples always have smaller hamming distance. According to the definition of hamming distance, binary representations of similar samples will have more identical binary bits. Hence, we utilize the binary bits to represent the states of human motions, which can maintain the fact that motions from the same class have more identical states.

Fig. 1.
figure 1

The sketch of HBSVS in constructing the state variation

Based on the hashing method, the state collection can be constructed as \(S=\{{s_1},{s_2},\cdots ,{s_c}\}\), and each state represents a bit of the binary representation. For the ith frame of the total motion sequence, we can construct state \({S_i}\) for \({x_i}\), and \({S_i}=\{{s_p}|{b_{ip}}=1,0<p<c\}\). Likely, the state for next frame \({x_{i+1}}\) can be represented by \({S_{i+1}}=\{{s_q}|{b_{(i+1)q}}=1,0<q<c\}\). From \({x_i}\) to \({x_{i+1}}\) on time series, states in \({S_i}=\{{s_p}|{b_{ip}}=1,0<p<c\}\) are activated, and the state change of two adjacent frames can be represented by \({V_i}=\{{s_p \rightarrow s_q}|{s_p}\in {S_i},{s_q}\in {S_{i+1}}\}\). Details can be captured from Fig. 1.

In human motion sequence, motions have the property of continuity as time goes by, especially for the motions from the same class. However, the initialized state collection S is not enough in representing the changing sate of a certain motion. Hence, we add the state change \({V_i}\) to the state collection. At each moment, the referred original states \(s_i\in S\) are activated as well as the changing state \({V_i}\). According to the state of next frame, we can construct new state changes. As for the repeated motions or similar motions in a sequence, the state changes have been constructed and will not construct much state changes. As is shown in Fig. 1, the change from ‘010110’ to ‘100110’ in gray rectangles has appeared, and HBSVS will not construct new state variation. Based on the mechanism, the state change variation can be measured by the following equation.

$$\begin{aligned} {M_i} = \frac{{{V_{i(new )}}}}{{{S_{i(activated )}}}} \end{aligned}$$
(8)

where \({{V_{i(new)}}}\) is the newly constructed state variation, \({{S_{i(acativate)}}}\) is the activated state in the ith frame, including the states of \((i-1)th\) frame and the referred changing state, as is shown in Fig. 1. Based on the state change variation measurement \(M_i\), we can effectively evaluate the change of motions on time series. As for the motions from the same class, since the state changes have been constructed at the start frames of a motion class. The number of newly constructed state variations will not be large. Nevertheless, when motions of a new class appears, many new state changes will be constructed. The main reason is that the states of motions from new class are quite different from that of the old class. Based on the state change variation \({M_i}\), we set a threshold \(\xi \) for to filter the significant state variation change. If \({M_i}>\xi \), add \({M_i}\) to \(Clips = \left\{ {{c_1}, \cdots ,{c_m}} \right\} \in {\mathbb {R}^{1 \times m}}\). Clips is the collection of transition clips.

Algorithm 4.1.

Input: Train data X

Output: Transition points Clips

1. Initialize the random projection matrix P for LSH.

2. Compute the low dimensional embedding Y for motion sequence X

   as \(Y=X \times P\).

3. Encode the binary representation for each frame \({x_i}\) with Eq. (7).

4. Generate the motion state variation \({M_i}\) for the motion sequence.

5. If \({M_i}>\xi \)

      Add \({M_i}\) to Clips;

   End

6. Output Clips.

4 Experiment Results

To fully validate the effectiveness of HBSVS method in segmenting human motion sequences, we conduct experiments on CMU motion capture database like many human motion segmentation methods [8, 13, 14, 16, 17]. Apart from the skeleton data of CMU motion capture database, we also conduct experiment on a real world video dataset UT-Interaction standard dataset [22]. Three methods are adopted as comparison methods to evaluate the segmentation performance of HBSVS, which are PCA, PPCA and TS-WMCS algorithms to human motion segmentation.

In order to verify the validity of the proposed algorithm in this paper, we employ two protocols utilized in [8, 17] to evaluate the performances of the proposed algorithm and the comparison algorithms. The protocols include precision \({P_p}=Z/Q\) and recall \({P_r} = Z/N\), where Z is the number of the segment points located in the right segment clips computed by the algorithm, Q is the total segments detected by the algorithms, and N is the number of the right segment clips. To comprehensively evaluate the performance of the segmentation methods, we also adopt the F-Measure as one of our protocols. \(F={P_p}*{P_r}*2/({P_p}+{P_r})\)

4.1 Experiments on CMU Motion Capture Database

CMU motion capture database contains common human motions of 144 subjects, which are represented with skeletons of 31 joints. In our experiment, we randomly select the motion sequences of trail 1 to trail 8 in subject 15 to evaluate the segmentation performance of the proposed method. The selected sequences range from 5524 frames to 22948 frames, and mainly contain daily behaviors (e.g. wall, wander, and wash window, etc.). Like in [8], we unify the coordinate of 31 joints in the sequences to a global coordinate of 93 dimensions.

HBSVS utilizes the state variation to evaluate the changes of human motion sequences, which can effectively detect the changes in motion sequences and avoid detecting the inner changes of motions from the same class. As is shown in Table 1, HBSVS outperforms PCA and PPCA approach significantly in both precision and recall. When compared to TS-WMCS, the segmentation recall of HBSVS is lower. The main reason is that HBSVS aims to provide more accurate cuts for the human motion sequence segmentation, and tries to reduce the meaningless cuts. Note that the segmentation precision of HBSVS is much higher than TS-WMCS, and the total cuts of TS-WMCS are about 6 times over that of HBSVS. Taken together, F-Measure of HBSVS is better than the other three segmentation methods, which validates the effectiveness of HBSVS.

Figure 2 represents the segmentation results on the \(12\_04\) motion sequence of the CMU database, which is a Tai Chi motion sequence. Motions of the same class in Tai Chi changes dramatically. However, we can find that HBSVS achieves satisfying segmentation results according to Fig. 2 and effectively reduces the meaningless cuts compared to other segmentation methods. The main reason is that HBSVS adpots hashing method to represent the state of human motions and focuses on the changing process, which can effectively reduce the influence of dramatic changes in a particular motion class. On the other hand, TS-WMCS, PCA and PPCA approach to human motion sequence segmentation only pay attentions to the change of a particular frame and neglect the influences of presented motions.

Table 1. Comparisons of precisions and recalls of 8 segmentation methods on CMU motion database
Fig. 2.
figure 2

Segmentation results on the 12_04 motion sequence of the CMU Database (TaiChi)

4.2 Experiment on UT-Interaction Standard Dataset

The UT-Interaction standard dataset contains videos of continuous executions of 6 classes of human-human interactions, including shake-hands, point, hug, push, kick and punch. All the videos in UT-Interaction standard dataset are around 1 min. 8 videos of the subset1 are selected in our segmentaion experiments.

Note that the continuity of different motions are high in real world motions, which influences the accuracy in cutting the motion sequences. HBSVS utilizes the state variation to evaluate the changes of human motion sequences, which can effectively detect the changes in motion sequences. From Table 2 and Fig. 3 we can draw the conclusion that HBSVS outperforms the other segmentaion methods both in accuracy and recall.

TS-WMCS utilizes the curvature-like descriptor to realize motion sequence segmentation, which can not yield satisfying result in complex situations. PCA and PPCA only focuses on the motion data, which neglect the physical property of motion sequences. Hence, PCA and PPCA can not achieve satisfying result on real-world applications. As is shown in Table 2 and Fig. 3, HBSVS outperforms the other comparison methods.

Fig. 3.
figure 3

Example of segmentation result on UT-Interaction standard video dataset

Table 2. Comparisons of precisions and recalls of 4 segmentation methods on UT-Interaction standard dataset

5 Conclusion

This paper proposes a new algorithm for human motion segmentation named hashing based state variation segmentation (HBSVS), which analyzes the human motion sequences by the motion changing degree. The proposed method employs the hashing method to construct state descriptors to reveal the motion states in the sequences. Based on the motion state, HBSVS focuses on the changing process of the human motions and constructs the state variation evaluation criterion. The constructed evaluation criterion can effectively reveal the changing points in the human motion sequences and avoid meaningless cuts in the motion sequence segmentation tasks. Experimental results validate the effectiveness of the proposed method in detecting the changing points compared to several state-of-the-art methods.