Keywords

1 Introduction

Video, as a carrier of information, is becoming more and more popular and widely used in various fields. In the face of massive video data, how to quickly and accurately find the required video resources is the current research hotspot. Shot segmentation is a key step in the process of video analysis. The subsequent keyframe extraction and video summary generation are mainly depended on the accuracy of shot segmentation.

In recent years, researchers have proposed many shot segmentation methods, most of which work on the principle of using frame difference as a measure of shot segmentation. The method of shot segmentation based on pixel difference in literature [1] is easy to understand and implement, but sensitive to camera motion, intensity change and noise. For this reason, the adaptive shot segmentation methods with robustness to camera motion and illumination intensity are proposed in literature [2] and literature [3], respectively. The disadvantage is that the computational cost is expensive. In order to reduce the computational cost, a shot segmentation method based on visual word bag is proposed in literature [4]. However, this method can not detect gradual shot very well. Literature [5] sets two thresholds, one for detecting gradual transformation and the other for detecting abrupt transformation. It can detect gradual shots very well, but there are still some shortcomings for detecting dissolved shots. In order to solve this problem, a method of detecting fade-in, fade-out and dissolve-in based on the combination of double contrast and black-and-white frame detection is proposed in [6]. This method can effectively segment the gradual shots, but it is very sensitive to change of illumination and fast motion of objects.

In addition, the above methods need to determine the threshold ahead of time in shot segmentation. However, due to the diversity of shot types, it is difficult to set the appropriate threshold for shot switching. If the shot switching in video data is regarded as a state transition with statistical significance, we can apply changepoint detection to shot segmentation. Bayesian online changepoint detection [7] describes data conversion [8] by estimating the probability that the incoming data points belong to the distribution related to the run length of the previous changepoint. Although previous studies have applied the changepoint method to human-computer interaction system [9], joint motion model [10], and physiological changes in medicine [11], the changepoint detection has not yet been involved in the field of shots segmentation. Therefore, on the basis of multi-feature fusion, Bayesian online changepoint detection is applied to shot segmentation of video data, and a new shot segmentation method is proposed in this paper.

2 Multi-feature Fusion

2.1 HSV Quantitative Color Feature

HSV color space histogram is selected as the description feature of frame to participate in shot segmentation. HSV is a color model that conforms to human visual characteristics. For the sake of storage efficiency, it is necessary to quantize the feature at unequal intervals. Firstly, the H, S, V components are divided into 8, 3, 3 levels respectively. Then, the three color components are converted into a quantity L according to the formula (1). Thus, the image color is divided into 72 levels [12].

$$ L = 9H + 3S + V $$
(1)

The value of L is taken as \( 0,1, \cdots ,71 \), and the one-dimensional histogram of the image can be obtained by calculating the L value of the image. Record the histogram feature vectors of the ith frame as \( f_{i} = (a_{i1} ,a_{i2} , \cdots ,a_{il} ) \), where \( a_{ij} \in [0,1] \) is a normalized proportional value, \( i = 1,2, \cdots ,n \), \( j = 1,2, \cdots ,l \), \( n \) is the total number of frames of the video, \( l = 72 \) is the dimension of the one-dimensional histogram vector. The normalization formula is as follows:

$$ a_{ij} = A_{ij} /\sum\limits_{j = 1}^{l} {A_{ij} } $$
(2)

Among them, \( A_{ij} \) is the number of pixels whose L value is \( j - 1 \) of the ith frame.

2.2 The MEP Feature

In order to segment shots more accurately, the color mean (M), image information entropy (E) and the number of eigenvalues larger than 1 (P) in the image correlation coefficient matrix of Principal Component Analysis (PCA) are further extracted. These three features are combined linearly to obtain the final MEP comprehensive features.

Assuming that the resolution of the frame is \( N_{w} \times N_{h} \) and any pixel in the ith frame is \( (x_{p} ,y_{q} ) \), the mean value of the pixel in the ith frame [13] is:

$$ m_{i} = \frac{1}{{N_{w} \times N_{h} }}\sum\limits_{p = 1}^{{N_{w} }} {\sum\limits_{q = 1}^{{N_{h} }} {f_{i}^{'} (x_{p} ,y_{q} )} } $$
(3)

where \( f_{i}^{'} (x_{p} ,y_{q} ) \) represents the pixel value at the pixel coordinate point \( (x_{p} ,y_{q} ) \) in the ith frame.

Image information entropy [14] is a description of the average amount of information in an image. In this paper, when calculating the image information entropy, the first step is to convert the color image to the gray image. Secondly, the information entropy of the gray image is calculated by formula (4). Then the information entropy of the ith frame is:

$$ e_{i} = - \sum\limits_{j = 1}^{N} {p_{ij} } \log p_{ij} $$
(4)

Among them, \( p_{ij} \) is the proportion of the pixels whose gray value is \( j - 1 \) in the gray image in all the pixels of the ith frame, \( j = 1,2, \cdots 256 \). The \( e_{i} \) is the information entropy of the ith frame.

PCA [15] refers to the transformation of a set of variables that may have correlation into a set of linear independent variables by orthogonal transformation. In this paper, image information is depicted by the number of eigenvalues greater than 1 in the correlation coefficient matrix of principal component analysis. Firstly, the gray image matrix is standardized, that is to subtract the mean value of each element in the gray image matrix to get \( x_{jk} ,j = 1,2, \cdots N_{w} ,k = 1,2, \cdots N_{h} \). Secondly, the correlation coefficient matrix of standardized data \( X = (x_{jk} )_{{N_{w} \times N_{h} }} \) is calculated. Then, the eigenvalues of the correlation coefficient matrix are calculated by Jacobian method; Finally, counting the number of eigenvalues greater than 1, the number of eigenvalues greater than 1 in the ith frame is counted as \( p_{i} \).

After calculating the pixel mean \( m_{i} \), information entropy \( e_{i} \) and the number of eigenvalues greater than 1 in the correlation coefficient matrix \( p_{i} \) of the ith frame, these three quantities are normalized. The normalized three features are recorded as \( M_{i} ,E_{i} ,P_{i} \), and the integrated features of MEP in the ith frame is \( F_{i} = (M_{i} ,E_{i} ,P_{i} ) \).

2.3 Similarity Measure

The Euclidean distance is used to calculate the similarity between two image feature vectors. Taking the HSV quantitative color feature of video frames as an example, two image feature vectors are given as \( f_{i} ,f_{i + 1} \). The similarity between video frames is calculated using the following formula:

$$ s_{i}^{HSV} (f_{i} ,f_{i + 1} ) = \sqrt {\sum\limits_{j = 1}^{l} {(a_{ij} - a_{(i + 1)j} )^{2} } } $$
(5)

So, the HSV color similarity matrix of adjacent frames is \( S_{HSV} = [s_{1}^{HSV} ,s_{2}^{HSV} , \cdots s_{n - 1}^{HSV} ] \). For the same reason, the similarity matrix of MEP features of adjacent frames is \( S_{MEP} = [s_{1}^{MEP} ,s_{2}^{MEP} , \cdots s_{n - 1}^{MEP} ] \), then the comprehensive similarity between frames is constructed as following:

$$ S = S_{HSV} + S_{MEP} $$
(6)

The plus sign represents the sum of the corresponding elements of two vectors,

$$ S = [s_{1} ,s_{2} , \cdots s_{n - 1} ],\,s_{i} = s_{i}^{HSV} + s_{i}^{MEP} ,\,i = 1,2, \cdots n - 1 $$

3 Bayesian Online Changepoint Detection for Shots Segmentation

3.1 Bayesian Online Changepoint Detection

The objective of this paper is to separate different shots from feature \( s_{1} ,s_{2} , \cdots s_{n - 1} \) (denoted as \( s_{1:n - 1} \)), and the boundary between each shots can be called a changepoint. In order to determine these shot boundaries, the run length method [8] is used, which is based on Bayesian theorem and assumes that the changepoints are generated by random processes, and that the data between the changepoints are independent and identically distributed. If run length \( r_{t} \) falls to zero, the changepoint occurs; otherwise, run length increases by 1. Firstly, assuming that the predicted distribution \( P(s_{t + 1} \left| {r_{t} ,s_{t}^{(r)} } \right.) \) can be calculated under the condition of given run length \( r_{t} \), then the posterior distribution \( P(r_{t} \left| {s_{1:t} } \right.) \) can be integrated on the current run length to find the marginal predicted distribution \( P(s_{t + 1} \left| {s_{1:t} } \right.) \), as shown in formula (7):

$$ P(s_{t + 1} \left| {s_{1:t} } \right.) = \sum\limits_{{r_{t} }} {P(s_{t + 1} \left| {r_{t} ,s_{t}^{(r)} } \right.)} P(r_{t} \left| {s_{1:t} } \right.) $$
(7)

And, the posterior distribution is:

$$ P(r_{t} \left| {s_{1:t} } \right.) = \frac{{P(r_{t} ,s_{1:t} )}}{{P(s_{1:t} )}} $$
(8)

\( s_{t}^{(r)} \) represents the data set \( s \) associated with run length \( r_{t} \). Furthermore, in order to find \( P(r_{t} ,s_{1:t} ) \), the run length distribution \( P(r_{ti} ,s_{1:t} ) \) is estimated using an iterative approach, \( r_{ti} \in r_{t} , \, i = 1,2, \cdots t \), \( \sum\limits_{i = 1}^{t} {r_{ti} } = 1 \).

By maximizing each run length distribution, it can be determined that when the probability of i = t element of the run length distribution is the highest, if \( r_{t} = 0 \), the changepoint has occurred (i.e., the shot boundary is detected); otherwise, there is no changepoint (i.e., the shot boundary is not detected), and the run length is increased to \( r_{t} = r_{t - 1} + 1 \). Run length distribution can be expressed as:

$$ P(r_{ti} \left| {s_{1:t} } \right.) = \frac{{P(r_{ti} ,s_{1:t} )}}{{\sum\limits_{{r_{ti} }} {P(r_{ti} ,s_{1:t} )} }} $$
(9)

The joint distribution \( P(r_{ti} ,s_{1:t} ) \) of run length \( r_{t} \) at time t and observation data \( s_{1:t} \) can be updated recursively online according to formula (10):

$$ \begin{aligned} P(r_{ti} ,s_{1:t} ) & = \sum\limits_{{r_{(t - 1)i} }} {P(r_{ti} ,r_{(t - 1)i} ,s_{1:t} )} \\ & = \sum\limits_{{r_{(t - 1)i} }} {P(r_{ti} ,s_{t} \left| {r_{(t - 1)i} ,s_{1:t - 1} } \right.)} P(r_{(t - 1)i} ,s_{1:t - 1} ) \\ & = \sum\limits_{{r_{(t - 1)i} }} {P(r_{ti} \left| {r_{(t - 1)i} } \right.)} P(s_{t} \left| {r_{(t - 1)i} ,s_{t - 1}^{(r)} } \right.)P(r_{(t - 1)i} ,s_{1:t - 1} ) \\ \end{aligned} $$
(10)

where

$$ P(r_{ti} \left| {r_{(t - 1)i} } \right.) = \left\{ {\begin{array}{*{20}l} {H(r_{(t - 1)i} + 1) ,} \hfill & {r_{ti} = 0} \hfill \\ {1 - H(r_{(t - 1)i} + 1),} \hfill & {r_{ti} = r_{(t - 1)i} + 1} \hfill \\ {0,} \hfill & {others} \hfill \\ \end{array} } \right. $$
(11)

\( H( \cdot ) \) is usually a constant. The predicted distribution \( P(s_{t} \left| {r_{(t - 1)i} ,s_{1:t - 1}^{{}} } \right.) \) depends only on the latest data \( s_{t - 1}^{(r)} \), which can be obtained by marginalizing the parameters in the conjugate exponential model [16].

3.2 Shot Segmentation

For the comprehensive similarity sequence \( s_{1} ,s_{2} , \cdots s_{n - 1} \) of two adjacent frames, Bayesian online changepoint detection algorithm is employed to detect the existence of shot boundary. We use t-distribution to calculate the prediction function. The specific algorithm is as follows.

Step 1. Initialization parameter \( \alpha_{1}^{(0)} ,\beta_{1}^{(0)} ,\kappa_{1}^{(0)} ,u_{1}^{(0)} \), corresponding initialization mean \( \mu_{{}} \), variance \( \sigma_{{}}^{2} \) and degree of freedom \( \upsilon \) are as follows:

$$ \sigma_{1}^{2\left( 0 \right)} = \frac{{\beta_{1}^{\left( 0 \right)} (\kappa_{1}^{\left( 0 \right)} + 1)}}{{\alpha_{1}^{\left( 0 \right)} \kappa_{1}^{\left( 0 \right)} }}\,,\upsilon_{1}^{\left( 0 \right)} = 2\alpha_{1}^{\left( 0 \right)} $$
(12)

run length distribution \( P(r_{0i} ) = 1 \);

Step 2. Read in data \( s_{t} \) and use t distribution to calculate the prediction function:

$$ \begin{aligned} \xi_{t}^{(r)} & = P\left( {s_{t} \left| {\mu_{t}^{(r)} } \right.,\sigma_{t}^{2\left( r \right)} ,\upsilon_{t}^{(r)} } \right) \\ & = \frac{{\Gamma (\frac{{\upsilon_{t}^{(r)} + 1}}{2})}}{{\sqrt {\upsilon_{t}^{(r)} \pi \sigma_{t}^{2\left( r \right)} }\Gamma (\frac{{\upsilon_{t}^{(r)} }}{2})}}\left( {1 + \frac{{\left( {s_{t} - \upsilon_{t}^{(r)} } \right)^{2} }}{{\upsilon_{t}^{(r)} \sigma_{t}^{2\left( r \right)} }}} \right)^{{( - \frac{{\upsilon_{t}^{(r)} + 1}}{2})}} \\ \end{aligned} $$
(13)

\( \Gamma \) is a Gamma function.

Step 3. Loop \( i = 1 \) to \( t - 1 \), calculate growth probability:

$$ P(r_{ti} ,s_{1:t} ) = P(r_{(t - 1)i} ,s_{1:t - 1} )\xi_{t}^{(r)} (1 - H) $$
(14)

Assume that the hazard function \( H = \lambda^{ - 1} \), \( \lambda \) is a time scale parameter, usually take \( \lambda = 250 \).

Step 4. Calculate the changepoint probability (i.e. shot boundary probability):

$$ P(r_{tt} ,s_{1:t} ) = \sum\limits_{{r_{(t - 1)i} }}^{{}} {} P(r_{(t - 1)i} ,s_{1:t - 1} )\xi_{t}^{(r)} H $$
(15)

Step 5. Calculate run length distribution:

$$ P(r_{ti} \left| {s_{1:t} } \right.) = \frac{{P(r_{ti} ,s_{1:t} )}}{{\sum\limits_{{r_{ti} }} {P(r_{ti} ,s_{1:t} )} }} $$
(16)

Step 6. Update the parameters \( \alpha ,\beta ,\kappa ,u \) and mean \( \mu_{{}} \), variance \( \sigma_{{}}^{2} \) and degree of freedom \( \upsilon \) as follows:

$$ \begin{array}{*{20}l} {\alpha_{t + 1}^{{\left( {r + 1} \right)}} = \alpha_{t}^{\left( r \right)} + 0.5} \hfill \\ {\beta_{t + 1}^{{\left( {r + 1} \right)}} = \beta_{t}^{\left( r \right)} + \frac{{\kappa_{t}^{\left( r \right)} \left( {s_{t}^{\left( r \right)} - \mu_{t}^{\left( r \right)} } \right)^{2} }}{{2\left( {\kappa_{t}^{\left( r \right)} + 1} \right)}}} \hfill \\ {\kappa_{t + 1}^{{\left( {r + 1} \right)}} = \kappa_{t}^{\left( r \right)} + 1} \hfill \\ {u_{t + 1}^{{\left( {r + 1} \right)}} = \mu_{t + 1}^{(r + 1)} = \frac{{\kappa_{t}^{\left( r \right)} \mu_{t}^{\left( r \right)} + s_{t} }}{{\kappa_{t}^{\left( r \right)} + 1}}} \hfill \\ {\sigma_{t + 1}^{{2\left( {r + 1} \right)}} = \frac{{\beta_{t + 1}^{{\left( {r + 1} \right)}} \left( {\kappa_{t + 1}^{{\left( {r + 1} \right)}} + 1} \right)}}{{\alpha_{t + 1}^{{\left( {r + 1} \right)}} \kappa_{t + 1}^{{\left( {r + 1} \right)}} }}} \hfill \\ {\upsilon_{t + 1}^{{\left( {r + 1} \right)}} = 2\alpha_{t + 1}^{{\left( {r + 1} \right)}} } \hfill \\ \end{array} $$
(17)

Step 7. If \( t = \mathop {\arg \hbox{max} }\limits_{i} P(r_{ti} \left| {s_{1:t} } \right.) \), A changepoint occurs and the run length is reset to 0. If not, run length \( r_{t} = r_{t - 1} + 1 \);

Step 8. Iterate until there is no data input and the algorithm ends. Finally, the position of the changepoint returned is the location where shot segmentation occurs in the video.

4 Experimental Results and Analysis

In order to verify the effectiveness of the proposed shot segmentation method, 50 videos are selected as experimental material. The types of videos include movies, news, advertisements, sports, historical films, animation, documentaries, man-made videos, animal videos and educational films. All videos are downloaded in different formats on the network, and they are converted to AVI format with resolution \( 320 \times 240 \), 30 frames per second by using MATLAB software. The Recall (R), Precision (P) and \( F_{1} \) measurement (\( F_{1} \)) [17] are used as the criteria to test the effectiveness of the algorithm to measure the advantages and disadvantages of our method in shot detection. The definitions of Recall rate, Precision rate and \( F_{1} \) measurement are as follows.

$$ R = \frac{{N_{c} }}{{N_{c} + N_{m} }} \times 100\% $$
(18)
$$ P = \frac{{N_{c} }}{{N_{c} + N_{f} }} \times 100\% $$
(19)
$$ F_{1} = \frac{2 \times R \times P}{R + P} \times 100\% $$
(20)

Referring to the artificial recognition statistics of shot boundary, \( N_{c} \) is the correct number of shot boundary detected by this method, \( N_{m} \) is the number of missed shot boundary detected by this method, and \( N_{f} \) is the number of false shot boundary detected by this method.

When using Bayesian online changepoint detection method to segment shots of video, the method of selecting initialization parameters is as follows. Take 02 video with both abrupt and gradual shots for example. In order to avoid the denominator equal to zero in the variance update formula (12), let’s take \( \alpha_{1}^{(0)} = \kappa_{1}^{(0)} = 0.1 \), \( u_{1}^{(0)} = 0 \), Fig. 1 shows the \( F_{1} \) value of \( \beta_{1}^{(0)} = [0.002,0.01,0.02,0.04,0.07,0.1,0.14,0.17] \)

Fig. 1.
figure 1

The value of \( F_{1} \) corresponding to different values of \( \beta_{1}^{(0)} \) at fixed \( \alpha_{1}^{(0)} ,\kappa_{1}^{(0)} ,u_{1}^{(0)} \)

when \( \alpha_{1}^{(0)} ,\kappa_{1}^{(0)} ,u_{1}^{(0)} \) is fixed.

Figure 1 shows the value of \( F_{1} \) is higher in the \( (0,0.02] \) interval. Fig. 2 shows the \( F_{1} \) value of \( u_{1}^{(0)} = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7] \) when \( \alpha_{1}^{(0)} = \kappa_{1}^{(0)} = 0.1 \), \( \beta_{1}^{(0)} = 0.01 \).

Fig. 2.
figure 2

The value of \( F_{1} \) corresponding to different values of \( \mu_{1}^{(0)} \) at fixed \( \alpha_{1}^{(0)} ,\kappa_{1}^{(0)} ,\beta_{1}^{(0)} \)

As can be seen from Fig. 2, the value of \( F_{1} \) is higher in the \( [0.2,0.7] \) interval. Fig. 3 shows the \( F_{1} \) value of \( \alpha_{1}^{(0)} = [0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7] \) when \( \beta_{1}^{(0)} = 0.01 \), \( \kappa_{1}^{(0)} = 0.1,u_{1}^{(0)} = 0.5. \)

Fig. 3.
figure 3

The value of \( F_{1} \) corresponding to different values of \( \alpha_{1}^{(0)} \) at fixed \( \mu_{1}^{(0)} ,\kappa_{1}^{(0)} ,\beta_{1}^{(0)} \)

As can be seen from Fig. 3, the value of \( F_{1} \) is higher in the \( [0.01,0.7] \) interval. Fig. 4 shows the \( F_{1} \) value of \( \kappa_{1}^{(0)} = [0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8] \) when \( \alpha_{1}^{(0)} = 0.01, \) \( \beta_{1}^{(0)} = 0.1,u_{1}^{(0)} = 0.5 \). As can be seen from Fig. 4, the value of \( F_{1} \) is higher in the \( [0.01,0.4] \) interval.

Fig. 4.
figure 4

The value of \( F_{1} \) corresponding to different values of \( \kappa_{1}^{(0)} \) at fixed \( \mu_{1}^{(0)} ,\alpha_{1}^{(0)} ,\beta_{1}^{(0)} \)

When using Bayesian online changepoint detection algorithm for shot segmentation, the initial parameters adopted in this paper are \( \alpha_{1}^{(0)} = 0.1,\beta_{1}^{(0)} = 0.01 \) \( \kappa_{1}^{(0)} = 0.1,u_{1}^{(0)} = 0.5 \) in turn. Firstly, the previous six videos are taken as examples.

Table 1 gives the basic information of the previous six videos.

Table 1. The basic information of the previous six videos.

To illustrate the effectiveness of proposed method in this paper, Fig. 5 shows the run length of the video segment with serial number 01 obtained by Bayesian online changepoint detection algorithm, in which the position with zero run length corresponds to the position where shot segmentation occurs in the video. From Fig. 5, we can see that Bayesian online changepoint detection has the ability to distinguish different shots.

Fig. 5.
figure 5

Run length of sequence 01 video segment

Take the previous six videos for example, Our method is compared with the classical Otsu algorithm [17] and the method in literature [3]. The experimental results are shown in Table 2.

Table 2. Comparison of the results of our method with those of other literature methods

In 01, 03 and 04 video segments, all of them are abrupt shots, and the accuracy of the four methods is relatively high, the average \( F_{1} \) of Otsu algorithm is 0.90, the average \( F_{1} \) of literature [3] method is 0.98, the average \( F_{1} \) of HSV and Bayesian method is 0.97, and the average \( F_{1} \) of our method is 0.95. It can be seen that our method is superior to Otsu algorithm only for video with abrupt shots, but slightly inferior to the combination of HSV and Bayesian method in literature [3].

In 02, 05 and 06 video segments, there are more gradual shots, and the picture changes between shots are more intense, so the recall, precision and value of \( F_{1} \) have decreased. The average \( F_{1} \) of Otsu algorithm is 0.80, the average \( F_{1} \) of literature [3] method is 0.83, the average \( F_{1} \) of HSV and Bayesian method is 0.85, and the average \( F_{1} \) of our method is 0.92. It can be seen that our method is superior to the other three algorithms in terms of video with a large number of gradual shots. Especially in shot gradual detection, the false detection rate is obviously less than the other three algorithms.

Table 3 gives the average Recall, average Precision and average \( F_{1} \) of the 01–50 six-segment videos. As can be seen from Table 3, the average \( F_{1} \) value of Otsu algorithm is 0.72, the average \( F_{1} \) value of literature [3] method is 0.74, the average \( F_{1} \) value of HSV and Bayesian method is 0.82, and the average \( F_{1} \) value of our method is 0.85. The average \( F_{1} \) value of our method is 0.13 higher than that of Otsu algorithm, 0.11 higher than that of literature [3] method, and 0.03 higher than that of the combination of HSV and Bayesian method.

Table 3. Average Recall rate, average Precision rate and average \( F_{1} \) of different methods

5 Conclusion

In this paper, a new shot segmentation method is proposed. Bayesian online change point detection is applied to shot segmentation, which realizes online shot segmentation and avoids the determination of traditional shot boundary threshold. In order to improve the accuracy of shot segmentation, a new MEP feature is proposed and fused with HSV quantitative color feature. Three different types of video sequences are used to evaluate the proposed method. The experimental results show that the proposed method is superior to the classical Otsu-based adaptive threshold method, the literature [3] method and the combination of HSV and Bayesian method. The next step is to apply the shot segmentation method to video summary generation, so as to achieve online video summary generation.