Keywords

1 Introduction and Related Work

Choosing appropriate performance measures plays an important role in developing effective information retrieval and classification systems. Common figures include the false positive and false negative rates, the precision and recall, and the F-measure which can all assess the accuracy of a prediction by comparing the predicted labels with given ground-truth labels. However, in applications such as information retrieval, it is often important to assess not only the accuracy of the predicted labels, but also that of a complete ranking of the samples. In classification, too, it is often preferable to evaluate the prediction accuracy at various trade-offs of precision and recall, to ensure coverage of multiple operating points. For both these needs, the average precision (a discretised version of the area under the precision-recall curve) offers a very informative performance measure.

Amongst the various flavours of classification, sequential labeling, or tagging, refers to the classification of each of the measurements in a sequence. It is a very important task in a variety of fields including video analysis, bioinformatics, financial time series and natural language processing [8]. Unlike the classification of independent samples, the typical sequential labeling algorithms such as Viterbi (including their n-best versions [7]) do not provide multiple predictions at varying trade-offs of precision and recall, and therefore the computation of their average precision is not trivial.

In the literature, a number of papers have addressed the average precision as a performance measure in the case of independent samples. For instance, [5] has studied the statistical behaviour of the average precision in the presence of relevance judgements. Yilmaz and Aslam in [15] have proposed an approximation of the average precision in retrieval systems with incomplete and imperfect judgements. Morgan et al. in [6] have proposed an algorithm for learning the weights of a search query with maximum average precision. Notably, Joachims et al. in [16] have proposed a learning algorithm that can efficiently train a support vector machine (SVM) under an average precision loss. However, all this work only considers independent and identically distributed (i.i.d.) samples, while very little work to date has addressed the average precision in sequential labeling and structured prediction. In [9], Rosenfeld et al. have proposed an algorithm for training structural SVM under the average precision loss. However, their algorithm assumes that the structured output variables can be ranked in a total order relationship which is generally restrictive.

For the above reasons, we propose a training algorithm that can train structural SVM for sequential labeling under an average precision loss. Our assumptions are very general and do not require ranking of the output space. The core component of our training algorithm is an inference procedure that returns sequential predictions at multiple levels of recall. The same inference procedure can also be used at test time, making it possible to evaluate the average precision of sequential labeling algorithms and to compare it with that of i.i.d. classifiers.

Experiments have been conducted over two challenging sequential datasets: the TUM Kitchen and the CMU-MMAC activity datasets [1, 11]. The results, reported in terms of average precision, show that the proposed method remarkably outperforms other performing classifiers such as standard SVM and structural SVM trained with conventional losses.

2 Background

2.1 Average Precision

The average precision (AP) is a de-facto standard evaluation in the computer vision community since the popular PASCAL VOC challenges [2]. It is defined as the average of the precision at various levels of recall and is a discretised version of the area under the precision-recall curve (AUC). The AP is a very informative measure since it assesses the classification performance at different trade-offs of precision and recall, reflecting a variety of operating conditions. Its formal definition is:

$$\begin{aligned} AP = \frac{1}{R} \sum _{r} \;\;\; p_{@r} \end{aligned}$$
(1)

where \(p_{@r}\) is the precision at level of recall r, and R is the number of levels. The recall ranges between 0 and 1, typically in 0.1 steps. At its turn, the precision at a chosen value of recall, \(p_{@r}\), is defined as:

$$\begin{aligned} p_{@r} = \frac{\; TP}{\; TP + FP} \quad s.t. \quad \frac{TP}{TP + FN} = r \end{aligned}$$
(2)

where TP, FP and FN are the number of true positives, the number of false negatives and the number of false positives, respectively, computed from the classification contingency table of the predicted and ground-truth labels.

In general, the precision tends to decrease as r grows. However, it is not a monotonically non-increasing function of r. To ensure monotonicity of the summand, Everingham et al. in [3] modified the definition of the AP as:

$$\begin{aligned} AP = \frac{1}{R} \sum _{r} \;\;\; \max _{l = 0 \ldots r} \; p_{@l} \end{aligned}$$
(3)

This way of computing the average precision has become commonplace in the computer vision and machine learning communities and it is therefore adopted in our experiments. However, the algorithm we describe in Sect. 3 can be used interchangeably for either (1) or (3). Given that the AP is bounded between 0 and 1, a natural definition for an AP-based loss is \(\varDelta _{AP} = 1 - AP\).

2.2 Sequential Labeling

Sequential labeling predicts a sequence of class labels, \(y = (y_1,\ldots ,y_t,\ldots ,y_T)\), from a given measurement sequence, \(x=(x_1,,\ldots ,x_t,\ldots ,x_T)\), where \(x_t\) is a feature vector at sequence position t and \(y_t\) is a corresponding discrete label, \(y_t \in {1 \ldots M}\). In many cases, it is not restrictive to assume that \(y_t\) is a binary label (1: positive class; 0: negative class), obtaining multi-class classification from a combination of binary classifiers. Therefore, in the following we focus on the binary case. The most widespread model for sequential labeling is the hidden Markov model (HMM) which is a probabilistic graphical model factorising the joint probability of the labels and the measurements. By restricting the model to the exponential family of distributions and expressing the probability in a logarithmic scale, the score of an HMM can be represented as a generalised linear model:

(4)

where \(w_{init}\) are the first-frame parameters, \(w_{tran}\) are the transition parameters, \(w_{em}\) are the emission parameters, and functions \(f(y_1)\), \(f(y_{t-1}, y_{t})\) and \(f(x_t, y_t)\) are arbitrary feature functions of their respective arguments. The inference problem for this model consists of determining the best class sequence for a given measurement sequence:

$$\begin{aligned} \bar{y} = \mathop {\mathrm{argmax}}\limits _{y} \; w^\top \phi (x,y) \end{aligned}$$
(5)

This problem can be efficiently solved in O(T) time by the well-known Viterbi algorithm operating in a logarithmic scale [8].

2.3 Structural SVM

SVM has been extended from independent (measurement, label) pairs to the prediction of structured labels, i.e. multiple labels that have mutual dependencies in the form of sequences, trees and graphs and that co-depend on multiple measurements [10, 12]. Given a set of N training instances \(\{x^i,y^i\},\ i=1,\ldots ,N\), structural SVM finds the optimal model’s parameter vector w by solving the following convex optimisation problem:

$$\begin{aligned} \begin{aligned}&\min _{w,\xi } \; \frac{1}{2} \Vert w\Vert ^2 + C \sum _{i=1}^N\xi ^i \quad s.t. \\&w^\top \phi (x^i,y^i) - w^\top \phi (x^i,y) \ge \varDelta (y^i,y) - \xi ^i, \\&\qquad i = 1 \dots N, \;\;\; \forall y \in \mathcal {Y} \end{aligned} \end{aligned}$$
(6)

As usual, term \(\sum _{i=1}^N\xi ^i\) places an upper bound over the total training error, while term \(\Vert w\Vert ^2\) regularises the solution to encourage generalisation. Parameter C is an arbitrary, positive coefficient that balances these two terms. In the constraints, function \(\phi (x,y)\) is a feature function that computes structured features from the pair \(\{x,y\}\) such that \(w^\top \phi (x,y)\) can assign a score to the pair. The constraint for labeling \(y = y^i\) guarantees that \(\xi ^i \ge 0\), and \(\varDelta (y^i,y)\) is the chosen, arbitrary loss function.

The problem with Eq. (6) is that the size of the constraint set, \(\mathcal {Y}\), is exponential in the number of of the output variables and it is therefore impossible to satisfy the full constraint set. However, [12] has shown that it is possible to find \(\epsilon \)-correct solutions with a constraint subset of polynomial size, consisting of only the “most violated” constraint for each sample, i.e. the labeling with the highest sum of score and loss:

$$\begin{aligned} \begin{aligned}&\xi ^i = \max _{y} (-w^\top \phi (x^i,y^i) + w^\top \phi (x^i,y) + \varDelta (y^i,y)) \\&\rightarrow \bar{y}^i = \mathop {\mathrm{argmax}}\limits _{y} (w^\top \phi (x^i,y) + \varDelta (y^i,y)) \end{aligned} \end{aligned}$$
(7)

This problem is commonly referred to as “loss-augmented inference” due to its resemblance to the usual inference of Eq. (5) and is the main step of structural SVM.

3 Training and Testing Sequential Labeling with the AP Loss

The loss functions used for training structural SVM commonly include the 0–1 loss and the Hamming loss. Under these losses, the loss-augmented inference can still be computed by a conventional Viterbi algorithm with adjusted weights. Instead, training with the average precision cannot be approached in the same way since it requires predicting either a ranking or multiple labelings. For this reason, we propose a different formulation of the structural SVM primal problem:

$$\begin{aligned} \begin{aligned}&\min _{w,\xi } \; \frac{1}{2} \Vert w\Vert ^2 + C \sum _{i=1}^N\xi ^i \quad s.t. \\&w^\top \phi (x^i,y^i) - \frac{1}{R} \sum _{r} w^\top \phi (x^i,y^{[r]})\\&\quad \ge \varDelta _{AP}(y^i,y^{[0]}, \ldots y^{[1]}) - \xi ^i, \xi ^i \ge 0, \;\; \;i = 1 \dots N,\\&\quad r = 0, 0.1, \dots 1, \;\;\; \forall y^{[0]} \ldots y^{[1]} \in \mathcal {Y}_0 \times \ldots \times \mathcal {Y}_1 \end{aligned} \end{aligned}$$
(8)

The constraints in Eq. (8) state that the score assigned to the ground-truth labeling, \(y^i\), must be greater than or equal to the average score of any set of R labelings at the appropriate levels of recall by at least their average precision loss. In this way, we retain the structural SVM principle of imposing a margin between the ground truth and the prediction that is equal to the chosen loss, while we constrain all the predictions at the prescribed levels of recall. At the same time, we cannot ensure that the hinge loss \(\xi ^i\) is an upper bound for \(\varDelta _{AP}(y^i,y^{[0]}, \ldots y^{[1]})\), and therefore the minimisation of the loss over the training set is only heuristic.

For Eq. (8), the loss-augmented inference becomes:

$$\begin{aligned} \begin{aligned}&\bar{y}^{[0]} \ldots \bar{y}^{[1]} = \mathop {\mathrm{argmax}}\limits _{y^{[0]} \ldots y^{[1]}} \left( \frac{1}{R} \sum _{r} w^\top \phi (x^i,y^{[r]}) + \varDelta _{AP}(y^i,y^{[0]}, \ldots y^{[1]}) \right) \\&= \mathop {\mathrm{argmax}}\limits _{y^{[0]} \ldots y^{[1]}} \left( \frac{1}{R} \sum _{r} w^\top \phi (x^i,y^{[r]}) + \frac{1}{R} \sum _{r} \varDelta _{p_{@r}} (y^i,y^{[r]}) \right) \\&= \mathop {\mathrm{argmax}}\limits _{y^{[0]}} (w^\top \phi (x^i,y^{[0]}) + \varDelta _{p_{@0}} (y^i,y^{[0]})), \ldots \mathop {\mathrm{argmax}}\limits _{y^{[1]}} (w^\top \phi (x^i,y^{[1]}) + \varDelta _{p_{@1}} (y^i,y^{[1]})) \\ \end{aligned} \end{aligned}$$
(9)

where we have made use of the definition of average precision from Eq. (1). Equation (9) shows an important property: that the R most violating labelings can be found independently of each other using the precision loss at the required level of recall. This is the key property for the algorithm we propose in the following sub-section.

3.1 Inference and Loss-Augmented Inference

Once the model is trained, testing it to report its AP requires, once again, the ability to produce a set of R predictions at the required levels of recall. Therefore, the key problems for both training and testing can be summed up, respectively, as:

$$\begin{aligned}&\mathop {\mathrm{argmax}}\limits _{y^{[r]}} \; (w^\top \phi (x^i,y^{[r]}) + \varDelta _{p_{@r}} (y^i,y^{[r]}))\end{aligned}$$
(10)
$$\begin{aligned}&\mathop {\mathrm{argmax}}\limits _{y^{[r]}} \; w^\top \phi (x^i,y^{[r]}) \end{aligned}$$
(11)

The algorithm we propose hereafter works interchangeably for both Eqs. (10) and (11), and also for the modified AP loss of Eq. (3). Given any ground-truth label sequence, \(y^i\), the degrees of freedom of the precision loss are only the number of false positives, FP, and false negatives, FN. By making a prediction in left-to-right order along the sequence, the running values of FP and FN can only increment or remain unchanged. We can thus still approach the solution of Eq. (10) by dynamic programming, extending the state of a partial solution to include: (a) the ground-truth label of the current frame, \(y_t\), as in conventional Viterbi; (b) the number of false positives, FP, in sub-sequence \(y_{1:t}\); and (c) the number of false negatives, FN, in sub-sequence \(y_{1:t}\). We use notation \(\psi (FP, FN, y_t)\) to indicate the \(y_{1:t}\) sub-sequence with the highest score for the given extended state, and \(s(\psi )\) for its score. The generic induction step is as follows: at any time step, t, a partial solution is obtained by extending two of the partial solutions of time \(t - 1\) with the current prediction, \(y_t\), and correspondingly incrementing either FP or FN if the prediction is incorrect, or neither if correct. After the final time step, T, Eq. (10) is computed over the stored sequences and the argmax returned. Algorithm 1 describes the solution formally.

figure a

4 Experiments

The proposed approach has been evaluated on two challenging datasets of human activities, TUM Kitchen and CMU Multimodal Activity (CMU-MMAC). Descriptions and results for these two datasets are reported in the following sub-sections. The compared algorithms include: (a) the proposed method based on the AP loss; (b) structural SVM using the common 0–1 loss and Hamming loss, and (c) a baseline offered by a standard SVM that classifies each frame separately. For SVM training, we have used constant \(C = 0.1\) (based on a preliminary cross-validation), the RBF kernel (for non-linearity), and, for SSVM, convergence threshold \(\epsilon = 0.01\) (default). For the AP loss, given the greater computational complexity of the loss-augmented inference (approximately quadratic for sequences with sparse positives), we decode each sequence in sub-sequences of 300 frames each. To develop the software, we have used the \(SVM^{struct}\) package and its MATLAB wrapper [4, 13]. All experiments have been performed on a PC with an Intel i7 2.4GHz CPU with 8 GB RAM.

4.1 Results on the TUM Kitchen Dataset

The TUM Kitchen dataset is a collection of activity sequences recorded in a kitchen equipped with multiple sensors  [11]. In the kitchen environment, various subjects were asked to set a table in different ways, performing 9 actions, Reaching, TakingSomething, Carrying, LoweringAnObject, ReleasingGrasp, OpeningADoor, ClosingADoor, OpeningADrawer and ClosingADrawer. For our experiments, we have chosen to use the motion capture data from the left and right hands. These data consist of 19 sequences for each hand, each ranging in length between 1, 000 and 6, 000 measurements. The first 6 sequences were used for training and the remaining for testing. Each measurement is a 45-D vector of 3D body joint locations. Figure 1a shows a scene from this dataset.

Table 1. Comparison of the average precision over the TUM Kitchen dataset. SVM: standard SVM baseline; 0–1 loss and Hamming loss: structural SVM with conventional loss functions; AP loss: proposed technique.

Table 1 reports the results for activity recognition from the left and right hand sequences. The table shows that the mean of the AP over the nine classes is the highest for the proposed technique, with an improvement of 4.2 % points over the runner-up for both the left and right hand sequences. In addition, the proposed technique reports the highest average precision in all the classes with the left hand sequences, and in 8 classes out of 9 with the right hand sequences. In addition, the average precision of the proposed technique is about double that of the standard SVM baseline that does not leverage sequentiality.

Fig. 1.
figure 1

Sample frames from (a) the TUM Kitchen dataset and (b) the CMU-MMAC dataset.

4.2 Results on the CMU Multimodal Activity Dataset

The CMU Multimodal Activity (CMU-MMAC) dataset contains multimodal measurements of the activities of 55 subjects preparing 5 different recipes: “brownies”, a salad, a pizza, a sandwich and scrambled eggs [1]. For our experiments, we have chosen to use the video clips of the 12 subjects preparing brownies from a dry mix box. The actions performed by the subjects are very realistic and are divided over 14 basic activities. The length of the 12 video clips ranges from 8, 000 to 20, 000 frames. For the experiments, we have used the first 8 videos for training and the remaining 4 for testing. For the feature vector of each frame, we have first extracted dense SIFT features at a 32-pixel step and used k-means with 32 clusters to generate a codebook. Then, the descriptors of each frame have been encoded into a 4, 096-D VLAD vector [14]. Figure 1b displays a scene from this dataset, showing that the kitchen environment and camera view are significantly different from TUM’s.

Table 2 reports the results for activity recognition over this dataset. The table shows that the mean of the AP is the highest for the proposed technique, with an improvement of 5.7 % points over the runner-up. In addition, the proposed technique reports the highest average precision for 12 classes out of 14, and more than doubles the SVM baseline.

Table 2. Comparison of the average precision over the CMU-MMAC dataset. SVM: standard SVM baseline; 0–1 loss and Hamming loss: structural SVM with conventional loss functions; AP loss: proposed technique.

5 Conclusion

The average precision has become a reference evaluation measure for its ability to assess performance at multiple operating points. However, the typical sequential labeling algorithms such as Viterbi do not allow the computation of the average precision. For this reason, in this paper, we have proposed an inference procedure that infers a set of predictions at multiple levels of recall and allows measuring the average precision of a sequential classifier. In addition, we have proposed a structural SVM training algorithm for sequential labeling that minimises an average precision loss. Experiments conducted over two challenging activity datasets - TUM Kitchen and CMU-MMAC - have shown that the proposed approach significantly outperforms all of the other compared techniques and more than doubles the performance of a baseline. Moreover, while we have only focused on sequential labeling in this paper, the proposed approach could readily be employed for more general structures such as trees and graphs.