Real time anomaly detection and categorisation

Alexander T. M. Fisch¹,
Lawrence Bardwell² &
Idris A. Eckley ORCID: orcid.org/0000-0001-9957-2460²

3119 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The ability to quickly and accurately detect anomalous structure within data sequences is an inference challenge of growing importance. This work extends recently proposed post-hoc (offline) anomaly detection methodology to the sequential setting. The resultant procedure is capable of real-time analysis and categorisation between baseline and two forms of anomalous structure: point and collective anomalies. Various theoretical properties of the procedure are derived. These, together with an extensive simulation study, highlight that the average run length to false alarm and the average detection delay of the proposed online algorithm are very close to that of the offline version. Experiments on simulated and real data are provided to demonstrate the benefits of the proposed method.

A Methodological Overview on Anomaly Detection

Integrated Clustering and Anomaly Detection (INCAD) for Streaming Data

Anomaly Detection: A Survey

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The detection of anomalies in time series has received considerable attention in both the statistics (Chen and Liu 1993) and machine learning (Chandola et al. 2009) literature. This is no surprise given the broad range of applications, from fraud detection (Ferdousi and Maeda 2006) to fault detection (Theissler 2017; Zhao et al. 2018), that this area lends itself to. In recent years, the proliferation of sensors within the internet of things (IoT) has led to the emergence of real time detection of anomalies in streaming (high frequency) data as an important new challenge.

Anomalies can be classified in a number of different ways (Chandola et al. 2009). In this work, following the definitions of Fisch et al. (2022a), we distinguish between point and collective anomalies. Point anomalies, also known as outliers, global anomalies or contextual anomalies (Chandola et al. 2009), are single observations that are anomalous with regards to their local or global data context. Conversely, collective anomalies, also known as abnormal regions (Bardwell and Fearnhead 2017), or epidemic changepoints (Bruce and Jennie 1985), are sequences of contiguous observations which are not necessarily anomalous when compared to either their local or the global data context, but together form an anomalous pattern. Figure 1 provides several examples. In this paper, collective anomalies and epidemic changepoints are used interchangeably.

The epidemic changepoint model assumes that data follows some baseline, or typical distribution, everywhere except for some anomalous time windows during which it follows another distribution. The detection of epidemic changes in mean was first studied by Bruce and Jennie (1985) with applications to epidemiology. Since then, research in this area has been driven by various applications including the detection of copy number variants in DNA (Bardwell and Fearnhead 2017; Jeng et al. 2013; Olshen et al. 2004) and the analysis of brain imaging data (Aston and Kirch 2012; Stoehr et al. 2019). In particular, much of the pertinent literature has concentrated on the epidemic change in mean setting. See Yao (1993) and Gut and Steinebach (2005) for details.

More recently, the detection of joint epidemic changes in mean and variance as well as point anomalies was considered by Fisch et al. (2022a). In parallel, there has also been some work on detecting anomalies within the online setting. Gut and Steinebach (2005) consider the problem of detecting epidemic changepoints sequentially while Wang et al. (2011) and Ahmad et al. (2017) propose methods for the online detection of point anomalies. Other recent contributions include the MOSUM work by Kirch and collaborators (see, e.g., Eichinger and Kirch (2018)) that permits the online detection of changepoints, and is thus able to identify collective anomalies. Various outlier-robust Kalman filter approaches have also been proposed in recent years. See for example Ting et al. (2007), Agamennoni et al. (2011), Ruckdeschel et al. (2014), Chang (2014) and references therein. Fearnhead and Rigaill (2019) have introduced a computationally efficient changepoint detection approach that is robust to the presence of anomalies. Similarly the OSTAD package, developed by Iturria et al. (2020), can be used online to identify contextual anomalies.

The main contribution of this paper is to extend the offline Collective And Point Anomaly (CAPA) algorithm of Fisch et al. (2022a) to the online setting to detect both collective and point anomalies in streaming data, formalising early heuristic ideas appearing in Bezahaf et al. (2019). We call this algorithm Sequential-CAPA (SCAPA). We explore various practical aspects of working in this online setting, including (i) computational and storage costs, (ii) the typical (baseline) parameter estimation and (iii) penalty selection. In addition, we provide various theoretical and empirical guarantees to the resulting costs and accuracy trade-offs that arise from our focus on the online setting.

The article is organised as follows. In Sect. 2 we introduce the literature on offline detection of anomalous time series regions, particularly focusing on the recently proposed CAPA approach. Sect. 3 proceeds to extend this methodology to the online setting, introducing the Sequential Collective and Point Anomaly (SCAPA) algorithm. Theoretical properties of the proposed methodology are investigated in Sect. 4. Further results, together with a set of simulation studies is given in Sect. 5, indicating how these can be used to inform practitioners on how to select the hyper-parameters of SCAPA. Finally, we apply SCAPA to the monitoring of a sensor on a publically available, industrial machine-level data in Sect. 6. All proofs can be found in the supplementary material.

2 Background

CAPA, introduced by Fisch et al. (2022a), seeks to jointly detect and distinguish between point and collective anomalies within an offline, univariate time series setting. The heart of the approach is founded upon an epidemic changepoint model. To this end, consider a stochastic process $x_t \sim {\mathcal {D}}(\theta (t))$, drawn from some distribution, ${\mathcal {D}}$, indexed by a set of model parameters, $\theta (t)$. Collective anomalies can then be modelled as epidemic changes of the set of parameters $\theta (t)$. I.e., time windows in which $\theta (t)$ deviates from the typical, and potentially unknown, set of parameters $\theta _0$. Formally,

$$\begin{aligned} \theta (t) = {\left\{ \begin{array}{ll} \theta _1 &{} s_1< t \le e_1 \\ &{}\vdots \\ \theta _K &{} s_K < t \le e_K \\ \theta _0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Here K denotes the number of collective anomalies, while $s_i$, $e_i$, and $\theta _i$ correspond to the start point, end point and the unknown parameter(s) of the ith collective anomaly respectively.

The number and locations of collective anomalies are estimated by choosing $K, (s_1,e_1) , \ldots , (s_k,e_K)$, and $\theta _0$ such that they minimise the penalised cost

$$\begin{aligned} \sum _{t \notin \cup [s_i+1,e_i] } {\mathcal {C}}(x_t,\theta _0) + \sum _{j=1}^{K} \left[ \min _{\theta _j} \left( \sum _{t=s_j +1}^{e_j} {\mathcal {C}}(x_t,\theta _j) \right) + \beta _C \right] . \end{aligned}$$

(2.1)

${\mathcal {C}}(\cdot ,\cdot )$ is a cost function, e.g., twice the negative log-likelihood, and $\beta _C$ is a penalty term for introducing a collective anomaly, which seeks to prevent overfitting. A minimum segment length, l, can be imposed by adding the constraint $e_k - s_k \ge l$ for $k = 1,2, \ldots ,K$, if collective anomalies of interest are assumed to be of length at least $l \ge 1$.

Minimising the cost function (2.1) exactly by solving a dynamic programme like the PELT method (Killick et al. 2012) is not possible. This is because the parameter of the typical distribution, $\theta _0$, is shared across segments, and introduces dependence. Fisch et al. (2022a) suggest removing this dependence in $\theta _0$ by obtaining a robust estimate ${\hat{\theta }}_0$ over the whole data and then minimising

$$\begin{aligned} \sum _{t \notin \cup [s_i+1,e_i] } {\mathcal {C}}(x_t,{\hat{\theta }}_0) + \sum _{j=1}^{K} \left[ \min _{\theta _j} \left( \sum _{t=s_j +1}^{e_j} {\mathcal {C}}(x_t,\theta _j) \right) + \beta _C \right] , \end{aligned}$$

(2.2)

as an approximation to (2.1) over just the number and location of collective anomalies. The main focus of Fisch et al. (2022a) was on the case where anomalies are characterised by an atypical mean and or variance. In this case, the authors suggest minimising

subject to a minimum segment length l of at least 2. The above expression arises from setting the cost function to twice the negative log-likelihood of the Gaussian. The robust estimates for mean and variance, ${\hat{\mu }}_0$ and ${\hat{\sigma }}_0$, can be obtained from the median and the inter-quartile range.

The main weakness of the above penalised cost is that point anomalies will be fitted as collective anomalies in a segment of length l. To remedy this, point anomalies are modelled as epidemic changes of length one in variance (only). The set of point anomalies is denoted as O. To infer both collective and point anomalies we minimise

(2.3)

with respect to $K, (s_1,e_1) , \ldots , (s_k,e_K)$, and O, subject to the constraint $e_k - s_k \ge l \ge 2$ for $k = 1,2, \ldots , K$. Here, $\beta _O$ corresponds to a penalty for a point anomaly.

The CAPA algorithm then minimises the cost in (2.3) by solving the dynamic programme

taking $C(0) = 0$.

In practice, as is common in many time series settings, some form of pre-processing of the series may be required to ensure it is of a suitable form for the CAPA framework. For example, some form of deseasonalisation may be appropriate.

3 Sequential CAPA

We now introduce our Sequential CAPA procedure. In extending CAPA to the online setting three main challenges arise. Specifically, any approach developed should be mindful of the following: (i) that the computational and storage cost of the dynamic programme increase with time; (ii) the typical (baseline) parameters have to be learned online and (iii) penalty selection. We address each of these three challenges in turn, proposing solutions in the following sections, prior to formally introducing the SCAPA algorithm in Sect. 3.3.

3.1 Increasing Computational and Storage Cost

As noted in Sect. 2, CAPA infers collective and point anomalies by solving a set of dynamic programme recursions. However both the computational cost of each recursion, and the storage cost, increase linearly in the total number of observations. This is unsuitable for the online setting in which both storage and computational resources are finite.

In practice, this problem can be surmounted by imposing a maximum length m for collective anomalies. This can be achieved by adding the set of constraints

$$\begin{aligned} e_i - s_i \le m \forall i = 1,2, \ldots ,K \end{aligned}$$

(3.1)

to the optimisation problem in equation (2.3). The resulting problem can then be solved using the following dynamic programme

As a consequence of restriction (3.1), each recursion only requires a finite number of calculations. Moreover, only a finite number of the optimal costs, C(t), need to be stored in memory. The practical implications of this additional constraint are likely to be limited. Within this setting collective anomalies encompassing fewer than m observations will be detected as before. However, for those scenarios where an anomaly encompasses more than m observations, these will be fitted as a succession of collective anomalies each of length less than m, provided that their signal strength (cf Sect. 5.1 for a definition) is large enough. As one might anticipate, within this setting long anomalous segments with low signal strength would not be detectable any more as a result of the approximation.

3.2 Sequential estimation of the typical parameters

As described in Sect. 2, the dynamic programme used by CAPA requires robust estimates of the set of typical parameters $\theta _0 = (\mu _0,\sigma _0)$. Fisch et al. (2022a) estimate $\mu _0$ and $\sigma _0$ on the full data using the median and inter-quartile range respectively. In an online setting, however, these quantiles have to be learnt as the data is observed.

A range of methods have been proposed that aim to estimate the cumulative distribution function (CDF) of the data sequentially and use it to estimate quantiles. For example, Tierney (1983) proposed a method based on techniques from Stochastic Approximation (SA) to estimate the $\alpha $th quantile $x_{(\alpha )}$ of an unknown distribution function. Moreover, Tierney (1983) also established that, in the i.i.d. setting, the resulting sequential estimates ${\hat{x}}_{(\alpha ),n} \rightarrow x_{(\alpha )}$ almost surely as the number of observations $n \rightarrow \infty $. Under the same assumptions, they also showed that $\sqrt{n}( {\hat{x}}_{(\alpha ),n} - x_{(\alpha )})$ converges in distribution to a Normal distribution. These consistency results are important for an online implementation of CAPA, as Fisch et al. (2022a) showed that the consistency of CAPA requires the robustly estimated mean and variance to be within $O_p\left( \sqrt{\frac{\log (n)}{n}}\right) $ of the true typical mean and variance.

The memory required to obtain the SA-estimate is finite and small. Moreover, the standard errors of the SA-estimate and sample quantiles are close even for relatively small sample sizes, as can be seen from Fig. 2. Further, we note that these estimates tend to be considerably more accurate than those of other commonly used methods such as the quantile filter (Justusson 1981) and the $p^2$-algorithm (Jain and Chlamtac 1985). This is due to the fact that the quantile filter is not consistent, and that the $p^2$-algorithm is not robust with respect to outliers, thus losing a critical property of quantile estimators.

Pseudo-code for the SA-based method is given in Algorithm 1. Using a burn in period to stabilise the quantile estimates is recommended, as even the exact order statistics take some time to initially converge. SA-based methods can also be used to calculate other important statistics in an online fashion. For example, Sharia (2010) applied SA-techniques to learn auto-regressive parameters sequentially. Such estimators can be used to inflate the penalties used to account for deviations from the i.i.d. assumptions. This is discussed in more detail in Sect. 5.

3.3 Penalty selection

We now turn to the important question of penalty selection. In the offline setting, penalties are typically chosen to control false positives under the null hypothesis. For example, Fisch et al. (2022a) suggested using penalties

$$\begin{aligned} \beta _C(a,\lambda ) = 2\frac{a}{a-1} \left( 1 + \lambda + \sqrt{2\lambda } \right) ,\quad \beta _O(\lambda ) = 2\lambda , \end{aligned}$$

(3.2)

indexed by a single parameter, $\lambda $, for CAPA when considering the change in mean and variance setting. Here, the penalty for collective anomalies, $\beta _C$, depends on the length a of the putative collective anomaly. The motivation for these penalties is to ensure that the estimates for the number of collective anomalies and the set of point anomalies, ${\hat{K}}$ and ${\hat{O}}$, satisfy

$$\begin{aligned} \Pr ( {\hat{K}} = 0 , {\hat{O}} = \emptyset ) \ge 1 - C_1ne^{-\lambda } - C_2(ne^{-\lambda })^2, \end{aligned}$$

(3.3)

under the null hypothesis that no point or collective anomaly is present in the data. Consequently, setting $\lambda = \log (n)$ asymptotically controls the number of false positives of a time series of length n.

In the online setting, however, the concept of the length of a time series does not exist. Consequently, fixed constants are used for the penalties instead. This means that, unless the errors are bounded, false positives will be observed eventually. In common with Lorden (1971) and Pollak (1985), we suggest choosing $\lambda $ to be as small as possible, to maximise power against anomalies, whilst maintaining the average run length (ARL), the average time between false positives, at an acceptable level. Practical guidance on the choice of $\lambda $ can be taken from Proposition 1, which provides an asymptotic result for the relationship between the log-ARL and $\lambda $, under a certain model form. This relationship is empirically verified for other models using simulations in Sect. 5.

Sequential Collective And Point Anomaly Given the above solutions to the three identified challenges, we are able to extend CAPA to an online setting. We call the resultant approach Sequential Collective And Point Anomaly (SCAPA). The basic steps of the algorithm are as follows: When an observation comes in, it is used to update the sequential estimates of the typical parameters. The observation is then standardised using the typical mean and variance $(\mu _0, \sigma ^2_0)$, before being passed to the finite horizon dynamic programme. Detailed pseudocode can be found in Algorithm 2 of the supplementary material

The sequential nature of SCAPA’s analysis is displayed in Fig. 3 across three plots, each representing the output of the analysis at different time points. Note how a collective anomaly is detected, initially, as a sequence of point anomalies until the number of observations equals the minimum segment length.

4 Theory

We now turn to consider the theoretical properties of SCAPA. In particular, we investigate the average run length (ARL) and the average detection delay (ADD). Here, the ARL corresponds to the expected number of baseline datapoints SCAPA processes before detecting a false positive. Conversely, the ADD corresponds to the expected number of observations between the onset of a collective anomaly and the time at which a collective anomaly is first detected. We will place a particular emphasis on the effects of the maximum segment length, m on the ADD, as the results following from that analysis provide practical guidance on how to choose m. Proofs of all propositions presented below may be found in the Appendix.

For simplicity of exposition, we will restrict our attention to the change in mean setting, in which the penalised cost is

$$\begin{aligned}&\sum _{t \notin \cup [s_i+1,e_i] \cup O }\left( \frac{x_t-{\mu }_0}{{\sigma }_0} \right) ^2 + \sum _{t \in O} \left[ 0 + \beta _O \right] \\&\qquad + \sum _{j=1}^{K} \left[ \sum _{t=s_j + 1}^{e_j} \left( \frac{ x_t - {\bar{x}}_{(s_j + 1):e_j}}{{\sigma }_0} \right) ^2 + \beta _{C} \right] \end{aligned}$$

Recalling the penalty selection approach outlined in Sect. 3.3. In this setting, the ARL of SCAPA can be related to the penalty constant, $\lambda $, via the following result:

Proposition 1

Assume we observe a data sequence with typical mean, $\mu _0$, and the typical variance, $\sigma _0^2$, both known. Then the ARL of SCAPA on i.i.d. $N(\mu _0,\sigma _0^2)$-distributed observations $x_1,x_2,\ldots $ then satisfies

$$\begin{aligned} \log (ARL) \sim \lambda /2 \end{aligned}$$

as $\lambda \rightarrow \infty $.

As a consequence of the above, the probability of false alarm is proportional to $\exp (-\lambda /2)$. As discussed in the previous section, this can be used to inform the choice of penalty in practice if an acceptable probability of false alarm is given. For comparison, we also provide additional simulation results for the log-average run length in a selection of Laplace and t-distributed settings (see Fig. 4).

We now turn to investigate the effects of the maximum segment length, m, on the ADD. To simplify the exposition of these results, we assume that the collective anomaly begins at time $\tau = 0$. Formally, consider the series

$$\begin{aligned} x_1, x_2, \ldots {\mathop {\sim }\limits ^{i.i.d.}} N(\mu ,1) \end{aligned}$$

(4.1)

and assume that the typical mean, $\mu _0$, is equal to 0 and known. For a maximum segment length m, we then define $ADD_m$ to be the ADD of SCAPA with a maximum segment length m. Additionally, we define $ADD_\infty $ to be the ADD of SCAPA without maximum segment length. The following proposition shows that imposing a maximum segment length does not affect the ADD, provided that the maximum segment length increases at a rate faster than the penalty.

Proposition 2

Let $x_1,x_2,.. $ follow the distribution specified in (4.1). Moreover, let the known baseline mean and variance be 0 and 1 respectively. Then, if $m > \frac{\lambda }{\mu ^2}(1+\epsilon )$ for some $\epsilon >0$,

$$\begin{aligned} ADD_m = ADD_\infty + o(1) \end{aligned}$$

as $\lambda \rightarrow \infty $.

Given Proposition 2, it is natural to consider what happens in the converse setting. I.e. what happens if the maximum segment length increases at a slower rate than the penalty.

Proposition 3

Let $x_1,x_2,.. $ follow the distribution specified in (4.1). Moreover, let the known typical mean and variance be 0 and 1 respectively. Then, if $1 \le m < \lambda ^{1-\epsilon }$ for some $\epsilon >0$

$$\begin{aligned} \log (ADD_m) \sim \lambda /2 \end{aligned}$$

as $\lambda \rightarrow \infty $.

In other words, the log-ADD has the same exponential rate as the log-ARL on non-anomalous data.

As previously discussed, limits on the number of possible interventions often determine a tolerable probability of false alarm in practice. Proposition 1 therefore provides a mechanism to determine a suitable penalty constant $\lambda $. Further, Propositions 2 and 3 can be used to help inform an appropriate choice of maximum segment length, m. Specifically m should be at least of magnitude $\frac{\lambda }{\mu ^2}$, where $\mu $ is the smallest change in mean of interest to ensure power.

5 Simulation study

We now turn to examine the performance of SCAPA in various simulated settings. We start by considering the case where a single collective anomaly is present to evaluate SCAPA via its ARL and ADD performance in Sect. 5.1. The effect of auto-correlation is also examined. This is followed by a comparison with CAPA on time series containing multiple anomalies in Sect. 5.2.

5.1 A single anomaly

Prior to describing our first simulation scenario, we begin by noting that the ARL and ADD are functions of $\beta _C$ and $\beta _O$. Further, as we have seen in equation (3.2) these are a function of a single parameter $\lambda $. The aim of our simulation study, therefore, is to inform the choice of $\lambda $ that gives a suitable ARL/ADD trade off. In particular, ceteri paribus, a weaker change gives rise to a larger delay than a stronger change. In other words, we must control for the strength of change when investigating the ADD. To do so, we take the definition of signal strength from Fisch et al. (2022a).

For a collective anomaly with mean $\mu $ and variance $\sigma ^2$ the strength, $\Delta $, of a change is defined as

$$\begin{aligned}&\Delta = \log \left( 1 + \frac{1}{2}\Delta _\sigma ^2 + \frac{1}{4}\Delta _\mu ^2 \right) \;\;\;\;\;\; \;\;\;\;\;\; \Delta _\mu ^2 = \frac{ (\mu _0 - \mu )^2 }{\sigma _0\sigma } \\&\Delta _\sigma ^2 = \frac{\sigma _0}{\sigma } + \frac{\sigma }{\sigma _0} - 2. \end{aligned}$$

Here, $\mu _0$ and $\sigma _0$ are the parameters of the typical distribution, while $\Delta _\mu $ and $\Delta _\sigma $ denote the strengths of the change in mean and variance respectively.

To simplify the simulations, we assume that the standard deviation remains unaffected by collective anomalies, i.e. $\sigma = \sigma _0$. Without loss of generality, we then set $\mu _0=0$ and $\sigma _0 = 1$. Consequently, the strength of the change only depends on the mean, $\mu $, of the collective anomaly and is given by

$$\begin{aligned} \Delta = \log \left( 1 + \frac{\mu ^2}{4} \right) . \end{aligned}$$

(5.1)

We investigate a number of differing strengths $\Delta = \{0.05,0.1,0.2\}$ corresponding to mean changes of $\mu = \{0.45,0.65,0.94\}$.

In all the simulations reported below we set the minimum segment length to be $l=2$, the maximum segment length to be $m=1000$ and used a burn-in period of $n_0= 1000$ time points. To estimate the ARL, data from the typical regime was simulated and SCAPA ran until the first anomaly was (erroneously) detected. To estimate the ADD, $n_0$ observations were simulated from the typical regime followed by simulated observations from a distribution with an altered mean. We ran SCAPA on this data and calculated the detection delay as being the number of observations after $n_0$ when the anomaly was detected.

5.1.1 Case 1: IID gaussian errors

For our initial simulations, we simulated from the assumed model with standard Gaussian errors. Figure 5 depicts the log-ARL over a range of values for the penalties (3.2) indexed by the parameter $\lambda $ as in (3.2) along with a bootstrapped 95% confidence interval. Similarly, Fig. 6 shows the relationship between $\lambda $ and the ADD over a range of different values for the mean change of the collective anomaly.

Note that the log-ARL increases linearly with $\lambda $. This structure is reminiscent of the theoretical exponential relationship between $\lambda $ and the ARL derived by Cao and Xie (2017), even though these results were derived for known pre and post change behaviour.

Similarly, the ADD, increases linearly in $\lambda $, as can be seen in Fig. 6. This is consistent with Proposition 2.

5.1.2 Case 2: temporal dependence

Whilst the i.i.d. data setting is appealing theoretically, many observed time series are not independent (in time). Instead many data sequences display serial auto-correlation. To assess the robustness of SCAPA to temporal dependence we simulated an AR(1) error process as the typical distribution, $x_t$, with standard normal errors $\epsilon _t$,

$$\begin{aligned} x_t = \phi x_{t-1} + e_t. \end{aligned}$$

This process was simulated for a range of values of $\phi $. As can be seen in Fig. 7, the presence of auto-correlation in the residuals leads to the spurious detection of collective anomalies at a higher rate than for independent residuals. This is due to the fact that the cost functions of Sect. 2 assumed i.i.d. data. However, Bardwell et al. (2019) gave some empirical evidence that changepoints could be recovered even when auto-correlation is present by applying a correction or inflation factor to the penalty. This factor is the sum of the auto-correlation function for the residuals from $-\infty $ to $\infty $. This is equal to the long run variance, $(1+\phi )/(1-\phi )$, for the AR(1) model. A similar correction exists for MA processes. We repeated the simulations using this correction.

The results in Fig. 9 show that the log-ARL of SCAPA with appropriately inflated penalties is almost identical to that of the i.i.d. case. On the other hand, the ADD now depends on the auto-correlation due to the inflated penalty (see Fig. 10).

Performing this correction requires knowledge of the AR(1) parameter $\phi $. In these simulations, the estimate of $\phi $, ${\hat{\phi }}$, was estimated from the burn-in period using a robust M-estimator for the lag-1 autocorrelation of Rocke (1996) from the R package robust (Wang et al. 2017).

5.2 Multiple anomalies

A natural comparison to make when investigating an online method is to compare its performance to its offline counterpart. We therefore compare SCAPA and CAPA for the detection of multiple anomalies using ROC curves in this section. In addition to this, we also compare SCAPA to a robust changepoint detection algorithm proposed by (Fearnhead and Rigaill 2019), which we refer to as RPOP. We also compare to the MOSUM implementation provided by (Meier et al. 2021), an online though not robust changepoint detection algorithm.

To this end, we simulated time series with a total length of 10,000 observations with a number of point and collective anomalies. The length of stay for the typical state and for collective anomalies were sampled from a $\text {NB}(5,0.01)$ distribution and a $\text {NB}(5,0.03)$ distribution respectively. Observations in the typical state were sampled from an N(0, 1) distribution, while observations from the kth collective anomaly were sampled from an $N(\mu _k,\sigma _k^2)$ distribution, where $\mu _1,\ldots ,\mu _K \sim N(0,2^2)$ and $\sigma _1,\ldots \sigma _K \sim \Gamma (1,1)$. Point anomalies occurred in the typical state independently with probability $p = 0.01$ and were drawn from a t-distribution with 2 degrees of freedom.

The ROC curve resulting from this simulation can be found in Fig. 11, alongside an example time series in Fig. 12 shown segmented by both CAPA and SCAPA. As expected CAPA, which has access to the whole data, outperforms SCAPA. However, the gap in performance accuracy is small, in particular for typical choices of $\lambda $ as described in Sect. 4.

5.3 CUSUM comparison

A natural comparison that can be made to assess SCAPA’s simulated performance is with the widely used online change point detection method CUSUM (Page 1954). Both methods use a test statistic based on the log-likelihood ratio and can be configured with known typical mean and variance. The difference between the two methods is that in SCAPA, collective and point anomalies are detected jointly with separate penalties whereas CUSUM is not designed to be robust to point anomalies. In our simulations, the CUSUM approach is implemented by setting the penalty for point anomalies in SCAPA to an arbitrarily large value ($\beta _O = 10^{12}$) so that no point anomalies are detected.

Table 1 Labelled anomalies from the NAB obtained from https://github.com/numenta/NAB/blob/master/labels/combined_windows.json along with the time it was detected (in bold)

Full size table

The data was simulated in a similar way to that of Sect. 5.2 with the difference being that 20% of points in the typical state were point anomalies, simulated from a t-distribution with degree of freedom $\nu \in \{2,5,10\}$. The ROC curve resulting from this simulation can be found in Fig. 13, alongside an example time series in Fig. 14 shown segmented by both methods. Between them, these plots highlight SCAPA’s ability to distinguish explicitly between point anomalies and collective anomalies, whereas an approach such as CUSUM suffers. Note in particular that, as expected, CUSUM gives a higher number of false positives than SCAPA when point anomalies are from distributions with heavier tails.

6 Machine temperature data

The Numenta Anomaly Benchmark (NAB) (Lavin and Ahmad 2015; Ahmad et al. 2017) provides a number of data sets that can be used to compare different anomaly detection approaches. The data can be obtained from https://github.com/numenta/NAB.

One example consists of heat sensor data from an internal component of a large industrial machine. The data is displayed in Fig. 15. There are $n = 22,695$ observations spanning 2nd December 2013 - 19th February 2014 sampled every five minutes.

Lavin and Ahmad (2015) use an initial, or burn-in, period to allow their algorithms to learn about the data. In line with their approach, we set the burn in period to be the first 15% of the data (2nd December 2013 until the 14th December 2013, as shown by the blue shaded area in Fig. 16). We used the burn in to obtain a robust M-estimator for the lag-1 autocorrelation of the observations after standardisation by the sequential mean and variance estimate. Using the robust M estimator of Rocke (1996) from the R package robust (Wang et al. 2017), we obtained an autocorrelation estimate ${\hat{\phi }}=0.974$. In line with the approach taken in Sect. 5.1.2, we therefore set the penalties to:

$$\begin{aligned} \beta _C = 2 \times \frac{1 + {\hat{\phi }} }{1 - {\hat{\phi }} } \times \log (n),\quad \beta _O = 2 \times \frac{1 + {\hat{\phi }} }{1 - {\hat{\phi }} } \times \log (n). \end{aligned}$$

Figure 16 shows the three anomalies SCAPA detected shaded in red. These corresponded to a set of hand labelled anomalous regions given by an engineer working on the machine shown by the dashed vertical lines. The positions of these are given in Table 1. It should be noted that the data labels in the NAB consist of anomalous periods, rather than points. However, all approaches applied to the data as part of the NAB only return points of anomalous behaviour, highlighting SCAPA’s potential to provide new insights into the data.

Table 2 Collective anomalies found using CAPA

Full size table

The detection of the more subtle second anomaly in a timely fashion is important as this was claimed in the NAB literature to be the cause of the catastrophic system failure (third anomaly). We can see that the time at which SCAPA first detected it in Table 1. If users of the system deemed this to be too long of a delay the penalties used above could be decreased, however, as noted elsewhere in this paper this would increase the frequency of false alarms.

References

Agamennoni, G., Nieto, J.I., Nebot, E.M.: An Outlier-robust Kalman Filter. In 2011 IEEE International Conference on Robotics and Automation, pages 1551–1558. IEEE (2011)
Ahmad, S., Lavin, A., Purdy, S., Agha, Z.: Unsupervised Real-time Anomaly Detection for Streaming Data. Neurocomputing, 262, 134 – 147. Online Real-Time Learning Strategies for Data Streams (2017)
Aston, J.A.D., Kirch, C.: Evaluating Stationarity Via Change-point Alternatives with Applications to Fmri Data. Ann. Appl. Stat. 6(4), 1906–1948 (2012)
Article MathSciNet Google Scholar
Bardwell, L., Fearnhead, P.: Bayesian Detection of Abnormal Segments in Multiple Time Series. Bayesian Anal. 12(1), 193–218 (2017)
Article MathSciNet Google Scholar
Bardwell, L., Fearnhead, P., Eckley, I.A., Smith, S., Spott, M.: Most Recent Changepoint Detection in Panel Data. Technometrics 61(1), 88–98 (2019)
Article MathSciNet Google Scholar
Bezahaf, M., Hernandez, M.P., Bardwell, L., Davies, E., Broadbent, M., King, D., Hutchison, D.: Self-generated Intent-based System. In 2019 10th International Conference on Networks of the Future (NoF), 138–140 (2019)
Bruce, L., Jennie, K.: The Cusum Test of Homogeneity with an Application in Spontaneous Abortion Epidemiology. Stat. Med. 4(4), 469–488 (1985)
Article Google Scholar
Cao, Y., Xie, Y.: Robust Sequential Change-point Detection by Convex Optimization. 2017 IEEE International Symposium on Information Theory (ISIT), 1287–1291 (2017)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A Survey. ACM Comput. Surv. 41(3), 15:1-15:58 (2009)
Article Google Scholar
Chang, G.: Robust Kalman Filtering Based on Mahalanobis Distance as Outlier Judging Criterion. J. Geodesy 88(4), 391–401 (2014)
Article Google Scholar
Chen, C., Liu, L.-M.: Joint Estimation of Model Parameters and Outlier Effects in Time Series. J. Am. Stat. Assoc. 88(421), 284–297 (1993)
MATH Google Scholar
Eichinger, B., Kirch, C.: A MOSUM Procedure for the Estimation of Multiple Random Change Points. Bernoulli 24(1), 526–564 (2018)
Article MathSciNet Google Scholar
Fearnhead, P., Rigaill, G.: Changepoint Detection in the Presence of Outliers. J. Am. Stat. Assoc. 114(525), 169–183 (2019)
Article MathSciNet Google Scholar
Ferdousi, Z., Maeda, A.: Unsupervised Outlier Detection in Time Series Data. 22nd International Conference on Data Engineering Workshops (ICDEW’06), x121–x121 (2006)
Fisch, A.T.M., Eckley, I.A., Fearnhead, P.: A Linear Time Method for the Detection of Point and Collective Anomalies. Stat. Anal. Data Min. (2022a). https://doi.org/10.1002/sam.11586
Fisch, A.T.M., Eckley, I.A., Fearnhead, P.: Subset Multivariate Collective and Point Anomaly Detection. J. Comput. Graph. Stat. 31(2), 574–585 (2022b)
Gut, A., Steinebach, J.: A Two-step Sequential Procedure for Detecting an Epidemic Change. Extremes 8(4), 311–326 (2005)
Article MathSciNet Google Scholar
Iturria, A., Carrasco, J., Charramendieta, S., Conde, A., Herrera, F.: otsad: A Package for Online Time-series Anomaly Detectors. Neurocomputing 374, 49–53 (2020)
Article Google Scholar
Jain, R., Chlamtac, I.: The $p^2$ Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations. Commun. ACM 28(10), 1076–1085 (1985)
Article Google Scholar
Jeng, X.J., Cai, T.T., Li, H.: Simultaneous Discovery of Rare and Common Segment Variants. Biometrika 100(1), 157–172 (2013)
Article MathSciNet Google Scholar
Justusson, B.I.: Median Filtering: Statistical Properties, 161–196. Springer, Berlin Heidelberg, Berlin, Heidelberg (1981)
Google Scholar
Killick, R., Fearnhead, P., Eckley, I.A.: Optimal Detection of Changepoints with a Linear Computational Cost. J. Am. Stat. Assoc. 107(500), 1590–1598 (2012)
Lavin, A., Ahmad, S.: Evaluating Real-time Anomaly Detection Algorithms – the Numenta Anomaly Benchmark. IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 38–44 (2015)
Lorden, G.: Procedures for Reacting to a Change in Distribution. Ann. Math. Statist. 42(6), 1897–1908 (1971)
Article MathSciNet Google Scholar
Meier, A., Kirch, C., Cho, H.: MOSUM: A Package for Moving Sums in Change-point Analysis. J. Stat. Softw. 97(8), 1–42 (2021)
Article Google Scholar
Olshen, A.B., Venkatraman, E.S., Lucito, R., Wigler, M.: Circular Binary Segmentation for the Analysis of Array-based dna Copy Number Data. Biostatistics 5(4), 557–572 (2004)
Article Google Scholar
Page, E.S.: Continuous Inspection Schemes. Biometrika 41(1/2), 100–115 (1954)
Article MathSciNet Google Scholar
Pollak, M.: Optimal Detection of a Change in Distribution. Ann. Statist. 13(1), 206–227 (1985)
Article MathSciNet Google Scholar
Rocke, D.M.: Robustness Properties of S-estimators of Multivariate Location and Shape in High Dimension. Ann. Stat. 24(3), 1327–1345 (1996)
Article MathSciNet Google Scholar
Ruckdeschel, P., Spangl, B., Pupashenko, D.: Robust Kalman Tracking and Smoothing with Propagating and Non-propagating Outliers. Stat. Pap. 55(1), 93–123 (2014)
Article MathSciNet Google Scholar
Sharia, T.: Efficient On-line Estimation of Autoregressive Parameters. Math. Methods Statist. 19(2), 163–186 (2010)
Article MathSciNet Google Scholar
Stoehr, C., Aston, J.A.D., Kirch, C.: Detecting Changes in the Covariance Structure of Functional Time Series with Application to fMRI Data. arXiv e-prints, arXiv:1903.00288 (2019)
Theissler, A.: Detecting Known and Unknown Faults in Automotive Systems Using Ensemble-based Anomaly Detection. Knowl.-Based Syst. 123, 163–173 (2017)
Article Google Scholar
Tierney, L.: A Space-efficient Recursive Procedure for Estimating a Quantile of an Unknown Distribution. SIAM J. Sci. Stat. Comput. 4(4), 706–711 (1983)
Article MathSciNet Google Scholar
Ting, J.-A., Theodorou, E., Schaal, S.: Learning An Outlier-robust Kalman Filter. In European Conference on Machine Learning, 748–756. Springer (2007)
Wang, C., Viswanathan, K., Choudur, L., Talwar, V., Satterfield, W., Schwan, K.: Statistical Techniques for Online Anomaly Detection in Data Centers. In 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops, 385–392 (2011)
Wang, J., Zamar, R., Marazzi, A., Yohai, V., Salibian-Barrera, M., Maronna, R., Zivot, E., Rocke, D., Martin, D., Maechler, M., Konis., K.: robust: Port of the S+ “Robust Library”. R package version 0.4-18 (2017)
Yao, Q.: Tests for Change-points with Epidemic Alternatives. Biometrika 80(1), 179–191 (1993)
Article MathSciNet Google Scholar
Zhao, H., Liu, H., Hu, W., Yan, X.: Anomaly Detection and Fault Analysis of Wind Turbine Components Based on Deep Learning Network. Renewable Energy 127, 825–834 (2018)
Article Google Scholar

Download references

Acknowledgements

This work was supported by EPSRC grant numbers EP/N031938/1 (StatScale), EP/R004935/1 (NG-CDI) and EP/L015692/1 (STOR-i). Fisch also gratefully acknowledges EPSRC (EP/S515127/1) and British Telecommunications plc (BT) for providing financial support for his PhD via an Industrial CASE award. Finally, the authors thank David Yearling, Trevor Burbridge, Stephen Cassidy, and Kjeld Jensen in BT Research for helpful discussions while this work was being undertaken.

Author information

Authors and Affiliations

STOR-i Centre for Doctoral Training, Lancaster University, LA1 4YF, Lancaster, UK
Alexander T. M. Fisch
Department of Mathematics and Statistics, Lancaster University, LA1 4YF, Lancaster, UK
Lawrence Bardwell & Idris A. Eckley

Authors

Alexander T. M. Fisch
View author publications
You can also search for this author in PubMed Google Scholar
Lawrence Bardwell
View author publications
You can also search for this author in PubMed Google Scholar
Idris A. Eckley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Idris A. Eckley.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 340 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fisch, A.T.M., Bardwell, L. & Eckley, I.A. Real time anomaly detection and categorisation. Stat Comput 32, 55 (2022). https://doi.org/10.1007/s11222-022-10112-3

Download citation

Received: 08 September 2020
Accepted: 29 January 2022
Published: 24 June 2022
DOI: https://doi.org/10.1007/s11222-022-10112-3

Real time anomaly detection and categorisation

Abstract

Similar content being viewed by others

A Methodological Overview on Anomaly Detection