Nothing Special   »   [go: up one dir, main page]

Sig Decomp Mprox

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Signal Decomposition

Using Masked Proximal Operators


Bennet E. Meyers Stephen P. Boyd
April 25, 2022

Abstract
We consider the well-studied problem of decomposing a vector time series signal
into components with different characteristics, such as smooth, periodic, nonnegative,
or sparse. We propose a simple and general framework in which the components are
defined by loss functions (which include constraints), and the signal decomposition
is carried out by minimizing the sum of losses of the components (subject to the
constraints). When each loss function is the negative log-likelihood of a density for
the signal component, our method coincides with maximum a posteriori probability
(MAP) estimation; but it also includes many other interesting cases. We give two
distributed optimization methods for computing the decomposition, which find the
optimal decomposition when the component class loss functions are convex, and are
good heuristics when they are not. Both methods require only the masked proximal
operator of each of the component loss functions, a generalization of the well-known
proximal operator that handles missing entries in its argument. Both methods are
distributed, i.e., handle each component separately. We derive tractable methods for
evaluating the masked proximal operators of some loss functions that, to our knowledge,
have not appeared in the literature.

1
Contents
1 Introduction 3

2 Signal decomposition 4
2.1 Signal decomposition into components . . . . . . . . . . . . . . . . . . . . . 4
2.2 Component classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Signal decomposition problem . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Statistical interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Optimality and stationarity conditions . . . . . . . . . . . . . . . . . . . . . 7
2.6 Signal class parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Related work 15

4 Solution methods 18
4.1 Masked proximal operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Block coordinate descent algorithm . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 ADMM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Component class attributes 26


5.1 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Time-invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Convex quadratic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 Common term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Component class examples 29


6.1 Time-separable classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Time-invariant classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Fitting component class losses . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Examples 34
7.1 Mauna Loa CO2 measurements . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 RFK bridge traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Outage detection in a photovoltaic combiner box . . . . . . . . . . . . . . . . 44

A SD-ADMM algorithm derivation 59

2
1 Introduction
The decomposition of a time series signal into components is an age old problem, with
many different approaches proposed, including traditional filtering and smoothing, seasonal-
trend decomposition, Fourier and other decompositions, PCA and newer variants such as
nonnegative matrix factorization, various statistical methods, and many heuristic methods.
It is believed that ancient Babylonian mathematicians used harmonic analysis to understand
astronomical observations as collections of ‘periodic phenomena’ [Neu69]. (We discuss prior
work in more detail in §3.)
We formulate the problem of decomposing a time series signal into components as an
optimization problem, where components are described by their loss functions. Once the
component class loss functions are chosen, we minimize the total loss subject to replicating
the given signal with the components. We give a simple unified algorithm for carrying
out this decomposition, which is guaranteed to find the globally optimal descomposition
when the loss functions are all convex, and is a good heuristic when they are not. The
method accesses the component loss functions only through a modified proximal operator
interface, which takes into account that some data in the original signal may be missing.
The method is distributed, in that each component class is handled separately, with the
algorithm coordinating them.

Handling of missing data. Our method is designed to handle missing data in the orig-
inal signal to be decomposed, a common situation in many practical settings. The signal
components in the decomposition, however, do not have any missing data; by summing the
components in the decomposition, we obtain a guess or estimate of the missing values in
the original signal. This means that signal decomposition can be used as a sophisticated
method for guessing or imputing or interpolating missing or unknown entries in a signal.
This allows us to carry out a kind of validation or self-consistency check on a decomposition,
by pretending that some known entries are missing, and comparing the imputed values to
the known ones.

Expressivity and interpretability. By describing a signal component class via its loss
function, we can handle more complex components classes than traditional simple ones such
as a periodic signal, a trend, a smooth signal, and so on. For example we can define a signal
component class that consists of periodic, smooth, and nonnegative signals, or piecewise
constant signals that have no more than some specified number of jumps. The resulting
decomposition is always interpretable, since we specify the component classes.

Outline. We describe our signal decomposition framework in §2, where we pose the signal
decomposition problem as an optimization problem, concluding with an illustrative simple
example in §2.9. In §3 we cover related and previous work and methods. Our distributed
methods for the signal decomposition problem are described in §4. The next two sections
concern loss functions for signal component classes: general attributes are described in §5 and

3
some example classes in §6. The topic of how to fit component class losses given archetypal
examples is discussed in §6.3. We conclude the monograph with examples using real data:
Weekly CO2 measurements at Mauna Loa in §7.1, hourly traffic over a New York bridge in
§7.2, and 1-minute power output for a group (fleet) of seven photo-voltaic (PV) installations
in §7.3.

Software. Our paper is accompanied by an open-source software implementation called


OSD, short for ‘Optimization(-based) Signal Decomposition’, available at
https://github.com/cvxgrp/signal-decomposition.

2 Signal decomposition
2.1 Signal decomposition into components
Vector time series signal with missing entries. Consider a vector time series or signal,
possibly with missing entries, y1 , . . . , yT ∈ (R ∪ {?})p . We denote the ith entry of yt as
(yt )i = yt,i . The value ? denotes a missing entry in the signal; we say that entry yt,i is known
if yt,i ∈ R, and unknown if yt,i =?. We define K as the set of indices corresponding to
known values, i.e., K = {(t, i) | yt,i ∈ R}. We define U as the set of indices corresponding
to unknown or missing values, i.e., U = {(t, i) | yt,i =?}. We represent the signal compactly
as a T × p matrix y ∈ (R ∪ {?})T ×p , with rows y1T , . . . , yTT .

The mask operator. Let q = |K| be the total number of known entries in y, with q ≤ T p.
We introduce the mask operator M : (R ∪ {?})T ×p → Rq , which simply lists the entries of
its argument that are in K in a vector, in some known order. We will also use its adjoint
M∗ , which takes a vector in Rq and puts them into a T × p matrix, in the correct order,
with other entries zero. Note that while the original signal y can have missing entries, the
vector My does not. We also observe that for any z ∈ RT ×p , M∗ Mz is z, with the entries
in U replaced with zeros.

Signal decomposition. We will model the given signal y as a sum (or decomposition) of
K components x1 , . . . , xK ∈ RT ×p ,

yt,i = (x1 )t,i + · · · + (xK )t,i , (t, i) ∈ K.

We refer to this constraint, that the sum of the components matches the given signal at its
known values, as the consistency constraint, which can be expressed as

My = Mx1 + · · · + MxK . (1)

Note that the components x1 , . . . , xK do not have missing values. Indeed, we can interpret
the values
ybt,i = x1t,i + · · · + xK
t,i , (t, i) ∈ U, (2)

4
as estimates of the missing values in the original signal y. (This will be the basis of a
validation method described later.)

2.2 Component classes


The K components are characterized by functions φk : RT ×p → R ∪ {∞}, k = 1, . . . , K. We
interpret φk (x) as the loss of or implausibility that xk = x. We will see later that in some
cases we can interpret the classes statistically, with φk (x) the negative log-likelihood of x
for signal class k. Roughly speaking, the smaller φk (x) is, the more plausible it is. Infinite
values of φk (x) are used to encode constraints on components. We refer to x as feasible for
component class k if φk (x) < ∞, and we refer to {x | φk (x) < ∞} as the set of feasible
signals for component class k. When a component class takes on the value ∞ for some x,
we say that it contains or encodes constraints; when φk does not take on the value ∞, we
say the component class has no constraints, or has full domain. We will assume that every
component class has at least one feasible point, i.e., a point with finite loss.
We will see many examples of component class losses later, but for now we mention a
few simple examples.

Mean-square small class. One simple component class has the mean-square loss
1 X 1
φ(x) = (xt,i )2 = kxk2F , (3)
T p t,i Tp

where k · kF denotes the Frobenius norm, the squareroot of the sum of squares of the entries.
(To lighten the notation, we drop the subscript k when describing a general component class.)
All signals are feasible for this class; roughly speaking, smaller signals are more plausible
than larger signals. We call this the component class of mean-square small signals.
We will assume that the first class is always mean-square small, with loss function (3).
We interpret x1 as a residual in the approximation

y ≈ x2 + · · · + xK ,

and φ1 (x1 ) as the mean-square error.

Mean-square smooth class. The component class of mean-square smooth signals has
loss
T −1
1 X
φ(x) = kxt+1 − xt k2F , (4)
(T − 1)p t=1
the mean-square value of the first difference. Here too all signals are feasible, but smooth
ones, i.e., ones with small mean-square first difference, are more plausible.

5
Boolean signal class. As one more simple example, consider the component class with
loss function 
0 xt,i ∈ {0, 1} for all t, i
φ(x) = (5)
∞ otherwise.
This component class consists only of constraints, specifically that each entry is either 0 or
1. It has a finite number, 2T p , of feasible signals, with no difference in plausibility among
them. We refer to this class as the Boolean component class.

2.3 Signal decomposition problem


We will estimate the components x1 , . . . , xK by solving the optimization problem

minimize φ1 (x1 ) + · · · + φK (xK )


(6)
subject to My = Mx1 + · · · + MxK ,

with variables x1 , . . . , xK . We refer to this problem as the signal decomposition (SD) problem.
Roughly speaking, we decompose the given signal y into components so as to minimize the
total implausibility.
We observe that the entries of the mean-square small component x1 with indices in U do
not appear in the contraints, so their optimal value is zero, i.e., x1 = M∗ Mx1 . It follows
that φ1 (x1 ) = T1p kMx1 k22 . We can now eliminate x1 , and express the SD problem as the
unconstrained problem
2
minimize T1p My − Mx2 − · · · − MxK 2 + φ2 (x2 ) + · · · + φK (xK ), (7)

with variables x2 , . . . , xK . From a solution of this problem we can recover an optimal x1 for
(6) from the residual in the first term, as x1 = M∗ (My − Mx2 − · · · − MxK ).

Solving the signal decomposition problem. If the class losses φk are all convex func-
tions, the SD problem (6) is convex, and can be efficiently solved globally [BV09]. In other
cases it can be very hard to find a globally optimal solution, and we settle for an approx-
imate solution. In §4 we will describe two methods that solve the SD problem when it is
convex (and has a solution), and approximately solve it when it is not. The first method
is based on block coordinate descent (BCD) [BT13, Wri15], and the second is based on
ADMM [BPC+ 11], an operator splitting method. Both methods handle each of the compo-
nent classes separately, using the masked proximal operators of the loss functions (described
in §4.1). This gives a very convenient software architecture, and makes it easy to modify or
extend it to many component classes.

Existence and uniqueness of decomposition. With our assumption that the first com-
ponent class is mean-square small, and all other component classes contain at least one signal
with finite loss, the SD problem is always feasible. But it need not have a solution, or when it
does, a unique solution. For example, consider K = 2 with a mean-square small component

6
and a Boolean component. If the (t, i) entry in y is unknown, then x2t,i can be either 0 or 1,
without affecting feasibility or the objective. The uniqueness of specific instances of the SD
problem (particularly when K = 2) has been studied extensively [MT14, DH01]. (See §3 for
a longer discussion.)

2.4 Statistical interpretation


We can give our losses a simple statistical interpretation in some cases, which conversely can
be used to suggest class losses. Suppose that φ is continuous on its domain, with
Z
Z = exp −φ(x) dx < ∞.

(The integration is with respect to Lebesgue measure.) We associate with this component
class the density
1
p(x) = exp −φ(x).
Z
Thus, φ(x) is a constant plus the negative log-likelihood of x under this density, a stan-
dard statistical measure of implausibility. Convex loss functions correspond to log-concave
densities.
As an example, with the mean-square loss φ(x) = 2T1 p kxk2F (note the additional factor
of two in the denominator), the associated density is Gaussian, with the entries of x IID
N (0, 1). As another example, the mean-square smooth component class with loss (4) has
Z = ∞, so we cannot associate it with a density.
When all component classes have Z < ∞, we can interpret the SD problem statistically.
Suppose x1 , . . . , xK are independent random variables with densities p1 , . . . , pK . Then the SD
objective is a constant plus the negative log-likelihood of the decomposition with x1 , . . . , xK ,
and the SD decomposition is the maximum a posteriori probability (MAP) decomposition
of the observed signal y.

2.5 Optimality and stationarity conditions


Here we give optimality or stationarity conditions for the SD problem for some special but
common cases. In all cases, the conditions include primal feasibility (1), i.e., consistency,
and a second condition, dual feasibility, which has a form that depends on the properties of
the losses.

Differentiable losses. We first suppose that the losses are differentiable. The dual feasi-
bility condition is that there exists a Lagrange multiplier ν ∈ Rq for which

∇φk (xk ) = M∗ ν, k = 1, . . . , K,

where ν ∈ Rq is a dual variable or Lagrange multiplier associated with the consistency con-
straint (1). In words: the gradients of the losses all agree, and are zero in the unknown

7
entries. If all losses are convex, this condition together with primal feasibility are the neces-
sary and sufficient optimality conditions for the SD problem. If the losses are not all convex,
then this condition together with primal feasibility are stationarity conditions; they hold for
any optimal decomposition, but there can be non-optimal points that also satisfy them.
Since φ1 (x) = T1p kxk2F , we have ∇φ1 (x) = (2/T p)x. The dual conditions can then be
written as
2 1
x1 = M∗ Mx1 , ∇φk (xk ) = x , k = 2, . . . , K, (8)
Tp
i.e., the gradients of the component class losses all equal the mean-square residual, scaled
by 2/(T p) in the known entries, and are zero in the unknown entries. These are also the
conditions under which the gradients of the objective in the unconstrained SD problem
formulation (7) with respect to x2 , . . . , xK are all zero.

Convex losses. If the losses are convex but not differentiable, we replace the gradients in
(8) with subgradients, to obtain
2 1
x1 = M∗ Mx1 , gk = x, g k ∈ ∂φk (xk ), k = 2, . . . , K, (9)
Tp

where ∂φk (xk ) is the subdifferential of φk at xk . This condition, together with primal feasi-
bility, are optimality conditions for the SD problem.

Other cases. When the losses are neither convex nor differentiable, the stationarity con-
ditions can be very complex, with the gradients in (8) or subgradients in (9) substituted
with some appropriate generalized gradients.

2.6 Signal class parameters


The component class losses φk can also have parameters associated with them. When we
need to refer to the parameters, we write φk (xk ) as φk (xk ; θk ), where θk ∈ Θk , the set of
allowable parameters. These parameters are fixed whenever we solve the SD problem, but it
is common to solve the SD problem for several values of the parameters, and choose one that
works well (e.g., using a validation method described later). The role of the parameters θk
will be made clear when we look at examples. For now, though, we mention a few common
examples.

Weight or scaling parameters. It is very common for a parameter to scale a fixed


function, i.e., φ(x; θ) = θ`(x), θ ∈ Θ = R++ , the set of positive numbers. (Of course we can
have additional parameters as well.) In this case we interpret the parameters as weights that
scale the relative implausibility of the component classes. We will use the more traditional
symbol λ to denote scale factors in loss functions, with the understanding that they are part
of the parameter θ.

8
Value and constraint parameters. Parameters are often used to specify constant values
that appear in the loss function. For example we can generalize the Boolean loss function,
which constrains the entries of x to take on values in {0, 1}, to one where the entries of x
take on values in a finite set {θ1 , . . . , θM }, where θi ∈ R, i.e.,

0 xt,i ∈ {θ1 , . . . , θM } for all t, i
φ(x) = (10)
∞ otherwise.

In this case, the parameters give the values that the entries of x are allowed to take on. As
another example, consider a loss function that constrains the entries of x to lie in the interval
[θ1 , θ2 ] (with θ1 ≤ θ2 ). Here the parameters set the lower and upper limits on the entries of
x.

Basis. Another common use of parameters is to specify a basis for the component, as in

0 x = θz for some z ∈ Rd×p



φ(x) = (11)
∞ otherwise,

where θ ∈ RT ×d , and z ∈ Rd×p . This component class requires each column of x, i.e.,
the scalar time series associated with an entry of x, to be a linear combination of the basis
(scalar) signals given by the columns of θ (sometimes referred to as a dictionary). The
entries of z give the coefficients of the linear combinations; for example, the first column of
x is z11 θ1 + · · · + z1d θd , where θi is the ith column of θ, i.e., the ith basis signal.

2.7 Validation
We can validate, or at least check consistency of, our choice of the component classes and
their parameter values. To do this, we select (typically randomly) some entries of y that are
known, denoted T ⊂ K (for ‘test’), and replace them with the value ?. A typical choice of the
number of test entries is a fraction of the known entries, such as 20%. We then carry out the
decomposition by solving the SD problem, using the entries K \ T of y. This decomposition
gives us estimates or guesses of the entries of y in T , given by (2). Finally, we check these
estimates against the true values of y, for example by evaluating the mean-square test error
1 X
(yt,i − ybt,i )2 .
|T |p
(t,i)∈T

A more stable estimate of test error can be found by evaluating the mean-square test error
for multiple test sets T (1) , . . . , T (M ) , each with the same number of entries, and averaging
these to obtain a final mean-square error.
It is reasonable to prefer a model (i.e., choice of component classes and their parameters)
that results in small test error, compared to another model with higher test error. The
out-of-sample validation method described above can be used to guide the choice of the
component classes and parameters that define an SD model.

9
Validating with non-unique decompositions. We note that the basic validation method
fails when the SD problem has multiple solutions, or more precisely, when multiple optimal
signal decompositions correspond to different values of ybt,i for (t, i) ∈ T . One simple work-
around is to regard the multiple solutions as each providing an estimate of the missing entry,
and to evaluate the test loss using the best of these estimates. For example, suppose the
second component class is Boolean, so (x2 )t,i can have the value 0 or 1 for (t, i) ∈ T . We
judge the error using
1 X
min (yt,i − ybt,i )2 .
|T |p x2t,i ∈{0,1}
(t,i)∈T

Parameter search. As is standard in machine learning and data fitting, it is common


to carry out multiple decompositions with the same loss functions but different parameters,
and validate each of these choices on one or more test sets, as described above. We then
choose as our final parameter values ones correspond to the lowest achieved test error. As in
machine learning and data fitting, the final decomposition is then fit with all known data,
using the parameter values found in the parameter search.

2.8 Data pre-processing


As in other data processing problems, pre-processing the raw data is often useful, leading to
better results or interpretability.

Standarization. The most basic pre-processing is to standardize the entries of y, with a


scale and offset for each component that results in the average value being around zero and
the standard deviation around one. In some cases, for example when the entries of y are all
measured in the same physical units, it can be more appropriate to use the same scaling for
all components of y.

Log transform. If the data are all positive and vary over a large range of values, a log
transform of the raw data can be appropriate. Roughly speaking, this means that we care
about relative or fractional deviations, as opposed to absolute errors in the raw data, e.g.,
we consider the values 10 and 11 to be as close as the values 1000 and 1100. With a log
transform, the signal decomposition has an interpretation as a multiplicative decomposition
(in the raw data), as opposed to an additive decomposition. If we denote the raw data as ye
K
and the transformed data as y = log ye (entrywise), and our decomposition is y = x1 +· · ·+xK ,
in terms of the raw data we have

e1t,i · · · x
yet,i = x eKt,i , (t, i) ∈ K,

where xei = exp xi (entrywise), i = 1, . . . , K. The signals x


ei can be though of as multiplicative
factors.

10
2.9 Simple example
In this section we give a simple synthetic example to illustrate the idea.

Signal decomposition model. We construct an SD problem with p = 1 (i.e., a scalar


signal), T = 500, and K = 3 component classes: mean-square small, mean-square second-
order smooth, and a scaled Boolean. For mean-square small we use loss function (3), and
for mean-square second order smooth we use the loss
T −1
θ1 X
φ2 (x) = (xt+1 − 2xt + xt−1 )2 , (12)
(T − 2)p t=2

where θ1 is a positive weight parameter. For the Boolean component class, we require that all
entries of x are in {0, θ2 }, where θ2 is another positive parameter. Our SD problem contains
two signal class parameters, θ1 and θ2 . Since φ3 is not convex, the SD problem is not convex.
(Nevertheless our method does a good job at approximately solving it.)

Data generation. We generate a signal y of length T = 500 as a sum of three ‘true’ signal
components, one that is Gaussian noise, one that is smooth, and one that is Boolean, i.e.,
e1 ∈ RT , has IID entries N (0, 0.12 ). The
takes on only two values. The first signal, denoted x
second true component is the quasiperiodic signal with three frequencies
3
X
e2t
x = aj cos(ωj t + δj ), t = 1, . . . , T,
j=1

where aj > 0 are the amplitudes, ωj > 0 are the frequencies, and δj ∈ [0, 2π] are the phases,
all chosen randomly. The last true component signal x e3 has the form
( P3
3 θe2 a0j cos(ωj0 t + δj0 ) ≥ 0
x
et = Pj=1
3 0 0 0
0 j=1 aj cos(ωj t + δj ) < 0,

for t = 1, . . . , T , where a0j , ωj0 , and δ 0 are a different set of amplitudes, frequencies, and
phases, also chosen randomly. We construct our signal as

e1 + x
y=x e2 + x
e3 ,

with θe2 (the ‘true’ value of θ2 ) chosen randomly. The signal y and the three true components
ei are shown in figure 1. The data in this example have no missing entries.
x

Parameter search. We use our method (described in detail later) to approximately solve
the SD problem for a grid of 21 values of θ1 , logarithmically spaced between 10−1 and 106 ,
and 21 values of θ2 , linearly spaced between 0.1 and 2.0, for a total of 441 different values
of the parameters. For each of these, we evaluate the test error using 10 random selections

11
Signal, y

−2

e1
x
0.4

0.2

0.0

−0.2

e2
x

−2

e3
x
0.75

0.50

0.25

0.00
0 100 200 300 400 500
t

Figure 1: Synthetic data for simple signal decomposition example.

12
optimization of weight and scale parameters
100
105

103

bootstrap MSE
10−1
θ1

1
10

10−1

10−2
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
θ2

Figure 2: Validation mean-square test error as a function of the parameters θ1 and θ2 .

of the test set as described above. Thus all together we solved 4410 instances of the SD
problem, which took about 13 minutes on a 2016 MacBook Pro. Each SD problem took
about 0.17 seconds to solve. (We solved the problems sequentially, but the computation is
embarrassingly parallel and could have been carried out faster using more processors.)
The mean-square test error for the parameter grid search is shown as a heat map in
figure 2. We use the final values θ1 = 320, θ2 = 0.765, which achieved the smallest mean-
square test error. Having chosen θ1 and θ2 , we approximately solve the SD problem one final
time, using all the data.
There is some discrepancy between the value we find θ2 = 0.765 and the true value used
to generate the data, θe2 = 0.7816, due to the discreteness of the grid search. (In a real
application, we might do a secondary, refined grid search of values near the best ones found
in this crude grid search.)

Final decomposition. The final decomposition is shown in figure 3. Evidently the de-
composition is quite good. The Boolean component is exactly reconstructed, aside from the
slight discrepancy in its amplitude. The smooth component is also well reconstructed, with
an RMS (root mean-square) error about 0.04.

13
Component x1
0.4
estimated
0.2 true
0.0

−0.2

Component x2

0
estimated
−2 true

Component x3
0.75

0.50

0.25 estimated
true
0.00

composed signal

0 observed, y
denoised estimate
−2 true

0 100 200 300 400 500

Figure 3: Signal decomposition for simple example. The top three plots show the true component
and the esimated component. The bottom plot shows the original signal y and x2 + x3 , i.e., our
decomposition without the residual component x1 .

14
3 Related work
Here we discuss a wide variety of methods that relate to the SD problem, some of which are
quite old. Many methods described below do not explicitly form an optimization problem,
and when they do, it need not be the same as our SD problem. Others, however, have the
exact same formulation, i.e., minimizing a sum of loss functions for signal component classes
subject to their sum matching an observed or given signal at known entries.

Regression. Least-squares linear regression is a particular instance of the SD problem,


with two component classes, a mean-square small component, and a component defined
by a basis (11), with the basis components the regressors or features. This SD problem
instance admits a well-known, closed form solution [BV18, §12]. The idea of solving an
over-determined system of linear equations by minimizing the sum of the squares of the
errors was proposed independently by the mathematicians Carl Friedrich Gauss and Adrien-
Marie Legendre around the beginning of the 19th century. Statistical justifications for this
fitting procedure were subsequently provided by Gauss, Laplace, Cauchy, and Thiele, among
others [Far01].

Robust regression. Robust regression covers a variety of techniques to reduce model


variance in the presence of data ‘outliers,’ which is a term without a precise definition
but can be thought of as data points that are not well explained by a linear regression
model. Common methods include Huber regression [Hub64, Hub81], Theil-Sen estima-
tion [The50, Sen68], and RANSAC [FB81], which are included in the popular Python pack-
age, scikit-learn [PVG+ 11]. In the SD framework, the residual component class used in
linear regression is substituted with an alternative penalty function that is less sensitive to
outliers. The penalty function formulation is discussed in detail in [BV09, §6.1 and §6.4].
Interestingly, the idea of minimizing the sum of absolute errors in an over-determined system
of equations actually predates the development of least-squares minimization, having been
proposed in the mid-18th century by Roger Joseph Boscovich [FB81]. In the SD framework,
robust regression is modeled using two residual classes, one the standard mean-square small,
and the other a loss function that grows slowly for large values, like the average absolute loss
T p
1 1 XX
φ(x) = kxk1 = |xt,i |. (13)
Tp T p t=1 i=1

(This same loss is used as a convex heuristic for a sparse signal, i.e., one with many entries
zero.)

Regularized regression. Regularized regression, also known as penalized regression or


shrinkage methods, is a family of estimators that introduce an additional penalty term
on coefficients of a linear regression problem. Well known examples include ridge regres-
sion [Tik63, Phi62, HK70], lasso regression [Tib96], and elastic-net regression [ZH05]. An

15
overview of different regularizer functions for regression is given in [BV09, §6.3], and a review
of other regressor selection methods is given in [HTF13, Ch. 3–4]. In the SD framework,
regularized regression is modeled by extending the basis class (11) with an additional loss
term on the internal variable, as in
`(z) x = θz for some z ∈ Rd×p

φ(x) =
∞ otherwise.

Isotonic regression. In isotonic (or monotonic) regression we fit a given signal with a
non-decreasing (or non-increasing) signal [BC90, WWM01]. This is a particular instance
of the SD problem, with p = 1 and K = 2 component classes: a sum-of-squares small
residual and a monotone component, which has a loss function that is zero if its argument is
non-decreasing and infinite otherwise. As efficient algorithm, with complexity linear in T , is
included in scikit-learn [PVG+ 11]. A detailed discussion of a linear time algorithm and
the connection to projection operators is given in [GW84].

Trend filtering. Trend filtering, also called signal smoothing, is the process of estimat-
ing a slowly varying trend from a scalar time series that includes rapidly varying noise. In
many cases this is also a special case of SD, with a mean-square residual component and a
component that is slowly varying, for example, with a mean-square second difference loss
function. Trend filtering has been employed in a wide variety of applications and settings,
including astrophysics [Tit85], geophysics [BC02, Blo92, BN92], social sciences [Lev04], bi-
ology [LS94], medicine [GL92], image processing [TK93], macroeconomics [HP97, Sin88],
and financial time series analysis [Tsa05, §11]. Many specific trend filtering methods have
been proposed, including moving-average filtering [Osb95] and Hodrick-Prescott (HP) fil-
tering [HP97, Les61] being two of the most well known. More recently, Kim et al. have
proposed `1 trend filtering [KKBG09], which uses as component loss function the `1 norm
of the second difference, which tends to result in piecewise affine signals (see §6.2 for an
example). A Bayesian interpretation of trend filtering and signal denoising is presented in
[Tit85, TK93, BS93, CP11].

Seasonal-trend decomposition. Seasonal-trend decomposition was originally motivated


by the analysis of economic data which tend to have strong seasonality; this method is ar-
guably what most people think of when they hear the term “time series decomposition,”
having first been proposed in the 1920s as a natural extension of moving average smooth-
ing [And27]. Seasonal-trend decomposition is the only one presented in the chapter on time
series decomposition in Hyndman and Athanasopoulos [HA18, §6]. A popular algorithm
that implements a specific method for seasonal-trend decomposition is STL [CCMT90], with
packages available for Python, R, and Matlab [STLa, STLb, STLc].
STL can be considered a specific case of the SD problem, with a scalar signal y and K = 3
component classes, i.e., seasonal, trend, and residual. However, STL does not formulate the
method as an optimization problem and uses an iterative heuristic to form the estimates of
the components.

16
Traditional frequency domain filtering. Traditional EE-style filtering (e.g., [OS10])
can be interpreted as a form of SD. For example, low pass filtering decomposes a signal into
a smooth component (the filter output) and a small, rapidly varying component (the residual,
or difference of the original signal and the low pass signal). This can often be represented as
SD with two components, a residual and a smooth or low-pass component with appropriate
time-invariant quadratic loss function. A traditional filter bank can be interpreted as giving
a decomposition of a signal into multiple components, each one corresponding to a different
region (or sub-band) in the spectrum of the signal.

Sparse signal recovery. Sparse signal recovery is concerned with finding sparse repre-
sentations of signals with respect to some known (typically over-complete) basis. The use
of (convex) optimization to solve sparse signal recovery problems has a long history with
many proposed approaches, and there are some very nice overviews available in [WM22,
MCD+ 14, TF11]. These methods have historically been applied to the problem of data com-
pression, such as the JPEG and JPEG2000 standards [Mal09, BCDH10]. These methods
are all related to the regularized linear inverse problem [CRPW10],

minimize f (x)
(14)
subject to Ax = y,

where the matrix A and the vector y are problem data, and f is some ‘complexity measure’
that encourages sparseness. A common variant is to relax the equality constraint,

minimize kAx − yk22 + λf (x). (15)

When f (x) = kxk1 , (14) is known as basis pursuit or compressed sensing, and (15) is the
lasso, which we encountered in the previous paragraph. The geometry of these and related
problems, specifically in the case where f (x) = kxk1 , has been extensively analyzed to deter-
mine when sparse signals are recoverable in [ALMT14]. The matrix A generally represents
the data generation process, either derived from known measurements or, in the case of
dictionary methods, derived from pre-defined, parameterized waveforms, like sinusoids or
wavelets. With dictionary learning methods the matrix A is fit to the data as well [TF11].
When A is introduced as a decision variable, problems (14) and (15) are no longer convex,
but there exist well established methods exists for approximately solving problems of this
form [UHZB16].

Matrix completion. In the basic formulation of this problem, we seek a low rank matrix
X which matches a known matrix M at a set of known indices [CR09]. A closely related
problem is (robust) principle component pursuit, in which an observed matrix is decomposed
into a low-rank component and a sparse component [CLMW11, WM22].

Convex demixing. Convex demixing has a long history [MT14, §7.1], beginning in the
geophysics community in the 1970s [CM73, TBM79]. It refers to the task of identifying

17
two (or sometime more) ‘structured signals,’ given only the sum of the two signals and
information about their structures [MCD+ 14, MT14]. The standard formulation for convex
demixing is
minimize f (x) + λg(z)
(16)
subject to x + z = y,
where x and z are the decision variables, y is the observed signal, and λ is a regularization
parameter. This is evidently a two-class, convex SD problem. In this formulation, the focus
tends to be on demixing signals that are sparse in various senses. A classic example is the
‘spikes and sines problem’, which shows up in a variety of applications including astron-
omy, image inpainting, and speech enhancement in signal processing [SMF10, DH01]. More
generally, these types of problems include demixing two signals that are sparse in mutually
incoherent bases, decoding spread-spectrum transmissions in the presence of impulsive (i.e.,
sparse) errors, and removing sparse corruptions from a low-rank matrix. Problem (16) has
been deeply studied in many contexts, and much of the existing work has focused on find-
ing solution methods and analyzing recovery bounds (i.e., uniqueness) when f and g are
various sparsity-inducing matrix norms [CRPW10, Bac10]. A three-operator extension of
(16)—where one operator is a smooth, nonconvex function and the other two operators are
convex functions—is studied in [YMS21]. These are instances of our signal decomposition
problem.

Contextually supervised source separation (CSSS). This is an optimization-based


framework for solving signal decomposition problems, in which the signal components are
assumed to be roughly correlated with known basis vectors [WK13], and is very similar in
many ways to the method presented in this paper. CSSS is extensible, allowing for different
loss terms on the linear representations, component estimates, and linear fit coefficients. The
SD formulation proposed in this paper is a further generalization of contextually supervised
source separation, and the proposed solution method in §4 solves all instances of contextually
supervised source separation as a subset of all SD problems.

Infimal convolution. The infimal convolution of functions fi : Rn → R, i = 1, . . . , K


denoted f1  · · · fK , is defined as

(f1  · · · fK )(v) = inf f1 (x1 ) + · · · + fK (xk ) | v = x1 + · · · + xK




as described (for convex functions) in [Roc70, §16] and [PB14, §3.1]. The case of nonconvex
functions was considered in [PR96]. We see that the SD problem, with no missing data, is
the problem of evaluating the infimal convolution of our component loss functions, on the
given signal y.

Proximal operator. The proximal operator of a function f arises often in optimization,


and is the basis of our solution methods. The details are given below, but we note there that
evaluating a proximal operator of the function f is an SD problem (again, with no missing
data) with a mean-square loss and the loss f .

18
4 Solution methods
In this section we describe two related methods for solving the SD problem (when it is
convex), and approximately solving it (when it is not convex). Both rely on the masked
proximal operators of the component class losses.

4.1 Masked proximal operator


Recall that the proximal operator [Mor62, PB14] of φk is defined as
 ρ 2

proxφk (v) = argmin φk (x) + kx − vkF
x 2
!
ρX
= argmin φk (x) + (xt,i − vt,i )2 ,
x 2 t,i

where ρ is a positive parameter, and v ∈ RT ×p . When φk is convex, the function minimized


is strictly convex, so there is a unique argmin. When φk is not convex, there can be multiple
argmins; we simply choose one.
The masked proximal operator is defined as
 ρ 
mproxφk (v) = argmin φk (x) + kM(x − v)k22
x 2
 
ρ X
= argmin φk (x) + (xt,i − vt,i )2  .
x 2
(t,i)∈K

Roughly speaking, it is the proximal operator, with the norm term only taken over known
entries. (The masked proximal operator depends on K, but we suppress this dependency to
keep the notation lighter.) The function minimized in the masked proximal operator need
not have a unique minimizer, even when φk is convex. In this case, we simply pick one.
When the function φk takes on the value ∞ (i.e., encodes constraints), the point x =
mproxφk (v) is feasible, i.e., satisfies φk (x) < ∞. We also note that mproxφk (v) does not
depend on vt,i for (t, i) ∈ U, so we have
mproxφk (v) = mproxφk (M∗ Mv). (17)
When there are no unknown entries, i.e., U = ∅, the masked proximal operator reduces
to the standard proximal operator. There is another simple connection between the proximal
operator and the masked proximal operator. Starting with a loss function φ, we define the
function
e = inf{φ(M∗ Mz + u) | Mu = 0},
φ(z)
which is, roughly speaking, the original loss function where we minimize over the unknown
e since it is its partial minimization [BV09, §3.2.5]. The
entries in y. If φ is convex, so is φ,
masked proximal operator is then
mproxφ (v) = proxφe(v),

19
the proximal operator of the partially minimized loss function.
For many component loss functions we can work out the masked proximal operator
analytically. In many other cases we can compute it with reasonable cost, often linear in T ,
the length of the signals. The monographs [PB14, §6] and [BPC+ 11] discuss the calculation
of proximal operators in depth and list many well known results. Many closed form proximal
operators are listed in the appendix of [CP11]. Many of these have straightforward extensions
to the masked proximal operator.
As a final generalization, we introduce the weighted proximal operator, which we define
as  
ρ X
wproxφk (v) = argmin φk (x) + wt,i (xt,i − vt,i )2  ,
x 2
(t,i)∈K

with nonnegative weights wt,i ∈ R+ for all (t, i) ∈ K. The weighted proximal operator arises
in the evaluation of certain masked proximal operators, as discussed in §5.3 and §5.4. When
all the weights are one, the weighted proximal operator coincides with the masked proximal
operator.

Proximal operator as SD problem. We note that the proximal operator itself can be
seen as a simple instance of an SD problem, with v playing the role of y, and components
x and v − x, with associated loss functions φk and (ρ/2)k · k2F , respectively. The masked
proximal operator is the version of this signal decomposition problem with missing entries
in v.
Thus, evaluating the masked proximal operator is the same as solving a simple SD prob-
lem with two components, one of which is scaled mean-square small. Our algorithms, de-
scribed below, solve (or approximately solve) the general SD problem by iteratively solving
these simple two component SD problems for each component.

Surrogate gradient. When φ is convex, the optimality condition for evaluating the
masked proximal operator x = mproxφ (v) tells us that

g = ρM∗ M(v − x) ∈ ∂φ(x), (18)

where ∂φ(x) is the subdifferential (set of all subgradients) of φ at x. So evaluating the


masked proximal operator at a point v automatically gives us a subgradient of the loss at
the image point x = mproxφ (v). When φ is not convex, we can interpret g in (18) as a
surrogate gradient.

Stopping criterion. In both of our algorithms, x2 , . . . , xK are found by evaluating the


loss function masked proximal operators, i.e.,

xk = mproxφk (v k ), k = 2, . . . , K,

20
for some v k . (The particular v k used to find xk depend on which algorithm is used, but each
of them satisfies v k = M∗ Mv k , i.e., they are zero in the unknown entries of y.) We define
x1 = M∗ M(y − x2 − · · · − xK ), so x1 , . . . , xK are feasible and x1 = M∗ Mx1 .
We combine (9) with (18) and define the optimality residual r as

K 2 !1/2
1 X ρM∗ M(v k − xk ) − 2
r= x1 , (19)
K − 1 k=2 T p F

which can be written as


K   2 !1/2
1 X M ρ(v k − xk ) − 2
1
r= x .
K − 1 k=2 Tp
2

When r = 0 and the losses are convex, x1 , . . . , xK are optimal.


Both of our algorithms use the standard stopping criterion

rel 2 rel 2
abs

1 abs

1
r ≤  +  x =  +  Mx , (20)
Tp F Tp 2

where abs and rel are specified positive absolute and relative tolerances.

4.2 Block coordinate descent algorithm


The BCD algorithm repeatedly minimizes the objective in (7),
1
My − Mx2 − · · · − MxK 2 + φ2 (x2 ) + · · · + φK (xK ),

Tp 2

over a single (matrix) variable xk , holding the other variables fixed. Minimizing the objective
over xk , with xi fixed for i 6= k, is the same as evaluating the masked proximal operator of
φk : !
X
xk = mproxφk y − xi
i6=k

with parameter ρ = 2/(T p). (Note that the masked proximal operator does not depend on
the entries of its argument that are unknown in y.) There are many choices for the sequence
in which we minimize over the variables, but we will use the simplest round-robin method,
updating x2 , then x3 , and on to xK , and then back to x2 again. This gives the SD-BCD
algorithm described below, with superscript j on the variables denoting iteration number,
where an iteration consists of one cycle of (successively) minimizing over x2 , . . . , xK .

Algorithm 4.1 Block coordinate descent algorithm for SD problem (SD-BCD)


Initialize. Set (xk )0 , k = 2, . . . , K, as some initial estimates.

21
for iteration j = 0, 1, . . .
for component class k = 2, . . . , K
Update a component using masked proximal operator.
 
X X
(xk )j+1 = mproxφk y − (xi )j+1 − (xi )j  .
i<k j>k

In SD-BCD we use the most recently updated value for the other components, in Gauss-
Seidel fashion. Since we fix ρ = 2/(T p), this algorithm contains no parameters to tune.
Note that SD-BCD accesses the component class loss functions only through their masked
proximal operators; in particular we never evaluate φk or its derivatives.

Stopping criterion. We evaluate the stopping criterion (20) at the end of each iteration,
using x1 = M∗ M(y − x2 − · · · − xK ) and v k the argument of the proximal operator in
SD-BCD.

Convergence. SD-BCD is evidently a descent algorithm, i.e., the objective is nonincreas-


ing in each iteration. (In fact, it is nonincreasing after each update of one of the components.)
Well known simple examples show that block coordinate descent need not converge to an
optimal point even when the objective is convex. There is a large body of literature on
the convergence of block coordinate descent type methods. Some recent review papers inl-
cude [Wri15, BT13, RT14] and a classic textbook that addresses the topic is [Ber16, §3.7].
These convergence proofs often rely on randomly permutating the block update order, but
we have found this has no practical effect on the convergence of SD-BCD. None of cited lit-
erature exactly proves the convergence of the algorithm presented here, so we give a simple
proof that any fixed point of SD-BCD must be optimal, when the losses are all convex. When
one or more loss functions are not convex, the algorithm may (and often does) converge to
a non-optimal stationary point.

Fixed point of SD-BCD. Here we show that if x2 , . . . , xK are a fixed point of SD-BCD,
and the losses are all convex, then the decomposition is optimal. If these variables are a
fixed point, then for k = 2, . . . , K,
!
X
xk = mproxφk y − xi .
i≥2, i6=k

22
From these and (18) we find that for k = 2, . . . , K,
!
2 X
gk = M∗ M y − xi − x k
Tp i≥2, i6=k
K
!
2 X
= M∗ M y − xi
Tp i=2
2 1
= x
Tp
∈ ∂φk (xk ),
PK
where in the third line we use x1 = M∗ M(y − i=2 xi ). This is the optimality condition
(9).

4.3 ADMM algorithm


Here we introduce an operator splitting method for the SD problem. The particular operator
splitting method we use is the alternating directions method of multipliers (ADMM) [GM75,
GM76, BPC+ 11]. The ADMM algorithm we develop for the SD problem is closely related
to the sharing problem [BPC+ 11, §7.3] and the optimal exchange problem [BPC+ 11, §7.3.2],
but not the same. The algorithm uses a scaled dual variable u ∈ Rq , and we denote iteration
number with the superscript j.

Algorithm 4.2 ADMM for SD problem (SD-ADMM)


Initialize. Set u0 = 0 ∈ Rq , and (xk )0 ∈ RT ×p , k = 1, . . . , K, as some initial estimates
for iteration j = 0, 1, . . .

1. Evaluate masked proximal operators of component classes in parallel.

(xk )j+1 = mproxφk ((xk )j − 2M∗ uj ), k = 1, . . . , K.

2. Dual update.
K
!
j+1 j 1 X
k j+1
u =u + M(x ) − My .
K
k=1

A detailed derivation of this algorithm is given in appendix §A. Unlike BCD, SD-ADMM
is not a descent method. It is also not a feasible method: the iterates satisfy the consistency
constraint My = Mx1 + · · · + MxK only in the limit.

Interpretations. From the dual update, we see that uj is the running sum of the residual
in the consistency constraint, scaled by 1/K; this term is used in the argument of the masked
proximal operator to drive xk to optimality.

23
Convergence with convex losses. When φk are all convex, SD-ADMM converges to a
solution, and ρuj converges to an optimal dual variable ν [BPC+ 11, §3.2]. In particular, the
consistency constraint (1) holds asymptotically.

Convergence with nonconvex losses. When any of the loss functions is nonconvex,
there are no convergence guarantees at all. The ADMM algorithm need not converge, and if
it converges it need not converge to a solution of the SD problem. But it has been observed
in practice that ADMM, when applied to nonconvex problems, often converges to a useful
value, which in our case is a useful signal decomposition; see, e.g., [BPC+ 11, §9].

Stopping criterion and final decomposition. The consistency constraint generally


does not hold for the iterates. To obtain a decomposition that satisfies the consistency
constraint, we can simply absorb the residual in the consistency constraint into x1 to obtain
a feasible signal decomposition. We can then evaluate the residual in (20), with v k the
arguments of the proximal operators in step 1 of SD-ADMM.

Choice of ρ. When the problem is convex, SD-ADMM converges to a solution for any
positive value of the algorithm parameter ρ, although the practical convergence speed can
be affected the choice of ρ. The natural value ρ = 2/(T p) seems to give good performance
in practice. When the problem is not convex, the choice of ρ is more critical, and can affect
whether or not the algorithm converges, and when it converges, the decomposition found.
For such problems too, the natural choice ρ = 2/(T p) seems to often give good results,
although we have found that scaling this value can improve the practical convergence for
some nonconvex problems. We take ρ = 2η/(T p), with η in the range between 0.5 and 2.

4.4 Hybrid algorithms


Comparison of SD-BCD and SD-ADMM. For convex SD problems, SD-BCD often
outperforms SD-ADMM, but not by much. For nonconvex SD problems, we have found
that SD-ADMM often outperforms SD-BCD in the quality of the decomposition found.
Specifically, CD-BCD often ends up converging to a poor local minimum, whereas SD-
ADMM is able to find a much better (lower objective) decomposition. On the other hand, for
nonconvex SD problems, one or two iterations of SD-BCD, starting from the decomposition
found by SD-ADMM, can lead to a modest improvement in the objective value found. (These
iterations cannot increase the objective, since SD-BCD is a descent method.)

Hybrid methods. A reasonable strategy, and the default in our implementation, is to use
SD-BCD if the SD problem is convex. If the SD problem is nonconvex, the default uses SD-
ADMM (with scale factor η = 0.7) until convergence, and then follows this with SD-BCD,
again run until convergence (quite often, but not always, only a few iterations). This hybrid
method seems to work well on a wide variety of SD problems.

24
Table 1: Summary of numerical examples

Name Section K T p Size (KT p) q Convex


simple §2.9 3 500 1 1,500 500 no
CO2 §7.1 3 2,459 1 7,377 2,441 yes
traffic §7.2 5 105,552 1 527,760 101,761 yes
PV §7.3 5 20,212 7 707,420 135,899 no

simple
10−1 CO2
traffic
PV
10−2
optimality residual

10−3

10−4

10−5

0 20 40 60 80
iteration

Figure 4: Residual versus iteration number for the 4 numerical examples given in §2.9, §7.1, §7.2,
and §7.3 respectively.

Numerical examples. In this paper we consider four numerical examples, summarized in


table 1. They include convex and nonconvex problems, and range from small to large, with
the SD problem in PV having over 700,000 variables. We use these examples to illustrate the
convergence of the hybrid algorithm. In figure 4 we plot the residual (19) versus iteration
number for these four problems.
We see rapid and monotonic convergence for problems CO2 and traffic, which are
convex. For simple and PV, which are nonconvex, we can see the switch to SD-BCD at the
end, with a sharp reduction in residual in simple in just a few iterations, and a smoother
reduction of residual over 12 iterations in PV. None of the examples requires more than 100
iterations to converge.

25
5 Component class attributes
In this section we describe some very basic attributes that component class losses can have.

5.1 Separability
A component class loss function can be separable across time, or entries, or both.

Time-separable losses. A function φ : RT ×p → R ∪ {∞} is separable across time if it


has the form
XT
φ(x) = `t (xt )
t=1
p
for some functions `t : R → R ∪ {∞}, t = 1, . . . , T . It is common for the loss functions
to not depend on t, in which case we say it is time-invariant. A simple example is the
mean-square loss (3), with `t (xt ) = T1p kxt k22 for all t.

Entry-separable losses. A component class function φ is separable across entries if it has


the form p
X
φ(x) = `i (x̆i )
i=1
T
for some functions `i : R → R ∪ {∞}, i = 1, . . . , p, where x̆i is the ith column of x (which
can be interpreted as a scalar time series), the ith entry of the vector time series {xt }.
Here too it is common for the loss function to not depend on i, in which case we say it is
symmetric (in the entries of xt ). The mean-square loss (3) is symmetric (in addition to being
time-separable).

Separability and proximal operators. Separability reduces the complexity of eval-


uating P
the masked proximal operator. For example if φ is separable across time, say,
φ(x) = t `t (xt ), its masked proximal operator is
 
mprox`1 (v1 )T
mproxφ (v) =  ..
,
 
.
T
mprox`T (vT )

i.e., we can evaluate the masked proximal operator in parallel for each time t = 1, . . . , T .
(Note the masked proximal operator for t depends on the missing data for that time period.)

26
5.2 Time-invariance
Time-invariance or shift-invariance is another important attribute. We let M < T denote
the memory of the loss function φ. We say φ is time-invariant if it has the form
T −M
X+1
φ(x) = `(xt:t+M −1 ),
t=1

where xt:t+M −1 is the M × p slice of x, that includes rows t, . . . , t + M − 1, and ` : RM ×p →


R ∪ {∞} is the slice loss. Thus, a time-invariant component class loss is sum of the slice
loss, applied to all M -long slices of its argument. With this definition, a time-separable
time-invariant loss is a special case of time-invariance, with memory M = 1.
The second-order mean-square smooth loss (12) is a simple example of a time-invariant
component class loss, with M = 3. As another example, consider the class of P -periodic
signals, with loss 
0 xt+P = xt , t = 1, . . . , T − P,
φ(x) = (21)
∞ otherwise,
which has memory M = P + 1.

5.3 Convex quadratic


A loss is convex quadratic if it has the form

(1/2)xT: P x: + q T x: + r Ax: = b
φ(x) = (22)
∞ otherwise,

where x: ∈ RT p is a vector representation of x (and xT: is its transpose), P ∈ RT p×T p is


symmetric positive semidefinite, q ∈ RT p , r ∈ R, A ∈ RL×T p , and b ∈ RL . Thus φ is convex
quadratic, with some equality constraints. We have already encountered a few examples of
convex quadratic loss functions, such as mean-square small and mean-square smooth.
As a more interesting example, consider the P -periodic smooth loss, defined as
1
kx2 − x1 k22 + · · · + kxP − xP −1 k22 + kx1 − xP k22 ,

φ(x) = (23)
Pp
provided x is P -periodic, i.e., xt+P = xt for t = 1, . . . , T − P , and φ(x) = ∞, otherwise.
This is the same as the P -periodic loss (21), with mean-square smoothness, taken circularly.

Masked proximal operator of convex quadratic loss. The masked proximal oper-
ator of a convex quadratic loss function can be efficiently evaluated; more precisely, after
the first evaluation, subsequent evaluations can be carried out more efficiently. Evaluating
the masked proximal operator involves minimizing a convex quadratic function subject to
equality constraints, which in turn can be done by solving a set of linear equations, the
KKT (Karush-Kuhn-Tucker) equations [BV18, §16]. If we cache the factorization used to

27
solve this set of linear equations (e.g., the LDLT factorization of the coefficient matrix),
subsequent evaluations require only the so-called back-solve step, and not the factorization.
This idea is often exploited in ADMM; see [BPC+ 11, §4.2].

Weighted proximal operator. In evaluating the masked proximal operators of certain


convex quadratic loss functions, it can more computationally efficient to evaluate a related
weighted proximal operator. This is seen commonly with loss functions that are P -periodic.
In this case, the solution to the masked proximal operator may be found by evaluating a
smaller weighted proximal operator. Specifically, the weighted proximal operator is evaluated
over a vector z ∈ RP ×p , representing a single period of component. The input to this smaller
proximal operator is the original input, averaged across periods, using only the available
data, e.g., the entries in K. The weights are defined as the number of real entries used in
each averaging operation, divided by the total possible number of entries. (Some additional
care must be taken here when evaluating signals that are not an even multiple of the period
length.)

5.4 Common term


Another common attribute of a component class is when it represents a common term across
the entries of the signal. The loss has the form

φ(z)
e xt = zt 1, t = 1, . . . , T
φ(x) = (24)
∞ otherwise,

where φe : RT → R ∪ {∞} is a loss function for a scalar signal. Roughly speaking, this
component class requires that all entries of x (i.e., its columns) are the same, and uses
a scalar-valued signal loss function on the common column. If φe is separable, then φ is
separable across time.
The proximal operator of such a φ is readily found in terms of the proximal operator of
φ. It is
e
proxφ (v) = proxφe((1/n)v1)1T .
In words: to evaluate the proximal operator for a common term loss function, we first average
the columns of v, then apply the proximal operator of φ, e and finally broadcast the result to
all columns.
The masked proximal operator is a bit more complex. Each row can have a different
number of entries in the known set, so the average across columns must be taken with
respect to the number of real entries in the row instead of the number of columns. However,
to make use of the scalar formulation φ(z),
e we must invoke the weighted proximal operator,
!
ρ X
mproxφ (v) = argmin φ(z) e + (xt,i − vt,i )2 , s.t. xt = zt 1
x 2 t,i∈K

= wproxφe(ravg(v))1T ,

28
where ravg : (R ∪ {?})T ×p → (R ∪ {?})T is the row-wise average of the matrix v, over only
the known entries. (If a row has no known entries, the function returns ? for that time index.)
The weights are the number of known entries used in each averaging operation, divided by
the total possible number of entries.

6 Component class examples


There is a wide variety of useful component classes; in this section we describe some typical
examples. In most cases the proximal operator of the loss is well known, and we do not give
it; we refer the reader to other resources, such as [CP11, PB14, BPC+ 11]. When the loss
function is convex, but an analytical method to evaluate the proximal operator is not known,
we can always fall back on a numerical method, e.g., using CVXPY [DB16, AVDB18]. In a
few cases where we believe our method of evaluating the proximal operator is new, we give
a short description of the method.

6.1 Time-separable classes


Time-separable classes are given by the loss functions `t on Rp . We have already seen the
mean-square small class, with loss `t (u) = T1p kuk22 , and the finite set class, which requires
that xt be one of a given set of values. We mention a few other examples in this section.

Value constraint component classes. As an extension of the finite value class, we


require that xt ∈ St , where St ⊂ Rp is some given set. If St are all convex, we have a
convex loss function. Simple convex examples include the nonnegative component class,
with S = Rp+ , and the vector interval signal class, with S = {u | xmin
t ≤ u ≤ xmax
t }, where
min max
the inequality is elementwise and xt and xt are given lower and upper limits on the
entries of the signal (which can be parameters). In addition to the constraint xt ∈ St , we
can add a nonzero penalty function of xt to the objective.

Mean-square close entries. The loss


p p
1X 1X
`(u) = (ui − µ)2 , µ= ui , (25)
p i=1 p i=1

which is the variance of the entries of the vector u, defines the mean-square close entries class.
If we scale this class by a very large weight, this gives an approximation of the common term
class (24) (with φe = 0), in which the entries of the signal must be the same for each t.

Robust losses. We can modify the sum of squares loss so the component class can in-
clude signals with occasional outliers, using so-called robust losses, which grow linearly for

29
large arguments, when they are convex, or sub-linearly when they are not. One well-known
examples is the Huber loss, defined as
p  2
X a |a| ≤ M
`(u) = H(ui ), H(a) =
M (2|a| − M ) |a| > M,
i=1

where M > 0 is a parameter [BV09, §6.1.2]. An example of a nonconvex robust loss is the
log Huber loss,
p  2
X a |a| ≤ M
`(u) = H(ui ),
e H(a) =
e 2
M (1 + 2 log(|a|/M )) |a| > M.
i=1

Convex sparsity inducing losses. The loss function `(u) = kuk2 (note that this norm
is not squared) leads to vector-sparse (also called block sparse) component signals, i.e., ones
for which for many values of t, we have xt = 0. In machine learning this is referred to as
group lasso [HTF13, §3.8.4]. With this loss, we typically find that when xt 6= 0, all its entries
are nonzero. The loss function `(u) = kuk1 , sum-absolute small component class, tends to
yield signals that are component-wise sparse, i.e., for many values of (t, i), we have xt,i = 0.

Non-convex sparsity inducing losses. The most obvious one is the cardinality or num-
ber of nonzeros loss, with `(u) being the number of nonzero entries in u (or, in the vector
version, 0 if u = 0 and 1 otherwise). In this case the overall loss φ(x) is the number of
nonzero values of xt,i . A variation is to limit the number of nonzeros to some given number,
say, r, which gives the r-sparse signal component class.
These losses are nonconvex, but have well-known analytic expressions for their proximal
operators. For example when the loss is the number of nonzero entries in x, the proximal
operator is so-called hard thresholding [BPC+ 11, §9.1.1],
 p
0 |vt,i | ≤ p2/ρ
proxφk (v)t,i = t = 1, . . . , T, i = 1, . . . , p.
vt,i |vt,i | > 2/ρ,

Quantile small. The quantile loss [KB78, KH01] is a variation on the `1 loss kuk1 , that
allows positive and negative values to be treated differently:
p
X
`(u) = (|ui | + (2τ − 1)ui ) , (26)
i=1

where τ ∈ (0, 1) is a parameter. For τ = 0.5, this class simplifies to sum-absolute small. (Its
proximal operator is given in [PB14, §2.2,§6.5.2].)

6.2 Time-invariant classes


Any time separable loss for which `t do not depend on t is time-invariant. We give a few
other examples here.

30
Index-dependent offset. In the common term class (24), the entries of signals are the
same. The index-dependent offset class is analogous: Its signals are different for different
indexes, but the same over time. It is given by φ(x) = 0 if for some z, xt = z for all t, where
z ∈ Rp , and ∞ otherwise. Of course we can add a penalty on z. This loss is time-invariant,
with a memory of one.

Higher order mean-square smooth component classes. We have already mentioned


the mean-square smooth class which uses the first-order difference (4), and its extension to
the second-order difference (12). Higher order mean-square smooth classes use higher order
differences.

Mean-absolute smooth. Replacing the mean-square penalty in mean-square first-order


smooth classes with a average-absolute penalty yields a components whose signal entries are
typically piecewise constant. With the second-order difference,
T −2
1 X
φ(x) = kxt − 2xt+1 + xt+2 k1 , (27)
(T − 2)p t=1

we obtain a class who entries are typically piecewise linear. (This is discussed under the
name `1 -trend filtering in §3.)

Periodic. The component class of signals with period P has loss function

0 xt+P = xt , t = 1, . . . , T − P,
φ(x) = (28)
∞ otherwise.

We can also express this using a basis.


To this constraint we can add a loss function such as mean-square signal or mean-square
smooth, to obtain, for example, the component class of P -periodic mean-square smooth
signals. (In this case the differences are computed in a circular fashion.)

Quasi-periodic. A variation on the periodic signal class does not require strict periodicity,
but allows some variation period to period, with a penalty for variation. The simplest version
uses the quadratic loss function
−P
TX
φ(x) = kxt+P − xt k22 , (29)
t=1

the sum of squares of the differences in signal values that are P period apart. Variations
include adding a smoothness term, or replacing the sum of squares with a sum of norms,
which tends to give intervals of time where the signal is exactly periodic.

31
Composite classes. Components may be combined to generate more complex loss func-
tions. An example that we will use later has time entries that are smooth (12) and peri-
odic (28) and entries that are mean-square close (25),

λ1 `1 (x) + λ2 `2 (x) xt+P = xt , t = 1, . . . , T − P,
φ(x; λ1 , λ2 ) = (30)
∞ otherwise.

where
T −2
1 X
`1 (x) = kxt − 2xt+1 + xt+2 k22 ,
(T − 2)p t=1
T p p
1 XX 1X
`2 (x) = (xt,i − µt )2 , µt = xt,i .
p t=1 i=1 p i=1

This composite example is convex quadratic (§5.3).

Monotone non-decreasing. The monotone nondecreasing loss is



1 xt+1,i ≥ xt,i t = 1, . . . , T − 1, i = 1, . . . , p
φ(x) =
0 otherwise.

It is used in monotone or isotonic regression, typically to represent something like cumulative


wear, that does not decrease over time. This loss is a constraint, but we can add an additional
term such as mean-square smoothness.

Markov. The Markov class is, roughly speaking, an extension of the finite set class (10)
that includes costs for the different values, as well as transitions between them. It is specified
×M
by some distinct values θ1 , . . . , θM ∈ Rp , a transition cost matrix C ∈ RM + , and state cost
M
vector c ∈ R+ . Like the finite set component class, the loss is ∞ unless for each t, we have
xt ∈ {θ1 , . . . , θM }. We write this as xt = θst , where we interpret st ∈ {1, . . . , M } as the state
at time t. When this holds, we define
T
X T
X
φ(x) = cs t + Cst ,st−1 .
t=1 t=2

The first term is the state cost, and the second is the cost of the state transitions.
This component class gets it name from a statistical interpretation in terms of a Markov
chain. If the state st is a Markov chain with states {1, . . . , M }, with transition probabilities
πij = Prob(st = i | st−1 = j). Then with c = 0 and Ci,j = log πij , the loss is the negative
log-likelihood, up to a constant.
The proximal operator of this component loss function can be efficiently evaluated using
standard dynamic programming. We create a graph with M T nodes, with each node corre-
sponding to one state at one time. All nodes at time t are connected to all nodes at time t−1

32
and t + 1, so there are (T − 1)M 2 edges. Let v be the signal for which we wish to evaluate
the proximal operator. At each node we attach the cost (ρ/2)kvt − cs k22 , and on each edge
from state s at time t − 1 to state s0 at t we attach the cost Cs,s0 . Then φ(x) + (ρ/2)kv − xk2F
is exactly the path cost through this graph. We can minimize this over s1 , . . . , sT using
dynamic programming to find the shortest path. The cost is O(T M 3 ) flops, which is linear
in the signal length T .

Single jump. As a variation on the Markov component class we describe the single jump
component class. We describe it for a scalar signal i.e., p = 1; it is extended to vector signals
with a loss that is separable across entries. The loss function is

 1 x = (0τ , a1T −τ )
φ(x) = 0 x=0 (31)
∞ otherwise,

for some (jump magnitude) a 6= 0 and some (jump time) τ ∈ {1, . . . , T }. Roughly speaking,
feasible signals start at zero and either stay zero, or jump once, at a time τ , to the value
a. The cost is zero if x is zero, and one if it does jump. This loss function is evidently
nonconvex.
Its proximal operator is readily evaluated directly, by evaluating

(ρ/2)kx − vk22 + φ(x)

for all feasible x. For x = 0 we have the value (ρ/2)kvk22 . For a jump at time τ , the value
of a that minimizes the cost above is simply the average of xt over t = τ, . . . , T . This value
and the cost is readily computed recursively, so the proximal operator can be evaluated in
time linear in T . This method extends reaadily to the masked proximal operator.

6.3 Fitting component class losses


In the discussion above we specify component classes directly in terms of the loss function.
We mention here that it is also possible to fit a component class loss from examples of signals
in that class, assuming they are available.
One simple method is based on the statistical interpretation given in §2.4. Given a
collection of example signals, we fit a statistical model, for example a Gaussian distribution
N (µ, Σ) with an appropriate mean µ ∈ RT p and covariance Σ ∈ RT p×T p . We use as loss
for this component class the convex quadratic φ(x) = (x: − µ)T Σ−1 (x: − µ), which is the
negative log-likelihood, up to a scale factor and constant. If we fit a statistical model for
each component of the signals we obtain an entry-separable loss; if we fit a common model
for the entries of the signal, we obtain an entry-separable symmetric loss. We can fit a
time-invariant loss by creating a common statistical model of all M -long slices of our signal
examples, and using the negative log-likelihood as the slice loss.
Another elementary method for fitting a loss to example signals uses the singular value
decomposition (SVD) or generalized low-rank model [UHZB16] to find a set of archetype

33
signals a1 , . . . , ar ∈ RT ×p , for which each of our examples is close to a linear combination of
them. We then use the basis loss function
0 x = z1 a1 + · · · + zr ar for some z ∈ Rr

φ(x) = (32)
∞ otherwise.

(As a variation on this, we can find a set of (scalar) archetypes in RT for which each
component of our examples in close to a linear combination of them, as in (11).) A soft
version of the basis loss is the loss function

φ(x) = min kx − z1 a1 − · · · − zr ar k2F , (33)


z

which has full domain. (It can also be thought of as a combination of two classes: the basis
class, and the a mean-square small residual class.)
The soft basis model can be used to fit a time-invariant loss. We use SVD to find a
set of archetypes or basis for which each M -long slice of each exmaple is close to a linear
combination, and then use the soft basis loss (33) as our slice loss.

7 Examples
7.1 Mauna Loa CO2 measurements
An example often used to demonstrate seasonal-trend decomposition is atmospheric carbon
dioxide (CO2 ), which has both a strong seasonal component and a underlying trend. These
data were utilized in the original STL paper [CCMT90] as well as the documentation for var-
ious implementations of STL [STLa]. In this section we compare the Python implementation
of STL in the statsmodels package to an SD formulation of the problem of decomposing
measurements of atmospheric CO2 into seasonal, trend, and residual components.

Data set. The weekly average CO2 measured at Mauna Loa, HI from May 1974 through
June 2021, available online from the National Oceanic and Atmospheric Administration
Global Monitoring Laboratory [TK], is shown in figure 5. The data set is a scalar signal of
length 2459 with 18 missing entries. In our notation, y1 , . . . , yT ∈ R ∪ {?}, with T = 2459,
and |U| = 18.

Decomposition using STL. We use the implementation in statsmodels (v0.12.2) with


default settings and period=52. We note that while the original STL paper describes how to
handle missing data, this particular software implementation cannot handle missing values,
so we used simple linear interpolation to fill the missing values before running the algorithm.
The resulting decomposition is shown in figure 6, using the conventional names for the
components. Interestingly, the “seasonal” component in this estimation is not periodic; it
almost repeats each year but with some variation.

34
atmospheric CO2 at Mauna Loa observatory
parts per million (ppm) 420

400

380

360

340
weekly average CO2

1980 1990 2000 2010 2020


year

Figure 5: Atmospheric CO2 data obtained from NOAA, which shows clear seasonal and trend
components.

Decomposition using SD. We form an SD problem with p = 1, T = 2459, and K = 3,


with component classes mean-square small (3), second-order-difference small (12), and a
quasi-periodic signal with period 52 (29). All the component classes are convex, so this SD
problem is convex. This problem has two parameters λ2 and λ3 , associated with the weights
on the second and third loss functions respectively. We found that λ2 = 104 and λ3 = 1 give
good results, although better parameter values could be found using a validation procedure.
The resulting decomposition is shown in figure 7.

Comparison. The decompositions found using STL and SD, shown in figures 6 and 7, and
nearly identical. The RMS deviation between trend estimates is 7.52 × 10−2 , about 0.02% of
the average measured value. The RMS deviation between seasonal estimates is 8.79 × 10−2 .
While STL is based on a heuristic algorithm, SD is based on solving a convex optimization
problem (for our particular choice of loss functions).

35
Component x1

1 STL residual

−1

Component x2

400 STL trend

350

Component x3

2.5
0.0
−2.5 STL seasonal

composed signal

400 observed, y
denoised estimate

350

1980 1990 2000 2010 2020

Figure 6: Decomposition of the CO2 data into residual, trend, and seasonal components, using
STL.

36
Component x1
1
OSD residual

−1

Component x2

400 OSD trend

350

Component x3

2.5
0.0
−2.5 OSD seasonal

composed signal

400 observed, y
denoised estimate
350

1980 1990 2000 2010 2020

Figure 7: Decomposition of the CO2 data into residual, trend, and seasonal components, using
SD. It is nearly identical to the decomposition found by STL, shown in figure 6.

37
Hourly vehicle counts on RFK bridge, Manhattan outbound
4000
20

number of vehicles
3000
Hour of day

15
2000
10

5 1000

0 0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Year

Figure 8: Hourly vehicle counts for the outbound Manhattan toll plaza of the Robert F. Kennedy
bridge, with hour of day on the y-axis and days on the x-axis. White pixels represent missing
values.

7.2 RFK bridge traffic


This example illustrates how the concept of seasonal-trend decomposition can be extended
in the SD framework to handle more complex analyses with additional components. Traffic
volume is measured with sensors embedded in the roadways that count the number of cars
that pass in each hour; from these data, summary statistics such as “Annual Average Daily
Traffic” and “Peak Hour Volume” are derived [Sah].

Data set. The hourly outbound vehicle count for the Manhattan toll plaza on the Robert
F. Kennedy Bridge in New York City from January 1, 2010 through August 28, 2021 is
shown in figure 8 as a heat map, with the hour of day shown vertically and the day shown
horizontally, and missing entries shown in white. Daily and seasonal variations can be seen,
along with the effects of COVID-19. A single week of data is shown in figure 9, where daily
variation, and the weekend effect, are evident. The data set is made available online by the
New York Metropolitan Transportation Authority (MTA) [MTA].
The data y is scalar (i.e., p = 1), with T = 102192 (24 hours per day, 4258 days). We
take the natural logarithm of the data, using the convention log 0 =?. With these unknown
entries, plus those that are unknown in the original data set, we have |U| = 3767. Thus our
decomposition is multiplicative; the components are multiplied to obtain our decomposition.

SD problem formulation. We form an SD problem with K = 5 components. The resid-


ual component is mean-square small (3), as in previous examples. The second component
is the weekly baseline, which is the smooth-periodic cost given in (23) with P = 168 and a
weight parameter λ2 . The third component is the yearly seasonal correction, which is also
smooth-periodic (23) with P = 8760 and weight parameter λ3 , and the additional constraint

38
Hourly vehicle counts over one week

3500

3000
numer of vehicles

2500

2000

1500

1000

500

0
24 25 26 27 28 29 30
May
2010

Figure 9: One week of hourly vehicle counts in May 2010.

that the sum over each period must be equal to zero. The fourth component is the long-
term trend, modeled as piecewise linear with the `1 second difference loss (27), with weight
parameter λ4 and the additional constraint that the first value of x4 must be equal to zero.
The fifth and final component is a sparse daily outlier, defined as

λ5 kxk1 x ∈ D
φ5 (x) = (34)
∞ otherwise,

where D is the set of signals that are constant over each day. All the component class losses
are convex, so this SD problem is convex with parameters λ2 , λ3 , λ4 , and λ5 .

Results. We solve the SD problem using parameter values

λ2 = 10−1 , λ3 = 5 × 105 , λ4 = 2 × 105 , λ5 = 1,

selected by hand to provide good results. The decomposition yields components that have
vastly different timescales.
By exponentiating our component estimates, x ek = exp(xk ), we recover a multiplicative
model of the traffic count data. The residual component x e1 is centered around 1, with 90%
of the residuals in the interval [0.74, 1.30], shown in figure 10. This means that in any given
hour, our decomposition predicts traffic typically within around ±30%.
Figure 11 shows one week of the (periodic) weekly baseline. We see many of the phe-
nomena present in figure 9, such as reduced traffic over the weekend, daily variation, and a
small increase from Monday to Friday, and a commute rush hour on weekdays.
Figure 12 shows component x e3 , the seasonal correction factor, which varies from around
−9% to +7%, with the peak in summer and the low point in late January and early February.

39
1.0

0.8

cumulative density [1]


0.6

0.4

0.2

0.0

0.4 0.6 0.8 1.0 1.2 1.4 1.6


residual factor [1]

Figure 10: Cumulative distribution function of the multiplicative residual x e1t,i for (t, i) ∈ K. The
gray dashed lines indicate the 5th and 95th percentiles. 90% of the residuals are between 0.74 and
1.30.

3000
hourly traffic count [cars]

2500

2000

1500

1000

500

Mon Tue Wed Thu Fri Sat Sun Mon


time [days]

e2 , shown for a single week (168 values).


Figure 11: Weekly baseline signal x

40
1.06

1.04
seasonal factor [1]

1.02

1.00

0.98

0.96

0.94

0.92

Jan Mar May Jul Sep Nov Jan


time [months]

e3 , shown for a single year (8760 values).


Figure 12: Seasonal adjustment x

Figure 13 shows the long term x e4 . The component x4 is piecewise-linear with a small
number of breakpoints, so xe4 is piecewise exponential, with a small number of breakpoints,
shown as red dots in the plot. We can see a slight increase in traffic over the first 10 years
followed by the a precipitous drop in traffic due to COVID-19 in early 2020, coinciding with
the mandatory lockdown implemented by the New York state government on March 22,
2020 [Cuo].
The final component x5 is sparse, which means that x e5 , shown in figure 14, mostly takes on
the value one. This component identifies 42 days (out of 4258) as outliers, with multiplicative
corrections ranging from around 0.2 (i.e., one fifth the normal traffic on that day) to around
twice the normal traffic on that (one) day. All but two of the outliers represent a decrease
in traffic on that day. Many of the detected outlier days are weather related, with some
notable examples being various blizzards including February 10, 2010 [Blia], December 27,
2010 [Blib], January 27, 2015 [Blic], and February 1, 2021 [Blid]. About 9 outlier days are
associated with reduced traffic during the COVID-19 lockdown event in early 2020. Figure 15
highlights the detection of Hurricane Irene in August of 2011 [AC13], with x e5 < 1 during
the hurricane.
The two positive outlier days occur on May 6 and 10, 2018. The authors could find no
explanation for the very high measured traffic on those days in the archives of the New York
Times and the New York Post. It is possible that sensors were simply malfunctioning on
those two days.

41
change-point
1.1

1.0
trend factor [1]

0.9

0.8

0.7

0.6

0.5
2010 2012 2014 2016 2018 2020 2022
time [years]

e4 . This trend is piecewise exponential, with


Figure 13: Long-term trend multiplicative factor x
breakpoints shown as red dots.

2.00

1.75
daily outlier factor [1]

1.50

1.25

1.00

0.75

0.50

0.25

2010 2012 2014 2016 2018 2020 2022


time [years]

e5 .
Figure 14: Daily outlier component x

42
e1
Component x
2

e2
Component x
3000

2000

1000

e5
Component x

1.0

0.5

composed signal

2000

observed, ye
denoised estimate
0
20

20

20

20

20

20

20

20
11

11

11

11
1−

1−

1−

1−


08

08

08

08

09

09

09

09


25

27

29

31

01

03

05

07

Figure 15: Decomposition components for two weeks in August 2011 (336 values). Hurricane
Irene hit New York city on August 27 and 28, greating reducing traffic on those days, clearly seen
e5 .
as outliers in x

43
30 str1
str2
power (kW)

str3
20
str4
str5
10 str6
str7
0
00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00
19-Aug 20-Aug
2017

Figure 16: Raw PV combiner box data, shown for two days in August 2017.

7.3 Outage detection in a photovoltaic combiner box


Data set. We consider a set of 7 measurements of real power from inside a photovoltaic
(PV) combiner box [Fra18], corresponding to 7 strings of series-connected PV modules that
are joined in parallel. These data are from PV strings forming the canopy at the NIST
campus in Maryland [BCD17]. Detailed documentation of the PV systems at this site,
including system designs, meteorological station information, and site layout, are also avail-
able [Boy15]. The canopy has multiple roof orientations, so the constituent strings have
similar but different power curves, depending on the specific geometry of each string.
The raw data consist of the power output of each of the 7 PV strings, measured each
minute over a month (August 2016), organized into a matrix with each column corresponding
to a single string and each row a particular minute. This raw data contains some missing
data. The power output of each string depends on available sunlight, weather conditions,
soiling accumulation, string geometry, and local shade patterns. Two days of string power
output are shown in figure 16.

Data pre-processing. We first eliminate all data points corresponding to night time and
early morning and evening, when string powers are zero or very small. We removed data
between 5:40pm and 6:49am. (These times were found as the times when the whole system
was producing less than 10% of system capacity.) Thus each day consists of 652 one minute
measurements. Next we scale each of the 7 string powers (columns) so that the 95th percentile
is one. This gives each column an approximate range of about 0.1 to 1.3.
Finally we take the log of each power output value, resulting in columns with a range of
about -2.3 to 0.25. Carrying out signal decomposition on this log signal gives us a multiplica-
tive decomposition, which makes sense for this application. (For example, a cloud passing
between the sun and the string gives a percentage reduction in power.) The final data is a
signal y with T = 20212, p = 7, and |U| = 6402.

44
S1 0
−2
0
S2

−2
0
S3

−2
0
S4

−2
0
S5

−2
0
S6

−2
0
S7

−2
0 2500 5000 7500 10000 12500 15000 17500 20000
time (minutes)

Figure 17: PV combiner box data after pre-processing, with simulated outages. The onset times
of the simulated outages are shown as vertical red lines.

Outage simulation. We modify this real data to include some simulated faults or out-
ages, where some part of each PV string no longer generates power. This is modeled as a
(multiplicative) reduction in power output, from the time of failure to the end of the data.
We simulated these fault for strings 2, 5, and 6, with onset times

T2 = 12132, T5 = 16573, T6 = 6063,

and power reduction factors

f2 = −7%, f5 = −10%, f6 = −12.5%,

chosen randomly. (These are realistic values.) The modified data is shown in figure 17, with
vertical red lines indicating the onset of the outages in spower trings 2, 5, and 6. These
power reductions can be seen in the plot, but would likely be hard to spot by eye.

SD problem formulation. We form an SD problem with K = 5 components. Our signal


decomposition models string output as the product of a mean-square small residual (3), a
clear sky signal, a common daily correction term, a common cloud/weather term, and a
failure term. The clear sky component is modeled as the composite class that is smooth and
periodic in time and close in entries (30) (with a small modification to remove the smoothness
penalty across day boundaries). This component has two parameters, one for the smoothness
term and one for the variance across entries, λ2a and λ2b , respectively. The third component

45
is a daily scale adjustment that is constant across columns, and constant over each day,
meant to capture day-to-day macro-scale changes in atmospheric conditions that effect all
strings, such as precipitable water and aerosol optical depth [Ine08]. The fourth component
is also constant across the columns and has a quantile loss function (26). This models a
common cloud-loss term between the strings, assumed to be equal because the strings are so
close to each other and are experiencing the same local weather. The fourth component has
two parameters, the quantile term, τ , which we set to be 0.65, and a weight, λ4 . The third
and fourth components make use of the common term formulation (24). The fifth component
is our failure detector. This component uses the single jump class (31), constrained to only
have negative jumps, with each column treated independently. The fifth component also has
a weight parameter, λ5 . Since our failures are simulated, we know esxactly when the onsets
are, and what the values are, which we can compare to our estimated failure component.

Results. We solve the SD problem with hand-selected weights,

λ2a = 5 × 104 /(T p), λ2b = 5 × 10−5 /(T p), λ4 = 2/(T p), λ5 = 10/(T p),

giving us estimates of x1 , . . . , x5 . Our estimates of the components are xek = exp xk . We


e1 , x
interpret x e2 , x
e4 , x
e5 as multiplicative components, and we interpret x e2 as the baseline clear
sky values, normalized. It takes approximately 15 seconds to run the SD-ADMM algorithm
to convergence on a 2016 MacBook Pro, with no parallelization of the proximal operator
evaluations. A segment of the decomposition is shown in figure 18, highlighting 5 days of
data for string 2, including the time of an estimated failure.
The residual term x e1 is shown as a histogram in figure 19. The residual is centered at 1
and has a standard deviation of 0.082. 95% of the entries in the known set have residuals in
the range of [0.85, 1.15], i.e., ±15%.
The clear sky component x e2 is shown in figure 20. We plot two days of this periodic
component to illustrate the discontinuities in values between adjacent days. We see that the
clear sky estimates for the strings are smooth in time, and vary a bit between strings.
The common daily scale factor x e3 , shown in figure 21(a), is constant across days and
across columns. This can be thought of how much the clear sky signals need to be scaled to
recreate any given day, and the all strings must agree on the factor. Days with significant
cloud cover tend to have much smaller scale factors, while clearer days tend to vary by about
10–15%.
The common weather term x e4 , shown in figure 21(b), is also constant across columns,
and it captures the effects of local weather, particularly attenuation by clouds. This term
is typically a loss, that is x e4 < 1. We chose the value of the quantile parameter τ = 0.65
through hand-tuning and selecting a value that gave good agreement between the clear sky
component and the measured data on periods without significant cloud impacts. While
having a weather correction term that is larger than about 1.5 does not make much physical
sense (see, for example, [ICC16]), we observe that this factor is applied to the combination
of components 2 and 3, the clear sky component and the daily scale factor. In fact, we see

46
Normalized signal ye2 and the denoised SD estimate
[kW/kWp ]

1.0
0.5

e12
x

1.25
[1]

1.00
0.75

e22
x
[kW/kWp ]

1.0

0.5

e32
x
1.00
[1]

0.75

e42
x
2
[1]

e52
x
1.00
[1]

0.95

10500 11000 11500 12000 12500 13000 13500


time index, t

fk for string 2 over 5 days.


Figure 18: Components x

47
40000

30000
count
20000

10000

0
0.4 0.6 0.8 1.0 1.2 1.4 1.6
e1i,t
x for (i, t) ∈ K [1]

e1 for all entries of the known set K.


Figure 19: Histogram of the residual term x

1.0
normalized power [kW/kWp]

str1
0.8 str2
str3
str4
0.6
str5
str6
0.4 str7

0.2

0 200 400 600 800 1000 1200


time index, t

e2 , with two days shown.


Figure 20: The clear sky component x

48
1.0
common daily scale factor [1]

0.9

0.8

0.7

0.6

2.5
common weather scale factor [1]

2.0

1.5

1.0

0.5

0.0
0 5 10 15 20 25 30
time [days]

e3 . (b) Bottom, the common weather compo-


Figure 21: (a) Top, the common daily scale factor x
nent x 4
e . Only one column of each component is plotted as all columns are equal to each other.

49
0
str1
−2
str2
str2 actual
string loss (%)

−4
str3
−6 str4
str5
−8 str5 actual
str6
−10 str6 actual
str7
−12

0 5 10 15 20 25 30
time (days)

e5 )%. The dashed lines show


Figure 22: Failure component, shown as the percentage 100 × (1 − x
the actual simulated failures.

Table 2: Outage detection results

string metric actual predicted


2 amount (%) -7 -6.24
2 time (days) 18.60 18.61
5 amount (%) -10 -9.10
5 time (days) 25.42 25.60
6 amount (%) -12.5 -12.44
6 time (days) 9.30 9.30

the larger values in xe4 exactly on the days that are highly cloudy and use very small daily
scale factors.
The failure component x e5 correctly identifies the failures correctly as appearing in only
strings 2, 5, and 6, as depicted in figure 22, which shows the predicted and real failure onset
times and amounts. The estimated failure time is 5 minutes late for string 2, about 2 hours
late for string 5, and exactly correct for string 6; for all three strings, the loss was detected
within the same day that the failure occurred. We can also see that the estimated failure
amounts are quite good.

Acknowledgments
This material is based on work supported by the U.S. Department of Energy’s Office of
Energy Efficiency and Renewable Energy (EERE) under the Solar Energy Technologies Office
Award Number 38529. Stephen Boyd’s work was funded in part by the AI Chip Center for

50
Emerging Smart Systems (ACCESS).
The authors thank Joel Tropp for useful suggestions on an early draft of this paper.

51
References
[AC13] L. Avila and J. Cangialosi. Tropical cyclone report for hurricane Irene
(AL092011). NOAA National Hurricane Center, April 2013. online: https:
//www.nhc.noaa.gov/data/tcr/AL092011_Irene.pdf.

[ALMT14] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge:


phase transitions in convex programs with random data. Information and In-
ference, 3(3):224–294, sep 2014.

[And27] O. Anderson. On the logic of the decomposition of statistical series into separate
components. Journal of the Royal Statistical Society, 90(3):548–569, 1927.

[AVDB18] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd. A rewriting system


for convex optimization problems. Journal of Control and Decision, 5(1):42–60,
2018.

[Bac10] F. Bach. Structured sparsity-inducing norms through submodular functions.


In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, edi-
tors, Advances in Neural Information Processing Systems, volume 23. Curran
Associates, Inc., 2010.

[BC90] M. Best and N. Chakravarti. Active set algorithms for isotonic regression; a
unifying framework. Mathematical Programming, 47(1-3):425–439, may 1990.

[BC02] R. Baillie and S.-K. Chung. Modeling and forecasting from trend-stationary
long memory models with applications to climatology. International Journal of
Forecasting, 18(2):215–226, apr 2002.

[BCD17] M. Boyd, T. Chen, and B. Doughert. NIST Campus Photovoltaic (PV) Arrays
and Weather Station Data Sets. National Institute of Standards and Technology,
2017. [Data set]. https://doi.org/10.18434/M3S67G.

[BCDH10] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based com-


pressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001,
apr 2010.

[Ber16] D. Bertsekas. Nonlinear Programming: Third Edition. Athena Scientific,


Nashua, NH, 2016.

[Blia] February 9–10, 2010 North American blizzard. Wikipedia. https:


//en.wikipedia.org/w/index.php?title=February_9%E2%80%9310,_2010_
North_American_blizzard&oldid=1030398339. Accessed: 2021-09-01.

[Blib] December 26–27th 2010 blizzard. National Weather Service. online: https:
//www.weather.gov/okx/storm12262010. Accessed: 2021-09-01.

52
[Blic] January 26–27 2015 blizzard. National Weather Service. online: https://www.
weather.gov/okx/Blizzard_01262715. Accessed: 2021-09-01.

[Blid] January 31–February 2 2021 winter storm. National Weather Service.


online: https://www.weather.gov/okx/WinterStormJan31_Feb22021. Ac-
cessed: 2021-09-01.

[Blo92] P. Bloomfield. Trends in global temperature. Climatic Change, 21(1):1–16, may


1992.

[BN92] P. Bloomfield and D. Nychka. Climate spectra and detecting climate change.
Climatic Change, 21(3):275–287, jul 1992.

[Boy15] M. Boyd. High-speed monitoring of multiple grid-connected photovoltaic ar-


ray configurations. NIST Technical Note 1896, 2015. http://dx.doi.org/10.
6028/NIST.TN.1896.

[BPC+ 11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza-
tion and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122, 2011.

[BS93] C. Bouman and K. Sauer. A generalized Gaussian image model for edge-
preserving MAP estimation. IEEE Transactions on Image Processing, 2(3):296–
310, 1993.

[BT13] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent


type methods. SIAM Journal on Optimization, 23(4):2037–2060, jan 2013.

[BV09] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University


Press, 2009.

[BV18] S. Boyd and L. Vandenberghe. Introduction to Applied Linear Algebra. Cam-


bridge university press, 2018.

[CCMT90] R. Cleveland, W. Cleveland, J. McRae, and I. Terpenning. STL: A seasonal-


trend decomposition procedure based on loess (with discussion). Journal of
Official Statistics, 6(1):3–73, 1990.

[CLMW11] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?
Journal of the ACM, 58(3):1–37, may 2011.

[CM73] Jon F. Claerbout and Francis Muir. Robust modeling with erratic data. Geo-
physics, 38(5):826–844, oct 1973.

[CP11] P. Combettes and J. Pesquet. Proximal splitting methods in signal processing.


Springer Optimization and Its Applications, 49:185–212, 2011.

53
[CR09] E. J. Candès and B. Recht. Exact matrix completion via convex optimization.
Foundations of Computational Mathematics, 9(6):717–772, dec 2009.

[CRPW10] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex ge-


ometry of linear inverse problems. Foundations of Computational Mathematics,
12(6):805–849, dec 2010.

[Cuo] A. Cuomo. Governor Cuomo signs the ‘New York State on PAUSE’ executive
order, March 20, 2020. New York Governor’s Press Office. archived at https:
//web.archive.org/web/20200328191630/https://www.governor.ny.gov/
news/governor-cuomo-signs-new-york-state-pause-executive-order.
Accessed: 2021-08-31.

[DB16] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for


convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.

[DH01] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposi-
tion. IEEE Transactions on Information Theory, 47(7):2845–2862, 2001.

[Far01] R. Farebrother. Adrien-Marie Legendre. In C. Heyde, E. Seneta, P. Crépel,


S. Fienberg, and J. Gani, editors, Statisticians of the Centuries, pages 101–104.
Springer, New York, NY, 2001.

[FB81] M. Fischler and R. Bolles. Random sample consensus. Communications of the


ACM, 24(6):381–395, jun 1981.

[Fra18] E. Franklin. Solar Photovoltaic (PV) System Components. The University of


Arizona College of Agriculture & Life Sciences, pages 1–8, May 2018.

[GL92] S. Greenland and M. Longnecker. Methods for trend estimation from summa-
rized dose-response data, with applications to meta-analysis. American Journal
of Epidemiology, 135(11):1301–1309, jun 1992.

[GM75] R. Glowinski and A. Marroco. Sur l’approximation, par elements finis d’ordre
un, et la resolution, par penalisation-dualité, d’une classe de problems de Dirich-
let non lineares. Revue Française d’Automatique, Informatique, et Recherche
Opérationelle, 9(R-2):41–76, 1975.

[GM76] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear varia-
tional problems via finite element approximation. Computers and Mathematics
with Applications, 2(1):17–40, 1976.

[GW84] S. Grotzinger and C. Witzgall. Projections onto order simplexes. Applied Math-
ematics & Optimization, 12(1):247–270, oct 1984.

[HA18] R. Hyndman and G. Athanasopoulos. Forecasting: principles and practice.


OTexts: Melbourne, Australia, 2018.

54
[HK70] A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(1):55–67, 1970.

[HP97] R. Hodrick and E. Prescott. Postwar U.S. business cycles: An empirical inves-
tigation. Journal of Money, Credit and Banking, 29(1):1, feb 1997.

[HTF13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning:


Data Mining, Inference, and Prediction. Springer Science & Business Media,
2013.

[Hub64] P. Huber. Robust estimation of a location parameter. The Annals of Mathe-


matical Statistics, 35(1):73–101, mar 1964.

[Hub81] P. Huber. Robust statistics, volume 523. John Wiley & Sons, Hoboken, NJ,
1981.

[ICC16] R. Inman, Y. Chu, and C. Coimbra. Cloud enhancement of global horizontal


irradiance in California and Hawaii. Solar Energy, 130:128–138, Jun 2016.

[Ine08] P. Ineichen. A broadband simplified version of the Solis clear sky model. Solar
Energy, 82(8):758–762, 2008.

[KB78] R. Koenker and G. Bassett. Regression quantiles. Econometrica, 46(1):33, jan


1978.

[KH01] R. Koenker and K. F. Hallock. Quantile regression. Journal of Economic Per-


spectives, 15(4):143–156, nov 2001.

[KKBG09] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. `1 Trend Filtering. SIAM
Review, 51(2):339–360, 2009.

[Les61] C. Leser. A simple method of trend construction. Journal of the Royal Statistical
Society: Series B (Methodological), 23(1):91–107, jan 1961.

[Lev04] S. Levitt. Understanding why crime fell in the 1990s: Four factors that explain
the decline and six that do not. Journal of Economic Perspectives, 18(1):163–
190, feb 2004.

[LS94] W. Link and F. Sauer. Estimating equations estimates of trends. Bird Popula-
tions, 2:23–32, 1994.

[Mal09] S. Mallat. A Wavelet Tour of Signal Processing. Elsevier, 2009.

[MCD+ 14] M. B. McCoy, V. Cevher, Q. T. Dinh, A. Asaei, and L. Baldassarre. Convexity


in source separation: Models, geometry, and algorithms. IEEE Signal Processing
Magazine, 31(3):87–95, may 2014.

55
[Mor62] J.-J. Moreau. Fonctions convexes duales et points proximaux dans un espace
hilbertien. Reports of the Paris Academy of Sciences, Series A, 255:2897–2899,
1962.

[MT14] Michael B. McCoy and Joel A. Tropp. Sharp recovery bounds for convex demix-
ing, with applications. Foundations of Computational Mathematics, 14(3):503–
567, jun 2014.

[MTA] Hourly traffic on Metropolitan Transportation Authority (MTA) bridges and


tunnels. NY Open Data. online: https://data.ny.gov/Transportation/
Hourly-Traffic-on-Metropolitan-Transportation-Auth/qzve-kjga. Ac-
cessed: 2021-08-31.

[Neu69] O. Neugebauer. The Exact Sciences in Antiquity. Acta historica scientiarum


naturalium et medicinalium. Dover Publications, 1969.

[OS10] A.V. Oppenheim and R.W. Schafer. Discrete-time Signal Processing. Prentice-
Hall signal processing series. Pearson, United Kingdom, 2010.

[Osb95] D. Osborn. Moving average detrending and the analysis of business cycles.
Oxford Bulletin of Economics and Statistics, 57(4):547–558, nov 1995.

[PB14] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Opti-
mization, 1(3):127–239, 2014.

[Phi62] D. Phillips. A technique for the numerical solution of certain integral equations
of the first kind. Journal of the ACM (JACM), 9(1):84–97, 1962.

[PR96] R. Poliquin and R. Rockafellar. Prox-regular functions in variational analysis.


Transactions of the American Mathematical Society, 348(5):1805–1838, 1996.

[PVG+ 11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,


M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
chine learning in Python. Journal of Machine Learning Research, 12:2825–2830,
2011.

[Roc70] R. Rockafellar. Convex Analysis. Princeton university press, 1970.

[RT14] P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate


descent methods for minimizing a composite function. Mathematical Program-
ming, 144(1-2):1–38, apr 2014.

[Sah] R. Sah. Caltrans traffic census program. California Department of Transporta-


tion. online: https://dot.ca.gov/programs/traffic-operations/census.
Accessed: 2021-08-16.

56
[Sen68] P. Sen. Estimates of the regression coefficient based on Kendall’s Tau. Journal
of the American Statistical Association, 63(324):1379–1389, 1968.

[Sin88] K. Singleton. Econometric issues in the analysis of equilibrium business cycle


models. Journal of Monetary Economics, 21(2-3):361–386, 1988.

[SMF10] J.-L. Starck, F. Murtagh, and J. M. Fadili. Sparse image and signal process-
ing: wavelets, curvelets, morphological diversity. Cambridge university press,
Cambridge, UK, 2010.

[STLa] Seasonal-trend decomposition using LOESS (STL). https://


www.statsmodels.org/dev/examples/notebooks/generated/stl_
decomposition.html. Accessed: 2021-06-23.

[STLb] STL: Seasonal decomposition of time series by loess. https://www.


rdocumentation.org/packages/stats/versions/3.6.2/topics/stl. Ac-
cessed: 2021-06-23.

[STLc] Time series decomposition. https://www.mathworks.com/help/econ/


detrending.html. Accessed: 2021-06-23.

[TBM79] H. L. Taylor, S. C. Banks, and J. F. McCoy. Deconvolution with the `1 norm.


GEOPHYSICS, 44(1):39–52, jan 1979.

[TF11] I. Tošić and P. Frossard. Dictionary learning. IEEE Signal Processing Magazine,
28(2):27–38, 2011.

[The50] H. Theil. A rank-invariant method of linear and polynomial regression analysis.


Proceedings of the Royal Netherlands Academy of Sciences, 53:Part I: 386–392,
Part II: 521–525, Part III: 1397–1412, 1950.

[Tib96] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.

[Tik63] A. Tikonov. Solution of incorrectly formulated problems and the regularization


method. Soviet Math., 4:1035–1038, 1963.

[Tit85] D. Titterington. General structure of regularization procedures in image recon-


struction. Astronomy and Astrophysics, 144(2):381–387, 1985.

[TK] P. Tans and R. Keeling. Mauna Loa CO2 weekly mean and historical com-
parisons. NOAA Global Monitoring Laboratory, Earth System Research Lab-
oratories. online: https://gml.noaa.gov/ccgg/trends/data.html. Accessed:
2021-07-10.

[TK93] A. Thompson and J. Kay. On some Bayesian choices of regularization parameter


in image restoration. Inverse Problems, 9(6):749–761, 1993.

57
[Tsa05] R. Tsay. Analysis of Financial Time Series. Wiley Series in Probability and
Statistics. John Wiley & Sons, Inc., Hoboken, NJ, USA, 2005.

[UHZB16] M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized low rank models.
Foundations and Trends in Machine Learning, 9(1):1–118, 2016.

[WK13] M. Wytock and J. Kolter. Contextually Supervised Source Separation with


Application to Energy Disaggregation. Twenty-Eighth AAAI Conference on
Artificial Intelligence, pages 1–10, 2013.

[WM22] J. Wright and Y. Ma. High-dimensional data analysis with low-dimensional


models: Principles, computation, and applications. Cambridge University Press,
New York, NY, 2022.

[Wri15] S. Wright. Coordinate descent algorithms. Mathematical Programming,


151(1):3–34, 2015.

[WWM01] W. Wu, M. Woodroofe, and G. Mentz. Isotonic regression: Another look at the
changepoint problem. Biometrika, 88(3):793–804, 2001.

[YMS21] A. Yurtsever, Varun M., and S. Sra. Three operator splitting with a nonconvex
loss function. mar 2021.

[ZH05] H. Zou and T. Hastie. Regularization and variable selection via the elastic
net. Journal of the royal statistical society, series B (statistical methodology),
67(2):301–320, 2005.

58
A SD-ADMM algorithm derivation
To derive an ADMM algorithm for SD, we introduce new variables z 1 , . . . , z K ∈ Rq and
reformulate the SD problem (7) as

minimize φ1 (x1 ) + · · · + φK (xK )


subject to Mxk − z k = 0, k = 1, . . . , K
My = z 1 + · · · + z K .

We let g denote the indicator function of the last constraint,



1 K 0 My = z 1 + · · · + z K
g(z , . . . , z ) =
∞ otherwise,

so the SD problem can be expressed as

minimize φ1 (x1 ) + · · · + φK (xK ) + g(z 1 , . . . , z K )


(35)
subject to Mxk − z k = 0, k = 1, . . . , K.

We write this in compact form as

minimize φ(x) + g(z)


(36)
subject to Mxk − z k = 0, k = 1, . . . , K,

where x = (x1 , . . . , xK ), z = (z 1 , . . . , z K ), and φ(x) = φ1 (x1 ) + · · · + φK (xK ). We are now


ready to derive the ADMM algorithm.
We form the augmented Lagrangian, with parameter ρ > 0,
K  
kT
X
k k k k 2
Lρ (x, z, λ) = φ(x) + g(z) + λ (Mx − z ) + (ρ/2)kMx − z k2
k=1
K
X
krk + uk k22 − kuk k22 ,

= φ(x) + g(z) + (ρ/2)
k=1

where rk = Mxk − z k are the residuals, λk are the dual variables, and uk = (1/ρ)λk are the
so-called scaled dual variables [BPC+ 11, §3.1.1].
Iteration j of ADMM consists of three steps:

xj+1 = argmin Lρ (xj , z j , uj )


x
z j+1 = argmin Lρ (xj+1 , z j , uj )
x
k j+1
(u ) = (u ) + M(xk )j+1 − (z k )j+1 ,
k j
k = 1, . . . , K,

which we refer to as the x-update, z-update, and u-update, respectively.

59
We now work out and simplify these steps. Since Lρ is separable in xk , we can minimize
over xk separately in the x-update to obtain
2
(xk )j+1 = argmin φk (xk ) + (ρ/2)kMxk − (z k )j + (uk )j 2 ), k = 1, . . . , K. (37)
xk

The z-update can be written as


z j+1 = Π(M(x1 )j+1 + (u1 )j , . . . , M(xK )j+1 + (uK )j ),
where Π is the projection onto the domain of g, i.e., the constraints My = z 1 + · · · + z k . To
simplify notation, let ak = M(xk )j+1 + (uk )j . The z-update can be written as
(z k )j+1 = ak + (1/K)(My − a1 − · · · − aK )
= M(xk )j+1 + (uk )j + (1/K)(My − a1 − · · · − aK ).
Now consider the u-update. Plugging in the new z-update above, we get
(uk )j+1 = −(1/K)(My − a1 − · · · − aK ).
The righthand side does not depend on k, which means that all (uk )j+1 are the same and
can be denoted as uj+1 . (This simplification is not unexpected since the original problem
has only one dual variable, which is a vector in Rq .) With this simplification, the u-update
(now for just one scaled dual variable u ∈ Rq ) becomes
K
!
1 X
uj+1 = uj + M(xk )j+1 − My .
K k=1

Substituting uj for (uk )j in the z-update, we get


(z k )j+1 = M(xk )j+1 − uj+1 .
Substituting (z k )j = M(xk )j − uj into the original x-update (37) above, we obtain
(xk )j+1 = argmin φk (xk ) + (ρ/2)kMxk − M(xk )j + 2uj k22

xk
= argmin φk (xk ) + (ρ/2)kM(xk − (xk )j + 2M∗ uj )k22

xk
= mproxφk ((xk )j − 2M∗ uj ),
for k = 1, . . . , K. (We use (17) in the second line.)
We now see that the variables z k have dropped out, and we arrive at the final set of
ADMM iterations
(xk )j+1 = mproxφk ((xk )j − 2M∗ uj ), k = 1, . . . , K
K
!
j+1 1
j
X
u = u + M(xk )j+1 − My .
K k=1

60

You might also like