Nothing Special   »   [go: up one dir, main page]

Lectures On Convex Optimization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 603

Springer Optimization and Its Applications  137

Yurii Nesterov

Lectures
on Convex
Optimization
Second Edition
Springer Optimization and Its Applications

Volume 137

Managing Editor
Panos M. Pardalos (University of Florida)

Editor-Combinatorial Optimization
Ding-Zhu Du (University of Texas at Dallas)

Advisory Board
J. Birge (University of Chicago)
S. Butenko (Texas A & M University)
F. Giannessi (University of Pisa)
S. Rebennack (Karlsruhe Institute of Technology)
T. Terlaky (Lehigh University)
Y. Ye (Stanford University)

Aims and Scope


Optimization has been expanding in all directions at an astonishing rate during the
last few decades. New algorithmic and theoretical techniques have been developed,
the diffusion into other disciplines has proceeded at a rapid pace, and our knowledge
of all aspects of the field has grown even more profound. At the same time, one of
the most striking trends in optimization is the constantly increasing emphasis on the
interdisciplinary nature of the field. Optimization has been a basic tool in all areas
of applied mathematics, engineering, medicine, economics and other sciences.
The series Springer Optimization and Its Applications publishes undergraduate
and graduate textbooks, monographs and state-of-the-art expository works that
focus on algorithms for solving optimization problems and also study applications
involving such problems. Some of the topics covered include nonlinear optimization
(convex and nonconvex), network flow problems, stochastic optimization, optimal
control, discrete optimization, multi-objective programming, description of soft-
ware packages, approximation techniques and heuristic approaches.
More information about this series at http://www.springer.com/series/7393
Yurii Nesterov

Lectures on Convex
Optimization

Second Edition

123
Yurii Nesterov
CORE/INMA
Catholic University of Louvain
Louvain-la-Neuve, Belgium

ISSN 1931-6828 ISSN 1931-6836 (electronic)


Springer Optimization and Its Applications
ISBN 978-3-319-91577-7 ISBN 978-3-319-91578-4 (eBook)
https://doi.org/10.1007/978-3-319-91578-4

Library of Congress Control Number: 2018949149

Mathematics Subject Classification (2010): 49M15, 49M29, 49N15, 65K05, 65K10, 90C25, 90C30,
90C46, 90C51, 90C52, 90C60

© Springer Nature Switzerland AG 2004, 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my wife Svetlana
Preface

The idea of writing this book came from the editors of Springer, who suggested that
the author should think about a renewal of the book
Introductory Lectures on Convex Optimization: Basic Course,
which was published by Kluwer in 2003 [39]. In fact, the main part of this book
was written in the period 1997–1998, so its material is at least twenty years old. For
such a lively field as Convex Optimization, this is indeed a long time.
However, having started to work with the text, the author realized very quickly
that this modest goal was simply unreachable. The main idea of [39] was to
present a short one-semester course (12 lectures) on Convex Optimization, which
reflected the main algorithmic achievements in the field at the time. Therefore,
some important notions and ideas, especially related to all kinds of Duality Theory,
were eliminated from the contents without any remorse. In some sense, [39] still
remains the minimal course representing the basic concepts of algorithmic Convex
Optimization. Any enlargements to this text would require difficult explanations
as to why the selected material is more important than the many other interesting
candidates which have been left on the shelf.
Thus, the author came to a hard decision to write a new book, which includes
all of the material of [39], along with the most important advances in the field
during the last two decades. From the chronological point of view, this book
covers the period up to the year 2012.1 Therefore, the newer results on random
coordinate descent methods and universal methods, complexity results on zero-
order algorithms and methods for solving huge-scale problems are still missing.
However, in our opinion, these very interesting topics have not yet matured enough
for a monographic presentation, especially in the form of lectures.
From the methodological point of view, the main novelty of this book consists
in the wide presence of duality. Now the reader can see the story from both sides,

1 Well, just for consistency, we added the results from several last-minute publications, which are

important for the topics discussed in the book.

vii
viii Preface

primal and dual. As compared to [39], the size of the book is doubled, which looks
to be a reasonable price to pay for a comprehensive presentation. Clearly, this book
is too big now to be taught during one-semester. However, it fits well a two-semester
term. Alternatively, different parts of it can be used in diverse educational programs
on modern optimization. We discuss possible variants at the end of the Introduction.
In this book we include three topics, which are new to the monographic
literature.
• The smoothing technique. This approach has completely changed our under-
standing of complexity of nonsmooth optimization problems, which arise in
the vast majority of applications. It is based on the algorithmic possibility
of approximating a non-differentiable convex function by a smooth one, and
minimizing the new objective by Fast Gradient Methods. As compared with
standard subgradient methods, the complexity of each iteration of the new
schemes does not change. However, the estimate for the number of iterations
of these schemes becomes proportional to the square root of this number for
the standard methods. Since in practice, these numbers are usually of the order
of many thousands, or even millions, the gain in computational time becomes
spectacular.
• Global complexity bounds for second-order methods. Second-order methods,
and their most famous representative, the Newton’s Method, are among the oldest
schemes in Numerical Analysis. However, their global complexity analysis has
only recently been carried out, after the discovery of the Cubic Regularization
of Newton’s Method. For this new variant of classical scheme, we can write
down the global complexity bounds for different problem classes. Consequently,
we can now compare global efficiency of different second-order methods and
develop accelerated schemes. A completely new feature of these methods is the
accumulation of some model of the objective function during the minimization
process. At the same time, we can derive for them lower complexity bounds and
develop optimal second-order methods. Similar modifications can be made for
methods solving systems of nonlinear equations.
• Optimization in relative scale. The standard way of defining an approximate
solution of an optimization problem consists in introducing absolute accuracy.
However, in many engineering applications, it is natural to measure the quality
of solution in a relative scale (percent). To adjust minimization methods toward
this goal, we introduce a special model of objective function and apply efficient
preprocessing algorithms for computing an appropriate metric, compatible with
the topology of the objective. As a result, we get very efficient optimization
methods with a weak dependence of their complexity bounds in the size of input
data.
We hope that this book will be useful for a wide audience, including students
with mathematical, economical, and engineering specializations, practitioners of
different fields, and researchers in Optimization Theory, Operations Research, and
Computer Science. The main lesson of the development of our field in the last few
decades is that efficient optimization methods can be developed only by intelligently
Preface ix

employing the structure of particular instances of problems. In order to do this, it is


always useful to look at successful examples. We believe that this book will provide
the interested reader with a great deal of information of this type.

Louvain-la-Neuve, Belgium Yurii Nesterov


January 2018
Acknowledgements

Through my scientific career, I have had an extraordinary opportunity of being


able to have regular scientific discussions with Arkady Nemirovsky. His remarkable
mathematical intuition and profound mathematical culture helped me enormously
in my scientific research. Boris Polyak has remained my scientific adviser starting
from the time of my PhD, for almost four decades. His scientific longevity has set
a very stimulating example. I am very thankful to my colleagues A. d’Aspremont,
A. Antipin, V. Blondel, O. Burdakov, C. Cartis, F. Glineur, C. Gonzaga, R. Freund,
A. Juditsky, H.-J. Lüthi, B. Mordukhovich, M. Overton, R. Polyak, V. Protasov,
J. Renegar, P. Richtarik, R. Sepulchre, K. Scheinberg, A. Shapiro, S.Shpirko, Y.
Smeers, L. Tuncel, P. Vandooren, J.-Ph. Vial, and R. Weismantel for our regular
scientific discussions resulting from time to time in a joint paper. In the recent years,
my contact with young researchers P. Dvurechensky, N. Doikov, A. Gasnikov, G.
Grapiglia, R. Hildebrand, A. Rodomanov, and V.Shikhman has been very interesting
and stimulating. At the same time, I am convinced that the excellent conditions
for research, provided me by Université Catholique de Louvain (UCL), is a result
of continuous support (over several decades!) from the patriarchs of UCL Jacques
Dreze, Michele Gevers, and Laurence Wolsey. To all these people, I express my
sincere gratitude.
The contents of this book have already been presented in several educational
courses. I am very thankful to C. Helmberg, R. Freund, B. Legat, J. Renegar, H.
Sendov, A. Tits, M. Todd, L. Tuncel, and P. Weiss for reporting to me a number
of misprints in [39]. In the period 2011–2017 I had the very useful opportunity of
presenting some parts of the new material in several advanced courses on Modern
Convex Optimization at different universities over the world (University of Liege,
ENSAE (ParisTech), University of Vienna, Max Planck Institute (Saarbrucken),
FIM (ETH Zurich), Ecole Polytechnique, Higher School of Economics (Moscow),
Korea Advanced Institute of Science Technology (Daejeon), Chinese Academy of
Sciences (Beijing)). I am very thankful to all these people and institutions for their
interest in my research.
Finally, only the patience and professionalism of Springer editors Anne-Kathrin
Birchley-Brun and Rémi Lodh has made the publication of this book possible.

xi
Introduction

Optimization problems arise naturally in many different fields. Very often, at some
point we get a craving to arrange things in the best possible way. This intention,
converted into a mathematical formulation, becomes an optimization problem of
a certain type. Depending on the field of interest, it could be an optimal design
problem, an optimal control problem, an optimal location problem, an optimal
diet problem, etc. However, the next step, consisting in finding a solution to the
mathematical model, is far from being trivial. At first glance, everything looks very
simple: many commercial optimization packages are easily available and any user
can get a “solution” to the model just by clicking at an icon on the desktop of a
personal computer. However, the question is, what do we actually get? How much
can we trust the answer?
One of the goals of this course is to show that, despite their easy availability, the
proposed “solutions” of general optimization problems very often cannot satisfy the
expectations of a naive user. In our opinion, the main fact, which should be known
to any person dealing with optimization models, is that in general, optimization
problems are unsolvable. This statement, which is usually missing in standard
optimization courses, is very important for understanding optimization theory and
the logic of its development in the past and in the future.
In many practical applications, the process of creating a model can take a lot
of time and effort. Therefore, the researchers should have a clear understanding
of the properties of the model they are constructing. At the stage of modelling,
many different ideas can be applied to represent a real-life situation, and it is
absolutely necessary to understand the computational consequences of each step
in this process. Very often, we have to choose between a “perfect” model, which we
cannot solve,2 and a “sketchy” model, which can be solved for sure. What is better?
In fact, computational practice provides us with an answer. Up to now, the most
widespread optimization models have been the models of Linear Optimization. It is
very unlikely that such models can describe our nonlinear world very well. Hence,

2 More precisely, which we can only try to solve.

xiii
xiv Introduction

the main reason for their popularity is that practitioners prefer to deal with solvable
models. Of course, very often the linear approximations are poor. However, usually
it is possible to predict the consequences of such a choice and make a correction in
interpretation of the obtained solution. This is much better than trying to solve an
overcomplicated model without any guarantee of success.
Another goal of this course consists in discussing numerical methods for solvable
nonlinear models, namely the problems of Convex Optimization. The development
of Convex Optimization in the last decades has been very rapid and exciting. Now
it consists of several competing branches, each of which has some strong and
some weak points. We will discuss their features in detail, taking into account
the historical aspect. More precisely, we will try to understand the internal logic
of the development of each branch of the field. Up to now, the main results of
these developments could only be found in specialized journals. However, in our
opinion, many of these theoretical achievements are ready to be understood by
the final users: computer scientists, industrial engineers, economists, and students
of different specializations. We hope that this book will be interesting even for
experts in optimization theory since it contains many results which have never been
published in a monograph.
In this book, we will try to convince the reader that, in order to work with
optimization formulations successfully, it is necessary to be aware of some theory,
which explains what we can and what we cannot do with optimization problems.
The elements of this simple theory can be found in almost every chapter of the
first part of the book, dealing with the standard black-box model of the objective
function. We will see that Black-Box Convex Optimization is an excellent example
of a comprehensive application theory, which is simple, easy to learn and which
can be very useful in practical applications. On the other hand, in the second part
of the book, we will see how much we can gain from a proper use of a problem’s
structure. This enormous increase of our abilities does not discard the results of the
first part. On the contrary, most of the achievements in Structural Optimization are
firmly supported by the fundamental methods of Black-Box Convex Optimization.
In this book, we discuss the most efficient modern optimization schemes and
establish for them global efficiency bounds. Our presentation is self-contained; we
prove all necessary results. Nevertheless, the proofs and reasonings should not be a
problem, even for a second-year undergraduate student.
The structure of the book is as follows. It consists of seven relatively independent
chapters. Each chapter includes three or four sections. Most of them correspond
approximately to a two-hour lecture. Thus, the contents of the book can be directly
used for a standard two-semester course on Convex Optimization. Of course,
different subsets of the chapters can be useful for a smaller course.
The whole contents is divided into two parts. Part I, which includes Chaps. 1–4,
contains all the material related to the Black-Box model of optimization problem. In
this framework, additional information on the given problem can be obtained only
by request, which corresponds to a particular set of values of the decision variables.
Typically, the result of this request is either the value of the objective function, or
Introduction xv

this value and the gradient, etc. This framework is the most advanced part of Convex
Optimization Theory.
Chapter 1 is devoted to general optimization problems. In Sect. 1.1, we intro-
duce the terminology, the notions of oracle, black box, functional model of
an optimization problem and the complexity of general iterative schemes. We
prove that global optimization problems are “unsolvable” and discuss the main
features of different fields of optimization theory. In Sect. 1.2, we discuss two main
local unconstrained minimization schemes: the gradient method and the Newton’s
method. We establish their local rates of convergence and discuss the possible
difficulties (divergence, convergence to a saddle point). In Sect. 1.3, we compare
the formal structures of the gradient and the Newton’s method. This analysis leads
to the idea of a variable metric. We describe quasi-Newton methods and conjugate
gradient schemes. We conclude this section with an analysis of different methods
for constrained minimization: Lagrangian relaxation with a certificate for global
optimality, the penalty function method, and the barrier approach.
In Chap. 2, we consider methods of smooth convex optimization. In Sect. 2.1,
we analyze the main reason for difficulties encountered in the previous chapter.
From this analysis, we derive two good functional classes, the classes of smooth
convex and smooth strongly convex functions. For corresponding unconstrained
minimization problems, we establish the lower complexity bounds. We conclude
this section with an analysis of a gradient scheme, which demonstrates that this
method is not optimal. The optimal schemes for smooth convex minimization
problems, so-called Fast Gradient Methods, are discussed in Sect. 2.2. We start
by presenting a special technique for convergence analysis, based on estimating
sequences. Initially, it is introduced for problems of Unconstrained Minimization.
After that, we introduce convex sets and define a notion of gradient mapping
for a problem with simple set constraints. We show that the gradient mapping
can formally replace a gradient step in the optimization schemes. In Sect. 2.3,
we discuss more complicated problems, which involve several smooth convex
functions, namely, the minimax problem and the constrained minimization problem.
For both problems we use a notion of gradient mapping and present the optimal
schemes.
Chapter 3 is devoted to the theory of nonsmooth convex optimization. Since we do
not assume that the reader has a background in Convex Analysis, the chapter begins
with Sect. 3.1, which contains a compact presentation of all the necessary facts.
The final goal of this section is to justify the rules for computing the subgradients
of a convex function. At the same time, we also discuss optimality conditions,
Fenchel duality and Lagrange multipliers. At the end of the section, we prove
several minimax theorems and explain the basic notions justifying the primal-dual
optimization schemes. This is the biggest section in the book and it can serve as a
basis for a mini-course on Convex Analysis.
The next Sect. 3.2 starts from the lower complexity bounds for nonsmooth
optimization problems. After that, we present a general scheme for the complexity
analysis of the corresponding methods. We use this scheme in order to establish a
convergence rate for the simplest subgradient method and for its switching variant,
xvi Introduction

treating the problems with functional constraints. For the latter scheme, we justify
the possibility of approximating optimal Lagrange multipliers. In the remaining part
of the section, we consider the two most important finite-dimensional methods: the
center-of-gravity method and the ellipsoid method. At the end, we briefly discuss
some other cutting plane schemes. Section 3.3 is devoted to the minimization
schemes, which employ a piece-wise linear model of a convex function. We describe
Kelley’s method and show that it can be extremely slow. After that, we introduce
the so-called Level Method. We justify its efficiency estimates for unconstrained
minimization problems and for problems with functional constraints.
Part I is concluded by Chap. 4, devoted to a global complexity analysis of second-
order methods. In Sect. 4.1, we introduce cubic regularization of the Newton method
and study its properties. We show that the auxiliary optimization problem in this
scheme can be efficiently solved even if the Hessian of the objective function is not
positive semidefinite. We study global and local convergence of the Cubic Newton
Method in convex and non-convex cases. In Sect. 4.2, we show that this method can
be accelerated using the estimating sequences technique.
In Sect. 4.3, we derive lower complexity bounds for second-order methods and
present a conceptual optimal scheme. At each iteration of this method, it is necessary
to perform a potentially expensive search procedure. Therefore, we conclude that the
problem of constructing an efficient optimal second-order scheme remains open.
In the last Sect. 4.4, we consider a modification of the standard Gauss-Newton
method for solving systems of nonlinear equations. This modification is also based
on an overestimating principle as applied to the norm of the residual of the system.
Both global and local convergence results are justified.
In Part II, we include results related to Structural Optimization. In this frame-
work, we have direct access to the elements of optimization problems. We can work
with the input data at the preliminary stage, and modify it, if necessary, to make
the problem simpler. We show that such a freedom can significantly increase our
computational abilities. Very often, we are able to get optimization methods which
go far beyond the limits prescribed by the lower complexity bounds of Black-Box
Optimization Theory.
In the first chapter of this part, Chap. 5, we present theoretical foundations
for polynomial-time interior-point methods. In Sect. 5.1, we discuss a certain
contradiction in the Black Box concept as applied to a convex optimization model.
We introduce a barrier model of an optimization problem, which is based on the
notion of a self-concordant function. For such functions, the second-order oracle is
not local. Moreover, they can easily be minimized by the standard Newton’s method.
We study the properties of these functions and their dual counterparts.
In the next Sect. 5.2, we study the complexity of minimization of self-concordant
functions by different variants of Newton’s method. The efficiency of direct
minimization is compared with that of a path-following scheme, and it is proved
that the latter method is much better.
In Sect. 5.3, we introduce self-concordant barriers, a subclass of standard self-
concordant functions, which is suitable for sequential unconstrained minimization
Introduction xvii

schemes. We study the properties of such barriers and prove the efficiency estimate
of the path-following scheme.
In Sect. 5.4, we consider several examples of optimization problems, for which
we can construct a self-concordant barrier. Consequently, these problems can
be solved by a polynomial-time path-following scheme. We consider linear and
quadratic optimization problems, problems of semidefinite optimization, separable
optimization and geometrical optimization, problems with extremal ellipsoids, and
problems of approximation in p -norms. A special subsection is devoted to a
general technique for constructing self-concordant barriers for particular convex
sets, which is provided with several application examples. We conclude Chap. 5
with a comparative analysis of performance of an interior-point scheme with a
nonsmooth optimization method as applied to a particular problem class.
In Chap. 6, we present different approaches based on the direct use of a primal-
dual model of the objective function. First of all, we study a possibility of
approximating nonsmooth functions by smooth functions. In the previous chapters,
it was shown that in the Black-Box framework smooth optimization problems are
much easier than nonsmooth problems. However, any non-differentiable function
can be approximated with arbitrary accuracy by a differentiable function. We pay for
the better quality of approximation by a higher curvature of the smooth function. In
Sect. 6.1, we show how to balance the accuracy of approximation and its curvature
in an optimal way. As a result, we develop a technique for creating computable
smoothed versions of non-differentiable functions and minimizing them by Fast
Gradient Methods described in Chap. 2. The number of iterations of the resulting
methods is proportional to the square root of the number of iterations of the standard
subgradient scheme. At the same time, the complexity of each iteration does not
change. In Sect. 6.2, we show that this technique can also be used in a symmetric
primal-dual form. In the next Sect. 6.3, we give an example of application of the
smoothing technique to the problems of Semidefinite Programming.
This chapter concludes with Sect. 6.4, where we analyze methods based on
minimization of a local model of the objective function. Our optimization problem
has a composite objective function equipped with a linear optimization oracle.
For this problem, we justify global complexity bounds for two versions of the
Conditional Gradient method (the Frank–Wolfe algorithm). It is shown that these
methods can compute approximations of the primal-dual problem. At the end of
this section, we analyze a new version of the Trust-Region second-order method,
for which we obtain the worst-case global complexity guarantee.
In the last Chap. 7, we collect optimization methods which are able to solve
problems with a certain relative accuracy. Indeed, in many applications, it is difficult
to relate the number of iterations of an optimization scheme with a desired accuracy
of the solution since the corresponding inequality contains unknown parameters
(Lipschitz constants, distance to the optimum). However, in many cases the required
level of relative accuracy is quite understandable. For developing methods which
compute solutions with relative accuracy, we need to employ internal structure of
the problem. In this chapter, we start from problems of minimizing homogeneous
objective functions over a convex set separated from the origin (Sect. 7.1). The
xviii Introduction

availability of a subdifferential of this function at zero provides us with a good


metric, which can be used in optimization schemes and in the smoothing technique.
If this subdifferential is polyhedral, then the metric can be computed by a cheap
preliminary rounding process (Sect. 7.2).
In the next Sect. 7.3, we present a barrier subgradient method, which computes
an approximate maximum of a positive convex function with a certain relative
accuracy. We show how to apply this method for solving problems of fractional
covering, maximal concurrent flow, semidefinite relaxation, online optimization,
portfolio management, and others.
We conclude this chapter with Sect. 7.4.1, where we study the possibility of
finding good relative approximations to a special class of convex functions, which
we call strictly positive. For these functions, it is possible to introduce a new notion
of mixed accuracy (absolute/relative) and develop a quasi-Newton scheme for its
efficient approximation. We derive global complexity bounds for this method and
show that they are monotone in the dimension of the problem. This means that
small dimensions always help.
The book concludes with Bibliographical Comments and an Appendix, where we
analyze efficiency of some methods for solving auxiliary optimization problems.
Let us conclude this Introduction by describing some possible combinations
of chapters suitable for a course. The most classical one-semester course can be
composed by Chaps. 1, 2, 3, and 5. It corresponds more or less to the contents
of monograph [39]. The only difference is that in the present book Sect. 3.1 is
much bigger and it will be reasonable to restrict the student’s attention only to the
necessary parts. Chapter 3 can be replaced by Chap. 4, which will yield a course
devoted only to differentiable optimization.
All three chapters of Part II are completely independent. At the same time,
they can be unified in an advanced one-semester course on Modern Convex
Optimization.
Contents

Part I Black-Box Optimization


1 Nonlinear Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.1 The World of Nonlinear Optimization.. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.1.1 General Formulation of the Problem .. . . . .. . . . . . . . . . . . . . . . . . . . 3
1.1.2 Performance of Numerical Methods . . . . . .. . . . . . . . . . . . . . . . . . . . 7
1.1.3 Complexity Bounds for Global Optimization.. . . . . . . . . . . . . . . . 10
1.1.4 Identity Cards of the Fields . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
1.2 Local Methods in Unconstrained Minimization . .. . . . . . . . . . . . . . . . . . . . 17
1.2.1 Relaxation and Approximation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 18
1.2.2 Classes of Differentiable Functions . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
1.2.3 The Gradient Method . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
1.2.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 35
1.3 First-Order Methods in Nonlinear Optimization ... . . . . . . . . . . . . . . . . . . . 40
1.3.1 The Gradient Method and Newton’s Method: What Is
Different? .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 40
1.3.2 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 45
1.3.3 Constrained Minimization . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 50
2 Smooth Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
2.1 Minimization of Smooth Functions . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
2.1.1 Smooth Convex Functions .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
2.1.2 Lower Complexity Bounds for FL∞,1 (Rn ) . . . . . . . . . . . . . . . . . . . 69
2.1.3 Strongly Convex Functions .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 73
∞,1
2.1.4 Lower Complexity Bounds for Sμ,L (Rn ) . . . . . . . . . . . . . . . . . . . 77
2.1.5 The Gradient Method . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 80
2.2 Optimal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 82
2.2.1 Estimating Sequences .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 83
2.2.2 Decreasing the Norm of the Gradient . . . . .. . . . . . . . . . . . . . . . . . . . 97
2.2.3 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101
2.2.4 The Gradient Mapping .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 112
2.2.5 Minimization over Simple Sets . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 114

xix
xx Contents

2.3 The Minimization Problem with Smooth Components .. . . . . . . . . . . . . . 117


2.3.1 The Minimax Problem . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 117
2.3.2 Gradient Mapping .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 120
2.3.3 Minimization Methods for the Minimax Problem .. . . . . . . . . . . 123
2.3.4 Optimization with Functional Constraints . . . . . . . . . . . . . . . . . . . . 127
2.3.5 The Method for Constrained Minimization .. . . . . . . . . . . . . . . . . . 131
3 Nonsmooth Convex Optimization .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 139
3.1 General Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 139
3.1.1 Motivation and Definitions . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 140
3.1.2 Operations with Convex Functions .. . . . . . .. . . . . . . . . . . . . . . . . . . . 147
3.1.3 Continuity and Differentiability . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 157
3.1.4 Separation Theorems.. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 159
3.1.5 Subgradients .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 162
3.1.6 Computing Subgradients.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 167
3.1.7 Optimality Conditions . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 176
3.1.8 Minimax Theorems . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 188
3.1.9 Basic Elements of Primal-Dual Methods .. . . . . . . . . . . . . . . . . . . . 191
3.2 Methods of Nonsmooth Minimization . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 194
3.2.1 General Lower Complexity Bounds .. . . . . .. . . . . . . . . . . . . . . . . . . . 194
3.2.2 Estimating Quality of Approximate Solutions .. . . . . . . . . . . . . . . 198
3.2.3 The Subgradient Method.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 201
3.2.4 Minimization with Functional Constraints .. . . . . . . . . . . . . . . . . . . 204
3.2.5 Approximating the Optimal Lagrange Multipliers . . . . . . . . . . . 206
3.2.6 Strongly Convex Functions .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 209
3.2.7 Complexity Bounds in Finite Dimension .. . . . . . . . . . . . . . . . . . . . 214
3.2.8 Cutting Plane Schemes . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 217
3.3 Methods with Complete Data . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 225
3.3.1 Nonsmooth Models of the Objective Function . . . . . . . . . . . . . . . 225
3.3.2 Kelley’s Method .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 226
3.3.3 The Level Method .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 229
3.3.4 Constrained Minimization . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 233
4 Second-Order Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 241
4.1 Cubic Regularization of Newton’s Method . . . . . . .. . . . . . . . . . . . . . . . . . . . 241
4.1.1 Cubic Regularization of Quadratic Approximation . . . . . . . . . . 241
4.1.2 General Convergence Results . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 246
4.1.3 Global Efficiency Bounds on Specific Problem Classes . . . . . 251
4.1.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 262
4.1.5 Global Complexity Bounds .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 268
4.2 Accelerated Cubic Newton.. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 270
4.2.1 Real Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 270
4.2.2 Uniformly Convex Functions .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 274
4.2.3 Cubic Regularization of Newton Iteration . . . . . . . . . . . . . . . . . . . . 278
4.2.4 An Accelerated Scheme . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 281
4.2.5 Global Non-degeneracy for Second-Order Schemes . . . . . . . . . 285
Contents xxi

4.2.6 Minimizing Strongly Convex Functions . .. . . . . . . . . . . . . . . . . . . . 288


4.2.7 False Acceleration.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 290
4.2.8 Decreasing the Norm of the Gradient . . . . .. . . . . . . . . . . . . . . . . . . . 291
4.2.9 Complexity of Non-degenerate Problems .. . . . . . . . . . . . . . . . . . . . 293
4.3 Optimal Second-Order Methods .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 294
4.3.1 Lower Complexity Bounds . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 294
4.3.2 A Conceptual Optimal Scheme .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 299
4.3.3 Complexity of the Search Procedure .. . . . .. . . . . . . . . . . . . . . . . . . . 304
4.4 The Modified Gauss–Newton Method . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 305
4.4.1 Quadratic Regularization of the Gauss–Newton Iterate .. . . . . 305
4.4.2 The Modified Gauss–Newton Process . . . .. . . . . . . . . . . . . . . . . . . . 312
4.4.3 Global Rate of Convergence .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 314
4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 319

Part II Structural Optimization


5 Polynomial-Time Interior-Point Methods . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 325
5.1 Self-concordant Functions . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 325
5.1.1 The Black Box Concept in Convex Optimization . . . . . . . . . . . . 325
5.1.2 What Does the Newton’s Method Actually Do? .. . . . . . . . . . . . . 328
5.1.3 Definition of Self-concordant Functions . .. . . . . . . . . . . . . . . . . . . . 330
5.1.4 Main Inequalities .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 337
5.1.5 Self-Concordance and Fenchel Duality . . .. . . . . . . . . . . . . . . . . . . . 346
5.2 Minimizing Self-concordant Functions . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 353
5.2.1 Local Convergence of Newton’s Methods . . . . . . . . . . . . . . . . . . . . 353
5.2.2 Path-Following Scheme .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 358
5.2.3 Minimizing Strongly Convex Functions . .. . . . . . . . . . . . . . . . . . . . 363
5.3 Self-concordant Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 367
5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 367
5.3.2 Definition of a Self-concordant Barrier . . .. . . . . . . . . . . . . . . . . . . . 369
5.3.3 Main Inequalities .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 375
5.3.4 The Path-Following Scheme .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 378
5.3.5 Finding the Analytic Center . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 382
5.3.6 Problems with Functional Constraints . . . .. . . . . . . . . . . . . . . . . . . . 385
5.4 Applications to Problems with Explicit Structure . . . . . . . . . . . . . . . . . . . . 388
5.4.1 Lower Bounds for the Parameter of a Self-concordant
Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 389
5.4.2 Upper Bound: Universal Barrier and Polar Set . . . . . . . . . . . . . . . 390
5.4.3 Linear and Quadratic Optimization . . . . . . .. . . . . . . . . . . . . . . . . . . . 391
5.4.4 Semidefinite Optimization .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 395
5.4.5 Extremal Ellipsoids . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 400
5.4.6 Constructing Self-concordant Barriers for Convex Sets. . . . . . 403
5.4.7 Examples of Self-concordant Barriers . . . .. . . . . . . . . . . . . . . . . . . . 406
5.4.8 Separable Optimization . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 414
5.4.9 Choice of Minimization Scheme . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 417
xxii Contents

6 The Primal-Dual Model of an Objective Function . . .. . . . . . . . . . . . . . . . . . . . 423


6.1 Smoothing for an Explicit Model of an Objective Function.. . . . . . . . . 423
6.1.1 Smooth Approximations of Non-differentiable Functions . . . 424
6.1.2 The Minimax Model of an Objective Function .. . . . . . . . . . . . . . 427
6.1.3 The Fast Gradient Method for Composite Minimization.. . . . 430
6.1.4 Application Examples.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 433
6.1.5 Implementation Issues . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 445
6.2 An Excessive Gap Technique for Non-smooth Convex
Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 446
6.2.1 Primal-Dual Problem Structure.. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 446
6.2.2 An Excessive Gap Condition . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 448
6.2.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 452
6.2.4 Minimizing Strongly Convex Functions . .. . . . . . . . . . . . . . . . . . . . 454
6.3 The Smoothing Technique in Semidefinite Optimization . . . . . . . . . . . . 460
6.3.1 Smooth Symmetric Functions of Eigenvalues . . . . . . . . . . . . . . . . 460
6.3.2 Minimizing the Maximal Eigenvalue of the Symmetric
Matrix .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 466
6.4 Minimizing the Local Model of an Objective Function . . . . . . . . . . . . . . 468
6.4.1 A Linear Optimization Oracle .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 468
6.4.2 The Method of Conditional Gradients with Composite
Objective.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 470
6.4.3 Conditional Gradients with Contraction . .. . . . . . . . . . . . . . . . . . . . 474
6.4.4 Computing the Primal-Dual Solution . . . . .. . . . . . . . . . . . . . . . . . . . 479
6.4.5 Strong Convexity of the Composite Term.. . . . . . . . . . . . . . . . . . . . 481
6.4.6 Minimizing the Second-Order Model .. . . .. . . . . . . . . . . . . . . . . . . . 483
7 Optimization in Relative Scale . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 489
7.1 Homogeneous Models of an Objective Function .. . . . . . . . . . . . . . . . . . . . 489
7.1.1 The Conic Unconstrained Minimization Problem .. . . . . . . . . . . 490
7.1.2 The Subgradient Approximation Scheme .. . . . . . . . . . . . . . . . . . . . 496
7.1.3 Direct Use of the Problem Structure . . . . . .. . . . . . . . . . . . . . . . . . . . 498
7.1.4 Application Examples.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 504
7.2 Rounding of Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 511
7.2.1 Computing Rounding Ellipsoids . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 511
7.2.2 Minimizing the Maximal Absolute Value of Linear
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 527
7.2.3 Bilinear Matrix Games with Non-negative Coefficients . . . . . 532
7.2.4 Minimizing the Spectral Radius of Symmetric Matrices .. . . . 535
7.3 Barrier Subgradient Method . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 539
7.3.1 Smoothing by a Self-Concordant Barrier .. . . . . . . . . . . . . . . . . . . . 539
7.3.2 The Barrier Subgradient Scheme . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 544
7.3.3 Maximizing Positive Concave Functions .. . . . . . . . . . . . . . . . . . . . 548
7.3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 551
7.3.5 Online Optimization as an Alternative to Stochastic
Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 555
Contents xxiii

7.4 Optimization with Mixed Accuracy . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 561


7.4.1 Strictly Positive Functions .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 561
7.4.2 The Quasi-Newton Method .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 564
7.4.3 Interpretation of Approximate Solutions ... . . . . . . . . . . . . . . . . . . . 567

A Solving Some Auxiliary Optimization Problems . . . . .. . . . . . . . . . . . . . . . . . . . 571


A.1 Newton’s Method for Univariate Minimization .. .. . . . . . . . . . . . . . . . . . . . 571
A.2 Barrier Projection onto a Simplex . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 573

Bibliographical Comments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 577

References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 581

Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 585
Part I
Black-Box Optimization
Chapter 1
Nonlinear Optimization

In this chapter, we introduce the main notations and concepts used in Continuous
Optimization. The first theoretical results are related to Complexity Analysis of
the problems of Global Optimization. For these problems, we start with a very
pessimistic lower performance guarantee. It implies that forany method there
exists an optimization problem in Rn which needs at least O 1n computations
of the function values in order to approximate its global solution up to accuracy
. Therefore, in the next section we pass to local optimization, and consider two
main methods, the Gradient Method and the Newton Method. For both of them,
we establish some local rates of convergence. In the last section, we present some
standard methods in General Nonlinear Optimization: the conjugate gradient meth-
ods, quasi-Newton methods, theory of Lagrangian relaxation, barrier methods and
penalty function methods. For some of them, we prove global convergence results.

1.1 The World of Nonlinear Optimization

(General formulation of the problem; Important examples; Black box and iterative methods;
Analytical and arithmetical complexity; The Uniform Grid Method; Lower complexity
bounds; Lower bounds for global optimization; Identity cards of the fields.)

1.1.1 General Formulation of the Problem

Let us start by fixing the mathematical form of our main problem and the standard
terminology. Let x be an n-dimensional real vector:

x = (x (1), . . . , x (n) )T ∈ Rn ,

© Springer Nature Switzerland AG 2018 3


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4_1
4 1 Nonlinear Optimization

and f0 (·), . . . , fm (·) be some real-valued functions defined on a set Q ⊆ Rn . In this


book, we consider different variants of the following general minimization problem:

min f0 (x),

s.t. fj (x) & 0, j = 1 . . . m, (1.1.1)

x ∈ Q,

where the sign & can be ≤, ≥, or =.


We call f0 (·) the objective function of our problem, the vector function

f (x) = (f1 (x), . . . , fm (x))T

is called the vector of functional constraints, the set Q is called the basic feasible
set, and the set

F = {x ∈ Q | fj (x) ≤ 0, j = 1 . . . m}

is called the (entire) feasible set of problem (1.1.1). It is just a convention to consider
minimization problems. Instead, we could consider maximization problems with the
objective function −f0 (·).
There exists a natural classification of the types of minimization problems.
• Constrained problems: F  Rn .
• Unconstrained problems: F = Rn .1
• Smooth problems: all fj (·) are differentiable.
• Nonsmooth problems: there are several nondifferentiable components fk (·).
• Linearly constrained problems: the functional constraints are affine:


n
(i)
fj (x) = aj x (i) + bj ≡ aj , x + bj , j = 1 . . . m,
i=1

(here ·, · stands for the inner (or scalar) product in Rn : a, x = a T x), and Q is
a polyhedron. If f0 (·) is also affine, then (1.1.1) is a linear optimization problem.
If f0 (·) is quadratic, then (1.1.1) is a quadratic optimization problem. If all the
functions f0 (·), ·, fm (·) are quadratic, then this is a quadratically constrained
quadratic problem.
There is also a classification based on properties of the feasible set.

1 Sometimes, problems with a “simple” basic feasible set Q and no functional constraints are also

treated as “unconstrained” problems. In this case, we need to know how to solve some auxiliary
optimization problems over the set Q in a closed form.
1.1 The World of Nonlinear Optimization 5

• Problem (1.1.1) is called feasible if F


= ∅.
• Problem (1.1.1) is called strictly feasible if there exists an x ∈ Q such that
fj (x) < 0 (or > 0) for all inequality constraints and fj (x) = 0 for all equality
constraints. (Slater condition.)
Finally, we distinguish different types of solutions to (1.1.1):
• A point x ∗ ∈ F is called the optimal global solution to (1.1.1) if f0 (x ∗ ) ≤ f0 (x)
for all x ∈ F (global minimum). In this case, f0 (x ∗ ) is called the (global) optimal
value of the problem.
• A point x ∗ ∈ F is called a local solution to (1.1.1) if there exists a set Fˆ ⊆ F
such that x ∗ ∈ intFˆ and f0 (x ∗ ) ≤ f0 (x) for all x ∈ Fˆ (local minimum). If
f0 (x ∗ ) < f0 (x) for all x ∈ Fˆ \ {x ∗ }, then x ∗ is called strict (or isolated) local
minimum.
Let us consider now several examples representing the main sources of optimiza-
tion problems.
Example 1.1.1 Let x (1), . . . , x (n) be our design variables. Then we can fix some
functional characteristics of our decision vector x: f0 (x), . . . , fm (x). For example,
we can consider a price of the project, amount of required resources, reliability of
the system, etc. We fix the most important characteristic, f0 (x), as our objective.
For all others, we impose some bounds: aj ≤ fj (x) ≤ bj . Thus, we come to the
problem

min f0 (x),
x∈Q

s.t. aj ≤ fj (x) ≤ bj , j = 1 . . . m,

where Q stands for the structural constraints like nonnegativity, boundedness of


some variables, etc.

Example 1.1.2 Let our initial problem be as follows:

Find x ∈ Rn such that fj (x) = aj , j = 1 . . . m, (1.1.2)

where aj ∈ R, j = 1, . . . , m. Then we can consider the problem


m
minn (fj (x) − aj )2 ,
x∈R
j =1

perhaps even with some additional constraints on x. If the optimal value of the latter
problem is zero, we conclude that our initial problem (1.1.2) has a solution.
Note that in Nonlinear Analysis the problem (1.1.2) is almost universal. It covers
ordinary differential equations, partial differential equations, problems arising in
Game Theory, and many others. 
6 1 Nonlinear Optimization

Example 1.1.3 Sometimes our decision variables x (1) , . . . , x (n) must be integer.
This can be described by the following constraint:

sin(πx (i) ) = 0, i = 1 . . . n.

Thus, we can also treat integer optimization problems:

min f0 (x),
x∈Q

s.t. aj ≤ fj (x) ≤ bj , j = 1 . . . m,

sin(πx (i) ) = 0, i = 1 . . . n. 

Looking at these examples, we can easily understand the optimism of the


pioneers of Nonlinear Optimization, which can be easily seen in the papers of the
1950s and 1960s. Our first impression should be, of course, as follows:

Nonlinear Optimization is a very important and promising applica-


tion theory. It covers almost ALL needs of Operations Research and
Numerical Analysis.

However, by looking at the same list of examples, especially at Examples 1.1.2


and 1.1.3, a more experienced (or suspicious) reader could come to the following
conjecture.

In general, optimization problems should be UNSOLVABLE (?)

Indeed, from our real-life experience, it is difficult to believe in the existence of a


universal tool which is able to solve all problems in the world.
However, suspicions are not the legal instruments of science. It is a question of
personal taste how much we can trust them. Therefore, it was definitely one of the
most important events in Optimization Theory when, in the middle of 1970s, this
conjecture was proved in a strict mathematical sense. This proof is so important and
simple that we cannot avoid it in our course. But first of all, we should introduce a
special language which is required for speaking about such things.
1.1 The World of Nonlinear Optimization 7

1.1.2 Performance of Numerical Methods

Let us imagine the following situation. We are going to solve a problem P , and we
know that there exist many different numerical methods for doing so. Of course,
we want to find a scheme which is the best for our P . However, it appears that
we are looking for something which does not exist. In fact, maybe it does, but it is
definitely not recommended to ask the winner for help. Indeed, consider a method
for solving problem (1.1.1), which does nothing except report that x ∗ = 0. Of
course, this method does not work properly for any problems except those which
have the optimal solution exactly at the origin, in which case the “performance” of
this method is unbeatable.
Hence, we cannot speak about the best method for a particular problem P , but
we can do so for a class of problems P  P . Indeed, numerical methods are usually
developed to solve many different problems with similar characteristics. Thus, the
performance of a method M on the whole class P can be a natural measure of its
efficiency.
Since we are going to speak about the performance of M on a class P, we
should assume that the method M does not have complete information about a
particular problem P .

The known (to a numerical scheme) “part” of problem P is called


the model of the problem.

We denote the model by Σ. Usually the model consists of the formulation of the
problem, description of classes of functional components, etc.
In order to recognize the problem P (and solve it), the method should be able
to collect specific information about P . It is convenient to describe the process of
collecting this data via the notion of an oracle. An oracle O is just a unit which
answers the successive questions of the methods. The method M is trying to solve
the problem P by collecting and handling the answers.
In general, each problem can be described by different models. Moreover, for
each problem we can develop different types of oracles. But let us fix Σ and O. In
this case, it is natural to define the performance of M on (Σ, O) as its performance
on the worst Pw from (Σ, O). Note that this Pw can be bad only for M .
Further, what is the performance of M on P ? Let us start from an intuitive
definition.

The performance of M on P is the total amount of computational


effort required by method M to solve the problem P .
8 1 Nonlinear Optimization

In this definition, there are two additional notions to be specified. First of all, what
does “to solve the problem” mean? In some situations it could mean finding an exact
solution. However, in many areas of Numerical Analysis this is impossible (and in
Optimization this is definitely the case). Therefore, we accept a relaxed goal.

Solving the problem means finding an approximate solution to P


with some accuracy  > 0.

Again, the meaning of the expession “with some accuracy  > 0” is very important
for our definitions. However, it is too early to speak about this now. We just introduce
the notation T for a stopping criterion. Its meaning will always be clear for
particular problem classes. Now we have a formal description of the problem class:

P ≡ (Σ, O, T ).

In order to solve a problem P from P, we apply to it an iterative process, which


is a natural form of any method which works with an oracle.

General Iterative Scheme

Input: Starting point x0 and accuracy  > 0.


Initialization. Set k = 0, I−1 = ∅. Here k is the iteration
counter and Ik is the accumulated informational set.

Main loop:
1. Call oracle O at point xk . 
2. Update the informational set: Ik = Ik−1 (xk , O(xk )).
3. Apply the rules of method M to Ik and generate a new point
xk+1 .
4. Check criterion T . If yes then form an output x̄. Otherwise set
k := k + 1 and go to Step 1.

(1.1.3)
1.1 The World of Nonlinear Optimization 9

Now we can specify the meaning of computational effort in our definition of


performance. In the scheme (1.1.3), we can see two potentially expensive steps. The
first one is Step 1, where we call the oracle. The second one is Step 3, where we form
the new test point. Thus, we can introduce two measures of complexity of problem
P for method M :

Analytical complexity: The number of calls of the oracle which is


necessary to solve problem P up to accuracy .
Arithmetical complexity: The total number of arithmetic operations
(including the work of oracle and work of method), which is
necessary for solving problem P up to accuracy .

Comparing the notions of analytical and arithmetical complexity, we can see that the
second one is more realistic. However usually, for a particular method M as applied
to problem P , arithmetical complexity can be easily obtained from the analytical
complexity and complexity of the oracle. Therefore, in Part I of this course we
speak mainly about bounds on the analytical complexity for some problem classes.
Arithmetical complexity will be treated in Part II, where we consider methods of
Structural Optimization.
There is one standard assumption on the oracle which allows us to obtain
the majority of results on analytical complexity for optimization schemes. This
assumption, called the Local Black Box Concept, is as follows.

Local Black Box

1. The only information available for the numerical scheme is the


answer of the oracle.
2. The oracle is local: A small variation of the problem far enough
from the test point x, which is compatible with the description of
the problem class, does not change the answer at x.

This concept is very useful in the complexity analysis. Of course, its first part looks
like an artificial wall between the method and the oracle. It seems natural to give
methods full access to the internal structure of the problem. However, we will see
that for problems with a complicated or implicit structure this access is almost
useless. For more simple problems it could help. We will see this in the second
part of this book.
10 1 Nonlinear Optimization

To conclude the section, let us mention that the standard formulation (1.1.1)
is called a functional model of optimization problems. Usually, for such models
the standard assumptions are related to the level of smoothness of functional
components. According to the degree of smoothness we can apply different types of
oracle:
• Zero-order oracle: returns the function value f (x).
• First-order oracle: returns the function value f (x) and the gradient ∇f (x).
• Second-order oracle: returns f (x), ∇f (x), and the Hessian ∇ 2 f (x).

1.1.3 Complexity Bounds for Global Optimization

Let us try to apply the formal language of the previous section to a particular
problem class. Consider the following problem:

min f (x). (1.1.4)


x∈Bn

In our terminology, this is a constrained minimization problem with no functional


constraints. The basic feasible set of this problem is Bn , an n-dimensional box in
Rn :

Bn = {x ∈ Rn | 0 ≤ x (i) ≤ 1, i = 1 . . . n}.

Let us measure distances in Rn by the ∞ -norm:

x(∞) = max |x (i) |.


1≤i≤n

Assume that, with respect to this norm,

the objective function f (·) : Rn → R is Lipschitz continuous on


Bn :

| f (x) − f (y) |≤ L  x − y (∞) ∀x, y ∈ Bn ,

with some constant L (Lipschitz constant).

(1.1.5)
Let us consider a very simple method for solving (1.1.4), which is called the
Uniform Grid Method. This method G (p) has one integer input parameter p ≥ 1.
1.1 The World of Nonlinear Optimization 11

Method G (p)

1. Form pn points
 T
2i1 −1 2i2 −1 2in −1
xα = 2p , 2p , . . . , 2p ,

where α ≡ (i1 , . . . , in ) ∈ {1, . . . , p}n .

2. Among all points xα , find the point x̄ with the minimal value of
the objective function.

3. The pair (x̄, f (x̄)) is the output of the method.

(1.1.6)
Thus, this method forms a uniform grid of the test points inside the box Bn ,
computes the best value of the objective over this grid, and returns this value as
an approximate solution to problem (1.1.4). In our terminology, this is a zero-order
iterative method without any influence from the accumulated information on the
sequence of test points. Let us find its efficiency estimate.
Theorem 1.1.1 Let f ∗ be a global optimal value of problem (1.1.4). Then

f (x̄) − f ∗ ≤ L
2p .

Proof For a multi-index α = (i1 , . . . , in ), define

Xα = {x ∈ Rn : x − xα (∞) ≤ 2p }.
1


Clearly, Xα = Bn .
α∈{1,...,p}n
Let x∗ be a global solution of our problem. Then there exists a multi-index α ∗
such that x ∗ ∈ Xα ∗ . Note that x ∗ − xα ∗ (∞) ≤ 2p
1
. Therefore,

(1.1.5)
f (x̄) − f (x ∗ ) ≤ f (xα ∗ ) − f (x ∗ ) ≤ L
2p . 

12 1 Nonlinear Optimization

Let us conclude with the definition of our problem class. We fix our goal as
follows:

Find x̄ ∈ Bn : f (x̄) − f ∗ ≤ . (1.1.7)

Then we immediately get the following result.


Corollary 1.1.1 The analytical complexity of problem class (1.1.4), (1.1.5), (1.1.7)
for method G is at most
L n
A (G ) = 2 +1 ,

(here and in the sequel, a is the integer part of a ∈ R).


L
Proof Take p = 2 + 1. Then p ≥ 2 L
, and, in view of Theorem 1.1.1, we have

f (x̄) − f ≤ 2p ≤ . Note that we need to call the oracle at pn points.
L

Thus, A (G ) justifies an upper complexity bound for our problem class.
This result is quite informative. However, we still have some questions. Firstly,
it may happen that our proof is too rough and the real performance of method G (p)
is much better. Secondly, we still cannot be sure that G (p) is a reasonable method
for solving (1.1.4). There could exist other schemes with much higher performance.
In order to answer these questions, we need to derive lower complexity bounds
for the problem class (1.1.4), (1.1.5), (1.1.7). The main features of such bounds are
as follows.
• They are based on the Black Box Concept.
• These bounds are valid for all reasonable iterative schemes. Thus, they provide
us with a lower estimate for the analytical complexity of the problem class.
• Very often such bounds employ the idea of a resisting oracle.
For us, only the concept of a resisting oracle is new. Therefore, let us present it
in more detail.
A resisting oracle tries to create the worst possible problem for each particular
method. It starts from an “empty” function and it tries to answer each call of
the method in the worst possible way. However, the answers must be compatible
with the previous answers and with description of the problem class. Then, after
termination of the method it is possible to reconstruct a problem which perfectly
fits the final informational set accumulated by the algorithm. Moreover, if we run
the method on this newborn problem, it will reproduce the same sequence of test
points since it will have the same sequence of answers from the oracle.
Let us show how this works for problem (1.1.4). Consider the class of problems
P∞ defined as follows.
1.1 The World of Nonlinear Optimization 13

Model : min f (x), where f (·) is


x∈Bn
∞ -Lipschitz continuous on Bn .

Oracle : Zero-order Local Black Box.

Approximate solution : Find x̄ ∈ Bn : f (x̄) − f ∗ < .

Theorem 1.1.2 For  < 12 L, the analytical complexity of problem class P∞ is at


 L n
least 2 calls of the oracle.
L
Proof Let p = 2 (≥ 1). Assume that there exists a method which needs N <
pn calls of oracle to solve any problem from P. Let us apply this method to the
following resisting strategy:

Return f (x) = 0 at any test point x.

Therefore this method can find only x̄ ∈ Bn with f (x̄) = 0.


However, since N < pn , there exists a multi-index α̂ such that there were no
test points in the box Xα̂ (see the notation of Theorem 1.1.1). Define x∗ = xα̂ , and
consider the function

f¯(x) = min{0, Lx − x∗ (∞) − }.

Clearly, this function is ∞ -Lipschitz continuous with constant L, and its global
optimal value is −. Moreover, f¯(·) differs from zero only inside the box Xα̂ . Thus,
f¯(·) is equal to zero at all test points of our method.
Since the accuracy of the output of our method is , we come to the following
conclusion:
If the number of calls of the oracle is less than pn ,then the accuracy of the result cannot be
better than .

Thus, the desired statement is proved.



Now we can say much more about the performance of the Uniform Grid Method.
Let us compare its efficiency estimate with the lower bound:

n n
L L
G : +1 ⇔ Lower bound: .
2 2
14 1 Nonlinear Optimization

If  ≤ O( Ln ), then the lower and upper bounds coincide up to an absolute constant


multiplicative factor. This means that, for such level of accuracy, G (·) is optimal for
the problem class P∞ .
At the same time, Theorem 1.1.2 supports our initial claim that the general
optimization problems are unsolvable. Let us look at the following illustrative
example.
Example 1.1.4 Consider the problem class P∞ defined by the following parame-
ters:

L = 2, n = 10,  = 0.01.

Note that the size of these problems is very small and we ask only for a moderate
1% accuracy.  L n
The lower complexity bound for this class is 2 calls of the oracle. Let us
compute this value for our example.

Lower bound : 1020 calls of the oracle


Oracle complexity : at least n arithmetic operations (a.o.)
Total complexity : 1021 a.o.
Processor performance : 106 a.o. per second
Total time : 1015s
One year : less than 3.2 · 107 s

We need : 31,250,000 years

This estimate is so disappointing that we cannot maintain any hope that such
problems may become solvable in the future. Let us just play with the parameters of
the problem class.
• If we change n to n + 1, then the estimate is multiplied by one hundred. Thus,
for n = 11 our lower bound is valid for a much more powerful computer.
• On the contrary, if we multiply  by two, we reduce the complexity by a factor
of a thousand. For example, if  = 8%, then we need only two weeks.2 

2 We keep this calculation unchanged from the first version of this book [39]. In this example,

the processor performance corresponds to a Sun Station, which was the most powerful personal
computer at the beginning of the 1990s. Now, after twenty five years of intensive progress in the
abilities of hardware, modern personal computers have reached a speed level of 108 a.o. per second.
Thus indeed, our time estimate remains valid for n = 11.
1.1 The World of Nonlinear Optimization 15

We should note that the lower complexity bounds for problems with smooth
functions, or for high-order methods, are not much better than the bound of
Theorem 1.1.2. This can be proved using the same arguments and we leave the
proof as an exercise for the reader. Comparison of the above results with the upper
bounds for NP-hard problems, which are considered as classical examples of very
difficult problems in Combinatorial Optimization, is also quite disappointing. To
find the exact solution, the hardest combinatorial problems need only 2n a.o. !
To conclude this section, let us compare our observations with some other fields
of Numerical Analysis. It is well known that the uniform grid approach is a standard
tool in many domains. For example, if we need to compute numerically the value of
the integral of a univariate function

1
S = f (x)dx,
0

the standard way to proceed is to form a discrete sum


n
SN = 1
N f (xi ), xi = i
N, i = 1 . . . N.
i=1

If f (·) is Lipschitz continuous, then this value is a good approximation to I :

N = L/ ⇒ | S − SN |≤ .

Note that in our terminology this is exactly a uniform grid approach. Moreover,
this is a standard way for approximating integrals. The reason why it works here is
related to the dimension of the problem. For integration, the standard dimensions
are very small (up to three). However, in Optimization, sometimes we need to solve
problems with several million variables.

1.1.4 Identity Cards of the Fields

After the pessimistic results of the previous section, we should try to find a
reasonable target in the theoretical analysis of optimization schemes. It seems that
everything is clear with general Global Optimization. However, maybe the goals
of this field are too ambitious? In some practical problems could we be satisfied
by much less “optimal” solutions? Or, are there some interesting problem classes
which are not as dangerous as the class of general continuous functions?
In fact, each of these questions can be answered in different ways, each of which
define the style of research (or rules of the game) in different fields of Nonlinear
16 1 Nonlinear Optimization

Optimization. If we try to classify these fields, we can easily see that they differ one
from another in the following aspects:
• Goals of the methods.
• Classes of functional components.
• Description of the oracle.
These aspects naturally define the list of desired properties of the optimization
methods. Let us present the “identity cards” of the fields which we are going to
consider in this book.

1. General Global Optimization (Sect. 1.1)

• Goals: Find a global minimum.


• Functional class: Continuous functions.
• Oracle: 0–1–2 order Black Box.
• Desired properties: Convergence to a global minimum.
• Features: From theoretical point of view, this game is too short.
• Problem sizes: Sometimes, we can solve problems with many
variables. No guarantee of success even for small problems.
• History Starts from 1955. Several local peaks of interest related
to new heuristic ideas (simulated annealing, genetic algorithms).

2. General Nonlinear Optimization (Sects. 1.2, 1.3)

• Goals: Find a local minimum.


• Functional class: Differentiable functions.
• Oracle: First- and second-order Black Box.
• Desired properties: Fast convergence to a local minimum.
• Features: Variability of approaches. Most widespread software.
The goals are not always acceptable and reachable.
• Problem sizes: Up to several thousand variables.
• History: Starts from 1955. Peak period: 1965 – 1985. Theoretical
activity now is rather low.
1.2 Local Methods in Unconstrained Minimization 17

3. Black Box Convex Optimization (Chaps. 2, 3, and 4)

• Goals: Find a global minimum.


• Functional class: Convex sets and functions.
• Oracle: First- and second-order Black Box.
• Desired properties: Convergence to a global minimum. The rate
of convergence may depend on dimension.
• Features: Very interesting and rich complexity theory. Efficient
practical methods. The problem class is sometimes restrictive.
• Problem sizes: Several thousand variables for the second-order
methods, and several million for the first-order schemes.
• History: Starts from 1970. Peak period: 1975–1985. Theoretical
activity now is high due to the interest to Structural Optimization
and global complexity analysis of second-order methods (2006).

4. Structural Optimization (Part II)

• Goals: Find a global minimum.


• Functional class: Simple convex sets and functions with explicit
minimax structure.
• Oracle: Second-order Black Box for special barrier functions
(Chap. 5), and modified first-order Black Box (Chaps. 6, 7).
• Desired properties: Fast convergence to a global minimum. The
rate of convergence depends on the structure of the problem.
• Features: Very new and perspective theory rejecting the Black
Box Concept. The problem class is practically the same as in
Convex Optimization.
• Problem sizes: Sometimes up to several million variables.
• History: Starts from 1984. Peak period: 1990–2000 for Interior-
Point Methods. The first accelerated first-order method for
problems with explicit structure was developed in 2005. Very
high theoretical activity right now.

1.2 Local Methods in Unconstrained Minimization

(Relaxation and approximation; Necessary optimality conditions; Sufficient optimality


conditions; The class of differentiable functions; The class of twice differentiable functions;
The Gradient Method; Rate of convergence; Newton’s Method.)
18 1 Nonlinear Optimization

1.2.1 Relaxation and Approximation

The simplest goal in general Nonlinear Optimization consists in finding a local


minimum of a differentiable function. However, even to reach such a restricted goal,
it is necessary to follow some special principles which guarantee convergence of the
minimization process.
The majority of methods in general Nonlinear Optimization are based on the idea
of relaxation.

A sequence of real numbers {ak }∞


k=0 is called a relaxation sequence
if

ak+1 ≤ ak ∀k ≥ 0.

In this section we consider several methods for solving the following unconstrained
minimization problem:

min f (x), (1.2.1)


x∈Rn

where f (·) is a smooth function. In order to do so, these methods generate a


relaxation sequence of function values {f (xk )}∞
k=0 :

f (xk+1 ) ≤ f (xk ), k = 0, 1, . . . .

This rule has the following important advantages.


1. If f (·) is bounded below on Rn , then the sequence {f (xk )}∞
k=0 converges.
2. In any case, we improve the initial value of the objective function.
However, it is impossible to implement the idea of relaxation without employing
another fundamental element of Numerical Analysis, approximation. In general,

To approximate means to replace an initial complex object by a


simpler one which is close to the original in terms of its properties.

In Nonlinear Optimization, we usually apply local approximations based on deriva-


tives of nonlinear functions. These are the first- and second-order approximations
(or, the linear and quadratic approximations).
Let the function f (·) be differentiable at x̄ ∈ Rn . Then, for any y ∈ Rn we have

f (y) = f (x̄) + ∇f (x̄), y − x̄ + o( y − x̄ ),


1.2 Local Methods in Unconstrained Minimization 19

where o(·) : [0, ∞) → R is a function of r ≥ 0 satisfying the conditions

lim 1r o(r) = 0, o(0) = 0.


r↓0

In the remaining part of this chapter, unless stated otherwise, we use the notation
 ·  for the standard Euclidean norm in Rn :
 n 
1/2
 2
x = x (i) = (x T x)1/2 = x, x ,
i=1

where ·, · is the standard inner product in the corresponding coordinate space.
Note that for any x ∈ Rn , y ∈ Rm , and matrix A ∈ Rm×n we have

Ax, y ≡ x, AT y . (1.2.2)

The linear function f (x̄) + ∇f (x̄), y − x̄ is called the linear approximation
of f at x̄. Recall that the vector ∇f (x̄) is called the gradient of function f at x̄.
Considering the points yi = x̄ + ei , where ei is the ith coordinate vector in Rn ,
and taking the limit as  → 0, we obtain the following coordinate representation of
the gradient:
 T
∂f (x̄)
∇f (x̄) = ∂x (1)
, . . . , ∂f (x̄)
∂x (n)
. (1.2.3)

Let us mention two important properties of the gradient. Denote by Lf (α) the
(sub)level set of f (·):

Lf (α) = {x ∈ Rn | f (x) ≤ α}.

Consider the set of directions that are tangent to Lf (f (x̄)) at x̄:


 
Sf (x̄) = s ∈ Rn | s= lim yk −x̄ , for some {yk } → x̄ with f (yk ) = f (x̄) ∀k .
k→∞ yk −x̄

Lemma 1.2.1 If s ∈ Sf (x̄), then ∇f (x̄), s = 0.


Proof Since f (yk ) = f (x̄), we have

f (yk ) = f (x̄) + ∇f (x̄), yk − x̄ + o( yk − x̄ ) = f (x̄).

Therefore ∇f (x̄), yk − x̄ +o( yk − x̄ ) = 0. Dividing this equation by  yk − x̄ 


and taking the limit as yk → x̄, we obtain the result. 
20 1 Nonlinear Optimization

Let s be a direction in Rn ,  s = 1. Consider the local decrease of the function


f (·) along direction s:

Δ(s) = lim α1 [f (x̄ + αs) − f (x̄)].


α↓0

Note that f (x̄ + αs) − f (x̄) = α∇f (x̄), s + o(α). Therefore

Δ(s) = ∇f (x̄), s .

Using the Cauchy–Schwarz inequality,

−  x  ·  y  ≤ x, y ≤  x  ·  y ,

we obtain Δ(s) = ∇f (x̄), s ≥ −  ∇f (x̄) . Let us take

s̄ = −∇f (x̄)/  ∇f (x̄)  .

Then

Δ(s̄) = −∇f (x̄), ∇f (x̄) /  ∇f (x̄) = −  ∇f (x̄)  .

Thus, the direction −∇f (x̄) (the antigradient) is the direction of the fastest local
decrease of the function f (·) at point x̄.
The next statement is probably the most fundamental fact in Optimization
Theory.
Theorem 1.2.1 (First-Order Optimality Condition) Let x ∗ be a local minimum
of a differentiable function f (·). Then

∇f (x ∗ ) = 0. (1.2.4)

Proof Since x ∗ is a local minimum of f (·), there exists an r > 0 such that for all
y ∈ Rn , y − x ∗  ≤ r, we have f (y) ≥ f (x ∗ ). Since f is differentiable, this
implies that

f (y) = f (x ∗ ) + ∇f (x ∗ ), y − x ∗ + o( y − x ∗ ) ≥ f (x ∗ ).

Thus, for all s ∈ Rn , we have ∇f (x ∗ ), s ≥ 0. By taking s = −∇f (x ∗ ), we get


−∇f (x ∗ )2 ≥ 0. Hence, ∇f (x ∗ ) = 0. 
In what follows the notation B  0, where B is a symmetric (n × n)-matrix,
means that B is positive semidefinite:

Bx, x ≥ 0 ∀x ∈ Rn .
1.2 Local Methods in Unconstrained Minimization 21

The notation B  0 means that B is positive definite (in this case, the inequality
above must be strict for all x
= 0).
Corollary 1.2.1 Let x ∗ be a local minimum of a differentiable function f (·) subject
to the linear equality constraints

x ∈ L ≡ {x ∈ Rn | Ax = b}
= ∅,

where A is an m × n-matrix with full row rank, and b ∈ Rm , m < n. Then there
exists a vector of multipliers λ∗ ∈ R m such that

∇f (x ∗ ) = AT λ∗ . (1.2.5)

Proof Let us assume that ∇f (x ∗ )


= 0. Consider the following optimization
problem:
 
g ∗ = minm g(λ) = 12 ∇f (x ∗ ) − AT λ2 . (1.2.6)
λ∈R

Assume that g ∗ > 0. Note that

g(λ) = 12 ∇f (x ∗ )2 − ∇f (x ∗ ), AT λ + 12 Bλ, λ ,

where B = AAT  λmin (B)In and λmin (B) > 0 denotes the smallest eigenvalue
of matrix B. Hence, the level sets of this function are bounded, and therefore the
problem (1.2.6) has a solution λ∗ satisfying the first-order optimality condition:

(1.2.4)
0 = ∇g(λ∗ ) = Bλ∗ − A∇f (x ∗ ).

Thus, λ∗ = B −1 A∇f (x ∗ ). Let s ∗ = (In − AT B −1 A)∇f (x ∗ ). Note that As ∗ = 0.


Then,

∇f (x ∗ ), s ∗ = ∇f (x ∗ )2 − B −1 A∇f (x ∗ ), A∇f (x ∗ ) = 2g ∗ > 0.

Therefore, the optimal value of the function g can be reduced along the ray {x ∗ −
αs ∗ : α ≥ 0}. This contradiction proves that g ∗ = 0.

Note that we have proved only a necessary condition for a local minimum. The
points satisfying this condition are called the stationary points of the function f . In
order to see that such points are not always local minima, it is enough to look at the
function f (x) = x 3 , x ∈ R, at the point x = 0.
Now let us introduce second-order approximation. Let the function f (·) be twice
differentiable at x̄. Then
1
f (y) = f (x̄) + ∇f (x̄), y − x̄ + ∇ 2 f (x̄)(y − x̄), y − x̄ + o( y − x̄ 2 ).
2
22 1 Nonlinear Optimization

The quadratic function

1
f (x̄) + ∇f (x̄), y − x̄ + ∇ 2 f (x̄)(y − x̄), y − x̄
2
is called the quadratic (or second-order) approximation of the function f at x̄.
Recall that ∇ 2 f (x̄) is an (n × n)-matrix with the following entries:

∂ 2 f (x̄)
(∇ 2 f (x̄))(i,j ) = ∂x (i) ∂x (j)
, i, j = 1, . . . , n.

It is called the Hessian of function f at x̄. Note that the Hessian is a symmetric
matrix:
 T
∇ 2 f (x̄) = ∇ 2 f (x̄) .

The Hessian can be regarded as a derivative of the vector function ∇f (·):

∇f (y) = ∇f (x̄) + ∇ 2 f (x̄)(y − x̄) + o( y − x̄ ) ∈ Rn , (1.2.7)

where o(·) : [0, ∞) → Rn is a continuous vector function satisfying the condition

lim 1r  o(r)  = 0.
r↓0

Using the second-order approximation, we can write down the second-order


optimality conditions.
Theorem 1.2.2 (Second-Order Optimality Condition) Let x ∗ be a local mini-
mum of a twice differentiable function f (·). Then

∇f (x ∗ ) = 0, ∇ 2 f (x ∗ )  0.

Proof Since x ∗ is a local minimum of the function f (·), there exists an r > 0 such
that for all y, y − x ∗  ≤ r, we have

f (y) ≥ f (x ∗ ).

In view of Theorem 1.2.1, ∇f (x ∗ ) = 0. Therefore, for any such y,

f (y) = f (x ∗ ) + ∇ 2 f (x ∗ )(y − x ∗ ), y − x ∗ + o( y − x ∗ 2 ) ≥ f (x ∗ ).

Thus, ∇ 2 f (x ∗ )s, s ≥ 0, for all s,  s = 1.



Again, the above theorem is a necessary (second-order) characteristic of a local
minimum. Let us prove now a sufficient condition.
1.2 Local Methods in Unconstrained Minimization 23

Theorem 1.2.3 Let a function f (·) be twice differentiable on Rn and let x ∗ ∈ Rn


satisfy the following conditions:

∇f (x ∗ ) = 0, ∇ 2 f (x ∗ )  0.

Then x ∗ is a strict local minimum of f (·).


Proof Note that in a small neighborhood of a point x ∗ the function f (·) can be
represented as

1
f (y) = f (x ∗ ) + ∇ 2 f (x ∗ )(y − x ∗ ), y − x ∗ + o( y − x ∗ 2 ).
2
o(r 2 )
Since r2
→ 0 as r ↓ 0, there exists a value r̄ > 0 such that for all r ∈ [0, r̄] we
have

r2 ∗
| o(r 2 ) |≤ 2
4 λmin (∇ f (x )).

In view of our assumption, this eigenvalue is positive. Therefore, for any y ∈ Rn ,


0 < y − x ∗  ≤ r̄, we have

f (y) ≥ f (x ∗ ) + 12 λmin (∇ 2 f (x ∗ ))  y − x ∗ 2 +o( y − x ∗ 2 )

≥ f (x ∗ ) + 14 λmin (∇ 2 f (x ∗ ))  y − x ∗ 2 > f (x ∗ ). 

1.2.2 Classes of Differentiable Functions

It is well known that any continuous function can be approximated by a smooth


function with arbitrarily small accuracy. Therefore, assuming only differentiability
of the objective function, we cannot ensure any reasonable properties of minimiza-
tion processes. For that, we need to impose some additional assumptions on the
magnitude of some derivatives. Traditionally, in Optimization such assumptions are
presented in the form of a Lipschitz condition for a derivative of certain degree.
k,p
Let Q be a subset of Rn . We denote by CL (Q) the class of functions with the
following properties:
k,p
• any f ∈ CL (Q) is k times continuously differentiable on Q.
• Its pth derivative is Lipschitz continuous on Q with constant L:

∇ p f (x) − ∇ p f (y) ≤ Lx − y

for all x, y ∈ Q. In this book, we usually work with p = 1 and p = 2.


24 1 Nonlinear Optimization

q,p k,p
Clearly, we always have p ≤ k. If q ≥ k, then CL (Q) ⊆ CL (Q). For example,
CL2,1 (Q) ⊆ CL1,1 (Q). Note also that these classes possess the following property:
k,p k,p
If f1 ∈ CL1 (Q), f2 ∈ CL2 (Q) and α1 , α2 ∈ R, then for

L3 =| α1 | L1 + | α2 | L2
k,p
we have α1 f1 + α2 f2 ∈ CL3 (Q).
We use the notation f ∈ C k (Q) for function f which is k times continuously
differentiable on Q.
One of the most important classes of differentiable functions is CL1,1 (Rn ), the
class of functions with Lipschitz continuous gradient. By definition the inclusion
f ∈ CL1,1 (Rn ) means that

 ∇f (x) − ∇f (y)  ≤ L  x − y  (1.2.8)

for all x, y ∈ Rn . Let us give a sufficient condition for this inclusion.


Lemma 1.2.2 A function f (·) belongs to the class CL2,1 (Rn ) ⊂ CL1,1 (Rn ) if and
only if for all x ∈ Rn we have

 ∇ 2 f (x) ≤ L. (1.2.9)

Proof Indeed, for any x, y ∈ Rn we have

1
∇f (y) = ∇f (x) + ∇ 2 f (x + τ (y − x))(y − x)dτ

0 
1 2
= ∇f (x) + ∇ f (x + τ (y − x))dτ · (y − x).
0

Therefore, if condition (1.2.9) is satisfied, then


  
 1 
 
 ∇f (y) − ∇f (x)  =  ∇ f (x + τ (y − x))dτ · (y − x)
2
 0 
 
1 
 
≤  ∇ 2 f (x + τ (y − x))dτ  ·  y − x 
0 

1
≤  ∇ 2 f (x + τ (y − x))  dτ ·  y − x 
0

≤ Ly−x .
1.2 Local Methods in Unconstrained Minimization 25

On the other hand, if f ∈ CL2,1 (Rn ), then for any s ∈ Rn and α > 0, we have
  
 α 
 
 ∇ f (x + τ s)dτ · s  =  ∇f (x + αs) − ∇f (x)  ≤ αL  s  .
2
 0 

Dividing this inequality by α and taking α ↓ 0, we obtain (1.2.9).



Note that the condition (1.2.9) can be written in the form of a matrix inequality:

−LIn  ∇ 2 f (x)  LIn , ∀x ∈ Rn . (1.2.10)

Lemma 1.2.2 provides us with many examples of functions with Lipschitz


continuous gradient.
Example 1.2.1
1. The linear function f (x) = α + a, x ∈ C01,1 (Rn ) since

∇f (x) = a, ∇ 2 f (x) = 0.

2. For a quadratic function f (x) = α + a, x + 12 Ax, x with A = AT , we have

∇f (x) = a + Ax, ∇ 2 f (x) = A.

Therefore f (·) ∈ CL1,1 (Rn ) with L = A .



3. Consider the function of one variable f (x) = 1 + x 2 , x ∈ R. We have

∇f (x) = √ x , ∇ 2 f (x) = 1
(1+x 2 )3/2
≤ 1.
1+x 2

Therefore, f (·) ∈ C11,1 (R).



The next statement is important for the geometric interpretation of functions in
CL1,1 (Rn ).

Lemma 1.2.3 Let f ∈ CL1,1 (Rn ). Then, for any x, y from Rn , we have

| f (y) − f (x) − ∇f (x), y − x |≤ L


2  y − x 2 . (1.2.11)

Proof For all x, y ∈ Rn , we have

1
f (y) = f (x) + ∇f (x + τ (y − x)), y − x dτ
0

1
= f (x) + ∇f (x), y − x + ∇f (x + τ (y − x)) − ∇f (x), y − x dτ.
0
26 1 Nonlinear Optimization

Therefore,

| f (y) − f (x) − ∇f (x), y − x |

1
= | ∇f (x + τ (y − x)) − ∇f (x), y − x dτ |
0

1
≤ | ∇f (x + τ (y − x)) − ∇f (x), y − x | dτ
0

1
≤  ∇f (x + τ (y − x)) − ∇f (x)  ·  y − x  dτ
0

1
≤ τ L  y − x 2 dτ = L
2  y − x 2 .

0

Geometrically, we have the following picture. Consider a function f ∈ CL1,1 (Rn ).


Let us fix a point x0 ∈ Rn , and define two quadratic functions

φ1 (x) = f (x0 ) + ∇f (x0 ), x − x0 − L


2  x − x0 2 ,

φ2 (x) = f (x0 ) + ∇f (x0 ), x − x0 + L


2  x − x0 2 .

Then the graph of the function f lies between the graphs of φ1 and φ2 :

φ1 (x) ≤ f (x) ≤ φ2 (x), ∀x ∈ Rn .

Let us prove similar results for the class of twice differentiable functions. The
2,2 n
main class of functions of this type is CM (R ), the class of twice differentiable
2,2 n
functions with Lipschitz continuous Hessian. Recall that for f ∈ CM (R ), we
have

 ∇ 2 f (x) − ∇ 2 f (y)  ≤ M  x − y , ∀x, y ∈ Rn . (1.2.12)


2,2
Lemma 1.2.4 Let f ∈ CM (Rn ). Then for all x, y ∈ Rn we have

 ∇f (y) − ∇f (x) − ∇ 2 f (x)(y − x) ≤ M


2  y − x 2 , (1.2.13)

|f (y) − f (x) − ∇f (x), y − x − 12 ∇ 2 f (x)(y − x), y − x |


(1.2.14)
≤ M
6  y − x 3 .
1.2 Local Methods in Unconstrained Minimization 27

Proof Let us fix some x, y ∈ Rn . Then

1
∇f (y) = ∇f (x) + ∇ 2 f (x + τ (y − x))(y − x)dτ
0

1
= ∇f (x) + ∇ 2 f (x)(y − x) + (∇ 2 f (x + τ (y − x)) − ∇ 2 f (x))(y − x)dτ.
0

Therefore,

 ∇f (y) − ∇f (x) − ∇ 2 f (x)(y − x) 

1
=  (∇ 2 f (x + τ (y − x)) − ∇ 2 f (x))(y − x)dτ 
0

1
≤  (∇ 2 f (x + τ (y − x)) − ∇ 2 f (x))(y − x)  dτ
0

1
≤  ∇ 2 f (x + τ (y − x)) − ∇ 2 f (x)  ·  y − x  dτ
0

1
≤ τ M  y − x 2 dτ = M
2  y − x 2 .
0

Inequality (1.2.14) can be proved in a similar way.



2,2 n
Corollary 1.2.2 Let f ∈ CM (R ) and x, y ∈ Rn with  y − x = r. Then

∇ 2 f (x) − MrIn  ∇ 2 f (y)  ∇ 2 f (x) + MrIn .

(Recall that for matrices A and B we write A  B if A − B  0.)


2,2 n
Proof Let G = ∇ 2 f (y)−∇ 2 f (x). Since f ∈ CM (R ), we have  G ≤ Mr. This
means that the eigenvalues of the symmetric matrix G, λi (G), satisfy the following
inequality:

| λi (G) | ≤ Mr, i = 1 . . . n.

Hence, −MrIn  G ≡ ∇ 2 f (y) − ∇ 2 f (x)  MrIn . 



28 1 Nonlinear Optimization

1.2.3 The Gradient Method

Now we are ready to sudy the rate of convergence of unconstrained minimization


schemes. Let us start with the simplest method. As we have already seen, the
antigradient is the direction of locally steepest descent of a differentiable function.
Since we are going to find a local minimum, the following strategy is the first to be
tried.

Gradient Method

(1.2.15)
Choose x0 ∈ Rn .
Iterate xk+1 = xk − hk ∇f (xk ), k = 0, 1, . . . .

We will refer to this scheme as the Gradient Method. The scalar factors for the
gradients, hk , are called the step sizes. Of course, they must be positive.
There are many variants of this method, which differ one from another by the
step-size strategy. Let us consider the most important examples.
1. The sequence {hk }∞
k=0 is chosen in advance. For example,

hk = h > 0, (constant step)

hk = √h .
k+1

2. Full relaxation:

hk = arg min f (xk − h∇f (xk )).


h≥0

3. The Armijo rule: Find xk+1 = xk − h∇f (xk ) with h > 0 such that

α∇f (xk ), xk − xk+1 ≤ f (xk ) − f (xk+1 ), (1.2.16)

β∇f (xk ), xk − xk+1 ≥ f (xk ) − f (xk+1 ), (1.2.17)

where 0 < α < β < 1 are some fixed parameters.


Comparing these strategies, we see that the first strategy is the simplest one. It is
often used in the context of Convex Optimization. In this framework, the behavior
of functions is much more predictable than in the general nonlinear case.
The second strategy is completely theoretical. It is never used in practice since
even in one-dimensional case we cannot find the exact minimum in finite time.
1.2 Local Methods in Unconstrained Minimization 29

The third strategy is used in the majority of practical algorithms. It has the
following geometric interpretation. Let us fix x ∈ Rn assuming that ∇f (x)
= 0.
Consider the following function of one variable:

φ(h) = f (x − h∇f (x)), h ≥ 0.

Then the step-size values acceptable for this strategy belong to the part of the graph
of φ which is located between two linear functions:

φ1 (h) = f (x) − αh  ∇f (x) 2 , φ2 (h) = f (x) − βh  ∇f (x) 2 .

Note that φ(0) = φ1 (0) = φ2 (0) and φ  (0) < φ2 (0) < φ1 (0) < 0. Therefore, the
acceptable values exist unless φ(·) is not bounded below. There are several very fast
one-dimensional procedures for finding a point satisfying the Armijo conditions.
However, their detailed description is not important for us now.
Let us estimate the performance of the Gradient Method. Consider the problem

min f (x), (1.2.18)


x∈Rn

with f ∈ CL1,1 (Rn ), and assume that f (·) is bounded below on Rn .


Let us evaluate the result of one gradient step. Consider y = x − h∇f (x). Then,
in view of (1.2.11), we have

f (y) ≤ f (x) + ∇f (x), y − x + L


2  y − x 2

2
= f (x) − h  ∇f (x) 2 + h2 L  ∇f (x) 2 (1.2.19)

= f (x) − h(1 − h2 L)  ∇f (x) 2 .

Thus, in order to get the best upper bound for the possible decrease of the objective
function, we have to solve the following one-dimensional problem:

Δ(h) = −h 1 − h2 L → min .
h

Computing the derivative of this function, we conclude that the optimal step size
must satisfy the equation Δ (h) = hL − 1 = 0. Thus, h∗ = L1 , which is a minimum
of Δ(h) since Δ (h) = L > 0.
Thus, our considerations prove that one step of the Gradient Method decreases
the value of the objective function at least as follows:

f (y) ≤ f (x) − 1
2L  ∇f (x) 2 .

Let us check what is going on with the other step-size strategies.


30 1 Nonlinear Optimization

Let xk+1 = xk − hk ∇f (xk ). Then for the constant step strategy, hk = h, we have

1
f (xk ) − f (xk+1 ) ≥ h(1 − Lh)  ∇f (xk ) 2 .
2

Therefore, if we choose hk = 2α
L with α ∈ (0, 1), then

f (xk ) − f (xk+1 ) ≥ 2
L α(1 − α)  ∇f (xk ) 2 .

Of course, the optimal choice is hk = L1 .


For the full relaxation strategy we have

f (xk ) − f (xk+1 ) ≥ 1
2L  ∇f (xk ) 2

since the maximal decrease is not worse than the decrease attained by hk = 1
L.
Finally, for the Armijo rule, in view of (1.2.17), we have

f (xk ) − f (xk+1 ) ≤ β∇f (xk ), xk − xk+1 = βhk  ∇f (xk ) 2 .

From (1.2.19), we obtain


 
hk
f (xk ) − f (xk+1 ) ≥ hk 1 − 2 L  ∇f (xk ) 2 .

Therefore, hk ≥ L (1 − β).
2
Further, using (1.2.16), we have

f (xk ) − f (xk+1 ) ≥ α∇f (xk ), xk − xk+1 = αhk  ∇f (xk ) 2 .

Combining this inequality with the previous one, we conclude that

f (xk ) − f (xk+1 ) ≥ 2
L α(1 − β)  ∇f (xk ) 2 .

Thus, we have proved that in all cases we have

f (xk ) − f (xk+1 ) ≥ ω
L  ∇f (xk ) 2 , (1.2.20)

where ω is some positive constant.


Now we are ready to estimate the performance of Gradient Method. Summing
up the inequalities (1.2.20) for k = 0 . . . N, we obtain


N
ω
L  ∇f (xk ) 2 ≤ f (x0 ) − f (xN+1 ) ≤ f (x0 ) − f ∗ , (1.2.21)
k=0
1.2 Local Methods in Unconstrained Minimization 31

where f ∗ is a lower bounds for the values of objective function in the prob-
lem (1.2.1). As a simple consequence of the bound (1.2.21), we have

 ∇f (xk ) → 0 as k → ∞.

However, we can also say something about the rate of convergence. Indeed, define

gN = min  ∇f (xk )  .
0≤k≤N

Then, in view of (1.2.21), we come to the following inequality:


 1/2
∗ ≤ ∗
ω L(f (x0 ) − f )
gN √1 1
. (1.2.22)
N+1

The right-hand side of this inequality describes the rate of convergence of the
sequence {gN ∗ } to zero. Note that we cannot say anything about the rate of

convergence of the sequences {f (xk )} and {xk }.


Recall that in general Nonlinear Optimization, our current goal is quite modest:
we only want to approach a local minimum of the optimization problem (1.2.18).
Nevertheless, in general, even this goal is unreachable for the Gradient Method. Let
us consider the following example.
Example 1.2.2 Consider the following function of two variables:

f (x) ≡ f (x (1), x (2) ) = 12 (x (1))2 + 14 (x (2))4 − 12 (x (2))2 .

The gradient of this function is ∇f (x) = (x (1) , (x (2))3 − x (2))T . Therefore, there
are only three points which can pretend to be a local minimum of this function:

x1∗ = (0, 0), x2∗ = (0, −1), x3∗ = (0, 1).

Computing the Hessian of this function,




1 0
∇ 2 f (x) = ,
0 3(x (2))2 − 1

we conclude that x2∗ and x3∗ are isolated local minima,3 but x1∗ is only a stationary
point of our function. Indeed, f (x1∗ ) = 0 and f (x1∗ + e2 ) = 4 − 2 < 0 for 
4 2

small enough.
Let us consider now the trajectory of the Gradient Method which starts at x0 =
(1, 0). Note that the second coordinate of this point is zero. Therefore, the second
coordinate of ∇f (x0 ) is also zero. Consequently, the second coordinate of x1 is

3 In fact, in our example they are global solutions.


32 1 Nonlinear Optimization

zero, etc. Thus, the entire sequence of points generated by the Gradient Method will
have the second coordinate equal to zero. This means that this sequence converges
to x1∗ .
To conclude our example, note that this situation is typical for all first-order
unconstrained minimization methods. Without additional rather restrictive assump-
tions, it is impossible to guarantee their global convergence to a local minimum.
Only a stationary point can be approached by these schemes. 
Note that inequality (1.2.22) provides us with an example of a new notion,
that is, the rate of convergence of a minimization process. How can we use this
information in the complexity analysis? The rate of convergence delivers an upper
complexity bound for the corresponding problem class. Such a bound is always
justified by some numerical method. A method for which the upper complexity
bound is proportional to the lower complexity bound of the problem class is said to
be optimal. Recall that in Sect. 1.1.3 we have already seen an optimal method for
the problem class P∞ .
Let us now present a formal description of our result. Consider the following
problem class G∗ .

Model : 1. Unconstrained minimization.


2. f ∈ CL1,1 (Rn ).
3. f (·) is bounded below by the value f ∗ .

(1.2.23)
Oracle : First-order Black Box.

ε-solution : f (x̄) ≤ f (x0 ),  ∇f (x̄) ≤ .

Note that inequality (1.2.22) can be used in order to obtain an upper bound for the
number of steps (= calls of the oracle), which is necessary to find a point where the
norm of the gradient is small. For that, let us write down the following inequality:
 1/2
∗ ≤ ∗
ω L(f (x0 ) − f ) ≤ .
gN √1 1 (1.2.24)
N+1

Therefore, if N + 1 ≥ ωL2 (f (x0 ) − f ∗ ), then we necessarily have gN


∗ ≤ .

Thus, we can use the value ω 2 (f (x0 ) − f ) as an upper complexity bound for
L

our problem class. Comparing this estimate with the result of Theorem 1.1.2, we can
see that it is much better. At least it does not depend on n. The lower complexity
bound for the class G∗ is unknown.
1.2 Local Methods in Unconstrained Minimization 33

Let us see, what can be said about the local convergence of the Gradient Method.
Consider the unconstrained minimization problem

min f (x)
x∈Rn

under the following assumptions.


2,2
1. f ∈ CM (Rn ).
2. There exists a local minimum x ∗ ∈ Rn of function f at which the Hessian is
positive definite.
3. We know some bounds 0 < μ ≤ L < ∞ for the Hessian at x ∗ :

μIn  ∇ 2 f (x ∗ )  LIn . (1.2.25)

4. Our starting point x0 is close enough to x ∗ .


Consider the process: xk+1 = xk − hk ∇f (xk ). Note that ∇f (x ∗ ) = 0. Hence,

1
∇f (xk ) = ∇f (xk ) − ∇f (x ∗ ) = ∇ 2 f (x ∗ + τ (xk − x ∗ ))(xk − x ∗ )dτ
0

= Gk (xk − x ∗ ),

1
where Gk = ∇ 2 f (x ∗ + τ (xk − x ∗ ))dτ . Therefore,
0

xk+1 − x ∗ = xk − x ∗ − hk Gk (xk − x ∗ ) = (In − hk Gk )(xk − x ∗ ).

There is a standard technique for analyzing processes of this type, which is based
on contraction mappings. Let the sequence {ak } be defined as follows:

a 0 ∈ Rn , ak+1 = Ak ak ,

where Ak are (n×n)-matrices such that  Ak ≤ 1−q for all k ≥ 0 with q ∈ (0, 1).
Then we can estimate the rate of convergence of the sequence {ak } to zero:

 ak+1  ≤ (1 − q)  ak  ≤ (1 − q)k+1  a0  → 0.

In our case, we need to estimate  In − hk Gk . Let rk = xk − x ∗ . In view of


Corollary 1.2.2, we have

∇ 2 f (x ∗ ) − τ Mrk In  ∇ 2 f (x ∗ + τ (xk − x ∗ ))  ∇ 2 f (x ∗ ) + τ Mrk In .


34 1 Nonlinear Optimization

Therefore, using assumption (1.2.25), we obtain


rk rk
(μ − 2 M)In  Gk  (L + 2 M)In .

Hence, (1−hk (L+ r2k M))In  In −hk Gk  (1−hk (μ− r2k M))In , and we conclude
that

 In − hk Gk ≤ max{ak (hk ), bk (hk )}, (1.2.26)

where ak (h) = 1 − h(μ − r2k M) and bk (h) = h(L + r2k M) − 1.



Note that ak (0) = 1 and bk (0) = −1. Therefore, if 0 < rk < r̄ ≡ M, then ak (·)
is a strictly decreasing function and we can ensure

 In − hk Gk < 1

for hk small enough. In this case, we will have rk+1 < rk .


As usual, many step-size strategies are available. For example, we can choose
hk = L1 . Let us consider the “optimal” strategy consisting in minimizing the right-
hand side of (1.2.26):

max{ak (h), bk (h)} → min .


h

Assume that r0 < r̄. Then, if we form the sequence {xk } using the optimal strategy,
we can be sure that rk+1 < rk < r̄. Further, the optimal step size h∗k can be found
from the equation
rk rk
ak (h) = bk (h) ⇔ 1 − h(μ − 2 M) = h(L + 2 M) − 1.

Hence

h∗k = 2
L+μ . (1.2.27)

(Surprisingly enough, the optimal step size does not depend on M.) Under this
choice, we obtain

(L−μ)rk Mrk2
rk+1 ≤ L+μ + L+μ .


Let us estimate the rate of convergence of the process. Let q = L+μ and ak =
M
L+μ rk (< q). Then

ak (1−(ak −q)2 ) ak
ak+1 ≤ (1 − q)ak + ak2 = ak (1 + (ak − q)) = 1−(ak −q) ≤ 1+q−ak .
1.2 Local Methods in Unconstrained Minimization 35

1+q
Therefore 1
ak+1 ≥ ak − 1, or
 
q q(1+q) q
ak+1 −1≥ ak − q − 1 = (1 + q) ak −1 .

Hence,
   
q q 2μ L+μ
ak − 1 ≥ (1 + q)k a0 − 1 = (1 + q)k L+μ · r0 M −1

 
= (1 + q)k r̄
r0 −1 .

Thus,
 k
qr0 qr0
ak ≤ r0 +(1+q)k (r̄−r0 )
≤ r̄−r0
1
1+q .

This proves the following theorem.


Theorem 1.2.4 Let the function f (·) satisfy our assumptions and let the starting
point x0 be close enough to a strict local minimum x ∗ :

r0 = x0 − x ∗ < r̄ = 2μ
M.

Then the Gradient Method with step size (1.2.27) converges as follows:
 k
 xk − x ∗ ≤ r̄r0
r̄−r0 1− 2μ
L+3μ .

This type of rate of convergence is called linear.

1.2.4 Newton’s Method

Newton’s Method is widely known as a technique for finding a root of a univariate


function. Let φ(·) : R → R. Consider the equation

φ(t ∗ ) = 0.

Newton’s rule can be obtained by linear approximation. Assume that we know some
t ∈ R which is close enough to t ∗ . Note that

φ(t + Δt) = φ(t) + φ  (t)Δt + o(| Δt |).


36 1 Nonlinear Optimization

Therefore, the solution of the equation φ(t + Δt) = 0 can be approximated by the
solution of the following linear equation:

φ(t) + φ  (t)Δt = 0.

Under some conditions, we can expect the displacement Δt to be a good approx-


imation to the optimal displacement Δt ∗ = t ∗ − t. Converting this idea into an
algorithm, we get the process

φ(tk )
tk+1 = tk − φ  (tk ) .

This scheme can be naturally extended to the problem of finding a solution to a


system of nonlinear equations,

F (x) = 0,

where x ∈ Rn and F (·) : Rn → Rn . In this case, we need to define the displacement


Δx as a solution to the following system of linear equations:

F (x) + F  (x)Δx = 0

(called the Newton system). If the Jacobian F  (x) is nondegenerate, we can compute
the displacement Δx = −[F  (x)]−1 F (x). The corresponding iterative scheme is as
follows:

xk+1 = xk − [F  (xk )]−1 F (xk ).

Finally, in view of Theorem 1.2.1, we can replace the unconstrained minimiza-


tion problem (1.2.1) by the problem of finding a root of the nonlinear system

∇f (x) = 0. (1.2.28)

(This replacement is not completely equivalent, but it works in nondegenerate


situations.) Further, to solve (1.2.28) we can apply the standard Newton Method
for the system of nonlinear equations. In this case, the Newton system is as follows:

∇f (x) + ∇ 2 f (x)Δx = 0.

Hence, the Newton’s Method for optimization problems can be written in the
following form:

xk+1 = xk − [∇ 2 f (xk )]−1 ∇f (xk ). (1.2.29)


1.2 Local Methods in Unconstrained Minimization 37

Note that we can obtain the process (1.2.29) using the idea of quadratic
approximation. Consider this approximation, computed with respect to the point
xk :

1
φ(x) = f (xk ) + ∇f (xk ), x − xk + ∇ 2 f (xk )(x − xk ), x − xk .
2

Assume that ∇ 2 f (xk )  0. Then we can choose xk+1 as the minimizer of the
quadratic function φ(·). This means that

∇φ(xk+1 ) = ∇f (xk ) + ∇ 2 f (xk )(xk+1 − xk ) = 0,

and we come again to Newton’s process (1.2.29).


We will see that the convergence of the Newton’s Method in a neighborhood of a
strict local minimum is very fast. However, this method has two serious drawbacks.
Firstly, it can break down if ∇ 2 f (xk ) is degenerate. Secondly, Newton’s process can
diverge. Let us look at the following example.
Example 1.2.3 Let us apply the Newton’s Method for finding a root of the following
univariate function:

φ(t) = √ t .
1+t 2

Clearly, t ∗ = 0. Note that

φ  (t) = 1
[1+t 2 ]3/2
.

Therefore Newton’s process is as follows:

φ(tk )  tk
tk+1 = tk − φ  (tk ) = tk − · [1 + tk2 ]3/2 = −tk3 .
1+tk2

Thus, if | t0 |< 1, then this method converges and the convergence is extremely fast.
The points ±1 are oscillation points of this scheme. If | t0 |> 1, then the method
diverges. 
In order to avoid a possible divergence, in practice we can apply the damped
Newton’s method:

xk+1 = xk − hk [∇ 2 f (xk )]−1 ∇f (xk ),

where hk > 0 is a step size parameter. At the initial stage of the method we can
use the same step size strategies as for the gradient scheme. At the final stage,
it is reasonable to choose hk = 1. Another possibility for ensuring the global
38 1 Nonlinear Optimization

convergence of this scheme consists in using Cubic Regularization. This approach


will be studied in detail in Chap. 4.
Let us derive the local rate of convergence of the Newton’s Method. Consider the
problem

min f (x)
x∈Rn

under the following assumptions:


2,2
1. f ∈ CM (Rn ).
2. There exists a local minimum of the function f with positive definite Hessian:

∇ 2 f (x ∗ )  μIn , μ > 0. (1.2.30)

3. Our starting point x0 is close enough to x ∗ .


Consider the process xk+1 = xk − [∇ 2 f (xk )]−1 ∇f (xk ). Then, using the same
reasoning as for the Gradient Method, we obtain the following representation:

xk+1 − x ∗ = xk − x ∗ − [∇ 2 f (xk )]−1 ∇f (xk )

1
= xk − x ∗ − [∇ 2 f (xk )]−1 ∇ 2 f (x ∗ + τ (xk − x ∗ ))(xk − x ∗ )dτ
0

= [∇ 2 f (xk )]−1 Gk (xk − x ∗ ),

1
where Gk = [∇ 2 f (xk ) − ∇ 2 f (x ∗ + τ (xk − x ∗ ))]dτ .
0
Let rk = xk − x ∗ . Then

1
 Gk  =  [∇ 2 f (xk ) − ∇ 2 f (x ∗ + τ (xk − x ∗ ))]dτ 
0

1
≤  ∇ 2 f (xk ) − ∇ 2 f (x ∗ + τ (xk − x ∗ ))  dτ
0

1 rk
≤ M(1 − τ )rk dτ = 2 M.
0

In view of Corollary 1.2.2, and relation (1.2.30), we have

∇ 2 f (xk )  ∇ 2 f (x ∗ ) − Mrk In  (μ − Mrk )In .


1.2 Local Methods in Unconstrained Minimization 39

μ
Therefore, if rk < M, then ∇ 2 f (xk ) is positive definite and

 [∇ 2 f (xk )]−1 ≤ (μ − Mrk )−1 .



Hence, for rk small enough (rk ≤ 3M ), we have

Mrk2
rk+1 ≤ 2(μ−Mrk ) (≤ rk ).

The rate of convergence of this type is called quadratic.


Thus, we have proved the following theorem.
Theorem 1.2.5 Let the function f (·) satisfy our assumptions. Suppose that the
initial starting point x0 is close enough to x ∗ :

 x0 − x ∗ ≤ r̄ = 2μ
3M .

Then  xk − x ∗ ≤ r̄ for all k and the Newton’s Method converges quadratically:

Mxk −x ∗ 2
 xk+1 − x ∗ ≤ 2(μ−Mxk −x ∗ ) .

Comparing this result with the local rate of convergence of the Gradient Method,
we see that the Newton’s Method is much faster. Surprisingly enough, the region
of quadratic convergence of the Newton’s Method is almost the same as the
region of linear convergence of the Gradient Method. This justifies the standard
recommendation to use the Gradient Method only at the initial stage of the
minimization process in order to get close to a local minimum. The final job should
be performed by Newton’s scheme. However, we will come back to a detailed
comparison of the performance of these two methods in Chap. 4.
In this section, we have seen several examples of convergence rate. Let us find a
correspondence between these rates and the complexity bounds. As we have already
seen (for example, in the case of the problem class G∗ (1.2.23)), the upper bound
for the analytical complexity of a problem class is an inverse function of the rate of
convergence.
1. Sublinear rate. This rate is described in terms of a power function of the iteration
counter. For example, suppose that for some method we can prove the rate of
convergence rk ≤ √c . In this case, the upper complexity bound justified by this
k
 2
scheme for the corresponding problem class is c .
The sublinear rate is rather slow. In terms of complexity, each new right digit
of the answer takes a number of iterations comparable with the total amount of
the previous work. Note also, that the constant c plays a significant role in the
corresponding complexity bound.
40 1 Nonlinear Optimization

2. Linear rate. This rate is given in terms of an exponential function of the iteration
counter. For example,

rk ≤ c(1 − q)k ≤ ce−qk , 0 < q ≤ 1.

Note that the corresponding complexity bound is q1 (ln c + ln 1 ).


This rate is fast: Each new right digit of the answer takes a constant number of
iterations. Moreover, the dependence of the complexity estimate on the constant
c is very weak.
3. Quadratic rate. This rate has a double exponential dependence in the iteration
counter. For example,

rk+1 ≤ crk2 .

The corresponding complexity estimate depends on the double logarithm of the


desired accuracy: ln ln 1 .
This rate is extremely fast: Each iteration doubles the number of right digits
in the answer. The constant c is important only for the starting moment of the
quadratic convergence (crk < 1). For example, after the moment crk ≤ 12 , we
can guarantee a fast convergence rate rk+1 ≤ 12 rk , which does not depend on c at
all.

1.3 First-Order Methods in Nonlinear Optimization

(The Gradient Method and Newton’s Method: What is different? The idea of a variable
metric; Variable metric methods; Conjugate gradient methods; Constrained minimization;
Lagrangian relaxation; A sufficient condition for zero duality gap; Penalty functions and
penalty function methods; Barrier functions and barrier function methods.)

1.3.1 The Gradient Method and Newton’s Method: What Is


Different?

In the previous section, we considered two local methods for finding a local
minimum of the simplest minimization problem

min f (x),
x∈Rn

2,2
with f ∈ CM (Rn ). Namely, the Gradient Method

xk+1 = xk − hk ∇f (xk ), hk > 0.


1.3 First-Order Methods in Nonlinear Optimization 41

and the Newton’s Method:

xk+1 = xk − [∇ 2 f (xk )]−1 ∇f (xk ).

Recall that the local rate of convergence of these methods is different. We have
seen that the Gradient Method has a linear rate and the Newton’s method converges
quadratically. What is the reason for this difference?
If we look at the analytical form of these methods, we can see at least the
following formal difference: In the Gradient Method, the search direction is the
antigradient, while in the Newton’s method we multiply the antigradient by some
matrix, the inverse Hessian. Let us try to derive these directions using some
“universal” reasoning.
Let us fix a point x̄ ∈ Rn . Consider the following approximation of the function
f (·):

φ1 (x) = f (x̄) + ∇f (x̄), x − x̄ + 1


2h  x − x̄ 2 ,

where the parameter h is positive. The first-order optimality condition provides us


with the following equation for x1∗ , the unconstrained minimum of this function:

∇φ1 (x1∗ ) = ∇f (x̄) + h1 (x1∗ − x̄) = 0.

Thus, x1∗ = x̄ − h∇f (x̄). This is exactly the iterate of the Gradient Method. Note
that if h ∈ (0, L1 ], then the function φ1 (·) is a global upper approximation of f (·):

f (x) ≤ φ1 (x), ∀x ∈ Rn ,

(see Lemma 1.2.3). This fact is responsible for the global convergence of the
Gradient Method.
Further, consider a quadratic approximation of the function f (·):

φ2 (x) = f (x̄) + ∇f (x̄), x − x̄ + 12 ∇ 2 f (x̄)(x − x̄), x − x̄ .

We have already seen that the minimum of this function is

x2∗ = x̄ − [∇ 2 f (x̄)]−1 ∇f (x̄),

and this is exactly the iterate of the Newton’s Method.


Thus, we can try to use some quadratic approximations of the function f (·),
which are better than φ1 (·) and which are less expensive than φ2 (·).
Let G be a symmetric positive definite n × n-matrix. Define

1
φG (x) = f (x̄) + ∇f (x̄), x − x̄ + G(x − x̄), x − x̄ .
2
42 1 Nonlinear Optimization

Computing the minimizer of φG (·) from the equation


∗ ∗
∇φG (xG ) = ∇f (x̄) + G(xG − x̄) = 0,

we obtain

xG = x̄ − G−1 ∇f (x̄). (1.3.1)

The first-order methods, which form a sequence of matrices

{Gk } : Gk → ∇ 2 f (x ∗ )

(or {Hk } : Hk ≡ G−1 k → [∇ 2 f (x ∗ )]−1 ), are called variable metric methods.


(Sometimes the name quasi-Newton methods is used.) In these methods, only the
gradients are involved in the process of generating the sequences {Gk } or {Hk }.
The updating rule (1.3.1) is very common in Optimization. Let us provide it with
one more interpretation.
Note that the gradient and Hessian of a nonlinear function f (·) are defined with
respect to the standard Euclidean inner product on Rn :

n
x, y = x T y = x (i)y (i) , x, y ∈ Rn ,  x = x, x 1/2 .
i=1

Indeed, the definition of the gradient is as follows:

f (x + h) = f (x) + ∇f (x), h + o( h ).

From this equation, we derive its coordinate representation:


 T
∂f (x)
∇f (x) = ∂x (1)
, . . . , ∂f (x)
∂x (n)
.

Let us now introduce a new inner product. Consider a symmetric positive definite
(n × n)-matrix A. For x, y ∈ Rn define

x, y A = Ax, y ,  x A = Ax, x 1/2 .

The function  · A is treated as a new norm on Rn . Note that topologically this new
norm is equivalent to the old one:

λmin (A)1/2  x  ≤  x A ≤ λmax (A)1/2  x ,


1.3 First-Order Methods in Nonlinear Optimization 43

where λmin (A) and λmax (A) are the smallest and the largest eigenvalues of the
matrix A. However, the gradient and the Hessian, computed with respect to the new
inner product, are different:

f (x + h) = f (x) + ∇f (x), h + 12 ∇ 2 f (x)h, h + o( h )

= f (x) + A−1 ∇f (x), h A + 12 A−1 ∇ 2 f (x)h, h A + o( h A ).

Hence, ∇fA (x) = A−1 ∇f (x) is the new gradient and ∇ 2 fA (x) = A−1 ∇ 2 f (x) is
the new Hessian.
Thus, the direction used in the Newton’s method can be seen as a gradient
direction computed with respect to the inner product defined by A = ∇ 2 f (x)  0.
Note that the Hessian of f (·) at x computed with respect to A = ∇ 2 f (x) is In .
Example 1.3.1 Consider the quadratic function

1
f (x) = α + a, x + Ax, x ,
2

where A = AT  0. Note that ∇f (x) = Ax + a, ∇ 2 f (x) = A and

∇f (x ∗ ) = Ax ∗ + a = 0

for x ∗ = −A−1 a. Let us compute the Newton’s direction at some x ∈ Rn :

dN (x) = [∇ 2 f (x)]−1 ∇f (x) = A−1 (Ax + a) = x + A−1 a.

Therefore for any x ∈ Rn we have x −dN (x) = −A−1 a = x ∗ . Thus, for a quadratic
function, Newton’s method converges in one step. Note also that

f (x) = α + A−1 a, x A + 1
2  x 2A ,

∇fA (x) = A−1 ∇f (x) = dN (x),

∇ 2 fA (x) = A−1 ∇ 2 f (x) = In . 


Let us look at the general scheme of the variable metric methods.


44 1 Nonlinear Optimization

Variable metric method

0. Choose x0 ∈ Rn . Set H0 = In . Compute f (x0 ) and ∇f (x0 ).

1. kth iteration (k ≥ 0).

(a) Set pk = Hk ∇f (xk ).

(b) Find xk+1 = xk − hk pk


(see Section 1.2.3 for step size rules).

(c) Compute f (xk+1 ) and ∇f (xk+1 ).

(d) Update the matrix Hk to Hk+1 .

The variable metric schemes differ from one to another only in the implementa-
tion of Step 1(d), which updates the matrix Hk . For that, they use new information,
accumulated at Step 1(c), namely the gradient ∇f (xk+1 ). This update is justified by
the following property of quadratic functions. Let

1
f (x) = α + a, x + Ax, x , ∇f (x) = Ax + a.
2

Then, for any x, y ∈ Rn we have ∇f (x) − ∇f (y) = A(x − y). This identity
explains the origin of the so-called quasi-Newton rule.

Quasi-Newton rule

Choose Hk+1 = Hk+1


T  0 such that

Hk+1 (∇f (xk+1 ) − ∇f (xk )) = xk+1 − xk .

Actually, there are many ways to satisfy this relation. Below, we present several
examples of schemes which are usually recommended as the most efficient.
1.3 First-Order Methods in Nonlinear Optimization 45

Define

ΔHk = Hk+1 − Hk , γk = ∇f (xk+1 ) − ∇f (xk ), δk = xk+1 − xk .

Then the quasi-Newton relation is satisfied by the following updating rules.


(δk −Hk γk )(δk −Hk γk )T
1. Rank-one correction scheme: ΔHk = δk −Hk γk ,γk .
δ δT Hk γk γkT Hk
2. Davidon–Fletcher–Powell scheme (DFP): ΔHk = γkk ,δkk − Hk γk ,γk .
3. Broyden–Fletcher–Goldfarb–Shanno scheme (BFGS):

δ δT Hk γk δkT +δk γkT Hk


ΔHk = βk γkk ,δkk − γk ,δk ,

where βk = 1 + Hk γk , γk /γk , δk .


Clearly, there are many other possibilities. From the computational point of view,
BFGS is considered to be the most stable scheme.
Note that for quadratic functions, the variable metric methods usually terminate
in at most n iterations. In a neighborhood of a strict local minimum x ∗ they
demonstrate a superlinear rate of convergence: for any x0 ∈ Rn close enough to
x ∗ there exists a number N such that for all k ≥ N we have

 xk+1 − x ∗  ≤ const·  xk − x ∗  ·  xk−n − x ∗ 

(the proofs are very long and technical). As far as the worst-case global convergence
is concerned, these methods are not better than the Gradient Method.
In the variable metric schemes it is necessary to store and update a symmetric
(n × n)-matrix. Thus, each iteration needs O(n2 ) auxiliary arithmetic operations.
This feature is considered as one of the main drawbacks of the variable metric
methods. It stimulated the interest in conjugate gradient schemes which have a
much lower complexity of each iteration. We discuss these schemes in Sect. 1.3.2.

1.3.2 Conjugate Gradients

Conjugate gradient methods were initially proposed for minimizing quadratic


functions. Consider the problem

min f (x) (1.3.2)


x∈Rn
46 1 Nonlinear Optimization

with f (x) = α + a, x + 12 Ax, x and A = AT  0. We have already seen that


the solution of this problem is x ∗ = −A−1 a. Therefore, our objective function can
be written in the following form:

f (x) = α + a, x + 12 Ax, x = α − Ax ∗ , x + 12 Ax, x

= α − 12 Ax ∗ , x ∗ + 12 A(x − x ∗ ), x − x ∗ .

Thus, f ∗ = α − 12 Ax ∗ , x ∗ and ∇f (x) = A(x − x ∗ ).


Suppose we are given a starting point x0 ∈ Rn . Consider the linear Krylov
subspaces

Lk = Lin{A(x0 − x ∗ ), . . . , Ak (x0 − x ∗ )}, k ≥ 1,

where Ak is the kth power of matrix A. A sequence of points {xk } is generated by


the Conjugate Gradient Method in accordance with the following rule.

xk = arg min{f (x) | x ∈ x0 + Lk }, k ≥ 1. (1.3.3)

This definition looks quite artificial. However, later we will see that this method can
be written in a pure “algorithmic” form. We need representation (1.3.3) only for
theoretical analysis.
Lemma 1.3.1 For any k ≥ 1 we have Lk = Lin{∇f (x0 ), . . . , ∇f (xk−1 )}.
Proof For k = 1, the statement is true since ∇f (x0 ) = A(x0 − x ∗ ). Suppose that it
is valid for some k ≥ 1. Consider a point


k
xk = x0 + λ(i) Ai (x0 − x ∗ ) ∈ x0 + Lk
i=1

with some λ ∈ Rk . Then


k

∇f (xk ) = A(x0 − x ) + λ(i) Ai+1 (x0 − x ∗ ) = y + λ(k) Ak+1 (x0 − x ∗ ),
i=1

for a certain y from Lk . Thus,


 
Lk+1 ≡ Lin{Lk Ak+1 (x0 − x ∗ )} = Lin{Lk ∇f (xk )}

= Lin{∇f (x0 ), . . . , ∇f (xk )}. 


The next result helps us to understand the behavior of the sequence {xk }.
1.3 First-Order Methods in Nonlinear Optimization 47

Lemma 1.3.2 For any k, i ≥ 0, k


= i we have ∇f (xk ), ∇f (xi ) = 0.
Proof Let k > i. Consider the function
 

k
φ(λ) = f x0 + λ(j ) ∇f (xj −1 ) , λ ∈ Rk .
j =1


k
(j )
In view of Lemma 1.3.1, for some λ∗ ∈ Rk we have xk = x0 + λ∗ ∇f (xj −1 ).
j =1
However, by definition, xk is the minimum point of f (·) on x0 + Lk . Therefore
∇φ(λ∗ ) = 0. It remains to compute the components of the gradient:
∂φ(λ∗ )
0= ∂λ(j)
= ∇f (xk ), ∇f (xj −1 ) , j = 1, . . . , k. 

This lemma has two evident consequences.


Corollary 1.3.1 The sequence generated by the Conjugate Gradient Method for
problem (1.3.2) is finite.
Proof Indeed, the number of nonzero orthogonal directions in Rn cannot exceed n.


Corollary 1.3.2 For any p ∈ Lk , k ≥ 1, we have ∇f (xk ), p = 0.

The last auxiliary result explains the name of the method. Let δi = xi+1 − xi . It
is clear that Lk = Lin{δ0 , . . . , δk−1 }.
Lemma 1.3.3 For any k, i ≥ 0, k
= i, we have Aδk , δi = 0.
(Such directions are called conjugate with respect to A.)
Proof Without loss of generality, we can assume that k > i. Then

Aδk , δi = A(xk+1 − xk ), δi = ∇f (xk+1 ) − ∇f (xk ), δi = 0

since δi = xi+1 − xi ∈ Li+1 ⊆ Lk ⊆ Lk+1 .



Let us show how we can write down the Conjugate Gradient Method in a more
algorithmic form. Since Lk = Lin{δ0 , . . . , δk−1 }, we can represent xk+1 as follows:


k−1
xk+1 = xk − hk ∇f (xk ) + λ(j ) δj .
j =0
48 1 Nonlinear Optimization

In our notation, this is


k−1
δk = −hk ∇f (xk ) + λ(j ) δj . (1.3.4)
j =0

Let us compute the coefficients in this representation. Multiplying (1.3.4) by A and


δi , 0 ≤ i ≤ k − 1, and using Lemma 1.3.3, we obtain


k−1
0 = Aδk , δi = −hk A∇f (xk ), δi + λ(j ) Aδj , δi
j =0

= −hk A∇f (xk ), δi + λ(i) Aδi , δi

= −hk ∇f (xk ), Aδi + λ(i) Aδi , δi

= −hk ∇f (xk ), ∇f (xi+1 ) − ∇f (xi ) + λ(i) Aδi , δi .

Hence, in view of Lemma 1.3.2, λi = 0 for i < k − 1. For i = k − 1, we have

hk ∇f (xk )2 hk ∇f (xk )2


λ(k−1) = Aδk−1 ,δk−1 = ∇f (xk )−∇f (xk−1 ),δk−1 .

Thus, xk+1 = xk − hk pk , where

∇f (xk )2 δk−1 ∇f (xk )2 pk−1


pk = ∇f (xk ) − ∇f (xk )−∇f (xk−1 ),δk−1 = ∇f (xk ) − ∇f (xk )−∇f (xk−1 ),pk−1

since δk−1 = −hk−1 pk−1 by definition of the directions {pk }.


Note that we managed to write down the Conjugate Gradient Method in terms of
the gradients of the objective function f (·). This provides us with the possibility
of formally applying this scheme to minimize a general nonlinear function. Of
course, such an extension destroys all properties of the process which are specific
for quadratic functions. However, in a neighborhood of a strict local minimum,
the objective function is close to quadratic. Therefore, asymptotically this method
should be fast.
Let us present a general scheme of the Conjugate Gradient Method for minimiz-
ing a general nonlinear function.
1.3 First-Order Methods in Nonlinear Optimization 49

Conjugate Gradient Method

0. Let x0 ∈ Rn . Compute f (x0 ), ∇f (x0 ). Set p0 = ∇f (x0 ).

1. kth iteration (k ≥ 0).

(a) Find xk+1 = xk − hk pk (by “exact” line search).

(b) Compute f (xk+1 ) and ∇f (xk+1 ).

(c) Compute the coefficient βk .

(d) Define pk+1 = ∇f (xk+1 ) − βk pk .

In this scheme, we have not yet specified the coefficient βk . In fact, there exist many
different formulas for this coefficient. All of them give the same results on quadratic
functions. However, in the general nonlinear case, they generate different sequences.
Let us present the three most popular expressions.
∇f (xk+1 )2
1. Dai–Yuan : βk = ∇f (xk+1 )−∇f (xk ),pk .
2. Fletcher–Rieves : βk = − ∇f (xk+1 )2
∇f (xk )2
.
∇f (xk+1 ),∇f (xk+1 )−∇f (xk )
3. Polak–Ribbiere : βk = − ∇f (xk )2
.
Recall that in the quadratic case, the Conjugate Gradient Method terminates
in n iterations (or less). Algorithmically, this means that pn = 0. In the general
nonlinear case, this is not true. However, after n iterations, this direction loses its
interpretation. Therefore, in all practical schemes, there exists a restarting strategy,
which at some moment sets βk = 0 (usually after every n iterations). This ensures
the global convergence of the process (since we have the usual gradient step just
after the restart, and all other iterations decrease the value of the objective function).
In a neighborhood of a strict minimum, the conjugate gradient schemes demonstrate
a local n-step quadratic convergence:

 xn − x ∗ ≤ const·  x0 − x ∗ 2 .

Note that this local convergence is slower than that of the variable metric methods.
However, the conjugate gradient methods have the advantage of cheap iteration. As
far as the global convergence is concerned, these schemes, in general, are not better
than the simplest Gradient Method.
50 1 Nonlinear Optimization

1.3.3 Constrained Minimization

Let us discuss now the main ideas underlying the methods of optimization with
functional constraints. The problem we consider here is as follows:

f0 (x) → min,
x∈Q
(1.3.5)
fj (x) ≤ 0, j = 1 . . . m,

where Q is a simple closed set in Rn , and the functional components


f0 (·), . . . , fm (·) are continuous functions. Since these components are general
nonlinear functions, we cannot expect this problem to be easier than an
unconstrained minimization problem. Indeed, even the standard difficulties with
stationary points, which we have in Unconstrained Minimization, appear in (1.3.5)
in a much stronger form. Note that a stationary point of this problem (whatever its
definition is) can be infeasible for the system of functional constraints. Hence, any
minimization scheme attracted by such a point fails even to find a feasible solution
of (1.3.5).
Therefore, the following reasoning looks quite convincing.
1. We have efficient methods for unconstrained minimization.4
2. Unconstrained minimization is simpler than constrained minimization.5
3. Therefore, let us try to approximate a solution of problem (1.3.5) by a sequence
of solutions to some auxiliary unconstrained minimization problems.
This philosophy is implemented by the schemes of Sequential Unconstrained
Minimization. There are three main groups of such methods.
• Lagrangian relaxation methods.
• Penalty function methods.
• Barrier methods.
Let us describe the main ideas of these approaches.

1.3.3.1 Lagrangian Relaxation

This approach is based on the following fundamental Minimax Principle.

4 In fact, this is not absolutely true. We will see that, in order to apply the unconstrained

minimization methods to solve constrained problems, we need to be able to find a global minimum
of some auxiliary problem, and we have already seen (Example 1.2.2) that this could be difficult.
5 We are not going to discuss the correctness of this statement for general nonlinear problems. We

just prevent the reader from extending it to other problem classes. In the following chapters, we
will see that this statement is valid only up to a certain point.
1.3 First-Order Methods in Nonlinear Optimization 51

Theorem 1.3.1 Let the function F (x, λ) be defined for x ∈ Q1 ⊆ Rn and λ ∈


Q2 ⊆ Rm , where both Q1 and Q2 are nonempty. Then,

sup inf F (x, λ) ≤ inf sup F (x, λ). (1.3.6)


λ∈Q2 x∈Q1 x∈Q1 λ∈Q2

Proof Indeed, for arbitrary x ∈ Q1 and λ ∈ Q2 , we have

F (x, λ) ≤ sup F (x, ξ ).


ξ ∈Q2

Since this inequality is valid for all x ∈ Q1 , we conclude that

inf F (x, λ) ≤ inf sup F (x, ξ ).


x∈Q1 x∈Q1 ξ ∈Q2

It remains to note that this inequality is valid for all λ ∈ Q2 . 



Let us apply this principle to problem (1.3.5). Note that

f ∗ = inf {f0 (x) : fj (x) ≤ 0, j = 1, . . . , m}


x∈Q

 
def
= inf sup L (x, λ) = f0 (x) + λ, f (x) ,
x∈Q λ∈Rm
+

where f (x) = (f1 (x), . . . , fm (x))T , Rm


+ = {λ ∈ R : λ
m (j ) ≥ 0, j = 1, . . . , m}

is a positive orthant, and L (x, λ) is the Lagrange function, or Lagrangian, of


problem (1.3.5). Let

ψ(λ) = inf L (x, λ),


x∈Q

domψ = {λ ∈ Rm : ψ(λ) > −∞}, (1.3.7)

X∗ (λ) = Arg inf L (x, λ),


x∈Q

where X∗ (λ) is the set of global solutions of the corresponding minimization


λ ∈
problem. Note that at some Rm the value of function ψ can be −∞. For us, it is
important to have domψ R+
= ∅. For simplicity, we assume that, for all λ from
m

this set, X∗ (λ)


= ∅.
Thus, we come to the following Lagrange dual problem:

def    (1.3.6)
f∗ = sup ψ(λ) : λ ∈ domψ Rm
+ ≤ f ∗. (1.3.8)
λ
52 1 Nonlinear Optimization

Note that the objective function of the dual problem is very special. Indeed, for any
two vectors λ1 , λ2 from domψ, and any x1 ∈ X∗ (λ1 ), x2 ∈ X∗ (λ2 ) we have


m
(j ) 
m
(j )
ψ(λ2 ) = f0 (x2 ) + λ2 fj (x2 ) ≤ f0 (x1 ) + λ2 fj (x1 )
j =1 j =1
(1.3.9)
= ψ(λ1 ) + f (x1 ), λ2 − λ1 .

This means that the function ψ is concave, and (1.3.8) is a convex optimization
problem. Such problems can be efficiently solved by numerical schemes (see
Chap. 3), provided that for any λ ∈ domψ we are able to compute the vector
f (x(λ)), where x(λ) is one of the global solutions of problem (1.3.7).
Note that the dual problem (1.3.8) is not completely equivalent to the primal
problem (1.3.5). Very often, we can observe the situation f∗ < f ∗ (the so-called
nonzero duality gap). This is the reason why the problem (1.3.8) is often called the
Lagrangian relaxation of problem (1.3.5).
Conditions for a zero duality gap, f∗ = f ∗ , are usually quite restrictive and
require convexity of all elements of problem (1.3.5). We will see many instances of
such problems in Part II of this book. Here, we give a sufficient condition, which is
sometimes useful.
Theorem 1.3.2 (Certificate of Global Optimality) Let λ∗ be an optimal solution
to problem (1.3.8). Assume that for some positive  we have

def
Δ+ ∗
 (λ ) = {λ ∈ R+ : λ − λ∗  ≤ } ⊆ domψ.
m

Let the vector x(λ) ∈ X∗ (λ), λ


= λ∗ , be uniquely defined and the following limit
exist

x∗ = lim x(λ).
λ→λ∗ ,
λ∈Δ+
 (λ∗ )

If x ∗ ∈ X∗ (λ∗ ), then it is an optimal global solution to problem (1.3.5).


Let I ∗ = {j : λ∗ > 0}. Choosing j ∈ I ∗ and  > 0
(j )
Proof Let g(λ) = f (x(λ)). 
ensuring λ∗ ± ej ∈ domψ Rm + , we get

(1.3.9)
ψ(λ∗ ) ≤ ψ(λ∗ + ej ) + g(λ∗ + ej ), −ej ≤ ψ(λ∗ ) + g(λ∗ + ej ), −ej ,

(1.3.9)
ψ(λ∗ ) ≤ ψ(λ∗ − ej ) + g(λ∗ − ej ), ej ≤ ψ(λ∗ ) + g(λ∗ − ej ), ej ,

Thus, we have

g(λ∗ + ej ), ej ≤ 0 ≤ g(λ∗ − ej ), ej .


1.3 First-Order Methods in Nonlinear Optimization 53

Taking the limit in both inequalities as  → 0, we obtain fj (x ∗ ) = 0.


Similarly, if j
∈ I ∗ , we can take  small enough to have λ∗ + ej ∈ domψ.
Then,

(1.3.9)
ψ(λ∗ ) ≤ ψ(λ∗ + ej ) + g(λ∗ + ej ), −ej
≤ ψ(λ∗ ) + g(λ∗ + ej ), −ej .

Hence, g(λ∗ + ej ), ej ≤ 0. Taking in this inequality the limit as  → 0, we get


fj (x ∗ ) ≤ 0.
Thus, the point x ∗ is feasible for the problem (1.3.5), and

λ∗ fj (x ∗ ) = 0,
(j )
j = 1, . . . , m. (1.3.10)

Therefore, we obtain

(1.3.10) 
m (1.3.8)
f0 (x ∗ ) f0 (x ∗ ) + λ∗ fj (x ∗ ) = ψ(λ∗ ) f ∗.
(j )
= ≤
j =1



Remark 1.3.1 The equality constraints in problem (1.3.5) can be treated in a similar
way. The only difference is that in the dual problem (1.3.8), the corresponding
Lagrange multipliers do not have sign restrictions. At the same time, the statement
of Theorem 1.3.2 remains valid.
Let us show how this condition works in some simple situations.
Example 1.3.2 Let us choose in the problem (1.3.5) Q = R2 , and

f0 (x) = 12 x − ē2 2 , f1 (x) = x (1) − 12 (x (2))2 ,

where ē2 = (1, 1)T . Then, we can form the Lagrangian


 
L (x, λ) = 12 x − ē2 2 + λ x (1) − 12 (x (2))2 ,

and define ψ(λ) = inf L (x, λ). It is clear that domψ = (−∞, 1), and for any
x∈R2
feasible λ, the point x(λ) can be found from the following equations:

x (1)(λ) − 1 + λ = 0,

x (2)(λ) − 1 − λx (2)(λ) = 0.
54 1 Nonlinear Optimization

Thus, x (1)(λ) = 1 − λ, and x (2)(λ) = 1


1−λ . Substituting this point into the
Lagrangian, we obtain

ψ(λ) = λ − 12 λ2 − 1
2(1−λ) + 12 .

 1/3
The maximum of ψ is attained at λ∗ = 1 − 12 . Since the trajectory x(λ)
is uniquely defined and continuous
 on the domain
domψ, by Theorem 1.3.2 we
conclude that the point x(λ∗ ) = 2−1/3 , 21/3 is the global optimal solution of our
problem. 
We consider another example of application of Theorem 1.3.2 in Sect. 4.1.4.

1.3.3.2 Penalty Functions

Definition 1.3.1 A continuous function Φ(·) is called a penalty function for a


closed set F ⊂ Rn if
• Φ(x) = 0 for any x ∈ F ,
• Φ(x) > 0 for any x ∈
/ F.
Sometimes, a penalty function is called just a penalty for the set F . The main
property of penalty functions is as follows.

If Φ1 (·) is a penalty for F1 and Φ2 (·) is a penalty


 for F2 , then
Φ1 (·) + Φ2 (·) is a penalty for the intersection F1 F2 .

Let us give several examples of such functions.


Example 1.3.3 Define (a)+ = max{a, 0}, a ∈ R. Let f1 (·), . . . , fm (·) be continu-
ous functions, and

F = {x ∈ Rn | fj (x) ≤ 0, j = 1 . . . m}.

Then, the following functions are penalties for F :



m
1. Quadratic penalty : Φ(x) = (fj (x))2+ .
j =1

m
2. Nonsmooth penalty : Φ(x) = (fj (x))+ .
j =1

The reader can easily continue the list. 



Let us present the general scheme of the Penalty Function Method as applied to
problem (1.3.5).
1.3 First-Order Methods in Nonlinear Optimization 55

Penalty Function Method

0. Choose x0 ∈ Q. Choose a sequence of penalty coefficients:


0 < tk < tk+1 and tk → ∞.
1. kth iteration (k ≥ 0).
Find xk+1 = arg min{f0 (x) + tk Φ(x)} using xk as starting point.
x∈Q

It is easy to prove the convergence of this scheme assuming that xk+1 is a global
minimum of the auxiliary function.6 Define

Ψk (x) = f0 (x) + tk Φ(x), Ψk∗ = min Ψk (x) = Ψk (xk+1 ).


x∈Q

(Ψk∗ is the global optimal value of Ψk (·)). Let x ∗ be a global solution to (1.3.5).
Theorem 1.3.3 Let there exist a value t¯ > 0 such that the set

S = {x ∈ Rn | f0 (x) + t¯Φ(x) ≤ f0 (x ∗ )}

is bounded. Then

lim f0 (xk ) = f0 (x ∗ ), lim Φ(xk ) = 0.


k→∞ k→∞

Proof Note that Ψk∗ ≤ Ψk (x ∗ ) = f0 (x ∗ ). At the same time, for any x ∈ Q we have
∗ ≥ Ψ ∗ . Thus, there exists a limit
Ψk+1 (x) ≥ Ψk (x). Therefore Ψk+1 k

lim Ψk∗ ≡ Ψ ∗ ≤ f0 (x ∗ ).
k→∞

If tk > t¯ then

f0 (xk+1 ) + t¯Φ(xk+1 ) ≤ f0 (xk+1 ) + tk Φ(xk+1 ) = Ψk∗ ≤ f0 (x ∗ ).

Therefore, xk ∈ S for k large enough. Hence, the sequence {xk } has limit points.
Since lim tk = +∞, for any such point x∗ we have Φ(x∗ ) = 0. Thus, x∗ ∈ F and
k→∞
f0 (x∗ ) ≤ f0 (x ∗ ). Consequently, f0 (x∗ ) = f0 (x ∗ ). 

Note that this result is very general, but not too informative. There are still
many questions which should be answered. For example, we do not know what

6 If we assume that it is a strict local minimum, then the results are much weaker.
56 1 Nonlinear Optimization

kind of penalty functions we should use. What should be the rules for choosing
the penalty coefficients? What should be the accuracy for solving the auxiliary
problems? In fact, all these questions are difficult to address in the framework of
general Nonlinear Optimization. Traditionally, they are redirected to computational
practice.

1.3.3.3 Barrier Functions

Let us look at Barrier Methods.


Definition 1.3.2 Let F be a closed set in Rn with nonempty interior. A continuous
function F (·) is called a barrier function for F if F (x) → ∞ as x approaches the
boundary of this set.
Sometimes a barrier function is called a barrier for short. Similarly to penalty
functions, the barriers possess the following property.

If F1 (·) is a barrier for F1 and F2 (·) is a barrier


 for F2 , then
F1 (·) + F2 (·) is a barrier for the intersection F1 F2 provided
that its interior is nonexpty.

In order to apply the barrier approach, problem (1.3.5) must satisfy the Slater
condition:

∃x̄ ∈ Rn : fj (x̄) < 0, j = 1 . . . m. (1.3.11)

Let us look at some examples of barrier functions.


Example 1.3.4 Let f1 (·), . . . , fm (·) be continuous functions and F = {x ∈ Rn |
fj (x) ≤ 0, j = 1 . . . m}. Then all the functions below are barriers for F :

m
1. Power-function barrier: F (x) = 1
(−fj (x))p , p ≥ 1.
j =1

m
2. Logarithmic barrier: F (x) = − ln(−fj (x)).
j =1

m  
3. Exponential barrier: F (x) = exp 1
−fj (x) .
j =1

The reader can easily extend this list.




Let F0 = Q intF and let F be a barrier for F . The general scheme of the
Barrier Method is as follows.
1.3 First-Order Methods in Nonlinear Optimization 57

Barrier Function Method

0. Choose x0 ∈ F0 and a sequence of penalty coefficients:


0 < tk < tk+1 and tk → ∞.
1. kth iteration (k ≥ 0).  
Find xk+1 = arg min f0 (x) + t1k F (x) using xk as the starting
x∈F0
point.

Let us prove the convergence of this method assuming that xk+1 is a global
minimum of the auxiliary function. Define

Ψk (x) = f0 (x) + 1
tk F (x), Ψk∗ = min Ψk (x),
x∈F0

(Ψk∗ is the global optimal value of Ψk (·)) and let f ∗ be the optimal value of the
problem (1.3.5).
Theorem 1.3.4 Let the barrier F (·) be bounded below on F0 . Then

lim Ψk∗ = f ∗ .
k→∞

Proof Let F (x) ≥ F ∗ for all x ∈ F0 . For arbitrary x̄ ∈ F0 we have


 
lim sup Ψk∗ ≤ lim f0 (x̄) + 1
tk F (x̄) = f0 (x̄).
k→∞ k→∞

Therefore lim sup Ψk∗ ≤ f ∗ . On the other hand,


k→∞
   
Ψk∗ = min f0 (x) + 1
tk F (x) ≥ inf f0 (x) + 1 ∗
tk F = f∗ + 1 ∗
tk F .
x∈F0 x∈F0

Thus, lim Ψk∗ = f ∗ . 



k→∞
As with Penalty Function Methods, many questions need to be answered. We
do not know how to find the starting point x0 and how to choose the best barrier
function. We do not know theoretically justified rules for updating the penalty
coefficients and the acceptable accuracy of the solution for the auxiliary problems.
Finally, we have no ideas about the efficiency estimates of this process. And the
reason is not in the lack of theory. Our problem (1.3.5) is still too complicated. We
will see that all the questions above get precise answers in the framework of Convex
Optimization (see Chap. 5).
58 1 Nonlinear Optimization

We have finished our brief presentation of general Nonlinear Optimization. It


was very short indeed, and there are many interesting theoretical topics that we
did not mention. The reason is that the main goal of this book is to describe the
areas of Optimization where we can obtain clear and comprehensive results on the
performance of numerical methods. Unfortunately, general Nonlinear Optimization
is just too complicated to fit the goal. However, it was impossible to skip this
field since a lot of basic ideas underlying Convex Optimization have their ori-
gin in the general theory of Nonlinear Optimization. The Gradient Method and
Newton’s Method, Sequential Unconstrained Minimization and Barrier Functions
were originally developed and used for general optimization problems. But only the
framework of Convex Optimization allows these ideas to get their real power. In the
following chapters of this book, we will see many examples of the second birth of
these old ideas.
Chapter 2
Smooth Convex Optimization

In this chapter, we study the complexity of solving optimization problems formed


by differentiable convex components. We start by establishing the main properties
of such functions and deriving the lower complexity bounds, which are valid for
all natural optimization methods. After that, we prove the worst-case performance
guarantees for the Gradient Method. Since these bounds are quite far from the
lower complexity bounds, we develop a special technique, based on the notion
of estimating sequences, which allows us to justify the Fast Gradient Methods.
These methods appear to be optimal for smooth convex problems. We also obtain
performance guarantees for these methods targeting on generating points with small
norm of the gradient. In order to treat problems with set constraints, we introduce the
notion of a Gradient Mapping. This allows an automatic extension of methods for
unconstrained minimization to the constrained case. In the last section, we consider
methods for solving smooth optimization problems, defined by several functional
components.

2.1 Minimization of Smooth Functions

(Smooth convex functions; Lower complexity bounds for FL∞,1 (Rn ); Strongly convex
∞,1 n
functions; Lower complexity bounds for Sμ,L (R ); The Gradient Method.)

2.1.1 Smooth Convex Functions

In this section, we consider the unconstrained minimization problem

min f (x), (2.1.1)


x∈Rn

© Springer Nature Switzerland AG 2018 59


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4_2
60 2 Smooth Convex Optimization

where the objective function f (·) is smooth enough. Recall that in the previous
chapter we were trying to solve this problem under very weak assumptions on the
function f . We have seen that in this general situation we cannot do too much: It is
impossible to guarantee convergence even to a local minimum and it is impossible to
get acceptable bounds on the global performance of minimization schemes, etc. Let
us try to introduce some reasonable assumptions on the function f in order to make
our problem more tractable. For that, let us try to specify the desired properties of a
hypothetical class of differentiable functions F we want to work with.
From the results of the previous chapter, we could come to the conclusion that the
main reason for our troubles is the weakness of the first-order optimality condition
(Theorem 1.2.1). Indeed, we have seen that, in general, the Gradient Method
converges only to a stationary point of the function f (see inequality (1.2.22) and
Example 1.2.2). Therefore, the first additional property we definitely need is as
follows.
Assumption 2.1.1 For any f ∈ F , the first-order optimality condition is sufficient
for a point to be a global solution to (2.1.1).
Further, the main feature of any tractable functional class F is the possibility to
verify the inclusion f ∈ F in a simple way. Usually, this is ensured by a set of basic
elements of the class, endowed with a list of possible operations with elements of F ,
which keep the result in the class (such operations are called invariant). An excellent
example of such a construction is the class of differentiable functions. In order to
check whether a function is differentiable or not, we just need to look at its analytical
representation.
We do not want to restrict our class too much. Therefore, let us introduce only
one invariant operation for the hypothetical class F .
Assumption 2.1.2 If f1 , f2 ∈ F and α, β ≥ 0, then αf1 + βf2 ∈ F .
The reason for the restriction on the sign of coefficients in this assumption is evident:
We would like to see x 2 in our class, but the function −x 2 is not suitable for our
goals.
Finally, let us add to F some basic elements.
Assumption 2.1.3 Any linear function (x) = α + a, x belongs to F .1
Note that the linear function (·) perfectly fits Assumption 2.1.1. Indeed, ∇(x) = 0
implies that this function is constant, and any point in Rn is its global minimum.
It turns out that we have already introduced enough assumptions to specify our
functional class. Consider f ∈ F . Let us fix some x0 ∈ Rn and consider the
function

φ(y) = f (y) − ∇f (x0 ), y .

1 This is not a description of the whole set of basic elements. We just say that we want to have all

linear functions in our class.


2.1 Minimization of Smooth Functions 61

Then φ ∈ F in view of Assumptions 2.1.2 and 2.1.3. Note that

∇φ(y) |y=x0 = ∇f (x0 ) − ∇f (x0 ) = 0.

Therefore, in view of Assumption 2.1.1, x0 is the global minimum of function φ,


and for any y ∈ Rn we have

φ(y) ≥ φ(x0 ) = f (x0 ) − ∇f (x0 ), x0 .

Hence, f (y) ≥ f (x0 ) + ∇f (x0 ), y − x0 .


This inequality is very well known in Optimization Theory. It defines the class
of differentiable convex functions. Such functions may have a restricted domain.
However, this domain must always be convex.
Definition 2.1.1 A set Q ⊆ Rn is called convex if for any x, y ∈ Q and α from
[0, 1] we have

αx + (1 − α)y ∈ Q.

Thus, a convex set contains the whole segment [x, y] provided that the end points x
and y belong to the set.
Definition 2.1.2 A continuously differentiable function f (·) is called convex on a
convex set Q (notation f ∈ F 1 (Q)) if for any x, y ∈ Q we have

f (y) ≥ f (x) + ∇f (x), y − x . (2.1.2)

If −f (·) is convex, we call f (·) concave.


In what follows we also consider the classes of convex functions FLk,l (Q) where
the indices have the same meaning as for the classes CLk,l (Q).
Let us check our assumptions, which now become the properties of the functional
class.
Theorem 2.1.1 If f ∈ F 1 (Rn ) and ∇f (x ∗ ) = 0 then x ∗ is the global minimum of
f (·) on Rn .
Proof In view of inequality (2.1.2), for any x ∈ Rn we have

f (x) ≥ f (x ∗ ) + ∇f (x ∗ ), x − x ∗ = f (x ∗ ). 

Thus, we get what we want in Assumption 2.1.1. Let us check Assumption 2.1.2.
Lemma 2.1.1 If f1 and f2 belong to F 1 (Q) and α, β ≥ 0, then the function f =
αf1 + βf2 also belongs to F 1 (Q).
62 2 Smooth Convex Optimization

Proof For any x, y ∈ Q, we have

f1 (y) ≥ f1 (x) + ∇f1 (x), y − x ,

f2 (y) ≥ f2 (x) + ∇f2 (x), y − x .

It remains to multiply the first equation by α, the second one by β, and add the
results.

Thus, for differentiable functions our hypothetical class coincides with the class
of convex functions. Let us present their main properties.
The next statement significantly increases our possibilities in constructing the
convex functions.
Lemma 2.1.2 If f ∈ F 1 (Q), b ∈ Rm and A : Rn → Rm then

φ(x) = f (Ax + b) ∈ F 1 (Q̂), Q̂ = {x ∈ Rn : Ax + b ∈ Q}.

Proof Indeed, let x, y ∈ Q. Define x̄ = Ax + b, ȳ = Ay + b. Since

∇φ(x) = AT ∇f (Ax + b),

we have

φ(y) = f (ȳ) ≥ f (x̄) + ∇f (x̄), ȳ − x̄

= φ(x) + ∇f (x̄), A(y − x)

= φ(x) + AT ∇f (x̄), y − x

= φ(x) + ∇φ(x), y − x . 

In order to make the verification of the inclusion f ∈ F 1 (Q) easier, let us


provide several equivalent definitions of this class.
Theorem 2.1.2 A continuously differentiable function f belongs to the class
F 1 (Q) if and only if for any x, y ∈ Q and α ∈ [0, 1] we have2

f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y). (2.1.3)

2 Note that inequality (2.1.3) without the assumption of differentiability of f serves as a definition

of general convex functions. We will study these functions in detail in Chap. 3.


2.1 Minimization of Smooth Functions 63

Proof Define xα = αx + (1 − α)y. Let f ∈ F 1 (Q). Then

f (xα ) ≤ f (y) − ∇f (xα ), y − xα = f (y) − α∇f (xα ), y − x ,

f (xα ) ≤ f (x) − ∇f (xα ), x − xα = f (x) + (1 − α)∇f (xα ), y − x .

Multiplying the first inequality by (1 − α), the second one by α, and adding the
results, we get (2.1.3).
Let (2.1.3) be true for all x, y ∈ Q and α ∈ [0, 1]. Let us choose some α ∈ [0, 1).
Then

f (y) ≥ 1−α [f (xα ) −


1
αf (x)] = f (x) + 1−α [f (xα )
1
− f (x)]

= f (x) + 1−α [f (x
1
+ (1 − α)(y − x)) − f (x)].

Letting α tend to 1, we get (2.1.2).



Theorem 2.1.3 A continuously differentiable function f belongs to the class
F 1 (Q) if and only if for any x, y ∈ Q we have

∇f (x) − ∇f (y), x − y ≥ 0. (2.1.4)

Proof Let f be a convex continuously differentiable function. Then

f (x) ≥ f (y) + ∇f (y), x − y , f (y) ≥ f (x) + ∇f (x), y − x .

Adding these inequalities, we get (2.1.4).


Let (2.1.4) hold for all x, y ∈ Q. Define xτ = x + τ (y − x) ∈ Q. Then

1
f (y) = f (x) + ∇f (x + τ (y − x)), y − x dτ
0

1
= f (x) + ∇f (x), y − x + ∇f (xτ ) − ∇f (x), y − x dτ
0

1
= f (x) + ∇f (x), y − x + τ ∇f (xτ )
1
− ∇f (x), xτ − x dτ
0

≥ f (x) + ∇f (x), y − x . 


Sometimes it is more convenient to work with functions from a smaller class


F 2 (Q) ⊂ F 1 (Q).
64 2 Smooth Convex Optimization

Theorem 2.1.4 Let Q be an open set. A twice continuously differentiable function


f belongs to the class F 2 (Q) if and only if for any x ∈ Q we have

∇ 2 f (x)  0. (2.1.5)

Proof Let a function f from C 2 (Q) be convex and s ∈ Rn . Let xτ = x + τ s ∈ Q


for τ > 0 small enough. Then, in view of (2.1.4), we have

0≤ 1
τ2
∇f (xτ ) − ∇f (x), xτ − x = τ1 ∇f (xτ ) − ∇f (x), s


= 1
τ ∇ 2 f (x + λs)s, s dλ,
0

and we get (2.1.5) by letting τ tend to zero.


Let (2.1.5) hold for all x ∈ Q. Then for y ∈ Q we have

1 τ
f (y) = f (x) + ∇f (x), y − x + ∇ 2 f (x + λ(y − x))(y − x), y − x dλdτ
0 0

≥ f (x) + ∇f (x), y − x . 


Let us look at some examples of differentiable convex functions on Rn .


Example 2.1.1
1. Every linear function f (x) = α + a, x is convex.
2. Let matrix A be symmetric and positive semidefinite. Then the quadratic function

1
f (x) = α + a, x + Ax, x
2

is convex (since ∇ 2 f (x) = A  0).


3. The following functions of one variable belong to F 1 (R):

f (x) = ex ,

f (x) = | x |p , p > 1,

x2
f (x) = 1−|x| ,

f (x) = | x | − ln(1+ | x |).


2.1 Minimization of Smooth Functions 65

We can check this using Theorem 2.1.4. Therefore, functions arising in Geomet-
ric Optimization (see Sect. 5.4.8), like


m
f (x) = eαi +ai ,x ,
i=1

are convex (see Lemma 2.1.2). Similarly, functions arising in p -norm approxi-
mation problems, like


m
f (x) = | ai , x − bi |p ,
i=1

are convex too.




n (i) 
n (i)
4. Consider the function f (x) = ln ex , x ∈ Rn . Define (x) = ex .
i=1 i=1
For an arbitrary h ∈ Rn , we have


n (i)
∇f (x), h = 1
(x) ex h(i) ,
i=1


2

n (i)  2 
n (i)
∇ 2 f (x)h, h = 1
(x) ex h(i) − 1
 2 (x)
ex h(i)
i=1 i=1
 
= (x)
1
 D(x) − 1 T
(x) d(x)d (x) h, h ,

(i)
where D(x) is a diagonal matrix with diagonal entries ex , i = 1, . . . , n, and
the vector d(x) ∈ Rn has the same entries. Since (x) = d(x), ēn , it is easy
to see that D(x)  (x)
1
d(x)d T (x). Thus, by Theorem 2.1.4 the function f is
convex on R .
n 
Note that for general convex functions, differentiability itself cannot ensure any
favorable growth properties. Therefore, we need to consider the problem classes
with some bounds on the derivatives. The most important functions of that type are
convex functions whose gradient is Lipschitz continuous in the standard Euclidean
norm. However, for future use in this book, let us explicitly state the necessary
and sufficient conditions for Lipschitz continuity of the gradient with respect to an
arbitrary norm  ·  in Rn . In this case, the size of linear functions on Rn (e.g. the
gradients) must be measured in the dual norm

g∗ = maxn {g, x : x ≤ 1}.


x∈R
66 2 Smooth Convex Optimization

This definition is necessary and sufficient for the justification of the Cauchy-
Schwarz inequality:

g, x ≤ g∗ · x, x, g ∈ Rn . (2.1.6)

Thus, for functions with Lipschitz continuous gradient with respect to the norm  · 
we introduce a new notation: f ∈ FL1,1 (Q,  · ) means that Q ⊆ domf and

∇f (x) − ∇f (y)∗ ≤ Lx − y, ∀x, y ∈ Q. (2.1.7)

If in this notation the norm is missing, then we are working with the standard
Euclidean norm (e.g. FL1,1 (Rn )). Let us prove that this norm is self-dual.
Lemma 2.1.3 For any x and s in Rn we have
   1/2

n 
n
max s, x : (x (i) )2 ≤1 = (s (i) )2 .
x∈RN i=1 i=1

Proof Let · be the standard Euclidean norm. By simple coordinate maximization,
it is easy to check that
 

n !
max {2s, x − x2 } = maxn 2s (i) x (i) − (x (i))2 = s2 .
x∈Rn x∈R i=1

On the other hand,

s,x 2
maxn {2s, x − x2 } = max {2τ s, x − τ 2 x2 } = max
x∈R \{0} x
2
x∈R x∈R ,τ ∈R
n n

= max s, x 2 .
x≤1



Thus, the standard Euclidean norm can be used both for measuring sizes of points
and gradients. Before we proceed, let us prove a simple property of general norms.
Lemma 2.1.4 For all x, y ∈ Rn and α ∈ [0, 1] we have

αx2 + (1 − α)y2 ≥ α(1 − α)(x + y)2 ≥ α(1 − α)x − y2 .


(2.1.8)
Proof Using the inequality a 2 + b 2 ≥ 2ab with a = αx and b = (1 − α)y,
we get the first inequality. The second one follows from the triangle inequality for
norms. 
2.1 Minimization of Smooth Functions 67

Theorem 2.1.5 All conditions below, holding for all x, y ∈ Rn and α from [0, 1],
are equivalent to the inclusion f ∈ FL1,1 (Rn ,  · ):

0 ≤ f (y) − f (x) − ∇f (x), y − x ≤ L


2  x − y 2 , (2.1.9)

f (x) + ∇f (x), y − x + 1


2L  ∇f (x) − ∇f (y) 2∗ ≤ f (y), (2.1.10)

1
L  ∇f (x) − ∇f (y) 2∗ ≤ ∇f (x) − ∇f (y), x − y , (2.1.11)

0 ≤ ∇f (x) − ∇f (y), x − y ≤ L  x − y 2 , (2.1.12)

αf (x) + (1 − α)f (y) ≥ f (αx + (1 − α)y)


(2.1.13)
+ α(1−α)
2L  ∇f (x) − ∇f (y) 2∗ ,

0 ≤ αf (x) + (1 − α)f (y) − f (αx + (1 − α)y)


(2.1.14)
≤ α(1 − α) L2 x−y 2 .

Moreover, if f ∈ FL1,1 (Q), then inequalities (2.1.9), (2.1.12), and (2.1.14) are valid
for all x, y ∈ Q.
Proof Indeed, the first inequality in (2.1.9) follows from the definition of convex
functions. To prove the second one, note that

1
f (y) − f (x) − ∇f (x), y − x = ∇f (x + τ (y − x))
0
−∇f (x), y − x dτ

(2.1.6), (2.1.7) 1
≤ Lτ y − x2 dτ = 2 y
L
− x2 .
0

Further, let us fix x0 ∈ Rn . Consider the function φ(y) = f (y) − ∇f (x0 ), y .
Note that φ ∈ FL1,1 (Rn ,  · ) and its optimal point is y ∗ = x0 . Therefore, in view
of (2.1.9), we have

(2.1.9)  
φ(y ∗ ) = minn φ(x) ≤ minn φ(y) + ∇φ(y), x − y + L2 x − y2
x∈R x∈R

(2.1.6) 
= min φ(y) − r∇φ(y)∗ + L2 r 2 = φ(y) − 1
2L  ∇φ(y) 2∗ ,
r≥0

and we get (2.1.10) since ∇φ(y) = ∇f (y) − ∇f (x0 ).


68 2 Smooth Convex Optimization

We obtain (2.1.11) from inequality (2.1.10) by adding two copies of it with x


and y interchanged. Applying the Cauchy–Schwarz inequality to (2.1.11), we get
 ∇f (x) − ∇f (y) ∗ ≤ L  x − y .
In the same way, we can obtain (2.1.12) from (2.1.9). In order to get (2.1.9)
from (2.1.12), we apply integration:

1
f (y) − f (x) − ∇f (x), y − x = ∇f (x + τ (y − x)) − ∇f (x), y − x dτ
0

≤ 12 Ly − x2 .

Let us now prove two last inequalities. Define xα = αx + (1 − α)y. Then,


using (2.1.10), we get

f (x) ≥ f (xα ) + ∇f (xα ), (1 − α)(x − y) + 1


2L  ∇f (x) − ∇f (xα ) 2∗ ,

f (y) ≥ f (xα ) + ∇f (xα ), α(y − x) + 1


2L  ∇f (y) − ∇f (xα ) 2∗ .

Adding these inequalities multiplied by α and (1 − α) respectively, and using


inequality (2.1.8), we get (2.1.13). It is easy to check that we get (2.1.10)
from (2.1.13) by letting α → 1.
Similarly, from (2.1.9) we get

f (x) ≤ f (xα ) + ∇f (xα ), (1 − α)(x − y) + L


2  (1 − α)(x − y) 2 ,

f (y) ≤ f (xα ) + ∇f (xα ), α(y − x) + L


2  α(y − x) 2 .

Adding these inequalities multiplied by α and (1 − α) respectively, we


obtain (2.1.14), and we get back to (2.1.9) as α → 1.

Finally, let us characterize the class FL2,1 (Rn ,  · ).
Theorem 2.1.6 A twice continuously differentiable function f belongs to the class
FL2,1 (Rn ,  · ) if and only if for any x, h ∈ Rn we have

0 ≤ ∇ 2 f (x)h, h ≤ Lh2 . (2.1.15)

Proof The first condition characterizes the convexity of the function f (·) and it was
proved in Theorem 2.1.4. The second inequality is a limiting case of (2.1.12). 
Note that for the class FL2,1 (Rn ), condition (2.1.15) can be written in the form
of a matrix inequality:

0  ∇ 2 f (x)  LIn , x ∈ Rn . (2.1.16)


2.1 Minimization of Smooth Functions 69

∞,1
2.1.2 Lower Complexity Bounds for FL (Rn )

Let us check our potential ability to minimize smooth convex functions. In this
section, we obtain the lower complexity bounds for optimization problems with
objective functions from FL∞,1 (Rn ) (and, consequently, FL1,1 (Rn )).
Recall that our problem class is as follows.

Model: min f (x), f ∈ FL∞,1 (Rn ).


x∈Rn

Oracle: First-order local Black Box.

Approximate solution: x̄ ∈ Rn , f (x̄) − f ∗ ≤ .

In order to make our considerations simpler, let us introduce the following assump-
tion on iterative processes.
Assumption 2.1.4 An iterative method M generates a sequence of test points {xk }
such that

xk ∈ x0 + Lin{∇f (x0 ), . . . , ∇f (xk−1 )}, k ≥ 1.

This assumption is not absolutely necessary and it can be avoided using more
sophisticated reasoning. However, it holds for the majority of practical methods.
We can prove the lower complexity bounds for our problem class without
developing a resisting oracle. Instead, we just point out the “worst function in the
world” belonging to the class FL∞,1 (Rn ). This function appears to be difficult for
all iterative schemes satisfying Assumption 2.1.4.
Let us fix some constant L > 0. Consider the following family of quadratic
functions
" # $ %
 (i)
k−1
fk (x) = 4 2 (x ) +
L 1 (1) 2 (x − x (i+1) ) + (x ) − x
2 (k) 2 (1)
i=1

for k = 1 . . . n. Note that for all h ∈ Rn , we have


# $

k−1
∇ 2 fk (x)h, h = 4 (h(1) )2 +
L
(h(i) − h(i+1) )2 + (h(k) )2 ≥ 0,
i=1
70 2 Smooth Convex Optimization

and
# $

k−1
∇ 2 fk (x)h, h ≤ L
4 (h(1) )2 + 2((h(i) )2 + (h(i+1) )2 ) + (h(k) )2
i=1


n
≤L (h(i) )2 .
i=1

Thus, 0  ∇ 2 fk (x)  LIn . Therefore, fk (·) ∈ FL∞,1 (Rn ), 1 ≤ k ≤ n.


Let us compute the minimal value of the function fk . Note that ∇ 2 fk (x) = L
4 Ak
with
⎛ ⎧ ⎞

⎪ 2 −1 0
⎜ ⎪
⎪ ⎟
⎜ ⎪
⎪ −1 2 −1 0 ⎟
⎜ ⎪
⎪ ⎟
⎜ ⎪
⎪ 0 −1 2 ⎟
⎜ ⎪
⎨ ⎟
⎜ k 0k,n−k ⎟
⎜ lines ⎪ ⎟
⎜ ⎪

... ... ⎟
Ak = ⎜ ⎪ ⎟,
⎜ ⎪
⎪ ⎟
⎜ ⎪
⎪ ⎟
⎜ ⎪
⎪ −1 2 −1 ⎟
⎜ ⎪
⎩ 0 ⎟
⎜ 0 −1 2 ⎟
⎜ ⎟
⎝ ⎠
0n−k,k 0n−k,n−k

where 0k,p is a (k × p) zero matrix. Therefore, the equation

∇fk (x) = Ak x − e1 = 0

has the following unique solution:



⎨1 − i = 1 . . . k,
i
k+1 ,
(i)
x̄k =

0, k + 1 ≤ i ≤ n.

Hence, the optimal value of the function fk is


   
fk∗ = L
4 2 Ak x̄k , x̄k − e1 , x̄k
1
= − L8 e1 , x̄k = L
8 −1 + 1
k+1 .
(2.1.17)

Note also that


k
k(k+1)(2k+1) (k+1)3
i2 = 6 ≤ 3 . (2.1.18)
i=1
2.1 Minimization of Smooth Functions 71

Therefore,

n 
 2 
k  2
 x̄k 2 = x̄k(i) = 1− i
k+1
i=1 i=1


k 
k
(2.1.19)
=k− 2
k+1 i+ 1
(k+1)2
i2
i=1 i=1

k(k+1) (k+1)3
≤k− 2
k+1 · 2 + 1
(k+1)2
· 3 = 13 (k + 1).

Let Rk,n = {x ∈ Rn | x (i) = 0, k + 1 ≤ i ≤ n}. This is the subspace of


Rn in which only the first k components of the point can differ from zero. From the
analytical form of the functions {fk }, it is easy to see that for all x ∈ Rk,n we have

fp (x) ≡ fk (x), p = k, . . . , n.

Let us fix some p, 1 ≤ p ≤ n.


p
Lemma 2.1.5 Let x0 = 0. Then for any sequence {xk }k=0 satisfying the condition

def
xk ∈ Lk = Lin{∇fp (x0 ), . . . , ∇fp (xk−1 )},

we have Lk ⊆ Rk,n .
Proof Since x0 = 0, we have ∇fp (x0 ) = − L4 e1 ∈ R1,n . Thus L1 ≡ R1,n .
Let Lk ⊆ Rk,n for some k < p. Since the matrix Ap is tri-diagonal, for any
x ∈ Rk,n we have ∇fp (x) ∈ Rk+1,n . Therefore Lk+1 ⊆ Rk+1,n , and we can
complete the proof by induction.

p
Corollary 2.1.1 For any sequence {xk }k=0 with x0 = 0 and xk ∈ Lk , we have

fp (xk ) ≥ fk∗ .

Proof Indeed, xk ∈ Lk ⊆ Rk,n and therefore fp (xk ) = fk (xk ) ≥ fk∗ . 



Now we are ready to prove the main result of this section.
Theorem 2.1.7 For any k, 1 ≤ k ≤ 12 (n − 1), and any x0 ∈ Rn there exists
a function f ∈ FL∞,1 (Rn ) such that for any first-order method M satisfying
Assumption 2.1.4 we have

3Lx0 −x ∗ 2
f (xk ) − f ∗ ≥ 32(k+1)2
,

 xk − x ∗ 2 ≥ 1
8  x0 − x ∗ 2 ,

where x ∗ is the minimum of the function f and f ∗ = f (x ∗ ).


72 2 Smooth Convex Optimization

Proof It is clear that the methods of this type are invariant with respect to a
simultaneous shift of all objects in the space of variables. Thus, the sequence of
iterates, which is generated by such a method for the function f (·) starting from x0 ,
is just a shift of the sequence generated for f¯(x) = f (x + x0 ) starting from the
origin. Therefore, we can assume that x0 = 0.
Let us prove the first inequality. For that, let us fix k and apply M to minimize
f (x) = f2k+1 (x). Then x ∗ = x̄2k+1 and f ∗ = f2k+1 ∗ . Using Corollary 2.1.1, we
conclude that

f (xk ) ≡ f2k+1 (xk ) = fk (xk ) ≥ fk∗ .

Hence, since x0 = 0, in view of (2.1.17) and (2.1.19) we get the following estimate:
 
L 1 1
f (xk )−f ∗ 8 −1+ k+1 +1− 2k+2
x0 −x ∗ 2
≥ 1 = 38 L · 1
4(k+1)2
.
3 (2k+2)

Let us prove the second inequality. Since xk ∈ Rk,n and x0 = 0, we have


2k+1  2 
2k+1  2
 xk − x ∗ 2 ≥ (i)
x̄2k+1 = 1− i
2k+2
i=k+1 i=k+1


2k+1 
2k+1
= k+1− 1
k+1 i+ 1
4(k+1)2
i2.
i=k+1 i=k+1

In view of (2.1.18), we have


2k+1
i2 = 1
6 [(2k + 1)(2k + 2)(4k + 3) − k(k + 1)(2k + 1)]
i=k+1

= 16 (k + 1)(2k + 1)(7k + 6).

Therefore, using (2.1.19) we finally obtain

 xk − x ∗ 2 ≥ k + 1 − 1
k+1 · (3k+2)(k+1)
2 + (2k+1)(7k+6)
24(k+1)

(2k+1)(7k+6) 2k 2 +7k+6
= 24(k+1) − k
2 = 24(k+1)

2k 2 +7k+6
≥ 16(k+1)2
 x0 − x̄2k+1 2 ≥ 1
8  x0 − x ∗ 2 .



2.1 Minimization of Smooth Functions 73

The above theorem is valid only under the assumption that the number of steps
of the iterative scheme is not too large as compared with the dimension of the space
of variables (k ≤ 12 (n − 1)). Complexity bounds of this type are called uniform
in the dimension. Clearly, they are valid for very large problems, in which we
cannot even wait for n iterates of the method. However, even for problems with
a moderate dimension, these bounds also provide us with some information. Firstly,
they describe the potential performance of numerical methods at the initial stage of
the minimization process. Secondly, they warn us that without a direct use of finite-
dimensional arguments we cannot justify a better complexity of the corresponding
numerical scheme.
To conclude this section, let us note that the obtained lower bound for the value
of the objective function is rather optimistic. Indeed, after one hundred iterations
we could decrease the initial residual by 104 times. However, the result on the
behavior of the minimizing sequence is quite disappointing. The convergence to the
optimal point can be arbitrarily slow. Since this is a lower bound, this conclusion is
inevitable for our problem class. The only thing we can do is to try to find problem
classes in which the situation could be better. This is the goal of the next section.

2.1.3 Strongly Convex Functions

Let us look at a possible restriction of the functional class FL1,1 (Rn ,  · ), for
which we can guarantee a reasonable rate of convergence to a unique solution of
the minimization problem

min f (x), f ∈ F 1 (Rn ,  · ).


x∈Rn

Recall that in Sect. 1.2.3 we have proved that in a small neighborhood of a


nondegenerate local minimum the Gradient Method (1.2.15) converges linearly. Let
us try to globalize this non-degeneracy assumption. Namely, let us assume that there
exists some constant μ > 0 such that for any x̄ with ∇f (x̄) = 0 and any x ∈ Rn
we have
1
f (x) ≥ f (x̄) + μ  x − x̄ 2 .
2
Recall that the norm in this definition can be general.
Using the same reasoning as in the beginning of Sect. 2.1.1, we obtain the class
of strongly convex functions.
74 2 Smooth Convex Optimization

Definition 2.1.3 A continuously differentiable function f (·) is called strongly


convex on Rn (notation f ∈ Sμ1 (Q,  · )) if there exists a constant μ > 0 such
that for any x, y ∈ Q we have

1
f (y) ≥ f (x) + ∇f (x), y − x + μ  y − x 2 . (2.1.20)
2
The constant μ is called the convexity parameter of function f .
k,l
We will also consider the classes Sμ,L (Q,  · ) where the indices k, l and L
have the same meaning as for the class CLk,l (Q).
Let us mention the most important properties of strongly convex functions.
Theorem 2.1.8 If f ∈ Sμ1 (Rn ) and ∇f (x ∗ ) = 0, then

f (x) ≥ f (x ∗ ) + 12 μ  x − x ∗ 2 (2.1.21)

for all x ∈ Rn .
Proof Since ∇f (x ∗ ) = 0, for any x ∈ Rn , we have

(2.1.20)
f (x) ≥ f (x ∗ ) + ∇f (x ∗ ), x − x ∗ + 12 μ  x − x ∗ 2

= f (x ∗ ) + 12 μ  x − x ∗ 2 .



Let us describe the result of addition of two strongly convex functions.
Lemma 2.1.6 If f1 ∈ Sμ11 (Q1 ,  · ), f2 ∈ Sμ12 (Q2 ,  · ) and α, β ≥ 0, then
0
1
f = αf1 + βf2 ∈ Sαμ1 +βμ2
(Q1 Q2 ,  · ).

Proof For any x, y ∈ Q1 Q2 , we have

f1 (y) ≥ f1 (x) + ∇f1 (x), y − x + 12 μ1  y − x 2 ,

f2 (y) ≥ f2 (x) + ∇f2 (x), y − x + 12 μ2  y − x 2 .

It remains to add these equations multiplied by α and β respectively.



Note that the class S01 (Q,  · ) coincides with F 1 (Q,  · ). Therefore, addition
of a convex function and a strongly convex function gives a strongly convex function
with the same value of convexity parameter.
Let us give several equivalent definitions of strongly convex functions.
2.1 Minimization of Smooth Functions 75

Theorem 2.1.9 Let f be continuously differentiable. Both conditions below, hold-


ing for all x, y ∈ Q and α ∈ [0, 1], are equivalent to inclusion f ∈ Sμ1 (Q,  · ):

∇f (x) − ∇f (y), x − y ≥ μ  x − y 2 , (2.1.22)

αf (x) + (1 − α)f (y) ≥ f (αx + (1 − α)y)


(2.1.23)
+α(1 − α) μ2  x − y 2 .

The proof of this theorem is very similar to the proof of Theorem 2.1.5 and we leave
it as an exercise for the reader.
The next statement is sometimes useful.
Theorem 2.1.10 If f ∈ Sμ1 (Rn ,  · ), then for any x and y from Rn we have

f (y) ≤ f (x) + ∇f (x), y − x + 1


2μ  ∇f (x) − ∇f (y) 2∗ , (2.1.24)

∇f (x) − ∇f (y), x − y ≤ 1


μ  ∇f (x) − ∇f (y) 2∗ , (2.1.25)

μx − y ≤ ∇f (x) − ∇f (y)∗ . (2.1.26)

Proof Let us fix some x ∈ Rn . Consider the function

φ(y) = f (y) − ∇f (x), y ∈ Sμ1 (Rn ,  · ).

Since ∇φ(x) = 0, for any y ∈ Rn , we have

(2.1.20)
φ(x) = minn φ(v) ≥ min [φ(y) + ∇φ(y), v − y + 12 μv − y2 ]
v∈R v∈Rn

= φ(y) − 2μ ∇φ(y)∗ ,
1 2

and this is exactly (2.1.24). Adding two copies of (2.1.24) with x and y inter-
changed, we get (2.1.25). Finally, (2.1.26) follows from (2.1.25) and (2.1.22).

Let us present a second-order characterization of the class Sμ1 (Q,  · ).
Theorem 2.1.11 Let a continuous function f be twice continuously differentiable
in intQ. It belongs to the class Sμ2 (Q,  · ) if and only if for all x ∈ intQ and
h ∈ Rn we have

∇ 2 f (x)h, h  μh2 . (2.1.27)

Proof We get (2.1.27) from (2.1.22) by setting y = x + αh ∈ Q with α small


enough and letting α → 0. 
76 2 Smooth Convex Optimization

In the case of the standard Euclidean norm, condition (2.1.27) can be written in
the form of a matrix inequality:

∇ 2 f (x)  μIn , x ∈ intQ. (2.1.28)

Now we can look at some examples of strongly convex functions.


Example 2.1.2
1. Let a symmetric matrix A satisfy the conditions μIn  A  LIn . Then, since
∇ 2 f (x) = A, we have

1 ∞,1 1,1
f (x) = α + a, x + Ax, x ∈ Sμ,L (Rn ) ⊂ Sμ,L (Rn ).
2
Adding this function to a convex function, we get other examples of strongly
convex functions.
def
2. Let Q = Δ+ n = {x ∈ R+ : ēn , x ≤ 1}, where ēn ∈ R is a vector of all ones.
n n

Consider the entropy function:


n
η(x) = x (i) ln x (i) , x ∈ Δ+
n. (2.1.29)
i=1


n
(h(i) )2
For direction h ∈ Rn , we have ∇ 2 η(x)h, h = x (i)
. We need to find
i=1
the minimum of this expression in x ∈ intΔ+ n . Since it is decreasing in x,
we conclude that the inequality constraint is active and we need to compute
n
(h(i) )2
min x (i)
. In view of Corollary 1.2.1, this minimum x∗ can be found from
en x =1 i=1
the system of equations

(h(i) )2
= λ∗ ,
(x∗(i) )2

where λ∗ is the optimal dual multiplier. It can be found from the equation


n 
n
1= x∗(i) = 1/2
1
|h(i) |.
i=1 λ∗ i=1


2

n
(h(i) )2 
n
Thus, ∇ 2 η(x)h, h ≥ (i) = |h(i) | , and by Theorem 2.1.11 we
i=1 x∗ i=1
conclude that the entropy function is strongly convex on Δ+
N in the 1 -norm with
convexity parameter one. 
2.1 Minimization of Smooth Functions 77

1,1
One of the most important functional classes is Sμ,L (Rn ) (recall that the
corresponding norm is standard Euclidean). This class is described by the following
inequalities:

∇f (x) − ∇f (y), x − y ≥ μ  x − y 2 , (2.1.30)

 ∇f (x) − ∇f (y) ≤ L  x − y  . (2.1.31)

The value Qf = L/μ ≥ 1 is called the condition number of the function f .


It is important that inequality (2.1.30) can be strengthened by the additional
information obtained from (2.1.31).
1,1
Theorem 2.1.12 If f ∈ Sμ,L (Rn ), then for any x, y ∈ Rn we have

μL
∇f (x) − ∇f (y), x − y ≥ μ+L  x − y 2 + μ+L
1
 ∇f (x) − ∇f (y) 2 .
(2.1.32)
Proof Define φ(x) = f (x) − 12 μx2 . Then ∇φ(x) = ∇f (x) − μx. Hence,
1,1
by inequalities (2.1.30) and (2.1.12), φ ∈ FL−μ (Rn ). If μ = L, then (2.1.32) is
proved. If μ < L, then by (2.1.11) we have

∇φ(x) − ∇φ(y), y − x ≥ L−μ ∇φ(x) − ∇φ(y) ,


1 2

and this is exactly (2.1.32).




∞,1
2.1.4 Lower Complexity Bounds for Sμ,L (Rn )

Let us obtain the lower complexity bounds for unconstrained minimization of


∞,1 1,1
functions from the class Sμ,L (Rn ) ⊂ Sμ,L (Rn ). Consider the following problem
class.

∞,1
Model: min f (x), f ∈ Sμ,L (Rn ), μ > 0, n ≥ 1.
x∈Rn

Oracle: First-order local Black Box.

Approximate solution: x̄ : f (x̄) − f ∗ ≤ ,  x̄ − x ∗ 2 ≤ .


78 2 Smooth Convex Optimization

As in the previous section, we consider methods satisfying Assumption 2.1.4. We


are going to find the lower complexity bounds for our problems in terms of the
condition number Qf = L μ . Note that in the description of our problem class, we do
not fix the dimension of the space of variables. Therefore, formally this class also
includes an infinite-dimensional problem.
We are going to give an example of a bad function defined in an infinite-
dimensional space. It is also possible to do this in finite dimensions, but the
corresponding reasoning is more complicated.
Consider R∞ ≡ 2 , the space of all sequences x = {x (i) }∞i=1 with finite standard
Euclidean norm
∞ 
 2
 x 2 = x (i) < ∞.
i=1

Let us choose two parameters, μ > 0 and Qf > 1, which define the following
function
 
μ(Qf −1) 

(i) − x (i+1) )2 − 2x (1) + μ  x 2 .
fμ,Qf (x) = 8 (x (1) )2 + (x 2
i=1

Let L = μQf and


⎛ ⎞
2 −1 0 0
⎜ −1 2 −1 0 ⎟
⎜ ⎟
A=⎜ ⎟
⎜ 0 −1 2 . . . ⎟ .
⎝ ⎠
.. ..
0 0 . .

μ(Q −1)
Then ∇ 2 fμ,Qf (x) = f
4 A + μI , where I is the unit operator in R∞ . As in
Sect. 2.1.2, we can see that 0  A  4I . Therefore,

μI  ∇ 2 fμ,Qf (x)  (μ(Qf − 1) + μ)I = μQf I = L I.

∞,1
This means that fμ,Qf ∈ Sμ,L (R∞ ). Note that the condition number of the
function fμ,Qf is Qf .
Let us find the minimum of the function fμ,Qf . The first-order optimality
condition
 
μ(Qf −1) μ(Qf −1)
∇fμ,Qf (x) ≡ 4 A + μI x− 4 e1 = 0

can be written as
 
A+ 4
Qf −1 I x = e1 .
2.1 Minimization of Smooth Functions 79

The coordinate form of this equation is as follows:

Q +1
2 Qff −1 x (1) − x (2) = 1,
(2.1.33)
Q +1
x (k+1) − 2 Qff −1 x (k) + x (k−1) = 0, k = 2, . . . .

Let q be the smallest root of the equation

Q +1
q 2 − 2 Qff −1 q + 1 = 0,

Qf −1
that is q = √ . Then the sequence (x ∗ )(k) = q k , k = 1, 2, . . . , satisfies the
Qf +1
system (2.1.33). Thus, we come to the following result.
Theorem 2.1.13 For any x0 ∈ R∞ and any constants μ > 0, Qf > 1, there
∞,1
exists a function f ∈ Sμ,L (R∞ ) such that for any first-order method M satisfying
Assumption 2.1.4, we have

√ 2k
Qf −1
 xk − x∗ 2 ≥ √  x0 − x ∗ 2 , (2.1.34)
Qf +1


√ 2k
Q −1
f (xk ) − f (x ∗ ) ≥ μ
2
√ f  x0 − x ∗ 2 , (2.1.35)
Qf +1

where x ∗ is the unique unconstrained minimum of function f .


Proof Indeed, we can assume that x0 = 0. Let us choose f (x) = fμ,Qf (x). Then


∞ 

q2
 x0 − x ∗ 2 = [(x ∗ )(i) ]2 = q 2i = 1−q 2
.
i=1 i=1

Since ∇ 2 fμ,Qf (x) is a tri-diagonal operator and ∇fμ,Qf (0) = − L−μ


4 e1 , we
conclude that xk ∈ R . Therefore
k,∞


∞ 

q 2(k+1)
 xk − x ∗ 2 ≥ [(x ∗ )(i) ]2 = q 2i = 1−q 2
= q 2k  x0 − x ∗ 2 .
i=k+1 i=k+1

The second bound of this theorem follows from (2.1.34) and Theorem 2.1.8.

80 2 Smooth Convex Optimization

2.1.5 The Gradient Method

Let us describe the performance of the Gradient Method as applied to the problem

min f (x) (2.1.36)


x∈Rn

with f ∈ FL1,1 (Rn ). Recall that the scheme of the Gradient Method is as follows.

Gradient Method

0. Choose x0 ∈ Rn .
1. kth iteration (k ≥ 0). (2.1.37)

(a) Compute f (xk ) and ∇f (xk ).


(b) Find xk+1 = xk − hk ∇f (xk ) (see Sect. 1.2.3 for step-
size rules).

In this section, we analyze the simplest variant of the gradient scheme with hk =
h > 0. It is possible to show that for all other reasonable step-size rules the rate of
convergence of this method is similar. Denote by x ∗ an arbitrary optimal point of
our problem, and let f ∗ = f (x ∗ ).
Theorem 2.1.14 Let f ∈ FL1,1 (Rn ) and 0 < h < L2 . Then the Gradient Method
generates a sequence of points {xk }, with function values satisfying the inequality

2(f (x0 )−f ∗ ) x0 −x ∗ 2


f (xk ) − f ∗ ≤ 2x0 −x ∗ 2 +k·h(2−Lh)·(f (x0 )−f ∗ )
, k ≥ 0.

Proof Let rk = xk − x ∗ . Then

2
rk+1 =  xk − x ∗ − h∇f (xk ) 2

= rk2 − 2h∇f (xk ), xk − x ∗ + h2  ∇f (xk ) 2

≤ rk2 − h( L2 − h)  ∇f (xk ) 2

(we use (2.1.11) and ∇f (x ∗ ) = 0). Therefore, rk ≤ r0 . In view of (2.1.9), we have

f (xk+1 ) ≤ f (xk ) + ∇f (xk ), xk+1 − xk + L


2  xk+1 − xk 2

= f (xk ) − ω  ∇f (xk ) 2 ,
2.1 Minimization of Smooth Functions 81

where ω = h(1 − L2 h). Define Δk = f (xk ) − f ∗ . Then

(2.1.2)
Δk ≤ ∇f (xk ), xk − x ∗ ≤ r0  ∇f (xk )  .

Therefore, Δk+1 ≤ Δk − ω 2
Δ .
r02 k
Thus,

Δk
1
≥ 1
+ ω
· ≥ 1
+ ω
.
Δk+1 Δk r02 Δk+1 Δk r02

Summing up these inequalities, we get

1
≥ 1
+ ω
(k + 1).
Δk+1 Δ0 r02



In order to choose the optimal step size, we need to maximize the function
φ(h) = h(2 − Lh) with respect to h. The first-order optimality condition φ  (h) =
2 − 2Lh = 0 provides us with the value h∗ = L1 . In this case, we get the following
rate of convergence for the Gradient Method:

2L(f (x0 )−f ∗ )x0 −x ∗ 2


f (xk ) − f ∗ ≤ 2Lx0 −x ∗ 2 +k·(f (x0 )−f ∗ )
. (2.1.38)

Further, in view of (2.1.9) we have

f (x0 ) ≤ f ∗ + ∇f (x ∗ ), x0 − x ∗ + L
2  x0 − x ∗ 2 = f ∗ + L
2  x0 − x ∗ 2 .

Since the right-hand side of inequality (2.1.38) is increasing in f (x0 ) − f ∗ , we


obtain the following result.
Corollary 2.1.2 If h = 1
L and f ∈ FL1,1 (Rn ), then

2Lx0 −x ∗ 2
f (xk ) − f ∗ ≤ k+4 . (2.1.39)

Let us estimate the performance of the Gradient Method on the class of strongly
convex functions.
1,1
Theorem 2.1.15 If f ∈ Sμ,L (Rn ) and 0 < h ≤ 2
μ+L , then the Gradient Method
generates a sequence {xk } such that
 k
 xk − x ∗ 2 ≤ 1 − 2hμL
μ+L  x0 − x ∗ 2 .
82 2 Smooth Convex Optimization

If h = 2
μ+L , then

 
Qf −1 k
 xk − x ∗  ≤ Qf +1  x0 − x ∗ ,

 
Qf −1 2k
f (xk ) − f ∗ ≤ L
2 Qf +1  x0 − x ∗ 2 ,

where Qf = L/μ.
Proof Let rk = xk − x ∗ . Then

2
rk+1 =  xk − x ∗ − h∇f (xk ) 2 = rk2 − 2h∇f (xk ), xk − x ∗ + h2  ∇f (xk ) 2
   
2hμL
≤ 1− μ+L rk2 + h h − 2
μ+L  ∇f (xk ) 2

(we use (2.1.32) and ∇f (x ∗ ) = 0). The last inequality of the theorem follows from
the previous one and (2.1.9). 
Note that the highest rate of convergence is achieved for h = 2
μ+L . In this case,

 2k
xk − x ∗ 2 ≤ L−μ
L+μ x0 − x ∗ 2 . (2.1.40)

We have already seen the step-size rule h = μ+L2


and the linear rate of convergence
of the Gradient Method in Sect. 1.2.3, Theorem 1.2.4. However, this was only a local
result.
Comparing the rate of convergence of the Gradient Method with the lower
complexity bounds (Theorems 2.1.7 and 2.1.13), we can see that it is far from
being optimal for the classes FL1,1 (Rn ) and Sμ,L
1,1
(Rn ). We should also note that on
these problem classes the standard unconstrained minimization methods (Conjugate
Gradients, Variable Metric) are not better. The optimal methods for minimizing
smooth convex and strongly convex functions need the accumulation of some global
information on the objective function. We will describe such schemes in the next
section.

2.2 Optimal Methods

(Estimating sequences and Fast Gradient Methods; Decreasing the norm of the gradient;
Convex sets; Constrained minimization problems; The gradient mapping; Minimization
methods over simple sets.)
2.2 Optimal Methods 83

2.2.1 Estimating Sequences

Let us consider the following unconstrained minimization problem:

min f (x), (2.2.1)


x∈Rn

1,1 1,1
where f is strongly convex: f ∈ Sμ,L (Rn ), μ ≥ 0. Since S0,L (Rn ) ≡ FL1,1 (Rn ),
this family of classes also contains the class of convex functions with Lipschitz
continuous gradient. We assume that there exists a solution x ∗ of problem (2.2.1)
and define f ∗ = f (x ∗ ).
In Sect. 2.1, we proved the following convergence rates for the Gradient Method:

2Lx0 −x ∗ 2
FL1,1 (Rn ) : f (xk ) − f ∗ ≤ ,
k+4 2k
1,1
Sμ,L (Rn ) : f (xk ) − f ∗ ≤ L L−μ
2 L+μ  x0 − x ∗ 2 .

These estimates differ from our lower complexity bounds (Theorem 2.1.7 and
Theorem 2.1.13) by an order of magnitude. Of course, generally speaking, this does
not mean that the Gradient Method is not optimal (it may be that the lower bounds
are too optimistic). However, we will see that in our case the lower bounds are
exact up to a constant factor. We prove this by constructing a method with rate of
convergence proportional to these bounds.
Recall that the Gradient Method forms a relaxation sequence:

f (xk+1 ) ≤ f (xk ).

This fact is crucial for the justification of its convergence rate (Theorem 2.1.14).
However, in Convex Optimization relaxation is not so important. Firstly, for some
problem classes, this property is quite expensive. Secondly, the schemes and
efficiency estimates of optimal methods are derived from some global topological
properties of convex functions (see Theorem 2.1.5). From this point of view, the
relaxation property is too microscopic to be useful.
The schemes and efficiency bounds of optimal methods are based on the notion
of estimating sequences.
Definition 2.2.1 A pair of sequences {φk (x)}∞ ∞
k=0 and {λk }k=0 , λk ≥ 0, are called
the estimating sequences of the function f (·) if

λk → 0,

and for any x ∈ Rn and all k ≥ 0 we have

φk (x) ≤ (1 − λk )f (x) + λk φ0 (x). (2.2.2)


84 2 Smooth Convex Optimization

The next statement explains why these objects are useful.


Lemma 2.2.1 If for some sequence of points {xk } we have

f (xk ) ≤ φk∗ = minn φk (x),


def
(2.2.3)
x∈R

then f (xk ) − f ∗ ≤ λk [φ0 (x ∗ ) − f ∗ ] → 0.


Proof Indeed,

(2.2.2)
f (xk ) ≤ φk∗ = minn φk (x) = min [(1 − λk )f (x) + λk φ0 (x)]
x∈R x∈Rn

≤ (1 − λk )f (x ∗ ) + λk φ0 (x ∗ ).



Thus, for any sequence {xk }, satisfying (2.2.3), we can derive its rate of
convergence directly from the convergence rate of the sequence {λk }. However, at
this moment we have two serious questions. Firstly, we do not know how to form the
estimating sequences. Secondly, we do not know how to satisfy inequalities (2.2.3).
The first question is simpler.
Lemma 2.2.2 Assume that:
1,1
1. a function f (·) belongs to the class Sμ,L (Rn ),
2. φ0 (·) is an arbitrary convex function on Rn ,
3. {yk }∞
k=0 is an arbitrary sequence of points in R ,
n


4. the coefficients {αk }∞
k=0 satisfy conditions αk ∈ (0, 1) and αk = ∞,
k=0
5. we choose λ0 = 1.
Then the pair of sequences {φk (·)}∞ ∞
k=0 and {λk }k=0 , defined recursively by the
relations

λk+1 = (1 − αk )λk ,

μ !
φk+1 (x) = (1 − αk )φk (x) + αk f (yk ) + ∇f (yk ), x − yk + 2  x − yk 2 ,
(2.2.4)

are estimating sequences.


2.2 Optimal Methods 85

Proof Indeed, φ0 (x) ≤ (1 − λ0 )f (x) + λ0 φ0 (x) ≡ φ0 (x). Further, let (2.2.2) hold
for some k ≥ 0. Then

(2.1.20),(2.2.4)
φk+1 (x) ≤ (1 − αk )φk (x) + αk f (x)

= (1 − (1 − αk )λk )f (x) + (1 − αk )(φk (x) − (1 − λk )f (x))

≤ (1 − (1 − αk )λk )f (x) + (1 − αk )λk φ0 (x)

(2.2.4)
≤ (1 − λk+1 )f (x) + λk+1 φ0 (x).

It remains to note that condition 4) ensures λk → 0.



Thus, the above statement provides us with some rules for updating the estimat-
ing sequences. Now we have two control sequences which can help us to maintain
recursively the relation (2.2.3). At this moment, we are also free in our choice of
initial function φ0 (x). Let us choose it as a simple quadratic function. Then, we can
obtain a closed form recurrence for values φk∗ .
Lemma 2.2.3 Let φ0 (x) = φ0∗ + γ20  x − v0 2 . Then the process (2.2.4) preserves
the canonical form of functions {φk (x)}:

φk (x) ≡ φk∗ + γk
2  x − vk 2 , (2.2.5)

where the sequences {γk }, {vk } and {φk∗ } are defined as follows:

γk+1 = (1 − αk )γk + αk μ,

vk+1 = γk+1 [(1 − αk )γk vk


1
+ αk μyk − αk ∇f (yk )],

∗ αk2
φk+1 = (1 − αk )φk∗ + αk f (yk ) − 2γk+1  ∇f (yk ) 2


+ αk (1−α
γk+1
k )γk
2  yk − vk 2 +∇f (yk ), vk − yk .

Proof Note that ∇ 2 φ0 (x) = γ0 In . Let us show that ∇ 2 φk (x) = γk In for all k ≥ 0.
Indeed, if it is true for some k, then

∇ 2 φk+1 (x) = (1 − αk )∇ 2 φk (x) + αk μIn = ((1 − αk )γk + αk μ)In ≡ γk+1 In .


86 2 Smooth Convex Optimization

This justifies the canonical form (2.2.5) of the functions φk (·). Further,

(2.2.4) 
φk+1 (x) = (1 − αk ) φk∗ + γk
2  x − vk 2

μ
+ αk [f (yk ) + ∇f (yk ), x − yk + 2  x − yk 2 ].

Therefore the equation ∇φk+1 (x) = 0, which is the first-order optimality condition
for the function φk+1 (·), is as follows:

(1 − αk )γk (x − vk ) + αk ∇f (yk ) + αk μ(x − yk ) = 0.

From this equation, we get a closed form expression for the point vk+1 , the minimum
of the function φk+1 (·).
∗ . In view of the recurrence (2.2.4) for the sequence
Finally, let us compute φk+1
{φk (·)}, we have

∗ γk+1 (2.2.5)
φk+1 + 2  yk − vk+1 2 = φk+1 (yk )
(2.2.6)

= (1 − αk ) φk∗ + γk
2  yk − vk 2 + αk f (yk ).

By the recursive relation for vk+1 , we have

vk+1 − yk = γk+1 [(1 − αk )γk (vk


1
− yk ) − αk ∇f (yk )].

Therefore,
γk+1
2  vk+1 − yk 2 = 2γk+1 [(1 − αk ) γk
1 2 2  vk − yk 2

−2αk (1 − αk )γk ∇f (yk ), vk − yk + αk2  ∇f (yk ) 2 ].

It remains to substitute this relation into (2.2.6), taking into account that the
multiplicative factor for the term  yk − vk 2 in the resulting expression is as
follows:
 
(1 − αk ) γ2k − 2γ1k+1 (1 − αk )2 γk2 = (1 − αk ) γ2k 1 − (1−α k )γk
γk+1

= (1 − αk ) γ2k · αk μ
γk+1 . 

2.2 Optimal Methods 87

The situation now is more transparent, and we are close to getting an algorithmic
scheme. Indeed, assume that we already have xk :

φk∗ ≥ f (xk ).

Then, in view of Lemma 2.2.3,

∗ αk2
φk+1 ≥ (1 − αk )f (xk ) + αk f (yk ) − 2γk+1  ∇f (yk ) 2

+ αk (1−α
γk+1
k )γk
∇f (yk ), vk − yk .

(2.1.2)
Since f (xk ) ≥ f (yk ) + ∇f (yk ), xk − yk , we get the following estimate:

∗ αk2
φk+1 ≥ f (yk ) − 2γk+1  ∇f (yk ) 2

+(1 − αk )∇f (yk ), αγk+1


k γk
(vk − yk ) + xk − yk .


Let us look at this inequality. We want to have φk+1 ≥ f (xk+1 ). Recall that we can
ensure the inequality

f (yk ) − 1
2L  ∇f (yk ) 2 ≥ f (xk+1 )

in many different ways. The simplest one is just to take the gradient step

xk+1 = yk − hk ∇f (yk )

with hk = 1
L (see (2.1.9)). Let us define αk as a positive root of the quadratic
equation

Lαk2 = (1 − αk )γk + αk μ (= γk+1 ).

αk2
Then 2γk+1 = 1
2L , and we can replace the previous inequality by the following one:


φk+1 ≥ f (xk+1 ) + (1 − αk )∇f (yk ), αγk+1
k γk
(vk − yk ) + xk − yk .

Let us now use our freedom in the choice of yk . It can be found from the equation:
αk γk
γk+1 (vk − yk ) + xk − yk = 0.
88 2 Smooth Convex Optimization

+γk+1 xk
This is yk = αk γkγvkk+αkμ
, and we come to the following methods, which are often
addressed as Fast Gradient Methods

General Scheme of Optimal Method

0. Choose the point x0 ∈ Rn , some γ0 > 0, and set v0 = x0 .


1. kth iteration (k ≥ 0).
(a) Compute αk ∈ (0, 1) from the equation

Lαk2 = (1 − αk )γk + αk μ.
(2.2.7)
Set γk+1 = (1 − αk )γk + αk μ.
(b) Choose yk = γk +α 1

[αk γk vk + γk+1 xk ]. Compute
f (yk ) and ∇f (yk ).
(c) Find xk+1 such that

f (xk+1 ) ≤ f (yk ) − 1
2L  ∇f (yk ) 2

(see Sect. 1.2.3 for the step-size rules).


(d) Set vk+1 = γk+1
1
[(1 − αk )γk vk + αk μyk − αk ∇f (yk )].

Note that in Step 1(c) of this scheme we can choose an arbitrary xk+1 satisfying
the inequality f (xk+1 ) ≤ f (yk ) − ω2  ∇f (yk ) 2 with some ω > 0. Then the
constant ω1 replaces L in the equation of Step 1(a).
Theorem 2.2.1 Scheme (2.2.7) generates a sequence of points {xk }∞
k=0 such that
!
f (xk ) − f ∗ ≤ λk f (x0 ) − f ∗ + γ0
2  x0 − x ∗ 2 ,

k−1
where λ0 = 1 and λk = Πi=0 (1 − αi ).
Proof Indeed, let us choose φ0 (x) = f (x0 ) + γ20  x − v0 2 . Then f (x0 ) = φ0∗
and we get f (xk ) ≤ φk∗ by the rules of the scheme. It remains to use Lemma 2.2.1.


Thus, in order to estimate the rate of convergence rate of method (2.2.7), we need
to understand how quickly the sequence {λk } approaches zero. Define
μ
qf = 1
Qf = L. (2.2.8)
2.2 Optimal Methods 89

Lemma 2.2.4 If in the method (2.2.7) we choose γ0 ∈ (μ, 3L + μ], then for all
k ≥ 0 we have

λk ≤  
1/2
  
1/2 2
≤ 4L
(γ0 −μ)(k+1)2
. (2.2.9)
(γ0 −μ)· exp k+1
2 fq −exp − k+1
2 qf

 √ k
For γ0 = μ, we have λk = 1 − qf , k ≥ 0.
Proof Let us start from the case γ0 > μ. In accordance with Step 1(a) in (2.2.7),

γk+1 − μ = (1 − αk )(γk − μ) = . . . = λk+1 (γ0 − μ). (2.2.10)

λk+1
Since αk = 1 − λk , from the quadratic equation of Step 1(a), we have

 1/2
γk+1 !1/2 (2.2.10)
1− λk+1
λk = L = μ
L + λk+1 γ0L−μ .

 
qf γ0 −μ 1/2
Therefore, 1
λk+1 − 1
λk = 1
1/2 λk+1 + L . Thus,
λk+1

 



1 qf γ0 −μ 1/2 1 1 1 − 1 2 1 1
1/2
λk+1 λk+1 + L ≤ 1/2
λk+1
+ 1/2
λk
· 1/2
λk+1
1/2 ≤
λk λk+1
1/2 1/2
λk+1
− 1/2
λk
.

 1/2
Defining ξk = L
(γ0 −μ)λk , we get the following relation:

!1/2
ξk+1 − ξk ≥ 1
2
2 +1
qf ξk+1 . (2.2.11)

1√
Now, for δ = 2 qf , we are going to prove by induction that
!
ξk ≥ 1
4δ e(k+1)δ − e−(k+1)δ , k ≥ 0. (2.2.12)

For k = 0, in view of the upper bound on γ0 , we have


 1/2 ! !
ξ0 = L
γ0 −μ ≥ 1
31/2
> 1
2 e1/2 − e−1/2 ≥ 1
4δ eδ − e−δ

since the right-hand side of the above inequality is increasing in δ, and δ ≤ 12 .


Thus, for k = 0, inequality (2.2.12) is valid. Let us assume !that it is valid for
some k ≥ 0. Consider the function ψ(t) = 4δ 1
e(t +1)δ − e−(t +1)δ . Its derivative
!
ψ  (t) = 1
4 e(t +1)δ + e−(t +1)δ
90 2 Smooth Convex Optimization

is increasing in t. Thus, in view of Theorem 2.1.3 the function ψ(·) is convex.


In view of our assumption,

(2.2.11) !1/2 def


ψ(t) ≤ ξk ≤ ξk+1 − 1
2
2
qf ξk+1 +1 = γ (ξk+1 ).

Note that γ  (ξ ) = 1 −
qf ξ
1
2 q ξ 2 +1!1/2 > 0. Suppose that ξk+1 < ψ(t + 1). Then
f k+1

  1/2
+2)δ −(t +2)δ
!2
ψ(t) < ψ(t + 1) − 1
2 4δ · 4δ e
2 1 (t −e +1

!
= ψ(t + 1) − 1
4 e(t +2)δ + e−(t +2)δ

(2.1.2)
= ψ(t + 1) + ψ  (t + 1)(t − (t + 1)) ≤ ψ(t).

Thus, we get a contradiction with our second assumption, which proves the lower
bound (2.2.12).
For the case γ0 = μ, we have γk = μ for all k ≥ 0 (see (2.2.10)). By the

quadratic equation of Step 1(a) in method (2.2.7), this means that αk = qf , k ≥ 0.


Let us present an exact statement on the optimality of (2.2.7).
Theorem 2.2.2 Let us take in (2.2.7) γ0 = 3L + μ. Then this scheme generates a
sequence {xk }∞
k=0 such that

2(4+qf )μx0 −x ∗ 2 2(4+qf )Lx0 −x ∗ 2


f (xk ) − f ∗ ≤  
1/2
  
1/2 2
≤ 3(k+1)2
. (2.2.13)
3 exp k+1
2 qf −exp − k+12 qf

This means that method (2.2.7) is optimal for solving the unconstrained minimiza-
1,1
tion problem (2.2.1) with f ∈ Sμ,L (Rn ) and μ ≥ 0, when the accuracy  > 0 is
small enough:

≤ μ
2 x0 − x ∗ 2 . (2.2.14)

If μ = 0, then this method is optimal for

≤ 32 x0
3L
− x ∗ 2 . (2.2.15)

(2.1.9)
Proof Indeed, since f (x0 ) − f ∗ ≤ L
2  x0 − x ∗ 2 , by Theorem 2.2.1 we have

f (xk ) − f ∗ ≤ λk
2 (L + γ0 )x0 − x ∗ 2 .
2.2 Optimal Methods 91

Therefore, by Lemma 2.2.4, we obtain the following bounds:

2μ(L+γ0 )x0 −x ∗ 2 2L(L+γ0 )x0 −x ∗ 2


f (xk ) − f ∗ ≤  
1/2
  
1/2 2
≤ (γ0 −μ)(k+1)2
.
(γ0 −μ)· exp k+1
2 qf −exp − k+1
2 qf

The upper bounds in the above relations are decreasing in γ0 . Hence, choosing it as
the maximal allowed value, we get inequality (2.2.13).
Let μ > 0. From the lower complexity bounds for the class (see Theorem 2.1.13),
we have

√ 2k

Q −1
f (xk ) − f ∗ ≥ μ √ f R2 ≥ μ
2 exp − √ 4k R2 ,
2 Qf +1 Qf −1

where R = x0 − x ∗ . Therefore, the worst case lower bound for finding xk


satisfying f (xk ) − f ∗ ≤  cannot be better than

Qf −1 2
k≥ 4 ln μR
2
(2.2.16)

calls of the oracle (in view of assumption (2.2.14), the right-hand side of this
inequality is positive). For our scheme, we have
 −1
(2.2.13) 10μR 2 1/2
f∗
(k+1)qf
f (xk ) − ≤ 3 e −1 .

1  
10μR 2
Therefore, we guarantee that for k > Qf ln 1 + 3 our problem will be
solved. Since
  (2.2.14)  
10μR 2 μR 2 10μR 2 2
ln 1 + 3 ≤ ln 2 + 3 = ln μR
2 + ln 3 ,
23

the upper bound for the number of iterations (= calls of the oracle) in method (2.2.7)
is as follows:
1  2

Qf · ln μR 2 + ln 23
3 . (2.2.17)

Clearly, this bound is proportional to the lower bound (2.2.16). Therefore, the
method (2.2.7) is optimal.
1,1 n
The same reasoning can be used for the class S0,L (R ). As above, we need
to impose the upper bound (2.2.15) for accuracy in order to have a positive lower
bound for the number of calls of the oracle (see Theorem 2.1.7).

Remark 2.2.1 Note that the scheme and the complexity analysis of method (2.2.7) is
continuous in the convexity parameter μ. Therefore, its version for convex functions
92 2 Smooth Convex Optimization

with Lipschitz continuous gradient has the following rate of convergence:

(2.2.13) 8Lx −x ∗ 2
f (xk ) − f ∗ ≤ 0
3(k+1)2
. (2.2.18)

Let us analyze a variant of scheme (2.2.7), which uses a constant gradient step
for finding the point xk+1 .

Constant Step Scheme I

0. Choose the point x0 ∈ Rn , some γ0 > 0, and set v0 = x0 .


1. kth iteration (k ≥ 0).
(a) Compute αk ∈ (0, 1) from the equation

Lαk2 = (1 − αk )γk + αk μ. (2.2.19)

Set γk+1 = (1 − αk )γk + αk μ.


(b) Choose yk = γk +α 1

[αk γk vk + γk+1 xk ]. Compute
f (yk ) and ∇f (yk ).
(c) Set xk+1 = yk − L1 ∇f (yk ) and

vk+1 = 1
γk+1 [(1 − αk )γk vk + αk μyk − αk ∇f (yk )] .

Let us show that this scheme can be rewritten in a simpler form. Note that

yk = 1
γk +αk μ (αk γk vk + γk+1 xk ),

xk+1 = yk − L1 ∇f (yk ),

vk+1 = γk+1 [(1 − αk )γk vk


1
+ αk μyk − αk ∇f (yk )].

Therefore,
 
(1−αk )
vk+1 = 1
γk+1 αk [(γk + αk μ)yk − γk+1 xk ] + αk μyk − αk ∇f (yk )
 
(1−αk )γk 1−αk αk
= 1
γk+1 αk yk + μyk − αk xk − γk+1 ∇f (yk )

= xk + 1
αk (yk − xk ) − αk L ∇f (yk )
1
= xk + 1
αk (xk+1 − xk ).
2.2 Optimal Methods 93

Hence,

yk+1 = 1
γk+1 +αk+1 μ (αk+1 γk+1 vk+1 + γk+2 xk+1 )

αk+1 γk+1 (vk+1 −xk+1 )


= xk+1 + γk+1 +αk+1 μ = xk+1 + βk (xk+1 − xk ),

where βk = ααkk+1 γk+1 (1−αk )


(γk+1 +αk+1 μ) . Thus, we managed to eliminate the sequence {vk }. Let
us do the same with the coefficients {γk }. We have

αk2 L = (1 − αk )γk + μαk ≡ γk+1 .

Therefore
αk+1 γk+1 (1−αk ) αk+1 γk+1 (1−αk )
βk = αk (γk+1 +αk+1 μ) =
αk (γk+1 +αk+1
2 L−(1−α
k+1 )γk+1 )

γk+1 (1−αk ) αk (1−αk )


= αk (γk+1 +αk+1 L) = αk2 +αk+1
.

2
Note also that αk+1 = (1 − αk+1 )αk2 + qf αk+1 , and

α02 L = (1 − α0 )γ0 + μα0 .

The latter relation means that γ0 can be seen as a function of α0 . Thus, we can
completely eliminate the sequence {γk }. Let us write down the corresponding
method.

Constant Step Scheme II

0. Choose the point x0 ∈ Rn , some α0 ∈ (0, 1), and set y0 =


x0 .
1. kth iteration (k ≥ 0).
(2.2.20)
(a) Compute f (yk ) and ∇f (yk ). Set xk+1 = yk − L1 ∇f (yk ).
(b) Compute αk+1 ∈ (0, 1) from the equation
2
αk+1 = (1 − αk+1 )αk2 + qf αk+1 .

αk (1−αk )
Set βk = and yk+1 = xk+1 + βk (xk+1 − xk ).
αk2 +αk+1
94 2 Smooth Convex Optimization

The rate of convergence of this method can be derived from Theorem 2.2.1 and
Lemma 2.2.4. Let us write down the corresponding statement in terms of α0 .
Theorem 2.2.3 If in the method (2.2.20) we choose α0 in accordance with the
conditions
√ 2(3+qf )

qf ≤ α0 ≤ , (2.2.21)
3+ 21+4qf

then
 
4μ f (x0 )−f ∗ + 20 x0 −x ∗ 2
γ

f (xk ) −f∗ ≤  
1/2
  
1/2 2
(γ0 −μ)· exp k+1
2 qf −exp − k+1 2 qf

!
≤ 4L
(γ0 −μ)(k+1)2
f (x0 ) − f ∗ + γ0
2  x0 − x ∗ 2 ,

α0 (α0 L−μ)
where γ0 = 1−α0 .
We do not need to prove this theorem since the initial scheme has not changed.
We change only the notation. In Theorem 2.2.3, condition (2.2.21) is equivalent to
the condition μ ≤ γ0 ≤ 3L + μ of Lemma 2.2.4.

Scheme (2.2.20) becomes very simple if we choose α0 = qf (this corresponds
to γ0 = μ). Then

√ 1− qf
αk = qf , βk = √
1+ qf

for all k ≥ 0. Thus, we come to the following process.

Constant Step scheme III

0. Choose y0 = x0 ∈ Rn .
1. kth iteration (k ≥ 0). (2.2.22)

xk+1 = yk − L1 ∇f (yk ),

1− qf
yk+1 = xk+1 + √
1+ qf (xk+1 − xk ).

In accordance with Theorem 2.2.1 and Lemma 2.2.4, it has the following rate of
convergence:

(2.1.9) L+μ √
f (xk ) − f ∗ ≤ 2 x0 − x ∗ 2 e−k qf
, k ≥ 0. (2.2.23)
2.2 Optimal Methods 95

However, this method does not work for μ = 0. The choice of a bigger value of the
parameter γ0 (which corresponds to another value of α0 ) is much safer.
Finally, let us prove the following statement.
Theorem 2.2.4 Let method (2.2.7) be applied to the function f ∈ FL1,1 (Rn ) (this
means that μ = 0). Then for any k ≥ 0 we have
 1/2
vk − x ∗  ≤ 1 + 1
γ0 L r0 , (2.2.24)

 1/2
xk − x ∗  ≤ 1 + 1
γ0 L r0 , (2.2.25)

def 
k−1
where r0 = x ∗ − x0 . Moreover, for the vector gk = λk
1−λk
αi
λi+1 ∇f (yk ), whose
i=0

k−1
αi 1−λk
coefficients satisfy the equation λi+1 = λk , k ≥ 1, we have
i=0

 1/2
λ k γ0
gk  ≤ 1−λk 1+ 1+ 1
γ0 L r0 . (2.2.26)

Choosing γ0 = 3L, we get the following rate:

(2.2.9) 4(3+2√3)Lr
gk  ≤ 3(k+1)2 −4
0
, k ≥ 1. (2.2.27)

Proof As we have seen, method (2.2.7) recursively updates a sequence of estimating


functions, which can be represented as follows:

φk (x) = k (x) + λk (f (x0 ) + 12 γ0 x − x0 2 ), k ≥ 0,

where k (·) are linear functions updated by the rules 0 (x) ≡ 0,

k+1 (x) = (1 − αk )k (x) + αk [f (yk ) + ∇f (yk ), x − yk ], k ≥ 0.


(2.2.28)

Let ∇k ≡ ∇k (x), x ∈ Rn .


Note that function φk is strongly convex with convexity parameter λk γ0 . There-
fore, for all x ∈ Rn we have

(2.1.21)
f (xk ) + 12 λk γ0 x − vk 2 ≤ φk∗ + 12 λk γ0 x − vk 2 ≤ φk (x)

(2.2.2)
≤ f (x) + λk (f (x0 ) + 12 γ0 x − x0 2 − f (x)).
96 2 Smooth Convex Optimization

Taking in this inequality x = x ∗ , we get

(2.1.9)
∗ − vk 2 ≤ λk (f (x0 ) − f (x ∗ ) + 12 γ0 x ∗ − x0 2 )
2 λk γ0 x ≤ 2 λk (L + γ0 )r0 ,
1 1 2

and this is the bound (2.2.24).


Let us prove by induction that the bound (2.2.25) holds for all k ≥ 0. Since
x0 = v0 , it holds for k = 0. Assume it holds for some k ≥ 0. Then, in view of Step
(b) in (2.2.19), we have yk − x ∗  ≤ [1 + γ10 L]1/2 r0 . It remains to note that the
gradient step decreases the distance to the optimal point (see, for example, the proof
of Theorem 2.1.14).
def
Let us look now at the evolution of the vectors sk = λ1k ∇k . Note that s0 = 0
and
(2.2.28)
∇k+1 = (1 − αk )∇k + αk ∇f (yk )

λk+1
= λk ∇k + αk ∇f (yk ), k ≥ 0.


k−1
αi αi
Thus, sk = λi+1 ∇f (yi ), k ≥ 0. On the other hand, for τi = λi+1 we have
i=0

(2.2.4) αi
τi = (1−αi )λi = 1
λi+1 − 1
λi .


k−1
λk sk
Thus, τi = 1
λk − 1, and gk = 1−λk ≡ 1−λk ∇k (x),
1
x ∈ Rn . Note that
i=0

1−λk
vk = x0 − λk γ0 ∇k
1
= x0 − λk γ0 gk .

Hence,
 1/2 (2.2.24)
1+ 1
γ0 L r0 ≥ x0 − 1−λk
λk γ0 gk − x ∗ ≥ 1−λk
λk γ0 gk  − r0 ,

and we get inequality (2.2.26).



Theorem 2.2.4 can be used to generate points with small gradient of the quadratic
function f (x) = 12 Ax, x − b, x with A  0. For that, we just compute the point

λk 
k−1
αi
ŷk = 1−λk λi+1 yi , k ≥ 1. (2.2.29)
i=0

Another example of employing the rule (2.2.29) is given in Sect. 2.2.3.


2.2 Optimal Methods 97

2.2.2 Decreasing the Norm of the Gradient

1,1
Sometimes, in solving the optimization problem (2.2.1) with f ∈ Fμ,L (Rn ), we
are interested in finding a point with small norm of the gradient:

∇f (x) ≤ . (2.2.30)

(We will give an important example of this situation in Example 2.2.4 in Sect. 2.2.3.)
What are the lower and upper complexity bounds for this goal? Since

(2.1.2)
f (x) − f ∗ ≤ ∇f (x) · x − x ∗ ,

the corresponding lower complexity bounds must be of the same order as for finding
a point with small residual in function value: f (x) − f ∗ ≤ . Let us see which
methods can be used to find points with small gradients.
First of all, let us look at the abilities of Gradient Method (2.1.37) with hk = L1 .
Denote R0 = x0 − x ∗ . Let us fix the total number of iterations T ≥ 3. After the
first k iterations, 0 ≤ k < T , we have

(2.1.39) 2LR 2
f (xk ) − f ∗ ≤ k+40 .

(2.1.9)
If i ≥ k, then f (xi ) − f (xi+1 ) ≥ 2L ∇f (xi )) .
1 2 Define gk,T =
min ∇f (xi ). Then
k≤i≤T


T 
T
(T − k + 1)gk,T
2 ≤ ∇f (xi )2 ≤ 2L (f (xi ) − f (xi+1 ))
i=k i=k

4L2 R02
= 2L(f (xk ) − f (xT +1 )) ≤ 2L(f (xk ) − f ∗ ) ≤ k+4 .

2 ≤ 4L2 R 2
0
Thus, g0,T (k+4)(T −k+1) . We can choose k by maximizing the quadratic function
q(k) = (k + 4)(T − k + 1) for integer k. Note that

def
q ∗ = max q(k) ≥ q(τ ∗ + 12 ), τ ∗ = arg max q(τ ).
k∈Z τ ∈R

Since τ ∗ = T −3 ∗ T −2
2 , we get q ≥ q( 2 ) = 4 (T + 4)(T + 6).
1

Thus, we have proved the following theorem.


98 2 Smooth Convex Optimization

Theorem 2.2.5 Let f ∈ FL1,1 (Rn ) and choose in method (2.1.37) hk = 1


L. Then
for the total number of steps in this method T ≥ 3 we have

4LR0
g0,T ≤ [(T +4)(T +6)]1/2
. (2.2.31)

Thus, the Gradient Method ensures the goal (2.2.30) in O( 1 ) iterations. Let us
see what happens with a monotone version of the Optimal Method (2.2.19) in the
case μ = 0.

Monotone Constant Step Scheme IA

0. Choose the point x0 ∈ Rn . Set λ0 = 1 and v0 = x0 .


1. kth iteration (k ≥ 0).
(a) Compute αk ∈ (0, 1) from equation αk2 = 3(1−αk )λk . (2.2.32)
(b) Set yk = αk vk + (1 − αk )xk and λk+1 = (1 − αk )λk .
(c) Compute ∇f (yk ) and set x̂k+1 = yk − L1 ∇f (yk ).
(d) Define vk+1 = vk − Lα 1
∇f (yk ).
 k 
(e) Set ŷk = arg min f (y) : y ∈ {xk , x̂k+1 } .
(f) Compute ∇f (ŷk ) and set xk+1 = ŷk − L1 ∇f (ŷk ).

This scheme corresponds to the method (2.2.7) with γ0 = 3L and μ = 0. Hence,


γk ≡ 3Lλk . Note that it ensures a monotone decrease of the objective function:

(2.2.32)e (2.2.32)f
≥ ≥ f (xk+1 ) + (2.2.33)
2L ∇f (ŷk ) .
1 2
f (xk ) f (ŷk )

As before, we divide the total number of iterations T ≥ 3 into two parts. After
the first k iterations, 0 ≤ k < T , we have

(2.2.18) 8LR02
f (xk ) − f ∗ ≤ 3(k+1)2
.

(2.2.33)
If i ≥ k, then f (xi ) − f (xi+1 ) ≥ 2L ∇f (ŷi )) .
1 2 Define gk,T =
min ∇f (ŷi ). Then
k≤i≤T


T 
T
(T − k + 1)gk,T
2 ≤ ∇f (ŷi )2 ≤ 2L (f (xi ) − f (xi+1 ))
i=k i=k

16L2 R02
= 2L(f (xk ) − f (xT +1 )) ≤ 2L(f (xk ) − f ∗ ) ≤ 3(k+1)2
.
2.2 Optimal Methods 99

16L2 R02
2
Thus, g0,T ≤ 3(k+1)2 (T −k+1)
. We can choose k by maximizing the cubic function
q(k) = (k + 1) (T − k + 1) for
2 integer k. Note that k ∗ , the optimal solution of
def
the problem q ∗ = max q(k), belongs to the interval [τ ∗ − 12 , τ ∗ + 12 ], where
k∈Z
τ ∗ = arg max q(τ ). Moreover, since the function q(·) is concave in this interval,
τ ∈R+
we have

q ∗ ≥ min{q(τ ∗ − 12 ), q(τ ∗ + 12 )}
 
= min q(τ ∗ ) + 12 q  (τ ∗ )( 12 )2 + 16 q  (τ ∗ )δ 3
δ=± 12

= q(τ ∗ ) + 18 q  (τ ∗ ) − 18 .

Note that q  (τ ) = (τ + 1)(2T + 1 − 3τ ) and q  (τ ) = 2T − 2 − 6k. Therefore,


τ ∗ = 2T3+1 , q  (τ ∗ ) = −2T − 4, and q(τ ∗ ) = 27
4
(T + 2)3 . Hence,

q∗ ≥ 4
27 (T + 2)3 − 14 (T + 2) − 18 .

Thus, we have proved the following theorem.


Theorem 2.2.6 If f ∈ FL1,1 (Rn ), then method (2.2.32) ensures the following rate
of decrease for the norm of the gradient:

4LR0
g0,T ≤ , T ≥ 1. (2.2.34)
[ 43 (T +2)3 − 94 (T +2)− 89 ]1/2

1
Thus, the Optimal Method (2.2.32) ensures the goal (2.2.30) in O(  2/3 ) itera-
tions. Let us show that we can be even faster if we apply a regularization technique.
Let us fix a regularization parameter δ > 0 and consider the following function:

fδ (x) = f (x) + 12 δx − x0 2 .

1,1
In view of conditions (2.1.12) and (2.1.22), fδ ∈ Sδ,L+δ (Rn ). Denote by xδ∗ its
unique optimal point, which satisfies the equation

∇f (xδ∗ ) + δ(xδ∗ − x0 ) = 0. (2.2.35)

Note that
(2.1.21)
fδ (xδ∗ ) + 12 δxδ∗ − x ∗ 2 ≤ fδ (x ∗ ) = f (x ∗ ) + 12 δx ∗ − x 0 2 .
100 2 Smooth Convex Optimization

Since f (x ∗ ) ≤ f (xδ∗ ), we conclude that

xδ∗ − x0 2 + xδ∗ − x ∗ 2 ≤ x0 − x ∗ 2 . (2.2.36)

Thus, by choosing an appropriate δ, we can make the gradient ∇f (xδ∗ ) small:

(2.2.35) (2.2.36)
∇f (xδ∗ ) = δxδ∗ − x0  ≤ δR0 .

Therefore, it is possible to find a point with small norm of the gradient by


minimizing the function fδ . Let us estimate the complexity of this process.
Let us use for our goal the scheme (2.2.22) with parameters L+δ and qf = δ+L
δ
.
Then after T iterations of this method, we have

(1.2.8)
∇f (xT ) ≤ ∇f (xδ∗ ) + ∇f (xT ) − ∇f (xδ∗ ) ≤ δR0 + LxT − xδ∗ 

(2.1.21)  1/2
≤ δR0 + L ∗
δ (fδ (xT ) − fδ (xδ ))
2

(2.2.23)  √ 1/2
L+2δ 2 −T
≤ δR0 + L δ R0 e
qf
.

Thus, choosing δ from condition δR0 = 12 , we get q1f = 1 + 2LR 0


 . Therefore,
the number of steps T in our scheme is bounded by the solution of the following
inequality:
 1/2 √
LR0 L+2δ
δ ≤  T
2e
qf /2
.

  1/2
This is T ≥ √2
qf ln 1
qf −1 1+ 1
qf . Thus, we have proved the following
theorem
Theorem 2.2.7 Let f ∈ FL1,1 (Rn ) and δ = 2R 0 . Then the number of steps T
which is necessary for method (2.2.22) to generate a point xT with ∇f (xT ) ≤ 
by minimizing the function fδ is bounded as follows:
  
2LR0 2LR0
T ≤ 3 1+  ln 1 +  . (2.2.37)

Thus, up to a logarithmic factor, the complexity estimate of the regularization


scheme is optimal. To the best of our knowledge, it is not known yet if this factor
can be dropped.
2.2 Optimal Methods 101

2.2.3 Convex Sets

The next step in generalizing the unconstrained minimization problem (2.1.36) is a


constrained minimization problem with no functional constraints:

min f (x),
x∈Q

where Q is a convex set of Rn . We have already introduced these sets in Defini-


tion 2.1.1, as natural domains of convex functions. Now we will need them as simple
constraints.
Let us look at two important examples of convex sets.
Lemma 2.2.5 If f (·) is a convex function on Rn , then for any β ∈ R its level set

Lf (β) = {x ∈ Rn | f (x) ≤ β}

is either convex or empty.


Proof Indeed, let x and y belong to Lf (β). Then f (x) ≤ β and f (y) ≤ β.
Therefore,

(2.1.3)
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) ≤ β,

which means αx + (1 − α)y ∈ Lf (β).



Lemma 2.2.6 Let f (·) be a convex function on Rn . Then its epigraph

Ef = {(x, τ ) ∈ Rn+1 | f (x) ≤ τ }

is a convex set.
Proof Indeed, let z1 = (x1 , τ1 ) ∈ Ef and z2 = (x2 , τ2 ) ∈ Ef . Then for any α ∈
[0, 1] we have

zα ≡ αz1 + (1 − α)z2 = (αx1 + (1 − α)x2 , ατ1 + (1 − α)τ2 ),

(2.1.3)
f (αx1 + (1 − α)x2 ) ≤ αf (x1 ) + (1 − α)f (x2 ) ≤ ατ1 + (1 − α)τ2 .

Thus, zα ∈ Ef . 

Let us consider now the most important operations with convex sets.
102 2 Smooth Convex Optimization

Theorem 2.2.8 Let Q1 ⊆ Rn and Q2 ⊆ Rm be closed convex sets, and A (·) be a


linear operator:

A (x) = Ax + b : Rn → Rm .

1. The intersection of two sets (m = n), Q1 Q2 = {x ∈ Rn | x ∈ Q1 , x ∈ Q2 },
is convex and closed.
2. The sum of two sets (m = n), Q1 + Q2 = {z = x + y | x ∈ Q1 , y ∈ Q2 }, is
convex. It is closed provided that one of the sets is bounded.
3. The direct product of two sets, Q1 × Q2 = {(x, y) ∈ Rn+m | x ∈ Q1 , y ∈ Q2 }
is convex and closed.
4. The conic hull of a set, K (Q1 ) = {z ∈ Rn | z = βx, x ∈ Q1 , β ≥ 0}, is
convex. It is closed if the set Q1 is bounded and does not contain the origin.
5. The convex hull of two sets,

Conv(Q1 , Q2 ) = {z ∈ Rn | z = αx + (1 − α)y, x ∈ Q1 , y ∈ Q2 , α ∈ [0, 1]},

is convex. It is closed if both sets are bounded.


6. The affine image of a set, A (Q1 ) = {y ∈ Rm | y = A (x), x ∈ Q1 }, is convex
and closed.
7. The inverse affine image: A −1 (Q2 ) = {x ∈ Rn | A (x) ∈ Q2 } is convex. It is
closed if Q2 is bounded.
Proof
 
1. If x1 ∈ Q1 Q2 and x1  ∈ Q1 Q2 , then [x1 , x2 ] ⊂ Q1 and [x1 , x2 ] ⊂ Q2 .
Therefore, [x1 , x2 ] ⊂ Q1 Q2 . Closedness of intersection is evident.
2. If z1 = x1 + y1 with x1 ∈ Q1 , y1 ∈ Q2 , and z2 = x2 + y2 with x2 ∈ Q1 ,
y2 ∈ Q2 , then

αz1 + (1 − α)z2 = [αx1 + (1 − α)x2 ]1 + [αy1 + (1 − α)y2 ]2 ,

where [·]1 ∈ Q1 and [·]2 ∈ Q2 . Let us assume now that the set Q2 is bounded.
Consider a convergent sequence zk = xk + yk → z̄ with {xk } ⊂ Q1 and
{yk } ⊂ Q2 . Since Q2 is bounded, we can assume that the whole sequence {yk }
converges (otherwise, select a converging subsequence). Then, the sequence {xk }
also converges. This implies the inclusion z̄ ∈ Q1 + Q2 .
3. If z1 = (x1 , y1 ), x1 ∈ Q1 , y1 ∈ Q2 and z2 = (x2 , y2 ), x2 ∈ Q1 , y2 ∈ Q2 , then

αz1 + (1 − α)z2 = ([αx1 + (1 − α)x2 ]1 , [αy1 + (1 − α)y2 ]2 ),

where [·]1 ∈ Q1 and [·]2 ∈ Q2 . Further, if a sequence {zk = (xk , yk )} ⊂ Q1 ×Q2


converges to z̄ = (x̄, ȳ), this means that xk → x̄ ∈ Q1 and yk → ȳ ∈ Q2 .
Hence, the point z̄ belongs to Q1 × Q2 .
2.2 Optimal Methods 103

4. If z1 = β1 x1 with x1 ∈ Q1 and β1 ≥ 0, and z2 = β2 x2 with x2 ∈ Q1 and β2 ≥ 0,


then for any α ∈ [0, 1] we have

αz1 + (1 − α)z2 = αβ1 x1 + (1 − α)β2 x2 = γ (ᾱx1 + (1 − ᾱ)x2 ),

where γ = αβ1 + (1 − α)β2 , and ᾱ = αβ1 /γ ∈ [0, 1]. Thus, the set K (Q1 ) is
convex.
Consider a convergent sequence {zk = βk xk → z̄} with {xk } ⊂ Q1 . If Q1 is
bounded, then the sequence {xk } is bounded. If 0
∈ Q1 , then the sequence {βk }
is also bounded. Therefore, without loss of generality, we can assume that both
sequences {βk } and {xk } are convergent. Hence, z̄ ∈ K (Q1 ) and we conclude
that this cone is closed.
5. If z1 = β1 x1 + (1 − β1 )y1 with x1 ∈ Q1 , y1 ∈ Q2 , and β1 ∈ [0, 1], and
z2 = β2 x2 + (1 − β2 )y2 with x2 ∈ Q1 , y2 ∈ Q2 , and β2 ∈ [0, 1], then for any
α ∈ [0, 1] we have

αz1 + (1 − α)z2 = α(β1 x1 + (1 − β1 )y1 ) + (1 − α)(β2 x2 + (1 − β2 )y2 )

= ᾱ(β̄1 x1 + (1 − β̄1 )x2 ) + (1 − ᾱ)(β̄2 y1 + (1 − β̄2 )y2 ),

where ᾱ = αβ1 + (1 − α)β2 and β̄1 = αβ1 /ᾱ, β̄2 = α(1 − β1 )/(1 − ᾱ).
Let us assume that both sets are bounded. Considering now a convergent
sequence {zk = βk xk + (1 − βk )yk → z̄} with {βk } ⊂ [0, 1], {xk } ⊂ Q1 , and
{yk } ⊂ Q2 , without loss of generality, we can assume that all these sequences
are convergent. This implies that z̄ ∈ Conv{Q1 , Q2 }.
6. If y1 , y2 ∈ A (Q1 ) then y1 = Ax1 + b and y2 = Ax2 + b for some x1 , x2 ∈ Q1 .
Therefore, for y(α) = αy1 + (1 − α)y2 , 0 ≤ α ≤ 1, we have

y(α) = α(Ax1 + b) + (1 − α)(Ax2 + b) = A(αx1 + (1 − α)x2 ) + b.

Thus, y(α) ∈ A (Q1 ). This set is closed in view of the continuity of linear
operators.
7. If x1 , x2 ∈ A −1 (Q2 ) then Ax1 +b = y1 and Ax2 +b = y2 for some y1 , y2 ∈ Q2 .
Therefore, for x(α) = αx1 + (1 − α)x2 , 0 ≤ α ≤ 1, we have

A (x(α)) = A(αx1 + (1 − α)x2 ) + b

= α(Ax1 + b) + (1 − α)(Ax2 + b) = αy1 + (1 − α)y2 ∈ Q2 .

Let Q2 be bounded. Consider a convergent sequence {xk → x̄} ⊂ A −1 (Q2 ).


Then, without loss of generality, we can assume that the sequence {yk =
A (xk )} ⊂ Q2 is convergent to a point ȳ ∈ Q2 . Since ȳ = A(x̄), we conclude
that x̄ ∈ A −1 (Q2 ). Thus, the inverse image of a bounded set is closed. 

104 2 Smooth Convex Optimization

Let us give examples justifying the additional assumptions of Theorem 2.2.8,


which were introduced to ensure closedness of the results of some operations with
convex sets.
Example 2.2.1 In all examples below, we work with an unbounded convex set
 
Q = x ∈ R2+ : x (2) ≥ 1
x (1)
.

1,2 def  
• Sum of two sets. Consider the set R+ = x ∈ R2 : x (1) ≥ 0, x (2) = 0 . Then

1,2  
Q − R+ = x ∈ R2 : x (2) > 0

1,2
is an open set. At the same time, Q + R+ ≡ Q is closed.
• Conic hull. Let 02 = (0, 0)T ∈ R2 . The set
  
K (Q) ≡ x ∈ R2 : x (1) > 0, x (2) > 0 { 02 }
 
is not closed. Also, for Q1 = x ∈ R2 : x − e1  ≤ 1 , we have
  
K (Q1 ) = x ∈ R2 : x (1) > 0 {02 },

which is not closed.


• Convex hull. Note that Conv{02 , Q} = K (Q), and the latter set is not closed.
• Inverse affine image. Note that

{x ∈ R : ∃τ > 0 such that (τ, x) ∈ Q} = {x ∈ R : x > 0} ,

and this set is open.



Using the statements above, we can justify the convexity of some important sets.
Example 2.2.2
1. Half-space. The set {x ∈ Rn | a, x ≤ β} is convex since linear function is
convex.
2. Polytope. The set {x ∈ Rn | ai , x ≤ bi , i = 1 . . . m} is convex as an
intersection of convex sets.
3. Ellipsoid. Let A = AT  0. Then the set {x ∈ Rn | Ax, x ≤ r 2 } is convex
since the function Ax, x is convex.

Let us consider now a smooth optimization problem with the set constraint:

min f (x), f ∈ F 1 (Q,  · ), (2.2.38)


x∈Q
2.2 Optimal Methods 105

where Q is a closed convex set. We assume that the optimal set of this problem X∗
is not empty. Our current goal consists in describing the optimality conditions for
problem (2.2.38). It is clear that the old condition

∇f (x) = 0

does not work here.


Example 2.2.3 Consider the following univariate minimization problem:

min x.
x≥0

Here Q = {x ∈ R : x ≥ 0}, and f (x) = x. Note that x ∗ = 0, but ∇f (x ∗ ) = 1 > 0.




Theorem 2.2.9 Let f ∈ F 1 (Q) and the set Q be closed and convex. A point x ∗ is
a solution of problem (2.2.38) if and only if

∇f (x ∗ ), x − x ∗ ≥ 0 (2.2.39)

for all x ∈ Q.
Proof Indeed, if (2.2.39) is true, then

(2.1.2) (2.2.39)
f (x) ≥ f (x ∗ ) + ∇f (x ∗ ), x − x ∗ ≥ f (x ∗ )

for all x ∈ Q.
Let x ∗ be a solution to (2.2.38). Assume that there exists some x ∈ Q such that

∇f (x ∗ ), x − x ∗ < 0.

Consider the function φ(α) = f (x ∗ + α(x − x ∗ )), α ∈ [0, 1]. Note that

φ(0) = f (x ∗ ), φ  (0) = ∇f (x ∗ ), x − x ∗ < 0.

Therefore, for α small enough we have

f (x ∗ + α(x − x ∗ )) = φ(α) < φ(0) = f (x ∗ ).

This is a contradiction.

The next statement is often addressed as the growth property of strongly convex
functions.
106 2 Smooth Convex Optimization

Corollary 2.2.1 If f ∈ Sμ1 (Q,  · ), then for any x ∈ Q, we have

f (x) ≥ f (x ∗ ) + μ2 x − x ∗ 2 . (2.2.40)

Proof Indeed,

(2.1.20)
f (x) ≥ f (x ∗ ) + ∇f (x ∗ ), x − x ∗ + μ2 x − x ∗ 2

(2.2.39)
≥ f (x ∗ ) + μ2 x − x ∗ 2 . 

Corollary 2.2.2 Let f ∈ CL1,1 (Rn ,  · ). Then, for any two points x1∗ , x2∗ ∈ X∗ , we
have

∇f (x1∗ ) = ∇f (x2∗ ), ∇f (x1∗ ), x1∗ = ∇f (x2∗ ), x2∗ . (2.2.41)

(2.2.39) (2.2.39)
Proof Indeed, ∇f (x1∗ ), x2∗ − x1∗ ≥ 0 and ∇f (x2∗ ), x1∗ − x2∗ ≥ 0. Adding
these two inequalities, we have

(2.1.11)
0 ≥ ∇f (x1∗ ) − ∇f (x2∗ ), x1∗ − x2∗ ≥ ∗
L ∇f (x1 ) −
1
∇f (x2∗ )2∗ .

For x ∗ ∈ X∗ , let g ∗ = ∇f (x ∗ ). Then,

(2.2.39) (2.2.41)
0 ≥ ∇f (x2∗ ), x2∗ − x1∗ = g ∗ , x2∗ − x1∗

(2.2.41) (2.2.39)
= ∇f (x1∗ ), x2∗ − x1∗ ≥ 0. 

Let us now prove the existence theorem.


Theorem 2.2.10 Let f ∈ Sμ1 (Q,  · ) with μ > 0 and the set Q be closed and
convex. Then there exists a unique solution x ∗ of problem (2.2.38).
Proof Let x0 ∈ Q. Consider the set Q̄ = {x ∈ Q | f (x) ≤ f (x0 )}. Note that the
problem (2.2.38) is equivalent to the following

min{f (x) | x ∈ Q̄}. (2.2.42)

However, the set Q̄ is bounded: for all x ∈ Q̄, we have

(2.1.20) μ
f (x0 ) ≥ f (x) ≥ f (x0 ) + ∇f (x0 ), x − x0 + 2  x − x0 2 .

Hence,  x − x0 ≤ 2
μ  ∇f (x0 ) ∗ .
2.2 Optimal Methods 107

Thus, the solution x ∗ of problem (2.2.42) (≡ (2.2.38)) exists. Let us prove that it
is unique. Indeed, if x1∗ is also an optimal solution to (2.2.38), then

(2.2.40)
f ∗ = f (x1∗ ) ≥ f∗ + μ
2  x1∗ − x ∗ 2 .

Therefore x1∗ = x ∗ . 

Example 2.2.4 Let f ∈ Fμ1 (Q, ·p ). Consider the following primal minimization
problem:

f ∗ = min{f (x) : Ax = b}, (2.2.43)


x∈Q

where A ∈ Rm×n and b ∈ Rm . In some applications the set Q and function f are
very simple, and the complexity of this problem is related to the nontrivial intersec-
tion of the linear constraints with the set Q. In these cases, it is recommended to
solve problem (2.2.43) by dualizing the linear constraints.
Let us introduce dual multipliers for equality constraints, and define the
Lagrangian

L (x, u) = f (x) + u, b − Ax , x ∈ Q, u ∈ Rm .

Now we can define the dual function φ(u) = min L (x, u). By Theorem 2.2.10, this
x∈Q
function is well defined for all u ∈ Rm . Let x(u) = arg min L (x, u) ∈ Q and let
x∈Q
g(u) = b − Ax(u). Note that for arbitrary u1 and u2 ∈ Rm we have

φ(u1 ) = f (x(u1 )) + u1 , b − Ax(u1 ) ≤ f (x(u2 )) + u1 , b − Ax(u2 )

= φ(u2 ) + u1 − u2 , g(u2 ) .

Let us introduce in Rm the norm  · d . Define

(2.1.6)
Ap,d = max{Ax, u : xp ≤ 1, ud ≤ 1} = max{AT up∗ : ud ≤ 1}.
x,u u

Then, for any u1 , u2 ∈ Rn we have

(2.2.39)
∇f (x(u2 )), x(u1 ) − x(u2 ) ≥ AT u2 , x(u1 ) − x(u2 ) . (2.2.44)
108 2 Smooth Convex Optimization

Therefore,

φ(u1 ) = f (x(u1 )) + u1 , b − Ax(u1 )

(2.1.20)
≥ f (x(u2 )) + ∇f (x(u2 )), x(u1 ) − x(u2 ) + 12 μx(u1 ) − x(u2 )2p

+u1 , b − Ax(u1)

(2.2.44)
≥ f (x(u2 )) + u2 , A(x(u1 ) − x(u2 )) + 12 μx(u1 ) − x(u2 )2p

+u1 , b − Ax(u1)

= φ(u2 ) + g(u2 ), u1 − u2 − u1 − u2 , A(x(u1 ) − x(u2 ))

+ 12 μx(u1 ) − x(u2 )2p

≥ φ(u2 ) + g(u2 ), u1 − u2 − 1 T
2μ (A (u1 − u2 )∗p )2 .

(2.1.9)
Since φ is concave, g(u) = ∇φ(u) and −φ ∈ FL1,1 (Rm ,  · d ) with L =
μ Ap,d .
1 2

Now we can solve the Lagrangian dual problem

min {−φ(u)} (2.2.45)


u∈Rm

by any method for minimizing smooth convex functions. Assuming that the solution
of this problem u∗ exists, we have

0 = ∇φ(u∗ ) = b − Ax(u∗ ).

Thus, x(u∗ ) is feasible for problem (2.2.43). On the other hand,

(1.3.6) def
f ∗ ≥ f∗ = maxm φ(u) = f (x(u∗ )) + u∗ , ∇φ(u∗ ) = f (x(u∗ )).
u∈R

Hence, f ∗ = f∗ and x(u∗ ) is the optimal solution of problem (2.2.43).


Now, assume that ū ∈ Rm is an approximate solution to the dual prob-
lem (2.2.45). Then it is clear that the norm of the gradient of the objective function
at this point is very important. Indeed, it bounds the residual b − A(x(ū)). On the
other hand,

f (x(ū)) − f ∗ = φ(ū) − ū, ∇φ(ū) − φ(u∗ ) ≤ ūd · ∇φ(ū)∗d .


2.2 Optimal Methods 109

Thus, the size of the gradient of the dual function bounds at the same time the level
of infeasibility and the level of optimality.
We have already discussed in Sect. 2.2.2 how to compute a point with small norm
of the gradient. However, for problem (2.2.45) the situation is even simpler. Indeed,
Theorem 2.2.4 shows that the average gradient at points {yk } decreases as O( k12 ).
For problem (2.2.45), this means that the residual of the linear system Ax = b at
some average point of the sequence {x(vk )} ⊂ Q (with points {vk } corresponding
to {yk } in method (2.2.7)) decreases as O( k12 ). So, these average points can be taken
as approximate solutions to the primal problem (2.2.43). 
To conclude this section, let us analyze the properties of Euclidean projection
onto the convex set. Up to the end of this section the notation  ·  is used for the
standard Euclidean norm.
Definition 2.2.2 Let Q be a closed set and x0 ∈ Rn . Define

πQ (x0 ) = arg min  x − x0  . (2.2.46)


x∈Q

We call πQ (x0 ) the Euclidean projection of the point x0 onto the set Q.
Let f (x) = 1
2  x 2 . Since ∇ 2 f (x) = In , this function belongs to the class
S12 (Rn ).
Theorem 2.2.11 If Q is a convex set, then there exists a unique projection πQ (x0 ).
1,1 n
Proof Indeed, πQ (x0 ) = arg min f (x), where f ∈ S1,1 (R ). Therefore πQ (x0 ) is
x∈Q
unique and well defined in view of Theorem 2.2.10.

Since Q is closed, πQ (x0 ) = x0 if and only if x0 ∈ Q.
Lemma 2.2.7 Let Q be a closed convex set and x0 ∈
/ Q. Then for any x ∈ Q, we
have

πQ (x0 ) − x0 , x − πQ (x0 ) ≥ 0. (2.2.47)

Proof Note that πQ (x0 ) is a solution of the minimization problem min f (x) with
x∈Q
f (x) = 1
2  x − x0 2 . Therefore, in view of Theorem 2.2.9 we have

∇f (πQ (x0 )), x − πQ (x0 ) ≥ 0

for all x ∈ Q. It remains to note that ∇f (x) = x − x0 .



Corollary 2.2.3 For any two points x1 and x2 ∈ Rn , we have

πQ (x1 ) − πQ (x2 ) ≤ x1 − x2 . (2.2.48)


110 2 Smooth Convex Optimization

Proof Indeed, in view of inequality (2.2.47), we have

πQ (x1 ) − x1 , πQ (x2 ) − πQ (x1 ) ≥ 0,

πQ (x2 ) − x2 , πQ (x1 ) − πQ (x2 ) ≥ 0.

Adding these two inequalities, we get

πQ (x1 ) − πQ (x2 )2 ≤ πQ (x1 ) − πQ (x2 ), x1 − x2

≤ πQ (x1 ) − πQ (x2 ) · x1 − x2 .



Let us also mention a triangle inequality for projection (compare with (2.2.36)).
Lemma 2.2.8 For any two point x ∈ Q and y ∈ Rn , we have

 x − πQ (y) 2 +  πQ (y) − y 2 ≤  x − y 2 . (2.2.49)

Proof Indeed, in view of (2.2.47), we have

 x − πQ (y) 2 −  x − y 2 = y − πQ (y), 2x − πQ (y) − y

≤ −  y − πQ (y) 2 .



There exists a useful characterization of optimal solutions to problem (2.2.38) in
terms of Euclidean projection.
Theorem 2.2.12 Let x ∗ be an optimal solution to problem (2.2.38). Then, for any
γ > 0 we have

πQ (x ∗ − γ1 ∇f (x ∗ )) = x ∗ . (2.2.50)

Proof Consider the minimization problem min 12 x −x ∗ + γ1 ∇f (x ∗ )2 . Its objective


x∈Q
function is strongly convex. Hence, in view of Theorem 2.2.10, its solution x∗ exists
and is unique. Moreover, in view of Theorem 2.2.9, it is completely characterized
by the following inequality:

x∗ − x ∗ + γ1 ∇f (x ∗ ), x − x∗ ≥ 0, ∀x ∈ Q.

Hence, x∗ = x ∗ . 

2.2 Optimal Methods 111

Finally, let us mention some properties of the distance function to a convex set:

def 1
ρQ (x) = 2 x − πQ (x)2 , x ∈ Rn . (2.2.51)

Lemma 2.2.9 A function ρQ is convex and differentiable on Rn with gradient

∇ρQ (x) = x − πQ (x), x ∈ Rn , (2.2.52)

which is Lipschitz continuous in the standard Euclidean norm with constant one.
Proof Let us fix two arbitrary points x1 and x2 in Rn . Let π1 = πQ (x1 ) ∈ Q,
π2 = πQ (x2 ) ∈ Q, g1 = x1 − π1 , and g2 = x2 − π2 . In view of the Euclidean
identity

2 g2  = 12 g1 2 + g1 , g2 − g1 + 12 g2 − g1 2 ,


1 2 (2.2.53)

we have

ρQ (x2 ) ≥ ρQ (x1 ) + x1 − πQ (x1 ), x2 − x1


+πQ (x1 ) − x1 , πQ (x2 ) − πQ (x1 )

(2.2.47)
≥ ρQ (x1 ) + g1 , x2 − x1 .

On the other hand,

(2.2.53)
ρQ (x2 ) − ρQ (x1 ) = g1 , g2 − g1 + 12 g2 − g1 2

= g1 , x2 − x1 + g1 , π1 − π2 − g2 + 12 g1 2 + 12 g2 2

(2.2.46)
≤ g1 , x2 − x1 + g1 , π1 − x2 + 12 g1 2 + 12 x2 − π1 2

= g1 , x2 − x1 + 12 x2 − x1 2 .

Thus, for arbitrary points x1 and x2 ∈ Rn we have proved the following relations:

g1 , x2 − x1 ≤ ρQ (x2 ) − ρQ (x1 ) ≤ g1 , x2 − x1 + 12 x2 − x1 2 .

Hence the function ρQ is differentiable at any point x ∈ Rn and ∇ρQ (x) = x −


πQ (x). Moreover, in view of condition (2.1.9), f ∈ F11,1 (Rn ).

112 2 Smooth Convex Optimization

2.2.4 The Gradient Mapping

As compared with the unconstrained problem, in the constrained minimization


problem (2.2.38), the gradient of the objective function should be treated differently.
In the previous section, we have already seen that its role in optimality conditions
is changing. Moreover, we can no longer use it for the gradient step since the result
may be infeasible. If we look at the main properties of the gradient, which are
useful for functions from the class FL1,1 (Rn ), we can see that two of them are of the
highest importance. The first is that the step along the direction of the anti-gradient
decreases the function value by an amount comparable with the squared norm of the
gradient:

f (x − L1 ∇f (x)) ≤ f (x) − 1
2L  ∇f (x) 2 .

The second is the inequality

∇f (x), x − x ∗ ≥ 1
L  ∇f (x) 2 .

It turns out that for Constrained Minimization we can introduce an object which
inherits both these important properties.
Definition 2.2.3 Let us fix some γ > 0. Define
γ !
xQ (x̄; γ ) = arg min f (x̄) + ∇f (x̄), x − x̄ + 2  x − x̄ 2 ,
x∈Q
(2.2.54)
gQ (x̄; γ ) = γ (x̄ − xQ (x̄; γ )).

We call xQ (x̄, γ ) the gradient mapping, and gQ (x̄, γ ) the reduced gradient of the
function f on Q.
Note that the objective function of the optimization problem in this definition can
be written as

f (x̄) + γ2 x − x̄ + γ1 ∇f (x̄)2 − 1
2γ ∇f (x̄)2 . (2.2.55)

Thus, xQ (x̄; γ ) is a projection of point x̄ − γ1 ∇f (x̄) onto the feasible set. For Q ≡
Rn , we have

xQ (x̄; γ ) = x̄ − γ1 ∇f (x̄), gQ (x̄; γ ) = ∇f (x̄).

1
The value γ can be seen as a natural step size for the “gradient” step

(2.2.54)
x̄ → xQ (x̄; γ ) = x̄ − γ1 gQ (x̄; γ ). (2.2.56)
2.2 Optimal Methods 113

Note that the gradient mapping is well defined in view of Theorem 2.2.10.
Moreover, it is defined for all x̄ ∈ Rn , not necessarily from Q.
Let us write down the main property of the gradient mapping.
1,1
Theorem 2.2.13 Let f ∈ Sμ,L (Q), γ ≥ L, and x̄ ∈ Rn . Then for any x ∈ Q, we
have

f (x) ≥ f (xQ (x̄; γ )) + gQ (x̄; γ ), x − x̄ + 1


2γ  gQ (x̄; γ ) 2 + μ2  x − x̄ 2 .

(2.2.57)

Proof Let xQ = xQ (γ , x̄), gQ = gQ (γ , x̄), and


γ
φ(x) = f (x̄) + ∇f (x̄), x − x̄ + 2  x − x̄ 2 .

Then ∇φ(x) = ∇f (x̄) + γ (x − x̄), and for any x ∈ Q we have

(2.2.39)
∇f (x̄) − gQ , x − xQ = ∇φ(xQ ), x − xQ ≥ 0.

Hence,

μ (2.1.20)
f (x) − 2  x − x̄ 2 ≥ f (x̄) + ∇f (x̄), x − x̄

= f (x̄) + ∇f (x̄), xQ − x̄ + ∇f (x̄), x − xQ

≥ f (x̄) + ∇f (x̄), xQ − x̄ + gQ , x − xQ

γ
= φ(xQ ) − 2  xQ − x̄ 2 +gQ , x − xQ

= φ(xQ ) − 1
2γ  gQ 2 +gQ , x − xQ

= φ(xQ ) + 1
2γ  gQ 2 +gQ , x − x̄ ,

(2.1.9)
and φ(xQ ) ≥ f (xQ ) since γ ≥ L. 

1,1
Corollary 2.2.4 Let f ∈ Sμ,L (Q), γ ≥ L, and x̄ ∈ Q. Then

f (xQ (x̄; γ )) ≤ f (x̄) − 1


2γ  gQ (x̄; γ ) 2 , (2.2.58)

gQ (x̄; γ ), x̄ − x ∗ ≥ 1
2γ  gQ (x̄; γ ) 2 + μ2  x̄ − x ∗ 2
(2.2.59)
+ μ2 xQ (x̄; γ ) − x ∗ 2 .
114 2 Smooth Convex Optimization

Proof Indeed, using (2.2.57) with x = x̄, we get (2.2.58). Using (2.2.57) with x =
x ∗ , we get (2.2.59) since

(2.2.40)
f (xQ (x̄; γ )) ≥ f (x ∗ ) + μ2 xQ (x̄; γ ) − x ∗ 2 . 

2.2.5 Minimization over Simple Sets

Let us show that we can use the gradient mapping to solve the following problem:

min f (x),
x∈Q

1,1
where f ∈ Sμ,L (Q) and Q is a closed convex set. We assume that the set Q
is simple enough, so the gradient mapping can be computed by a closed form
expression. This assumption is valid for some simple sets like positive orthants,
n dimensional boxes, simplexes, Euclidean balls, and some others.
Let us start with the Gradient Method.

Gradient Method for Simple Set

0. Choose a starting point x0 ∈ Q and a parameter γ > 0. (2.2.60)


1. kth iteration (k ≥ 0).

xk+1 = xk − 1
γ gQ (xk ; γ ) .

Note that in this scheme


(2.2.56)
 
xk+1 = xQ (xk ; γ ) = πQ xk − γ1 ∇f (xk ) . (2.2.61)

The efficiency analysis of this scheme is very similar to the analysis of its
unconstrained version.
1,1 L+μ
Theorem 2.2.14 Let f ∈ Sμ,L (Rn ). If in (2.2.60) γ ≥ 2 , then

 k
 xk − x ∗ ≤ 1 − μ
γ  x0 − x ∗  .
2.2 Optimal Methods 115

Proof Let rk = xk − x ∗ . Then, in view of Theorem 2.2.12, we have

(2.2.61)
2
rk+1 = πQ (xk − γ1 ∇f (xk )) − πQ (x ∗ − γ1 ∇f (x ∗ ))2

(2.2.48)
≤ xk − x ∗ − γ1 (∇f (xk ) − ∇f (x ∗ ))2

= rk2 − γ2 ∇f (xk ) − ∇f (x ∗ ), xk − x ∗ + 1


γ2
∇f (xk ) − ∇f (x ∗ )2

(2.1.32)    
≤ 1− 2
γ · μL
μ+L rk2 + 1
γ2
− 2
γ · 1
μ+L ∇f (xk ) − ∇f (x ∗ )2

(2.1.26)     2
μL
≤ 1− 2
γ · μ+L + μ2 1
γ2
− 2
γ · 1
μ+L rk2 = 1− μ
γ rk2 . 

Thus, for the minimal value of the scaling parameter γ = L+μ


2 , method (2.2.60)
has the same rate of convergence as for the unconstrained scheme (2.1.37):
 k
 xk − x ∗  ≤ L−μ
L+μ  x0 − x ∗  . (2.2.62)

Consider now the optimal schemes. We give only a sketch of their justification
since it is very similar to the analysis of Sect. 2.2.1.
First of all, we define the estimating sequences. Assume that x0 ∈ Q. Define
γ0
φ0 (x) = f (x0 ) + 2  x − x0 2 ,

φk+1 (x) = (1 − αk )φk (x) + αk [f (xQ (yk ; L)) + 1


2L  gQ (yk ; L) 2

μ
+gQ (yk ; L), x − yk + 2  x − yk 2 ], k ≥ 0.

Note that the recursive rule for updating the estimating functions φk (·) has changed.
The reason is that now we have to use inequality (2.2.57) instead of (2.1.20).
However, this modification does not change the functional terms in the recursion,
only the constant terms are affected. Therefore, it is possible to keep all complexity
results of Sect. 2.2.1.
It is easy to see that the estimating sequence {φk (x·)} can be represented in the
canonical form

φk (x) = φk∗ + γk
2  x − vk 2 ,
116 2 Smooth Convex Optimization

with the following recursive rules for γk , vk and φk∗ :

γk+1 = (1 − αk )γk + αk μ,

vk+1 = γk+1 [(1 − αk )γk vk


1
+ αk μyk − αk gQ (yk ; L)],


∗ αk2
φk+1 = (1 − αk )φk∗ + αk f (xQ (yk ; L)) + αk
2L − 2γk+1  gQ (yk ; L) 2


+ αk (1−α
γk+1
k )γk
2  yk − vk 2 +gQ (yk ; L), vk − yk .

Further, assuming that φk∗ ≥ f (xk ) and using the inequality

(2.2.57)
f (xk ) ≥ f (xQ (yk ; L)) + gQ (yk ; L), xk − yk

+ 2L
1
 gQ (yk ; L) 2 + μ2  xk − yk 2 ],

we come to the following lower bound:




∗ αk αk2
φk+1 ≥ (1 − αk )f (xk ) + αk f (xQ (yk ; L)) + 2L − 2γk+1  gQ (yk ; L) 2

+ αk (1−α
γk+1
k )γk
gQ (yk ; L), vk − yk


αk2
≥ f (xQ (yk ; L)) + 1
2L − 2γk+1  gQ (yk ; L) 2

+(1 − αk )gQ (yk ; L), αγk+1


k γk
(vk − yk ) + xk − yk .

Thus, again we can choose

xk+1 = xQ (yk ; L),

Lαk2 = (1 − αk )γk + αk μ ≡ γk+1 ,

yk = 1
γk +αk μ (αk γk vk + γk+1 xk ).
2.3 The Minimization Problem with Smooth Components 117

Let us write down the corresponding variant of scheme (2.2.20).

Constant Step Scheme II for Simple Set

 
√ 2(3+qf )
0. Choose x0 ∈ Rn and α0 ∈ qf , 3+√21+4q . Set y0 =
x0 .
1. kth iteration (k ≥ 0).
(2.2.63)
(a) Compute f (yk ) and ∇f (yk ). Set xk+1 = xQ (yk ; L).
(b) Compute αk+1 ∈ (0, 1) from the equation
2
αk+1 = (1 − αk+1 )αk2 + qf αk+1 .

αk (1−αk )
Set βk = and yk+1 = xk+1 + βk (xk+1 − xk ).
αk2 +αk+1

The rate of convergence of this method is given by Theorem 2.2.3. Note that
only the points {xk } are feasible for Q. The sequence {yk } is used for computing the
gradient mapping and it may be infeasible.

2.3 The Minimization Problem with Smooth Components

(Minimax problems: Gradient Mapping, Gradient Method, Optimal Methods; Problem with
functional constraints; Methods for Constrained Minimization.)

2.3.1 The Minimax Problem

Very often, the objective function in optimization problems is composed of several


functional components. For example, the reliability of a complex system is usually
defined as the minimal reliability of its parts. A constrained minimization problem
with functional constraints also provides us with an example of the interaction of
several nonlinear functions, etc.
The simplest problem of this type is called the (discrete) minimax problem. In
this section, we consider the following smooth minimax problem:
 
min f (x) = max fi (x) , (2.3.1)
x∈Q 1≤i≤m
118 2 Smooth Convex Optimization

1,1
where fi ∈ Sμ,L (Rn ,  · ), i = 1 . . . m, and Q is a closed convex set. We call
the function f a max-type function composed of components fi (x). We write f ∈
1,1
Sμ,L (Rn ,  · ) if all components of the function f belong to this class.
Note that in general, f is not differentiable. However, provided that all fi are
differentiable functions, we can introduce an object, which behaves exactly as a
linear approximation of the differentiable function.
Definition 2.3.1 Let f be a max-type function:

f (x) = max fi (x).


1≤i≤m

The function

f (x̄; x) = max [fi (x̄) + ∇fi (x̄), x − x̄ ],


1≤i≤m

is called the linearization of f at the point x̄.


Compare the following result with inequalities (2.1.20) and (2.1.9).
Lemma 2.3.1 For any two points x and x̄ in Rn , we have
μ
f (x) ≥ f (x̄; x) + 2  x − x̄ 2 , (2.3.2)

f (x) ≤ f (x̄; x) + L
2  x − x̄ 2 . (2.3.3)

Proof Indeed, for all i = 1, . . . , m, we have

(2.1.20) μ
fi (x) ≥ fi (x̄) + ∇fi (x̄), x − x̄ + 2  x − x̄ 2 .

Taking the maximum of these inequalities in i, we get (2.3.2).


To prove (2.3.3), we use inequalities

(2.1.9)
fi (x) ≤ fi (x̄) + ∇fi (x̄), x − x̄ + L
2  x − x̄ 2 , i = 1, . . . , m. 

Let us write down the optimality conditions for problem (2.3.1) (compare with
Theorem 2.2.9).
Theorem 2.3.1 The point x ∗ ∈ Q is an optimal solution to problem (2.3.1) if and
only if for any x ∈ Q we have

f (x ∗ ; x) ≥ f (x ∗ ; x ∗ ) = f (x ∗ ). (2.3.4)
2.3 The Minimization Problem with Smooth Components 119

Proof Indeed, if condition (2.3.4) holds, then

(2.3.2)
f (x) ≥ f (x ∗ ; x) ≥ f (x ∗ ; x ∗ ) = f (x ∗ )

for all x ∈ Q.
Let x ∗ be an optimal solution to (2.3.1). Assume that there exists an x ∈ Q such
that f (x ∗ ; x) < f (x ∗ ). Consider the functions

φi (α) = fi (x ∗ + α(x − x ∗ )), i = 1 . . . m.

Note that for all i, 1 ≤ i ≤ m, we have

fi (x ∗ ) + ∇fi (x ∗ ), x − x ∗ < f (x ∗ ) = max fi (x ∗ ).


1≤i≤m

Therefore, either φi (0) ≡ fi (x ∗ ) < f (x ∗ ), or

φi (0) = f (x ∗ ), φi (0) = ∇fi (x ∗ ), x − x ∗ < 0.

Thus, for α small enough, we have

fi (x ∗ + α(x − x ∗ )) = φi (α) < f (x ∗ )

for all i, 1 ≤ i ≤ m. This is a contradiction.



Corollary 2.3.1 Let x ∗ be a minimum of the max-type function f (·) on the set Q.
If f belongs to Sμ1 (Rn ,  · ), then

f (x) ≥ f (x ∗ ) + μ
2  x − x ∗ 2

for all x ∈ Q.
Proof Indeed, in view of (2.3.2) and Theorem 2.3.1, for any x ∈ Q, we have

f (x) ≥ f (x ∗ ; x) + μ
2  x − x ∗ 2 ≥ f (x ∗ ; x ∗ ) + μ
2  x − x ∗ 2

= f (x ∗ ) + μ
2  x − x ∗ 2 . 

Finally, let us prove an existence theorem.


Theorem 2.3.2 Let the max-type function f belong to the class Sμ1 (Rn ,  · ) with
μ > 0, and Q be a closed convex set. Then there exists a unique optimal solution
x ∗ to problem (2.3.1).
120 2 Smooth Convex Optimization

Proof Let x̄ ∈ Q. Consider the set Q̄ = {x ∈ Q | f (x) ≤ f (x̄)}. Note that the
problem (2.3.1) is equivalent to the following problem:

min{f (x) | x ∈ Q̄}. (2.3.5)

However, the set Q̄ is bounded: for any x ∈ Q̄ we have

(2.1.20) μ
f (x̄) ≥ fi (x) ≥ fi (x̄) + ∇fi (x̄), x − x̄ + 2  x − x̄ 2 , i = 1, . . . , m.

Consequently,
μ
2  x − x̄ 2 ≤ ∇fi (x̄) ∗ ·  x − x̄  +f (x̄) − fi (x̄), i = 1, . . . , m.

Thus, the solution x ∗ of (2.3.5) (and of (2.3.1)) exists.


If x1∗ is another solution to (2.3.1), then

(2.3.2) (2.3.4)
f (x ∗ ) = f (x1∗ ) ≥ f (x ∗ ; x1∗ ) + μ
2  x1∗ − x ∗ 2 ≥ f (x ∗ ) + μ
2  x1∗ − x ∗ 2 .

Therefore, x1∗ = x ∗ .


2.3.2 Gradient Mapping

In Sect. 2.2.4, we introduced the reduced gradient, which replaces the usual gradient
for a constrained minimization problem over a simple set. Since linearization of a
max-type function behaves similarly to the linearization of a smooth function, we
can adapt this notion to our particular situation. Up to the end of this chapter, we
will be working with the standard Euclidean norm.
Let us fix some γ > 0 and point x̄ ∈ Rn . For a max-type function f , define
γ
fγ (x̄; x) = f (x̄; x) + 2  x − x̄ 2 .

The following definition is an extension of Definition 2.2.3.


Definition 2.3.2 Define

f ∗ (x̄; γ ) = min fγ (x̄; x),


x∈Q

xf (x̄; γ ) = arg min fγ (x̄; x),


x∈Q

gf (x̄; γ ) = γ (x̄ − xf (x̄; γ )).


2.3 The Minimization Problem with Smooth Components 121

We call xf (x; γ ) the Gradient Mapping and gf (x̄; γ ) the Reduced Gradient of a
max-type function f on Q.
For m = 1, this definition is equivalent to Definition 2.2.3. Note that the point of
linearization x̄ does not necessarily belong to Q. At the same time, now the point
xf (x̄; γ ) cannot be interpreted as a projection (2.2.55).
It is clear that fγ (x̄; ·) is a max-type function composed by the components

γ
fi (x̄) + ∇fi (x̄), x − x̄ + 2  x − x̄ 2 ∈ Sγ1,1 n
,γ (R ), i = 1 . . . m.

Therefore, the gradient mapping is well defined (see Theorem 2.3.2).


Let us now prove the main result of this section, which highlights the similarity
between the properties of the Gradient Mapping and the properties of the reduced
gradient (compare with Theorem 2.2.13).
Theorem 2.3.3 For all x ∈ Q, γ ≥ L, and x̄ ∈ Rn , we have

f (x̄; x) ≥ f ∗ (x̄; γ ) + gf (x̄; γ ), x − x̄ + 1


2γ  gf (x̄; γ ) 2 . (2.3.6)

Proof Let xf = xf (x̄; γ ), gf = gf (x̄; γ ). It is clear that fγ (x̄; ·) ∈ Sγ1,1 n


,γ (R ) and
it is a max-type function. Therefore, all results of the previous section can also be
applied to the function fγ .
Since xf = arg min fγ (x̄; x), in view of Corollary 2.3.1 and Theorem 2.3.1, we
x∈Q
have
γ
f (x̄; x) = fγ (x̄; x) − 2  x − x̄ 2

≥ fγ (x̄; xf ) + γ2 ( x − xf 2 −  x − x̄ 2 )

≥ f ∗ (x̄; γ ) + γ2 x̄ − xf , 2x − xf − x̄

= f ∗ (x̄; γ ) + γ2 x̄ − xf , 2(x − x̄) + x̄ − xf

= f ∗ (x̄; γ ) + gf , x − x̄ + 1
2γ  gf 2 . 

In what follows, we often use the following corollary to Theorem 2.3.3.


1,1
Corollary 2.3.2 Let f ∈ Sμ,L (Rn ) and γ ≥ L. Then:
1. For any x ∈ Q and x̄ ∈ Rn , we have

f (x) ≥ f (xf (x̄; γ )) + gf (x̄; γ ), x − x̄ + 1


2γ  gf (x̄; γ ) 2 + μ2  x − x̄ 2 .

(2.3.7)
122 2 Smooth Convex Optimization

2. If x̄ ∈ Q, then

f (xf (x̄; γ )) ≤ f (x̄) − 1


2γ  gf (x̄; γ ) 2 . (2.3.8)

3. For any x̄ ∈ Rn , we have

gf (x̄; γ ), x̄ − x ∗ ≥ 1
2γ  gf (x̄; γ ) 2 + μ2  x ∗ − x̄ 2 . (2.3.9)

Proof Assumption γ ≥ L implies that f ∗ (x̄; γ ) ≥ f (xf (x̄; γ )). Therefore, (2.3.7)
follows from (2.3.6) since
μ
f (x) ≥ f (x̄; x) + 2  x − x̄ 2

for all x ∈ Rn (see Lemma 2.3.1).


Using (2.3.7) with x = x̄, we get (2.3.8), and using (2.3.7) with x = x ∗ , we
get (2.3.9) since f (xf (x̄; γ )) − f (x ∗ ) ≥ 0.

Finally, let us estimate the variation of the optimal value f ∗ (x̄; γ ) as a function
of γ .
Lemma 2.3.2 For any γ1 , γ2 > 0, and x̄ ∈ Rn , we have
γ2 −γ1
f ∗ (x̄; γ2 ) ≥ f ∗ (x̄; γ1) + 2γ1 γ2  gf (x̄; γ1 ) 2 .

Proof Let xi = xf (x̄; γi ), gi = gf (x̄; γi ), i = 1, 2. In view of (2.3.6), we have

f (x̄; x) + γ2
2  x − x̄ 2 ≥ f ∗ (x̄; γ1 ) + g1 , x − x̄
(2.3.10)
+ 2γ11  g1 2 + γ22  x − x̄ 2

for all x ∈ Q. In particular, for x = x2 we obtain

f ∗ (x̄; γ2 ) = f (x̄; x2 ) + γ2
2  x2 − x̄ 2

≥ f ∗ (x̄; γ1 ) + g1 , x2 − x̄ + 1
2γ1  g1 2 + γ22  x2 − x̄ 2

= f ∗ (x̄; γ1 ) + 1
2γ1  g1 2 − γ12 g1 , g2 + 1
2γ2  g2 2

≥ f ∗ (x̄; γ1 ) + 1
2γ1  g1 2 − 2γ12  g1 2 . 

2.3 The Minimization Problem with Smooth Components 123

2.3.3 Minimization Methods for the Minimax Problem

As usual, we start the presentation of numerical methods for problem (2.3.1) with a
variant of the Gradient Method with constant step.

Gradient Method for Minimax Problem

0. Choose x0 ∈ Q and h > 0. (2.3.11)


1. kth iteration (k ≥ 0).

xk+1 = xk − hgf (xk ; L).

1,1
Theorem 2.3.4 Let f ∈ Sμ,L (Rn ). If in method (2.3.11) we choose h ≤ 1
L, then it
forms a feasible sequence of points such that

 xk − x ∗ 2 ≤ (1 − μh)k  x0 − x ∗ 2 , k ≥ 0.

Proof Let rk = xk − x ∗  and gk = gf (xk ; L). Then, in view of (2.3.9), we have

2
rk+1 =  xk − x ∗ − hgk 2 = rk2 − 2hgk , xk − x ∗ + h2  gk 2
 
≤ (1 − hμ)rk2 + h h − L1  gk 2 ≤ (1 − hμ)rk2 .

Let α = hL ≤ 1. Then xk+1 = (1 − α)xk + αxf (xk , L) ∈ Q. 



With the maximal step size h = 1
L, we have

xk+1 = xk − L1 gf (xk ; L) = xf (xk ; L).

For this step size, the rate of convergence of method (2.3.11) is as follows:
 k
 xk − x ∗ 2 ≤ 1 − μ
L  x0 − x ∗ 2 .

As compared with Theorem 2.2.14, the Gradient Method for the minimax problem
has a rate of convergence with a similar dependence on the condition number.
Let us check what we can say about the optimal methods. In order to develop an
optimal scheme, we need to introduce estimating sequences with some recursive
updating rules. Formally, the minimax problem differs from the unconstrained
minimization problem only by the analytical form of the lower approximation
of the objective function. In the case of unconstrained minimization, we use
124 2 Smooth Convex Optimization

inequality (2.1.20) for updating the estimating sequence. Now we just replace it
by the lower bound (2.3.7).
Let us introduce the estimating sequences for problem (2.3.1). We fix some point
x0 ∈ Q and coefficient γ0 > 0. Consider the sequences {yk } ⊂ Rn and {αk } ⊂
(0, 1). Define
γ0
φ0 (x) = f (x0 ) + 2  x − x0 2 ,

φk+1 (x) = (1 − αk )φk (x) + αk [ f (xf (yk ; L)) + 1


2L  gf (yk ; L) 2

μ
+gf (yk ; L), x − yk + 2  x − yk 2 ].

Comparing these relations with (2.2.4), we can see the difference only in the
constant term (shown in the frame). In (2.2.4), we used f (yk ) in this position.
This difference leads to a trivial modification of the results of Lemma 2.2.3: All
appearances of f (yk ) must be formally replaced by the expression in the frame, and
∇f (yk ) must be replaced by the reduced gradient gf (yk ; L). Thus, we come to the
following lemma.
Lemma 2.3.3 For all k ≥ 0 we have

φk (x) ≡ φk∗ + γk
2  x − vk 2 ,

where the sequences {γk }, {vk } and {φk∗ } are defined as v0 = x0 , φ0∗ = f (x0 ), and

γk+1 = (1 − αk )γk + αk μ,

1 [(1 − α )γ v + α μy − α g (y ; L)],
vk+1 = γk+1 k k k k k k f k

2

φk+1 1  g (y ; L) 2 ) + αk  g (y ; L) 2
= (1 − αk )φk + αk (f (xf (yk ; L)) + 2L f k 2γk+1 f k

 
+ αk (1−α
γk+1
k )γk μ 2
2  yk − vk  +gf (yk ; L), vk − yk .



Now we can proceed exactly as in Sect. 2.2. Assume that φk∗ ≥ f (xk ).
Inequality (2.3.7) with x = xk and x̄ = yk becomes as follows:

f (xk ) ≥ f (xf (yk ; L)) + gf (yk ; L), xk − yk + 1


2L  gf (yk ; L) 2

+ μ2  xk − yk 2 .
2.3 The Minimization Problem with Smooth Components 125

Hence,


∗ αk αk2
φk+1 ≥ (1 − αk )f (xk ) + αk f (xf (yk ; L)) + 2L − 2γk+1  gf (yk ; L) 2

+ αk (1−α
γk+1
k )γk
gf (yk ; L), vk − yk


αk2
≥ f (xf (yk ; L)) + 1
2L − 2γk+1  gf (yk ; L) 2

+(1 − αk )gf (yk ; L), αγk+1


k γk
(vk − yk ) + xk − yk .

Thus, again we can choose

xk+1 = xf (yk ; L),

Lαk2 = (1 − αk )γk + αk μ ≡ γk+1 ,

yk = 1
γk +αk μ (αk γk vk + γk+1 xk ).

Let us write down the resulting scheme in the form of (2.2.20), with eliminated
sequences {vk } and {γk }.

Constant Step Scheme II for Minimax Problem

 
√ 2(3+qf )

0. Choose x0 ∈ Rn and α0 ∈ qf , . Set y0 =
3+ 21+4qf
x0 .
1. kth iteration (k ≥ 0).
(2.3.12)
i=1 and {∇fi (yk )}i=1 .
(a) Compute {fi (yk )}m m

Set xk+1 = xf (yk ; L).


(b) Compute αk+1 ∈ (0, 1) from the equation
2
αk+1 = (1 − αk+1 )αk2 + qf αk+1 .

αk (1−αk )
Set βk = and yk+1 = xk+1 + βk (xk+1 − xk ).
αk2 +αk+1

The convergence analysis of this scheme is completely identical to the analysis


used for scheme (2.2.20). Let us just give the final result.
126 2 Smooth Convex Optimization

1,1
Theorem 2.3.5 Let the max-type function f belong to Sμ,L (Rn ). If in the
 
√ 2(3+q )
method (2.3.12) we take α0 ∈ qf , √ f , then
3+ 21+4qf

 
4μ f (x0 )−f ∗ + 20 x0 −x ∗ 2
γ

f (xk ) −f∗ ≤  
1/2
  
1/2 2
(γ0 −μ)· exp k+1
2 qf −exp − k+1 2 qf

!
≤ 4L
(γ0 −μ)(k+1)2
f (x0 ) − f ∗ + γ0
2  x0 − x ∗ 2 ,

α0 (α0 L−μ)
where γ0 = 1−α0 . 

Note that the scheme (2.3.12) works for all μ ≥ 0. Let us write down the method
for solving problem (2.3.1) with strictly convex components.

1,1
Optimal Method for Minimax Problem with f ∈ Sμ,L (Rn )


1− q
0. Choose x0 ∈ Q. Set y0 = x0 , β = 1+√qff . (2.3.13)
1. kth iteration (k ≥ 0).
Compute {fi (yk )} and {∇fi (yk )}. Set xk+1 = xf (yk ; L) and

yk+1 = xk+1 + β(xk+1 − xk ).

Theorem 2.3.6 For scheme (2.3.13) we have


  k
f (xk ) − f ∗ ≤ 2 1 − μ
L (f (x0 ) − f ∗ ). (2.3.14)


Proof Scheme (2.3.13) is a variant of (2.3.12) with α0 = μ L . Under this choice,
γ0 = μ and we get (2.3.14) from Theorem 2.3.5 since, in view of Corollary 2.3.1,
μ ∗ 2 ∗
2  x0 − x  ≤ f (x0 ) − f . 
To conclude this section, let us look at the auxiliary problem, which we need to
solve for computing the Gradient Mapping of the minimax problem. Recall that this
problem is as follows:
 
γ
min max [fi (x0 ) + ∇fi (x0 ), x − x0 ] + 2  x − x0 2 .
x∈Q 1≤i≤m
2.3 The Minimization Problem with Smooth Components 127

Introducing an additional variable t ∈ R, we can rewrite this problem in the


following form:
 γ 
min t + 2  x − x0 2
x,t

s. t. fi (x0 ) + ∇fi (x0 ), x − x0 ≤ t, i = 1 . . . m, (2.3.15)

x ∈ Q, t ∈ R,

If Q is a polytope, then the problem (2.3.15) is a quadratic optimization problem.


Such a problem can be solved by some special finite methods (simplex-type
algorithms). It can also be solved by Interior Point Methods (see Chap. 5). In the
latter case, we can treat much more complicated structures of the basic feasible
set Q.

2.3.4 Optimization with Functional Constraints

Let us show that the methods of the previous section can be used to solve a
constrained minimization problem with smooth functional constraints. Recall, that
the analytical form of such a problem is as follows:

min f0 (x),
x∈Q
(2.3.16)
s.t. fi (x) ≤ 0, i = 1 . . . m,

where the functions fi are convex and smooth and Q is a simple closed convex set.
1,1
In this section, we assume that fi ∈ Sμ,L (Rn ), i = 0 . . . m, with some μ > 0.
The relation between problem (2.3.16) and minimax problems is established by
some special function of one variable. Consider the parametric max-type function

f (t; x) = max{f0 (x) − t; fi (x), i = 1 . . . m}, t ∈ R, x ∈ Q.

Let us introduce the auxiliary function

f ∗ (t) = min f (t; x). (2.3.17)


x∈Q

Note that the components of the max-type function f (t; ·) are strongly convex in x.
Therefore, for any t ∈ R, the solution of problem (2.3.17), x ∗ (t), exists and is
unique in view of Theorem 2.3.2.
128 2 Smooth Convex Optimization

We will try to approach the solution of problem (2.3.16) by a process based on


approximate values of the function f ∗ (t). This approach can be seen as a variant of
Sequential Quadratic Optimization. It can also be applied to nonconvex problems.
Let us establish some properties of function f ∗ (·). Clearly, this is a continuous
function.
Lemma 2.3.4 Let t ∗ be the optimal value of problem (2.3.16). Then

f ∗ (t) ≤ 0 for all t ≥ t ∗ ,

f ∗ (t) > 0 for all t < t ∗ .

Proof Let x ∗ be the solution to problem (2.3.16). If t ≥ t ∗ , then

f ∗ (t) ≤ f (t; x ∗ ) = max{f0 (x ∗ ) − t; fi (x ∗ )} ≤ max{t ∗ − t; fi (x ∗ )} ≤ 0.

Suppose that t < t ∗ and f ∗ (t) ≤ 0. Then there exists a y ∈ Q such that

f0 (y) ≤ t < t ∗ , fi (y) ≤ 0, i = 1 . . . m.

Hence, t ∗ cannot be the optimal value of problem (2.3.16).



Thus, the smallest root of the function f ∗ (·) corresponds to the optimal value of
problem (2.3.16). Note also that, using the methods of the previous section, we can
only compute an approximation to the value f ∗ (t). Hence, our goal now is to form
a process for finding this root, based on this inexact information. To do so, we need
to establish some properties of the function f ∗ (·).
Lemma 2.3.5 For any Δ ≥ 0, we have

f ∗ (t) − Δ ≤ f ∗ (t + Δ) ≤ f ∗ (t).

Proof Indeed,

f ∗ (t + Δ) = min max {f0 (x) − t − Δ; fi (x)}


x∈Q 1≤i≤m

≤ min max {f0 (x) − t; fi (x)} = f ∗ (t),


x∈Q 1≤i≤m

f ∗ (t + Δ) = min max {f0 (x) − t; fi (x) + Δ} − Δ


x∈Q 1≤i≤m

≥ min max {f0 (x) − t; fi (x)} − Δ = f ∗ (t) − Δ. 



x∈Q 1≤i≤m
2.3 The Minimization Problem with Smooth Components 129

In other words the function f ∗ (·) is decreasing and Lipschitz continuous with
constant one.
Lemma 2.3.6 For any t1 < t2 and Δ ≥ 0, we have
∗ (t ∗ (t
f ∗ (t1 − Δ) ≥ f ∗ (t1 ) + Δ f 1 )−f
t2 −t1
2)
. (2.3.18)

Proof Let t0 = t1 − Δ, α = t2 Δ−t0 ≡ t2 −t1 +Δ ∈ [0, 1]. Then t1 = (1 − α)t0 + αt2 ,


Δ

and inequality (2.3.18) can be written as follows:

f ∗ (t1 ) ≤ (1 − α)f ∗ (t0 ) + αf ∗ (t2 ). (2.3.19)

Let xα = (1 − α)x ∗ (t0 ) + αx ∗ (t2 ). Then

f ∗ (t1 ) ≤ max {f0 (xα ) − t1 ; fi (xα )}


1≤i≤m

(2.1.3)
≤ max {(1 − α)(f0 (x ∗ (t0 )) − t0 ) + α(f0 (x ∗ (t2 )) − t2 );
1≤i≤m

(1 − α)fi (x ∗ (t0 )) + αfi (x ∗ (t2 ))}

≤ (1 − α) max {f0 (x ∗ (t0 )) − t0 ; fi (x ∗ (t0 ))}


1≤i≤m

+α max {f0 (x ∗ (t2 )) − t2 ; fi (x ∗ (t2 ))}


1≤i≤m

= (1 − α)f ∗ (t0 ) + αf ∗ (t2 ),

and we get (2.3.18).



Note that Lemmas 2.3.5 and 2.3.6 are valid for any parametric max-type
functions, not necessarily formed by the functional components of problem (2.3.16).
Let us now study the properties of Gradient Mapping for the parametric max-type
function. Define a linearization of parametric max-type function f (t; ·):

f (t; x̄; x) = max {f0 (x̄) + ∇f0 (x̄), x − x̄ − t; fi (x̄) + ∇fi (x̄), x − x̄ }.
1≤i≤m
130 2 Smooth Convex Optimization

Now we can introduce a Gradient Mapping in the usual way. Let us fix some γ > 0.
Define
γ
fγ (t; x̄; x) = f (t; x̄; x) + 2  x − x̄ 2 ,

f ∗ (t; x̄; γ ) = min fγ (t; x̄; x),


x∈Q

xf (t; x̄; γ ) = arg min fγ (t; x̄; x),


x∈Q

gf (t; x̄; γ ) = γ (x̄ − xf (t; x̄; γ )).

We call xf (t; x̄; γ ) the Constrained Gradient Mapping, and gf (t; x̄, γ ) the Con-
strained Reduced Gradient of problem (2.3.16). As usual, the point of linearization
x̄ is not necessarily feasible for Q.
Note that the function fγ (t; x̄; ·) itself is a max-type function composed of the
components
γ
f0 (x̄) + ∇f0 (x̄), x − x̄ − t + 2  x − x̄ 2 ,

γ
fi (x̄) + ∇fi (x̄), x − x̄ + 2  x − x̄ 2 , i = 1 . . . m.

Moreover, fγ (t; x̄; ·) ∈ Sγ1,1 n


,γ (R ). Therefore, in view of Theorem 2.3.2, the
Constrained Gradient Mapping is well defined for any t ∈ R.
1,1
Since f (t; ·) ∈ Sμ,L (Rn ), we have

(2.3.2) (2.3.3)
fμ (t; x̄; x) ≤ f (t; x) ≤ fL (t; x̄; x)

for all x ∈ Rn . Therefore

f ∗ (t; x̄; μ) ≤ f ∗ (t) ≤ f ∗ (t; x̄; L).

Moreover, using Lemma 2.3.6, we obtain the following result.


For any x̄ ∈ Rn , γ > 0, Δ ≥ 0 and t1 < t2 and we have

f ∗ (t1 − Δ; x̄; γ ) ≥ f ∗ (t1 ; x̄; γ ) + ∗ ∗


t2 −t1 (f (t1 ; x̄; γ ) − f (t2 ; x̄; γ )).
Δ

(2.3.20)

There are two values, γ = L and γ = μ, which are important for us. Applying
Lemma 2.3.2 to the max-type function fγ (t; x̄; x) with γ1 = L and γ2 = μ, we get
the following inequality:

f ∗ (t; x̄; μ) ≥ f ∗ (t; x̄; L) − L−μ


2μL  gf (t; x̄; L) 2 . (2.3.21)
2.3 The Minimization Problem with Smooth Components 131

Since we are interested in finding a root of the function f ∗ (·), let us look first
at the roots of the function f ∗ (·; x̄; γ ), which can be seen as an approximation to
f ∗ (·).
Define

t ∗ (x̄, t) = roott (f ∗ (t; x̄; μ))

(the notation roott (·) corresponds to the root in t of the function (·)).
Lemma 2.3.7 Let x̄ ∈ Rn and t¯ < t ∗ be such that

f ∗ (t¯; x̄; μ) ≥ (1 − )f ∗ (t¯; x̄; L)

for some  ∈ (0, 1). Then t¯ < t ∗ (x̄, t¯) ≤ t ∗ . Moreover, for any t < t¯ and x ∈ Rn
we have

t¯−t
f ∗ (t; x; L) ≥ 2(1 − )f ∗ (t¯; x̄; L) t ∗ (x̄, t¯)−t¯
.

Proof Since t¯ < t ∗ , we have

0 < f ∗ (t¯) ≤ f ∗ (t¯; x̄; L) ≤ ∗ ¯


1− f (t ; x̄; μ).
1

Thus, f ∗ (t¯; x̄; μ) > 0 and, since f ∗ (·; x̄; μ) is decreasing, we get

t ∗ (x̄, t¯) > t¯.

Let Δ = t¯ − t. Then, in view of inequality (2.3.20), we have

f ∗ (t; x; L) ≥ f ∗ (t) ≥ f ∗ (t¯; x̄; μ) ≥ f ∗ (t¯; x̄; μ) + Δ


t ∗ (x̄,t¯)−t¯
f ∗ (t¯; x̄; μ)
 
≥ (1 − ) 1 + Δ
t ∗ (x̄,t¯)−t¯
f ∗ (t¯; x̄; L)


≥ 2(1 − )f ∗ (t¯; x̄; L) Δ
t ∗ (x̄,t¯)−t¯
.


In the last inequality, we use the relation 1 + τ ≥ 2 τ , τ ≥ 0. 

2.3.5 The Method for Constrained Minimization

Now we are ready to analyze the following process.


132 2 Smooth Convex Optimization

Constrained Minimization Scheme

0. Choose x0 ∈ Q,  ∈ (0, 12 ), t0 < t ∗ , and accuracy  > 0.


1. kth iteration (k ≥ 0).
(a) Generate the sequence {xk,j } by method (2.3.13) as
applied to f (tk ; ·) with starting point xk,0 = xk . If

f ∗ (tk ; xk,j ; μ) ≥ (1 − )f ∗ (tk ; xk,j ; L),


(2.3.22)
then stop the internal process and set j (k) = j ,

j ∗ (k) = arg min f ∗ (tk ; xk,j ; L),


0≤j ≤j (k)

xk+1 = xf (tk ; xk,j ∗ (k) ; L).

Global Stop: f ∗ (tk ; xk,j ; L) ≤  at some iteration of


the internal scheme.
(b) Set tk+1 = t ∗ (xk,j (k) , tk ).

This is the first time in this book we have met a two-level process. Clearly, its
analysis is more complicated. Firstly, we need to estimate the rate of convergence of
the upper-level process in (2.3.22) (called the Master Process). Secondly, we need
to estimate the total complexity of the internal processes in Step 1(a). Since we
are interested in the analytical complexity of this method, the arithmetical cost of
computation of the root t ∗ (x, t) and optimal value f ∗ (t; x, γ ) is not important for
us now.
Let us describe the convergence of the Master Process.
Lemma 2.3.8
 k
t ∗ −t0
f ∗ (tk ; xk+1; L) ≤ 1−
1
2(1−) .

Proof Let β = 1
2(1−) (< 1) and

f ∗ (tk ;xk,j (k) ;L)


δk = √
tk+1 −tk
.

Since tk+1 = t ∗ (xk,j (k), tk ), in view of Lemma 2.3.7, for k ≥ 1 we have

f ∗ (tk ;xk,j (k) ;L) f ∗ (tk−1 ;xk−1,j (k−1) ;L))


2(1 − ) √
tk+1 −tk
≤ √
tk −tk−1
.
2.3 The Minimization Problem with Smooth Components 133

Thus, δk ≤ βδk−1 and we obtain


√ √
f ∗ (tk ; xk,j (k); L) = δk tk+1 − tk ≤ β k δ0 tk+1 − tk

tk+1 −tk
= β k f ∗ (t0 ; x0,j (0); L) t1 −t0 .

Further, in view of Lemma 2.3.5, we have t1 − t0 ≥ f ∗ (t0 ; x0,j (0); μ). Hence,

tk+1 −tk
f ∗ (tk ; xk,j (k); L) ≤ β k f ∗ (t0 ; x0,j (0); L) f ∗ (t0 ;x0,j (0) ;μ)

βk 1 ∗
≤ 1− f (t0 ; x0,j (0); μ)(tk+1 − tk )

βk √ ∗ ∗
≤ 1− f (t0 )(t − t0 ).

It remains to note that f ∗ (t0 ) ≤ t ∗ − t0 (see Lemma 2.3.5), and

f ∗ (tk ; xk+1 ; L) ≡ f ∗ (tk ; xk,j ∗(k) ; L) ≤ f ∗ (tk ; xk,j (k); L). 


The above result provides us with an estimate for the number of upper-level
iterations, which we need for finding an -solution to problem (2.3.16). Indeed,
let f ∗ (tk ; xk,j ; L) ≤ . Then for x∗ = xf (tk ; xk,j ; L), we have

f (tk ; x∗ ) = max {f0 (x∗ ) − tk ; fi (x∗ )} ≤ f ∗ (tk ; xk,j ; L) ≤ .


1≤i≤m

Since tk ≤ t ∗ , we conclude that

f0 (x∗ ) ≤ t ∗ + ,
(2.3.23)
fi (x∗ ) ≤ , i = 1 . . . m.

In view of Lemma 2.3.8, we can get (2.3.23) at most in



t −t0
N() = 1
ln[2(1−)] ln (1−) (2.3.24)

full iterations of the master process (the last iteration of the process, in general, is
not full since it is terminated by the Global Stop rule). Note that in estimate (2.3.24),
 is an absolute constant (for example,  = 14 ).
134 2 Smooth Convex Optimization

Let us analyze the complexity of the internal process. Assume that the sequence
{xk,j } is generated by (2.3.13) starting from the point xk,0 = xk . In view of
Theorem 2.3.6, we have
 √ j
f (tk ; xk,j ) − f ∗ (tk ) ≤ 2 1 − qf (f (tk ; xk ) − f ∗ (tk ))

≤ 2e−σ ·j (f (tk ; xk ) − f ∗ (tk )) ≤ 2e−σ ·j f (tk ; xk ),

def √
where σ = qf . Recall that Qf = q1f = L μ.
Let N be the number of full iterations of process (2.3.22) (N ≤ N()). Thus, j (k)
is well defined for all k, 0 ≤ k ≤ N. Note that tk = t ∗ (xk−1,j (k−1), tk−1 ) > tk−1 .
Therefore

f (tk ; xk ) ≤ f (tk−1 ; xk ) ≤ f ∗ (tk−1 ; xk−1,j ∗ (k−1) , L).

Define

Δk = f ∗ (tk−1 ; xk−1,j ∗ (k−1) , L), k ≥ 1, Δ0 = f (t0 ; x0).

Then, for all k ≥ 0 we have

f (tk ; xk ) − f ∗ (tk ) ≤ Δk .

Lemma 2.3.9 For all k, 0 ≤ k ≤ N, the internal process no longer works if the
following condition is satisfied:

f (tk ; xk,j ) − f ∗ (tk ) ≤ 


Qf −1 · f ∗ (tk ; xk,j ; L). (2.3.25)

Proof Assume that (2.3.25) is satisfied. Then, in view of (2.3.8), we have

1
2L  gf (tk ; xk,j ; L 2 ≤ f (tk ; xk,j ) − f (tk ; xf (tk ; xk,j ; L))

≤ f (tk ; xk,j ) − f ∗ (tk ).

Therefore, using (2.3.21), we obtain

f ∗ (tk ; xk,j ; μ) ≥ f ∗ (tk ; xk,j ; L) − L−μ


2μL  gf (tk ; xk,j ; L 2

≥ f ∗ (tk ; xk,j ; L) − (Qf − 1) · (f (tk ; xk,j ) − f ∗ (tk ))

(2.3.25)
≥ (1 − )f ∗ (tk ; xk,j ; L),

which is the termination criterion of Step 1(a) in (2.3.22).



2.3 The Minimization Problem with Smooth Components 135

The above result, combined with the estimate of the rate of convergence for the
internal process, provide us with the total complexity estimate for the constrained
minimization scheme.
Lemma 2.3.10 For all k, 0 ≤ k ≤ N, we have
1 2(Qf −1)Δk
j (k) ≤ 1 + Qf · ln Δk+1 .

Proof Assume that

2(Qf −1)Δk
j (k) − 1 > 1
σ ln Δk+1 ,
(2.3.26)

where σ = qf . Recall that Δk+1 = min f ∗ (tk ; xk,j ; L). Note that the
0≤j ≤j (k)
stopping criterion of the internal process was not satisfied for j = j (k) − 1.
Therefore, in view of Lemma 2.3.9, we have

Qf −1 Qf −1 −σ ·j (2.3.26)
f ∗ (tk ; xk,j ; L) ≤ ∗
 (f (tk ; xk,j ) − f (tk )) ≤ 2  e Δk < Δk+1 .

This is a contradiction with the definition of Δk+1 . 



Corollary 2.3.3


N  1  1
j (k) ≤ (N + 1) 1 + Qf · ln 2(L−μ)
μ + Qf · ln ΔΔN+1
0
. 

k=0

It remains to estimate the number of internal iterations in the last step of the
Master Process. Denote this number by j ∗ .
Lemma 2.3.11
1 2(Qf −1)ΔN+1
j∗ ≤ 1 + Qf · ln  .

Proof The proof is very similar to the proof of Lemma 2.3.10. Suppose that
1 2(Qf −1)ΔN+1
j∗ − 1 > Qf · ln  .

Note that for j = j ∗ − 1 we have

Qf −1
 ≤ f ∗ (tN+1 ; xN+1,j ; L) ≤ ∗
 (f (tN+1 ; xN+1,j ) − f (tN+1 ))

Qf −1 −σ ·j
≤2  e ΔN+1 < .

This is a contradiction.

136 2 Smooth Convex Optimization

Corollary 2.3.4


N  1  1
2(Qf −1)
j∗ + j (k) ≤ (N + 2) 1 + Qf · ln  + Qf · ln Δ0 .
k=0

Let us put everything together. Substituting the estimate (2.3.24) for the number
of full iterations N into the estimate of Corollary 2.3.4, we come to the following
bound for the total number of internal iterations of process (2.3.22):
   1 
t0 −t ∗ 2(Qf −1)
1
ln[2(1−)] ln (1−) + 2 · 1 + Qf · ln 


(2.3.27)
1
+ Qf · ln 1
 · max {f0 (x0 ) − t0 ; fi (x0 )} .
1≤i≤m

Note that method (2.3.13), which is used in the internal process, calls the oracle of
problem (2.3.16) only once at each iteration. Therefore, the estimate (2.3.27) is an
upper bound for the analytical complexity of problem (2.3.16) which -solution is
defined by relations (2.3.23).
Let us check how far this estimate is from the lower bound. The principal term
in the estimate (2.3.27) is of the order
∗ 1
ln t0 −t
 · Qf · ln Qf .

This value differs from the lower bound for an unconstrained minimization problem
by a factor of ln L
μ . This means that the scheme (2.3.22) is at least suboptimal for
constrained optimization problems.
To conclude this section, let us address two technical questions. Firstly, in
scheme (2.3.22) it is assumed that we know some estimate t0 < t ∗ . This assumption
is not binding since it is possible to choose t0 as the optimal value of the
minimization problem
μ
min [f (x0 ) + ∇f (x0 ), x − x0 + 2  x − x0 2 ].
x∈Q

Clearly, this value is less than or equal to t ∗ .


Secondly, we assume that we are able to compute t ∗ (x̄, t). Recall that t ∗ (x̄, t) is
a root of the function

f ∗ (t; x̄; μ) = min fμ (t; x̄; x),


x∈Q
2.3 The Minimization Problem with Smooth Components 137

where fμ (t; x̄; x) is a max-type function composed of the components

μ
f0 (x̄) + ∇f0 (x̄), x − x̄ + 2  x − x̄ 2 −t,

μ
fi (x̄) + ∇fi (x̄), x − x̄ + 2  x − x̄ 2 , i = 1 . . . m.

In view of Lemma 2.3.4, it is the optimal value of the following minimization


problem:
μ
min [f0 (x̄) + ∇f0 (x̄), x − x̄ + 2  x − x̄ 2 ],
x∈Q

μ
s.t. fi (x̄) + ∇fi (x̄), x − x̄ + 2  x − x̄ 2 ≤ 0, i = 1 . . . m.

This problem is not a pure problem of Quadratic Optimization since the constraints
are not linear. However, it still can be solved in finite time by a simplex-type
procedure, since the objective function and the constraints have the same Hessian.
This problem can also be solved by Interior-Point Methods (see Chap. 5).
Chapter 3
Nonsmooth Convex Optimization

In this chapter, we consider the most general convex optimization problems, which
are formed by non-differentiable convex functions. We start by studying the main
properties of these functions and the definition of subgradients, which are the main
directions used in the corresponding optimization schemes. We also prove the neces-
sary facts from Convex Analysis, including different variants of Minimax Theorems.
After that, we establish the lower complexity bounds and prove the convergence
rate of the Subgradient Method for constrained and unconstrained optimization
problems. This method appears to be optimal uniformly in the dimension of the
space of variables. In the next section, we consider other optimization methods,
which can work in spaces of moderate dimension (the Method of Centers of Gravity,
the Ellipsoid Algorithm). The chapter concludes with a presentation of methods
based on a complete piece-wise linear model of the objective function (Kelley’s
method, the Level Method).

3.1 General Convex Functions

(Equivalent definitions; Closed functions; The discrete minimax theorem; Continuity


of convex functions; Separation theorems; Subgradients; Computation rules; Optimality
conditions; the Karush–Kuhn–Tucker Theorem; The exact penalty function; Minimax
theorems; Basic elements of primal-dual methods.)

© Springer Nature Switzerland AG 2018 139


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4_3
140 3 Nonsmooth Convex Optimization

3.1.1 Motivation and Definitions

In this chapter, we consider methods for solving the most general convex minimiza-
tion problem

min f0 (x),
x∈Q
(3.1.1)
s.t. fi (x) ≤ 0, i = 1 . . . m,

where Q ⊆ Rn is a closed convex set and fi (·), i = 0 . . . m, are general convex


functions. The term general means that these functions can be nondifferentiable.
Clearly, such a problem is more difficult than a problem with differentiable
components.
Note that nonsmooth minimization problems arise frequently in different appli-
cations. Quite often, some components of a model are composed of max-type
functions:

f (x) = max fj (x),


1≤j ≤p

where fj (·) are convex and differentiable. In Sect. 2.3, we have seen that such a
function can be minimized by methods based on Gradient Mapping. However, if
the number of smooth components p is very big, the computation of the Gradient
Mapping becomes too expensive. Then, it is reasonable to treat this max-type
function as a general convex function. Another source of nondifferentiable functions
is the situation when some components of the problem (3.1.1) are given implicitly,
as solutions of some auxiliary problems. Such functions are called the functions
with implicit structure. Very often, these functions are nondifferentiable.
Let us start our considerations with the definition of a general convex function.
In the sequel, the term “general” is often omitted.
Denote by

dom f = {x ∈ Rn : | f (x) |< ∞}

the domain of function f . We always assume that dom f


= ∅.
Definition 3.1.1 A function f (·) is called convex if its domain is convex and for all
x, y ∈ dom f and α ∈ [0, 1] the following inequality holds:

f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y). (3.1.2)

If this inequality is strict, the function is called strictly convex. We call f concave if
−f is convex.
3.1 General Convex Functions 141

At this point, we are not yet ready to speak about any methods for solving
problem (3.1.1). In Chap. 2, our optimization schemes were based on gradients of
smooth functions. For nonsmooth functions, such objects do not exist and we have
to find something to replace them. However, in order to do that, we should first
study the properties of general convex functions and justify a possible definition of
a computable generalized gradient. This route is quite long, but we have to follow it
up to the end.
A straightforward consequence of Definition 3.1.1 is the following.
Lemma 3.1.1 (Jensen’s Inequality) For any x1 , . . . , xm ∈ dom f and positive
coefficients α1 , . . . , αm such that


m
αi = 1, (3.1.3)
i=1

we have



m 
m
f αi xi ≤ αi f (xi ). (3.1.4)
i=1 i=1

Proof Let us prove this statement by induction over m. Definition 3.1.1 justifies
inequality (3.1.4) for m = 2. Assume it is true for some m ≥ 2. For a set of m + 1
points we have


m+1 
m
αi xi = α1 x1 + (1 − α1 ) βi x i ,
i=1 i=1

αi+1
where βi = 1−α1 , i = 1, . . . , m. Clearly,


m
βi = 1, βi > 0, i = 1 . . . m.
i=1

Therefore, using Definition 3.1.1 and our inductive assumption, we have


m+1   
 
m
f αi xi = f α1 x1 + (1 − α1 ) βi x i
i=1 i=1
 m 
 
m+1
≤ α1 f (x1 ) + (1 − α1 )f βi x i ≤ αi f (xi ). 

i=1 i=1


m
A point x = αi xi with positive coefficients αi satisfying the normalizing
i=1
condition (3.1.3) is called a convex combination of points {xi }m
i=1 .
142 3 Nonsmooth Convex Optimization

Let us mention two important consequences of Jensen’s inequality.


Corollary 3.1.1 Let x be a convex combination of points x1 , . . . , xm . Then

f (x) ≤ max f (xi ).


1≤i≤m

Proof Indeed, by Jensen’s inequality and condition (3.1.3), we have





m 
m
f (x) = f αi xi ≤ αi f (xi ) ≤ max f (xi ). 

i=1 i=1 1≤i≤m

Corollary 3.1.2 Let


 

m 
m
Δ = Conv {x1 , . . . , xm } ≡ x = αi xi | αi ≥ 0, αi = 1 .
i=1 i=1

Then max f (x) = max f (xi ). 



x∈Δ 1≤i≤n

There exist two other equivalent definitions of convex functions.


Theorem 3.1.1 A function f is convex if and only if for all x, y ∈ dom f and β ≥ 0
such that y + β(y − x) ∈ dom f , we have

f (y + β(y − x)) ≥ f (y) + β(f (y) − f (x)). (3.1.5)

β
Proof Let f be convex. Define α = 1+β and u = y + β(y − x). Then

y= 1+β (u +
1
βx) = (1 − α)u + αx.

Therefore,
β
f (y) ≤ (1 − α)f (u) + αf (x) = 1
1+β f (u) + 1+β f (x).

Assume now that (3.1.5) holds. Let us fix x, y ∈ dom f and α ∈ (0, 1]. Define
β = 1−α
α and u = αx + (1 − α)y. Then

x= α (u − (1 − α)y)
1
= u + β(u − y).

Therefore, f (x) ≥ f (u) + β(f (u) − f (y)) = 1


α f (u) − 1−α
α f (y). 

Theorem 3.1.2 A function f is convex if and only if its epigraph

epi (f ) = {(x, t) ∈ dom f × R | t ≥ f (x)}

is a convex set.
3.1 General Convex Functions 143

Proof Indeed, if (x1 , t1 ) ∈ epi (f ) and (x2 , t2 ) ∈ epi (f ), then for any α ∈ [0, 1] we
have

αt1 + (1 − α)t2 ≥ αf (x1 ) + (1 − α)f (x2 ) ≥ f (αx1 + (1 − α)x2 ).

Thus, (αx1 + (1 − α)x2 , αt1 + (1 − α)t2 ) ∈ epi (f ).


Let epi (f ) be convex. Note that for x1 , x2 ∈ dom f , the corresponding points of
the graph of the function belong to the epigraph:

(x1 , f (x1 )) ∈ epi (f ), (x1 , f (x2 )) ∈ epi (f ).

Therefore (αx1 + (1 − α)x2 , αf (x1 ) + (1 − α)f (x2 )) ∈ epi (f ). This means that

f (αx1 + (1 − α)x2 ) ≤ αf (x1 ) + (1 − α)f (x2 ). 


We also need the following property of the level sets of convex functions.
Theorem 3.1.3 If a function f is convex, then all level sets

Lf (β) = {x ∈ dom f | f (x) ≤ β}, β ∈ R,

are either convex or empty.


Proof Indeed, if x1 ∈ Lf (β) and x2 ∈ Lf (β), then for any α ∈ [0, 1] we have

f (αx1 + (1 − α)x2 ) ≤ αf (x1 ) + (1 − α)f (x2 ) ≤ αβ + (1 − α)β = β. 


In Example 3.1.1(6) we will see that behavior of a general convex function on


the boundary of its domain is sometimes out of any control. Therefore, we need to
introduce one convenient notion, which will be very useful in our analysis.
Definition 3.1.2 A function f is called closed and convex on a convex set Q ⊆
dom f if its constrained epigraph

epi Q (f ) = {(x, t) ∈ Q × R : t ≥ f (x)}

is a closed convex set. If Q = dom f , we call f a closed convex function.


Note that in this definition the set Q is not necessarily closed. Let us prove the
following natural statement.
Lemma 3.1.2 Let a function f be closed and convex on Q. Then for any closed
convex set Q1 ⊆ Q, this function is closed and convex on Q1 .
Proof Indeed, the set {(x, t) : x ∈ Q1 } is closed. Hence, the statement follows
from Item 1 of Theorem 2.2.8. 
144 3 Nonsmooth Convex Optimization

Let us mention the most important topological properties of closed convex


functions.
Theorem 3.1.4 Let a function f be closed and convex.
1. For any sequence {xk } ⊂ dom f convergent to a point x̄ ∈ dom f we have

lim inf f (xk ) ≥ f (x̄). (3.1.6)


k→∞

(This means that f is lower semi-continuous.)


2. For any sequence {xk } ⊂ dom f convergent to some point x̄
∈ dom f we have

lim f (xk ) = +∞. (3.1.7)


k→∞

3. All level sets of the function f are either empty or closed and convex.
4. Let f be closed and convex on a set Q and its constrained level sets be bounded.
Then problem

min f (x)
x∈Q

is solvable.
5. Let f be closed and convex on Q. If the optimal set X∗ = Arg min f (x) is
x∈Q
nonempty and bounded, then all level sets of the function f on Q are either
empty or bounded.
Proof
1. Note that the sequence {(xk , f (xk ))} belongs to the closed set epi (f ). If it has
a subsequence convergent to (x̄, f¯) ∈ epi (f ), then x̄ ∈ dom f and f¯ ≥ f (x̄).
This is the inequality (3.1.6).
If there is no convergent subsequence in {f (xk )}, we need to consider two
cases. Assume that lim inf f (xk ) = −∞. Since x̄ ∈ dom f , the sequence
k→∞
{(xk , f (x̄) − 1)} belongs to epi (f ) for k large enough, but it converges to the
point (x̄, f (x̄) − 1)
∈ epi (f ). This contradicts our assumption. Thus, the only
possibility is lim f (xk ) = +∞. Hence, (3.1.6) is also satisfied.
k→∞
2. Let x̄
∈ dom f . If the sequence {f (xk )} contains a bounded subsequence, then
the corresponding points (xk , τ ) with τ big enough belong to the epigraph.
However, their limit is not in this set. This
contradiction proves (3.1.7).
3. By its definition, (Lf (β), β) = epi (f ) {(x, t) | t = β}. Therefore, the level
set Lf (β) is closed and convex as an intersection of two closed convex sets.
def
4. Consider a sequence {xk } ⊂ Q such that lim f (xk ) = f∗ = inf f (x). Since
k→∞ x∈Q
the level sets of the function f on Q are bounded, we can assume that it is a
convergent sequence: lim xk = x ∗ . Assume that f∗ = −∞. Consider the points
k→∞
3.1 General Convex Functions 145

yk = (1 − αk )x0 + αk xk ∈ Q, k ≥ 0, with slowly decreasing coefficients αk ↓ 0.


Note that we can always ensure

(3.1.2)
f (yk ) ≤ f (x0 ) + αk (f (xk ) − f (x0 )) → −∞,

and this contradicts the closedness of the set epi Q (f ).


Thus, f∗ > −∞, and we can assume that the whole sequence {(xk , f (xk ))}
converges to a certain point (x ∗ , f∗ ) from epi Q (f ). However, by definition of
this set, x ∗ ∈ Q and f (x ∗ ) ≤ f∗ .
5. Assume that some set Lf (β) with β > f ∗ = min f (x) is unbounded. Let
x∈Q
us fix a point x ∗ ∈ X∗ and choose R > max∗ y − x ∗ . Consider a sequence
y∈X
def
{xk } ⊂ Lf (β) with ρk = xk − x ∗
→ ∞. Without loss of generality, we can
assume that all ρk ≥ R. Define yk = x ∗ + ρ1k R(xk − x ∗ ). Clearly, yk ∈ Q and
yk − x ∗  = R. However,

(3.1.2)
f (yk ) ≤ f ∗ + ∗
ρk R(f (xk ) − f )
1
→ f ∗, k → ∞.

Since the sequence {yk }k≥0 is compact and the level set Lf (β) is closed (see
def
Item 3), we can assume that the limit lim yk = ȳ ∈ Lf (β) exists. However,
k→∞
by (3.1.6) we have f (ȳ) = f ∗ , and this contradicts the choice of R. 

Note that, if f is convex and continuous and its domain dom f is closed, then f
is a closed function. However, in general, a closed convex function is not necessarily
continuous.
Let us look at some examples of closed convex functions.
Example 3.1.1
1. A linear function is closed and convex.
2. f (x) =| x |, x ∈ R, is closed and convex since its epigraph is

{(x, t) | t ≥ x, t ≥ −x},

which is the intersection of two closed convex sets (see Theorem 3.1.2).
3. All continuous and convex functions on Rn belong to the class of general closed
convex functions.
4. The function f (x) = x1 , x > 0, is convex and closed. However, its domain
dom f = int R+ is open.
5. The function f (x) = x , where  ·  is any norm, is closed and convex:

f (αx1 + (1 − α)x2 ) =  αx1 + (1 − α)x2  ≤  αx1  +  (1 − α)x2 

= α  x1  +(1 − α)  x2 
146 3 Nonsmooth Convex Optimization

for any x1 , x2 ∈ Rn and α ∈ [0, 1]. The most popular norms in Numerical
Analysis are so-called p -norms:
# $1/p

n
 x (p) = | x (i) |p , p ≥ 1.
i=1

Among them, there are three norms, which are commonly used:

n
• Euclidean norm  x (2) = [ (x (i) )2 ]1/2 , p = 2. Since it is used very often,
i=1
usually we drop the subscript if no ambiguity arises.

n
• 1 -norm  x (1) = | x (i) |, p = 1.
i=1
• ∞ -norm (Chebyshev norm, uniform norm, infinity norm)

 x (∞) = max | x (i) | .


1≤i≤n

Any norm defines a system of balls,

B· (x0 , r) = {x ∈ Rn |  x − x0 ≤ r}, r ≥ 0,

where r is the radius of the ball and x0 ∈ Rn is its center. We call the ball
B· (0, 1) the unit ball of the norm  · . Clearly, these balls are convex sets (see
Theorem 3.1.3). For p -balls of radius r we also use the notation

Bp (x0 , r) = {x ∈ Rn |  x − x0 (p) ≤ r}.

For 1 -balls, we often use the following representation:

B1 (x0 , r) = {x ∈ Rn : x − x0 (1) ≤ r} = Conv {x0 ± rei , i = 1, . . . , n},


(3.1.8)

where ei are coordinate vectors in Rn .


6. Up to now, none of our examples have demonstrated any pathological behavior.
However, let us look at the following function of two variables:

⎨ 0, if x 2 + y 2 < 1,
f (x, y) =

φ(x, y), if x 2 + y 2 = 1,

where φ(x, y) is an arbitrary nonnegative function defined on the boundary of


a unit circle. The domain of this function is the unit Euclidean disk, which is
closed and convex. Moreover, it is easy to see that f is convex. However, it has
no reasonable properties at the boundary of its domain. Definitely, we want to
3.1 General Convex Functions 147

exclude such functions from our considerations. This was the main reason for
introducing the notion of the closed function. It is clear that f (·, ·) is not closed
unless φ(x, y) ≡ 0.
Another possibility would be to consider a smaller class of continuous convex
functions. However, we will see that for closedness of a convex function there
exist very natural sufficient conditions, and this is not the case for continuity. 

3.1.2 Operations with Convex Functions

In the previous section, we have seen several examples of convex functions. Let us
describe a set of invariant operations which allow us to create more complicated
objects.
Theorem 3.1.5 Let functions f1 and f2 be closed and convex on convex sets Q1
and Q2 , and β ≥ 0. Then all functions below are closed and convex on the
corresponding sets Q:
1. f (x) = βf1 (x) , Q = Q1 . 
2. f (x) = f1 (x) + f2 (x), Q = Q1 Q 2.
1

3. f (x) = max{f1 (x), f2 (x)}, Q = Q1 Q2 .


Proof
1. The first item is evident:

f (αx1 + (1 − α)x2 ) ≤ β(αf1 (x1 ) + (1 − α)f1 (x2 )), x1 , x2 ∈ Q1 .



2. For all x1 , x2 ∈ Q = Q1 Q2 and α ∈ [0, 1] we have

f1 (αx1 + (1 − α)x2 ) + f2 (αx1 + (1 − α)x2 )

≤ αf1 (x1 ) + (1 − α)f1 (x2 ) + αf2 (x1 ) + (1 − α)f2 (x2 )

= α(f1 (x1 ) + f2 (x1 )) + (1 − α)(f1 (x2 ) + f2 (x2 )).

Thus, f is convex on the set Q. Let us prove that it is also closed on Q. Consider
a convergent sequence {(xk , tk )} ⊂ epi Q (f ):

tk ≥ f1 (xk ) + f2 (xk ), xk ∈ Q, lim xk = x̄, lim tk = t¯.


k→∞ k→∞

1 Recallthat without additional assumptions, we cannot guarantee closedness of the sum of


two closed convex sets (see Item 2 in Theorem 2.2.8 and Example 2.2.1). For that, we need
boundedness of one of them. However, the epighraphs are never bounded.
148 3 Nonsmooth Convex Optimization

Since the functions f1 and f2 are closed on Q1 and Q2 respectively, we have

(3.1.6) (3.1.6)
lim inf f1 (xk ) ≥ f1 (x̄), x̄ ∈ Q1 , lim inf f2 (xk ) ≥ f2 (x̄), x̄ ∈ Q2 .
k→∞ k→∞

Therefore, x̄ ∈ Q1 Q2 , and

t¯ = lim tk ≥ lim inf f1 (xk ) + lim inf f2 (xk ) ≥ f (x̄).


k→∞ k→∞ k→∞

Thus, (x̄, t¯) ∈ epi Q (f ).


3. The constrained epigraph of the function f can be represented as follows:

epi Q (f ) = {(x, t) | t ≥ f1 (x), t ≥ f2 (x), x ∈ Q1 Q2 }

≡ epi Q1 (f1 ) epi Q2 (f2 ).

Thus, epi Q (f ) is closed and convex as an intersection of two closed convex sets.


Let us prove that convexity is an affine-invariant property.
Theorem 3.1.6 Let a function φ be closed and convex on a bounded set S ⊆ Rm .
Consider a linear operator

A (x) = Ax + b : Rn → R m .

Then the function f (x) = φ(A (x)) is closed and convex on the inverse image of
the set S defined as follows:

Q = {x ∈ Rn | A (x) ∈ S}.

Proof For x1 and x2 in Q, define y1 = A (x1 ), y2 = A (x2 ). Then for α ∈ [0, 1] we


have

f (αx1 + (1 − α)x2 ) = φ(A (αx1 + (1 − α)x2 )) = φ(αy1 + (1 − α)y2 )

≤ αφ(y1 ) + (1 − α)φ(y2 ) = αf (x1 ) + (1 − α)f (x2 ).

Thus, the function f is convex. The closedness of its constrained epigraph follows
from the continuity of the linear operator A (·).

The next two theorems are the main providers of closed convex functions with
implicit structure.
3.1 General Convex Functions 149

Theorem 3.1.7 Let Q be a convex set, and let the function φ be convex with
dom φ ⊇ Q. Then the function

f (x) = inf{φ(x, y) : (x, y) ∈ Q} (3.1.9)


y

is convex on Q̂ = {x : ∃ysuch that (x, y) ∈ Q}.


Proof Let us take arbitrary points x1 , x2 ∈ Q̂. Consider two sequences {y1,k } and
{y2,k } such that {(x1 , y1,k )} ⊂ Q, {(x2 , y2,k )} ⊂ Q, and

lim φ(x1 , y1,k ) = f (x1 ), lim φ(x2 , y2,k ) = f (x2 ).


k→∞ k→∞

Since φ is jointly convex in (x, y), for any α ∈ [0, 1] we have

(3.1.9)
f (αx1 + (1 − α)x2 ) ≤ φ(αx1 + (1 − α)x2 , αy1,k + (1 − α)y2,k )

≤ αφ(x1 , y1,k ) + (1 − α)φ(x2 , y2,k ).

Taking the limit of the right-hand side of this inequality, we get the convexity
condition (3.1.2) for the function f .

Conditions for closedness of the function (3.1.9) will be presented later in Theo-
rem 3.1.25 and Theorem 3.1.28.
Theorem 3.1.8 Let Δ be an arbitrary set and

f (x) = sup{φ(x, y) | y ∈ Δ}.


y

Suppose that for any y ∈ Δ functions φ(·, y) are closed and convex on some set Q.
Then f (·) is a closed convex function on the set
" %
Q̂ = x ∈ Q | sup φ(x, y) < +∞ . (3.1.10)
y∈Δ

Proof Indeed, if x ∈ Q̂, then f (x) < +∞ and we conclude that Q ⊆ dom f .
Further, it is clear that (x, t) ∈ epi Q (f ) if and only if for all y ∈ Δ we have

x ∈ Q, t ≥ φ(x, y).

This means that


0
epi Q (f ) = epi Q (φ(·, y)).
y∈Δ
150 3 Nonsmooth Convex Optimization

Thus, epi Q (f ) is closed and convex since each set epi Q (φ(·, y)) is closed and
convex. 
Theorem 3.1.9 Let a function ψ(·) be convex and ϕ be a univariate convex function
which is non-decreasing on the set

Im ψ = {τ = ψ(x), x ∈ dom ψ}.

Then the function f (x) = ϕ(ψ(x)), x ∈ dom ψ, is convex.


Proof Indeed, for any points x and y from dom f , and α ∈ [0, 1], we have

f (αx + (1 − αy)) = ϕ(ψ(αx + (1 − α)y))

≤ ϕ(αψ(x) + (1 − α)ψ(y))

≤ αϕ(ψ(x)) + (1 − α)ϕ(ψ(y))

= αf (x) + (1 − α)f (y). 


Now we are ready to look at more sophisticated examples of convex functions.


Example 3.1.2
1. The function f (x) = max {x (i) } is closed and convex. Another example of a
1≤i≤n
closed convex function is

φ∗ (s) = sup [s, x − φ(x)],


x∈dom φ

where φ is an arbitrary function on Rn . The function φ∗ is called the Fenchel


dual of φ.
2. Let λ = (λ(1) , . . . , λ(m) ), and let Δ be a set in Rm
+ . Consider the function
 

m
f (x) = sup λ(i) fi (x) ,
λ∈Δ i=1

where all fi are closed and convex. In view of Theorem 3.1.5, the epigraphs of
the functions

m
φλ (x) = λ(i) fi (x)
i=1

are convex and closed. Thus, f (·) is closed and convex in view of Theorem 3.1.8.
Note that we have not assumed anything about the structure of the set Δ.
3.1 General Convex Functions 151

3. Let Q be an arbitrary set. Consider the function

ξQ (x) = sup{g, x | g ∈ Q}.

The function ξQ (·) is called the support function of the set Q. Note that ξQ (·)
is closed and convex in view of Theorem 3.1.8. This function is positively
homogeneous of degree one:

ξQ (τ x) = τ ξQ (x), x ∈ dom Q, τ ≥ 0.

If the set Q is bounded then dom ξQ = Rn .


The support function is a very useful tool in Convex Analysis with many
interesting properties. We will present them later in the appropriate places. Here
we mention only one of them.
Lemma 3.1.3 For two sets Q1 and Q2 define Q = Conv {Q1 , Q2 }. Then

ξQ (x) = max{ξQ1 (x), ξQ2 (x)}, x ∈ Rn .

Proof Indeed, since the sets Q1 and Q2 are subsets of Q, for any x ∈ Rn we
have

ξQ (x) ≥ max{ξQ1 (x), ξQ2 (x)}.

On the other hand,

ξQ (x) = sup {αg1 + (1 − α)g2 , x : g1 ∈ Q1 , g2 ∈ Q2 , α ∈ [0, 1]}


α,g1 ,g2

≤ sup {αξQ1 (x) + (1 − α)ξQ2 (x)} = max{ξQ1 (x), ξQ2 (x)}. 



α∈[0,1]

4. Another important example of a convex homogeneous function related to a


convex set is the Minkowski function. Let Q be a bounded closed convex set,
and 0 ∈ int Q. Then we can define

ψQ (x) = min{τ : x ∈ τ Q}.


τ ≥0

Denote the unique solution of this problem by τ (x). Then τ (x)x


∈ ∂Q. It is easy
to see that ψQ is a positively homogeneous convex function with dom ψQ = Rn .
Indeed, for arbitrary x1 , x2 ∈ Rn \ {0} and α ∈ [0, 1], we have
x x
αx1 +(1−α)x2 ατ (x1 ) τ (x1 ) +(1−α)τ (x2 ) τ (x2 )
ατ (x1 )+(1−α)τ (x2) = 1
ατ (x1 )+(1−α)τ (x2)
2
∈ Q.

Therefore, ψQ (αx1 + (1 − α)x2 ) ≤ ατ (x1 ) + (1 − α)τ (x2 ).


152 3 Nonsmooth Convex Optimization

5. Let Q be a set in Rn . Consider the function ψ(g, γ ) = sup φ(y, g, γ ), where


y∈Q

γ
φ(y, g, γ ) = g, y − 2  y 2 .

The function ψ(g, γ ) is closed and convex in (g, γ ) in view of Theorem 3.1.8.
Let us look at its properties.
If Q is bounded, then dom ψ = Rn+1 . Let us describe the domain of ψ for
the case Q = Rn . If γ < 0, then for any g
= 0 we can set yα = αg. Clearly,
along this line, φ(yα , g, γ ) → ∞ as α → ∞. Thus, dom ψ contains only points
with γ ≥ 0.
If γ = 0, the only possible value for g is zero since otherwise the function
φ(y, g, 0) is unbounded. Finally, if γ > 0, then the point maximizing φ(y, g, γ )
with respect to y is y ∗ (g, γ ) = γ1 g, and we get the following expression for ψ:

g2
ψ(g, γ ) = 2γ .

Thus,


⎨ 0, if g = 0, γ = 0,
ψ(g, γ ) =

⎩ g2 , if γ > 0,


with domain dom ψ = (Rn × {γ > 0}) (0, 0). This is a convex set which is
neither closed nor open. Nevertheless, ψ is a closed convex function. At the
same time, this function is discontinuous at the origin:

ψ( γ g, γ ) ≡ 1
2  g 2 , γ
= 0.

Considering the closed convex set Q = {(g, γ ) : γ ≥ g2 }, we can see that
ψ is a closed convex function on Q (see Lemma 3.1.2), with bounded values.
However, it is still discontinuous at the origin.
6. Similar constructions can be obtained by homogenization. Let f be convex on
Rn . Consider the function
x
fˆ(τ, x) = τf τ .

This function is well defined for all x ∈ Rn and τ > 0. Note that fˆ is a positively
homogeneous function. Therefore, it is natural to define its value at the origin as
follows:

fˆ(0, 0) = 0.
3.1 General Convex Functions 153

Let us prove that this function is convex. Consider z1 = (τ1 , x1 ) and z2 =


(τ2 , x2 ) with τ1 , τ2 > 0. Then, for any α ∈ [0, 1] we have:
 
αx1 +(1−α)x2
fˆ(αz1 + (1 − α)z2 ) = (ατ1 + (1 − α)τ2 )f ατ1 +(1−α)τ2


x x
ατ1 τ1 +(1−α)τ2 τ2
= (ατ1 + (1 − α)τ2 )f 1
ατ1 +(1−α)τ2
2

   
x1 x2
≤ ατ1 f τ1 + (1 − α)τ2 f τ2

= α fˆ(z1 ) + (1 − α)fˆ(z2 ).

However, in general, fˆ(·) is not closed. In order to ensure closedness, it is enough


to assume that

lim 1 f (τ x) = +∞ ∀x ∈ Rn . (3.1.11)
τ →∞ τ

Note that the function ψ in Item 5 can be obtained from f (x) = 12 x2 , which
satisfies condition (3.1.11).

As we have seen in Example 3.1.2(5), a closed convex function can be discontin-
uous at some points of its domain. However, there exists one very exceptional case
when this cannot happen.
Lemma 3.1.4 Any univariate closed convex function is continuous on its domain.
Proof Let f be closed and convex and x̄ ∈ dom f ⊆ R. We have proved in Item 1
of Theorem 3.1.4 that f is lower-semicontinuous at x̄. On the other hand, if xk =
(1 − αk )x̄ + αk ȳ, for certain ȳ ∈ dom f and αk ∈ [0, 1], then

(3.1.2)
f (xk ) ≤ (1 − αk )f (x̄) + αk f (ȳ).

Thus, if xk → x̄, then αk → 0 and lim sup f (xk ) ≤ f (x̄). Hence, f is also upper-
k→∞
semicontinuous at x̄. Consequently, it is continuous at x̄.

Thus, it is not surprising that a restriction of the discontinuous function ψ in
Item 5 of Example 3.1.2 onto the ray {(γ g, γ ), γ ≥ 0} is a continuous convex
function.
As for any other exception, the statement of Lemma 3.1.4 is sometimes very
useful.
154 3 Nonsmooth Convex Optimization

Theorem 3.1.10 Let functions f1 and f2 be closed and convex on Q and their
constrained level sets be bounded. Then there exists some λ∗ ∈ [0, 1] such that
 
def
min f (x) = max{f1 (x), f2 (x)} = min {λ∗ f1 (x) + (1 − λ∗ )f2 (x)} .
x∈Q x∈Q
(3.1.12)

Proof Define φ(λ) = min{λf1 (x) + (1 − λ)f2 (x)}. In view of Theorem 3.1.8, this
x∈Q
function is closed and convex, and by Lemma 3.1.4 it is continuous for λ ∈ [0, 1].
Thus, its maximal value φ ∗ is well defined and

φ ∗ = φ(λ∗ ) = max φ(λ) ≤ f ∗ = min f (x).


λ∈[0,1] x∈Q

Our goal is to show that φ ∗ = f ∗ .


For each λ ∈ [0, 1], we fix an arbitrary point

x(λ) ∈ Arg min{λf1 (x) + (1 − λ)f2 (x)}.


x∈Q

Define g(λ) = f1 (x(λ)) − f2 (x(λ)). Note that for arbitrary λ1 , λ2 ∈ [0, 1] we have

φ(λ1 ) ≤ λ1 f1 (x(λ2 )) + (1 − λ1 )f2 (x(λ2 )) = φ(λ2 ) + g(λ2 )(λ1 − λ2 ).


(3.1.13)

Adding two variants of this inequality with λ1 and λ2 interchanged, we get

(g(λ2 ) − g(λ1 ))(λ1 − λ2 ) ≥ 0, λ1 , λ2 ∈ [0, 1].

Thus, g(·) is a non-increasing function on [0, 1].


Define fi∗ = min fi (x), i = 1, 2. If λ∗ = 1, then taking in (3.1.13) λ1 = 1 and
x∈Q
λ2 = λ ∈ (1, 0], we get g(λ) ≥ 0. Therefore, in view of Lemma 3.1.4 we have

φ ∗ = lim {λf1 (x(λ)) + (1 − λ)f2 (x(λ))}


λ→1

≥ lim {λf (x(λ)) + (1 − λ)f2∗ } ≥ f ∗ .


λ→1

Thus, φ ∗ = f ∗ and in this case equality (3.1.12) is proved. By a symmetric


reasoning, we can justify this equality for λ∗ = 0.
3.1 General Convex Functions 155

Consider now the case λ∗ ∈ (0, 1). Assume first that there exists a sequence
{λk }k≥0 ⊂ [0, 1] such that

λk → λ∗ , g(λk ) → 0, (3.1.14)

as k → ∞. Then, in view of Lemma 3.1.4,

φ ∗ = lim {λk f1 (x(λk )) + (1 − λk )f2 (x(λk ))} = lim {f2 (x(λk )) + λk g(λk )}
k→∞ k→∞

= lim f2 (x(λk )).


k→∞

Similarly, we can prove that φ ∗ = lim f1 (x(λk )). Since max{·, ·} is a continuous
k→∞
function, we conclude that

φ ∗ = lim f (x(λk )) ≥ f ∗ ,
k→∞

which proves (3.1.12) under assumption (3.1.14).


Finally, let us assume that there is no sequence satisfying conditions (3.1.14).
Consider two sequences:

{αk }k≥0 : αk ↑ λ∗ , {βk }k≥0 : βk ↓ λ∗ .

Since the condition (3.1.14) is not satisfied and the function g is monotone, there
exist two positive values a and b such that

lim g(αk ) = a, lim g(βk ) = −b.


k→∞ k→∞

Let γ = b
a+b . Then, in view of Lemma 3.1.4, we have

φ ∗ = lim {γ φ(αk ) + (1 − γ )φ(βk )}


k→∞
 
= lim γ [f2 (x(αk )) + αk g(αk )] + (1 − γ )[f2 (x(βk )) + βk g(βk )]
k→∞
 
= lim γf2 (x(αk )) + (1 − γ )f2 (x(βk ))
k→∞

≥ lim sup f2 (γ x(αk ) + (1 − γ )x(βk )).


k→∞
156 3 Nonsmooth Convex Optimization

Similarly,

φ ∗ = lim γ [f1 (x(αk )) − (1 − αk )g(αk )]
k→∞

+(1 − γ )[f1 (x(βk )) − (1 − βk )g(βk )]
 
= lim γf1 (x(αk )) + (1 − γ )f1 (x(βk ))
k→∞

≥ lim sup f1 (γ x(αk ) + (1 − γ )x(βk )).


k→∞

Choosing subsequences convergent in the function values, we can see that

φ ∗ ≥ lim f (γ x(αk ) + (1 − γ )x(βk )) ≥ f ∗ . 



k→∞

Corollary 3.1.3 Let functions fi , i = 1, . . . , m, be closed and convex on Q and


their constrained level sets be bounded. Then there exists some λ∗ ∈ Δm such that

 
def 
m
min F (x) = max fi (x) = min λ(i)
∗ fi (x) . (3.1.15)
x∈Q 1≤i≤m x∈Q i=1

Proof In view of the cumbersome notation, we do only the first two steps in our
proof by induction. Let Fk (x) = max fi (x). Then
k≤i≤m

F (x) = max{f1 (x), F2 (x)}, Fk (x) = max{fk (x), Fk+1 ((x)}, k = 2, . . . , m − 1.

Therefore, by Theorem 3.1.10 there exists a λ∗(1) ∈ [0, 1] such that


 
def def (1)
F ∗ = min F (x) = min ψ1 (x) = λ∗ f1 (x) + (1 − λ∗ )F2 (x)
(1)
x∈Q x∈Q

 
(1) (1) (1) (1)
= min max λ∗ f1 (x) + (1 − λ∗ )f2 (x), λ∗ f1 (x) + (1 − λ∗ )F3 (x) .
x∈Q

Again, by Theorem 3.1.10, there exists a ξ ∗ ∈ [0, 1] such that F ∗ = min ψ2 (x),
x∈Q
where

ψ2 (x) = ξ ∗ (λ∗ f1 (x) + (1 − λ∗ )f2 (x))


(1) (1)

+(1 − ξ ∗ )(λ∗ f1 (x) + (1 − λ∗ )F3 (x))


(1) (1)

= λ∗ f1 (x) + ξ ∗ (1 − λ∗ )f2 (x) + (1 − ξ ∗ )(1 − λ∗ )F3 (x).


(1) (1) (1)
3.1 General Convex Functions 157

Defining λ∗ = ξ ∗ (1 − λ∗ ), observe that


(2) (1)

ψ2 (x) = λ(1) (2) (1) (2)


∗ f1 (x) + λ∗ f2 (x) + (1 − λ∗ − λ∗ )F3 (x).

And we can continue.



Note that the functions fi , i = 1, . . . , m, in Corollary 3.1.3 may be discontinu-
ous.

3.1.3 Continuity and Differentiability

In the previous sections, we have seen that a behavior of convex function on the
boundary of its domain can be unpredictable (see Examples 3.1.1(6) and 3.1.2(5)).
Fortunately, this is the only bad thing which can happen. In this section, we will
see that the local structure of a convex function in the interior of its domain is very
simple.
Theorem 3.1.11 Let f be convex and x0 ∈ int (dom f ). Then f is locally bounded
and locally Lipschitz continuous at x0 .
Proof Let us first prove that f is locally bounded. Let us choose some  > 0 such
that x0 ± ei ∈ int (dom f ), i = 1, . . . , n. Define

(3.1.8)
Δ = Conv {x0 ± ei , i = 1 . . . n} = B1 (x0 , ).

Clearly, Δ ⊆ dom f and, in view of Corollary 3.1.2, we have

def
max f (x) = max f (x0 ± ei ) = M. (3.1.16)
x∈Δ 1≤i≤n

Consider now a point y ∈ B1 (x0 , ), y


= x0 . Let

α= 1
  y − x0 (1) , z = x0 + α1 (y − x0 ).

It is clear that  z − x0 (1) = 1


α  y − x0 (1) = . Therefore, α ≤ 1 and

y = αz + (1 − α)x0 .

Hence,

(3.1.16)
f (y) ≤ αf (z) + (1 − α)f (x0 ) ≤ f (x0 ) + α(M − f (x0 ))

M−f (x0 )
= f (x0 ) +   y − x0 (1) .
158 3 Nonsmooth Convex Optimization

Further, let u = x0 + α1 (x0 − y). Then  u − x0 (1) =  and y = x0 + α(x0 − u).


Therefore, in view of Theorem 3.1.1, we have

(3.1.16)
f (y) ≥ f (x0 ) + α(f (x0 ) − f (u)) ≥ f (x0 ) − α(M − f (x0 ))

M−f (x0 )
= f (x0 ) −   y − x0 (1) .

M−f (x0 )
Thus, | f (y) − f (x0 ) |≤   y − x0 (1) .

Let us show that all convex functions possess a property which is very close to
differentiability.
Definition 3.1.3 Let x ∈ dom f . We call f differentiable at the point x in direction
p
= 0 if the following limit exists:

f  (x; p) = lim α1 [f (x + αp) − f (x)]. (3.1.17)


α↓0

The value f  (x; p) is called the directional derivative of f at x.


Theorem 3.1.12 A convex function f is differentiable in any direction at any
interior point of its domain.
Proof Let x ∈ int (dom f ). Consider the function

φ(α) = α1 [f (x + αp) − f (x)], α > 0.

Let β ∈ (0, 1], and the value α ∈ (0, ] be small enough to have x + p ∈ dom f .
Then,

f (x + αβp) = f ((1 − β)x + β(x + αp)) ≤ (1 − β)f (x) + βf (x + αp).

Therefore,

φ(αβ) = αβ [f (x
1
+ αβp) − f (x0 )] ≤ α [f (x
1
+ αp) − f (x)] = φ(α).

Thus, φ(α) decreases as α ↓ 0. Let us choose γ > 0 small enough to have the point
x − γp inside the domain. Then, x + αp = x + γα (x − (x − γp)). Therefore, in
view of inequality (3.1.5), we have

φ(α) ≥ γ [f (x) − f (x
1
− γp)].

Hence, the limit in the right-hand side of (3.1.17) exists. 



Let us prove that the directional derivative provides us with a global lower
support of the initial convex function.
3.1 General Convex Functions 159

Lemma 3.1.5 Let the function f be convex and x ∈ int (dom f ). Then f  (x; ·) is a
convex function which is positively homogeneous of degree one. For any y ∈ dom f ,
we have

f (y) ≥ f (x) + f  (x; y − x). (3.1.18)

Proof Let us prove that the directional derivative is homogeneous. Indeed, for any
p ∈ Rn and τ > 0, we have

f  (x; τp) = lim α1 [f (x + τ αp) − f (x)]


α↓0

= τ lim β1 [f (x + βp) − f (x)] = τf  (x0 ; p).


β↓0

Further, for any p1 , p2 ∈ Rn and β ∈ [0, 1], we obtain

f  (x; βp1 + (1 − β)p2 ) = lim α1 [f (x + α(βp1 + (1 − β)p2 )) − f (x)]


α↓0

≤ lim α1 {β[f (x + αp1 ) − f (x)]


α↓0

+(1 − β)[f (x + αp2 ) − f (x)]}

= βf  (x; p1 ) + (1 − β)f  (x; p2 ).

Thus, f  (x; p) is convex in p. Finally, let α ∈ (0, 1], y ∈ dom f , and yα = x +


α(y − x). Then in view of Theorem 3.1.1, we have

f (y) = f (yα + α1 (1 − α)(yα − x)) ≥ f (yα ) + α1 (1 − α)[f (yα ) − f (x)],

and we get (3.1.18) taking the limit as α ↓ 0.




3.1.4 Separation Theorems

Up to now, we have looked at the properties of convex functions in terms of


function values. We have not yet introduce any directions, which could be used
by minimization schemes. In Convex Analysis, such directions are defined by
separation theorems, which are presented in this section.
Definition 3.1.4 Let Q be a convex set. We say that the hyperplane

H (g, γ ) = {x ∈ Rn | g, x = γ }, g
= 0,
160 3 Nonsmooth Convex Optimization

is supporting to Q if any x ∈ Q satisfies inequality g, x ≤ γ . The hyperplane


H (g, γ )
⊇ Q separates a point x0 from Q if

g, x ≤ γ ≤ g, x0 (3.1.19)

for all x ∈ Q. If one of the inequalities in (3.1.19) is strict, the we call the separation
strong. 
In a similar way, we define separability of convex sets. Two sets Q1 and Q2 are
called separable if there exist g ∈ Rn , g
= 0, and γ ∈ R such that

g, x ≤ γ ≤ g, y ∀x ∈ Q1 , y ∈ Q2 . (3.1.20)

The separation is strict if one of the inequalities in (3.1.20) is strict. We call the
separation strong if

sup g, x < γ < inf g, y . (3.1.21)


x∈Q1 y∈Q2

All separation theorems in Rn can be derived from the properties of Euclidean


projection. Let us first describe the possibilities for strong separation.

Theorem 3.1.13 Let Q1 and Q2 be closed convex sets in Rn such that Q1 Q2 =
∅. These sets are strongly separable provided that one of them is bounded.
Proof Suppose that Q1 is bounded. Consider the following minimization problem:

ρ ∗ = min ρQ2 (x).


x∈Q1

Note that the optimal value of this problem is positive and its optimal set X∗ is not
empty. Moreover, for all x ∗ ∈ X∗ , we have

(2.2.41) (2.2.41)
∇ρQ2 (x ∗ ) = g∗ , g ∗ , x ∗ = γ ∗.

Therefore, for all x1 ∈ Q1 we have

(2.2.41) (2.2.39)
g ∗ , x1 − γ ∗ = ∇ρQ (x ∗ ), x1 − x ∗ ≥ 0.

On the other hand, for all x2 ∈ Q2 we have

(2.2.41) (2.2.47)
g ∗ , x2 − γ ∗ ≤ x ∗ − πQ2 (x ∗ ), x2 − x ∗ ≤ −x ∗ − πQ2 (x ∗ )2

= −(ρ ∗ )2 . 

3.1 General Convex Functions 161

Remark 3.1.1 The assumption of boundedness of one of the sets in Theorem 3.1.13
1,2
cannot be omitted. To see why, consider the separation problem for sets Q and R+
in Example 2.2.1. 
Corollary 3.1.4 Let Q be a closed convex set and x
∈ Q. Then x is strongly
separable from Q. 
Let us give an example of application of this important fact.
Corollary 3.1.5 Let Q1 and Q2 be two closed convex sets.
1. If ξQ1 (g) ≤ ξQ2 (g) for all g ∈ dom ψQ2 , then Q1 ⊆ Q2 .
2. Let dom ξQ1 = dom ξQ2 , and for any g ∈ dom ξQ1 we have ξQ1 (g) = ξQ2 (p).
Then Q1 ≡ Q2 .
Proof
1. Assume that there exists an x0 ∈ Q1 which does not belong to Q2 . Then, in view
of Corollary 3.1.5, there exists a direction g such that

g, x0 > γ ≥ g, x

for all x ∈ Q2 . Hence, g ∈ dom ξQ2 and ξQ1 (g) > ξQ2 (g). This is a
contradiction.
2. In view of the first statement, Q1 ⊆ Q2 and Q2 ⊆ Q1 . Therefore, Q1 ≡ Q2 .


The next separation theorem deals with boundary points of convex sets.
Theorem 3.1.14 Let Q be a closed convex set. If the point x0 belongs to the
boundary of Q, then there exists a supporting to Q hyperplane H (g, γ ) which
contains x0 .
(Such a vector g is called supporting to Q at the point x0 .)
Proof Consider a sequence {yk } such that yk ∈
/ Q and yk → x0 . Let

yk −πQ (yk )
gk = yk −πQ (yk ) , γk = gk , πQ (yk ) .

In view of Corollary 3.1.5, for all x ∈ Q we have

gk , x ≤ γk ≤ gk , yk . (3.1.22)

However,  gk = 1 and, in view of Lemma 2.2.8, the sequence {γk } is bounded:

| γk | = | gk , πQ (yk ) − x0 + gk , x0 | ≤  πQ (yk ) − x0  +  x0 

≤  yk − x0  +  x0  .
162 3 Nonsmooth Convex Optimization

Therefore, without loss of generality, we can assume that there exist g ∗ = lim gk
k→∞
and γ ∗ = lim γk . It remains to take the limit in inequalities (3.1.22).

k→∞

3.1.5 Subgradients

Now we are ready to introduce a generalization of the notion of the gradient.


Definition 3.1.5 A vector g is called a subgradient of the function f at the point
x0 ∈ dom f if for any y ∈ dom f we have

f (y) ≥ f (x0 ) + g, y − x0 . (3.1.23)

The set of all subgradients of f at x0 , ∂f (x0 ), is called the subdifferential of the


function f at the point x0 .
If inequality (3.1.23) is valid only for points y ∈ Q, we use notation g ∈ ∂Q f (x0 ).
The latter set is called constrained subdifferential. Clearly, ∂f (x0 ) ⊆ ∂Q f (x0 ) for
any convex set Q ⊆ dom f .
For concave functions, we define super-gradients and super-differentials by
changing the sign in inequality (3.1.23). Note that ∂f (x0 ) can be nonempty even
for nonconvex f .
A simple consequence of Definition 3.1.5 is as follows:

g1 − g2 , x1 − x2 ≥ 0 ∀x1 , x2 ∈ dom f, g1 ∈ ∂f (x1 ), g2 ∈ ∂f (x2 ).


(3.1.24)

The necessity of introducing the notion of subdifferential is clear from the


following example.
def
Example 3.1.3 Consider the function f (x) = (x)+ = max{x, 0}, x ∈ R. For all
y ∈ R and g ∈ [0, 1], we have

f (y) = max{y, 0} ≥ g · y = f (0) + g · (y − 0).

Therefore subgradient of f at x = 0 is not uniquely defined. In our example, this is


an arbitrary value from the interval [0, 1].

The whole set of conditions (3.1.23) parameterized by y ∈ Q can be seen as a
set of linear inequality constraints for g, defining the set ∂Q f (x0 ). Therefore, by
definition, any subdifferential is a closed convex set.
Let us prove that subdifferentiability of function f at all points of some convex
set implies convexity and closedness of the function.
3.1 General Convex Functions 163

Lemma 3.1.6 Let Q be a convex set. Assume that, for any x ∈ Q ⊆ dom f , the
constrained subdifferential ∂Q f (x) is nonempty. Then f is a closed convex function
on Q.
Proof For any x ∈ Q, define fˆ(x) = sup{f (y) + g(y), x − y : y ∈ Q} ≥ f (x),
y
where g(y) is an arbitrary subgradient from ∂Q f (y). By Theorem 3.1.8, fˆ is a
(3.1.23)
closed convex function, and f (x) ≥ fˆ(x) for all x ∈ Q. 

On the other hand, we can prove a relaxed converse statement.
Theorem 3.1.15 Let the function f be convex. If x0 ∈ int (dom f ), then ∂f (x0 ) is
a nonempty bounded set.
Proof Since the point (f (x0 ), x0 ) belongs to the boundary of epi (f ), in view of
Theorem 3.1.14, there exists a hyperplane supporting to epi (f ) at (f (x0 ), x0 ):

− ατ + d, x ≤ −αf (x0 ) + d, x0 (3.1.25)

for all (τ, x) ∈ epi (f ). Let us normalize the coefficients of hyperplane in order to
satisfy the condition

 d 2 +α 2 = 1, (3.1.26)

where the norm is standard Euclidean. Since the point (τ, x0 ) belongs to epi (f ) for
all τ ≥ f (x0 ), we conclude that α ≥ 0.
In view of Theorem 3.1.11 a convex function is locally Lipschitz continuous in
the interior of its domain. This means that there exist some  > 0 and M > 0 such
that B2 (x0 , ) ⊆ dom f and

f (x) − f (x0 ) ≤ M  x − x0 

for all x ∈ B2 (x0 , ).2 Therefore, in view of (3.1.25), for any x from this ball

d, x − x0 ≤ α(f (x) − f (x0 )) ≤ αM  x − x0  .

Choosing x = x0 + d, we get  d 2 ≤ Mα  d . Thus, in view of normalizing


condition (3.1.26), we get α ≥ [1 + M 2 ]−1/2 . Hence, choosing g = d/α, we obtain

(3.1.25)
f (x) ≥ f (x0 ) + g, x − x0

for all x ∈ dom f .

2 Inthe proof of Theorem 3.1.11, we worked with the 1 -norm. However, the result remains valid
for any norm in Rn , since in finite dimensions all norms are topologically equivalent.
164 3 Nonsmooth Convex Optimization

Finally, if g ∈ ∂f (x0 ), g
= 0, then choosing x = x0 + g/  g  we obtain

  g  = g, x − x0 ≤ f (x) − f (x0 ) ≤ M  x − x0  = M.

Thus, ∂f (x0 ) is bounded.



The next example shows that the statement of Theorem 3.1.15 cannot be
strengthened.

Example 3.1.4 Consider function f (x) = − x with domain R+ . This function is
convex and closed, but the subdifferential does not exist at x = 0.

Sub-differentiability at x ∈ dom f is an important characteristic of the local
structure of the function f around this point. Let us prove the following fact.
Theorem 3.1.16 For the function f , define its Fenchel dual

f∗ (s) = sup [s, y − f (y)], (3.1.27)


y∈dom f

and the dual of the Fenchel dual:

f∗∗ (x) = sup [s, x − f∗ (s)].


s∈dom f∗

Then f (x) ≥ f∗∗ (x) for all x ∈ dom f . Moreover, if ∂f (x)


= ∅ for some x ∈
dom f , then ∂f (x) ⊆ dom f∗ and f (x) = f∗∗ (x).
Proof Indeed, for any x ∈ dom f we have

(3.1.27)
f∗∗ (x) = sup [s, x − f∗ (s)] = sup inf [s, x − s, y + f (y)]
s∈dom f∗ s∈dom f∗ y∈dom f

(1.3.6) y=x
≤ inf sup [s, x − y + f (y)] ≤ f (x).
y∈dom f s∈dom f∗

Let us choose now an arbitrary g ∈ ∂f (x). Then for any y ∈ dom f we have

(3.1.23)
g, y − f (y) ≤ g, y − f (x) − g, y − x = g, x − f (x).

Thus, g ∈ dom f∗ . Therefore,

f∗∗ (x) = sup inf [s, x − s, y + f (y)]


s∈dom f∗ y∈dom f

(3.1.23)
≥ inf [g, x − g, y + f (y)] = f (x). 

y∈dom f
3.1 General Convex Functions 165

Let us prove an important relation between subdifferential and directional


derivatives of a convex function.
Theorem 3.1.17 Let the function f be convex, and x0 ∈ int (dom f ). Then

∂2 f  (x0 ; 0) = ∂f (x0 ),

where the subdifferential ∂2 corresponds to the second argument of the function


f (x0 ; ·). Moreover, for any p ∈ Rn , we have

f  (x0 ; p) = max{g, p | g ∈ ∂f (x0 )}. (3.1.28)

Proof Note that

f  (x0 ; p) = lim α1 [f (x0 + αp) − f (x0 )] ≥ g, p , (3.1.29)


α↓0

where g is an arbitrary vector from ∂f (x0 ). Therefore, the subdifferential of the


function f  (x0 ; ·) at p = 0 is not empty and ∂f (x0 ) ⊆ ∂2 f  (x0 ; 0). On the other
hand, since f  (x0 ; p) is convex in p, in view of Lemma 3.1.5, for any y ∈ dom f
we have

f (y) ≥ f (x0 ) + f  (x0 ; y − x0 ) ≥ f (x0 ) + g, y − x0 ,

where g ∈ ∂2 f  (x0 ; 0). Thus, ∂2 f  (x0 ; 0) ⊆ ∂f (x0 ) and we see that these two sets
coincide.
Consider g ∈ ∂2 f  (x0 ; p). Then, in view of inequality (3.1.18), for all v ∈ Rn
and τ > 0 we have

τf  (x0 ; v) = f  (x0 ; τ v) ≥ f  (x0 ; p) + g, τ v − p .

Considering τ → ∞ we get

f  (x0 ; v) ≥ g, v , (3.1.30)

and, considering τ → 0, we obtain

f  (x0 ; p) − g, p ≤ 0. (3.1.31)

However, inequality (3.1.30) implies that g ∈ ∂2 f  (x0 ; 0). Therefore, compar-


ing (3.1.29) and (3.1.31), we conclude that g, p = f  (x0 ; p).

Let us mention some properties of subgradients, which are of central importance
for Convex Optimization. The next result forms the basis for the cutting plane
optimization schemes.
166 3 Nonsmooth Convex Optimization

Theorem 3.1.18 For any x0 ∈ dom f , all vectors g ∈ ∂f (x0 ) are supporting to the
level set Lf (f (x0 )):

g, x0 − x ≥ 0 ∀x ∈ Lf (f (x0 )) = {x ∈ dom f : f (x) ≤ f (x0 )}.

Proof Indeed, if f (x) ≤ f (x0 ) and g ∈ ∂f (x0 ), then

f (x0 ) + g, x − x0 ≤ f (x) ≤ f (x0 ). 


Corollary 3.1.6 Let Q ⊆ dom f be a closed convex set, x0 ∈ Q, and

x ∗ ∈ Arg min f (x).


x∈Q

Then for any g ∈ ∂f (x0 ), we have g, x0 − x ∗ ≥ 0. 



In some situations, the following objects are very useful.
Definition 3.1.6 Let the set X ⊆ dom f be closed and convex. The set

2(X) =
∂f ∂f (x) (3.1.32)
x∈X

is called the epigraph facet of the set X.


This definition is motivated by the following statement.
2(X)
= ∅. Then
Theorem 3.1.19 Let the set X be closed and convex, and ∂f

f ((1 − α)x0 + αx1 ) = (1 − α)f (x0 ) + αf (x1 ), ∀x0 , x1 ∈ X, α ∈ [0, 1].


(3.1.33)
2(X) and all x0 , x1 from X, we have
Moreover, for any g ∈ ∂f

f (x1 ) = f (x0 ) + g, x1 − x0 . (3.1.34)



2(X) ⊆ ∂f (x0 )
Proof Indeed, let g ∈ ∂f ∂f (x1 ). Then,

(3.1.23) (3.1.23)
f (x0 ) + g, x1 − x0 ≤ f (x1 ) ≤ f (x0 ) + g, x1 − x0 .

Thus, (3.1.34) is proved. Consequently, for xα = (1 − α)x0 + αx1 with α ∈ [0, 1],
we have
(3.1.2) (3.1.23)
(1 − α)f (x0 ) + αf (x1 ) ≥ f (xα ) ≥ f (x0 ) + g, xα − x0

(3.1.34)
= f (x0 ) + αg, x1 − x0 = (1 − α)f (x0 ) + αf (x1 ).

Thus, we have proved equality (3.1.33).



3.1 General Convex Functions 167

Let us show how the epigraph facets arise in optimality conditions for Uncon-
strained Optimization.
Theorem 3.1.20 Let X∗ = Arg min f (x). Then a closed convex set X∗ is a
x∈dom f
subset of X∗ if and only if

2(X∗ ).
0 ∈ ∂f

2(X∗ ), then for any x ∗ ∈ X∗ and all x ∈ dom f we have


Proof Indeed, if 0 ∈ ∂f

f (x) ≥ f (x ∗ ) + 0, x − x ∗ = f (x ∗ ).

Thus, x ∗ ∈ X∗ .
On the other hand, if f (x) ≥ f (x ∗ ) for all x ∈ dom f and x ∗ ∈ X∗ , then by
Definition 3.1.5, 0 ∈ ∂f (x ∗ ).

x ∗ ∈X∗

In what follows, for a set-valued mapping S (·) and arbitrary set X ⊆ Rn , we


def 
use the notation S3(X) = S (x).
x∈X

3.1.6 Computing Subgradients

In the previous section, we introduced subgradients, the objects which we are going
to use in minimization methods. However, in order to apply such methods for
solving real-life problems, we need to be sure that subgradients are computable.
In this section, we present the corresponding computational rules. Note that for
the majority of minimization methods, it is enough to be able to compute a single
subgradient from the set ∂f (x).
Let us first establish some relations between gradients and subgradients.
Lemma 3.1.7 Let a function f be convex. Assume that it is differentiable at a point
x ∈ int (dom f ). Then ∂f (x) = {∇f (x)}.
Proof Indeed, for any direction p ∈ Rn , we have

f  (x; p) = ∇f (x), p .

It remains to use Theorem 3.1.17 and Item 2 of Corollary 3.1.5.



Lemma 3.1.8 Let a function ψ(·) be convex and ϕ be a univariate convex function,
which is non-decreasing on the set

Im ψ = {τ = ψ(x), x ∈ dom ψ}.


168 3 Nonsmooth Convex Optimization

Then the function f (·) = ϕ(ψ(·)) is convex and for any x from int (dom ψ) we have

∂f (x) = Conv {λ∂ψ(x), λ ∈ ∂ϕ(ψ(x))}.

Proof Indeed the function f is convex in view of Theorem 3.1.9. Let us fix an
arbitrary x ∈ int (dom ψ) and any direction h. Then, by the chain rule for directional
derivatives, we have

f  (x; p) = ϕ  (ψ(x); ψ  (x; p)) = max{λψ  (x; p) : λ ∈ ∂ϕ(ψ(x))}


λ

= max{g, p : g ∈ λ∂ψ(x), λ ∈ ∂ϕ(ψ(x))}.


λ,g

It remains to use Theorem 3.1.17 and Item 2 of Corollary 3.1.5.



Consider now a mixed situation when the function f (x, y) depends on two
variables x ∈ Rn and y ∈ Rm .
Lemma 3.1.9 Let a function f be convex, and

z̄ = (x̄, ȳ) ∈ int (dom f ) ⊆ Rn × Rm .

Assume that f is differentiable in the first variable, and the corresponding partial
gradient ∇1 f (·, ·) ∈ Rn is continuous at z̄ along any direction in Rn+m . Then

∂f (z̄) = (∇1 f (x̄, ȳ), ∂2 f (x̄, ȳ)),

where ∂2 f (x, y) ⊂ Rm is the partial subdifferential of f with respect to the second


variable, when the first variable is fixed.
Proof Let us fix an arbitrary direction h = (hx , hy ) ∈ Rn × Rm . Then for α > 0
small enough, we have

1
α (f (x̄ + αhx , ȳ + αhy ) − f (x̄, ȳ)) = 1
α (f (x̄ + αhx , ȳ + αhy ) − f (x̄, ȳ + αhy ))

+ α1 (f (x̄, ȳ + αhy ) − f (x̄, ȳ)).

Since f is convex, we have

(2.1.2)
α∇1 f (x̄, ȳ + αhy ), hx ≤ f (x̄ + αhx , ȳ + αhy ) − f (x̄, ȳ + αhy )

(2.1.2)
≤ α∇1 f (x̄ + αhx , ȳ + αhy ), hx .
3.1 General Convex Functions 169

Hence, in view of the directional continuity of ∇1 f , we have

f  (z̄, h) = ∇1 (f (x̄, ȳ), hx + f  (z̄, (0, hy ))

(3.1.28)
= ∇1 (f (x̄, ȳ), hx + max{g, hy : g ∈ ∂2 f (x̄, ȳ)}.
g

It remains to use Corollary 3.1.5.



Finally, let us present a converse statement, which derives differentiability from
a kind of continuous subdifferentiability.
Lemma 3.1.10 Let f be convex and x0 ∈ int (dom f ). Assume that there exists a
vector function g(x) ∈ ∂f (x) which is continuous at x0 . Then f is differentiable at
x0 and ∇f (x0 ) = g(x0 ).
Proof Indeed, for any direction h ∈ Rn and small enough positive α, we have

(3.1.23) (3.1.23)
g(x0 ), h ≤ α [f (x0
1
+ αh) − f (x0 )] ≤ g(x0 + αh), h .

Thus, taking the limit as α ↓ 0, we get f  (x0 ; h) = g(x0 ), h for all h ∈ Rn .


Hence, g(x0 ) = ∇f (x0 ). 
Let us provide all operations for convex functions, described in Sect. 3.1.2, with
corresponding chain rules for updating subgradients.
Lemma 3.1.11 Let the function f be closed and convex on the bounded set S ⊆
dom f ⊆ Rm . Consider a linear operator

A (x) = Ax + b : Rn → Rm .

Then φ(x) = f (A (x)) is a closed convex function on the set

Q = {x | A (x) ∈ S}.

For any x ∈ Q with nonempty ∂f (A (x)) we have

∂φ(x) = AT ∂f (A (x)).

Proof We have already proved the first part of this lemma in Theorem 3.1.6. Let us
prove the relation for the subdifferential. Let y0 = A (x0 ). Then for all p ∈ Rn , we
have

φ  (x0 , p) = f  (y0 ; Ap) = max{g, Ap | g ∈ ∂f (y0 )}

= max{ḡ, p | ḡ ∈ AT ∂f (y0 )}.

Using Theorem 3.1.17 and Corollary 3.1.5, we get ∂φ(x0 ) = AT ∂f (A (x0 )). 

170 3 Nonsmooth Convex Optimization

Lemma 3.1.12 Let functions f1 and f2 be closed and convex, and α1 , α2 ≥ 0. Then
the function f (x) = α1 f1 (x) + α2 f2 (x) is also closed and convex and

∂f (x) = α1 ∂f1 (x) + α2 ∂f2 (x) (3.1.35)



for any x from int (dom f ) = int (dom f1 ) int (dom f2 ).
Proof In view of Theorem 3.1.5, we need  to prove only the relation for the subdif-
ferentials. Consider x0 ∈ int (dom f1 ) int (dom f2 ). In view of Theorem 3.1.15,
at this point both subdifferentials are bounded. For any p ∈ Rn , we have

f  (x0 ; p) = α1 f1 (x0 ; p) + α2 f2 (x0 ; p)

= max{g1 , α1 p | g1 ∈ ∂f1 (x0 )}

+ max{g2 , α2 p | g2 ∈ ∂f2 (x0 )}

= max{α1 g1 + α2 g2 , p | g1 ∈ ∂f1 (x0 ), g2 ∈ ∂f2 (x0 )}

= max{g, p | g ∈ α1 ∂f1 (x0 ) + α2 ∂f2 (x0 )}.

Hence, using Theorem 3.1.17 and Corollary 3.1.5, we get (3.1.35).



Lemma 3.1.13 Let functions fi , i = 1 . . . m, be closed and convex. Then the
function f (x) = max fi (x) is closed and convex. For any x ∈ int (dom f ) =
1≤i≤m

m
int (dom fi ), we have
i=1

∂f (x) = Conv {∂fi (x) | i ∈ I (x)}, (3.1.36)

where I (x) = {i : fi (x) = f (x)}.


Proof Again, in view of Theorem 3.1.5, we need to justify only the rules for

m
subdifferentials. Consider x ∈ int (dom fi ). In view of Theorem 3.1.15, at this
i=1
point, subdifferentials of all functions fi are bounded.
For the sake of notation, assume that I (x) = {1, . . . , k}. Then for any p ∈ Rn ,
we have

f  (x; p) = max fi (x; p) = max max{gi , p | gi ∈ ∂fi (x)}.


1≤i≤k 1≤i≤k
3.1 General Convex Functions 171

Note that for any set of values a1 , . . . , ak we have


 

k
max ai = max λi ai | {λi } ∈ Δk ,
1≤i≤k i=1


k
where Δk = {λi ≥ 0, λi = 1} is the standard k-dimensional simplex. Therefore,
i=1


k
f  (x; p) = max { λi max{gi , p | gi ∈ ∂fi (x)}}
{λi }∈Δk
i=1

k
= max{ λi gi , p | gi ∈ ∂fi (x), {λi } ∈ Δk }
i=1


k
= max{g, p | g = λi gi , gi ∈ ∂fi (x), {λi } ∈ Δk }
i=1

= max{g, p | g ∈ Conv {∂fi (x), i ∈ I (x)} }. 


The last rule can be useful for computing some elements from subdifferentials.
Lemma 3.1.14 Let Δ be an arbitrary set, and f (x) = sup{φ(x, y) | y ∈ Δ}.
Suppose that for any y ∈ Δ the function φ(·, y) is closed and convex on some
convex set Q. Then f is closed convex on the set
" %
Q̂ = x ∈ Q | sup φ(x, y) < +∞ .
y∈Δ

Moreover, for any x ∈ Q̂ we have

∂Q̂ f (x) ⊇ Conv {∂Q,x φ(x, y) | y ∈ I (x)},

where I (x) = {y ∈ Δ | φ(x, y) = f (x)}.


Proof In view of Theorem 3.1.8, we have to prove only the inclusion. Indeed, for
any x ∈ Q̂, y0 ∈ I (x0 ), and g0 ∈ ∂Q,x φ(x0 , y0 ), we have

f (x) ≥ φ(x, y0 ) ≥ φ(x0 , y0 ) + g0 , x − x0 = f (x0 ) + g0 , x − x0 . 


Now we can look at some examples of subdifferentials.


172 3 Nonsmooth Convex Optimization

Example 3.1.5
1. Let f (x) = (x)+ , x ∈ R. Then ∂f (0) = [0, 1] since f (x) = max g x.
g∈[0,1]

m
2. Consider the function f (x) = | ai , x |. Define
i=1

I− (x) = {i : ai , x < 0},

I+ (x) = {i : ai , x > 0},

I0 (x) = {i : ai , x = 0}.


  
Then ∂f (x) = ai − ai + [−ai , ai ].
i∈I+ (x) i∈I− (x) i∈I0 (x)
3. Consider the function f (x) = max x (i) . Define I (x) = {i : x (i) = f (x)}.
1≤i≤n
Then

∂f (x) = Conv {ei | i ∈ I (x)}.

For x = 0, we have ∂f (0) = Conv {ei | 1 ≤ i ≤ n} ≡ Δn .


4. For the Euclidean norm f (x) = x , we have

∂f (0) = B2 (0, 1) = {x ∈ Rn |  x ≤ 1},

∂f (x) = {x/  x }, x


= 0.


n
5. For the 1 -norm, f (x) = x 1 = | x (i) |, we have
i=1

∂f (0) = B∞ (0, 1) = {x ∈ Rn | max | x (i) |≤ 1},


1≤i≤n

  
∂f (x) = ei − ei + [−ei , ei ], x
= 0,
i∈I+ (x) i∈I− (x) i∈I0 (x)

where I+ (x) = {i | x (i) > 0}, I− (x) = {i | x (i) < 0} and I0 (x) = {i | x (i) =
0}.
6. In the case of the Minkowski function, we need to introduce a polar of the set Q:

PQ = {g ∈ Rn : g, x ≤ 1 ∀x ∈ Q}. (3.1.37)


3.1 General Convex Functions 173

Then

∂ψQ (0) = PQ , ∂ψQ (x) = Arg max g, x .


g∈PQ

We leave the justification of these examples as an exercise for the reader.



Finally, let us describe subgradients of homogeneous functions.
Definition 3.1.7 A function f is called (positively) homogeneous of degree p ≥ 0
if dom f is a cone and

f (τ x) = τ p f (x) ∀x ∈ dom f, ∀τ ≥ 0. (3.1.38)

Note that all functions in Example 3.1.5 are homogeneous of degree one.
Theorem 3.1.21 (Euler’s Homogeneous Function Theorem) Let the function f
be convex and subdifferentiable on its domain. If it is homogeneous of degree p ≥ 1,
then

g, x = pf (x) ∀x ∈ dom f, ∀g ∈ ∂f (x). (3.1.39)

Proof Indeed, let x ∈ dom f and g ∈ ∂f (x). Then for any τ ≥ 0 we have

(3.1.38) (3.1.23)
τ p f (x) = f (τ x) ≥ f (x) + (τ − 1)g, x .

τ p −1
For τ > 1, this implies that τ −1 f (x) ≥ g, x . Therefore, taking the limit as τ ↓ 1,
we get pf (x) ≥ g, x .
p
1−τ f (x) ≤ g, x . Hence, taking the
For τ < 1, the above inequality implies 1−τ
limit as τ ↑ 1, we get pf (x) ≤ g, x .

In Convex Analysis, the most important homogeneous functions have degree of
homogeneity one. For such functions,

(3.1.39)
g, x = f (x) ∀x ∈ dom f, ∀g ∈ ∂f (x). (3.1.40)

From now on, let us assume that dom f = Rn . Then, for all x ∈ Rn we have

(3.1.28)
f (x) = f  (0, x) = max{g, x : g ∈ ∂f (0)}. (3.1.41)
g

The simplest example of the homogeneous function is a linear function f (x) =


a, x . A more important case is a general norm. For f (x) = x, we have

x = max{g, x : g∗ ≤ 1},


g
174 3 Nonsmooth Convex Optimization

where g∗ = max{g, x : x ≤ 1} is the dual norm. Thus,


x
4
4
∂x4 = {g ∈ Rn : g∗ ≤ 1}. (3.1.42)
x=0

Lemma 3.1.15 Let a function f be convex and homogeneous of degree one with
dom f = Rn . Then for all x ∈ Rn , we have

∂f (x) = {g ∈ ∂f (0) : g, x = f (x)}. (3.1.43)

Proof Denote the right-hand side of equality (3.1.43) by G(x). If g ∈ ∂f (x), then
for any y ∈ Rm we have

(3.1.23) (3.1.40)
f (y) ≥ f (x) + g, y − x = g, y .

(3.1.40)
Thus, g ∈ ∂f (0). Consequently, g ∈ G(x). On the other hand, if g ∈ G(x),
then for any y ∈ Rn we have

(3.1.23)
f (y) ≥ g, y = f (x) + g, y − x .

Therefore, g ∈ ∂f (x).

Thus, in view of equality (3.1.41), ∂f (x) is a facet of ∂f (0).
Let us give an example of application for the machinery developed so far.
Theorem3.1.22 Let Q1 and Q2 be bounded closed convex sets with intersection
Q = Q1 Q2 , which has nonempty interior. Then
 
ξQ (x) = minn ξQ1 (x + y) + ξQ2 (−y) , x ∈ Rn . (3.1.44)
y∈R

Proof Let
 us first prove that the optimization problem in (3.1.44) is solvable. If
g ∈ Q1 Q2 , then for any y ∈ Rn we have

def
φx (y) = ξQ1 (x + y) + ξQ2 (−y) ≥ g, x + y + g, −y = g, x .

Thus the objective function in (3.1.44) is bounded below and for its infimum φx∗ we
have φx∗ ≥ ξQ (x). Consider a sequence {yk } such that φx (yk ) → φx∗ . If this sequence
def
is bounded, then the infimum is attained. If not, then we can have tk = yk  → ∞.
Let ȳk = t1k yk . Then

lim φx (ȳk ) = lim [ξQ1 ( t1k x + ȳk ) + ξQ2 (−ȳk )] = lim 1


φx (yk ) = 0.
k→∞ k→∞ k→∞ tk
3.1 General Convex Functions 175

Since the sequence {ȳk } is bounded, we can assume that it is convergent to a point
ȳ with ȳ = 1 and φx (ȳ) = 0. In this case, we have

g1 , ȳ ≤ ξQ1 (ȳ) ≤ −ξQ2 (−ȳ) = g2 , ȳ , ∀g1 ∈ Q1 , ∀g2 ∈ Q2 .

Hence, g, ȳ = 0 for all g ∈ Q, and we get a contradiction with the assumptions.
Denote by y ∗ the solution of the optimization problem in (3.1.44). In view of
Theorem 3.1.20, we have
(3.1.35)
0 ∈ ∂φx (y ∗ ) = ∂ξQ1 (x + y ∗ ) + ∂ξ−Q2 (y ∗ ).

In view of Lemma 3.1.15 this means that there exists a vector g such that

g ∈ Q1 , g, x + y ∗ = ξQ1 (x + y ∗ ),

−g ∈ −Q2 , −g, y ∗ = ξ−Q2 (y ∗ ).

Thus, φx∗ = ξQ1 (x + y ∗ ) + ξQ2 (−y ∗ ) = ξQ1 (x + y ∗ ) + ξ−Q2 (y ∗ ) = g, x . Since


g ∈ Q, we conclude that φx∗ ≤ ξQ (x). 
Finally, let us describe subgradients of superpositions of convex functions and
differentiable convex functions.
Lemma 3.1.16 Consider ψ(g) = maxλ, g , where Λ ⊂ Rm
+ is a bounded convex
λ∈Λ
set. Let the vector function F (x) = (f1 (x), . . . , fm (x)), x ∈ Rn , have differentiable
convex components. Then the superposition f (x) = ψ(F (x)) is convex and
 

m
∂f (x) = λ(i) ∇fi (x) : λ ∈ Arg maxλ, F (x) . (3.1.45)
i=1 λ∈Λ

Proof Indeed the function ψ(·) is monotone: if g1 ≤ g2 in the component-wise


sense, then ψ(g1 ) ≤ ψ(g2 ). Therefore, for any x, y from Rn and α ∈ [0, 1] we have

f (αx + (1 − α)y) ≤ ψ(αF (x) + (1 − α)F (y)) ≤ αf (x) + (1 − α)f (y).

Relation (3.1.45) follows from the representation of directional derivatives.


Define F  (x) = (∇f1 (x), . . . , ∇fm (x)) ∈ Rn×m . Then for any direction h ∈ Rn
we have

f  (x; h) = ψ  (F (x); (F  (x))T h)

(3.1.43)
= max{λ, (F  (x))T h : λ ∈ Arg maxλ, F (x) }. 

λ∈Λ
176 3 Nonsmooth Convex Optimization

Lemma 3.1.17 Let F be a differentiable convex and monotone function on Rm and


suppose the functions fi are convex on a convex open set Q. Then the function

φ(x) = F (f1 (x), . . . , fm (x))

is convex on Q and


m
∂φ(x) = ∇i F (f (x)) · ∂fi (x), x ∈ Q, (3.1.46)
i=1

where f (x) = (f1 (x), . . . , fm (x))T ∈ Rm .


Proof Indeed, for x, y ∈ Q and α ∈ [0, 1] we have

φ(αx + (1 − α)y) ≤ F (αf (x) + (1 − α)f (y)) ≤ αφ(x) + (1 − α)φ(y).

Further, for any direction p ∈ Rn ,


m (3.1.28) 
m
φ  (x; p) = ∇i F (f (x))fi (x; p) = ∇i F (f (x))ξ∂fi (x) (p).
i=1 i=1

It remains to use Corollary 3.1.5.



Corollary 3.1.7 If all fi , i = 1, . . . , m, are convex, then the function



m
φ(x) = ln efi (x) (3.1.47)
i=1

is also convex.
Proof Indeed, we have seen in Example 2.1.1(4) that the function



n (i)
F (s) = ln es
i=1

is convex and monotone on Rn . 


3.1.7 Optimality Conditions

Let us apply the developed technique to derive different optimality conditions.


We start with a simple minimization problem, where the objective function has a
composite form:
 
min f˜(x) = f (x) + Ψ (x) ,
def
(3.1.48)
x∈Q
3.1 General Convex Functions 177

where Q is a closed convex set, f ∈ C 1 (Q) is a continuously differentiable convex


function and Ψ is a closed convex function defined on the set Q.
Theorem 3.1.23 A point x ∗ is a solution to problem (3.1.48) if and only if for every
x ∈ Q we have

∇f (x ∗ ), x − x ∗ + Ψ (x) ≥ Ψ (x ∗ ). (3.1.49)

Proof Indeed, if condition (3.1.49) is satisfied, then

(2.1.2)
f˜(x) = f (x) + Ψ (x) ≥ f (x ∗ ) + ∇f (x ∗ ), x − x ∗ + Ψ (x)

(3.1.49)
≥ f (x ∗ ) + Ψ (x ∗ ) = f˜(x ∗ ).

Assume now that x ∗ is an optimal solution of the minimization problem (3.1.48).


Suppose that there exists an x ∈ Q such that

∇f (x ∗ ), x − x ∗ + Ψ (x) < Ψ (x ∗ ).

Note that lim α1 [f (αx + (1 − α)x ∗ ) − f (x ∗ )] = ∇f (x ∗ ), x − x ∗ . Thus, for a


α↓0
positive α small enough we have

f (αx + (1 − α)x ∗ ) < f (x ∗ ) + α[Ψ (x ∗ ) − Ψ (x)]

= f˜(x ∗ ) + α[Ψ (x ∗ ) − Ψ (x)] − Ψ (x ∗ )

(3.1.2)
≤ f˜(x ∗ ) − Ψ (αx + (1 − α)x ∗ ).

Hence, f˜(αx + (1 − α)x ∗ ) < f˜(x ∗ ) and we get a contradiction.



In view of Definition 3.1.5, condition (3.1.49) is equivalent to the inclusion

−∇f (x ∗ ) ∈ ∂Q Ψ (x ∗ ).

Let us now look at optimization problems with general objective functions.


Consider the problem

min f (x), (3.1.50)


x∈Q

where Q ⊆ Rn is a closed convex set and f is a closed convex function, dom f ⊃


Q. For a point x̄ ∈ Q, define the normal cone:

N (x̄) = {g ∈ Rn | g, x − x̄ ≥ 0, ∀x ∈ Q}. (3.1.51)


178 3 Nonsmooth Convex Optimization

Since inclusion g ∈ N (x̄) implies τg ∈ N (x̄) for any τ ≥ 0, this is indeed a cone.
It is closed and convex as an intersection of closed convex sets, the half-spaces

{g : g, x − x̄ ≥ 0}, x ∈ Q.

Clearly, N (x̄) = {0n } for all x̄ ∈ int Q. Thus, this cone is nontrivial only at the
boundary points x̄ ∈ ∂Q.
For x̄ ∈ Q, define the tangent cone

T (x̄) = {p ∈ Rn | g, p ≥ 0, ∀g ∈ N (x̄)}. (3.1.52)

Thus, this is a standard dual cone to N (x̄). Again, this cone is closed and convex
as the intersection of the system of half-spaces. Clearly, for x̄ ∈ int Q we have
T (x̄) = Rn .
The name of the cone T (·) is justified by the following property.
Lemma 3.1.18 Let x̄ ∈ ∂Q. Then Q − x̄ ⊂ T (x̄). Moreover,

T (x̄) = cl (K (Q − x̄)) . (3.1.53)

Thus, T (x̄) is the closure of the conic hull of the set Q − x̄.
Proof Indeed, in view of the definition of normal cone (3.1.51), we have

g, x − x̄ ≥ 0, ∀x ∈ Q, g ∈ N (x̄).

(3.1.52)
Therefore, Q − x̄ ⊂ T (x̄). Since T (x̄) is a closed cone, this means that

K¯ = cl (K (x̄)) ⊆ T (x̄).
def

Let us assume that there exists a point p̄ ∈ T (x̄) such that p̄


∈ K¯ . Then, by
Corollary 3.1.4, there exists a direction ḡ which strongly separates p̄ from K¯ :

ḡ, p̄ < γ ≤ ḡ, α(x − x̄) , ∀x ∈ Q, α ≥ 0.

Letting α → +∞ in this inequality, we get ḡ, x − x̄ ≥ 0 for all x ∈ Q. Thus,


direction ḡ belongs to the cone N (x̄). On the other hand, taking α = 0, we get
(3.1.52)
γ ≤ 0. Thus, ḡ, p̄ < 0. This means that p̄
∈ T (x̄). Hence, we get a
contradiction.

3.1 General Convex Functions 179

Remark 3.1.2 For the special case Q = {x ∈ Rn : Ax = b}, where A is an (m×n)-


matrix, standard arguments from Linear Algebra prove the following representation:

N (x̄) = {g ∈ Rn : g = AT y, y ∈ Rm },
(3.1.54)
T (x̄) = {h ∈ Rn : Ah = 0},

which is valid for all x̄ ∈ Q.


The next statement gives us an optimality condition for a linearized version of
problem (3.1.50).
Lemma 3.1.19 Let x ∗ be an optimal solution to problem (3.1.50). Then

f  (x ∗ ; p) ≥ 0 ∀p ∈ T (x ∗ ). (3.1.55)

Proof Assume that there exists a point p̄ ∈ T (x ∗ ) such that f  (x ∗ , p̄) < 0. In view
of Lemma 3.1.18, there exist two sequences {αk } ⊂ R+ and {xk } ⊂ Q such that

p̄ = lim αk (xk − x ∗ ).
k→∞

Since the function f  (x ∗ ; ·) is continuous, in view of Lemma 3.1.5, we have

0 > f  (x ∗ ; p̄) = lim αk f  (x ∗ ; xk − x ∗ )


k→∞

= lim lim αβk [f (x ∗ + β(xk − x ∗ )) − f (x ∗ )] ≥ 0.


k→∞ β↓0

Thus, we come to a contradiction.



Now we can justify an optimality condition for problem (3.1.50). Define

X∗ = Arg min f (x).


x∈Q

Theorem 3.1.24 A point x ∗ from Q belongs to X∗ if and only if there exists a


g ∗ ∈ ∂f (x ∗ ) such that

g ∗ , x − x ∗ ≥ 0 ∀x ∈ Q. (3.1.56)

2(X∗ )
In this case, g ∗ ∈ ∂f N2(X∗ ) (see Definition 3.1.6).
Proof Indeed, from the condition (3.1.56) and definition of ∂f (x ∗ ), we have

(3.1.23) (3.1.56)
f (x) ≥ f (x ∗ ) + g ∗ , x − x ∗ ≥ f (x ∗ ) ∀x ∈ Q.

Thus, x ∗ ∈ X∗ .
180 3 Nonsmooth Convex Optimization

Let us prove the converse statement. Let x ∗ ∈ X∗ be an optimal solution of


problem (3.1.50). Assume that there is no g ∈ ∂f (x ∗ ) such that

g, x − x ∗ ≥ 0 ∀x ∈ Q.

In view of definition (3.1.51), this means that ∂f (x ∗ ) N (x ∗ ) = ∅. Consider the
following auxiliary optimization problem:
 
min φ(g1 , g2 ) = 12 g1 − g2 2 : g1 ∈ ∂f (x ∗ ), g2 ∈ N (x ∗ ) ,
g1 ,g2

where the norm is standard Euclidean. Since the set ∂f (x ∗ ) is bounded, there exists
def
its optimal solution (g1∗ , g2∗ ) and the optimal value ρ ∗ = φ(g1∗ , g2∗ ) is positive. Let
us write down optimality conditions for this auxiliary problem. By Theorem 2.2.9,
we obtain

∇g1 φ(g1∗ , g2∗ ), g1 − g1∗ = g1∗ − g2∗ , g1 − g1∗ ≥ 0 ∀g1 ∈ ∂f (x ∗ ),


(3.1.57)
∇g2 φ(g1∗ , g2∗ ), g2 − g2∗ = g2∗ − g1∗ , g2 − g2∗ ≥ 0 ∀g2 ∈ N (x ∗ ).
(3.1.58)
Taking in (3.1.58) g2 = 0 and g2 = αg2∗ as α → +∞, we get

g2∗ − g1∗ , g2∗ ≤ 0 ≤ g2∗ − g1∗ , g2∗ .

def
Thus, for p∗ = g2∗ − g1∗ we have g2∗ , p∗ = 0. Therefore,

(3.1.58)
g2 , p∗ ≥ 0 ∀g2 ∈ N (x ∗ ),

(3.1.52)
which means p∗ ∈ T (x ∗ ). On the other hand, for all g1 ∈ ∂f (x ∗ ) we have

(3.1.57)
g1 , p∗ ≤ g1∗ , p∗ = g1∗ − g2∗ , p∗ = −2ρ ∗ .

(3.1.28)
This means that f  (x ∗ ; p∗ ) = −2ρ ∗ < 0. Thus, we get a contradiction with
Lemma 3.1.19 and prove the existence of a vector g ∗ ∈ ∂f (x ∗ ) such that

g ∗ , x − x ∗ ≥ 0 ∀x ∈ Q.

Note that for any other point x1∗ ∈ X∗ we have

(3.1.23)
f (x ∗ ) = f (x1∗ ) ≥ f (x ∗ ) + g ∗ , x1∗ − x ∗ ≥ f (x ∗ ).
3.1 General Convex Functions 181

Hence, g ∗ , x1∗ − x ∗ = 0 and we conclude that g ∗ ∈ ∂f (x1∗ ). Consequently, g ∗ ∈


2(X∗ ). For the same reason, g ∗ belongs both to N (x ∗ ) and N (x ∗ ).
∂f 
1
Remark 3.1.3 For x ∗ ∈ int Q, condition (3.1.56) is equivalent to the inclusion of
Theorem 3.1.20.
Remark 3.1.4 In the special case Q = {x ∈ Rn : Ax = b}, where A is an (m × n)-
matrix, in view of representation (3.1.54), the statement of Theorem 3.1.24 can be
specified in the following way:

A point x ∗ belongs to X∗ if and only if there exists a


(3.1.59)
g ∗ ∈ ∂f (x ∗ ) such that g ∗ = AT y ∗ for some y ∗ ∈ Rm .

(Compare with the statement of Corollary 1.2.1.)


Theorem 3.1.24 is one of the most powerful tools of Convex Analysis. Let us
demonstrate this with several important examples.
First of all, consider the differentiation rules for a partial minimum of a convex
function (3.1.9).
Theorem 3.1.25 Let φ be a closed convex function, and Q1 ⊆ Rn and Q2 ⊆ Rm
be two closed convex sets such that Q1 × Q2 ⊆ dom φ. Define

f (x) = inf φ(x, y).


y∈Q2

def
Then f is convex on Q1 . Moreover, if Y (x) = Arg min φ(x, y)
= ∅, then
y∈Q2


∂Q1 f (x) ⊇ {gx ∈ Rn : ∃gy such that (gx , gy ) ∈ ∂φ(x, y),
y∈Y (x)

and gy , y − yx ≥ 0 ∀y ∈ Q2 , ∀yx ∈ Y (x)}.


(3.1.60)
Proof The convexity of the function f was already proved in Theorem 3.1.7. Let us
fix a point x ∈ Q1 with Y (x)
= ∅. In view of Theorem 3.1.24, the right-hand side of
inclusion (3.1.60) is not empty. Consider an arbitrary element (gx , gy ) from this set.
Let x1 ∈ Q1 and  > 0. Choosing a point y1 ∈ Q2 such that φ(x1 , y1 ) ≤ f (x1 ) + ,
we get

f (x1 ) +  ≥ φ(x1 , y1 ) ≥ φ(x, yx ) + gx , x1 − x + gy , y1 − yx

≥ φ(x, yx ) + gx , x1 − x = f (x) + gx , x1 − x .

Since we can choose  arbitrarily small, inclusion gx ∈ ∂Q1 f (x) is proved.



182 3 Nonsmooth Convex Optimization

Corollary 3.1.8 If Y (x)


= ∅ for all x ∈ dom f , then f is a closed convex function
on Q1 .
Proof By inclusion (3.1.60), ∂f (x)
= ∅. Therefore, we can apply Lemma 3.1.6.


Note that separability of the constraints x ∈ Q1 and y ∈ Q2 is essential for the
validity of the rule (3.1.60). Simple examples show that in the general situation of
Theorem 3.1.7, the set ∂f (x) can be dependent also on the partial subgradients of
function φ in y. Such a general case can be treated by Theorem 3.1.28.
Let us look now at optimality conditions for smooth minimization problem with
functional constraints:

min{f0 (x)| fi (x) ≤ 0, i = 1, . . . , m}, (3.1.61)


x∈Q

where Q is a closed convex set.


Theorem 3.1.26 (Karush–Kuhn–Tucker) Let functions fi , i = 0 . . . m, be
convex and differentiable with int (dom fi ) ⊃ Q . Suppose that there exists a point
x̄ ∈ Q such that

fi (x̄) < 0, i = 1, . . . , m. (Slater condition for inequalities) (3.1.62)

A point x ∗ is an optimal solution of problem (3.1.61) if and only if there exist


nonnegative values λ∗i , i = 1 . . . m, satisfying the following conditions:


m
∇f0 (x ∗ ) + λ∗i ∇fi (x ∗ ), x − x ∗ ≥ 0, ∀x ∈ Q,
i=1
(3.1.63)
λ∗i fi (x ∗ ) = 0, i = 1, . . . , m.

Proof In view of Lemma 2.3.4, x ∗ is an optimal solution to problem (3.1.61) if and


only if it is a global minimizer of the function

φ(x) = max{f0 (x) − f ∗ ; fi (x), i = 1 . . . m}

over the set Q. In view of Theorem 3.1.24, this is the case if and only if there exists
a g ∗ ∈ ∂φ(x ∗ ) such that

g ∗ , x − x ∗ ≥ 0 ∀x ∈ Q.
3.1 General Convex Functions 183

Further, in view of Lemma 3.1.13, inclusion g ∗ ∈ ∂f (x ∗ ) is equivalent to the


existence of nonnegative weights λ̄i , i = 0, . . . , m, such that

λ̄0 ∇f0 (x ∗ ) + λ¯i ∇fi (x ∗ ) = g ∗ ,
i∈I ∗


λ̄0 + λ̄i = 1,
i∈I ∗

where I ∗ = {i ∈ {1, . . . , m} : fi (x ∗ ) = 0}.


Thus, we need to prove only that λ̄0 > 0. Indeed, if λ̄0 = 0, then
 
λ¯i fi (x̄) ≥ λ¯i [fi (x ∗ ) + ∇fi (x ∗ ), x̄ − x ∗ ] ≥ 0.
i∈I ∗ i∈I ∗

This contradicts the Slater condition. Therefore λ̄0 > 0 and we can take λ∗i = λ̄i /λ̄0
for all i ∈ I ∗ and λ∗i = 0 for i
∈ I ∗ .

Theorem 3.1.26 is very useful for solving simple optimization problems.
Lemma 3.1.20 Let A  0. Then

max{c, x : Ax, x ≤ 1} = c, A−1 c 1/2 . (3.1.64)


x

Proof Note that all conditions of Theorem 3.1.26 are satisfied and the solution x ∗
of the above problem is attained at the boundary of the feasible set. Therefore, in
accordance with Theorem 3.1.26, we have to solve the following equations:

c = λ∗ Ax ∗ , Ax ∗ , x ∗ = 1.

Thus, λ∗ = c, A−1 c 1/2 and x ∗ = 1 −1


λ∗ A c. 

The values λ∗i ≥ 0, i = 1, . . . , m, are called optimal dual (Lagrange) multipliers
for problem (3.1.61). We can get some upper bounds for these values from the depth
of the Slater condition (3.1.62).
Lemma 3.1.21 Any point x̄, feasible for problem (3.1.61), generates the following
upper bound on the magnitude of optimal dual multipliers:


m
f0 (x̄) − f0 (x ∗ ) ≥ (−fi (x̄))λ∗i . (3.1.65)
i=1
184 3 Nonsmooth Convex Optimization

Proof Indeed,


m
f0 (x̄) + λ∗i fi (x̄)
i=1

(2.1.2) 
m
≥ f0 (x ∗ ) + ∇f0 (x ∗ ), x̄ − x ∗ + λ∗i [fi (x ∗ ) + ∇fi (x ∗ ), x̄ − x ∗ ]
i=1


m 
m
= f0 (x ∗ ) + λ∗i fi (x ∗ ) + ∇f0 (x ∗ ) + λ∗i ∇fi (x ∗ ), x̄ − x ∗
i=1 i=1

(3.1.63)
≥ f0 (x ∗ ). 

The statement of Lemma 3.1.21 can be used to construct an exact penalty


function for problem (3.1.61). Let the point x̄ ∈ Q satisfy Slater condition (3.1.62).
Assume that we know some upper bound D for the gap f0 (x̄)−f0 (x ∗ ). For example,
it can be found by the following optimization problem:

D = max∇f0 (x̄), x̄ − x .
x∈Q


m
+ :
Consider the set Λ = {λ ∈ Rm (−fi (x̄))λi ≤ D}. In view of
i=1
Lemma 3.1.21, we have λ∗ ∈ Λ. Define the following nonsmooth penalty function:


(i)
Ψ (g) = maxλ, g = D max g , g ∈ Rm , (3.1.66)
λ∈Λ 1≤i≤m −fi (x̄) +

where (a)+ = max{0, a}.


Consider the following minimization problem:
 
def
min φ(x) = f0 (x) + Ψ (f (x)) , (3.1.67)
x∈Q

where f (x) = (f1 (x), . . . , fm (x)). Let us compute its subdifferential at the point
x ∗ , the solution of problem (3.1.61).
Note that maxλ, f (x ∗ ) = 0. In accordance with the rules of Lemma 3.1.16, we
λ∈Λ
can form the set

Λ+ = {λ ∈ Λ : λ, f (x ∗ ) = 0} = {λ ∈ Λ : λi = 0, i
∈ I ∗ (x)},
3.1 General Convex Functions 185

where I (x ∗ ) = {i : fi (x ∗ ) = 0}. Since λ∗ ∈ Λ+ , in view of Lemma 3.1.16 we


have

g ∗ = ∇f0 (x ∗ ) + λ∗i ∇fi (x ∗ ) ∈ ∂φ(x ∗ ).
i∈I (x ∗ )

Hence, by Theorem 3.1.26 and Theorem 3.1.24, x ∗ ∈ Arg min φ(x). Thus, the
x∈Q
optimal values of problems (3.1.67) and (3.1.61) coincide.
Let x̂ be an arbitrary optimal solution to problem (3.1.67). Then, by Theo-
rem 3.1.24 and Lemma 3.1.16, there exists a vector λ̂ ∈ Arg maxλ, f (x̂) such
λ∈Λ
that

m
∇f0 (x̂) + λ̂i ∇fi (x̂), x − x̂ ≥ 0, ∀x ∈ Q.
i=1

Let us assume that Ψ (f (x̂)) > 0. Then the inequality constraint in the definition of
the set Λ is active and we have λ̂, −f (x̄) = D. However,


m
D ≥ f0 (x̄) − f0 (x̂) ≥ ∇f0 (x̂), x̄ − x̂ ≥ λ̂i ∇fi (x̂), x̂ − x̄
i=1

≥ λ̂, f (x̂) − f (x̄) = Ψ (f (x̂)) + D.

This contradiction proves that Ψ (f (x̂)) = 0. Therefore, this point is feasible for
problem (3.1.61) and it attains the optimal value of the objective function.
In some situations, the optimization methods based on the exact penalty may
look more attractive than the two-level procedure described in Sect. 2.3.5. However,
note that for these methods it is necessary to know the point x̄ satisfying the Slater
condition (3.1.62). If this condition is not “deep” enough, the resulting penalty
function can have bad bounds on the derivatives. This slows down the minimization
schemes.
The Slater condition in the form (3.1.62) cannot work for equality constraints.
Let us show how it can be modified in order to justify the Karush–Kuhn–Tucker
condition for a minimization problem of the following form:

min{f (x) : Ax = b}, (3.1.68)


x∈Q

where Q is a closed convex set and the matrix A ∈ Rm×n has full row rank.
Theorem 3.1.27 Let a function f be convex on Q ⊂ int (dom f ) and its level sets
on Q be bounded. Suppose that there exist a point x̄ and  > 0 such that

Ax̄ = b, B(x̄, ) ⊆ Q. (Slater condition for equalities) (3.1.69)


186 3 Nonsmooth Convex Optimization

A point x ∗ ∈ Q is an optimal solution for problem (3.1.68) if and only if Ax ∗ = b


and there exist y ∗ ∈ Rm and g ∗ ∈ ∂f (x ∗ ) such that

g ∗ − AT y ∗ , x − x ∗ ≥ 0 ∀x ∈ Q. (3.1.70)

The magnitude of the vector y ∗ can be estimated as follows:




AT y ∗  ≤ 1
 max f (x) − min f (x) . (3.1.71)
x∈B(x̄,) x∈Q

Proof Indeed, if condition (3.1.70) is satisfied, then for any x ∈ Q with Ax = b we


have
(3.1.23) (3.1.70)
f (x) − f (x ∗ ) ≥ g ∗ , x − x ∗ ≥ y ∗ , A(x − x ∗ ) = 0.

To prove the converse statement, consider the function

φ(x) = f (x) + Kb − Ax,

where the norm is standard Euclidean and K > 0 is a constant, which will be
specified later. In view of our assumptions, φ attains its minimum on Q at some
point x∗ . Therefore, by Theorem 3.1.24, there exists a vector gφ∗ ∈ ∂φ(x∗ ) such that

gφ∗ , x − x∗ ≥ 0, ∀x ∈ Q. (3.1.72)

In view of Lemma 3.1.12, Lemma 3.1.11 and representation (3.1.42), there exist
g ∗ ∈ ∂f (x∗ ) and ȳ ∈ Rm with ȳ ≤ 1 such that

gφ∗ = g ∗ − KAT ȳ.

Moreover, in view of Lemma 3.1.15, ȳ, b − Ax∗ = b − Ax̄.


def (3.1.70)
On the other hand, for any δ ∈ B(0, ), we have xδ = x̄ + δ ∈ Q. Therefore,

(3.1.72)
g ∗ , xδ − x∗ ≥ KAT ȳ, x̄ + δ − x ∗ = Kȳ, Aδ + b − Ax∗
(3.1.73)
= Kb − Ax∗  + KAT ȳ, δ .

In view of Theorem 3.1.11, M = max {f (x) : x ∈ B(x̄, )} < +∞. Then
x

(3.1.23)
g ∗ , xδ − x∗ ≤ f (xδ ) − f (x∗ ) ≤ M − f∗ ,
3.1 General Convex Functions 187

where f∗ = min f (x). Therefore, maximizing the right-hand side of inequal-


x∈Q
ity (3.1.73) in δ ∈ B(0, ), we get

M − f∗ ≥ KAT ȳ ≥ Kμȳ,

where μ = λmin (AAT ) > 0. Defining y ∗ = K ȳ, we get from the first inequality
1/2

the bound (3.1.71). On the other hand, choosing K > μ1


(M − f∗ ), from the second
inequality we necessarily get ȳ < 1. By Lemma 3.1.15, this implies that Ax∗ = b.
Consequently, x∗ is an optimal solution for problem (3.1.68).
As we can see now, for K big enough, any solution x ∗ of problem (3.1.68) is a
global minimum of the function φ. Repeating the above reasoning, we can justify
the condition (3.1.70). 
In view of its simplicity, Theorem 3.1.27 has many interesting applications.
Here we present only one of them, related to the rules for differentiating a partial
minimization of a convex function. The new statement significantly extends a
particular case of Theorem 3.1.25.
Theorem 3.1.28 Let a function f be convex and Q be a closed convex set belonging
to int (dom f ). Assume that the level sets of f are bounded on Q.
Let a matrix A ∈ Rm×n with n > m have a full row rank. Consider the function

φ(u) = min{f (x) : Ax = u}.


x∈Q

Then φ is convex and for any u ∈ Rm such that {x ∈ int (Q) : Ax = u}


= ∅, we
have

{y ∗ : ∃x ∗ ∈ Q, Ax ∗ = u, and g ∗ ∈ ∂f (x ∗ )
(3.1.74)
such that g ∗ − AT y ∗ , x − x ∗ ≥ 0 ∀x ∈ Q} ⊆ ∂φ(u).

Proof Let Q(u) = {x ∈ Q : Ax = u}. Then dom φ = {u ∈ Rm : Q(u)


= ∅}. In
view of the conditions of the theorem, for any u ∈ dom φ there exists at least one
point x(u) in the set Arg min f (x). Let u1 , u2 ∈ dom φ and α ∈ [0, 1]. Then
x∈Q(u)

def
xα = αx(u1 ) + (1 − α)x(u2 ) ∈ Q(αu1 + (1 − α)u2 ).

Therefore,

(3.1.2)
φ(αu1 + (1 − α)u2 ) ≤ f (xα ) ≤ αf (x(u1 )) + (1 − α)f (x(u2 ))

= αφ(u1 ) + (1 − α)φ(u2 ).
188 3 Nonsmooth Convex Optimization

Further, in view of Theorem 3.1.27, the set in the left-hand side of inclu-
sion (3.1.74) is nonempty. Let the triple (x ∗ , y ∗ , g ∗ ) be an element of this set for
some u = u1 ∈ dom φ. Then for another u2 ∈ dom φ we have

(3.1.23) (3.1.74)
φ(u2 ) = f (x(u2 )) ≥ f (x ∗ ) + g ∗ , x(u2 ) − x ∗ ≥ AT y ∗ , x(u2 ) − x ∗

= φ(u1 ) + y ∗ , u2 − u1 .

(3.1.23)
Therefore, y ∗ ∈ ∂φ(u1 ).

Thus, the rules for differentiating the function φ at a point u are very simple. We
need to solve the corresponding minimization problem and extract from the solver
the optimal Lagrange multipliers for equality constraints. This vector is an element
of the subdifferential ∂φ(u).

3.1.8 Minimax Theorems

Consider a function Ψ (·, ·) defined on the direct product of two convex sets, P ⊆ Rn
and S ⊆ Rm . We assume that the functions Ψ (·, u) are closed and convex on P ⊆
dom Ψ (·, u) for all u ∈ S. Similarly, all functions Ψ (x, ·) are closed and concave
on S ⊆ dom Ψ (x, ·) for all x ∈ P . The main goal of this section is the justification
of the sufficient conditions for the equality

inf sup Ψ (x, u) = sup inf Ψ (x, u). (3.1.75)


x∈P u∈S u∈S x∈P

Note that in general, we can guarantee only that the right-hand side of this relation
does not exceed its left-hand side (see (1.3.6)).
Define f (x) = sup Ψ (x, u) ≥ φ(u) = inf Ψ (x, u). We will see that in many
u∈S x∈P
situations

min f (x) = max φ(u).


x∈P u∈S

Let us start from a simple observation.


Lemma 3.1.22 Assume that for any u ∈ S, the level sets of the function Ψ (·, u) are
bounded on P , and the function φ attains its maximum on S at some point u∗ . Then
for any u ∈ S we have

min max{Ψ (x, u), Ψ (x, u∗ )} = φ(u∗ ). (3.1.76)


x∈P
3.1 General Convex Functions 189

Proof Let us choose an arbitrary u ∈ S. For x ∈ P , consider the function

fu (x) = max{Ψ (x, u), Ψ (x, u∗ )} ≥ max{φ(u), φ(u∗ )} = φ(u∗ ). (3.1.77)

In view of Theorem 3.1.10, there exists a λ∗ ∈ [0, 1] such that


 
min fu (x) = min λ∗ Ψ (x, u) + (1 − λ∗ )Ψ (x, u∗ )
x∈P x∈P
≤ min Ψ (x, λ∗ u + (1 − λ∗ )u∗ )
x∈P

= φ(λ∗ u + (1 − λ∗ )u∗ ).

(3.1.77)
Hence, φ(u∗ ) ≤ min fu (x) ≤ φ(λu + (1 − λ)u∗ ) ≤ φ(u∗ ). 

x∈P
Now we can prove the first variant of the Minimax Theorem.
Theorem 3.1.29 Let each of the functions Ψ (·, u) attain a unique minimum on P ,
and let the function φ attain its maximum on S. Then

min f (x) = max φ(u). (3.1.78)


x∈P u∈S

Proof Since the point x(u) = arg min Ψ (x, u) is uniquely defined, the level sets
x∈P
of all functions Ψ (·, u), u ∈ S are bounded (see Theorem 3.1.4(5)). Thus, by
Lemma 3.1.22, relation (3.1.76) is valid for all u ∈ S.
Since φ(u∗ ) = Ψ (x(u∗ ), u∗ ), the minimum of problem (3.1.76) can be achieved
only at the point x(u∗ ). But then for any u ∈ S we have

(3.1.76)
Ψ (x(u∗ ), u) ≤ Ψ (x(u∗ ), u∗ ) ≤ Ψ (x, u∗ ), x ∈ P.

Thus, f (x(u∗ )) ≤ φ(u∗ ), and we get (3.1.78) by (1.3.6).



Relaxation of the uniqueness condition for the minimizers of the functions
Ψ (·, u), u ∈ S, gives us a variant of von Neuman’s Theorem.3
Theorem 3.1.30 Assume that both sets P and S are bounded. Then

min f (x) = max φ(u). (3.1.79)


x∈P u∈S

3 As compared with the standard version of this theorem, we replace the continuity assumptions by

assumptions on closedness of the epigraphs.


190 3 Nonsmooth Convex Optimization

Proof Let us fix some  > 0. For the standard Euclidean norm  · , consider the
function

Ψ (x, u) = Ψ (x, u) + 12 x2 , x ∈ P , u ∈ S.

Since for each u ∈ S the function Ψ (·, u) is strongly convex, it attains a unique
minimum on P . Therefore the function φ (u) = min Ψ (x, u) is well defined, and in
x∈P
view of Theorem 3.1.8, it is concave and closed on S. Therefore, by Theorem 3.1.29,
there exist points u∗ ∈ S and x∗ = arg min Ψ (x, u∗ ), such that
x∈P

Ψ (x∗ , u) ≤ Ψ (x∗ , u∗ ) ≤ Ψ (x, u∗ ), x ∈ P , u ∈ S.

The first inequality is Ψ (x∗ , u) ≤ Ψ (x∗ , u∗ ) for all u ∈ S. Thus,

f (x∗ ) = sup Ψ (x∗ , u) ≤ Ψ (x∗ , u∗ ).


u∈S

On the other hand, for all x ∈ P we have

Ψ (x∗ , u∗ ) ≤ Ψ (x∗ , u∗ ) ≤ Ψ (x, u∗ ) + 12 D 2 ,

where D ≥ sup x. Hence,


x∈P

f (x∗ ) ≤ φ(u∗ ) + 12 D 2 ,  > 0.

In view of the boundedness of the sets P and S, letting  → 0 in this inequality, we


get the relation (3.1.79) (see Item 4 of Theorem 3.1.4). 
Finally, let us show that sometimes it is possible to derive the no-gap prop-
erty (3.1.78) from the local optimality conditions.
Theorem 3.1.31 Let a function f attain its minimum on P at the point x ∗ . Suppose
that for some g∗ ∈ ∂P f (x ∗ ), yielding the first-order optimality condition

(3.1.56)
g∗ , x − x ∗ ≥ 0, x ∈ P,

there exists a representation


k
g∗ = λ(i) gi , (3.1.80)
i=1

for certain k ≥ 1, λ ∈ Δk , and some gi belonging to the sets ∂P ,x Ψ (x ∗ , ui ), where


ui ∈ I (x ∗ ), i = 1, . . . , k, and I (x ∗ ) = {u ∈ S : Ψ (x ∗ , u) = f (x ∗ )}. Then the
relation (3.1.78) is satisfied.
3.1 General Convex Functions 191


k
Proof Indeed, let ū = ui . Then, for any x ∈ P , we have
i=1

(3.1.80) 
k
f (x ∗ ) ≤ f (x ∗ ) + g∗ , x − x ∗ = f (x ∗ ) + λ(i) gi , x − x ∗
i=1

(3.1.23) 
k 
k
≤ f (x ∗ ) + λ(i) [Ψ (x, ui ) − Ψ (x ∗ , ui )] = λ(i) Ψ (x, ui )
i=1 i=1

≤ Ψ (x, ū).

Thus, f (x ∗ ) ≤ φ(ū), and by (1.3.6) we see that φ(ū) = max φ(u). 



u∈S

Note that the right-hand side of representation (3.1.80) belongs to ∂P f (x ∗ )


(see Lemma 3.1.14). Therefore, a sufficient condition for the existence of this
representation is

∂P f (x ∗ ) = Conv {∂P ,x Ψ (x ∗ , u) : u ∈ I (x ∗ )}. (3.1.81)

3.1.9 Basic Elements of Primal-Dual Methods

Very often, the possibility of applying primal-dual optimization methods comes out
from direct access to the internal structure of the objective function. Consider the
problem

f ∗ = min f (x), (3.1.82)


x∈P

where the function f is closed and convex on P . Suppose that the objective function
f has a max-representation:

f (x) = max Ψ (x, u), (3.1.83)


u∈S

where the function Ψ satisfies all our assumptions made in the beginning of
Sect. 3.1.8. From this representation, we derive the dual problem4

def
φ ∗ = max φ(u), φ(u) = min Ψ (x, u). (3.1.84)
u∈S x∈P

4 In Chap. 6 we call it the adjoint problem due to the fact that very often representation (3.1.83) is

not unique.
192 3 Nonsmooth Convex Optimization

From the mathematical point of view, the pair of primal-dual problems (3.1.82)
and (3.1.84) looks completely symmetric. However, this is not true for numerical
methods. Indeed, our initial intention was to solve problem (3.1.82). Hence, it is
implicitly assumed that the maximization problem in definition (3.1.83) is relatively
easy. It should be possible to solve it either in a closed form, or by a simple
numerical procedure (which defines the complexity of the oracle). At the same time,
the complexity of computing the value of the objective function in problem (3.1.84)
can be very high. It can easily reach the complexity of our initial problem (3.1.82).
Therefore, it seems that the dual problem has a good chance of being much more
difficult than the initial primal problem (3.1.82).
Fortunately this is not the case provided that we have an access to the internal
structure of the oracle (3.1.83). Indeed, in order to compute the value f (x) the oracle
needs to compute a point

u(x) ∈ Arg max Ψ (x, u).


u∈S

Let us assume that this point is used to compute the subgradient g(x) (or, when f is
smooth, the gradient) of the objective function (see Lemma 3.1.14):

g(x) ∈ ∂P ,x Ψ (x, u(x)).

Thus, we assume that the oracle returns three objects: f (x), g(x), and u(x) ∈ S.
Let us show how this information can be used in numerical methods.
In Smooth Optimization, we often use the functional model of the objective
function. Assume that some method accumulated the information from the oracle
at points {yk }N
k=0 ⊂ P . Then, for some scaling coefficients


N
αk > 0, k = 0, . . . , N, αk = 1,
k=0

we can construct a linear model of the objective function:


N (3.1.23)
N (x) = αk [f (yk ) + g(yk ), x − yk ] ≤ f (x), x ∈ P.
k=0

In some methods (see, for example, (2.2.3), (2.2.4)), for points of minimizing
sequence {xk }k≥0 , it is possible to ensure the following relation:

f (xN ) ≤ min N (x) + rN , (3.1.85)


x∈P
3.1 General Convex Functions 193

where rN → 0 as N → ∞. In fact, this relation can be used not only for justifying
the quality of point xN , but also for estimating the primal-dual gap with respect to
the dual solution


N
ûN = αk u(yk ) ∈ S. (3.1.86)
k=0

Lemma 3.1.23 Let the point xN satisfy (3.1.85). Then

0 ≤ (f (xN ) − f ∗ ) + (φ ∗ − φ(ûN )) ≤ f (xN ) − φ(ûN ) ≤ rN .

Proof Indeed, g(yk ) ∈ ∂P ,x Ψ (yk , u(yk )). Therefore,


N
min N (x) = min αk [Ψ (yk , u(yk )) + g(yk ), x − yk ]
x∈P x∈P k=0

(3.1.23) 
N
≤ min αk Ψ (x, u(yk )) ≤ min Ψ (x, ûN ) = φ(ûN ).
x∈P k=0 x∈P

It remains to use inequality (3.1.85).



Since we have ensured rN → 0, for our problem we have managed to prove
the no-gap property algorithmically. Note that our way of generating the good dual
solution (3.1.86) does not require a single computation of the dual function.
In Nonsmooth Optimization, we use another certificate of optimality based on
the gap function. It is defined by a sequence of test points {yk }N k=0 and scaling
coefficients as follows:


N
δN (x) = αk g(yk ), yk − x .
k=0


N
Define fˆN = αk f (yk ).
k=0
Lemma 3.1.24 Assume that max δN (x) ≤ rN → 0. Then
x∈P

0 ≤ (fˆN − f ∗ ) + (φ ∗ − φ(ûN )) ≤ fˆN − φ(ûN ) ≤ rN → 0.


194 3 Nonsmooth Convex Optimization

Proof Indeed


N
max δN (x) = max αk g(yk ), yk − x
x∈P x∈P k=0

(3.1.23) 
N
≥ min αk [Ψ (yk , u(yk )) − Ψ (x, u(yk ))]
x∈P k=0

(3.1.4)
≥ fˆN − min Ψ (x, ûN ) = fˆN − φ(ûN ). 

x∈P

Again, for nonsmooth problems, computation of the good dual solution ûN does
not require significant computational resources.

3.2 Methods of Nonsmooth Minimization

(General lower complexity bounds; Main lemma; Localization sets; The subgradient
method; Minimization with functional constraints; Approximation of optimal Lagrange
multipliers; Strongly convex functions; Optimization in finite dimensions and lower
complexity bounds; Cutting plane schemes; The center of gravity method; The ellipsoid
method and others.)

3.2.1 General Lower Complexity Bounds

In Sect. 3.1, we introduced a class of general convex functions. These functions can
be nonsmooth and therefore the corresponding minimization problem can be quite
difficult. As for smooth problems, let us try to derive lower complexity bounds,
which will help us to evaluate the performance of numerical methods.
In this section, we derive such bounds for the following unconstrained minimiza-
tion problem

min f (x), (3.2.1)


x∈Rn

where f is a convex function. Denote by x ∗ ∈ Rn one of its optimal solutions. Thus,


our problem class is as follows.
3.2 Methods of Nonsmooth Minimization 195

Model: 1. Unconstrained minimization.


2. f is convex on Rn and Lipschitz
continuous on a bounded set.

Oracle: First-order Black Box:


at each point x̂, we can compute
f (x̂), g(x̂) ∈ ∂f (x̂),
g(x̂) is an arbitrary subgradient. (3.2.2)

Approximate Find x̄ ∈ Rn : f (x̄) − f ∗ ≤ .


solution:

Methods: Generate a sequence {xk } :


xk ∈ x0 + Lin {g(x0 ), . . . , g(xk−1 )}.

As in Sect. 2.1.2, to derive lower complexity bounds for our problem class, we
will study the behavior of numerical methods on some function, which appears to
be very difficult for all schemes.
Let us fix some parameters μ > 0 and γ > 0. Consider the family of functions
μ
fk (x) = γ max x (i) + 2  x 2 , k = 1 . . . n, (3.2.3)
1≤i≤k

where the norm is standard Euclidean. Using the rules of subdifferential calculus,
described in Sect. 3.1.6, we can write down a closed-form expression for the
subdifferential of fk at x. This is

∂fk (x) = μx + γ Conv {ei | i ∈ I (x)},

I (x) = {j | 1 ≤ j ≤ k, x (j ) = max x (i) }.


1≤i≤k

Let xk∗ be the global minimum of the function fk . Then, for any x, y ∈ B2 (x ∗ , ρ),
ρ > 0, and gk (y) ∈ ∂fk (y), we have

fk (y) − fk (x) ≤ gk (y), y − x ≤  gk (y)  ·  y − x 


(3.2.4)

≤ μxk∗  + μρ + γ  y − x  .
196 3 Nonsmooth Convex Optimization

Thus, fk is Lipschitz continuous on B2 (xk∗ , ρ) with Lipschitz constant

M = μxk∗  + μρ + γ .

Further, by Theorem 3.1.20, it is easy to check that the optimal point xk∗ has the
following coordinates:

⎪ γ
⎨ − μk , 1 ≤ i ≤ k,
(xk∗ )(i) =

⎩ 0, k + 1 ≤ i ≤ n.

Now we have all the important characteristics of our problem:

def 2 2
Rk =  xk∗  = γ
√ , fk∗ = − μk
γ γ
+ μ2 Rk2 = − 2μk ,
μ k
(3.2.5)

M = μxk∗  + μρ + γ = μρ + γ √k+1
.
k

Let us describe now a resisting oracle for the function fk (·). Since the analytical
form of this function is fixed, the resistance of this oracle consists in providing us
with the worst possible subgradient at each test point. The algorithmic scheme of
this oracle is as follows.

Input: x ∈ Rn .

MainLoop: f := −∞; i ∗ := 0;

for j := 1 to k do
(3.2.6)
if x (j ) > f then { f := x (j ); i ∗ := j };

μ
f := γf + 2  x 2 ; g := γ ei ∗ + μx;

Output : fk (x) := f, gk (x) := g ∈ Rn .

At first glance, there is nothing special in this procedure. Its main loop is just a
standard process for finding the maximal coordinate of a vector from Rk . However,
the main feature of this loop is that we always form the subgradient of the
nonsmooth part of the objective proportional to a coordinate vector. Moreover, the
3.2 Methods of Nonsmooth Minimization 197

active coordinate i ∗ always corresponds to the first maximal component of vector x.


Let us see what happens with a minimizing sequence based on such an oracle.
Let us choose the starting point x0 = 0. Define

Rp,n = {x ∈ Rn | x (i) = 0, p + 1 ≤ i ≤ n}.

Since x0 = 0, the answer of the oracle is fk (x0 ) = 0 and gk (x0 ) = e1 . Therefore,


the next point of the sequence, x1 , necessarily belongs to R1,n . Assume now that the
current test point of the sequence, xi , belongs to Rp,n , 1 ≤ p ≤ k. Then the oracle
returns a subgradient

g = μxi + γ ei ∗ ,

where i ∗ ≤ p + 1. Therefore, the next test point xi+1 belongs to Rp+1,n .


This simple reasoning proves that for all i, 1 ≤ i ≤ k, we have xi ∈ Ri,n .
Consequently, for i: 1 ≤ i ≤ k − 1, we cannot improve the starting value of the
objective function:

(j )
fk (xi ) ≥ γ max xi = 0.
1≤j ≤k

Let us convert this observation into a lower complexity bound. Let us fix some
parameters of our problem class P(x0 , R, M), that is, R > 0 and M > 0. In
addition to (3.2.2) we assume the following.

• The point x0 is close enough to the solution of prob-


lem (3.2.1):

x0 − x ∗  ≤ R. (3.2.7)

• The function f is Lipschitz continuous on B2 (x ∗ , R) with


constant M > 0.

Theorem 3.2.1 For any class P(x0, R, M) and any k, 0 ≤ k ≤ n − 1, there exists
a function f ∈ P(x0 , R, M) such that

f (xk ) − f ∗ ≥ MR

2(2+ k+1)

for any optimization scheme, which generates a sequence {xk } satisfying the
condition

xk ∈ x0 + Lin {g(x0 ), . . . , g(xk−1 )}.


198 3 Nonsmooth Convex Optimization

Proof Without loss of generality, we can assume that x0 = 0. Let us choose f (x) =
fk+1 (x) with the following values of parameters:

k+1M
γ = √ , μ= √M .
2+ k+1 (2+ k+1)R

Then
(3.2.5) 2
f ∗ = fk+1
∗ = γ
− 2μ(k+1) = − MR
√ ,
2(2+ k+1)

(3.2.5)
 x0 − x ∗  = Rk+1 = √γ = R.
μ k+1

Moreover, f is Lipschitz continuous on B2 (x ∗ , R) with constant μR + γ k+1+1

k+1
=
M. Note that xk ∈ Rk,n . Hence, f (xk )−f∗ ≥ −f ∗ . 

The lower complexity bound presented in Theorem 3.2.1 does not depend on the
dimension of the space of variables. As for the lower bound of Theorem 2.1.7, it
can be applied to problems with very large dimension, or to the efficiency analysis
of starting iterations of a minimization scheme (k ≤ n − 1).
We will see that our lower estimate is exact: There exist minimization methods
which have a rate of convergence proportional to this lower bound. Comparing this
bound with the lower bound for smooth minimization problems, we can see that
now the possible convergence rate is much slower. However, we should remember
that we are working with one of the most general classes of convex problems.

3.2.2 Estimating Quality of Approximate Solutions

We are now interested in the following optimization problem:

min f (x), (3.2.8)


x∈Q

where Q is a closed convex set, and the function f is convex on Rn . We are going
to study numerical methods for solving (3.2.8), which employ subgradients g(x) of
the objective function, computed at x ∈ Q. As compared with the smooth problem,
our goal is more challenging. Indeed, even in the simplest situation, when Q ≡
Rn , the subgradient seems to be a poor replacement for the gradient of a smooth
function. For example, we cannot be sure that the value of the objective function is
decreasing in the direction −g(x). We cannot expect that g(x) → 0 as x approaches
the solution of our problem, etc.
3.2 Methods of Nonsmooth Minimization 199

Fortunately, there is one property of subgradients which makes our goals


reachable. We have justified this property in Corollary 3.1.6:

At any x ∈ Q, the following inequality holds:


(3.2.9)
g(x), x − x ∗ ≥ 0.

This simple inequality leads to two important consequences, which form the basis
for the majority of nonsmooth minimization methods. Namely:
• The distance between x and x ∗ decreases along the direction −g(x).
• Inequality (3.2.9) cuts Rn in two half-spaces, and it is known which of them
contains the optimal point x ∗ .
Nonsmooth minimization methods cannot employ the idea of relaxation or
approximation. There is another concept underlying all these schemes. This is
the concept of localization. However, to go forward with this concept, we have
to develop a special technique which allows us to estimate the quality of an
approximate solution to problem (3.2.8). This is the main goal of this section.
Let us fix some x̄ ∈ Rn . For x ∈ Rn with g(x)
= 0 define

vf (x̄, x) = g(x) g(x), x


1
− x̄ . (3.2.10)

If g(x) = 0, then define vf (x̄; x) = 0. Clearly, by the Cauchy-Schwarz inequality,

vf (x̄, x) ≤  x − x̄  .

The values vf (x̄, x) have a natural geometric interpretation. Consider a point x


such that g(x)
= 0 and g(x), x − x̄ ≥ 0. Let us look at the point

g(x)
ȳ = x̄ + vf (x̄, x) g(x) .

Then
(3.2.10)
g(x), x − ȳ = g(x), x − x̄ − vf (x̄, x)  g(x)  = 0,

and  ȳ − x̄ = vf (x̄, x). Thus, vf (x̄, x) is a distance from point x̄ to the hyperplane
{y : g(x), x − y = 0}.
Let us introduce a function which measures the growth of the function f around
the point x̄. For t ≥ 0, define

ωf (x̄; t) = max{f (x) − f (x̄) :  x − x̄ ≤ t}.


x

If t < 0, we set ωf (x̄; t) = 0.


200 3 Nonsmooth Convex Optimization

Clearly function ωf possesses the following properties.


• ωf (x̄; t) = 0 for all t ≤ 0.
• ωf (x̄; t) is a nondecreasing function of t ∈ R.
• f (x) − f (x̄) ≤ ωf (x̄;  x − x̄ ).
It is important that under a convexity assumption the last inequality can be
significantly strengthened.
Lemma 3.2.1 For any x ∈ Rn we have

f (x) − f (x̄) ≤ ωf (x̄; vf (x̄; x)). (3.2.11)

If f (·) is Lipschitz continuous on B2 (x̄, R) with constant M, then

f (x) − f (x̄) ≤ M(vf (x̄; x))+ (3.2.12)

for all x ∈ Rn with vf (x̄; x) ≤ R.


Proof If g(x), x − x̄ < 0, then f (x̄) ≥ f (x) + g(x), x̄ − x ≥ f (x). Since
vf (x̄; x) is negative, we have ωf (x̄; vf (x̄; x)) = 0 and (3.2.11) holds.
Let g(x), x − x̄ ≥ 0. For
g(x)
ȳ = x̄ + vf (x̄; x) g(x) ,

we have g(x), ȳ − x = 0 and  ȳ − x̄ = vf (x̄; x). Therefore,

f (ȳ) ≥ f (x) + g(x), ȳ − x = f (x),

and

f (x) − f (x̄) ≤ f (ȳ) − f (x̄) ≤ ωf (x̄;  ȳ − x̄ ) = ωf (x̄; vf (x̄; x)).

If f is Lipschitz continuous on B2 (x̄, R) and 0 ≤ vf (x̄; x) ≤ R, then ȳ ∈ B2 (x̄, R).


Hence,

f (x) − f (x̄) ≤ f (ȳ) − f (x̄) ≤ M  ȳ − x̄  = Mvf (x̄; x). 


Let us fix some optimal solution x ∗ of problem (3.2.8). The values vf (x ∗ ; x)


allow us to estimate the quality of so-called localization sets.
Definition 3.2.1 Let {xi }∞
i=0 be a sequence in Q. Define

Sk = {x ∈ Q | g(xi ), xi − x ≥ 0, i = 0 . . . k}.

We call Sk the localization set of problem (3.2.8) generated by the sequence {xi }∞
i=0 .
3.2 Methods of Nonsmooth Minimization 201

In view of inequality (3.2.9), for all k ≥ 0, we have x ∗ ∈ Sk .


Let

vi = vf (x ∗ ; xi ) (≥ 0), vk∗ = min vi .


0≤i≤k

Thus,

vk∗ = max{r : g(xi ), xi − x ≥ 0, i = 0 . . . k, ∀x ∈ B2 (x ∗ , r)}.

This is the radius of the maximal ball centered at x ∗ , which is contained in the
localization set Sk .
Lemma 3.2.2 Let fk∗ = min f (xi ). Then
0≤i≤k

fk∗ − f ∗ ≤ ωf (x ∗ ; vk∗ ).

Proof Using Lemma 3.2.1, we have

ωf (x ∗ ; vk∗ ) = min ωf (x ∗ ; vi ) ≥ min [f (xi ) − f ∗ ] = fk∗ − f ∗ . 



0≤i≤k 0≤i≤k

3.2.3 The Subgradient Method

Now we are ready to analyze the behavior of some minimization methods. Consider
the problem

min f (x), (3.2.13)


x∈Q

where the function f is convex on Rn , and Q is a simple closed convex set. The term
“simple” means that we can solve explicitly some simple minimization problems
over Q. In this section, we need to find in a reasonably cheap way the Euclidean
projection of any point onto the set Q.
We assume that problem (3.2.13) is equipped with a first-order oracle, which at
any test point x̄ provides us with the value of the objective function f (x̄) and one
of its subgradients g(x̄).
As usual, we first try a version of the Gradient Method. Note that for nonsmooth
problems the norm of the subgradient,  g(x) , is not very informative. Therefore,
g(x̄)
in the subgradient scheme we use a normalized direction g( x̄) .
202 3 Nonsmooth Convex Optimization

Subgradient Method for Simple Sets

0. Choose x0 ∈ Q and a sequence {hk }∞


k=0 :



(3.2.14)
hk > 0, hk → 0, hk = ∞.
k=0

1. kth iteration (k ≥ 0).


Compute
 f (xk ),  g(xk ) and set xk+1 =
g(xk )
πQ xk − hk g(x k )
.

Let us estimate the rate of convergence of this scheme.


Theorem 3.2.2 Let a function f be Lipschitz continuous on B2 (x ∗ , R) with
constant M, where R ≥ x0 − x ∗ . Then


k
R2 + h2i
fk∗ −f∗ ≤M i=0

k
. (3.2.15)
2 hi
i=0

Proof Let ri = xi − x ∗ . Then, in view of Lemma 2.2.8, we have


   2
2 =  
πQ xi − hi g(xii ) − x ∗ 
g(x )
ri+1

 2
 g(xi ) ∗  = r 2 − 2h v + h2 .
≤ xi − hi g(x i )
− x  i i i i

Summing up these inequalities for i = 0 . . . k, we get


k 
k 
k
r02 + h2i ≥ 2 hi vi + rk+1
2 ≥ 2vk∗ hi .
i=0 i=0 i=0

Thus,


k
R2 + h2i
vk∗ ≤ i=0

k
.
2 hi
i=0

Since vk∗ ≤ v0 ≤ x0 − x ∗  ≤ R, we can use Lemma 3.2.2.



3.2 Methods of Nonsmooth Minimization 203

Thus, by Theorem 3.2.2, the rate of convergence of the Subgradient


Method (3.2.14) depends on the values


k
R2+ h2i
Δk = i=0

k
.
2 hi
i=0



We can easily see that Δk → 0 if the series hi diverges. However, let us try to
i=0
choose hk in an optimal way.
Let us assume that we have to perform a fixed number of steps, say N ≥ 1, of
the Subgradient Method. Then, minimizing Δk as a function of {hk }N
k=0 , we can see
5
that the optimal strategy is as follows :

hi = √R ,
N+1
i = 0 . . . N. (3.2.16)

In this case, ΔN = √R and we obtain the following rate of convergence:


N+1

fN∗ − f ∗ ≤ √MR .
N+1
(3.2.17)

Another possibility for defining the step sizes in the Subgradient Method (3.2.14)
consists in using the final accuracy  > 0 as a parameter of the algorithm. Indeed,
let us find N from the equation

(3.2.17) M 2 R2
√MR =  ⇒ N +1 = 2
. (3.2.18)
N+1

Then, in accordance with (3.2.16), we have

hi = 
M, i ≥ 0. (3.2.19)

In view of the upper bound (3.2.15), in this case we have

MR 2
fN∗ − f ∗ ≤ 2N + 12 . (3.2.20)

Thus, we get an -solution of the problem (3.2.1) as far as

M 2 R2
N ≥ 2
. (3.2.21)

Example 3.1.2(5), we can see that Δk is a symmetric convex function of {hi }. Therefore, its
5 From

minimum is achieved at the point having same values for all variables.
204 3 Nonsmooth Convex Optimization

The main advantage of the step size rule (3.2.19) consists in its independence of the
parameters R and N, which usually are not known in advance. Parameter M is an
upper bound on the norm of subgradients of the objective function, which are easily
observable during the minimization process.
Comparing inequality (3.2.17) with the lower bound of Theorem 3.2.1, we come
to the following conclusion.

Subgradient Method (3.2.14), (3.2.16) is optimal for the prob-


lem (3.2.13) uniformly in the number of variables n.

If we are not going to fix the number of iterations in advance, we can choose

hi = √r , i = 0, . . . .
i+1

Then it is easy to see that Δk is proportional to

R 2 +r√
2 ln(k+1)

4r k+1
,

and we can classify this rate of convergence as sub-optimal.


Thus, the simplest method for solving problem (3.2.8) appears to be optimal.
Usually, this indicates that the problems from our class cannot be solved very
efficiently. However, we should remember that our conclusion is valid uniformly
in the dimension of the problem. We will see later that a moderate dimension of
the problem, taken into account in a proper way, helps in developing much faster
schemes.

3.2.4 Minimization with Functional Constraints

Let us show how we can use the Subgradient Method to solve minimization
problems with functional constraints. Consider the problem

min{f (x) : fj (x) ≤ 0, j = 1, . . . , m}, (3.2.22)


x∈Q

with closed and convex functions f and fj , and a simple closed convex set Q.
Let us form an aggregate constraint f¯(x) = max fj (x). Then our problem can
1≤j ≤m
be written in the following way:

min{f (x) : f¯(x) ≤ 0}. (3.2.23)


x∈Q
3.2 Methods of Nonsmooth Minimization 205

Note that we can easily compute a subgradient ḡ(x) of the function f¯, provided that
we can do so for the functions fj (see Lemma 3.1.13).
Let us fix some x ∗ , an optimal solution to problem (3.2.22). Let  > 0 be the
desired accuracy of the approximate solution of problem (3.2.22). Consider the
following method.

Subgradient Method with Functional Constraints

0. Choose a starting point x0 ∈ Q.


1. kth iteration (k ≥ 0).
(a) Compute f (xk ) with g(xk ) ∈ ∂f (xk ), and f¯(xk ) with
ḡ(xk ) ∈ ∂ f¯(xk ). (3.2.24)
(b) If f¯(xk ) ≤ , then set

 
xk+1 = πQ xk − 
g(xk )2
g(xk ) . (Case A)

 
f¯(xk )
Else, set xk+1 = πQ xk − ḡ(xk )2
ḡ(xk ) . (Case B)

For method (3.2.24), denote by IA (N) the set of iterations of type A, and by IB (N)
the set of iterations of type B, which occurred during the first N steps of this scheme.
Clearly,

f¯(xk ) ≤ , ∀k ∈ IA (N). (3.2.25)

Theorem 3.2.3 Let functions f and fj , j = 1, . . . , m, be Lipschitz continuous


on the ball B2 (x ∗ , x0 − x ∗ ) with constant M. If the number of steps N in
method (3.2.24) is big enough,

M2
N ≥ 2
x0 − x ∗ 2 , (3.2.26)

then FA (N)
= ∅ and

def
fN∗ = min{f (xk ) : k ∈ IA (N)} ≤ f (x ∗ ) + . (3.2.27)
k

Proof Define rk = xk − x ∗ . Let us assume that N satisfies (3.2.26), but

f (xk ) − f ∗ ≥ , ∀k ∈ IA (N). (3.2.28)


206 3 Nonsmooth Convex Optimization

If k ∈ IA (N), then
 2
(2.2.49)   2
2
rk+1 ≤ xk − 
g(xk )2
g(xk ) = rk2 − 2
g(xk )2
g(xk ), xk − x ∗ + g(xk )2

(3.1.23) (3.2.28)
2 2
≤ rk2 − 2
g(xk )2
(f (xk ) − f ∗) + g(xk )2
≤ rk2 − g(xk )2
.

In Case B, we have
 
(2.2.49)  f¯(xk ) 2 2f¯(xk ) f¯(xk )2
2
rk+1 ≤ xk − ḡ(xk )2
ḡ(xk ) = rk2 − ḡ(xk )2
ḡ(xk ), xk − x ∗ + ḡ(xk )2

(3.1.23) f¯(xk )2 (3.2.28) 2


≤ rk2 − ḡ(xk )2
≤ rk2 − ḡ(xk )2
.

Thus, in both cases, rk+1 < rk ≤ x0 − x ∗ . Hence,

g(xk ) ≤ M, k ∈ IA (N), ḡ(xk ) ≤ M, k ∈ IB (N).

2
2
Therefore, rk+1 ≤ rk2 − M2
for any k = 0, . . . , N. Summing up these inequalities,
we get the inequality

2
0 ≤ rN+1
2 ≤ r02 − M2
(N + 1),

which contradicts our assumption (3.2.26).



Comparing the bound (3.2.26) with the result of Theorem 3.2.1, we see that
the scheme (3.2.24) has an optimal worst-case performance guarantee. Recall, that
the same lower complexity bound was obtained for an unconstrained minimization
problem. Thus, we can see that, from the viewpoint of analytical complexity, Convex
Unconstrained Minimization is not easier than Constrained Minimization.

3.2.5 Approximating the Optimal Lagrange Multipliers

Let us show now that a simple subgradient switching strategy can be used for
approximating the optimal Lagrange multipliers of problem (3.2.22) (see Theo-
rem 3.1.26).
For  > 0, denote by

F () = {x ∈ Q : fj (x) ≤ , j = 1, . . . , m}
3.2 Methods of Nonsmooth Minimization 207

the extended feasible set of problem (3.2.22). Defining the Lagrangian


m
L (x, λ) = f (x) + λ(j ) fj (x), x ∈ Q, λ = (λ(1) , . . . , λ(m) ) ∈ Rm
+,
j =1

we can introduce the Lagrangian dual problem

def
φ ∗ = sup φ(λ), (3.2.29)
λ∈Rm
+

def (1.3.6)
where φ(λ) = min L (x, λ). Clearly, f ∗ ≥ φ ∗ .
x∈Q
In order to approach an optimal solution of problems (3.2.22), (3.2.29), we apply
the following switching strategy. It has only one input parameter, the step size h > 0.
In what follows, we use the notation  ·  for the standard Euclidean norm, g(·)
denotes the subgradient of the objective function, and gj (·) denotes the subgradient
of the corresponding constraints.

Subgradient Method for Lagrange Multipliers

0. Choose a starting point x0 ∈ Q.


1. kth iteration (k ≥ 0).
(3.2.30)
(a) Define Ik = {j : fj (xk ) > hgj (xk )}.
 
hg(xk )
(b) If Ik = ∅, then compute xk+1 = πQ xk − g(x k )
.
(c) If Ik
= ∅, then choose arbitrary jk ∈ Ik and define
fjk (xk )
hk = gjk (xk )2
. Compute xk+1 = πQ (xk − hk gjk (xk )).

After t ≥ 0 iterations, define A0 (t) = {k ∈ {0, . . . , t} : Ik = ∅} and let

Aj (t) = {k ∈ {0, . . . , t} : jk = j }, 1 ≤ j ≤ m.

Let N(t) = |A0 (t)|. It is possible that N(t) = 0. However, if N(t) > 0, then we
can define the approximate dual multipliers as follows:
 (j ) 
σt = h 1
g(xk ) , λt = 1
σt hk , j = 1, . . . , m. (3.2.31)
k∈A0 (t ) k∈Aj (t )


Let St = 1
g(xk ) . If A0 (t) = ∅, then we define St = 0. Thus, σt = hSt .
k∈A0 (t )
208 3 Nonsmooth Convex Optimization

For proving convergence of the switching strategy (3.2.30), we are going to find
an upper bound for the gap
 f (xk )
δt = 1
St g(xk ) − φ(λt ),
k∈A0 (t )

(1) (m)
assuming that N(t) > 0. Here and in the sequel λt denotes (λt , . . . , λt ).
Theorem 3.2.4 Let the set Q be bounded: x − x0  ≤ R for all x ∈ Q. If the
number of iterations t of method (3.2.30) is big enough,

R2 (3.2.32)
t> h2
,

then N(t) > 0. Moreover, in this case

max fj (xk ) ≤ Mh, k ∈ A0 (t),


1≤j ≤m
(3.2.33)
δt ≤ Mh,

where M = max max gj (xk ).


0≤k≤t 0≤j ≤m

Proof Note that


" %
(3.2.31)  hf (xk ) 
m 
σt · δt = max g(xk ) − σt f (x) − hk fj (x)
x∈Q k∈A0 (t ) j =1 k∈Aj (t )

" %
 h(f (xk )−f (x)) 
= max g(xk ) − hk fjk (x)
x∈Q k∈A0 (t ) k
∈A0 (t )

" %
 hg(xk ),xk −x 
≤ max g(xk )∗ + hk [gjk (xk ), xk − x − fjk (xk )] .
x∈Q k∈A0 (t ) k
∈A0 (t )
(3.2.34)
Let us estimate from above the right-hand side of this inequality. For arbitrary
x ∈ Q, let rk (x) = x − xk . Assume that k ∈ A0 (t). Then
 2
(2.2.48)  hg(xk ) 
2 (x)
rk+1 ≤ xk − x − g(xk ) 
(3.2.35)
= rk2 (x) − g(x
2h
k )
g(xk ), xk − x + h2 .
3.2 Methods of Nonsmooth Minimization 209

If k
∈ A0 (t), then

(2.2.48)
2 (x)
rk+1 ≤ xk − x − hk gjk (xk )2

= rk2 (x) − 2hk gjk (xk ), xk − x + h2k gjk (xk )2 .

Hence,

fj2 (xk )
2hk [gjk (xk ), xk − x − fjk (xk )] ≤ rk2 (x) − rk+1
2 (x) − k
gjk (xk )2

≤ rk2 (x) − rk+1


2 (x) − h2 .

Summing up these inequalities and inequalities (3.2.35) for k = 0, . . . , t, and taking


into account that rt +1 (x) ≥ 0, we get

(3.2.34)
σt δt ≤ 2 r0 (x) + 2 N(t)h
1 2 1 2 − 12 (t − N(t))h2
(3.2.36)
= 2 r0 (x) − 2 th
1 2 1 2
+ N(t)h2 ≤ 1 2
2R − 1 2
2 th + N(t)h2 .

Assume now that t satisfies the condition (3.2.32). In this case we cannot have
N(t) = 0 since then σt = 0 and inequality (3.2.36) is violated. Thus, the first
inequality in (3.2.33) follows from the conditions of Step (b) in method (3.2.30).
(3.2.31)
Finally, σt ≥ h
M N(t). Therefore, if N(t) > 0 and the iteration counter t satisfies
(3.2.36) 2
inequality (3.2.32), then δt ≤ N(tσt)h ≤ Mh.


3.2.6 Strongly Convex Functions

In Sect. 2.1.3, we introduced the notion of strong convexity for differentiable convex
functions. We have seen that this additional assumption significantly accelerates
optimization methods. Let us study the effect of this assumption on the class of non-
differentiable convex functions. For the sake of simplicity, we work in this section
with standard Euclidean norm.
Definition 3.2.2 A function f is called strongly convex on a convex set Q if there
exists a constant μ > 0 such that for all x, y ∈ Q and α ∈ [0, 1] we have

f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) − 12 μα(1 − α)x − y2 .


(3.2.37)
210 3 Nonsmooth Convex Optimization

For such functions, we use the notation f ∈ Sμ0 (Q). If in this inequality μ = 0, we
get definition (3.1.2) of the usual convex function.
Note that for smooth convex functions we proved this inequality as one of the
equivalent definitions (2.1.23).
Let us present the most important properties of strongly convex functions.
Lemma 3.2.3 Let f ∈ Sμ0 (Q). Then for any x ∈ int Q and y ∈ W , we have

f (y) ≥ f (x) + f  (x; y − x) + 12 μx − y2 . (3.2.38)

Proof Indeed,

(3.2.37)  
f (y) ≥ 1
α f ((1 − α)x + αy) − (1 − α)f (x) + 12 μα(1 − α)x − y2

= f (x) + α1 [f (x + α(y − x)) − f (x)] + 12 μ(1 − α)y − x2 .

Taking in this inequality the limit as α ↓ 0, we get inequality (3.2.38). The limit
exists in view of Theorem 3.1.12. 
Corollary 3.2.1 Let f ∈ Sμ0 (Q). For any g ∈ ∂f (x), we have

f (y) ≥ f (x) + g, y − x + 12 μy − x2 . (3.2.39)

Proof Indeed, in view of Theorem 3.1.17, for any g ∈ ∂f (x) we have

f  (x; y − x) ≥ g, y − x . 

Corollary 3.2.2 If in problem (3.2.13) the objective function belongs to the class
Sμ0 (Q), then its level sets are bounded. Hence, its optimal solution exists.

Corollary 3.2.3 Let x ∗ ∈ int dom f be an optimal solution of problem (3.2.13)
with f ∈ Sμ0 . Then for all x ∈ Q, we have

f (x) ≥ f ∗ + 12 μx − x ∗ 2 . (3.2.40)

Hence, the solution of this problem is unique.


Proof Indeed, in view of Theorem 3.1.24, there exists a g ∗ ∈ ∂f (x ∗ ) such that

g ∗ , y − x ∗ ≥ 0.

Thus, (3.2.40) follows from (3.2.39).



3.2 Methods of Nonsmooth Minimization 211

Let us describe the results of some operations with strongly convex functions.
1. Addition. If f1 ∈ Sμ01 (Q) and f2 ∈ Sμ02 (Q), then for any α1 , α2 ≥ 0 we have

α1 f1 + α2 f2 ∈ Sα01 μ1 +α2 μ2 (Q).

(The proof follows directly from definition (3.2.37).) In particular, if we add a


convex function and a strongly convex function with parameter μ, then we get a
strongly convex function with the same value of parameter.
2. Maximum. If f1 ∈ Sμ01 (Q) and f2 ∈ Sμ02 (Q), then

f (x) = max{f1 (x), f2 (x)} ∈ Sμ0 (Q)

with μ = min{μ1 , μ2 }. Indeed, for any x1 , x2 ∈ Q and α ∈ [0, 1], we have

f (αx1 + (1 − α)x2 ) ≤ max{αf1 (x1 ) + (1 − α)f1 (x2 )


1
− μ1 α(1 − α)x1 − x2 2 , αf2 (x1 ) + (1 − α)f2 (x2 )
2
1
− μ2 α(1 − α)x1 − x2 2 }
2
1
≤ αf (x1 ) + (1 − α)f (x2 ) − μα(1 − α)x1 − x2 2 .
2

3. Subtraction. If f ∈ Sμ0 (Q), then the function fˆ(x) = f (x)− 12 μx2 is convex.
This fact follows from definition (3.2.37) and the Euclidean identity

2 αx + (1 − α)y2 ≡ 12 αx2 + 12 (1 − α)y2 − 12 α(1 − α)x − y2 ,


1

(3.2.41)
which is valid for all x, y ∈ Rn and α ∈ [0, 1].
Note also that any differentiable strongly convex function in the sense of (2.1.20)
belongs to the class Sμ0 (Q) (see Theorem 2.1.9).
Let us now derive the lower complexity bounds for problem (3.2.13) with a
strongly convex objective function. For that, we are going to use the function fk (·)
defined by (3.2.3). We add to assumptions (3.2.2) on the problem class the following
specification (compare with (3.2.7)).

• The function f is Lipschitz continuous on B2 (x ∗ , x0 − x ∗ )


with constant M > 0. (3.2.42)
• f ∈ Sμ0 (B2 (x ∗ , x0 − x ∗ )) with μ > 0.
212 3 Nonsmooth Convex Optimization

In what follows, we denote the class of problems satisfying assumptions (3.2.2),


(3.2.42) by Ps (x0 , μ, M).
Theorem 3.2.5 For any class Ps (x0 , μ, M) and any k, 0 ≤ k ≤ n − 1, there exists
a function f ∈ Ps (x0 , μ, M) such that
2
f (xk ) − f ∗ ≥ M√
2μ(2+ k+1)2
(3.2.43)

for any optimization scheme generating a sequence {xk }, which satisfies the
condition

xk ∈ x0 + Lin {g(x0 ), . . . , g(xk−1 )}.

Proof In this proof, we use functions (3.2.3) with the resisting oracle (3.2.6).
Without loss of generality, we can take x0 = 0. Let us choose f (x) = fk+1 (x)
with parameter

M √k+1
γ = . (3.2.44)
2+ k+1

In view of identity (3.2.41) function fk belongs to the class Sμ0 (Rn ). At the same
time,

def (3.2.5) (3.2.44)


Rk = x0 − xk∗  = √γ
μ k+1
= μ(2+M

k+1)
.

In view of (3.2.4), the Lipschitz constant of the function fk on the ball B2 (xk∗ , Rk )
is bounded by

(3.2.44) M √k+1
2μRk + γ = 2M

2+ k+1
+ 2+ k+1
= M.

Thus, optimization problem (3.2.13) with f = fk+1 belongs to the problem class
Ps (x0 , μ, M). At the same time, in view of the condition of the theorem,

(3.2.5) γ2 M√2
f (xk ) − f ∗ ≥ −fk+1
∗ = 2μ(k+1) = . 

2μ(2+ k+1)2

It appears that for our problem class the simplest subgradient method is
suboptimal.
Theorem 3.2.6 Assume that the objective function f in problem (3.2.13) satisfies
assumptions (3.2.42). Let  > 0 be the desired accuracy in the optimal value of this
problem. Consider a sequence of points {xk } ⊂ Q generated by the following rule:
 
2 g(xk )
xk+1 = πQ xk − g(xk )2
, k ≥ 0, (3.2.45)
3.2 Methods of Nonsmooth Minimization 213

where g(xk ) ∈ ∂f (xk ). Then, if the number of steps N of this scheme is big enough,

N≥ M2
μ ln Mx0−x  , (3.2.46)

def
we have fN∗ = min f (xk ) ≤ f ∗ + .
0≤k≤N

Proof Let rk = xk − x ∗  and hk = 2


g(xk )2
. Assume that N satisfies the lower
bound (3.2.46) and

f (xk ) − f ∗ > , k = 0, . . . , N. (3.2.47)

Then
(2.2.49)
4 2
2
rk+1 ≤ xk − hk g(xk )2 = rk2 − 2hk g(xk ), xk − x ∗ + g(xk )2

(3.2.39)  
4 2
≤ rk2 − 4
g(xk )2
f (xk ) − f ∗ + 12 μrk2 + g(xk )2

(3.2.47)  
2μ
≤ 1− g(xk )2
rk2 .

Thus, all xk ∈ B(x ∗ , r0 ) and therefore g(xk ) ≤ M. This implies that

(3.2.47)
 N/2  
 < f (xN ) − f ∗ ≤ MrN ≤ M 1 − 2μ
M2
r0 ≤ M exp − μN
M 2 r0 .

This contradicts the lower bound (3.2.46).



In view of our assumptions,

(3.2.40)
1
2 μx0 − x ∗ 2 ≤ f (x0 ) − f ∗ ≤ Mx0 − x ∗ .

Therefore, x0 − x ∗  ≤ 2M μ . Thus, the lower bound on the number of itera-


tions (3.2.46) can be rewritten in terms of the class parameters in the following
way:

M2 2
N≥ μ ln 2M
μ .
(3.2.48)

Comparing it with the lower complexity bound (3.2.43), we can see that the
Subgradient Method (3.2.45) is suboptimal. Its main advantage is independence on
the exact values of the class parameters μ and M.
Note that the step sizes of method (3.2.45) are twice as big as those of
method (3.2.24). If we divide the step sizes in (3.2.45) by two, then, for strongly
convex functions, this method will be twice as slow. At the same time, this new
214 3 Nonsmooth Convex Optimization

version will be identical to (3.2.24) with m = 0, which is able to minimize Lipschitz


continuous functions with simple set constraints (see Theorem 3.2.3).

3.2.7 Complexity Bounds in Finite Dimension

Let us look at the problems of Unconstrained Minimization again, assuming that


their dimension is relatively small. This means that our computational resources
allow us to perform a number of iterations of minimization schemes proportional to
the dimension of the space of variables. What will be the lower complexity bounds
in this case?
In this section, we obtain a finite-dimensional lower complexity bound for a
problem which is closely related to minimization problems. This is the feasibility
problem:

Find x ∗ ∈ S, (3.2.49)

where S is a closed convex set. We assume that this problem is endowed with a
separation oracle, which answers our request at a point x̄ ∈ Rn in the following
way.

• Either it reports that x̄ ∈ S.


• Or, it returns a vector ḡ, separating x̄ from S:

ḡ, x̄ − x ≥ 0 ∀x ∈ S.

In order to measure the complexity of this problem, we introduce the following


assumption.
Assumption 3.2.1 There exists a point x ∗ ∈ S such that for some  > 0 the ball
B2 (x ∗ , ) belongs to S .
For example, if we know an optimal value f ∗ for problem (3.2.8), we can treat
this problem as a feasibility problem with

S = {(t, x) ∈ Rn+1 | t ≥ f (x), t ≤ f ∗ + ¯ , x ∈ Q}.

The relation between accuracy parameters ¯ and  in (3.2.2) can be easily obtained,
using the assumption that f is Lipschitz continuous. We leave the corresponding
reasoning as an exercise for the reader.
3.2 Methods of Nonsmooth Minimization 215

Let us describe now a resisting oracle for problem (3.2.49). Taking into account
the requests of the numerical method, this oracle forms a sequence of boxes {Bk }∞
k=0 ,
Bk+1 ⊂ Bk , defined by their lower and upper bounds:

Bk = {x ∈ Rn | ak ≤ x ≤ bk }.

For each box Bk , k ≥ 0, denote by ck = 12 (ak + bk ) its center. For each box Bk ,
k ≥ 1, the oracle creates an individual separating vector gk . Up to the choice of
sign, this is always a coordinate vector.
In the scheme below, we use two dynamic counters:
• m is the number of generated boxes.
• i is the active coordinate.
Denote by ēn ∈ Rn the vector of all ones. The oracle starts from the following
settings:

a0 := −R ēn , b0 := R ēn , m := 0, i := 1.

Its input is an arbitrary test point x ∈ Rn .

Resisting oracle for feasibility problem

If x ∈
/ B0 then return a separator of x from B0 else

1. Find the maximal k ∈ [0, . . . , m] : x ∈ Bk .


2. If k < m then return gk else {Create a new box}:

(i)
If x (i) ≥ cm then am+1 := am ,

(i) (i)
bm+1 := bm + (cm − bm )ei , gm := ei .

(i) (i)
else am+1 := am + (cm − am )ei ,

bm+1 := bm , gm := −ei .

m := m + 1; i := i + 1; If i > n then i := 1.

Return gm .
216 3 Nonsmooth Convex Optimization

This oracle implements a very simple strategy. Note that the next box Bm+1 is
always half of the last box Bm . The last generated box Bm is divided into two equal
parts by a hyperplane, defined by the coordinate vector ei , which passes through cm ,
the center of Bm . Depending on the part of the box Bm containing the point x, we
choose the sign of the separation vector: gm+1 = ±ei . The new box Bm+1 is always
the half of the box Bm which does not contain the test point x.
After creating a new box Bm+1 , the index i is increased by 1. If its value exceeds
n, we set again i = 1. Thus, the sequence of boxes {Bk } possesses two important
properties:
• voln Bk+1 = 12 voln Bk .
• For any k ≥ 0 we have bk+n − ak+n = 12 (bk − ak ).
Note also that the number of generated boxes does not exceed the number of calls
of the oracle.
Lemma 3.2.4 For all k ≥ 0 we have the inclusion
 k
n
B2 (ck , rk ) ⊂ Bk , with rk = R
2
1
2 . (3.2.50)

Proof Indeed, for all k ∈ [0, . . . , n − 1] we have


1 1 1
Bk ⊃ Bn = {x | cn − R ēn ≤ x ≤ cn + R ēn } ⊃ B2 (cn , R).
2 2 2

Therefore, for such k we have Bk ⊃ B2 (ck , 12 R) and (3.2.50) holds. Further, let
k = nl + p for some p ∈ [0, . . . , n − 1]. Since

l
1
bk − ak = (bp − ap ),
2

we conclude that

l 
1 1
Bk ⊃ B2 ck , R .
2 2

 l
It remains to note that rk ≤ 12 R 1
2 . 

Lemma 3.2.4 immediately leads to the following complexity result.


Theorem 3.2.7 Consider a class of feasibility problems (3.2.49), which satisfy
Assumption 3.2.1, and for which the feasible sets S are subsets of B∞ (0, R). The
lower analytical complexity bound for this class is

R
n ln 2

calls of the separation oracle.


3.2 Methods of Nonsmooth Minimization 217

Proof Indeed, we have seen that the number of generated boxes does not exceed the
number of calls of the oracle. Moreover, in view of Lemma 3.2.4, after k iterations
the last box contains the ball B2 (cmk , rk ).

The lower complexity bound for minimization problem (3.2.8) can be obtained
in a similar way. However, the corresponding reasoning is more complicated.
Therefore we present here only the final result.
Theorem 3.2.8 A lower bound for the analytical complexity of the problem
class formed by minimization problem (3.2.8) with Q ⊆ B∞ (0, R) and
0,0
f ∈ FM 8 calls of the oracle.
(B∞ (0, R)), is n ln MR 

3.2.8 Cutting Plane Schemes

Let us look now at the following minimization problem with set constraint:

min{f (x) | x ∈ Q}, (3.2.51)

where the function f is convex on Rn , and Q is a bounded closed convex set such
that

int Q
= ∅, diam Q = D < ∞.

We assume that Q is not simple and that our problem is equipped with a separation
oracle. At any test point x̄ ∈ Rn , this oracle returns a vector g(x), which is either:
• a subgradient of f at x̄, if x ∈ Q,
• a separator of x̄ from Q, if x ∈/ Q.
An important example of such a problem is a constrained minimization problem
with functional constraints (3.2.22). We have seen that this problem can be rewritten
as a problem with a single functional constraint (see (3.2.23)) defining the feasible
set

Q = {x ∈ Rn | f¯(x) ≤ 0}.

/ Q the oracle has to provide us with any subgradient ḡ ∈ ∂ f¯(x).


In this case, for x ∈
Clearly, ḡ separates x from Q (see Theorem 3.1.18).
Let us present the main property of finite-dimensional localization sets.
Consider a sequence X ≡ {xi }∞ i=0 belonging to the set Q. Recall that the
localization sets generated by this sequence are defined as follows:

S0 (X) = Q,

Sk+1 (X) = {x ∈ Sk (X) | g(xk ), xk − x ≥ 0}.


218 3 Nonsmooth Convex Optimization

Clearly, for any k ≥ 0 we have x ∗ ∈ Sk . Define

vi = vf (x ∗ ; xi ) (≥ 0), vk∗ = min vi .


0≤i≤k

Denote by voln S the n-dimensional volume of the set S ⊂ Rn .


Theorem 3.2.9 For any k ≥ 0 we have
 1
vk∗ ≤ D voln Sk (X) n
voln Q .

Proof Let α = vk∗ /D (≤ 1). Since Q ⊆ B2 (x ∗ , D) we have the following inclusion:

(1 − α)x ∗ + αQ ⊆ (1 − α)x ∗ + αB2 (x ∗ , D) = B2 (x ∗ , vk∗ ).

Since Q is convex, we conclude that


 
(1 − α)x ∗ + αQ ≡ [(1 − α)x ∗ + αQ] Q ⊆ B2 (x ∗ , vk∗ ) Q ⊆ Sk (X).

Therefore voln Sk (X) ≥ voln [(1 − α)x ∗ + αQ] = α n voln Q. 



Quite often, the set Q is very complicated and it is difficult to work directly with
the sets Sk (X). Instead, we can update some simple upper approximations of these
sets. The process of generating such approximations is described by the following
cutting plane scheme.

General cutting plane scheme

0. Choose a bounded set E0 ⊇ Q.


1. kth iteration (k ≥ 0).
(a) Choose yk ∈ Ek
(b) If yk ∈ Q then compute f (yk ), g(yk ). If yk ∈
/ Q, then
compute ḡ(yk ), which separates yk from Q. (3.2.52)
(c) Set

⎨ g(yk ), if yk ∈ Q,
gk =

ḡ(yk ), if yk ∈
/ Q.

(d) Choose Ek+1 ⊇ {x ∈ Ek | gk , yk − x ≥ 0}.


3.2 Methods of Nonsmooth Minimization 219

Let us estimate the performance of this process. Consider the sequence Y =


{yk }∞
k=0 , involved in thisscheme. Denote by X a subsequence of feasible points in
the sequence Y : X = Y Q. Let us introduce the counter

i(k) = number of points yj , 0 ≤ j < k, such that yj ∈ Q.

Thus, if i(k) > 0, then X


= ∅.
Lemma 3.2.5 For any k ≥ 0, we have Si(k) ⊆ Ek .
Proof Indeed, if i(0) = 0, then S0 = Q ⊆ E0 . Let us assume that Si(k) ⊆ Ek for
some k ≥ 0. Then, at the next iteration there are two possibilities.
(a) i(k + 1) = i(k). This happens if and only if yk ∈
/ Q. Then

Ek+1 ⊇ {x ∈ Ek | ḡ(yk ), yk − x ≥ 0}

⊇ {x ∈ Si(k+1) | ḡ(yk ), yk − x ≥ 0} = Si(k+1)

since Si(k+1) ⊆ Q and ḡ(yk ) separates yk from Q.


(b) i(k + 1) = i(k) + 1. In this case yk ∈ Q. Then

Ek+1 ⊇ {x ∈ Ek | g(yk ), yk − x ≥ 0}

⊇ {x ∈ Si(k) | g(yk ), yk − x ≥ 0} = Si(k)+1

since yk = xi(k) . 

The above results immediately lead to the following important conclusion.
Corollary 3.2.4
1. For any k such that i(k) > 0, we have
 1  1
∗ (X) ≤ D voln Si(k) (X) n voln Ek n
vi(k) voln Q ≤D voln Q .

2. If voln Ek < voln Q, then i(k) > 0.


Proof We have already proved the first statement. The second one follows from the
inclusion Q = S0 = Si(k) ⊆ Ek , which is valid for all k such that i(k) = 0.

Thus, if we manage to ensure voln Ek → 0, then we obtain a convergent
scheme. Moreover, the rate of decrease of the volume automatically defines the
rate of convergence of the corresponding method. Clearly, we should try to decrease
voln Ek as quickly as possible.
220 3 Nonsmooth Convex Optimization

Historically, the first nonsmooth minimization method, implementing the idea


of cutting planes, was the Center of Gravity Method. It is based on the following
geometric idea.
Consider a bounded convex set S ⊂ Rn , int S
= ∅. Define the center of gravity
of this set as

cg(S) = vol1n S xdx.
S

It appears that any cutting plane passing through the center of gravity divides the set
into two almost proportional pieces.
Lemma 3.2.6 Let g be a direction in Rn . Define

S+ = {x ∈ S | g, cg(S) − x ≥ 0}.

Then
voln S+
voln S ≤ 1 − 1e .

(We accept this result without proof.) 


This observation naturally leads to the following minimization scheme.

Method of Centers of Gravity

0. Set S0 = Q.
1. kth iteration (k ≥ 0).
(a) Choose xk = cg(Sk ) and compute f (xk ), g(xk ).
(b) Set Sk+1 = {x ∈ Sk | g(xk ), xk − x ≥ 0}.

Let us estimate the rate of convergence of this method. Define

fk∗ = min f (xj ).


0≤j ≤k

Theorem 3.2.10 If f is Lipschitz continuous on B2 (x ∗ , D) with constant M, then


for any k ≥ 0 we have
 k
fk∗ − f ∗ ≤ MD 1 − 1e .
n
3.2 Methods of Nonsmooth Minimization 221

Proof The statement follows from Lemma 3.2.2, Theorem 3.2.9 and Lemma
3.2.6.

Comparing this result with the lower complexity bound of Theorem 3.2.8, we
see that the method of centers of gravity is optimal in finite dimensions. Its rate of
convergence does not depend on any individual characteristics of our problem like
the condition number, etc. However, we should accept that this method is absolutely
impractical, since the computation of the center of gravity in a high-dimensional
space is a more difficult problem than the problem of Convex Optimization.
Let us look at another method, which uses the possibility of approximating the
localization sets. This method is based on the following geometrical observation.
Let H be a positive definite symmetric n × n matrix. Consider the ellipsoid

E(H, x̄) = {x ∈ Rn | H −1 (x − x̄), x − x̄ ≤ 1}.

Let us choose a direction g ∈ Rn , and consider a half of the above ellipsoid, defined
by the corresponding hyperplane:

E+ = {x ∈ E(H, x̄) | g, x̄ − x ≥ 0}.

It turns out that this set belongs to another ellipsoid, whose volume is strictly smaller
than the volume of E(H, x̄).
Lemma 3.2.7 Define

Hg
x̄+ = x̄ − 1
n+1 · Hg,g 1/2
,
 
n2 Hgg T H
H+ = n2 −1
H− 2
n+1 · Hg,g .

Then E+ ⊂ E(H+ , x̄+ ) and


 n
2
voln E(H+ , x̄+ ) ≤ 1 − 1
(n+1)2
voln E(H, x̄).

Proof Let G = H −1 and G+ = H+−1 . It is clear that


 
n2 −1 gg T
G+ = n2
G+ 2
n−1 · Hg,g .
222 3 Nonsmooth Convex Optimization

Without loss of generality we can assume that x̄ = 0 and H g, g = 1. Suppose


x ∈ E+ . Note that x̄+ = − n+1
1
H g. Therefore,
 
n2 −1
 x − x̄+ 2G+ = n2
 x − x̄+ 2G + n−1
2
g, x − x̄+ 2 ,

 x − x̄+ 2G =  x 2G + n+1


2
g, x + 1
(n+1)2
,

g, x − x̄+ 2 = g, x 2 + n+1 g, x


2
+ 1
(n+1)2
.

Putting all the terms together, we obtain


 
n2 −1
 x − x̄+ 2G+ = n2
 x 2G + n−1
2
g, x 2 + n−1 g, x
2
+ 1
n2 −1
.

Note that g, x ≤ 0 and  x G ≤ 1. Therefore

g, x 2 + g, x = g, x (1 + g, x ) ≤ 0.

Hence,
 
n2 −1
 x − x̄+ 2G+ ≤ n2
 x 2G + n21−1 ≤ 1.

Thus, we have proved that E+ ⊂ E(H+ , x̄+ ).


Let us estimate the volume of E(H+ , x̄+ ).

  #
n $1/2
voln E(H+ , x̄+ ) det H+ 1/2 n2 n−1
= =
voln E(H, x̄) det H n2 − 1 n+1

#
1 $ n2 
 n2
n2 2 n n2 2
= 1− ≤ 2 1−
n2 − 1 n+1 n −1 n(n + 1)

  n2  n
n2 (n2 + n − 2) 1 2
= = 1− . 

n(n − 1)(n + 1)2 (n + 1)2

It turns out that the ellipsoid E(H+ , x̄+ ) is the ellipsoid of minimal volume
containing half of the initial ellipsoid E+ .
Our observations can be implemented in the following algorithmic scheme of the
famous Ellipsoid Method.
3.2 Methods of Nonsmooth Minimization 223

Ellipsoid Method

0. Choose y0 ∈ Rn and R > 0 such that B2 (y0 , R) ⊇ Q. Set


H0 = R 2 · In .
1. kth iteration (k ≥ 0).

⎨ g(yk ), if yk ∈ Q, (3.2.53)
gk =

ḡ(yk ), if yk ∈
/ Q,

Hk gk
yk+1 = yk − 1
n+1 · Hk gk ,gk 1/2
,


n2 Hk gk gkT Hk
Hk+1 = n2 −1
Hk − 2
n+1 · Hk gk ,gk .

This method can be seen as a particular implementation of the general cutting


plane scheme (3.2.52) by choosing

Ek = {x ∈ Rn | Hk−1 (x − yk ), x − yk ≤ 1}

with yk being the center of the ellipsoid.


Let us present an efficiency estimate for the Ellipsoid Method. Let Y = {yk }∞
k=0 ,
and let X be a feasible subsequence of sequence Y :
0
X=Y Q.

Define fk∗ = min f (xj ).


0≤j ≤k

Theorem 3.2.11 Let the function f be Lipschitz continuous on B2 (x ∗ , R) with


some constant M. Then for i(k) > 0, we have
 k  1
∗ − f ∗ ≤ MR 1 −
fi(k) 1 2 B2 (x0 ,R) n
· volnvol .
(n+1)2 n Q

Proof The proof follows from Lemma 3.2.2, Corollary 3.2.4 and Lemma 3.2.7.

We need additional assumptions to guarantee X
= ∅. Assume that there exists
some ρ > 0 and x̄ ∈ Q such that

B2 (x̄, ρ) ⊆ Q. (3.2.54)
224 3 Nonsmooth Convex Optimization

Then
 1  k  1 − k
voln Ek n 2 voln B2 (x0 ,R) n
voln Q ≤ 1− 1
(n+1)2 voln Q ≤ ρ1 e 2(n+1)2 R.

In view of Corollary 3.2.4, this implies that i(k) > 0 for all

k > 2(n + 1)2 ln Rρ .

If i(k) > 0, then

∗ − f ∗ ≤ 1 MR 2 · e − k
fi(k) 2(n+1)2 .
ρ

In order to ensure that (3.2.54) holds for a constrained minimization problem


with functional constraints, it is enough to assume that all constraints are Lipschitz
continuous and there is a feasible point at which all functional constraints are
strictly negative (the Slater condition). We leave the details of the corresponding
justification as an exercise for the reader.
Let us discuss now the total complexity of the Ellipsoid Method (3.2.53). Each
iteration of this scheme is relatively cheap: it takes O(n2 ) arithmetic operations. On
the other hand, in order to generate an -solution of problem (3.2.51), satisfying
assumption (3.2.54), this method needs
2
2(n + 1)2 ln MR
ρ

calls of the oracle. This efficiency estimate is not optimal (see Theorem 3.2.8), but it
has linear dependence on ln 1 , and polynomial dependence on the dimension and the
logarithms of the class parameters M, R and ρ. For problem classes, whose oracle
also has a polynomial complexity, such algorithms are called (weakly) polynomial.
To conclude this section, note that there are several methods which work with
localization sets in the form of the polytope:

Ek = {x ∈ Rn | aj , x ≤ bj , j = 1 . . . mk }.

Let us mention the most important methods of this type:


• Inscribed Ellipsoid Method. The point yk in this scheme is chosen as follows:

yk = Center of the maximal ellipsoid Wk : Wk ⊂ Ek .

• Analytic Center Method. In this method, the point yk is chosen as the minimum
of the analytic barrier


mk
Fk (x) = − ln(bj − aj , x ).
j =1
3.3 Methods with Complete Data 225

• Volumetric Center Method. This is also a barrier-type scheme. The point yk is


chosen as the minimum of the volumetric barrier

Vk (x) = ln det ∇ 2 Fk (x),

where Fk (·) is the analytic barrier for the set Ek .


All these methods are polynomial with complexity bound
 p
n ln 1 ,

where p is either 1 or 2. However, the complexity of each iteration in these methods


is much larger (n3 − n4 arithmetic operations). In Chap. 5, we will see that the test
points yk for these schemes can be efficiently computed by Interior-Point Methods.

3.3 Methods with Complete Data

(Nonsmooth models of objective function; Kelley’s method; The Level Method; Uncon-
strained minimization; Efficiency estimates; Problems with functional constraints.)

3.3.1 Nonsmooth Models of the Objective Function

In the previous section, we looked at several methods for solving the following
problem:

min f (x), (3.3.1)


x∈Q

where f is a Lipschitz continuous convex function and Q is a closed convex


set. We have seen that the optimal method for problem (3.3.1) is the Subgradient
Method (3.2.14), (3.2.16). Note that this conclusion is valid for the whole class of
Lipschitz continuous functions. However, if we are going to minimize a particular
function from this class, we can expect that it will not be as bad as in the worst case.
We usually can hope that the actual performance of the minimization methods can
be much better than the worst-case theoretical bound. Unfortunately, as far as the
Subgradient Method is concerned, these expectations are too optimistic. The scheme
of the Subgradient Method is very strict and in general it cannot converge faster
than in theory. It can also be shown that the Ellipsoid Method (3.2.53) inherits this
drawback of subgradient schemes. In practice it works more or less in accordance
with its theoretical bound even when it is applied to a very simple function like
 x 2 .
226 3 Nonsmooth Convex Optimization

In this section, we will discuss algorithmic schemes which are more flexible than
the Subgradient Method and Ellipsoid Method. These schemes are based on the
notion of a nonsmooth model of a convex objective function.
Definition 3.3.1 Let X = {xk }∞
k=0 be a sequence of points in Q. Define

fˆk (X; x) = max [f (xi ) + g(xi ), x − xi ],


0≤i≤k

where g(xi ) are some subgradients of f at xi . The function fˆk (X; ·) is called a
nonsmooth model of the convex function f .
Note that fk (X; ·) is a piece-wise linear function. In view of inequality (3.1.23),
we always have

f (x) ≥ fˆk (X; x)

for all x ∈ Rn . However, at all test points xi , 0 ≤ i ≤ k, we have

f (xi ) = fˆk (X; xi ), g(xi ) ∈ ∂ fˆk (X; xi ).

Moreover, the next model is always better than the previous one:

fˆk+1 (X; x) ≥ fˆk (X; x)

for all x ∈ Rn .

3.3.2 Kelley’s Method

The model fˆk (X; ·) represents complete information on the function f accumulated
after k calls of the oracle. Therefore, it seems natural to develop a minimization
scheme, based on this object. Perhaps, the most natural method of this type is as
follows.

Kelley’s Method

0. Choose x0 ∈ Q. (3.3.2)
1. kth iteration (k ≥ 0).
Find xk+1 ∈ Arg min fˆk (X; x).
x∈Q
3.3 Methods with Complete Data 227

Intuitively, this scheme looks very attractive. Even the presence of a complicated
auxiliary problem is not too disturbing, since for polyhedral Q it can be solved by
linear optimization methods in finite time. However, it turns out that this method
cannot be recommended for practical applications. The main reason for this is its
instability. Note that the solution of the auxiliary problem in method (3.3.2) may be
not unique. Moreover, the whole set Arg min fˆk (X; x) can be unstable with respect
x∈Q
to an arbitrary small variation of data {f (xi ), g(xi )}. This feature results in unstable
practical behavior of the scheme. At the same time, it can be used to construct an
example of a problem for which method (3.3.2) has a very disappointing lower
complexity bound.
Example 3.3.1 Consider the problem (3.3.1) with

f (y, x) = max{| y |,  x 2 }, y ∈ R, x ∈ Rn ,

Q = {z = (y, x) : y 2 +  x 2 ≤ 1},

where the norm is standard Euclidean. Thus, the solution of this problem is z∗ =
(y ∗ , x ∗ ) = (0, 0), and the optimal value f ∗ = 0. Denote by Zk∗ = Arg min fˆk (Z; z)
z∈Q
the optimal set of model fˆk (Z; z) and let fˆk∗ = fˆk (Zk∗ ) be the optimal value of the
model.
Let us choose z0 = (1, 0). Then the initial model of the function f is fˆ0 (Z; z) =
y. Therefore, the first point, generated by Kelley’s method, is z1 = (−1, 0). Hence,
the next model of the function f is as follows:

fˆ1 (Z; z) = max{y, −y} =| y | .

Clearly, fˆ1∗ = 0. Note that fˆk+1


∗ ≥ fˆ∗ . On the other hand,
k

fˆk∗ ≤ f (z∗ ) = 0.

Thus, for all consequent models with k ≥ 1, we will have fˆk∗ = 0 and Zk∗ = (0, Xk∗ ),
where

Xk∗ = {x ∈ B2 (0, 1) :  xi 2 +2xi , x − xi ≤ 0, i = 0 . . . k}.

Let us estimate the efficiency of the cuts for the set Xk∗ . Since xk+1 can be an
arbitrary point from Xk∗ , at the first stage of the method we can choose xi with the
unit norms:  xi = 1. Then the set Xk∗ is defined as follows:

1
Xk∗ = {x ∈ B2 (0, 1) | xi , x ≤ , i = 0 . . . k}.
2
228 3 Nonsmooth Convex Optimization

We can do this if
0
S2 (0, 1) ≡ {x ∈ Rn |  x = 1} Xk∗
= ∅.

As far as this is possible, we can have

f (zi ) ≡ f (0, xi ) = 1.

Let us estimate the possible length of this stage using the following fact.

Let d be a direction in Rn ,  d = 1. Consider a surface

1
Sd (α) = {x ∈ Rn |  x = 1, d, x ≥ α}, α ∈ [ , 1].
2
! n−1
Then v(α) ≡ voln−1 (S(α)) ≤ v(0) 1 − α 2 2
.

At the first stage, each step cuts from the sphere S2 (0, 1) one of the segments
 n−1
Sd ( 12 ), at most. Therefore, we can continue the process for all k ≤ √2 . During
3
these iterations we still have f (zi ) = 1.
Since at the first stage of the process the cuts are xi , x ≤ 12 , for all k, 0 ≤ k ≤
 n−1
N ≡ √2 , we have
3

1
B2 (0, ) ⊂ Xk∗ .
2

This means that after N iterations we can repeat our process with the ball B2 (0, 12 ),
etc. Note that f (0, x) = 14 for all x from B2 (0, 12 ).
Thus, we prove the following lower bound for the Kelley’s method (3.3.2):
 √ n−1
 k 3
)−f∗
2
f (xk ≥ 1
4 .

This means that we cannot get an -solution of our problem in fewer than
 n−1
1 √2
2 ln 2 ln 1
3
3.3 Methods with Complete Data 229

calls of the oracle. It remains to compare this lower bound with the upper complexity
bounds of other methods:

 
Ellipsoid method: O n2 ln 1

 
Optimal methods: O n ln 1

 
1
Gradient method: O 2

3.3.3 The Level Method

Let us show that it is possible to work with a nonsmooth model of the objective
function in a stable way. Define

fˆk∗ = min fˆk (X; x), fk∗ = min f (xi ).


x∈Q 0≤i≤k

The first of these values is called the minimal value of the model, and the second
one is the record value of the model. Clearly fˆk∗ ≤ f ∗ ≤ fk∗ .
Let us choose some α ∈ (0, 1). Define

k (α) = (1 − α)fˆk∗ + αfk∗ .

Consider the level set

Lk (α) = {x ∈ Q | fˆk (X; x) ≤ k (α)}.

Clearly, Lk (α) is a closed convex set.


Note that the set Lk (α) is certainly interesting for optimization schemes. Firstly,
inside this set there is clearly no test point of the current model. Secondly, this set is
stable with respect to a small perturbation of the data. Let us present a minimization
method which deals directly with this level set.
230 3 Nonsmooth Convex Optimization

Level Method

0. Choose a point x0 ∈ Q, accuracy  > 0, and level


coefficient α ∈ (0, 1). (3.3.3)
1. kth iteration (k ≥ 0).
(a) Compute fˆk∗ and fk∗ .
(b) If fk∗ − fˆk∗ ≤ , then STOP.
(c) Set xk+1 = πLk (α) (xk ).

In this scheme, there are two potentially expensive operations. We need to


compute an optimal value fˆk∗ of the current model. If Q is a polytope, then this
value can be obtained from the following linear programming problem:

min t,

s.t. f (xi ) + g(xi ), x − xi ≤ t, i = 0 . . . k,

x ∈ Q.

We also need to compute the Euclidean projection πLk (α) (xk ). If Q is a polytope,
then this is a quadratic programming problem:

min  x − xk 2 ,

s.t. f (xi ) + g(xi ), x − xi ≤ k (α), i = 0 . . . k,

x ∈ Q.

Both problems are solvable either by a standard simplex-type method, or by Interior-


Point Methods (see Chap. 5).
Let us look at some properties of the Level Method. Recall that the optimal values
of the model increase, and the record values decrease:

fˆk∗ ≤ fˆk+1

≤ f ∗ ≤ fk+1

≤ fk∗ .

Let Δk = [fˆk∗ , fk∗ ] and δk = fk∗ − fˆk∗ . We call δk the gap of the model fˆk (X; x).
Then

Δk+1 ⊆ Δk , δk+1 ≤ δk .

The next result is crucial for the analysis of the Level Method.
3.3 Methods with Complete Data 231

Lemma 3.3.1 Assume that for some p ≥ k the gap is still big enough:

δp ≥ (1 − α)δk .

Then for all i, k ≤ i ≤ p, we have i (α) ≥ fˆp∗ .


Proof Note that for all such i, we have δp ≥ (1 − α)δk ≥ (1 − α)δi . Therefore,

i (α) = fi∗ − (1 − α)δi ≥ fp∗ − (1 − α)δi = fˆp∗ + δp − (1 − α)δi ≥ fˆp∗ . 


Let us show that the steps of Level Method are large enough. Define

Mf = max{ g  | g ∈ ∂f (x), x ∈ Q}.

Lemma 3.3.2 For the sequence of points {xk } generated by the Level Method, we
have
(1−α)δk
 xk+1 − xk  ≥ Mf .

Proof Indeed,

f (xk ) − (1 − α)δk ≥ fk∗ − (1 − α)δk = k (α)

≥ fˆk (xk+1 ) ≥ f (xk ) + g(xk ), xk+1 − xk

≥ f (xk ) − Mf  xk+1 − xk  .



Finally, we need to show that the gap of the model is decreasing.
Lemma 3.3.3 Let the set Q in problem (3.3.1) be bounded: diam Q ≤ D. If for
some p ≥ k we have δp ≥ (1 − α)δk , then

Mf2 D 2
p+1−k ≤ (1−α)2 δp2
.

Proof Let xp∗ ∈ Arg min fˆp (X; x). In view of Lemma 3.3.1, we have
x∈Q

fˆi (X; xp∗ ) ≤ fˆp (X; xp∗ ) = fˆp∗ ≤ i (α)


232 3 Nonsmooth Convex Optimization

for all i, k ≤ i ≤ p. Therefore, in view of Lemma 2.2.8 and Lemma 3.3.2, we get

(1−α)2 δi2
 xi+1 − xp∗ 2 ≤  xi − xp∗ 2 −  xi+1 − xi 2 ≤  xi − xp∗ 2 −
Mf2

(1−α)2 δp2
≤  xi − xp∗ 2 − .
Mf2

Summing up these inequalities in i = k . . . p, we get

(1−α)2 δp2
(p + 1 − k) ≤  xk − xp∗ 2 ≤ D 2 . 

Mf2

Note that the number of indices in the segment [k, p] is equal to p + 1 − k. Now
we can prove the efficiency estimate of the Level Method.
Theorem 3.3.1 Let diam Q = D. Then Level Method terminates after

Mf2 D 2
N=  2 α(1−α)2 (2−α)
+1

iterations at most. The termination criterion of the method guarantees fk∗ −f ∗ ≤ .


Proof Assume that δk ≥ , 0 ≤ k ≤ N. Let us represent the whole set of indices in
decreasing order as a union of m + 1 groups,
5 5 5
{N, . . . , 0} = I (0) I (1) ··· I (m),

such that

I (j ) = [p(j ), k(j )], p(j ) ≥ k(j ), j = 0 . . . m,

p(0) = N, p(j + 1) = k(j ) − 1, k(m) = 0,

δk(j ) ≤ 1
1−α δp(j ) < δk(j )+1 ≡ δp(j +1).

Clearly, for j ≥ 0 we have

δp(j) δp(0)
δp(j +1) ≥ 1−α ≥ (1−α)j+1
≥ 
(1−α)j+1
.

In view of Lemma 3.3.3, n(j ) = p(j ) + 1 − k(j ) is bounded:

Mf2 D 2 Mf2 D 2
n(j ) ≤ 2 ≤  (1−α)2
2 (1 − α)2j .
(1−α)2 δp(j)
3.3 Methods with Complete Data 233

Therefore,


m Mf2 D 2 
m Mf2 D 2
N= n(j ) ≤  (1−α)2
2 (1 − α)2j ≤  (1−α)2 (1−(1−α)2 )
2 . 

j =0 j =0

Let us discuss the above efficiency estimate. Note that we can obtain the optimal
value of the level parameter α from the following maximization problem:

(1 − α)2 (1 − (1 − α)2 ) → max .


α∈[0,1]

Its solution is α ∗ = 1√ ≈ 0.2929. Under this choice, we have the following


2+ 2
efficiency bound of the Level Method:

N ≤ 4
M 2 D2 .
2 f

Comparing this result with Theorem 3.2.1, we see that Level Method is optimal
uniformly in the dimension of the space of variables. Note that the analytical
complexity bound of this method in finite dimensions is not known.
One of the advantages of this method is that the gap δk = fk∗ − fˆk∗ provides
us with an exact estimate of the current accuracy. Usually, this gap converges to
zero much faster than in the worst case situation. For the majority of real-life
optimization problems, the accuracy  = 10−4 − 10−5 is obtained by the method
after 3n to 4n iterations.

3.3.4 Constrained Minimization

Let us show how to use piece-wise linear models to solve constrained minimization
problems. Consider the problem

min f (x),
x∈Q
(3.3.4)
s.t. fj (x) ≤ 0, j = 1 . . . m,

where Q is a bounded closed convex set, and functions f (·), fj (·) are Lipschitz
continuous on Q.
Let us rewrite this problem as a problem with a single functional constraint.
Define f¯(x) = max fj (x). Then we obtain the equivalent problem
1≤j ≤m

min f (x),
x∈Q
(3.3.5)
s.t. f¯(x) ≤ 0.
234 3 Nonsmooth Convex Optimization

Note that the functions f (·) and f¯(·) are convex and Lipschitz continuous. In this
section, we will try to solve (3.3.5) using the models for both of them.
Let us define the corresponding models. Consider a sequence X = {xk }∞ k=0 .
Define

fˆk (X; x) = max [f (xj ) + g(xj ), x − xj ] ≤ f (x),


0≤j ≤k

fˇk (X; x) = max [f¯(xj ) + ḡ(xj ), x − xj ] ≤ f¯(x),


0≤j ≤k

where g(xj ) ∈ ∂f (xj ) and ḡ(xj ) ∈ ∂ f¯(xj ).


As in Sect. 2.3.4, our scheme is based on the parametric function

f (t; x) = max{f (x) − t, f¯(x)},

f ∗ (t) = min f (t; x).


x∈Q

Recall that f ∗ (t) is nonincreasing in t. Let x ∗ be a solution to (3.3.5). Let t ∗ =


f (x ∗ ). Then t ∗ is the smallest root of thte function f ∗ (t).
Using the models for the objective function and the constraint, we can introduce
a model for the parametric function. Define

fk (X; t, x) = max{fˆk (X; x) − t, fˇk (X; x)} ≤ f (t; x),

fˆk∗ (X; t) = min fk (X; t, x) ≤ f ∗ (t).


x∈Q

Again, fˆk∗ (X; t) is nonincreasing in t. It is clear that its smallest root tk∗ (X) does not
exceed t ∗ .
We will need the following characterization of the root tk∗ (X).
Lemma 3.3.4

tk∗ (X) = min{fˆk (X; x) | fˇk (X; x) ≤ 0}.


x∈Q

Proof Denote by x̂k∗ the solution of the minimization problem in the above equation
and let tˆk∗ = fˆk (X; x̂k∗ ) be its optimal value. Then

fˆk∗ (X; tˆk∗ ) ≤ max{fˆk (X; x̂k∗ ) − tˆk∗ , fˇk (X; x̂k∗ )} ≤ 0.

Thus, we always have tˆk∗ ≥ tk∗ (X).


3.3 Methods with Complete Data 235

Assume that tˆk∗ > tk∗ (X). Then there exists a point y such that

fˆk (X; y) − tk∗ (X) ≤ 0, fˇk (X; y) ≤ 0.

However, in this case tˆk∗ = fˆk (X; x̂k∗ ) ≤ fˆk (X; y) ≤ tk∗ (X) < tˆk∗ . This is a
contradiction.

In our analysis, we will also need the function

fk∗ (X; t) = min fk (X; t, xj ),


0≤j ≤k

the record value of our parametric model.


Lemma 3.3.5 Let t0 < t1 ≤ t ∗ . Assume that fˆk∗ (X; t1 ) > 0. Then tk∗ (X) > t1 and

t1 −t0
fˆk∗ (X; t0 ) ≥ fˆk∗ (X; t1 ) + fˆ∗ (X; t1 ).
tk∗ (X)−t1 k (3.3.6)

t1 −t0
Proof Let xk∗ (t) ∈ Arg min fk (X; t, x), t2 = tk∗ (X), α = t2 −t0 ∈ [0, 1]. Then

t1 = (1 − α)t0 + αt2

and inequality (3.3.6) is equivalent to the following:

fˆk∗ (X; t1 ) ≤ (1 − α)fˆk∗ (X; t0 ) + α fˆk∗ (X; t2 ) (3.3.7)

(note that fˆk∗ (X; t2 ) = 0). Let xα = (1 − α)xk∗ (t0 ) + αxk∗ (t2 ). Then we have

fˆk∗ (X; t1 ) ≤ max{fˆk (X; xα ) − t1 ; fˇk (X; xα )}

≤ max{(1 − α)(fˆk (X; xk∗ (t0 )) − t0 ) + α(fˆk (X; xk∗ (t2 )) − t2 );

(1 − α)fˇk (X; xk∗ (t0 )) + α fˇk (X; xk∗ (t2 ))}

≤ (1 − α) max{fˆk (X; xk∗ (t0 )) − t0 ; fˇk (X; xk∗ (t0 ))}

+α max{fˆk (X; xk∗ (t2 )) − t2 ; fˇk (X; xk∗ (t2 ))}

= (1 − α)fˆk∗ (X; t0 ) + α fˆk∗ (X; t2 ),

and we get (3.3.7).



We also need the following statement (compare with Lemma 2.3.5).
236 3 Nonsmooth Convex Optimization

Lemma 3.3.6 For any Δ ≥ 0, we have

f ∗ (t) − Δ ≤ f ∗ (t + Δ),

fˆk∗ (X; t) − Δ ≤ fˆk∗ (X; t + Δ).

Proof Indeed, for f ∗ (t) we have

f ∗ (t + Δ) = min [max{f (x) − t; f¯(x) + Δ} − Δ]


x∈Q

≥ min [max{f (x) − t; f¯(x)} − Δ] = f ∗ (t) − Δ.


x∈Q

The proof of the second inequality is similar. 



Now we are ready to present a constrained minimization scheme (compare with
the constrained minimization scheme of Sect. 2.3.5).

Constrained Level Method

0. Choose x0 ∈ Q, t0 < t ∗ ,  ∈ (0, 12 ), and accuracy  > 0.


1. kth iteration (k ≥ 0).
(a) Keep generating the sequence X = {xj }∞ j =0 by the
Level Method as applied to the function f (tk ; x). If (3.3.8)
the internal termination criterion

fˆj∗ (X; tk ) ≥ (1 − )fj∗ (X; tk )

holds, then stop the internal process and set j (k) = j .


Global stop: fj∗ (X; tk ) ≤ .
(b) Set tk+1 = tj∗(k) (X).

We are interested in the analytical complexity of this method. Therefore, the


complexity of the computation of the root tj∗ (X) and of the value fˆj∗ (X; t) is not
important for us now. We need to estimate the rate of convergence of the master
process and the complexity of Step 1(a).
Let us start from the master process.
Lemma 3.3.7 For all k ≥ 0, we have
 k
t0 −t ∗
fj∗(k) (X; tk ) ≤ 1−
1
2(1−) .
3.3 Methods with Complete Data 237

Proof Define

fj∗(k) (X;tk )
σk = √
tk+1 −tk
, β= 1
2(1−) (< 1).

Since tk+1 = tj∗(k) (X), in view of Lemma 3.3.5, for all k ≥ 1, we have

σk−1 = √ 1 f∗
tk −tk−1 j (k−1)
(X; tk−1 ) ≥ √ 1 fˆ∗ (X; tk−1 )
tk −tk−1 j (k)

≥ √ 2 fˆ∗ (X; tk )
tk+1 −tk j (k)
≥ √2(1−) f ∗ (X; tk )
tk+1 −tk j (k)
= σk
β .

Thus, σk ≤ βσk−1 and we obtain


√ √
fj∗(k) (X; tk ) = σk tk+1 − tk ≤ β k σ0 tk+1 − tk

tk+1 −tk
= β k fj∗(0) (X; t0 ) t1 −t0 .

Further, in view of Lemma 3.3.6, t1 − t0 ≥ fˆj∗(0) (X; t0 ). Therefore,


6 
tk+1 −tk βk
fj∗(k) (X; tk ) ≤ β k fj∗(0) (X; t0 ) ≤ fˆj∗(0) (X; t0 )(tk+1 − tk )
fˆj∗(0) (X;t0 ) 1−

βk √ ∗
≤ 1− f (t0 )(t0 − t ∗ ).

It remains to note that f ∗ (t0 ) ≤ t0 − t ∗ (see Lemma 3.3.6).



Let the Global Stop condition in (3.3.8) be satisfied: fj∗ (X; tk ) ≤ . Then there
exists a j ∗ such that

f (tk ; xj ∗ ) = fj∗ (X; tk ) ≤ .

Therefore, we have

f (tk ; xj ∗ ) = max{f (xj ∗ ) − tk ; f¯(xj ∗ )} ≤ .

Since tk ≤ t ∗ , we conclude that

f (xj ∗ ) ≤ t ∗ + ,
(3.3.9)
f¯(xj ∗ ) ≤ .
238 3 Nonsmooth Convex Optimization

In view of Lemma 3.3.7, we can get (3.3.9) at most in

t0 −t∗
N() = 1
ln[2(1−)] ln (1−)

full iterations of the master process. (The last iteration of the process is terminated
by the Global Stop rule.) Note that in the above expression,  is an absolute constant
(for example, we can take  = 14 ).
Let us estimate the complexity of the internal process. Define
5
Mf = max{ g  | g ∈ ∂f (x) ∂ f¯(x), x ∈ Q}.

We need to analyze two cases.


1. Full step. At this step, the internal process is terminated by the rule

fˆj∗(k) (X; tk ) ≥ (1 − )fj∗(k) (X; tk ).

The corresponding inequality for the gap is as follows:

fj∗(k) (X; tk ) − fˆj∗(k) (X; tk ) ≤ fj∗(k) (X; tk ).

In view of Theorem 3.3.1, this happens at most after

Mf2 D 2

 (fj (k) (X;tk ))2 α(1−α)2 (2−α)
2

iterations of the internal process. Since at the full step fj∗(k) (X; tk )) ≥ , we
conclude that
Mf2 D 2
j (k) − j (k − 1) ≤   α(1−α)2 (2−α)
2 2

for any full iteration of the master process.


2. Last step. The internal process of this step was terminated by the Global Stop
rule:

fj∗ (X; tk ) ≤ .

Since the normal stopping criterion did not work, we conclude that

fj∗−1 (X; tk ) − fˆj∗−1 (X; tk ) ≥ fj∗−1 (X; tk ) ≥ .

Therefore, in view of Theorem 3.3.1, the number of iterations at the last step does
not exceed
Mf2 D 2
  α(1−α)2 (2−α)
2 2 .
3.3 Methods with Complete Data 239

Thus, we come to the following estimate of total complexity of the Constrained


Level Method:
M 2 D2
(N() + 1)  2  2 α(1−α)
f
2 (2−α)

Mf2 D 2
 ∗

t0 −t
=   α(1−α)2 (2−α)
2 2 1+ 1
ln[2(1−)] ln (1−)

2(t0 −t ∗ )
Mf2 D 2 ln
= 
 2 α(1−α)2 (2−α) 2 ln[2(1−)]
.

A reasonable choice for the parameters of this scheme is α =  = 1√


.
2+ 2
The principal term in the above complexity bound is of the order O( 12 ln
2(t0 −t ∗ )
 Thus, the Constrained Level Method is suboptimal (see Theorem 3.2.1).
).
In this method, at each iteration of the master process we need to find the root
tj∗(k) (X). In view of Lemma 3.3.4, this is equivalent to the following problem:

min{fˆk (X; x) | fˇk (X; x) ≤ 0}.


x∈Q

In other words, we need to solve the problem

min t,
x∈Q

s.t. f (xj ) + g(xj ), x − xj ≤ t, j = 0 . . . k,

f¯(xj ) + ḡ(xj ), x − xj ≤ 0, j = 0 . . . k.

If Q is a polytope, this problem can be solved by finite linear programming methods


(simplex method). If Q is more complicated, we can use Interior-Point Schemes
(Chap. 5).
To conclude this section, let us note that we can use a better model for the
functional constraints. Since

f¯(x) = max fi (x),


1≤i≤m

it is possible to work with

fˇk (X; x) = max max [fi (xj ) + gi (xj ), x − xj ],


0≤j ≤k 1≤i≤m
240 3 Nonsmooth Convex Optimization

where gi (xj ) ∈ ∂fi (xj ). In practice, this complete model significantly accelerates
the convergence of the process. However, clearly each iteration becomes much more
expensive.
As far as the practical behavior of this scheme is concerned, we note that
usually the process is very fast. There are some technical problems related to
the accumulation of many linear pieces in the model. However, in all practical
implementations of the Level Method there exist some strategies for dropping the
old inactive elements of the model.
Chapter 4
Second-Order Methods

In this chapter, we study Black-Box second-order methods. In the first two sections,
these methods are based on cubic regularization of the second-order model of the
objective function. With an appropriate proximal coefficient, this model becomes
a global upper approximation of the objective function. At the same time, the
global minimum of this approximation is computable in polynomial time even if
the Hessian of the objective is not positive semidefinite. We study global and local
convergence of the Cubic Newton Method in convex and non-convex cases. In the
next section, we derive the lower complexity bounds and show that this method
can be accelerated using the estimating sequences technique. In the last section,
we consider a modification of the standard Gauss–Newton method for solving
systems of nonlinear equations. This modification is also based on an overestimating
principle as applied to the norm of the residual of the system. Both global and local
convergence results are justified.

4.1 Cubic Regularization of Newton’s Method

(Cubic regularization of quadratic approximation; General convergence results; Global rate


of convergence for different problem classes; Implementation issues; Complexity results for
strongly convex functions.)

4.1.1 Cubic Regularization of Quadratic Approximation

In this section, we consider the simplest unconstrained minimization problem

min f (x)
x∈Rn

© Springer Nature Switzerland AG 2018 241


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4_4
242 4 Second-Order Methods

with a twice continuously differentiable objective function. The standard second-


order scheme for this problem, Newton’s method, is as follows:

xk+1 = xk − [∇ 2 f (xk )]−1 ∇f (xk ). (4.1.1)

We have already looked at this method in Sect. 1.2.


Despite its very natural motivation, this scheme has several hidden drawbacks.
First of all, it may happen that at the current test point the Hessian is degenerate; in
this case the method is not well-defined. Secondly, it may happen that this scheme
diverges or converges to a saddle point or even to a point of local maximum. In order
to overcome these difficulties, there are three standard recipes.
• Levenberg–Marquardt regularization. If ∇ 2 f (xk ) is indefinite, let us regularize
it with a unit matrix. Namely, use the matrix Gk = ∇ 2 f (xk ) + γ In  0 in order
to perform the step:

xk+1 = xk − G−1
k ∇f (xk ).

This strategy is sometimes considered as a way of mixing Newton’s method with


the gradient method.
• Line search. Since we are interested in minimization, it is reasonable to introduce
in method (4.1.1) a certain step size hk > 0:

xk+1 = xk − hk [∇ 2 f (xk )]−1 ∇f (xk ).

(This is a damped Newton method. Compare with the scheme (5.1.28).) This can
help in generating a monotone sequence of function values: f (xk+1 ) ≤ f (xk ).
• Trust-region methods. In accordance with this approach, at point a xk we have
to define a neighborhood, where the second-order approximation of the objective
function is reliable. This is a certain trust region Δ(xk ). For example, we can take

Δ(xk ) = {x : x − xk  ≤ }

with some  > 0. Then the next point xk+1 can be chosen as a solution to the
following auxiliary problem:
 
1 2
min ∇f (xk ), x − xk + ∇ f (xk )(x − xk ), x − xk .
x∈Δ(xk ) 2

Note that for Δ(xk ) ≡ Rn , this is exactly the standard Newton step.
Unfortunately, none of these approaches seems to be useful in addressing the
global behavior of second-order schemes. In this section, we present a modification
of Newton’s method, which is constructed in a similar way as the Gradient Mapping
(see Sect. 2.2.4).
4.1 Cubic Regularization of Newton’s Method 243

Let F ⊆ Rn be an open convex set. Consider a function f which is twice


differentiable on F . Let x0 ∈ F be a starting point of our iterative scheme. We
assume that the set F is large enough: It contains at least the level set

L (f (x0 )) ≡ {x ∈ Rn : f (x) ≤ f (x0 ).}

Moreover, in this section we always assume the following.


Assumption 4.1.1 The Hessian of the function f is Lipschitz continuous on F :

∇ 2 f (x) − ∇ 2 f (y) ≤ Lx − y, ∀x, y ∈ F , (4.1.2)

with some constant L > 0. In this section, the norm is always standard Euclidean.
For the reader’s convenience, let us recall the following variant of Lemma 1.2.4.
Lemma 4.1.1 For any x and y from F we have

(1.2.13) 1
∇f (y) − ∇f (x) − ∇ 2 f (x)(y − x) ≤ Ly − x2 , (4.1.3)
2
(1.2.14)
|f (y) − f (x) − ∇f (x), y − x − 12 ∇ 2 f (x)(y − x), y − x | ≤ 6 y
L
− x3 .
(4.1.4)
Let M be a positive parameter. Define a modified Newton step by minimizing a
cubic regularization of the quadratic approximation of the function f :
 
min ∇f (x), y − x + 12 ∇ 2 f (x)(y − x), y − x + 6 y
M
− x3 . (4.1.5)
y

Denote by TM (x) and arbitrary point from the set of global minima of this
minimization problem. We postpone the discussion of computational complexity
of finding this point up to Sect. 4.1.4.1.
Note that point TM (x) satisfies the following first-order optimality condition:

(1.2.4)
∇f (x) + ∇ 2 f (x)(TM (x) − x) + 2 TM (x) − x · (TM (x) − x) =
M
0.
(4.1.6)

Let rM (x) = x − TM (x). Multiplying (4.1.6) by TM (x) − x, we get the equation

∇f (x), TM (x) − x + ∇ 2 f (x)(TM (x) − x), TM (x) − x + M 3


2 rM (x) = 0.
(4.1.7)

In our analysis of the process (4.1.16), we need the following fact.


244 4 Second-Order Methods

Lemma 4.1.2 For any x ∈ F , we have

∇ 2 f (x) + M
2 rM (x)In  0. (4.1.8)

This statement will be justified later in Sect. 4.1.4.1. Let us now present the main
properties of the vector function TM (·).
Lemma 4.1.3 For any x ∈ L (f (x0 ), we have the following relation:

∇f (x), x − TM (x) ≥ 0. (4.1.9)

If M > 23 L and x ∈ int F , then TM (x) ∈ L (f (x)) ⊂ F .


Proof Indeed, multiplying (4.1.8) by x − TM (x) twice, we get

∇ 2 f (x)(TM (x) − x), TM (x) − x + M 3


2 rM (x) ≥ 0.

Therefore, (4.1.9) follows from (4.1.7).


Further, let M > 23 L. Assume that TM (x)
∈ F . Then rM (x) > 0. Consider the
following points:

y(α) = x + α(TM (x) − x), α ∈ [0, 1].

Since y(0) ∈ F , the value

ᾱ : y(ᾱ) ∈ ∂cl (F )

is well defined. In accordance with our assumption, ᾱ ≤ 1 and y(α) ∈ F for all
α ∈ [0, ᾱ). Therefore, using (4.1.4), relation (4.1.7), and inequality (4.1.9), we get

f (y(α)) ≤ f (x) + ∇f (x), y(α) − x

α3 L 3
+ 12 ∇ 2 f (x)(y(α) − x), y(α) − x + 6 rM (x)

= f (x) + ∇f (x), y(α) − x

α3 M 3
+ 12 ∇ 2 f (x)(y(α) − x), y(α) − x + 4 rM (x) − α3 δ

α2 α 2 (1−α)
= f (x) + (α − 2 )∇f (x), TM (x) − x − 4
3 (x) − α 3 δ
MrM

α 2 (1−α)
≤ f (x) − 4
3 (x) −
MrM α 3 δ,
4.1 Cubic Regularization of Newton’s Method 245

 3
where δ = M 4 − 6 rM (x) > 0. Thus, f (y(ᾱ)) < f (x). Therefore y(ᾱ) ∈
L

L (f (x)) ⊂ F . This is a contradiction. Hence, TM (x) ∈ F . Using the same


arguments, we prove that f (TM (x)) ≤ f (x).

Lemma 4.1.4 If TM (x) ∈ F , then

1 2
∇f (TM (x)) ≤ (L + M)rM (x). (4.1.10)
2
Proof From Eq. (4.1.6), we get

1
∇f (x) + ∇ 2 f (x)(TM (x) − x) = Mr 2 (x).
2 M
On the other hand, in view of (4.1.3), we have

1 2
∇f (TM (x)) − ∇f (x) − ∇ 2 f (x)(TM (x) − x) ≤ Lr (x).
2 M
Combining these two relations, we obtain inequality (4.1.10).

Define
 
f¯M (x) = min f (x) + ∇f (x), y − x + 12 ∇ 2 f (x)(y − x), y − x + 6 y
M
− x3 .
y

Lemma 4.1.5 For any x ∈ F , we have


!
f¯M (x) ≤ min f (y) + 6 y
L+M
− x3 , (4.1.11)
y∈F

f (x) − f¯M (x) ≥ M 3


12 rM (x).
(4.1.12)

Moreover, if M ≥ L, then TM (x) ∈ F and

f (TM (x)) ≤ f¯M (x). (4.1.13)

Proof Indeed, using the lower bound in (4.1.4), for any y ∈ F we have

f (x) + ∇f (x), y − x + 12 ∇ 2 f (x)(y − x), y − x ≤ f (y) + L6 y − x3 ,

and inequality in (4.1.11) follows from the definition of f¯M (x).


246 4 Second-Order Methods

Further, in view of the definition of the point TM (x), relation (4.1.7), and
inequality (4.1.9), we have

f (x) − f¯M (x) = ∇f (x), x − TM (x)


− 12 ∇ 2 f (x)(TM (x) − x), TM (x) − x − M 3
6 rM (x)

= 12 ∇f (x), x − TM (x) + M 3


12 rM (x) ≥ M 3
12 rM (x).

Finally, if M ≥ L, then TM (x) ∈ F in view of Lemma 4.1.3. Therefore, we get


inequality (4.1.13) from the upper bound in (4.1.4).


4.1.2 General Convergence Results

In this section, our main problem of interest is as follows:

min f (x), (4.1.14)


x∈Rn

where the objective function f (·) satisfies Assumption 4.1.1. Recall that the
necessary conditions for a point x ∗ to be a local minimum of problem (4.1.14) are
as follows (see Theorem 1.2.2):

∇f (x ∗ ) = 0, ∇ 2 f (x ∗ )  0. (4.1.15)

Therefore, for arbitrary x ∈ F , we can introduce the following measure of local


optimality:
 
μM (x) = max L+M ∇f (x), − 2L+M λmin (∇ f (x))
2 2 2 ,

where M is a positive parameter, and λmin (·) is the minimal eigenvalue of the
corresponding matrix. It is clear that for any x from F the measure μM (x) is
non-negative and it vanishes only at the points satisfying conditions (4.1.15). The
analytical form of this measure can be justified by the following result.
Lemma 4.1.6 For any x ∈ F we have μM (TM (x)) ≤ rM (x).
Proof The proof follows immediately from inequality (4.1.10) and relation (4.1.8)
since

∇ 2 f (TM (x))  ∇ 2 f (x) − LrM (x)I  −( 12 M + L)rM (x)I. 



4.1 Cubic Regularization of Newton’s Method 247

Let L0 ∈ (0, L] be a positive parameter. Consider the following regularized


Newton method.

Cubic Regularization of Newton’s Method

Initialization: Choose x0 ∈ Rn .

(4.1.16)
Iteration k, (k ≥ 0):

1. Find Mk ∈ [L0 , 2L] such that f (TMk (xk )) ≤ f¯Mk (xk ).

2. Set xk+1 = TMk (xk ).

Since f¯M (x) ≤ f (x), this process is monotone:

f (xk+1 ) ≤ f (xk ).

If the constant L is known, in Step 1 of this scheme we can take Mk ≡ L. In the


opposite case, it is possible to apply a simple search procedure; we will discuss its
complexity later in Sect. 4.1.4.2.
Let us start from the following simple observation.
Theorem 4.1.1 Let the sequence {xi } be generated by method (4.1.16). Assume that
the objective function f (·) is bounded below:

f (x) ≥ f ∗ ∀x ∈ F .



Then 3
rM i
(xi ) ≤ 12
L0 (f (x0 ) − f ∗ ). Hence, lim μL (xi ) = 0 and for any k ≥ 1
i=0 i→∞
we have
 1/3
3(f (x0 )−f ∗ )
min μL (xi ) ≤ 8
3 · 2k·L0 . (4.1.17)
1≤i≤k

Proof In view of inequality (4.1.12), we have


k−1 
k−1 
k−1
f (x0 ) − f ∗ ≥ [f (xi ) − f (xi+1 ) ≥ Mi 3
12 rMi (xi ) ≥ L0
12
3 (x ).
rMi
i
i=0 i=0 i=0
248 4 Second-Order Methods

It remains to use the statement of Lemma 4.1.6 and the upper bound on Mk at Step 1
in (4.1.16):

rMi (xi ) ≥ μMi (xi+1 ) ≥ 3


4 μL (xi+1 ).


Note that inequality (4.1.17) implies that

min ∇f (xi ) ≤ O(k −2/3).


1≤i≤k

We have seen that for a gradient scheme, the right-hand side in this inequality can
be of the order O k −1/2 (see inequality (1.2.24)).
Theorem 4.1.1 helps us to get convergence results in many different situations.
We mention only one of them.
Theorem 4.1.2 Let the sequence {xi } be generated by method (4.1.16). Let us
assume that for some i ≥ 0 the set L (f (xi )) is bounded. Then there exists a limit

lim f (xi ) = f ∗ .
i→∞

The set X∗ of limit points of this sequence is non-empty. Moreover, this is a


connected set such that for any x ∗ ∈ X∗ we have

f (x ∗ ) = f ∗ , ∇f (x ∗ ) = 0, ∇ 2 f (x ∗ )  0.

Proof The proof of this theorem can be derived from Theorem 4.1.1 in a standard
way. 
Let us describe now the behavior of the process (4.1.16) in a neighborhood of a
non-degenerate stationary point, which is not a point of local minimum.
Lemma 4.1.7 Let x̄ ∈ F be a non-degenerate saddle point or a point of local
maximum of the function f (·):

∇f (x̄) = 0, λmin (∇ 2 f (x̄)) < 0.

Then there exist constants , δ > 0 such that whenever the point xi appears to be in
the set Q = {x : x − x̄ ≤ , f (x) ≥ f (x̄)} (for instance, if xi = x̄), then the next
point xi+1 leaves the set Q:

f (xi+1 ) ≤ f (x̄) − δ.

Proof Let us choose a direction d, d = 1, with negative curvature:

∇ 2 f (x̄)d, d ≡ −2σ < 0.


4.1 Cubic Regularization of Newton’s Method 249

σ 
And let τ̄ > 0 be small enough: x̄ ± τ̄ d ∈ F . Define  = min 2L , τ̄ and δ = σ6  2 .
Then, in view of inequality (4.1.11), upper bound on Mi , and inequality (4.1.4), for
|τ | ≤ τ̄ we get the following estimate

f (xi+1 ) ≤ f (x̄ + τ d) + L2 x̄ + τ d − xi 3


!3/2
≤ f (x̄) − σ τ 2 + L6 |τ |3 + L
2  2 + 2τ d, x̄ − xi + τ 2 .

Since we are free in the choice of the sign of τ , we can guarantee that
!3/2
f (xi+1 ) ≤ f (x̄) − σ τ 2 + L6 |τ |3 + L
2 2 + τ 2 , |τ | ≤ τ̄ .

Let us choose |τ | =  ≤ τ̄ . Then

f (xi+1 ) ≤ f (x̄) − σ τ 2 + 3 |τ |
5L 3 ≤ f (x̄) − σ τ 2 + 5L
3 · σ
2L · τ 2 = f (x̄) − 16 σ τ 2 .

Since the process (4.1.16) is monotone with respect to the objective function, it will
never return to Q. 
Consider now the behavior of the regularized Newton scheme (4.1.16) in a
neighborhood of a non-degenerate local minimum. It appears that in such a situation,
condition L0 > 0 is no longer necessary. Let us analyze a relaxed version
of (4.1.16):

xk+1 = TMk (xk ), k ≥ 0 (4.1.18)

where Mk ∈ (0, 2L]. Define


L∇f (xk )
δk = .
λ2min (∇ 2 f (xk ))

Theorem 4.1.3 Let ∇ 2 f (x0 )  0 and δ0 ≤ 14 . Let the points {xk } be generated by
method (4.1.18). Then:
1. For all k ≥ 0 the values δk are well defined and converge quadratically to zero:
 2
δk
δk+1 ≤ 3
2 1−δk ≤ 83 δk2 ≤ 23 δk , k ≥ 0. (4.1.19)

2. Minimal eigenvalues of all Hessians ∇ 2 f (xk ) satisfy the following bounds:

e−1 λmin (∇ 2 f (x0 )) ≤ λmin (∇ 2 f (xk )) ≤ e3/4λmin (∇ 2 f (x0 )). (4.1.20)


250 4 Second-Order Methods

3. The whole sequence {xi } converges quadratically to a point x ∗ , which is a non-


degenerate local minimum of the function f . In particular, for any k ≥ 1 we have

3/2
 2k
∇f (xk ) ≤ λ2min (∇ 2 f (x0 )) 9e
16L
1
2 . (4.1.21)

Proof Assume that ∇ 2 f (xk )  0 for some k ≥ 0. Then the corresponding δk is well
defined. Assume that δk ≤ 14 . From Eq. (4.1.6), we have

Mk −1
rMk (xk ) = TMk (xk ) − xk  = (∇ 2 f (xk ) + rMk (xk ) In ) ∇f (xk )
2
∇f (xk ) 1
≤ 2
= λmin (∇ 2 f (xk ))δk . (4.1.22)
λmin (∇ f (xk )) L

(4.1.2)
Note also that ∇ 2 f (xk+1 )  ∇ 2 f (xk ) − rMk (xk )L In . Therefore,

λmin (∇ 2 f (xk+1 )) ≥ λmin (∇ 2 f (xk )) − rMk (xk )L

L∇f (xk )
≥ λmin (∇ 2 f (xk )) − λmin (∇ 2 f (xk ))
(4.1.23)

= (1 − δk )λmin (∇ 2 f (xk )).

Thus, ∇ 2 f (xk+1 ) is also positive definite. Moreover, using inequality (4.1.10) and
the upper bound for Mk , we obtain

L∇f (xk+1 ) 3L2 rM2 (x )


k 3L2 ∇f (xk )2
δk+1 = ≤ k

λ2min (∇ 2 f (xk+1 )) 2
2λmin (∇ 2 f (xk+1 )) 2λ4min (∇ 2 f (xk ))(1−δk )2

 2
δk
= 3
2 1−δk ≤ 8 2
3 δk .

Thus, δk+1 ≤ 14 and we prove (4.1.19) by induction. We also get δk+1 ≤ 23 δk , and,
since δ0 ≤ 14 , we come to the following bound:



δ0
δi ≤ ≤ 1 − δ0 . (4.1.24)
1− 23
i=0

Further,

2 (4.1.23) 
∞ 
∞ 

ln λλmin (∇ f (xk ))
(∇ 2 f (x ))
≥ ln(1 − δi ) ≥ − δi
1−δi ≥ − 1−δ
1
0
δi ≥ −1.
min 0
i=0 i=0 i=0
4.1 Cubic Regularization of Newton’s Method 251

(4.1.2)
In order to get an upper bound, note that ∇ 2 f (xk+1 )  ∇ 2 f (xk ) + rMk (xk )L In .
Hence,

(4.1.22)
λmin (∇ 2 f (xk+1 )) ≤ λmin (∇ 2 f (xk )) + rMk (xk )L ≤ (1 + δk )λmin (∇ 2 f (xk )).

Therefore
2 
∞ 

ln λλmin (∇ f (xk ))
(∇ 2 f (x ))
≤ ln(1 + δi ) ≤ δi ≤ 34 .
min 0
i=0 i=0

It remains to prove Item 3 of the theorem. In view of inequalities (4.1.22)


and (4.1.20), we have

e3/4
rMk (xk ) ≤ 1 2
L λmin (∇ f (xk ))δk ≤ 2
L λmin (∇ f (x0 ))δk .

Thus, in view of the bound (4.1.24), {xi } is a Cauchy sequence, which has a unique
limit point x ∗ . Since the eigenvalues of ∇ 2 f (x) are continuous functions of x, from
the first inequality in (4.1.20) we conclude that ∇ 2 f (x ∗ )  0.
Further, by inequality (4.1.19), we get the bound

δk2
δk+1 ≤ (1−δ0 )2
≤ 16 2
9 δk .

Defining δ̂k = 16
9 δk , we get δ̂k+1 ≤ δ̂k2 . Thus, for any k ≥ 1, we have

 2k
9 2k
δk = 9
16 δ̂k ≤ 16 δ̂0 < 9
16
1
2 .

Using the upper bound in (4.1.20), we get the last upper bound (4.1.21).


4.1.3 Global Efficiency Bounds on Specific Problem Classes

In the previous section, we have already seen that the modified Newton scheme
can be supported by a global efficiency estimate (4.1.17) on a general class of non-
convex problems. The main goal of this section is to show that by specifying some
additional properties of non-convex functions, it is possible to get for this method
much better performance guarantees. A nice feature of method (4.1.16) consists in
its ability to automatically adjust its rate of convergence to the specific problem
classes.
252 4 Second-Order Methods

4.1.3.1 Star-Convex Functions

Let us start from a definition.


Definition 4.1.1 We call the function f star-convex if its set of global minimums
X∗ is not empty and for any x ∗ ∈ X∗ and any x ∈ Rn we have

f (αx ∗ + (1 − α)x) ≤ αf (x ∗ ) + (1 − α)f (x) ∀x ∈ F , ∀α ∈ [0, 1].


(4.1.25)
A particular example of a star-convex function is a usual convex function. However,
in general star-convex function need not to be convex, even in the scalar case. For
instance, f (x) = |x|(1 − e−|x| ), x ∈ R, is star-convex, but not convex. Star-convex
functions arise quite often in optimization problems related to sum of squares. For
example the function f (x, y) = x 2 y 2 + x 2 + y 2 with (x, y) ∈ R2 belongs to this
class.
Theorem 4.1.4 Assume that the objective function in the problem (4.1.14) is star-
convex, and the set F is bounded: diam F = D < ∞. Let the sequence {xk } be
generated by method (4.1.16).
1. If f (x0 ) − f ∗ ≥ 32 LD 3 , then f (x1 ) − f ∗ ≤ 12 LD 3 .
2. If f (x0 ) − f ∗ ≤ 32 LD 3 , then the rate of convergence of process (4.1.16) is as
follows:

3LD 3
f (xk ) − f (x ∗ ) ≤ , k ≥ 0. (4.1.26)
2(1+ 31 k)2

Proof Indeed, in view of inequality (4.1.11) the upper bound on the parameters Mk ,
and definition (4.1.25), for any k ≥ 0 we have:

f (xk+1 ) − f (x ∗ ) ≤ min [ f (y) − f (x ∗ ) + L2 y − xk 3 :


y

y = αx ∗ + (1 − α)xk , α ∈ [0, 1] ]

≤ min f (xk ) − f (x ∗ )
α∈[0,1] !
−α(f (xk ) − f (x ∗ )) + L2 α 3 x ∗ − xk 3
!
≤ min f (xk ) − f (x ∗ ) − α(f (xk ) − f (x ∗ )) + L2 α 3 D 3 .
α∈[0,1]

The minimum of the objective function in the last minimization problem in α ≥ 0


is achieved for
 ∗
αk = 2(f (x3LD
k )−f (x ))
3 .
4.1 Cubic Regularization of Newton’s Method 253

If αk ≥ 1, then the actual optimal value corresponds to α = 1. In this case,

f (xk+1 ) − f (x ∗ ) ≤ 12 LD 3 .

Since the process (4.1.16) is monotone, this can happen only at the first iteration of
the method.
Assume that αk ≤ 1. Then
 3/2
f (xk+1 ) − f (x ∗ ) ≤ f (xk ) − f (x ∗ ) − ∗
3 (f (xk ) − f (x ))
2 √1 .
LD 3

2(f (xk )−f (x ∗ ))
Or, using the notation αk = 3LD 3
, 2
this is αk+1 ≤ αk2 − 23 αk3 < αk2 .
Therefore,

αk −αk+1 αk2 −αk+1


2 αk2 −αk+1
2
1
− 1
= = αk αk+1 (αk +αk+1 ) ≥ ≥ 1
3.
αk+1 αk αk αk+1 2αk3

Thus, 1
αk ≥ 1
α0 + k
3 ≥ 1 + k3 , and (4.1.26) follows.

Let us now introduce the notion of a generalized non-degenerate global minimum.
Definition 4.1.2 We say that the optimal set X∗ of function f (·) is globally non-
degenerate if there exists a constant μ > 0 such that for any x ∈ F we have

f (x) − f ∗ ≥ μ
2 ρ 2 (x, X∗ ), (4.1.27)

where f ∗ is the global minimal value of the function f (·), and ρ(x, X∗ ) is the
Euclidean distance from x to X∗ .
Of course, this property holds for strongly convex functions (see (3.2.43); in this
case X∗ is a singleton). However, it can also hold for some non-convex functions.
As an example, we can look at the function

f (x) = (x2 − 1)2 , X∗ = {x : x = 1} ⊂ Rn .

Note also that if the set X∗ has a connected non-trivial component, then the Hessians
of the objective function at these points are necessarily degenerate. However, as we
will see, in this situation the modified Newton scheme still ensures a super-linear
rate of convergence. Define
 μ 3
ω̄ = 1
L2 2 .

Theorem 4.1.5 Let a function f be star-convex. Assume that it also has a globally
non-degenerate optimal set. Then the performance of the scheme (4.1.16) on this
problem is as follows.
254 4 Second-Order Methods

1. If f (x0 ) − f (x ∗ ) ≥ 49 ω̄, then at the first phase of the process we get the following
rate of convergence:
  4
f (xk ) − f (x ∗ ) ≤ (f (x0 ) − f (x ∗ ))1/4 − k6 23 ω̄1/4 . (4.1.28)

This phase is terminated as soon as f (xk0 ) − f (x ∗ ) ≤ 49 ω̄ for some k0 ≥ 0.


2. For k ≥ k0 the sequence converges superlinearly:

f (xk )−f (x ∗ )
f (xk+1 ) − f (x ∗ ) ≤ 12 (f (xk ) − f (x ∗ )) ω̄ . (4.1.29)

Proof Denote by xk∗ the projection of the point xk onto the optimal set X∗ . In
view of inequality (4.1.11) the upper bound on the parameters Mk and defini-
tions (4.1.25), (4.1.27), for any k ≥ 0 we have:

f (xk+1 ) − f (x ∗ ) ≤ min f (xk ) − f (x ∗ ) − α(f (xk ) − f (x ∗ ))


α∈[0,1]

L
+ α 3 xk∗ − xk 3
2
≤ min f (xk ) − f (x ∗ ) − α(f (xk ) − f (x ∗ ))
α∈[0,1]

3/2 $
L 2
+ α3 (f (xk ) − f (x ∗ )) .
2 μ

Defining Δk = (f (xk ) − f (x ∗ ))/ω̄, we get the inequality


 
1 3/2
Δk+1 ≤ min Δk − αΔk + α 3 Δk . (4.1.30)
α∈[0,1] 2

Note that the first-order optimality condition for α ≥ 0 in this problem is



2 −1/2
αk = 3 Δk .

Therefore, if Δk ≥ 49 , we get
 3/2
3/4
Δk+1 ≤ Δk − 2
3 Δk .

Defining uk = 94 Δk , we get a simpler relation:

3/4
uk+1 ≤ uk − 23 uk ,
4.1 Cubic Regularization of Newton’s Method 255

which is applicable if uk ≥ 1. Since the right-hand side of this inequality is


increasing for uk ≥ 16
1
, let us prove by induction that
 4
1/4
uk ≤ u0 − k6 .

Indeed, inequality
 4  4  3
1/4 1/4 1/4
u0 − k+1
6 ≥ u0 − k6 − 2
3 u0 − k
6

is clearly equivalent to
 3  4  4  3
1/4 1/4 1/4 1/4
2
3 u0 − k
6 ≥ u0 − k6 − u0 − k+1
6 = 1
6 u0 − k
6

 2     2  3 
1/4 1/4 1/4 1/4 1/4
+ u0 − k6 u0 − k+1
6 + u0 − k6 u0 − k+1
6 + u0 − k+1
6 ,

which is obviously true.


Finally, if uk ≤ 1, then the optimal value for α in (4.1.30) is equal to one, and we
get (4.1.29). 

4.1.3.2 Gradient-Dominated Functions

Let us now look at another interesting class of nonconvex functions.


Definition 4.1.3 A function f (·) is called gradient dominated of degree p ∈ [1, 2]
if it attains a global minimum at some point x ∗ and for any x ∈ F we have

f (x) − f (x ∗ ) ≤ τf ∇f (x)p , (4.1.31)

where τf is a positive constant. The parameter p is called the degree of domination.


We do not assume here that the global minimum of function f is unique. Let us
give several examples of gradient dominated functions.
Example 4.1.1 (Convex Functions) Let f be convex on Rn . Assume it achieves its
minimum at point x ∗ . Then, for any x ∈ Rn with x − x ∗  < R, we have

(2.1.2)
f (x) − f (x ∗ ) ≤ ∇f (x), x − x ∗ ≤ ∇f (x) · R.

Thus, the function f is a gradient dominated function of degree one on the set
F = {x : x − x ∗  < R} with τf = R. 
256 4 Second-Order Methods

Example 4.1.2 (Strongly Convex Functions) Let f be differentiable and strongly


convex on Rn . This means that there exists a constant μ > 0 such that

(2.1.20)
f (y) ≥ f (x) + ∇f (x), y − x + 12 μy − x2 , (4.1.32)

for all x, y ∈ Rn . Then, minimizing both sides of this inequality in y, we obtain,

f (x) − f (x ∗ ) ≤ 2μ ∇f (x)


1 2 ∀x ∈ Rn .

Thus, f is a gradient dominated function of degree two on the set F = Rn with


τf = 2μ
1
. 
Example 4.1.3 (Sum of Squares) Consider a system of non-linear equations:

g(x) = 0, (4.1.33)

where g(x) = (g1 (x), . . . , gm (x))T : Rn → Rm is a differentiable vector function.


We assume that m ≤ n and that there exists a solution x ∗ to (4.1.33). Let us assume
in addition that the Jacobian

J T (x) = (∇g1 (x), . . . , ∇gm (x))

is uniformly non-degenerate on a certain convex set F containing x ∗ . This means


that the value
 
σ ≡ inf λmin J (x)J T (x)
x∈F

is positive. Consider the function

1 2
m
f (x) = gi (x).
2
i=1

Clearly, f (x ∗ ) = 0. Note that ∇f (x) = J T (x)g(x). Therefore,



∇f (x)2 =  J (x)J T (x) g(x), g(x) ≥ σ g(x)2 = 2σ (f (x) − f (x ∗ )).

Thus, f is a gradient dominated function on F of degree two with τf = 2σ 1


. Note
that, for m < n, the set of solutions to (4.1.33) is not a singleton and therefore the
Hessians of the function f are necessarily degenerate at the solutions. 
In order to study the complexity of minimization of the gradient dominated
functions, we need one auxiliary result.
4.1 Cubic Regularization of Newton’s Method 257

Lemma 4.1.8 At each step of method (4.1.16) we can guarantee the following
decrease of the objective function:

L0 ·∇f (xk+1 )3/2


f (xk ) − f (xk+1 ) ≥ √ , k ≥ 0. (4.1.34)
3 2·(L+L0 )3/2

Proof In view of inequalities (4.1.12) and (4.1.10) we get


 
Mk 3 Mk 2∇f (xk+1 ) 3/2 Mk ∇f (xk+1 )3/2
f (xk ) − f (xk+1 ) ≥ 12 rMk (xk ) ≥ 12 L+Mk = √ .
3 2·(L+Mk )3/2

It remains to note that the right-hand side of this inequality is increasing in Mk ≤


2L. Thus, we can replace Mk by its lower bound L0 . 
Let us start from the analysis of gradient dominated functions of degree one. The
following theorem shows that the process can be partitioned into two phases. The
first phase (with large values of the objective function) is very short, while at the
second phase we can guarantee the rate of convergence of the order O(1/k 2 ).
Theorem 4.1.6 Let us use method (4.1.16) for minimizing a gradient dominated
function f of degree p = 1.
1. If the initial value of the objective function is large enough:

def 18 3
f (x0 ) − f (x ∗ ) ≥ ω̂ = τ · (L + L0 )3 ,
L20 f

then the process converges to the region L (ω̂) superlinearly:


   k  
ln 1
ω̂
(f (xk ) − f (x ∗ ) ≤ 23 ln ω̂1 (f (x0 ) − f (x ∗ ) . (4.1.35)

2. If f (x0 ) − f (x ∗ ) ≤ γ 2 ω̂ for some γ > 1, then we have the following estimate


for the rate of convergence:
 2
γ 2 2+ 32 γ
f (xk ) − f (x ∗ ) ≤ ω̂ ·    2 , k ≥ 0. (4.1.36)
2+ k+ 32 ·γ

Proof Using inequalities (4.1.34) and (4.1.31) with p = 1, we get

L0 ·(f (xk+1 )−f (x ∗ ))3/2


f (xk ) − f (xk+1 ) ≥ √ 3/2 = ω̂−1/2 (f (xk+1 ) − f (x ∗ ))3/2.
3 2·(L+L0 )3/2 ·τf

Defining δk = (f (xk ) − f (x ∗ ))/ω̂, we obtain

3/2
δk − δk+1 ≥ δk+1 . (4.1.37)
258 4 Second-Order Methods

 k
1/2
Hence, ln δk ≥ ln δk+1 + ln(1 + δk+1 ) ≥ 32 ln δk+1 . Thus, ln δk ≤ 23 ln δ0 , and
this is inequality (4.1.35).
Let us now prove inequality (4.1.36). Using inequality (4.1.37), we have
 
3/2 √ 1/2
δk+1 +δk+1 − δk+1 1+δk+1 −1
√1
δk+1
− √1
δk
≥ √1
δk+1
−  1
3/2
= √

3/2
= 
3/2
δk+1 +δk+1 δk+1 δk+1 +δk+1 δk+1 +δk+1

1√
= √ √ 1 √ √  = √ √
1+ δk+1 · 1+ 1+ δk+1 1+ δk+1 + 1+ δk+1

≥ 1
√ ≥ 1√
.
2+ 32 δk+1 2+ 23 δ0

Thus, √1 ≥ 1
+ k
, and this is (4.1.36).

δk γ 2+ 32 γ

The reader should not be confused by the superlinear rate of convergence established
by (4.1.35). It is valid only for the first stage of the process and describes a
convergence to the set L (ω̂). For example, the first stage of the process discussed
in Theorem 4.1.4 is even shorter: it takes just one iteration.
Let us now look at the gradient dominated functions of degree two. Here we can
also see two phases of the process.
Theorem 4.1.7 Let us apply method (4.1.16) for minimizing a gradient dominated
function f of degree p = 2.
1. If the initial value of the objective function is large enough:

def L40
f (x0 ) − f (x ∗ ) ≥ ω̃ = , (4.1.38)
324(L+L0)6 τf3

then at its first phase the process converges as follows:

f (xk ) − f (x ∗ ) ≤ (f (x0 ) − f (x ∗ )) · e−k·σ , (4.1.39)


1/4
where σ = ω̃1/4 +(f (xω̃ )−f (x ∗ ))1/4 . This phase ends at the first iteration k0 for
0
which (4.1.38) does not hold.
2. For k ≥ k0 , the rate of convergence is super-linear:
 4/3
f (xk )−f (x ∗ )
f (xk+1 ) − f (x ∗ ) ≤ ω̃ · ω̃ . (4.1.40)

Proof Using inequalities (4.1.34) and (4.1.31) with p = 2, we get

L0 ·(f (xk+1 )−f (x ∗ ))3/4


f (xk ) − f (xk+1 ) ≥ √ 3/4 = ω̃1/4 (f (xk+1 ) − f (x ∗ ))3/4 .
3 2·(L+L0 )3/2 ·τf
4.1 Cubic Regularization of Newton’s Method 259

Defining δk = (f (xk ) − f (x ∗ ))/ω̃, we obtain

3/4
δk ≥ δk+1 + δk+1 . (4.1.41)

Hence,
δk −1/4 −1/4
δk+1 ≥ 1 + δk ≥ 1 + δ0 = 1
1−σ ≥ eσ ,

4/3
and we get (4.1.39). Finally, from (4.1.41) we have δk+1 ≤ δk , which is
(4.1.40).

Comparing the statement of Theorem 4.1.7 with other theorems of this section, we
can see a significant difference. This is the first time when the initial residual f (x0 )−
f (x ∗ ) enters the complexity estimate of the first phase of the process in a polynomial
way. In all other cases, the dependence on this value is much weaker. However, we
will observe a similar situation in Sect. 5.2, when we will address the complexity of
minimizing self-concordant functions.
Note that it is possible to embed the gradient dominated functions of degree two
into the class of gradient dominated functions of degree one. However, it is easy to
check that this only makes the efficiency estimates established by Theorem 4.1.7
worse.

4.1.3.3 Nonlinear Transformations of Convex Functions

Let u(x) : Rn → Rn be a non-degenerate vector function. Denote by v(u) its


inverse:

v(u) : Rn → Rn , v(u(x)) ≡ x.

Consider the following function:

f (x) = φ(u(x)),

where φ(u) is a convex function with bounded level sets. Denote by x ∗ ≡ v(u∗ ) its
minimum. Let us fix some x0 ∈ Rn . Define

σ = max{v  (u) : φ(u) ≤ f (x0 )},


u

D = max{u − u∗  : φ(u) ≤ f (x0 )}.


u

The following result is straightforward.


260 4 Second-Order Methods

Lemma 4.1.9 For any x, y ∈ L (f (x0 )) we have

x − y ≤ σ u(x) − u(y). (4.1.42)

Proof Indeed, for x, y ∈ L (f (x0 )), we have φ(u(x)) ≤ f (x0 ) and φ(u(y)) ≤
f (x0 ). Consider the trajectory x(t) = v(tu(y) + (1 − t)u(x)), t ∈ [0, 1]. Then
 
1 1
y−x = x  (t)dt = v  (tu(y) + (1 − t)u(x))dt · (u(y) − u(x)),
0 0

and (4.1.42) follows.



The following result is very similar to Theorem 4.1.4.
Theorem 4.1.8 Assume that the Hessian of the function f is Lipschitz continuous
on a convex set F ⊃ L (f (x0 )) with constant L and let the sequence {xk } be
generated by method (4.1.16).
1. If f (x0 ) − f ∗ ≥ 32 L(σ D)3 , then f (x1 ) − f ∗ ≤ 12 L(σ D)3 .
2. If f (x0 ) − f ∗ ≤ 32 L(σ D)3 , then the rate of convergence of the process (4.1.16)
is as follows:

3L(σ D)3
f (xk ) − f (x ∗ ) ≤ , k ≥ 0. (4.1.43)
2(1+ 31 k)2

Proof Indeed, in view of inequality (4.1.11), the upper bound on the parameters Mk ,
and definition (4.1.25), for any k ≥ 0 we have:

f (xk+1 ) − f (x ∗ ) ≤ min[ f (y) − f (x ∗ ) + L2 y − xk 3 :


y

y = v(αu∗ + (1 − α)u(xk )), α ∈ [0, 1] ].

By definition of the points y in the above minimization problem and (4.1.42), we


have

f (y) − f (x ∗ ) = φ(αu∗ + (1 − α)u(xk )) − φ(u∗ ) ≤ (1 − α)(f (xk ) − f (x ∗ )),

y − xk  ≤ ασ u(xk ) − u∗  ≤ ασ D.

This means that the reasoning of Theorem 4.1.4 goes through replacing D
by σ D. 
 3
Let us prove a statement on strongly convex φ. Define ω̌ = L12 2σμ2 .
4.1 Cubic Regularization of Newton’s Method 261

Theorem 4.1.9 Let the function φ be strongly convex with convexity parameter
μ > 0. Then, under assumptions of Theorem 4.1.8, the performance of the
scheme (4.1.16) is as follows.
1. If f (x0 ) − f (x ∗ ) ≥ 49 ω̌, then in the first phase of the process we get the following
rate of convergence:
  4
f (xk ) − f (x ∗ ) ≤ (f (x0 ) − f (x ∗ ))1/4 − k6 23 ω̌1/4 . (4.1.44)

This phase is terminated as soon as f (xk0 ) − f (x ∗ ) ≤ 49 ω̌ for some k0 ≥ 0.


2. For k ≥ k0 , the sequence converges superlinearly:

f (xk )−f (x ∗ )
f (xk+1 ) − f (x ∗ ) ≤ 12 (f (xk ) − f (x ∗ )) ω̌
. (4.1.45)

Proof Indeed, in view of inequality (4.1.11), the upper bound on the parameters Mk ,
and definition (4.1.25), for any k ≥ 0 we have:

f (xk+1 ) − f (x ∗ ) ≤ min[ f (y) − f (x ∗ ) + L2 y − xk 3 :


y

y = v(αu∗ + (1 − α)u(xk )), α ∈ [0, 1] ].

By definition of the points y in the above minimization problem and (4.1.42), we


have

f (y) − f (x ∗ ) = φ(αu∗ + (1 − α)u(xk )) − φ(u∗ ) ≤ (1 − α)(f (xk ) − f (x ∗ )),

(2.1.21) 
y − xk  ≤ ασ u(xk ) − u∗  ≤ ασ ∗
μ (f (x0 ) − f (x )).
2

This means that the reasoning of Theorem 4.1.5 goes through replacing L by
σ 3 L.

Note that the functions described in this section are often used as test functions for
non-convex optimization algorithms. The simplest way of defining a nondegenerate
transformation u(·) : Rn → Rn is as follows:

u(1)(x) = x (1) ,

u(2)(x) = x (2) + φ1 (x (1)),


(4.1.46)
u(3)(x) = x (3) + φ2 (x (1), x (2) ),
··· ···
u(n) (x) = x (n) + φn−1 (x (1) , . . . , x (n−1) ),
262 4 Second-Order Methods

where φ1 , · · · , φn−1 are arbitrary differentiable functions. It is clear that the Jaco-
bian u (x) is an upper-triangular matrix with unit diagonal. Thus, this transformation
is non-degenerate.

4.1.4 Implementation Issues

4.1.4.1 Minimizing the Cubic Regularization

In order to compute the mapping TM (x), we need to solve an auxiliary minimization


problem (4.1.5), namely,
 
def
minn v(h) = g, h + 12 H h, h + 6 h
M 3 . (4.1.47)
h∈R

If the Hessian H is indefinite, this problem is nonconvex. It can have many strict
isolated minima, while we need to find a global one. Nevertheless, as we will
show in this section, this problem is equivalent to a convex univariate optimization
problem.
Note that the objective function of the optimization problem (4.1.47) can be
represented in the following way:
 
def
v(h) = min ṽ(h, τ ) = g, h + 12 H h, h + 6 |τ |
M 3/2 : h2 ≤ τ .
τ ∈R

Thus, the point TM (x) can be found from the following problem
 
def
min ṽ(h, τ ) : f (h, τ ) = 12 h2 − 12 τ ≤ 0 .
h∈R ,τ ∈R
n

Since this is already a constrained minimization problem, we can form for


it a Lagrangian dual problem (see Sect. 1.3.3). Indeed, define the Lagrangian
L (h, τ, λ) = ṽ(h, τ ) + λ[ 12 h2 − 12 τ ] with h ∈ Rn and τ, λ ∈ R. Then the
dual function is
 
ψ(λ) = inf g, h + 1
2 H h, h + M
6 |τ | 3/2 + λ[ 1 h2 − 1 τ ] .
2 2
h∈R ,τ ∈R
n

4 |τ | =
M 1/2 sign(τ ) 1
The optimal value of τ can be found from the equation 2 λ.
Therefore, τ (λ) = 4λ|λ|
M2
, and we have
 
ψ(λ) = infn g, h + 12 (H + λIn )h, h − 2
3M 2
|λ|3 ,
h∈R
 
def
dom ψ = λ ∈ R : infn [qλ (h) = g, h + 12 (H + λIn )h, h ] > −∞ .
h∈R
4.1 Cubic Regularization of Newton’s Method 263

Let us describe the structure of dom ψ. Without loss of generality, we can


assume that H is a diagonal matrix with values {Hi }ni=1 on the diagonal. Let
Hmin = min Hi .
1≤i≤n
If λ > −Hmin , then λ ∈ dom ψ. If λ < −Hmin, then λ
∈ dom ψ. Thus, only
the status of the point λ = −Hmin can be different. Define

G2 = (g (i) )2 , I ∗ = {i : Hi = Hmin }.
i∈I ∗

There are three possibilities.


1. G2 > 0. Then dom ψ = {λ ∈ R : λ > −Hmin }. For any λ in this domain we
have
2  (g (i) )2
ψ(λ) = − 12 Hmin
G
+λ −
1
2 Hi +λ − 2
3M 2
|λ|3 . (4.1.48)
i
∈I ∗

At the same time, the optimal vector for the function qλ (·) has the form

h(λ) = −(H + λIn )−1 g.

This vector and value τ (λ) are uniquely defined and continuous on dom ψ.
Hence, in view of Theorem 1.3.2, we have

min v(h) = max ψ(λ). (4.1.49)


h∈Rn λ∈dom ψ R+

2. G2 = 0. Then dom ψ = {λ ∈ R : λ ≥ −Hmin }. In this case, for any λ >


−Hmin , the optimal vector is uniquely defined as follows:
"
g (i)
Hi +λ , if i
∈ I ∗ ,
h(i) (λ) = i = 1, . . . , n. (4.1.50)
0, otherwise,

This vector is continuous on dom ψ. Therefore, if

def
λ∗ = arg max ψ(λ) > −Hmin ,
λ∈dom ψ R+

then the conditions of Theorem 1.3.2 are satisfied. Hence, in this case rela-
tion (4.1.49) is also valid.
3. The only remaining case is G2 = 0 and λ∗ = −Hmin. This is possible only if
Hmin ≤ 0 and the gradient is small enough (e.g. g = 0). In this situation, the
rule (4.1.50) does not work and we need to form the solution of problem (4.1.47)
using an eigenvector of matrix H , which corresponds to the eigenvalue Hmin .
264 4 Second-Order Methods

Let us choose an arbitrary k ∈ I ∗ and a small parameter δ > 0. Define a new


function

vδ (h) = v(h) + δh(k) .

This function satisfies the condition of Item 1. Therefore, in view of (4.1.49) we


have

max vδ (h) = max ψδ (λ),


h∈Rn λ∈dom ψδ R+

2  (g (i) )2
ψδ (λ) = − 12 Hmin
δ
+λ −
1
2 Hi +λ − 2
3M 2
|λ|3 .
i
∈I ∗

Since dom ψδ = (−Hmin , +∞), the optimal point of the dual problem λ∗δ can
be found from the following equation:

δ2  (g (i) )2 4λ2
(Hmin +λ)2
+ (Hi +λ)2
= M2
. (4.1.51)
i
∈I ∗

Thus, the optimal vector for the primal problem is

h∗ (δ) = −(H + λ∗δ In )−1 (g + δek ).

(i)
All components h∗ (δ) with i
= k are continuous in δ (recall that H is a diagonal
matrix). For i = k, we have
# $1/2
(4.1.51) 4(λ∗δ )2  (g (i) )2
h(k)
∗ (δ) = − H δ+λ∗ = − M2
− (Hi +λ∗δ )2
.
min δ
i
∈I ∗

Thus, there exists a limit h∗ = lim h∗ (δ), defined as follows:


δ→0

 (i)
i
∈ I ∗ ,
(i) (k) (i) g
h∗ = h∗ ei + h∗ ek , h∗ = − Hi −Hmin
,
i
∈I ∗
# $1/2 (4.1.52)
(k) 2
4Hmin  (g (i) )2
h∗ =− M2
− (Hi −Hmin )2
.
i
∈I ∗

It is easy to see that h∗ is a global optimum for problem (4.1.47). Indeed, for
any h ∈ Rn we have
(k)
vδ (h) ≥ vδ (h∗ (δ)) ≥ v(h∗ (δ)) − δ|h∗ (δ)|.

Taking in these inequalities the limit as δ → 0, we get v(h) ≥ v(h∗ ).



4.1 Cubic Regularization of Newton’s Method 265

Note that in both Items 1 and 2, the optimal solution of the dual problem λ∗
satisfies the first-order optimality condition

G2  (g (i) )2 (1.2.4)
ψ  (λ∗ ) = − 12 (H ∗ )2 − 1
(Hi +λ∗ )2
− 2
(λ∗ )2 = 0,
min +λ 2 M2
i
∈I ∗

and the optimal global solution of primal problem (4.1.47) is h∗ = −(H +λ∗ In )−1 g.
In other words, λ∗ satisfies the equation

(H + λ∗ In )−1 g = 2 ∗
Mλ .
(4.1.53)

2 ∗
Thus, rM (x) = h∗  = M λ , and we conclude that H + MrM2 (x) In  0 (this
is (4.1.8)). Note that in the case described in Item 3, we have h∗  = 2|HMmin | , Thus,
we also have

MrM (x)
H+ 2 In = H + |Hmin |In  0.

Using the new variable r, we can rewrite equation (4.1.53) in the following form
 Mr −1

r= H+ 2 I g, (4.1.54)

with r ≥ M 2
(−λmin (H ))+ . A technique for solving such equations is very well
developed for the needs of Trust Region Methods. As compared with (4.1.54), the
equations for Trust Region Schemes have a constant left-hand side. But of course, all
possible difficulties with (4.1.54) are due to the non-linear convex right-hand side.
In any case, before running a procedure for solving this equation, it is reasonable to
transform the matrix H into a tri-diagonal form using the Lanczos algorithm. In the
general case, it takes O(n3 ) operations.
In order to illustrate possible difficulties arising in the dual problem, let us look
at the following example.
Example 4.1.4 Let n = 2 and

g = (−1, 0)T , H1 = 0, H2 = −1, M = 1.

Thus, our primal problem is as follows:


"  3 %
 (2) 2  2  2
min ψ(h) ≡ −h(1) − 1
2 h + 1
6 h(1) + h(2) .
h∈R2
266 4 Second-Order Methods

Following (4.1.6), we have to solve the system of two non-linear equations:


 2  2
h(1)
2 h(1) + h(2) = 1,
 2  2
h(2)
2 h(1) + h(2) = h(2) .

Thus, we have three candidate solutions:


√ √ T √
h∗1 = ( 2, 0)T , h∗2 = (1, 3) , h∗3 = (1, − 3)T .

By direct substitution, we can see that



ψ(h∗1 ) = − 2 3 2 > − 76 = ψ(h∗2 ) = ψ(h∗3 ).

Thus, both h∗2 and h∗3 are our global solutions.


Let us look at the dual problem. Since G2 = 0, we have the following objective:

(4.1.48)
ψ(λ) = − 2λ
1
− 23 λ3 .

We need to maximize this function subject to the constraint λ ≥ (−Hmin )+ = 1.


Since ψ  (1) < 0, we conclude that λ∗ = 1. Thus, using representation (4.1.52), we
get
 1/2 √ T
−1
h∗ = −e1 · 0+1 + e2 4Hmin
2 − 1
(−Hmin )2
= (1, 3) . 

To the best of our knowledge, a technique for finding the global minimum
of problem (4.1.47) in the degenerate situation of Item 3 without computing an
eigenvalue decomposition of the matrix H is not known yet. Of course, we can
always say that this degeneracy disappears with probability one after an arbitrary
small random perturbation of the vector g.

4.1.4.2 Line Search Strategies

Let us discuss the computational cost of Step 1 in method (4.1.16), which consists
in finding Mk ∈ [L0 , 2L] satisfying the equation:

f (TMk (xk )) ≤ f¯Mk (xk ).


4.1 Cubic Regularization of Newton’s Method 267

Note that for Mk ≥ L this inequality holds. Consider now the following backtrack-
ing strategy.

Find the first ik ≥ 0 such that f (T2ik Mk (x)) ≤ f¯2ik Mk (xk ).


(4.1.55)
Define xk+1 := T2ik Mk (xk ) and Mk+1 := 2ik Mk .

If we apply this procedure at each iteration of process (4.1.16), which starts from
M0 ∈ [L0 , 2L], then we have the following advantages:
• Mk ≤ 2L.
• The total amount of additional computations of mappings TMk (·) during N
iterations of process (4.1.16) is equal to


N 
N
Mk+1 MN+1
ik = log2 Mk = log2 M0 ≤ 1 + log2 L
L0 .
k=0 k=0

(Indeed, if ik = 0, then we compute only one mapping TMk (·) at this iteration.)
The right-hand side of the above bound does not depend on N, the number of
iterations of the main process.
However, it may happen that rule (4.1.55) is too conservative. Indeed, we can
only increase our estimates for the constant L and never let them go down. This
may force the method to take only short steps. A more optimistic strategy is as
follows:

Find the first ik ≥ 0 such that f (T2ik Mk (xk )) ≤ f¯2ik Mk (xk ).


(4.1.56)
 
Define xk+1 := T2ik Mk (xk ) and Mk+1 := max L0 , 2ik −1 Mk .

Then the total amount of additional computations of mappings TMk (·) after N
iterations of the process (4.1.16) can be bounded as follows


N 
N
2Mk+1 MN+1
ik ≤ log2 Mk = N + 1 + log2 M0 ≤ N + 2 + log2 L
L0 .
k=0 k=0

Thus, after N iterations of this process, we never compute more than

2N + 3 + log2 2L
L0

mappings TM (·). This is a reasonable price to pay for the possibility of moving by
long steps.
268 4 Second-Order Methods

4.1.5 Global Complexity Bounds

Let us compare the complexity results presented in this section with some known
facts on global efficiency bounds of other minimization schemes.
Assume that the function f is strongly convex on Rn with convexity parameter
μ > 0 (see (4.1.32)). In this case, there exists its unique global minimum x ∗ , and
condition (4.1.27) holds for all x ∈ Rn (see Theorem 2.1.8). Assume also that the
Hessian of this function is Lipschitz continuous:

∇ 2 f (x) − ∇ 2 f (y) ≤ Lx − y, ∀x, y ∈ Rn .

For such functions, let us obtain the complexity bounds of method (4.1.16) using
the results of Theorems 4.1.4 and 4.1.5.
Indeed, let us fix some x0 ∈ Rn . Denote by D the radius of its level set:

D = max{x − x ∗  : f (x) ≤ f (x0 )}.


x

From the condition (4.1.27), we get


 1/2
D≤ ∗
μ (f (x0 ) − f (x ))
2
.

We will see that it is natural to measure the quality of the starting point x0 by the
following characteristic:

 ≡ (x0 ) = LD
μ .

Let us introduce three switching values

μ3
ω0 = 18L2
≡ 4
9 ω̄, ω1 = 3 2
2 μD , ω2 = 3 3
2 LD .

In view of Theorem 4.1.4, we can reach the level f (x0 ) − f (x ∗ ) ≤ 12 LD 3 in one


iteration. Therefore, without loss of generality we assume that

f (x1 ) − f (x ∗ ) ≤ ω2 .

Suppose we are interested in a very high accuracy of the solution. Note that the case
 ≤ 1 is very easy since the first iteration of method (4.1.16) comes very close to
the region of super-linear convergence (see Item 2 of Theorem 4.1.5).
4.1 Cubic Regularization of Newton’s Method 269

Consider the case  ≥ 1. Then ω0 ≤ ω1 ≤ ω2 . Let us estimate the duration of


the following phases:

Phase 1: ω1 ≤ f (xi ) ≤ ω2 ,

Phase 2: ω0 ≤ f (xi ) ≤ ω1 ,

Phase 3:  ≤ f (xi ) ≤ ω0 .

In view of Theorem 4.1.4, the duration k1 of the first phase is bounded as follows:

3LD 3
ω1 ≤ .
2(1+ 31 k1 )2


Thus, k1 ≤ 3 . Further, in view of Item 1 of Theorem 4.1.5, we can bound the
duration k2 of the second phase:

≤ (f (xk1 +1 ) − f (x ∗ ))1/4 −
1/4 k2 1/4
ω0 6 ω0

k2 1/4
≤ ( 12 μD 2 )1/4 − 6 ω0 .
√ √
This gives the following bound: k2 ≤ 33/4 21/2  ≤ 3.25 .
Finally, let δk = 4ω1 0 (f (xk ) − f (x ∗ )). In view of inequality (4.1.29) we have:

3/2
δk+1 ≤ δk , k ≥ k̄ ≡ k1 + k2 + 1.

At the same time f (xk̄ ) − f (x ∗ ) ≤ ω0 . Thus, δk̄ ≤ 14 , and the bound on the duration
k3 of the last phase can be found from the following inequality:
 k3
3
2 ln 4 ≤ ln 4ω 0 .

2μ 3
This is k3 ≤ log 3 log4 9L 2 . Putting all the bounds together, we obtain that the total
2
number of steps N in (4.1.16) is bounded as follows:
  
2μ3
N ≤ 6.25 LD
μ + log 3 log4 1
 + log4 9L2
. (4.1.57)
2

It is interesting that in estimate (4.1.57) the parameters of our problem interact


with accuracy in an additive way. Recall that usually such an interaction is
multiplicative. Let us estimate, for example, the complexity of our problem for
the Fast Gradient Method (2.2.20) for strongly convex functions with Lipschitz
continuous gradient. Denote by L̂ the largest eigenvalue of the matrix ∇ 2 f (x ∗ ).
270 4 Second-Order Methods

Then we can guarantee that

μI  ∇ 2 f (x)  (L̂ + LD)I ∀x, x − x ∗  ≤ D.

Thus, the complexity bound for the optimal gradient method is of the order of


2
O L̂+LD
μ ln (L̂+LD)D


iterations. For the Gradient Method (2.1.37) it is even worse:


 2

O L̂+LD
μ ln (L̂+LD)D
 .

Thus, we conclude that the global complexity bounds of the Cubic Newton
Method (4.1.16) are considerably better than the estimates of the gradient schemes.
At the same time, we should recall, of course, the difference in computational cost
of each iteration.
Note that similar bounds can be obtained for other classes of non-convex
problems. For example, for nonlinear transformations of convex functions (see
Sect. 4.1.3.3), the complexity bound is as follows:
  
2μ3
N ≤ 6.25 σ
μ LD + log 3 log4 1
 + log4 9σ 6 L2
. (4.1.58)
2

To conclude, note that in scheme (4.1.16) it is possible to find elements of


the Levenberg–Marquardt approach (see relation (4.1.8)), or a trust-region idea
(see the discussion in Sect. 4.1.4.1), or a line-search technique (see the rule of
Step 1 in (4.1.16)). However, all these facts are consequences of the main idea of
the scheme, consisting in computation of the next test point of the process as a
global minimizer of cubic regularization of the second-order approximation, which
globally overestimates the values of the objective function.

4.2 Accelerated Cubic Newton

(Primal and dual spaces; Uniformly convex functions; Regularization of Newton iteration;
An Accelerated scheme Global non-degeneracy for second-order schemes; Minimizing
strongly convex functions; False accelerations.)

4.2.1 Real Vector Spaces

Starting from this section, we often work with more abstract real vector spaces.
In the previous part of the book, we were dealing mainly with the simplest space
4.2 Accelerated Cubic Newton 271

Rn . However, very often we need to highlight the fundamental difference between


the vectors of decision variables and the vectors of gradients. The simplest way of
doing this is to just keep them in different spaces. For us, the space for variables will
always be the primal space, and the space for gradients will be the dual space.
Let E be a finite-dimensional real vector space, and E∗ be its dual space,
comprised of linear functions on E. Denote by s, x E the value of s ∈ E∗ at a
point x ∈ E (sometimes it is called the scalar product of s and x). If there is no
ambiguity of notation, the subscript of the scalar product is usually omitted. Since
we always work in finite dimensions, we have (E∗ )∗ = E.
Consider, for example, a differentiable function f with dom f = E. Then, by
definition of the gradient, we have

f (x + h) = f (x) + ∇f (x), h + o(h), x, h ∈ E.

Thus, the gradient defines a linear function of x, and therefore ∇f (x) ∈ E∗ . It


is important to remember that the coordinate form of the gradient (1.2.3) makes
sense only if E = E∗ = Rn . In order to convert E to Rn , we need to fix a basis of
this space. This operation can be done in many different ways, which significantly
change the topology of functions and their characteristics. Therefore, it is often
convenient to avoid this operation in explaining the principles of optimization
schemes.
Further, for two spaces E1 and E∗2 , we can consider a linear operator A : E1 →

E2 . For this operator, we can define the adjoint operator A∗ as follows:

Ax, y E2 ≡ A∗ y, x E1 , ∀x ∈ E1 , y ∈ E2 .

Clearly, A∗ maps E2 to E∗1 . In the case when E1 = Rn , and E2 = Rm the operator A


can be represented by an (m×n)-matrix. Then the matrix for A∗ is just its transpose:
A∗ = AT .
In order to have a full picture, let us describe a standard procedure for converting
E and E∗ into Rn . Let n = dim E. Let us choose a basis B = (b1 , . . . , bn ) in E. We
can treat it as a linear operator B : Rn → E defined by the following rule:

def 
n
x = B x̄ = bi x̄ (i) , x̄ = (x̄ (1), . . . , x̄ (n) )T ∈ Rn .
i=1

Using this basis, we can define a linear operator B ∗ : E∗ → Rn as follows:

s̄ = (s̄ (1) , . . . , s̄ (n) )T = B ∗ s ∈ R n , s ∈ E∗ ,

which is equivalent to the following rules:

s̄ (i) = s, bi , i = 1, . . . , n.
272 4 Second-Order Methods

Then, using the operator (B ∗ )−1 : Rn → E∗ , we can define the dual basis in E∗ .
Indeed, s = (B ∗ )−1 s̄ ∈ E∗ for s̄ ∈ Rn . Therefore, the corresponding basis vectors
in E∗ are as follows:
 ∗ −1
(B ) e1 , . . . , (B ∗ )−1 en ,

where ei are the unit coordinate vectors in Rn , i = 1. . . . , n. Note that

(B ∗ )−1 s̄, bi E = (B ∗ )−1 s̄, Bei E = B ∗ (B ∗ )−1 s̄, ei Rn


(4.2.1)
= s̄ (i) , i = 1, . . . , n.

Hence, we get the following representation for the scalar product of two vectors
s ∈ E∗ and x ∈ E:

n
s, x E = (B ∗ )−1 s̄, B x̄ E = x̄ (i)(B ∗ )−1 s̄, bi
i=1

(4.2.1) 
m
= x̄ (i) s̄ (i) = s̄ T x̄ ≡ s̄, x̄ Rn .
i=1

Further, the operator B : E → E∗ is called self-adjoint if

Bx, y ≡ By, x , ∀x, y ∈ E.

For E = Rn a self-adjoint operator is represented by a symmetric matrix. The


most important examples of self-adjoint operators are given by Hessians. Indeed,
by definition (see (1.2.7)), we have

∇f (x + h) = ∇f (x) + ∇ 2 f (x)h + o(h) ∈ E∗ , x ∈ E, h ∈ E.

Thus, ∇ 2 f (x) is a linear operator from E to E∗ . This interpretation confirms the


validity of the Newton direction:

[∇ 2 f (x)]−1 ∇f (x) ∈ E.

It is well known that for twice continuously differentiable functions the matrix
representation of the Hessian is symmetric. This means that any Hessian is a self-
adjoint operator.
Finally, a self-adjoint operator B : E → E∗ is positive semidefinite if

Bx, x ≥ 0, ∀x ∈ E,

notation B  0. If the above inequality is strict for all x


= 0, we call the operator
positive definite (notation B  0). Positive definite operators are invertible.
4.2 Accelerated Cubic Newton 273

Now we can define all necessary objects. Let us fix a positive definite self-adjoint
operator B : E → E∗ . Define the primal norm for the space E:

h = Bh, h 1/2 , h ∈ E. (4.2.2)

Our above discussion suggests that the most natural candidates for such an operator
B are nondegenerate Hessians of convex functions. We will discuss this possibility
in detail in Chap. 5.
The dual norm for E∗ can be defined in the standard way:

(3.1.64)
s∗ = max{s, x : x ≤ 1} = s, B −1 s 1/2 , s ∈ E∗ . (4.2.3)
x∈E

An immediate consequence of this definition is the Cauchy–Schwarz inequality

(4.2.3)
s, x ≤ s∗ · x, x ∈ E, s ∈ E∗ . (4.2.4)

Finally, for a linear operator A : E → E∗ we have

A = max Ah∗ . (4.2.5)


h≤1

If the operator A is self-adjoint, the same norm can be defined as

A = max |Ah, h |. (4.2.6)


h≤1

Any s ∈ E∗ generates a rank-one self-adjoint operator ss ∗ : E → E∗ acting as


follows

ss ∗ · x = s, x · s, x ∈ E.

def ∗
We extend the operator A(s) = s
ss

onto the origin in a continuous way: A(0) = 0.
In this section, we mainly consider functions with Lipschitz-continuous Hessian:

∇ 2 f (x) − ∇ 2 f (y) ≤ L3 x − y, x, y ∈ E, (4.2.7)

def
where L3 = L3 (f ). Consequently, for all x and y from E we have

(1.2.13)
∇f (y) − ∇f (x) − ∇ 2 f (x)(y − x)∗ ≤ (4.2.8)
2 L3 y − x2 .
1

Moreover, for the quadratic model

def 1
f2 (x; y) = f (x) + ∇f (x), y − x + ∇ 2 f (x)(y − x), y − x
2
274 4 Second-Order Methods

we can bound the residual:


(1.2.14) L
|f (y) − f2 (x; y)| ≤ (4.2.9)
6 y − x3 , x, y ∈ E.
3

4.2.2 Uniformly Convex Functions

In this section, we will often use the cubic power function

d3 (x) = 13 x − x0 3 , ∇d3 (x) = x − x0  · B(x − x0 ), x ∈ E.

This is the simplest example of the uniformly convex function. In order to understand
their properties, we need to develop some theory.
Let the function d(·) be differentiable on a closed convex set Q. We call it
uniformly convex on Q of degree p ≥ 2 if there exists a constant σp = σp (d) > 0
such that1

d(y) ≥ d(x) + ∇d(x), y − x + p1 σp y − xp , ∀x, y ∈ Q. (4.2.10)

The constant σp is called the parameter of uniform convexity of this function. By


adding such a function to an arbitrary convex function, we get a uniformly convex
function of the same degree and with the same value of parameter. Recall that degree
p = 2 corresponds to strongly convex functions (see (2.1.20)). In our old notation,
the parameter μ of strong convexity for the function f corresponds to σ2 (f ).
Note that any uniformly convex function grows faster than any linear function.
Therefore, its level sets are always bounded. This implies that any minimization
problem with uniformly convex objective is always solvable provided that its
feasible set is nonempty. Moreover, its solution is always unique.
Adding two copies of inequality (4.2.10) with x and y interchanged, we get

∇d(x) − ∇d(y), x − y ≥ p σp x
2
− yp , ∀x, y ∈ Q. (4.2.11)

It appears that this condition is sufficient for uniform convexity (however, for p > 2
the convexity parameter is changing).
Lemma 4.2.1 Assume that for some p ≥ 2, σ > 0, and all x, y ∈ Q the following
inequality holds:

∇d(x) − ∇d(y), x − y ≥ σ x − yp , x, y ∈ Q. (4.2.12)

Then the function d is uniformly convex on Q with degree p and parameter σ .

1 It could be a good exercise for the reader to prove that there are no uniformly convex functions

with degree p ∈ (0, 2).


4.2 Accelerated Cubic Newton 275

Proof Indeed,

1
d(y) − d(x) − ∇d(x), y − x = d(x + τ (y − x)) − ∇d(x), y − x dτ
0

1
1
= d(x + τ (y − x)) − ∇d(x), τ (y − x) dτ
τ
0

1
(4.2.12) 1
≥ σ τ p−1 y − xp dτ = σ y − xp . 

p
0

Lemma 4.2.2 Let d be uniformly convex on Q of degree p ≥ 2. Then for all x, y ∈


Q we have
  1 p
p−1 p−1
d(y) − d(x) − ∇d(x), y − x ≤ p
1
σp ∇d(y) − ∇d(x)∗p−1 .
(4.2.13)

Proof Assume that d attains its global minimum on E at some point x ∗ ∈ Q. Then
(4.2.10)  
d(x ∗ ) = min d(y) ≥ min d(x) + ∇d(x), y − x + p1 σp y − xp
y∈Q x∈Q

 
≥ min d(x) + ∇d(x), y − x + p1 σp y − xp
x∈E

(4.2.3)
  1 p
p−1 p−1
= d(x) − p
1
σp ∇d(x)∗p−1 .

Let us fix x ∈ Q and consider the convex function φ(y) = d(y) − ∇d(x), y . It is
uniformly convex of degree p and parameter σp . Moreover, it attains its minimum
at y = x ∈ Q. Hence, applying the above inequality to φ(y), we get (4.2.13). 
Let us give an important example of a uniformly convex function. By fixing an
arbitrary x0 ∈ E, we define the function dp (x) = p1 x − x0 p , where the norm is
Euclidean (see (4.2.2)). Then

∇dp (x) = x − x0 p−2 · B(x − x0 ), x ∈ E.

Lemma 4.2.3 For any x and y from E we have


 p−2
∇dp (x) − ∇dp (y), x − y ≥ 12 x − yp , (4.2.14)
 p−2
dp (x) − dp (y) − ∇dp (y), x − y ≥ p1 12 x − yp . (4.2.15)
276 4 Second-Order Methods

Proof Without loss of generality, let us assume that x0 = 0. Then

∇dp (x) − ∇dp (y), x − y = xp−2 · Bx − yp−2 · By, x − y

= xp + yp − Bx, y (xp−2 + yp−2 ).

To prove (4.2.14), we need to show that the right-hand side of the latter equality is
greater than or equal to

p−2
p−2  p/2
1 1
x − yp = x2 + y2 − 2Bx, y .
2 2

Without loss of generality we can assume that x


= 0 and y
= 0. Then, defining

y Bx,y
τ = x , α= x·y ∈ [−1, 1],

we obtain the statement to be proved:



p−2
1
1+τ p
≥ ατ (1 + τ p−2
)+ [1 + τ 2 − 2ατ ]p/2 , τ ≥ 0, |α| ≤ 1.
2
(4.2.16)

Since the right-hand side of this inequality is convex in α, in view of Corollary 3.1.2,
we need to justify two marginal inequalities:
 p−2
α=1: 1 + τ p ≥ τ (1 + τ p−2 ) + 1
2 |1 − τ |p ,
(4.2.17)
 p−2
α = −1 : 1 + τ p ≥ −τ (1 + τ p−2 ) + 1
2 (1 + τ )p

for all τ ≥ 0.
The second inequality in (4.2.17) can be derived from the lower bound for the
ratio
1+τ p +τ (1+τ p−2 ) 1+τ p−1
(1+τ )p = (1+τ )p−1
, τ ≥ 0.

Indeed, its minimum is attained at τ = 1, and this proves the second line in (4.2.17).
To prove the first line, note that it is valid for τ = 1. If τ ≥ 0 and τ
= 1, then we
need to estimate from below the ratio
1+τ p −τ (1+τ p−2 ) (1−τ )(1−τ p−1 ) 1+τ +···+τ p−2
|1−τ |p = |1−τ |p = |1−τ |p−2
.
4.2 Accelerated Cubic Newton 277

Since the absolute value of any coefficient of the polynomial (1 − τ )p−2 does
not exceed 2p−2 , the first line in inequality (4.2.17) is also justified. This
proves (4.2.14), and, to prove (4.2.15), we can now use Lemma 4.2.1.

The main property of uniformly convex functions is the following growth condition.
Theorem 4.2.1 Let d be uniformly convex on Q of degree p ≥ 2 with positive
constant σp . Let x ∗ = arg min d(x). Then for all x ∈ Q we have
x∈Q

d(x) ≥ d(x ∗ ) + p1 σp x − x ∗ p . (4.2.18)

Proof Indeed, in view of the first-order optimality condition (2.2.39), we have

∇d(x ∗ ), x − x ∗ ≥ 0, x ∈ Q.

Therefore, (4.2.18) follows from (4.2.10).



Thus, by (4.2.14) and Lemma 4.2.1 we conclude that σ3 (d3 ) = 12 . On the other hand
we can prove the following important fact.
Lemma 4.2.4 For any x, y ∈ E we have

∇ 2 d3 (x) − ∇ 2 d3 (y) ≤ 2 x − y. (4.2.19)

Proof For any x ∈ E, we have ∇ 2 d3 (x) = xB + 1 ∗ Clearly, for all


x Bxx B.
x ∈ E we have
(4.2.4)
∇ 2 d3 (x) ≤ 2x. (4.2.20)

Let us fix two points x, y ∈ E and an arbitrary direction h ∈ E. Define x(τ ) =


x + τ (y − x) and

φ(τ ) = ∇ 2 d3 (x(τ ))h, h = x(τ ) · h2 + x(τ ) Bx(τ ), h ,


1 2 τ ∈ [0, 1].

Assume first that 0


∈ [x, y]. Then φ(τ ) is continuously differentiable on [0, 1] and

Bx(τ ),y−x Bx(τ ),h 2


φ  (τ ) = x(τ ) · h2 + 2Bx(τ ),h
x(τ ) Bh, y − x − x(τ )3
Bx(τ ), y − x



Bx(τ ),y−x Bx(τ ), h 2
= x(τ ) · h −
2
+ 2Bx(τ ),h
x(τ ) Bh, y − x .
x(τ )2
7 89 :
≥0 by (4.2.4)

Bx(τ ),h
Let α = x(τ )·h ∈ [−1, 1]. Then

|φ  (τ )| ≤ y − x · h2 · (1 − α 2 + 2|α|) ≤ 2 y − x · h2 .


278 4 Second-Order Methods

Hence,

|(∇ 2 d3 (y) − ∇ 2 d3 (x))h, h | = |φ(1) − φ(0)| ≤ 2 y − x · h2 ,

and we get (4.2.19) from (4.2.6).


The remaining case 0 ∈ [x, y] is trivial since then x − y = x + y and we
can apply (4.2.20). 
In the sequel, we often use Lipschitz constants for different derivatives. For p ≥
2, denote by Lp (f ) the Lipschitz constant for the (p − 1)-st derivative of the
function f :

∇ (p−1) f (x) − ∇ (p−1) f (y) ≤ Lp (f )x − y, x, y ∈ dom f. (4.2.21)

In this notation, L2 (f ) is the Lipschitz constant for the gradient of the function f .
At the same time, by Lemma 4.2.4, we conclude that L3 (d3 ) = 2.
We often establish the complexity of different problem classes in terms of
condition numbers of variable degree:

def σp (f )
γp (f ) = Lp (f ) , p ≥ 2. (4.2.22)

It is clear, for example, that for d2 (x) = 12 x − x0 2 we have γ2 (d2 ) = 1. On the


other hand, we have seen that γ3 (d3 ) = 14 .

4.2.3 Cubic Regularization of Newton Iteration

Consider the following minimization problem:

min f (x), (4.2.23)


x∈E

where E is a finite-dimension real vector space, and f is a twice differentiable


convex function with Lipschitz-continuous Hessian. As was shown in Sect. 4.1, the
global rate of convergence of the Cubic Newton Method (CNM) on this problem
class is of the order O( k12 ), where k is the iteration counter (see Theorem 4.1.4).
However, note that CNM is a local one-step second-order method. From the
complexity theory of smooth Convex Optimization, it is known that the rate of
convergence of the local one-step first-order method (this is just the Gradient
Method, see Theorem 2.1.14) can be improved from O( 1k ) to O( k12 ) by applying
a multi-step strategy (see, for example, Theorem 2.2.3). In this section we show
that a similar trick also works with CNM. As a result, we get a new method, which
converges on the specified problem class as O( k13 ).
4.2 Accelerated Cubic Newton 279

Let us recall the most important properties of cubic regularization of Newton’s


method, taking into account the convexity of the objective function.
As suggested in Sect. 4.1, we introduce the following mapping:
 
TM (x) = Arg min fˆM (x; y) = f2 (x; y) +
def def
6 y − x3 .
M
(4.2.24)
y∈E

Note that T = TM (x) is a unique solution of the following equation

∇f (x) + ∇ 2 f (x)(T − x) + 12 M · T − x · B(T − x) = 0. (4.2.25)

Define rM (x) = x − TM (x). Then,

(4.2.25) M
∇f (T )∗ = ∇f (T ) − ∇f (x) − ∇ 2 f (x)(T − x) − rM (x)B(T − x)∗
2
(4.2.8) L3 + M 2
≤ rM (x). (4.2.26)
2
Further, multiplying (4.2.25) by T − x, we obtain

1
∇f (x), x − T = ∇ 2 f (x)(T − x), T − x + MrM
3
(x). (4.2.27)
2
Let us assume that M ≥ L3 . Then, in view of (4.2.9), we have

f (x) − f (T ) ≥ f (x) − fˆM (x; T )


1 M 3
= ∇f (x), x − T − ∇ 2 f (x)(T − x), T − x − rM (x)
2 6
1 M 3
= ∇ 2 f (x)(T − x), T − x + rM (x). (4.2.28)
2 3
In particular, since f is convex,

(4.2.28) (4.2.26)  3/2


f (x) − f (T ) ≥ M 3
3 rM (x) ≥ M
3 L3 +M ∇f (T )∗
2
. (4.2.29)

Sometimes we need to interpret this step from a global perspective:

(M≥L3 ) !
f (T ) ≤ min f2 (x; y) + 6 y
M
− x3
y
(4.2.30)
(4.2.9)  
L3 +M
≤ min f (y) + 6 y − x3 .
y

Finally, let us prove the following result.


280 4 Second-Order Methods

Lemma 4.2.5 If M ≥ 2L3 , then



3/2
∇f (T ), x − T ≥ 2
L3 +M · ∇f (T )∗ . (4.2.31)

Proof Let T = TM (x) and r = rM (x). Then


 2 (4.2.8)
L3
1 2 4
4 L3 r = 2 T − x2 ≥ ∇f (T ) − ∇f (x) − ∇ 2 f (x)(T − x)2∗

(4.2.25)
= ∇f (T ) + 12 M · r · B(T − x)2∗

= ∇f (T )2∗ + Mr∇f (T ), T − x + 14 M 2 r 4 .

Hence,

∇f (T ), x − T ≥ Mr ∇f (T )∗


1 2 + 1
4M (M
2 − L23 )r 3 . (4.2.32)

In view of the conditions of the lemma, we can estimate the derivative in r of the
right-hand side of inequality (4.2.32):
 2 (4.2.26)
3r 2 L3 +M r2
− Mr
1
2 ∇f (T )∗ +
2
4M (M
2 − L23 ) ≥ − Mr
1
2 ∇f (T )∗ +
2
2 M ≥ 0.
 1/2
Thus, its minimum is attained at the boundary point r = L3 +M2
∇f (T )∗
of the feasible ray (4.2.26). Substituting this value into (4.2.32), we obtain
(4.2.31). 
To conclude this section, let us estimate the rate of convergence of CNM as applied
to our main problem (4.2.23). We assume that there exists a solution of this problem
x ∗ , and the Lipschitz constant L3 for the Hessian of objective function is known.
Thus, we just iterate

xk+1 = TL3 (xk ), k = 0, 1, . . . . (4.2.33)

Theorem 4.2.2 Assume that the level sets of problem (4.2.23) are bounded:

x − x ∗  ≤ D ∀x : f (x) ≤ f (x0 ). (4.2.34)

If the sequence {xk }∞


k=1 is generated by method (4.2.33), then

9L3 D 3
f (xk ) − f (x ∗ ) ≤ (k+4)2
, k ≥ 1. (4.2.35)

Proof In view of (4.2.28), f (xk+1 ) ≤ f (xk ) for all k ≥ 0. Thus, xk − x ∗  ≤ D,


k ≥ 0. Further, in view of (4.2.30), we have

f (x1 ) ≤ f (x ∗ ) + L3 3
3 D .
(4.2.36)
4.2 Accelerated Cubic Newton 281

Consider now an arbitrary k ≥ 1. Let xk (τ ) = x ∗ + (1 − τ )(xk − x ∗ ). In view of


inequality (4.2.30), for any τ ∈ [0, 1] we have
3
f (xk+1 ) ≤ f (xk (τ )) + τ 3 L33 xk − x ∗ 3 ≤ f (xk ) − τ (f (xk ) − f (x ∗ )) + τ 3 L33D .

The minimum of the right-hand side of this inequality in τ is attained for


 
f (xk )−f (x ∗ ) f (x1 )−f (x ∗ ) (4.2.36)
τ = L3 D 3
≤ L3 D 3
< 1.

Thus, for any k ≥ 1, we have


(f (xk )−f (x ∗ ))3/2
f (xk+1 ) ≤ f (xk (τ )) − 2
3 · √ . (4.2.37)
L3 D 3

Let δk = f (xk ) − f (x ∗ ). Then

δk −δk+1 (4.2.37)
√1 − √1 = √ √ √ ≥ √2 · √ √δk √
δk+1 δk δk δk+1 ( δk + δk+1 ) 3 L3 D 3 δk+1 ( δk + δk+1 )

≥ √1 .
3 L3 D 3

Thus, for any k ≥ 1, we have

(4.2.36) √ 
√1
δk
≥ √1
δ1
+ √k−1 ≥ √ 1
· 3+ k−1
3 ≥ √k+4 . 

3 L3 D 3 L3 D 3 3 L3 D3

4.2.4 An Accelerated Scheme

In order to accelerate method (4.2.33), we apply a variant of the estimating


sequences technique, which we presented in Sect. 2.2.1 as a tool for accelerating
the usual Gradient Method. In our situation, this idea can be applied to CNM in the
following way.
To solve the problem (4.2.23), we recursively update the following sequences.
• The sequence of estimating functions

ψk (x) = k (x) + 6 x
C
− x0 3 , k = 1, 2, . . . , (4.2.38)

where k (x) are linear functions in x ∈ E, and C is a positive parameter.


• The minimizing sequence {xk }∞ k=1 .
• The sequence of scaling parameters {Ak }∞ k=1 :

def
Ak+1 = Ak + ak , k = 1, 2, . . . .
282 4 Second-Order Methods

For these objects, we are going to maintain the following relations:



Rk1 : Ak f (xk ) ≤ ψk∗ ≡ min ψk (x), ⎪

x∈E
, k ≥ 1. (4.2.39)
2L3 +C


Rk2 : ψk (x) ≤ Ak f (x) + 6 x − x0 3 , ∀x ∈ E

Let us ensure that relations (4.2.39) hold for k = 1. We choose

x1 = TL3 (x0 ), 1 (x) ≡ f (x1 ), x ∈ E, A1 = 1. (4.2.40)

Then ψ1∗ = f (x1 ), so R11 holds. On the other hand, in view of definition (4.2.38),
we get

ψ1 (x) = f (x1 ) + 6 x
C
− x0 3

(4.2.30)  
2L3
≤ min f (y) + 6 y − x0 3 + 6 x
C
− x0 3 ,
y∈E

and R12 follows.


Assume now that relations (4.2.39) hold for some k ≥ 1. Let

vk = arg min ψk (x).


x∈E

Let us choose some ak > 0 and M ≥ 2L3 . Define2


ak
αk = Ak +ak , yk = (1 − αk )xk + αk vk , xk+1 = TM (yk ),
(4.2.41)
ψk+1 (x) = ψk (x) + ak [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ].

In view of Rk2 , for any x ∈ E we have

2L3 +C
ψk+1 (x) ≤ Ak f (x) + 6 x − x0 3 + ak [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ]

(2.1.2) 2L3 +C
≤ (Ak + ak )f (x) + 6 x − x0 3 ,

and this is Rk+1


2 . Let us show now that, for the appropriate choices of a , C and M,
k
relation Rk+1 is also valid.
1

2 Thisis the main difference with the technique presented in Sect. 2.2.1: we update the estimating
function by a linearization computed at the new point xk+1 .
4.2 Accelerated Cubic Newton 283

Indeed, in view of Rk1 and Lemma 4.2.3 with p = 3, for any x ∈ E, we have

ψk (x) ≡ k (x) + C
2 d3 (x) ≥ ψk∗ + C
2 · 16 x − vk 3
(4.2.42)
≥ Ak f (xk ) + C
2 · 16 x − vk 3 .

Therefore,


ψk+1 = min {ψk (x) + ak [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ]}
x∈E

(4.2.42) 
≥ min Ak f (xk ) + 12 x
C
− vk 3
x∈E
+ak [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ]}

(2.1.2)
≥ min {(Ak + ak )f (xk+1 ) + Ak ∇f (xk+1 ), xk − xk+1
x∈E

+ak ∇f (xk+1 ), x − xk+1 + 12 x


C
− vk 3 ]}

(4.2.41)
= min {Ak+1 f (xk+1 ) + ∇f (xk+1 ), Ak+1 yk − ak vk − Ak xk+1
x∈E

+ak ∇f (xk+1 ), x − xk+1 + 12 x


C
− vk 3 ]}

= min {Ak+1 f (xk+1 ) + Ak+1 ∇f (xk+1 ), yk − xk+1


x∈E

+ak ∇f (xk+1 ), x − vk + 12 x


C
− vk 3 ]}.

Further, if we choose M ≥ 2L3 , then by (4.2.31) we have



3/2
∇f (xk+1 ), yk − xk+1 ≥ 2
L3 +M · ∇f (xk+1 )∗ .

Hence, our choice of parameters must ensure the following inequality:



3/2
Ak+1 2
L3 +M · ∇f (xk+1 )∗ + ak ∇f (xk+1 ), x − vk + 12 x
C
− vk 3 ≥ 0,

for all x ∈ E. Minimizing this expression in x ∈ E, we come to the following


condition:

3/2
Ak+1 L3 +M2
≥ √4 ak . (4.2.43)
3 C
284 4 Second-Order Methods

For k ≥ 1, let us choose


k(k+1)(k+2)
Ak = 6 ,

(k+1)(k+2)(k+3) k(k+1)(k+2)
ak = Ak+1 − Ak = 6 − 6
(4.2.44)

(k+1)(k+2)
= 2 .

Since
−3/2 23/2 (k+1)(k+2)(k+3) 21/2 (k+3)
ak Ak+1 = 6[(k+1)(k+2)]3/2
= 3[(k+1)(k+2)]1/2
≥ 2
3,

inequality (4.2.43) leads to the following condition on the parameters:


1
L3 +M ≥ 2
C.

Hence, we can choose

M = 2L3 , C = 2(L3 + M) = 6L3 . (4.2.45)

In this case 2L3 + C = 8L3 .


Now we are ready to put all the pieces together.

Accelerated Cubic Regularization of Newton’s Method

Initialization: Choose x0 ∈ E. Set M = 2L3 and C = 6L3 .

Compute x1 = TL3 (x0 ) and define ψ1 (x) = f (x1 ) + 6 x


C
− x0 3 .

Iteration k,(k ≥ 1):

1. Compute vk = arg min ψk (x) and choose yk = k


k+3 xk + 3
k+3 vk .
x∈E

2. Compute xk+1 = TM (yk ) and update

(k+1)(k+2)
ψk+1 (x) = ψk (x) + 2 · [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ].

(4.2.46)

The above discussion proves the following theorem.


4.2 Accelerated Cubic Newton 285

Theorem 4.2.3 If the sequence {xk }∞ k=1 is generated by method (4.2.46) as applied
to problem (4.2.23), then for any k ≥ 1 we have:
8 L3 x0 −x ∗ 3
f (xk ) − f (x ∗ ) ≤ k(k+1)(k+2) ,
(4.2.47)

where x ∗ is an optimal solution to the problem.


Proof Indeed, we have shown that
Rk1 Rk2
2L3 +C
Ak f (xk ) ≤ ψk∗ ≤ Ak f (x ∗ ) + 6 x0 − x ∗ 3 .

Thus, (4.2.47) follows from (4.2.44) and (4.2.45).



Note that the point vk can be found in (4.2.46) by a closed-form expression.
Consider

sk = ∇k (x).

Since the function k (x) is linear, this vector does not depend on x. Therefore,

vk = x0 − Cs2k ∗ · B −1 sk .

4.2.5 Global Non-degeneracy for Second-Order Schemes

Traditionally, in Numerical Analysis the term non-degenerate is applied to certain


classes of efficiently solvable problems. For unconstrained optimization, non-
degeneracy of the objective function is usually characterized by a uniform lower
bound τ (f ) on the angle between the gradient at point x and the direction pointing
towards the optimal solution:
def ∇f (x),x−x ∗
α(x) = ∇f (x)∗ ·x−x ∗  ≥ τ (f ) > 0, x ∈ E. (4.2.48)

This condition has a nice geometric interpretation. Moreover, there exists a large
class of smooth convex functions possessing this property. This is the class of
strongly convex functions with Lipschitz-continuous gradient.
√ √
Lemma 4.2.6 τ (f ) ≥ 21+γγ22(f
(f )
) > γ2 (f ).
Proof Indeed, in view of inequality (2.1.32), we have

∇f (x), x − x ∗ ≥ σ2 +L2 ∇f (x)∗


1 2 + σ2 L 2
σ2 +L2 x − x ∗ 2

≥ 2 σ2 L 2
σ2 +L2 · ∇f (x)∗ · x − x ∗ ,

and this proves the required inequality.



286 4 Second-Order Methods

Note that the efficiency bounds of the first-order schemes for the class of smooth
strongly convex functions can be completely characterized in terms of the condition
number γ2 . Indeed, on one hand, the lower complexity bound for finding an -
solution for problems from this problem class is proven to be
 2

O √1
γ2 ln σ2D (4.2.49)

calls of the oracle, where the constant D bounds the distance between the initial
point and the optimal solution (see Theorem 2.1.13). On the other hand, the
simple numerical scheme (2.2.20) exhibits the required rate of convergence (see
Theorem 2.2.3).
What can be said about the complexity of the above problem class for the second-
order schemes? Surprisingly enough, in this situation it is difficult to find any
favorable consequences of the condition (4.2.48). We will discuss the complexity
bounds for this problem class in detail later in Sect. 4.2.6. Now let us present a new
non-degeneracy condition, which replaces (4.2.48) for the second-order methods.
Assume that γ3 (f ) = Lσ33(f )
(f ) > 0. In this case,

(4.2.13)
f (x) − f (x ∗ )
3/2
≤ √2
3 σ3
· ∇f (x)∗ . (4.2.50)

Therefore, for method (4.2.33) we have

(4.2.29) 3/2
f (xk ) − f (xk+1 ) ≥ √1 ∇f (xk+1 )∗
3 L3
(4.2.51)
1√
(4.2.50)
≥ 2 γ3 (f ) · (f (xk+1 ) − f (x ∗ )).

Hence, for any k ≥ 1 we have

(4.2.51) f (x1 )−f ∗


f (xk ) − f (x ∗ ) ≤  √ k−1
1+ 21 γ3 (f )
(4.2.52)

γ3 (f )·(k−1)
(4.2.30) − √
≤ e 2+ γ3 (f )
· L3
3 x0 − x ∗ 3 .

Thus, the complexity of minimizing a function with positive condition number


γ3 (f ) by method (4.2.33) is of the order of
 3

O √ 1
γ3 (f )
ln L3D (4.2.53)

calls of the oracle. The structure of this estimate is similar to that of (4.2.49). Hence,
it is natural to say that such functions possess global second-order non-degeneracy.
4.2 Accelerated Cubic Newton 287

Let us demonstrate that the accelerated variant of Newton’s method (4.2.46) can
be used to improve the complexity estimate (4.2.53). Denote by Ak (x0 ) the point xk
generated by method (4.2.46) with starting point x0 . Consider the following process.
> 1/3 ?
1. Define m = 24e
γ3 (f ) , and set y0 = x0 .
(4.2.54)
2. For k ≥ 0, iterate yk+1 = Am (yk ).

The performance of this scheme can be derived from the following lemma.
Lemma 4.2.7 For any k ≥ 0 we have

yk+1 − x ∗ 3 ≤ 1e yk − x ∗ 3 ,
(4.2.55)
f (yk+1 ) − f (x ∗ ) ≤ 1e (f (yk ) − f (x ∗ )).
 1/3
Proof Indeed, since m ≥ 24e
γ3 (f ) , we have

1 (4.2.10)
σ3 yk+1 − x ∗ 3 ≤ f (yk+1 ) − f (x ∗ )
3
(4.2.47) 8L3 yk − x ∗ 3 1
≤ ≤ σ3 yk − x ∗ 3
m(m + 1)(m + 2) 3e
(4.2.10) 1
≤ (f (yk ) − f (x ∗ )). 

e
Thus,

(4.2.30) L (4.2.30) L
f (TL3 (yk )) − f (x ∗ ) ≤ 3 yk
3
− x ∗ 3 ≤ 3 y0
3
− x ∗ 3 · e−k ,

and we conclude that an -solution to our problem can be found by (4.2.54) in


  
O 1
[γ3 (f )]1/3
ln L3
 x0 − x ∗ 3 (4.2.56)

iterations. Lower complexity bounds for this problem class have not yet been
developed. So, we cannot say how far these results are from the best possible ones.
288 4 Second-Order Methods

4.2.6 Minimizing Strongly Convex Functions

Let us look now at the complexity of problem (4.2.23) with

σ2 (f ) > 0, L3 (f ) < ∞. (4.2.57)

The main advantage of such functions consists in quadratic convergence of New-


ton’s method (4.2.33) in a certain neighborhood of the optimal solution. Indeed, for
T = TL3 (x) we have
(4.2.28) 1 2 σ2 2
f (x) − f (T ) ≥ ∇ f (T )(T − x), T − x ≥ · r (x)
2 2 L3
(4.2.26) σ2 (4.2.13) σ2 !1/2
≥ · ∇f (T )∗ ≥ · 2σ2 (f (T ) − f (x ∗ )) .
2L3 2L3
(4.2.58)

Hence,
(4.2.58) 2L2 2L23
f (T ) − f (x ∗ ) ≤ 3
(f (x) − f (T ))2 ≤ (f (x) − f (x ∗ ))2 . (4.2.59)
σ23 σ23

Therefore, the region of quadratic convergence of method (4.2.33) can be defined as


 
σ23
Qf = x ∈ E : f (x) − f (x ∗ ) ≤ . (4.2.60)
2L23

Alternatively, the region of quadratic convergence can be described in terms of


the norm of the gradient. Indeed,

σ2
2 · rL3 (x) ≤ 12 ∇ 2 f (T )(T − x), T − x
2
(4.2.28)
≤ f (x) − f (T ) ≤ ∇f (x)∗ · rL3 (x).

Thus,

(4.2.26) σ  1/2
σ2
∇f (x)∗ ≥ 2 · rL3 (x) ≥ 2
2
L3 ∇f (T )∗
1
.

Consequently,

4L3
∇f (T )∗ ≤ ∇f (x)2∗ , (4.2.61)
σ22

and the region of quadratic convergence can be defined as


 
σ22
Qg = x ∈ E : ∇f (x)∗ ≤ 4L3 . (4.2.62)
4.2 Accelerated Cubic Newton 289

Thus, the global complexity of problem (4.2.23), (4.2.57) is mainly related to the
number of iterations required to come from x0 to the region Qf (or, to Qg ). For
method (4.2.33), this value can be estimated from above by
 
L3 (f )D
O σ2 (f ) , (4.2.63)

where D is defined by (4.2.34) (see Sect. 4.1). Let us show that, using the accelerated
scheme (4.2.46), it is possible to improve this complexity bound.
Assume that we know an upper bound for the distance to the solution:

x0 − x ∗  ≤ R (≤ D).

Consider the following process.


@ A
64L3 (f )R 1/3
1. Set y0 = TL3 (x0 ), and define m0 = σ2 (f ) .

σ22
2. While ∇f (TL3 (yk ))∗ ≥ 4L3 do {yk+1 = Amk (yk ), mk+1 = 1
21/3
mk }.
(4.2.64)
Theorem 4.2.4 The process (4.2.64) terminates at most after

 3
L3 (f )R
1
ln 4 ln 8
3 · σ2 (f ) (4.2.65)

stages. The total number of Newton steps in all stages does not exceed 4m0 .
 k
Proof Let Rk = R · 12 . It is clear that

 1/3
L3 (f )Rk
mk ≥ 4 σ2 (f ) , k ≥ 0. (4.2.66)

For k ≥ 0, let us prove by induction that

yk − x ∗  ≤ Rk . (4.2.67)

Assume that for some k ≥ 0 this statement is valid (it is true for k = 0). Then,

(2.1.21) (4.2.47) 8L3 Rk3


σ2
2 yk+1 − x ∗ 2 ≤ f (yk+1 ) − f (x ∗ ) ≤ mk (mk +1)(mk +2)

(4.2.66)
≤ 8 2
64 σ2 Rk = 1 2
8 σ2 Rk = 1 2
2 σ2 Rk+1 .
290 4 Second-Order Methods

Thus, (4.2.67) is valid for all k ≥ 0. On the other hand,


(4.2.47) 8L3 yk −x ∗ 3 (4.2.67) 8L y −x ∗ 2 R
f (yk+1 ) − f (x ∗ ) ≤ mk (mk +1)(mk +2) ≤ 3 k k
mk (mk +1)(mk +2)

(4.2.66) (2.1.21)
≤ 8 σ2 yk
1
− x ∗ 2 ≤ ∗
4 (f (yk ) − f (x )).
1

Hence
(4.2.58)
σ2
2L3 ∇f (TL3 (yk ))∗ ≤ f (yk ) − f (TL3 (yk )) ≤ f (yk ) − f (x ∗ )

 k (4.2.30)  k
≤ 1
4 (f (y0 ) − f (x ∗ )) ≤ 1
4
L3 3
3 R ,

and (4.2.65) follows from (4.2.62). Finally, the total number of Newton steps does
not exceed

∞ 

m0
mk = m0 1
2k/3
= 21/3 −1
< 4m0 . 

k=0 k=0

4.2.7 False Acceleration

Note that the properties of the class of smooth strongly convex functions (4.2.57)
leave some space for erroneous conclusions related to the rate of convergence of the
optimization methods at the first stage of the process, aiming to enter the region of
quadratic convergence. Let us demonstrate this with a particular example.
Consider a modified version M  of method (4.2.46). The only modification is
introduced in Step 2. Now it is as follows:

2’. Compute ŷk = TM (yk ) and update

(k+1)(k+2)
ψk+1 (x) = ψk (x) + 2 · [f (ŷk ) + ∇f (ŷk ), x − ŷk ].

Choose x̂k : f (x̂k ) = min{f (xk ), f (ŷk )}. Set xk+1 = TM (x̂k ).

(4.2.68)

Note that for M  the statement of Theorem 4.2.3 is valid. Moreover, the process now
becomes monotone, and, using the same reasoning as in (4.2.58) with M = 2L3 ,
we obtain
√ 3/2
2 σ2
f (xk ) − f (xk+1 ) ≥ f (x̂k ) − f (xk+1 ) ≥ 3L3 · [f (xk+1 ) − f (x ∗ )]1/2 .
(4.2.69)
4.2 Accelerated Cubic Newton 291

Further, let us fix the number of steps N. Define k̂ = 2


3 N. Then, in view
of (4.2.47), we can guarantee that
 3
8L3 R 3 3
f (xk̂ ) − f (x ∗ ) ≤ 3
2 N3
= 33 LN3 R3 . (4.2.70)

On the other hand

f (xk̂ ) − f (x ∗ ) ≥ f (xk̂ ) − f (xN+1 )


√ (4.2.71)
(4.2.69) 3/2
2 σ2
≥ 1
3N · 3L3 · [f (xN+1 ) − f (x ∗ )]1/2 .

Combining (4.2.70) and (4.2.71) we obtain

310 ·L43 ·R 6
f (xN+1 ) − f (x ∗ ) ≤ · N −8 . (4.2.72)
2σ23

As compared with the rate of convergence (4.2.47), the proposed modification


looks amazingly efficient. However, that is just an illusion. Indeed, in view
of (4.2.60), in order to enter the region of quadratic convergence of Newton’s
method, we need to make the right-hand-side of inequality (4.2.72) smaller than
σ23
. For that we need
2L23


 3/4
L3 R
O σ2 (4.2.73)

iterations of M  . This is much worse than the complexity estimate (4.2.63) of the
basic scheme (4.2.33) even without acceleration (4.2.46).
Another clarification comes from an estimate for the number of steps, which is


 M to halve
necessary for the distance to the minimum. From (4.2.72) we see that
1/2
L3 R
it needs O σ2 iterations, which is worse than the corresponding estimate
for the method (4.2.46).

4.2.8 Decreasing the Norm of the Gradient

Let us check now our ability to generate points with small norm of the gradient
using second-order methods (compare with Sect. 2.2.2). We first look at the simplest
method (4.2.33).
Denote by T the total number of iterations of this scheme. For the sake of
simplicity, let us assume that T = 3m + 2 for some integer m ≥ 0. Let us divide all
292 4 Second-Order Methods

iterations of the method into two parts. For the first part of length 2m we have

(4.2.35) 9L D 3
f (x2m ) − f ∗ ≤ 4(m+2)3
2,

where L3 = L3 (f ). For the second part of length m + 2, we have


m+1 (4.2.29)
∗ 3/2 ,
f (x2m ) − f (xT ) = (f (x2m+k ) − f (x2m+k+1 )) ≥ m+2
1/2 (gT )
k=0 3L3

where gT∗ = min ∇f (xk )∗ . Thus,


1≤k≤T


3/2
2/3
27L3 D 3 34 L3 D 3
gT∗ ≤ 4(m+2)3
= 24/3 (T +4)2
. (4.2.74)

Let us look now at the monotone version of the accelerated Cubic Newton
Method (4.2.46), (4.2.68). Let R0 = x0 − x ∗ . Let T = 4m for some integer
m ≥ 1. Then, for the first 3m iterations of this method we have
(4.2.47) 8L3 R03
f (x3m ) − f ∗ ≤ 3m(3m+1)(3m+2) .

For the second part of length m, we have


m−1 (4.2.29)
∗ 3/2 .
f (x3m ) − f (xT ) = (f (x3m+k ) − f (x3m+k+1 )) ≥ m
1/2 (gT )
k=0 3L3

Thus,

3/2
2/3
8L3 R03 28 L3 R02
gT∗ ≤ m2 (3m+1)(3m+2)
< T 8/3
. (4.2.75)

Finally, let us check what can be achieved with the regularization technique. As
in Sect. 2.2.2, we fix a regularization parameter δ > 0 and introduce the following
function:

fδ (x) = f (x) + 13 δx − x0 3 .

Let D = max{x − x0  : f (x) ≤ f (x0 )}. Since fδ (x) ≥ f (x) for all x ∈ E,
x∈E
inequality fδ (x) ≤ f (x0 ) implies x − x0  ≤ D.
In view of Lemmas 4.2.3 and 4.2.4, we have

σ3 (fδ ) = 12 δ, L3 (fδ ) = L3 + 2δ.

Thus, γ3 (fδ ) = δ
2L3 +4δ .
4.2 Accelerated Cubic Newton 293

>  1/3 ?
Let xδ∗ = arg min fδ (x) and let m = 24e 4 + 2L3
δ . In view of Lem-
x∈E
ma 4.2.7, restarting strategy (4.2.54) ensures the following rate of convergence:

fδ (yk+1 ) − fδ (xδ∗ ) ≤ 1
e (fδ (yk ) − fδ (xδ∗ )),
(4.1.11)
where y0 = TL3 (x0 ). Thus, fδ (yk ) − fδ (xδ∗ ) ≤ 3e1k L3 (f )D 3 .
Define yk∗ = TL3 (fδ ) (yk ). Then fδ (yk+ ) ≤ fδ (yk ) ≤ f (x0 ). Hence, yk+ − x0  ≤
D and we have

∇f (yk+ )∗ ≤ ∇fδ (yk+ )∗ + δD 2

(4.2.29)   2/3
3L3 (fδ ) · fδ (yk ) − fδ (xδ∗ )
1/2
≤ + δD 2


≤ 1
e2k/3
L3 D 2 1+ 2δ
L3 + δD 2 .

Let us choose now δ = 


2D 2
. Define  = L3 D 2
 . Then, to ensure ∇f (yk+ )∗ ≤ ,
we need to perform
 √ 
k≥ 3
2 ln 2  2 + 

iterations
B of the restarting
C strategy (4.2.54). Each cycle of this strategy needs
2 (12e(1 + ))1/3 iterations of the Accelerated Cubic Newton Method (4.2.46).
Thus, we get a bound which is asymptotically better than the simple
estimate (4.2.75). However, it seems that for all practical values of the accuracy, the
method (4.2.46), (4.2.68) has better performance guarantees.

4.2.9 Complexity of Non-degenerate Problems

1. From the complexity results presented in the previous sections, we can derive a
class of problems which are easy for the second-order schemes:

σ2 (f ) > 0, σ3 (f ) > 0, L3 (f ) < ∞. (4.2.76)

For such functions, the second-order methods exhibit a global linear rate of conver-
gence and a local quadratic convergence. In accordance with (4.2.56) and (4.2.60),
we need

   
L3 (f ) 1/3 L3 (f ) ∗
O σ3 (f ) ln σ2 (f ) x0 − x  (4.2.77)
294 4 Second-Order Methods

iterations of (4.2.46) to enter the region of quadratic convergence.


Note that the class (4.2.76) is non-trivial. It contains, for example, all functions

ξα,β (x) = αd2 (x) + βd3 (x), α, β > 0,

with parameters

1
σ2 (ξα,β ) = α, σ3 (ξα,β ) = β, L3 (ξα,β ) = 2β.
2
Moreover, any convex function with Lipschitz-continuous Hessian can be regular-
ized by adding an auxiliary function ξα,β .
2. For one important class of convex problems, namely, for problems with

σ2 (f ) > 0, L2 (f ) < ∞, L3 (f ) < ∞, (4.2.78)

we have actually failed to clarify the situation. The standard theory of optimal first-
order methods (see Sect. 2.2) can bound the number of iterations which are required
to enter the region of quadratic convergence (4.2.60), as follows:

 1/2  
L2 (f )L23 (f )
O L2 (f )
ln x0 − x ∗ 2 . (4.2.79)
σ2 (f ) σ23 (f )

Note that in this estimate the role of the second-order scheme is quite weak: it is
used only to establish the bounds of the termination stage. Of course, as is shown in
Sect. 4.2.6, we could also use it at the first stage. However, in this case the size of
the optimal solution x ∗ enters polynomially the estimate for the number of iterations.
Thus, the following question is still open:
Can we get any advantage from the second-order schemes being used at the initial stage of
minimization process as applied to a function from the problem class (4.2.78)?

We will come back to the complexity of problem class (4.2.78) again in Sect. 5.2,
when we will discuss our possibilities in minimizing self-concordant functions.

4.3 Optimal Second-Order Methods

4.3.1 Lower Complexity Bounds

Let us derive lower complexity bounds for the second-order methods as applied to
the problem

f ∗ = minn f (x), (4.3.1)


x∈R
4.3 Optimal Second-Order Methods 295

where the Hessian of the objective function is Lipschitz continuous. We assume that
this problem is solvable and x ∗ is its optimal solution.
For the sake of simplicity, as we did in Sect. 2.1.2 (see Assumption 2.1.4), let us
first fix the natural rules for generating the test points. It can be easily checked that
the second-order methods usually compute the next test point as follows:

xk+1 = xk − hk [αk In + (1 − αk )∇ 2 f (xk )]−1 ∇f (xk ),

where hk > 0 is a step-size parameter, and the coefficient αk ∈ [0, 1] depends


on a particular optimization scheme. In the case αk = 1, we get the usual Gradient
Method. The case αk = 0 corresponds to the standard Newton direction. Finally, the
Cubic Regularization strategy (4.2.24) and the majority of Trust Region Methods
compute these values from some equation (see, for example, (4.2.25)). Therefore,
the following assumption looks quite reasonable.
Assumption 4.3.1 All iterative second-order schemes generate a sequence of test
points {xk }k≥0 such that
 
xk+1 ∈ x0 + Lin Gf (x0 ), . . . , Gf (xk ) , k ≥ 0, (4.3.2)
  
where Gf (x) = cl Conv [αIn + (1 − α)∇ 2 f (x)]−1 ∇f (x), α ∈ [0, 1) .
Note that the set Gf (x) also contains ∇f (x). Therefore, the rules for computing
the point vk in the accelerated method (4.2.46) also satisfy condition (4.3.2).
For 2 ≤ k ≤ n, consider the following parametric family of functions:
" %

k−1 
n
fk (x) = 1
3 |x (i) − x (i+1) |3 + |x (i) |3 − x (1), x ∈ Rn . (4.3.3)
i=1 i=k

This is a uniformly convex function, and its unique minimum can be found from the
following system of equations:

(x (1) − x (2) )|x (1) − x (2) | = 1,

(x (i) − x (i−1) )|x (i) − x (i−1) | + (x (i) − x (i+1) )|x (i) − x (i+1) | = 0, 2 ≤ i ≤ k − 1,

(x (k) − x (k−1) )|x (k) − x (k−1) | + x (k) |x (k) | = 0,

x (i) |x (i) | = 0, k + 1 ≤ i ≤ n.

Clearly, the only solution of this system is given by vector x∗ with coordinates
(i)
x∗ = (k − i + 1)+ , i = 1, . . . , n, (4.3.4)
296 4 Second-Order Methods

where (τ )+ = max{τ, 0}. For our methods, we always take x0 = 0. Therefore, we


have the following characteristics of our problem (4.3.1) with f = fk :

fk∗ = − 23 k,
(4.3.5)

k
(k+1)3
Rk2 = x0 − x∗ 2(2) = i2 < 3 .
i=1

It remains to estimate the Lipschitz constant of the Hessian of the function fk with
respect to the standard Euclidean norm.
Let us look first at the Hessian of the following function


n
ρ3 (u) = 1
3 |u(i) |3 , u ∈ Rn .
i=1


n
For a direction h ∈ Rn , we have ∇ 2 ρ3 (u)h, h = 2 |u(i) | (h(i) )2 . Therefore, for
i=1
u, v ∈ Rn we get
4 n 4
4 4 4 4
4(∇ 2 ρ3 (u) − ∇ 2 ρ3 (v))h, h 4 = 2 4 (|u(i) | − |v (i) |)(h(i) )2 4 ≤ 2u − v(∞) h2 .
4 4 (2)
i=1

Note that function fk (·) can be represented as follows:




Ak 0
fk (x) = ρ3 (Bk x) − x (1) , Bk = ∈ Rn×n ,
0 In−k

where the upper bi-diagonal matrix Ak ∈ Rk×k has the following structure:
⎛ ⎞
1 −1 0 . . . 0
⎜ 0 1 −1 . . . 0 ⎟
⎜ ⎟
⎜ ⎟
Ak = ⎜ ... 0⎟.
⎜ ⎟
⎝ . . . −1 ⎠
0 ... ... 0 1

Therefore, for any point x, displacement d, and direction h in Rn we have


4 4 4 4
4(∇ 2 fk (x + d) − ∇ 2 fk (x))h, h 4 = 4(∇ 2 ρ3 (Bk (x + d)) − ∇ 2 ρ3 (Bk x))Bk h, Bk h 4

≤ 2Bk d(∞) Bk h2(2) .


4.3 Optimal Second-Order Methods 297

Note that for any h ∈ Rn we have


1
Bk d(∞) ≤ max {|d (i) | + |d (i+1)|} ≤ max 2[(d (i) )2 + (d (i+1) )2 ]
1≤i≤n−1 1≤i≤n−1

≤ 21/2 d(2) ,


k−1 
n
Bk h2(2) = (h(i) − h(i+1) )2 + (h(i) )2 ≤ 4h2(2) .
i=1 i=k

Thus, we conclude that



∇ 2 fk (x + d) − ∇ 2 fk (x) ≤ 8 2d(2) ,

and we can take the Lipschitz constant for the Hessian of this function L = 27/2 .
In order to understand the behavior of numerical schemes satisfying condi-
tion (4.3.2), as applied to minimization of some function ft with t big enough, we
need to introduce the following subspaces (compare with Sect. 2.1.2):

Rk,n = {x ∈ Rn : x (i) = 0 for i > k}, 1 ≤ k ≤ n − 1,

Sk,n = {H ∈ Rn×n : H = H T , H (i,j ) = 0 if i


= j and (i > k or j > k)}.

Let us write down the first and the second derivatives of the function ft along
direction h ∈ Rn (see (4.3.3):
t
−1
∇ft (x), h = |x (i) − x (i+1) |(x (i) − x (i+1) )(h(i) − h(i+1) )
i=1


n
+ |x (i) |x (i) h(i) − h(1) ,
i=t

t
−1 
n
∇ 2 ft (x)h, h = 2 |x (i) − x (i+1)|(h(i) − h(i+1) )2 + 2 |x (i) |(h(i) )2 .
i=1 i=t
(4.3.6)

From this structure, we derive the following important conclusions.


Lemma 4.3.1 If x ∈ Ri,n and i < k, then ∇ft (x) ∈ Ri+1,n and ∇ 2 ft (x) ∈
Si+1,n .

Corollary 4.3.1 Let xi ∈ Ri,n , i = 0, . . . , k, and suppose the point xk+1 satisfies
condition (4.3.2) with f (·) = ft (·), where k + 1 ≤ t ≤ n. Then xk+1 ∈ Rk+1,n .
Proof Indeed, in view of Lemma 4.3.1, we have

∇ft (xi ) ∈ Ri+1,n ⊂ Rk+1,n , ∇ 2 ft (xi ) ∈ Si+1,n ⊂ Sk+1,n , i = 0, . . . , k.


298 4 Second-Order Methods

Therefore,

[αIn + (1 − α)∇ 2 ft (xi )]−1 ∇ft (xi ) ∈ Rk+1,n

for all α ∈ [0, 1) and i = 0, . . . , k. 



Our last observation is as follows.
Lemma 4.3.2 For any p ≥ 0 and x ∈ Rk,n , we have fk+p (x) = fk (x).

Now we can prove the lower complexity bound for the second-order methods.
Theorem 4.3.1 Let the Hessian of the objective function f in problem (4.3.1)
be Lipschitz continuous with constant Lf . Assume that the rules of a second-
order method M satisfy condition (4.3.2), and for any starting point x0 with
x0 − x ∗ (2) ≤ ρ0 we can guarantee that
Lf ρ03
min f (xi ) − f (x ∗ ) ≤ CM (k) , (4.3.7)
0≤i≤k

where k is the number of generated test points. Then for k = 3m + 2 with integer
m, 0 ≤ m ≤ n4 − 1, we have

CM (k) ≤ 36(k + 1)3.5 . (4.3.8)

Proof Let k = 3m + 2 for some integer m ≥ 0. Define t = 4m + 3. Then

k + 1 = 3(m + 1), t + 1 = 4(m + 1).

Let us apply method M for minimizing the function ft (·) starting from the point
(4.3.2)
x0 = 0. Note that ∇ft (x0 ) = −e1 ∈ R1,n and ∇ 2 ft (x0 ) = 0. Therefore, x1 ∈
(4.3.2)
R1,n ,
and by induction, using Corollary 4.3.1, we get xk ∈ Rk,n , 0 ≤ k ≤ t.
Hence, by Lemma 4.3.2, we have

(4.3.5) (4.3.7) Lf ρ03


2
3 (m + 1) = fk∗ − ft∗ ≤ min ft (xi ) − ft∗ ≤ CM (k)
0≤i≤k

(4.3.5)  3/2
27/2 (t +1)3
≤ CM (k) 3 .

Thus,

9/2
25/2(t + 1)9/2 25/2 31/2 4 223/2
CM (k) ≤ = (k + 1) = (k + 1)3.5
(m + 1)31/2 k+1 3 34
< 36(k + 1)3.5 . 

4.3 Optimal Second-Order Methods 299

As we can see, the lower bound (4.3.8) is a little bit better than the rate of
convergence (4.2.47) of the Accelerated Cubic Regularization (4.2.46). In the next
section, we will discuss the possibility of reaching this lower bound.

4.3.2 A Conceptual Optimal Scheme

As in Sect. 4.2.3, let us fix a self-adjoint positive definite operator B : E → E∗ and


define primal and dual Euclidean norms

x = Bx, x 1/2 , g∗ = g, B −1 g 1/2 , x ∈ E, g ∈ E∗ .

Consider the problem of unconstrained optimization

min f (x), (4.3.9)


x∈E

where the Hessian of the function f satisfies the Lipschitz condition

∇ 2 f (x) − ∇ 2 f (y) ≤ Mf x − y, ∀x, y ∈ E. (4.3.10)

Our main iteration will be the Cubic Newton Step



TM (x) = arg min ∇f (x), T − x + 12 ∇ 2 f (x)(T − x), T − x
T ∈E
(4.3.11)

+M
6 T − x3 .

Let rM (x) = TM (x) − x. Then the point T = TM (x) is characterized by the
following first-order optimality condition:

∇f (x) + ∇ 2 f (x)(T − x) + 12 MrM (x) B(T − x) = 0. (4.3.12)

Lemma 4.3.3 For any x ∈ E we have

1 M 2 − Mf2
∇f (TM (x)), x − TM (x) ≥ ∇f (TM (x))2∗ + 3
rM (x).
MrM (x) 4M
(4.3.13)

Moreover, if M ≥ 1
σ Mf for some σ ∈ (0, 1], then

1 1 − σ2
∇f (TM (x)), x − TM (x) ≥ ∇f (TM (x))2∗ + 3
MrM (x).
MrM (x) 4
(4.3.14)
300 4 Second-Order Methods

Proof Let T = TM (x). Then


Mf2 rM
4 (x) (4.3.10)

4 ≥ ∇f (T ) − ∇f (x) − ∇ 2 f (x)(T − x)2∗

(4.3.12)
= ∇f (T ) + 12 MrM (x)B(T − x)2∗

M 2 rM
4 (x)
= ∇f (T )2∗ + MrM (x)∇f (T ), T − x + 4 .

This is (4.3.13). Inequality (4.3.14) follows from (4.3.13) since Mf ≤ σ M. 



Let us consider now the following conceptual version of the Optimal Cubic Newton
Method.

Optimal Cubic Newton Method (Conceptual Version)

Initialization. Choose x0 ∈ E, σ ∈ (0, 1). Define ψ0 (x) = 12 x − x0 2 .

Set A0 = 0 and M = 1
σ Mf .

kth iteration (k ≥ 0).


(a) Compute vt = arg min ψk (x).
x∈E

2(Ak +ak+1 )
2
(b) Choose ρk > 0 and find ak+1 > 0 from equation ak+1 = Mρk .

ak+1
(c) Set Ak+1 = Ak + ak+1 , τk = Ak+1 , yk = (1 − τk )xk + τk vk .

(d) Compute xk+1 = TM (yk ) and define

ψk+1 (x) = ψk (x) + ak+1 [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ].

(4.3.15)

Step (b) of method (4.3.15) is not completely specified since the definition of the
parameter ρk is missing. This is the reason why we call this method conceptual. Let
us present some guidelines for its choice.
Lemma 4.3.4 Assume that parameters ρk in method (4.3.15) satisfy condition

rM (yk ) ≤ ρk . (4.3.16)
4.3 Optimal Second-Order Methods 301

Then for any k ≥ 0 we have

def
Ak f (xk ) + Bk ≤ ψk∗ = min ψk (x), (4.3.17)
x∈E

1−σ 2 
k−1
where Bk = 4 M
3
Ai+1 rM (yi ).
i=0
Proof Let us prove (4.3.17) by induction. For t = 0 it is trivial. Assume that
inequality (4.3.17) is valid for some k ≥ 0. Then for any x ∈ E we have

ψk+1 (x) ≥ ψk∗ + 12 x − vk 2 + ak+1 [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ]

(4.3.17)
≥ Ak f (xk ) + Bk + 12 x − vk 2

+ak+1 [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ]

≥ Ak+1 f (xk+1 ) + Bk + 12 x − vk 2

+∇f (xk+1 ), Ak (xk − xk+1 ) + ak+1 (x − xk+1 )

= Ak+1 f (xk+1 ) + Bk + 12 x − vk 2

+∇f (xk+1 ), ak+1 (x − vk ) + Ak+1 (yk − xk+1 ) .

Therefore,


ψk+1 ≥ Ak+1 f (xk+1 ) + Bk − 12 ak+1
2 ∇f (x 2
k+1 )∗

+Ak+1 ∇f (xk+1 ), yk − xk+1

(4.3.14) Ak+1
≥ Ak+1 f (xk+1 ) + Bk − Mρk ∇f (xk+1 )∗
2

 
1−σ 2
+Ak+1 MrM (yk ) ∇f (xk+1 )∗ +
1 2 3
4 MrM (yk )

(4.3.16)
1−σ 2
≥ Ak+1 f (xk+1 ) + Bk + 3
4 MAk+1 rM (yk ).


In order to ensure a fast growth of the coefficients Ak , we need to introduce more
conditions for the parameters ρk .
302 4 Second-Order Methods

Lemma 4.3.5 Let us choose γ ≥ 1. Assume that parameters ρk in method (4.3.15)


satisfy condition

rM (yk ) ≤ ρk ≤ γ rM (yk ). (4.3.18)

Then for any k ≥ 1 we have


 3/2 √  3.5
1−σ 2
Ak ≥ 1
4
1
γ Mx0 −x ∗ 
2k+1
3 . (4.3.19)

Proof First of all, let us relate the rate of growth of coefficients Ak to the values
rM (yk ). Note that
 
1/2 1/2 ak+1 2Ak+1
Ak+1 − Ak = 1/2 1/2 = 1/2
1
1/2 Mρk ≥ 1
2Mρk .
Ak+1 +Ak Ak+1 +Ak

Thus,
 2  2

k−1 (4.3.18) 
k−1
Ak ≥ 1
2M
1
1/2 ≥ 1
2Mγ
1
1/2 . (4.3.20)
i=0 ρi i=0 rM (yi )

(4.3.17)
On the other hand, we have Ak f (xk ) + Bk ≤ Ak f (x ∗ ) + 12 x0 − x ∗ 2 .
Therefore,

1−σ 2 
k−1
3 (y ) ≤ 1 x − x ∗ 2 .
Bk ≡ 4 M Ai+1 rM i 2 0
i=0


k−1
1
Let us estimate from below the value 1/2 subject to the above constraint.
i=0 M (yi )
r

x ∗ 2 , we come to the following


1/2
Defining ξi = rM (yi ) and D = 2
(1−σ 2 )M
x0 −
minimization problem:
" %

k−1 
k−1
ξ∗ = min 1
ξi : Ai+1 ξi6 ≤D .
ξ ∈Rk i=0 i=0

Introducing a Lagrange multiplier λ for the inequality constraint, we get the


following optimality conditions:

1
= λAi+1 ξi5 , i = 0, . . . , k − 1.
ξi2
4.3 Optimal Second-Order Methods 303

 1/7
Thus, ξi = 1
λAi+1 . Since the constraint is active,


k−1  6/7 
k−1
1/7
D= Ai+1 1
λAi+1 = 1
λ6/7
Ai+1 .
i=0 i=0
 7/6

k−1 
k−1
Therefore, ξ ∗ =
1/7
(λAi+1 )1/7 = 1
D 1/6
Ai+1 . Coming back to our initial
i=0 i=0
notation, we get
 7/6

k−1  1/6 
k−1
(1−σ 2 )M 1/7
1
1/2 ≥ 2x0 −x ∗ 2
Ai+1 .
i=0 rM (yi ) i=0

In view of inequality (4.3.20), we come to the following relation:


 1/3

k
7/3
1−σ 2 1/7
Ak ≥ 1
2γ 2M 2 x0 −x ∗ 2
Ai , k ≥ 1. (4.3.21)
i=1

Denote the coefficient in the right-hand side of inequality (4.3.21) by θ and let Ck =

k 2/3
 1/7
Ai . Then (4.3.21) can be rewritten as
i=1

3/2 3/2 1/2


Ck+1 − Ck ≥ θ 1/7 Ck+1 .

This means that C1 ≥ θ 1/7 and


1/2 1/2 1/2 1/2 1/2 1/2
θ 1/7 Ck+1 ≤ (Ck+1 − Ck )(Ck+1 (Ck+1 + Ck ) + Ck )

1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2


≤ (Ck+1 − Ck )(Ck+1 (Ck+1 + Ck ) + 12 Ck+1 (Ck+1 + Ck ))

1/2
= 32 Ck+1 (Ck+1 − Ck ).

Thus, Ck ≥ θ 1/7 (1 + 23 (k − 1)), k ≥ 1. Finally, we obtain





(4.3.21) 3/2 2k + 1 7/2 2k + 1 7/2
Ak ≥ θ (Ck )7/3 ≥ θ θ 1/7 · = θ 3/2
3 3


1/3 3/2

1 1 − σ2 2k + 1 3.5
=
2γ 2M 2 x0 − x ∗ 2 3



1 1 3/2 1 − σ2 2k + 1 3.5
= . 

4 γ Mx0 − x ∗  3

Now we are ready to justify the rate of convergence of method (4.3.15).


304 4 Second-Order Methods

Theorem 4.3.2 Let us choose σ ∈ (0, 1) and γ ≥ 1. Suppose that the parameters
ρk in method (4.3.15) satisfy condition (4.3.18). If method (4.3.15) is applied with
M = σ1 Mf , then for any k ≥ 1 we have
 3.5
2γ 3/2 Mf x0 −x ∗ 3
f (xk ) − f (x ∗ ) ≤ √ 3
2k+1 . (4.3.22)
σ 1−σ 2

Proof Indeed, in view of inequality (4.3.17), we have

f (xk ) − f (x ∗ ) ≤ 2Ak x0


1
− x ∗ 2 .

It remains to use the lower bound (4.3.19).



The best value of σ in the right-hand side of inequality (4.3.22) is σ = √1 . In this
2
case,
 3.5
f (xk ) − f (x ∗ ) ≤ 4γ 3/2Mf x0 − x ∗ 3 3
2k+1 , k ≥ 1. (4.3.23)

4.3.3 Complexity of the Search Procedure

In the previous section, we presented a conceptual second-order scheme (4.3.15),


which reaches the best possible rate of convergence (4.3.8). In contrast to the
Accelerated Cubic Newton Method (4.2.46), its estimating sequence {ψk } starts
from the squared Euclidean norm. Another difference consists in presenting the
coefficient ρk in the equation defining the scaling coefficient ak+1 (see Step b)). In
order to make this method function in accordance to its rate of convergence (4.3.22),
we need to ensure that

ρk ≈ rM (yk ). (4.3.24)

Note that the right-hand side of this equality is a continuous function of ρk . In this
method, if ρk = 0, then ak+1 = +∞ and yk = vk . In this case, the left-hand side of
inequality (4.3.24) is smaller than its right-hand side. If ρk → ∞, then ak+1 → 0
and yk → xk . Thus, there is always a root of equation (4.3.24).
However, the problem is that any search procedure in ρk is very expensive.
It needs to call the oracle many times. At present it is difficult to point out any
favorable property of function yk = yk (ρk ) which could help.
At the same time, from the practical point of view, the gain from this acceleration
1
of the rate of convergence is very small. Indeed, method (4.2.46) ensures O(  1/3 )
complexity of finding an -solution of problem (4.3.9). The number of iterations of
1
method (4.3.15) is of the order O(  2/7 ). Thus, the gain in the number of iterations of
 1
21
the “optimal” method is bounded by a factor proportional to 1 . For the values
4.4 The Modified Gauss–Newton Method 305

of  used in practical applications, namely the range 10−4 . . . 10−12 , this is just
 1
an absolute constant (since 1012 21 < 4). Therefore, this factor, decreasing the
total number of iterations, cannot compensate a significant increase in the analytical
computational complexity of each iteration. That is the main reason why we drop
the cumbersome analysis of the complexity of the corresponding search procedure
in this book.
To conclude, from the practical point of view, method (4.2.46) is now the fastest
second-order scheme. At the same time, the problem of finding the optimal second-
order method with cheap iteration remains an open and challenging question in
Optimization Theory.

4.4 The Modified Gauss–Newton Method

(Quadratic regularization; The modified Gauss–Newton process; Global rate of conver-


gence; Comparative analysis; Implementation issues.)

4.4.1 Quadratic Regularization of the Gauss–Newton Iterate

The problem of solving a system of nonlinear equations is one of the most


fundamental problems in Numerical Analysis. The standard approach consists in
replacing the initial problem

Find x ∈ E : fi (x) = 0, i = 1, . . . , m, (4.4.1)

by a minimization problem
 
def
min f (x) = φ(f1 (x), . . . , fm (x)) , (4.4.2)
x∈E

where function φ(u) is non-negative and vanishes only at the origin. The most
recommended choice for this merit function φ(u) is the standard squared Euclidean
norm:
m 
 2
φ(u) = u2(2) ≡ u(i) , (4.4.3)
i=1

where squaring the norm has the advantage of keeping the objective function
in (4.4.2) smooth enough. Of course, the new problem (4.4.2), (4.4.3) can be solved
by the standard second-order minimization schemes. However, it is possible to
reduce the order of the required derivatives by applying the so-called Gauss–Newton
approach. In this case, the search direction is defined as a solution of the following
306 4 Second-Order Methods

auxiliary problem:

min {φ (f1 (x) + ∇f1 (x), h , . . . , fm (x) + ∇fm (x), h ) : x + h ∈ D(x)} ,


h∈E

where D(x) is a properly chosen neighborhood of the point x. Under some non-
degeneracy assumptions, for this strategy it is possible to establish local quadratic
convergence.
Despite its elegance, the above approach deserves some criticism. Indeed, the
transformation of problem (4.4.1) into problem (4.4.2) is done in a quite straight-
forward way. For example, if the initial system of equations is linear, then such a
transformation squares the condition number of the problem. Besides increasing
numerical instability, for large problems this leads to squaring the number of
iterations, which is necessary to get an -solution of the original problem.
In this section, we consider another approach. At first glance, it looks very
similar to the standard one: We replace our initial problem by a minimization
problem (4.4.2). However, our merit function is non-smooth.
Before we start, let us recall some notation. For a linear operator A : E1 → E2 ,
its adjoint operator A∗ : E∗2 → E∗1 is defined as follows:

s, Ax = A∗ s, x , ∀x ∈ E1 , s ∈ E∗2 .

For measuring distances in E1 and E2 , we introduce the norms  · E1 and  · E2 .
In the dual spaces, the norms are defined in the standard way. For example,

sE∗1 = max{s, x : xE1 ≤ 1}, s ∈ E∗1 .


x∈E1

If no ambiguity occurs, we drop subindexes of the norms since they are always
defined by the spaces containing the arguments. For example, s ≡ sE∗1 for
s ∈ E∗1 .
For A : E1 → E2 , we define the minimal singular value as follows:

σmin (A) = min {Ax : x = 1} ⇒ Ax ≥ σmin (A)x ∀x ∈ E1 .


x∈E1

For invertible A, we have σmin (A) = 1/A−1 . Note that for two linear operators
A1 and A2 ,

σmin (A1 A2 ) ≥ σmin (A1 ) · σmin (A2 ).

If σmin (A) > 0, then we say that the operator A possesses primal non-degeneracy.
If σmin (A∗ ) > 0, then we say that A possesses dual non-degeneracy.
4.4 The Modified Gauss–Newton Method 307

Finally, for a non-linear function F (·) : E1 → E2 we denote by F  (x) its


Jacobian, which is a linear operator from E1 to E2 :

F  (x)h = lim 1
[F (x + αh) − F (x)] ∈ E2 , h ∈ E1 .
α→0 α

In the special case f (·) : E1 → E2 ≡ R, we have f  (x)h = ∇f (x), h for all
h ∈ E1 .
Consider a smooth non-linear function F (·) : E1 → E2 . Our main problem of
interest is to find an approximate solution to the following system of equations:

F (x) = 0, x ∈ E1 . (4.4.4)

In order to measure the quality of such a solution, we introduce a (sharp) merit


function φ(u), u ∈ E2 , which satisfies the following conditions:
• It is convex, non-negative and vanishes only at the origin. (Hence, its level sets
are bounded.)
• It is Lipschitz-continuous with unit Lipschitz constant:

|φ(u) − φ(v)| ≤ u − v, ∀u, v ∈ E2 . (4.4.5)

• It has a sharp minimum at the origin:

φ(u) ≥ γφ u, ∀u ∈ E2 , (4.4.6)

for a certain γφ ∈ (0, 1].


For example, we can take φ(u) = uE2 . Then γφ = 1.
We can use this merit function to transform the problem (4.4.4) into the following
unconstrained minimization problem:

min { f (x) ≡ φ(F (x)) } = f ∗ .


def
(4.4.7)
x∈E1

Clearly, the solution x ∗ to the system (4.4.4) exists if and only if the optimal value
f ∗ of the problem (4.4.7) is equal to zero. The iterative scheme proposed below
can be seen as a minimization method for problem (4.4.7), which employs a special
structure of the objective function. Function f can even be non-smooth. However,
we will see that it is possible to decrease its value at any point x ∈ E1 excluding the
stationary points of the problem (4.4.7).
Let us fix some x ∈ E1 . Consider the following local model of our objective
function:

ψ(x; y) = φ F (x) + F  (x)(y − x) , y ∈ E1 .
308 4 Second-Order Methods

Note that ψ(x; y) is convex in y. Therefore it looks natural to choose the next
approximation of the solution to problem (4.4.7) from the set

Arg min ψ(x; y).


y∈E1

Such schemes are very well studied in the literature. For example, if choosing φ as
in (4.4.3), we get the classical Gauss–Newton method. However, in what follows we
see that a simple regularization of this approach leads to another scheme, for which
we can speak about global efficiency of the process.
Let us introduce the following smoothness assumption. Denote by F a closed
convex set in E1 with non-empty interior.
Assumption 4.4.1 The function F (·) is differentiable on the set F and its deriva-
tive is Lipschitz-continuous:

F  (x) − F  (y) ≤ Lx − y, ∀x, y ∈ F , (4.4.8)

with some L > 0.


A straightforward consequence of this assumption is as follows:

F (y) − F (x) − F  (x)(y − x) ≤ 12 Ly − x2 , x, y ∈ F . (4.4.9)

We skip its proof since it is very similar to the proof of inequality (1.2.13). In the
remaining part of this section, we always assume that Assumption 4.4.1 is satisfied.
Lemma 4.4.1 For any x and y from F , we have

|f (y) − ψ(x; y)| ≤ 12 Ly − x2 . (4.4.10)

Proof Let d(x, y) = F (y) − F (x) − F  (x)(y − x) ∈ E2 . By inequality (4.4.9),

1
d(x, y) ≤ Lx − y2 .
2
Since both x and y belong to F , we have

|f (y) − ψ(x; y)| = |φ(F (y)) − φ(F (x) + F  (x)(y − x))|


(4.4.5) 1
≤ d(x, y) ≤ Ly − x2 . 

2
Inequality (4.4.10) provides us with an upper approximation of function f :

f (y) ≤ ψ(x; y) + 12 Ly − x2 , ∀x, y ∈ F .


4.4 The Modified Gauss–Newton Method 309

Let us use it for constructing a minimization scheme. Let M be a positive parameter.


For the problem (4.4.7), define a modified Gauss–Newton iterate from a point x ∈
F as follows:
 
VM (x) ∈ Arg min ψ(x; y) + 12 My − x2 , (4.4.11)
y∈E1

where “Arg” indicates that VM (x) is chosen from the set of global minima of the
corresponding minimization problem.3 Note that the auxiliary optimization problem
in (4.4.11) is convex in y. We postpone a discussion on the complexity of finding
the point VM (x) until Sect. 4.4.4.
Let us prove several auxiliary results. Define

rM (x) = VM (x) − x,

fM (x) = ψ(x; VM (x)) + 12 MrM


2 (x),

δM (x) = f (x) − fM (x).

For a fixed x, the value fM (x) is a concave function in M since it can be represented
as a minimum of functions linear in M (see Theorem 3.1.8):
 
fM (x) = min ψ(x; y) + 12 My − x2 .
y∈E1

Consequently, the value 12 rM


2 (x), which is equal to the derivative of f (x) in M
M
(see Lemma 3.1.14), is a decreasing function of M.
Lemma 4.4.2 For any x ∈ E1 we have

1
δM (x) ≥ Mr 2 (x). (4.4.12)
2 M

Proof Let us fix an arbitrary x ∈ E1 . Let ψ0 (y) = 12 My − x2 and

ψ1 (y) = ψ(x; y) + ψ0 (y).

In view of Theorem 3.1.24, there exists g1 ∈ ∂y ψ(x; VM (x)) and g2 ∈ ∂ψ0 (VM (x))
such that

g1 + g2 , y − VM (x) ≥ 0 ∀y ∈ E1 . (4.4.13)

3 Sincewe do not assume that the norm x, x ∈ E1 , is strongly convex, this problem may have a
non-trivial convex set of global solutions.
310 4 Second-Order Methods

At the same time, in view of identity (3.1.39), we have g2 , VM (x) − x = MrM
2 (x).

Hence,
(3.1.23)
f (x) = ψ(x; x) ≥ ψ(x, VM (x)) + g1 , x − VM (x)

(4.4.13)
≥ ψ(x, VM (x)) + g2 , VM (x) − x

= ψ(x, VM (x)) + MrM


2 (x) = f (x) + 1 Mr 2 (x).
M 2 M

This is exactly inequality (4.4.12).



Let us compare δM (x) with another natural measure of local decrease of the model
ψ(x; ·). For r > 0 define

Δr (x) = f (x) − min {ψ(x; y) : y − x ≤ r}.


y∈E1

Lemma 4.4.3 For any x ∈ E1 and r > 0 we have


 
δM (x) ≥ Mr 2 ·  Mr1
2 Δr (x) , (4.4.14)

where

⎨ t − 2 , t ≥ 1,
1

(t) =
⎩ 1 2
2t , t ∈ [0, 1].

The right-hand side of the bound (4.4.14) is a decreasing function of M.


Proof Let us choose hr ∈ Arg min {ψ(x; x + h) : h ≤ r}. Then
h∈E1

fM (x) ≤ min{φ(F (x) + τ F  (x)hr ) + 12 Mτ 2 r 2 : τ ∈ [0, 1]}


τ

= min{φ((1 − τ )F (x) + τ (F (x) + F  (x)hr )) + 12 Mτ 2 r 2 : τ ∈ [0, 1]}


τ

≤ min{(1 − τ )φ(F (x)) + τ φ(F (x) + F  (x)hr )) + 12 Mτ 2 r 2 : τ ∈ [0, 1]}


τ

= min{f (x) − τ Δr (x) + 12 Mτ 2 r 2 : τ ∈ [0, 1]}.


τ

Thus,
 
δM (x) ≥ max {τ Δr (x) − 12 Mτ 2 r 2 } = Mr 2 ·  1
Δ (x)
Mr 2 r
.
τ ∈[0,1]

Note that the right-hand side of this inequality is decreasing in M. 



4.4 The Modified Gauss–Newton Method 311

Define

L (τ ) = {y ∈ E1 : f (y) ≤ τ }.

Lemma 4.4.4 Let L (f (x)) ⊆ int F and M ≥ L. Then VM (x) ∈ L (f (x)).


Proof Let VM (x)
∈ L (f (x)). Consider the points

y(α) = x + α · (VM (x) − x), α ∈ [0, 1].

Since y(0) = x ∈ int F , we can define the value ᾱ ∈ (0, 1) such that y(ᾱ) lies at
the boundary of the set F . Note that

f (y(ᾱ)) ≥ f (x) ≥ fM (x),

and rM (x) > 0. By our assumption, ᾱ ∈ (0, 1). Define

d = F (y(ᾱ)) − F (x) − ᾱF  (x)(VM (x) − x) ∈ E2 .

In view of inequality (4.4.9), d ≤ L


2
2 (x). Therefore,
ᾱ 2 rM

f (x) ≤ f (y(ᾱ)) = φ(F (x) + ᾱF  (x)(y(1) − x) + d)

≤ φ((F (x) + ᾱF  (x)(VM (x) − x)) + d

≤ (1 − ᾱ)f (x) + ᾱφ((F (x) + F  (x)(VM (x) − x)) + 12 M ᾱ 2 rM


2 (x)

≤ (1 − ᾱ)f (x) + ᾱfM (x) − 12 M ᾱ(1 − ᾱ)rM


2 (x).

Thus, f (x) ≤ fM (x) − 12 M(1 − ᾱ)rM


2 (x), which is a contradiction to (4.4.12).

Lemma 4.4.5 Let both x and VM (x) belong to F . Then
 
1
fM (x) ≤ min f (y) + (L + M)y − x2 . (4.4.15)
y∈F 2

Proof For y ∈ F let d(x, y) = F (y) − F (x) − F  (x)(y − x) ∈ E2 . By


inequality (4.4.9),

1
d(x, y) ≤ Lx − y2 .
2
312 4 Second-Order Methods

Hence, since both x and VM (x) belong to F , we have


 
1
fM (x) = min φ(F (x) + F  (x)(y − x)) + My − x2
y∈F 2
 
1
= min φ(F (y) − d(x, y)) + My − x 2
y∈F 2
 
1 2
≤ min f (y) + (L + M)y − x . 

y∈F 2

Corollary 4.4.1 Let x ∗ be a solution to problem (4.4.7) and L (f (x)) ⊆ F . Then

1
fM (x) ≤ f ∗ + (L + M)x − x ∗ 2 . (4.4.16)
2

Proof It is enough to substitute y = x ∗ in the right-hand side of (4.4.15).




4.4.2 The Modified Gauss–Newton Process

Now we can analyze the convergence of the following process. Let us fix L0 ∈
(0, L].

Modified Gauss–Newton method

Initialization: Choose x0 ∈ Rn .

Iteration k, (k ≥ 0) : (4.4.17)

1. Find Mk ∈ [L0 , 2L] such that

f (VMk (xk )) ≤ fMk (xk ).

2. Set xk+1 = VMk (xk ).

Since fM (x) ≤ f (x), this process is monotone:

f (xk+1 ) ≤ f (xk ). (4.4.18)


4.4 The Modified Gauss–Newton Method 313

If the constant L is known, then in Item 1 of this scheme we can use Mk ≡ L. In the
opposite case, it is possible to apply a simple search procedure (see, for example,
Sect. 4.1.4). Let us now present the convergence results.
Let x0 ∈ int F be a starting point for the above minimization process. We need
to assume the following.
Assumption 4.4.2 The set F is big enough: L (f (x0 )) ⊆ F .
In what follows, we always suppose that Assumption 4.4.2 is satisfied. In view
of (4.4.18,) this assumption implies that L (f (xk )) ⊆ F for any k ≥ 0.
Theorem 4.4.1 For any k ≥ 0 and r > 0 we have


∞ 

f (xk ) − f ∗ ≥ 12 L0 2 (x ) ≥ 1 L
rMi
i 2 0
2 (x ),
r2L i
i=k i=k
(4.4.19)

∞   
∞  
f (xk ) − f ∗ ≥ r 2 Mi  1
Δ (x)
Mi r 2 r
≥ 2Lr 2  1
Δ (x)
2Lr 2 r
.
i=k i=k

Proof Indeed, in view of the rules of Step 1 in (4.4.17),

fMi (xi ) ≥ f (xi+1 ), Mi ≥ L0 , rMi (xi ) ≥ r2L (xi ).

Thus, inequality (4.4.12) justifies the first inequality in (4.4.19). In order to prove
the second one, we apply (4.4.14) and use the bound Mi ≤ 2L imposed by
(4.4.17).

Corollary 4.4.2 Let the sequence {xk }∞
k=0 be generated by the scheme (4.4.17).
Then

lim xk − xk+1  = 0, lim Δr (xk ) = 0,


k→∞ k→∞

and therefore the set of limit points X∗ of this sequence is connected. For any x̄ from
X∗ , we have Δr (x̄) = 0. 
Let us justify now the local convergence of the scheme (4.4.17).
Theorem 4.4.2 Let the point x ∗ ∈ L (f (x0 )) with F (x ∗ ) = 0 be a non-degenerate
solution to problem (4.4.4):

σ ≡ σmin (F  (x ∗ )) > 0.

Let γφ be defined by (4.4.6). If xk ∈ L (f (x0 )) and

xk − x ∗  ≤
σ γφ
2
L · 3+5γφ ,
314 4 Second-Order Methods

then xk+1 ∈ L (f (x0 )) and

3(1+γφ )L xk −x ∗ 2
xk+1 − x ∗  ≤ 2γφ (σ −Lxk −x ∗ ) ≤ xk − x ∗ . (4.4.20)

Proof Since f (x ∗ ) = 0, in view of inequality (4.4.16) and inequality (4.4.9), we


have

2 xk
3L
− x ∗ 2 ≥ fMk (xk ) ≥ ψ(xk ; xk+1 ) ≥ γφ F (xk ) + F  (xk )(xk+1 − xk )

= γφ F  (x ∗ )(xk+1 − x ∗ ) + F (xk ) − F (x ∗ ) − F  (x ∗ )(xk − x ∗ )

+ (F  (xk ) − F  (x ∗ ))(xk+1 − xk )

≥ γφ [F  (x ∗ )(xk+1 − x ∗ ) − L2 xk − x ∗ 2

−Lxk − x ∗  · xk+1 − xk ]
 
≥ γφ (σ − Lxk − x ∗ ) · xk+1 − x ∗  − 2 xk
3L
− x ∗ 2 . 

4.4.3 Global Rate of Convergence

In order to get global complexity results for method (4.4.17), we need to introduce
an additional non-degeneracy assumption.
Assumption 4.4.3 The operator F  (x) : E1 → E2 possesses a uniform dual non-
degeneracy:

σmin (F  (x)∗ ) ≥ σ > 0 ∀x ∈ L (f (x0 )).

Note that this assumption implies dim E2 ≤ dim E1 . The role of Assump-
tion 4.4.3 in our analysis can be seen from the following standard result.
Lemma 4.4.6 Let the linear operator A : E1 → E2 possess dual non-degeneracy:

σmin (A∗ ) > 0.

Then for any b ∈ E2 there exists a point x(b) ∈ E1 such that


b
Ax(b) = b, x(b) ≤ σmin (A∗ ) .
4.4 The Modified Gauss–Newton Method 315

Proof Consider the following optimization problem:

min{f (x) = x : Ax = b}.


x

Since the level sets of its objective function are bounded, its solution x ∗ exists. In
view of the statement (3.1.59), there exists a y ∗ ∈ E∗2 such that g ∗ = A∗ y ∗ ∈
∂f (x ∗ ). Using inequality (3.1.42) and Lemma 3.1.15, we conclude that g ∗  ≤ 1.
Thus,

1 ≥ A∗ y ∗  ≥ σmin (A∗ )y ∗ . (4.4.21)

On the other hand,


(3.1.40)
x ∗  = g ∗ , x ∗ = Ax ∗ , y ∗ = b, y ∗ ≤ b · y ∗ .

It remains to apply inequality (4.4.21).



An important consequence of Lemma 4.4.6 is as follows.
Lemma 4.4.7 Let the operator F  (x) possess dual non-degeneracy: σmin (F  (x)∗ )
> 0. Then for any M > 0 we have
F (x)
rM (x) ≤ σmin (F  (x)∗ ) . (4.4.22)

Proof Indeed, in view of Lemma 4.4.6 there exists an h∗ such that

F (x) + F  (x)h∗ = 0

F (x)
and h∗  ≤ σmin (F  (x)∗ ) . Therefore
 
M 2 M 2 M
r (x) ≤ ψ(x; VM (x)) + rM (x) = min ψ(x; x + h) + h2
2 M 2 h∈E1 2
M ∗ 2 MF (x)2
≤ h  ≤ 2 (F  (x)∗ )
. 

2 2σmin

Now we can justify the global rate of convergence of scheme (4.4.17).


Theorem 4.4.3 Let Assumptions 4.4.1, 4.4.2 and 4.4.3 be satisfied.
1) Suppose that the sequence {xk }∞
k=0 is generated by method (4.4.17). If f (xk ) ≥
σ2 2
2L γφ , then

σ2 2
f (xk+1 ) ≤ f (xk ) − 4L γφ .
(4.4.23)
316 4 Second-Order Methods

Otherwise,

f (xk+1 ) ≤ L
f 2 (xk ) ≤ 12 f (xk ). (4.4.24)
σ 2 γφ2

2) Suppose that the sequence {xk }∞


k=0 is generated by method (4.4.17) with Mk ≡
σ2 2
L. If f (xk ) ≥ L γφ , then

σ2 2
f (xk+1 ) ≤ f (xk ) − 2L γφ .
(4.4.25)

Otherwise,

f (xk+1 ) ≤ L
f 2 (xk ) ≤ 12 f (xk ). (4.4.26)
2σ 2 γφ2

Proof Let us prove the first part of the theorem. Since the operator F  (xk ) is non-
degenerate, in view of Lemma 4.4.6 there exists a solution h∗k to the system of linear
equations F (xk ) + F  (xk )h = 0 with a bounded norm:

h∗k  ≤ σ F (xk )
1
≤ 1
σ γφ f (xk ).

Therefore, in view of the step-size rules in the scheme (4.4.17) and the upper bound
on the values Mk , we have
 
f (xk+1 ) ≤ min φ(F (xk ) + F  (xk )h) + 12 Mk h2
h∈E1

!
≤ min φ(F (xk ) + tF  (xk )h∗k ) + Lth∗k 2
t ∈[0,1]

 
≤ min φ((1 − t)F (xk )) + L 2 2
t f (xk )
t ∈[0,1] σ 2 γφ2

 
≤ min (1 − t)f (xk ) + L 2 2
σ 2 γφ2
t f (xk ) .
t ∈[0,1]

2
Thus, if f (xk ) ≤ 2Lσ
γφ2 , then the minimum in the latter univariate problem is
attained at t = 1 and we get inequalities (4.4.24). In the opposite case, the minimum
σ 2γ 2
is attained at t = 2Lf (xφk ) and we get estimate (4.4.23).
The second part of the theorem can be proved in a similar way.

Using Theorem 4.4.3, we can establish some properties of problem (4.4.7).
4.4 The Modified Gauss–Newton Method 317

Theorem 4.4.4 Let Assumptions 4.4.1, 4.4.2 and 4.4.3 be satisfied. Then there
exists a solution x ∗ to problem (4.4.7) such that f (x ∗ ) = 0 and

x ∗ − x0  ≤ σ F (x0 ).
2 (4.4.27)

Proof Let us choose φ(u) = u. Then γφ = 1. Let us now apply method (4.4.17)
with Mk ≡ L to the corresponding problem (4.4.7) with f (x) = F (x).
2
Assume first that f (x0 ) > σL . In accordance with the second statement of
σ2
Theorem 4.4.3, as far as f (xk ) ≥ L we have

σ2
f (xk ) − f (xk+1 ) ≥ 2L .
(4.4.28)

Denote by N the length of the first stage of the process:

σ2
f (xN ) ≥ L ≥ f (xN+1 ).

Summing up inequalities (4.4.28) for k = 0, . . . , N, we get

N +1≤ 2L
σ2
(f (x0 ) − f (xN+1 )). (4.4.29)

On the other hand, in view of inequality (4.4.12) we have

f (xk ) − f (xk+1 ) ≥ 2 xk


L
− xk+1 2 . (4.4.30)

Summing up these inequalities for k = 0, . . . , N, we get



2

N 
N
f (x0 ) − f (xN+1 ) ≥ L
2 xk − xk+1 2 ≥ L
2(N+1) xk − xk+1 
k=0 k=0

≥ 2(N+1) x0
L
− xN+1 2 .

Now, using estimate (4.4.29), we obtain


 1/2
2(N+1)
x0 − xN+1  ≤ L (f (x0 ) − f (xN+1 )) ≤ 2
σ (f (x0 ) − f (xN+1 )).
(4.4.31)

Further, in view of Theorem 4.4.3, at the second stage of the process we can
guarantee that

f (xk+1 ) ≤ L
2σ 2
f 2 (xk ) ≤ 12 f (xk ), k ≥ N + 1. (4.4.32)
318 4 Second-Order Methods

Thus, f (xN+k+1 ) ≤ ( 12 )k f (xN+1 ) for k ≥ 0. Hence, in view of inequality (4.4.22)


we have

xN+k+2 − xN+k+1  ≤ 1 1 k
σ ( 2 ) f (xN+1 ), k ≥ 0.

Thus, the sequence {xk }∞ ∗ ∗


k=0 converges to a point x with F (x ) = 0 and

x ∗ − xN+1  ≤ 2
σ f (xN+1 ).

Taking into account this inequality and (4.4.31), we get (4.4.27).


2
If f (x0 ) ≤ σL , then we can apply the latter reasoning from the very beginning:


∞ 
∞ 

xk+1 − xk  ≤ 1
σ f (xk ) ≤ 1
σ f (x0 ) ( 12 )k = 2
σ f (x0 ). 

k=0 k=0 k=0

Applying exactly the same arguments as in the proof of Theorem 4.4.4, it is possible
to justify the following statement.
Theorem 4.4.5 Let Assumptions 4.4.1, 4.4.2 and 4.4.3 be satisfied. Suppose the
sequence {xk }∞
k=0 is generated by method (4.4.17) as applied to problem (4.4.7).
Then this sequence converges to a single point x ∗ with F (x ∗ ) = 0.

Let us conclude this section with the following remark. We have seen that
Assumptions 4.4.1, 4.4.2 and 4.4.3 guarantee the existence of a solution to
problem (4.4.4). Define

D = min{x − x0  : x ∈ L (f (x0 )), F (x) = 0}.


x

In view of Corollary 4.4.1 and the bounds on Mk in method (4.4.17), we can always
guarantee that

f (x1 ) ≤ 32 LD 2 . (4.4.33)

Thus, in view of Theorem 4.4.3, the number of iterations N of method (4.4.17)


which is necessary for reaching the region of quadratic convergence can be bounded
as follows:
 2
N ≤1+ 4L
σ 2 γφ2
f (x1 ) ≤1+6 LD
σ γφ . (4.4.34)

We will refer to this bound as an upper complexity estimate of the class of problems
described by Assumptions 4.4.1, 4.4.2 and 4.4.3. This bound is justified by the
modified Gauss–Newton method (4.4.17).
4.4 The Modified Gauss–Newton Method 319

4.4.4 Discussion
4.4.4.1 A Comparative Analysis of Scheme (4.4.17)

Let us compare the efficiency of method (4.4.17) with the Cubic Newton Method
for unconstrained minimization (see Sect. 4.1). Note that the fields of applications
of both methods intersect. Indeed, any problem of solving a system of non-linear
equations can be transformed into a problem of unconstrained minimization using
some merit function. On the other hand, any unconstrained minimization problem
can be reduced to a system of non-linear equations, which corresponds to the first-
order optimality conditions (1.2.4).
Consider the following unconstrained minimization problem:

min ϕ(x), (4.4.35)


x∈E1

where ϕ(·) is a twice differentiable strongly convex function whose Hessian is


Lipschitz continuous. In this subsection, we assume that all norms are Euclidean.
Suppose that there exist positive σ and L such that the conditions

∇ 2 ϕ(x)h, h ≥ σ h2 ,
(4.4.36)
∇ 2 ϕ(x + h) − ∇ 2 ϕ(x) ≤ Lh,

are satisfied for any x and h from E1 . Let D = x0 − x ∗ . Then in Sect. 4.1.5,
we have shown that the complexity of problem (4.4.35) for the Cubic Newton
Method (4.1.16) depends on the characteristic

ζ = LD
σ

(we use the notation of this section). If ζ < 1, then problem (4.4.35) is easy. In
the opposite case, the number of iterations of the modified Newton scheme which is
necessary to come to the region of quadratic convergence is essentially bounded by
1
N1 = 6.25 ζ , (4.4.37)

(see (4.1.57)).
Note that problem (4.4.35) can be posed in the form (4.4.4):

def
Find x : F (x) = ∇ϕ(x) = 0. (4.4.38)

In this case, F  (x) = ∇ 2 ϕ(x). Therefore, in view of conditions (4.4.36), our


problem (4.4.38) satisfies Assumptions 4.4.1, 4.4.2 and 4.4.3. Let us choose f (x) =
F (x). Then, in view of (4.4.34), the number of iterations of the modified Gauss–
Newton scheme (4.4.17) required to come to the region of quadratic convergence is
320 4 Second-Order Methods

bounded by

N2 = 1 + 6ζ 2 . (4.4.39)

Clearly, the estimate (4.4.37) is much better than (4.4.39). However, this
observation just confirms a standard rule that the specialized procedures are usually
more efficient than a general purpose scheme. However, at this moment we cannot
come to a definitive answer since the lower complexity bounds for the problem class
described by Assumptions 4.4.1, 4.4.2 and 4.4.3 are not known. So, there is a chance
that the complexity (4.4.39) can be improved by other methods.
In fact, as compared with the Cubic Newton Method (4.1.16), the scheme (4.4.17)
has one important advantage. The auxiliary problem for computing the new test
point at each iteration of method (4.1.16) is solvable in polynomial time only if this
method is based on the Euclidean norm. On the contrary, in the modified Gauss–
Newton scheme we are absolutely free in the choice of norms in the spaces E1 and
E2 . As we will see in Sect. 4.4.4.2, any choice results in a convex auxiliary problem.
Therefore, it is possible to choose the norms in a reasonable way, which makes the
ratio L
σ as small as possible.

4.4.4.2 Implementation Issues

Let us study the complexity of auxiliary problem (4.4.11). For simplicity, let us
assume that we choose f (x) = F (x). So, our problem is as follows:
 
1
Find fM (x) = min F (x) + F  (x)h + Mh2 . (4.4.40)
h∈E1 2

Note that sometimes this problem looks easier in its dual form:
 
1 
min F (x) + F (x)h + Mh2
h∈E1 2
 
1 
= min max s, F (x) + F (x)h + Mh2
h∈E1 s∈E∗2 2
s≤1

 
 1
= max min s, F (x) + F (x)h + Mh2
s∈E∗ h∈E1 2
2
s≤1

 
1
= max∗ s, F (x) − F  (x)∗ s2∗ : s ≤ 1 .
s∈E2 2M
4.4 The Modified Gauss–Newton Method 321

Since this problem is convex, it can be solved by the efficient optimization schemes
of Convex Optimization.
Let us show that for Euclidean norms, problem (4.4.40) can be solved by the
standard Linear Algebra technique.
Lemma 4.4.8 Let us introduce in E1 and E2 the Euclidean norms:

x = B1 x, x 1/2 , x ∈ E1 , u = B2 u, u 1/2 , u ∈ E2 ,

where B1 = B1∗  0, and B2 = B2∗  0. Then the solution of the problem (4.4.40)
can be found by the following univariate convex optimization problem:
 
fM (x) = min τ + τ1 F (x)2 − [τ F  (x)∗ B2 F  (x) + τ 2 MB1 ]−1 g, g ,
τ ≥0
(4.4.41)

where g = F  (x)∗ B2 F (x). If τ ∗ is an optimal solution to this problem, then the


solution to (4.4.40) is given by
!−1
h∗ = − F  (x)∗ B2 F  (x) + τ ∗ MB1 F  (x)∗ B2 F (x). (4.4.42)

Proof Indeed
 
fM (x) = min min + 
2τ F (x) + F (x)h + 2 h
1 1 2 M 2
h∈E1 τ ≥0 2τ

 
= min min + 
2τ F (x) + F (x)h + 2 h
1 1 2 M 2
τ ≥0 h∈E1 2τ


= min min 1
2τ + 2τ F (x)
1 2 + τ1 B2 F (x), F  (x)h
τ ≥0 h∈E1


+ 2τ
1
B2 F  (x)h, F  (x)h + 2 B1 h, h
M
.

The minimum of the internal minimization problem is achieved at


 −1
h∗ (τ ) = − 1  ∗ 
τ F (x) B2 F (x) + MB1
1  ∗
τ F (x) B2 F (x)

!−1
= − F  (x)∗ B2 F  (x) + τ MB1 F  (x)∗ B2 F (x).
322 4 Second-Order Methods

With the notation g = F  (x)∗ B2 F (x), the objective function of the optimization
problem in τ is as follows:
 −1
1
2τ + 2τ F (x)
1 2 − 1
 1 F  (x)∗ B2 F  (x) + MB1
2τ 2 τ
g, g

!−1
= 12 τ + 2τ F (x)
1 2 − 12  τ F  (x)∗ B2 F  (x) + τ 2 MB1 g, g .

In view of Theorem 3.1.7, this function is convex in τ . 



Note that the univariate optimization problem in (4.4.41) can be solved efficiently
by one-dimensional search procedures (see, for example, Sect. A.1).
Part II
Structural Optimization
Chapter 5
Polynomial-Time Interior-Point Methods

In this section, we present the problem classes and complexity bounds of polyno-
mial-time interior-point methods. These methods are based on the notion of a
self-concordant function. It appears that such a function can be easily minimized
by the Newton’s Method. On the other hand, an important subclass of these
functions, the self-concordant barriers, can be used in the framework of path-
following schemes. Moreover, it can be proved that we can follow the corresponding
central path with polynomial-time complexity. The size of the steps in the penalty
coefficient of the central path depends on the corresponding barrier parameter. It
appears that for any convex set there exists a self-concordant barrier with parameter
proportional to the dimension of the space of variables. On the other hand, for
any convex set with explicit structure, such a barrier with a reasonable value of
parameter can be constructed by simple combination rules. We present applications
of this technique to Linear and Quadratic Optimization, Linear Matrix Inequalities
and other optimization problems.

5.1 Self-concordant Functions

(Do we really have a Black Box? What does the Newton method actually do? Definition
of self-concordant functions; Main properties; The Implicit Function Theorem; Minimizing
self-concordant functions; Relations with the standard second-order methods.)

5.1.1 The Black Box Concept in Convex Optimization

In this chapter, we are going to present the main ideas underlying the modern
polynomial-time interior-point methods in Nonlinear Optimization. In order to start,
let us look first at the traditional formulation of a minimization problem.

© Springer Nature Switzerland AG 2018 325


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4_5
326 5 Polynomial-Time Interior-Point Methods

Suppose we want to solve a minimization problem in the following form:

min {f0 (x) : fj (x) ≤ 0, j = 1 . . . m}.


x∈Rn

We assume that the functional components of this problem are convex. Note that
all standard convex optimization schemes for solving this problem are based on the
Black-Box concept. This means that we assume our problem to be equipped with an
oracle, which provides us with some information on the functional components of
the problem at some test point x. This oracle is local: If we change the shape of the
component far enough from the test point, the answer of the oracle does not change.
These answers comprise the only information available for numerical methods.1
However, looking carefully at the above situation, we can discover a certain
contradiction. Indeed, in order to apply the convex optimization methods, we need
to be sure that our functional components are convex. However, we can check
convexity only by analyzing the structure of these functions2: If our function
is obtained from the basic convex functions by convex operations (summation,
maximum, etc.), we conclude that it is convex.
Thus, the functional components of the problem are not in the Black Box at
the moment we are checking their convexity and choose the minimization scheme.
However, we lock them in the Black Box for numerical methods. This is the main
conceptual contradiction of the standard Convex Optimization theory.3
The above observation gives us hope that the structure of the problem could
be used to improve performance of convex minimization schemes. Unfortunately,
structure is a very fuzzy notion, which is quite difficult to formalize. One possible
way to describe the structure is to fix the analytical type of functional components.
For example, we can consider the problems with linear functions fj (·) only. This
works, but note that this approach is very fragile: If we introduce in our problem
just a single functional component of different type, we get another problem class
and all the theory must be redone from scratch.
Alternatively, it is clear that having the structure at hand, we can play with the
analytical form of the problem. We can rewrite the problem in many equivalent
forms using nontrivial transformations of variables or constraints, introducing
additional variables, etc. However, this would serve no purpose without realizing
the final goal of such transformations. So, let us try to find such a goal.
At this moment, it is better to look at classical examples. In many situations,
the sequential reformulations of the initial problem can be seen as a part of the
numerical method. We start from a complicated problem P and, step by step,
simplify its structure up to the moment we get a trivial problem (or, a problem

1 We have already discussed this concept and the corresponding methods in Part I of the book.
2A numerical verification of convexity is a hopeless computational task.
3 Nevertheless, the conclusions of the theory concerning the oracle-based minimization schemes

remain valid, of course, for the methods which are designed in accordance with the Black-Box
principles.
5.1 Self-concordant Functions 327

which we know how to solve):

P −→ . . . −→ (f ∗ , x ∗ ).

Let us look at the standard approach for solving the system of linear equations,
namely,

Ax = b.

We can proceed as follows:


1. Check that matrix A is symmetric and positive definite. Sometimes this is clear
from its origin.
2. Compute the Cholesky factorization of the matrix:

A = LLT ,

where L is a lower-triangular matrix. Form two auxiliary systems

Ly = b, LT x = y.

3. Solve the auxiliary systems.


This process can be seen as a sequence of equivalent transformations of the initial
problem.
Imagine for a moment that we do not know how to solve the systems of
linear equation. In order to discover the above technology, we should perform the
following steps:
1. Find a class of problems which can be solved very efficiently (linear systems
with triangular matrices in our example).
2. Describe the transformation rules for converting our initial problem into the
desired form.
3. Describe the class of problems for which these transformation rules are applica-
ble.
We are ready to explain how it works in Convex Optimization. First of all,
we need to find a basic numerical scheme and problem formulation at which
this scheme is very efficient. We will see that for our goals the most appropriate
candidate is the Newton’s method (see Sect. 1.2.4 and Chap. 4) as applied in the
framework of Sequential Unconstrained Minimization (see Sect. 1.3.3).
In the next section, we will analyze some drawbacks of the standard theory on
the Newton’s method. From this analysis, we derive a family of very special convex
functions, so-called self-concordant functions and self-concordant barriers, which
can be efficiently minimized by the Newton’s method. We use these objects in the
description of a transformed version of the initial problem. In the sequel, we refer
to this description as a barrier model of our problem. This model will replace
328 5 Polynomial-Time Interior-Point Methods

the standard functional model of the optimization problem used in the previous
chapters.

5.1.2 What Does the Newton’s Method Actually Do?

Let us look at the standard result on the local convergence of Newton’s method (we
have proved it as Theorem 1.2.5). We need to find an unconstrained local minimum
x ∗ of the twice differentiable function f (·):

min f (x), (5.1.1)


x∈Rn

For the moment, all the norms we use are standard Euclidean. Assume that:
• ∇ 2 f (x ∗ )  μIn with some constant μ > 0,
•  ∇ 2 f (x) − ∇ 2 f (y) ≤ M  x − y  for all x and y ∈ Rn .
Assume also that the starting point of the Newton process x0 is close enough to
x ∗:

 x0 − x ∗  < r̄ = 2μ
3M .
(5.1.2)

Then we can prove (see Theorem 1.2.5) that the sequence

xk+1 = xk − [∇ 2 f (xk )]−1 ∇f (xk ), k ≥ 0, (5.1.3)

is well defined. Moreover,  xk − x ∗ < r̄ for all k ≥ 0 and the Newton’s


method (5.1.3) converges quadratically:

Mxk −x ∗ 2
 xk+1 − x ∗  ≤ 2(μ−Mxk −x ∗ ) .

What is wrong with this result? Note that the description of the region of
quadratic convergence (5.1.2) for this method is given in terms of the standard
inner product


n
x, y = x (i) y (i) , x, y ∈ Rn .
i=1

If we choose a new basis in Rn , then all objects in our description change: the
metric, the Hessians, the bounds μ and M. However, let us see what happens in this
situation with the Newton process. Namely, let B be a nondegenerate (n×n)-matrix.
Consider the function

φ(y) = f (By), y ∈ Rn .
5.1 Self-concordant Functions 329

The following result is very important for understanding the nature of the Newton’s
method.
Lemma 5.1.1 Let the sequence {xk } be generated by the Newton’s method as
applied to the function f :

xk+1 = xk − [∇ 2 f (xk )]−1 ∇f (xk ), k ≥ 0.

Consider the sequence {yk }, generated by the Newton’s method for the function φ:

yk+1 = yk − [∇ 2 φ(yk )]−1 ∇φ(yk ), k ≥ 0,

with y0 = B −1 x0 . Then yk = B −1 xk for all k ≥ 0.


Proof Let yk = B −1 xk for some k ≥ 0. Then

yk+1 = yk − [∇ 2 φ(yk )]−1 ∇φ(yk ) = yk − [B T ∇ 2 f (Byk )B]−1 B T ∇f (Byk )


= B −1 xk − B −1 [∇ 2 f (xk )]−1 ∇f (xk ) = B −1 xk+1 . 

Thus, the Newton’s method is affine invariant with respect to affine transforma-
tions of variables. Therefore, its actual region of quadratic convergence does not
depend on a particular choice of the basis. It depends only on the local topological
structure of the function f (·).
Let us try to understand what was wrong in our assumptions. The main
assumption is related to the Lipschitz continuity of the Hessians:

 ∇ 2 f (x) − ∇ 2 f (y)  ≤ M  x − y , ∀x, y ∈ Rn .

Let us assume that f ∈ C 3 (Rn ). Define

f  (x)[u] = lim1


[∇ 2 f (x + αu) − ∇ 2 f (x)] ≡ D 3 f (x)[h].
α→0 α

The object in the right-hand side of this equality (and, consequently, in its left-hand
side) is an (n × n)-matrix. Thus, our assumption is equivalent to the condition

 f  (x)[u]  ≤ M  u  .

This means that at any point x ∈ Rn , we have

f  (x)[u]v, v ≡ D 3 f (x)[u, v, v] ≤ M  u  ·  v 2 ∀u, v ∈ Rn .

Note that the value in the left-hand side of this inequality is invariant with respect
to affine transformations of variables (since this is just a third directional derivative
along direction u and twice along direction v). However, its right-hand side does
330 5 Polynomial-Time Interior-Point Methods

depend on the choice of coordinates. Therefore, the most natural way to improve
our situation consists in finding an affine-invariant replacement for the standard
Euclidean norm  · . The most natural candidate for such a replacement is quite
evident: This is the norm defined by the Hessian ∇ 2 f (x) itself, namely,

 u 2∇ 2 f (x) = ∇ 2 f (x)u, u ≡ D 2 f (x)[h, h].

This choice results in the definition of a self-concordant function.

5.1.3 Definition of Self-concordant Functions

Since we are going to work with affine-invariant objects, it is natural to get rid of
coordinate representations and denote by E a real vector space for our variables, and
by E∗ the dual space (see Sect. 4.2.1).
Let us consider a closed convex function f (·) ∈ C 3 (dom f ) with open domain.
By fixing a point x ∈ dom f and direction u ∈ E, we define a function

φ(x; t) = f (x + tu),

dependent on the variable t ∈ dom φ(x; ·) ⊆ R. Define

Df (x)[u] = φ  (x; 0) = ∇f (x), u ,

D 2 f (x)[u, u] = φ  (x; 0) = ∇ 2 f (x)u, u = u 2∇ 2 f (x) ,

D 3 f (x)[u, u, u] = φ  (x; 0) = D 3 f (x)[u]u, u .

Definition 5.1.1 A function f is called self-concordant if there exists a constant


Mf ≥ 0 such that the inequality

|D 3 f (x)[u, u, u]| ≤ 2Mf  u 3∇ 2 f (x) (5.1.4)

holds for all x ∈ dom f and u ∈ E. If Mf = 1, the function is called standard


self-concordant.
Note that we are going to use these functions to construct a barrier model of
our problem. Our main hope is that they can be easily minimized by the Newton’s
method.
Let us point out an equivalent definition of self-concordant functions.
5.1 Self-concordant Functions 331

Lemma 5.1.2 A function f is self-concordant if and only if for any x ∈ dom f and
any triple of directions u1 , u2 , u3 ∈ E we have

D
3
| D 3 f (x)[u1 , u2 , u3 ] |≤ 2Mf  ui ∇ 2 f (x) . (5.1.5)
i=1

We accept this statement without proof since it needs some special facts from the
theory of tri-linear symmetric forms. For the same reason, we accept without proof
the following corollary.
Corollary 5.1.1 A function f is self-concordant if and only if for any x ∈ dom f
and any direction u ∈ Rn we have

D 3 f (x)[u]  2Mf u∇ 2 f (x)∇ 2 f (x). (5.1.6)

In what follows, we often use Definition 5.1.1 in order to prove that some f
is self-concordant. In contrast, Lemma 5.1.2 is useful for establishing different
properties of self-concordant functions.
Let us consider several examples.
Example 5.1.1
1. Linear function. Consider the function

f (x) = α + a, x , dom f = E.

Then

∇f (x) = a, ∇ 2 f (x) = 0, ∇ 3 f (x) = 0,

and we conclude that Mf = 0.


2. Convex quadratic function. Consider the function

1
f (x) = α + a, x + Ax, x , dom f = E,
2

where A = A∗  0. Then

∇f (x) = a + Ax, ∇ 2 f (x) = A, ∇ 3 f (x) = 0,

and we conclude that Mf = 0.


3. Logarithmic barrier for a ray. Consider a univariate function

f (x) = − ln x, dom f = {x ∈ R | x > 0}.


332 5 Polynomial-Time Interior-Point Methods

Then

f  (x) = − x1 , f  (x) = 1
x2
, f  (x) = − x23 .

Therefore, f (·) is self-concordant with Mf = 1.


4. Logarithmic barrier for an ellipsoid. Let A = A∗  0. Consider the concave
function
1
φ(x) = α + a, x − Ax, x .
2
Define f (x) = − ln φ(x), with dom f = {x ∈ E : φ(x) > 0}. Then

Df (x)[u] = − φ(x)
1
[a, u − Ax, u ],

D 2 f (x)[u, u] = 1
φ 2 (x)
[a, u − Ax, u ]2 + φ(x) Au, u ,
1

D 3 f (x)[u, u, u] = − φ 32(x) [a, u − Ax, u ]3

− φ 23(x) [a, u − Ax, u ]Au, u .

Let ω1 = Df (x)[u] and ω2 = φ(x) Au, u .


1
Then

D 2 f (x)[u, u] = ω12 + ω2 ≥ 0,

| D 3 f (x)[u, u, u] | = | 2ω13 + 3ω1 ω2 | .

The only nontrivial case is ω1


= 0. Let ξ = ω2 /ω12 . Then

|D 3 f (x)[u,u,u]| 2|ω1 |3 +3|ω1 |ω2 2(1+ 23 ξ )


≤ = ≤ 2,
(D 2 f (x)[u,u])3/2 (ω12 +ω2 )3/2 (1+ξ )3/2

where the last inequality follows from the convexity of the function (1 + ξ )3/2
for ξ ≥ −1. Thus, the function f is self-concordant and Mf = 1.
5. It is easy to verify that none of the following univariate functions is self-
concordant:

f (x) = ex , f (x) = 1
xp , x > 0, p > 0, f (x) =| x |p , p > 2.

However the function fp (x) = 12 x 2 + px1 p − p1 with p > 0 is self-concordant for


x > 0. Let us prove this. Indeed,

fp (x) = x − 1
x p+1
, fp (x) = 1 + p+1
x p+2
≥ 1, fp (x) = − (p+1)(p+2)
x p+3
.
5.1 Self-concordant Functions 333

If x ≥ 1, then

|fp (x)| = (p+1)(p+2)


x p+2
≤ (p + 2)fp (x) ≤ (p + 2)[fp (x)]3/2.

If x ∈ (0, 1], then


 3/2
|fp (x)| = (p+1)(p+2)
x p+3
≤ (p + 1)(p + 2) 1
x p+2

 f  (x) 3/2
≤ (p + 1)(p + 2) p
p+1 .
 
Thus, we can take Mfp = max 1 + p2 , 2√
p+2
p+1
= 1 + p2 . Note that the function
fp is well defined as p → 0. Indeed,
 1

lim fp (x) = 12 x 2 + lim 1
ep ln x − 1 = 1 2
2x − ln x.
p→0 p→0 p

6. Let f ∈ CL3,2 3
(Rn ). Assume that it is strongly convex on Rn with convexity
parameter σ2 (f ). Then, for any x ∈ Rn and direction u ∈ Rn we have

(2.1.28)  1/2
D 3 f (x)[u]  L3 uIn  L3 σ2 (f ) u∇ 2 f (x)
1 2
σ2 (f ) ∇ f (x).
1 2

L3
Thus, in view of Corollary 5.1.1, we can take Mf = 3/2 . 

2σ2 (f )

Let us now look at the main properties of self-concordant functions.


Theorem 5.1.1 Let functions fi be self-concordant with constants Mi , i = 1, 2,
and let α, β > 0. Then the function f (x) = αf1 (x) + βf2 (x) is self-concordant
with constant
 
Mf = max √1α M1 , √1β M2

and dom f = dom f1 dom f2 .
Proof In view of Theorem 3.1.5, f is a closed convex function. Let us fix some
x ∈ dom f and u ∈ E. Then
!3/2
| D 3 fi (x)[u, u, u] | ≤ 2Mi D 2 fi (x)[u, u] , i = 1, 2.

Let ωi = D 2 fi (x)[u, u] ≥ 0. Then


3/2 3/2
|D 3 f (x)[u,u,u]| α|D 3 f1 (x)[u,u,u]|+β|D 3 f2 (x)[u,u,u]| αM1 ω1 +βM2 ω2
[D 2 f (x)[u,u]]3/2
≤ [αD 2 f1 (x)[u,u]+βD 2 f2 (x)[u,u]]3/2
≤ [αω1 +βω2 ]3/2
.
(5.1.7)
334 5 Polynomial-Time Interior-Point Methods

The right-hand side of this inequality does not change when we replace (ω1 , ω2 ) by
(tω1 , tω2 ) with t > 0. Therefore, we can assume that

αω1 + βω2 = 1.

Let ξ = αω1 . Then the right-hand side of inequality (5.1.7) becomes equal to

M1 3/2 M2

α
ξ + √
β
(1 − ξ )3/2 , ξ ∈ [0, 1].

This function is convex in ξ . Therefore it attains its maximum at the end points of
the interval (see Corollary 3.1.1).

Corollary 5.1.2 Let a function f be self-concordant with some constant Mf . If
A = A∗  0, then the function

1
φ(x) = α + a, x + Ax, x + f (x)
2
is also self-concordant with constant Mφ = Mf .
Proof We have seen that any convex quadratic function is self-concordant with zero
constant.

Corollary 5.1.3 Let a function f be self-concordant with some constant Mf and
α > 0. Then the function φ(x) = αf (x) is also self-concordant with constant
Mφ = √1α Mf . 
Let us now prove that self-concordance is an affine-invariant property.
Theorem 5.1.2 Let A (x) = Ax + b: E → E1 be a linear operator. Assume that a
function f (·) is self-concordant with constant Mf . Then the function

φ(x) = f (A (x))

is also self-concordant and Mφ = Mf .


Proof The function φ(·) is closed and convex in view of Theorem 3.1.6. Let us fix
some x ∈ dom φ = {x : A (x) ∈ dom f } and u ∈ E. Define y = A (x), v = Au.
Then

Dφ(x)[u] = ∇f (A (x)), Au = ∇f (y), v ,

D 2 φ(x)[u, u] = ∇ 2 f (A (x))Au, Au = ∇ 2 f (y)v, v ,

D 3 φ(x)[u, u, u] = D 3 f (A (x))[Au, Au, Au] = D 3 f (y)[v, v, v].


5.1 Self-concordant Functions 335

Therefore,

| D 3 φ(x)[u, u, u] | =| D 3 f (y)[v, v, v] |≤ 2Mf ∇ 2 f (y)v, v 3/2


= 2Mf (D 2 φ(x)[u, u])3/2 . 

Finally, let us describe the behavior of a self-concordant function near the


boundary of its domain.
Theorem 5.1.3 Let f be a self-concordant function. Then for any x̄ ∈ ∂(dom f )
and any sequence

{xk } ⊂ dom f : xk → x̄

we have f (xk ) → +∞.


Proof Since f is a closed convex function with open domain, this statement follows
from Item 2 of Theorem 3.1.4. 
Thus, f is a barrier function for cl (dom f ) (see Sect. 1.3.3). Finally, let us
establish the self-concordance of a logarithmic barrier for the level set of self-
concordant function.
Theorem 5.1.4 Let a function f be self-concordant with constant Mf and f (x) ≥
f ∗ for all x ∈ dom f . For arbitrary β > f ∗ , consider the function

φ(x) = − ln(β − f (x)).

Then
1. φ is well defined on dom φ = {x ∈ dom f : f (x) < β}.
2. For any x ∈ dom φ and h ∈ E we have

∇ 2 φ(x)h, h ≥ ∇φ(x), h 2 . (5.1.8)



3. φ is self-concordant with constant Mφ = 1 + Mf2 (β − f ∗ ).

Proof Let us fix x ∈ dom φ and h ∈ E. Consider the function ψ(τ ) = φ(x + τ h).
Define ω = β − f (x). Then

ψ  (0) = ω ∇f (x), h ,


1
ψ  (0) = ω ∇ f (x)h, h + ω2 ∇f (x), h ,
1 2 1 2

ψ  (0) = 1 3
ω D f (x)[h, h, h] + 3
ω2
∇ 2 f (x)h, h ∇f (x), h + ω23 ∇f (x), h 3 .

Thus, ψ  (0) ≥ (ψ  (0))2 , and this is inequality (5.1.8).


336 5 Polynomial-Time Interior-Point Methods

Further, we need to bound ψ  (0) from above by ψ  (0)3/2. Since f is self-
concordant, we have

(5.1.4) 2Mf
ψ  (0) ≤ ω ∇ 2 f (x)h, h 3/2 + 3
ω2
∇ 2 f (x)h, h ∇f (x), h

+ ω23 ∇f (x), h 3 .

The right-hand side of this inequality is homogeneous in h of degree three.


Therefore, let us find an upper bound for it assuming that ψ  (0) = 1. Defining
 1/2
τ = ω ∇ f (x)h, h
1 2 , ξ= ω ∇f (x), h ,
1

we come to the following maximization problem:


 
max 2ω̂1/2 τ 3 + 3τ 2 ξ + 2ξ 3 : τ 2 + ξ 2 = 1 ,
τ,ξ ∈R

where ω̂ = Mf2 ω. Note that the optimal values of τ and ξ in this problem are non-
negative. Therefore, in view of the equality constraint, we can rewrite the objective
function as follows.

2ω̂1/2 τ 3 + 3τ 2 ξ + 2ξ 3 = 2ω̂1/2 τ 3 + τ 2 ξ + 2ξ(τ 2 + ξ 2 ) = 2ω̂1/2 τ 3 + (τ 2 + 2)ξ



= 2ω̂1/2 τ 3 + (τ 2 + 2) 1 − τ 2 .

The first-order optimality condition for this univariate function can be written as
follows:
√ 3
0 = 6ω̂1/2τ 2 + 2τ 1 − τ 2 − (τ 2 + 2) √ τ = 6ω̂1/2 τ 2 − √3τ .
1−τ 2 1−τ 2


Thus, the optimal value τ∗ satisfies equation 2ω̂1/2 = √ τ∗ 2
. Hence, τ∗ = 4ω̂
1+4ω̂
.
1−τ∗
Substituting this value into the objective function, we come to the following bound:
 3/2 √
2+12ω̂+16ω2
2ω̂1/2 4ω̂
1+4ω̂
+ 2+12ω̂
(1+4ω̂)3/2
= (1+4ω̂)3/2
= 2 (1+4
1+2ω̂
ω̂) 1/2 ≤ 2 1 + ω̂.

It remains to note that ω̂ ≤ Mf2 (β − f ∗ ).



5.1 Self-concordant Functions 337

5.1.4 Main Inequalities

Let f be a self-concordant function. Define

 h x = ∇ 2 f (x)h, h 1/2 .

We call  h x the (primal) local norm of direction h with respect to x. Let us fix a
point x ∈ dom f and a direction h ∈ E such that ∇ 2 f (x)h, h > 0. Consider the
univariate function

φ(t) = 1
∇ 2 f (x+t h)h,h 1/2
.

In view of the continuity of the second derivative of the function f , 0 ∈ int (dom φ).
Lemma 5.1.3 For all feasible t, we have | φ  (t) |≤ Mf .
Proof Indeed,
3
φ  (t) = − 2∇
D f (x+t h)[h,h,h]
2 f (x+t u)h,h 3/2 .

Therefore | φ  (t) |≤ Mf in view of Definition 5.1.1.



Corollary 5.1.4 The domain of function φ(·) contains the interval
 
Ix = − M1f φ(0), M1f φ(0) .

Proof Indeed, in view of Lemma 5.1.3, the values ∇ 2 f (x +τ h)h, h are positive at
any subinterval of Ix and φ(t) ≥ φ(0) − Mf | t |. Moreover, since f (x + th) → ∞
as the points x +th approach the boundary of dom f (see Theorem 5.1.3), the cannot
intersect the boundary as t ∈ Ix .

Let us consider the following ellipsoids:

W 0 (x; r) = {y ∈ E |  y − x x < r},



W (x; r) = cl W 0 (x; r) = {y ∈ E |  y − x x ≤ r}.

This set is called the Dikin ellipsoid of the function f at x.


Theorem 5.1.5 1. For any x ∈ dom f , we have W 0 (x; M1f ) ⊆ dom f .
2. For all x, y ∈ dom f , the following inequality holds:

y−xx
 y − x y ≥ 1+Mf y−xx . (5.1.9)
338 5 Polynomial-Time Interior-Point Methods

3. If  y − x x < 1
Mf , then

y−xx
 y − x y ≤ 1−Mf y−xx . (5.1.10)

Proof 1. Let us choose in E a Euclidean norm  ·  and small  > 0. Consider the
function f (x) = f (x) + 12 x2 . In view of Corollary 5.1.2, it is self-concordant
with constant Mf . Moreover, for any h ∈ E we have ∇ 2 f (x)h, h > 0. Therefore,
in view of Corollary 5.1.4, dom f ≡ dom f contains the set
 
y = x + th | t 2 ( h 2x +h2 ) < 1
Mf2

(since φ(0) = 1/∇ 2 f (x)h, h 1/2 ). Since  can be arbitrarily small, this means that
dom f contains W 0 (x; M1f ).
2. Let us choose h = y − x. Assume for a moment that hx > 0. Then

φ(1) = 1
y−xy , φ(0) = 1
y−xx ,

and φ(1) ≤ φ(0) + Mf in view of Lemma 5.1.3. This is inequality (5.1.9).


3. If  y − x x < M1f , then φ(0) > Mf , and in view of Lemma 5.1.3 φ(1) ≥
φ(0) − Mf . This is inequality (5.1.10).
In the case when hx = 0, both items can be justified by the trick used in the
proof of Item 1. 
The next statement demonstrates that some local properties of self-concordant
functions reflect somehow the global properties of its domain.
Theorem 5.1.6 Let a function f be self-concordant and dom f contains no straight
lines. Then the Hessian ∇ 2 f (x) is nondegenerate at all points x ∈ dom f .
Proof Assume that ∇ 2 f (x̄)h, h = 0 for some x̄ ∈ dom f and direction h ∈ E,
h
= 0. Then, all points of the line {x = x̄ + τ h, τ ∈ R} belong to the ellipsoid
W 0 (x; M1f ). However, in view of Item 1 of Theorem 5.1.5, this ellipsoid belongs to
dom f . This contradicts the conditions of the theorem. 
Theorem 5.1.7 Let x ∈ dom f . Then for any y ∈ W 0 (x; M1f ) we have

(1 − Mf r)2 ∇ 2 f (x)  ∇ 2 f (y)  1


(1−Mf r)2
∇ 2 f (x), (5.1.11)

where r = y − x x .
Proof Let us fix an arbitrary direction h ∈ E, h
= 0. Consider the function

ψ(t) = ∇ 2 f (x + t (y − x))h, h , t ∈ [0, 1].


5.1 Self-concordant Functions 339

Define yt = x + t (y − x) and r = y − xx . Then, in view of Lemma 5.1.2 and


inequality (5.1.10), we have

| ψ  (t) | = | D 3 f (yt )[y − x, h, h] | ≤ 2Mf  y − x yt  h 2yt

2Mf 2Mf yt −xx


= t  yt − x yt ψ(t) ≤ t · 1−Mf yt −xx · ψ(t)

2Mf r
= 1−t Mf r ψ(t).

If y − xx = 0, then ψ(t) = ψ(0), t ∈ [0, 1], and therefore

(1 − Mf r)2 ψ(0) ≤ ψ(t) ≤ 1


(1−Mf r)2
ψ(0). (5.1.12)

If r > 0, then 2(ln(1 −tMf r)) ≤ (ln ψ(t)) ≤ −2(ln(1 −tMf r)) for all t ∈ [0, 1].
Integrating these inequalities in t ∈ [0, 1], we get again (5.1.12), which is equivalent
to (5.1.11) since h was chosen arbitrarily. 
Corollary 5.1.5 Let x ∈ dom f and r = y − x x < 1
Mf . Then we can bound the
operator

1
G= ∇ 2 f (x + τ (y − x))dτ
0

as follows:
 
1 − Mf r + 13 Mf2 r 2 ∇ 2 f (x)  G  1−Mf r ∇ f (x).
1 2

Proof Indeed, in view of Theorem 5.1.7 we have

1 1
G= ∇ 2 f (x + τ (y − x))dτ  ∇ 2 f (x) · (1 − τ Mf r)2 dτ
0 0

= (1 − Mf r + 13 Mf2 r 2 )∇ 2 f (x),

1
and G  ∇ 2 f (x) · dτ
(1−τ Mf r)2
= 1−Mf r ∇ f (x).
1 2 

0
Remark 5.1.1 The statement of Corollary 5.1.5 remains valid for r = y − xy .
Let us now recall the most important facts we have already proved.
340 5 Polynomial-Time Interior-Point Methods

• At any point x ∈ dom f we can define an ellipsoid


   
W 0 x; M1f = x ∈ E | ∇ 2 f (x)(y − x), y − x) < 1
,
Mf2

belonging to dom f .
• Inside the ellipsoid W (x; r) with r ∈ [0, M1f ) the function f is almost quadratic:

(1 − Mf r)2 ∇ 2 f (x)  ∇ 2 f (y)  1


(1−Mf r)2
∇ 2 f (x)

for all y ∈ W (x; r). Choosing r small enough, we can make the quality of
quadratic approximation acceptable for our goals.
These two facts form the basis for all consequent results.
Let us now prove several inequalities related to the divergence of the value of a
self-concordant function with respect to its linear approximation.
Theorem 5.1.8 For any x, y ∈ dom f , we have

y−x2x
∇f (y) − ∇f (x), y − x ≥ 1+Mf y−xx ,
(5.1.13)

f (y) ≥ f (x) + ∇f (x), y − x + 1


ω(Mf  y − x x ), (5.1.14)
Mf2

where ω(t) = t − ln(1 + t).


Proof Let yτ = x +τ (y −x), τ ∈ [0, 1], and r = y −x x . Then, in view of (5.1.9)
we have

1
∇f (y) − ∇f (x), y − x = ∇ 2 f (yτ )(y − x), y − x dτ
0

1
= 1
τ2
 yτ − x 2yτ dτ
0

1 r2
Mf r
r2
≥ (1+τ Mf r)2
dτ = r
Mf
1
(1+t )2
dt = 1+Mf r .
0 0
5.1 Self-concordant Functions 341

Further, using (5.1.13), we obtain

1
f (y) − f (x) − ∇f (x), y − x = ∇f (yτ ) − ∇f (x), y − x dτ
0

1 1 yτ −x2x 1 τ r2
= τ ∇f (yτ )
1
− ∇f (x), yτ − x dτ ≥ τ (1+Mf yτ −xx ) dτ = 1+τ Mf r dτ
0 0 0

Mf r
= 1 t dt
= 1
ω(Mf r). 

Mf2 1+t Mf2
0

Theorem 5.1.9 Let x ∈ dom f and  y − x x < 1


Mf . Then

y−x2x
∇f (y) − ∇f (x), y − x ≤ 1−Mf y−xx ,
(5.1.15)

f (y) ≤ f (x) + ∇f (x), y − x + 1


ω∗ (Mf  y − x x ), (5.1.16)
Mf2

where ω∗ (t) = −t − ln(1 − t).


Proof Let yτ = x +τ (y −x), τ ∈ [0, 1], and r = y −x x . Since  yτ −x < 1
Mf ,
in view of (5.1.10) we have

1
∇f (y) − ∇f (x), y − x = ∇ 2 f (yτ )(y − x), y − x dτ
0

1
= 1
τ2
 yτ − x 2yτ dτ
0

1 r2
Mf r
r2
≤ (1−τ Mf r)2
dτ = r
Mf
1
(1−t )2
dt = 1−Mf r .
0 0

Further, using (5.1.15), we obtain

1
f (y) − f (x) − ∇f (x), y − x = ∇f (yτ ) − ∇f (x), y − x dτ
0

1 1 yτ −x2x 1 τ r2
= τ ∇f (yτ )
1
− ∇f (x), yτ − x dτ ≤ τ (1−Mf yτ −xx ) dτ = 1−τ Mf r dτ
0 0 0

Mf r
= 1 t dt
= 1
ω∗ (Mf r). 

Mf2 1−t Mf2
0
342 5 Polynomial-Time Interior-Point Methods

Theorem 5.1.10 Inequalities (5.1.9), (5.1.10), (5.1.13), (5.1.14), (5.1.15)


and (5.1.16) are necessary and sufficient characteristics of self-concordant
functions.
Proof We have already justified two sequences of implications:

Definition 5.1.1 ⇒ (5.1.9) ⇒ (5.1.13) ⇒ (5.1.14),

Definition 5.1.1 ⇒ (5.1.10) ⇒ (5.1.15) ⇒ (5.1.16).

Let us prove the implication (5.1.14) ⇒ Definition 5.1.1. Let x ∈ dom f and x −
αu ∈ dom f for α ∈ [0, ). Consider the function

ψ(α) = f (x − αu), α ∈ [0, ).

Let r = ux ≡ [ψ  (0)]1/2. Assuming that (5.1.14) holds for all x and y from
dom f , we have

ψ(α) − ψ(0) − ψ  (0)α − 12 ψ  (0)α 2 ≥ 1


ω(αMf r) − 12 α 2 r 2 .
Mf2

Therefore
 
1 
6 ψ (0) = lim α13 ψ(α) − ψ(0) − ψ  (0)α − 12 ψ  (0)α 2
α↓0

   
≥ lim α13 1
ω(αMf r) − 12 α 2 r 2 = lim 3αr 2 1
ω (αMf r) − αr
α↓0 Mf2 α↓0 Mf

 
= lim 3αr 2 αr
1+αMf r − αr = − 13 Mf r 3 .
α↓0

Therefore, D 3 f (x)[u, u, u] = −ψ  (0) ≤ 2Mf [ψ  (0)]3/2 and this is Defini-


tion 5.1.1. Implication (5.1.16) ⇒ Definition 5.1.1 can be proved in a similar
way. 
Sometimes Theorem 5.1.10 is convenient for establishing self-concordance of
certain functions. Let us demonstrate this with an Implicit Function Theorem.
Let us assume that E = E1 × E2 . Thus, we have a corresponding partition of
variable z = (x, y) ∈ E. Let Φ be a self-concordant function with dom Φ ⊆ E.
Consider the following implicit function:

f (x) = min{Φ(x, y) : (x, y) ∈ dom Φ}. (5.1.17)


y
5.1 Self-concordant Functions 343

In order to simplify the situation, let us assume that for any x such that the set
Q(x) = {y : (x, y) ∈ dom Φ} is nonempty, it does not contain a straight line. Then
simple conditions, like boundedness of Φ from below, guarantee existence of the
unique solution y(x) of the optimization problem in (5.1.17) (see Sect. 5.1.5).
Anyway, let us assume existence of point y(x). Then it is characterized by the
first-order optimality condition:

∇y Φ(x, y(x)) = 0. (5.1.18)

Moreover, by Theorem 3.1.25 and Lemma 3.1.10, we have

∇f (x) = ∇x Φ(x, y(x)). (5.1.19)

Let us compute the Hessian of the function f . Differentiating equation (5.1.18)


along direction h ∈ E1 , we get

2 Φ(x, y(x))h + ∇ 2 Φ(x, y(x))y  (x)h = 0.


∇yx yy

Therefore, by differentiating equality (5.1.19) along direction h, we obtain

2 Φ(x, y(x))h + ∇ 2 Φ(x, y(x))y  (x)h


∇ 2 f (x)h = ∇xx xy

2 Φ(x, y(x))h − ∇ 2 Φ(x, y(x))[∇ 2 Φ(x, y(x))]−1 ∇ 2 Φ(x, y(x))h.


= ∇xx xy yy yx
(5.1.20)
Theorem 5.1.11 Let Φ be a self-concordant function. Then the function f defined
by (5.1.17) is also self-concordant with constant MΦ .
Proof Let us fix x̄ ∈ dom f . Define z̄ = (x̄, y(x̄)) and let x ∈ dom f . Then with
z = (x, y), we have

f (x) = min Φ(x, y)


y∈Q(x)

 
(5.1.14)
≥ min Φ(x̄, y(x̄)) + ∇Φ(x̄, y(x̄)), z − z̄ + 1
ω(Mf z − z̄z̄ )
y∈Q(x) Mf2



(5.1.19)
= f (x̄) + ∇f (x̄), x − x̄ E1 + 1
ω Mf min z − z̄z̄ .
Mf2 y∈Q(x)
344 5 Polynomial-Time Interior-Point Methods

It remains to compute the minimum in the last line. Let h = x − x̄. Then

min z − z̄2z̄ = ∇xx


2 Φ(z̄)h, h
E1
y∈Q(x)

 
+ min 2 Φ(z̄)(y − ȳ), h + ∇ 2 Φ(z̄)(y − ȳ), y − ȳ
2∇xy E1 yy E2
y∈Q(x)

 
≥ ∇xx
2 Φ(z̄)h, h + min 2∇ 2 Φ(z̄)δ, h + ∇ 2 Φ(z̄)δ, δ
E1 xy E1 yy E2
δ∈E2

= 2 Φ(z̄)h, h − [∇ 2 Φ(z̄)]−1 ∇ 2 Φ(z̄)h, ∇ 2 Φ(z̄)h


∇xx E1 yy yx yx E1

(5.1.20)
= ∇ 2 f (x̄)h, h .

It remains to apply Theorem 5.1.10.



Let us prove two more inequalities. From now on, we assume that dom f contains
no straight lines. In this case, in view of Theorem 5.1.6, all Hessians ∇ 2 f (x) with
x ∈ dom f are nondegenerate. Denote by

 g ∗x = g, [∇ 2 f (x)]−1 g 1/2 , g ∈ E∗ ,

the dual local norm. Clearly, | g, h |≤ g ∗x ·  h x .


Theorem 5.1.12 For any x and y from dom f we have

f (y) ≥ f (x) + ∇f (x), y − x + 1


ω(Mf ∇f (y) − ∇f (x)∗y ). (5.1.21)
Mf2

If in addition ∇f (y) − ∇f (x)∗y < 1


Mf , then

f (y) ≤ f (x) + ∇f (x), y − x + 1


ω∗ (Mf ∇f (y) − ∇f (x)∗y ). (5.1.22)
Mf2

Proof Let us fix arbitrary points x and y from dom f . Consider the function

φ(z) = f (z) − ∇f (x), z , z ∈ dom f.

Note that this function is self-concordant and ∇φ(x) = 0. Therefore, using


inequality (5.1.16), we get

f (x) − ∇f (x), x = φ(x) = min φ(z)


z∈dom f

 
≤ min φ(y) + ∇φ(y), z − y + 1
ω∗ (Mf z − yy ) : z − yy < 1
z Mf2 Mf
5.1 Self-concordant Functions 345

 
= min φ(y) − τ
Mf ∇φ(y)∗y + 1
Mf2
ω∗ (τ ) = φ(y) − 1
Mf2
ω(Mf ∇φ(y)∗y )
0≤τ <1

= f (y) − ∇f (x), y − 1


Mf2
ω(Mf ∇f (y) − ∇f (x)∗y ),

and this is inequality (5.1.21). In order to prove inequality (5.1.22), we use a similar
reasoning based on inequality (5.1.14). 
All theorems above are written in terms of two auxiliary univariate functions,

ω(t) = t − ln(1 + t), ω∗ (τ ) = −τ − ln(1 − τ ).

Note that

ω (t) = t
1+t ≥ 0, ω (t) = 1
(1+t )2
> 0,

ω∗ (τ ) = τ
1−τ ≥ 0, ω∗ (τ ) = 1
(1−τ )2
> 0.

Therefore, ω(·) and ω∗ (·) are convex functions. In what follows, we often use
different relations between these objects. Let us provide them with a formal
justification.
Lemma 5.1.4 For any t ≥ 0 and τ ∈ [0, 1), we have

ω (ω∗ (τ )) = τ, ω∗ (ω (t)) = t,

ω(t) = max [ξ t − ω∗ (ξ )], ω∗ (τ ) = max[ξ τ − ω(ξ )],


0≤ξ <1 ξ ≥0

ω(t) + ω∗ (τ ) ≥ τ t,

ω∗ (τ ) = τ ω∗ (τ ) − ω(ω∗ (τ )), ω(t) = tω (t) − ω∗ (ω (t)).

We leave the proof of this lemma as an exercise for the reader. Note that the main
reason for the above relations is that functions ω(t) and ω∗ (t) are Fenchel conjugate
(see definition (3.1.27)).
Functions ω(·) and ω∗ (·) will often be used for estimating the rate of growth
of self-concordant functions. Sometimes, it is more convenient to replace them by
appropriate lower and upper bounds.
Lemma 5.1.5 For any t ≥ 0 we have

t2 2 t2
2(1+t ) ≤  t  ≤ ω(t) ≤ 2+t , (5.1.23)
2 1+ 23 t
346 5 Polynomial-Time Interior-Point Methods

and for t ∈ [0, 1),


t2 t2
2−t ≤ ω∗ (t) ≤ 2(1−t ) .
(5.1.24)

2
Proof Let ψ1 (t) =  t . Note that ψ1 (0) = ω(0) = 0. At the same time,
2 1+ 32 t

t2
ψ1 (t) = t
−  2 = 
t (3+t )
2 ≤ t
= ω (t).
1+ 23 t 3 1+ 32 t 3 1+ 32 t
1+t

t2
Similarly, for ψ2 (t) = 2+t , we have

t2 4t +t 2
ψ2 (t) = 2t
2+t − (2+t )2
= (2+t )2
≥ t
1+t = ω (t).

t2 t2
For the second inequality, let ψ3 (t) = 2−t and ψ4 (t) = 2(1−t ) . Then

t2 4t −t 2
ψ3 (t) = 2t
2−t + (2−t )2
= (2−t )2
≤ t
1−t ,

t2 2t −t 2
ψ4 (t) = t
1−t + 2(1−t )2
= 2(1−t )2
≥ t
1−t .

Since 1−tt
= ω∗ (t) and ω∗ (0) = ψ3 (0) = ψ4 (0) = 0, we get (5.1.24) by
integration.


5.1.5 Self-Concordance and Fenchel Duality

Let us start with some preliminary results. Consider the following minimization
problem:

min{f (x) | x ∈ dom f }, (5.1.25)

where we assume that f is self-concordant and all Hessians ∇ 2 f (x), x ∈ dom f ,


are positive definite. In view of Theorem 5.1.6, this can be derived from the fact that
dom f contains no straight lines. Or, we can assume that f is strongly convex.
Define

λf (x) = ∇f (x), [∇ 2 f (x)]−1 ∇f (x) 1/2 .

We call λf (x) = ∇f (x) ∗x the local norm of the gradient ∇f (x).4

4 Sometimes λf (x) is called the Newton decrement of the function f at x.


5.1 Self-concordant Functions 347

The next theorem provides us with a sufficient condition for existence of solution
of problem (5.1.25).
Theorem 5.1.13 Let λf (x) < M1f for some x ∈ dom f . Then there exists a unique
solution xf∗ of problem (5.1.25) and

f (x) − f (xf∗ ) ≤ 1
ω∗ (Mf λf (x)). (5.1.26)
Mf2

Proof Indeed, in view of (5.1.14), for any y ∈ dom f we have

f (y) ≥ f (x) + ∇f (x), y − x + 1


Mf2
ω(Mf  y − x x )

≥ f (x) − λf (x)·  y − x x + 1
ω(Mf  y − x x )
Mf2

 
= f (x) + 1
− λf (x) y − xx − 1
ln(1 + Mf y − xx ).
Mf Mf2

Thus, the level set Lf (f (x)) is bounded and therefore xf∗ exists. It is unique since
in view of (5.1.14), for all y ∈ dom f we have

f (y) ≥ f (xf∗ ) + 1
ω(Mf  y − xf∗ xf∗ ).
Mf2

Finally, taking in (5.1.22) x = x ∗ and y = x, we get inequality (5.1.26).



Thus, we have proved that a local condition λf (x) < M1f provides us with some
global information on the function f , namely, the existence of the minimum xf∗ .
Note that the result of Theorem 5.1.13 cannot be strengthened.
Example 5.1.2 Let us fix some  > 0. Consider a function of one variable

f (x) = x − ln x, x > 0.

This function is self-concordant in view of Example 5.1.1 and Corollary 5.1.2. Note
that

∇f (x) =  − x1 , ∇ 2 f = 1
x2
.

Therefore λf (x) =| 1 − x |. Thus, for  = 0 we have λf0 (x) = 1 for any x > 0.
Note that the function f0 is not bounded below.
If  > 0, then xf∗ = 1 . However, we can guarantee the existence of this point by
collecting information at the point x = 1 even if  is arbitrary small. 
348 5 Polynomial-Time Interior-Point Methods

Theorem 5.1.13 has several important consequences. One of them is called the
Theorem on Recession Direction. Note that for its validity, we do not need the
assumption that all Hessians of the function f are positive definite.
Theorem 5.1.14 Let h ∈ E be a recession direction of the self-concordant function
f : for any x ∈ dom f we have

∇f (x), h ≤ 0,

and there exists a τ = τ (x) such that x − τ h ∈ ∂dom f . Then

∇ 2 f (x)h, h 1/2 ≤ Mf −∇f (x), h , x ∈ dom f. (5.1.27)

Proof Let us fix an arbitrary x ∈ dom f . Consider a univariate function φ(τ ) =


f (x + τ h). This function is self-concordant and 0 ∈ dom φ. As dom φ contains no
straight line, by Theorem 5.1.6, φ  (τ ) > 0 for all τ ∈ dom φ. Therefore, we must
have
∇f (x),h 2
λ2φ (0) ≡ ∇ 2 f (x)h,h
≥ 1
Mf2

since otherwise, by Theorem 5.1.13, the minimum of φ(·) exists. Thus,

∇f (x), h 2 ≥ 1
Mf2
∇ 2 f (x)h, h ,

and we get (5.1.27) taking into account the sign of the first derivative.

Let us consider now the scheme of the Damped Newton’s method.

Damped Newton’s method

0. Choose x0 ∈ dom f.
1. Iterate xk+1 = xk − 1+Mf1λf (xk ) [∇ 2 f (xk )]−1 ∇f (xk ), k ≥ 0.

(5.1.28)
Theorem 5.1.15 For any k ≥ 0, we have

f (xk+1 ) ≤ f (xk ) − 1
ω(Mf λf (xk )). (5.1.29)
Mf2
5.1 Self-concordant Functions 349

Proof Let λ = λf (xk ). Then  xk+1 − xk xk = λ


1+Mf λ = 1
Mf ω (Mf λ). Therefore,
in view of (5.1.16) and Lemma 5.1.4, we have

f (xk+1 ) ≤ f (xk ) + ∇f (xk ), xk+1 − xk + 1


ω∗ (Mf  xk+1 − xk x )
Mf2

λ2
= f (xk ) − + 1
ω∗ (ω (Mf λ))
1+Mf λ Mf2

= f (xk ) − λ
ω (Mf λ) + 1
ω∗ (ω (Mf λ))
Mf Mf2

= f (xk ) − 1
Mf2
ω(Mf λ). 

Thus, for all x ∈ dom f with λf (x) ≥ β > 0, one step of the damped Newton’s
Method decreases the value of the function f (·) at least by a constant 12 ω(Mf β) >
Mf
0. Note that the result of Theorem 5.1.15 is global. In Sect. 5.2 it will be used to
obtaine a global efficiency bound of the process. However, now we employ it to
prove an existence theorem. Recall that we assume that dom f contains no straight
line.
Theorem 5.1.16 Let a self-concordant function f be bounded below. Then it
attains its minimum at a single point.
Proof Indeed, assume that f (x) ≥ f ∗ for all x ∈ dom f . Let us start the
process (5.1.28) from some x0 ∈ dom f . If the number of steps of this method
exceeds Mf2 (f (x0 ) − f ∗ )/ω(1), then in view of (5.1.29) we must get a point xk
with λf (xk ) < M1f . However, by Theorem 5.1.13 this implies the existence of a
point xf∗ . It is unique since all Hessians of the function f are nondegenerate.

Now we can introduce the Fenchel dual of a self-concordant function f
(sometimes called a conjugate function, or dual function of f ). For s ∈ E∗ , the
value of this function is defined as follows:

f∗ (s) = sup [s, x − f (x)]. (5.1.30)


x∈dom f

Clearly, dom f∗ = {s ∈ E∗ : f (x) − s, x is bounded below on dom f }.


Lemma 5.1.6 The function f∗ is a closed convex function with nonempty open
domain. Moreover, dom f∗ = {∇f (x) : x ∈ dom f }.
Proof Indeed, for any x̄ ∈ dom f , we have ∇f (x̄) ∈ dom f∗ . On the other hand, if
s ∈ dom f∗ , then f (x) − s, x is below bounded. Hence, by Theorem 5.1.16 and
the first-order optimality condition, there exists an x ∈ dom f such that s = ∇f (x).
350 5 Polynomial-Time Interior-Point Methods

Further, the epigraph of the function f∗ is an intersection of half-spaces

{(s, τ ) ∈ E∗ × R : τ ≥ s, x − f (x)}, x ∈ dom f,

which are closed and convex. Therefore, the epigraph of f∗ is also closed and
convex.
Suppose for s1 and s2 from dom f∗ we have

f (x) − s1 , x ≥ f1∗ , f (x) − s2 , x ≥ f2∗

for all x ∈ dom f . Then, for any α ∈ [0, 1]

f (x) − αs1 + (1 − α)s2 , x = α(f (x) − s1 , x ) + (1 − α)(f (x) − s2 , x )

≥ αf1∗ + (1 − α)f2∗ , x ∈ dom f.

Thus, αs1 + (1 − α)s2 ∈ dom f∗ .


Finally, let s ∈ dom f∗ . Denote by x(s) ∈ dom f the unique solution of the
equation

s = ∇f (x(s)).

Let δ ∈ E∗ be small enough: δ∗x(s) < 1


Mf . Consider the function

fδ (x) = f (x) − s + δ, x .

Then ∇fδ (x(s)) = ∇f (x(s)) − s − δ = −δ. Therefore, λfδ (x(s)) = δ∗x(s) < M1f .
Thus, in view of Theorem 5.1.13 the function fδ attains its minimum. Consequently,
s + δ ∈ dom f∗ , and we conclude that s is an interior point of dom f∗ .

Example 5.1.3 Note that in general, the structure of the set {∇f (x) : x ∈ dom f }
can be quite complicated. Consider the function
 (2) 2 
f (x) = 1
x (1)
x , dom f = {x ∈ R2 : x (1) > 0} {0}, f (0) = 0.

In Example 3.1.2(5) we have seen that this is a closed convex function. However,

 
(2) 2 (2)
∇f (x) = − xx (1) , 2 xx (1) , x
= 0, ∇f (0) = 0.

Thus, {∇f (x) : x ∈ dom f } = {g ∈ R2 : g (1) = − 12 (g (2) )2 }. 



Let us now look at the derivatives of the function f∗ . Since f is self-concordant,
for any s ∈ dom f∗ , the supremum in (5.1.30) is attained (see Theorem 5.1.16).
5.1 Self-concordant Functions 351

Define

x(s) = arg max [s, x − f (x)].


x∈dom f

Thus,

∇f (x(s)) = s. (5.1.31)

In view of Lemma 3.1.14, we have x(s) ∈ ∂f∗ (s). On the other hand, for s1 and s2
from dom f∗ we have

x(s1 )−x(s2 )2x(s ) (5.1.13)


1
1+Mf x(s1 )−x(s2 )x(s1 ) ≤ ∇f (x(s1 )) − ∇f (x(s2 )), x(s1 ) − x(s2 )

(5.1.31)
= s1 − s2 , x(s1 ) − x(s2 )

≤ s1 − s2 ∗x(s1 ) x(s1 ) − x(s2 )x(s1) .

Thus, x(s) is a continuous function of s and by Lemma 3.1.10 we conclude that

∇f∗ (s) = x(s). (5.1.32)

Let us differentiate identities (5.1.31) and (5.1.32) along direction h ∈ E∗ :

∇ 2 f (x(s))x  (s)h = h, ∇ 2 f∗ (s)h = x  (s)h.

Thus,

∇ 2 f∗ (s) = [∇ 2 f (x(s))]−1 , s ∈ dom f∗ . (5.1.33)

In other words, if s = ∇f (x), then

∇ 2 f∗ (s) = [∇ 2 f (x)]−1 , x ∈ dom f. (5.1.34)

Let us compute the third derivative of the dual function f∗ along direction h ∈ E∗
using the representation (5.1.33).

D 3 f∗ (s)[h] = lim 1
[∇ 2 f (x(s + αh))]−1 − [∇ 2 f (x(s))]−1
α→0 α

 2
= lim 1
[∇ 2 f (x(s))]−1 ∇ f (x(s)) − ∇ 2 f (x(s + αh)) [∇ 2 f (x(s + αh))]−1
α→0 α

= −[∇ 2 f (x(s))]−1 D 3 f (x(s))[x  (s)h][∇ 2 f (x(s))]−1 .


352 5 Polynomial-Time Interior-Point Methods

Thus, we have proved the following representation:


!
D 3 f∗ (s)[h] = ∇ 2 f∗ (s)D 3 f (x(s)) −∇ 2 f∗ (s)h ∇ 2 f∗ (s), (5.1.35)

which is valid for all s ∈ dom f∗ and h ∈ E∗ . Now we can prove our main statement.
Theorem 5.1.17 The function f∗ is self-concordant with Mf∗ = Mf .
Proof Indeed, in view of Lemma 5.1.6, f∗ is a closed convex function with open
domain. Further, for any s ∈ dom f∗ and h ∈ E∗ we have

(5.1.33) def
∇ 2 f∗ (s)h2x(s) = h, ∇ 2 f∗ (s)h = r 2 .

Therefore, in view of (5.1.35),

(5.1.6) (5.1.33)
D 3 f∗ (s)[h]  2Mf r ∇ 2 f∗ (s) ∇ 2 f (x(s)) ∇ 2 f∗ (s) = 2Mf r ∇ 2 f∗ (s).

It remains to use Corollary 5.1.1.



As an example of application of Theorem 5.1.17, let us prove the following result.
Lemma 5.1.7 Let x, y ∈ dom f and d = ∇f (x) − ∇f (y)∗x < 1
Mf . Then

(1 − Mf d)2 ∇ 2 f (x)  ∇ 2 f (y)  1


(1−Mf d)2
∇ 2 f (x). (5.1.36)

Proof Let u = ∇f (x) and v = ∇f (y). In view of Lemma 5.1.6, both points belong
to dom f∗ . Note that

d 2 = (∇f (x) − ∇f (y)∗x )2 = u − v, ∇ 2 f∗ (u)(u − v) .

Since f∗ is self-concordant with constant Mf , by Theorem 5.1.7 we have

(1 − Mf d)2 ∇ 2 f∗ (u)  ∇ 2 f∗ (v)  1


(1−Mf d)2
∇ 2 f∗ (u).

In view of (5.1.33), this is exactly (5.1.36).



Remark 5.1.2 Some results on self-concordant functions have a more natural dual
interpretation. Let us look at the statement of Theorem 5.1.13. Since the function f∗
is self-concordant, for any s̄ ∈ dom f∗ the ellipsoid
 
W∗0 (s̄) = s ∈ E∗ : s − s̄, ∇ 2 f∗ (s̄)(s − s̄) < 1
Mf2
5.2 Minimizing Self-concordant Functions 353

belongs to dom f∗ . Note that for s̄ = ∇f (x), in view of (5.1.33), condition λf (x) <
1
Mf is equivalent to

s̄, ∇ 2 f∗ (s̄)s̄ < 1


Mf2
.

This guarantees that 0 ∈ W∗0 (s̄) . Consequently, 0 ∈ dom f∗ and consequently the
function f∗ is below bounded. 

5.2 Minimizing Self-concordant Functions

(Local convergence of different variants of Newton’s Method; Path-following method;


Minimization of strongly convex functions.)

5.2.1 Local Convergence of Newton’s Methods

In this section, we are going to study the complexity of solving the problem (5.1.25)
by different optimization strategies. Let us look first at different variants of Newton’s
Method.

Variants of Newton’s Method

0. Choose x0 ∈ dom f.
1. For k ≥ 0, iterate

xk+1 = xk − −1
1+ξk [∇ f (xk )] ∇f (xk ),
1 2
(5.2.1)

where ξk is chosen in one of the following ways:


(A) ξk = 0 (this is the Standard Newton’s Method),
(B) ξk = Mf λk (this is the Damped Newton’s Method (5.1.28)),
Mf2 λ2k
(C) ξk = 1+Mf λk (this is the Intermediate Newton’s Method),
where λk = λf (xk ).

We call method (5.2.1)C intermediate since for big λk it is close to variant B,


and for small values of λk it is very close to variant A. However, note that its
step size is always bigger than the step size of variant B, which was obtained
354 5 Polynomial-Time Interior-Point Methods

by minimizing an upper bound for the self-concordant function (see the proof of
Theorem 5.1.15). Nevertheless, method (5.2.1)C ensures a monotone decrease of
the value of objective function in problem (5.1.25).
Lemma 5.2.1 Let points {xk }k≥0 be generated by method (5.2.1)C . Then, for any
k ≥ 0 we have

λ2k Mf λ3k
f (xk ) − f (xk+1 ) ≥ 2(1+Mf λk +Mf2 λ2k )
+ 2(1+Mf λk )(3+2Mf λk ) . (5.2.2)

Proof Indeed, in view of inequality (5.1.16), we have


 
λ2k Mf λk
f (xk+1 ) ≤ f (xk ) − + 1
ω∗
1+ξk Mf2 1+ξk

  
λ2k (1+Mf λk ) M λ (1+Mf λk )
= f (xk ) − + 1
− f k + ln 1 + M λ
f k + M 2 λ2 .
1+Mf λk +Mf2 λ2k Mf2 1+Mf λk +Mf λk
2 2 f k

Defining τk = Mf λk , we have


τk (1+τk )2  τk (1+τk )2 τk2
− ln 1 + τk + τk2 = − τk + ω(τk ) − ln 1 +
1+τk +τk2 1+τk +τk2 1+τk



(5.1.23) τk2 τ2 τk2 τ2 ξk
≥ +  k  − ln 1 + =  k  − ξk + + ω(ξk ).
1+τk +τk2 2 1+ 23 τk 1+τk 2 1+ 23 τk 1+ξk

It remains to note that




τ2 τk2 τk3
 k  − 12 ξk = 1
− 1
= 2(1+τk )(3+2τk ) ,
2 1+ 23 τk 2 1+ 32 τk 1+τk

(5.1.23) ξk2
and − ξ2k + ω(ξk ) ≥ − ξ2k + 2(1+ξk )
ξk
= − 2(1+ξ k)
. 

Let us describe now the local convergence of different variants of the Newton’s
Method. Note that we can measure the convergence of these schemes in four
different ways. We can estimate the rate of convergence for the functional gap
f (xk ) − f (xf∗ ), or for the local norm of the gradient λf (xk ) = ∇f (xk ) ∗xk , or
for the local distance to the minimum  xk − xf∗ xk . Finally, we can look at the
distance to the minimum in a fixed metric

r∗ (xk ) = xk − xf∗ xf∗ ,

defined by the minimum itself. Let us prove that locally all these measures are
equivalent.
5.2 Minimizing Self-concordant Functions 355

1
Theorem 5.2.1 Let λf (x) < Mf . Then

ω(Mf λf (x)) ≤ Mf2 (f (x) − f (xf∗ )) ≤ ω∗ (Mf λf (x)), (5.2.3)

ω (Mf λf (x)) ≤ Mf  x − xf∗ x ≤ ω∗ (Mf λf (x)), (5.2.4)

ω(Mf r∗ (x)) ≤ Mf2 (f (x) − f (xf∗ )) ≤ ω∗ (Mf r∗ (x)), (5.2.5)

1
where the last inequality is valid for r∗ (x) < Mf .

Proof Let r = x − xf∗ x and λ = λf (x). Inequalities (5.2.3) follow from


Theorem 5.1.12. Further, in view of (5.1.13), we have

r2
1+Mf r ≤ ∇f (x), x − xf∗ ≤ λr.

Applying the function ω∗ (·) to both sides of inequality


Mf r
1+Mf r ≤ Mf λ, we get
the right-hand side of inequality (5.2.4). If r ≥ 1
Mf , then the left-hand side of this
inequality is trivial. Suppose that r < 1
Mf . Then ∇f (x) = G(x − xf∗ ) with

1
G= ∇ 2 f (xf∗ + τ (x − xf∗ ))dτ  0,
0

and λ2f (x) = G[∇ 2 f (x)]−1 G(x − xf∗ ), x − xf∗ . Let us introduce in E a canonical
basis. Then all self-adjoint operators from E to E∗ can be represented by symmetric
matrices (we do not change the existing notation). Define
 2 def 2
H = ∇ 2 f (x), S = H −1/2GH −1 GH −1/2 = H −1/2GH −1/2 = P  0.

Then H 1/2(x − xf∗ )2 = x − xf∗ x = r, where  · 2 is the standard Euclidean


norm, and

λf (x) = H 1/2 SH 1/2 (x−x ∗), x−x ∗ 1/2 ≤  P 2 H 1/2 (x−x ∗)2 =  P 2 r.

In view of Corollary 5.1.5 (see Remark 5.1.1), we have

G 1
1−Mf r H.

Therefore,  P 2 ≤ 1
1−Mf r and we conclude that

Mf r
Mf λf (x) ≤ 1−Mf r = ω∗ (Mf r).
356 5 Polynomial-Time Interior-Point Methods

Applying the function ω (·) to both sides of this inequality, we get the remaining
part of (5.2.4). Finally, inequalities (5.2.5) follow from (5.1.14) and (5.1.16).

We are going to estimate the local rate of convergence of different variants of the
Newton’s method (5.2.1) in terms of λf (·), the local norm of the gradient.
Theorem 5.2.2 Let x ∈ dom f and λ = λf (x).
1. If λ < M1f and the point x+ is generated by variant A of method (5.2.1), then
x+ ∈ dom f and

Mf λ2
λf (x+ ) ≤ (1−Mf λ)2
. (5.2.6)

2. If point x+ is generated by variant B of method (5.2.1), then x+ ∈ dom f and


 
λf (x+ ) ≤ Mf λ2 1 + 1
1+Mf λ . (5.2.7)

3. If Mf λ + Mf2 λ2 + Mf3 λ3 ≤ 1 and point x+ is generated by method (5.2.1)C , then


x+ ∈ dom f and


Mf λ 
λf (x+ ) ≤ Mf λ2 1 + Mf λ + ≤ Mf λ2 1 + 2Mf λ .
1+Mf λ+Mf2 λ2
(5.2.8)
Proof Let h = x+ − x, λ = λf (x), and r = hx . Then r = λ
1+ξ .
Note that for all
variants of method (5.2.1), we have Mf λ < 1 + ξ . Therefore, in all cases, Mf r < 1
and x+ ∈ dom f (see Theorem 5.1.5). Hence, in view of Theorem 5.1.7 we have

λf (x+ ) = ∇f (x+ ), [∇ 2 f (x+ )]−1 ∇f (x+ ) 1/2 ≤ 1


1−Mf r  ∇f (x+ ) ∗x .

Further, by (5.2.1)

1
∇f (x+ ) = ∇f (x) + ∇ 2 f (x + τ h)hdτ = Gh,
0

1
where G = [∇ 2 f (x + τ h) − (1 + ξ )∇ 2f (x)]dτ . As in the proof of Theorem 5.2.1,
0
let us pass to matrices. Define

def
H = ∇ 2 f (x), S = H −1/2GH −1 GH −1/2 = P 2 ,

where P = H −1/2 GH −1/2. Then H 1/2h2 = hx = r, and

 ∇f (x+ ) ∗x = Gh, H −1 Gh 1/2 = H 1/2SH 1/2 h, h 1/2 ≤  P 2 r.


5.2 Minimizing Self-concordant Functions 357

In view of Corollary 5.1.5,


   
−ξ − Mf r + 13 Mf2 r 2 H  G  1−M
1
f r − (1 + ξ ) H.
 
Mf r
Therefore,  P 2 ≤ max 1−Mf r − ξ, Mf r + ξ .
Mf λ
For the variant A, ξ = 0. Thus, r = λ and we get P 2 ≤ 1−Mf λ . Therefore,

Mf λ2
λf (x+ ) ≤ 1−Mf λ P 2
λ
≤ (1−Mf λ)2
.

For the variant B, ξ = Mf λ. Therefore, r = λ


1+Mf λ , and we get P 2 ≤ Mf λ +
Mf λ
1+Mf λ . Consequently,
 
λf (x+ ) ≤ 1−Mf r P 2
r
≤ Mf λ2 1 + 1
1+Mf λ .

Mf2 λ2 λ(1+Mf λ)
Finally, for variant C, ξ = 1+Mf λ . Then, r = , and we have
1+Mf λ+Mf2 λ2

Mf r Mf2 r 2 Mf2 λ2 (1+Mf λ)2 Mf2 λ2


− Mf r − ξ = −ξ = −
1−Mf r 1−Mf r 1+Mf λ+Mf2 λ2 1+Mf λ

Mf2 λ2 (2Mf λ+2Mf2 λ2 +Mf3 λ3 ) ξ(2Mf λ+2Mf2 λ2 +Mf3 λ3 )


= = ≤ ξ
(1+Mf λ+Mf2 λ2 )(1+Mf λ) 1+Mf λ+Mf2 λ2

in view of the condition of this item. Hence

λf (x+ ) ≤ 1−Mf r P 2
r
≤ r
1−Mf r (Mf r + ξ)


λ(1+Mf λ) Mf λ(1+Mf λ) Mf2 λ2
= (1 + Mf λ + Mf2 λ2 ) +
1+Mf λ+Mf2 λ2 1+Mf λ+Mf2 λ2 1+Mf λ



(1+Mf λ)2
= Mf λ2 + Mf λ
1+Mf λ+Mf2 λ2



Mf λ
= Mf λ2 1 + Mf λ + 1+Mf λ+Mf2 λ2
. 

Among all variants of the rate of convergence, described in Theorem 5.2.2, the
estimate (5.2.8) looks more attractive. It provides us with the following description
358 5 Polynomial-Time Interior-Point Methods

of the region of quadratic convergence for method (5.2.1)C :


 
def
Qf = x ∈ dom f : λf (x) < 1
2Mf . (5.2.9)

In this case, we can guarantee that λf (x+ ) < λf (x), and then the quadratic
convergence starts (see (5.2.8)). Thus, our results lead to the following strategy for
solving the initial problem (5.1.25).
• First stage: λf (xk ) ≥ 2M1 f . At this stage we apply the Damped Newton’s
Method (5.1.28). At each iteration of this method, we have

f (xk+1 ) ≤ f (xk ) − 1
ω( 12 ).
Mf2

Thus, the number of steps of this stage is bounded as follows:

N ≤ Mf2 [f (x0 ) − f (xf∗ )]/ω( 12 ). (5.2.10)

• Second stage: λf (xk ) < 2M1 f . At this stage, we apply method (5.2.1)C . This
process converges quadratically:

λf (xk+1 ) ≤ Mf λ2f (xk )(1 + 2Mf λf (xk )) < λf (xk ).

Since the quadratic convergence is very fast, the main efforts in the above strategy
are spent at the first stage. The estimate (5.2.10) shows that the length of this stage
is O(Δf (x0 )), where

def
Δf (x0 ) = Mf2 [f (x0 ) − f (xf∗ )]. (5.2.11)

Is it possible to reach the region of quadratic convergence in a faster way? In order to


answer this question, let us consider an alternative way to solve the problem (5.1.25),
based on a path-following scheme. In Sect. 5.3 we will see how we can use this idea
for solving a constrained minimization problem.

5.2.2 Path-Following Scheme

Assume that we have y0 ∈ dom f . Let us define an auxiliary central path


 
def
y(t) = arg min ψ(t; y) = f (y) − t∇f (y0 ), y , t ∈ [0, 1]. (5.2.12)
y∈dom f

This minimization problem corresponds to computation of the value of the dual


function −f∗ (s) with s = t∇f (y0 ) (see (5.1.30)). Note that ∇f (y0 ) ∈ dom f∗ and
5.2 Minimizing Self-concordant Functions 359

the origin in the dual space also belongs to dom f∗ since the problem (5.1.25) is
solvable. Therefore, in view of Lemma 5.1.6,

t∇f (y0 ) ∈ dom f∗ , 0 ≤ t ≤ 1,

and trajectory (5.2.12) is well defined.


We are going to follow the auxiliary central path with parameter t changing from
one to zero by updating points satisfying the approximate centering condition

def
λψ(t ;·)(y) = ∇f (y) − t∇f (y0 )∗y ≤ β
Mf , (5.2.13)

where the centering parameter β is small enough. Note that the function ψ(t; ·) is
self-concordant with constant Mf and domain dom f (see Corollary 5.1.2).
Consider the following iterate:
⎧ γ
⎨ t+ = t −
⎪ Mf ∇f (y0 )∗y ,
(t+ , y+ ) = Pγ (t, y) ≡ (5.2.14)

⎩ [∇ 2 f (y)]−1 (∇f (y)−t+ ∇f (y0 ))
y+ = y − 1+ξ ,

M 2 λ2
where ξ = 1+M f

and λ = λψ(t ;·)(y) (this is one iteration of method (5.2.1)C ). For
future use, we allow the parameter γ in (5.2.14) to be both positive or negative.
Lemma 5.2.2 Let the pair (t, y) satisfy (5.2.13) with β = τ 2 (1 + τ + τ
1+τ +τ 2
),
where τ ≤ 12 . Then the pair (t+ , y+ ) satisfies the same condition for γ small enough,
namely
 
|γ | ≤ τ − τ 2 1 + τ + τ
1+τ +τ 2
. (5.2.15)

Proof Let λ = ∇f (y) − t∇f (y0 )∗y ≤ β


Mf , λ1 = ∇f (y) − t+ ∇f (y0 )∗y , and
|γ | (5.2.15)
λ+ = ∇f (y) − t+ ∇f (y0 )∗y+ . Then λ1 ≤ λ + Mf ≤ 1
Mf (β + |γ |) ≤ τ
Mf .
Hence,

(5.2.8)  
τ2 β
λ+ ≤ Mf 1+τ + τ
1+τ +τ 2
= Mf . 

Let us derive from this fact a complexity bound of the path-following scheme as
applied to problem (5.1.25).
360 5 Polynomial-Time Interior-Point Methods

Theorem 5.2.3 Consider the following process:

t0 = 1, y0 ∈ dom f, (tk+1 , yk+1 ) = Pγ (tk , yk ), k ≥ 0, (5.2.16)


 
where γ = γ (τ ) = τ − β, β = β(τ ) = τ 2 1 + τ + τ
1+τ +τ 2
, and τ ≤ 0.23. Then

def
λk = ∇f (yk ) − tk ∇f (y0 )∗yk ≤ β
Mf , k ≥ 0. (5.2.17)

Assume that λf (yk ) ≥ 1


2Mf for all k = 0, . . . , N. Then
 
)N 2
tN ≤ exp − γΔ(τ
(x
f 0 ) , (5.2.18)

(τ −3β)(1+β)
where (τ ) = 2(1+β+β 2 )
.

Proof Since λ0 = 0 < Mβf , by Lemma 5.2.2 we prove that inequality (5.2.17) is
valid for all k ≥ 0. Let c = −∇f (y0 ). Note that


(5.2.14) −1 γc
yk − yk+1 = 1+ξk [∇ f (yk )]
1 2 tk c + ∇f (yk ) − Mf c∗y , (5.2.19)
k

Mf2 λ2k
where ξk = 1+Mf λk . Therefore,

def λk γ γ +Mf λk (5.2.17) τ


rk = yk − yk+1 yk ≤ 1+ξk + Mf (1+ξk ) = Mf (1+ξk ) ≤ Mf . (5.2.20)

Further,

 
(5.2.14) γ γ
tk+1 = tk − Mf c∗y = tk 1 − Mf tk c∗y ≤ tk exp − Mf tkγc∗ .
k k yk

  
N
Thus, tN ≤ exp − Mγf SN , where SN = 1
tk c∗y . Let us estimate this value from
k=0 k
below.
β 2 (5.2.17)
Since ≥ λ2f (yk ) + 2tk ∇f (yk ), [∇ 2 f (yk )]−1 c + tk2 (c∗yk )2 , we have
Mf2

 
β2
−∇f (yk ), [∇ 2 f (yk )]−1 c ≥ 1
λ2f (yk ) + tk2 (c∗yk )2 − . (5.2.21)
2tk Mf2
5.2 Minimizing Self-concordant Functions 361

Therefore,

(5.1.16)
f (yk ) − f (yk+1 ) ≥ ∇f (yk ), yk − yk+1 − 1
ω∗ (Mf rk )
Mf2



(5.2.19) −1 γc
= 1+ξk ∇f (yk ), [∇ f (yk )]
1 2 tk c + ∇f (yk ) − Mf c∗y − 1
ω∗ (Mf rk )
k Mf2

λ2k tk −1 (t c
= 1+ξk − 1+ξk c, [∇ f (yk )]
2
k + ∇f (yk ))


−γ c
+ 1+ξ
1
k
∇f (yk ), [∇ 2 f (yk )]−1 Mf c∗y − 1
Mf2
ω∗ (Mf rk )
k

(5.2.17) λ2k −tk c∗y λk


≥ − γ −1
Mf c∗y (1+ξk ) ∇f (yk ), [∇ f (yk )] c − M 2
k 2 1
1+ξk ω∗ (Mf rk )
k f

 
(5.2.21) λ2k −tk c∗y λk β2
≥ k
+ γ
2Mf tk c∗y (1+ξk ) λ2f (yk ) + tk2 (c∗yk )2 −
1+ξk
k Mf2

− 1
ω∗ (Mf rk )
Mf2

(5.2.20) γ −2Mf λk
≥ 2Mf (1+ξk ) tk c∗yk + ρk ,
 
γ β2
where ρk = 2Mf tk c∗y (1+ξk ) λ2f (yk ) − − 1
ω∗ (τ ).
k Mf2 Mf2
(5.2.17)
Our next goal is to show that ρk ≥ 0. Note that tk c∗yk ≤ λf (yk ) + β
Mf .
Since λf (yk ) ≥ 1
2Mf , we have
 
γ β γ (1−2β)
ρk ≥ 2Mf (1+ξk ) λf (yk ) − Mf − 1
Mf2
ω∗ (τ ) ≥ 4Mf2 (1+ξk )
− 1
Mf2
ω∗ (τ )

(5.2.17)  
γ (1−2β)(1+β)
≥ 1
− ω∗ (τ ) .
Mf2 4(1+β+β 2 )

Note that γ = O(τ ), β = O(τ 2 ), and ω∗ (τ ) = O(τ 2 ). Therefore, for τ small


enough we have ρk ≥ 0. By numerical evaluation, it is easy to check that this can
be achieved by taking τ ≤ 0.23.
Further,

γ −2Mf λk (5.2.17) (γ −2β)(1+β) (τ −3β)(1+β) def


2(1+ξk ) ≥ 2(1+β+β 2 )
= 2(1+β+β 2 )
= (τ ).
362 5 Polynomial-Time Interior-Point Methods

Again, it is easy to check that (τ ) > 0 for τ ∈ (0, 0.23]. Thus, we have proved
that f (yk ) − f (yk+1 ) ≥ (τ )
Mf tk cyk . Therefore,


N
(τ ) (τ )Λ∗ (N)
SN ≥ Mf (f (yk )−f (yk+1 )) ≥ Mf (f (y0 )−f (yN+1 )) ,
k=0
" %

N+1 
N+1
where Λ∗ (N) = min 1
λ(i)
: λ(i) = 1 = (N + 1)2 .

N+1
λ∈R+ i=1 i=1

Let us estimate now the number of iterations, which are necessary for
method (5.2.16) to enter the region of quadratic convergence Qf . Define

D= max {x − yy0 : f (x) ≤ f (y0 ), f (y) ≤ f (y0 )}.


x,y∈dom f

Theorem 5.2.4 Let the sequence {yk }k≥0 be generated by method (5.2.16). Then
for all

1/2
Δf (x0 ) Mf Dω−1 (Δf (x0 ))
N≥ ln (5.2.22)
γ (τ ) ω( (1−β)(1−2β)
2 )

and we have yN ∈ Qf .
Proof Indeed,

(5.2.12)
f (y(tk )) − f ∗ ≤ ∇f (y(tk )), y(tk ) − x ∗ = tk ∇f (y0 ), y(tk ) − x ∗

≤ tk λf (y0 )D.

(5.1.29)
Note that ω(Mf λf (y0 )) ≤ Mf2 (f (y0 ) − f ∗ ) = Δf (y0 ). Thus,

(5.1.29)
1
Mf2
ω(Mf λf (y(tk ))) ≤ f (y(tk )) − f ∗ ≤ tk
Mf ω−1 (Δf (y0 ))D.

(5.2.12)
Since ∇f (yk ) − ∇f (y(tk ))∗yk = ∇f (yk ) − tk ∇f (y0 )∗yk ≤ β
Mf , we have

(5.2.17)
λf (yk ) ≤ tk ∇f (y0 )∗yk + β
Mf = ∇f (y(tk )), [∇ 2 f (yk )]−1 ∇f (y(tk )) 1/2

+ Mβf

(5.1.36) β
≤ 1−β λf (y(tk )) + Mf
1
.
5.2 Minimizing Self-concordant Functions 363

(1−β)(1−2β)
Thus, inclusion yk ∈ Qf is ensured by the inequality λf (y(tk )) ≤ 2Mf .
Consequently, we need to ensure the inequality
 
tk
ω−1 (Δf (x0 ))D ≤ 1
ω (1−β)(1−2β)
.
Mf Mf2 2

It remains to use the estimate (5.2.18).



As we can see from the estimate (5.2.22), up to a logarithmic factor, the
1/2
number of iterations of the path-following scheme is proportional to Δf (y0 ).
This is much better than the guarantee (5.2.10) obtained for the Damped Newton’s
Method (5.1.28). However, as we will see in Sect. 5.2.3, for some special subclasses
of self-concordant functions the performance estimate (5.2.22) can be significantly
improved.
From the practical point of view, reasonable values of parameters for path-
 1/2
following scheme (5.2.16) correspond to τ = 0.15. In this case, γ (τ )(τ
1
) ≤
16.1.
Remark 5.2.1 The dual interpretation of the central path (5.2.12) is quite straight-
forward: it is just a straight line. We follow the primal image of the dual central
path

s(t) = t∇f (y0 ) ∈ dom f∗ , 0 ≤ t ≤ 1,

by generating points sk = ∇f (yk ) in a small neighborhood of this trajectory:

(5.2.13) β 2
sk − s(tk ), ∇ 2 f∗ (sk )(sk − s(tk )) ≤ . 

Mf2

5.2.3 Minimizing Strongly Convex Functions

Let B = B ∗  0 map E to E∗ . Define the Euclidean metric

x2 = Bx, x 1/2 , x ∈ E.

In this section, we consider the following minimization problem

min f (x), (5.2.23)


x∈E

where f is a strongly convex function:

f (y) ≥ f (x) + ∇f (x), y − x + 12 σ2 (f )y − x2 , x, y ∈ E, (5.2.24)


364 5 Polynomial-Time Interior-Point Methods

where σ2 (f ) > 0. We also assume that the function f belongs to C3 (E) and its
Hessian is Lipschitz continuous:

∇ 2 f (x) − ∇ 2 f (y) ≤ L3 (f )x − y, x, y ∈ E. (5.2.25)

As we have seen in Example 5.1.1 (6), this function is self-concordant on E with


the constant
L3 (f )
Mf = 3/2 . (5.2.26)
2σ2 (f )

Thus, problem (5.2.23) can be solved by methods (5.1.28) and (5.2.16). The
corresponding complexity bounds can be given in terms of the complexity measure

Δf (x0 ) = L3 (f )
3/2 (f (x0 ) − f ∗ ).
2σ2 (f )

As we have seen, the first method needs O(Δf (x0 )) iterations. The complexity
1/2
bound for the second scheme is of the order Õ(Δf (x0 )), where Õ(·) denotes
the hidden logarithmic factors. Let us show that for our particular subclass of self-
concordant functions these bounds can be significantly improved.
We will do this by the second-order methods based on cubic regularization of
the Newton’s Method (see Sect. 4.2). In view of (4.2.60), the region of quadratic
convergence of the Cubic Newton’s Method (4.2.33) in terms of function value is
defined as
 
∗ σ23 (f )
Qf = x ∈ E : f (x) − f ≤ 2 = 1
2 .
2L3 (f ) 8Mf

Let us check how many iterations we need to enter this region by different schemes
based on the cubic Newton step.
Assume our method has the following rate of convergence:

cL3 (f )D 3
f (xk ) − f ∗ ≤ kp ,

where c is an absolute constant, p > 0, and D = max{x − x ∗  : f (x) ≤ f (x 0 )}.


x∈E
Since f is strongly convex, for all x with f (x) ≤ f (x0 ) we have

(5.2.24)
1
2 σ2 (f )x − x ∗ 2 ≤ f (x) − f ∗ ≤ f (x0 ) − f ∗ .
5.2 Minimizing Self-concordant Functions 365

Therefore,
 3/2
f (xk ) − f ∗ ≤ cL3 (f )
kp
2
σ2 (f ) (f (x0 ) − f ∗)
(5.2.27)
(5.2.26) 25/2 cMf
= kp (f (x0 ) − f ∗ )3/2 .

 1/p
3
Thus, we need O Mf3 (f (x0 ) − f ∗ )3/2 = O Δf (x0 ) iterations to enter
2p

the region of quadratic convergence Qf . For the Cubic Newton’s method (4.2.33)
3/4
we have p = 2. Thus, it ensures complexity O(Δf (x0 )). For the accelerated Cubic
Newton’s method (4.2.46) we have p = 3. Thus, it needs O(Δ1/2 (x0 )) iterations
(which is slightly better than (5.2.22)). However, note that for these methods there
exists a powerful acceleration tool based on a restarting procedure.
Let us define kp as the first integer for which the right-hand side of inequal-
ity (5.2.27) is smaller than 12 (f (x0 ) − f ∗ ):

25/2 cMf
kp (f (x0 ) − f ∗ )3/2 ≤ 12 (f (x0 ) − f ∗ ).


1
!1/p 
Clearly kp = O Mf (f (x0 ) − f ∗ )1/2 = O Δf2p (x0 ) . This value can be
used in the following multi-stage scheme.

Multi-stage Acceleration Scheme

Set y0 = x0 (5.2.28)
At the kth stage (k ≥ 1) the method starts from the point yk−1 .
@ A
kp
After tk = 2(k−1)/(2p) steps it generates the output yk .
The method stops when yk ∈ Qf .

Theorem 5.2.5 The total number of stages T in the optimizations strategy (5.2.28)
satisfies the inequality

T ≤ 4 + log2 Δf (x0 ). (5.2.29)

The total number of lower-level iterations N in this scheme does not exceed

21/(2p)
4 + log2 Δf (x0 ) + k .
21/(2p) −1 p
366 5 Polynomial-Time Interior-Point Methods

Proof Let us prove by induction that f (yk ) − f ∗ ≤ ( 12 )k (f (y0 ) − f ∗ ). For k = 0


p p
this is true. Assume that this is also true for some k ≥ 0. Note that tk+1 ≥ ( 12 )k/2 kp .
Therefore,

kp (f (yk )−f ∗ )1/2


p
f (yk+1 )−f ∗ 25/2 cMf
f (yk )−f ∗ ≤ p
tk+1
(f (yk ) − f ∗ )1/2 ≤ p
2tk+1 (f (x0 )−f ∗ )1/2

 1/2
2k (f (yk )−f ∗ )
≤ 1
2 f (x0 )−f ∗ ≤ 1
2.

 T −1
Thus, the total number of stages satisfies inequality 1
(f (x0 ) − f ∗ ) ≥ 1
.
2 8Mf2
Finally,


T −1

T k ∞
2p

k
1 2p 1
N= tk ≤ T + kp ≤ T + kp
2 2
k=1 k=0 k=0
kp
=T +  1/(2p) . 

1− 1
2

Applying Theorem 5.2.5 to different second-order methods based on Cubic


Regularization, we get the following complexity bounds.
• Cubic Newton’s Method (4.2.33). For this method p = 2. Therefore, the
complexity bound of this scheme, used in the framework of multi-stage
method (5.2.28), is of the order
 
1/4
O Δf (x0 ) .

In fact, this method does not need a restarting strategy. Thus, Theorem 5.2.5
provides the Cubic Newton method with a better way of estimating its rate of
convergence.
• Accelerated Newton’s Method (4.2.46). For this method p = 3. Hence, the
complexity bound of the corresponding multi-stage scheme (5.2.28) becomes

O Δ1/6 (x0 ) .

• Optimal second-order method (see Sect. 4.3.2). For this method p = 3.5.
Therefore, the corresponding complexity bound is

Õ Δ1/7 (x0 ) .

However, note that this method includes an expensive line-search procedure.


Consequently, its practical efficiency should be worse that the efficiency of the
5.3 Self-concordant Barriers 367

method from the previous item. Note that the theoretical gap in the  complexity

1/42
estimates of these methods is negligibly small, of the order of O Δf (x0 ) .
For all reasonable values of the complexity measure Δf (x0 ), feasible for modern
computers, it should be much smaller than the logarithmic factors coming from
the line search.

5.3 Self-concordant Barriers

(Motivation; Definition of self-concordant barriers; Barriers related to self-concordant


functions; The implicit barrier theorem; Main properties; Standard minimization problems;
The central path; The path-following method; How to initialize the process? Problems with
functional constraints.)

5.3.1 Motivation

In the previous section, we have seen that the Newton’s Method is very efficient
in minimizing self-concordant functions. Such a function is always a barrier for
its domain. Let us check what can be proved about the Sequential Unconstrained
Minimization approach (Sect. 1.3.3) based on these barriers. From now on, we are
always working with standard self-concordant functions, which means that

Mf = 1. (5.3.1)

In what follows, we deal with constrained minimization problems of a special


type. Let Dom f = cl (dom f ).
Definition 5.3.1 A constrained minimization problem is called standard if it has
the following form:

min{c, x | x ∈ Q}, (5.3.2)

where Q is a closed convex set. It is also assumed that we know a standard self-
concordant function f such that Dom f = Q.
Note that the assumption Mf = 1 is not binding since otherwise we can multiply
f by an appropriate constant (see Corollary 5.1.3).
Let us introduce a parametric family of penalty functions

f (t; x) = tc, x + f (x)


368 5 Polynomial-Time Interior-Point Methods

with t ≥ 0. Note that f (t; x) is self-concordant in x (see Corollary 5.1.2). Define

x ∗ (t) = arg min f (t; x).


x∈dom f

This trajectory is called the central path of problem (5.3.2). We can expect that
x ∗ (t) → x ∗ as t → ∞ (see Sect. 1.3.3). Therefore, it should be a good idea to keep
our test points close to this trajectory.
Recall that the Newton’s Methods, as applied to the minimization of the function
f (t; ·), have local quadratic convergence (Theorem 5.2.2). Our subsequent analysis
is based on the Intermediate Newton Method (5.2.1)C , which has the following
region of quadratic convergence:

λf (t ;·)(x) ≤ β < 12 .

Let us study our possibilities to move forward in t, assuming that we know exactly
x = x ∗ (t) for some t > 0.
Thus, we are going to increase t:

t+ = t + Δ, Δ > 0.

However, we need to keep x in the region of quadratic convergence of Newton’s


Method for the function f (t + Δ; ·):

λf (t +Δ;·)(x) ≤ β < 1
2.

Note that the update t → t+ does not change the Hessian of the barrier function:

∇ 2 f (t + Δ; x) = ∇ 2 f (t; x).

Therefore, it is easy to estimate how big the step Δ can be. Indeed, the first-order
optimality condition (1.2.4) provides us with the following central path equation:

tc + ∇f (x ∗ (t)) = 0. (5.3.3)

Since tc + ∇f (x) = 0, we obtain

(5.3.3)
λf (t +Δ;·)(x) =  t+ c + ∇f (x) ∗x = Δ  c ∗x = Δ
t  ∇f (x) ∗x ≤ β.

Hence, if we want to increase t at some linear rate, we need to assume that the value

λ2f (x) = ( ∇f (x) ∗x )2 ≡ ∇f (x), [∇ 2 f (x)]−1 ∇f (x)

is uniformly bounded on dom f . Without this assumption, we can have only a


sublinear rate of convergence of the process (see Sect. 5.2.2).
Thus, we come to a definition of a self-concordant barrier.
5.3 Self-concordant Barriers 369

5.3.2 Definition of a Self-concordant Barrier

Definition 5.3.2 Let F (·) be a standard self-concordant function. We call it a ν-


self-concordant barrier for the set Dom F , if

sup [2∇F (x), u − ∇ 2 F (x)u, u ] ≤ ν (5.3.4)


u∈E

for all x ∈ dom F . The value ν is called the parameter of the barrier.
Note that we do not assume ∇ 2 F (x) to be nondegenerate. However, if this is the
case, then inequality (5.3.4) is equivalent to the following:

∇F (x), [∇ 2 F (x)]−1 ∇F (x) ≤ ν. (5.3.5)

We will also use another equivalent form of inequality (5.3.4):

∇F (x), u 2 ≤ ν∇ 2 F (x)u, u ∀u ∈ E. (5.3.6)

(To see this for u with ∇ 2 F (x)u, u > 0, replace u in (5.3.4) by τ u and find the
maximum of the left-hand side in τ .) Note that the condition (5.3.6) can be rewritten
in matrix notation:

∇ 2 F (x)  ν1 ∇F (x)∇F (x)T . (5.3.7)

Lemma 5.3.1 Let F be a ν-self-concordant


 barrier. Then for any p ≥ ν the
function ξp (x) = exp − p F (x) is concave on dom F . On the other hand, if
1

function ξν (·) is concave on dom F , then F is a self-concordant barrier.


Proof Indeed, for any x ∈ dom F and h ∈ E, we have

∇ξp (x), h = − p1 ∇F (x), h ξp (x),

∇ 2 ξp (x)h, h = 1
p2
∇F (x), h 2 ξp (x) − p1 ∇ 2 F (x)h, h ξp (x).

It remains to use definition (5.3.6).



Note that condition (5.3.5) has interesting dual interpretation. In view of
relation (5.1.34), definition (5.3.5) is equivalent to the following condition:

s, ∇ 2 F∗ (s)s ≤ ν, s ∈ dom F∗ . (5.3.8)

In other words, at any feasible s, the distance to the origin is proportional to the size
of the unit Dikin ellipsoid, which describes an ellipsoidal neighborhood in dom f∗
with similar Hessians.
370 5 Polynomial-Time Interior-Point Methods

Let us now check which self-concordant functions presented in Example 5.1.1


are also self-concordant barriers.
Example 5.3.1
1. Linear function: f (x) = α + a, x , dom f = E. Clearly, for a
= 0 this function
is not a self-concordant barrier since ∇ 2 F (x) = 0.
2. Convex quadratic function. Let A = AT  0. Consider the function

1
f (x) = α + a, x + Ax, x , dom f = Rn .
2

Then ∇f (x) = a + Ax and ∇ 2 f (x) = A. Therefore,

[∇ 2 f (x)]−1 ∇f (x), ∇f (x) = A−1 (Ax + a), Ax + a

= Ax, x + 2a, x + A−1 a, a .

Clearly, this value is unbounded from above on Rn . Thus, a quadratic function is


not a self-concordant barrier.
3. Logarithmic barrier for a ray. Consider the following function of one variable:

F (x) = − ln x, dom F = {x ∈ R | x > 0}.

Then ∇F (x) = − x1 and ∇ 2 F (x) = 1


x2
> 0. Therefore

(∇F (x))2
∇ 2 F (x)
= 1
x2
· x 2 = 1.

Thus, F (·) is a ν-self-concordant barrier for the set {x ≥ 0} with ν = 1.


4. Logarithmic barrier for a second-order region. Let A = AT  0. Consider the
concave quadratic function

1
φ(x) = α + a, x − Ax, x .
2
Define F (x) = − ln φ(x), dom f = {x ∈ Rn | φ(x) > 0}. Then

∇F (x), u = − φ(x)


1
[a, u − Ax, u ],

∇ 2 F (x)u, u = 1
φ 2 (x)
[a, u − Ax, u ]2 + φ(x) Au, u .
1

Let ω1 = ∇F (x), u and ω2 = φ(x) Au, u .


1
Then

∇ 2 F (x)u, u = ω12 + ω2 ≥ ω12 .


5.3 Self-concordant Barriers 371

Therefore 2∇F (x), u − ∇ 2 F (x)u, u ≤ 2ω1 − ω12 ≤ 1. Thus, F (·) is a ν-self-


concordant barrier with ν = 1. 
Let us now check the results of some simple operations with self-concordant
barriers.
Theorem 5.3.1 Let F (·) be a self-concordant barrier. Then the function c, x +
F (x) is a standard self-concordant function on dom F .
Proof Since F (·) is a self-concordant function, we just apply Corollary 5.1.2.

Note that this property is important for path-following schemes.
Theorem 5.3.2 Let Fi be νi -self-concordant barriers, i = 1, 2. Then the function

F (x) = F1 (x) + F2 (x)



is a self-concordant barrier for a convex set Dom F = Dom F1 Dom F2 with the
parameter ν = ν1 + ν2 .
Proof In view of Theorem 5.1.1, F is a standard self-concordant function. Let us
fix x ∈ dom F . Then

max [2∇F (x), u − ∇ 2 F (x)u, u ]


u∈Rn

= maxn [2∇F1 (x), u − ∇ 2 F1 (x)u, u + 2∇F2 (x), u − ∇ 2 F2 (x)u, u ]


u∈R

≤ maxn [2∇F1 (x), u − ∇ 2 F1 (x)u, u ] + maxn [2∇F2 (x), u − ∇ 2 F2 (x)u, u ]


u∈R u∈R

≤ ν1 + ν2 . 

It is easy to see that the value of the parameter of a self-concordant barrier is


invariant with respect to an affine transformation of variables.
Theorem 5.3.3 Let A (x) = Ax + b be a linear operator, A : E → E1 . Assume
that function F is a ν-self-concordant barrier with Dom F ⊂ E1 . Then the function

Φ(x) = F (A (x))

is a ν-self-concordant barrier for the set Dom Φ = {x ∈ E : A (x) ∈ Dom F }.


Proof The function Φ(·) is a standard self-concordant function in view of Theo-
rem 5.1.2. Let us fix x ∈ dom Φ. Then y = A (x) ∈ dom F . Note that for any
u ∈ E we have

∇Φ(x), u = ∇F (y), Au , ∇ 2 Φ(x)u, u = ∇ 2 F (y)Au, Au .


372 5 Polynomial-Time Interior-Point Methods

Therefore

max [2∇Φ(x), u − ∇ 2 Φ(x)u, u ] = max [2∇F (y), Au − ∇ 2 F (y)Au, Au ]


u∈E u∈E

≤ max [2∇F (y), w − ∇ 2 F (y)w, w ] ≤ ν. 



w∈E1

To conclude this section, let us show how to construct self-concordant barriers for
the level sets of self-concordant functions and for the epigraphs of self-concordant
barriers.
Theorem 5.3.4 Let the function f be self-concordant with constant Mf ≥ 0.
Suppose that the set

L (β) = {x ∈ dom f : f (x) ≤ β}

has nonempty interior and f (x) ≥ f ∗ for all x ∈ dom f . Then the function

F (x) = −ν ln(β − f (x))

with any ν ≥ 1 + Mf2 (β − f ∗ ) is a ν-self-concordant barrier for the level set L (β).
Proof Let φ(x) = − ln(β − f (x)). In view of Theorem 5.1.4 and Corollary 5.1.3,
the function F (x) = νφ(x) is a standard self-concordant function on dom f . On the
other hand, for any h ∈ E we have

(5.1.8)
∇F (x), h 2 = ν 2 ∇φ(x), h 2 ≤ ν 2 ∇ 2 φ(x)h, h = ν∇ 2 F (x)h, h .

Thus, by definition (5.3.6), F is a ν-self-concordant barrier for L (β).



Theorem 5.3.5 Let f be a ν-self-concordant barrier. Then the function

F (x, t) = f (x) − ln(t − f (x))

is a (ν + 1)-self-concordant barrier for the epigraph

Ef = {(x, t) ∈ dom f × R : t ≥ f (x)}.

Proof Let us fix a direction h ∈ E and δ ∈ R. Consider the function

φ(τ ) = F (x + τ h, t + τ δ) = f (x + τ h) − ln(t + τ δ − f (x + τ h).


5.3 Self-concordant Barriers 373

Let ω = t − f (x) and ω̂ = 1 + ω1 . Then

φ  (0) = ∇f (x), h + ω1 (∇f (x), h − δ),

φ  (0) = ∇ 2 f (x)h, h + 1
ω2
(∇f (x), h − δ)2 + ω1 ∇ 2 f (x)h, h

= ω̂∇ 2 f (x)h, h + 1
ω2
(∇f (x), h − δ)2 .

!1/2
Define ξ = ω̂∇ 2 f (x)h, h and λ = ω (∇f (x), h −
1
δ). Note that

(5.3.6) √ 
φ  (0) ≤ ν∇ 2 f (x)h, h 1/2 + λ = ξ ω̂ν + λ.

It remains to note that the maximum of the right-hand side of this inequality subject
!1/2 √
to the constraint ξ 2 + λ2 = 1 is equal to ω̂ν + 1 ≤ ν + 1. Thus, in view of
definition (5.3.6), the parameter of the barrier F can be chosen as ν + 1.
Let us estimate now the third derivative of the function φ at zero, assuming that
its second derivative is less or equal to one. Note that

φ  (0) = D 3 f (x)[h, h, h] + 2


ω3
(∇f (x), h − δ)3

+ ω32 (∇f (x), h − δ)∇ 2 f (x)h, h + ω1 D 3 f (x)[h, h, h]

(5.1.4)
≤ 2ω̂∇ 2 f (x)h, h 3/2 + 2
ω3
(∇f (x), h − δ)3

+ ω32 (∇f (x), h − δ)∇ 2 f (x)h, h



= 2 ω
1+ω ξ
3 + 2λ3 + 3 2
1+ω ξ λ = 2γ ξ 3 + 2λ3 + 3(1 − γ 2 )ξ 2 λ,

where γ 2 = 1+ωω
. We need to maximize the right-hand side of the above inequality
subject to constraints ξ 2 + λ2 ≤ 1 and γ ∈ [0, 1]:

 ∗ = max {2γ ξ 3 + 2λ3 + 3(1 − γ 2 )ξ 2 λ : ξ 2 + λ2 ≤ 1, 0 ≤ γ ≤ 1}.


γ ,λ,ξ

Let us maximize this objective in γ . From the first-order optimality condition for γ ,

2ξ 3 − 6γ ξ 2 λ = 0,
 
ξ
we have γ∗ = min 1, 3λ . Assume that ξ ≥ 3λ. Then γ∗ = 1 and we need to
maximize 2ξ 3 + 2λ3 with constraints ξ 2 + λ2 = 1 and ξ ≥ 3λ. Introducing new
374 5 Polynomial-Time Interior-Point Methods

variables ξ̂ = ξ 2 and λ̂ = λ2 , we come to the problem

max {2ξ̂ 3/2 + 2λ̂3/2 : ξ̂ + λ̂ ≤ 1, ξ̂ ≥ 9λ̂}.


ξ̂ ,λ̂≥0

Its objective is convex. Hence, by inspecting the extreme points of its feasible set we
find the optimal solution ξ̂∗ = 1, λ̂∗ = 0. Thus, the maximal value of this problem
is two.
ξ
Assume now that ξ ≤ 3λ. Then γ∗ = 3λ and we get the following objective:
 
ξ 3 ξ2 ξ4
2 3λ ξ + 2λ3 + 3 1 − 9λ2
ξ 2λ = 3λ + 2λ3 + 3ξ 2 λ.

Note that the maximum of this expression is attained at the boundary of the unit
circle: ξ 2 + λ2 = 1. Thus, we need to show that

(1−λ2 )2
3λ + 2λ3 + 3(1 − λ2 )λ ≤ 2,

with constraint 3λ ≥ 1 − λ2 . In other words, we need to prove that

def
p(λ) = (1 − λ2 )2 + 3λ(3λ − λ3 ) − 6λ ≤ 0, √1 ≤ λ ≤ 1.
10

Note that p(λ) = (1 − λ)2 (3 − 2(1 + λ)2 ) ≤ 0 for all λ ≥ 3
2 −1= 1√
, and
2+ 6
this constant is smaller than our lower bound for λ: √1 > 1√ .
10 2+ 6
Thus,  ∗ ≤ 2, which means that F is a standard self-concordant function.

Corollary 5.3.1 If f is a standard self-concordant function, then F is also a
standard self-concordant function with Dom F = Ef .
Finally, let us prove the Implicit Barrier Theorem. Let Φ be a ν-self-concordant
barrier for dom Φ ⊂ E. We partition the space as follows: E = E1 × E2 . Define

F (x) = min{Φ(x, y) : (x, y) ∈ dom Φ}. (5.3.9)


y

We assume that for any x ∈ dom F ⊂ E1 the solution y(x) of this optimization
problem exists and is unique. Then, as we have seen in the proof of Theorem 5.1.11,

∇y Φ(x, y(x)) = 0, ∇x Φ(x, y(x)) = ∇F (x).

Theorem 5.3.6 The function F defined by (5.3.9) is a ν-self-concordant barrier.


5.3 Self-concordant Barriers 375

Proof In view of Theorem 5.1.11 the function F is standard self-concordant. Let us


fix x ∈ dom F . Then for any direction z = (h, δ) ∈ E1 × E2 we have

∇F (x), h 2E1 = ∇x Φ(x, y(x)), h 2E1 = ∇Φ(x, y(x)), z 2E

(5.3.6)
≤ ν∇ 2 Φ(x, y(x))z, z E .

As was shown in the proof of Theorem 5.1.11,

min ∇ 2 Φ(x, y(x))z, z E = ∇ 2 F (x)h, h E1 .


δ∈E2

Thus, F satisfies definition (5.3.6) of a ν-self-concordant barrier.




5.3.3 Main Inequalities

Let us show that the local characteristics of a self-concordant barrier (gradient and
Hessian) provide us with global information about the structure of its domain.
Theorem 5.3.7 1. Let F be a ν-self-concordant barrier. For any x and y from
dom F , we have

∇F (x), y − x < ν. (5.3.10)

Moreover, if ∇F (x), y − x ≥ 0, then

∇F (x),y−x 2
∇F (y) − ∇F (x), y − x ≥ ν−∇F (x),y−x .
(5.3.11)

2. A standard self-concordant function F is a ν-self-concordant barrier if and only


if
 
F (y) ≥ F (x) − ν ln 1 − ν1 ∇F (x), y − x ∀x, y ∈ dom F. (5.3.12)

Proof 1. Let us fix two points x, y ∈ dom F . Consider the univariate function

φ(t) = ∇F (x + t (y − x)), y − x , t ∈ [0, 1].

If φ(0) ≤ 0, then (5.3.10) is trivial. If φ(0) = 0, then (5.3.11) is valid in view of


convexity of f . Suppose that φ(0) > 0. In view of inequality (5.3.6), we have

φ  (t) = ∇ 2 F (x + t (y − x))(y − x), y − x

≥ ν1 ∇F (x + t (y − x)), y − x 2 = 1 2
ν φ (t).
376 5 Polynomial-Time Interior-Point Methods

Therefore, φ(t) increases and is positive for t ∈ [0, 1]. Moreover, for any t ∈ [0, 1]
we have

t φ  (τ ) (5.3.6)
− φ(t
1
) +
1
φ(0) = φ 2 (τ )
dτ ≥ 1
ν t.
0

This implies that ∇F (x), y − x = φ(0) < ν


t for all t ∈ [0, 1]. Thus, (5.3.10) is
proved. At the same time,

νφ(0) t φ(0)2
φ(t) − φ(0) ≥ ν−t φ(0) − φ(0) = ν−t φ(0) , t ∈ [0, 1].

Choosing t = 1, we get inequality (5.3.11).


2. Let ψ(x) = e− ν F (x). In view of Lemma 5.3.1, this function is concave. It
1

remains to note that inequality (5.3.12) is equivalent to the condition

ψ(y) ≤ ψ(x) + ∇ψ(x), y − x

up to a logarithmic transformation of both sides.



Corollary 5.3.2 Let F be a ν-self-concordant barrier and h ∈ E be a recession
direction of dom F : x + τ h ∈ dom F for any x ∈ dom F and τ ≥ 0. Then,

∇ 2 F (x)h, h 1/2 ≤ −∇F (x), h . (5.3.13)

Proof In view of inequality (5.3.10), ∇F (x), h ≤ 0. If dom F does not contain the
line {x + τ h, τ ∈ R}, then inequality (5.3.13) follows from (5.1.27). If it contains
the line, then ∇F (x), h = 0 for all x ∈ dom F . This means that F is constant
along this line and both sides of inequality (5.3.13) vanish.

Corollary 5.3.3 Let x, y ∈ dom F . Then for any α ∈ [0, 1) we have

F (x + α(y − x)) ≤ F (x) − ν ln(1 − α). (5.3.14)

Proof Let y(t) = x + t (y − x) and φ(t) = F (y(t)). Then

(5.3.10)
φ  (t) = ∇F (y(t)), y − x = 1−t ∇F (y(t)), y
1
− y(α) ≤ ν
1−t .

Integrating this inequality in t ∈ [0, α), we get inequality (5.3.14).



Theorem 5.3.8 Let F be a ν-self-concordant barrier. Then for any x ∈ dom F and
y ∈ Dom F such that

∇F (x), y − x ≥ 0, (5.3.15)


5.3 Self-concordant Barriers 377

we have

 y − x x ≤ ν + 2 ν. (5.3.16)

Proof Let r = y − x x and suppose r > ν (otherwise√
(5.3.16) is trivial).
Consider the point yα = x + α(y − x) with α = rν < 1. In view of our
assumption (5.3.15) and inequality (5.1.13) we have

ω ≡ ∇F (yα ), y − x ≥ ∇F (yα ) − ∇F (x), y − x

= α ∇F (yα ) − ∇F (x), yα


1
− x

yα −x2x αy−x2x r ν
≥ 1
α · 1+yα −xx = 1+αy−xx = √ .
1+ ν

On the other hand, in view of (5.3.10), we obtain

(1 − α)ω = ∇F (yα ), y − yα ≤ ν.

Thus,
 √  √
ν r ν
1− r

1+ ν
≤ ν,

and this is exactly (5.3.16).



We conclude this section by studying the properties of one special point of a
convex set.
Definition 5.3.3 Let F be a ν-self-concordant barrier for the set Dom F . The point

xF∗ = arg min F (x)


x∈dom F

is called the analytic center of the convex set Dom F , generated by the barrier F .
Theorem 5.3.9 Assume that the analytic center of a ν-self-concordant barrier F
exists. Then for any x ∈ Dom F we have

 x − xF∗ xF∗ ≤ ν + 2 ν.

On the other hand, for any x ∈ Rn such that  x − xF∗ xF∗ ≤ 1, we have x ∈ Dom F .
Proof The first statement follows from Theorem 5.3.8 since ∇F (xF∗ ) = 0. The
second statement follows from Theorem 5.1.5.

Thus, the asphericity of the set ∗
√Dom F with respect to xF , computed in the metricn
 · xF∗ , does not exceed ν + 2 ν. It is well known that for any convex set in R
there exists a metric in which the asphericity of this set is less than or equal to n
378 5 Polynomial-Time Interior-Point Methods

(John’s Theorem). However, we managed to estimate the asphericity in terms of the


parameter of the self-concordant barrier. This value does not depend directly on the
dimension of the space of variables.
Recall also that if Dom F contains no straight lines the existence of xF∗
implies the boundedness of Dom F (since then ∇ 2 F (xF∗ ) is nondegenerate, see
Theorem 5.1.6).
Corollary 5.3.4 Let Dom F be bounded. Then for any x ∈ dom F and v ∈ Rn we
have

 v ∗x ≤ (ν + 2 ν)  v ∗x ∗ .
F

In other words, for any x ∈ dom F we have

∇ 2 F (x)  1√
(ν+2 ν)2
∇ 2 F (xF∗ ). (5.3.17)

Proof By Lemma 3.1.20, we get the following representation:

 v ∗x ≡ v, [∇ 2 F (x)]−1 v 1/2 = max{v, u | ∇ 2 F (x)u, u ≤ 1}.

On the other hand, in view of Theorems 5.1.5 and 5.3.9, we have

B ≡ {y ∈ Rn |  y − x x ≤ 1} ⊆ Dom F

⊆ {y ∈ Rn |  y − xF∗ xF∗ ≤ ν + 2 ν} ≡ B∗ .

Therefore, using again Theorem 5.3.9, we get the following relations:

 v ∗x = max{v, y − x | y ∈ B} ≤ max{v, y − x | y ∈ B∗ }



= v, xF∗ − x + (ν + 2 ν)  v ∗x ∗ .
F

Note that  v ∗x = −v ∗x . Therefore, we can always ensure v, xF∗ − x ≤ 0.


5.3.4 The Path-Following Scheme

Now we are ready to describe a barrier model of the minimization problem. This is
a standard minimization problem

min{c, x | x ∈ Q} (5.3.18)

where Q is a bounded closed convex set with nonempty interior, which is a closure
of the domain of some ν-self-concordant barrier F .
5.3 Self-concordant Barriers 379

We are going to solve (5.3.18) by tracing the central path:

x ∗ (t) = arg min f (t; x), (5.3.19)


x∈dom F

where f (t; x) = tc, x + F (x) and t ≥ 0. In view of the first-order optimality


condition (1.2.4), any point of the central path satisfies the equation

tc + ∇F (x ∗ (t)) = 0. (5.3.20)

Since the set Q is bounded and F is a closed convex function, the analytic center of
this set xF∗ exists and it is uniquely defined (see Item 4 of Theorems 3.1.4 and 5.1.6).
Moreover, it is a starting point for the central path:

x ∗ (0) = xF∗ . (5.3.21)

In order to follow the central path, we are going to update the points satisfying an
approximate centering condition:

λf (t ;·)(x) ≡  f  (t; x) ∗x =  tc + ∇F (x) ∗x ≤ β, (5.3.22)

where the centering parameter β is small enough.


Let us show that this is a reasonable goal.
Theorem 5.3.10 For any t > 0, we have

c, x ∗ (t) − c∗ ≤ νt , (5.3.23)

where c∗ is the optimal value of problem (5.3.18). If a point x satisfies the


approximate centering condition (5.3.22), then
 √ 
c, x − c∗ ≤ 1
t ν+ (β+ ν)β
1−β . (5.3.24)

Proof Let x ∗ be a solution to (5.3.18). In view of (5.3.20) and (5.3.10), we have

c, x ∗ (t) − x ∗ = 1t ∇F (x ∗ (t)), x ∗ − x ∗ (t) ≤ ν


t.

Further, let x satisfy (5.3.22). Let λ = λf (t ;·)(x). Then, in view of (5.3.5),


Theorem 5.2.1, and (5.3.22), we have

tc, x − x ∗ (t) = f  (t; x) − ∇F (x), x − x ∗ (t) ≤ (λ + ν)  x − x ∗ (t) x

√ λ (β + ν)β
≤ (λ + ν) ≤ . 

1−λ 1−β
380 5 Polynomial-Time Interior-Point Methods

Let us analyze now one step of a path-following scheme. It differs from the
updating rule (5.2.14) only by the origin of the objective vector.
Assume that x ∈ dom F . Consider the following iterate:

γ
t+ = t + c∗x ,

x+ = x − −1
1+ξ [∇ F (x)] (t+ c + ∇F (x)),
1 2
(5.3.25)

λ2
where ξ = 1+λ and λ = t+ c + ∇F (x)∗x .

From Lemma 5.2.2 we know that if β = β(τ ) = τ 2 (1 + τ + τ


1+τ +τ 2
) with
τ ∈ [0, 2]
1
and x satisfies approximate centering condition (5.3.22), then for γ , such
that

| γ | ≤ τ − τ 2 (1 + τ + τ
1+τ +τ 2
), (5.3.26)

we have again  t+ c + ∇F (x+ ) ∗x+ ≤ β.


Let us prove now that the increase of t in the scheme (5.3.25) is sufficiently large.
Lemma 5.3.2 Let x satisfy (5.3.22). Then

 c ∗x ≤ 1t (β + ν). (5.3.27)

Proof Indeed, in view of (5.3.22) and (5.3.5), we have

t  c ∗x = f  (t; x) − ∇F (x) ∗x ≤  f  (t; x) ∗x +  ∇F (x) ∗x



≤β+ ν. 

Let us now fix some reasonable values of parameters in method (5.3.25). In the
remaining part of this chapter we always assume that

τ = 0.29, β = β(τ ) ≈ 0.126,


(5.3.28)
γ = τ − β(τ ) ≈ 0.164 ⇒ γ −1 < 6.11.

We have proved that it is possible to follow the central path, using the rule (5.3.25).
Note that we can either increase or decrease the current value of t. The lower
5.3 Self-concordant Barriers 381

estimate for the rate of increasing t is


 
γ√
t+ ≥ 1 + β+ ν
· t,

and the upper estimate for the rate of decreasing t is


 
γ√
t+ ≤ 1 − β+ ν
· t.

Thus, the general scheme for solving the problem (5.3.18) is as follows.

Main path-following scheme

0. Set t0 = 0. Choose an accuracy  > 0 and x0 ∈ dom F such that

 ∇F (x0 ) ∗x0 ≤ β.

1. kth iteration (k ≥ 0). Set


γ
tk+1 = tk + c∗x ,
k

xk+1 = xk − −1
1+ξk [∇ F (xk )] (tk+1 c + ∇F (xk )),
1 2

λ2k
where ξk = 1+λk , and λk = tk+1 c + ∇F (xk )∗xk .
 √ 
(β+ ν)β
2. Stop the process if tk ≥ 1
 ν+ 1−β .

(5.3.29)

Let us derive a complexity bound for the above scheme.


Theorem 5.3.11 Method (5.3.29) terminates after N steps at most, where

νc∗x ∗


N ≤O ν ln  F .

Moreover, at the moment of termination we have c, xN − c∗ ≤ .


382 5 Polynomial-Time Interior-Point Methods

Proof Note that r0 ≡  x0 − xF∗ x0 ≤ β


1−β (see Theorem 5.2.1). Therefore, in view
of Theorem 5.1.7 we have
γ
t1 =  c ∗x0 ≤ 1
1−r0  c ∗x ∗ ≤ 1−β
1−2β  c ∗x ∗ .
F F

 k−1
γ (1−2β) γ√
Thus, tk ≥ (1−β)c∗x ∗
1+ β+ ν
for all k ≥ 1.

F

Let us discuss now the above complexity bound. The main term there is

√ νcx ∗
6.11 ν ln  F .

Note that the value ν  c ∗x ∗ estimates from above the variation of the linear
F
function c, x over the set Dom F (see Theorem 5.3.9). Thus, the ratio νc

∗ can

xF
be seen as the relative accuracy of the solution.
The process (5.3.29) has one drawback. Sometimes it is difficult to satisfy its
starting condition

 ∇F (x0 ) ∗x0 ≤ β.

In this case, we need an additional process for computing an appropriate starting


point. We analyze the corresponding strategies in the next section.

5.3.5 Finding the Analytic Center

Thus, our current goal is to find an approximation to the analytic center of the set
Dom F . Let us look at the following minimization problem:

min{F (x) | x ∈ dom F }, (5.3.30)

where F is a ν-self-concordant barrier. In view of the needs of the previous section,


we accept an approximate solution x̄ ∈ dom F to this problem, which satisfies the
inequality

 ∇F (x̄) ∗x̄ ≤ β,

for certain β ∈ (0, 1).


As we have already discussed in Sect. 5.2, we can apply two different minimiza-
tion strategies. The first one is a straightforward implementation of the Intermediate
Newton’s Method and the second one is based on a path-following approach.
5.3 Self-concordant Barriers 383

Consider the first scheme.

Intermediate Newton’s Method for finding the analytic center

0. Choose y0 ∈ dom F.
1. kth iteration (k ≥ 0). Set
(5.3.31)
[∇ 2 F (yk )]−1 ∇F (yk )
yk+1 = yk − 1+ξk ,

λ2k
where ξk = 1+λk and λk = ∇F (yk ) ∗yk .
2. Stop the process if  ∇F (yk ) ∗yk ≤ β.

As we have seen already, this method needs O(F (y0 )−F (xF∗ )) iterations to enter
to the region of quadratic convergence.
To implement the path-following approach, we need to choose some y0 ∈ dom F
and define the auxiliary central path:

y ∗ (t) = arg min [−t∇F (y0 ), y + F (y)],


y∈dom F

where t ≥ 0. Since this trajectory satisfies the equation

∇F (y ∗ (t)) = t∇F (y0 ), (5.3.32)

it connects two points, the starting point y0 and the analytic center xF∗ :

y ∗ (1) = y0 , y ∗ (0) = xF∗ .

As was shown in Lemma 5.2.2, we can follow this trajectory by the process (5.3.25)
with decreasing t.
Let us estimate the rate of convergence of the auxiliary central path y ∗ (t) to the
analytic center in terms of the barrier parameter.
Lemma 5.3.3 For any t ≥ 0, we have

 ∇F (y ∗ (t)) ∗y ∗ (t ) ≤ (ν + 2 ν)  ∇F (y0 ) ∗x ∗ · t.
F

Proof This estimate follows from (5.3.32) and Corollary 5.3.4.



384 5 Polynomial-Time Interior-Point Methods

Let us look now at the corresponding algorithmic scheme.

Auxiliary Path-Following Scheme

0. Choose y0 ∈ dom F. Set t0 = 1.


1. kth iteration (k ≥ 0). Set

γ
tk+1 = tk − ∇F (y0 )∗y ,
k

yk+1 = yk − −1
1+ξk [∇ F (yk )] (−tk+1 ∇F (y0 ) + ∇F (yk )),
1 2

λ2k
where ξk = 1+λk and λk = tk+1 ∇F (y0 ) − ∇F (yk )∗yk .

λF (yk )2
2. Stop the process if  ∇F (yk ) ∗yk ≤ τ. Set ξk = 1+λF (yk )
and x̄ = yk − 1+ξ
1
k
[∇ 2 F (yk )]−1 ∇F (yk ).

(5.3.33)

Note that the above scheme follows the auxiliary central path y ∗ (t) as tk → 0. It
updates the points {yk } satisfying the approximate centering condition

 −tk ∇F (y0 ) + ∇F (yk ) ∗yk ≤ β.

The termination criterion of this process,

λk =  ∇F (yk ) ∗yk ≤ τ,

guarantees that  ∇F (x̄) ∗x̄ ≤ β(τ ) (see Theorem 5.2.2). Let us derive a complexity
bound for this process.
Theorem 5.3.12 The process (5.3.33) terminates no later than after
√  √ 
1
γ (β + ν) ln γ1 (ν + 2 ν)  ∇F (y0 ) ∗x ∗
F

iterations.
Proof Recall that our parameters are fixed by (5.3.28). Note that t0 = 1. Therefore,
in view of Lemmas 5.2.2 and 5.3.2, we have
   
tk+1 ≤ 1 − β+γ√ν tk ≤ exp − γβ+ (k+1)

ν
t0 .
5.3 Self-concordant Barriers 385

Further, in view of Lemma 5.3.3, we obtain

 ∇F (yk ) ∗yk =  (−tk ∇F (x0 ) + ∇F (yk )) + tk ∇F (y0 ) ∗yk



≤ β + tk  ∇F (y0 ) ∗yk ≤ β + tk (ν + 2 ν)  ∇F (y0 ) ∗x ∗ .
F

Thus, the process is terminated at most when the following inequality holds:

tk (ν + 2 ν)  ∇F (y0 ) ∗x ∗ ≤ τ − β(τ ) = γ . 

F

The principal term in the complexity bound of the auxiliary path-following


scheme is

6.11 ν[ln ν + ln  ∇F (y0 ) ∗x ∗ ]
F

and for the auxiliary Intermediate Newton’s method it is O(F (y0 ) − F (xF∗ )). These
estimates cannot be compared directly. However, as we have proved in Sect. 5.2.2
by another reasoning the path-following approach is much more efficient. Note also
that its complexity estimate naturally fits the complexity of the main path-following
process. Indeed, if we apply (5.3.29) with (5.3.33), we get the following complexity
bound for the whole process:
√  
6.11 ν 2 ln ν + ln  ∇F (y0 ) ∗x ∗ + ln  c ∗x ∗ + ln 1 .
F F

To conclude this section, note that for some problems it is difficult even to
point out a starting point y0 ∈ dom F . In such cases, we should apply one more
auxiliary minimization process, which is similar to the process (5.3.33). We discuss
this situation in the next section.

5.3.6 Problems with Functional Constraints

Let us consider the following minimization problem:

min{f0 (x) : fj (x) ≤ 0, j = 1 . . . m}, (5.3.34)


x∈Q

where Q is a simple bounded closed convex set with nonempty interior and all
functions fj , j = 0 . . . m, are convex. We assume that the problem satisfies the
Slater condition: There exists an x̄ ∈ int Q such that fj (x̄) < 0 for all j = 1 . . . m.
386 5 Polynomial-Time Interior-Point Methods

Let us assume that we know an upper bound ξ̄ such that f0 (x) < ξ̄ for all x ∈ Q.
Then, introducing two additional variables ξ and , we can rewrite this problem in
the standard form:

min {ξ : f0 (x) ≤ ξ, fj (x) ≤ , j = 1 . . . m}.


ξ≤ξ̄ , ≤0, (5.3.35)
x∈Q

Note that we can apply interior-point methods to this problem only if we are able to
construct a self-concordant barrier for the feasible set. In the current situation, this
means that we should be able to construct the following barriers:
• A self-concordant barrier FQ (x) for the set Q.
• A self-concordant barrier F0 (x, ξ ) for the epigraph of the objective function
f0 (x).
• Self-concordant barriers Fj (x, ) for the epigraphs of functional constraints
fj (x).
Let us assume that we can do that. Then the resulting self-concordant barrier for the
feasible set of problem (5.3.35) is as follows:


m
F̂ (x, ξ, ) = FQ (x) + F0 (x, ξ ) + Fj (x, ) − ln(ξ̄ − ξ ) − ln(−).
j =1

The parameter of this barrier is


m
ν̂ = νQ + ν0 + νj + 2, (5.3.36)
j =1

where ν(·) are the parameters of the corresponding barriers.


Note that it could still be difficult to find a starting point from dom F̂ . This
domain is an intersection of the set Q with epigraphs of the objective function and
constraints, and with two additional linear constraints ξ ≤ ξ̄ and  ≤ 0. If we have
a point x0 ∈ int Q, then we can choose ξ0 and 0 large enough to guarantee

f0 (x0 ) < ξ0 < ξ̄ , fj (x0 ) < 0 , j = 1 . . . m.

Then, only constraint  ≤ 0 will be violated.


In order to simplify our analysis, let us change the notation. From now on, we
consider the problem

min{c, z : d, z ≤ 0}, (5.3.37)


z∈S

where z = (x, ξ, ), c, z ≡ ξ , d, z ≡  and S is the feasible set of


problem (5.3.35) without the constraint  ≤ 0. Note that we know a self-concordant
barrier F (z) for the set S, and we can easily find a point z0 ∈ int S. Moreover, in
5.3 Self-concordant Barriers 387

view of our assumptions, the set

S(α) = {z ∈ S | d, z ≤ α}

is bounded and, for α large enough, it has nonempty interior.


The process of solving problem (5.3.37) consists of three stages.
1. Choose a starting point z0 ∈ int S and some initial gap Δ > 0. Set α =
d, z0 +Δ. If α ≤ 0, then we can use the two-stage process described in Sect. 5.3.5.
Otherwise, we do the following. First, we find an approximate analytic center of the
set S(α), generated by the barrier

F̃ (z) = F (z) − ln(α − d, z ).

Namely, we find a point z̃ satisfying the condition


 
λF̃ (z̃) ≡ ∇F (z̃) + −1
α−d,z̃ , [∇ F̃ (z̃)] ∇F (z̃) + 1/2 ≤ β.
d 2 d
α−d,z̃

In order to generate such a point, we can use the auxiliary schemes discussed in
Sect. 5.3.5.
2. The next stage consists in following the central path z(t) defined by the
equation

td + ∇ F̃ (z(t)) = 0, t ≥ 0.

Note that the previous stage provides us with a reasonable approximation to the
analytic center z(0). Therefore, we can follow this path, using the process (5.3.25).
This trajectory leads us to the solution of the minimization problem

min{d, z | z ∈ S(α)}.

In view of the Slater condition for problem (5.3.37), the optimal value of this
problem is strictly negative.
The goal of this stage consists in finding an approximation to the analytic center
of the set

S̄ = {z ∈ S(α) | d, z ≤ 0}

generated by the barrier F̄ (z) = F̃ (z) − ln(−d, z ). This point, z∗ , satisfies the
equation

∇ F̃ (z∗ ) − d
d,z∗ = 0.
388 5 Polynomial-Time Interior-Point Methods

Therefore, z∗ is a point of the central path z(t). The corresponding value of the
penalty parameter t∗ is

t∗ = − d,z
1

> 0.

This stage terminates with a point z̄ satisfying the condition


 
λF̃ (z̄) ≡ ∇ F̃ (z̄) − −1
d,z̄ , [∇ F̃ (z̄)] ∇ F̃ (z̄) − 1/2 ≤ β.
d 2 d
d,z̄

3. Note that ∇ 2 F̄ (z)  ∇ 2 F̃ (z). Therefore, the point z̄, computed at the previous
stage, satisfies the inequality
 
λF̄ (z̄) ≡ ∇ F̃ (z̄) − −1
d,z̄ , [∇ F̄ (z̄)] ∇ F̃ (z̄) − 1/2 ≤ β.
d 2 d
d,z̄

This means that we have a good approximation of the analytic center of the set S̄,
and we can apply the main path-following scheme (5.3.29) to solve the problem

min{c, z : z ∈ S̄}.

Clearly, this problem is equivalent to (5.3.37).


We omit the detailed complexity analysis of the above three-stage scheme. It can
be done similarly to the analysis of Sect. 5.3.5. The√ main term in the complexity
of this scheme is proportional to the product of ν̂ (see (5.3.36)) and the sum
of the logarithm of the desired accuracy  with logarithms of some structural
characteristics of the problem (size of the region, depth of Slater condition, etc.).
Thus, we have shown that the interior point methods can be applied to all
problems, for which we can point out some self-concordant barriers for the basic
feasible set Q and for the epigraphs of functional constraints. Our main goal
now is to describe the classes of convex problems for which such barriers can be
constructed in a computable form. Note that we have an exact characteristic of the
quality of self-concordant barrier. This is the value of its parameter. The smaller it
is, the more efficient will be the corresponding path-following scheme. In the next
section, we discuss our possibilities in applying the developed theory to particular
convex problems.

5.4 Applications to Problems with Explicit Structure

(Bounds on parameters of self-concordant barriers; Linear and quadratic optimization;


Semidefinite optimization; Extremal ellipsoids; Constructing self-concordant barriers for
particular sets; Separable problems; Geometric optimization; Approximation in p -norms;
Choice of optimization scheme.)
5.4 Applications to Problems with Explicit Structure 389

5.4.1 Lower Bounds for the Parameter of a Self-concordant


Barrier

In the previous section, we discussed a path-following scheme for solving the


following problem:

min c, x , (5.4.1)


x∈Q

where Q is a closed convex set with nonempty interior, for which we know √a ν-self-
concordant barrier F (·). Using such a barrier, we can solve (5.4.1) in O ν · ln ν
iterations of a path-following scheme. Recall that the most difficult part of each
iteration is the solution of a system of linear equations.
In this section, we study the limits of applicability of this approach. We discuss
the lower and upper bounds for the parameters of self-concordant barriers. We also
discuss some classes of convex problems for which the model (5.4.1) can be created
in a computable form.
Let us start from the lower bounds on the barrier parameters.
Lemma 5.4.1 Let f be a ν-self-concordant barrier for the interval (α, β) ⊂ R,
α < β < ∞, where we admit the value α = −∞. Then

def (f  (t ))2
ν≥ = sup f  (t ) ≥ 1.
t ∈(α,β)

Proof Note that ν ≥  by definition. Let us assume that  < 1. Since f is a convex
barrier function for (α, β), there exists a value ᾱ ∈ (α, β) such that f  (t) > 0 for
all t ∈ [ᾱ, β).
 2
Consider the function φ(t) = (ff (t(t))) , t ∈ [ᾱ, β). Then, since f  (t) > 0, f (·) is
standard self-concordant, and φ(t) ≤  < 1, we have
 2
f  (t )
φ  (t) = 2f  (t) − f  (t ) f  (t)


 f  (t ) √
= f  (t) 2 − √f (t ) · [f  (t )]3/2
≥ 2(1 − )f  (t).
f (t )


Hence, for all t ∈ [ᾱ, β) we obtain φ(t) ≥ φ(ᾱ) + 2(1 − )(f (t) − f (ᾱ)). This
is a contradiction since f is a barrier function and φ is bounded from above.

Corollary 5.4.1 Let F be a ν-self-concordant barrier for Q ⊂ E. Then ν ≥ 1.
Proof Indeed, let x ∈ int Q. Since Q ⊂ E, there exists a nonzero direction u ∈ E
such that the line {y = x + tu, t ∈ R} intersects the boundary of the set Q.
Therefore, considering function f (t) = F (x + tu), and using Lemma 5.4.1, we get
the result.

390 5 Polynomial-Time Interior-Point Methods

Let us prove a simple lower bound for parameters of self-concordant barriers for
unbounded sets.
Let Q be a closed convex set with nonempty interior. Consider x̄ ∈ int Q.
Assume that there exists a nontrivial set of recession directions {p1 , . . . , pk } of the
set Q:

x̄ + αpi ∈ Q ∀α ≥ 0, i = 1, . . . , k.

Theorem 5.4.1 Let the positive coefficients {βi }ki=1 satisfy the condition

x̄ − βi pi ∈
/ int Q, i = 1, . . . , k.


k
If for some positive α1 , . . . , αk we have ȳ = x̄ − αi pi ∈ Q, then the parameter
i=1
ν of any self-concordant barrier for the set Q satisfies the inequality:


k
αi
ν≥ βi .
i=1

Proof Let F be a ν-self-concordant barrier for the set Q. Since pi is a recession


direction, by Theorem 5.1.14 we have

∇F (x̄), −pi ≥ ∇ 2 F (x̄)pi , pi 1/2 ≡  pi x̄ .

Note that x̄ −βi pi ∈


/ Q. Therefore, in view of Theorem 5.1.5, the norm of direction
pi is large enough: βi  pi x̄ ≥ 1. Hence, in view of Theorem 5.3.7, we obtain


k
ν ≥ ∇F (x̄), ȳ − x̄ = ∇F (x̄), − αi pi
i=1


k 
k
αi
≥ αi  pi x̄ ≥ . 

βi
i=1 i=1

5.4.2 Upper Bound: Universal Barrier and Polar Set

Let us present now an existence theorem for self-concordant barriers. Consider a


closed convex set Q, int Q
= ∅, and assume that Q contains no straight lines.
Define a polar set of Q with respect to some point x̄ ∈ int Q as follows:

P (x̄) = {s ∈ Rn | s, x − x̄ ≤ 1, ∀x ∈ Q}.


5.4 Applications to Problems with Explicit Structure 391

It can be proved that for any x ∈ int Q the set P (x) is a bounded closed convex set
with nonempty interior. It always contains the origin.
Define V (x) = voln P (x).
Theorem 5.4.2 There exist absolute constants c1 and c2 , such that the function

U (x) = c1 · ln V (x)

is a (c2 · n)-self-concordant barrier for Q. 



We drop the proof of this statement since it is very technical. 
The function U (·) is called the Universal Barrier for the set Q. Note that
the√
analytical complexity of problem (5.4.1), equipped with a universal barrier, is
O n · ln n calls of oracle. Recall that such efficiency estimate is impossible for
the methods based on a local Black-Box oracle (see Theorem 3.2.8).
The statement of Theorem 5.4.2 is mainly of theoretical interest. Indeed, in
general, the value U (x) cannot easily be computed. However, Theorem 5.4.2
demonstrates that self-concordant barriers, in principle, can be found for any convex
set. Thus, the applicability of this approach is restricted only by our ability to
construct a computable self-concordant barrier, hopefully with a small value of
the parameter. The process of creating the barrier model of the initial problem
can hardly be described in a formal way. For each particular problem, there could
be many different barrier models, and we should choose the best one, taking into
account the value of the parameter of the self-concordant barrier, the complexity of
the computation of its gradient and Hessian, and the complexity of the solution of
the corresponding Newton system.
In the remaining part of this section we will see how this can be done for some
standard problem classes of Convex Optimization.

5.4.3 Linear and Quadratic Optimization

Let us start from a problem of Linear Optimization:

min {c, x : Ax = b}, (5.4.2)


x∈Rn+

where A is an (m × n)-matrix, m < n. The basic feasible set in this problem


is represented by the positive orthant, the set of all vectors with nonnegative
coefficients in Rn . It can be equipped with the following self-concordant barrier:


n
F (x) = − ln x (i) , ν = n, (5.4.3)
i=1
392 5 Polynomial-Time Interior-Point Methods

(see Example 5.3.1 and Theorem 5.3.2). This barrier is called the standard
logarithmic barrier for Rn+ .
In order to solve problem (5.4.2), we have to use a restriction of the barrier
F onto the affine subspace {x : Ax = b}. Since this restriction is an n-self-
concordant
√ barrier
(see Theorem 5.3.3), the complexity bound for problem (5.4.2)
is O n · ln n iterations of a path-following scheme.
Let us prove that the standard logarithmic barrier is optimal for Rn+ .
Lemma 5.4.2 The parameter ν of any self-concordant barrier for Rn+ satisfies
inequality ν ≥ n.
Proof Let us choose

x̄ = ēn ≡ (1, . . . , 1)T ∈ int Rn+ ,

pi = ei , i = 1 . . . n,

where ei is the ith coordinate vector of Rn . In this case the conditions of


Theorem 5.4.1 are satisfied with αi = βi = 1, i = 1 . . . n. Therefore,


n
αi
ν≥ βi = n. 

i=1

Note that the above lower bound is valid only for the whole set Rn+ . The lower
bound for the intersection {x ∈ Rn+ | Ax = b} can be smaller.
Self-concordant barriers for cones usually have one important property, which is
called logarithmic homogeneity (e.g. (5.4.3)).
Definition 5.4.1 A function F ∈ C 2 (E) with Dom F = K, where K is a closed
convex cone, is called logarithmically homogeneous if there exists a constant ν ≥ 1
such that

F (τ x) = F (x) − ν ln τ, ∀x ∈ int K, τ > 0. (5.4.4)

This simple property has surprisingly many interesting consequences, one of


which makes the computation of the barrier parameter completely trivial.
Lemma 5.4.3 Let F be a logarithmically homogeneous self-concordant barrier for
a convex cone K which contains no straight lines. Then for any x ∈ int K and τ > 0
we have

∇F (τ x) = τ1 ∇F (x), ∇ 2 F (τ x) = 1 2
τ2
∇ F (x), (5.4.5)

∇F (x), x = −ν, ∇ 2 F (x)x = −∇F (x), (5.4.6)

∇ 2 F (x)x, x = ν, ∇F (x), [∇ 2 F (x)]−1 ∇F (x) = ν. (5.4.7)


5.4 Applications to Problems with Explicit Structure 393

Proof Differentiating identity (5.4.4) in x, we get the first identity in (5.4.5).


Differentiating the latter identity in x again, we get the second relation in (5.4.5).
Differentiating identity (5.4.4) in τ and taking τ = 1, we get the first identity
in (5.4.6). Differentiating it in x, we obtain the second identity in this line.
Finally, substituting the last expression in (5.4.6) into the first one, we get the first
identity in (5.4.7). Since K contains no straight lines ∇ 2 F (x) is non-degenerate.
Therefore, x = −[∇ 2 F (x)]−1 ∇F (x), and we get the last expression in (5.4.7). 
Thus, for logarithmically homogeneous barriers, the degree of homogeneity is
always equal to the barrier parameter (see the second identity in (5.4.7)).
Let us look now at the quadratically constrained quadratic optimization problem:

min { q0 (x) = α0 + a0 , x + 12 A0 x, x ,


x∈Rn
(5.4.8)
qi (x) = αi + ai , x + 2 Ai x, x
1
≤ βi , i = 1 . . . m},

where Ai are some positive semidefinite (n × n)-matrices. Let us rewrite this


problem in the standard form:

min {τ : q0 (x) ≤ τ, qi (x) ≤ βi , i = 1 . . . m}. (5.4.9)


x∈Rn ,τ ∈R

The feasible set of this problem can be equipped with the following self-concordant
barrier:

m
F (x, τ ) = − ln(τ − q0 (x)) − ln(βi − qi (x)), ν = m + 1,
i=1

(see Example 5.3.1,


√ and Theorem 5.3.2). Thus, the complexity bound for prob-
lem (5.4.8) is O m + 1 · ln m iterations of a path-following scheme. Note that
this estimate does not depend on n.
In some applications, the functional components of the problem include a
nonsmooth quadratic term of the form  Ax − b , where the norm is standard
Euclidean. Let us show that we can treat such terms using an interior-point
technique.
Lemma 5.4.4 The function

F (x, t) = − ln(t 2 −  x 2 )

is a 2-self-concordant barrier for the convex cone5

K2 = {(x, t) ∈ Rn+1 | t ≥ x }.

5 Depending on the field, this set has different names: Lorentz cone, ice-cream cone, second-order

cone.
394 5 Polynomial-Time Interior-Point Methods

Proof Let us fix a point z = (x, t) ∈ int K2 and a nonzero direction u = (h, τ ) ∈
Rn+1 . Let ξ(α) = (t + ατ )2 −  x + αh 2 . We need to compare the derivatives of
the function

φ(α) = F (z + αu) = − ln ξ(α)

at α = 0. Let φ (·) = φ (·) (0), ξ (·) = ξ (·) (0). Then

ξ  = 2(tτ − x, h ), ξ  = 2(τ 2 −  h 2 ), ξ  = 0,


  2   3
ξ   
φ  = − ξξ , φ  = ξ
ξ − ξ , φ  = 3 ξξξ2 − 2 ξ
ξ .

Note that inequality 2φ  ≥ (φ  )2 is equivalent to (ξ  )2 ≥ 2ξ ξ  . Thus, we need to


prove that for any (h, τ ) we have

(tτ − x, h )2 ≥ (t 2 −  x 2 )(τ 2 −  h 2 ).

After opening the brackets and cancellation, we come to the inequality

τ 2 x2 + t 2 h2 + x, h 2 − 2τ tx, h ≥ x2 h2 .

Minimizing the left-hand side in τ , we get inequality

t 2 h2 + x, h 2 − t 2 x,h


2
x2
≥ x2 h2 ,
'  
t2
h2 (t 2 − x2 ) ≥ x, h 2 x2
−1 ,

which is valid since t ≥ x.



Finally, since 0 ≤ (ξξ ξ )2 ≤ 12 and [1 − ξ ]3/2 ≥ 1 − 32 ξ , we get the following:

|φ  | |ξ  |·|(ξ  )2 − 32 ξ ξ  |
=2 ≤ 2. 

(φ  )3/2 [(ξ  )2 −ξ ξ  ]3/2

Let us prove that the barrier described in the above statement is optimal for the
second-order cone.
Lemma 5.4.5 The parameter ν of any self-concordant barrier for the set K2
satisfies the inequality ν ≥ 2.
Proof Let us choose z̄ = (0, 1) ∈ int K2 and some h ∈ Rn ,  h = 1. Define

1 1
p1 = (h, 1), p2 = (−h, 1), α1 = α2 = , β1 = β2 = .
2 2
5.4 Applications to Problems with Explicit Structure 395

Note that for all γ ≥ 0 we have z̄ + γpi = (±γ h, 1 + γ ) ∈ K2 and

z̄ − βi pi = (± 12 h, 12 )
∈ int K2 ,

z̄ − α1 p1 − α2 p2 = (− 12 h + 12 h, 1 − 1
2 − 12 ) = 0 ∈ K2 .

Therefore, the conditions of Theorem 5.4.1 are satisfied and


α1 α2
ν≥ β1 + β2 = 2. 

5.4.4 Semidefinite Optimization

In Semidefinite Optimization, the decision variables are matrices. Let

X = {X(i,j ) }ni,j =1

be a symmetric n × n-matrix (notation: X ∈ Sn ). The real vector space Sn can be


provided with the following inner product: for any X, Y ∈ Sn define


n 
n
1/2
X, Y F = X(i,j ) Y (i,j ) ,  X F = X, X F .
i=1 j =1

Sometimes the value  X F is called the Frobenius norm of the matrix X. For
symmetric matrices X and Y , we have the following identity:


n 
n 
n 
n 
n 
n
X, Y · Y F = X(i,j ) Y (i,k) Y (j,k) = X(i,j ) Y (i,k) Y (j,k)
i=1 j =1 k=1 i=1 j =1 k=1


n 
n 
n 
n 
n
= Y (k,j ) X(j,i) Y (i,k) = Y (k,j )(XY )(j,k)
k=1 j =1 i=1 k=1 j =1


n
= (Y XY )(k,k) = Trace (Y XY ) = Y XY, In F .
k=1
(5.4.10)

In Semidefinite Optimization, a nontrivial part of the constraints is formed by


the cone of positive semidefinite n × n-matrices SN+ ⊂ S . Recall that X ∈ S+ if
n n

and only if Xu, u ≥ 0 for any u ∈ R . If Xu, u > 0 for all nonzero u, we call
n

X positive definite. Such matrices form the interior of cone Sn+ . Note that Sn+ is a
closed convex set.
396 5 Polynomial-Time Interior-Point Methods

The general formulation of the Semidefinite Optimization problem is as follows:

min {C, X F : Ai , X F = bi , i = 1 . . . m}, (5.4.11)


X∈Sn+

where C and all Ai belong to Sn . In order to apply a path-following scheme to this


problem, we need a self-concordant barrier for the cone Sn+ .
Let the matrix X belong to int Sn+ . Define F (X) = − ln det X. Clearly


n
F (X) = − ln λi (X),
i=1

where {λi (X)}ni=1 is the set of eigenvalues of matrix X.


Lemma 5.4.6 Function F is convex and ∇F (X) = −X−1 . Moreover, for any
direction Δ ∈ Sn , we have

∇ 2 F (X)Δ, Δ F =  X−1/2 ΔX−1/2 2F = X−1 ΔX−1 , Δ F



= Trace [X−1/2 ΔX−1/2 ]2 ,

D 3 F (x)[Δ, Δ, Δ] = −2In , [X−1/2 ΔX−1/2 ]3 F



= −2Trace [X−1/2 ΔX−1/2 ]3 .

Proof Let us fix some Δ ∈ Sn and X ∈ int Sn+ such that X + Δ ∈ Sn+ . Then

F (X + Δ) − F (X) = − ln det(X + Δ) − ln det X

= − ln det(In + X−1/2 ΔX−1/2 )


 n
≥ − ln 1
n Trace (In + X−1/2 ΔX−1/2 )
 
= −n ln 1 + n1 In , X−1/2 ΔX−1/2 F

≥ −In , X−1/2 ΔX−1/2 F = −X−1 , Δ F .

Thus, −X−1 ∈ ∂F (X). Therefore, F is convex (Lemma 3.1.6) and ∇F (x) = −X−1
(Lemma 3.1.7).
5.4 Applications to Problems with Explicit Structure 397

Further, consider the function φ(α) ≡ ∇F (X + αΔ), Δ F , α ∈ [0, 1]. Then

φ(α) − φ(0) = X−1 − (X + αΔ)−1 , Δ F

= (X + αΔ)−1 [(X + αΔ) − X]X−1 , Δ F

= α(X + αΔ)−1 ΔX−1 , Δ F .

Thus, φ  (0) = ∇ 2 F (X)Δ, Δ F = X−1 ΔX−1 , Δ F .


The last expression can be proved in a similar way by differentiating the function
ψ(α) = (X + αΔ)−1 Δ(X + αΔ)−1 , Δ F . 
Theorem 5.4.3 The function F is an n-self-concordant barrier for Sn+ .
Proof Let us fix X ∈ int Sn+ and Δ ∈ Sn . Define Q = X−1/2 ΔX−1/2 and λi =
λi (Q), i = 1 . . . n. Then, in view of Lemma 5.4.6, we have


n
∇F (X), Δ F = λi ,
i=1


n
∇ 2 F (X)Δ, Δ F = λ2i ,
i=1


n
D 3 F (X)[Δ, Δ, Δ] = −2 λ3i .
i=1

Using the two standard inequalities



2 4 n 4
n 3/2
  4  34
4 λ 4 ≤  λ2
n n
λi ≤n λ2i , 4 i4 i ,
i=1 i=1 i=1 i=1

we obtain

∇F (X), Δ 2F ≤ n∇ 2 F (X)Δ, Δ F ,

3/2
| D 3 F (X)[Δ, Δ, Δ] | ≤ 2∇ 2 F (X)Δ, Δ F . 

Let us prove that F (X) = − ln det X is the optimal barrier for Sn+ .
Lemma 5.4.7 The parameter ν of any self-concordant barrier for the cone Sn+
satisfies the inequality ν ≥ n.
Proof Let us choose X̄ = In ∈ int Sn+ and directions Pi = ei eiT , i = 1 . . . n,
where ei is the ith coordinate vector of Rn . Note that for any γ ≥ 0 we have
398 5 Polynomial-Time Interior-Point Methods

In + γ Pi ∈ int Sn+ . Moreover,



n
In − ei eiT
∈ int Sn+ , In − ei eiT = 0 ∈ Sn+ .
i=1

Therefore conditions of Theorem 5.4.1 are satisfied with αi = βi = 1, i = 1 . . . n,



n
αi
and we obtain ν ≥ βi = n. 
i=1
As in Linear Optimization problem (5.4.2), in problem (5.4.11) we need to use
the restriction of F onto the affine subspace

L = {X : Ai , X F = bi , i = 1 . . . m}.

This restriction is an n-self-concordant barrier in view


√ of Theorem
5.3.3. Thus,
the complexity bound of the problem (5.4.11) is O n · ln n iterations of a path-
following scheme. Note that this estimate is very encouraging since the dimension
of the problem (5.4.11) is 12 n(n + 1).
Let us estimate the arithmetical cost of each iteration of a path-following
scheme (5.3.29) as applied to the problem (5.4.11). Note that we work with a
restriction of the barrier F to the set L . In view of Lemma 5.4.6, each Newton
step consists in solving the following problem:

min{U, Δ F + 12 X−1 ΔX−1 , Δ F : Ai , Δ F = 0, i = 1 . . . m},


Δ

where X  0 belongs to L and U is a combination of the cost matrix C and


the gradient ∇F (X). In accordance with the statement (3.1.59), the solution of this
problem can be found from the following system of linear equations:


m
U + X−1 ΔX−1 = λ(j ) Aj ,
j =1
(5.4.12)
Ai , Δ F = 0, i = 1 . . . m.

From the first equation in (5.4.12) we get


⎡ ⎤

m
Δ = X ⎣−U + λ(j ) Aj ⎦ X. (5.4.13)
j =1

Substituting this expression into the second equation in (5.4.12), we get the linear
system


m
λ(j ) Ai , XAj X F = Ai , XU X F , i = 1 . . . m, (5.4.14)
j =1
5.4 Applications to Problems with Explicit Structure 399

which can be written in matrix form as Sλ = d with

S (i,j ) = Ai , XAj X F , d (j ) = U, XAj X F , i, j = 1 . . . n.

Thus, a straightforward strategy of solving system (5.4.12) consists in the following


steps.
• Compute the matrices XAj X, j = 1 . . . m. Cost: O(mn3 ) operations.
• Compute the elements of S and d. Cost: O(m2 n2 ) operations.
• Compute λ = S −1 d. Cost: O(m3 ) operations.
• Compute Δ by (5.4.13). Cost: O(mn2 ) operations.
n(n+1)
Taking into account that m ≤ 2 we conclude that the complexity of one
Newton step does not exceed

O(n2 (m + n)m) arithmetic operations. (5.4.15)

However, if the matrices Aj possess a certain structure, then this estimate can be
significantly improved. For example, if all Aj are of rank 1:

Aj = aj ajT , a j ∈ Rn , j = 1 . . . m,

then the computation of the Newton step can be done in

O((m + n)3 ) arithmetic operations. (5.4.16)

We leave the justification of this claim as an exercise for the reader.


To conclude this section, note that in many important applications we can use the
barrier − ln det(·) to treat some functions of eigenvalues. Consider, for example, a
matrix A (x) ∈ Sn which depends linearly on x. Then the convex region

{(x, t) | max λi (A (x)) ≤ t}


1≤i≤n

can be described by a self-concordant barrier

F (x, t) = − ln det(tIn − A (x)).

The value of the parameter of this barrier is equal to n.


400 5 Polynomial-Time Interior-Point Methods

5.4.5 Extremal Ellipsoids

In some applications, we are interested in approximating different sets by ellipsoids.


Let us consider the most important examples.

5.4.5.1 Circumscribed Ellipsoid

Given a set of points a1 , . . . , am ∈ Rn , find an ellipsoid W with the


minimal volume which contains all points {ai }.

Let us pose this problem in a formal way. First of all, note that any bounded
ellipsoid W ⊂ Rn can be represented as

W = {x ∈ Rn | x = H −1 (v + u),  u ≤ 1},

where H ∈ int Sn+ , v ∈ Rn , and the norm is standard Euclidean. Then the inclusion
a ∈ W is equivalent to the inequality  H a − v ≤ 1. Note also that

voln W = voln B2 (0, 1) · det H −1 = voln B2 (0,1)


det H .

Thus, our problem is as follows:

min {τ : − ln det H ≤ τ,  H ai − v ≤ 1, i = 1 . . . m}.


H ∈Sn+, (5.4.17)
v∈Rn ,τ ∈R

In order to solve this problem by an interior-point scheme, we need to find a self-


concordant barrier for the feasible set. In view of Theorems 5.4.3 and 5.3.5, we
know self-concordant barriers for all components. Indeed, we can use the following
barrier:

m
F (H, v, τ ) = − ln det H − ln(τ + ln det H ) − ln(1−  H ai − v 2 ),
i=1

ν = m + n + 1.
√
The corresponding complexity bound is O m + n + 1 · ln m+n
 iterations of a
path-following scheme.
5.4 Applications to Problems with Explicit Structure 401

5.4.5.2 Inscribed Ellipsoid with Fixed Center

Let Q be a convex polytope defined by a set of linear inequalities:

Q = {x ∈ Rn | ai , x ≤ bi , i = 1 . . . m},

and let v ∈ int Q. Find an ellipsoid W ⊂ Q with the biggest volume


which is centered at v.

Let us fix some H ∈ int Sn+ . We can represent the ellipsoid W as

W = {x ∈ Rn | H −1 (x − v), x − v ≤ 1}.

We need the following simple result.


Lemma 5.4.8 Let a, v < b. The inequality a, x ≤ b is valid for all x ∈ W if
and only if

H a, a ≤ (b − a, v )2 .

Proof In view of Lemma 3.1.20, we have

max{a, u | H −1 u, u ≤ 1} = H a, a 1/2.
u

Therefore, we need to ensure

maxa, x = max[a, x − v + a, v ]


x∈W x∈W

= a, v + max{a, u | H −1 u, u ≤ 1}
x

= a, v + H a, a 1/2 ≤ b.

This proves our statement since a, v < b.



Note that voln W = voln B2 (0, 1)[det H ]1/2. Hence, our problem is as follows:

min {τ : − ln det H ≤ τ, H ai , ai ≤ (bi − ai , v )2 , i = 1 . . . m}.


H ∈Sn+ ,τ ∈R
(5.4.18)
402 5 Polynomial-Time Interior-Point Methods

In view of Theorems 5.4.3 and 5.3.5, we can use the following self-concordant
barrier:

m
F (H, τ ) = − ln det H − ln(τ + ln det H ) − ln[(bi − ai , v )2 − H ai , ai ],
i=1

with barrier parameter ν = m + n + 1. The complexity bound of the corresponding


path-following scheme is
√
O m + n + 1 · ln m+n 

iterations.

5.4.5.3 Inscribed Ellipsoid with Free Center

Let Q be a convex polytope defined by a set of linear inequalities:

Q = {x ∈ Rn | ai , x ≤ bi , i = 1 . . . m},

and let int Q


= ∅. Find an ellipsoid W with the biggest volume which
is contained in Q.

Let G ∈ int Sn+ and v ∈ int Q. We can represent W as follows:

W = {x ∈ Rn |  G−1 (x − v) ≤ 1}

≡ {x ∈ Rn | G−2 (x − v), x − v ≤ 1}.

In view of Lemma 5.4.8, inequality a, x ≤ b is valid for any x ∈ W if and only if

 Ga 2 ≡ G2 a, a ≤ (b − a, v )2 .

This gives us a convex feasible set for parameters (G, v):

 Ga  ≤ b − a, v .

Note that voln W = voln B2 (0, 1) det G. Therefore, our problem can be written as
follows:

min {τ : − ln det G ≤ τ,  Gai ≤ bi − ai , v , i = 1 . . . m}.


G∈Sn +, (5.4.19)
v∈Rn ,τ ∈R
5.4 Applications to Problems with Explicit Structure 403

In view of Theorems 5.4.3, 5.3.5 and Lemma 5.4.4, we can use the following
self-concordant barrier:

m
F (G, v, τ ) = − ln det G − ln(τ + ln det G) − ln[(bi − ai , v )2 −  Gai 2 ]
i=1

√barrier parameter ν = 2m + n + 1. The corresponding efficiency estimate is


with
O 2m + n + 1 · ln m+n  iterations of a path-following scheme.

5.4.6 Constructing Self-concordant Barriers for Convex Sets

In this section we develop a general framework for constructing self-concordant


barriers for convex cones. First of all, let us define the objects we are working with.
They are related to three different real vector spaces, E1 , E2 , and E3 .
Consider a function ξ(·) : E1 → E2 defined on a closed convex set Q1 ⊂ E1 .
Assume that ξ is three times continuously differentiable and concave with respect
to a closed convex cone K ⊂ E2 :

−D 2 ξ(x)[h, h] ∈ K ∀x ∈ int Q1 , h ∈ E1 . (5.4.20)

It is convenient to write this inclusion as D 2 ξ(x)[h, h] K 0.


Definition 5.4.2 Let F (·) be a ν-self-concordant barrier for Q1 and β ≥ 1. We say
that a function ξ is β-compatible with F if for all x ∈ int Q1 and h ∈ E1 we have

D 3 ξ(x)[h, h, h] K −3β · D 2 ξ(x)[h, h] · ∇ 2 F (x)h, h 1/2 . (5.4.21)

Alternating the sign of direction h in (5.4.21), we get the following equivalent


condition:

−D 3 ξ(x)[h, h, h] K −3β · D 2 ξ(x)[h, h] · ∇ 2 F (x)h, h 1/2 . (5.4.22)

Note that the set of β-compatible functions is a convex cone: if functions ξ1


and ξ2 are β-compatible with barrier F , then the sum α1 ξ1 + α2 ξ2 , with arbitrary
α1 , α2 > 0, is also β-compatible with F .
Let us construct a self-concordant barrier for a composition of the set

S1 = {(x, y) ∈ Q1 × E2 : ξ(x) K y}

and a convex set Q2 ⊂ E2 × E3 . That is

Q = {(x, z) ∈ Q1 × E3 : ∃y, ξ(x) K y, (y, z) ∈ Q2 }.

The necessity of such a structure is clear from the following example.


404 5 Polynomial-Time Interior-Point Methods

Example 5.4.1 Let us fix some α ∈ (0, 1). Consider the following power cone
 
Kα = (x (1), x (2) , z) ∈ R2+ × R : (x (1))α · (x (2))1−α ≥ |z| .

For our representation, we need the following objects:

E1 = R2 , Q1 = R2+ , F (x) = − ln x (1) − ln x (2), ν = 2,


E2 = R, ξ(x) = (x (1) )α · (x (2) )1−α , K = R+ ⊂ E2 ,
E3 = R, Q2 = {(y, z) ∈ E2 × E3 : y ≥ |z|}. 

In our construction, we also need a μ-self-concordant barrier Φ(y, z) for the set
def
Q2 . We assume that all directions from the cone K0 = K × {0} ⊂ E2 × E3 are
recession directions of the set Q2 . Consequently, for any s ∈ K and (y, z) ∈ int Q2
we have
(5.3.13)
∇y Φ(y, z), s = ∇Φ(y, z), (s, 0) ≤ 0. (5.4.23)

Consider the barrier

Ψ (x, z) = Φ(ξ(x), z) + β 3 F (x).

Let us fix a point (x, z) ∈ int Q and choose an arbitrary direction d = (h, v) ∈
E1 × E3 . Define

ξ  = Dξ(x)[h], ξ  = D 2 ξ(x)[h, h], ξ  = D 3 ξ(x)[h, h, h], l = (ξ  , v).

Let ψ(x, z) = Φ(ξ(x), z). Consider the following directional derivatives:

def
Δ1 = Dψ(x, z)[d] = ∇y Φ(ξ(x), z), ξ  + ∇z Φ(ξ(x), z), v = ∇Φ(ξ(x), z), l .

def (5.4.20)
Note that l ≡ l(x). Therefore l  = Dl(x)[d] = (ξ  , 0) ∈ −K0 . Thus, we can
continue:
def
Δ2 = D 2 ψ(x, z)[d, d] = ∇ 2 Φ(ξ(x), z)l, l + ∇Φ(ξ(x), z), l 
(5.4.24)
def
= ∇ 2 Φ(ξ(x), z)l, l + ∇y Φ(ξ(x), z), ξ  = σ1 + σ2 .
5.4 Applications to Problems with Explicit Structure 405

Since −l  is a recession direction of Q2 , by (5.3.13) we have σ2 ≥ 0. Finally,

def
Δ3 = D 3 ψ(x, z)[d, d, d]

= D 3 Φ(ξ(x), z)[l, l, l] + 3∇ 2 Φ(ξ(x), z)l, l  + ∇y Φ(ξ(x), z), ξ  .


(5.4.25)

Again, since −l  is a recession direction of Q2 ,

∇ 2 Φ(ξ(x), z)l, l  ≤ ∇ 2 Φ(ξ(x), z)l, l 1/2 · ∇ 2 Φ(ξ(x), z)l  , l  1/2

(5.3.13)
∇ 2 Φ(ξ(x), z)l, l 1/2 · −∇Φ(ξ(x), z), −l  = σ1 σ2 .
1/2

Further, let σ3 = ∇ 2 F (x)h, h . Since ξ is β-compatible with F (see (5.4.22)), we


have
(5.4.23)
−∇y Φ(ξ(x), z), −ξ  3β−∇y Φ(ξ(x), z), −ξ  · σ3
1/2 1/2
≤ = 3β · σ2 · σ3 .

Thus, substituting these inequalities into (5.4.25) and using (5.1.4), we obtain
3/2 1/2 1/2
Δ3 ≤ 2σ1 + 3σ1 σ2 + 3β · σ2 · σ3 .

Consider now Dk , k = 1 . . . 3, the directional derivatives of the function Ψ . Note


that

D2 = Δ2 + β 3 σ3 = σ1 + σ2 + β 3 σ3 ≥ σ1 + σ2 + β 2 σ3 . (5.4.26)

Therefore,

(5.1.4) 3/2
D3 = Δ3 + β 3 D 3 F (x)[h, h, h] ≤ Δ3 + 2β 3 σ3

3/2 1/2 1/2 3/2


≤ 2σ1 + 3σ1 σ2 + 3β · σ2 · σ3 + 2β 3 σ3

1/2 1/2 1/2 1/2


= (σ1 + βσ3 )(2σ1 − 2βσ1 σ3 + 2β 2 σ3 + 3σ2 )

(5.4.26) 1/2 1/2 1/2 1/2 3/2


≤ (σ1 + βσ3 )(3D2 − (σ1 + βσ3 )2 ) ≤ 2D2 .

Thus, we come to the following statement.


Theorem 5.4.4 Let the function ξ(·) : E1 → E2 satisfy the following conditions.
• It is concave with respect to a convex cone K ⊂ E2 .
• It is β-compatible with self-concordant barrier F (·) for a set Q ⊆ dom ξ .
406 5 Polynomial-Time Interior-Point Methods

Assume in addition that Φ(·, ·) is a μ-self-concordant barrier for a closed convex


set Q2 ⊂ E2 × E3 , and the cone K × {0} ⊂ E2 × E3 contains only the recession
directions of the set Q2 . Then the function

Ψ (x, z) = Φ(ξ(x), z) + β 3 F (x) (5.4.27)

is a self-concordant barrier for the set {(x, z) ∈ Q×E3 : ∃y, ξ(x) K y, (y, z) ∈
Q2 } with barrier parameter ν̂ = μ + β 3 ν.
Proof We need to justify only the value of the barrier parameter ν̂. Indeed,
√ 1/2 √ 1/2
D1 = ∇Φ(ξ(x), z), l + β 3 ∇F (x), h ≤ ν · σ1 + β 3 μ · σ3

√ 1/2 √ 1/2 (5.4.26)


≤ max { ν · σ1 + β 3 · μσ3 : σ1 + β 3 σ3 ≤ D2 }
σ1 ,σ3 ≥0

√ 1/2
= ν̂ · D2 .

It remains to use definition (5.3.6).



Note that in construction (5.4.27) the function ξ must be compatible only with
the barrier F . The function Φ can be an arbitrary self-concordant barrier for the set
Q2 .

5.4.7 Examples of Self-concordant Barriers

Despite its complicated formulation, Theorem 5.4.4 is very convenient for con-
structing a good self-concordant barrier for convex cones. Let us confirm this claim
with several examples.
1. The power cone and epigraph of the p-norm. Let us fix some α ∈ (0, 1). To
the description of the representation of the power cone
 
2
Kα = (x (1), x (2) , z) ∈ R+ × R : (x (1) )α · (x (2))1−α ≥ |z| ,

given in Example 5.4.1, we need to add only a definition of the barrier function for
the set Q2 . In view of Lemma 5.4.4, we can take

Φ(y, z) = − ln(y 2 − z2 ),

with barrier parameter μ = 2. Thus, all conditions of Theorem 5.4.4 are clearly
satisfied except β-compatibility.
Let us prove that the function ξ(x) = (x (1))α · (x (2) )1−α is β-comptible with
barrier F (x) = − ln x (1) − ln x (2) . Let us choose a direction h ∈ R2 and x ∈ int R2+ .
5.4 Applications to Problems with Explicit Structure 407

Define

h(1) h(2)
δ1 = x (1)
, δ2 = x (2)
, σ = δ12 + δ22 .

Let us compute the directional derivatives:


 
αh(1) (1−α)h(2)
Dξ(x)[h] = x (1)
+ x (2)
· ξ(x) = [αδ1 + (1 − α)δ2 ] · ξ(x),

D 2 ξ(x)[h, h] = −[αδ12 + (1 − α)δ22 ] · ξ(x) + [αδ1 + (1 − α)δ2 ] · Dξ(x)[h]

= −α(1 − α)(δ1 − δ2 )2 · ξ(x),

D 3 ξ(x)[h, h, h] = 2α(1 − α)(δ1 − δ2 ) · (δ12 − δ22 ) · ξ(x)

−α(1 − α)(δ1 − δ2 )2 · Dξ(x)[h]

= ξ(x) · α(1 − α)(δ1 − δ2 )2 · [2δ1 + 2δ2 − αδ1 − (1 − α)δ2 ]

= −D 2 ξ(x)[h, h] · [(2 − α)δ1 + (1 + α)δ2 ].

Since (2 − α)δ1 + (1 + α)δ2 ≤ [(2 − α)2 + (1 + α)2 ]1/2 σ 1/2 < 3σ 1/2, we conclude
that ξ is 1-compatible with F . Therefore, in view of Theorem 5.4.4, function
 
ΨP (x, z) = − ln (x (1) )2α · (x (2))2(1−α) − z2 − ln x (1) − ln x (2) (5.4.28)

is a 4-self-concordant barrier for cone Kα .


A similar structure can be used to construct a self-concordant barrier for the cone
 
Kα+ = (x (1), x (2), z) ∈ R2+ × R : (x (1))α · (x (2))1−α ≥ z .

In this case, we can choose Φ(y, z) = ln(y − z) with parameter μ = 1. Thus, by


Theorem 5.4.4, we get the following 3-self-concordant barrier:

ΨP+ (x, z) = − ln (x (1))α · (x (2))(1−α) − z − ln x (1) − ln x (2). (5.4.29)

Let us show that this barrier has the best possible value of parameter.
Lemma 5.4.9 Any ν-self-concordant barrier for the cone Kα+ has ν ≥ 3.
Proof Note that the cone Kα+ has three recession directions:

p1 = (1, 0, 0)T , p2 = (0, 1, 0)T , p3 = (0, 0, −1)T .


408 5 Polynomial-Time Interior-Point Methods

Let us choose a parameter τ > 0 and define x̄ = (1, 1, −τ )T . Note that

x̄ − p1
∈ int Kα+ , x̄ − p2
∈ int Kα+ , x̄ − (1 + τ )p3 ∈ ∂Kα+ .

On the other hand, x̄ − p1 − p2 − τp3 = 0 ∈ Kα+ . Thus, to apply Theorem 5.4.1,


we can choose

α1 = α2 = 1, α3 = τ, β1 = β2 = 1, β3 = 1 + τ.


3
αi
Hence, ν ≥ βi =2+ τ
1+τ . It remains to compute the limit as τ → +∞.

i=1
Note that the barrier ΨP (x, z) can be used to construct 4n-self-concordant barrier
for the epigraph of an p -norm in Rn :
 
Kp = (τ, z) ∈ R × Rn : τ ≥ z(p) , 1 ≤ p ≤ ∞,
 1/p

n
def 1
where z(p) = |z(i) |p . Let us assume that α = p ∈ (0, 1). Then, it
i=1
is easy to prove that the point (τ, z) belongs to Kp if and only if there exists an
x ∈ Rn+ satisfying the conditions

(x (i) )α · τ 1−α ≥ |z(i) |, i = 1, . . . , n,


n (5.4.30)
x (i) = τ.
i=1

Thus, a self-concordant barrier for the cone Kp can be implemented by restricting


the (4n)-self-concordant barrier


n  !
Ψα (τ, x, z) = − ln (x (i))2α · τ 2(1−α) − (z(i) )2 + ln x (i) + ln τ (5.4.31)
i=1


n
onto the hyperplane x (i) = τ .
i=1
2. The conic hull of the epigraph of the entropy function. We need to describe
the conic hull of the following set:
 
(x (1), z) : z ≥ x (1) ln x (1), x (1) > 0 .

Introducing a projective variable x (2) > 0, we obtain the cone


 
Q = (x (1), x (2) , z) : z ≥ x (1) · [ ln x (1) − ln x (2) ], x (1) , x (2) > 0 . (5.4.32)
5.4 Applications to Problems with Explicit Structure 409

Let us represent it in the format of Theorem 5.4.4:

E1 = R2 , Q1 = R+
2, F (x) = − ln x (1) − ln x (2) , ν = 2,

E2 = R, ξ(x) = −x (1) · [ ln x (1) − ln x (2) ], K = R+ ,

E3 = R, Q2 = {(y, z) : y + z ≥ 0}, Φ(y, z) = − ln(y + z), μ = 1.

Let us show that ξ is 1-compatible with F . We use the notation of the previous
example.

Dξ(x)[h] = δ1 · ξ(x) − x (1) · [δ1 − δ2 ].

D 2 ξ(x)[h, h] = −δ12 · ξ(x) + δ1 · Dξ(x)[h] − h(1) · [δ1 − δ2 ] + x (1) · [δ12 − δ22 ]

= x (1) · [−2δ1 (δ1 − δ2 ) + δ12 − δ22 ] = −x (1) · (δ1 − δ2 )2 .

D 3 ξ(x)[h, h, h] = −h(1) · (δ1 − δ2 )2 + 2x (1) · (δ1 − δ2 ) · (δ12 − δ22 )

= x (1)(δ1 − δ2 )2 · [−δ1 + 2(δ1 + δ2 )]

= −D 2 ξ(x)[h, h] · [δ1 + 2δ2 ].



Since δ1 + 2δ2 ≤ 5 · σ 1/2 < 3σ 1/2 , we conclude that ξ is 1-compatible with F .
Therefore, in view of Theorem 5.4.4 the function
 (1)

ΨE (x, z) = − ln z − x (1) · ln xx (2) − ln x (1) − ln x (2) (5.4.33)

is a 3-self-concordant barrier for the cone Q. It is interesting that the same barrier
can also describe the epigraph of logarithmic and exponent functions. Indeed,

Q {x : x (1) = 1} = {(x (2), z) : z ≥ − ln x (2) } = {(x (2), z) : x (2) ≥ e−z }.

Let us show that we can use the 3-self-concordant barrier



ψE (x, y, τ ) = − ln τ ln yτ − x − ln y − ln τ,
(5.4.34)

def 
(x, y, τ ) ∈ int E = y ≥ τ ex/τ , τ > 0 ⊂ R3 ,
410 5 Polynomial-Time Interior-Point Methods

in more complicated situations. Consider the conic hull of the epigraph of the
following function:

n
def  x (i)
fn (x) = ln e , x ∈ Rn ,
i=1
(5.4.35)
def   
Q = (x, t, τ ) ∈ Rn × R × R : t ≥ τfn xτ , τ > 0 .

Clearly (x, t, τ ) ∈ Q if and only if


 
fn τ1 (x − t · ēn ) ≤ 1,

where ēn ∈ Rn is the vector of all ones. Therefore, we can model Q as a projection
of the following cone:

Q̂ = (x, y, t, τ ) ∈ Rn × Rn × R × R : y (i) ≥ τ e(x −t )/τ , i = 1, . . . , n,
(i)


n 
y (i) = τ .
i=1

This cone admits a 3n-self-concordant barrier, obtained as a restriction of the


function

n  !
ΨL (x, y, t, τ ) = − ln t + τ ln y (i) − x (i) − τ ln τ + ln y (i) + ln τ ,
i=1
(5.4.36)


n
onto the hyperplane y (i) = τ .
i=1  
def
n
3. The geometric mean. Let x ∈ Rn+ and a ∈ Δn = y ∈ R+ :
n y =1 .
(i)
i=1
Without loss of generality, we can consider a with positive components. Define

def I
n (i)
ξ(x) = x a = (x (i) )a .
i=1

Let us write down the directional derivatives of this function along some h ∈ Rn .
Define
h(i)
δx(i) (h) = x (i)
, i = 1, . . . , n,

 T
(1) (n)
δx (h) = δx (h), . . . , δx (h) ,


n
F (x) = − ln x (i) .
i=1
5.4 Applications to Problems with Explicit Structure 411

def
Clearly, hx = F  (x)h, h 1/2 = δx (h), where the norm is standard Euclidean.
Note that

D(ln ξ(x))[h] = 1
ξ(x) Dξ(x)[h] = a, δx (h) .

Thus, Dξ(x)[h] = ξ(x) · a, δx (h) . Denoting by [x]k ∈ Rn a component-wise


power of a vector x ∈ Rn , we obtain:

D 2 ξ(x)[h, h] = ξ(x) · a, δx (h) 2 − ξ(x) · a, [δx (h)]2

def
= −ξ(x) · a, [δx (h) − a, δx (h) · ēn ]2 = −ξ(x) · S2 .

Further, defining ξ = ξ(x) and δ = δx (h), we obtain:

D 3 ξ(x)[h, h, h] = ξ a, δ 3 + 2ξ a, δ a, −[δ]2 − ξ a, δ a, [δ]2 − ξ a, −2[δ]3

= ξ a, δ 3 − 3a, δ a, [δ]2 + 2a, [δ]3 .

Define

S3 = a, [δ − a, δ ēn ]3 = a, [δ]3 − 3a, δ [δ]2 + 3a, δ 2 δ − a, δ 3 ēn

= a, [δ]3 − 3a, δ a, [δ]2 + 2a, δ 3 .

Then, in this new notation we have



D 3 ξ(x)[h, h, h] = ξ a, δ 3 − 3a, δ a, [δ]2

!
+2 S3 + 3a, δ a, [δ]2 − 2a, δ 3


= ξ 2S3 + 3a, δ a, [δ]2 − 3a, δ 3 = ξ(2S3 + 3a, δ S2 ).

Therefore,


D 3 ξ(x)[h, h, h] ≤ ξ S2 3a, δ + 2 max [δ − a, δ ]
(i)
1≤i≤n



≤ ξ S2 a, δ + 2 max |δ |
(i)
1≤i≤n

≤ −3D 2 ξ(x)[h, h] · F  (x)δ, δ 1/2 .


412 5 Polynomial-Time Interior-Point Methods

Thus, we have proved that ξ is 1-compatible with F . This means that the function

Ψ (x, t) = − ln(ξ(x) − t) + F (x), x > 0 ∈ Rn , (5.4.37)

is an (n + 1)-self-concordant barrier for the hypograph of the function ξ . Moreover,


since the set of β-compatible functions is a convex cone, any sum

m
ξ(x) = αk x ak , (5.4.38)
k=1

with αk > 0, and ak ∈ Δn , k = 1, . . . , m, is 1-compatible with F . Hence, for


such functions formula (5.4.37) is also applicable and the parameter of this barrier
remains equal to n + 1.
Note that the functions in the form (5.4.38) sometimes arise in optimization
problems related to polynomials. Indeed, assume we need to solve the problem
 

m
max p(y) = αk y bk : y ≥ 0, y(d) ≤ 1 ,
y k=1

 n 
 (i) d 1/d
where all bk belong to d · Δn and y(d) = (y ) . Then for new variables
i=1
!1/d
y (i) = x (i) , i = 1, . . . , n, our problem becomes convex with a concave
objective ξ(·) given by (5.4.38).
4. The hypograph of the exponent of the self-concordant barrier. Let F (·)
be ν-self-concordant barrier
 for the set Dom F . Let us fix p ≥ ν and consider the
function ξp (x) = exp − p1 F (x) . As we have proved in Lemma 5.3.1, this function
is concave on dom F . Consider the following set:
 
Hp = (x, t) ∈ dom F × R : ξp (x) ≥ t .

Let us construct a self-concordant barrier for this set.


In our framework, Q1 = Dom F , Q2 = {(y, t) ∈ R2 : y ≥ t}, K = R+ , and
Φ(y, t) = − ln(y − t) with μ = 1. Let us prove that ξp (x) is concave with respect
to K, and it is β-compatible with F .
Let us fix x ∈ dom F and an arbitrary direction h ∈ E. Then

def
ξ  = Dξp (x)[h] = − p1 ∇F (x), h ξp (x),

def
ξ  = D 2 F (x)[h, h] = 1
p2
∇F (x), h 2 ξp (x) − p1 ∇ 2 F (x)h, h ξp (x),

def
ξ  = D 3 F (x)[h, h, h] = − p13 ∇F (x), h 3 ξp (x)

+ p32 ∇F (x), h · ∇ 2 F (x)h, h ξp (x) − p1 D 3 F (x)[h, h, h]ξp (x).


5.4 Applications to Problems with Explicit Structure 413

As we have already seen, in view of (5.3.6), we have ξ  ≤ 0. This means that it is


concave with respect to K.
Let ξ = ξp (x), D1 = ∇F (x), h , D2 = ∇ 2 F (x)h, h 1/2 , and τ = pξ D22 . Then

ξ  = ξ
D2
p2 1
− τ ≤ 0,

(5.1.4) 2ξ  
ξ  ≤ p D23 + 3ξ
D D2
p2 1 2
− ξ
D3
p3 1
= 2τ D2 + p1 D1 3τ − ξ
D2
p2 1

 (5.3.6) √ 
= 2τ D2 + p1 D1 2τ − ξ  − ξ  .
ν
≤ 2τ D2 + p D2 2τ

(5.3.6)
Note that ξ  + τ = ξ
D 2 ≤ pξ ν2 D22
p2 1
= ν
pτ. Thus, τ ≤ p 
p−ν (−ξ ), and therefore
 √ √   √ 
ξ  ≤ D2 2(1 + ν
p )τ + ν 
p (−ξ ) ≤ D2 √ 2√
p− ν
+ ν 
p (−ξ ).


This means that for p ≥ (1 + ν)2 the function ξp (x) is 1-compatible with F and
by Theorem 5.4.4 we get a (ν + 1)-self-concordant barrier
   
ΨH (x, t) = − ln exp − p1 F (x) − t + F (x) (5.4.39)

for the set Hp .


5. The matrix epigraph of the inverse matrix. Consider the following set

In = {(X, Y ) ∈ Sn+ × Sn+ : X−1  Y }.

In order to construct a barrier for this set, consider the mapping ξ(X) = −X−1 .
It is defined on the set of positive definite matrices, for which we know a ν-self-
concordant barrier F (X) = − ln det X with the barrier parameter ν = n (see
Theorem 5.4.3). Let us show that ξ is 1-compatible with F .
Indeed, let us fix an arbitrary direction H ∈ Sn . By the same reasoning as in
Lemma 5.4.6, we can prove that

Dξ(X)[H ] = X−1 H X−1 ,

D 2 ξ(X)[H, H ] = −2X−1 H X−1 H X−1 ∈ −S+


n
,

D 3 ξ(X)[H, H, H ] = 6X−1 H X−1 H X−1 H X−1 .


414 5 Polynomial-Time Interior-Point Methods

Let A = X−1/2 H X−1/2 and ρ = max |λi (A)|. Then, in view of Lemma 5.4.6,
1≤i≤n


n
∇ 2 F (X)H, H = A2F = λ2i (A) ≥ ρ 2 .
i=1

On the other hand,

D 3 ξ(X)[H, H, H ] = 6X−1/2 A3 X−1/2  6ρX−1/2 A2 X−1/2

 6∇ 2 F (X)H, H 1/2 X−1/2 A2 X−1/2

= 3∇ 2 F (X)H, H 1/2 D 2 F (X)[H, H ].

Thus, condition (5.4.21) is satisfied with β = 1. Hence, by Theorem 5.4.4 the


function

F (X, Y ) = − ln det(Y − X−1 ) − ln det X (5.4.40)

is a ν-self-concordant barrier for In with ν = 2n.


Lemma 5.4.10 Any self-concordant barrier for the set In has parameter ν ≥ 2n.
Proof Let us choose γ > 1 and consider matrices X̄ = Ȳ = γ In . Clearly the point
(X̄, Ȳ ) belongs to int In . Note that for positive definite matrices, relation Y  X−1
holds if and only if X  Y −1 . Therefore, all directions

pi = (ei eiT , 0), qi = (0, ei eiT ), i = 1, . . . , n,

are recession directions of the set In . It is easy to check that for β = γ − 1


γ we get

(X̄, Ȳ ) − βpi ∈ ∂In , (X̄, Ȳ ) − βqi ∈ ∂In , i = 1, . . . , n.


n 
n
On the other hand, for α = γ −1, we have Ȳ −α ei eiT = In = (X̄−α ei eiT )−1 .
i=1 i=1
Therefore, in the conditions of Theorem 5.4.1 we can get all αi = α and all βi = β.
2nγ
Thus, we obtain ν ≥ 2n βα = 1+γ . Since γ can be arbitrarily big, we come to the
bound ν ≥ 2n. 

5.4.8 Separable Optimization

In problems of Separable Optimization all nonlinear terms in functional components


are represented by univariate functions. A general formulation of such a problem is
5.4 Applications to Problems with Explicit Structure 415

as follows:
 
m0
minn q0 (x) = α0,j f0,j (a0,j , x + b0,j ),
x∈R j =1
(5.4.41)

mi 
qi (x) = αi,j fi,j (ai,j , x + bi,j ) ≤ βi , i = 1 . . . m ,
j =1

where αi,j are some positive coefficients, ai,j ∈ Rn and fi,j (·) are convex functions
of one variable. Let us rewrite this problem in the standard form:
 
mi
min τ0 : αi,j ti,j ≤ τi , i = 0 . . . m, τi ≤ βi , i = 1 . . . m,
x∈Rn ,τ ∈Rm+1 ,t ∈RM j =1


fi,j (ai,j , x + bi,j ) ≤ ti,j , j = 1 . . . mi , i = 0 . . . m, ,
(5.4.42)


m
where M = mi . Thus, in order to construct a self-concordant barrier for the
i=0
feasible set of this problem, we need barriers for epigraphs of univariate convex
functions fi,j . Let us point out such barriers for several important examples.

5.4.8.1 Logarithm and Exponent

By fixing the first coordinate in the barrier (5.4.33), we obtain the barrier function
F1 (x, t) = − ln x − ln(ln x + t), which is a 3-self-concordant barrier for the set

Q1 = {(x, t) ∈ R2 | x > 0, t ≥ − ln x}.

Similarly, we obtain the function F2 (x, t) = − ln t − ln(ln t − x) as a 3-self-


concordant barrier for the set

Q2 = {(x, t) ∈ R2 | t ≥ ex }.

5.4.8.2 Entropy Function

By fixing the second coordinate in the barrier (5.4.33), we obtain the barrier function
F3 (x, t) = − ln x − ln(t − x ln x), which is a 3-self-concordant barrier for the set

Q3 = {(x, t) ∈ R2 | x ≥ 0, t ≥ x ln x}.
416 5 Polynomial-Time Interior-Point Methods

5.4.8.3 Increasing Power Functions

Let p ≥ 1 and define α = p1 . By fixing the second variable in barrier (5.4.28),


x (2) = 1, we get function F4 (x, t) = − ln t − ln(t 2/p − x 2 ), which is a 4-self-
concordant barrier for the set

Q4 = {(x, t) ∈ R2 | t ≥| x |p }, p ≥ 1.

If p < 1, then a similar operation with the barrier (5.4.29) gives us the function
F5 (x, t) = − ln t − ln(t p − x), which is a 3-self-concordant barrier for the set

Q5 = {(x, t) ∈ R2 | t ≥ 0, t p ≥ x}, 0 < p ≤ 1.

5.4.8.4 Decreasing Power Functions


p
Let p > 0. Define α = p+1 . Then by fixing z = 1 in the barrier (5.4.29), we get
the function F6 (x, t) = − ln x − ln t − ln(x α t 1−α − 1), which is a 3-self-concordant
barrier for the set
 
Q6 = (x, t) ∈ R2 | x > 0, t ≥ x1p .

Let us conclude our discussion with two examples.

5.4.8.5 Geometric Optimization

The initial formulation of such problems is as follows:


 
m0 I
n (j)
σ0,j
min
n
q0 (x) = α0,j (x (j )) ,
x∈R++ j =1 j =1
(5.4.43)

mi I
n (j) 
σi,j
qi (x) = αi,j (x (j ) ) ≤ 1, i = 1 . . . m ,
j =1 j =1

where RN ++ is the interior of the positive orthant, and αi,j are some positive
coefficients. Note that the problem (5.4.43) is not convex.
(1) (n)
Let us introduce vectors ai,j = (σi,j , . . . , σi,j ) ∈ Rn , and change variables:

(i)
x (i) = ey , i = 1, . . . , n.
5.4 Applications to Problems with Explicit Structure 417

Then problem (5.4.43) can be written in a convex form.


" %

m0 
mi
minn α0,j exp(a0,j , y ) : αi,j exp(ai,j , y ) ≤ 1, i = 1 . . . m .
y∈R j =1 j =1
(5.4.44)

m
Let M = mi . The complexity of solving (5.4.44) by a path-following scheme is
 1/2 i=0 M
O M · ln  iterations.

5.4.8.6 Approximation in an p -Norm

The simplest problem of this type is as follows:


 

m
minn | ai , x − b(i) |p : α ≤ x ≤ β , (5.4.45)
x∈R i=1

where p ≥ 1 and α, β ∈ Rn . Clearly, we can rewrite this problem in an equivalent


standard form:

min τ (0) : | ai , x − b(i) |p ≤ τ (i) , i = 1 . . . m,
x∈Rn ,τ ∈Rm+1
(5.4.46)

m 
τ (i) ≤ τ (0) , α≤x≤β .
i=1
√
The complexity bound of this problem is O m + n · ln m+n  iterations of a path-
following scheme.
We have discussed the performance of Interior-Point Methods for several pure
optimization problems. However, it is important that we can apply these methods
to mixed problems. For example, in problems (5.4.11) or (5.4.45) we can also treat
the quadratic constraints. To do this, we need to construct a corresponding self-
concordant barrier. Such barriers are known for all important functional components
arising in practical applications.

5.4.9 Choice of Minimization Scheme

We have seen that the majority of convex optimization problems can be solved
by Interior-Point Methods. However, the same problems can also be solved by
methods of Nonsmooth Optimization. In general, we cannot say which approach is
better, since the answer depends on the individual structure of a particular problem.
However, the complexity estimates for optimization schemes are often helpful in
making the choice. Let us consider a simple example.
418 5 Polynomial-Time Interior-Point Methods

Assume we are going to solve a problem of finding the best approximation in an


p -norm:
m 

minn | ai , x − b | : α ≤ x ≤ β ,
(i) p
(5.4.47)
x∈R i=1

where p ≥ 1. We have two available numerical methods:


• The Ellipsoid Method (Sect. 3.2.8).
• The Interior-Point Path-Following Scheme.
Which of them should we use? The answer can be derived from the complexity
analysis of the corresponding schemes.
Firstly, let us estimate the performance of the Ellipsoid Method as applied to
problem (5.4.47).

Complexity of the Ellipsoid Method

 
Number of iterations: O n2 ln 1 ,

Complexity of the oracle: O(mn) operations,

Complexity of the iteration: O(n2 ) operations.


 
Total complexity : O n3 (m + n) ln 1 operations.

The analysis of the Path-Following Method is more involved. First of all, we


should form a barrier model of the problem:

minm ξ : |ai , x − b(i) |p ≤ τ (i) , i = 1 . . . m,
x∈R ,τ ∈R ,ξ ∈R
n


m 
τ (i) ≤ ξ, α ≤ x ≤ β ,
i=1
(5.4.48)

m 
m
F (x, τ, ξ )) = f (ai , x − b(i) , τ (i) ) − ln(ξ − τ (i) )
i=1 i=1


n
− [ln(x (i) − α (i) ) + ln(β (i) − x (i) )],
i=1

where f (y, t) = − ln t − ln(t 2/p − y 2 ).


5.4 Applications to Problems with Explicit Structure 419

We have seen that the parameter of barrier √ F (x, τ, ξ ) is ν = 4m + n + 1.


Therefore, the Path-Following Scheme needs O 4m + n + 1 · ln m+n  iterations
at most.
At each iteration of this method, we need to compute the gradient and the Hessian
of barrier F (x, τ, ξ ). Define

g1 (y, t) = ∇y f (y, t), g2 (y, t) = ft (y, t).

Then

m n 
 
∇x F (x, τ, ξ ) = g1 (ai , x − b(i) , τ (i) )ai − 1
x (i) −α (i)
− 1
β (i) −x (i)
ei ,
i=1 i=1

# $−1

m
Fτ (i) (x, τ, ξ ) = g2 (ai , x − b(i) , τ (i) ) + ξ− τ (j ) ,
j =1

 −1

m
Fξ (x, τ, ξ ) =− ξ− τ (i) .
i=1

Further, defining
2
h11 (y, t) = ∇yy F (y, t), 2
h12 (y, t) = ∇yt F (y, t), h22 (y, t) = Ftt (y, t),

we obtain

m
2 F (x, τ, ξ ) =
∇xx h11 (ai , x − b(i) , τ (i) )ai aiT
i=1
 
+diag 1
(x (i) −α (i) )2
+ 1
(β (i) −x (i) )2
,

∇τ2(i) x F (x, τ, ξ ) = h12 (ai , x − b(i), τ (i) )ai ,


−2

m
Fτ(i) ,τ (i) (x, τ, ξ ) = h22 (ai , x − b(i), τ (i) ) + ξ− τ (i) ,
i=1


−2

m
Fτ(i) ,τ (j) (x, τ, ξ ) = ξ− τ (i) , i
= j,
i=1


−2

m
∇x,ξ
2 F (x, τ, ξ ) = 0, Fτ(i) ,ξ (x, τ, ξ ) = − ξ− τ (i) ,
i=1


−2
 (x, τ, ξ ) 
m
Fξ,ξ = ξ− τ (i) .
i=1
420 5 Polynomial-Time Interior-Point Methods

Thus, the complexity of the second-order oracle in the Path-Following Scheme is


O(mn2 ) arithmetic operations.
Let us estimate now the complexity of each iteration. The main source of
computations at each iteration is the solution of the Newton system. Let

−2

m
= ξ− τ (i) , si = ai , x − b(i) , i = 1 . . . n,
i=1

and
 n
Λ0 = diag 1
(x (i) −α (i) )2
+ 1
(β (i) −x (i) )2
Λ1 = diag (h11 (si , τ (i) ))m
i=1 ,
i=1

Λ2 = diag (h12 (si , τ (i) ))m


i=1 , D = diag (h22 (si , τ (i) ))m
i=1 .

Then, using the notation A = (a1 , . . . , am ), ēm = (1, . . . , 1) ∈ Rm , the Newton


system can be written in the following form:

[A(Λ0 + Λ1 )AT ]Δx + AΛ2 Δτ = ∇x F (x, τ, ξ ),

Λ2 AT Δx + [D + Im ]Δτ +  ēm Δξ = Fτ (x, τ, ξ ), (5.4.49)

ēm , Δτ + Δξ = Fξ (x, τ, ξ ) + t,

where t is the penalty parameter. From the second equation in (5.4.49), we obtain

Δτ = [D + Im ]−1 (Fτ (x, τ, ξ ) − Λ2 AT Δx −  ēm Δξ ).

Substituting Δτ into the first equation in (5.4.49), we have

Δx = [A(Λ0 + Λ1 − Λ22 [D + Im ]−1 )AT ]−1 {∇x F (x, τ, ξ )

−AΛ2 [D + Im ]−1 (Fτ (x, τ, ξ ) −  ēm Δξ )}.

Using these relations, we can find Δξ from the last equation in (5.4.49).
Thus, the Newton system (5.4.49) can be solved in O(n3 +mn2 ) operations. This
implies that the total complexity of the Path-Following Scheme can be estimated as

O n2 (m + n)3/2 · ln m+n


arithmetic operations. Comparing this estimate with the bound for the Ellipsoid Me-
thod, we conclude that the Interior-Point Method is more efficient if m is not too big,
namely, if m ≤ O(n2 ).
5.4 Applications to Problems with Explicit Structure 421

Of course, this analysis is valid only if the methods behave in accordance with
their worst-case complexity bounds. For the Ellipsoid Method this is indeed true.
However, Interior-Point Path-Following Schemes can be accelerated by long-step
strategies. The explanation of these abilities requires the introduction of a primal-
dual setting of the optimization problems, posed in a conic form. Because of the
volume constraints, we have decided not to touch on this deep theory in the present
book.
Chapter 6
The Primal-Dual Model of an Objective
Function

In the previous chapters, we have proved that in the Black-Box framework the
non-smooth optimization problems are much more difficult than the smooth ones.
However, very often we know the explicit structure of the functional components. In
this chapter we show how this knowledge can be used to accelerate the minimization
methods and to extract a useful information about the dual counterpart of the prob-
lem. The main acceleration idea is based on the approximation of a nondifferentiable
function by a differentiable one. We develop a technique for creating computable
smoothed versions of non-differentiable functions and minimize them by Fast
Gradient Methods. The number of iterations of the resulting methods is proportional
to the square root of the number of iterations of the standard subgradient scheme.
At the same time, the complexity of each iteration does not change. This technique
can be used either in the primal form, or in the symmetric primal-dual form. We
include in this chapter an example of application of this approach to the problem of
Semidefinite Optimization. The chapter is concluded by analysis of performance of
the Conditional Gradient method, which is based only on solving at each iteration
an auxiliary problem of minimization of a linear function. We show that this method
can also reconstruct the primal-dual solution of the problem. A similar idea is used
in the second-order Trust Region Method with contraction, the first method of this
type with provable global worst-case performance guarantees.

6.1 Smoothing for an Explicit Model of an Objective


Function

(The minimax model of non-differentiable objective functions; The Fast Gradient Method
for arbitrary norms and composite objective function; Application examples: minimax
strategies for matrix games, the continuous location problem, variational inequalities with
linear operator, minimization of piece-wise linear functions; Implementation issues.)

© Springer Nature Switzerland AG 2018 423


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4_6
424 6 The Primal-Dual Model of an Objective Function

6.1.1 Smooth Approximations of Non-differentiable Functions

As we have seen in Chap. 3, subgradient methods solve the problem of Nonsmooth


Convex Optimization in
 
1
O 2
(6.1.1)

calls of the oracle, where  is the desired absolute accuracy of finding the
approximate solution in the function value. Moreover, we have already seen that the
efficiency bound of the simplest Subgradient Method cannot be improved uniformly
in the dimension of the space of variables (see Sect. 3.2). Of course, this statement
is valid only for a Black-Box model of the objective function. However, the proof is
constructive: it can be shown that the simplest problems like
 
μ
min γ max x (i) + 2 x
2 , 1 ≤ k ≤ n,
x∈Rn 1≤i≤k

where the norm is standard Euclidean, are difficult for all numerical schemes.
The extremal simplicity of these functions possibly explains a common pessimistic
belief that the actual worst-case complexity bound for finding an -approximation
of the minimal value of a piece-wise linear function by gradient schemes is indeed
given by (6.1.1).
In fact, this is not absolutely true. In practice, we almost never meet a pure
Black-Box model. We always know something about the structure of the underlying
objects (we have already discussed this in Sect. 5.1.1), and the proper use of this
structure can and does help in constructing more efficient schemes.
In this section, we discuss one such possibility based on constructing a smooth
approximation of a nonsmooth function. Let us look at the following situation.
Consider a function f which is convex on E. Assume that f satisfies the following
growth condition:

f (x) ≤ f (0) + Lx, ∀x ∈ Rn , (6.1.2)

where the Euclidean norm x = Bx, x 1/2 is defined by a self-adjoint positive
definite linear operator B : E → E∗ . Define the Fenchel conjugate of the function
f as follows:

f∗ (s) = sup[s, x − f (x)], s ∈ E∗ . (6.1.3)


x∈E

Clearly, this function is closed and convex in view of Theorem 3.1.8. Its domain is
not empty since by Theorem 3.1.20

dom f∗ ⊇ ∂f (x), ∀x ∈ E.
6.1 Smoothing for an Explicit Model of an Objective Function 425

At the same time, dom f∗ is bounded:

(6.1.2)
s ≤ L ∀s ∈ dom f∗ . (6.1.4)

Note that for all x ∈ E and g ∈ ∂f (x), we have

f (x) + f∗ (g) = g, x . (6.1.5)

Hence, for any s ∈ dom f∗ this implies that

(6.1.3) (6.1.5)
f∗ (s) ≥ s, x − f (x) = f∗ (g) + s − g, x .

In other words, if g ∈ ∂f (x), then x ∈ ∂f∗ (g).


Let us prove the following relation (compare with general Theorem 3.1.16).
Lemma 6.1.1 For all x ∈ Rn , we have

f (x) = max [s, x − f∗ (s)].


s∈dom f∗

(6.1.3)
Proof Indeed, for any s ∈ dom f∗ , we have s, x − f∗ (s) ≤ f (x), and, in view
of (6.1.5), equality is achieved for s ∈ ∂f (x).

Let us now look at the following smooth approximation of function f :
 
fμ (x) = max s, x − f∗ (s) − 12 μ(s∗ )2 , (6.1.6)
s∈dom f∗

where μ ≥ 0 is a smoothing parameter and the dual norm is defined as s∗ =


s, B −1 s 1/2 . In view of Lemma 6.1.1, we have

(6.1.4)
f (x) ≥ fμ (x) ≥ f (x) − 12 μL2 , ∀x ∈ E. (6.1.7)

On the other hand, it appears that the function fμ has a Lipschitz continuous
gradient.
Lemma 6.1.2 The function fμ is differentiable on E, and for any points x1 and
x2 ∈ E we have

∇fμ (x1 ) − ∇fμ (x2 )∗ ≤ μ x1


1
− x2 . (6.1.8)

Proof Consider two points x1 and x2 from E. Let si∗ , i = 1, 2 be the optimal
solutions of the corresponding optimization problems in (6.1.6). They are uniquely
defined since the objective function in definition (6.1.6) is strongly concave.
426 6 The Primal-Dual Model of an Objective Function

Note that by Theorem 3.1.14, si∗ ∈ ∂fμ (xi ), i = 1, 2. On the other hand, by the
first-order optimality condition of Theorem 3.1.20, there exist vectors x̃i ∈ ∂f∗ (si∗ )
such that

s − si∗ , xi − x̃i − μB −1 si∗ ≤ 0, ∀s ∈ dom f∗ , i = 1, 2.


∗ and adding two copies of it with i = 1, 2, we get
Taking in this inequality s = s3−i

(3.1.24)
μ(s1∗ − s2∗ ∗ )2 ≤ s1∗ − s2∗ , x1 − x̃1 − (x2 − x̃2 ) ≤ s1∗ − s2∗ , x1 − x2

≤ s1∗ − s2∗ ∗ · x1 − x2 .

Thus, s1∗ − s2∗ ∗ ≤ μ x1


1
− x2 . Now, applying Lemma 3.1.10, we get ∇fμ (xi ) =
si∗ , i = 1, 2.

Of course the smooth approximation (6.1.6) of the function f is not very practical
since its internal minimization problem includes a potentially complicated function
f∗ . However, it already gives us some hints. Indeed, if we choose μ ≈ , then the
Lipschitz constant Lμ for the gradient of fμ will be O( 1 ). Therefore, Fast Gradient
Methods

 (e.g. (2.2.20)) can find an -approximation of function f (this is fμ ) in

O  ≈ O( 1 ) calls of an oracle.
It remains to find a systematic and computationally inexpensive way of approx-
imating the initial non-smooth objective function by a function with a Lipschitz
continuous gradient. This can be done by exploiting a special max-representation of
the objective function, which we introduce in Sect. 6.1.2.
For our goals, it is convenient to use the following notation. We often work with
two finite-dimensional real vector spaces E1 and E2 . In these spaces, we use the
corresponding scalar products and general norms

s, x Ei , xEi , s∗Ei , x ∈ Ei , s ∈ E∗i , i = 1, 2,

which are not necessarily Euclidean. A norm of a linear operator A : E1 → E∗2 is


defined in the standard way:

A1,2 = max{Ax, u E2 : xE1 = 1, uE2 = 1}.


x,u

Clearly,

A1,2 = A∗ 2,1 = max{Ax∗E2 : xE1 = 1}


x

= max{A∗ u∗E1 : uE2 = 1}.


u
6.1 Smoothing for an Explicit Model of an Objective Function 427

Hence, for any x ∈ E1 and u ∈ E2 we have

Ax∗E2 ≤ A1,2 · xE1 , A∗ u∗E1 ≤ A1,2 · uE2 . (6.1.9)

6.1.2 The Minimax Model of an Objective Function

In this section, our main problem of interest is as follows:

Find f ∗ = min{f (x) : x ∈ Q1 }, (6.1.10)


x

where Q1 is a bounded closed convex set in a finite-dimensional real vector space


E1 , and f (·) is a continuous convex function on Q1 . We do not assume f to be
differentiable.
Quite often, the structure of the objective function in (6.1.10) is given explicitly.
Let us assume that this structure can be described by the following model:

f (x) = fˆ(x) + max{Ax, u E2 − φ̂(u) : u ∈ Q2 }, (6.1.11)


u

where the function fˆ(·) is continuous and convex on Q1 , Q2 is a bounded closed


convex set in a finite-dimensional real vector space E2 , φ̂(·) is a continuous
convex function on Q2 , and the linear operator A maps E1 to E2∗ . In this case,
problem (6.1.10) can be written in an adjoint form. Indeed,

f∗ = min max {fˆ(x) + Ax, u E2 − φ̂(u)}


x∈Q1 u∈Q2

(1.3.6)
≥ max min {fˆ(x) + Ax, u E2 − φ̂(u)}.
u∈Q2 x∈Q1

Thus, the adjoint problem can be stated as follows:

f∗ = max φ(u),
u∈Q2
(6.1.12)
φ(u) = −φ̂(u) + min {Ax, u E2 + fˆ(x)}.
x∈Q1

However, the complexity of this problem is not completely identical to that


of (6.1.10). Indeed, in the primal problem (6.1.10), we implicitly assume that the
function φ̂(·) and set Q2 are so simple that the solution of the optimization problem
in (6.1.11) can be found in a closed form. This assumption may be not valid for the
objects defining the function φ(·).
428 6 The Primal-Dual Model of an Objective Function

Note that usually, for a convex function f , representation (6.1.11) is not uniquely
defined. If we decide to use, for example, the Fenchel dual of f ,

φ̂(u) ≡ f∗ (u) = max{u, x E1 − f (x) : x ∈ E1 }, Q2 ≡ E2 = E∗1 ,


x

then we can take fˆ(x) ≡ 0, and A is equal to In , the identity operator. However,
in this case the function φ̂(·) may be too complicated for our goals. Intuitively, it is
clear that the bigger the dimension of the space E2 is, the simpler is the structure of
the adjoint object defined by the function φ̂(·) and the set Q2 . Let us demonstrate
this with an example.
Example 6.1.1 Consider f (x) = max |aj , x E1 − b(j ) |. Let us choose A = In ,
1≤j ≤m
E2 = E∗1 = Rn , and
 
φ̂(u) = f∗ (u) = max u, x E1 − max |aj , x E1 − b |
(j )
x 1≤j ≤m

" %

m 
m
= max minm u, x E1 − s (j ) [a j , x E1 − b (j )] : |s (j ) | ≤1
x s∈R j =1 j =1

" %

m
= minm b, s E2 : As = u, |s (j ) | ≤ 1 .
s∈R j =1

It is clear that the structure of such a function can be very complicated.


Let us look at another possibility. Note that

f (x) = max |aj , x E1 − b (j ) |


1≤j ≤m

" %

m 
m
= maxm u(j ) [a j , x E1 − b (j ) ] : |u(j ) | ≤1 .
u∈R j =1 j =1

" %

m
In this case E2 = Rm , φ̂(u) = b, u E2 and Q2 = u ∈ Rm : |u(j ) | ≤ 1 .
j =1
Finally, we can also represent f (x) as follows:
" %

m (j ) (j ) m (j ) (j )
f (x) = max (u1 − u2 ) · [aj , x E1 − b(j ) ] : (u1 + u2 ) = 1 .
u=(u1 ,u2 )∈R2m
+ j =1 j =1

In this case E2 = R2m , φ̂(u) is a linear function and Q2 is a simplex. In Sect. 6.1.4.4
we will see that this representation is the easiest one. 
6.1 Smoothing for an Explicit Model of an Objective Function 429

Let us show that the knowledge of structure (6.1.11) can help in solving both
problems (6.1.10) and (6.1.12). We are going to use this structure to construct a
smooth approximation of the objective function in (6.1.10).
Consider a differentiable prox-function d2 (·) of the set Q2 . This means that d2 (·)
is strongly convex on Q2 with convexity parameter one. Denote by

u0 = arg min{d2 (u) : u ∈ Q2 }


u

its prox-center. Without loss of generality, we assume that d2 (u0 ) = 0. Thus, for
any u ∈ Q2 we have

(2.2.40) 1
d2 (u) ≥ u − u0 2E2 . (6.1.13)
2
Let μ be a positive smoothing parameter. Consider the following function:

fμ (x) = max{Ax, u E2 − φ̂(u) − μd2 (u) : u ∈ Q2 }. (6.1.14)


u

Denote by uμ (x) the optimal solution of the above problem. Since the function d2 (·)
is strongly convex, this solution is unique.
Theorem 6.1.1 The function fμ is well defined and continuously differentiable at
any x ∈ E1 . Moreover, this function is convex and its gradient

∇fμ (x) = A∗ uμ (x) (6.1.15)

is Lipschitz continuous with constant

Lμ = μ A1,2 .
1 2

Proof Indeed the function fμ (·) is convex as a maximum of functions which are
linear in x, and A∗ uμ (x) ∈ ∂fμ (x) (see Lemma 3.1.14). Let us prove now the
existence and Lipschitz continuity of its gradient.
Consider two points x1 and x2 from E1 . From the first-order optimality condi-
tions (3.1.56), we have

Axi − gi − μ∇d2 (uμ (xi )), uμ (x3−i ) − uμ (xi ) E2 ≤ 0

for some gi ∈ ∂ φ̂(uμ (xi )), i = 1, 2. Adding these inequalities, we get

(2.1.22)
μuμ (x1 ) − uμ (x2 )2E2 ≤ μ∇d2 (uμ (x1 )) − ∇d2 (uμ (x2 )), uμ (x1 ) − uμ (x2 ) E2

≤ A(x1 − x2 ) − (g1 − g2 ), uμ (x1 ) − uμ (x2 ) E2


430 6 The Primal-Dual Model of an Objective Function

(3.1.24)
≤ A(x1 − x2 ), uμ (x1 ) − uμ (x2 ) E2

≤ A1,2 · x1 − x2 E1 · uμ (x1 ) − uμ (x2 )E2 .

Thus, in view of (6.1.9), we have

A∗ uμ (x1 ) − A∗ uμ (x2 ))∗E1 ≤ A1,2 · uμ (x1 ) − uμ (x2 )2E2

≤ μ A1,2
1 2 · x1 − x2 E1 .

It remains to use Lemma 3.1.10.



Let D2 = max d2 (u) and f0 (x) = max {Ax, u E2 − φ̂(u)}. Then, for any x ∈ E1
u∈Q2 u∈Q2
we have
(6.1.14) (6.1.14)
f0 (x) ≥ fμ (x) ≥ f0 (x) − μD2 . (6.1.16)

Thus, for μ > 0 the function fμ can be seen as a uniform μ-approximation of the
objective function f0 with Lipschitz constant for the gradient of the order O( μ1 ).

6.1.3 The Fast Gradient Method for Composite Minimization

Let f (·) be a convex differentiable function defined on a closed convex set Q ⊆ E.


Assume that the gradient of this function is Lipschitz continuous:

∇f (x) − ∇f (y)∗ ≤ Lx − y, ∀x, y ∈ Q.

Denote by d(·) a differentiable prox-function of the set Q. Assume that d(·) is


strongly convex on Q with convexity parameter one. Let x0 be the d-center of Q:

x0 = arg min d(x).


x∈Q

Without loss of generality, assume that d(x0 ) = 0. Thus, for any x ∈ Q we have

(2.2.40)
≥ (6.1.17)
2 x − x0 2 .
1
d(x)

In this section, we present a fast gradient method for solving the following
composite optimization problem:
 
min f˜(x) = f (x) + Ψ (x) : x ∈ Q ,
def
(6.1.18)
x
6.1 Smoothing for an Explicit Model of an Objective Function 431

where Ψ (·) is an arbitrary simple closed convex function defined on Q. Our main
assumption is that the auxiliary minimization problem of the form

min{s, x + αd(x) + βΨ (x)}, α, β ≥ 0,


x∈Q

is easily solvable. For simplicity, we assume that the constant L > 0 is known.

Method of Similar Triangles

0. Choose x0 ∈ Q. Set v0 = x0 and φ0 (x) = Ld(x).

1. kth iteration (k ≥ 0).

(a) Define yk = k
k+2 xk + 2
k+2 vk .

(b) Set φk+1 (x) = φk (x) + 2 [f (yk ) + ∇f (yk ), x


k+1
− yk + Ψ (x)].

(c) Compute vk+1 = min φk+1 (x).


x∈Q

(d) Define xk+1 = k


k+2 xk + 2
k+2 vk+1 .

(6.1.19)

In this scheme, we generate two sequences of feasible points {xk }∞ ∞


k=0 and {yk }k=0 ,

and a sequence of estimating functions {φk (x)}k=0 . At each iteration of this method,
all “events” happen in the two-dimensional plane defined by the triangle

{xk , vk , vk+1 }.

Note that this triangle is similar to the resulting triangle {xk , yk , xk+1 }, defining the
new point of the sequence {xk }∞ k=0 , for which we are able to establish the rate of
convergence.
Theorem 6.1.2 Let the sequences {xk }∞ ∞ ∞
k=0 , {yk }k=0 , and {vk }k=0 be generated by
method (6.1.19). Then, for any k ≥ 0 and x ∈ Q we have

k(k+1) ˜
4 f (xk ) + L2 vk − x2


k−1
k(k+1)
≤ φk (x) = Ld(x) + 2 [f (yi ) + ∇f (yi ), x
i+1
− yi ] + 4 Ψ (x).
i=0
(6.1.20)
432 6 The Primal-Dual Model of an Objective Function

Therefore, for any k ≥ 1, we get


4Ld(x ∗)
f˜(xk ) − f˜(x ∗ ) + k(k+1) vk
2L
− x ∗ 2 ≤ k(k+1) , (6.1.21)

where x ∗ is an optimal solution to problem (6.1.18).


Proof For k ≥ 0, let


k
k(k+1) ak+1
ak = k2 , Ak = ai = 4 , τk = Ak+1 .
i=0

Then the rules of method (6.1.19) can be written as follows:

yk = (1 − τk )xk + τk vk , xk+1 = (1 − τk )xk + τk vk+1 . (6.1.22)

Let us prove that

Ak f˜(xk ) ≤ φk∗ = min φk = φk (vk ),


def
k ≥ 0. (6.1.23)
x∈Q

Since A0 = 0, this inequality is valid for k = 0. Assume that it is true for some
k ≥ 0. Since all functions φk are strongly convex with convexity parameter L, we
have

φk+1 = φk (vk+1 ) + ak+1 [f (yk ) + ∇f (yk ), vk+1 − yk + Ψ (vk+1 )]

(2.2.40)
≥ φk∗ + L2 vk+1 − vk 2

+ak+1 [f (yk ) + ∇f (yk ), vk+1 − yk + Ψ (vk+1 )]

(6.1.23)
≥ Ak [f (xk ) + Ψ (xk )] + L2 vk+1 − vk 2

+ak+1 [f (yk ) + ∇f (yk ), vk+1 − yk + Ψ (vk+1 )]

(2.1.2)
≥ Ak+1 f (yk ) + ∇f (yk ), Ak (xk − yk ) + ak+1 (vk+1 − yk )

+ L2 vk+1 − vk 2 + Ak Ψ (xk ) + ak+1 Ψ (vk+1 ).


6.1 Smoothing for an Explicit Model of an Objective Function 433

(6.1.22)
By the rules of the method, Ak (xk − yk ) + ak+1 (vk+1 − yk ) = ak+1 (vk+1 − vk )
and Ak Ψ (xk ) + ak+1 Ψ (vk+1 ) ≥ Ak+1 Ψ (xk+1 ). Therefore,


φk+1 ≥ Ak+1 f (yk ) + ak+1 ∇f (yk ), vk+1 − vk + L2 vk+1 − vk 2
+Ak+1 Ψ (xk+1 )
(6.1.22) LAk+1
= Ak+1 [f (yk ) + ∇f (yk ), xk+1 − yk + 2 xk+1 − yk 2
2ak+1
+Ψ (xk+1 )] .

(2.1.9)
Since Ak+1
= (k+1)(k+2)
· 4 ∗
> 1, we get φk+1 ≥ Ak+1 f (xk+1 ). By strong
2
ak+1 4 (k+1)2
convexity of the function φk , we have

(2.2.40) (6.1.23)
φk (x) ≥ φk∗ + L2 x − vk 2 ≥ Ak f˜(xk ) + L2 x − vk 2 ,

and this is inequality (6.1.20). Finally, inequality (6.1.21) follows from (6.1.20) in
view of the convexity of the function f . 
Remark 6.1.1 Note that method (6.1.19) generates bounded sequences of points.
Indeed, by the rules of this method we have

xk , yk ∈ Conv {v0 , . . . , vk }, k ≥ 0.

On the other hand, from inequality (6.1.21), it follows that

vk − x ∗ 2 ≤ 2d(x ∗ ). (6.1.24)

In the Euclidean case, d(x) = 12 x − x0 2 , and we get

vk − x ∗  ≤ x0 − x ∗ , k ≥ 0. (6.1.25)

6.1.4 Application Examples

Let us put the results of the previous sections together. Assume that the function
fˆ(·) in (6.1.11) is differentiable and its gradient is Lipschitz-continuous with some
constant M ≥ 0. Then the smoothing technique as applied to problem (6.1.10)
provides us with the following objective function:

f¯μ (x) = fˆ(x) + fμ (x) → min : x ∈ Q1 . (6.1.26)


434 6 The Primal-Dual Model of an Objective Function

In view of Theorem 6.1.1, the gradient of this function is Lipschitz continuous with
the constant

Lμ = M + μ1 A21,2 .

Let us choose some prox-function d1 (·) for the set Q1 with convexity parameter
equal to one. Recall that the set Q1 is assumed to be bounded:

max d1 (x) ≤ D1 .
x∈Q1

Theorem 6.1.3 Let us apply method (6.1.19) to problem (6.1.26) with the following
value of the smoothness parameter:

2A1,2 D1
μ = μ(N) = √
N(N+1)
· D2 .

Then after N iterations we can generate approximate solutions to problems (6.1.10)


and (6.1.12), namely,


N−1
2(i+1)
x̂ = xN ∈ Q1 , û = (N+1)(N+2) uμ (yi ) ∈ Q2 , (6.1.27)
i=0

which satisfy the following inequality:

4A1,2 √ 4MD1
0 ≤ f (x̂) − φ(û) ≤ √
N(N+1)
· D1 D2 + N(N+1) . (6.1.28)

Thus, the complexity of finding an -solution to problems (6.1.10), (6.1.12) by the


smoothing technique does not exceed

√ 
MD1
4A1,2 D1 D2 · 1
 + 2 
(6.1.29)

iterations of method (6.1.19).


Proof Let us fix an arbitrary μ > 0. In view of Theorem 6.1.2, after N iterations of
method (2.2.63) we can deliver a point x̂ = xN such that


N−1
f¯μ (x̂) ≤ [f¯μ (yi ) + ∇ f¯μ (xi ), x
4Lμ D1 2(i+1)
N(N+1) + min − yi E1 ].
x∈Q1 i=0 N(N+1)
(6.1.30)
6.1 Smoothing for an Explicit Model of an Objective Function 435

Note that

fμ (y) = max{Ay, u E2 − φ̂(u) − μd2 (u) : u ∈ Q2 }


u

= Ay, uμ (y) E2 − φ̂(uμ (y)) − μd2 (uμ (y)),

∇fμ (y), y E1 = A∗ uμ (y), y E1 .

Therefore, for i = 0, . . . , N − 1 we have

fμ (yi ) − ∇fμ (yi ), yi E1 = −φ̂(uμ (yi )) − μd2 (uμ (yi )). (6.1.31)

Thus, in view of (6.1.15) and (6.1.31) we obtain


N−1
(i + 1)[f¯μ (yi ) + ∇ f¯μ (yi ), x − yi E1 ]
i=0


(2.1.2) N−1
≤ (i + 1)[fμ (yi ) − ∇fμ (yi ), yi E1 ] + 12 N(N + 1)(fˆ(x) + A∗ û, x E1 )
i=0


N−1
≤ − (i + 1)φ̂(uμ (yi )) + 12 N(N + 1)(fˆ(x) + A∗ û, x E1 )
i=0

≤ 1
2 N(N + 1)[−φ̂(û) + fˆ(x) + Ax, û E2 ].

Hence, using (6.1.30), (6.1.12) and (6.1.16), we get the following bound:

≥ f¯μ (x̂) − φ(û) ≥ f (x̂) − φ(û) − μD2 .


4Lμ D1
N(N+1)

This is
4A21,2 D1 4MD1
0 ≤ f (x̂) − φ(û) ≤ μD2 + μN(N+1) + N(N+1) .
(6.1.32)

Minimizing the right-hand side of this inequality in μ, we get inequality


(6.1.28).


Note
 that the efficiency estimate (6.1.29) is much better than the standard bound
O  2 . In accordance with the above theorem, for M = 0 the optimal dependence
1

of the parameters μ, Lμ and N in  is as follows:

√ √ A21,2
N(N + 1) ≥ 4A1,2 D1 D2 · 1 , μ = 
2D2 , Lμ = D2 ·  .
(6.1.33)
436 6 The Primal-Dual Model of an Objective Function

Remark 6.1.2 Inequality (6.1.28) shows that the pair of adjoint problems (6.1.10)
and (6.1.12) has no duality gap:

f ∗ = f∗ . (6.1.34)

Let us now look at some examples.

6.1.4.1 Minimax Strategies for Matrix Games

Denote by Δn the standard simplex in Rn :


 

n
Δn = x ∈ Rn+ : x (i) = 1 .
i=1

Let A : Rn → Rm , E1 = Rn , and E2 = Rm . Consider the following saddle point


problem:

min max {Ax, u E2 + c, x E1 + b, u E2 }. (6.1.35)


x∈Δn u∈Δm

From the viewpoint of players, this problem can be seen as a pair of non-smooth
minimization problems:

min f (x), f (x) = c, x E1 + max [aj , x E1 + b (j ) ],


x∈Δn 1≤j ≤m
(6.1.36)
max φ(u), φ(u) = b, u E2 + min [âi , u E2 + c(i) ],
u∈Δm 1≤i≤n

where aj are the rows and âi are the columns of matrix A. In order to solve this
pair of problems using the smoothing approach, we need to find a reasonable prox-
function for the simplex. Let us compare two possibilities.
1. Euclidean Distance Let us choose
 1/2

n 
n
xE1 = (x (i) )2 , d1 (x) = 1
2 (x (i) − n1 )2 ,
i=1 i=1

# $1/2

m 
m
uE2 = (u(j ) )2 , d2 (x) = 1
2 (u(j ) − 1 2
m) .
j =1 j =1

Then D1 = 1 − 1
n < 1, D2 = 1 − 1
m < 1 and

A1,2 = max{Ax∗2 : xE1 = 1} = λmax (AT A).


1/2
u
6.1 Smoothing for an Explicit Model of an Objective Function 437

Thus, in our case the estimate (6.1.28) for the result (6.1.27) can be specified as
follows:
1/2 T

√max (A A) .
0 ≤ f (x̂) − φ(û) ≤ N(N+1)
(6.1.37)

2. Entropy Distance Let us choose


n 
n
xE1 = |x (i) |, d1 (x) = ln n + x (i) ln x (i) ,
i=1 i=1


m 
m
uE2 = |u(j ) |, d2 (u) = ln m + u(j ) ln u(j ) .
j =1 j =1

Functions d1 and d2 are called the entropy functions.


Lemma 6.1.3 The above prox-functions are strongly convex in an 1 -norm with
convexity parameter one and D1 = ln n, D2 = ln m.
Proof Note that the function d1 is twice continuously differentiable in the interior
of simplex Δn , and


n
(h(i) )2
∇ 2 d1 (x)h, h = x (i)
.
i=1

Thus, in view of Theorem 2.1.11 strong convexity of d1 is a consequence of the


following variant of Cauchy–Schwarz inequality,

2

n

n 
n  (h(i) )2
|h(i) | ≤ x (i) · x (i)
,
i=1 i=1 i=1

which is valid for all positive vectors x ∈ Rn . Since d1 (·) is a convex symmetric
function of the arguments, its minimum is attained at the center of the simplex, the
point x0 = n1 ēn . Clearly, d1 (x0 ) = 0. On the other hand, its maximum is attained at
one of the vertices of the simplex (see Corollary 3.1.2).
The reasoning for d2 (·) is similar. 
Note also that now we get the following norm of the operator A:

A1,2 = max{ max |aj , x | : xE1 ≤ 1} = max |A(i,j ) |


x 1≤j ≤m i,j

(see Corollary 3.1.2). Thus, if we apply the entropy distance, the estimate (6.1.28)
can be written as follows:

0 ≤ f (x̂) − φ(û) ≤ 4√ ln n ln m
N(N+1)
· max |A(i,j ) |. (6.1.38)
i,j
438 6 The Primal-Dual Model of an Objective Function

Note that typically the estimate (6.1.38) is much better than its Euclidean vari-
ant (6.1.37).
Let us write down explicitly the smooth approximation for the objective function
in the first problem of (6.1.36) using the entropy distance. By definition,
" %

m 
m
f¯μ (x) = c, x E1 + max u [aj , x + b ] − μ
(j ) (j ) u ln u − μ ln m .
(j ) (j )
u∈Δm j =1 j =1

Let us apply the following result.


Lemma 6.1.4 The solution of the problem
" %

m 
m
Find φ∗ (s) = max u(j ) s (j ) −μ u(j ) ln u(j ) (6.1.39)
u∈Δm j =1 j =1

is given by the vector uμ (s) ∈ Δm with the following entries

(j) /μ
(j ) es
uμ (s) = 
m , j = 1, . . . , m. (6.1.40)
(i) /μ
es
i=1




m (i) /μ
Therefore, φ∗ (s) = μ ln es .
i=1
Proof Note that the gradient of the objective function in problem (6.1.39) goes to
infinity as the argument approaches the boundary of the domain. Therefore, the first
order necessary and sufficient optimality conditions for this problem are as follows
(see (3.1.59)):

s (j ) − μ(1 + ln u(j ) ) = λ, j = 1, . . . , m,


m
u(j ) = 1.
j =1




m (l)
Clearly, they are satisfied by (6.1.40) with λ = μ ln es /μ − μ.

l=1
Using the result of Lemma 6.1.4, we conclude that in our case the prob-
lem (6.1.26) is as follows:
"  %

m
f¯μ (x) = c, x E1 + μ ln e[aj ,x +b ]/μ
1 (j)
min m .
x∈Δn j =1

Note that the complexity of the oracle for this problem is basically the same as that
of the initial problem (6.1.36).
6.1 Smoothing for an Explicit Model of an Objective Function 439

6.1.4.2 The Continuous Location Problem

Consider the following location problem. There are p cities with population mj ,
which are located at points cj ∈ Rn , j = 1, . . . , p. We want to construct a service
center at some position x ∈ Rn ≡ E1 , which minimizes the total social distance
f (x) to the center. On the other hand, this center must be constructed not too far
from the origin.
Mathematically, the above problem can be posed as follows
" %

p
Find f∗ = min f (x) = mj x − cj E1 : xE1 ≤ r̄ . (6.1.41)
x j =1

In accordance to its interpretation, it is natural to choose


 1/2

n
xE1 = (x (i) )2 , d1 (x) = 2 xE1 .
1 2
i=1

Then D1 = 12 r̄ 2 .
Further, the structure of the adjoint space E2 is quite clear:
 
E2 = (E∗1 )p , Q2 = u = (u1 , . . . , up ) ∈ E2 : uj ∗E1 ≤ 1, j = 1, . . . , p .

Let us choose
# $1/2

p
uE2 = mj (uj ∗E1 )2 , d2 (u) = 2 uE2 .
1 2
j =1


p
Then D2 = 12 P with P ≡ mj . Note that the value P may be interpreted as the
j =1
total size of the population.
It remains to compute the norm of the operator A:
" %

p 
p
A1,2 = max mj uj , x E1 : mj (uj ∗E1 )2 = 1, xE1 = 1
x,u j =1 j =1

" %

p 
p
= max mj rj : mj rj2 =1 = P 1/2
rj j =1 j =1

(see Lemma 3.1.20).


440 6 The Primal-Dual Model of an Objective Function

Putting the computed values into the estimate (6.1.28), we get the following rate
of convergence:

f (x̂) − f ∗ ≤ √ 2P r̄
N(N+1)
. (6.1.42)

Note that the value f˜(x) = P1 f (x) corresponds to the average individual expenses
generated by the location x. Therefore,

f˜(x̂) − f˜∗ ≤ √ 2r̄


N(N+1)
.

It is interesting that the right-hand side of this inequality is independent of any


dimension. At the same time, it is clear that the reasonable accuracy for the
approximate solution of our problem should not be too high. Given the low
complexity of each iteration in the scheme (6.1.19), the total efficiency of the
proposed technique looks quite promising.
To conclude with the location problem, let us write down explicitly a smooth
approximation of the objective function.
" %

p
fμ (x) = max mj uj , x − cj E1 − μd2 (u) : u ∈ Q2
u j =1

"

p  
= max mj uj , x − cj E1 − 12 μ(uj ∗E1 )2 : uj ∗E1 ≤ 1,
u j =1

j = 1, . . . , p}


p
= mj ψμ (x − cj E1 ),
j =1

where the function ψμ (τ ), τ ≥ 0, is defined as follows:





τ2
0 ≤ τ ≤ μ,
1 2μ ,
ψμ (τ ) = max {γ τ − μγ 2 } = (6.1.43)
γ ∈[0,1] 2 ⎪

τ − μ2 , μ ≤ τ.

This is the so-called the Huber loss function.

6.1.4.3 Variational Inequalities with a Linear Operator

Consider a linear operator B(w) = Bw + c: E → E∗ , which is monotone:

Bh, h ≥ 0 ∀h ∈ E.
6.1 Smoothing for an Explicit Model of an Objective Function 441

Let Q be a bounded closed convex set in E. Then we can pose the following
variational inequality problem:

Find w∗ ∈ Q : B(w∗ ), w − w∗ ≥ 0 ∀w ∈ Q. (6.1.44)

Note that we can always rewrite problem (6.1.44) as an optimization problem.


Indeed, define

ψ(w) = max{B(v), w − v : v ∈ Q}.


v

In view of Theorem 3.1.8, ψ(w) is a convex function. Let us show that the problem

min{ψ(w) : w ∈ Q} (6.1.45)
w

is equivalent to (6.1.44).
Lemma 6.1.5 A point w∗ is a solution to (6.1.45) if and only if it solves variational
inequality (6.1.44). Moreover, for such w∗ we have ψ(w∗ ) = 0.
Proof Indeed, at any w ∈ Q the function ψ is non-negative. If w∗ is a solution
to (6.1.44), then for any v ∈ Q we have

B(v), v − w∗ ≥ B(w∗ ), v − w∗ ≥ 0.

Hence, ψ(w∗ ) = 0 and w∗ ∈ Arg min ψ(w).


w∈Q
Now, consider some w∗ ∈ Q with ψ(w∗ ) = 0. Then for any v ∈ Q we have

B(v), v − w∗ ≥ 0.

Suppose there exists some v1 ∈ Q such that B(w∗ ), v1 − w∗ < 0. Consider the
points

vα = w∗ + α(v1 − w∗ ), α ∈ [0, 1].

Then

0 ≤ B(vα ), vα − w∗ = αB(vα ), v1 − w∗

= αB(w∗ ), v1 − w∗ + α 2 B · (v1 − w∗ ), v1 − w∗ .

Hence, for α small enough we get a contradiction.



There are two possibilities for representing the problem (6.1.44), (6.1.45) in the
form (6.1.10), (6.1.11).
442 6 The Primal-Dual Model of an Objective Function

1. Primal Form We take E1 = E2 = E, Q1 = Q2 = Q, d1 (x) = d2 (x) = d(x),


A = B, and

fˆ(x) = b, x E1 , φ̂(u) = b, u E1 + Bu, u E1 .

Note that the quadratic function φ̂(u) is convex. To compute the value and the
gradient of the function fμ (x), we need to solve the following problem:

max{Bx, u E1 − μd(u) − b, u E1 − Bu, u E1 }. (6.1.46)


u∈Q

Since in our case M = 0, from Theorem 6.1.3 we get the following estimate for the
complexity of problem (6.1.44):
4D1 B1,2
 . (6.1.47)

However, because of the presence of a non-trivial quadratic function in (6.1.46),


the oracle for the function fˆ can be quite expensive. We can avoid that in the dual
variant of this problem.
2. Dual Form Consider the dual variant of problem (6.1.45):

min maxB(v), w − v = max min B(v), w − v = − min maxB(v), v − w .


w∈Q v∈Q v∈Q w∈Q v∈Q w∈Q

Thus, we can take E1 = E2 = E, Q1 = Q2 = Q, d1 (x) = d2 (x) = d(x), A = B,


and

fˆ(x) = b, x E1 + Bx, x E1 , φ̂(u) = b, u E1 .

Now the computation of the function value fμ (x) becomes much simpler:

fμ (x) = max{Bx, u E1 − μd(u) − b, u E1 : u ∈ Q}.


u

Note that we pay quite a moderate cost for this. Indeed, now M becomes equal to
B1,2 . Hence, the complexity estimate (6.1.47) increases up to the following level:

4D1 B1,2 D1 B1,2
 +  .

In the important particular case of skew-symmetry of the operator B, that is B +


B ∗ = 0, the primal and dual variant have a similar complexity.
6.1 Smoothing for an Explicit Model of an Objective Function 443

6.1.4.4 Piece-Wise Linear Optimization

1. Maximum of Absolute Values Consider the following problem:


 
min f (x) = max |aj , x E1 − b | .
(j ) (6.1.48)
x∈Q1 1≤j ≤m

For simplicity, let us choose


 1/2

n
xE1 = (x (i) )2 , d1 (x) = 2 x .
1 2
i=1

Denote by A the matrix with rows aj , j = 1, . . . , m. It is convenient to choose


2m 
2m
E2 = R2m , uE2 = |u(j ) |, d2 (u) = ln(2m) + u(j ) ln u(j ) .
j =1 j =1

Then

f (x) = max{Âx, u E2 − b̂, u E2 : u ∈ Δ2m },


u



A b
where  = and b̂ = . Thus, D2 = ln(2m), and
−A −b

1 2
D1 = r̄ , r̄ = max{xE1 : x ∈ Q1 }.
2 x

It remains to compute the norm of the operator Â:

Â1,2 = max{Âx, u E2 : xE1 = 1, uE2 = 1}


x,u

= max{ max |aj , x E1 | : xE1 = 1} = max aj ∗1 .


x 1≤j ≤m 1≤j ≤m

Putting all the computed values into the estimate (6.1.29), we see that the
problem (6.1.48) can be solved in
√ √
2 2 r̄ max aj ∗1 ln(2m) · 1
1≤j ≤m

iterations of scheme (6.1.19). The standard subgradient schemes in this situation can
count only on an
   2
O r̄ max aj ∗1 · 1

1≤j ≤m

upper bound for the number of iterations.


444 6 The Primal-Dual Model of an Objective Function

Finally, the smooth version of the objective function in (6.1.48) is as follows:


 

m  
f¯μ (x) = μ ln 1
m ξ μ [aj , x + b ]
1 (j )
j =1

with ξ(τ ) = 12 [eτ + e−τ ]. We leave the justification of this expression as an exercise
for the reader.
2. Sum of Absolute Values Consider now the problem
" %

m
min f (x) = |aj , x E1 − b (j )| . (6.1.49)
x∈Q1 j =1

The simplest representation of the function f (·) is as follows. Denote by A the


matrix with the rows aj . Let us choose

E2 = Rm , Q2 = {u ∈ Rm : |u(j ) | ≤ 1, j = 1, . . . , m},


m
d2 (u) = 12 u2E2 = 1
2 aj ∗E1 · (u(j ) )2 .
j =1

Then the smooth version of the objective function is as follows:

fμ (x) = max{Ax − b, u E2 − μd2 (u) : u ∈ Q2 }


u



m |aj ,x E1 −b(j) |
= aj ∗E1 · ψμ aj ∗E
,
j =1 1

where the function ψμ (τ ) is defined by (6.1.43). Note that


" %

m
A1,2 = max u(j ) a j , x E1 : xE1 ≤ 1, uE2 ≤ 1
x,u j =1

" %

m 
m
≤ max aj ∗E1 · |u(j ) | : aj ∗E1 · (u(j ) )2 ≤1
u j =1 j =1

# $1/2

m
= D 1/2 ≡ aj ∗E1 .
j =1
6.1 Smoothing for an Explicit Model of an Objective Function 445

On the other hand, D2 = 12 D. Therefore from Theorem 6.1.3 we get the following
complexity bound:

√ 
m
2
 · 2D1 · aj ∗E1
j =1

iterations of method (6.1.19).

6.1.5 Implementation Issues

6.1.5.1 Computational Complexity

Let us discuss the computational complexity of the method (6.1.19) as applied to


the function f¯μ (·). The main computations are performed at Steps (b) and (c) of the
algorithm.
Step (b). Call of Oracle At this step we need to compute the solution of the
following maximization problem:

max {Ayk , u E2 − φ̂(u) − μd2 (u) : u ∈ Q2 }.


u∈Q2

Note that from the origin of this problem we know that this computation for μ = 0
can be done in a closed form. Thus, we can expect that with a properly chosen prox-
function, computation of the smoothed version is not too difficult. In Sect. 6.1.4 we
have seen three examples which confirm this belief.
Step (c). Computation of vk+1 This computation consists in solving the following
problem:

min {d1 (x) + s, x E1 }


x∈Q1

for some fixed s ∈ E∗1 . If the set Q1 and the prox-function d1 (·) are simple enough,
this computation can be done in a closed form (see Sect. 6.1.4). For some sets we
need to solve an auxiliary equation with one variable.

6.1.5.2 Computational Stability

Our approach is based on the smoothing of non-differentiable functions. In accor-


dance with (6.1.33), the value of the smoothness parameter μ must be of the order
of . This may cause some numerical troubles in computing the function f¯μ (x) and
its gradient. Among examples of Sect. 6.1.4, only a smooth variant of the objective
function in Sect. 6.1.4.2 does not involve dangerous operations; all others need a
careful implementation.
446 6 The Primal-Dual Model of an Objective Function

In both Sects. 6.1.4.1 and 6.1.4.4 we need a stable technique for computing the
values and derivatives of the function
 
m (j)
η(u) = μ ln eu /μ (6.1.50)
j =1

with very small values of parameter μ. This can be done in the following way. Let

ū = max u(j ) , v (j ) = u(j ) − ū, j = 1, . . . , m.


1≤j ≤m

Then

η(u) = ū + η(v).

Note that all components of the vector v are non-negative and one of them is zero.
Therefore, the value η(v) can be computed quite accurately. The same technique
can be used to compute the gradient since ∇η(u) = ∇η(v).

6.2 An Excessive Gap Technique for Non-smooth Convex


Minimization

(Primal-dual problem structure; An excessive gap condition; Gradient mapping; Conver-


gence analysis; Minimizing strongly convex functions.)

6.2.1 Primal-Dual Problem Structure

In this section, we give some extensions of the results presented in Sect. 6.1, where
it was shown that some structured non-smooth optimization problems can be solved
in O( 1 ) iterations of a gradient-type scheme with  being the desired accuracy of
the solution. This complexity is much better than the theoretical lower complexity
bound O( 12 ) for Black-Box methods (see Sect. 3.2). This improvement, of course,
is possible because of certain relaxations of the standard Black Box assumption.
Instead, it was assumed that our problem has an explicit and quite simple minimax
structure. However, the approach discussed in Sect. 6.1 has a certain drawback.
Namely, the number of steps of the optimization scheme must be fixed in advance.
It is chosen in accordance with the worst case complexity analysis and desired
accuracy. Let us try to be more flexible.
Consider the same optimization problems as before:

Find f ∗ = min f (x), (6.2.1)


x∈Q1
6.2 An Excessive Gap Technique for Non-smooth Convex Minimization 447

where Q1 is a bounded closed convex set in a finite-dimensional real vector


space E1 , and f is a continuous convex function on Q1 . We do not assume f to
be differentiable. Let the structure of the objective function be described by the
following model:

f (x) = fˆ(x) + max {Ax, u E2 − φ̂(u)}, (6.2.2)


u∈Q2

where the function fˆ is continuous and convex on Q1 , Q2 is a closed convex


bounded set in a finite-dimensional real vector space E2 , φ̂(·) is a continuous
convex function on Q2 , and the linear operator A maps E1 to E∗2 . In this case,
problem (6.2.1) can be written in an adjoint form:

f∗ = max φ(u),
u∈Q2
(6.2.3)
φ(u) = −φ̂(u) + min {Ax, u E2 + fˆ(x)},
x∈Q1

which has zero duality gap (see (6.1.34)).


We assume that this representation is completely similar to (6.2.1) in the
following sense. All methods described in this section are implementable only if
the optimization problems involved in the definitions of functions f and φ can be
solved in a closed form. So, we assume that the structure of all objects in fˆ, φ̂, Q1
and Q2 is simple enough. We also assume that functions fˆ and φ̂ have Lipschitz
continuous gradients with Lipschitz constants L1 (fˆ) and L2 (φ̂) respectively.
Let us show that the knowledge of structure (6.2.2) can help in solving prob-
lems (6.2.1) and (6.2.3). Consider a prox-function d2 (·) of the set Q2 . This means
that d2 is continuous and strongly convex on Q2 with a strong convexity parameter
equal to one. Denote by

u0 = arg min d2 (u)


u∈Q2

the prox-center of the function d2 . Without loss of generality we assume that


d2 (u0 ) = 0. Thus, in view of (4.2.18), for any u ∈ Q2 we have

1
d2 (u) ≥ u − u0 22 . (6.2.4)
2
Let μ2 be a positive smoothing parameter. Consider the following function:

fμ2 (x) = fˆ(x) + max {Ax, u E2 − φ̂(u) − μ2 d2 (u)}. (6.2.5)


u∈Q2

Denote by uμ2 (x) the optimal solution of this problem. Since the function d2 is
strongly convex, this solution is unique. In accordance with Danskin’s theorem, the
448 6 The Primal-Dual Model of an Objective Function

gradient of fμ2 is well defined as

∇fμ2 (x) = ∇ fˆ(x) + A∗ uμ2 (x). (6.2.6)

Moreover, this gradient is Lipschitz-continuous with constant

L1 (fμ2 ) = L1 (fˆ) + μ2 A1,2


1 2
(6.2.7)

(see Theorem 6.1.1).


Similarly, let us consider a prox-function d1 (·) of the set Q1 , which has convexity
parameter equal to one, and the prox-center x0 with d1 (x0 ) = 0. By (4.2.18), for any
x ∈ Q1 we have

1
d1 (x) ≥ x − x0 21 . (6.2.8)
2
Let μ1 be a positive smoothing parameter. Consider

φμ1 (u) = −φ̂(u) + min {Ax, u E2 + fˆ(x) + μ1 d1 (x)}. (6.2.9)


x∈Q1

Since the second term in the above definition is a minimum of linear functions,
φμ1 (u) is concave. Denote by xμ1 (u) the unique optimal solution of the above
problem. In accordance with Theorem 6.1.1, the gradient

∇φμ1 (u) = −∇ φ̂(u) + Axμ1 (u) (6.2.10)

is Lipschitz-continuous with constant

L2 (φμ1 ) = L2 (φ̂) + μ1 A1,2 .


1 2
(6.2.11)

6.2.2 An Excessive Gap Condition

In view of Theorem 1.3.1, for any x ∈ Q1 and u ∈ Q2 we have

φ(u) ≤ f (x), (6.2.12)

and our assumptions guarantee no duality gap for problems (6.2.1) and (6.2.3).
However, fμ2 (x) ≤ f (x) and φ(u) ≤ φμ1 (u). This opens a possibility to satisfy
the following excessive gap condition:

fμ2 (x̄) ≤ φμ1 (ū) (6.2.13)


6.2 An Excessive Gap Technique for Non-smooth Convex Minimization 449

for certain x̄ ∈ Q1 and ū ∈ Q2 . Let us show that condition (6.2.13) provides us with
an upper bound on the quality of the primal-dual pair (x̄, ū).
Lemma 6.2.1 Let x̄ ∈ Q1 and ū ∈ Q2 satisfy (6.2.13). Then

0 ≤ max{f (x̄) − f ∗ , f ∗ − φ(ū)}


(6.2.14)
≤ f (x̄) − φ(ū) ≤ μ1 D1 + μ2 D2 ,

where D1 = max d1 (x), and D2 = max d2 (u).


x∈Q1 u∈Q2

Proof Indeed, for any x̄ ∈ Q1 , ū ∈ Q2 we have

(6.2.13)
f (x̄) − μ2 D2 ≤ fμ2 (x̄) ≤ φμ1 (ū) ≤ φ(ū) + μ1 D1 .

It remains to apply inequality (6.2.12).



Our goal is to justify a process for recursively updating the pair (x̄, ū), which
maintains inequality (6.2.13) as μ1 and μ2 go to zero. Before we start our analysis,
let us prove a useful inequality.
Lemma 6.2.2 For any x and x̂ from Q1 we have:

fμ2 (x̂) + ∇fμ2 (x̂), x − x̂ E1 ≤ fˆ(x) + Ax, uμ2 (x̂) E2 − φ̂(uμ2 (x̂)). (6.2.15)

Proof Let us take arbitrary x and x̂ from Q1 . Let û = uμ2 (x̂). Then

fˆ(x̂) + Aȳ, û E2 − φ̂(û) − μ2 d2 (û)


(6.2.5),(6.2.6)
fμ2 (x̂) + ∇fμ2 (x̂), x − ȳ E1 =

+∇ fˆ(x̂) + A∗ û, x − x̂ E1

(2.1.2)
≤ fˆ(x) + Ax, û E2 − φ̂(û). 

Let us justify the possibility of satisfying the excessive gap condition (6.2.13) at
some starting primal-dual pair.
Lemma 6.2.3 Let us choose an arbitrary μ2 > 0 and set

x̄ = arg min {∇fμ2 (x0 )), x − x0 E1 + L1 (fμ2 )d1 (x)},


x∈Q1
(6.2.16)
ū = uμ2 (x0 ).

Then the excessive gap condition is satisfied for any μ1 ≥ L1 (fμ2 ).


450 6 The Primal-Dual Model of an Objective Function

Proof Indeed, in view of (1.2.11) we have

fμ2 (x̄) ≤ fμ2 (x0 ) + ∇fμ2 (x0 ), x̄ − x0 E1 + 12 L1 (fμ2 )x̄ − x0 21

(6.2.4)
≤ fμ2 (x0 ) + ∇fμ2 (x0 ), x̄ − x0 E1 + 12 L1 (fμ2 )d1 (x̄)

(6.2.16)
= fμ2 (x0 ) + min {∇fμ2 (x0 ), x − x0 E1 + L1 (fμ2 )d1 (x)}
x∈Q1

(6.2.15)  
≤ min fˆ(x) + Ax, uμ2 (x0 ) E2 − φ̂(uμ2 (x0 )) + L1 (fμ2 )d1 (x)
x∈Q1

(6.2.9)
= φL1 (fμ2 ) (ū) ≤ φμ1 (ū). 

Thus, condition (6.2.13) can be satisfied for some primal-dual pair. Let us show
how we can update the points x̄ and ū in order to keep it valid for smaller values of
μ1 and μ2 . In view of the symmetry of the situation, at the first step of the process
we can try to decrease only μ1 , keeping μ2 unchanged. After that, at the second
step, we update μ2 and keep μ1 constant, and so on. The main advantage of such a
switching strategy is that we need to find a justification only for the first step. The
proof for the second one will be symmetric.
Theorem 6.2.1 Let points x̄ ∈ Q1 and ū ∈ Q2 satisfy the excessive gap
condition (6.2.13) for some positive μ1 and μ2 . Let us fix τ ∈ (0, 1) and choose
μ+
1 = (1 − τ )μ1 ,

x̂ = (1 − τ )x̄ + τ xμ1 (ū),

ū+ = (1 − τ )ū + τ uμ2 (x̂), (6.2.17)

x̄+ = (1 − τ )x̄ + τ xμ+ (ū+ ).


1

Then the pair (x̄+ , ū+ ) satisfies condition (6.2.13) with smoothing parameters μ+
1
and μ2 provided that τ satisfies the following relation:

τ2 μ1
1−τ ≤ L1 (fμ2 ) (6.2.18)
6.2 An Excessive Gap Technique for Non-smooth Convex Minimization 451

Proof Let û = uμ2 (x̂), x1 = xμ1 (ū), and x̃+ = xμ+ (ū+ ). Since φ̂ is convex, in view
1
of the operation in (6.2.17), we have φ̂(ū+ ) ≤ (1 − τ )φ̂(ū) + τ φ̂(û). Therefore,

φμ+ (ū+ ) = (1 − τ )μ1 d1 (x̃+ ) + Ax̃+ , (1 − τ )ū + τ û E2 + fˆ(x̃+ ) − φ̂(ū+ )


1

≥ (1 − τ )[μ1 d1 (x̃+ ) + Ax̃+ , ū E2 + fˆ(x̃+ ) − φ̂(ū)]

+τ [fˆ(x̃+ ) + Ax̃+ , û E2 − φ̂(û)]

(6.2.15)
≥ (1 − τ )[φμ1 (ū) + 12 μ1 x̃+ − x1 21 ]a

+τ [fμ2 (x̂) + ∇fμ2 (x̂), x̃+ − x̂ E1 ]b .

Note that in view of condition (6.2.13) and the first line in (6.2.17) we have

φμ1 (ū) ≥ fμ2 (x̄) ≥ fμ2 (x̂) + ∇fμ2 (x̂), x̄ − x̂ E1

= fμ2 (x̂) + τ ∇fμ2 (x̂), x̄ − x1 E1 .

Therefore, we can estimate the expression in the first brackets as follows:

[ · ]a ≥ fμ2 (x̂) + τ ∇fμ2 (x̂), x̄ − x1 E1 + 12 μ1 x̃+ − x1 21 .

In view of the first line in (6.2.15), for second brackets we have

[ · ]b = fμ2 (x̂) + ∇fμ2 (x̂), x̃+ − x1 + (1 − τ )(x1 − x̄) E1 .

(6.2.17)
Thus, taking into account that x̄+ − x̂ = τ (x̃+ − x1 ), we finish the proof as
follows:

φμ+ (ū+ ) ≥ fμ2 (x̂) + τ ∇fμ2 (x̂), x̃+ − x1 E1 + 12 (1 − τ )μ1 x̃+ − x1 21
1

(1−τ )μ1
= fμ2 (x̂) + ∇fμ2 (x̂), x̄+ − x̂ E1 + 2τ 2
x̄+ − x̂21

(6.2.18)
≥ fμ2 (x̂) + ∇fμ2 (x̂), x̄+ − x̂ E1 + 12 L1 (fμ2 )x̄+ − x̂21

(1.2.11)
≥ fμ2 (x̄+ ). 

452 6 The Primal-Dual Model of an Objective Function

6.2.3 Convergence Analysis

In Sect. 6.2.2, we have seen that the smoothness parameters μ1 and μ2 can
be decreased by a switching strategy. Thus, in order to transform the result of
Theorem 6.2.1 into an algorithmic scheme, we need to point out a strategy for
updating these parameters, which is compatible with the growth condition (6.2.18).
In this section, we do this for an important case L1 (fˆ) = L2 (φ̂) = 0.
It is convenient to represent the smoothness parameters as follows:
 
D2 D1
μ1 = λ1 · A1,2 · D1 , μ2 = λ2 · A1,2 · D2 .
(6.2.19)

Then the estimate (6.2.14) for the duality gap becomes symmetric:

f (x̄) − φ(ū) ≤ (λ1 + λ2 ) · A1,2 · D1 D2 . (6.2.20)

Since by (6.2.7), L1 (fμ2 ) = μ2 A1,2 ,


1 2 condition (6.2.18) becomes problem
independent:

τ2
1−τ ≤ μ1 μ2 · 1
A21,2
= λ1 λ2 . (6.2.21)

Let us write down the corresponding switching algorithmic scheme in an explicit


form. It is convenient to have a permanent iteration counter. In this case, at
even iterations we apply the primal update (6.2.17), and at odd iterations the
corresponding dual update is used. Since at even iterations λ2 does not change and
at odd iterations λ1 does not change it is convenient to put their new values in the
same sequence {αk }∞ k=−1 . Let us fix the following relations between the sequences:

k= 2l : λ1,k = αk−1 , λ2,k = αk ,


(6.2.22)
k = 2l + 1 : λ1,k = αk , λ2,k = αk−1 .

Then the corresponding parameters τk (see the rule (6.2.1)) define the reduction rate
of the sequence {αk }∞
k=−1 .
Lemma 6.2.4 For all k ≥ 0 we have αk+1 = (1 − τk )αk−1 .
Proof Indeed, in accordance with (6.2.22), if k = 2l, then

αk+1 = λ1,k+1 = (1 − τk )λ1,k = (1 − τk )αk−1 .

And if k = 2l + 1, then αk+1 = λ2,k+1 = (1 − τk )λ2,k = (1 − τk )αk−1 . 



6.2 An Excessive Gap Technique for Non-smooth Convex Minimization 453

Corollary 6.2.1 In terms of the sequence {αk }∞


k=−1 , condition (6.2.21) is as
follows:

(αk+1 − αk−1 )2 ≤ αk+1 αk αk−1


2
, k ≥ 0. (6.2.23)

Proof In view of (6.2.22), we always have λ1,k λ2,k = αk αk−1 . Since τk = 1 − ααk+1
k−1
,
we get (6.2.23). 
Clearly, condition (6.2.23) is satisfied by

αk = 2
k+2 , k ≥ −1. (6.2.24)

Then
αk+1
τk = 1 − αk−1 = 2
k+3 , k ≥ 0. (6.2.25)

Now we are ready to write down an algorithmic scheme. Let us do this for the
rule (6.2.17). In this scheme, we use the sequences {μ1,k }∞ ∞
k=−1 and {μ2,k }k=−1 ,
generated in accordance with rules (6.2.19), (6.2.22) and (6.2.24).

1. Initialization: Choose x̄0 and ū0 in accodance with


(6.2.16) taking μ1 = μ1,0 and μ2 = μ2,0 .

2. Iterations (k ≥ 0):
(a) Set τk = k+32
. (6.2.26)
(b) If k is even, then generate (x̄k+1 , ūk+1 ) from (x̄k , ūk ) using
(6.2.17).
(c) If k is odd, then generate (x̄k+1 , ūk+1 ) from (x̄k , ūk ) using
the symmetric dual variant of (6.2.17).

Theorem 6.2.2 Let the sequences {x̄k }∞ ∞


k=0 and {ūk }k=0 be generated by
method (6.2.26). Then each pair of points (x̄k , ūk ) satisfy the excessive gap
condition. Therefore,
4A1,2 √
f (x̄k ) − φ(ūk ) ≤ k+1 D1 D2 . (6.2.27)

Proof In accordance with our choice of parameters,

μ1,0 μ2,0 = λ1,0 λ2,0 · A21,2 = 2μ2,0L1 (fμ2,0 ) > μ2,0 L1 (fμ2,0 ).

Hence, in view of Lemma 6.2.3 the pair (x̄0 , ū0 ) satisfies the excessive gap condi-
tion. We have already checked that the sequence {τk }∞k=0 defined by (6.2.25) satisfies
454 6 The Primal-Dual Model of an Objective Function

the conditions of Theorem 6.2.1. Therefore, excessive gap conditions will be valid
for the sequences generated by (6.2.26). It remains to use inequality (6.2.20).


6.2.4 Minimizing Strongly Convex Functions

Consider now the model (6.2.2), which satisfies the following assumption.
Assumption 6.2.1 In representation (6.2.2) the function fˆ is strongly convex with
convexity parameter σ̂ > 0.
Let us prove the following variant of Danskin’s theorem.
Lemma 6.2.5 Under Assumption 6.2.1 the function φ defined by (6.2.3) is concave
and differentiable. Moreover, its gradient

∇φ(u) = −∇ φ̂(u) + Ax0 (u), (6.2.28)

where x0 (u) is defined by (6.2.9), is Lipschitz-continuous with constant

L2 (φ) = 1
σ̂
A21,2 + L2 (φ̂). (6.2.29)

Proof Let φ̃(u) = min {Ax, u E2 + fˆ(x)}. This function is concave as a minimum
x∈Q1
of linear functions. Since fˆ is strongly convex, the solution of the latter minimiza-
tion problem is unique. Therefore, φ̃(·) is differentiable and ∇ φ̃(u) = Ax0 (u).
Consider two points u1 and u2 . From the first-order optimality conditions
for (6.2.3) we have

A∗ u1 + ∇ fˆ(x0 (u1 )), x0 (u2 ) − x0 (u1 ) E1 ≥ 0,

A∗ u2 + ∇ fˆ(x0 (u2 )), x0 (u1 ) − x0 (u2 ) E1 ≥ 0.

Adding these inequalities and using the strong convexity of fˆ(·), we continue as
follows:

Ax0 (u2 ) − Ax0 (u1 ), u1 − u2 E2

≥ ∇ fˆ(x0 (u1 )) − ∇ fˆ(x0 (u2 )), x0 (u1 ) − x0 (u2 ) E1

(2.1.22) (6.1.9)  2
≥ σ̂ x0 (u1 ) − x0 (u2 )2E1 ≥ σ̂
∇ φ̃(u1 ) − ∇ φ̃(u2 )∗E2 .
A21,2

Thus, ∇ φ̃(u1 ) − ∇ φ̃(u2 )∗|E2 ≤ 1


σ̂
A21,2 · u1 − u2 E2 , and (6.2.29) follows.

6.2 An Excessive Gap Technique for Non-smooth Convex Minimization 455

Lemma 6.2.6 For any u and û from Q2 , we have:

φ(û) + ∇φ(û), u − û E2 ≥ −φ̂(u) + Ax0 (û), u E2 + fˆ(x0 (û)). (6.2.30)

Proof Let us take arbitrary u and û from Q2 . Define x̂ = x0 (û). Then

φ(û) + ∇φ(û), u − û E2

= −φ̂(û) + Ax̂, û E2 + fˆ(x̂) + −∇ φ̂(û) + Ax̂, u − û E2

(2.1.2)
≥ −φ̂(u) + Ax̂, u E2 + fˆ(x̂). 

In this section, we derive an optimization scheme from the following variant of


excessive gap condition:

fμ2 (x̄) ≤ φ(ū) (6.2.31)

for some x̄ ∈ Q1 and ū in Q2 .


This condition can be seen as a variant of condition (6.2.13) with μ1 = 0.
However, we prefer not to use the results of the previous sections since our
assumptions will be slightly different. For example, we no longer need the set Q1 to
be bounded.
Lemma 6.2.7 Let points x̄ from Q1 and ū from Q2 satisfy condition (6.2.31). Then

0 ≤ f (x̄) − φ(ū) ≤ μ2 D2 . (6.2.32)

Proof Indeed, for any x ∈ Q1 , we have fμ2 (x) ≥ f (x) − μ2 D2 . 



Define the adjoint gradient mapping as follows:
 
1
V (u) = arg max ∇φ(u), v − u E2 − L2 (φ)v − u2E2 . (6.2.33)
v∈Q2 2

Lemma 6.2.8 The excessive gap condition (6.2.31) is valid for μ2 = L2 (φ) and

x̄ = x0 (u0 ), ū = V (u0 ). (6.2.34)


456 6 The Primal-Dual Model of an Objective Function

Proof Indeed, in view of Lemma 6.2.5 and (1.2.11), we get the following relations:
1
φ(V (u0 )) ≥ φ(u0 ) + ∇φ(u0 ), V (u0 ) − u0 E2 − L2 (φ)V (u0 ) − u0 22
2

 
(6.2.33) 1 2
= max φ(u0 ) + ∇φ(u0 ), u − u0 E2 − L2 (φ)u − u0 2
u∈Q2 2

(6.2.3),(6.2.28)

= max −φ̂(u0 ) + Ax0 (u0 ), u0 E2 + fˆ(x0 (u0 ))
u∈Q2


1
+Ax0 (u0 ) − ∇ φ̂(u0 ), u − u0 E2 − μ2 u − u0 22
2

(6.2.4)  
≥ max −φ̂(u) + fˆ(x0 (u0 )) + Ax0 (u0 ), u E2 − μ2 d2 (u)
u∈Q2

(6.2.5)
= fμ2 (x0 (u0 )). 

Theorem 6.2.3 Let points x̄ ∈ Q1 and ū ∈ Q2 satisfy the excessive gap


condition (6.2.31) for some positive μ2 . Let us fix τ ∈ (0, 1) and choose μ+
2 =
(1 − τ )μ2 ,

û = (1 − τ )ū + τ uμ2 (x̄),

x̄+ = (1 − τ )x̄ + τ x0 (û), (6.2.35)

ū+ = V (û).

Then the pair (x̄+ , ū+ ) satisfies condition (6.2.31) with smoothness parameter μ+
2,
provided that τ satisfies the following growth relation:

τ2 μ2
1−τ ≤ L2 (φ) .
(6.2.36)
6.2 An Excessive Gap Technique for Non-smooth Convex Minimization 457

Proof Let x̂ = x0 (û) and u2 = uμ2 (x̄). In view of the second rule in (6.2.35),
and (6.2.5), we have:

fμ+ (x̄+ ) = fˆ(x̄+ ) + max A((1 − τ )x̄ + τ x̂), u E2 − φ̂(u)
2 u∈Q2

−(1 − τ )μ2 d2 (u)}

(3.1.2)   
≤ max (1 − τ ) fˆ(x̄) + Ax̄, u E2 − φ̂(u) − μ2 d2 (u)
u∈Q2


+τ [fˆ(x̂) + Ax̂, u E2 − φ̂(u)]

(4.2.18)   
≤ max (1 − τ ) fμ2 (x̄) − 12 μ2 u − u2 22
u∈Q2


+τ [φ(û) + ∇φ(û), u − û E2 ] ,

where we used (6.2.30) in the last line. Since φ is concave, by (6.2.31) we obtain

fμ2 (x̄) ≤ φ(ū) ≤ φ(û) + ∇φ(û), ū − û E2

Line 1 in (6.2.35)
= φ(û) + τ ∇φ(û), ū − u2 E2 .

Hence, we can finish the proof as follows:


 
fμ+ (x̄+ ) ≤ max φ(û) + τ ∇φ(û), u − u2 E2 − 12 (1 − τ )μ2 u − u2 22
2 u∈Q2

(6.2.36)  
≤ max φ(û) + τ ∇φ(û), u − u2 E2 − 12 τ 2 L2 (φ)u − u2 22 .
u∈Q2

Defining now v = ū + τ (u − ū)) with u ∈ Q2 , we continue:


 
fμ+ (x̄+ ) ≤ max φ(û) + ∇φ(û), v − û E2 − 12 L2 (φ)v − û22
2 v∈ū+τ (Q2 −ū)

 
(Q2 is convex) ≤ max φ(û) + ∇φ(û), v − û E2 − 12 L2 (φ)v − û22
v∈Q2

(6.2.33)
≤ φ(û) + ∇φ(û), ū+ − û E2 − 12 L2 (φ)ū+ − û22

(1.2.11)
≤ φ(ū+ ). 

458 6 The Primal-Dual Model of an Objective Function

Now we can justify the following minimization scheme.

1. Initialization:
Set μ2,0 = 2L2 (φ), x̄0 = x0 (u0 ) and ū0 = V (u0 ).

2. For k ≥ 0 iterate:

Set τk = 2
k+3 and ûk = (1 − τk )ūk + τk uμ2,k (x̄k ).
(6.2.37)

Update μ2,k+1 = (1 − τk )μ2,k ,

x̄k+1 = (1 − τk )x̄k + τk x0 (ûk ),

ūk+1 = V (ûk ).

Theorem 6.2.4 Let problem (6.2.1) satisfy Assumption 6.2.1. Then the pairs
(x̄k , ūk ) generated by scheme (6.2.37) satisfy the following inequality:
4L2 (φ)D2
f (x̄k ) − φ(ūk ) ≤ (k+1)(k+2) , (6.2.38)

where L2 (φ) is given by (6.2.29).


Proof Indeed, in view of Theorem 6.2.3 and Lemma 6.2.8 we need only to
justify that the sequences {μ2,k }∞ ∞
k=0 and {τk }k=0 satisfy relation (6.2.36). This is
straightforward because of the following relation:
4L2 (φ)
μ2,k = (k+1)(k+2) ,

which is valid for all k ≥ 0. 



Let us conclude this section with an example. Consider the problem

1
f (x) = x2E1 + max [fj + gj , x − xj E1 ] → min : x ∈ E1 . (6.2.39)
2 1≤j ≤m

Let E1 = Rn and choose


n
x21 = (x (i) )2 , x ∈ E1 .
i=1

Then this problem can be solved by the method (6.2.37).


6.2 An Excessive Gap Technique for Non-smooth Convex Minimization 459

Indeed, we can represent the objective function in (6.2.39) in the form (6.2.2)
using the following objects:


m
E2 = Rm , Q2 = Δm = {u ∈ Rm
+ : u(j ) = 1},
j =1

fˆ(x) = 12 x21 , φ̂(u) = b, u E2 , b(j ) = gj , xj E1 − fj , j = 1, . . . , m,

AT = (g1 , . . . , gm ).

Thus, σ̂ = 1 and L2 (φ̂) = 0. Let us choose for E2 the following norm:


m
uE2 = |u(j )|.
j =1

Then we can use the entropy distance function,


m
d2 (u) = ln m + u(j ) ln u(j ) , u0 = ( m1 , . . . , m1 ),
j =1

for which the convexity parameter is one and D2 = ln m. Note that in this case

A1,2 = max gj ∗1 .


1≤j ≤m

Thus, method (6.2.37) as applied to problem (6.2.39) converges with the following
rate:
 2
f (x̄k ) − φ(ūk ) ≤ 4 ln m
(k+1)(k+2) · max gj ∗1 .
1≤j ≤m

Let us study the complexity of method (6.2.37) for our example. At each
iteration, we need to compute the following objects.
1. Computation of uμ2 (x̄). This is the solution of the following problem:
" %

m
max u(j ) s (j ) (x̄) − μ2 d2 (u) : u ∈ Q2
u j =1

with s (j ) (x̄) = fj + gj , x̄ − xj , j = 1, . . . , m. As we have seen several times,


this solution can be found in a closed form:
 −1
(j ) (j) 
m (l)
uμ2 (x̄) = es (x̄)/μ2 · es (x̄)/μ2 , j = 1, . . . , m.
l=1
460 6 The Primal-Dual Model of an Objective Function

2. Computation of x0 (û). In our case, this is a solution to the problem


 
1
min Ax, û E2 + x2E1 : x ∈ E1 .
x 2

Hence, the answer is very simple: x0 (û) = −AT û.


3. Computation of V (û). In our case,
" %

m
φ(ū) = min u(j ) [fj + gj , x − xj E1 ] + 2 xE1
1 2
x∈E1 j =1

 2
= −b, u E2 − 1
2 AT û∗E1 .

Thus, ∇φ(ū) = −b − AAT û. Now we can compute V (û) by (6.2.33). It can
be easily shown that the complexity of finding V (û) is of the order O(m ln m),
which comes from the necessity to sort the components of a vector in Rm .
Thus, we have seen that all computations at each iteration of method (6.2.37) as
applied to problem (6.2.39) are very cheap. The most expensive part of the iteration
is the multiplication of matrix A by a vector. In a straightforward implementation,
we need three such multiplications per iteration. However, a simple modification of
the order of operations can reduce this amount to two.

6.3 The Smoothing Technique in Semidefinite Optimization

(Smooth symmetric functions of eigenvalues; Minimizing the maximal eigenvalue of a


symmetric matrix.)

6.3.1 Smooth Symmetric Functions of Eigenvalues

In Sects. 6.1 and 6.2, we have shown that a proper use of the structure of nonsmooth
convex optimization problems leads to very efficient gradient schemes, whose
performance is significantly better than the lower complexity bounds derived from
the Black Box assumptions. However, this observation leads to implementable
algorithms only if we are able to form a computable smooth approximation of the
objective function of our problem. In this case, applying to this approximation an
optimal method (6.1.19) for minimizing smooth convex functions, we can easily
obtain a good solution to our initial problem.
Our previous results are related mainly to piece-wise linear functions. In this
section, we extend them to the problems of Semidefinite Optimization (SO).
6.3 The Smoothing Technique in Semidefinite Optimization 461

For that, we introduce computable smooth approximation for one of the most
important nonsmooth functions of symmetric matrices, its maximal eigenvalue. Our
approximation is based on entropy smoothing.
In what follows, we denote by Mn the space of real n × n-matrices, and by
Sn ⊂ Mn the space of symmetric matrices. A particular matrix is always denoted
by a capital letter. In the spaces Rn and Mn we use the standard inner products


n
x, y = x (i) y (i) , x, y ∈ Rn ,
i=1


n
X, Y F = X(i,j ) Y (i,j ) , X, Y ∈ Mn .
i,j =1

For X ∈ Sn , we denote by λ(X) ∈ Rn the vector of its eigenvalues. We assume that


the eigenvalues are ordered in a decreasing order:

λ(1) (X) ≥ λ(2) (X) ≥ · · · ≥ λ(n) (X), X ∈ Sn .

Thus, λmax (X) = λ(1) (X). The notation D(λ) ∈ Sn is used for a diagonal matrix
with vector λ ∈ Rn on the main diagonal. Note that any X ∈ Sn admits an
eigenvalue decomposition

X = U (X)D(λ(X))U (X)T

with U (X)U (X)T = In , where In ∈ Sn is the identity matrix.


Let us mention some notations with different meanings for vectors and matrices.
For a vector λ ∈ Rn , we denote by |λ| ∈ Rn the vector with entries |λ(i) |, i =
1, . . . , n. The notation λk ∈ Rn is used for the vector with components (λ(i) )k ,
i = 1, . . . , n. However, for X ∈ Sn we define

def
|X| = U (X)D(|λ(X)|)U (X)T  0,

and the notation Xk is used for the standard matrix power. Since the power k ≥ 0
does not change the ordering of nonnegative components, for any X  0 we have

λk (X) = λ(Xk ). (6.3.1)

Further, in Rn , we use a standard notation for p -norms:


 1/p

n
x(p) = |x (i) |p , x ∈ Rn ,
i=1
462 6 The Primal-Dual Model of an Objective Function

where p ≥ 1, and x(∞) = max |x (i)|. The corresponding norms in Sn are


1≤i≤n
introduced by

X(p) = λ(X)(p) = λ(|X|)(p) , X ∈ Sn . (6.3.2)

For k ≥ 1, consider the following function:


n
πk (X) = Xk , In F = (λ(i) (X))k , X ∈ Sn .
i=1

Let us derive an upper bound for its second derivative. Note that this bound is
nontrivial only for k ≥ 2.
The derivatives of this function along a direction H ∈ Sn are defined as follows:

∇πk (X), H F = kXk−1 , H F ,


k−2 (6.3.3)
∇ 2 π k (X)H, H F =k Xp H Xk−2−p , H F .
p=0

We need the following result.


Lemma 6.3.1 For any p, q ≥ 0, and X, H from Sn we have

Xp H Xq + Xq H Xp , H F ≤ 2|X|p+q , H 2 F
(6.3.4)
≤2 λp+q (|X|), λ2 (|H |) .

Proof Indeed, let λ = λ(X), D = D(λ), U = U (X) and Ĥ = U T H U . Then

Xp H Xq + Xq H Xp , H F = U D p U T H U D q U T + U D q U T H U D p U T , H F

= D p Ĥ D q + D q Ĥ D p , Ĥ F


n 
= (Ĥ (i,j ) )2 (λ(i) )p (λ(j ) )q + (λ(i) )q (λ(j ) )p
i,j =1


n 
≤ (Ĥ (i,j ) )2 |λ(i) |p |λ(j ) |q + |λ(i) |q |λ(j ) |p .
i,j =1

Note that for arbitrary non-negative values a and b we always have

0 ≤ (a p − bp )(a q − bq ) = (a p+q + bp+q ) − (a p bq + a q bp ).


6.3 The Smoothing Technique in Semidefinite Optimization 463

Thus, we can continue as follows:


n 
Xp H Xq + Xq H Xp , H F ≤ (Ĥ (i,j ) )2 |λ(i) |p+q + |λ(j ) |p+q
i,j =1


n
=2 (Ĥ (i,j ) )2 |λ(i) |p+q = 2D(|λ|)p+q Ĥ , Ĥ F
i,j =1

= 2D p+q (|λ|), Ĥ 2 F = 2|X|p+q , H 2 F .

Hence, we get the first inequality in (6.3.4). Further, by von Neumann’s inequality

(6.3.1)
|X|p+q , H 2 F ≤ λ(|X|p+q ), λ(H 2 ) = λp+q (|X|), λ2 (|H |) ,

and this proves the remaining part of (6.3.4).



Corollary 6.3.1 For any k ≥ 2, we have

∇ 2 πk (X)H, H F ≤ k(k − 1)λk−2 (|X|), λ2 (|H |) . (6.3.5)

Proof For k = 2, the bound is trivial. For k ≥ 3, in representation (6.3.3) we can


 p
k−2
unify the terms in the expression X H Xk−2−p , H F in symmetric pairs
p=0

Xp H Xk−2−p + Xk−2−p H Xp , H F .

Applying inequality (6.3.4) to each pair, we get the estimate (6.3.5).



Let f (·) be a function of a real variable, defined by a power series



f (τ ) = a0 + ak τ k
k=1

with ak ≥ 0 for k ≥ 2. We assume that its domain dom f = {τ : |τ | < R} is


nonempty. For X ∈ Sn , consider the following symmetric function of eigenvalues:


n
F (X) = f (λ(i) (X)).
i=1

Clearly, dom F = {X ∈ Sn : λ(1) (X) < R, λ(n) (X) > −R}.


Theorem 6.3.1 For any X ∈ dom F and H ∈ Sn we have


n
∇ 2 F (X)H, H ≤ ∇ 2 f (λ(i) (|X|))(λ(i) (|H |))2 .
i=1
464 6 The Primal-Dual Model of an Objective Function

Proof Indeed,


n 

F (X) = n · a0 + ak (λ(i) (X))k
i=1 k=1


∞ 
n 

= n · a0 + ak (λ(i) (X))k = n · a0 + ak πk (X).
k=1 i=1 k=1

Thus, in view of inequality (6.3.5),



∇ 2 F (X)H, H F = ak ∇ 2 πk (X)H, H F
k=2



≤ k(k − 1)ak λk−2 (|X|), λ2 (|H |)
k=2


n 

= k(k − 1)ak (λ(i) (|X|))k−2 (λ(i) (|H |))2
i=1 k=2


n
= ∇ 2 f (λ(i) (|X|))(λ(i) (|H |))2. 

i=1

Let us consider now two important examples of symmetric functions of eigen-


values.
1. Squared p -Matrix Norm. For an integer p ≥ 1, consider the following
function:
1/p
Fp (X) = 12 λ(X)2(2p) = 2 X , In F ,
1 2p X ∈ Sn . (6.3.6)

Thus, Fp (X) = 12 (π2p (X))1/p . Therefore, in view of (6.3.5), for any X, H ∈ Sn we


have

p −1 ∇π
1
∇Fp (X), H F = 1
2p (π2p (X)) 2p (X), H F ,

  1
−2
∇ 2 Fp (X)H, H F = 1
2p · 1
p − 1 · (π2p (X)) p ∇π2p (X), H 2F

1
−1 (6.3.7)
+ 2p
1
(π2p (X)) p ∇ 2 π2p (X)H, H F

1
−1
≤ (2p − 1)(π2p (X)) p λ2p−2 (|X|), λ2 (|H |) .
6.3 The Smoothing Technique in Semidefinite Optimization 465

p β
Let us apply Hölder’s inequality x, y ≤ x(β) y(γ ) with β = p−1 , γ = β−1 =
p, and

x (i) = (λ(i) (|X|))2p−2 , y (i) = (λ(i) (|H |))2 , i = 1, . . . , n.

Then,
  p−1  n 1

n p  (i) p
x, y ≤ (λ(i) (|X|))2p · (λ (|H |)) 2p
i=1 i=1

(6.3.2) p−1
= π2p (X) p · λ(H )2(2p) ,

and we can continue:

∇ 2 Fp (X)H, H F ≤ (2p − 1)λ(H )2(2p) = (2p − 1)H 2(2p). (6.3.8)

2. Entropy Smoothing of Maximal Eigenvalue. Consider the function


n (i) (X) def
E(X) = ln eλ = ln F (X), X ∈ Sn . (6.3.9)
i=1

Note that

∇E(X), H F = F (X) ∇F (X), H F ,


1

∇ 2 E(X)H, H F = − F 21(X) ∇F (X), H 2F + F (X) ∇ F (X)H, H F


1 2

≤ F (X) ∇ F (X)H, H F .
1 2

Let us assume first that X  0. The function F (X) is formed by the auxiliary
function f (τ ) = eτ , which satisfies the assumptions of Theorem 6.3.1. Therefore,
 −1

n (i) (X) 
n (i) (X)
∇ 2 E(X)H, H F ≤ eλ eλ (λ(i) (|H |))2 ≤ H 2(∞) .
i=1 i=1
(6.3.10)

It remains to note that E(X + τ In ) = E(X) + τ . Hence, the Hessian ∇ 2 E(X +


τ In ) does not depend on τ , and we conclude that the estimate (6.3.10) is valid for
arbitrary X ∈ Sn .
466 6 The Primal-Dual Model of an Objective Function

6.3.2 Minimizing the Maximal Eigenvalue of the Symmetric


Matrix

Consider the following problem:

Find φ ∗ = min{φ(y) = λmax (C + A(y))},


def
(6.3.11)
y∈Q

where Q is a closed convex set in Rm and A(·) is a linear operator from Rm to Sn :


m
A(y) = y (i) Ai ∈ Sn , y ∈ Rm .
i=1

Note that the objective function in (6.3.11) is nonsmooth. Therefore, this problem
can be solved either by interior-point methods (see Chap. 5), or by general methods
of nonsmooth convex optimization (see Chap. 3). However, due to the very special
structure of the objective function, for problem (6.3.11) it is better to develop a
special scheme.
We are going to solve problem (6.3.11) by a smoothing technique discussed
in Sect. 6.1. This means that we replace the function λmax (X) by its smooth
approximation fμ (X) = μE( μ1 X), defined by (6.3.9) with tolerance parameter
μ > 0. Note that
 

n (i)
fμ (X) = μ ln eλ (X)/μ ≥ λmax (X),
i=1
(6.3.12)
fμ (X) ≤ λmax (X) + μ ln n.

At the same time,


 −1

n (i) 
n (i) (X)/μ
∇fμ (X) = eλ (X)/μ · eλ ui (X)ui (X)T , (6.3.13)
i=1 i=1

where ui (X), i = 1, . . . , n, are corresponding unit eigenvectors of the symmetric


matrix X. Thus, at each test point X, the gradient ∇fμ (X) takes into account all
(i)
eigenvalues of the matrix X. However, since the factors eλ (X)/μ decrease very
rapidly, it actually depends only on few largest eigenvalues. Their selection is made
automatically by expression (6.3.13). The ranking of importance of the eigenvalues
is done in a logarithmic scale controlled by the tolerance parameter μ.
Let us analyze now the efficiency of the smoothing technique as applied to
problem (6.3.11). Our goal is to find an -solution x̄ ∈ Q to problem (6.3.11):

φ(ȳ) − φ ∗ ≤ . (6.3.14)
6.3 The Smoothing Technique in Semidefinite Optimization 467

For that, we will try to find a 12 -solution to the smooth problem

Find φμ∗ = min{φμ (y) = fμ (C + A(y))},


def
(6.3.15)
y∈Q

with

μ = μ() = 
2 ln n .
(6.3.16)

Clearly, if φμ (ȳ) − φμ∗ ≤ 12 , then in view of (6.3.12) we have

φ(ȳ) − φ ∗ ≤ φμ (ȳ) − φμ∗ + μ ln n ≤ .

Let us analyze now the complexity of finding a 12 -solution to problem (6.3.15) by


the optimal method (6.1.19).
Let us fix some norm h for h ∈ Rm . Consider a prox-function d(·) of the set
Q with prox-center x0 ∈ Q. We assume this function to be strongly convex on Q
with convexity parameter one. Define

A = maxm {A(h)(∞) : h = 1}.


h∈R

Note that this norm is quite small. Indeed,

1/2
A(h)(∞) = λ(1) (|A(h)|) ≤ A(h), A(h) F , h ∈ Rm .

def 1/2
Therefore, for example, A ≤ AG = max A(h), A(h) F .
h=1
Let us estimate the second derivative of the function φμ (·). For any y and h from
Rm , in view of inequality (6.3.10) we have

∇φμ (y), h = ∇fμ (C + A(y)), h = ∇E( μ1 (C + A(y))), A(h) F ,

∇ 2 φμ (y)h, h = μ ∇ E(C
1 2 + A(y))A(h), A(h) F

≤ μ A(h)(∞)
1 2 ≤ μ A
1 2 · h2 .

Thus, by Theorem 6.1.1 the function φμ has Lipschitz continuous gradient with the
constant

L= μ A
1 2 =  A .
2 ln n 2
468 6 The Primal-Dual Model of an Objective Function

Now taking into account the estimate (6.1.21), we conclude that the method (6.1.19),
as applied to problem (6.3.15), has the following rate of convergence:

8 ln nA2 d(yμ∗ )
φμ (yk ) − φμ∗ ≤ ·(k+1)(k+2) ,

where yμ∗ ∈ Q is the solution to (6.3.15). Hence, it is able to generate a 12 -solution


to this problem (which is an -solution to problem (6.3.11)) at most after

4A
 d(yμ∗ ) ln n (6.3.17)

iterations.

6.4 Minimizing the Local Model of an Objective Function

(A linear optimization oracle; The method of conditional gradients; Conditional gradients


with contraction; Computation of primal-dual solution; Strong convexity of the composite
term; The second-order trust-region method with contraction.)

6.4.1 A Linear Optimization Oracle

In this section we consider numerical methods for solving the following composite
minimization problem:
 
min f¯(x) = f (x) + Ψ (x) ,
def
(6.4.1)
x

where Ψ is a simple closed convex function with bounded domain Q ⊂ E, and f


is a convex function, which is differentiable on Q. Denote by x ∗ one of the optimal
def
solutions of (6.4.1), and D = diam(Q). As usual, our assumption on the simplicity
of the function Ψ means that some auxiliary optimization problems related to Ψ
are easily solvable. The complexity of these problems will be always discussed for
corresponding optimization schemes.
The most important examples of the function Ψ are as follows.
• Ψ is an indicator function of a closed convex set Q:

def 0, x ∈ Q,
Ψ (x) = Ind Q (x) = (6.4.2)
+∞, otherwise.

• Ψ is a self-concordant barrier for a closed convex set Q (see Sect. 5.3).


• Ψ is a nonsmooth convex function with simple structure. In this case, we need
to include in Ψ an indicator function for a bounded domain. For example, it
6.4 Minimizing the Local Model of an Objective Function 469

could be

x(1) , if x(1) ≤ R,
Ψ (x) =
+∞, otherwise.

We assume that the function f is represented by a Black-Box oracle. If it is a


first-order oracle, we assume its gradients satisfy the following Hölder condition:

∇f (x) − ∇f (y)∗ ≤ Gν x − yν , x, y ∈ Q. (6.4.3)

The constant Gν is formally defined for any ν ∈ (0, 1]. For some values of ν it can
be +∞. Note that for any x and y in Q we have

f (y) ≤ f (x) + ∇f (x), y − x + 1+ν y − x1+ν . (6.4.4)

If this is a second-order oracle, we assume that its Hessians satisfy the Hölder
condition

∇ 2 f (x) − ∇ 2 f (y) ≤ Hν x − yν , x, y ∈ Q. (6.4.5)

In this case, for any x and y in Q we have


Hν y−x2+ν
f (y) ≤ f (x) + ∇f (x), y − x + 12 ∇ 2 f (x)(y − x), y − x + (1+ν)(2+ν) .
(6.4.6)

Our assumption on the simplicity of the function Ψ means exactly the following.
Assumption 6.4.1 For any s ∈ E∗ , the auxiliary problem

min {s, x + Ψ (x)} (6.4.7)


x∈Q

is easily solvable. Denote by vΨ (s) ∈ Q one of its optimal solutions.


Thus, for our methods we assume that we can use a linear optimization oracle,
related to the set Q. Indeed, in the case (6.4.2), this assumption implies that we are
able to solve the problem

min{s, x : x ∈ Q}.
x

For some sets (e.g. convex hulls of finite number of points), this oracle has lower
complexity than the standard auxiliary problem consisting in minimizing a prox-
function plus a linear term (see, for example, Sect. 6.1.3).
In view of Theorem 3.1.23 the point vΨ (s) is characterized by the following
variational principle:

s, x − vΨ (s) + Ψ (x) ≥ Ψ (vΨ (s)), x ∈ Q. (6.4.8)

By Definition 3.1.5, this means that −s ∈ ∂Ψ (vΨ (s)).


470 6 The Primal-Dual Model of an Objective Function

In the sequel, we often need to estimate the partial sums of different series. For
that, it is convenient to use the following lemma, the proof of which we leave as an
exercise for the reader.
Lemma 6.4.1 Let the function ξ(τ ), τ ∈ R, be decreasing and convex. Then, for
any two integers a and b, such that [a − 12 , b + 1] ⊂ dom ξ , we have


b+1 
b 
b+1/2
ξ(τ )dτ ≤ ξ(k) ≤ ξ(τ ) dτ. (6.4.9)
a k=a a−1/2

For example, for any t ≥ 0 and p ≥ −t, we have

+p
2t (5.4.38) 2t +p+1
 42t +p+1
4
1
k+p+1 ≥ 1
τ +p+1 dτ = ln(τ + p + 1)4
k=t t t
(6.4.10)

= ln 2tt +2p+2
+p+1 = ln 2.

On the other hand, if t ≥ 1, then

2t
+1 (5.4.38) 2t +3/2
 42t +3/2
1 4
1
(k+2)2
≤ 1
(τ +2)2
dτ = − τ +2 4 = 1
t +3/2 − 1
2t +7/2
k=t t −1/2 t −1/2

4t +8
= (2t +3)(4t +7) ≤ 12
11(2t +3) .
(6.4.11)

6.4.2 The Method of Conditional Gradients with Composite


Objective

In order to solve problem (6.4.1), we apply the following method.

Conditional Gradients with Composite Objective

1. Choose an arbitrary point x0 ∈ Q.


(6.4.12)

2. For t ≥ 0 iterate: (a) Compute vt = vΨ (∇f (xt )).

(b) Choose τt ∈ (0, 1] and set xt +1 = (1 − τt )xt + τt vt .


6.4 Minimizing the Local Model of an Objective Function 471

It is clear that this method can solve only problems where the function f has
continuous gradient.
Example 6.4.1 Let Ψ (x) = Ind Q (x) with Q = {x ∈ R2 : (x (1))2 + (x (2))2 ≤ 1}.
Define

f (x) = max{x (1), x (2) }.


 T
−1 √
Then clearly x∗ = √ , −1 . Let us choose in (6.4.12) x0
= x∗ .
2 2
For the function f , we can apply an oracle which returns at any x ∈ Q
a subgradient ∇f (x) ∈ {(1, 0)T , (0, 1)T }. Then, for any feasible x, the point
vΨ (∇f (x)) is equal either to y1 = (−1, 0)T , or to y2 = (0, −1)T . Therefore, all
points of the sequence {xt }t ≥0, generated by method (6.4.12), belong to the triangle
Conv{x0 , y1 , y2 }, which does not contain the optimal point x∗ .

In order to justify the rate of convergence of method (6.4.12) for functions
with Hölder continuous gradients, we apply a variant of the estimating sequences
technique (see Sects. 2.2.1 and 6.1.3). For that, it is convenient to introduce
in (6.4.12) new control variables. Consider a sequence of nonnegative weights
{at }t ≥0 . Define


t
at+1
At = ak , τt = At+1 , t ≥ 0. (6.4.13)
k=0

From now on, we assume that the parameter τt in method (6.4.12) is chosen in
accordance with the rule (6.4.13). Define

V0 = max {∇f (x0 ), x0 − x + Ψ (x0 ) − Ψ (x)} ,


x

(6.4.14)

t ak1+ν
Bν,t = a0 V0 + Aνk Gν D 1+ν , t ≥ 0.
k=1

It is clear that
(6.4.6) 

V0 ≤ max f (x0 ) − f (x) + 1+ν x − x0 1+ν + Ψ (x0 ) − Ψ (x)
x
(6.4.15)
≤ f¯(x0 ) − f¯(x∗ ) + Gν D 1+ν Gν D 1+ν
def
1+ν = Δ(x0 ) + 1+ν .

Theorem 6.4.1 Let the sequence {xt }t ≥0 be generated by method (6.4.12). Then,
for any ν ∈ (0, 1] with Gν < +∞, any step t ≥ 0, and any x ∈ Q we have


t
At (f (xt ) + Ψ (xt )) ≤ ak [f (xk ) + ∇f (xk ), x − xk + Ψ (x)] + Bν,t .
k=0
(6.4.16)
472 6 The Primal-Dual Model of an Objective Function

Proof Indeed, in view of definition (6.4.14), for t = 0 inequality (6.4.16) is


satisfied. Assume that it is valid for some t ≥ 0. Then

t
+1
ak [f (xk ) + ∇f (xk ), x − xk + Ψ (x)] + Bν,t
k=0

(6.4.16)
≥ At (f (xt ) + Ψ (xt )) + at +1[f (xt +1) + ∇f (xt +1 ), x − xt +1 + Ψ (x)]

≥ At +1 f (xt +1 ) + At Ψ (xt ) + ∇f (xt +1 ), at +1 (x − xt +1) + At (xt − xt +1)

+at +1 Ψ (x)

(6.4.12)b
= At +1 f (xt +1) + At Ψ (xt ) + at +1 [Ψ (x) + ∇f (xt +1), x − vt ]

(6.4.12)b
≥ At +1 (f (xt +1 ) + Ψ (xt +1)) + at +1 [Ψ (x) − Ψ (vt ) + ∇f (xt +1), x − vt ] .

It remains to note that


(6.4.8)
Ψ (x) − Ψ (vt ) + ∇f (xt +1 ), x − vt ≥ ∇f (xt +1) − ∇f (xt ), x − vt

(6.4.3)
≥ −τtν Gν D 1+ν .

Thus, to ensure that (6.4.16) is valid for the next iteration, it is enough to choose
1+ν
at+1
Bν,t +1 = Bν,t + Aνt+1 Gν D
1+ν . 

Corollary 6.4.1 For any t ≥ 0 with At > 0, and any ν ∈ (0, 1], we have

f¯(xt ) − f¯(x∗ ) ≤ 1
At Bν,t . (6.4.17)

Let us discuss now the possible variants for choosing the weights {at }t ≥0.
1. Constant weights. Let us choose at ≡ 1, t ≥ 0. Then At = t + 1, and for
ν ∈ (0, 1) we have



t
Bν,t = V0 + 1
(1+k)ν Gν D 1+ν
k=1

4t +1/2
(6.4.9) 4
≤ V0 + Gν D 1+ν 1−ν
1
(1 + τ )1−ν 4
1/2
  1−ν
 1−ν 
(6.4.15)
≤ Δ(x0 ) + Gν D 1+ν 1
1+ν + 3
2
1
1−ν 1 + 23 t −1 .
6.4 Minimizing the Local Model of an Objective Function 473

Thus, for ν ∈ (0, 1), we have A1t Bν,t ≤ O(t −ν ). For the most important case

 1−ν
ν = 1, we have lim 1−ν
1
1 + 23 t − 1 = ln(1 + 23 t). Therefore,
ν→1

  
f¯(xt ) − f¯(x∗ ) ≤ 1
t +1 Δ(x0 ) + G1 D 2 12 + ln(1 + 23 t) . (6.4.18)

(6.4.13)
In this situation, in method (6.4.12) we take τt = t +1
1
.
t (t +1)
2. Linear weights. Let us choose at ≡ t, t ≥ 0. Then At = 2 , and for ν ∈ (0, 1)
with t ≥ 1 we have




t
2ν k 1+ν 
t
Bν,t = k ν (1+k)ν Gν D 1+ν ≤ 2ν k 1−ν Gν D 1+ν
k=1 k=1

4t +1/2  2−ν  2−ν 


2ν 2−ν 4
(6.4.9)

≤ Gν D 1+ν 2−ν τ 4 = 2−ν t + 12 − 12 Gν D 1+ν .
1/2

Thus, for ν ∈ (0, 1), we again have 1


At Bν,t ≤ O(t −ν ). For the case ν = 1, we
get the following bound:

f¯(xt ) − f¯(x∗ ) ≤ 4 2
t +1 G1 D , t ≥ 1. (6.4.19)

As we can see, this rate of convergence is better than (6.4.18). In this case, in
(6.4.13)
method (6.4.12) we take τt = t +2 2
, which is a standard recommendation for
this scheme.
3. Aggressive weights. Let us choose, for example, at ≡ t 2 , t ≥ 0. Then At =
t (t +1)(2t +1) 2+ν 2−ν
6 . Note that for k ≥ 0 we have (k+1)kν (2k+1)ν ≤ k 2ν . Therefore, for
ν ∈ (0, 1) with t ≥ 1 we obtain




t
6ν k 2(1+ν) 
t
Bν,t = k ν (1+k)ν (2k+1)ν Gν D 1+ν ≤ 3ν k 2−ν Gν D 1+ν
k=1 k=1

4t +1/2  3−ν  3−ν 


3ν 3−ν 4
(6.4.9)

≤ Gν D 1+ν 3−ν τ 4 = 3−ν t + 12 − 12 Gν D 1+ν .
1/2

For ν ∈ (0, 1), we get again 1


At Bν,t ≤ O(t −ν ). For ν = 1, we obtain

f¯(xt ) − f¯(x∗ ) ≤ 9 2
2t +1 G1 D , t ≥ 1, (6.4.20)

which is slightly worse than (6.4.19). The rule for choosing the coefficients τt
6(t +1)
(6.4.13)
in this situation is τt = (t +2)(2t +3) . It can be easily checked that a further
increase of the rate of growth of coefficients at makes the rate of convergence of
method (6.4.12) even worse.
474 6 The Primal-Dual Model of an Objective Function

Note that the above rules for choosing the coefficients {τt }t ≥0 in method (6.4.12)
do not depend on the smoothness parameter ν ∈ (0, 1]. In this sense,
method (6.4.12) is a universal method for solving the problem (6.4.1). Moreover,
this method is affine invariant. Its behavior does not depend on the choice of norm
in E. Hence, its rate of convergence can be established with respect to the best norm
describing the geometry of the feasible set.

6.4.3 Conditional Gradients with Contraction

In this section, we will use some special dual functions. Let Q ⊂ E be a bounded
closed convex set. For a closed convex function F (·) with dom F ⊇ int Q, we define
its restricted dual function, (with respect to a central point x̄ ∈ Q), as follows:

∗ (s) = max{s, x̄ − x + F (x̄) − F (x)},


Fx̄,Q s ∈ E∗ .
x∈Q
(6.4.21)

Clearly, this function is well defined for all s ∈ E∗ . Moreover, it is convex and
nonnegative on E∗ .
We need to introduce in construction (6.4.21) an additional scaling parameter
τ ∈ [0, 1], which controls the size of the feasible set. For s ∈ E∗ , we call the
function

Fτ,∗ x̄,Q (s) = max{s, x̄ − y + F (x̄) − F (y) : y = (1 − τ )x̄ + τ x} (6.4.22)


x∈Q

the scaled restricted dual of the function F .


Lemma 6.4.2 For any s ∈ E∗ and τ ∈ [0, 1], we have

∗ (s) ≥ F ∗ ∗
Fx̄,Q τ,x̄,Q (s) ≥ τ Fx̄,Q (s). (6.4.23)

Proof Since for any x ∈ Q, the point y = (1 − τ )x̄ + τ x belongs to Q, the first
inequality is trivial. On the other hand,

Fτ,∗ x̄,Q (s) = max{s, τ (x̄ − x) + F (x̄) − F (y) : y = (1 − τ )x̄ + τ x }


x∈Q

≥ max{s, τ (x̄ − x) + F (x̄) − (1 − τ )F (x̄) − τ F (x) }


x∈Q

∗ (s).
= τ Fx̄,Q 

Let us consider a variant of method (6.4.12), which takes into account the com-
posite form of the objective function in problem (6.4.1). For Ψ (x) ≡ IndQ (x), these
6.4 Minimizing the Local Model of an Objective Function 475

two methods coincide. Otherwise, they generate different minimization sequences.

Conditional Gradient Method with Contraction

1. Choose an arbitrary point x0 ∈ Q.

2. For t ≥ 0 iterate: Choose a coefficient τt ∈ (0, 1] and compute

xt +1 = arg min {∇f (xt ), y + Ψ (y) : y = (1 − τt )xt + τt x} .


x∈Q

(6.4.24)

This method can be seen as a Trust-Region Scheme with a linear model of the
objective function. The trust region in method (6.4.24) is formed by a contraction of
the initial feasible set. In Sect. 6.4.6, we will consider a more traditional trust-region
method with quadratic model of the objective.
In view of Theorem 3.1.23 the point xt +1 in method (6.4.24) is characterized by
the following variational principle:

xt +1 = (1 − τt )xt + τt vt , vt ∈ Q,

Ψ ((1 − τt )xt + τt x) + τt ∇f (xt ), x − xt (6.4.25)

≥ Ψ (xt +1 ) + ∇f (xt ), xt +1 − xt , x ∈ Q.

Let us choose somehow the sequence of nonnegative weights {at }t ≥0, and define
in (6.4.24) the coefficients τt in accordance to (6.4.13). Define now the estimating
functional sequence {φt (x)}t ≥0 as follows:

φ0 (x) = a0 f¯(x),

φt +1 (x) = φt (x) + at +1 [f (xt ) + ∇f (xt ), x − xt + Ψ (x)], t ≥ 0.


(6.4.26)

Clearly, for all t ≥ 0 we have

φt (x) ≤ At f¯(x), x ∈ Q. (6.4.27)


476 6 The Primal-Dual Model of an Objective Function

Define



t ak1+ν
Cν,t = a0 Δ(x0 ) + 1
1+ν Aνk Gν D 1+ν , t ≥ 0. (6.4.28)
k=1

Let us introduce

def (6.4.21) ∗ (∇f (x)).


δ(x) = max{∇f (x), x − y + Ψ (x) − Ψ (y)} ≡ Ψx,Q
y∈Q
(6.4.29)

For problem (6.4.1), this value measures the level of satisfaction of the first-order
optimality conditions at a point x ∈ Q. For any x ∈ Q, we have

δ(x) ≥ f¯(x) − f¯(x∗ ) ≥ 0. (6.4.30)

We call δ(x) the total variation of the linear model of the composite objective
function in problem (6.4.1) over the feasible set. It justifies the first-order optimality
conditions in our problem. Note that this value can be computed by a procedure for
solving the auxiliary problem (6.4.7).
Theorem 6.4.2 Let the sequence {xt }t ≥0 be generated by method (6.4.24). Then,
for any ν ∈ (0, 1] and any step t ≥ 0, we have

At f¯(xt ) ≤ φt (x) + Cν,t , x ∈ Q. (6.4.31)

Moreover, for any t ≥ 0 we have

f¯(xt ) − f¯(xt +1) ≥ τt δ(xt ) − Gν D 1+ν 1+ν


1+ν τt . (6.4.32)

Proof Let us prove inequality (6.4.31). For t = 0, we have Cν,0 = a0 [f¯(x0 ) −


f¯(x∗ )]. Thus, in this case (6.4.31) follows from (6.4.27).
Assume now that (6.4.31) is valid for some t ≥ 0. In view of definition (6.4.13),
optimality condition (6.4.25) can written in the following form:

at +1 ∇f (xt ), x − xt ≥ At +1 [Ψ (xt +1 ) − Ψ ((1 − τt )xt + τt x)

+ ∇f (xt ), xt +1 − xt ]
6.4 Minimizing the Local Model of an Objective Function 477

for all x ∈ Q. Therefore,

φt +1 (x) + Cν,t = φt (x) + Cν,t

+at +1[f (xt ) + ∇f (xt ), x − xt + Ψ (x)]

(6.4.25),(6.4.31)
≥ At [f (xt ) + Ψ (xt )] + at +1 [f (xt ) + Ψ (x)]

+At +1[Ψ (xt +1 ) − Ψ ((1 − τt )xt + τt x)

+∇f (xt ), xt +1 − xt ]

≥ At +1 [f (xt ) + ∇f (xt ), xt +1 − xt + Ψ (xt +1 )]

(6.4.4)  
≥ At +1 f¯(xt +1 ) − 1+ν Gν xt +1
1
− xt 1+ν .

(6.4.13) a
It remains to note that xt +1 − xt  = τt xt − vt  ≤ At+1 t+1
D. Thus, we can take

1+ν
1 at+1
Cν,t +1 = Cν,t + 1+ν Aνt+1 Gν D
1+ν .

In order to prove inequality (6.4.32), let us introduce the values

def
δτ (x) = max{∇f (x), x − y + Ψ (x) − Ψ (y) : y = (1 − τ )x + τ u}
u∈Q

(6.4.22) ∗
= Ψτ,x,Q (∇f (x)), τ ∈ [0, 1].

Clearly,

−δτt (xt ) = min{∇f (xt ), y − xt + Ψ (y) − Ψ (xt ) : y = (1 − τt )xt + τt x}


x∈Q

= ∇f (xt ), xt +1 − xt + Ψ (xt +1 ) − Ψ (xt )

(6.4.4)
≥ f¯(xt +1 ) − f¯(xt ) − Gν
1+ν xt +1 − xt 1+ν .

Since xt +1 − xt  ≤ τt D, we conclude that

Gν D 1+ν 1+ν (6.4.23)


f¯(xt ) − f¯(xt +1 ) ≥ δτt (xt ) − 1+ν τt ≥ τt δ(xt ) − Gν D 1+ν 1+ν
1+ν τt . 

478 6 The Primal-Dual Model of an Objective Function

In view of (6.4.27), inequality (6.4.31) results in the following rate of conver-


gence:

f¯(xt ) − f¯(x∗ ) ≤ 1
At Cν,t , t ≥ 0. (6.4.33)

For the linearly growing weights at = t, At = t (t 2+1) , t ≥ 0, we have already seen


that
 2−ν  2−ν 

Cν,t = 1+ν1
Bν,t ≤ (1+ν)(2−ν) t + 12 − 12 Gν D 1+ν .

In the case ν = 1, this results in the following rate of convergence:

f¯(xt ) − f¯(x∗ ) ≤ 2 2
t +1 G1 D , t ≥ 1. (6.4.34)

Let us justify for this case the rate of convergence of the sequence {δ(xt )}t ≥1. We
(6.4.13) at+1
have τt = At+1 = 2
t +2 . On the other hand, for any T ≥ t,

2G1 D 2 (6.4.34)
t +1 ≥ f¯(xt ) − f¯(x∗ )
(6.4.35)
(6.4.32) T 
 
≥ τk δ(xk ) − 1 2 2
2 G 1 D τk + f¯(xT +1 ) − f¯(x∗ ).
k=t

Let δT∗ = min δ(xt ). Then, choosing T = 2t + 1, we get


0≤t ≤T


 
(6.4.10) 
T (6.4.35) 
T
2 ln 2 · δT∗ ≤ 2
k+2 δT∗ ≤ 2G1 D 2 1
t +1 + 1
(k+2)2
k=t k=t

(6.4.11)    
≤ 2G1 D 2 1
t +1 + 12
11(2t +3) = 2G1 D 2 2
T +1 + 12
11(T +2)

G1 D 2
≤ 68
11 · T +1 .

Thus, in the case ν = 1, for odd T , we get the following bound:

G1 D 2
δT∗ ≤ 34
11 ln 2 · T +1 .
(6.4.36)
6.4 Minimizing the Local Model of an Objective Function 479

6.4.4 Computing the Primal-Dual Solution

Note that both methods (6.4.12) and (6.4.24) admit computable accuracy certifi-
cates. For the first method, define
 

t
t = 1
At min ak [f (xk ) + ∇f (xk ), x − xk + Ψ (x)] : x ∈ Q .
x k=0

This value can be computed by the standard operation (6.4.7). Clearly,

(6.4.16)
f¯(xt ) − f¯(x∗ ) ≤ f¯(xt ) − t ≤ 1
At Bν,t . (6.4.37)

For the second method, let us choose a0 = 0. Then the estimating functions are
linear:


t
φt (x) = ak [f (xk−1 ) + ∇f (xk−1 ), x − xk−1 + Ψ (x)].
k=1

Therefore, defining ˆt = 1


At min{φt (x) : x ∈ Q}, we also have
x

(6.4.16)
f¯(xt ) − f¯(x∗ ) ≤ f¯(xt ) − ˆt ≤ 1
At Cν,t , t ≥ 1. (6.4.38)

Accuracy certificates (6.4.37) and (6.4.38) justify that both methods (6.4.12)
and (6.4.24) are able to recover some information on the optimal dual solution.
However, in order to implement this ability, we need to open the Black Box and
introduce an explicit model of the function f (·).
Let us assume that the function f is representable in the following form:

f (x) = max{Ax, u − g(u) : u ∈ Qd }, (6.4.39)


u

where A : E → E∗1 , Qd is a closed convex set in a finite-dimensional linear space


E2 , and the function g(·) is p-uniformly convex on Qd :

∇g(u1 ) − ∇g(u2 ), u1 − u2 ≥ σg u1 − u2 p , u1 , u2 ∈ Qd , (6.4.40)

where the convexity degree p ≥ 2. Denote by u(x) ∈ Qd the unique optimal


solution to optimization problem in (6.4.39).
Lemma 6.4.3 The function f has Hölder continuous
 ν gradient ∇f (x) = A∗ u(x)
with parameter ν = p−1
1
and constant Gν = σ1g A1+ν .
480 6 The Primal-Dual Model of an Objective Function

Proof Let u1 = u(x1 ), u2 = u(x2 ), g1 = ∇g(u1 ), and g2 = ∇g(u2 ). Then, in view
of the optimality condition (2.2.39), we have

Ax1 − g1 , u2 − u1 ≤ 0, Ax2 − g2 , u1 − u2 ≤ 0.

Adding these two inequalities, we get

(6.4.40)
A(x1 − x2 ), u1 − u2 ≥ g1 − g2 , u1 − u2 ≥ σg u1 − u2 p .

Thus,

∇f (x1 ) − ∇f (x2 )∗ = A∗ (u1 − u2 )∗ ≤ A · u1 − u2 

  1
p−1
≤ A · σg A(x1
1
− x2 )

p   1
p−1
≤ A p−1 σg x1
1
− x2  . 

Let us write down an adjoint problem to (6.4.1).


 
min{f¯(x) : x ∈ Q}
(6.4.39)
= min Ψ (x) + max{Ax, u − g(u) : u ∈ Qd }
x x u
 
≥ max −g(u) + min{A∗ u, x + Ψ (x)} .
u∈Qd x

Thus, defining Φ(u) = min{A∗ u, x + Ψ (x)}, we get the following adjoint


x
problem:
 
def
max ḡ(u) = −g(u) + Φ(u) . (6.4.41)
u∈Qd

In this problem, the objective function is nonsmooth and uniformly strongly concave
of degree p. Clearly, we have

f¯(x) − ḡ(u) ≥ 0, x ∈ Q, u ∈ Qd . (6.4.42)

Let us show that both methods (6.4.12) and (6.4.24) are able to approximate the
optimal solution to the problem (6.4.41).
Note that for any x̄ ∈ Q we have

(6.4.39)
f (x̄) + ∇f (x̄), x − x̄ = Ax̄, u(x̄) − g(u(x̄)) + A∗ u(x̄), x − x̄

= Ax, u(x̄) − g(u(x̄)).


6.4 Minimizing the Local Model of an Objective Function 481


t
Therefore, defining for the first method (6.4.12) ut = 1
At ak u(xk ), we obtain
k=0
 

t
t = min Ψ (x) + 1
At ak [Ax, u(xk ) − g(u(xk ))]
x∈Q k=0


t
= Φ(ut ) − 1
At ak g(u(xk )) ≤ ḡ(ut ).
k=0

Thus, we get

(6.4.42) (6.4.37)
0 ≤ f¯(xt ) − ḡ(ut ) ≤ f¯(xt ) − t ≤ 1
At Bν,t , t ≥ 0. (6.4.43)


t
For the second method (6.4.24), we choose a0 = 0 and take ut = 1
At ak u(xk−1 ).
k=1
In this case, by a similar reasoning, we get

(6.4.42) (6.4.38)
0 ≤ f¯(xt ) − ḡ(ut ) ≤ f¯(xt ) − ˆt ≤ 1
At Cν,t , t ≥ 1. (6.4.44)

6.4.5 Strong Convexity of the Composite Term

In this section, we assume that the function Ψ in problem (6.4.1) is strongly convex
(see Sect. 3.2.6). In view of (3.2.37), this means that there exists a positive constant
σΨ such that

Ψ (τ x + (1 − τ )y) ≤ τ Ψ (x) + (1 − τ )Ψ (y) − 12 σΨ τ (1 − τ )x − y2


(6.4.45)

for all x, y ∈ Q and τ ∈ [0, 1]. Let us show that in this case CG-methods converge
much faster. We demonstrate this for method (6.4.12).
In view of the strong convexity of Ψ , the variational principle (6.4.8) character-
izing the point vt in method (6.4.12) can be strengthened:

Ψ (x) + ∇f (xt ), x − vt ≥ Ψ (vt ) + 12 σΨ x − vt 2 , x ∈ Q. (6.4.46)

Let V0 be defined as in (6.4.14). Define





t ak1+2ν G2ν D 2ν
B̂ν,t = a0 V0 + 2σΨ , t ≥ 0. (6.4.47)
A2ν
k
k=1
482 6 The Primal-Dual Model of an Objective Function

Theorem 6.4.3 Let the sequence {xt }t ≥0 be generated by method (6.4.12), and
assume the function Ψ is strongly convex. Then, for any ν ∈ (0, 1], any step t ≥ 0,
and any x ∈ Q we have


t
At (f (xt ) + Ψ (xt )) ≤ ak [f (xk ) + ∇f (xk ), x − xk + Ψ (x)] + B̂ν,t .
k=0
(6.4.48)
Proof The beginning of the proof of this statement is very similar to that of
Theorem 6.4.1. Assuming that (6.4.48) is valid for some t ≥ 0, we get the following
inequality:

t
+1
ak [f (xk ) + ∇f (xk ), x − xk + Ψ (x)] + Bν,t
k=0

≥ At +1 (f (xt +1) + Ψ (xt +1)) + at +1 [Ψ (x) − Ψ (vt ) + ∇f (xt +1 ), x − vt ] .

Further,

Ψ (x) − Ψ (vt ) + ∇f (xt +1), x − vt

(6.4.46)
≥ ∇f (xt +1 ) − ∇f (xt ), x − vt + 12 σΨ x − vt 2

(4.2.3)
≥ − 2σ1Ψ ∇f (xt +1) − ∇f (xt )2∗

(6.4.3)  aν 2
≥ − 2σ1Ψ t+1
Aνt+1 Gν D
ν .

Thus, to ensure that (6.4.48) is valid for the next iteration, it is enough to choose
1+2ν
1 at+1
B̂ν,t +1 = B̂ν,t + 2 2ν
2σΨ A2ν Gν D . 

t+1

It can be easily checked that in our situation, the linear weights strategy at ≡ t is
not the best one. Let us choose at = t 2 , t ≥ 0. Then At = t (t +1)(2t
6
+1)
, and we get




t
62ν k 2(1+2ν) G2ν D 2ν 
t
G2ν D 2ν
B̂ν,t = k 2ν (k+1)2ν (2k+1)2ν 2σΨ ≤ 32ν k 2(1−ν) 2σ Ψ
k=1 k=1

4t +1/2   3−2ν  2 2ν
32ν 3−2ν 4
(6.4.9) G2 D 2ν
32ν Gν D
≤ ν
2σΨ · 3−2ν τ 4 = 3−2ν (t + 2 )
1 3−2ν
− 12 2σΨ .
1/2
6.4 Minimizing the Local Model of an Objective Function 483

Thus, for ν ∈ (0, 1), we get 1


At B̂ν,t ≤ O(t −2ν ). For ν = 1, we obtain

G21 D 2
f¯(xt ) − f¯(x∗ ) ≤ 54
(t +1)(2t +1) · 2σΨ ,
(6.4.49)

which is much better than (6.4.19). This gives us an example of acceleration of the
Conditional Gradient Method by a strong convexity assumption.

6.4.6 Minimizing the Second-Order Model

Let us assume now that in problem (6.4.1) the function f is twice continuously
differentiable. Then we can apply to this problem the following method.

Composite Trust-Region Method with Contraction

1. Choose an arbitrary point x0 ∈ Q.

2. For t ≥ 0 iterate: Define the coefficient τt ∈ (0, 1] and choose



xt +1 ∈ Arg min ∇f (xt ), y − xt + 12 ∇ 2 f (xt )(y − xt ), y − xt
y 
+Ψ (y) : y ∈ (1 − τt )xt + τt x, x ∈ Q .

(6.4.50)

Note that this scheme is well defined even if the Hessian of the function f is
positive semidefinite. Of course, in general, the computational cost of each iteration
of this scheme can be big. However, in one important case, when Ψ (·) is an indicator
function of a Euclidean ball, the complexity of each iteration of this scheme is
dominated by the complexity of matrix inversion. Thus, method (6.4.50) can be
easily applied to problems of the form

min{f (x) : x − x0  ≤ r}, (6.4.51)


x

where the norm  ·  is Euclidean.


Let Hν < +∞ for some ν ∈ (0, 1]. In this section we assume that

∇ 2 f (x)h, h ≤ Lh2 , x ∈ Q, h ∈ E. (6.4.52)


484 6 The Primal-Dual Model of an Objective Function

Let us choose a sequence of nonnegative weights {at }t ≥0, and define in (6.4.50)
the coefficients {τt }t ≥0 in accordance with (6.4.13). Define the estimating functional
sequence {φt (x)}t ≥0 by recurrent relations (6.4.26), where the sequence {xt }t ≥0 is
generated by method (6.4.50). Finally, define




t ak2+ν Hν D 2+ν 
t ak2
Ĉν,t = a0 Δ(x0 ) + + LD 2 . (6.4.53)
A1+ν
k
(1+ν)(2+ν) 2Ak
k=1 k=1

In our convergence results, we also estimate the second-order optimality measure


for problem (6.4.1) at the current test points. Let us introduce

def
θ (x) = max{∇f (x), x − y − 12 ∇ 2 f (x)(y − x), y − x + Ψ (x) − Ψ (y)}.
y∈Q
(6.4.54)

For any x ∈ Q we have θ (x) ≥ 0. We call θ (x) the total variation of the quadratic
model of the composite objective function in problem (6.4.1) over the feasible set.
Defining

Fx (y) = 12 ∇ 2 f (x)(y − x), y − x + Ψ (y),


 ∗
we get θ (x) = Fx (∇f (x)) (see definition (6.4.21)).
x,Q

Theorem 6.4.4 Let the sequence {xt }t ≥0 be generated by method (6.4.50). Then,
for any ν ∈ [0, 1] and any step t ≥ 0 we have

At f¯(xt ) ≤ φt (x) + Ĉν,t , x ∈ Q. (6.4.55)

Moreover, for any t ≥ 0 we have

f¯(xt ) − f¯(xt +1 ) ≥ τt θ (xt ) − Hν D 2+ν


(1+ν)(2+ν) τt
2+ν
. (6.4.56)

Proof Let us prove inequality (6.4.55). For t = 0, Ĉν,0 = a0 [f¯(x0 ) − f¯(x∗ )].
Therefore, this inequality is valid.
In view of Theorem 3.1.23 the point xt +1 is characterized by the following
variational principle:

xt +1 = (1 − τt )xt + τt vt , vt ∈ Q,

Ψ (y) + ∇f (xt ) + ∇ 2 f (xt )(xt +1 − xt ), y − xt +1 ≥ Ψ (xt +1),

∀ y = (1 − τt )xt + τt x, x ∈ Q.
6.4 Minimizing the Local Model of an Objective Function 485

Therefore, in view of definition (6.4.13), for any x ∈ Q we have

at +1 ∇f (xt ), x − xt ≥ At +1 ∇f (xt ) + ∇ 2 f (xt )(xt +1 − xt ), xt +1 − xt

+at +1 ∇ 2 f (xt )(xt +1 − xt ), xt − x

+At +1 [Ψ (xt +1 ) − Ψ ((1 − τt )xt + τt x)]

(6.4.52)
≥ At +1 ∇f (xt ) + 12 ∇ 2 f (xt )(xt +1 − xt ), xt +1 − xt

a2
+At +1 [Ψ (xt +1 ) − Ψ ((1 − τt )xt + τt x)] − 2At+1 LD 2 .
t+1

Hence,

At f¯(xt ) + at +1 [f (xt ) + ∇f (xt ), x − xt + Ψ (x)]

≥ At Ψ (xt ) + At +1 [f (xt ) + ∇f (xt ) + 12 ∇ 2 f (xt )(xt +1 − xt ), xt +1 − xt ]

2
at+1
+at +1 Ψ (x) + At +1 [Ψ (xt +1 ) − Ψ ((1 − τt )xt + τt x)] − 2At+1 LD
2

(6.4.6) 2
xt+1 −xt  2+ν at+1
≥ At +1 [f (xt +1 ) + Ψ (xt +1)] − At +1 Hν(1+ν)(2+ν) − 2At+1 LD
2

2+ν 2
≥ At +1 f¯(xt +1) −
at+1 Hν D 2+ν at+1
· − 2
2At+1 LD .
A1+ν
t+1
(1+ν)(2+ν)

Thus, if (6.4.55) is valid for some t ≥ 0, then

φt +1 (x) + Ĉν,t ≥ At f¯(xt ) + at +1 [f (xt ) + ∇f (xt ), x − xt + Ψ (x)]

2+ν 2
≥ At +1 f¯(xt +1) −
at+1 Hν D 2+ν at+1
· − 2
2At+1 LD .
A1+ν
t+1
(1+ν)(2+ν)

2+ν 2
at+1 Hν D 2+ν at+1
Therefore, we can take Ĉν,t +1 = Ĉν,t + · + 2
2At+1 LD .
A1+ν
t+1
(1+ν)(2+ν)
In order to justify inequality (6.4.56), let us introduce the values

def
θt (τ ) = max{∇f (xt ), xt − y − 12 ∇ 2 f (xt )(y − xt ), y − xt
x∈Q

+Ψ (xt ) − Ψ (y) : y = (1 − τ )xt + τ x}

(6.4.22)
 ∗
= Fxt (∇f (xt )), τ ∈ [0, 1].
τ,xt ,Q
486 6 The Primal-Dual Model of an Objective Function

Clearly,

−θt (τt ) = min{∇f (xt ), y − xt − 12 ∇ 2 f (xt )(y − xt ), y − xt


x∈Q

+Ψ (y) − Ψ (xt ) : y = (1 − τt )xt + τt x}

= ∇f (xt ), xt +1 − xt − 12 ∇ 2 f (xt )(xt +1 − xt ), xt +1 − xt

+Ψ (xt +1) − Ψ (xt )

(6.4.6)
≥ f¯(xt +1 ) − f¯(xt ) − Hν
(1+ν)(2+ν) xt +1 − xt 2+ν .

Since xt +1 − xt  ≤ τt D, we conclude that

2+ν (6.4.23) 2+ν


f¯(xt ) − f¯(xt +1 ) ≥ θt (τt ) − Hν D 2+ν
(1+ν)(2+ν) τt ≥ τt θ (xt ) − Hν D 2+ν
(1+ν)(2+ν) τt .

Thus, inequality (6.4.55) ensures the following rate of convergence of method


(6.4.50)

f¯(xt ) − f¯(x∗ ) ≤ 1
At Ĉν,t . (6.4.57)

A particular expression of the right-hand side of this inequality for different values
of ν ∈ [0, 1] can be obtained in exactly the same way as it was done in Sect. 6.4.2.
Here, we restrict ourselves only to the case when ν = 1 and at = t 2 , t ≥ 0. Then
At = t (t +1)(2t
6
+1)
, and


t ak3 
t
36k 6
= ≤ 18t,
A2k k 2 (k+1)2 (2k+1)2
k=1 k=1


t ak2 
t
3k 4 
t
2Ak = k(k+1)(2k+1) ≤ 3
2 k = 34 t (t + 1).
k=1 k=1 k=1

Thus, we get

f¯(xt ) − f¯(x∗ ) ≤ 18H1 D 3


(t +1)(2t +1) + 9LD 2
2(2t +1) .
(6.4.58)

Note that the rate of convergence (6.4.58) is worse than the convergence rate of
cubic regularization of the Newton method (see Sect. 4.2.3). However, to the best of
our knowledge, inequality (6.4.58) gives us the first global rate of convergence of
an optimization scheme belonging to the family of trust-region methods. In view
of inequality (6.4.55), the optimal solution of the dual problem (6.4.41) can be
6.4 Minimizing the Local Model of an Objective Function 487

approximated by method (6.4.50) with a0 = 0 in the same way as it was suggested


in Sect. 6.4.4 for Conditional Gradient Methods.
Let us now estimate the rate of decrease of the values θ (xt ), t ≥ 0, in the case
6(t +1)
(6.4.13)
when ν = 1. Note that τt = Aat+1 t+1
= (t +2)(2t +3) . It is easy to see that these
coefficients satisfy the following inequalities:

3
t +3 ≤ τt ≤ 6
2t +5 , t ≥ 0. (6.4.59)

Therefore, choosing the total number of steps T = 2t + 2, we have


T (6.4.59) 2t
+2 (6.4.10)
τk ≥ 3 1
k+3 ≥ 3 ln 2,
k=t k=t

 (6.4.59) 2t
+2 42t +5/2
T (6.4.11) 4
τk3 ≤ 27
(k+5/2)3
≤ − 2(k+5/2)
27
24
k=t k=t t −1/2 (6.4.60)
   
= 27
2
1
(t +2)2
− 1
(2t +5)2
= 27
2
4
(T +2)2
− 1
(T +3)2

27(3T +8)(T +4)


= 2(T +2)2 (T +3)2
≤ 81
2(T +1)(T +2) .

Now we can use the same trick as at the end of Sect. 6.4.2. Define

θT∗ = min θ (xt ).


0≤t ≤T

Then

(6.4.58) 
T
36H1 D 3
T (T −1) + 9LD 2
2(T −1) ≥ f¯(xt ) − f¯(x∗ ) ≥ (f¯(xk ) − f¯(xk+1 ))
k=t

(6.4.56) 
T
H1 D 3 
T
≥ θT∗ τk − 6 τk3
k=t k=t

(6.4.60) 27H1 D 3
≥ 3θT∗ ln 2 − 4(T +1)(T +2) .

Thus, for even T , we get the following bound:


 
4H1 D 3 3H1 D 3 LD 2
θT∗ ≤ 3
ln 2 T (T −1) + 4(T +1)(T +2) + 2(T −1)

  (6.4.61)
5H1 D 3 LD 2
≤ 3
ln 2 T (T −1) + 2(T −1) .
Chapter 7
Optimization in Relative Scale

In many applications, it is difficult to relate the number of iterations in an


optimization scheme with the desired accuracy of the solution since the corre-
sponding inequality contains unknown parameters (Lipschitz constant, distance to
the optimum). However, in many cases the required level of relative accuracy is
quite understandable. To develop methods which compute solutions with relative
accuracy, we need to employ internal structure of the problem. In this chapter,
we start from problems of minimizing homogeneous objective functions over a
convex set separated from the origin. The availability of the subdifferential of this
function at zero provides us with a good metric, which can be used in optimization
schemes and in the smoothing technique. If this subdifferential is polyhedral, then
the metric can be computed by a cheap preliminary rounding process. We also
present a barrier subgradient method, which computes an approximate maximum
of a positive convex function with certain relative accuracy. We show how to apply
this method to solve problems of fractional covering, maximal concurrent flow,
semidefinite relaxation, online optimization, portfolio management, and others.
Finally, we consider a class of strictly positive functions, for which a kind of quasi-
Newton method is developed.

7.1 Homogeneous Models of an Objective Function

(The conic unconstrained minimization problem; The subgradient approximation scheme;


Structural optimization; Application examples: Linear Programming, Minimization of the
spectral radius; The truss topology design problem.)

© Springer Nature Switzerland AG 2018 489


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4_7
490 7 Optimization in Relative Scale

7.1.1 The Conic Unconstrained Minimization Problem

Quite often, in the theoretical justification of convex optimization methods it is


assumed that problems have bounded feasible sets. Besides its technical conve-
nience, this assumption allows us to introduce a reasonable scale for measuring the
absolute accuracy of an approximate solution. In the cases when the initial problem
does not possess this property, some algorithms require an artificial bounding of
the domain (the “big M” approach). This approach is, perhaps, acceptable for
polynomial-time methods, where the “big M” enters the complexity estimates only
inside a logarithm term (see Chap. 5). However, it is clear that for gradient-type
methods, such a strategy cannot work.
In fact, this is almost a philosophical question: Do the problems with unbounded
feasible sets really arise in practice? And if so, how they should be treated? Actually,
there is at least one, very important class of such problems, namely, the problems
obtained by Lagrangian relaxations of inequality constraints (see Sects. 1.3.3 and
3.1.7). If there were some reasonable bounds on the dual variables for these
constraints, then it would be natural to incorporate them into the primal problem.
Then, instead of constraints in the primal problem, we could have an additional
term in the objective function.
Another difficulty is related to the way of bounding unbounded feasible sets.
It is not always possible to find a reasonable localization set a priori, without
collecting additional information on the topology of the problem by some auxiliary
computations.
In this chapter, we suggest an alternative way of treating the convex minimization
problems. Namely, we are going to compute their approximate solutions in relative
scale. We will see that this idea works at least for a special class of conic
unconstrained minimization problems.1 These are the problems of minimizing a
positively homogeneous convex function over a convex set, which is separated from
the origin. In order to compute an approximate solution to this problem with a
certain relative accuracy, we need to know a John ellipsoid for the subdifferential
of the objective function evaluated at the origin. We will see that in many cases
all necessary information about the objective function can be easily obtained by
analyzing its structure.
In what follows, we say that the value f (x̄) approximates the optimal value f ∗ >
0 with relative accuracy δ if

f ∗ ≤ f (x̄) ≤ (1 + δ)f ∗ .

In this chapter, it is convenient to use the following notation for the balls in E
with respect to  · :

B· (r) = {x ∈ E : x ≤ r}.

1 By this term we mean problems with no functional constraints.


7.1 Homogeneous Models of an Objective Function 491

The notation πQ,· (x) is used for the projection of a point x onto the set Q with
respect to the norm  · . For the sake of notation, if no ambiguity arises, the
indication of the norm is omitted.
Finally, in the case E = Rn , In denotes the unit matrix in Rn , ei denotes the ith
coordinate vector, and ēn stands for the vector of all ones. For an n × n matrix X we
denote by λ1 (X), . . . , λn (X) its spectrum of eigenvalues numbered in decreasing
order.
The most general form of the optimization problem considered in this section is
as follows:

Find f ∗ = min f (x), (7.1.1)


x∈Q1

where f is a convex positively homogeneous function of degree one (see the end of
Sect. 3.1.6), and Q1 ⊂ E is a closed convex set, which does not contain the origin.
In many applications, the role of Q1 is played by an affine subspace

L = {x ∈ E : Cx = b},

where b ∈ E1 , b
= 0, and C : E → E1 . Without loss of generality, we can assume
that C is non-degenerate.
Our main assumption on problem (7.1.1) is that

dom f ≡ E, 0 ∈ int ∂f (0). (7.1.2)

In other words, we assume that f is a support function of a convex compact set


containing the origin in its interior. Then f ∗ > 0, and the problem of finding
an approximate solution to (7.1.1) with a certain relative accuracy becomes well
posed. In what follows, we call the setting (7.1.1), (7.1.2) the conic unconstrained
minimization problem.
Note that any unconstrained minimization problem

min φ(y),
y∈E

with convex objective φ(·), can be rewritten in the form (7.1.1) by simple homoge-
nization:

x = (y, τ ) ∈ E × R+ , f (x) = τ φ(y/τ ), Cx ≡ τ, b=1

(see Example 3.1.2(6)). However, in general, we cannot guarantee that such a


function satisfies assumption (7.1.2).
Let us look at the following examples.
492 7 Optimization in Relative Scale

Example 7.1.1 Let our initial problem consist in finding approximately an uncon-
strained minimum of the function

φ∞ (y) = max |ai , y + c(i) |, y ∈ Rn−1 .


1≤i≤m




y ai
Let us introduce x = , and âi = , i = 1, . . . , m. Let
τ c(i)

AT = â1 . . . , âm , F∞ (v) = max |v (i) |,
1≤i≤m

p = 1, C = ( 0, . . . , 0 , 1), b = 1.
7 89 :
(n−1) times

Then for positive τ we can define

f (x) = τ φ∞ (y/τ ) ≡ F∞ (Ax), Q1 = L .

Thus, this description of f (·) can be extended onto the whole space.
In a similar way, for the function


m
φ1 (y) = |ai , y + c(i) |, y ∈ Rn−1 ,
i=1

we can get a representation (7.1.1), which satisfies (7.1.2). In this case, we use
f (x) = F1 (Ax) with


m
F1 (v) = |v (i) |.
i=1

However, for the function


 
φ(y) = max ai , y + c(i) , y ∈ Rn−1 ,
1≤i≤m

the above lifting cannot guarantee (7.1.2).



Let us fix some norm  ·  in E, and define the dual norm in the standard way:

g∗ = max s, x , g ∈ E∗ . (7.1.3)


x≤1

Then we can rewrite our main assumption (7.1.2) in a quantitative form. Let γ0 ≤ γ1
be some positive values satisfying the following asphericity condition:

B·∗ (γ0 ) ⊆ ∂f (0) ⊆ B·∗ (γ1 ). (7.1.4)


7.1 Homogeneous Models of an Objective Function 493

Thus, by (7.1.2) we just assume that such values are well defined. Note that these
values depend on the choice of the norm  · . In the sequel, this choice will always
be evident from the context.
Denote by
γ0
α= γ1 < 1,

the asphericity coefficient of the function f . As we will see later, this parameter is
crucial for complexity bounds of finding approximate solutions to problem (7.1.1)
with a certain relative accuracy.
Note that in many situations it is reasonable to choose · as an ellipsoidal norm.
In view of John’s theorem, for a good variant of this norm we can guarantee that

α ≥ n1 , (7.1.5)

where n = dim E. Moreover, if ∂f (0) is symmetric:

f (x) = f (−x) ∀x ∈ E,

then the lower bound for ellipsoidal norms is even better:

α≥ √1 .
n (7.1.6)

(We will prove both variants of John’s Theorem in Sect. 7.2.) Of course, it may be
difficult to find a norm which is good for a particular objective function f . However,
in this case we can try to employ our knowledge of its structure.
For example, it may happen that we know a self-concordant barrier ψ(·) for the
convex set ∂f (0) (see Sect. 5.3), and ∇ψ(0) = 0. Then we can use

v∗ = v, ∇ 2 ψ(0)v 1/2 , x = [∇ 2 ψ(0)]−1 x, x 1/2.

In this case, it is possible to choose



γ0 = 1, γ1 = ν + 2 ν,

where ν is the parameter of the barrier ψ(·) (see Theorem 5.3.9).


For some important problems the subdifferential ∂f (0) is a polyhedral set. Then
the following result may be useful.
Lemma 7.1.1 Let f (x) = max ai , x , x ∈ Rn . Assume that the matrix
1≤i≤m

A = (a1 , . . . , am )
494 7 Optimization in Relative Scale


m
has full row rank and ai = 0 (thus, m > n). Then the norm
i=1

 1/2

m
x = ai , x 2
i=1

is well defined . We can choose γ1 = 1 and γ0 = √ 1


m(m−1)
.

m
Proof Note that the matrix G = ai aiT is non-degenerate. Then
i=1

v∗ = v, G−1 v 1/2

(see Lemma 3.1.20), and therefore for any i = 1, . . . , m we have

(ai ∗ )2 = ai , G−1 ai = maxn {2ai , x − Gx, x }


x∈R
 

m
= maxn 2ai , x − ak , x 2
x∈R k=1

≤ maxn {2ai , x − ai , x 2 } = 1.


x∈R

Since ∂f (0) = Conv {ai , i = 1, . . . , m}, we can take γ1 = 1.


m
On the other hand, for any x ∈ Rn we have ai , x = 0. Therefore,
i=1


m
Gx, x = ai , x 2
i=1
 

m 
m
≤ maxm (s (i) )2 : s (i) = 0, s (i) ≤ f (x), i = 1, . . . , m .
s∈R i=1 i=1

In view of Corollary 3.1.2, the extremum in the above maximization problem is


attained, for example, at

ŝ = f (x) · (ēm − me1 ).

This means that Gx, x ≤ m(m − 1)f 2 (x). Hence, f (x) ≥ √ x . In view of
m(m−1)
representation (3.1.41), this justifies the choice γ0 = √ 1
m(m−1)
. 

The possibility of employing another structural representation of problem (7.1.1)
is discussed in Sect. 7.1.3.
7.1 Homogeneous Models of an Objective Function 495

Let us conclude this section with a statement which supports our ability to solve
problem (7.1.1) with a certain relative accuracy.
Denote by x0 the projection of the origin onto the set Q1 with respect to the
norm  · 2 :

x0  = min x.


x∈Q1

Theorem 7.1.1
1. For any x ∈ Rn , we have

γ0 · x ≤ f (x) ≤ γ1 · x. (7.1.7)

Therefore the function f is Lipschitz continuous on E in the norm  ·  with


Lipschitz constant γ1 . Moreover,

αf (x0 ) ≤ γ0 · x0  ≤ f ∗ ≤ f (x0 ) ≤ γ1 · x0 . (7.1.8)

2. For any optimal solution x ∗ to (7.1.1), we have

x0 − x ∗  ≤ 2 ∗
γ0 f ≤ 2
γ0 f (x0 ). (7.1.9)

If the norm  ·  is Euclidean, then this inequality can be strengthened as follows:

x0 − x ∗  ≤ 1 ∗
γ0 f ≤ 1
γ0 f (x0 ). (7.1.10)

Proof For any x ∈ E, we have

(3.1.41)
f (x) = max{v, x : v ∈ ∂f (0)} ≥ max{v, x : v ∈ B·∗ (γ0 )} = γ0 x,
v v

(3.1.41)
f (x) = max{v, x : v ∈ ∂f (0)} ≤ max{v, x : v ∈ B·∗ (γ1 )} = γ1 x.
v v

Therefore, for any x and h ∈ E, we have

f (x + h) ≤ f (x) + f (h) ≤ f (x) + γ1 h.

Moreover,

f ∗ = min f (x) ≥ min γ0 x = γ0 x0 .


x∈Q1 x∈Q1

2 Recall that this can be any general norm.


496 7 Optimization in Relative Scale

Hence, in view of (7.1.7) we have

f ∗ ≥ γ0 x0  ≥ αf (x0 ),

f ∗ ≤ f (x0 ) ≤ γ1 x0 .

In order to prove the second statement, note that in view of the first item of the
theorem we have

x0 − x ∗  ≤ x0  + x ∗  ≤ 2
γ0 · f ∗.

For the Euclidean norm x = Gx, x 1/2 with G  0, this bound can be
(2.2.39)
strengthened. Indeed, in this case Gx0 , x ∗ − x0 ≥ 0. Therefore,

x0 − x ∗ 2 = x0 2 − 2Gx0 , x ∗ + x ∗ 2 ≤ x ∗ 2 − x0 2

< x ∗ 2 . 

7.1.2 The Subgradient Approximation Scheme

Let us discuss now different possibilities for finding an approximate solution to


problem (7.1.1). For the sake of simplicity, we assume that the norm  ·  is
Euclidean.
The first of our schemes is based on the standard Subgradient Method for
minimizing non-smooth convex functions. Denote by g(x) an arbitrary subgradient
of the function f at point x. Consider the simplest variant of the Subgradient Method
as applied to problem (7.1.1).

Subgradient Method GN (R)

for k := 0 to N do: Compute f (xk ) and g(xk ).


  (7.1.11)
g(xk )
xk+1 := πQ1 xk − √R
N+1
· g(xk )∗ .

Output: x̄ = arg min{f (x) : x = x0 , . . . , xN }.


x
7.1 Homogeneous Models of an Objective Function 497

In what follows, the output of this process x̄ ∈ E is denoted by GN (R). In view of


Theorem 3.2.2, the rate of convergence of this method is as follows:

x0 −x ∗ 2 +R 2
f (GN (R)) − f ∗ ≤ √ γ1
N+1
· 2R . (7.1.12)

Thus, in order to be efficient, the Subgradient Method needs a good estimate for the
distance between the starting point x0 and the solution x ∗ :

R ≈ x0 − x ∗ .

In our case, this estimate could be obtained from the first inequality in (7.1.10).
However, since the value f ∗ is not known in advance, we will use the second part
of this inequality:

def 1
ρ̂ = γ0 f (x0 ) ≥ x0 − x ∗ . (7.1.13)

The performance of the corresponding scheme is given by the following statement.


Theorem 7.1.2 For a fixed δ from (0, 1), let us choose
J K
N = 1
α4δ2
. (7.1.14)

Then f (GN (ρ̂)) ≤ (1 + δ) · f ∗ .


Proof In view of inequality (7.1.12), the choice (7.1.14) and inequali-
ties (7.1.10), (7.1.8), we have

x0 − x ∗ 2 + ρ̂ 2
f (GN (ρ̂)) − f ∗ ≤ α 2 δγ1 · ≤ α 2 δγ1 ρ̂ = αδf (x0 )
2ρ̂

≤ δ · f ∗. 

Note that we pay a high price for the poor estimate of the initial distance. If
we were be able to use the first part of inequality (7.1.10), then the corresponding
complexity bound could be much better. Let us show that a better bound for the
distance to the optimal solution can be derived from the trivial observation that
f ∗ ≤ f (x) for any point x from Q1 .
Denote by δ ∈ (0, 1) the desired relative accuracy. Let
 2
N̂ = e
α2
· 1+ δ
1
,
498 7 Optimization in Relative Scale

where e is the base of the exponent. Consider the following restarting strategy. Set
x̂0 = x0 , and for t ≥ 1 iterate

 
x̂t := GN̂ 1
γ0 f (x̂t −1 ) ;
(7.1.15)
if f (x̂t ) ≥ √1 f (x̂t −1 )
e
then T := tand Stop.

Theorem 7.1.3 The number of points generated by the process (7.1.15) is bounded:

T ≤ 1 + 2 ln α1 . (7.1.16)

The last generated point satisfies the inequality f (x̂T ) ≤ (1 + δ)f ∗ . The total
number of lower-level gradient steps in the process (7.1.15) does not exceed
 2  
e
α2
· 1 + 1δ · 1 + 2 ln α1 . (7.1.17)

Proof By simple induction, it is easy to prove that at the beginning of stage t


in (7.1.15) the following inequality holds:
 t −1
√1
e
f (x0 ) ≥ f (x̂t −1), t ≥ 1.

Thus, in view of inequality (7.1.8), at the last stage T of the process we have
 T −1
√1
e
f (x0 ) ≥ f (x̂T −1 ) ≥ f ∗ ≥ αf (x0 ).

This leads to inequality (7.1.16).


In view of (7.1.10), we have x0 − x ∗  ≤ γ10 f ∗ ≤ γ10 f (x̂T −1 ). Therefore, at the
last stage of the process, using (7.1.12) and the termination rule in (7.1.15), we get

f (x̂T ) − f ∗ ≤ √γ1 · 1
γ0 · f (x̂T −1 ) ≤ √e · f (x̂T )
N̂ +1 α N̂ +1

≤ δ
1+δ · f (x̂T ). 

7.1.3 Direct Use of the Problem Structure

In Sect. 7.1.2 we have shown that the outer and inner ellipsoidal approximations
of the set ∂f (0) are the key ingredients of minimization schemes for computing
7.1 Homogeneous Models of an Objective Function 499

an approximate solution to problem (7.1.1) in relative scale. However, in order to


find an ellipsoidal norm, which is good for our problem, we need to employ its
structure somehow. In this section, we introduce a model of problem (7.1.1) which
is suitable both for the explicit indication of such a norm and for applying the
smoothing technique described in Sect. 6.1. We will see that the efficiency of the
latter approach significantly dominates that of the Subgradient Method.
Since the objective function f in problem (7.1.1) is positive homogeneous, the
simplest possible structure of such an object could be as follows. Let us assume that
the objective function f is a composition of two objects, a linear operator A(x) and
a simple nonlinear convex homogeneous function F . In other words, assume that
f (x) = F (A(x)). Let us introduce this object in a formal way. In this section we
switch to the notation of Sect. 6.1, choosing E1 = Rn and E2 = Rm .
Let Q2 be a closed bounded convex set in Rm containing the origin in its interior.
Define a convex homogeneous function F as follows:

F (v) = max v, u Rm . (7.1.18)


u∈Q2

Further, let A be an m × n-matrix which has a full column rank (thus, m ≥ n).
Define the objective function

f (x) = F (Ax), x ∈ Rn . (7.1.19)

Clearly, f is a convex function with degree of homogeneity one. Our problem of


interest is still (7.1.1), which we repeat for convenience here:

Find f ∗ = min f (x). (7.1.20)


x∈Q1

Since ∂F (0) ≡ Q2 , we have ∂f (0) = AT Q2 (see Lemma 3.1.11). Thus,


problem (7.1.20) satisfies the main assumption (7.1.2).
Let  · Rm denote the standard Euclidean norm in Rm :
 m 
1/2
 2
u Rm = u(i) , u ∈ Rm .
i=1

Let us introduce the following characteristics of the function F :

γ0 (F ) = max{r : B·|Rm (r) ⊆ ∂F (0)},


r>0

γ1 (F ) = min{r : B·Rm (r) ⊇ ∂F (0)},


r>0

γ1 (F )
α(F ) = γ0 (F ) ≥ 1.
500 7 Optimization in Relative Scale

For the sets from Example 7.1.1, these values are as follows:

γ0 (F1 ) = √1 ,
m
γ1 (F1 ) = 1, α(F1 ) = m,
(7.1.21)
√ √
γ0 (F∞ ) = 1, γ1 (F∞ ) = m, α(F∞ ) = m.

Let us define now the following Euclidean norm in the primal space:

xRn = Ax∗Rm , x ∈ Rn . (7.1.22)

Since A is non-degenerate, this norm is well defined. Defining G = AT A  0, we


get the following representations:
 1/2

m
xRn = Gx, x 1/2 = ai , x 2 ,
i=1 (7.1.23)

g∗Rn = g, G−1 g 1/2 ,

where ai , i = 1, . . . , m, denote the columns of the matrix AT .


Lemma 7.1.2 For norm  · Rn , condition (7.1.4) holds with

γ0 = γ0 (F ), γ1 = γ1 (F ).
γ0 (F )
Thus, we can take α = α(F ) = γ1 (F ) .

Proof Since ∂f (0) = AT Q2 , we have the following representation for the support
function of this set:
def
ξ(x) = max s, x Rn = max AT u, x Rm = max Ax, u Rm .
s∈∂f (0) u∈Q2 u∈Q2

Thus,

ξ(x) ≤ max Ax, u Rm = γ1 (F )Ax∗Rm = γ1 (F )xRn ,


u2 ≤γ1 (F )

ξ(x) ≥ max Ax, u Rm = γ0 (F )Ax∗Rm = γ0 (F )xRn .


uRm ≤γ0 (F )

Hence, in view of Corollary 3.1.5, ∂f (0) ⊆ B·∗1 (γ1 (F )), and ∂f (0) ⊇
B·∗1 (γ0 (F )).

Note that for many simple sets Q2 , parameters γ1 (F ) and γ0 (F ) are easily
available (see, for example, (7.1.21)). Therefore, metric (7.1.23) can be used to
find an approximate solution to the corresponding problems by the Subgradient
7.1 Homogeneous Models of an Objective Function 501

Method (7.1.15). However, the main advantage of representation (7.1.19) is related


to the possibility of employing the smoothing technique of Sect. 6.1. Let us show
how this can be done.
Problem (7.1.20) differs from problem (6.1.10) only in one aspect: it can have an
unbounded primal feasible set. Thus, a straightforward application of the efficient
smoothing technique to (7.1.20) is impossible. However, we can introduce an
artificial bound on the size of the optimal solution using the information provided
by inequality (7.1.10). Define

Q1 (ρ) = {x ∈ Q1 : x − x0 Rn ≤ ρ}.

In view of (7.1.10), we have x ∗ ∈ Q1 (ρ̂) for ρ̂ = 1


γ0 (F ) f (x0 ). Thus, prob-
lem (7.1.20) is equivalent to the following:

Find f ∗ = min {f (x) : x ∈ Q1 (ρ̂)}


x∈Rn

= min max Ax, u Rm (7.1.24)


x∈Q1 (ρ̂) u∈Q2

(6.1.34)
= max {φρ̂ (u) : u ∈ Q2 },
u∈Rm

where φρ (u) = min Ax, u Rm . Thus, we have managed to represent our problem
x∈Q1 (ρ)
in the form required by Sect. 6.1.
Let us introduce the objects necessary for applying the smoothing technique. In
the primal space, we choose the prox-function d1 (x) = 12 x − x0 2Rn . This function
has convexity parameter equal to one. Its maximum on the feasible set Q1 (ρ̂) does
not exceed D1 = 12 ρ̂ 2 .
Similarly, for the dual feasible set, we choose d2 (u) = 12 u2Rm . Then its
convexity parameter is one, and the maximum of this function on the dual feasible
set Q2 is smaller than D2 = 12 γ12 (F ). It remains to note that

A1,2 = max{Ax, u Rm : xRn ≤ 1, uRm ≤ 1}


x,u

= max{Ax∗Rm : xRn ≤ 1} (7.1.25)


x

(7.1.22)
= max{xRn : xRn ≤ 1} = 1.
x

For the reader’s convenience we present here the algorithm (6.1.19) adopted
for our needs. This method is applied to a smooth approximation of the objective
function f :

fμ (x) = max {Ax, u Rm − μd2 (u)}, x ∈ Rn . (7.1.26)


u∈Q2
502 7 Optimization in Relative Scale

In view of Theorem 6.1.1, this function has Lipschitz continuous gradient

∇fμ (x) = AT uμ (x),

where uμ (x) is a unique solution to the optimization problem in (7.1.26). In view of


equality (7.1.25), the Lipschitz constant for the gradient is equal to μ1 .

Method SN (R)

Set μ = 2R

γ1 (F )· N(N+1)
and v0 = x0 .

for k := 0 to N − 1 do

yk = k
k+2 xk + 2
k+2 vk ,

uμ (yk ) = arg max {Ayk , u Rm − μ2 u2Rm },


u∈Q2
 

k
vk+1 = arg min 2μ x − x0 Rn + Ax,
1 2 i+1
2 uμ (yi ) R
m ,
x∈Q1 (R) i=0

xk+1 = k
k+2 xk + 2
k+2 vk+1 .

Output: x̄ := xN .

(7.1.27)

In what follows, we denote the output x̄ ∈ Rn of this process by SN (R). It is easy to


check that all conditions of Theorem 6.1.3 are satisfied. Thus, if x0 − x ∗ Rn ≤ R,
then the output of this process satisfies inequality

f (SN (R)) − f ∗ ≤ √2γ1 (F )R .


N(N+1)
(7.1.28)

This observation has an important corollary.


Theorem 7.1.4 For δ ∈ (0, 1), let
J K
N= 2
α 2 (F ) δ
. (7.1.29)
  
1
Then f SN γ0 (F ) f (x 0 ) ≤ (1 + δ)f ∗ .
7.1 Homogeneous Models of an Objective Function 503

(7.1.10) (7.1.29)
Proof Since x0 −x ∗ Rn ≤ γ0 (F ) f (x0 ), and N +1
1
≥ 2
α 2 (F ) δ
, from (7.1.28)
and (7.1.8) we have

f (SN (R)) − f ∗ ≤ δ · α(F )f (x0 ) ≤ δ · f ∗ . 


Note that the complexity bound (7.1.29) of the scheme (7.1.27) is lower even than
the bound of the Subgradient Method (7.1.15) with a recursively updated estimate
for the distance to the optimum. Let us show that a similar updating strategy can
also accelerate scheme (7.1.27).
Let δ ∈ (0, 1) be the desired relative accuracy. Let
J  K
Ñ = 2e
α(F ) · 1 + 1δ .

Consider the following restarting strategy. Set x̂0 = x0 . For t ≥ 1 iterate

 
x̂t := SÑ 1
γ0 (F ) f (x̂t −1 ) ;
(7.1.30)
if f (x̂t ) ≥ 1
e f (x̂t −1 ) then T := t and Stop.

Theorem 7.1.5 The number of points T generated by scheme (7.1.30) is bounded


as follows:

T ≤ 1 + ln α(F
1
). (7.1.31)

The last generated point satisfies inequality f (x̂T ) ≤ (1 + δ)f ∗ . The total number
of lower-level steps in the process (7.1.30) does not exceed
   
2e
α(F ) · 1 + 1δ · 1 + ln α(F
1
) . (7.1.32)

Proof By simple induction it is easy to prove that at the beginning of stage t the
following inequality holds:
 t −1
1
e f (x0 ) ≥ f (x̂t −1 ), t ≥ 1.

Thus, in view of Item 1 of Theorem 7.1.1, at the last stage T of the process we have
 T −1
1
e f (x0 ) ≥ f (x̂T −1 ) ≥ f ∗ ≥ α(F )f (x0 ).

This leads to inequality (7.1.31).


504 7 Optimization in Relative Scale

Note that x0 − x ∗  ≤ γ0 (F


1 ∗
) f ≤ γ0 (F ) f (x̂T −1 ). Therefore, at the last stage
1

of the process in view of inequality (7.1.28) and the termination rule in (7.1.30) we
have

f (x̂T ) − f ∗ ≤ 2γ1 (F )
· 1
γ0 (F ) · f (x̂T −1 ) ≤ 2e
· f (x̂T )
Ñ+1 α(F )·(Ñ+1)

≤ δ
1+δ · f (x̂T ). 

7.1.4 Application Examples

In this section, we discuss the complexity of implementation of the schemes


presented in Sect. 7.1.3 as applied to different structural classes of optimization
problems.

7.1.4.1 Linear Programming

Let  be an m × (n − 1)-matrix, m ≥ n, which has a full column rank. For a given


vector c ∈ Rm , consider the following optimization problem:
 
Find f ∗ = maxm c, u : ÂT u = 0, |u(i) | ≤ 1, i = 1, . . . , m . (7.1.33)
u∈R

This problem is non-trivial only if the column rank of matrix A = (Â, c) is equal to
n, which we assume to be true.
Problem (7.1.33) can be rewritten in the adjoint form. Define
  
m
φ1 (y) = max c, u + y, ÂT u : |u(i) | ≤ 1, i = 1, . . . , m = |ai , y + ci |,
u∈Rm i=1

where ai are the columns of the matrix ÂT . Then

f ∗ = min φ1 (y).
y∈Rn−1

In Example 7.1.1 we have already seen that the latter minimization problem can be

m
represented in the form (7.1.19)–(7.1.20) with x = (y T , τ )T , and F1 (v) = |v (i) |.
i=1
Thus,

Q2 = {u ∈ Rm : |u(i) | ≤ 1, i = 1, . . . , m}.
7.1 Homogeneous Models of an Objective Function 505

 1/2

m
Choosing u(2) = (u(i) )2 , we get
i=1


γ0 (F∞ ) = 1, γ1 (F∞ ) = m, α(F∞ ) = √1 .
m

Therefore, in view of Theorem 7.1.5, in order to estimate f ∗ with relative accuracy


δ ∈ (0, 1) we need at most
   
2e · m1/2 · 1 + 1
2 ln m · 1 + 1δ

iterations of the scheme SN (R).


For this method, we need to compute and invert the matrix G = AT A. If A is
dense, this takes O(n2 m) operations. Further, each iteration of the scheme SN (R)
requires O(nm) operations:
• Multiplication of matrix A by yk takes O(mn) operations.
• Since the set Q2 and the norm uRm have separable structure, computation of
uμ (xk ) needs O(m) operations.
• Computation of vk+1 needs one multiplication of AT by a vector, and finding the
projection onto a set with representation

Q1 (R) = {x ∈ Rn : Cx = 1, xRn ≤ R}

in the Euclidean metric  · Rn . Since C ∈ R1×n , such a projection can be found
by a closed-form expression.
Thus, the total amount of computations in the scheme is of the order of
 
O n2 m + 1
δ · nm1.5 ln m (7.1.34)


operations. The first ingredient of this estimate is dominant when δ > nm ln m.
Note that for problem (7.1.33) we can apply a standard short-step path-following
scheme (5.3.25). Each iteration of this scheme needs O(n2 m) operations. Therefore
its worst-case efficiency estimate is as follows:

O n2 m1.5 ln mδ . (7.1.35)

Another possibility is to solve this problem by the ellipsoid method (3.2.53). In this
case, the total complexity of its solution is

O n3 m ln mδ . (7.1.36)
506 7 Optimization in Relative Scale

Comparing the bounds (7.1.34), (7.1.35), and (7.1.36), we conclude that the
scheme (7.1.30) is the best when δ is not too small, say
  √ 
δ>O 1
n max 1, nm .

7.1.4.2 Minimization of the Spectral Radius

Denote by Sn the space of symmetric n × n-matrices. For X ∈ Sn , we can define its


spectral radius:

ρ(X) = max |λi (X)|.


1≤i≤n

Note that this function is convex on Sn . For a vector of decision variables x ∈ Rp ,


let us introduce a linear operator A(x):


p
A(x) = x (i) Ai ∈ Sn .
i=1

Now we can define the following objective function in problem (7.1.20):

f (x) = ρ(A(x)). (7.1.37)

Assume also that the constraints in problem (7.1.20), (7.1.37) are linear and very
simple. For example, it could be x (1) = 1.
In order to treat the problem (7.1.20), (7.1.37) we need to represent the upper-
level function ρ(X) in a special form (7.1.18). Let
 

n
Q2 = X ∈ S :
n |λi (X)| ≤ 1 .
i=1

Let us endow the space Sn with the standard Frobenius norm:

1/2 def 
n
XF = X, X F , X, Y F = X(i,j ) Y (i,j ) , X, Y ∈ Sn .
i,j =1

Lemma 7.1.3 Let Q2 be a closed convex set such that


 
B·F √1
n
⊂ Q2 ⊂ B·F (1). (7.1.38)

Moreover, ρ(X) = max X, U .


U ∈Q2
7.1 Homogeneous Models of an Objective Function 507

Proof For any X ∈ S n , we have:

ρ(X) = min{τ : τ In  X, τ In  −X}


τ ∈R

= min max [τ + X − τ In , Y1 F − X + τ In , Y2 F ]
τ ∈R Y1 ,Y2 0

= max {X, Y1 − Y2 F : In , Y1 + Y2 F = 1}.


Y1 ,Y2 0

Let U = Y1 − Y2 and V = Y1 + Y2 . Then

ρ(X) = maxn {X, U F , U ∈ Q̂},


U ∈S

where Q̂ = {U : ∃V  ±U, In , V F = 1}. It is clear that the set Q̂ is closed,


convex and bounded. Let us prove that Q̂ = Q2 .
Indeed, we can always represent U by its orthogonal basis of eigenvectors:

U = BΛB T , BB T = In ,

where Λ is a diagonal matrix. Assume that U ∈ Q2 . Define a diagonal matrix Λ̂


with the following diagonal entries:


n
Λ̂(i,i) = |Λ(i,i) |/[ |Λ(j,j ) |], i = 1, . . . , n.
j =1

Then V = B Λ̂B T  ±U and In , V F = 1. Thus Q2 ⊆ Q̂.


Conversely, if U ∈ Q̂, then there exists a V ∈ S n such that B T V B  ±Λ.
Therefore

V bi , bi F ≥ |Λ(i,i) |, i = 1, . . . , n,

where bi are the columns of the matrix B. Hence,


n 
n
1 = In , V F = BB T , V F = In , B T V B F = V bi , bi F ≥ |λi (U )|.
i=1 i=1

Thus, Q̂ ⊆ Q2 and we conclude that Q̂ = Q2 .



n
It remains to prove inclusion (7.1.38). Indeed, if U 2F ≤ n1 , that is λ2i (U ) ≤
i=1
1
n, then
 1/2

n √ 
n
|λi (U )| ≤ n· λ2i (U ) ≤ 1.
i=1 i=1
508 7 Optimization in Relative Scale

 2

n 
n 
n
Conversely, if |λi (U )| ≤ 1, then λ2i (U ) ≤ |λi (U )| ≤ 1. 

i=1 i=1 i=1
Thus, in view of inclusion (7.1.38) we have

γ0 (ρ) = √1 ,
n
γ1 (ρ) = 1, α(ρ) = √1 .
n

Hence, in view of Theorem 7.1.5, the total number of iterations of the method SN (R)
does not exceed
√    
2e n 1 + 12 ln n · 1 + 1δ .

In order to apply this approach, we need to compute and invert the matrix G. In our
situation, G is the matrix of the following quadratic form:

Gx, x = A(x), A(x) F .

Thus, G(i,j ) = Ai , Aj F , i, j = 1, . . . , p. If the matrices Ai are dense, the


computation of this matrix takes O(p2 n2 ) arithmetic operations and the inversion
takes O(p3 ) operations. Since we assume p < n(n+1) 2 , the total cost of the
2 2
preliminary computation is of the order of O(p n ) operations.
Further, the most expensive operations at each step of the method SN (R) are as
follows.
• Computation of the value of the bilinear form A(x), U F and its gradients takes
O(pn2 ) operations.
• Finding a projection of point X onto the set Q2 with respect to the standard
Frobenius norm. The most expensive part of this operation consists in solving an
eigenvalue problem for the matrix X. This can be done in O(n3 ) operations.
• The total amount of operations in the space Rp does not exceed O(p2 ).
Thus, the complexity of each iteration of SN (R) is of the order of O(n2 (n + p))
operations. Hence, in total, the method (7.1.30) requires
 
O n2 p2 + 1
δ · n2.5 (p + n) ln n (7.1.39)

arithmetic operations.
Let us compare this estimate with the worst-case complexity of a short-step path-
following scheme as applied to the problem (7.1.20)–(7.1.37). For this method, the
most expensive computations at each iteration are the computations of the elements
of the Hessian of the barrier function. In accordance with Lemma 5.4.6, these are
the values

X−1 Ai X−1 , Aj F , i, j = 1, . . . , p.
7.1 Homogeneous Models of an Objective Function 509

Such a computation needs O(pn2 (p + n)) operations. Thus, the total complexity of
the interior-point method is of the order of
 n
O pn2.5 (p + n) ln
δ
operations. Comparing this estimate with (7.1.39) we see that the gradient method
is better if the required relative accuracy is not too small:
 
δ≥O 1
p .

7.1.4.3 The Truss Topology Design Problem

In this problem, we have a set of points

x i ∈ R2 , i = 1, . . . , n + p,

connected by a set of arcs (ik , jk ), k = 1, . . . , m. We always assume that jk > ik .


Each arc has a nonnegative weight t (k) , and the sum of all weights is equal to one.
The nodes xn+1 , . . . , xn+p are fixed. To all other nodes we can apply external forces

def
fi ∈ R2 , i = 1, . . . , n, f = (f1 , . . . , fn )T ∈ R2n .

The goal is to find an optimal design vector


 
def 
m
t = (t (1) , . . . , t (m) )T ∈ Δm ≡ t ∈ Rm
+ : t (i) = 1
i=1

which minimizes the total stiffness ψ(t) of the system.


To define the stiffness, we can always assume that ik < n, k = 1, . . . , m,
allowing no arcs between fixed nodes. For each arc k, define vectors
xik −xjk
dk = xik −xjk 2
∈ R2 , k = 1, . . . , m,

where  ·  is the standard Euclidean norm in R2 . Now we can define the constraint
vector ak = (ak,1 , . . . , ak,n )T ∈ R2n , which is composed by the following two-
dimensional vectors:

⎨ dk , if q = ik ,
ak,q = −dk , if q = jk and jk ≤ n, q = 1, . . . , n.

0, otherwise.
510 7 Optimization in Relative Scale


m
Let B(t) = t (k) ak akT . Then the truss topology design problem can be written
k=1
as follows

Find ψ ∗ = inf{[B(t)]−1 f, f : t ∈ rint Δm }. (7.1.40)


t

def
This problem is well defined if and only if the matrix G = B(ēm ) is positive
definite.
Let us show how this problem can be rewritten in the form (7.1.19)–(7.1.20).

ψ∗ = inf [B(t)]−1 f, f
t ∈rint Δm

= inf max [2f, x − B(t)x, x ]


t ∈rint Δm x∈R2n

 

m
= max inf 2f, x − t (k) ak , x 2
x∈R2n t ∈rint Δm k=1
 
= max 2f, x − max ak , x 2
x∈R2n 1≤k≤m

f,x 2
= max max ak ,x 2
x∈R2n 1≤k≤m

(in the last step we perform a maximization of the objective function along direction
x by multiplying it by a positive factor).
Thus, we can consider the problem

Find f ∗ = min {f (x) = max |ak , x | : f, x = 1},


def
(7.1.41)
x∈R2n 1≤k≤m

which is exactly in the desired form (7.1.19)–(7.1.20). Let A be an m × (2n)-matrix


with the rows akT . Then, using the notation of Example 7.1.1 the objective function
of this problem can be written as

f (x) = F∞ (Ax).

In view of (7.1.21) we have α(F∞ ) = √1m . Therefore, in order to find an


approximate solution to (7.1.41) with relative accuracy δ, the method (7.1.30) needs
at most
√    
2e m 1 + 12 ln m · 1 + 1δ (7.1.42)
7.2 Rounding of Convex Sets 511

iterations of the scheme SN (R). The most expensive operations of each iteration of
the latter scheme are as follows.
• Computation of the value and the gradients of the bilinear form Ax, u needs
O(m) operations (recall that A is sparse).
• Euclidean projection on Q2 ⊂ Rm needs O(m ln m) operations.
• All steps in the primal space need O(n2 ) operations.
Note that the preliminary computation of the matrix G needs O(m + n2 )
operations, but its inversion costs O(n3 ). Since m ≤ n(n+1) 2 , we come to the
following upper bound for the total computational effort of the method (7.1.30):
 √ 
O n3 + 1
δ · (n2 + m ln m) · m ln m (7.1.43)

arithmetic operations. For a dense truss with m = O(n2 ) this estimates becomes
 
n3
O δ ln2 n

arithmetic operations.

7.2 Rounding of Convex Sets

(Computing rounding ellipsoids; John’s Theorem; Rounding by diagonal ellipsoids; Min-


imizing the maximal absolute value of linear functions; Bilinear matrix games with
non-negative coefficients; Minimizing the spectral radius of symmetric matrices.)

7.2.1 Computing Rounding Ellipsoids

Among modern methods for solving problems of Linear Programming (LP-prob-


lems, for short), the Interior-Point Methods (IPM) are considered to be the most
efficient. However, these methods are based on an expensive machinery. For an LP-
problem with n variables and m inequality constraints, (m > n), in order to get an
approximate solution with absolute accuracy , these methods need to perform

O( m ln m )

iterations of Newton’s Method (see Sect. 5.4). Recall that for problems with dense
data, each iteration can take up to O(n2 m) operations.
Clearly these bounds leave considerable room for competition with gradient-type
methods, for which each iteration is much cheaper. However, the main drawback
of the latter schemes is their relatively slow convergence. In general, the gradient
512 7 Optimization in Relative Scale

 
schemes need O C 20 iterations in order to find an -solution to the problem (see
Sect. 3.2). In this estimate, a strong dependence on  is coupled with the presence of
a constant C0 , which depends on the norm of the matrix of constraints, the size of
the solution, etc, and which can be uncontrollably large. Consequently, the classical
gradient-type schemes can compete with IPM only on very large problems.
However, in Chap. 6 we have shown that it is possible to use the special structure
 
of LP-problems in order to get gradient-type schemes which converge in O C1
iterations. Moreover, it was shown that, for some LP-problems, the constant C1
can be found explicitly and that it is reasonably small. In Sect. 7.1 this result was
extended to cover minimization schemes for finding an approximate solution with
a certain relative accuracy. Namely, it was shown that for some classes of LP-
problems√ it is possible to compute an approximate solution of relative accuracy δ
with O( δm ) iterations of a gradient-type scheme. Recall that for many applications
the concept of relative accuracy is very attractive since it adapts automatically to
any size of the solution. So, there is no necessity to fight against big and unknown
constants. For many problems in Economics and Engineering, the level of relative
accuracy of the order 1.5–0.05% is completely acceptable.
The approach of Sect. 7.1 is applicable to special conic unconstrained min-
imization problems. They consist in minimization of a non-negative positively
homogeneous convex function f , dom f = Rn , on a closed convex set separated
from zero. In order to compute a solution to this problem with some relative
accuracy, we need to know a rounding ellipsoid for the subdifferential of f at the
origin. It was shown that for some LP-problems it is possible to use the structure
√ of
the objective function in order to compute such an ellipsoid with radius O m .
√ It is well known that, for any centrally symmetric set in R , there exists a
n

n-rounding ellipsoid. Moreover, a good approximation to such an ellipsoid can


be easily computed. It appears that this ellipsoid provides us with a good norm,
allowing us to solve the corresponding minimization problem up to a certain relative
accuracy. In this section, we analyze two non-trivial classes of LP-problems and
show that for both √classes the approximate solutions with relative accuracy δ can be
n ln m
computed in O δ ln n iterations of a gradient-type method.
At the same time, the preliminary computation of the rounding ellipsoids in
both situations is reasonably cheap: it takes O(n2 m ln m) operations at most. Up
to a logarithmic factor, this estimate coincides with the complexity of finding a
projection onto a linear subspace in Rm defined by n linear equations. However, we
will see that the consequent optimization process is even cheaper.
Let us recall some notation. In this section, it is convenient to identify E and E∗
with Rn . A symmetric n × n-matrix G  0 defines a norm on Rn :

xG = Gx, x 1/2 , x ∈ Rn .


7.2 Rounding of Convex Sets 513

The dual norm is defined in the usual way:

s∗G = sup{s, x : xG ≤ 1} = s, G−1 s 1/2 , s ∈ Rn .


x

For a closed convex bounded set C ⊂ Rn , ξC (x) denotes its support function:

ξC (x) = maxs, x , x ∈ Rn .
s∈C

Thus ∂ξC (0) = C.


Finally, D(a) denotes a diagonal n×n-matrix with vector a ∈ Rn at the diagonal.
In this setting, ek ∈ Rn denotes the kth coordinate vector, and ēn ∈ Rn denotes the
vector of all ones. Thus, In ≡ D(ēn ). As before, the notation Rn+ is used for the
positive orthant and Δn ≡ {x ∈ Rn+ : ēn , x = 1} denotes the standard simplex
in Rn .
In this section, we analyze efficient algorithms for constructing rounding ellip-
soids for different types of convex sets. An ellipsoid Wr (v, G) ⊂ Rn is usually
represented in the following form:

Wr (v, G) = {s ∈ Rn : s − v∗G ≡ s − v, G−1 (s − v) 1/2 ≤ r},

where G  0 is a symmetric n × n-matrix. If v = 0, we often use the notation


Wr (G). An ellipsoid W1 (v, G) is called a β-rounding for a convex set C ⊂ Rn ,
β ≥ 1, if

W1 (v, G) ⊆ C ⊆ Wβ (v, G).

We call β the radius of ellipsoidal rounding.

7.2.1.1 Convex Sets with Central Symmetry

Let G  0. For an arbitrary g ∈ Rn , consider the set C±g (G) =


Conv {W1 (G), ±g}. For α ∈ [0, 1] define

G(α) = (1 − α)G + αgg T .

Lemma 7.2.1 For any α ∈ [0, 1), the following inclusion holds:

W1 (G(α)) ⊂ C±g (G). (7.2.1)

def 1 ∗ 2
If the value σ = n (gG ) − 1 is positive, then the function

def
V (α) = ln det G(α)
det G(0) = ln(1 + α(n(1 + σ ) − 1)) + (n − 1) ln(1 − α),
514 7 Optimization in Relative Scale

attains its maximum at α ∗ = σ


n(1+σ )−1 . Moreover,

V (α ∗ ) = ln(1 + σ ) + (n − 1) ln (n−1)(1+σ
n(1+σ )−1
)

(7.2.2)
σ2
≥ ln(1 + σ ) − σ
1+σ ≥ (1+σ )(2+σ ) .

Proof For any x ∈ Rn , we have

ξW1 (G(α))(x) = G(α)x, x 1/2 = [(1 − α)Gx, x + αg, x 2 ]1/2

≤ max{Gx, x 1/2 , |g, x |}

= max{ξW1 (G) (x), ξConv {±g} (x)} = ξC±g (G) (x).

Hence, in view of Corollary 3.1.5, inclusion (7.2.1) is proved.


Furthermore,

V (α) = ln det(G−1/2 G(α)G−1/2 )



= ln det (1 − α)In + αG−1/2 gg T G−1/2

= ln 1 − α + α(g∗G )2 + (n − 1) ln(1 − α)

= ln (1 + α(n(1 + σ ) − 1)) + (n − 1) ln(1 − α).

Hence, in view of Theorem 2.1.1, the global optimality condition for the function
V (·) is as follows:
n(1+σ )−1
n−1
1−α = 1+α(n(1+σ )−1) .

The only solution of this equation is α ∗ = σ


n(1+σ )−1 . Note that

V (α ∗ ) = ln(1 + σ ) + (n − 1) ln (n−1)(1+σ
n(1+σ )−1
)

 
= ln(1 + σ ) − (n − 1) ln 1 + σ
(n−1)(1+σ )

σ2
≥ ln(1 + σ ) − σ
1+σ = 1+σ − ω(σ )

(5.1.23)
σ2
≥ (1+σ )(2+σ ) . 

7.2 Rounding of Convex Sets 515

In this section, we are interested in solving the following problem. Let C be a


convex centrally symmetric body, i.e. int C
= ∅, and x ∈ C ⇔ −x ∈ √C. For
a given γ > 1, we need to find an ellipsoidal rounding for C of radius γ n. An
initial approximation to the solution of our problem is given by a matrix G0  0
such that W1 (G0 ) ⊆ C, and C ⊆ WR (G0 ) for a certain R ≥ 1.
Let us look at a particular variant of such a problem.
Example 7.2.1 Consider a set of vectors ai ∈ Rn , i = 1, . . . , m, which span the
whole space Rn . Let the set C be defined as follows:

C = Conv {±ai , i = 1, . . . , m}. (7.2.3)


m
We choose G0 = 1
m ai aiT . Note that for any x ∈ Rn , we have ξC (x) =
i=1
max |ai , x |. Therefore,
1≤i≤m

 1/2

m
ξW1 (G0 ) (x) = 1
m ai , x 2 ≤ ξC (x),
i=1

 1/2

m
ξW√m (G0 ) (x) = m1/2 1
m ai , x 2 ≥ ξC (x).
i=1

Thus, in view of Corollary 3.1.5, W1 (G0 ) ⊆ C ⊆ W√m (G0 ). 



Let us analyze the following algorithmic scheme.

For k ≥ 0 iterate:

def
1. Compute gk ∈ C : gk ∗Gk = rk = max{g∗Gk : g ∈ C}.
g

2. If rk ≤ γ n1/2 then Stop else (7.2.4)

rk2 −n
αk = , Gk+1 = (1 − αk )Gk + αk gk gkT .
n(rk2 −1)

end.

The complexity bound for this scheme is given by the following statement.
516 7 Optimization in Relative Scale

Theorem 7.2.1 Let R ≥ 1 and W1 (G0 ) ⊆ C ⊆ WR (G0 ). Then scheme (7.2.4)


terminates after
2
2n (γ γ−1)2 ln R (7.2.5)

iterations at most.
Proof Note that the coefficient αk in Step 2 of (7.2.4) is chosen in accordance with
def 1 2
Lemma 7.2.1. Since the method runs as long as σk = n rk − 1 ≥ γ 2 − 1, in view of
inequality (7.2.2), at each step k ≥ 0 we have

γ 2 −1
ln det Gk+1 ≥ ln det Gk + 2 ln γ − γ2
. (7.2.6)

Note that

γ 2 −1 (γ 2 −1)2 (5.1.23) (γ 2 −1)2 (γ 2 −1)2


2 ln γ − γ2
= γ2
− ω(γ 2 − 1) ≥ γ2
− 1+γ 2

(γ 2 −1)2
= γ 2 (1+γ 2 )
≥ 1
γ2
(γ − 1)2 .

At the same time, for any k ≥ 0 we get

det(Gk )1/2 · voln (W1 (In )) = voln (W1 (Gk )) ≤ voln (C) ≤ voln (WR (G0 ))

= R n · det(G0 )1/2 · voln (W1 (In )).

Hence, ln det Gk − ln det G0 ≤ 2n ln R, and we get bound (7.2.5) by summing up


inequalities (7.2.6).

Let us estimate the total arithmetical complexity of the scheme (7.2.4) as applied
to a particular symmetric convex set (7.2.3). In this situation, it is reasonable to
recursively update the inverse matrices Hk = G−1
def
k , and the set of values

νk(i) = ai , Hk ai , i = 1, . . . , m,
7.2 Rounding of Convex Sets 517

which we treat as a vector νk ∈ Rm . A modified variant of the scheme (7.2.4) is as


follows.

 −1

m
A. Compute H0 = 1
m ai aiT and the vector ν0 ∈ Rm .
i=1

B. For k ≥ 0 iterate:

1. Find ik : νk(ik ) = max νk(i) . Set rk = [ν (ik ) ]1/2 .


1≤i≤m

2. If rk ≤ γ n1/2 then Stop else


(7.2.7)
σk
2.1. Set σk = 1 2
n rk − 1, αk = ,x
rk2 −1 k
= Hk aik .

 
αk
2.2. Update Hk+1 := 1
1−αk Hk − 1+σk · xk xkT .
 
(i)
2.3. Update νk+1 := 1
1−αk νk(i) − αk
1+σk · ai , xk 2 ,
i = 1, . . . , m.

end.

Let us estimate the arithmetical complexity of this scheme. For simplicity, we


assume that the matrix A = (a1 , . . . , am ) is dense. We write down only the leading
polynomial terms in the complexity of the corresponding computations, where we
count only multiplications.
mn2 n3
• Phase A takes 2 operations to compute the matrix G0 , plus 6 operations to
mn2
compute its inverse, and operations to compute the vector ν0 .
2
• Step 2.1 takes n2 operations.
2
• Step 2.2 takes n2 operations.
• Step 2.3 takes mn operations.

Using now the estimate (7.2.5) with R = m (see Example 7.2.1), we conclude
that
√ for γ > 1 and the centrally symmetric set (7.2.3), the scheme (7.2.7) can find a
γ n-rounding in

n2 γ2
6 (n + 6m) + (γ −1)2 n (2m +
2 3n) ln m
518 7 Optimization in Relative Scale

arithmetic operations. Note that for a sparse matrix A the complexity of Phase A
and Step 2.3 will be much lower.
Remark 7.2.1 Note that the process (7.2.4) with eliminated stopping criterion can
be used to prove a symmetric version of John’s theorem.
Indeed, all matrices generated by this process have the following form:


m 
m
Gk = λ(i) T
k ai ai , λk ∈ Rm
+, λ(i)
k = 1.
i=1 i=1


m
−1/2 −1/2
Therefore, In = λ(i)
k Gk ai aiT Gk . Taking the trace of both sides of this
i=1
equality, we get


m
λk (ai ∗Gk )2 ≤ rk2 .
(i)
n=
i=1

On the other hand, we have seen that

(7.2.6) σk (5.1.23) √
ln det Gk+1 ≥ ln det Gk + ln(1 + σk ) − ≥ 1
(r − n)2 .
1+σk rk2 k

Therefore, by the same reasoning as in the proof of Theorem 7.2.1, after N iterations
of the scheme we get

N 
 √ 
n 2
1− rk ≤ 2n ln R.
k=0

√  1/2 √
Defining rN∗ = min rk , we have ∗
rN
n
≥ 1− 2n
N+1 ln R . Thus, rN∗ → n as
0≤k≤N
N → ∞. Since the sequence of matrices {Gk } is compact, √we conclude that there
exists a limiting matrix G∗ with rounding coefficient β = n.
Thus, we have proved a symmetric version of John’s Theorem for the set C
defined by (7.2.3). Since the quality of our rounding does not depend on the number
of points m, we can use the fact that any general symmetric convex set can be
approximated by a convex combination of finite number of points with arbitrary
accuracy. Thus, our statement is also valid for general sets.
Note
√ that the process (7.2.4) always constructs a matrix with rounding coefficient
β = n. Of course, there exist symmetric sets with much better rounding. It will be
interesting to develop an efficient procedure which can adjust to the exact rounding
coefficient for a particular convex set. 
7.2 Rounding of Convex Sets 519

7.2.1.2 General Convex Sets

For an arbitrary g from Rn , consider the set Cg (G) = Conv {W1 (G), g}. In view of
Lemma 3.1.3 support function of this set is as follows:

ξCg (G) (x) = max{xG , g, x }, x ∈ Rn .

Define r = g∗G , and



 2 
2
G(α) = (1 − α)G + α
r + r−1
2 · αr · gg T , α ∈ [0, 1].

Lemma 7.2.2 For all α ∈ [0, 1), the ellipsoid

Eα = {s ∈ Rn : s − r−1
2r · αg∗G(α) ≤ 1}

belongs to the set Cg (G). If r ≥ n, then the function


 
def
V (α) = ln det G(α)
det G(0) = 2 ln 1+α· r−1
2 + (n − 1) ln(1 − α)

attains its maximum at α ∗ = 2


n+1 · r−n
r−1 . Moreover,

V (α ∗ ) = 2 ln n+1
r−1
+ (n − 1) ln (n−1)(r+1)
(n+1)(r−1)

  (7.2.8)
(5.1.23)
2σ 2
≥ 2 ln(1 + σ ) − σ
1+σ ≥ (1+σ )(2+σ ) ,

where σ = r−n
n+1 .
Proof In view of Corollary 3.1.5, we need to prove that for all x ∈ E

 2  1/2
α 2
ξEα (x) ≡ α · r−1
2r · g, x + (1 − α)x2G + αr + r−1
2 · r g, x 2

≤ ξCg (G) (x) = max{xG , g, x }.

If xG ≤ g, x , then

ξEα (x) ≤ α · r−1


2r · g, x + |1 − α · 2r | · g, x
r−1
= g, x .

Otherwise, we have −rxG ≤ g, x ≤ xG . Note that the value ξEα (x) depends
on g, x in a convex way. Therefore, in view of Corollary 3.1.2, its maximum is
achieved at the end points of the feasible interval for g, x . For the end point
520 7 Optimization in Relative Scale

g, x = xG , we have already proved that ξEα (x) = xG . Consider now the
case g, x = −rxG . Then,

 2  1/2
α 2
ξEα (x) = −α · r−1
2 · xG + (1 − α)x2G + α
r + r−1
2 · r r xG
2 2

= xG .

Thus, we have proved that Eα ⊆ Cg (G) for any α ∈ [0, 1). Further,

V (α) = ln det(G−1/2 G(α)G−1/2 )




 2 
α 2 −1/2 ∗ −1/2
= ln det (1 − α)In + r + 2
α r−1
· r G gg G



 2 
α 2
= ln 1 − α + r + 2
α r−1
· r · r + (n − 1) ln(1 − α)
2

 
= 2 ln 1 + α · r−1
2 + (n − 1) ln(1 − α).

Hence, in view of Theorem 2.1.1, the optimality condition for the concave function
V (·) is as follows:

n−1
= r−1
.
1−α 1+α· r−1
2

Thus, the maximum is attained at α ∗ = 2


n+1 · r−n
r−1 . Defining σ = r−n
n+1 , we get
 
V (α ∗ ) = 2 ln 1 + α ∗ · r−1
2 + (n − 1) ln(1 − α ∗ )
 
2(r−n)
= 2 ln(1 + σ ) − (n − 1) ln 1 + (n−1)(r+1)

 
2(r−n)
≥ 2 ln(1 + σ ) − r+1 = 2 ln(1 + σ ) − σ
1+σ . 

In this section, we are interested in solving the following problem. Let C ⊂ Rn


be a convex set with nonempty interior. For a given γ > 1, we need to find a γ n-
rounding for C. An initial approximation to the solution of this problem is given
by a point v0 and a matrix G0  0 such that W1 (v0 , G0 ) ⊆ Q ⊆ WR (v0 , G0 ) for
certain R ≥ 1. We assume that n ≥ 2.
7.2 Rounding of Convex Sets 521

Let us analyze the following algorithmic scheme.

For k ≥ 0 iterate:

def
1. Compute gk ∈ C : gk − vk ∗Gk = rk = max g − vk ∗Gk .
g∈C

2. If rk ≤ γ n then Stop else


 2  2
αk = 2
· rk −n
βk = αk
+ rk −1
· αrkk , (7.2.9)
n+1 rk −1 , rk 2

vk+1 = vk + αk rk2r−1
k
(gk − vk ),

Gk+1 = (1 − αk )Gk + βk · (gk − vk )(gk − vk )T .

end.

The complexity bound for this scheme is given by the following statement.
Theorem 7.2.2 Let W1 (v0 , G0 ) ⊆ C ⊆ WR (v0 , G0 ) for some R ≥ 1. Then
scheme (7.2.9) terminates after

(1+2γ )(2+γ )
2(γ −1)2
· n ln R (7.2.10)

iterations at most.
Proof Note that the coefficient αk , vector vk+1 and matrix Gk+1 in Step 2 of (7.2.9)
are chosen in accordance with Lemma 7.2.2. Since the method runs as long as

def rk −n
σk = n+1 ≥ n
n+1 (γ − 1) ≥ 2
3 (γ − 1),

in view of inequality (7.2.8), at each step k ≥ 0 we have

2σk2 4(γ −1)2


ln det Gk+1 ≥ ln det Gk + (1+σk )(2+σk ) ≥ ln det Gk + (1+2γ )(2+γ ) .
(7.2.11)

Note that for any k ≥ 0, we have

det(Gk )1/2 · voln (W1 (In )) = voln (W1 (Gk )) ≤ voln (C) ≤ voln (WR (G0 ))

= R n · det(G0 )1/2 · voln (W1 (In )).


522 7 Optimization in Relative Scale

Hence, ln det Gk − ln det G0 ≤ 2n ln R, and we get bound (7.2.10) by summing up


the inequalities (7.2.11).

Note that in the case C = Conv {ai , i = 1, . . . , m}, scheme (7.2.9) can be
implemented efficiently in the same style as (7.2.7). We leave the derivation of this
modification and its complexity analysis as an exercise for the reader. The starting
rounding ellipsoid for such a set C can be chosen as follows.
Lemma 7.2.3 Assume that the set C = Conv {ai , i = 1, . . . , m} has nonempty
interior. Define


m 
m
â = 1
m ai , G = 1
R2
(ai − â)(ai − â)T ,
i=1 i=1

where R = m(m − 1). Then W1 (â, G) ⊂ C ⊂ WR (â, G).
Proof For any x ∈ Rn and r > 0, we have
 1/2

m
ξWr (â,G) (x) = â, x + rxG = â, x + r
R ai − â, x 2 .
i=1

Thus, we have ξWR (â,G) (x) ≥ max ai , x = ξC (x). Hence, WR (â, G) ⊃ C.
1≤i≤m
Further, let

τi = ai − â, x , i = 1, . . . , m, and

τ̂ = max ai , x − â, x ≥ 0.


1≤i≤m


m
Note that τi = 0 and τi ≤ τ̂ for all i. Therefore,
i=1
" 1/2 %

m 
m
ξW1 (â,G) (x) − â, x ≤ 1
R max τi2 : τi = 0, τi ≤ τ̂ , i = 1, . . . , m
τi i=1 i=1

τ̂ √
= R m(m − 1) = max ai , x − â, x
1≤i≤m

= ξC (x) − â, x .

Thus, in view of Corollary 3.1.5, W1 (â, G) ⊂ C. 



Remark 7.2.2 In the same way as it was done in Remark 7.2.1, we can use
algorithm (7.2.9) to prove John’s Theorem for general convex sets. We leave this
reasoning as an exercise for the reader.

7.2 Rounding of Convex Sets 523

7.2.1.3 Sign-Invariant Convex Sets

We call a set C ⊂ Rn sign-invariant if, for any point g from C, an arbitrary change

of signs of its entries leaves the point inside C. In other words, for any g ∈ C Rn+ ,
we have

B(g) ≡ {s ∈ Rn : −g ≤ s ≤ g} ⊆ C.

Examples of such sets are given by unit balls of p -norms or by Euclidean norms
generated by diagonal matrices.
Clearly, any sign-invariant set is centrally
√ symmetric. Thus, in view of
Lemma 7.2.1, for such a set there exists a n-ellipsoidal rounding (this is John’s
Theorem). We will see that an important additional feature of sign-invariant sets is
that the matrix of the corresponding quadratic form can be diagonal.
Let D  0 be a diagonal matrix. Let us choose an arbitrary vector g ∈ Rn+ .
Define

C = Conv {W1 (D), B(g)},

G(α) = (1 − α)D + αD 2 (g).

Clearly C is a sign-invariant set. Consider the function

det G(0) 
n
V (α) = ln det G(α) = − ln (1 + α(τi − 1)) , α ∈ [0, 1),
i=1

(i) 2
where τi = (gD (i)) , i = 1, . . . , n. Note that V (·) is a standard self-concordant
function (see Sect. 5.1). For our analysis it is important that


n  2
V  (0) = n − τi = n − g∗D , and
i=1
(7.2.12)

n
V  (0) = (τi − 1)2 .
i=1

Lemma 7.2.4 For any α ∈ [0, 1], W1 (G(α)) ⊆ C. Assuming that (g∗D )2 > n,
define the step

def (g∗D )2 −n
α∗ = (2(g∗D )2 −n)·(g∗D )2
.

 
Then, α ∗ ∈ (0, n1 ], and for any γ ∈ 1, √1n g∗D we have

 
γ 2 −1 γ 2 −1
V (α ∗ ) ≤ ln 1 + γ2
− γ2
< 0. (7.2.13)
524 7 Optimization in Relative Scale

Proof For any α ∈ [0, 1] and x ∈ Rn , we get


n
[ξW1 (G(α))(x)]2 = (1 − α)Dx, x + α (g (i) x (i) )2
i=1


2

n
≤ (1 − α)Dx, x + α g (i) · |x (i)|
i=1

!2
≤ max{ξW1 (D) (x), ξB(g)(x)} = [ξC (x)]2 .


n
Further, let S = τi = (g∗D )2 . By assumption, S > n. Therefore,
i=1
 

n 
n
V  (0) ≤ max (τi − 1)2 : τi = S, τi ≥ 0, i = 1 . . . n
τ i=1 i=1

= (S − 1)2 + n − 1 < S 2 .

Since V (·) is a standard self-concordant function, by inequality (5.1.16) we have:

V (α) ≤ V (0) + α · V  (0) + ω∗ (α · (V  (0))1/2)


(7.2.14)
≤ −α · (S − n) + ω∗ (α · S),

where ω∗ (τ ) = −τ − ln(1 − τ ). By Theorem 2.1.1, the minimum of the right-hand


side of this inequality is attained at the solution of the equation

α∗ S 2
S−n = 1−α∗ S .

Thus, α∗ = S·(2S−n)
S−n
< 1
n. By Lemma 5.1.4, the decrease of the right-hand side
in (7.2.14) is equal to

ω 1 − Sn ≥ ω(1 − γ −2 ),

where ω(t) = t − ln(1 + t). 



Corollary 7.2.1 For any sign-symmetric set C ⊂ Rn with nonempty interior, there
exists a diagonal matrix D  0 such that

W1 (D) ⊆ C ⊆ W√n (D).


7.2 Rounding of Convex Sets 525

Proof For R big enough, the set {D  0 : W1 (D) ⊆√ C ⊆ WR (D)} is


nonempty, closed, and bounded. Therefore, the existence of n-rounding follows
from inequality (7.2.13).

For us, Corollary 7.2.1 is important because of the following consequence.
Lemma 7.2.5 Let all vectors ai ∈ Rn , i = 1, . . . , m, have nonnegative coefficients.
Assume that there exists a diagonal matrix D  0 such that

W1 (D) ⊆ Conv {B(ai ), i = 1, . . . , m} ⊆ Wγ √n (D)

for certain γ ≥ 1. Then the function f (x) = max ai , x satisfies the inequalities
1≤i≤m


xD ≤ f (x) ≤ γ n · xD ∀x ∈ Rn+ . (7.2.15)


n
Proof Consider the function: fˆ(x)
(j )
= max ai |x (j ) |. In view of
1≤i≤m j =1
Lemma 3.1.13, its subdifferential can be expressed as follows:

∂ fˆ(0) = Conv {B(ai ), i = 1, . . . , m}.

Thus, for any x ∈ Rn we have

xD = max{s, x : s ∈ W1 (D)} ≤ max{s, x : s ∈ ∂ fˆ(0)} ≡ fˆ(x)


s s


≤ max{s, x : s ∈ Wγ √n (D)} = γ n · xD .
s

It remains to note that fˆ(x) ≡ f (x) for all x ∈ Rn+ . 



Corollary 7.2.2 Let ai ∈ Rn+ , i = 1, . . . , m. Consider the set

F = {x ∈ Rn+ : ai , x ≤ bi , i = 1, . . . , m}

with bi > 0, i = 1, . . . , m. Then there exists a diagonal matrix D  0 such that


0 0
W1 (D) Rn+ ⊂ F ⊂ W√n (D) Rn+ . (7.2.16)

Proof Consider f (x) = max 1


ai , x . In view of Corollary 7.2.1 the assumptions
1≤i≤m bi
of Lemma 7.2.5 are satisfied with γ = 1. Since F = {x ∈ Rn+ : f (x) ≤ 1}, the
inclusions (7.2.16) follow from inequalities (7.2.15).

In this section, we are interested in finding a diagonal ellipsoidal rounding for
the following sign-symmetric set:

C = Conv {B(ai ), i = 1, . . . , m}, (7.2.17)


526 7 Optimization in Relative Scale

where ai ∈ Rn+ \ {0}, i = 1, . . . , m. Our main assumption on the data is as follows:

def 1 
m
â = m ai > 0.
i=1

Let D̂ = D 2 (â).
Lemma 7.2.6 W1 (D̂) ⊂ C ⊂ Wm√n (D̂).

Proof Since â ∈ C, we have W1 (D̂) ⊂ B(â) ⊆ C. On the other hand,


 n  2 
 x (i)
C ⊆ B(m â) ⊂ x ∈ Rn : m â (i)
≤ n = Wm√n (D̂). 

i=1

For the sign-symmetric set C ⊂ Rn defined by (7.2.17), consider


√ the following
algorithmic scheme which finds a diagonal rounding of radius γ n with
 1/2
γ > 1+ √1
n
.

Set D0 = D̂.

For k ≥ 0 iterate:

def
1. Compute ik : aik ∗Dk = rk = max ai ∗Dk .
1≤i≤m


2. If rk ≤ γ n then Stop else (7.2.18)

 2

n (j)
(ai )2 rk2 −n
βk := k
(j) −1 , αk := 1/2 ,
j =1 Dk βk +(rk2 −n)βk

Dk+1 := (1 − αk )Dk + αk D 2 (aik ).

end.

Note that this scheme applies the rules described in Lemma 7.2.4 using the
notation βk for V  (0). Therefore, exactly as in Theorems 7.2.1 and 7.2.2, we can
prove the following statement.
7.2 Rounding of Convex Sets 527

 1/2
Theorem 7.2.3 For γ ≥ 1 + √1
n
, the scheme (7.2.18) terminates at most after

  −1
γ 2 −1 γ 2 −1
γ2
− ln 1 + γ2
· n(ln n + 2 ln m)

iterations.
Note that the number of operations during each iteration of the scheme (7.2.18)
is proportional to the number of nonzero elements in the matrix A = (a1 , . . . , am ).

7.2.2 Minimizing the Maximal Absolute Value of Linear


Functions

Consider the following problem of Linear Programming:

min max |āi , y − ci |. (7.2.19)


y∈Rn−1 1≤i≤m



y
Defining ai = (āiT , −ci )T , i = 1, . . . , m, x = ∈ Rn and d = en , we can
τ
rewrite this problem in a conic form (see Sect. 7.1):
 
def
Find f ∗ = min f (x) = max |ai , x | : d, x = 1 . (7.2.20)
x 1≤i≤m

In Sect. 7.1, in order to construct an ellipsoidal rounding for ∂f (0), we used the
composite structure of the function
√ f (·). However, the radius of this rounding was
quite large, of the order O( m). Now, by method (7.2.4) we can efficiently√pre-
compute a rounding ellipsoid for this set which radius is proportional to O( n).
Let us show that this leads to a much more efficient minimization scheme.
Let us fix some γ > 1. Assume that using the process (7.2.4) we managed to
construct
√ an ellipsoidal rounding for the centrally symmetric set ∂f (0) of radius
γ n:

W1 (G) ⊆ ∂f (0) ≡ Conv {±ai , i = 1, . . . , m} ⊆ Wγ √n (G).

The immediate consequences are as follows:



xG ≤ f (x) ≡ sup{s, x : s ∈ ∂f (0)} ≤ γ n · xG , (7.2.21)
s

ai ∗G ≤ γ n, i = 1, . . . , m. (7.2.22)
528 7 Optimization in Relative Scale

Let us now fix a smoothing parameter μ > 0. Consider the following approxi-
mation of the function f (·):



m !
fμ (x) = μ ln eai ,x /μ + e−ai ,x /μ .
i=1

Clearly fμ (·) is convex and continuously differentiable infinitely many times on Rn .


Moreover,

f (x) ≤ fμ (x) ≤ f (x) + μ ln(2m), ∀x ∈ Rn . (7.2.23)

Finally, note that for any point x and any direction h from Rn we have


m
(i)
∇fμ (x), h = λμ (x) · ai , h ,
i=1


λ(i)
μ (x) =
1
ωμ (x) · eai ,x /μ − e−ai ,x /μ , i = 1, . . . , m,

m 

ωμ (x) = eai ,x /μ + e−ai ,x /μ .
i=1

Therefore, the expression for the Hessian is as follows:


 2

m   
m
ai ,h 2 ai ,x /μ + e−ai ,x /μ − 1 (i)
∇ 2 fμ (x)h, h = μ1 ωμ (x) e μ λμ (x) · ai , h .
i=1 i=1

In view of (7.2.22), we have



2
γ 2n
∇ 2 fμ (x)h, h ≤ 1
μ max ai ∗G · h2G ≤ μ · h2G .
1≤i≤m

In view of Theorem 2.1.6, this implies that the gradient of the function fμ (·) is
γ 2n
Lipschitz continuous in the metric  · G with Lipschitz constant Lμ = μ :

∇fμ (x) − ∇fμ (y)∗G ≤ Lμ x − yG ∀x, y ∈ E.

Our approach is very similar to that of Sect. 7.1. Consider the problem

min{φ(x); x ∈ Q}, (7.2.24)


x
7.2 Rounding of Convex Sets 529

where Q is a closed convex set and the differentiable convex function φ(·) has a
gradient which is Lipschitz continuous in the Euclidean norm ·G with constant L.
Let us write down here the optimal method (2.2.63) for solving the problem (7.2.24).

Method S(φ, L, Q, G, x0 , N)

Set v0 = x0 . For k = 0, . . . , N − 1 do

1. Set yk = k
k+2 xk + 2
k+2 vk .

2. Compute ∇φ(yk ).
(7.2.25)
 k 

3. vk+1 = arg min  2 ∇φ(yi ), v
i+1
− x0 + L2 v − x0 2G .
v∈Q i=0

4. xk+1 := k
k+2 xk + 2
k+2 vk+1 .

Return: S(φ, L, Q, G, x0 , N) ≡ xN .

In accordance with Theorem 6.1.2, the output of this scheme xN satisfies the
following inequality

2Lx0 −xφ∗ 2G


φ(xN ) − φ(xφ∗ ) ≤ N(N+1) ,
(7.2.26)

where xφ∗ is an optimal solution to problem (7.2.24).


As in Sect. 7.1, we are going to use the scheme (7.2.25) in order to compute an
approximate solution to (7.2.20) with a certain relative accuracy δ > 0. Define

Q(r) = {x ∈ Rn : d, x = 1, xG ≤ r},

G−1 d
x0 = d,G−1 d
,
J √  K
Ñ = 2eγ 2n ln(2m) 1 + 1δ .
530 7 Optimization in Relative Scale

Consider the following method.

Set x̂0 = x0 .

For t ≥ 1 iterate:

μt := δf (x̂t−1 ) γ 2n (7.2.27)
2e(1+δ) ln(2m) ; Lμt := μt ;
 
x̂t := S fμt , Lμt , Q(f (x̂t −1 )), G, x0 , Ñ ;

If f (x̂t ) ≥ 1e f (x̂t −1 ) then T := t and Stop.

Theorem 7.2.4 The number of points generated by method (7.2.27) is bounded as


follows:

T ≤ 1 + ln(γ n). (7.2.28)

The last point of the process satisfies inequality f (x̂T ) ≤ (1 + δ)f ∗ . The total
number of lower-level steps in the process (7.2.27) does not exceed
√ √  
2γ e(1 + ln(γ n)) 2n ln(2m) 1 + 1δ . (7.2.29)

Proof Let x ∗ be an optimal solution to the problem (7.2.20). Note that all points x̂t
generated by (7.2.27) are feasible for (7.2.20). Therefore, in view of (7.2.21)

f (x̂t ) ≥ f ∗ ≥ x ∗ G .

Thus, x ∗ ∈ Q(f (x̂t )) for any t ≥ 0. Let

ft∗ = fμt (xt∗ ) = min{fμt (x) : x ∈ Q(f (x̂t −1 ))}.


x

Since x ∗ ∈ Q(f (x̂t )), in view of (7.2.23) we have

ft∗ ≤ fμt (x ∗ ) ≤ f ∗ + μt ln(2m).

By the first part of (7.2.23), f (x̂t ) ≤ fμt (x̂t ). Note that

x0 − xt∗ G ≤ xt∗ G ≤ f (x̂t −1), t ≥ 1.


7.2 Rounding of Convex Sets 531

In view of (7.2.26), we have at the last iteration T :

f (x̂T ) − f ∗ ≤ fμT (x̂T ) − fT∗ + μT ln(2m)

2LμT f 2 (x̂T −1 ) 2γ 2 nf 2 (x̂T −1 )


≤  2 + μT ln(2m) =  2 + μT ln(2m)
Ñ +1 μT Ñ+1

f 2 (x̂T −1 ) δ 2
≤ 4μT e2 ln(2m)(1+δ)2
+ μT ln(2m) = 2μT ln(2m).

Further, in view of the choice of μt and the stopping criterion in (7.2.27), we have

δf (x̂T −1 ) δf (x̂T )
2μT ln(2m) = e(1+δ) ≤ 1+δ .

Thus f (x̂T ) ≤ (1 + δ)f ∗ .


It remains to prove the estimate (7.2.28) for the number of steps in the upper-level
of the process. Indeed, by a simple induction it is easy to prove that at the beginning
of stage t the following inequality holds:
 t −1
1
e f (x0 ) ≥ f (x̂t −1 ), t ≥ 1.

Note that x0 is the projection of the origin on the hyperplane d, x = 1. Therefore,
in view of inequalities (7.2.21), we have

f ∗ ≥ x ∗ G ≥ x0 G ≥ γ
1

n
f (x0 ).

Thus, at the final step of the scheme we have


 T −1
1
e f (x0 ) ≥ f (x̂T −1 ) ≥ f ∗ ≥ γ
1

n
f (x0 ).

This leads to the bound (7.2.28).



Recall
√ that the preliminary stage of method (7.2.27), that is, the computation
of γ n-rounding for ∂f (0) with relative accuracy γ > 1, can be performed by
procedure (7.2.4) in

n2 γ2
6 (n + 6m) + (γ −1)2 n (2m + 3n) ln m = O(n2 (n + m) ln m)
2

arithmetic operations. Since each step of method (7.2.25) takes O(mn) operations,
the complexity of the preliminary stage is dominant if δ is not too small, say δ > √1n .
532 7 Optimization in Relative Scale

7.2.3 Bilinear Matrix Games with Non-negative Coefficients

Let A = (a1 , . . . , am ) be an n × m-matrix with nonnegative coefficients. Consider


the problem
 
∗ def
Find f = min f (x) = max ai , x . (7.2.30)
x∈Δn 1≤i≤m

Note that this format can be used for different standard problem settings. Consider,
for example, the linear packing problem
 
Find ψ ∗ = maxn c, y : ai , y ≤ b(i), i = 1, . . . , m ,
y∈R+

where all entries of vectors ai are non-negative, b > 0 ∈ Rm , and c > 0 ∈ Rn . Then
 
c,y
ψ ∗ = maxn c, y : max 1
(i) ai , y ≤1 = maxn 1
y∈R+ 1≤i≤m b max (i)
y∈R+ 1≤i≤m ai ,y
b

#  $−1
= min max 1(i) ai , y : c, y = 1
y∈Rn+ 1≤i≤m b

 −1
= min max 1(i) D −1 (c)ai , x .
x∈Δn 1≤i≤m b

As usual, we can approximate the objective function f (·) in (7.2.30) by the


following smooth function:



m
fμ (x) = μ ln eai ,x /μ .
i=1

In this case, the following relations hold:

f (x) ≤ fμ (x) ≤ f (x) + μ · ln m, ∀x ∈ Rn . (7.2.31)

Define

n
fˆ(x) = max
(j )
a |x (j )|.
1≤i≤m j =1 i

Note that the subdifferential of the homogeneous function fˆ(·) at the origin is as
follows:

∂f (0) = Conv {B(ai ), i = 1, . . . , m}.


7.2 Rounding of Convex Sets 533

In Sect. 7.2.1.3, we have seen that it is possible to compute a diagonal matrix D  0


such that

W1 (D) ⊆ ∂ fˆ(0) ⊆ W2√n (D),

(this corresponds to the choice γ = 2 in scheme (7.2.18)). In view of Lemma 7.2.5,


using this matrix we can define a Euclidean norm  · D such that

xD ≤ f (x) ≤ 2 n · xD , ∀x ∈ Rn+ . (7.2.32)

Moreover, in this norm the sizes of all ai are bounded by 2 n.
Now, using the same reasoning as in Sect. 7.2.2, we can show that for any x and
h from Rn

∇ 2 fμ (x)h, h ≤ 4n
μ · h2D .

Hence, the gradient of this function is Lipschitz continuous with respect to the norm
 · D with constant 4nμ . This implies that the function fμ (·) can be minimized by
the efficient method (6.1.19).
Let us fix some relative accuracy δ > 0. Define

Q(r) = {x ∈ Δn : xD ≤ r},

D −1 ēn
x0 = ēn ,D −1 ēn
,
J √  K
Ñ = 4e 2n ln m 1 + 1δ .

Consider the following method.

Set x̂0 = x0 .

For t ≥ 1 iterate:

μt := δf (x̂t−1 ) (7.2.33)
2e(1+δ) ln m ; Lμt := μt ;
4n

 
x̂t := S fμt , Lμt , Q(f (x̂t −1 )), D, x0 , Ñ ;

If f (x̂t ) ≥ 1e f (x̂t −1 ) then T := t and Stop.

Justification of this scheme is very similar to that of (7.2.27).


534 7 Optimization in Relative Scale

Theorem 7.2.5 The number of points generated by method (7.2.27) is bounded as


follows:

T ≤ 1 + ln(2 n). (7.2.34)

The last point of the process satisfies the inequality f (x̂T ) ≤ (1 + δ)f ∗ . The total
number of lower-level steps in the process (7.2.27) does not exceed
√ √  
4e(1 + ln(2 n)) 2n ln m 1 + 1δ . (7.2.35)

Proof Let x ∗ be an optimal solution to the problem (7.2.30). Note that all points x̂t
generated by (7.2.33) are feasible. Therefore, in view of (7.2.32),

f (x̂t ) ≥ f ∗ ≥ x ∗ D .

Thus, x ∗ ∈ Q(f (x̂t )) for any t ≥ 0. Define

ft∗ = fμt (xt∗ ) = min{fμt (x) : x ∈ Q(f (x̂t −1 ))}.


x

Since x ∗ ∈ Q(f (x̂t )), in view of (7.2.31), we have

ft∗ ≤ fμt (x ∗ ) ≤ f ∗ + μt ln m.

By the first part of (7.2.31) f (x̂t ) ≤ fμt (x̂t ). Note that

x0 − xt∗ D ≤ xt∗ D ≤ f (x̂t −1)

for all t ≥ 1. Thus, in view of (7.2.26), at the last iteration T , we have:

2LμT f 2 (x̂T −1 )
f (x̂T ) − f ∗ ≤ fμT (x̂T ) − fT∗ + μT ln m ≤  2 + μT ln m
Ñ +1

8nf 2 (x̂T −1 ) f 2 (x̂T −1 ) δ 2


=  2 + μT ln m ≤ 4μT e2 ln m(1+δ)2
+ μT ln m
μT Ñ+1

= 2μT ln m.

Further, in view of the choice of μT and the stopping criterion, we have

δf (x̂T −1 ) δf (x̂T )
2μT ln m = e(1+δ) ≤ 1+δ .

Thus, f (x̂T ) ≤ (1 + δ)f ∗ .


7.2 Rounding of Convex Sets 535

It remains to prove the estimate (7.2.34) for the number of steps of the upper-
level process. Indeed, by simple induction it is easy to prove that at the beginning of
stage t the following inequality holds:
 t −1
1
e f (x0 ) ≥ f (x̂t −1 ), t ≥ 1.

Note that x0 is the projection of the origin at the hyperplane ēn , x = 1. Therefore,
in view of inequalities (7.2.32), we have

f ∗ ≥ x ∗ D ≥ x0 D ≥ 1

2 n
f (x0 ).

Thus, at the last step of the scheme we have


 T −1
1
e f (x0 ) ≥ f (x̂T −1 ) ≥ f ∗ ≥ 1

2 n
f (x0 ).

This leads to the bound (7.2.34).



√ 
n ln m
Thus, we have seen that the scheme (7.2.33) needs O δ ln n iterations
of the gradient scheme (7.2.25). Since the matrix D is diagonal, each iteration of
this scheme is very cheap. Its complexity is proportional to the number of nonzero
elements in the matrix A. Note also that in Step 3 of scheme (7.2.25) it is necessary
to compute projections onto the set Q(r), which is an intersection of the simplex
and a diagonal ellipsoid. However, since D is a diagonal matrix, this can be done in
O(n ln n) operations by relaxing the only equality constraint and arranging a one-
dimensional search in the corresponding Lagrange multiplier.

7.2.4 Minimizing the Spectral Radius of Symmetric Matrices

For a matrix X ∈ Sn , define its spectral radius:

ρ(X) = max |λ(i) (X)| = max{λ(1) (X), −λ(n) (X)}


1≤i≤n

= min{τ : τ In  ±X}.
τ

In view of Theorem 3.1.7, ρ(X) is a convex function on Sn . In this section, we


consider the following optimization problem:

def
Find φ∗ = min{φ(y) = ρ(A(y))}, (7.2.36)
y∈Q
536 7 Optimization in Relative Scale

where Q ⊂ Rm is a closed convex set separated from the origin, and A(·) is a linear
operator from Rm to Sn :


m
A(y) = y (i) Ai ∈ Sn , y ∈ Rm .
i=1

We assume that matrices {Ai }m


i=1 are linearly independent. Hence, the matrix G ∈
Sm with elements

G(i,j ) = Ai , Aj M , i, j = 1, . . . , m,

is positive definite. Denote by r the maximal rank of A(y):


 m 

r = maxm rank A(y) ≤ min n, rank Ai .
y∈R i=1

We are going to solve (7.2.36) using a variant of the smoothing technique, which
is applicable for solving structural convex optimization problems in relative scale.
Note that in view of our assumptions φ ∗ is strictly positive.
First of all, we approximate a non-smooth objective function in (7.2.36) by a
smooth one. For that, we use Fp (X) defined by (6.3.6). Note that

1/p
Fp (X) = 12 X2p , In M ≥ 1 2
2 ρ (X),
(7.2.37)
Fp (X) ≤ 1 2
2 ρ (X) · (rank X)1/p .

Consider the problem

def
Find fp∗ = minm {fp (y) = Fp (A(y)) : y ∈ Q}. (7.2.38)
y∈R

From (7.2.37), we can see that

1 2
2 φ∗ ≤ fp∗ ≤ 1 2
2 φ∗ · r 1/p . (7.2.39)

Our goal is to find a point ȳ ∈ Q which solves (7.2.36) with relative accuracy δ > 0:

φ(ȳ) ≤ (1 + δ)φ∗ .

Let us choose an integer p which satisfies the following inequality

def 1+δ
p(δ) = δ ln r ≤ p ≤ 2p(δ). (7.2.40)
7.2 Rounding of Convex Sets 537

Assume that ȳ ∈ Q solves (7.2.38) with relative accuracy δ. Then, in view


of (7.2.37) and (7.2.39), we have
1  1 √
φ(ȳ)/φ∗ ≤ r 2p · fp (ȳ)/fp∗ ≤ r 2p · 1 + δ

δ √
≤ e 2(1+δ) · 1 + δ ≤ 1 + δ.

Thus, we need to estimate the efficiency of method (6.1.19) as applied to the


problem (7.2.38). Let us introduce the following norm

hG = Gh, h 1/2 , h ∈ Rm .

Assuming that p(δ) ≥ 1 and using the estimate (6.3.8) and notation of Sect. 6.3.1,
for any y and h from Rm we get

∇ 2 fp (y)h, h = ∇ 2 Fp (A(y))A(h), A(h) M

≤ (2p − 1)A(h)2(2p) ≤ (2p − 1)A(h)2(2)

= (2p − 1)A(h), A(h) M = (2p − 1)Gh, h

= (2p − 1)h2G .

Thus, in view of Theorem 2.1.6 function fp (y) has Lipschitz continuous gradient
on Rm with respect to the norm  · G with Lipschitz constant

L = 2p − 1 ≤ 4p(δ). (7.2.41)

On the other hand, for any X ∈ Sn with rank X ≤ r, and p ≥ 1 we have

r X(2) ≤ X2(∞) ≤ X2(2p) .


1 2

2r yG ≤ fp (y) for any y ∈ Rm . In particular,


1 2
Hence,

∗ 2 ≤ fp∗ ,
2r yp G
1
(7.2.42)

where yp∗ is an optimal solution to (7.2.38).


Let x0 = arg min yG . Since the norm  · G is Euclidean, and Q is convex, in
y∈Q
view of inequality (2.2.49), we have

yp∗ − x0 2G ≤ yp∗ 2G − x0 2G < yp∗ 2G .


538 7 Optimization in Relative Scale

Combining this inequality with estimate (7.2.42), we get

∗ − x0 2G ≤ 12 yp∗ 2G ≤ rfp∗ .


2 yp
1
(7.2.43)

In order to apply method (2.2.63) to problem (7.2.38), let us choose the following
prox-function:

d(x) = 12 x − x0 2G . (7.2.44)

Note that the convexity parameter of this function is equal to one. Hence, in view of
bounds (7.2.41), (7.2.42), and (6.1.21), method (6.1.19) launched from the starting
point x0 converges as follows:

fp (xk ) − fp∗ ≤ 16(1+δ)r ln r


δ·k(k+1) · fp∗ . (7.2.45)

Hence, in order to solve problem (7.2.38) with relative accuracy δ (and, therefore,
solve (7.2.36) with the same relative accuracy), method (6.1.19) needs at most

4
δ (1 + δ)r ln r (7.2.46)

iterations. Note that this bound does not depend on the data size of the particular
problem instance.
At each iteration of method (6.1.19) as applied to the problem (7.2.38) with d(·)
defined by (7.2.44), it is necessary to compute a projection of a point onto the set Q
with respect to the Euclidean metric  · G . This operation is easy in the following
cases.
• The set Q is an affine subspace in Rm . Then the projection can be computed by
inverting the matrix G. An important example of such a problem is as follows:

m 
 (i)
minm ρ y Ai : y = 1 .
(1)
y∈R i=1

• The matrix G and the set Q are both simple. For example, if Ai , Aj = 0 for
i
= j , then G is a diagonal matrix. In this case, a projection onto a box, for
example, is easy to compute. Such a situation occurs when the matrix A(y) is
parameterized directly by its entries.
Finally, note that the computation of the value and the gradient of the function
fp (·) can be done without eigenvalue decomposition of the matrix A(y). Indeed, let
p = 2k satisfy condition (7.2.40). Consider the following of sequence of matrices:

X0 = A(y), Y0 = In ,
(7.2.47)
Xi = Xi−1
2 , Y =Y
i i−1 Xi−1 , i = 1, . . . , k.
7.3 Barrier Subgradient Method 539

By induction, it is easy to see that Xk = Ap (y) and Yk = Ap−1 (y). Hence, in


accordance with (6.3.3), (6.3.6), and the definition of the function fp (·) in (7.2.38),
we have:
2/p
fp (y) = 12 Xk , In M ,

2fp (y)
∇fp (y)(i) = Xk ,In M · Yk , Ai M , i = 1, . . . , m.

Note that the complexity of computing the matrix A(y) is of the order of O(n2 m)
arithmetic operations. The auxiliary computation (7.2.47) takes
 
O(n3 ln p) = O n3 ln lnδ r

operations. After that the vector ∇fp (y) can be computed in O(n2 m) arithmetic
operations. Clearly, the complexity of the first and the last computation is much
lower if the matrices Ai are sparse.
Note also that the computation (7.2.47) can be performed more efficiently if the
matrix A(y) is represented in the form

A(y) = U T U T , U U T = In ,

where T is a tri-diagonal matrix. Computation of this representation needs O(n3 )


arithmetic operations.

7.3 Barrier Subgradient Method

(Smoothing by a self-concordant barrier; The barrier subgradient scheme; Relative accuracy


and maximization of positive concave functions; Applications: The fractional covering
problem, the maximal concurrent flow problem, the minimax problem with nonnegative
components, Semidefinite relaxation of the Boolean quadratic problem; Online Optimiza-
tion as an alternative to Stochastic Programming.)

7.3.1 Smoothing by a Self-Concordant Barrier

In Nonlinear Optimization the performance of numerical methods strongly depends


on our ability execute some auxiliary operations related to the convex sets involved
in the problem’s formulation. Usually, the optimization methods assume the feasi-
bility of one of the following actions:
L: Maximization of a linear function c, x over a convex set Q.
540 7 Optimization in Relative Scale

S: Maximization of the function c, x − d(x) in x ∈ Q, where d is a strongly


convex prox-function of the set Q.
B: Computation of the value and first two derivatives of some self-concordant
barrier at the interior points of the convex set Q.
Note that in Structural Optimization we can always consider the optimization
problem posed in a primal-dual setting. The most important example of such a
representation is a bilinear saddle point formulation:

min max {Ax, w + c, x + b, w } , (7.3.1)


x∈Qp w∈Qd

where Qp and Qd are closed convex sets in corresponding spaces and A is a


linear operator. Since the structure of the primal and dual sets may be of different
complexity, we have six possible combinations of the above mentioned auxiliary
operations. Let us present the known results on their complexity.
L
• Lp LLd . The complexity of this combination is still not clear.
• Sp Sd . This case is treated by the smoothing technique (see Chap. 6). An -
solution of the problem (7.3.1) can be obtained in
 
O 1
 · A · [D1 D2 ]1/2

gradient steps, where D1 and D2 are the sizes of the primal and dual sets, and the
LA is defined by the norms of the primal and dual spaces.
norm
• Bp Bd . In this situation, Interior-Point Methods provide an -solution of the
problem (7.3.1) in
√
O ν · ln ν

Newton steps, where ν is the parameter of a self-concordant barrier for a primal-


Lfeasible set Qp × Qd (see Chap. 5).
dual
• Sp Ld . This case is similar to the standard Black-Box Nonsmooth Minimiza-
tion. Primal-dual subgradient methods provide an -solution to (7.3.1) in
 
O 1
2
· A2 · D1 · D2

gradient
L steps (see Sect. 3.2).
• Bp L Sd . The complexity of this combination is not known yet.
• Bp Ld . The last variant is studied in this section. From the viewpoint of Black-
Box Optimization, it corresponds to the problem of minimizing nonsmooth
convex function over a feasible set endowed with a self-concordant barrier.
7.3 Barrier Subgradient Method 541

Let us recall our notation. For a linear operator A : E → H∗ , we denote by


A∗ : H → E∗ the adjoint operator:

Ax, y H = A∗ y, x E , x ∈ E, y ∈ H.

If there is no ambiguity, the subscripts of scalar products are omitted. For a concave
function f , we denote by ∇f (x) one of its subgradients at x:

f (y) ≤ f (x) + ∇f (x), y − x , y, x ∈ dom f.

For a function of two vector variables Ψ (u, x), the notation ∇2 Ψ (u, x) is used to
denote its subgradient with respect to the second argument.
Let Q ⊂ E be a closed convex set containing no straight lines. We assume
that Q is endowed with a ν-self-concordant barrier F (see Sect. 5.3). In view of
Theorem 5.1.6, its Hessian is non-degenerate at all points of the domain.
Consider another closed convex set P̂ ⊆ E. We are mainly interested in the set
0
P = P̂ Q,

which we assume to be bounded. Denote by x0 its constrained analytic center:

def 0
x0 = arg min F (x) ∈ P0 = P̂ int Q ⊆ P . (7.3.2)
x∈P0

Thus, F (x) ≥ F (x0 ) for all x ∈ P . Since Q contains no straight lines, x0 is well
defined (see Theorem 5.1.6).
For the set P , we introduce the following smooth approximation of its support
function:

Uβ (s) = max{s, u − x0 − β[F (u) − F (x0 )]}, s ∈ E∗ , (7.3.3)


u∈P̂

where β > 0 is a smoothing parameter. Denote by uβ (s) the unique solution of the
maximization problem (7.3.3). Then, in view of relation (5.3.17) and Theorem 6.1.1,
we have

∇Uβ (s) = uβ (s) − x0 , s ∈ E∗ . (7.3.4)

For any x ∈ int Q, consider the following local norms:

hx = ∇ 2 F (x)h, h 1/2 , h ∈ E,

s∗x = s, [∇ 2 F (x)]−1 s 1/2 , s ∈ E∗ .

Then, we can guarantee the following level of smoothness of the function Uβ (·).
542 7 Optimization in Relative Scale

Lemma 7.3.1 Let β > 0, s ∈ E∗ and x = uβ (s). Then for any g ∈ E∗ with
g∗x < β we have

Uβ (s + g) ≤ Uβ (s) + g, ∇Uβ (s) + βω∗ ( β1 g∗x ), (7.3.5)

(5.1.24)
τ2
where ω∗ (τ ) = −τ − ln(1 − τ ) ≤ 2(1−τ ) for τ ∈ [0, 1).
Proof In view of definition (7.3.3) and Theorem 2.2.9, for any y ∈ P0 we have

s − β∇F (x), y − x ≤ 0. (7.3.6)

Moreover, since F is a standard self-concordant function, at any point y ∈ int Q

F (y) ≥ F (x) + ∇F (x), y − x + ω(y − xx ), (7.3.7)

where ω(t) = t − ln(1 + t) (see inequality (5.1.14)). Hence,

Uβ (s + g) − Uβ (s) − g, ∇Uβ (s)

(7.3.4)
= max {s + g, y − x0 − β[F (y) − F (x0 )] − s + g, x − x0
y∈P0

+β[F (x) − F (x0 )]}

= max {s + g, y − x − β[F (y) − F (x)]}


y∈P0

(7.3.6)
≤ max {g, y − x + β[∇F (x), y − x − F (y) + F (x)]}
y∈P0

(7.3.7)
≤ max {g, y − x − βω(y − xx )} ≤ sup{ τ g∗x − βω(τ ) }.
y∈P0 τ ≥0

If g∗x < β, then the supremum in the right-hand side is equal to βω∗ ( β1 g∗x ) (see
Lemma 5.1.4). 
Consider now an affine function (x), x ∈ P . For β ≥ 0 define

def
 (β) = max{(x) − β[F (x) − F (x0 )]} ≥ (x0 ) = 0 . (7.3.8)
x∈P0

def
Then  (0) = max (x) =  .
x∈P
7.3 Barrier Subgradient Method 543

Lemma 7.3.2 For any β > 0 we have



  
−0
 (β) ≤  ≤  (β) + βν 1 + ln  βν , (7.3.9)
+

where [a]+ = max{a, 0}. Moreover,


√  √ !2
 − 0 ≤  (β) − 0 + βν . (7.3.10)

Proof The first part of inequality (7.3.9) follows from definitions (7.3.2) and (7.3.8).
Let us prove the second part. Consider an arbitrary y  ∈ Arg max (x). Define
x∈P

y(α) = x0 + α(y  − x0 ), α ∈ [0, 1].

In view of inequality (5.3.14), we have

F (y(α)) ≤ F (x0 ) − ν ln(1 − α), α ∈ [0, 1).

Since (·) is linear, this relation implies that

 (β) ≥ max {(y(α)) − β[F (y(α)) − F (x0 )]}


α∈[0,1)
(7.3.11)
≥ (1 − α)0 + α + βν ln(1 − α), α ∈ [0, 1).
 
βν
The maximum in α of the latter expression is attained at α  = 1 −  − . Thus,
0 +
 −0
if βν ≤ 1 (that is = 0), then ≤ 0 + βν, and (7.3.9) follows from (7.3.8).
α 
If α > 0, then we get (7.3.9) by direct substitution.
On the other hand, from (7.3.11) we have
  
βν
 − 0 ≤ 1
α  (β) − 0 + βν ln 1 + α
1−α ≤ α [ (β) − 0 ] + 1−α .
1 

Minimizing the latter expression in α, we get (7.3.10).



Corollary 7.3.1 For any β > 0 we have
    
 ≤  (β) + βν 1 + 2 ln 1 +  (β)−
βν
0
. (7.3.12)
544 7 Optimization in Relative Scale

7.3.2 The Barrier Subgradient Scheme

In this section, we consider convex optimization problems in the following form:

def
Find f = max{f (x) : x ∈ P }, (7.3.13)
x

where f is a concave function and P satisfies the structural assumptions specified


at the beginning of Sect. 7.3.1. In the sequel, we assume f to be subdifferentiable
on P0 and the set P to be simple. The latter means that the auxiliary optimization
problem (7.3.3) can be easily solved.
Consider now the generic scheme of the Barrier Subgradient Method (BSM).

Initialization: Set s0 = 0 ∈ E∗ .

Iteration (k ≥ 0):
(7.3.14)
1. Choose βk > 0 and compute xk = uβk (sk ).

2. Choose λk > 0 and set sk+1 = sk + λk ∇f (xk ).

Recall that uβ (s) denotes the unique solution of the optimization problem (7.3.3).
Thus, BSM is an affine-invariant scheme.
In order to analyze the performance of method (7.3.14), consider the following
gap functions:


k
k (y) = λi ∇f (xi ), y − xi ,
i=0

def
k = max k (y), k ≥ 0.
y∈P

Theorem 7.3.1 Assume that the parameters of scheme (7.3.14) satisfy the condition

λk ∇f (xk )∗xk ≤ βk ≤ βk+1 , k ≥ 0. (7.3.15)


k 
k  
Let Sk = λi , and Ak = λi ∗
βi ω∗ βi ∇f (xi )xi . Then, for any k ≥ 0 we have
i=0 i=0
   
Ak
k ≤ Ak + βk+1 ν 1 + 2 ln 1 + βk+1 Sk ∗
ν + 3 βk+1 ∇f (x0 )x0 . (7.3.16)
7.3 Barrier Subgradient Method 545

Proof Note that for any k ≥ 0 we have

(7.3.15)
Uβk+1 (sk+1 ) ≤ Uβk (sk+1 )

(7.3.5)  
≤ Uβk (sk ) + λk ∇f (xk ), uβk (sk ) − x0 + βk ω∗ λk ∗
βk ∇f (xk )xk .

Since Uβ0 (0) = 0, we conclude that

sk+1 , xk+1 − x0 − βk+1 [F (xk+1) − F (x0 )] = Uβk+1 (sk+1 )


k 
k   (7.3.17)
≤ λi ∇f (xi ), xi − x0 + λi ∗
βi ω∗ βi ∇f (xi )xi .
i=0 i=0

In view of the first-order optimality condition for (7.3.3), for all y ∈ P0 we have

sk+1 , y − xk+1 ≤ βk+1 ∇F (xk+1 ), y − xk+1 . (7.3.18)


k
Note that sk+1 = λi ∇f (xi ). Therefore, for any y ∈ P0 we obtain
i=0


k (7.3.17)
λi ∇f (xi ), y − xi ≤ sk+1 , y − xk+1 + βk+1 [F (xk+1 ) − F (x0 )] + Ak
i=0

(7.3.18)
≤ βk+1 [F (xk+1 ) + ∇F (xk+1 ), y − xk+1 − F (x0 )]

+Ak

≤ βk+1 [F (y) − F (x0 )] + Ak .

Hence, k (βk+1 ) ≤ Ak . On the other hand, since f is concave, we obtain


k 
k
lk (x0 ) = λi ∇f (xi ), x0 − xi ≥ λi ∇f (x0 ), x0 − xi
i=0 i=0


k
≥ −∇f (x0 )∗x0 · λi x0 − xi x0 .
i=0

√have ∇F (x0 ), xi − x0 ≥ 0. Hence, by Theorem


In view of definition (7.3.2), we
5.3.9, xi − x0 x0 ≤ ν + 2 ν ≤ 3ν (recall that ν ≥ 1 by Lemma 5.4.1).
Thus, we conclude that k (x0 ) ≥ −3νSk ∇f (x0 )∗x0 . Using our observations and
inequality (7.3.12), we obtain (7.3.16).

546 7 Optimization in Relative Scale

Let us estimate now the rate of convergence of method (7.3.14) as applied to a


specific problem class.
Definition 7.3.1 We say that f ∈ BM (P ) if ∇f (x)∗x ≤ M for any x ∈ P0 .
For a function f ∈ BM (P ), we suggest the following values of parameters
in (7.3.14):


λk = 1, k ≥ 0, β0 = β1 , βk = M · 1 + νk , k ≥ 1. (7.3.19)

Theorem 7.3.2 Let problem (7.3.13) with f ∈ BM (P ) be solved by


method (7.3.14) with parameters given by (7.3.19). Then for any k ≥ 0 we have
    √ 
1
Sk k ≤ 2M · ν
k+1 + ν
k+1 · 1 + ln 2 + 32 ν(k + 1) . (7.3.20)

Proof Define τk = M 1
βk > 1. In view of the choice of parameters (7.3.19) and
assumptions of the theorem, we have Sk = k + 1, and

      
k
λi ∗
k k τi−2
Ak = βi ω∗ βi ∇f (xi )xi ≤ M τi ω∗ 1
≤ 1
2M τi
τi 1−τi−1
i=0 i=0 i=0

  

k √
ν 
k √ √ 
= 1
2M
1
τi −1 = 2 M 1+ √1 ≤ νM 12 + k .
i
i=0 i=1
(7.3.21)
(The last inequality can be easily justified by induction.) Furthermore,

Sk ∗

βk+1 ∇f (x0 )x0 ≤ ≤ ν(k + 1),
k+1

1+ k+1
ν


Ak
1
+ k
βk+1 ν ≤ √2 √ ≤ 1.
ν+ k+1

Thus, substituting the above estimates in inequality (7.3.16), we obtain


√  √  √   1 √ 
k ν ν+ ν(k+1)
Sk ≤M k+1
1
2 + k + k+1 1 + 2 ln 1 + 1 + 3 ν(k + 1)
    √ 
≤ 2M · ν
k+1 + ν
k+1 · 1 + ln 2 + 32 ν(k + 1) .

√  √   ν
ν
In the last inequality we use the bound k+1
1
2 + k ≤ k+1 + ν
k+1 . 

7.3 Barrier Subgradient Method 547

With parameters chosen by (7.3.19), the scheme of method (7.3.14) can be


written in the following form:
 √ √ 

k
xk+1 = arg max 1
k+1 ∇f (xi ), x − xi − M √ν+ k+1
ν(k+1)
[F (x) − F (x0 )] .
x∈P0 i=0
(7.3.22)

Since f is a concave function,

k
1
Sk k = 1
Sk max λi ∇f (xi ), y − xi
y∈P i=0

k 
k
≥ 1
max
Sk y∈P λi [f (y) − f (xi )] = f − 1
Sk λi f (xi ).
i=0 i=0

Thus, the estimate (7.3.20) justifies the following rate of convergence for primal
variables:


k     √ 
λi
f − Sk f (xi ) ≤ 2M · ν
k+1 + ν
k+1 · 1 + ln 2 + 32 ν(k + 1) .
i=0
(7.3.23)

Note that the value k is computable. Hence, it can be used for terminating the
process.
Let us show now that method (7.3.22) can also generate approximate solutions to
the dual problem. For that, we need to employ the internal structure of our problem.
Let us assume that it can be represented in a saddle-point form:

f (x) = min Ψ (x, w) → max, (7.3.24)


w∈S x∈P

where S ⊂ E1 is a closed convex set, and the function Ψ (x, w) is convex in w ∈ S


and concave and subdifferentiable in x ∈ P . Then, the dual problem is defined as

Find f = min η(w),


w∈S
(7.3.25)
η(w) = max Ψ (y, w).
y∈P

Since P is bounded, the above problem is well defined. Without loss of generality,
it is always possible to choose

∇f (x) = ∇1 Ψ (x, w(x)) (7.3.26)

with some w(x) ∈ Arg min Ψ (x, w) ⊆ S. Let us assume that w(x) is computable
w∈S
for any x ∈ P .
548 7 Optimization in Relative Scale


k 
k
Lemma 7.3.3 Define w̄k = 1
Sk λi w(xi ), and x̄k = 1
Sk λi xi . Then
i=0 i=0

η(w̄k ) − f (x̄k ) ≤ 1
Sk k . (7.3.27)

Proof Since Ψ is concave in the first argument, for any y ∈ P we have

∇f (xi ), y − xi = ∇1 Ψ (xi , w(xi )), y − xi

≥ Ψ (y, w(xi )) − Ψ (xi , w(xi )) = Ψ (y, w(xi )) − f (xi ).

Hence,

k k
1
Sk k = 1
Sk max λi ∇f (xi ), y − xi ≥ 1
Sk max λi [Ψ (y, w(xi )) − f (xi )]
y∈P i=0 y∈P i=0


k 
k
≥ max Ψ (y, w̄k ) − 1
Sk λi f (xi ) = η(w̄k ) − 1
Sk λi f (xi )
y∈P i=0 i=0

≥ η(w̄k ) − f (x̄k ). 

Thus, the scheme (7.3.22) can generate approximate primal-dual solutions:


    √ 
η(w̄k ) − f (x̄k ) ≤ 2M · ν
k+1 + ν
k+1 · 1 + ln 2 + 32 ν(k + 1) .
(7.3.28)

7.3.3 Maximizing Positive Concave Functions

Consider now a convex optimization problem

def
Find ψ = max{ψ(x) : x ∈ P }, (7.3.29)
x


where the set P = P̂ Q satisfies the assumptions introduced for problem (7.3.13).
However, now we assume that the function ψ is concave and positive on int Q:

ψ(x) > 0, ∀x ∈ int Q. (7.3.30)

Lemma 7.3.4 Let ψ be concave and positive on int Q. Then for any x ∈ int Q we
have

∇ψ(x)∗x ≤ ψ(x). (7.3.31)


7.3 Barrier Subgradient Method 549

Proof Let us choose an arbitrary x ∈ int Q and r ∈ [0, 1). Define

y=x− −1
∇ψ(x)∗x [∇ F (x)] ∇ψ(x).
r 2

In view of Item 1 of Theorem 5.1.5, y ∈ int Q. Therefore,

0 ≤ ψ(y) ≤ ψ(x) + ∇ψ(x), y − x = ψ(x) − r∇ψ(x)∗x .

Since r is an arbitrary value from [0, 1), we get (7.3.31).



This result has an important corollary. Let us apply to the objective function of
problem (7.3.29) a logarithmic transformation:
def
f (x) = ln ψ(x). (7.3.32)

Lemma 7.3.5 Let ψ be concave and positive in the sense of (7.3.30). Then f ∈
B1 (Q), and it is concave on Q.
Proof Indeed, it is well known that the logarithm of a concave function is a
concave function too. It remains to note that ∇f (x) = ψ(x)
1
∇ψ(x) and apply
inequality (7.3.31).

Thus, in order to solve problem (7.3.29), we can apply method (7.3.14) to
problem (7.3.13) with the objective function defined by (7.3.32). The resulting
optimization scheme is as follows:
 √ √ 

k
xk+1 = arg max 1
k+1  ∇ψ(x i)
ψ(xi ) , x − xi − ν+ k+1

ν(k+1)
[F (x) − F (x0 )] .
x∈P0 i=0
(7.3.33)

For scheme (7.3.33), we can guarantee a certain rate of convergence in relative scale.
Theorem 7.3.3 Let the sequence {xk }∞ k=0 be generated by method (7.3.33) for
problem (7.3.29). Then for any k ≥ 0 we have

  k+1
1
I
k
ψ(xi )
i=0
    √ 
(7.3.34)
≥ ψ · exp −2 k+1 +
ν ν
k+1 1 + ln 2 + 32 ν(k + 1)

    √ 
≥ ψ · 1 − 2 k+1 +
ν ν
k+1 1 + ln 2 + 32 ν(k + 1) .
550 7 Optimization in Relative Scale

Proof Indeed, we just apply method (7.3.22) to the function f defined by (7.3.32).
Since f ∈ B1 (Q) ⊆ B1 (P ), by (7.3.20) we conclude that


k    √ 
def
f − 1
k+1 f (xi ) ≤ δk = 2 ν
k+1 + ν
k+1 1 + ln 2 + 32 ν(k + 1) .
i=0

  k+1
1
I
k
Hence, ψ(xi ) ≥ ψ · e−δk ≥ ψ · (1 − δk ). This is exactly (7.3.34).

i=0
Let us show how we can treat a problem dual to (7.3.29). For simplicity, assume
that

ψ(x) = min Ψ0 (u, x), (7.3.35)


u∈Ω

where Ω ⊂ E1 is a closed convex set. In this case, condition (7.3.30) can be written
as

Ψ0 (u, x) ≥ 0, u ∈ Ω, x ∈ P . (7.3.36)

Note that

max ln ψ(x) = max min min [ τ Ψ0 (u, x) − ln τ − 1 ]


x∈P x∈P τ >0 u∈Ω
   
= max min τ Ψ0 1
τ v, x − ln τ − 1
x∈P v∈τ Ω,

(1.3.6) τ >0
 
def
≤ min η(w) ≡ η(v, τ ) = −1 − ln τ + τ ψ  τ1 v ,
v∈τ Ω,
τ >0

where ψ  (u) = max Ψ0 (u, x).


x∈P
Denote by u(x) a solution of the minimization problem (7.3.35). Then w(x) is
clearly defined as follows

w(x) = (v(x), τ (x)), v(x) = τ (x)u(x), τ (x) = 1


ψ(x) .

In accordance with Lemma 7.3.3, we can form w̄k = (v̄k , τ̄k ) with


k
u(xi ) 
k
v̄k = 1
k+1 ψ(xi ) , τ̄k = 1
k+1
1
ψ(xi ) .
i=0 i=0
7.3 Barrier Subgradient Method 551

 

k
v̄k 
k
u(xi ) 
k
Let x̄k = 1
k+1 xi , and ūk = τ̄k = ψ(xi ) /
1
ψ(xi ) ∈ Ω. Then, by (7.3.27)
i=0 i=0 i=0
we get
 
1
Sk k ≥ η(w̄k ) − ln ψ(x̄k ) = −1 − ln τ̄k + τ̄k ψ  1
τ̄k v̄k − ln ψ(x̄k )


= −1 − ln τ̄k + τ̄k ψ  (ūk ) − ln ψ(x̄k ) ≥ ln ψψ((x̄ūkk)) .

Hence,
 
ψ(x̄k ) ≥ ψ  (ūk ) · exp − S1k k . (7.3.37)

Note that ψ  (ūk ) ≥ ψ∗ .

7.3.4 Applications

In this section, we are going to consider examples of applications of


method (7.3.33). It will be more convenient to use a slight modification of the
usual notion of relative accuracy. We say that some value φ̄ is a δ-approximation of
the optimal value φ > 0 in relative scale if

φ ≥ φ̄ ≥ φ · e−δ , δ > 0.

In the complexity estimates, the short notation Õ(·) is used to indicate that some
logarithmic factors are omitted. Since the rate of convergence (7.3.34) does not
depend on the problem’s data, our method is a so-called fully polynomial-time
approximation scheme.

7.3.4.1 The Fractional Covering Problem

Consider the following fractional covering problem:

def
Find φ = min{b, y : AT y ≥ c, y ≥ 0 ∈ Rm }, (7.3.38)
y

where A = (a1 , . . . , an ) is an (m × n)-matrix with non-negative coefficients, and


vectors b ∈ Rm and c ∈ Rn have positive coefficients. Define

ψ(y) = min 1
ai , y .
1≤i≤n c(i)
552 7 Optimization in Relative Scale

Note that ψ is concave and positively homogeneous of degree one. Therefore,


 
b,y
φ = min ψ(y) : y ≥ 0 ∈ Rm
y

  −1
ψ(y)
= max b,y : y ≥ 0 ∈ Rm
y

 −1
= max {ψ(y) : b, y = 1, y ≥ 0 ∈ R }
m .
y

Thus, problem (7.3.38) can be written in the form (7.3.29) with Q = Rm


+,


m
F (y) = − ln y (j ), ν = m,
j =1

and P̂ = {y : b, y = 1}. Hence, in accordance with the estimate (7.3.34) a δ-


approximation of φ = ψ−1 in relative scale can be found in Õ( δm2 ) iterations of
method (7.3.33). Each iteration of the scheme needs O(mn) operations to compute
ψ(y) and its subgradient, and essentially O(m ln m) operation to solve the auxiliary
maximization problem in (7.3.33) (see Sect. A.2). Of course, this computational
strategy is reasonable if m << n. Otherwise, it is better to solve the dual form
of problem (7.3.38) by the smoothing technique (see Chap. 6).

7.3.4.2 The Maximal Concurrent Flow Problem

Consider a network consisting of set of nodes N , |N | = n, and set of directed arcs

A = {α = (i, j ), i, j ∈ N }, |A | = m.

We assume that all arcs have bounded capacities. Formally, this means that the arc
flow vector f ∈ Rm
+ must satisfy the capacity constraint:

f ≤ f¯.

Let us introduce the set of origin-destination pairs

OD = {(i, j ), i, j ∈ N }.

Each pair (i, j ) ∈ OD generates for nodes i and j a directed flow fi,j ∈ Rm + of
level di,j . Formally, this means that the vectors fi,j must satisfy the system of linear
equations

Bfi,j = di,j (ei − ej ), (i, j ) ∈ OD,


7.3 Barrier Subgradient Method 553

where B is the balance matrix of the network and e(·) is the corresponding coordinate
vectors in Rn .
The maximal concurrent flow problem can be posed as follows:

def
Find λ = max {λ : Bfi,j = λ · di,j (ei − ej ),
λ,fi,j
 (7.3.39)
fi,j ≥ 0, (i, j ) ∈ OD, fi,j ≤ f¯ }.
(i,j )∈OD

Dualizing the flow capacity constraints by a vector of Lagrange multipliers t ∈ RM


+,
we get the following dual problem:

def
ψ = λ−1 ¯
 = max{ψ(t) : f , t = 1, t ≥ 0 ∈ R },
m
t

 (7.3.40)
ψ(t) = di,j · SPi,j (t),
(i,j )∈OD

where the function SPi,j (t) is the shortest path distance between nodes i and j with
respect to a non-negative arc travel time vector t ∈ Rm .
Clearly the function ψ in (7.3.40) satisfies all assumptions introduced for prob-
lem (7.3.29). Therefore (7.3.40) can be treated by method (7.3.33). In accordance
with the estimate (7.3.34), a δ-approximation of ψ in relative scale can be found
in Õ( δm2 ) iterations. Each iteration of the scheme needs a computation of the
shortest-path distances for all origin-destination pairs. The complexity of solving
the auxiliary maximization problem in (7.3.33) is essentially O(m ln m) operations
(see Sect. A.2). Note that we are also able to reconstruct the dual solutions (origin-
destination flows) using the technique described at the end of Sect. 7.3.3.

7.3.4.3 The Minimax Problem with Nonnegative Components

Consider the following minimax problem:

def
Find ψ = min max fi (x), (7.3.41)
x∈S 1≤i≤m

where S is a closed convex set and all functions fi (·) are convex and non-negative
on S. We assume that the function

m
ψ(y) = min y (i)fi (x)
x∈S i=1

is well defined for any y ≥ 0 ∈ Rm . Moreover, let us assume that the values of this
function and its subgradients are easily computable.
554 7 Optimization in Relative Scale

Then we can rewrite problem (7.3.41) in the dual form


 
ψ = max ψ(y) : ēm , y = 1, y ≥ 0 ∈ Rm , (7.3.42)
y

where ēm ∈ Rm is the vector of all ones.


Note that (7.3.42) satisfies all assumption of problem (7.3.29). Therefore, in
accordance with the estimate (7.3.34),
 a δ-approximation of ψ in relative scale can
be found by method (7.3.33) in Õ δm2 iterations. Each iteration of the scheme
results in a minimization of a weighted sum of functions fi and the barrier
function F .

7.3.4.4 Semidefinite Relaxation of the Boolean Quadratic Problem

Consider the following maximization problem:

def
Find f = max{Ax, x : x (i) = ±1, i = 1, . . . , n}, (7.3.43)
x

where A is a symmetric positive definite (n × n)-matrix. It is well known that this


problem is NP-hard. However, it appears that its optimal value can be approximated
in polynomial time with a certain dimension-independent relative accuracy. Namely,
define

ψ = min{ēn , y : D(y)  A}, (7.3.44)


y

where D(y) is a diagonal (n × n)-matrix with vector y on the diagonal. Then it can
be proved that

2
π ψ ≤ f ≤ ψ .

Usually the problem (7.3.44) is treated by Interior-Point Methods. However, note


that quite often it is useless to compute an approximation to ψ with a high relative
accuracy. Therefore it seems reasonable to solve it by a cheap gradient scheme.
Let us justify another representation for ψ .
Lemma 7.3.6 Let A = LT L. Then
"  n 2 %
def 
ψ = max ψ(X) = Xqi , qi 1/2 : In , X F = 1, X  0 , (7.3.45)
X i=1

where qi are the columns of matrix L, In is the identity matrix, and the scalar
product in the space of symmetric matrices is defined in a natural way.
7.3 Barrier Subgradient Method 555

Proof Indeed, since A  0, we have


 

n
ψ = min 1
u(i)
: A−1  D(u)
u i=1
 

n
= min max 1
u(i)
+ Y, D(u) − A−1 M
u Y 0 i=1
 n   

= max min 1
u(i)
+ Y (i,i) u(i) − Y, A−1 M .
Y 0 u i=1
 

n !1/2
Thus, ψ = max 2 Y (i,i) − Y, A−1 M . Maximizing the objective func-
Y 0 i=1
tion in this problem along a fixed direction Y  0, we obtain
"  2 %

n !1/2
ψ = max 1
Y,A−1 M
Y (i,i) .
Y 0 i=1

Choosing in this problem new variables X = L−T Y L−1 , we obtain representa-


tion (7.3.45).

Note that the function ψ in (7.3.45) is concave. Moreover, it is differentiable and
positive at any X  0. In our case, Q is the cone of positive-semidefinite matrices
with

F (X) = − ln det X, ν = n.

Hence, (5.8) satisfies the conditionsof the problem (7.3.29). Consequently, ψ can
be approximated by (7.3.33) in Õ δn2 iterations, where δ is the desired relative
accuracy. In our case, each iteration of the scheme (7.3.33) requires a representation
of an (n × n)-matrix in the form U T U T , where U is an orthogonal matrix, and
the matrix T is tri-diagonal. After that, we can apply the efficient search procedure
described at the end of Sect. A.2.

7.3.5 Online Optimization as an Alternative to Stochastic


Programming
7.3.5.1 A Decision-Making Process in an Uncertain Environment

Consider a repeatable decision-making process with uncertain income. Assume we


have N + 1 periods of time, each of which corresponds to a full production cycle.
556 7 Optimization in Relative Scale

In the beginning of the kth period, we choose a production strategy

xk ∈ P , k = 0, . . . , N,

where the structure of P satisfies the assumptions of Sect. 7.3.1. The results of
different economic activities in this period are given by a production function

ψk (x) ≥ 0, x ∈ P.

The value ψk (x) is equal to the rate of growth of the capital invested at the beginning
of period k in accordance with production strategy x ∈ P . The function ψk (·)
becomes known only at the end of the period k. So, it can be used for choosing
the production strategies of the next periods.
Assume for a moment that we know in advance all production functions

ψk (x), k = 0, . . . , N.

However, for certain reasons, we are obliged to apply in all these periods the same
strategy x ∈ P . In this case, of course, it is reasonable to use

def
 = arg max I
N
xN ψk (x).
x∈P k=0

Then, the average efficiency of this static strategy is given by

  N+1
1
I
N
ψN = ψk (x  ) .
k=0

However, usually the future is unknown. Instead, often we have the freedom to
choose for each period k a specific production strategy xk ∈ P . Let us look at its
possible efficiency.
Suppose we know a ν-self-concordant barrier F (·) for the set Q. Then, we could
apply the following variant of method (7.3.33):
 √ √ 

k
 ∇ψ i (xi ) ν+ k+1
xk+1 = arg max 1
k+1 ψi (xi ) , x − xi −

ν(k+1)
[F (x) − F (x0 )] .
x∈P i=0
(7.3.46)

In this case, after N + 1 periods, the average rate of growth is given by

  N+1
1
def I
N
ΨN = ψk (xk ) .
k=0
7.3 Barrier Subgradient Method 557

Theorem 7.3.4 For any N ≥ 0 we have ΨN ≥ ψN · e−δN with


    √ 
δN = 2 ν
N+1 + ν
N+1 · 1 + ln 2 + 32 ν(N + 1) → 0

as N → ∞.
Proof The proof is very similar to the proofs of Theorems 7.3.1 and 7.3.2. Define


N 
k 
k
∇ψi (xi )
fk (x) = ln ψk (x), f (x) = 1
N+1 fk (x), sk = ∇fi (xi ) = ψi (xi ) .
k=0 i=0 i=0

Note that method (7.3.46) can be seen as an application of scheme (7.3.14), (7.3.19)
to a changing objective function.
For any k ≥ 0, we have

Uβk+1 (sk+1 ) ≤ Uβk (sk+1 )

(7.3.5)  
≤ Uβk (sk ) + ∇fk (xk ), uβk (sk ) − x0 + βk ω∗ ∗
βk ∇fk (xk )xk
1

(7.3.31)  
≤ Uβk (sk ) + ∇fk (xk ), uβk (sk ) − x0 + βk ω∗ 1
βk .

Since Uβ0 (0) = 0, we conclude that

sN+1 , xN+1 − x0 − βN+1 [F (xN+1 ) − F (x0 )]


N 
N  
= UβN+1 (sN+1 ) ≤ ∇fi (xi ), xi − x0 + βi ω∗ 1
βi
i=0 i=0 (7.3.47)

(7.3.21) 
N √ 1 √ 
≤ ∇fi (xi ), xi − x0 + ν 2+ N .
i=0

In view of the first-order optimality condition for (7.3.3), for all y ∈ P0 we have

sN+1 , y − xN+1 ≤ βN+1 ∇F (xN+1 ), y − xN+1 . (7.3.48)

Therefore, using the concavity of all functions fi , for any y ∈ P we get

def 
N
N (y) = ∇fi (xi ), y − xi
i=0

 
(7.3.47) √ 1 √
≤ sN+1 , y − xN+1 + βN+1 [F (xN+1 ) − F (x0 )] + ν + N
2
558 7 Optimization in Relative Scale

(7.3.48)
≤ βN+1 [F (xN+1 ) + ∇F (xN+1 ), y − xN+1 − F (x0 )]

 
√ 1 √
+ ν + N
2
 
√ 1 √
≤ βN+1 [F (y) − F (x0 )] + ν + N .
2
√ 1 √ 
Hence, N (βN+1 ) ≤ ν 2 + N . On the other hand, applying the same
arguments as in the end of the proof of Theorem 7.3.1, we obtain


N 
N
N (x0 ) = ∇fi (xi ), x0 − xi ≥ ∇fi (x0 ), x0 − xi
i=0 i=0

≥ −3ν · (N + 1).

√ 1 √ 
Thus, N (βN+1 ) − N (x0 ) ≤ ν 2 + N + 3ν · (N + 1). Since βN+1 = 1 +

N+1
ν , by (7.3.12) we have:

√  √ 
N ν
N+1 ≤ N+1
1
2 + N

⎡ ⎛ M ⎞⎤
√ √ 1 √ 
ν 2 + N +3ν·(N+1)
+ ν+ N+1
ν(N+1) ⎣1 + 2 ln ⎝1 + √ ⎠⎦
ν+ ν(N+1)

√  √  √   1 √ 
ν ν+ ν(N+1)
≤ N+1
1
2 + N + N+1 1 + 2 ln 1 + 1 + 3 ν(N + 1)

≤ δN

(see the arguments used at the end of the proof of Theorem 7.3.2). On the other
hand,
  N 

N 
1
N+1 N = 1
N+1 max ∇fi (xi ), y − xi ≥ 1
N+1 max [fi (y) − fi (xi )]
y∈P i=0 y∈P i=0

= ln ψN − ln ΨN . 

Let us now look at several applications of this theorem.


7.3 Barrier Subgradient Method 559

7.3.5.2 Portfolio Management

Let x ∈ Δn be the structure of our portfolio. Denote by ck(i) ≥ 0, i = 1, . . . , n,


the growth coefficient for the price of stock i during day k ≥ 0. Then the optimal
portfolio with constant sharing is defined as
 1/(N+1)
 = arg max
I
N I
N
xN ck , x , ψN = ck , xN
 .
x∈P k=0 k=0

For the set Q = Rn+ , we can apply the standard n-self-concordant barrier


n
F (x) = − ln x (i) .
i=1

Then, we can use the following variant of method (7.3.46):


 √ √ 

k
ci ,x−xi ν+ k+1
xk+1 = arg max 1
k+1 ci ,xi − √
ν(k+1)
[F (x) − F (x0 )] , k ≥ 0.
x∈P i=0
(7.3.49)

In this case, after N + 1 periods, the average rate of growth of our portfolio is given
by

  N+1
1
def I
N
ΨN = ck , xk .
k=0

In view of Theorem 7.3.4, we have ΨN ≥ ψN · e−δN . Note that each step of
the algorithm (7.3.49) is implementable in O(n ln n) arithmetic operations (see
Sect. A.2).

7.3.5.3 Processes with Full Production Cycles

Assume that in our economy there are n elastic production processes. At the
beginning of the kth period, we know the cost ak(i) > 0 of producing one unit of
product i, i = 1, . . . , n. This cost is derived from the prices of raw materials, labor,
equipment, etc. However, the price bk(i) ≥ 0 of the unit of product i becomes known
only at the end of period k, when we sell it. It may depend on competition in the
market, uncertain preferences of the consumers, etc. Denoting by x (i) the fraction
560 7 Optimization in Relative Scale

of the capital invested in the process i, we come to the following model:


n (i)
bk
ψk (x) = (i) · x (i) ,
i=1 ak

def (7.3.50)
x = (x (1), . . . , x (n) )T ∈ Q = Rn+ ,

P̂ = Δn .

Then we can apply method (7.3.46) with


n
F (x) = − ln x (i) , ν = n.
i=1

In this situation, the complexity of solving the auxiliary maximization problem


in (7.3.46) is again O(n ln n) arithmetic operations (see Sect. A.2).

7.3.5.4 Discussion

Theorem 7.3.4, being applied in an uncertain environment, delivers an absolute


and risk-free guarantee for a certain level of efficiency of online optimization
strategy (7.3.46). To obtain such a result, we do not need to introduce the
standard machinery related to random events, risk measures, stochastic or robust
optimization. Note that in Theorem 7.3.4 we compare the efficiency of a dynamic
adjustment strategy with a static one. Hence, our arguments may not be too
convincing. However, let us look at the standard one-stage stochastic programming
problem

x = arg max Eζ [f (x, ζ )], (7.3.51)


x∈P

where Eζ [·] denotes the expectation with respect to a random vector ζ . The optimal
strategy x must be static by its origin (otherwise, maximization of expectation does
not make sense). At the same time, the quality of the model f (x, ξ ), constructed
by an analysis of the past, can hardly be comparable with the quality of the static
model based on exact knowledge of future. Thus, by transitivity, we can hope that
our online adjustment strategy gives much better results than the standard Stochastic
Programming approach. Of course, it can be applied only in the situations when the
dynamic adjustments of the decision variables are implementable.
The main drawback of online optimization strategy (7.3.46) is its low rate of
convergence. Therefore, it is efficient only for the processes where the average gain
is big as compared to the number of iterations and the parameter of the barrier
function. Interesting applications of this technique can be found most probably in
long-run production planning and management than in stock market activity.
7.4 Optimization with Mixed Accuracy 561

7.4 Optimization with Mixed Accuracy

(Strictly positive functions; The Quasi-Newton Method; Approximate solutions; Mixed


accuracy.)

7.4.1 Strictly Positive Functions

In the previous chapters, we considered different approaches for finding approx-


imate solutions of optimization problems with absolute and relative accuracy. In
all cases, the type of desired accuracy was very important for the definition of
the problem class, and consequently for the development of the corresponding
numerical schemes. In this section, we proceed in a converse way. Firstly, we
define a class of functions with favorable properties. Only after that will we try
to understand what kind of theory can be developed for corresponding optimization
problems.
Consider a closed convex function f with dom f ⊆ Rn . Let Q ⊆ dom f be a
closed convex set. We assume that ∂f (x)
= ∅ for all x ∈ Q.
Definition 7.4.1 A convex function f is called strictly positive on Q if for any x, y
from Q and g ∈ ∂f (x) we have

f (y) + f (x) + g, y − x ≥ 0. (7.4.1)

Since f is convex, this inequality can be written in a more appealing form:

f (y) ≥ |f (x) + g, y − x |, x, y ∈ Q, g ∈ ∂f (x). (7.4.2)

Clearly, strong positivity is an affine-invariant property.


Lemma 7.4.1 Let f be strictly positive on Qx ⊆ Rn and let A ∈ Rn×m and
b ∈ Rn . Then the function φ(y) = f (Ay + b) is strictly positive on the set

Qy = {y ∈ Rm : Ay + b ∈ Qx }.

Proof Indeed, in view of Lemma 3.1.11, for x = Ay + b we have

gy = AT gx ∈ ∂φ(y), ∀gx ∈ ∂f (x).

For two arbitrary points y1 , y2 ∈ Qy let xi = Ayi + b, i = 1, 2. Then

φ(y2 ) + φ(y1 ) + gy1 , y2 − y1 = f (x2 ) + f (x1 ) + AT gx1 , y1 − y2

(7.4.1)
= f (x2 ) + f (x1 ) + gx1 , x1 − x2 ≥ 0. 

562 7 Optimization in Relative Scale

Let us give some important examples of strictly positive functions and mention
their main properties.
1. Any positive constant is a strictly positive function.
2. Let us look at convex homogeneous functions of degree one.
Lemma 7.4.2 Let f (x) = maxs, x , where the set S is bounded, closed and
x∈S
centrally symmetric. Then the function f is strictly positive.
(3.1.40)
Proof For any x ∈ Rn and gx ∈ ∂f (x), we have f (x) = gx , x and −gx ∈ S.
Therefore,

(3.1.23) (3.1.40)
f (y) ≥ −gx , y = −f (x) − gx , y − x . 

3. Thus, the simplest nontrivial examples of strictly positive functions are norms.
Let us look now at operations preserving strong positivity.
Lemma 7.4.3 The class of strictly positive functions is a convex cone: if f1 and
f2 are strictly positive on Q, and α1 , α2 ≥ 0, then f (x) = α1 f1 (x) + α2 f2 (x) is
strictly positive on Q.
Proof Indeed, the characteristic inequality (7.4.1) is convex in f . 

Lemma 7.4.4 Let the functions f1 (·) and f2 (·) be strictly positive on Q. Then the
function f (x) = max{f1 (x), f2 (x)} is also strictly positive.
Proof Let us fix an arbitrary x ∈ Q. Assume that f1 (x) > f2 (x). Then, for y ∈ Q
and g1 ∈ ∂f1 (x) we have

f (y) ≥ f1 (y) ≥ −f1 (x) − g1 , y − x = −f (x) − ∇f (x), y − x .

The case f1 (x) < f2 (x) and f1 (x) = f2 (x) can be justified in a similar way (see
Lemma 3.1.13). 
Thus, the functions below are strictly positive on Rn :


m
f1 (x) = Ai x − bi , f2 (x) = max Ai x − bi ,
i=1 1≤i≤m

where Ai ∈ Rm×n , and bi ∈ Rm , i = 1 . . . n.


At the same time, the class of strictly positive functions contains functions with
quite a general shape of epigraph. Let us fix a norm  ·  for measuring distances in
Rn , and define the corresponding dual norm  · ∗ in the standard way (7.1.3).
7.4 Optimization with Mixed Accuracy 563

Theorem 7.4.1 Let the function φ be convex on Q and all its subgradients be
uniformly bounded:

gx ∗ ≤ L, x ∈ Q, gx ∈ ∂f (x). (7.4.3)

Then the function f (x) = max{φ(x), Lx} is strictly positive on Q.


Proof Let us fix an arbitrary x ∈ Q. Assume first, that φ(x) < Lx. Let us choose
s ∈ Rn with s∗ = 1, such that s, x = x. Note that any gx ∈ ∂f (x) coincides
with one of the vectors Ls (see Lemma 3.1.15). Hence, for any y ∈ E we have

f (y) + f (x) + gx , y − x ≥ Ly + Lx + Ls, y − x = Ly + Ls, y ≥ 0.

Further, if φ(x) > Lx, then ∂f (x) = ∂φ(x) and therefore for any gx ∈ ∂f (x)
we have

f (y) + f (x) + gx , y − x ≥ Ly + Lx + gx , y − x

(7.4.3)
≥ Ly + Lx − Ly − x ≥ 0.

Finally, for the case φ(x) = Lx we can apply a convex combination of the above
inequalities. 
Using this result, we can endow a general minimization problem

Find φ ∗ = min φ(x) (7.4.4)


x∈Q

with a strictly positive objective function. Denote by x ∗ ∈ Q its optimal solution.


Corollary 7.4.1 Let the function φ satisfy condition (7.4.3). Then for any x0 ∈ Q
the function

f (x) = max{φ(x) − φ(x0 ) + 2LR, Lx − x0 }

is strictly positive on Q. Moreover, for all x with x − x0  ≤ R we have

f (x) = φ(x) − φ(x0 ) + 2LR. (7.4.5)

If x0 − x ∗  ≤ R, then problem (7.4.4) is equivalent to the problem

f ∗ = min f (x),
x∈Q

with optimal value satisfying the following bounds:

LR ≤ f ∗ ≤ 2LR. (7.4.6)
564 7 Optimization in Relative Scale

Proof Indeed, f is strictly positive on Q in view of Theorem 7.4.1. If x −x0 ≤ R,


then
(7.4.3)
φ(x) − φ(x0 ) + 2LR ≥ 2LR − Lx − x0  ≥ Lx − x0 ,

and we obtain representation (7.4.5). Further, f ∗ ≤ f (x0 ) = 2LR. Finally,

(7.4.3)
f (x) ≥ max{2LR − Lx − x0 , Lx − x0 } ≥ LR. 

7.4.2 The Quasi-Newton Method

Consider the following minimization problem:

min f (x), (7.4.7)


x∈Q

where Q is a closed convex set in Rn , and the function f is strictly positive on Q.


Denote by x ∗ the optimal solution of this problem. It will be convenient to work
with another objective function:

fˆ(x) = 12 f 2 (x),
(7.4.8)
Lm 3.1.8
ĝ(x) = f (x) · g(x) ∈ ∂ fˆ(x), g(x) ∈ ∂f (x).

Since the function f is nonnegative, problem (7.4.7) can be rewritten in the


equivalent form

min fˆ(x). (7.4.9)


x∈Q

The most unusual feature of the function fˆ is the existence of nonlinear lower
support functions.
Lemma 7.4.5 Let the function f be strictly positive on Q. Then for any x and
y ∈ Q we have

fˆ(y) ≥ fˆ(x) + ĝ(x), y − x + 12 g(x), y − x 2 . (7.4.10)

Proof Indeed,

(7.4.2)
fˆ(y) = 12 f 2 (y)
(7.4.8)
≥ 2 [f (x) + g(x), y
1
− x ]2

= fˆ(x) + ĝ(x), y − x + 12 g(x), y − x 2 .


(7.4.8)


7.4 Optimization with Mixed Accuracy 565

We will use inequality (7.4.10) in the framework of estimating sequences (see


Sects. 2.2.1, 4.2.4, and 6.1.3). Let us fix a symmetric n × n-matrix G0  0, and a
starting point x0 ∈ Q. Define the primal and dual norms:

xG0 = G0 x, x 1/2 , g∗G0 = g, G−1


0 g
1/2 , x, g ∈ Rn .

We assume that x0 − x ∗ G0 ≤ R. Define the initial function for the estimating
sequence as follows:

ψ0 (x) = 12 x − x0 2G0 .

Let us fix an accuracy parameter δ ∈ (0, 1). Assuming that g(xk )


= 0, k ≥ 0,
define


k−1
ak = δ
1−δ · 1
(g(xk )∗G )2
, Ak = ai , k ≥ 0. (7.4.11)
k i=0

Thus, A0 = 0. For k ≥ 0, consider the following process:

xk = arg min ψk (x),


x∈Q

 
ψk+1 (x) = ψk (x) + ak · fˆ(xk ) + ĝ(xk ), x − xk + 12 g(xk ), x − xk 2 .
(7.4.12)

Clearly, in view of inequality (7.4.10), we have

ψk (x) ≤ Ak fˆ(x) + ψ0 (x), x ∈ Q. (7.4.13)

On the other hand, ψk (·) is a quadratic function with Hessian Gk  0 updated


by the following rule

(7.4.11) g(xk )g T (xk )


Gk+1 = Gk + ak · g(xk )g T (xk ) = Gk + δ
1−δ · (g(xk )∗G )2
, k ≥ 0.
k
(7.4.14)

Therefore, by the Sherman–Morrison–Woodbury rule, we have

G−1 T −1
G−1 −1
k+1 = Gk − δ ·
k g(xk )g (xk )Gk
(g(xk )∗G )2
.
k
566 7 Optimization in Relative Scale

Thus, we conclude that

1 2 ∗
2 ak (ĝ(xk )Gk+1 )
2 (7.4.8)
= ak2 · fˆ(xk ) · (g(xk )∗Gk+1 )2

= ak2 · fˆ(xk ) · (1 − δ) · (g(xk )∗Gk )2 (7.4.15)

δ · ak · fˆ(xk ).
(7.4.11)
=

Lemma 7.4.6 For any k ≥ 0 we have


k−1
ai fˆ(xi ).
def
ψk∗ = min ψk (x) ≥ (1 − δ) (7.4.16)
x∈Q i=0

Proof Let us prove inequality (7.4.16) by induction. For k = 0 it is true. Let us


assume that it is true for some k ≥ 0. Since ψk (·) is a quadratic function, it is
strongly convex in the norm  · Gk with convexity parameter one. Thus, for any
x ∈ Q the first-order optimality condition implies

(2.2.40)
ψk (x) = ψk∗ + ψk (xk ), x − xk + 12 x − xk 2Gk ≥ ψk∗ + 12 x − xk 2Gk .

Therefore,


ψk+1 ≥ ψk∗ + min 2 x
1
− xk 2Gk + ak [fˆ(xk ) + ĝ(xk ), x − xk
x∈Q


+ 12 g(xk ), x − xk 2 ]

 
ψk∗ + ak fˆ(xk ) + min
(7.4.14)
= 2 x
1
− xk 2Gk+1 + ak ĝ(xk ), x − xk
x∈Q

≥ ψk∗ + ak fˆ(xk ) − 12 ak2 ĝ(xk )2Gk+1

ψk∗ + (1 − δ) · ak fˆ(xk ).
(7.4.15)
= 

We can now estimate the rate of convergence of method (7.4.12). Define


k−1
xk∗ = arg min{f (x) : x = x0 , . . . , xk }, x̃k = 1
Ak ai xi .
x i=0

Theorem 7.4.2 Let us assume that a strictly positive function f has uniformly
bounded subgradients:

g(x)∗G0 ≤ L, x ∈ Q. (7.4.17)
7.4 Optimization with Mixed Accuracy 567

Then, for any k ≥ 0 we have

L2 R 2
(1 − δ)fˆ(xk∗ ) ≤ fˆ(x ∗ ) + !. (7.4.18)
2n eδ(k+1)/n − 1

This estimate is also valid for the value fˆ(x̃k+1 ).


Proof In view of inequalities (7.4.13) and (7.4.16),

(1 − δ)fˆ(xk∗ ) ≤ fˆ(x ∗ ) + 2Ak+1 x0


1
− x ∗ 2G0 .

−1/2 −1/2
Let us estimate the rate of growth of the coefficients Ak . Let Ḡk = G0 Gk G0 ,
(7.4.14)
k ≥ 0. Since det Gk+1 = 1
1−δ det Gk , we have

det Ḡk = 1
(1−δ)k
, k ≥ 0. (7.4.19)

It remains to note that


(7.4.11) k−1 (7.4.17) 
k−1 (7.4.11) !
Ak = ai ≥ 1
L2
ai (g(xi )∗G0 )2 = 1
L2
Trace Ḡk − n
i=0 i=0

(7.4.19)   !
≥ n
L2
1
(1−δ)k/n
−1 ≥ n
L2
eδk/n − 1 . 

7.4.3 Interpretation of Approximate Solutions

Note that the quality of point xk∗ as an approximate solution to problem (7.4.9) is
characterized by inequality (7.4.18) in a nonstandard way. Let us introduce a new
definition.
Definition 7.4.2 We say that a point x̄ ∈ Q is an approximate solution to
problem (7.4.9) with mixed (, δ)-accuracy if

(1 − δ)fˆ(x̄) ≤ fˆ(x ∗ ) + .

In this definition,  > 0 serves as an absolute accuracy, and δ ∈ (0, 1) represents


the relative accuracy of the point x̄. Thus, in view of (7.4.18), the mixed (, δ)-
accuracy can be reached by the Quasi-Newton Method (7.4.12) in
 
def n L2 R 2
Nn (, δ) = δ ln 1 + 2n  (7.4.20)

iterations.
568 7 Optimization in Relative Scale

Thus, it is not difficult to reach a high absolute accuracy. A high level of relative
accuracy is much more expensive. Nevertheless, despite to the non-smoothness of
the objective function in (7.4.9), the number of iterations of method (7.4.12) is
proportional to 1δ . This is, of course, a consequence of the finite dimension of the
space of variables. Note that we have the following uniform upper bound for our
estimate of the number of iterations:
def L2 R 2
Nn (, δ) < N∞ (, δ) = 2δ .
(7.4.21)

It is easy to see that the bound Nn (, δ) is a monotonically increasing function of


dimension n.
Let us discuss now the ability of method (7.4.12) to generate approximate
solutions in the standard accuracy scales.

7.4.3.1 Relative Accuracy

Consider our initial problem (7.4.7). Assume that our goal is to generate an
approximate solution x̄ ∈ Q to this problem with relative accuracy δ ∈ (0, 12 ):

f (x̄) ≤ (1 + δ)f ∗ . (7.4.22)

After k iterations of method (7.4.12), we have

(7.4.8)
(1 − δ)(f (xk∗ ) − f ∗ )f ∗ ≤ (1 − δ)(fˆ(xk∗ ) − fˆ(x ∗ ))
(7.4.23)
(7.4.18)
≤ δ fˆ(x ∗ ) + L2 R 2
2n[eδ(k+1)/n −1]
.

In order to have the point x̄ = xk∗ satisfying inequality (7.4.22), we need to ensure
that the right-hand side of the latter inequality does not exceed δ(1 − δ)(f ∗ )2 . Thus,
for δ ∈ (0, 12 ) we need
 
def n L2 R 2
k = Rn (δ) = δ ln 1 + nδ(1−2δ)(f ∗ )2
(7.4.24)

iterations. Note that the main factor nδ in this complexity bound does not depend on
the data of the problem. Thus, for problem (7.4.7), we get a fully polynomial-time
approximation scheme. Its dependence on n is the same as that of optimal methods
for nonsmooth convex minimization in finite dimensions. However, each iteration of
method (7.4.12) is very simple, of the same order as in the Ellipsoid Method. Note
LR
that for problem (7.4.7) the Ellipsoid Method has complexity bound O(n2 ln δf ∗)
iterations (see, Sect. 3.2.8). Thus, for a moderate relative accuracy, method (7.4.12)
is faster. It is important that the right-hand side of inequality (7.4.24) is uniformly
7.4 Optimization with Mixed Accuracy 569

bounded as n → ∞:
def L2 R 2
Rn (δ) < R∞ (δ) = δ 2 (1−2δ)(f ∗ )2
.

7.4.3.2 Absolute Accuracy

Consider now the general minimization problem (7.4.4), which we want to solve
with absolute accuracy  > 0:

φ(x̄) ≤ φ ∗ + , x̄ ∈ Q. (7.4.25)

We assume that φ satisfies condition (7.4.3) and the constants L and R are known.
Moreover, for the sake of simplicity, we assume that

x − x0  ≤ R ∀x ∈ Q. (7.4.26)

Defining now a new strictly positive objective function f (·) by equation (7.4.5), we
get

f (x) = φ(x) − φ(x0 ) + 2LR ∀x ∈ Q. (7.4.27)

Let us choose some δ ∈ (0, 1) and apply method (7.4.12) to the corresponding
problem (7.4.7) (by solving (7.4.9), of course). After k iterations of this scheme, we
have

(7.4.27) (7.4.23) δf ∗ L2 R 2
φ(xk∗ ) − φ ∗ = f (xk∗ ) − f ∗ ≤ 2(1−δ) + 2n[eδ(k+1)/n −1]·(1−δ)f ∗

(7.4.6)  
≤ LR δ
+ 1
eδ(k+1)/n −1
.
1−δ 2n[ ]·(1−δ)

Thus, to obtain accuracy  > 0, we can find δ = δ() from the equation

δ
1−δ = 
2LR ⇒ δ() = 
+2LR .

Then, we need at most


 
def
k = Tn () = n
δ() ln 1 + LR
n(1−δ())

  (7.4.28)

= n 1 + 2 LR
 · ln 1 + +2LR
2n
570 7 Optimization in Relative Scale

iterations of method (7.4.12). Note that


 2
Tn () < T∞ () = 1
2 1 + 2 LR
 .

Thus, in finite dimensions the worst-case complexity bound of the Quasi-Newton


Method (7.4.12) is always better than the bound of the standard subgradient scheme
(see Sect. 3.2.3).
Appendix A
Solving Some Auxiliary Optimization
Problems

A.1 Newton’s Method for Univariate Minimization

Let us show that Newton’s Method is very efficient in finding the maximal root of
increasing convex univariate functions. Consider a univariate function f such that

f (τ∗ ) = 0, f (τ ) > 0, for τ > τ∗ , (A.1.1)

and it is convex for τ ≥ τ∗ . Let us choose τ0 > τ∗ . Consider the following Newton
process:

f (τk )
τk+1 = τk − gk , (A.1.2)

where gk ∈ ∂f (τk ). Thus, we do not assume f to be differentiable for τ ≥ τ∗ .


Theorem A.1.1 Method (A.1.2) is well defined. For any k ≥ 0 we have

f (τk+1 )gk+1 ≤ 14 f (τk )gk . (A.1.3)


 k
Thus, f (xk ) ≤ 1
2 g0 (τ0 − τ∗ ).

Proof Let fk = f (τk ). Let us assume that fk > 0 for all k ≥ 0. Since f is convex
for τ ≥ τ∗ , 0 = f (τ∗ ) ≥ fk + gk (τ∗ − τk ). Thus,

gk (τk − τ∗ ) ≥ fk > 0. (A.1.4)

This means that gk > 0 and τk+1 ∈ [τ∗ , τk ). In particular, we conclude that

τk − τ∗ ≤ τ0 − τ∗ . (A.1.5)

© Springer Nature Switzerland AG 2018 571


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4
572 A Solving Some Auxiliary Optimization Problems

Further, for any k ≥ 0 we have:

(A.1.2) fk gk+1
fk ≥ fk+1 + gk+1 (τk − τk+1 ) = fk+1 + gk .


Thus, 1 ≥ ffk+1
k
+ ggk+1
k
≥2 fk+1 gk+1
fk gk , and this is (A.1.3). Finally, since f is convex
for τ ≥ τ∗ , we have
  6
(A.1.4) f0 g0 (A.1.3) fk gk (A.1.4) fk2
g0 ≥ τ0 −τ∗ ≥ 2k τ0 −τ∗ ≥ 2k (τ0 −τ∗ )(τk −τ∗ )

(A.1.5)
≥ 2k τ0f−τ
k

. 

Thus, we have seen that method (A.1.2) has linear rate of convergence, which does
not depend on the particular properties of the function f . Let us show that in a
non-degenerate situation this method has local quadratic convergence.
Theorem A.1.2 Let a convex function f be twice differentiable. Assume that it
satisfies the conditions (A.1.1) and its second derivative increases for τ ≥ τ∗ . Then
for any k ≥ 0 we have

f  (τk )
f (τk+1 ) ≤ 2(f  (τk ))2
· f 2 (τk ). (A.1.6)

If the root τ∗ is non-degenerate:

f  (τ∗ ) > 0, (A.1.7)

f  (τ0 )
then f (τk+1 ) ≤ 2(f  (τ∗ ))2
· f 2 (τk ).

Proof In view of conditions of the theorem, f  (τ ) ≤ f  (τk ) for all τ ∈ [τk+1 , τk ].


Therefore,

f (τk+1 ) ≤ f (τk ) + f  (τk )(τk+1 − τk ) + 12 f  (τk )(τk+1 − τk )2

(A.1.2) 1  2
= 2 f (τk ) (ff (τ(τk))) 2 .
k

To prove the last statement, it remains to note that f  (τk ) ≤ f  (τ0 ) and f  (τk ) ≥
f  (τ∗ ).

A.2 Barrier Projection onto a Simplex 573

A.2 Barrier Projection onto a Simplex

In the case K = Rn+ , we can take


n
F (x) = − ln x (i) , ν = n.
i=1

Consider P̂ = {x ∈ Rn+ : ēn , x = 1}. Then, at each iteration of method (7.3.14)


we need to solve the following problem:
 
def 
n 
n
φ ∗ = max s, x + ln x (i) : x (i) = 1 . (A.2.1)
x i=1 i=1

Let us show that its complexity does not depend on the size of particular data (that
is, the coefficients of the vector s ∈ Rn ).
Consider the following Lagrangian:
 

n 
n
L (x, λ) = s, x + ln x (i) + λ · 1 − x (i) , x ∈ Rn , λ ∈ R.
i=1 i=1

The dual function


 

n
def
φ(λ) = max L (x, λ) : x (i) = 1 = L (x(λ), λ)
x i=1

is defined by the vector x(λ) : x (i) (λ) = 1


λ−s (i)
, i = 1, . . . , n. Thus,

n 
φ(λ) = −n + λ − ln λ − s (i) ,
 i=1  (A.2.2)
φ∗ = min φ(λ) : λ > max s (i) .
λ 1≤i≤n

Note that φ(·) is a standard self-concordant function. Therefore we can apply


to its minimization the intermediate Newton’s Method (5.2.1), Item C), which
converges quadratically starting from any λ from the region

Q(s) = {λ : 4(φ  (λ))2 ≤ φ  (λ)}

(see Theorem 5.2.2). Let us show that the complexity of finding a starting point from
this set does not depend on the initial data.
n
Consider the function ψ(λ) = −φ  (λ) = 1
λ−s (i)
− 1. Clearly, the prob-
i=1
lem (A.2.2) is equivalent to finding the largest root λ∗ of the equation

ψ(λ) = 0. (A.2.3)
574 A Solving Some Auxiliary Optimization Problems

Let λ0 = 1 + max s (i) . Then ψ(λ0 ) ≥ 0 and therefore λ0 ≤ λ∗ . Consider the


1≤i≤n
following process:

ψ(λk )
λk+1 = λk − ψ  (λk ) , k ≥ 0. (A.2.4)

This is a standard Newton’s method for solving the Eq. (A.2.3), which can be also
interpreted as a Newton’s method for the minimization problem (A.2.2).
 k
Lemma A.2.1 For any k ≥ 0 we have (φ  (λk ))2 ≤ n7 · 16 1
φ  (λk ).

Proof Note that function ψ is decreasing and strictly convex. Therefore, for any
k ≥ 0 we have

λk < λk+1 < λ∗ , ψ  (λk ) < 0 , ψ(λk ) > 0.

ψ  (λk+1 )
Since ψ(λk ) ≥ ψ(λk+1 ) + ψ  (λk+1 )(λk − λk+1 ) = ψ(λk+1 ) + ψ  (λk ) ψ(λk ),
we obtain1

ψ  (λk+1 ) ψ(λk+1 )ψ  (λk+1 )
1 ≥ ψ(λ k+1 )
ψ(λk ) + ψ  (λk ) ≥ 2 ψ(λk )ψ  (λk ) .

Thus, for any k ≥ 0 we get


 k
φ  (λk ) · |φ  (λk )| ≤ 1
4 φ  (λ0 ) · |φ  (λ0 )|. (A.2.5)

Further, in view of the choice of λ0 we have


n
|φ  (λ0 )| = ψ(λ0 ) = 1
λ0 −s (i)
− 1 < n − 1,
i=1


n
φ  (λ0 ) = 1
(λ0 −s (i) )2
≤ n.
i=1


n
Finally, since 0 ≤ ψ(λk ) = 1
λk −s (i)
− 1, we conclude that
i=1


n
φ  (λk ) = 1
(λk −s (i) )2
≥ 1
n.
i=1

1 We use the same arguments as in the proof of Theorem A.1.1, but for a decreasing univariate

function.
A.2 Barrier Projection onto a Simplex 575

Using these bounds in (A.2.5), we obtain


 k  k
(φ  (λ0 ))2 (φ  (λ0 ))2
1
φ  (λ k)
(φ  (λk ))2 ≤ 1
16 (φ  (λk ))3
≤ 1
16 · n7 . 

Comparing the statement of Lemma A.2.1 with the definition of Q(s), we


conclude that the process (A.2.4) arrives at the region of quadratic convergence at
most after
@ A
4 (2 + 7 log2 n)
1
(A.2.6)

iterations. Each such iteration takes O(n) arithmetic operations.


A similar technique can be used for finding the barrier projection in the cone of
positive-semidefinite matrices:

max{S, X + ln det X : In , X = 1}.


X

The most straightforward strategy consists in finding an eigenvalue decomposition


of the matrix S and solving the problem (A.2.1) with s being the spectrum of the
matrix. In a more efficient strategy, we transform S into tri-diagonal form by an
orthogonal transformation, compute its maximal eigenvalue and apply the Newton’s
method to the corresponding dual function.
Bibliographical Comments

In the past few decades, numerical methods for Convex Optimization have become
widely studied in the monographic literature. The reader interested in engineering
applications can benefit from the introductory exposition by Polyak [55], excel-
lent course by Boyd and Vandenberghe [6], and lecture notes by Ben-Tal and
Nemirovski [5]. Mathematical aspects are described in detail in the older lectures
by A. Nemirovski (see [33] for the Internet version) and in the original versions
of the theory for Interior-Point Methods by Renegar [57], Roos et al. [59], and Ye
[63]. Recent theoretical highlights can be found in the monographs by Beck [3]
and Bubeck [7]. In our book, we have tried to be more balanced, combining the
comprehensive mathematical theory with many examples of practical applications,
sometimes supported by numerical experiments.

Chapter 1: Nonlinear Optimization

Section 1.1 The complexity theory for black-box optimization schemes was devel-
oped in [34], where the reader can find different examples of resisting oracles and
lower complexity bounds similar to that of Theorem 1.1.2.
Sections 1.2 and 1.3 There exist several classical monographs [11, 12, 30, 53]
treating different aspects of Nonlinear Optimization. For understanding Sequential
Unconstrained Minimization, the best source is still [14]. Some facts in Sect. 1.3,
related to conditions for zero duality gap, are probably new.

© Springer Nature Switzerland AG 2018 577


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4
578 Bibliographical Comments

Chapter 2: Smooth Convex Optimization

Section 2.1 The original lower complexity bounds for smooth convex and strongly
convex functions can be found in [34]. The proof used in this section was first
published in [39].
Section 2.2 Gradient mapping was introduced in [34]. The first optimal method for
smooth and strongly convex functions was proposed in [35]. The constrained variant
of this scheme is taken from [37]. However, the framework of estimating sequences
was suggested for the first time in [39]. A discussion of different approaches for
generating points with small norm of the gradient can be found in [48].
Section 2.3 Optimal methods for discrete minimax problems were developed in
[37]. The approach of Sect. 2.3.5 was first described in [39].

Chapter 3: Nonsmooth Convex Optimization

Section 3.1 A comprehensive treatment of different topics of Convex Analysis can


be found in [24]. However, the classical monograph [58] is still very useful.
Section 3.2 Lower complexity bounds for nonsmooth minimization problems can
be found in [34]. The framework of Sect. 3.2.2 was suggested in [36]. For detailed
bibliographical comments on the early history of Nonsmooth Minimization see [55,
56].
Section 3.3 The example of a difficult function for Kelley’s method is taken from
[34]. The presentation of the Level Method in this section is close to [28].

Chapter 4: Second-Order Methods

Section 4.1 Starting from the seminal papers of Bennet [4] and Kantorovich [26],
Newton’s Method became an important tool for numerous applied problems. In
the last 50 years, the number of different suggestions for improving the scheme is
extremely large (see, for example, [11, 12, 15, 21, 29, 31]). The reader can consult
an exhaustive bibliography in [11].
Most probably, the natural idea of using cubic regularization to improve the
stability of the Newton scheme was first analyzed in [22]. However, the author
was very sceptical about the complexity of solving the auxiliary minimization
problem in the case of nonconvex quadratic approximation (and indeed, it can
have an exponential number of local minima). As a result, this paper was never
published. Twenty five years later, in an independent paper [52] this idea was
checked again, and it was shown that this problem is solvable by standard techniques
Bibliographical Comments 579

of Linear Algebra. The authors also developed global worst-case complexity


bounds for different problem classes. This paper forms the basis of Sect. 4.1. The
interested reader can also consult the complementary approach [8, 9], where cubic
regularization is coupled with a line search along the gradient direction. However,
note that this feature, though improving somewhat the numerical stability, forces
the algorithm to stop at saddle points. A historical exposition of the development in
this field with recent results, including lower complexity bounds for gradient norm
minimization, can be found in [10].
Section 4.2 This section is based on the paper [45].
Section 4.3 This section is based on very recent and partially unpublished results.
The first lower complexity bounds for second-order methods were obtained in [2].
At the same time, one  of the second-order schemes in [32] achieves the rate of
1
convergence Õ k 7/2 , which is optimal. However, each iteration of this method
needs an expensive search procedure based on additional calls of oracle. So, its
practical efficiency is questionable.
In our presentation, we use a simpler derivation of the lower complexity bounds
and a simpler conceptual version of the “optimal” second-order scheme, based on
iteration of the Cubic Newton Method.
Section 4.4 Methods for solving systems of nonlinear equations have attracted a lot
of attention (see [11, 12, 53, 54]). However, we have not been able to find any global
worst-case efficiency estimates for them in the literature. Our presentation follows
the paper [43].

Chapter 5: Polynomial-Time Interior-Point Methods

This chapter contains an adaptation of the main concepts from [51]. We added
several useful inequalities and a slightly simplified presentation of the path-follow-
ing scheme. We refer the reader to [5] for numerous applications of interior-point
methods, and to [57, 59, 62] and [63] for a detailed treatment of different theoretical
aspects.
Section 5.1 In this section, we introduce the definition of a self-concordant function
and study its properties. As compared with Section 4.1 in [39], we add Fenchel
duality and the Implicit Function Theorem. The main novelty is an explicit treatment
of the constant of self-concordance. However, most of the material can be found in
[51].
Section 5.2 In this new section, we analyze different methods for minimizing self-
concordant functions. We propose a new step-size rule for the Newton scheme
(intermediate step), which gives better constants for the path-following approach.
Complexity estimates for a path-following scheme, as applied to a self-concordant
function, were obtained only recently [13].
580 Bibliographical Comments

Section 5.3 In this section we study the properties of a self-concordant barrier and
give the complexity analysis for the path-following method. This is an adaptation of
Section 4.2 in [39].
Section 5.4 In this section, we give examples of self-concordant barriers and related
applications. This is an extension of Section 4.3 in [39] by the results of [49].

Chapter 6: The Primal-Dual Model of an Objective Function

This is the first attempt at presenting in the monographic literature the fast primal-
dual gradient methods based on an explicit minimax model of the objective function.
In the first three sections we present different aspects of the smoothing technique,
following the papers [40, 41], and [42]. It seems that the Fast Gradient Method in
the form of the Method of Similar Triangles (6.1.19) was published for the first time
only recently (see [20]).
The last Sect. 6.4 is devoted to the new analysis of the old Conditional Gradient
Method (or, the Frank–Wolfe algorithm [16, 18, 19, 23, 25]). Our presentation
follows the paper [50], which is close in spirit to [17].

Chapter 7: Optimization in Relative Scale

The presentation in this new chapter is based on the papers [44, 46], and [47].
Some examples of application were analyzed in [5], however, from the viewpoint of
the applicability of Interior-Point Methods. Algorithms for computing the rounding
ellipsoids are studied in [1, 27, 61], and in the recent book [60]. Constant quality
of semidefinite relaxation for Boolean quadratic maximization with general matrix
was proved in [38]. The material of Sect. 7.4 is new.
References

1. K.M. Anstreicher, Ellipsoidal approximations of convex sets based on the volumetric barrier.
CORE Discussion Paper 9745, 1997
2. Y. Arjevani, O. Shamir, R. Shiff, Oracle complexity of second-order methods for smooth
convex optimization. arXiv:1705.07260v2 (2017)
3. A. Beck, First-Order Methods in Optimization (SIAM, Philadelphia, 2017)
4. A.A. Bennet, Newton’s method in general analysis. Proc. Natl. Acad. Sci. U. S. A. 2(10), 592–
598 (1916)
5. A. Ben-Tal, A. Nemirovskii, Lectures on Modern Convex Optimization: Analysis, Algorithms,
and Engineering Applications (SIAM, Philadelphia, 2001)
6. S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge,
2004)
7. S. Bubeck, Convex Optimization: Algorithms and Complexity (Now Publishers, LP Breda,
2015). arXiv:1405.4980
8. C. Cartis, N.I.M. Gould, P.L. Toint, Adaptive cubic regularisation methods for unconstrained
optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2),
245–295 (2011)
9. C. Cartis, N.I.M. Gould, P.L. Toint, Adaptive cubic regularisation methods for unconstrained
optimization. Part II: worst-case function- and derivative-evaluation complexity. Math. Pro-
gram. 130(2), 295–319 (2011)
10. C. Cartis, N.I.M. Gould, P.L. Toint, How much patience do you have? a worst-case perspective
on smooth nonconvex optimization. Optima 88, 1–10 (2012)
11. A.B. Conn, N.I.M. Gould, P.L. Toint. Trust Region Methods (SIAM, Philadelphia, 2000)
12. J.E. Dennis, R.B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlin-
ear Equations, 2nd edn. (SIAM, Philadelphia, 1996)
13. P. Dvurechensky, Yu. Nesterov, Global performance guarantees of second-order methods for
unconstrained convex minimization, CORE Discussion Paper, 2018
14. A.V. Fiacco, G.P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimiza-
tion Techniques (Wiley, New York, 1968)
15. R. Fletcher, Practical Methods of Optimization, Vol. 1, Unconstrained Minimization (Wiley,
New York, 1980)
16. M. Frank, P. Wolfe, An algorithm for quadratic programming. Nav. Res. Logist. Q. 3, 149–154
(1956)
17. R.M. Freund, P. Grigas, New analysis and results for the Frank–Wolfe method. Math. Program.
155, 199–230 (2014). https://doi.org/10.1007/s10107-014-0841-6
18. D. Garber, E. Hazan, A linearly convergent conditional gradient algorithm with application to
online and stochastic optimization. arXiv: 1301.4666v5 (2013)

© Springer Nature Switzerland AG 2018 581


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4
582 References

19. D. Garber, E. Hazan, Faster rates for the Frank–Wolfe method over strongly convex sets.
arXiv:1406.1305v2 (2015)
20. A. Gasnikov, Yu. Nesterov, Universal method for problems of stochastic composite minimiza-
tion. Comput. Math. Math. Phys. 58(1), 48–64 (2018)
21. S. Goldfeld, R. Quandt, H. Trotter, Maximization by quadratic hill climbing. Econometrica 34,
541–551 (1966)
22. A. Griewank, The modification of Newton’s method for unconstrained optimization by
bounding cubic terms, Technical Report NA/12 (1981), Department of Applied Mathematics
and Theoretical Physics, University of Cambridge, United Kingdom, 1981
23. Z. Harchaoui, A. Juditsky, A. Nemirovski, Conditional gradient algorithms for norm-
regularized smooth convex optimization. Math. Program. 152, 75–112 (2014). https://doi.org/
10.1007/s10107-014-0778-9
24. J.-B. Hiriart-Urruty, C. Lemarechal, Convex Analysis and Minimization Algorithms. Part 1. A
Series of Comprehensive Studies in Mathematics (Springer, Berlin, 1993)
25. M. Jaggi, Revisiting Frank–Wolfe: projection-free sparse convex optimization, in Proceedings
of the 30th International Conference on Machine Learning, Atlanta, Georgia (2013)
26. L.V. Kantorovich, Functional analysis and applied mathematics. Uspehi Mat. Nauk 3(1), 89–
185 (1948) (in Russian). Translated as N.B.S. Report 1509, Washington D.C., 1952
27. L.G. Khachiyan, Rounding of polytopes in the real number model of computation. Math. Oper.
Res. 21(2), 307–320 (1996)
28. C. Lemarechal, A. Nemirovskii, Yu. Nesterov, New variants of bundle methods. Math.
Program. 69, 111–148 (1995)
29. K. Levenberg. A method for the solution of certain problems in least squares. Q. Appl. Math.
2, 164–168 (1944)
30. D.G. Luenberger, Linear and Nonlinear Programming, 2nd edn. (Addison Wesley, Boston,
1984)
31. D. Marquardt, An algorithm for least-squares estimation of nonlinear parameters. SIAM J.
Appl. Math. 11, 431–441 (1963)
32. R. Monteiro, B. Svaiter, An accelerated hybrid proximal extragradient method for convex
optimization and its implications to second-order methods. SIAM J. Optim. 23(2), 1092–1125
(2013)
33. A. Nemirovski, Interior-point polynomial-time methods in convex programming (1996),
https://www2.isye.gatech.edu/~nemirovs/LectIPM.pdf
34. A.S. Nemirovskij, D.B. Yudin, Problem Complexity and Method Efficiency in Optimization.
Wiley-Interscience Series in Discrete Mathematics (A Wiley-Interscience Publication/Wiley,
New York, 1983)
35. Yu. Nesterov, A method for unconstrained convex minimization problem with the rate of
convergence O( k12 ). Doklady AN SSSR 269, 543–547 (1983) (In Russian; translated as Soviet
Math. Docl.)
36. Yu. Nesterov, Minimization methods for nonsmooth convex and quasiconvex functions.
Ekonomika i Mat. Metody 11(3), 519–531 (1984) (In Russian; translated in MatEcon.)
37. Yu. Nesterov, Efficient Methods in Nonlinear Programming (Radio i Sviaz, Moscow, 1989) (In
Russian.)
38. Yu. Nesterov, Semidefinite relaxation and nonconvex quadratic optimization. Optim. Methods
Softw. 9, 141–160 (1998)
39. Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (Kluwer, Boston,
2004)
40. Yu. Nesterov, Smooth minimization of non-smooth functions. Math. Program. (A) 103(1), 127–
152 (2005)
41. Yu. Nesterov, Excessive gap technique in non-smooth convex minimizarion. SIAM J. Optim.
16 (1), 235–249 (2005)
42. Yu. Nesterov, Smoothing technique and its applications in semidefinite optimization. Math.
Program. 110(2), 245–259 (2007)
References 583

43. Yu. Nesterov, Modified Gauss–Newton scheme with worst-case guarantees for its global
performance. Optim. Methods Softw. 22(3), 469–483 (2007)
44. Yu. Nesterov, Rounding of convex sets and efficient gradient methods for linear programming
problems. Optim. Methods Softw. 23(1), 109–128 (2008)
45. Yu. Nesterov, Accelerating the cubic regularization of Newton’s method on convex problems.
Math. Program. 112(1), 159–181 (2008)
46. Yu. Nesterov, Unconstrained convex minimization in relative scale. Math. Oper. Res. 34(1),
180–193 (2009)
47. Yu. Nesterov, Barrier subgradient method. Math. Program. 127(1), 31–56 (2011)
48. Yu. Nesterov, How to make the gradients small. Optima 88, 10–11 (2012)
49. Yu. Nesterov, Towards non-symmetric conic optimization. Optim. Methods Softw. 27(4–5),
893–918 (2012)
50. Yu. Nesterov, Complexity bounds for primal-dual methods minimizing the model of objective
function. Math. Program. (2017). https://doi.org/10.1007/s10107-017-1188-6
51. Yu. Nesterov, A. Nemirovskii, Interior-Point Polynomial Algorithms in Convex Programming
(SIAM, Philadelphia, 1994)
52. Yu. Nesterov, B. Polyak, Cubic regularization of Newton’s method and its global performance.
Math. Program. 108(1), 177–205 (2006)
53. J. Nocedal, S.J. Wright, Numerical Optimization (Springer, New York, 1999)
54. J. Ortega, W. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables
(Academic Press, New York, 1970)
55. B.T. Polyak, Introduction to Optimization (Optimization Software, Publications Division, New
York, 1987)
56. B.T. Polyak, History of mathematical programming in the USSR: analyzing the phenomenon.
Math. Program. 91(3), 401–416 (2002)
57. J. Renegar, A Mathematical View of Interior-Point Methods in Convex Optimization. MPS-
SIAM Series on Optimization (SIAM, Philadelphia, 2001)
58. R.T. Rockafellar, Convex Analysis (Princeton University Press, Princeton, 1970)
59. C. Roos, T. Terlaky, J.-Ph. Vial, Theory and Algorithms for Linear Optimization: An Interior
Point Approach (Wiley, Chichester, 1997)
60. M. Todd, Minimum-Volume Ellipsoids: Theory and Algorithms. MOS-SIAM Series on Opti-
mization (SIAM, philadelphia, 2016)
61. M.J. Todd, E.A.Yildirim, On Khachiyan’s algorithm for the computation of minimum volume
enclosing ellipsoids, Technical Report, TR 1435, School of Operations Research and Industrial
Engineering, Cornell University, 2005
62. S.J. Wright, Primal-Dual Interior Point Methods (SIAM, Philadelphia, 1996)
63. Y. Ye, Interior Point Algorithms: Theory and Analysis (Wiley, Hoboken, 1997)
Index

Analytic center, 377 lower bounds, 12


Antigradient, 20 upper bounds, 12
Approximate centering condition, 359, 379 Computational effort, 9
Approximate solution, 13 Computational stability
Approximation, 18 for entropy function, 445
first-order, 19 Condition number, 77
global upper, 41 of variable degree, 278
linear, 19 Cone
local, 18 normal, 177
in p -norms, 417 positive semidefinite matrices, 395
quadratic, 22 second-order, 393
second-order, 22 tangent, 178
Asphericity coefficient, 493 Conjugate directions, 47
Constant step scheme
minimax problem, 125
Barrier monotone, 98
analytic, 224 smooth convex functions, 93
self-concordant, 369 smooth strongly convex functions, 94
universal, 391 unconstrained, 92
volumetric, 225 Constrained minimization schemes, 132
Black box concept, 9 Contraction mapping, 33
Convex
combination, 141
Center differentiable function, 61
analytic, 377 function, 140
of gravity, 220 set, 61
Central path, 368 asphericity, 377, 492
auxiliary, 383 Cubic power function, 274
dual interpretation, 363 Cutting plane scheme, 218
equation, 368
Cholesky factorization, 327
Class of problems, 8 Dikin ellipsoid, 337
Complexity Directional derivative, 158
analytical, 9 Discrete approximation of integral, 470
arithmetical, 9 Distance function, 111

© Springer Nature Switzerland AG 2018 585


Y. Nesterov, Lectures on Convex Optimization, Springer Optimization
and Its Applications 137, https://doi.org/10.1007/978-3-319-91578-4
586 Index

Domain of function, 140 General iterative scheme, 8


Dual multipliers, 183 Global optimality certificate, 52
Gradient, 19, 271
mapping, 112
Epigraph, 101, 142 reduced, 112
constrained, 143
facet, 166
Estimating sequences Hessian, 22, 272
composite objective, 431 Hölder condition
for conditional gradients gradient, 469, 479
with contraction, 476 Hessian, 469
for conditional gradients with composite Hyperplane
objective, 471 separating, 160
definition, 83 supporting, 160
second-order methods, 281
Euclidean projection, 109
triangle inequality, 110
Excessive gap Inequality
condition, 448 Cauchy–Schwarz, 20, 66, 273, 437
updating rule, 450, 456 Jensen, 141
von Neumann, 463
Infinity norm, 146
Fast gradient method, 431 Informational set, 8
relative accuracy, 502, 503 Inner product, 4
Feasibility problem, 214 standard, 328
Function
barrier, 56, 335
β-compatible with self-concordant barrier, Kelley’s method, 226
403 Krylov subspace, 46
closed and convex, 143
conjugate, 349, 424
convex, 140 Lagrange
nonlinear transformation, 259 dual problem, 51
smooth approximation, 425 function, 51
strictly positive, 561 multipliers, 183
with degree of homogeneity one, 173, 499 Lagrangian, 51, 107
Fenchel dual, 164, 349, 424 dual problem, 108
for entropy, 438 relaxation, 50
gradient dominated, 255 Level set, 19, 101, 143, 243
growth measure, 199 Levenberg–Marquardt regularization, 242
logarithmically homogeneous, 392 Linear operator
objective, 4 adjoint, 271, 306
smooth approximation, 429 dual non-degeneracy, 306
positively homogeneous, 173 norm, 273, 426, 437, 439, 467, 501
self-concordant, 330 positive semidefinite, 272
star-convex, 252 primal non-degeneracy, 306
strongly convex, 74 of rank one, 273
growth property, 105 self-adjoint, 272
lower complexity bounds, 212 Linear optimization oracle
uniformly convex, 274 composite form, 468
growth condition, 277 Lipschitz condition
Functional constraints, 4 function, 10
Index 587

gradient, 24, 66, 430 trust regions, 242


of smooth approximation, 425, 429, 448 composite form with contraction, 483
Hessian, 243, 268, 273, 328, 329, 364 uniform grid, 10
cubic power function, 277 variable metric, 42, 44
high-order derivative, 278 volumetric centers, 225
Jacobian, 308 Minimax
Local decrease of Gauss–Newton model, 310 principle, 50
Localization set, 200 strategies for matrix games, 436
Local norm, 541 theorem, 156, 189
dual, 344 Minimax problem, 117
primal, 337 gradient method, 123
Logarithmic barrier optimal method, 126
for ellipsoid, 332 Minimum
for level set of self-concordant function, global, 5
335 local, 5
for ray, 331, 370 Minkowski function, 151
for second-order region, 370 Mixed accuracy, 567
standard, 392 Model of
convex function, 226
linear, 192
Making decisions in uncertain environment, minimization problem, 7
555 barrier, 327, 378
Matrix functional, 10
positive definite, 21 objective function, 427
positive semidefinite, 20
Maximal eigenvalue
entropy smoothing, 465
of symmetric matrix, 466 Newton’s method
Max-type function, 118 affine invariance, 329
Measure of cubic regularization, 247, 249, 299
local optimality, 246 accelerated, 284
Method of backtracking strategy, 267
analytic centers, 224 optimal conceptual version, 300
barrier functions, 57 damped, 37, 242, 348, 353
centers of gravity, 220 intermediate, 353
conditional gradients standard, 36, 242, 328, 353
with composite objective, 470 Newton system, 36
with contraction, 475 No-gap property, 193
conjugate gradient, 46, 49 Non-degenerate
ellipsoids, 222 global minimum, 253
fast gradient, 88 saddle point, 248
Gauss–Newton sum of squares, 256
modified, 309 Norm
standard, 306 dual, 65, 273, 306, 492, 500, 513
gradient, 28, 80, 114 Euclidean, 19
inscribed ellipsoids, 224 Frobenius, 395, 506
optimal, 14, 32, 88 1 , 146
path-following, 381 p , 146, 408, 461
penalty functions, 55 p -matrix, 462
primal-dual, 191, 479 ∞ , 10, 146
quasi-Newton, 42 self-dual, 66
similar triangles, 431 squared p -matrix, 464
588 Index

Optimality condition of separable optimization, 415


composite form, 176 smooth, 4
constrained problem, 105 strictly feasible, 5
first-order, 20, 61 unconstrained, 4
with equalities, 21 unsolvable, 14
minimax problem, 118 Production processes with full cycle, 559
nonsmooth convex problem, 167, 179 Prox-function, 430, 447
second-order, 22 definition, 429
Optimality measure entropy distance, 437
first-order, 476 Euclidean distance, 436, 501
second-order, 484
Oracle
first-order, 10 Quasi-Newton
local black box, 9 method
resisting, 12, 13, 215 Broyden–Fletcher–Goldfarb–Shanno,
second-order, 10 45
zero-order, 10 Davidon–Fletcher–Powell, 45
rank-one correction, 45
Parameter of for strictly positive functions, 564
centering, 359, 379 rule1, 44
self-concordant barrier, 369
smoothing, 429
uniform convexity, 274 Rate of convergence, 31
Partial minimization of convex function, 187 gradient method, 35
Penalty linear, 40
function, 54 Newton’s method, 39
for set, 54 quadratic, 40
Performance sublinear, 39
on a problem, 7 Recession direction, 390
on a problem class, 7 Region of quadratic convergence
Piece-wise linear optimization, 443 disance to optimum, 328
Polar set, 390 function value, 288
Polynomial methods, 224 norm of the gradient, 288
Portfolio management, 559 Regularization technique, 292
Positive orthant, 391 Relative accuracy, 490, 568
Problem for bilinear matrix games, 532
adjoint form, 427 for fractional covering problem, 551
of approximation in p -norms, 417, 418 for linear optimization, 504
of composite optimization, 430 for maximal concurrent flow problem, 552
of conic unconstrained optimization, 491 for maximizing positive concave function,
constrained, 4 548
of continuous location, 439 for minimax problems with nonnegative
feasible, 5 component, 553
general, 4 for minimizing maximal absolute values,
of geometric optimization, 416 527
of integer optimization, 6 for minimizing the spectral radius, 506, 535
linearly constrained, 4 for semidefinite relaxation of boolean
of linear optimization, 4, 391 quadratic problems, 554
nonsmooth, 4 for truss topology design, 509
NP-hard, 15 Relaxation, 18
quadratically constrained quadratic, 4, 393 sequence, 18
of quadratic optimization, 4 Restarting strategy, 49
of semidefinite optimization, 396
Index 589

Rounding ellipsoids Solution


for centrally symmetric sets, 513 approximate, 8, 69, 77, 195
for general convex sets, 519 global, 5
for sign-invariant sets, 523 local, 5
Standard
minimization problem, 367
Scalar product, 272 simplex, 171
Frobenius, 461, 506 Stationary point, 21
standard, 4, 328, 461 Step-size strategy, 28
Self-concordant Armijo rule, 28
barrier full relaxation, 28
for cone of positive semidefinite Strong separation, 160
matrices, 397 Structural constraints, 5
definition, 369 Subdifferential, 162
for epigraph of entropy function, 408 constrained, 162
for epigraph of p-norm, 408 Subgradient, 162
for epigraph of self-concordant barrier, Subgradient method
372 for finding Lagrange multipliers, 207
for geometric mean, 412 functional constraints, 205
for hypograph of exponent of for relative accuracy, 496
self-concordant barrier, 413 restarting strategy, 498
for level set of self-concordant function, simple set, 202
372 Support function, 151
logarithmically homogeneous, 392 Supporting vector, 161
for Lorentz cone, 393
for matrix epigraph of inverse matrix,
414 Theorem
for power cone, 406 Euler, 173
function on implicit self-concordant barrier, 374
barrier property, 335 John
definition, 330 general convex sets, 522
local rate of convergence, 355 for symmetric sets, 518
necessary and sufficient conditions, 342 Karush–Kuhn–Tucker, 182
non-degeneracy of Hessian, 338 on recession direction, 348
standard, 330 von Neuman, 189
Sequential quadratic optimization, 128 Third directional derivative, 329
Sequential unconstrained minimization, 50 Total variation of linear model, 476
Set
convex, 61
Uniform dual non-degeneracy, 314
feasible, 4
Unit ball, 146
basic, 4
Univariate convex function, 150, 153, 167,
sign-invariant, 523
345, 470, 571
Singular value
minimal, 306
Slater condition, 5, 56, 182 Variational inequalities with linear operator,
linear equality constraints, 185 440

You might also like