Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

A Stochastic Successive Minimization Method for Nonsmooth Nonconvex Optimization with Applications to Transceiver Design in Wireless Communication Networks

  • Full Length Paper
  • Published:
Mathematical Programming Submit manuscript

Abstract

Consider the problem of minimizing the expected value of a cost function parameterized by a random variable. The classical sample average approximation method for solving this problem requires minimization of an ensemble average of the objective at each step, which can be expensive. In this paper, we propose a stochastic successive upper-bound minimization method (SSUM) which minimizes an approximate ensemble average at each iteration. To ensure convergence and to facilitate computation, we require the approximate ensemble average to be a locally tight upper-bound of the expected cost function and be easily optimized. The main contributions of this work include the development and analysis of the SSUM method as well as its applications in linear transceiver design for wireless communication networks and online dictionary learning. Moreover, using the SSUM framework, we extend the classical stochastic (sub-)gradient method to the case of minimizing a nonsmooth nonconvex objective function and establish its convergence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The family of functions \(\mathcal {F}\) is uniformly equicontinuous over the set \({\mathcal {X}}\) if for every \(\epsilon >0\), there exists a \(\delta >0\) such that \(|f(x) - f(y)| <\epsilon , \;\forall f \in \mathcal {F}\) and for all \(x ,y \in {\mathcal {X}}\) with \(|x - y| <\delta \).

References

  1. Plambeck, E.L., Fu, B.R., Robinson, S.M., Suri, R.: Sample-path optimization of convex stochastic performance functions. Math. Program. 75(2), 137–176 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  2. Robinson, S.M.: Analysis of sample-path optimization. Math. Oper. Res. 21(3), 513–528 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  3. Healy, K., Schruben, L.W.: Retrospective simulation response optimization. In: Proceedings of the 23rd Conference on Winter Simulation, pp. 901–906. IEEE Computer Society (1991)

  4. Rubinstein, R.Y., Shapiro, A.: Optimization of static simulation models by the score function method. Math. Comput. Simul. 32(4), 373–392 (1990)

    Article  MathSciNet  Google Scholar 

  5. Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method, vol. 346. Wiley, New York (1993)

    MATH  Google Scholar 

  6. Shapiro, A.: Monte carlo sampling methods. Handb. Oper. Res. Manag. Sci. 10, 353–426 (2003)

    Article  MathSciNet  Google Scholar 

  7. Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on Stochastic Programming: Modeling and Theory, vol. 9. Society for Industrial and Applied Mathematics, Philadelphia (2009)

    Book  MATH  Google Scholar 

  8. Kim, S., Pasupathy, R., Henderson, S.: A guide to sample average approximation In: Fu, M.C. (ed.) Handbook of Simulation Optimization, pp. 207–243. Springer, New York (2015)

  9. Razaviyayn, M., Hong, M., Luo, Z.Q.: A unified convergence analysis of block successive minimization methods for non-smooth optimization. arXiv preprint, arXiv:1209.2385 (2012)

  10. Yuille, A.L., Rangarajan, A.: The concave–convex procedure. Neural Comput. 15, 915–936 (2003)

    Article  MATH  Google Scholar 

  11. Borman, S.: The expectation maximization algorithm—a short tutorial. Unpublished paper http://ftp.csd.uwo.ca/faculty/olga/Courses/Fall2006/Papers/EM_algorithm.pdf

  12. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  13. Fristedt, B.E., Gray, L.F.: A Modern Approach to Probability Theory. Birkhuser, Boston (1996)

    MATH  Google Scholar 

  14. Dunford, N., Schwartz, J.T.: Linear Operators. Part 1: General Theory. Interscience Publications, New York (1958)

    MATH  Google Scholar 

  15. Bastin, F., Cirillo, C., Toint, P.L.: Convergence theory for nonconvex stochastic programming with an application to mixed logit. Math. Program. 108(2–3), 207–234 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  16. Larsson, E., Jorswieck, E.: Competition versus cooperation on the MISO interference channel. IEEE J. Sel. Areas Commun. 26, 1059–1069 (2008)

    Article  Google Scholar 

  17. Razaviyayn, M., Luo, Z.Q., Tseng, P., Pang, J.S.: A stackelberg game approach to distributed spectrum management. Math. Program. 129, 197–224 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  18. Bengtsson, M., Ottersten, B.: Handbook of antennas in wireless communications. In: Godara, L.C. (ed.) Optimal and Suboptimal Transmit Beamforming, CRC, Boco Raton (2001)

  19. Shi, C., Berry, R.A., Honig, M.L.: Local interference pricing for distributed beamforming in MIMO networks. In: Military Communications Conference, MILCOM, pp. 1–6 (2009)

  20. Kim, S.J., Giannakis, G.B.: Optimal resource allocation for MIMO ad hoc cognitive radio networks. IEEE Trans. Inf. Theory 57, 3117–3131 (2011)

    Article  MathSciNet  Google Scholar 

  21. Shi, Q., Razaviyayn, M., Luo, Z.Q., He, C.: An iteratively weighted MMSE approach to distributed sum-utility maximization for a MIMO interfering broadcast channel. IEEE Trans. Signal Process. 59, 4331–4340 (2011)

    Article  MathSciNet  Google Scholar 

  22. Razaviyayn, M., Sanjabi, M., Luo, Z.Q.: Linear transceiver design for interference alignment: complexity and computation. IEEE Trans. Inf. Theory 58, 2896–2910 (2012)

    Article  MathSciNet  Google Scholar 

  23. Scutari, G., Facchinei, F., Song, P., Palomar, D.P., Pang, J.S.: Decomposition by partial linearization: parallel optimization of multi-agent systems. arXiv preprint, arXiv:1302.0756 (2013)

  24. Scutari, G., Palomar, D.P., Facchinei, F., Pang, J.S.: Distributed dynamic pricing for mimo interfering multiuser systems: a unified approach. In: 2011 5th International Conference on Network Games, Control and Optimization (NetGCooP), pp. 1–5 (2011)

  25. Hong, M., Luo, Z.Q.: Signal processing and optimal resource allocation for the interference channel. arXiv preprint, arXiv:1206.5144 (2012)

  26. Wajid, I., Eldar, Y.C., Gershman, A.: Robust downlink beamforming using covariance channel state information. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 2285–2288 (2009)

  27. Vucic, N., Boche, H.: Downlink precoding for multiuser MISO systems with imperfect channel knowledge. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 3121–3124 (2008)

  28. Song, E., Shi, Q., Sanjabi, M., Sun, R., Luo, Z.Q.: Robust SINR-constrained MISO downlink beamforming: When is semidefinite programming relaxation tight? In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 3096–3099 (2011)

  29. Tajer, A., Prasad, N., Wang, X.: Robust linear precoder design for multi-cell downlink transmission. IEEE Trans. Signal Process. 59, 235–251 (2011)

    Article  MathSciNet  Google Scholar 

  30. Shenouda, M., Davidson, T.N.: On the design of linear transceivers for multiuser systems with channel uncertainty. IEEE J. Sel. Areas Commun. 26, 1015–1024 (2008)

    Article  Google Scholar 

  31. Li, W.C., Chang, T.H., Lin, C., Chi, C.Y.: Coordinated beamforming for multiuser miso interference channel under rate outage constraints. IEEE Trans. Signal Process. 61(5), 1087–1103 (2013)

  32. Negro, F., Ghauri, I., Slock, D.: Sum rate maximization in the noisy MIMO interfering broadcast channel with partial CSIT via the expected weighted MSE. In: International Symposium on Wireless Communication Systems, ISWCS, pp. 576–580 (2012)

  33. Razaviyayn, M., Baligh, H., Callard, A., Luo, Z.Q.: Joint transceiver design and user grouping in a MIMO interfering broadcast channel. In: 45th Annual Conference on Information Sciences and Systems (CISS) pp. 1–6 (2011)

  34. Shi, Q., Razaviyayn, M., Luo, Z.Q., He, C.: An iteratively weighted MMSE approach to distributed sum-utility maximization for a MIMO interfering broadcast channel. IEEE Trans. Signal Process. 59, 4331–4340 (2011)

    Article  MathSciNet  Google Scholar 

  35. Guo, D., Shamai, S., Verdú, S.: Mutual information and minimum mean-square error in Gaussian channels. IEEE Trans. Inf. Theory 51(4), 1261–1282 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  36. Sampath, H., Stoica, P., Paulraj, A.: Generalized linear precoder and decoder design for MIMO channels using the weighted MMSE criterion. IEEE Trans. Commun. 49(12), 2198–2206 (2001)

    Article  Google Scholar 

  37. Hong, M., Sun, R., Baligh, H., Luo, Z.Q.: Joint base station clustering and beamformer design for partial coordinated transmission in heterogeneous networks. IEEE J. Sel. Areas Commun. 31(2), 226–240 (2013)

    Article  Google Scholar 

  38. 3GPP TR 36.814. In: http://www.3gpp.org/ftp/specs/archive/36_series/36.814/

  39. Yousefian, F., Nedić, A., Shanbhag, U.V.: On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica 48(1), 56–67 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  40. Razaviyayn, M., Sanjabi, M., Luo, Z.Q.: A stochastic weighted MMSE approach to sum rate maximization for a MIMO interference channel. In: IEEE 14th Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 325–329 (2013)

  41. Luo, Z.Q., Zhang, S.: Dynamic spectrum management: complexity and duality. IEEE J. Sel. Top. Signal Process. 2(1), 57–73 (2008)

    Article  Google Scholar 

  42. Weeraddana, P.C., Codreanu, M., Latva-aho, M., Ephremides, A., Fischione, C.: Weighted Sum-Rate Maximization in Wireless Networks: A Review. Now Publishers, Hanover (2012)

    MATH  Google Scholar 

  43. Christensen, S.S., Agarwal, R., Carvalho, E., Cioffi, J.M.: Weighted sum-rate maximization using weighted MMSE for MIMO-BC beamforming design. IEEE Trans. Wirel. Commun. 7(12), 4792–4799 (2008)

    Article  Google Scholar 

  44. Schmidt, D.A., Shi, C., Berry, R.A., Honig, M.L., Utschick, W.: Minimum mean squared error interference alignment. In: Forty-Third Asilomar Conference on Signals, Systems and Computers, pp. 1106–1110 (2009)

  45. Negro, F., Shenoy, S.P., Ghauri, I., Slock, D.T.: On the MIMO interference channel. In: Information Theory and Applications Workshop (ITA), 2010, pp. 1–9 (2010)

  46. Shin, J., Moon, J.: Weighted sum rate maximizing transceiver design in MIMO interference channel. In: Global Telecommunications Conference (GLOBECOM), pp. 1–5 (2011)

  47. Razaviyayn, M., Baligh, H., Callard, A., Luo, Z.Q.: Joint transceiver design and user grouping in a MIMO interfering broadcast channel. In: 45th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6 (2011)

  48. Aharon, M., Elad, M., Bruckstein, A.: K-SVD: design of dictionaries for sparse representation. Proc. SPARS 5, 9–12 (2005)

    Google Scholar 

  49. Lewicki, M.S., Sejnowski, T.J.: Learning overcomplete representations. Neural Comput. 12(2), 337–365 (2000)

    Article  Google Scholar 

  50. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19–60 (2010)

    MathSciNet  MATH  Google Scholar 

  51. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena-Scientific, Belmont (1999)

    MATH  Google Scholar 

  52. Razaviyayn, M., Tseng, H.W., Luo, Z.Q.: Dictionary learning for sparse representation: Complexity and algorithms. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5247–5251 (2014)

  53. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  54. Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  55. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  56. Koshal, J., Nedić, A., Shanbhag, U.V.: Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Trans. Autom. Control 58(3), 594–609 (2013)

    Article  MathSciNet  Google Scholar 

  57. Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)

    Google Scholar 

  58. Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat. 25, 463–483 (1954)

    Article  MathSciNet  MATH  Google Scholar 

  59. Ermoliev, Y.: stochastic quasigradient methods and their application to system optimization. Stoch. Int. J. Probab. Stoch. Process. 9(1–2), 1–36 (1983)

    MathSciNet  MATH  Google Scholar 

  60. Amari, S.: A theory of adaptive pattern classifiers. IEEE Trans. Electron. Comput. 16(3), 299–307 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  61. Wijnhoven, R., With, P.H.N.D.: Fast training of object detection using stochastic gradient descent. In: Proceedings of IEEE International Conference on Pattern Recognition (ICPR), pp. 424–427 (2010)

  62. Grippo, L.: Convergent on-line algorithms for supervised learning in neural networks. IEEE Trans. Neural Netw. 11(6), 1284–1299 (2000)

    Article  Google Scholar 

  63. Mangasarian, O.L., Solodov, M.V.: Serial and parallel backpropagation convergence via nonmonotone perturbed minimization. Optim. Methods Softw. 4(2), 103–116 (1994)

    Article  Google Scholar 

  64. Luo, Z.Q.: On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks. Neural Comput. 3(2), 226–245 (1991)

    Article  Google Scholar 

  65. Luo, Z.Q., Tseng, P.: Analysis of an approximate gradient projection method with applications to the backpropagation algorithm. Optim. Methods Softw. 4(2), 85–101 (1994)

    Article  MathSciNet  Google Scholar 

  66. Bottou, L.: Online learning and stochastic approximations. In: On-Line Learning in Neural Networks, vol. 17(9) (1998)

  67. Bertsekas, D.P.: A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 7(4), 913–926 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  68. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, 2nd edn. Athena-Scientific, Belmont (1999)

    MATH  Google Scholar 

  69. Tsitsiklis, J., Bertsekas, D.P., Athans, M.: Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Autom. Control 31(9), 803–812 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  70. Bertsekas, D.P.: Distributed asynchronous computation of fixed points. Math. Program. 27(1), 107–120 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  71. Ermol’ev, Y.M., Norkin, V.I.: Stochastic generalized gradient method for nonconvex nonsmooth stochastic optimization. Cybern. Syst. Anal. 34(2), 196–215 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  72. Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  73. Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. In: Optimization for Machine Learning 2010, pp. 1–38 (2011)

  74. Tseng, P.: An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2), 506–531 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  75. George, A.P., Powell, W.B.: Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Mach. Learn. 65(1), 167–198 (2006)

    Article  Google Scholar 

  76. Broadie, M., Cicek, D., Zeevi, A.: General bounds and finite-time improvement for the Kiefer–Wolfowitz stochastic approximation algorithm. Oper. Res. 59(5), 1211–1224 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  77. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  78. Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)

    MathSciNet  MATH  Google Scholar 

  79. Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)

    MathSciNet  MATH  Google Scholar 

  80. Fisk, D.L.: Quasi-martingales. Trans. Am. Math. Soc. 120, 369–389 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  81. Van der Vaart, A.W.: Asymptotic Statistics, vol. 3. Cambridge University Press, Cambridge (2000)

    MATH  Google Scholar 

  82. Hewitt, E., Savage, L.J.: Symmetric measures on cartesian products. Trans. Am. Math. Soc. 80, 470–501 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  83. Bonnans, J.F., Shapiro, A.: Perturbation Analysis of Optimization Problems. Springer, New York (2000)

    Book  MATH  Google Scholar 

Download references

Acknowledgments

The authors are grateful to Maury Bramson of the University of Minnesota for valuable discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meisam Razaviyayn.

Appendix

Appendix

Proof of Lemma 1

First of all, it can be observed that \({\hat{f}}_1^r(x^r) - f_1^r(x^r) = {\hat{f}}^r(x^r) - f^r(x^r)\) according to the definition of the functions \({\hat{f}}(\cdot )\) and \(f(\cdot )\). Therefore it suffices to show that \(\lim _{r\rightarrow \infty } {\hat{f}}^r(x^r) - f^r(x^r)=0,\;\mathrm{almost \;surely}\). The proof of this fact requires the use of quasi martingale convergence theorem [80], much like the convergence proof of online learning algorithms [50, Proposition 3]. In particular, we will show that the sequence \(\{{\hat{f}}^r(x^r)\}_{r=1}^{\infty }\) converges almost surely. Notice that

$$\begin{aligned}&{\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^r(x^r) \nonumber \\&\quad = {\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^{r+1}(x^r) + {\hat{f}}^{r+1}(x^r)- {\hat{f}}^r(x^r) \nonumber \\&\quad = {\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^{r+1}(x^r) + \frac{1}{r+1} \sum _{i=1}^{r+1} {\hat{g}}(x^r,x^{i-1},\xi ^i) - \frac{1}{r} \sum _{i=1}^r {\hat{g}}(x^r,x^{i-1},\xi ^i) \nonumber \\&\quad = {\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^{r+1}(x^r) - \frac{1}{r(r+1)} \sum _{i=1}^r {\hat{g}}(x^r,x^{i-1},\xi ^i) + \frac{1}{r+1} {\hat{g}}(x^r,x^r,\xi ^{r+1}) \nonumber \\&\quad = {\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^{r+1}(x^r) - \frac{{\hat{f}}^r(x^r)}{r+1} + \frac{1}{r+1} g(x^r,\xi ^{r+1})\nonumber \\&\quad \le \frac{-{\hat{f}}^r(x^r) + g(x^r,\xi ^{r+1})}{r+1}, \end{aligned}$$

where the last equality is due to the assumption A1 and the inequality is due to the update rule of the SSUM algorithm. Taking the expectation with respect to the natural history yields

$$\begin{aligned} \mathbb {E}\left[ {\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^r(x^r)\bigg |\mathcal {F}^r\right]&\le \mathbb {E}\left[ \frac{-{\hat{f}}^r(x^r) + g(x^r,\xi ^{r+1})}{r+1}\bigg |\mathcal {F}^r\right] \nonumber \\&= \frac{-{\hat{f}}^r(x^r)}{r+1} + \frac{f(x^r)}{r+1} \nonumber \\&= \frac{-{\hat{f}}^r(x^r) + f^r(x^r)}{r+1} + \frac{f(x^r) - f^r(x^r)}{r+1} \end{aligned}$$
(43)
$$\begin{aligned}&\le \frac{f(x^r) - f^r(x^r)}{r+1} \end{aligned}$$
(44)
$$\begin{aligned}&\le \frac{\Vert f-f^r\Vert _\infty }{r+1}, \end{aligned}$$
(45)

where (44) is due to the assumption A2 and (45) follows from the definition of \(\Vert \cdot \Vert _\infty \). On the other hand, the Donsker theorem (see [50, Lemma 7] and [81, Chapter 19]) implies that there exists a constant k such that

$$\begin{aligned} \mathbb {E}\left[ \Vert f-f^r\Vert _\infty \right] \le \frac{k}{\sqrt{r}}. \end{aligned}$$
(46)

Combining (45) and (46) yields

$$\begin{aligned} \mathbb {E} \left[ \left( \mathbb {E}\left[ {\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^r(x^r)\bigg |\mathcal {F}^r\right] \right) _+\right] \le \frac{k}{r^{3/2}}, \end{aligned}$$
(47)

where \((a)_+ \triangleq \max \{0,a\}\) is the projection to the non-negative orthant. Summing (47) over r, we obtain

$$\begin{aligned} \sum _{r=1}^\infty \mathbb {E} \left[ \left( \mathbb {E}\left[ {\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^r(x^r)\bigg |\mathcal {F}^r\right] \right) _+\right] \le M < \infty , \end{aligned}$$
(48)

where \(M \triangleq \sum _{r=1}^\infty \frac{k}{r^{3/2}}\). The equation (48) combined with the quasi-martingale convergence theorem (see [80] and [50, Theorem 6]) implies that the stochastic process\(\{{\hat{f}}^r(x^r) + \bar{g}\}_{r=1}^{\infty }\) is a quasi-martingale with respect to the natural history \(\{\mathcal {F}^r\}_{r=1}^{\infty }\) and \({\hat{f}}^r(x^r)\) converges. Moreover, we have

$$\begin{aligned} \sum _{r=1}^\infty \bigg |\mathbb {E}\left[ {\hat{f}}^{r+1}(x^{r+1}) - {\hat{f}}^r(x^r)\big |\mathcal {F}^r\right] \bigg | < \infty ,\quad \mathrm{almost\;surely.} \end{aligned}$$
(49)

Next we use (49) to show that \(\sum _{r=1}^\infty \frac{{\hat{f}}^r(x^r) - f^r(x^r)}{r+1} <\infty ,\; \mathrm{almost\;surely}\). To this end, let us rewrite (43) as

$$\begin{aligned} \frac{{\hat{f}}^r(x^r) - f^r(x^r)}{r+1}\le \mathbb {E}\left[ -{\hat{f}}^{r+1}(x^{r+1}) + {\hat{f}}^r(x^r)\bigg |\mathcal {F}^r\right] + \frac{f(x^r) - f^r(x^r)}{r+1}. \end{aligned}$$
(50)

Using the fact that \({\hat{f}}^r(x^r) \ge f^r(x^r),\;\forall \ r\) and summing (50) over all values of r, we have

$$\begin{aligned} 0\le & {} \sum _{r=1}^\infty \frac{{\hat{f}}^r(x^r) - f^r(x^r)}{r+1}\nonumber \\\le & {} \sum _{r=1}^\infty \bigg |\mathbb {E}\left[ -{\hat{f}}^{r+1}(x^{r+1}) + {\hat{f}}^r(x^r)\big |\mathcal {F}^r\right] \bigg | + \sum _{r=1}^\infty \frac{\Vert f-f^r\Vert _\infty }{r+1}. \end{aligned}$$
(51)

Notice that the first term in the right hand side is finite due to (49). Hence in order to show \(\sum _{r=1}^\infty \frac{{\hat{f}}^r(x^r) - f^r(x^r)}{r+1}<\infty ,\;\mathrm{almost\;surely}\), it suffices to show that \(\sum _{r=1}^\infty \frac{\Vert f-f^r\Vert _\infty }{r+1} < \infty ,\;\mathrm{almost\;surely}\). To show this, we use the Hewitt-Savage zero-one law; see [82, Theorem 11.3] and [13, Chapter 12, Theorem 19]. Let us define the event

$$\begin{aligned} \mathcal {A} \triangleq \left\{ (\xi ^1,\xi ^2,\ldots ) \mid \sum _{r=1}^\infty \frac{\Vert f^r-f\Vert _\infty }{r+1}<\infty \right\} . \end{aligned}$$

It can be checked that the event \(\mathcal {A}\) is permutable, i.e., any finite permutation of each element of \(\mathcal {A}\) is inside \(\mathcal {A}\); see [82, Theorem 11.3] and [13, Chapter 12, Theorem 19]. Therefore, due to the Hewitt-Savage zero-one law [82], probability of the event \(\mathcal {A}\) is either zero or one. On the other hand, it follows from (46) that there exists \(M'>0\) such that

$$\begin{aligned} \mathbb {E}\left[ \sum _{r=1}^\infty \frac{\Vert f^r-f\Vert _\infty }{r+1}\right] \le M' < \infty . \end{aligned}$$
(52)

Using Markov’s inequality, (52) implies that

$$\begin{aligned} Pr\left( \sum _{r=1}^\infty \frac{\Vert f^r-f\Vert _\infty }{r+1} > 2M'\right) \le \frac{1}{2}. \end{aligned}$$

Hence combining this result with the result of the Hewitt-Savage zero-one law, we obtain \(Pr(\mathcal {A}) =1\); or equivalently

$$\begin{aligned} \sum _{r=1}^\infty \frac{\Vert f^r-f\Vert _\infty }{r+1} < \infty , \quad \mathrm{almost \; surely}. \end{aligned}$$
(53)

As a result of (51) and (53), we have

$$\begin{aligned} 0 \le \sum _{r=1}^\infty \frac{{\hat{f}}^r(x^r) - f^r(x^r)}{r+1} < \infty , \quad \mathrm{almost \; surely}. \end{aligned}$$
(54)

On the other hand, it follows from the triangle inequality that

$$\begin{aligned}&\bigg |{\hat{f}}^{r+1}(x^{r+1}) - f^{r+1}(x^{r+1}) - {\hat{f}}^r(x^r) + f^r(x^r)\bigg |\nonumber \\&\quad \le \bigg | {\hat{f}}^{r+1}(x^{r+1})- {\hat{f}}^r(x^r)\bigg | + \bigg | f^{r+1}(x^{r+1})- f^r(x^r)\bigg | \end{aligned}$$
(55)

and

$$\begin{aligned}&\bigg | {\hat{f}}^{r+1}(x^{r+1})- {\hat{f}}^r(x^r)\bigg | \nonumber \\&\quad \le \bigg | {\hat{f}}^{r+1}(x^{r+1})- {\hat{f}}^{r+1}(x^r)\bigg | + \bigg | {\hat{f}}^{r+1}(x^r)- {\hat{f}}^r(x^r)\bigg | \nonumber \\&\quad \le \kappa \Vert x^{r+1} - x^r\Vert +\left| \frac{1}{r+1} \sum _{i=1}^{r+1} {\hat{g}}(x^r,x^{i-1},\xi ^i)- \frac{1}{r} \sum _{i=1}^{r} {\hat{g}}(x^r,x^{i-1},\xi ^i) \right| \end{aligned}$$
(56)
$$\begin{aligned}&\quad \le \kappa \Vert x^{r+1} - x^r\Vert +\left| \frac{1}{r(r+1)} \sum _{i=1}^{r} {\hat{g}}(x^r,x^{i-1},\xi ^i) + \frac{{\hat{g}}(x^r,x^r,\xi ^{r+1})}{r+1}\right| \nonumber \\&\quad \le \kappa \Vert x^{r+1} - x^r\Vert + \frac{2\bar{g}}{r+1} \end{aligned}$$
(57)
$$\begin{aligned}&\quad = \mathcal {O}\left( \frac{1}{r}\right) , \end{aligned}$$
(58)

where (56) is due to the assumption B3 (with \(\kappa =(K+K') \)); (57) follows from the assumption B6, and (58) will be shown in Lemma 2. Similarly, one can show that

$$\begin{aligned} |f^{r+1}(x^{r+1}) - f^r(x^r)| = \mathcal {O}\left( \frac{1}{r}\right) . \end{aligned}$$
(59)

It follows from (55), (58), and (59) that

$$\begin{aligned} \bigg |{\hat{f}}^{r+1}(x^{r+1}) - f^{r+1}(x^{r+1}) - {\hat{f}}^r(x^r) + f^r(x^r)\bigg | = \mathcal {O}\left( \frac{1}{r}\right) . \end{aligned}$$
(60)

Let us fix a random realization \(\{\xi ^r\}_{r=1}^\infty \) in the set of probability one for which (54) and (60) hold. Define

$$\begin{aligned} \varpi ^r \triangleq {\hat{f}}^r(x^r) - f^r(x^r). \end{aligned}$$

Clearly, \(\varpi ^r \ge 0\) and \(\sum _r \frac{\varpi ^r}{r}<\infty \) due to (54). Moreover, it follows from (60) that \(|\varpi ^{r+1}-\varpi ^r| < \frac{\tau }{r}\) for some constant \(\tau >0\). Hence Lemma 3 implies that

$$\begin{aligned} \lim _{r \rightarrow \infty } \; \varpi ^r = 0, \end{aligned}$$

which is the desired result. \(\square \)

Lemma 2

\(\Vert x^{r+1} - x^r\Vert = \mathcal {O}(\frac{1}{r}).\)

Proof

The proof of this lemma is similar to the proof of [50, Lemma 1]; see also [83, Proposition 4.32]. First of all, since \(x^r\) is the minimizer of \({\hat{f}}^r(\cdot )\), the first order optimality condition implies

$$\begin{aligned} {\hat{f}}^r(x^r;d) \ge 0, \quad \forall \ d\in \mathbb {R}^n. \end{aligned}$$

Hence, it follows from the assumption A3 that

$$\begin{aligned} {\hat{f}}^r(x^{r+1}) - {\hat{f}}^r(x^r) \ge \frac{\gamma }{2} \Vert x^{r+1} - x^r\Vert ^2. \end{aligned}$$
(61)

On the other hand,

$$\begin{aligned} {\hat{f}}^r(x^{r+1}) - {\hat{f}}^r(x^r)&\le {\hat{f}}^r(x^{r+1}) - {\hat{f}}^{r+1}(x^{r+1}) + {\hat{f}}^{r+1}(x^r) - {\hat{f}}^r(x^r) \end{aligned}$$
(62)
$$\begin{aligned}&\le \frac{1}{r(r+1)} \sum _{i=1}^r |{\hat{g}}(x^{r+1},x^{i-1},\xi ^i) - {\hat{g}}(x^r,x^{i-1},\xi ^i)| \nonumber \\&\quad + \frac{1}{r+1}|{\hat{g}}(x^{r+1},x^r,\xi ^{r+1}) - {\hat{g}}(x^r,x^r,\xi ^{r+1})| \nonumber \\&\le \frac{\theta }{r+1} \Vert x^{r+1} - x^r\Vert , \end{aligned}$$
(63)

where (62) follows from the fact that \(x^{r+1}\) is the minimizer of \({\hat{f}}^{r+1}(\cdot )\), the second inequality is due to the definitions of \({\hat{f}}^r\) and \({\hat{f}}^{r+1}\), while (63) is the result of the assumptions B3 and B5. Combining (61) and (63) yields the desired result. \(\square \)

Lemma 3

Assume \(\varpi ^r > 0\) and \(\sum _{r=1}^\infty \frac{\varpi ^r}{r} <\infty \). Furthermore, suppose that \(|\varpi ^{r+1}-\varpi ^r| \le {\tau }/{r}\) for all r. Then \(\lim _{r\rightarrow \infty } \varpi ^r = 0\).

Proof

Since \(\sum _{r=1}^{\infty } \frac{\varpi ^r}{r} < \infty \), we have \(\liminf _{r \rightarrow \infty } \varpi ^r = 0\). Now, we prove the result using contradiction. Assume the contrary so that

$$\begin{aligned} \limsup _{r \rightarrow \infty } \varpi ^r >\epsilon , \end{aligned}$$
(64)

for some \(\epsilon >0\). Hence there should exist subsequences \(\{m_j\}\) and \(\{n_j\}\) with \(m_j\le n_j<m_{j+1},\forall \ j\) so that

$$\begin{aligned} \frac{\epsilon }{3} < \varpi ^r \quad \quad&m_j \le r <n_j, \end{aligned}$$
(65)
$$\begin{aligned} \varpi ^r \le \frac{\epsilon }{3} \quad \quad&n_j \le r < m_{j+1}. \end{aligned}$$
(66)

On the other hand, since \(\sum _{r=1}^{\infty } \frac{\varpi ^r}{r} < \infty \), there exists an index \(\bar{r}\) such that

$$\begin{aligned} \sum _{r = \bar{r}}^{\infty } \frac{\varpi ^r}{r} < \frac{\epsilon ^2}{9\tau }. \end{aligned}$$
(67)

Therefore, for every \(r_0 \ge \bar{r}\) with \(m_j \le r_0 \le n_j-1\), we have

$$\begin{aligned} |\varpi ^{n_j} - \varpi ^{r_0}|&\le \sum _{r = r_0}^{n_j-1}|\varpi ^{r+1} - \varpi ^r| \nonumber \\&\le \sum _{r = r_0}^{n_j-1} \frac{\tau }{r} \end{aligned}$$
(68)
$$\begin{aligned}&\le \frac{3}{\epsilon } \sum _{r = r_0}^{n_j-1} \frac{\tau }{r} \varpi ^r \end{aligned}$$
(69)
$$\begin{aligned}&\le \frac{3\tau \epsilon ^2}{9\epsilon \tau } = \frac{\epsilon }{3}, \end{aligned}$$
(70)

where the equation (69) follows from (65), and (70) is the direct consequence of (67). Hence the triangle inequality implies

$$\begin{aligned} \alpha ^{r_0} \le \varpi ^{n_j} + |\varpi ^{n_j} - \alpha ^{r_0}| \le \frac{\epsilon }{3} + \frac{\epsilon }{3} = \frac{2\epsilon }{3}, \end{aligned}$$

for any \(r_0 \ge \bar{r}\), which contradicts (64), implying that

$$\begin{aligned} \limsup _{r\rightarrow \infty } \varpi ^r = 0. \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Razaviyayn, M., Sanjabi, M. & Luo, ZQ. A Stochastic Successive Minimization Method for Nonsmooth Nonconvex Optimization with Applications to Transceiver Design in Wireless Communication Networks. Math. Program. 157, 515–545 (2016). https://doi.org/10.1007/s10107-016-1021-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-016-1021-7

Keywords

Navigation