Minimizing oracle-structured composite functions

Shen, Xinyue; Ali, Alnur; Boyd, Stephen

doi:10.1007/s11081-021-09705-0

Minimizing oracle-structured composite functions

Research Article
Published: 11 January 2022

Volume 24, pages 743–777, (2023)
Cite this article

Optimization and Engineering Aims and scope Submit manuscript

Xinyue Shen^1,2,
Alnur Ali^1,3 &
Stephen Boyd¹

396 Accesses
1 Citation
Explore all metrics

Abstract

We consider the problem of minimizing a composite convex function with two different access methods: an oracle, for which we can evaluate the value and gradient, and a structured function, which we access only by solving a convex optimization problem. We are motivated by two associated technological developments. For the oracle, systems like PyTorch or TensorFlow can automatically and efficiently compute gradients, given a computation graph description. For the structured function, systems like CVXPY accept a high level domain specific language description of the problem, and automatically translate it to a standard form for efficient solution. We develop a method that makes minimal assumptions about the two functions, does not require the tuning of algorithm parameters, and works well in practice across a variety of problems. Our algorithm combines a number of well-known ideas, including a low-rank quasi-Newton approximation of curvature, piecewise affine lower bounds from bundle-type methods, and two types of damping to ensure stability. We illustrate the method on stochastic optimization, utility maximization, and risk-averse programming problems, showing that our method is more efficient than standard solvers when the oracle function contains much data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Sample complexity analysis for adaptive optimization algorithms with stochastic oracles

Article 29 April 2024

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org
Adamson DS, Winant CW (1969) A SLANG simulation of an initially strong shock wave downstream of an infinite area change. In: Proceedings of the Conference on Applications of Continuous-System Simulation Languages, pp 231–240
Agrawal A, Verschueren R, Diamond S, Boyd S (2018) A rewriting system for convex optimization problems. J Control Decision 5(1):42–60
MathSciNet Google Scholar
Ahmadi-Javid A (2011) An information-theoretic approach to constructing coherent risk measures. In: 2011 IEEE International Symposium on Information Theory Proceedings, pp 2125–2127. IEEE
Ahmed S, Çakmak U, Shapiro A (2007) Coherent risk measures in inventory problems. Eur J Oper Res 182(1):226–238
MathSciNet MATH Google Scholar
ApS M (2019) MOSEK Optimizer API for Python 9.2.40. https://docs.mosek.com/9.2/pythonapi/index.html
Armijo L (1966) Minimization of functions having Lipschitz continuous first partial derivatives. Pacific J Math 16(1):1–3
MathSciNet MATH Google Scholar
Asi H, Duchi J (2019) Modeling simple structures and geometry for better stochastic optimization algorithms. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp 2425–2434
Bagirov A, Karmitsa N, Mäkelä MM (2014) Bundle Methods. Introduction to Nonsmooth Optimization: Theory, Practice and Software, pp 305–310. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-08114-4_12
Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2018) Automatic differentiation in machine learning: a survey. J Mach Learn Res 18(153):1-43
Becker S, Fadili J, Ochs P (2019) On quasi-Newton forward-backward splitting: proximal calculus and convergence. SIAM J Optim 29(4):2445–2481
MathSciNet MATH Google Scholar
Becker S, Fadili MJ (2012) A quasi-Newton proximal splitting method. arXiv preprint arXiv:1206.1156
Boyd s, Vandenberghe l (2004) Convex Optimization. Cambridge University Press
MATH Google Scholar
Broyden CG (1965) A class of methods for solving nonlinear simultaneous equations. Math Comput 19(92):577–593
MathSciNet MATH Google Scholar
Busseti E, Ryu EK, Boyd S (2016) Risk-constrained Kelly gambling. J Invest 25(3):118–134
Google Scholar
Byrd RH, Nocedal J, Schnabel RB (1994) Representations of quasi-Newton matrices and their use in limited memory methods. Math Progr 63(1):129–156
MathSciNet MATH Google Scholar
Daníelsson J, Jorgensen BN, Samorodnitsky G, Sarma M, de Vries CG (2013) Fat tails, var and subadditivity. J Econom 172(2):283–291
MathSciNet MATH Google Scholar
Daníelsson J, Jorgensen BN, de Vries CG, Yang X (2008) Optimal portfolio allocation under the probabilistic var constraint and incentives for financial innovation. Ann Financ 4(3):345–367
MATH Google Scholar
Davidon WC (1959) Variable metric method for minimization. Tech. rep., Argonne National Lab., Lemont, Ill
De Oliveira W, Solodov M (2016) A doubly stabilized bundle method for nonsmooth convex optimization. Math Program 156(1–2):125–159
MathSciNet MATH Google Scholar
de Oliveira W, Sagastizábal C, Lemaréchal C (2014) Convex proximal bundle methods in depth: a unified analysis for inexact oracles. Math Program 148(1):241–277
MathSciNet MATH Google Scholar
Dennis JE Jr, Moré JJ (1977) Quasi-Newton methods, motivation and theory. SIAM Rev 19(1):46–89
MathSciNet MATH Google Scholar
Diamond S, Boyd S (2016) CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research. http://stanford.edu/~boyd/papers/pdf/cvxpy_paper.pdf
Domahidi A, Chu E, Boyd S (2013) ECOS: An SOCP solver for embedded systems. In: European Control Conference (ECC), pp 3071–3076
Erdogdu MA, Montanari A (2015) Convergence rates of sub-sampled Newton methods. Adv Neur Inf Process Syst 28:3052–3060
Google Scholar
Fletcher R (2005) A new low rank quasi-Newton update scheme for nonlinear programming. In: IFIP Conference on System Modeling and Optimization, pp 275–293. Springer
Fletcher R, Powell M (1963) A rapidly convergent descent method for minimization. Comput J 6(2):163–168
MathSciNet MATH Google Scholar
Frangioni A (2002) Generalized bundle methods. SIAM J Optim 13(1):117–156
MathSciNet MATH Google Scholar
Fukushima M (1984) A descent algorithm for nonsmooth convex optimization. Math Program 30(2):163–175
MathSciNet MATH Google Scholar
Gao Y, Zhang YY, Wu X (2015) Penalized exponential series estimation of copula densities with an application to intergenerational dependence of body mass index. Empir Econ 48(1):61–81
Google Scholar
Ghanbari H, Scheinberg K (2018) Proximal quasi-Newton methods for regularized convex optimization with linear and accelerated sublinear convergence rates. Comput Optim Appl 69(3):597–627
MathSciNet MATH Google Scholar
Gower R, Le Roux N, Bach F (2018) Tracking the gradients using the hessian: A new look at variance reducing stochastic methods. In: International Conference on Artificial Intelligence and Statistics, pp 707–715. PMLR
Grant M, Boyd S (2014) CVX: Matlab software for disciplined convex programming, version 2.1
Grant M, Boyd S, Ye Y (2006) Disciplined convex programming. In: L. Liberti, N. Maculan (eds.) Global Optimization: From Theory to Implementation, Nonconvex Optimization and its Applications, pp 155–210. Springer
Huang X, Liang X, Liu Z, Li L, Yu Y, Li Y (2020) Span: a stochastic projected approximate newton method. Proc AAAI Conf Artif Intell 34(02):1520–1527
Google Scholar
Hutchinson MF (1989) A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun Stat -Sim Comput 18(3):1059–1076
MathSciNet MATH Google Scholar
Innes M, Edelman A, Fischer K, Rackauckas C, Saba E, Shah VB, Tebbutt W (2019) A differentiable programming system to bridge machine learning and scientific computing. arXiv preprint arXiv:1907.07587
Innes M, Saba E, Fischer K, Gandhi D, Rudilosso MC, Joy NM, Karmali T, Pal A, Shah V (2018) Fashionable modelling with Flux. CoRR abs/1811.01457. https://arxiv.org/abs/1811.01457
Kelly Jr JL (2011) A new interpretation of information rate. In: The Kelly capital growth investment criterion: theory and practice, pp 25–34. World Scientific
Kiwiel KC (1990) Proximity control in bundle methods for convex nondifferentiable minimization. Math Program 46(1):105–122
MathSciNet MATH Google Scholar
Kiwiel KC (2000) Efficiency of proximal bundle methods. J Optim Theory Appl 104(3):589–603
MathSciNet MATH Google Scholar
Lee CP, Wright SJ (2019) Inexact successive quadratic approximation for regularized optimization. Comput Optim Appl 72(3):641–674
MathSciNet MATH Google Scholar
Lee JD, Sun Y, Saunders MA (2014) Proximal Newton-type methods for minimizing composite functions. SIAM J Optim 24(3):1420–1443
MathSciNet MATH Google Scholar
Lemaréchal C (1978) Nonsmooth optimization and descent methods. RR-78-004
Lemaréchal C, Nemirovskii A, Nesterov Y (1995) New variants of bundle methods. Math Program 69(1):111–147
MathSciNet MATH Google Scholar
Lemaréchal C, Sagastizábal C (1997) Variable metric bundle methods: from conceptual to implementable forms. Math Program 76(3):393–410
MathSciNet MATH Google Scholar
Levitin ES, Polyak BT (1966) Constrained minimization methods. USSR Comput Math Math Phys 6(5):1–50
MATH Google Scholar
Li J, Andersen MS, Vandenberghe L (2017) Inexact proximal Newton methods for self-concordant functions. Math Methods Oper Res 85(1):19–41
MathSciNet MATH Google Scholar
Lukšan L, Vlček J (1998) A bundle-Newton method for nonsmooth unconstrained minimization. Math Program 83(1):373–391
MathSciNet MATH Google Scholar
Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol. 238, p. 5
Marsh P (2007) Goodness of fit tests via exponential series density estimation. Comput Stat & Data Anal 51(5):2428–2441
MathSciNet MATH Google Scholar
Meyer RA, Musco C, Musco C, Woodruff DP (2021) Hutch++: Optimal stochastic trace estimation. In: Symposium on Simplicity in Algorithms (SOSA), pp 142–155. SIAM
Mifflin R (1996) A quasi-second-order proximal bundle algorithm. Math Program 73(1):51–72
MathSciNet MATH Google Scholar
Mifflin R, Sun D, Qi L (1998) Quasi-Newton bundle-type methods for nondifferentiable convex optimization. SIAM J Optim 8(2):583–603
MathSciNet MATH Google Scholar
Mordukhovich BS, Yuan X, Zeng S, Zhang J (2020) A globally convergent proximal Newton-type method in nonsmooth convex optimization. arXiv preprint arXiv:2011.08166
Nesterov Y (2013) Gradient methods for minimizing composite functions. Math Program 140(1):125–161
MathSciNet MATH Google Scholar
Nolan JF (1953) Analytical differentiation on a digital computer. Ph.D. thesis, Massachusetts Institute of Technology
Noll D (2013) Bundle method for non-convex minimization with inexact subgradients and function values. In: Computational and analytical mathematics, pp 555–592. Springer
O’Donoghue B, Chu E, Parikh N, Boyd S (2016) Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications 169(3):1042–1068. http://stanford.edu/~boyd/papers/scs.html
O’Donoghue B, Chu E, Parikh N, Boyd S (2019) SCS: Splitting conic solver, version 2.1.2. https://github.com/cvxgrp/scs
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32: pp. 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Rockafellar RT, Uryasev S (2002) Conditional value-at-risk for general loss distributions. J Bank & Financ 26(7):1443–1471
Google Scholar
Scheinberg K, Tang X (2016) Practical inexact proximal quasi-Newton method with global complexity analysis. Math Program 160(1):495–529
MathSciNet MATH Google Scholar
Schmidt M, Berg E, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: A limited-memory projected quasi-Newton algorithm. In: Artificial Intelligence and Statistics, pp 456–463. PMLR
Schramm H, Zowe J (1992) A version of the bundle idea for minimizing a nonsmooth function: Conceptual idea, convergence analysis, numerical results. SIAM J Optim 2(1):121–152
MathSciNet MATH Google Scholar
Sra s, Nowozin S, Wright SJ (2012) Optimization for machine learning. MIT Press
Google Scholar
Teo CH, Vishwanathan SVN, Smola A, Le QV (2010) Bundle methods for regularized risk minimization. J Mach Learn Res 11(1):311–365
Udell M, Mohan K, Zeng D, Hong J, Diamond S, Boyd S (2014) Convex optimization in Julia. In: Proceedings of the Workshop for High Performance Technical Computing in Dynamic Languages, pp 18–28
Uryasev S, Rockafellar RT (2001) Conditional Value-at-Risk: Optimization Approach, pp 411–435. Springer US, Boston, MA
Van Ackooij W, Frangioni A (2018) Incremental bundle methods using upper models. SIAM J Optim 28(1):379–410
MathSciNet MATH Google Scholar
van Ackooij W, Bello Cruz JY, de Oliveira W (2016) A strongly convergent proximal bundle method for convex minimization in Hilbert spaces. Optimization 65(1):145–167
MathSciNet MATH Google Scholar
Wainwright M, Jordan M (2008) Graphical models, exponential families, and variational inference. Now Publishers Inc
MATH Google Scholar
Wu X (2010) Exponential series estimator of multivariate densities. J Econom 156(2):354–366
MathSciNet MATH Google Scholar
Ye H, Luo L, Zhang Z (2017) Approximate Newton methods and their local convergence. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp 3931–3939
Yu J, Vishwanathan SVN, Günter S, Schraudolph NN (2010) A quasi-Newton approach to nonsmooth convex optimization problems in machine learning. J Mach Learn Res 11:1145–1200
MathSciNet MATH Google Scholar
Yue MC, Zhou Z, So AMC (2019) A family of inexact SQA methods for non-smooth convex minimization with provable convergence guarantees based on the Luo-Tseng error bound property. Math Program 174(1):327–358
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Stanford University, Stanford, USA
Xinyue Shen, Alnur Ali & Stephen Boyd
SRIBD and FNii, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China
Xinyue Shen
Department of Statistics, Stanford University, Stanford, USA
Alnur Ali

Authors

Xinyue Shen
View author publications
You can also search for this author in PubMed Google Scholar
Alnur Ali
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Boyd
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinyue Shen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work was supported in part by the National Key R&D Program of China with grant No. 2018YFB1800800, by the Key Area R&D Program of Guangdong Province with grant No. 2018B030338001, by Shenzhen Outstanding Talents Training Fund, and by Guangdong Research Project No. 2017ZT07X152. Stephen Boyd’s work was funded in part by the AI Chip Center for Emerging Smart Systems (ACCESS).

A Appendix

1.1 A.1 Details of forming $G_k$

We form $G_k$ in (3) by adopting a quasi-Newton update given in Fletcher (2005). The main idea is to divide $G_k$ into two matrices $G_k^{(1)}$ and $G_k^{(2)}$, which are the first $r_1$ and the last $r_2=r-r_1$ columns in $G_k$, respectively, and update them separately into $G_{k+1}^{(1)}$ and $G_{k+1}^{(2)}$. Then $G_{k+1}$ is updated by

$$\begin{aligned} G_{k+1} = \left[ G^{(1)}_{k+1} \quad G^{(2)}_{k+1}\right] . \end{aligned}$$

The detail is as follows. Let $s_k = x^{k+1} - x^k$ and $y_k = \nabla f(x^{k+1}) - \nabla f(x^k)$. From the convexity of f, $s_k^T y_k\ge 0$. Suppose that $s_k^T y_k$ is not too small such that

$$\begin{aligned} s_k^T y_k > \max (\varepsilon _{\mathrm {abs}}, \varepsilon _{\mathrm {rel}}\Vert s_k\Vert _2\Vert y_k\Vert _2), \end{aligned}$$

(39)

where constants $\varepsilon _{\mathrm {abs}}, \varepsilon _{\mathrm {rel}}>0$ are given. Then $r_1$ is chosen as the largest integer in [0, r] such that $G_k^{(1)}$ satisfies

$$\begin{aligned} s_k^Ty_k - \left\| (G_k^{(1)})^Ts_k\right\| _2^2 > \varepsilon _{\mathrm {rel}} \Vert s_k\Vert _2 \left\| y_k - G_k^{(1)}(G_k^{(1)})^Ts_k \right\| _2. \end{aligned}$$

According to (39) the above holds at least for $r_1=0$, in which case $G^{(1)}_k$ degenerates to 0.

Once $r_1$ is obtained, we define $w_k^{(1)} = (G^{(1)}_k)^T s_k$ and $w_k^{(2)} = (G^{(2)}_k)^T s_k$. Then $R_k^{(1)}\in {\mathbf{R}}^{(r_1+1)\times (r_1+1)}$ is the upper triangular factor in the following R-Q decomposition

$$\begin{aligned} R_k^{(1)} Q_k^{(1)} = \left[ \begin{array}{ll} \frac{1}{\sqrt{s_k^T y_k - \Vert w_k^{(1)}\Vert _2^2}} &{} 0_{r_1}^T \\ \frac{-1}{\sqrt{s_k^T y_k - \Vert w_k^{(1)}\Vert _2^2}}w_k^{(1)} &{} I_{r_1} \end{array} \right] \in {\mathbf{R}}^{(r_1+1)\times (r_1+1)}, \end{aligned}$$

and $Q_k^{(2)}\in {\mathbf{R}}^{r_2 \times (r_2 - 1)}$ is a set of orthonormal basis orthogonal to $w_k^{(2)}$. The update is

$$\begin{aligned} G^{(1)}_{k+1} = \left[ y_k \quad G^{(1)}_k \right] R_k^{(1)}, \quad G^{(2)}_{k+1} = G^{(2)}_{k} Q_k^{(2)}. \end{aligned}$$

There are some corner cases. If (39) holds and $r_1 = 0$, then $G_{k+1}^{(1)} = y_k / \sqrt{s_k^T y_k}$. If $r_1 = r-1$ or $r_1 = r$, then $G_{k+1}^{(2)}$ vanishes, and $G_{k+1}$ takes the first r columns of $G_{k+1}^{(1)}$.

In cases where (39) does not hold, if $\Vert w_k^{(2)}\Vert _2 > \varepsilon _{\mathrm {abs}}$, then we can still define $G_{k+1}^{(2)}$ in the same way, and $G_{k+1} = \left[ G^{(2)}_{k+1} \quad 0_n\right] .$ Otherwise, $G_{k+1} = G_k$.

It can be easily checked that by the $G_k$ defined as above, the trace of $H_k$ is uniformly upper bounded in k.

The default values for the parameters are $\varepsilon _{\mathrm {abs}}=10^{-8}$ and $\varepsilon _{\mathrm {rel}}=10^{-3}$.

1.2 A.2 Details of computing an optimal subgradient of g

Here we show how to compute a subgradient $q^{k+1} \in \partial g(x^{k+1/2})$ satisfying the optimality conditions in (8), which implies (10). This, in turn, is important because it allows us to compute the stopping criteria described in sect. 2.6.

Since we know the third term on the right-hand side of (8), it suffices to find an optimal subgradient $l_k^{k+1/2} \in \partial l_k(x^{k+1/2})$, which is easier. We start by rewriting the defining problem for $x^{k+1/2}$ in a more useful form. Putting (4) and (7) together, we see that we can alternatively express $x^{k+1/2}$ as the solution to the convex problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} z + g(x) + \frac{1}{2}(x-x^k)^T (H_k + \lambda _k I) (x-x^k) \\ \text{ subject } \text{ to } &{} z \ge f(x^i) + \nabla f(x^i)^T (x - x^i), \quad \text {for } i=\max \{0, k-M+1\},\ldots ,k, \end{array} \end{aligned}$$

(40)

with variables x and $z \in {\mathbf{R}}$.

The KKT conditions for problem (40), which are necessary and sufficient for the points $(x^{k+1/2}, z^{k+1/2})$ and $\gamma _i, \; i=\max \{0, k-M+1\},\ldots ,k$, to be primal and dual optimal, are

$$\begin{aligned} z^{k+1/2} \ge f(x^i) + \nabla f(x^i)^T (x^{k+1/2} - x^i), \quad&i=\max \{0, k-M+1\},\ldots ,k; \end{aligned}$$

(41)

$$\begin{aligned} \sum _{i=\max \{0, k-M+1\}}^k \gamma _i = 1, \quad \gamma _i \ge 0, \quad&i=\max \{0, k-M+1\},\ldots ,k; \end{aligned}$$

(42)

$$\begin{aligned} z^{k+1/2} > f(x^i) + \nabla f(x^i)^T (x^{k+1/2} - x^i) \implies \gamma _i = 0, \quad&i\in \max \{0, k-M+1\},\ldots ,k; \end{aligned}$$

(43)

$$\begin{aligned} \partial g(x^{k+1/2}) + (H_k + \lambda _k I) v^k + \sum _{i=\max \{0, k-M+1\}}^k \gamma _i \nabla f(x^i) \ni&0. \end{aligned}$$

(44)

Here we used the definition of $v^k$ to simplify the stationarity condition (44).

Now we claim that

$$\begin{aligned} l_k^{k+1/2} = \sum _{i=\max \{0, k-M+1\}}^k \gamma _i \nabla f(x^i) \in \partial l_k(x^{k+1/2}). \end{aligned}$$

To see this, note that (42) says the $\gamma _i$ are nonnegative and sum to one, while (41) and (43) together imply $\gamma _i$ is positive as long as $\nabla f(x^i)$ is active; this satisfies (9), which says the subdifferential $\partial l_k(x^{k+1/2})$ is the convex hull of the active gradients. Therefore, re-arranging (44) gives

$$\begin{aligned} q^{k+1} = - l_k^{k+1/2} - (H_k + \lambda _k I) v^k \in \partial g(x^{k+1/2}), \end{aligned}$$

which shows (10).

1.3 A.3 Proof that undamped steps occur infinitely often

Assume that $\nabla f$ is Lipschitz continuous with constant L. Also, assume $\mu _{\max }\tau _{\min } > 2L / (1-\alpha )$. We show that for any number of iterations $k_0$, there is some $k \ge k_0$ such that $t_k = 1$. This means that there exists a subsequence $(k_{\ell })_{\ell =1}^\infty$ such that $t_{k_{\ell }} = 1$.

To show the result, we first claim that the line search condition (20) is satisfied as soon as

$$\begin{aligned} t_k \le \frac{1-\alpha }{2L} \lambda _k. \end{aligned}$$

(45)

We prove the claim later. Taking the claim as a given for now, the main result follows by deriving a contradiction. To get a contradiction, suppose there exists some number of iterations $k_0$ such that $t_k < 1$ for each $k \ge k_0$. Then, because $t_k$ is the largest number satisfying $t_k=\beta ^j$ ($j \in \text{ N}_0$) and the line search condition, we get that the line search condition does not hold with $t_k=1$, and thus (45) does not hold with $t_k=1$, meaning that $\lambda _k < 2 L / (1-\alpha )$, for each $k \ge k_0$. But from (23), we have $\mu _k = \min \left\{ \gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0},\mu _{\max }\right\}$, since we assumed $t_k < 1$ for every $k \ge k_0$. Additionally, from (22), we get $\lambda _k \ge \mu _k \tau _{\min }$. So, we now have two cases. Either we have $\lambda _k \ge \mu _{\max } \tau _{\min } > 2L / (1-\alpha )$, which is a contradiction with $\lambda _k < 2 L / (1-\alpha )$. Or we have $\lambda _k \ge \gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0}\tau _{\min }$, in which case $\gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0}\tau _{\min }$ grows exponentially in k; this means we must have $\lambda _k \ge 2 L / (1-\alpha )$, for k sufficiently large, which is again a contradiction. This finishes the proof of the main result.

Now we prove the claim. Observe that

$$\begin{aligned} L t_k^2 \Vert v^k\Vert ^2 \le \frac{1-\alpha }{2} \lambda _k t_k \Vert v^k\Vert ^2 \le \frac{1-\alpha }{2} t_k (v^k)^T (H_k +\lambda _k I) v^k, \end{aligned}$$

(46)

where we used (45) to get the first inequality. We now use the following two facts: $\nabla f$ being L-Lipschitz continuous implies that (i) $\phi _k$ is convex in t, and (ii) $\phi _k^\prime$ is $L \Vert v^k \Vert ^2$-Lipschitz in t. By the first fact (convexity), we have

$$\begin{aligned} \phi _k(t_k) \le \phi _k(0) + \phi _k^\prime (t_k) t_k. \end{aligned}$$

Adding and subtracting $\phi _k^\prime (0)$ gives

$$\begin{aligned} \phi _k(t_k) \le \phi _k(0) + \phi _k^\prime (0) t_k + (\phi _k^\prime (t_k) - \phi _k^\prime (0)) t_k. \end{aligned}$$

The second fact (Lipschitz continuity) yields a bound on the third term on the right-hand side,

$$\begin{aligned} \phi _k(t_k) \le \phi _k(0) + \phi _k^\prime (0) t_k + L t_k^2 \Vert v^k\Vert ^2. \end{aligned}$$

(47)

Finally, using (18) and (46) to bound $\phi _k^\prime (0)$ and $L t_k^2 \Vert v^k \Vert _2^2$ in (47) yields (20), which implies the line search condition (20) is satisfied, as claimed.

We note that the claim above also implies the following lower bound, which is used in Sect. 3.1. Observe that if $t_k$ indeed satisfies the line search condition, then we can express $t_k = \beta ^j$, where j is the smallest integer such that (45) holds (see Sect. 2.4). Now consider two cases. If $1 \le (1-\alpha ) \lambda _k / (2L)$, then we can simply take $j = 0$, so that $t_k = 1$. On the other hand, if $1 > (1-\alpha ) \lambda _k / (2L)$, then a short calculation shows that $j=\lceil \log _{\beta } ((1-\alpha ) \lambda _k / (2L)) \rceil$, and so $j < 1 + \log _{\beta } ((1-\alpha ) \lambda _k / (2L))$, giving $t_k = \beta ^j > \beta (1-\alpha ) \lambda _k / (2L)$. Therefore, to sum up, we have (45) implies that $t_k$ satisfies the inequalities

$$\begin{aligned} t_k > \beta \min \left\{ 1, \frac{1-\alpha }{2L} \lambda _k \right\} \ge \beta \min \left\{ 1,\frac{1-\alpha }{2L} \mu _{\min }\tau _{\min }\right\} , \end{aligned}$$

(48)

where we also used the fact that $\lambda _k \ge \mu _{\min }\tau _{\min }$.

1.4 A.4 Proof that the limited memory piecewise affine minorant is accurate enough

Assume that $\nabla f$ is Lipschitz continuous with constant L. For the rest of the proof, fix any k such that $t_k = 1$. (From Sect. A.3, it is always possible to find at least one such k.) We will show that for any such k, the limited memory piecewise affine minorant (4) satisfies the bound (29); because our choice of k was arbitrary, the required result will then follow.

For any $l_k^{k+1} \in \partial l_k(x^{k+1})$, note that adding and subtracting $\nabla f(x^k)$ in the left-hand side of (29) easily gives

$$\begin{aligned} \Vert \nabla f(x^{k+1}) - l_k^{k+1} \Vert _2 \le \Vert \nabla f(x^{k+1}) - \nabla f(x^k) \Vert _2 + \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2. \end{aligned}$$

(49)

The Lipschitz continuity of $\nabla f$, in turn, immediately gives a bound for the first term on the right-hand side of (49), i.e., we get

$$\begin{aligned} \Vert \nabla f(x^{k+1}) - l_k^{k+1} \Vert _2&\le L \Vert x^{k+1} - x^k \Vert _2 + \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2 \nonumber \\&\lesssim \Vert v^k \Vert _2 + \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2, \end{aligned}$$

(50)

using the definition of $v^k$.

Therefore, we focus on the second term on the right-hand side of (49). For this term, (9) tells us that for any $l_k^{k+1} \in \partial l_k(x^{k+1})$,

$$\begin{aligned} \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2 \le \max _{j=\max \{0,k-M+1\},\ldots ,k} \Vert \nabla f(x^k) - \nabla f(x^j) \Vert _2, \end{aligned}$$

(51)

because the maximum of a convex function over a convex polytope is attained at one of its vertices. The Lipschitz continuity of $\nabla f$ again shows that, for any $j \in \{ \max \{0,k-M+1\}, \ldots , k \}$,

$$\begin{aligned} \Vert \nabla f(x^k) - \nabla f(x^j) \Vert _2&\le L \Vert x^k - x^j \Vert _2 \\&= L \Vert (x^k - x^{k-1}) + (x^{k-1} - x^{k-2}) + \cdots + (x^{j+2} - x^{j+1}) + (x^{j+1} - x^j) \Vert _2 \\&\le L \sum _{\ell =j}^{k-1} \Vert v^\ell \Vert _2 \\&\lesssim \max _{\ell =j,\ldots ,k-1} \Vert v^\ell \Vert _2. \end{aligned}$$

To get the third line, we used the fact that the sum on the second line telescopes, and applied the triangle inequality. To get the fourth line, we used the fact that the average is less than the max. Putting this last inequality together with (51), we see that

$$\begin{aligned} \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2 \lesssim \max _{j=\max \{0,k-M+1\},\ldots ,k} \Vert v^j \Vert _2, \end{aligned}$$

(52)

for any $l_k^{k+1} \in \partial l_k(x^{k+1})$. Combining (52) and (50) gives (29). This completes the proof of the result.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, X., Ali, A. & Boyd, S. Minimizing oracle-structured composite functions. Optim Eng 24, 743–777 (2023). https://doi.org/10.1007/s11081-021-09705-0

Download citation

Received: 03 June 2021
Revised: 27 November 2021
Accepted: 27 November 2021
Published: 11 January 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11081-021-09705-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Minimizing oracle-structured composite functions

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Random Gradient-Free Minimization of Convex Functions

Sample complexity analysis for adaptive optimization algorithms with stochastic oracles

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Appendix

1.1 A.1 Details of forming \(G_k\)

1.2 A.2 Details of computing an optimal subgradient of g

1.3 A.3 Proof that undamped steps occur infinitely often

1.4 A.4 Proof that the limited memory piecewise affine minorant is accurate enough

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Minimizing oracle-structured composite functions

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Random Gradient-Free Minimization of Convex Functions

Sample complexity analysis for adaptive optimization algorithms with stochastic oracles

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Appendix

A Appendix

1.1 A.1 Details of forming \(G_k\)

1.2 A.2 Details of computing an optimal subgradient of g

1.3 A.3 Proof that undamped steps occur infinitely often

1.4 A.4 Proof that the limited memory piecewise affine minorant is accurate enough

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation