Skip to main content
Log in

Minimizing oracle-structured composite functions

  • Research Article
  • Published:
Optimization and Engineering Aims and scope Submit manuscript

Abstract

We consider the problem of minimizing a composite convex function with two different access methods: an oracle, for which we can evaluate the value and gradient, and a structured function, which we access only by solving a convex optimization problem. We are motivated by two associated technological developments. For the oracle, systems like PyTorch or TensorFlow can automatically and efficiently compute gradients, given a computation graph description. For the structured function, systems like CVXPY accept a high level domain specific language description of the problem, and automatically translate it to a standard form for efficient solution. We develop a method that makes minimal assumptions about the two functions, does not require the tuning of algorithm parameters, and works well in practice across a variety of problems. Our algorithm combines a number of well-known ideas, including a low-rank quasi-Newton approximation of curvature, piecewise affine lower bounds from bundle-type methods, and two types of damping to ensure stability. We illustrate the method on stochastic optimization, utility maximization, and risk-averse programming problems, showing that our method is more efficient than standard solvers when the oracle function contains much data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org

  • Adamson DS, Winant CW (1969) A SLANG simulation of an initially strong shock wave downstream of an infinite area change. In: Proceedings of the Conference on Applications of Continuous-System Simulation Languages, pp 231–240

  • Agrawal A, Verschueren R, Diamond S, Boyd S (2018) A rewriting system for convex optimization problems. J Control Decision 5(1):42–60

    MathSciNet  Google Scholar 

  • Ahmadi-Javid A (2011) An information-theoretic approach to constructing coherent risk measures. In: 2011 IEEE International Symposium on Information Theory Proceedings, pp 2125–2127. IEEE

  • Ahmed S, Çakmak U, Shapiro A (2007) Coherent risk measures in inventory problems. Eur J Oper Res 182(1):226–238

    MathSciNet  MATH  Google Scholar 

  • ApS M (2019) MOSEK Optimizer API for Python 9.2.40. https://docs.mosek.com/9.2/pythonapi/index.html

  • Armijo L (1966) Minimization of functions having Lipschitz continuous first partial derivatives. Pacific J Math 16(1):1–3

    MathSciNet  MATH  Google Scholar 

  • Asi H, Duchi J (2019) Modeling simple structures and geometry for better stochastic optimization algorithms. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp 2425–2434

  • Bagirov A, Karmitsa N, Mäkelä MM (2014) Bundle Methods. Introduction to Nonsmooth Optimization: Theory, Practice and Software, pp 305–310. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-08114-4_12

  • Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2018) Automatic differentiation in machine learning: a survey. J Mach Learn Res 18(153):1-43

  • Becker S, Fadili J, Ochs P (2019) On quasi-Newton forward-backward splitting: proximal calculus and convergence. SIAM J Optim 29(4):2445–2481

    MathSciNet  MATH  Google Scholar 

  • Becker S, Fadili MJ (2012) A quasi-Newton proximal splitting method. arXiv preprint arXiv:1206.1156

  • Boyd s, Vandenberghe l (2004) Convex Optimization. Cambridge University Press

    MATH  Google Scholar 

  • Broyden CG (1965) A class of methods for solving nonlinear simultaneous equations. Math Comput 19(92):577–593

    MathSciNet  MATH  Google Scholar 

  • Busseti E, Ryu EK, Boyd S (2016) Risk-constrained Kelly gambling. J Invest 25(3):118–134

    Google Scholar 

  • Byrd RH, Nocedal J, Schnabel RB (1994) Representations of quasi-Newton matrices and their use in limited memory methods. Math Progr 63(1):129–156

    MathSciNet  MATH  Google Scholar 

  • Daníelsson J, Jorgensen BN, Samorodnitsky G, Sarma M, de Vries CG (2013) Fat tails, var and subadditivity. J Econom 172(2):283–291

    MathSciNet  MATH  Google Scholar 

  • Daníelsson J, Jorgensen BN, de Vries CG, Yang X (2008) Optimal portfolio allocation under the probabilistic var constraint and incentives for financial innovation. Ann Financ 4(3):345–367

    MATH  Google Scholar 

  • Davidon WC (1959) Variable metric method for minimization. Tech. rep., Argonne National Lab., Lemont, Ill

  • De Oliveira W, Solodov M (2016) A doubly stabilized bundle method for nonsmooth convex optimization. Math Program 156(1–2):125–159

    MathSciNet  MATH  Google Scholar 

  • de Oliveira W, Sagastizábal C, Lemaréchal C (2014) Convex proximal bundle methods in depth: a unified analysis for inexact oracles. Math Program 148(1):241–277

    MathSciNet  MATH  Google Scholar 

  • Dennis JE Jr, Moré JJ (1977) Quasi-Newton methods, motivation and theory. SIAM Rev 19(1):46–89

    MathSciNet  MATH  Google Scholar 

  • Diamond S, Boyd S (2016) CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research. http://stanford.edu/~boyd/papers/pdf/cvxpy_paper.pdf

  • Domahidi A, Chu E, Boyd S (2013) ECOS: An SOCP solver for embedded systems. In: European Control Conference (ECC), pp 3071–3076

  • Erdogdu MA, Montanari A (2015) Convergence rates of sub-sampled Newton methods. Adv Neur Inf Process Syst 28:3052–3060

    Google Scholar 

  • Fletcher R (2005) A new low rank quasi-Newton update scheme for nonlinear programming. In: IFIP Conference on System Modeling and Optimization, pp 275–293. Springer

  • Fletcher R, Powell M (1963) A rapidly convergent descent method for minimization. Comput J 6(2):163–168

    MathSciNet  MATH  Google Scholar 

  • Frangioni A (2002) Generalized bundle methods. SIAM J Optim 13(1):117–156

    MathSciNet  MATH  Google Scholar 

  • Fukushima M (1984) A descent algorithm for nonsmooth convex optimization. Math Program 30(2):163–175

    MathSciNet  MATH  Google Scholar 

  • Gao Y, Zhang YY, Wu X (2015) Penalized exponential series estimation of copula densities with an application to intergenerational dependence of body mass index. Empir Econ 48(1):61–81

    Google Scholar 

  • Ghanbari H, Scheinberg K (2018) Proximal quasi-Newton methods for regularized convex optimization with linear and accelerated sublinear convergence rates. Comput Optim Appl 69(3):597–627

    MathSciNet  MATH  Google Scholar 

  • Gower R, Le Roux N, Bach F (2018) Tracking the gradients using the hessian: A new look at variance reducing stochastic methods. In: International Conference on Artificial Intelligence and Statistics, pp 707–715. PMLR

  • Grant M, Boyd S (2014) CVX: Matlab software for disciplined convex programming, version 2.1

  • Grant M, Boyd S, Ye Y (2006) Disciplined convex programming. In: L. Liberti, N. Maculan (eds.) Global Optimization: From Theory to Implementation, Nonconvex Optimization and its Applications, pp 155–210. Springer

  • Huang X, Liang X, Liu Z, Li L, Yu Y, Li Y (2020) Span: a stochastic projected approximate newton method. Proc AAAI Conf Artif Intell 34(02):1520–1527

    Google Scholar 

  • Hutchinson MF (1989) A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun Stat -Sim Comput 18(3):1059–1076

    MathSciNet  MATH  Google Scholar 

  • Innes M, Edelman A, Fischer K, Rackauckas C, Saba E, Shah VB, Tebbutt W (2019) A differentiable programming system to bridge machine learning and scientific computing. arXiv preprint arXiv:1907.07587

  • Innes M, Saba E, Fischer K, Gandhi D, Rudilosso MC, Joy NM, Karmali T, Pal A, Shah V (2018) Fashionable modelling with Flux. CoRR abs/1811.01457. https://arxiv.org/abs/1811.01457

  • Kelly Jr JL (2011) A new interpretation of information rate. In: The Kelly capital growth investment criterion: theory and practice, pp 25–34. World Scientific

  • Kiwiel KC (1990) Proximity control in bundle methods for convex nondifferentiable minimization. Math Program 46(1):105–122

    MathSciNet  MATH  Google Scholar 

  • Kiwiel KC (2000) Efficiency of proximal bundle methods. J Optim Theory Appl 104(3):589–603

    MathSciNet  MATH  Google Scholar 

  • Lee CP, Wright SJ (2019) Inexact successive quadratic approximation for regularized optimization. Comput Optim Appl 72(3):641–674

    MathSciNet  MATH  Google Scholar 

  • Lee JD, Sun Y, Saunders MA (2014) Proximal Newton-type methods for minimizing composite functions. SIAM J Optim 24(3):1420–1443

    MathSciNet  MATH  Google Scholar 

  • Lemaréchal C (1978) Nonsmooth optimization and descent methods. RR-78-004

  • Lemaréchal C, Nemirovskii A, Nesterov Y (1995) New variants of bundle methods. Math Program 69(1):111–147

    MathSciNet  MATH  Google Scholar 

  • Lemaréchal C, Sagastizábal C (1997) Variable metric bundle methods: from conceptual to implementable forms. Math Program 76(3):393–410

    MathSciNet  MATH  Google Scholar 

  • Levitin ES, Polyak BT (1966) Constrained minimization methods. USSR Comput Math Math Phys 6(5):1–50

    MATH  Google Scholar 

  • Li J, Andersen MS, Vandenberghe L (2017) Inexact proximal Newton methods for self-concordant functions. Math Methods Oper Res 85(1):19–41

    MathSciNet  MATH  Google Scholar 

  • Lukšan L, Vlček J (1998) A bundle-Newton method for nonsmooth unconstrained minimization. Math Program 83(1):373–391

    MathSciNet  MATH  Google Scholar 

  • Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol. 238, p. 5

  • Marsh P (2007) Goodness of fit tests via exponential series density estimation. Comput Stat & Data Anal 51(5):2428–2441

    MathSciNet  MATH  Google Scholar 

  • Meyer RA, Musco C, Musco C, Woodruff DP (2021) Hutch++: Optimal stochastic trace estimation. In: Symposium on Simplicity in Algorithms (SOSA), pp 142–155. SIAM

  • Mifflin R (1996) A quasi-second-order proximal bundle algorithm. Math Program 73(1):51–72

    MathSciNet  MATH  Google Scholar 

  • Mifflin R, Sun D, Qi L (1998) Quasi-Newton bundle-type methods for nondifferentiable convex optimization. SIAM J Optim 8(2):583–603

    MathSciNet  MATH  Google Scholar 

  • Mordukhovich BS, Yuan X, Zeng S, Zhang J (2020) A globally convergent proximal Newton-type method in nonsmooth convex optimization. arXiv preprint arXiv:2011.08166

  • Nesterov Y (2013) Gradient methods for minimizing composite functions. Math Program 140(1):125–161

    MathSciNet  MATH  Google Scholar 

  • Nolan JF (1953) Analytical differentiation on a digital computer. Ph.D. thesis, Massachusetts Institute of Technology

  • Noll D (2013) Bundle method for non-convex minimization with inexact subgradients and function values. In: Computational and analytical mathematics, pp 555–592. Springer

  • O’Donoghue B, Chu E, Parikh N, Boyd S (2016) Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications 169(3):1042–1068. http://stanford.edu/~boyd/papers/scs.html

  • O’Donoghue B, Chu E, Parikh N, Boyd S (2019) SCS: Splitting conic solver, version 2.1.2. https://github.com/cvxgrp/scs

  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32: pp. 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

  • Rockafellar RT, Uryasev S (2002) Conditional value-at-risk for general loss distributions. J Bank & Financ 26(7):1443–1471

    Google Scholar 

  • Scheinberg K, Tang X (2016) Practical inexact proximal quasi-Newton method with global complexity analysis. Math Program 160(1):495–529

    MathSciNet  MATH  Google Scholar 

  • Schmidt M, Berg E, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: A limited-memory projected quasi-Newton algorithm. In: Artificial Intelligence and Statistics, pp 456–463. PMLR

  • Schramm H, Zowe J (1992) A version of the bundle idea for minimizing a nonsmooth function: Conceptual idea, convergence analysis, numerical results. SIAM J Optim 2(1):121–152

    MathSciNet  MATH  Google Scholar 

  • Sra s, Nowozin S, Wright SJ (2012) Optimization for machine learning. MIT Press

    Google Scholar 

  • Teo CH, Vishwanathan SVN, Smola A, Le QV (2010) Bundle methods for regularized risk minimization. J Mach Learn Res 11(1):311–365

  • Udell M, Mohan K, Zeng D, Hong J, Diamond S, Boyd S (2014) Convex optimization in Julia. In: Proceedings of the Workshop for High Performance Technical Computing in Dynamic Languages, pp 18–28

  • Uryasev S, Rockafellar RT (2001) Conditional Value-at-Risk: Optimization Approach, pp 411–435. Springer US, Boston, MA

  • Van Ackooij W, Frangioni A (2018) Incremental bundle methods using upper models. SIAM J Optim 28(1):379–410

    MathSciNet  MATH  Google Scholar 

  • van Ackooij W, Bello Cruz JY, de Oliveira W (2016) A strongly convergent proximal bundle method for convex minimization in Hilbert spaces. Optimization 65(1):145–167

    MathSciNet  MATH  Google Scholar 

  • Wainwright M, Jordan M (2008) Graphical models, exponential families, and variational inference. Now Publishers Inc

    MATH  Google Scholar 

  • Wu X (2010) Exponential series estimator of multivariate densities. J Econom 156(2):354–366

    MathSciNet  MATH  Google Scholar 

  • Ye H, Luo L, Zhang Z (2017) Approximate Newton methods and their local convergence. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp 3931–3939

  • Yu J, Vishwanathan SVN, Günter S, Schraudolph NN (2010) A quasi-Newton approach to nonsmooth convex optimization problems in machine learning. J Mach Learn Res 11:1145–1200

    MathSciNet  MATH  Google Scholar 

  • Yue MC, Zhou Z, So AMC (2019) A family of inexact SQA methods for non-smooth convex minimization with provable convergence guarantees based on the Luo-Tseng error bound property. Math Program 174(1):327–358

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinyue Shen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work was supported in part by the National Key R&D Program of China with grant No. 2018YFB1800800, by the Key Area R&D Program of Guangdong Province with grant No. 2018B030338001, by Shenzhen Outstanding Talents Training Fund, and by Guangdong Research Project No. 2017ZT07X152. Stephen Boyd’s work was funded in part by the AI Chip Center for Emerging Smart Systems (ACCESS).

A Appendix

A Appendix

1.1 A.1 Details of forming \(G_k\)

We form \(G_k\) in (3) by adopting a quasi-Newton update given in Fletcher (2005). The main idea is to divide \(G_k\) into two matrices \(G_k^{(1)}\) and \(G_k^{(2)}\), which are the first \(r_1\) and the last \(r_2=r-r_1\) columns in \(G_k\), respectively, and update them separately into \(G_{k+1}^{(1)}\) and \(G_{k+1}^{(2)}\). Then \(G_{k+1}\) is updated by

$$\begin{aligned} G_{k+1} = \left[ G^{(1)}_{k+1} \quad G^{(2)}_{k+1}\right] . \end{aligned}$$

The detail is as follows. Let \(s_k = x^{k+1} - x^k\) and \(y_k = \nabla f(x^{k+1}) - \nabla f(x^k)\). From the convexity of f, \(s_k^T y_k\ge 0\). Suppose that \(s_k^T y_k\) is not too small such that

$$\begin{aligned} s_k^T y_k > \max (\varepsilon _{\mathrm {abs}}, \varepsilon _{\mathrm {rel}}\Vert s_k\Vert _2\Vert y_k\Vert _2), \end{aligned}$$
(39)

where constants \(\varepsilon _{\mathrm {abs}}, \varepsilon _{\mathrm {rel}}>0\) are given. Then \(r_1\) is chosen as the largest integer in [0, r] such that \(G_k^{(1)}\) satisfies

$$\begin{aligned} s_k^Ty_k - \left\| (G_k^{(1)})^Ts_k\right\| _2^2 > \varepsilon _{\mathrm {rel}} \Vert s_k\Vert _2 \left\| y_k - G_k^{(1)}(G_k^{(1)})^Ts_k \right\| _2. \end{aligned}$$

According to (39) the above holds at least for \(r_1=0\), in which case \(G^{(1)}_k\) degenerates to 0.

Once \(r_1\) is obtained, we define \(w_k^{(1)} = (G^{(1)}_k)^T s_k\) and \(w_k^{(2)} = (G^{(2)}_k)^T s_k\). Then \(R_k^{(1)}\in {\mathbf{R}}^{(r_1+1)\times (r_1+1)}\) is the upper triangular factor in the following R-Q decomposition

$$\begin{aligned} R_k^{(1)} Q_k^{(1)} = \left[ \begin{array}{ll} \frac{1}{\sqrt{s_k^T y_k - \Vert w_k^{(1)}\Vert _2^2}} &{} 0_{r_1}^T \\ \frac{-1}{\sqrt{s_k^T y_k - \Vert w_k^{(1)}\Vert _2^2}}w_k^{(1)} &{} I_{r_1} \end{array} \right] \in {\mathbf{R}}^{(r_1+1)\times (r_1+1)}, \end{aligned}$$

and \(Q_k^{(2)}\in {\mathbf{R}}^{r_2 \times (r_2 - 1)}\) is a set of orthonormal basis orthogonal to \(w_k^{(2)}\). The update is

$$\begin{aligned} G^{(1)}_{k+1} = \left[ y_k \quad G^{(1)}_k \right] R_k^{(1)}, \quad G^{(2)}_{k+1} = G^{(2)}_{k} Q_k^{(2)}. \end{aligned}$$

There are some corner cases. If (39) holds and \(r_1 = 0\), then \(G_{k+1}^{(1)} = y_k / \sqrt{s_k^T y_k}\). If \(r_1 = r-1\) or \(r_1 = r\), then \(G_{k+1}^{(2)}\) vanishes, and \(G_{k+1}\) takes the first r columns of \(G_{k+1}^{(1)}\).

In cases where (39) does not hold, if \(\Vert w_k^{(2)}\Vert _2 > \varepsilon _{\mathrm {abs}}\), then we can still define \(G_{k+1}^{(2)}\) in the same way, and \(G_{k+1} = \left[ G^{(2)}_{k+1} \quad 0_n\right] .\) Otherwise, \(G_{k+1} = G_k\).

It can be easily checked that by the \(G_k\) defined as above, the trace of \(H_k\) is uniformly upper bounded in k.

The default values for the parameters are \(\varepsilon _{\mathrm {abs}}=10^{-8}\) and \(\varepsilon _{\mathrm {rel}}=10^{-3}\).

1.2 A.2 Details of computing an optimal subgradient of g

Here we show how to compute a subgradient \(q^{k+1} \in \partial g(x^{k+1/2})\) satisfying the optimality conditions in (8), which implies (10). This, in turn, is important because it allows us to compute the stopping criteria described in sect. 2.6.

Since we know the third term on the right-hand side of (8), it suffices to find an optimal subgradient \(l_k^{k+1/2} \in \partial l_k(x^{k+1/2})\), which is easier. We start by rewriting the defining problem for \(x^{k+1/2}\) in a more useful form. Putting (4) and (7) together, we see that we can alternatively express \(x^{k+1/2}\) as the solution to the convex problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} z + g(x) + \frac{1}{2}(x-x^k)^T (H_k + \lambda _k I) (x-x^k) \\ \text{ subject } \text{ to } &{} z \ge f(x^i) + \nabla f(x^i)^T (x - x^i), \quad \text {for } i=\max \{0, k-M+1\},\ldots ,k, \end{array} \end{aligned}$$
(40)

with variables x and \(z \in {\mathbf{R}}\).

The KKT conditions for problem (40), which are necessary and sufficient for the points \((x^{k+1/2}, z^{k+1/2})\) and \(\gamma _i, \; i=\max \{0, k-M+1\},\ldots ,k\), to be primal and dual optimal, are

$$\begin{aligned} z^{k+1/2} \ge f(x^i) + \nabla f(x^i)^T (x^{k+1/2} - x^i), \quad&i=\max \{0, k-M+1\},\ldots ,k; \end{aligned}$$
(41)
$$\begin{aligned} \sum _{i=\max \{0, k-M+1\}}^k \gamma _i = 1, \quad \gamma _i \ge 0, \quad&i=\max \{0, k-M+1\},\ldots ,k; \end{aligned}$$
(42)
$$\begin{aligned} z^{k+1/2} > f(x^i) + \nabla f(x^i)^T (x^{k+1/2} - x^i) \implies \gamma _i = 0, \quad&i\in \max \{0, k-M+1\},\ldots ,k; \end{aligned}$$
(43)
$$\begin{aligned} \partial g(x^{k+1/2}) + (H_k + \lambda _k I) v^k + \sum _{i=\max \{0, k-M+1\}}^k \gamma _i \nabla f(x^i) \ni&0. \end{aligned}$$
(44)

Here we used the definition of \(v^k\) to simplify the stationarity condition (44).

Now we claim that

$$\begin{aligned} l_k^{k+1/2} = \sum _{i=\max \{0, k-M+1\}}^k \gamma _i \nabla f(x^i) \in \partial l_k(x^{k+1/2}). \end{aligned}$$

To see this, note that (42) says the \(\gamma _i\) are nonnegative and sum to one, while (41) and (43) together imply \(\gamma _i\) is positive as long as \(\nabla f(x^i)\) is active; this satisfies (9), which says the subdifferential \(\partial l_k(x^{k+1/2})\) is the convex hull of the active gradients. Therefore, re-arranging (44) gives

$$\begin{aligned} q^{k+1} = - l_k^{k+1/2} - (H_k + \lambda _k I) v^k \in \partial g(x^{k+1/2}), \end{aligned}$$

which shows (10).

1.3 A.3 Proof that undamped steps occur infinitely often

Assume that \(\nabla f\) is Lipschitz continuous with constant L. Also, assume \(\mu _{\max }\tau _{\min } > 2L / (1-\alpha )\). We show that for any number of iterations \(k_0\), there is some \(k \ge k_0\) such that \(t_k = 1\). This means that there exists a subsequence \((k_{\ell })_{\ell =1}^\infty\) such that \(t_{k_{\ell }} = 1\).

To show the result, we first claim that the line search condition (20) is satisfied as soon as

$$\begin{aligned} t_k \le \frac{1-\alpha }{2L} \lambda _k. \end{aligned}$$
(45)

We prove the claim later. Taking the claim as a given for now, the main result follows by deriving a contradiction. To get a contradiction, suppose there exists some number of iterations \(k_0\) such that \(t_k < 1\) for each \(k \ge k_0\). Then, because \(t_k\) is the largest number satisfying \(t_k=\beta ^j\) (\(j \in \text{ N}_0\)) and the line search condition, we get that the line search condition does not hold with \(t_k=1\), and thus (45) does not hold with \(t_k=1\), meaning that \(\lambda _k < 2 L / (1-\alpha )\), for each \(k \ge k_0\). But from (23), we have \(\mu _k = \min \left\{ \gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0},\mu _{\max }\right\}\), since we assumed \(t_k < 1\) for every \(k \ge k_0\). Additionally, from (22), we get \(\lambda _k \ge \mu _k \tau _{\min }\). So, we now have two cases. Either we have \(\lambda _k \ge \mu _{\max } \tau _{\min } > 2L / (1-\alpha )\), which is a contradiction with \(\lambda _k < 2 L / (1-\alpha )\). Or we have \(\lambda _k \ge \gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0}\tau _{\min }\), in which case \(\gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0}\tau _{\min }\) grows exponentially in k; this means we must have \(\lambda _k \ge 2 L / (1-\alpha )\), for k sufficiently large, which is again a contradiction. This finishes the proof of the main result.

Now we prove the claim. Observe that

$$\begin{aligned} L t_k^2 \Vert v^k\Vert ^2 \le \frac{1-\alpha }{2} \lambda _k t_k \Vert v^k\Vert ^2 \le \frac{1-\alpha }{2} t_k (v^k)^T (H_k +\lambda _k I) v^k, \end{aligned}$$
(46)

where we used (45) to get the first inequality. We now use the following two facts: \(\nabla f\) being L-Lipschitz continuous implies that (i) \(\phi _k\) is convex in t, and (ii) \(\phi _k^\prime\) is \(L \Vert v^k \Vert ^2\)-Lipschitz in t. By the first fact (convexity), we have

$$\begin{aligned} \phi _k(t_k) \le \phi _k(0) + \phi _k^\prime (t_k) t_k. \end{aligned}$$

Adding and subtracting \(\phi _k^\prime (0)\) gives

$$\begin{aligned} \phi _k(t_k) \le \phi _k(0) + \phi _k^\prime (0) t_k + (\phi _k^\prime (t_k) - \phi _k^\prime (0)) t_k. \end{aligned}$$

The second fact (Lipschitz continuity) yields a bound on the third term on the right-hand side,

$$\begin{aligned} \phi _k(t_k) \le \phi _k(0) + \phi _k^\prime (0) t_k + L t_k^2 \Vert v^k\Vert ^2. \end{aligned}$$
(47)

Finally, using (18) and (46) to bound \(\phi _k^\prime (0)\) and \(L t_k^2 \Vert v^k \Vert _2^2\) in (47) yields (20), which implies the line search condition (20) is satisfied, as claimed.

We note that the claim above also implies the following lower bound, which is used in Sect. 3.1. Observe that if \(t_k\) indeed satisfies the line search condition, then we can express \(t_k = \beta ^j\), where j is the smallest integer such that (45) holds (see Sect. 2.4). Now consider two cases. If \(1 \le (1-\alpha ) \lambda _k / (2L)\), then we can simply take \(j = 0\), so that \(t_k = 1\). On the other hand, if \(1 > (1-\alpha ) \lambda _k / (2L)\), then a short calculation shows that \(j=\lceil \log _{\beta } ((1-\alpha ) \lambda _k / (2L)) \rceil\), and so \(j < 1 + \log _{\beta } ((1-\alpha ) \lambda _k / (2L))\), giving \(t_k = \beta ^j > \beta (1-\alpha ) \lambda _k / (2L)\). Therefore, to sum up, we have (45) implies that \(t_k\) satisfies the inequalities

$$\begin{aligned} t_k > \beta \min \left\{ 1, \frac{1-\alpha }{2L} \lambda _k \right\} \ge \beta \min \left\{ 1,\frac{1-\alpha }{2L} \mu _{\min }\tau _{\min }\right\} , \end{aligned}$$
(48)

where we also used the fact that \(\lambda _k \ge \mu _{\min }\tau _{\min }\).

1.4 A.4 Proof that the limited memory piecewise affine minorant is accurate enough

Assume that \(\nabla f\) is Lipschitz continuous with constant L. For the rest of the proof, fix any k such that \(t_k = 1\). (From Sect. A.3, it is always possible to find at least one such k.) We will show that for any such k, the limited memory piecewise affine minorant (4) satisfies the bound (29); because our choice of k was arbitrary, the required result will then follow.

For any \(l_k^{k+1} \in \partial l_k(x^{k+1})\), note that adding and subtracting \(\nabla f(x^k)\) in the left-hand side of (29) easily gives

$$\begin{aligned} \Vert \nabla f(x^{k+1}) - l_k^{k+1} \Vert _2 \le \Vert \nabla f(x^{k+1}) - \nabla f(x^k) \Vert _2 + \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2. \end{aligned}$$
(49)

The Lipschitz continuity of \(\nabla f\), in turn, immediately gives a bound for the first term on the right-hand side of (49), i.e., we get

$$\begin{aligned} \Vert \nabla f(x^{k+1}) - l_k^{k+1} \Vert _2&\le L \Vert x^{k+1} - x^k \Vert _2 + \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2 \nonumber \\&\lesssim \Vert v^k \Vert _2 + \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2, \end{aligned}$$
(50)

using the definition of \(v^k\).

Therefore, we focus on the second term on the right-hand side of (49). For this term, (9) tells us that for any \(l_k^{k+1} \in \partial l_k(x^{k+1})\),

$$\begin{aligned} \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2 \le \max _{j=\max \{0,k-M+1\},\ldots ,k} \Vert \nabla f(x^k) - \nabla f(x^j) \Vert _2, \end{aligned}$$
(51)

because the maximum of a convex function over a convex polytope is attained at one of its vertices. The Lipschitz continuity of \(\nabla f\) again shows that, for any \(j \in \{ \max \{0,k-M+1\}, \ldots , k \}\),

$$\begin{aligned} \Vert \nabla f(x^k) - \nabla f(x^j) \Vert _2&\le L \Vert x^k - x^j \Vert _2 \\&= L \Vert (x^k - x^{k-1}) + (x^{k-1} - x^{k-2}) + \cdots + (x^{j+2} - x^{j+1}) + (x^{j+1} - x^j) \Vert _2 \\&\le L \sum _{\ell =j}^{k-1} \Vert v^\ell \Vert _2 \\&\lesssim \max _{\ell =j,\ldots ,k-1} \Vert v^\ell \Vert _2. \end{aligned}$$

To get the third line, we used the fact that the sum on the second line telescopes, and applied the triangle inequality. To get the fourth line, we used the fact that the average is less than the max. Putting this last inequality together with (51), we see that

$$\begin{aligned} \Vert \nabla f(x^k) - l_k^{k+1} \Vert _2 \lesssim \max _{j=\max \{0,k-M+1\},\ldots ,k} \Vert v^j \Vert _2, \end{aligned}$$
(52)

for any \(l_k^{k+1} \in \partial l_k(x^{k+1})\). Combining (52) and (50) gives (29). This completes the proof of the result.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, X., Ali, A. & Boyd, S. Minimizing oracle-structured composite functions. Optim Eng 24, 743–777 (2023). https://doi.org/10.1007/s11081-021-09705-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11081-021-09705-0

Keywords

Navigation