Abstract
We consider the problem of minimizing a composite convex function with two different access methods: an oracle, for which we can evaluate the value and gradient, and a structured function, which we access only by solving a convex optimization problem. We are motivated by two associated technological developments. For the oracle, systems like PyTorch or TensorFlow can automatically and efficiently compute gradients, given a computation graph description. For the structured function, systems like CVXPY accept a high level domain specific language description of the problem, and automatically translate it to a standard form for efficient solution. We develop a method that makes minimal assumptions about the two functions, does not require the tuning of algorithm parameters, and works well in practice across a variety of problems. Our algorithm combines a number of well-known ideas, including a low-rank quasi-Newton approximation of curvature, piecewise affine lower bounds from bundle-type methods, and two types of damping to ensure stability. We illustrate the method on stochastic optimization, utility maximization, and risk-averse programming problems, showing that our method is more efficient than standard solvers when the oracle function contains much data.
Similar content being viewed by others
References
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org
Adamson DS, Winant CW (1969) A SLANG simulation of an initially strong shock wave downstream of an infinite area change. In: Proceedings of the Conference on Applications of Continuous-System Simulation Languages, pp 231–240
Agrawal A, Verschueren R, Diamond S, Boyd S (2018) A rewriting system for convex optimization problems. J Control Decision 5(1):42–60
Ahmadi-Javid A (2011) An information-theoretic approach to constructing coherent risk measures. In: 2011 IEEE International Symposium on Information Theory Proceedings, pp 2125–2127. IEEE
Ahmed S, Çakmak U, Shapiro A (2007) Coherent risk measures in inventory problems. Eur J Oper Res 182(1):226–238
ApS M (2019) MOSEK Optimizer API for Python 9.2.40. https://docs.mosek.com/9.2/pythonapi/index.html
Armijo L (1966) Minimization of functions having Lipschitz continuous first partial derivatives. Pacific J Math 16(1):1–3
Asi H, Duchi J (2019) Modeling simple structures and geometry for better stochastic optimization algorithms. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp 2425–2434
Bagirov A, Karmitsa N, Mäkelä MM (2014) Bundle Methods. Introduction to Nonsmooth Optimization: Theory, Practice and Software, pp 305–310. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-08114-4_12
Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2018) Automatic differentiation in machine learning: a survey. J Mach Learn Res 18(153):1-43
Becker S, Fadili J, Ochs P (2019) On quasi-Newton forward-backward splitting: proximal calculus and convergence. SIAM J Optim 29(4):2445–2481
Becker S, Fadili MJ (2012) A quasi-Newton proximal splitting method. arXiv preprint arXiv:1206.1156
Boyd s, Vandenberghe l (2004) Convex Optimization. Cambridge University Press
Broyden CG (1965) A class of methods for solving nonlinear simultaneous equations. Math Comput 19(92):577–593
Busseti E, Ryu EK, Boyd S (2016) Risk-constrained Kelly gambling. J Invest 25(3):118–134
Byrd RH, Nocedal J, Schnabel RB (1994) Representations of quasi-Newton matrices and their use in limited memory methods. Math Progr 63(1):129–156
Daníelsson J, Jorgensen BN, Samorodnitsky G, Sarma M, de Vries CG (2013) Fat tails, var and subadditivity. J Econom 172(2):283–291
Daníelsson J, Jorgensen BN, de Vries CG, Yang X (2008) Optimal portfolio allocation under the probabilistic var constraint and incentives for financial innovation. Ann Financ 4(3):345–367
Davidon WC (1959) Variable metric method for minimization. Tech. rep., Argonne National Lab., Lemont, Ill
De Oliveira W, Solodov M (2016) A doubly stabilized bundle method for nonsmooth convex optimization. Math Program 156(1–2):125–159
de Oliveira W, Sagastizábal C, Lemaréchal C (2014) Convex proximal bundle methods in depth: a unified analysis for inexact oracles. Math Program 148(1):241–277
Dennis JE Jr, Moré JJ (1977) Quasi-Newton methods, motivation and theory. SIAM Rev 19(1):46–89
Diamond S, Boyd S (2016) CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research. http://stanford.edu/~boyd/papers/pdf/cvxpy_paper.pdf
Domahidi A, Chu E, Boyd S (2013) ECOS: An SOCP solver for embedded systems. In: European Control Conference (ECC), pp 3071–3076
Erdogdu MA, Montanari A (2015) Convergence rates of sub-sampled Newton methods. Adv Neur Inf Process Syst 28:3052–3060
Fletcher R (2005) A new low rank quasi-Newton update scheme for nonlinear programming. In: IFIP Conference on System Modeling and Optimization, pp 275–293. Springer
Fletcher R, Powell M (1963) A rapidly convergent descent method for minimization. Comput J 6(2):163–168
Frangioni A (2002) Generalized bundle methods. SIAM J Optim 13(1):117–156
Fukushima M (1984) A descent algorithm for nonsmooth convex optimization. Math Program 30(2):163–175
Gao Y, Zhang YY, Wu X (2015) Penalized exponential series estimation of copula densities with an application to intergenerational dependence of body mass index. Empir Econ 48(1):61–81
Ghanbari H, Scheinberg K (2018) Proximal quasi-Newton methods for regularized convex optimization with linear and accelerated sublinear convergence rates. Comput Optim Appl 69(3):597–627
Gower R, Le Roux N, Bach F (2018) Tracking the gradients using the hessian: A new look at variance reducing stochastic methods. In: International Conference on Artificial Intelligence and Statistics, pp 707–715. PMLR
Grant M, Boyd S (2014) CVX: Matlab software for disciplined convex programming, version 2.1
Grant M, Boyd S, Ye Y (2006) Disciplined convex programming. In: L. Liberti, N. Maculan (eds.) Global Optimization: From Theory to Implementation, Nonconvex Optimization and its Applications, pp 155–210. Springer
Huang X, Liang X, Liu Z, Li L, Yu Y, Li Y (2020) Span: a stochastic projected approximate newton method. Proc AAAI Conf Artif Intell 34(02):1520–1527
Hutchinson MF (1989) A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun Stat -Sim Comput 18(3):1059–1076
Innes M, Edelman A, Fischer K, Rackauckas C, Saba E, Shah VB, Tebbutt W (2019) A differentiable programming system to bridge machine learning and scientific computing. arXiv preprint arXiv:1907.07587
Innes M, Saba E, Fischer K, Gandhi D, Rudilosso MC, Joy NM, Karmali T, Pal A, Shah V (2018) Fashionable modelling with Flux. CoRR abs/1811.01457. https://arxiv.org/abs/1811.01457
Kelly Jr JL (2011) A new interpretation of information rate. In: The Kelly capital growth investment criterion: theory and practice, pp 25–34. World Scientific
Kiwiel KC (1990) Proximity control in bundle methods for convex nondifferentiable minimization. Math Program 46(1):105–122
Kiwiel KC (2000) Efficiency of proximal bundle methods. J Optim Theory Appl 104(3):589–603
Lee CP, Wright SJ (2019) Inexact successive quadratic approximation for regularized optimization. Comput Optim Appl 72(3):641–674
Lee JD, Sun Y, Saunders MA (2014) Proximal Newton-type methods for minimizing composite functions. SIAM J Optim 24(3):1420–1443
Lemaréchal C (1978) Nonsmooth optimization and descent methods. RR-78-004
Lemaréchal C, Nemirovskii A, Nesterov Y (1995) New variants of bundle methods. Math Program 69(1):111–147
Lemaréchal C, Sagastizábal C (1997) Variable metric bundle methods: from conceptual to implementable forms. Math Program 76(3):393–410
Levitin ES, Polyak BT (1966) Constrained minimization methods. USSR Comput Math Math Phys 6(5):1–50
Li J, Andersen MS, Vandenberghe L (2017) Inexact proximal Newton methods for self-concordant functions. Math Methods Oper Res 85(1):19–41
Lukšan L, Vlček J (1998) A bundle-Newton method for nonsmooth unconstrained minimization. Math Program 83(1):373–391
Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol. 238, p. 5
Marsh P (2007) Goodness of fit tests via exponential series density estimation. Comput Stat & Data Anal 51(5):2428–2441
Meyer RA, Musco C, Musco C, Woodruff DP (2021) Hutch++: Optimal stochastic trace estimation. In: Symposium on Simplicity in Algorithms (SOSA), pp 142–155. SIAM
Mifflin R (1996) A quasi-second-order proximal bundle algorithm. Math Program 73(1):51–72
Mifflin R, Sun D, Qi L (1998) Quasi-Newton bundle-type methods for nondifferentiable convex optimization. SIAM J Optim 8(2):583–603
Mordukhovich BS, Yuan X, Zeng S, Zhang J (2020) A globally convergent proximal Newton-type method in nonsmooth convex optimization. arXiv preprint arXiv:2011.08166
Nesterov Y (2013) Gradient methods for minimizing composite functions. Math Program 140(1):125–161
Nolan JF (1953) Analytical differentiation on a digital computer. Ph.D. thesis, Massachusetts Institute of Technology
Noll D (2013) Bundle method for non-convex minimization with inexact subgradients and function values. In: Computational and analytical mathematics, pp 555–592. Springer
O’Donoghue B, Chu E, Parikh N, Boyd S (2016) Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications 169(3):1042–1068. http://stanford.edu/~boyd/papers/scs.html
O’Donoghue B, Chu E, Parikh N, Boyd S (2019) SCS: Splitting conic solver, version 2.1.2. https://github.com/cvxgrp/scs
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32: pp. 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Rockafellar RT, Uryasev S (2002) Conditional value-at-risk for general loss distributions. J Bank & Financ 26(7):1443–1471
Scheinberg K, Tang X (2016) Practical inexact proximal quasi-Newton method with global complexity analysis. Math Program 160(1):495–529
Schmidt M, Berg E, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: A limited-memory projected quasi-Newton algorithm. In: Artificial Intelligence and Statistics, pp 456–463. PMLR
Schramm H, Zowe J (1992) A version of the bundle idea for minimizing a nonsmooth function: Conceptual idea, convergence analysis, numerical results. SIAM J Optim 2(1):121–152
Sra s, Nowozin S, Wright SJ (2012) Optimization for machine learning. MIT Press
Teo CH, Vishwanathan SVN, Smola A, Le QV (2010) Bundle methods for regularized risk minimization. J Mach Learn Res 11(1):311–365
Udell M, Mohan K, Zeng D, Hong J, Diamond S, Boyd S (2014) Convex optimization in Julia. In: Proceedings of the Workshop for High Performance Technical Computing in Dynamic Languages, pp 18–28
Uryasev S, Rockafellar RT (2001) Conditional Value-at-Risk: Optimization Approach, pp 411–435. Springer US, Boston, MA
Van Ackooij W, Frangioni A (2018) Incremental bundle methods using upper models. SIAM J Optim 28(1):379–410
van Ackooij W, Bello Cruz JY, de Oliveira W (2016) A strongly convergent proximal bundle method for convex minimization in Hilbert spaces. Optimization 65(1):145–167
Wainwright M, Jordan M (2008) Graphical models, exponential families, and variational inference. Now Publishers Inc
Wu X (2010) Exponential series estimator of multivariate densities. J Econom 156(2):354–366
Ye H, Luo L, Zhang Z (2017) Approximate Newton methods and their local convergence. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp 3931–3939
Yu J, Vishwanathan SVN, Günter S, Schraudolph NN (2010) A quasi-Newton approach to nonsmooth convex optimization problems in machine learning. J Mach Learn Res 11:1145–1200
Yue MC, Zhou Z, So AMC (2019) A family of inexact SQA methods for non-smooth convex minimization with provable convergence guarantees based on the Luo-Tseng error bound property. Math Program 174(1):327–358
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work was supported in part by the National Key R&D Program of China with grant No. 2018YFB1800800, by the Key Area R&D Program of Guangdong Province with grant No. 2018B030338001, by Shenzhen Outstanding Talents Training Fund, and by Guangdong Research Project No. 2017ZT07X152. Stephen Boyd’s work was funded in part by the AI Chip Center for Emerging Smart Systems (ACCESS).
A Appendix
A Appendix
1.1 A.1 Details of forming \(G_k\)
We form \(G_k\) in (3) by adopting a quasi-Newton update given in Fletcher (2005). The main idea is to divide \(G_k\) into two matrices \(G_k^{(1)}\) and \(G_k^{(2)}\), which are the first \(r_1\) and the last \(r_2=r-r_1\) columns in \(G_k\), respectively, and update them separately into \(G_{k+1}^{(1)}\) and \(G_{k+1}^{(2)}\). Then \(G_{k+1}\) is updated by
The detail is as follows. Let \(s_k = x^{k+1} - x^k\) and \(y_k = \nabla f(x^{k+1}) - \nabla f(x^k)\). From the convexity of f, \(s_k^T y_k\ge 0\). Suppose that \(s_k^T y_k\) is not too small such that
where constants \(\varepsilon _{\mathrm {abs}}, \varepsilon _{\mathrm {rel}}>0\) are given. Then \(r_1\) is chosen as the largest integer in [0, r] such that \(G_k^{(1)}\) satisfies
According to (39) the above holds at least for \(r_1=0\), in which case \(G^{(1)}_k\) degenerates to 0.
Once \(r_1\) is obtained, we define \(w_k^{(1)} = (G^{(1)}_k)^T s_k\) and \(w_k^{(2)} = (G^{(2)}_k)^T s_k\). Then \(R_k^{(1)}\in {\mathbf{R}}^{(r_1+1)\times (r_1+1)}\) is the upper triangular factor in the following R-Q decomposition
and \(Q_k^{(2)}\in {\mathbf{R}}^{r_2 \times (r_2 - 1)}\) is a set of orthonormal basis orthogonal to \(w_k^{(2)}\). The update is
There are some corner cases. If (39) holds and \(r_1 = 0\), then \(G_{k+1}^{(1)} = y_k / \sqrt{s_k^T y_k}\). If \(r_1 = r-1\) or \(r_1 = r\), then \(G_{k+1}^{(2)}\) vanishes, and \(G_{k+1}\) takes the first r columns of \(G_{k+1}^{(1)}\).
In cases where (39) does not hold, if \(\Vert w_k^{(2)}\Vert _2 > \varepsilon _{\mathrm {abs}}\), then we can still define \(G_{k+1}^{(2)}\) in the same way, and \(G_{k+1} = \left[ G^{(2)}_{k+1} \quad 0_n\right] .\) Otherwise, \(G_{k+1} = G_k\).
It can be easily checked that by the \(G_k\) defined as above, the trace of \(H_k\) is uniformly upper bounded in k.
The default values for the parameters are \(\varepsilon _{\mathrm {abs}}=10^{-8}\) and \(\varepsilon _{\mathrm {rel}}=10^{-3}\).
1.2 A.2 Details of computing an optimal subgradient of g
Here we show how to compute a subgradient \(q^{k+1} \in \partial g(x^{k+1/2})\) satisfying the optimality conditions in (8), which implies (10). This, in turn, is important because it allows us to compute the stopping criteria described in sect. 2.6.
Since we know the third term on the right-hand side of (8), it suffices to find an optimal subgradient \(l_k^{k+1/2} \in \partial l_k(x^{k+1/2})\), which is easier. We start by rewriting the defining problem for \(x^{k+1/2}\) in a more useful form. Putting (4) and (7) together, we see that we can alternatively express \(x^{k+1/2}\) as the solution to the convex problem
with variables x and \(z \in {\mathbf{R}}\).
The KKT conditions for problem (40), which are necessary and sufficient for the points \((x^{k+1/2}, z^{k+1/2})\) and \(\gamma _i, \; i=\max \{0, k-M+1\},\ldots ,k\), to be primal and dual optimal, are
Here we used the definition of \(v^k\) to simplify the stationarity condition (44).
Now we claim that
To see this, note that (42) says the \(\gamma _i\) are nonnegative and sum to one, while (41) and (43) together imply \(\gamma _i\) is positive as long as \(\nabla f(x^i)\) is active; this satisfies (9), which says the subdifferential \(\partial l_k(x^{k+1/2})\) is the convex hull of the active gradients. Therefore, re-arranging (44) gives
which shows (10).
1.3 A.3 Proof that undamped steps occur infinitely often
Assume that \(\nabla f\) is Lipschitz continuous with constant L. Also, assume \(\mu _{\max }\tau _{\min } > 2L / (1-\alpha )\). We show that for any number of iterations \(k_0\), there is some \(k \ge k_0\) such that \(t_k = 1\). This means that there exists a subsequence \((k_{\ell })_{\ell =1}^\infty\) such that \(t_{k_{\ell }} = 1\).
To show the result, we first claim that the line search condition (20) is satisfied as soon as
We prove the claim later. Taking the claim as a given for now, the main result follows by deriving a contradiction. To get a contradiction, suppose there exists some number of iterations \(k_0\) such that \(t_k < 1\) for each \(k \ge k_0\). Then, because \(t_k\) is the largest number satisfying \(t_k=\beta ^j\) (\(j \in \text{ N}_0\)) and the line search condition, we get that the line search condition does not hold with \(t_k=1\), and thus (45) does not hold with \(t_k=1\), meaning that \(\lambda _k < 2 L / (1-\alpha )\), for each \(k \ge k_0\). But from (23), we have \(\mu _k = \min \left\{ \gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0},\mu _{\max }\right\}\), since we assumed \(t_k < 1\) for every \(k \ge k_0\). Additionally, from (22), we get \(\lambda _k \ge \mu _k \tau _{\min }\). So, we now have two cases. Either we have \(\lambda _k \ge \mu _{\max } \tau _{\min } > 2L / (1-\alpha )\), which is a contradiction with \(\lambda _k < 2 L / (1-\alpha )\). Or we have \(\lambda _k \ge \gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0}\tau _{\min }\), in which case \(\gamma _{\mathrm {inc}}^{k-k_0} \mu _{k_0}\tau _{\min }\) grows exponentially in k; this means we must have \(\lambda _k \ge 2 L / (1-\alpha )\), for k sufficiently large, which is again a contradiction. This finishes the proof of the main result.
Now we prove the claim. Observe that
where we used (45) to get the first inequality. We now use the following two facts: \(\nabla f\) being L-Lipschitz continuous implies that (i) \(\phi _k\) is convex in t, and (ii) \(\phi _k^\prime\) is \(L \Vert v^k \Vert ^2\)-Lipschitz in t. By the first fact (convexity), we have
Adding and subtracting \(\phi _k^\prime (0)\) gives
The second fact (Lipschitz continuity) yields a bound on the third term on the right-hand side,
Finally, using (18) and (46) to bound \(\phi _k^\prime (0)\) and \(L t_k^2 \Vert v^k \Vert _2^2\) in (47) yields (20), which implies the line search condition (20) is satisfied, as claimed.
We note that the claim above also implies the following lower bound, which is used in Sect. 3.1. Observe that if \(t_k\) indeed satisfies the line search condition, then we can express \(t_k = \beta ^j\), where j is the smallest integer such that (45) holds (see Sect. 2.4). Now consider two cases. If \(1 \le (1-\alpha ) \lambda _k / (2L)\), then we can simply take \(j = 0\), so that \(t_k = 1\). On the other hand, if \(1 > (1-\alpha ) \lambda _k / (2L)\), then a short calculation shows that \(j=\lceil \log _{\beta } ((1-\alpha ) \lambda _k / (2L)) \rceil\), and so \(j < 1 + \log _{\beta } ((1-\alpha ) \lambda _k / (2L))\), giving \(t_k = \beta ^j > \beta (1-\alpha ) \lambda _k / (2L)\). Therefore, to sum up, we have (45) implies that \(t_k\) satisfies the inequalities
where we also used the fact that \(\lambda _k \ge \mu _{\min }\tau _{\min }\).
1.4 A.4 Proof that the limited memory piecewise affine minorant is accurate enough
Assume that \(\nabla f\) is Lipschitz continuous with constant L. For the rest of the proof, fix any k such that \(t_k = 1\). (From Sect. A.3, it is always possible to find at least one such k.) We will show that for any such k, the limited memory piecewise affine minorant (4) satisfies the bound (29); because our choice of k was arbitrary, the required result will then follow.
For any \(l_k^{k+1} \in \partial l_k(x^{k+1})\), note that adding and subtracting \(\nabla f(x^k)\) in the left-hand side of (29) easily gives
The Lipschitz continuity of \(\nabla f\), in turn, immediately gives a bound for the first term on the right-hand side of (49), i.e., we get
using the definition of \(v^k\).
Therefore, we focus on the second term on the right-hand side of (49). For this term, (9) tells us that for any \(l_k^{k+1} \in \partial l_k(x^{k+1})\),
because the maximum of a convex function over a convex polytope is attained at one of its vertices. The Lipschitz continuity of \(\nabla f\) again shows that, for any \(j \in \{ \max \{0,k-M+1\}, \ldots , k \}\),
To get the third line, we used the fact that the sum on the second line telescopes, and applied the triangle inequality. To get the fourth line, we used the fact that the average is less than the max. Putting this last inequality together with (51), we see that
for any \(l_k^{k+1} \in \partial l_k(x^{k+1})\). Combining (52) and (50) gives (29). This completes the proof of the result.
Rights and permissions
About this article
Cite this article
Shen, X., Ali, A. & Boyd, S. Minimizing oracle-structured composite functions. Optim Eng 24, 743–777 (2023). https://doi.org/10.1007/s11081-021-09705-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11081-021-09705-0