Abstract
We consider the problem of minimization of the sum of two convex functions, one of which is a smooth function, while another one may be a nonsmooth function. Many high-dimensional learning problems (classification/regression) can be designed using such frameworks, which can be efficiently solved with the help of first-order proximal-based methods. Due to slow convergence of traditional proximal methods, a recent trend is to introduce acceleration to such methods, which increases the speed of convergence. Such proximal gradient methods belong to a wider class of the forward–backward algorithms, which mathematically can be interpreted as fixed-point iterative schemes. In this paper, we design few new proximal gradient methods corresponding to few state-of-the-art fixed-point iterative schemes and compare their performances on the regression problem. In addition, we propose a new accelerated proximal gradient algorithm, which outperforms earlier traditional methods in terms of convergence speed and regression error. To demonstrate the applicability of our method, we conducted experiments for the problem of regression with several publicly available high-dimensional real datasets taken from different application domains. Empirical results exhibit that the proposed method outperforms the previous methods in terms of convergence, accuracy, and objective function values.
Similar content being viewed by others
References
Agarwal RP, O’Regan D, Sahu DR (2007) Iterative construction of fixed points of nearly asymptotically nonexpansive mappings. J Nonlinear Convex Anal 8(1):61–79
Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853, http://dl.acm.org/citation.cfm?id=1046920.1194905
Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272. doi:10.1007/s10994-007-5040-8
Bach F, Jenatton R, Mairal J, Obozinski G (2012) Structured sparsity through convex optimization. Stat Sci 27(4):450–468. doi:10.1214/12-STS394
Bach FR (2008) Consistency of the group lasso and multiple kernel learning. J Mach Learn Res 9:1179–1225, http://dl.acm.org/citation.cfm?id=1390681.1390721
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202. doi:10.1137/080716542
Bertsekas D (1999) Nonlinear programming, 2nd edn. Athena Scientific. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/1886529000
Bickel S, Bogojeska J, Lengauer T, Scheffer T (2008) Multi-task learning for HIV therapy screening. In: Proceedings of the 25th international conference on machine learning. ACM, New York, ICML ’08, pp 56–63. doi:10.1145/1390156.1390164
Chambolle A, Dossal C (2015) On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J Optim Theory Appl 166(3):968–982. doi:10.1007/s10957-015-0746-4
Chang S, Qi GJ, Yang Y, Aggarwal CC, Zhou J, Wang M, Huang TS (2016) Large-scale supervised similarity learning in networks. Knowl Inform Syst 48(3):707–740. doi:10.1007/s10115-015-0894-8
Chapelle O, Shivaswamy P, Vadrevu S, Weinberger K, Zhang Y, Tseng B (2010) Multi-task learning for boosting with application to web search ranking. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, KDD ’10, pp 1189–1198. doi:10.1145/1835804.1835953
Chen X, Lin Q, Kim S, Carbonell JG, Xing EP (2012) Smoothing proximal gradient method for general structured sparse regression. Ann Appl Stat 6(2):719–752. doi:10.1214/11-AOAS514
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning. ACM, New York, ICML ’08, pp 160–167. doi:10.1145/1390156.1390177
Daumé H III, Kumar A, Saha A (2010) Frustratingly easy semi-supervised domain adaptation. In: Proceedings of the 2010 workshop on domain adaptation for natural language processing, association for computational linguistics, Stroudsburg, PA, DANLP 2010, pp 53–59. http://dl.acm.org/citation.cfm?id=1870526.1870534
Duchi J, Singer Y (2009) Efficient online and batch learning using forward backward splitting. J Mach Learn Res 10:2899–2934. http://dl.acm.org/citation.cfm?id=1577069.1755882
Jenatton R, Mairal J, Obozinski G, Bach F (2011) Proximal methods for hierarchical sparse coding. J Mach Learn Res 12:2297–2334. http://dl.acm.org/citation.cfm?id=1953048.2021074
Juditsky A, Nemirovski A (2012) First-order methods for nonsmooth convex large-scale optimization, I: general purpose methods. MIT Press, pp 121–148. http://www.jstor.org/stable/j.ctt5hhgpg.9
Li B, Yang Q, Xue X (2009) Transfer learning for collaborative filtering via a rating-matrix generative model. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, ICML ’09, pp 617–624. doi:10.1145/1553374.1553454
Liu H, Palatucci M, Zhang J (2009) Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, ICML ’09, pp 649–656. doi:10.1145/1553374.1553458
Luo D, Ding C, Huang H (2013) Toward structural sparsity: an explicit \(l_2-l_0\) approach. Knowl Inform Syst 36(2):411–438. doi:10.1007/s10115-012-0545-2
Mainge PE (2008) Convergence theorems for inertial km-type algorithms. J Comput Appl Math 219(1):223–236. doi:10.1016/j.cam.2007.07.021
Mann WR (1953) Mean value methods in iteration. Proc Am Math Soc 4:506–510. doi:10.1090/S0002-9939-1953-0054846-3
Nesterov Y (2007) Gradient methods for minimizing composite objective function. CORE Discussion Papers 2007076, Universit catholique de Louvain, Center for Operations Research and Econometrics (CORE). http://EconPapers.repec.org/RePEc:cor:louvco:2007076
Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239. doi:10.1561/2400000003
Quattoni A, Carreras X, Collins M, Darrell T (2009) An efficient projection for l1,∞ regularization. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, ICML ’09, pp 857–864. doi:10.1145/1553374.1553484
Sahu DR (2011) Applications of the s-iteration process to constrained minimization problems and split feasibility problems. Fixed Point Theory 12(1):187–204
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288. http://www.jstor.org/stable/2346178
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B (Stat Methodol) 67(1):91–108. doi:10.1111/j.1467-9868.2005.00490.x
Turlach BA, Venables WN, Wright SJ (2005) Simultaneous variable selection. Technometrics 47(3):349–363. doi:10.1198/004017005000000139
Wang J, Liu J, Ye J (2013) Efficient mixed-norm regularization: algorithms and safe screening methods. arXiv:1307.4156 abs/CoRR
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67. doi:10.1111/j.1467-9868.2005.00532.x
Zhang S, Qian H, Gong X (2016) An alternating proximal splitting method with global convergence for nonconvex structured sparsity optimization. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI Press, AAAI’16, pp 2330–2336, http://dl.acm.org/citation.cfm?id=3016100.3016224
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37(6A):3468–3497. doi:10.1214/07-AOS584
Zhou J, Yuan L, Liu J, Ye J (2011) A multi-task learning formulation for predicting disease progression. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, KDD ’11, pp 814–822, doi:10.1145/2020408.2020549
Acknowledgements
The first author would like to acknowledge the financial support by Indian Institute of Technology (Banaras Hindu University) in terms of teaching assistantship.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Verma, M., Shukla, K.K. A new accelerated proximal technique for regression with high-dimensional datasets. Knowl Inf Syst 53, 423–438 (2017). https://doi.org/10.1007/s10115-017-1047-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1047-z