[1]黄鉴之,丁成诚,陶蔚,等.非光滑凸情形Adam型算法的最优个体收敛速率[J].智能系统学报,2020,15(6):1140-1146.[doi:10.11992/tis.202006046]
 HUANG Jianzhi,DING Chengcheng,TAO Wei,et al.Optimal individual convergence rate of Adam-type algorithms in nonsmooth convex optimization[J].CAAI Transactions on Intelligent Systems,2020,15(6):1140-1146.[doi:10.11992/tis.202006046]
点击复制

非光滑凸情形Adam型算法的最优个体收敛速率(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第15卷
期数:
2020年6期
页码:
1140-1146
栏目:
学术论文—机器学习
出版日期:
2020-11-05

文章信息/Info

Title:
Optimal individual convergence rate of Adam-type algorithms in nonsmooth convex optimization
作者:
黄鉴之1 丁成诚1 陶蔚2 陶卿1
1. 中国人民解放军陆军炮兵防空兵学院 信息工程系, 安徽 合肥 230031;
2. 中国人民解放军陆军工程大学 指挥控制工程学院, 江苏 南京 210007
Author(s):
HUANG Jianzhi1 DING Chengcheng1 TAO Wei2 TAO Qing1
1. Department of Information Engineering, Army Academy of Artillery and Air Defense of PLA, Hefei 230031, China;
2. Command and Control Engineering, Army Engineering University of PLA, Nanjing 210007, China
关键词:
机器学习AdaGrad算法RMSProp算法动量方法Adam算法AMSGrad算法个体收敛速率稀疏性
Keywords:
machine learningAdaGrad algorithmRMSProp algorithmmomentum methodsAdam algorithmAMSGrad algorithmindividual convergence ratesparsity
分类号:
TP181
DOI:
10.11992/tis.202006046
摘要:
Adam是目前深度神经网络训练中广泛采用的一种优化算法框架,同时使用了自适应步长和动量技巧,克服了SGD的一些固有缺陷。但即使对于凸优化问题,目前Adam也只是在线学习框架下给出了和梯度下降法一样的regret界,动量的加速特性并没有得到体现。这里针对非光滑凸优化问题,通过巧妙选取动量和步长参数,证明了Adam的改进型具有最优的个体收敛速率,从而说明了Adam同时具有自适应和加速的优点。通过求解 ${l_1}$ 范数约束下的hinge损失问题,实验验证了理论分析的正确性和在算法保持稀疏性方面的良好性能。
Abstract:
Adam is a popular optimization framework for training deep neural networks, which simultaneously employs adaptive step-size and momentum techniques to overcome some inherent disadvantages of SGD. However, even for the convex optimization problem, Adam proves to have the same regret bound as the gradient descent method under online optimization circumstances; moreover, the momentum acceleration property is not revealed. This paper focuses on nonsmooth convex problems. By selecting suitable time-varying step-size and momentum parameters, the improved Adam algorithm exhibits an optimal individual convergence rate, which indicates that Adam has the advantages of both adaptation and acceleration. Experiments conducted on the l1-norm ball constrained hinge loss function problem verify the correctness of the theoretical analysis and the performance of the proposed algorithms in keeping the sparsity.

参考文献/References:

[1] KINGMA D P, BA J L. Adam: a method for stochastic optimization[C]//Proceedings of the 3rd International Conference for Learning Representations. San Diego, USA, 2015.
[2] DUCHI J, HAZAN E, SINGER Y. Adaptive subgradient methods for online learning and stochastic optimization[J]. The journal of machine learning research, 2011, 12: 2121-2159.
[3] ZINKEVICH M. Online convex programming and generalized infinitesimal gradient ascent[C]//Proceedings of the 20th International Conference on Machine Learning. Washington, USA, 2003: 928-935.
[4] TIELEMAN T, HINTON G. Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude[R]. Toronto: University of Toronto, 2012.
[5] ZEILER M D. ADADELTA: an adaptive learning rate method[EB/OL]. (2012-12-22)[2020-04-20]. https://arxiv.org/abs/1212.5701
[6] POLYAK B T. Some methods of speeding up the convergence of iteration methods[J]. USSR computational mathematics and mathematical physics, 1964, 4(5): 1-17.
[7] NESTEROV Y E. A method of solving a convex programming problem with convergence rate O(1/k2)[J]. Soviet mathematics doklady, 1983, 27(2): 372-376.
[8] GHADIMI E, FEYZMAHDAVIAN H R, JOHANSSON M. Global convergence of the Heavy-ball method for convex optimization[C]//Proceedings of 2015 European Control Conference. Linz, Austria, 2015: 310-315.
[9] SHAMIR O, ZHANG Tong. Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes[C]//Proceedings of the 30th International Conference on International Conference on Machine Learning. Atlanta, USA, 2013: 1-71-1-79.
[10] 陶蔚, 潘志松, 储德军, 等. 使用Nesterov步长策略投影次梯度方法的个体收敛性[J]. 计算机学报, 2018, 41(1): 164-176
TAO Wei, PAN Zhisong, CHU Dejun, et al. The individual convergence of projected subgradient methods using the Nesterov’s step-size strategy[J]. Chinese journal of computers, 2018, 41(1): 164-176
[11] TAO Wei, PAN Zhisong, WU Gaowei, et al. The strength of Nesterov’s extrapolation in the individual convergence of nonsmooth optimization[J]. IEEE transactions on neural networks and learning systems, 2020, 31(7): 2557-2568.
[12] 程禹嘉, 陶蔚, 刘宇翔, 等. Heavy-Ball型动量方法的最优个体收敛速率[J]. 计算机研究与发展, 2019, 56(8): 1686-1694
CHENG Yujia, TAO Wei, LIU Yuxiang, et al. Optimal individual convergence rate of the Heavy-ball-based momentum methods[J]. Journal of computer research and development, 2019, 56(8): 1686-1694
[13] KIROS R, ZEMEL R S, SALAKHUTDINOV R, et al. A multiplicative model for learning distributed text-based attribute representations[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada, 2014: 2348-2356.
[14] BAHAR P, ALKHOULI T, PETER J T, et al. Empirical investigation of optimization algorithms in neural machine translation[J]. The Prague bulletin of mathematical linguistics, 2017, 108(1): 13-25.
[15] REDDI S J, KALE S, KUMAR S. On the convergence of Adam and beyond[C]//Processing of the 6th International Conference on Learning Representations. Vancouver, Canada, 2018.
[16] WANG Guanghui, LU Shiyin, TU Weiwei, et al. SAdam: a variant of Adam for strongly convex functions[C]//Processing of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia, 2020.
[17] CHEN Xiangyi, LIU Sijia, SUN Ruoyu, et al. On the convergence of a class of Adam-type algorithms for non-convex optimization[C]//Processing of the 7th International Conference on Learning Representations. New Orleans, USA, 2019.
[18] DUCHI J, SHALEV-SHWARTZ S, SINGER Y, et al. Efficient projections onto the l1-ball for learning in high dimensions[C]//Processing of the 25th International Conference on Machine learning. Helsinki, Finland, 2008: 272-279.
[19] AGARWAL A, BARTLETT P L, RAVIKUMAR P, et al. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization[J]. IEEE transactions on information theory, 2012, 58(5): 3235-3249.
[20] SHALEV-SHWARTZ S, SINGER Y, SREBRO N, et al. Pegasos: primal estimated sub-gradient solver for SVM[J]. Mathematical programming, 2011, 127(1): 3-30.
[21] RAKHLIN A, SHAMIR O, SRIDHARAN K. Making gradient descent optimal for strongly convex stochastic optimization[C]//Processing of the 29th International Conference on International Conference on Machine Learning. Edinburgh, Scotland, 2012: 1571-1578.

相似文献/References:

[1]叶志飞,文益民,吕宝粮.不平衡分类问题研究综述[J].智能系统学报,2009,4(02):148.
 YE Zhi-fei,WEN Yi-min,LU Bao-liang.A survey of imbalanced pattern classification problems[J].CAAI Transactions on Intelligent Systems,2009,4(6):148.
[2]刘奕群,张 敏,马少平.基于非内容信息的网络关键资源有效定位[J].智能系统学报,2007,2(01):45.
 LIU Yi-qun,ZHANG Min,MA Shao-ping.Web key resource page selection based on non-content inf o rmation[J].CAAI Transactions on Intelligent Systems,2007,2(6):45.
[3]马世龙,眭跃飞,许 可.优先归纳逻辑程序的极限行为[J].智能系统学报,2007,2(04):9.
 MA Shi-long,SUI Yue-fei,XU Ke.Limit behavior of prioritized inductive logic programs[J].CAAI Transactions on Intelligent Systems,2007,2(6):9.
[4]姚伏天,钱沄涛.高斯过程及其在高光谱图像分类中的应用[J].智能系统学报,2011,6(05):396.
 YAO Futian,QIAN Yuntao.Gaussian process and its applications in hyperspectral image classification[J].CAAI Transactions on Intelligent Systems,2011,6(6):396.
[5]文益民,强保华,范志刚.概念漂移数据流分类研究综述[J].智能系统学报,2013,8(02):95.[doi:10.3969/j.issn.1673-4785.201208012]
 WEN Yimin,QIANG Baohua,FAN Zhigang.A survey of the classification of data streams with concept drift[J].CAAI Transactions on Intelligent Systems,2013,8(6):95.[doi:10.3969/j.issn.1673-4785.201208012]
[6]杨成东,邓廷权.综合属性选择和删除的属性约简方法[J].智能系统学报,2013,8(02):183.[doi:10.3969/j.issn.1673-4785.201209056]
 YANG Chengdong,DENG Tingquan.An approach to attribute reduction combining attribute selection and deletion[J].CAAI Transactions on Intelligent Systems,2013,8(6):183.[doi:10.3969/j.issn.1673-4785.201209056]
[7]胡小生,钟勇.基于加权聚类质心的SVM不平衡分类方法[J].智能系统学报,2013,8(03):261.
 HU Xiaosheng,ZHONG Yong.Support vector machine imbalanced data classification based on weighted clustering centroid[J].CAAI Transactions on Intelligent Systems,2013,8(6):261.
[8]丁科,谭营.GPU通用计算及其在计算智能领域的应用[J].智能系统学报,2015,10(01):1.[doi:10.3969/j.issn.1673-4785.201403072]
 DING Ke,TAN Ying.A review on general purpose computing on GPUs and its applications in computational intelligence[J].CAAI Transactions on Intelligent Systems,2015,10(6):1.[doi:10.3969/j.issn.1673-4785.201403072]
[9]孔庆超,毛文吉,张育浩.社交网站中用户评论行为预测[J].智能系统学报,2015,10(03):349.[doi:10.3969/j.issn.1673-4785.201403019]
 KONG Qingchao,MAO Wenji,ZHANG Yuhao.User comment behavior prediction in social networking sites[J].CAAI Transactions on Intelligent Systems,2015,10(6):349.[doi:10.3969/j.issn.1673-4785.201403019]
[10]姚霖,刘轶,李鑫鑫,等.词边界字向量的中文命名实体识别[J].智能系统学报,2016,11(1):37.[doi:10.11992/tis.201507065]
 YAO Lin,LIU Yi,LI Xinxin,et al.Chinese named entity recognition via word boundarybased character embedding[J].CAAI Transactions on Intelligent Systems,2016,11(6):37.[doi:10.11992/tis.201507065]

备注/Memo

备注/Memo:
收稿日期:2020-06-28。
基金项目:国家自然科学基金项目(61673394;62076252)
作者简介:黄鉴之,硕士研究生,主要研究方向为凸优化算法及其在机器学习中的应用;丁成诚,硕士研究生,主要研究方向为凸优化算法及其在机器学习中的应用;陶卿,教授,博士,主要研究方向为模式识别、机器学习和应用数学。承担国家自然科学基金、安徽省自然科学基金等。发表学术论文60余篇
通讯作者:陶卿.E-mail:qing.tao@ia.ac.cn
更新日期/Last Update: 2020-12-25