<-上一篇/Previous Article 下一篇/Next Article->

[1]余成宇,李志远,毛文宇,等.一种高效的稀疏卷积神经网络加速器的设计与实现[J].智能系统学报,2020,15(2):323-333.[doi:10.11992/tis.201902007]
　YU Chengyu,LI Zhiyuan,MAO Wenyu,et al.Design and implementation of an efficient accelerator for sparse convolutional neural network[J].CAAI Transactions on Intelligent Systems,2020,15(2):323-333.[doi:10.11992/tis.201902007]

点击复制

一种高效的稀疏卷积神经网络加速器的设计与实现

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 15 期数: 2020年第2期页码: 323-333 栏目: 学术论文—机器学习出版日期: 2020-03-05

Title:: Design and implementation of an efficient accelerator for sparse convolutional neural network

作者:: 余成宇^1,2, 李志远^1,2, 毛文宇¹, 鲁华祥^1,2,3,4; 1. 中国科学院半导体研究所, 北京 100083;
2. 中国科学院大学, 北京 100089;
3. 中国科学院脑科学与智能技术卓越创新中心, 上海 200031;
4. 半导体神经网络智能感知与计算技术北京市重点实验室, 北京 100083

Author(s):: YU Chengyu^1,2, LI Zhiyuan^1,2, MAO Wenyu¹, LU Huaxiang^1,2,3,4; 1. Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China;
2. University of Chinese Academy of Sciences, Beijing 100089, China;
3. Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Scienc

关键词:: 卷积神经网络; 稀疏性; 嵌入式FPGA; ReLU; 硬件加速; 并行计算; 深度学习

Keywords:: convolutional neural network; sparsity; embedded FPGA; ReLU; hardware acceleration; parallel computing; deep learning

分类号:: TN4

DOI:: 10.11992/tis.201902007

摘要:: 针对卷积神经网络计算硬件化实现困难的问题，之前大部分卷积神经网络加速器的设计都集中于解决计算性能和带宽瓶颈，忽视了卷积神经网络稀疏性对加速器设计的重要意义，近来少量的能够利用稀疏性的卷积神经网络加速器设计也往往难以同时兼顾计算灵活度、并行效率和资源开销。本文首先比较了不同并行展开方式对利用稀疏性的影响，分析了利用稀疏性的不同方法，然后提出了一种能够利用激活稀疏性加速卷积神经网络计算的同时，相比于同领域其他设计，并行效率更高、额外资源开销更小的并行展开方法，最后完成了这种卷积神经网络加速器的设计并在FPGA上实现。研究结果表明：运行VGG-16网络，在ImageNet数据集下，该并行展开方法实现的稀疏卷积神经网络加速器和使用相同器件的稠密网络设计相比，卷积性能提升了108.8%，整体性能提升了164.6%，具有明显的性能优势。

Abstract:: To address the difficulty experienced by convolutional neural networks (CNNs) in computing hardware implementation, most previous designs of convolutional neural network accelerators have focused on solving the computation performance and bandwidth bottlenecks, while ignoring the importance of CNN sparsity to accelerator design. Recently, it has often been difficult to simultaneously achieve computational flexibility, parallel efficiency, and resource overhead using the small number of CNN accelerator designs capable of utilizing sparsity. In this paper, we first analyze the effects of different parallel expansion methods on the use of sparsity, analyze different methods that utilize sparsity, and then propose a parallel expansion method that can accelerate CNNs with activated sparsity to achieve higher parallelism efficiency and lower additional resource cost, as compared with other designs. Lastly, we complete the design of this CNN accelerator and implemented it on FPGA. The results show that compared with a dense network design using the same device, the acceleration performance of the VGG-16 network was increased by 108.8% and its overall performance was improved by 164.6%, which has obvious performance advantages.

参考文献/References:: [1] RUSSAKOVSKY O, DENG Jia, SU Hao, et al. ImageNet large scale visual recognition challenge[J]. International journal of computer vision, 2014, 115(3): 211-252.
[2] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]// The 3rd International Conference on Learning Representations (ICLR2015).San Diego, CA, 2015.
[3] ZHANG Chen, LI Peng, SUN Guangyu, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]//Proceedings of 2015 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. New York, NY, USA, 2015.
[4] NATALE G, BACIS M, SANTAMBROGIO M D. On how to design dataflow FPGA-based accelerators for convolutional neural networks[C]//2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). Bochum, Germany, 2017.
[5] SHEN Yongming, FERDMAN M, MILDER P. Maximizing CNN accelerator efficiency through resource partitioning[C]//2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). Toronto, ON, Canada, 2016.
[6] MA Yufei, CAO Yu, VRUDHULA S, et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks[C]//ACM/SIGDA International Symposium on Field-programmable Gate Arrays. Monterey, California, USA, 2017.
[7] SHI Shaohuai, CHU Xiaowen. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units[J]. Computer vision and pattern recognition, 2017, 4: 1-7.
[8] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE journal of solid-state circuits, 2017, 52(1): 127-138.
[9] ALBERICIO J, JUDD P, HETHERINGTON T, et al. Cnvlutin: ineffectual-neuron-free deep neural network computing[C]//2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). Seoul, South Korea, 2016.
[10] ZHANG Shijin, DU Zidong, ZHANG Lei, et al. Cambricon-X: an accelerator for sparse neural networks[C]//201649th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Taipei, China, 2016: 1-12.
[11] HAN Song, LIU Xingyu, MAO Huizi, et al. EIE: efficient inference engine on compressed deep neural network[J]. International symposium on computer architecture, 2016, 44(3): 243-254.
[12] Parashar A, RHU M, Mukkara A, et al. SCNN: An accelerator for compressed-sparse convolutional neural networks[C]//The 44th Annual International Symposium. TorontoCanada, 2017.
[13] OLSHAUSEN B A, FIELD D J. Sparse coding with an overcomplete basis set: a strategy employed by V1?[J]. Vision research, 1997, 37(23): 3311-3325.
[14] DAYAN P, ABBOTT L F. Theoretical neuroscience: computational and mathematical modeling of neural systems[M]. Cambridge, USA: The MIT Press, 2001.
[15] NAIR V, HINTON G E. Rectified linear units improve restricted Boltzmann machines[C]//Proceedings of the 27th International Conference on International Conference on Machine Learning. Haifa, Israel, 2010.
[16] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, 2012.
[17] HAN Song, POOL J, TRAN J, et al. Learning both weights and connections for efficient neural networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada, 2015: 1135-1143.
[18] HAN Song, MAO Huizi, DALLY W J, et al. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding[C]//International Conference on Learning Representations, San Juan, Puerto Rico,2016.
[19] MA Yufei, SUDA N, CAO Yu, et al. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA[C]//201626th International Conference on Field Programmable Logic and Applications (FPL). Lausanne, Switzerland, 2016: 1-8.
[20] KU NG. Why systolic architectures[J]. IEEE Computer, 1982, 15(1): 300-309.
[21] GYSEL P, MOTAMEDI M, GHIASI S. Hardware-oriented approximation of convolutional neural networks[J]. Computer Vision and Pattern Recognition, 2016, 10: 1-8.
[22] DORRANCE R, REN Fengbo, MARKOVI? D, et al. A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs[C]//ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey, California, USA, 2014: 161-170.
[23] QIU Jiantao, WANG Jie, YAO Song, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey, California, USA, 2016: 26-35.
[24] SUDA N, CHANDRA V, DASIKA G, et al. Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey, California, USA, 2016.
[25] ZHANG Chen, FANG Zhenman, ZHOU Peipei, et al. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks[C]//Proceedings of the 35th International Conference on Computer-Aided Design. Austin, Texas, USA, 2016.

相似文献/References:: [1]殷瑞,苏松志,李绍滋.一种卷积神经网络的图像矩正则化策略[J].智能系统学报,2016,11(1):43.[doi:10.11992/tis.201509018]
　YIN Rui,SU Songzhi,LI Shaozi.Convolutional neural network’s image moment regularizing strategy[J].CAAI Transactions on Intelligent Systems,2016,11():43.[doi:10.11992/tis.201509018]
[2]龚震霆,陈光喜,任夏荔,等.基于卷积神经网络和哈希编码的图像检索方法[J].智能系统学报,2016,11(3):391.[doi:10.11992/tis.201603028]
　GONG Zhenting,CHEN Guangxi,REN Xiali,et al.An image retrieval method based on a convolutional neural network and hash coding[J].CAAI Transactions on Intelligent Systems,2016,11():391.[doi:10.11992/tis.201603028]
[3]刘帅师,程曦,郭文燕,等.深度学习方法研究新进展[J].智能系统学报,2016,11(5):567.[doi:10.11992/tis.201511028]
　LIU Shuaishi,CHENG Xi,GUO Wenyan,et al.Progress report on new research in deep learning[J].CAAI Transactions on Intelligent Systems,2016,11():567.[doi:10.11992/tis.201511028]
[4]师亚亭,李卫军,宁欣,等.基于嘴巴状态约束的人脸特征点定位算法[J].智能系统学报,2016,11(5):578.[doi:10.11992/tis.201602006]
　SHI Yating,LI Weijun,NING Xin,et al.A facial feature point locating algorithmbased on mouth-state constraints[J].CAAI Transactions on Intelligent Systems,2016,11():578.[doi:10.11992/tis.201602006]
[5]宋婉茹,赵晴晴,陈昌红,等.行人重识别研究综述[J].智能系统学报,2017,12(6):770.[doi:10.11992/tis.201706084]
　SONG Wanru,ZHAO Qingqing,CHEN Changhong,et al.Survey on pedestrian re-identification research[J].CAAI Transactions on Intelligent Systems,2017,12():770.[doi:10.11992/tis.201706084]
[6]杨晓兰,强彦,赵涓涓,等.基于医学征象和卷积神经网络的肺结节CT图像哈希检索[J].智能系统学报,2017,12(6):857.[doi:10.11992/tis.201706035]
　YANG Xiaolan,QIANG Yan,ZHAO Juanjuan,et al.Hashing retrieval for CT images of pulmonary nodules based on medical signs and convolutional neural networks[J].CAAI Transactions on Intelligent Systems,2017,12():857.[doi:10.11992/tis.201706035]
[7]王科俊,赵彦东,邢向磊.深度学习在无人驾驶汽车领域应用的研究进展[J].智能系统学报,2018,13(1):55.[doi:10.11992/tis.201609029]
　WANG Kejun,ZHAO Yandong,XING Xianglei.Deep learning in driverless vehicles[J].CAAI Transactions on Intelligent Systems,2018,13():55.[doi:10.11992/tis.201609029]
[8]莫凌飞,蒋红亮,李煊鹏.基于深度学习的视频预测研究综述[J].智能系统学报,2018,13(1):85.[doi:10.11992/tis.201707032]
　MO Lingfei,JIANG Hongliang,LI Xuanpeng.Review of deep learning-based video prediction[J].CAAI Transactions on Intelligent Systems,2018,13():85.[doi:10.11992/tis.201707032]
[9]王成济,罗志明,钟准,等.一种多层特征融合的人脸检测方法[J].智能系统学报,2018,13(1):138.[doi:10.11992/tis.201707018]
　WANG Chengji,LUO Zhiming,ZHONG Zhun,et al.Face detection method fusing multi-layer features[J].CAAI Transactions on Intelligent Systems,2018,13():138.[doi:10.11992/tis.201707018]
[10]葛园园,许有疆,赵帅,等.自动驾驶场景下小且密集的交通标志检测[J].智能系统学报,2018,13(3):366.[doi:10.11992/tis.201706040]
　GE Yuanyuan,XU Youjiang,ZHAO Shuai,et al.Detection of small and dense traffic signs in self-driving scenarios[J].CAAI Transactions on Intelligent Systems,2018,13():366.[doi:10.11992/tis.201706040]

备注/Memo

收稿日期:2019-02-14。
基金项目:国家自然科学基金项目(61701473)；中国科学院STS计划项目(KFJ-STS-ZDTP-070)；中国科学院国防科技创新基金项目(CXJJ-17-M152)；中国科学院战略性先导科技专项(A类)(XDA18040400)；北京市科技计划项目(Z181100001518006)
作者简介:余成宇，硕士研究生，主要研究方向为算法硬件加速;李志远，博士研究生，主要研究方向为计算机视觉;毛文宇，助理研究员，主要研究方向为智能计算系统、人工智能算法、信号处理。主持国家自然科学基金项目1项，中科院创新基金项目1项，授权专利1项。发表学术论文10余篇。
通讯作者:毛文宇.E-mail:maowenyu@semi.ac.cn

更新日期/Last Update: 1900-01-01

一种高效的稀疏卷积神经网络加速器的设计与实现 PDF下载HTML

备注/Memo

一种高效的稀疏卷积神经网络加速器的设计与实现

PDF下载 HTML