[1]莫凌飞,蒋红亮,李煊鹏.基于深度学习的视频预测研究综述[J].智能系统学报,2018,13(1):85-96.[doi:10.11992/tis.201707032]
 MO Lingfei,JIANG Hongliang,LI Xuanpeng.Review of deep learning-based video prediction[J].CAAI Transactions on Intelligent Systems,2018,13(1):85-96.[doi:10.11992/tis.201707032]
点击复制

基于深度学习的视频预测研究综述

参考文献/References:
[1] LECUN Y. Predictive Learning[R]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems. Barcelona, Spain, 2016
[2] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.
[3] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012. South Lake Tahoe, NV, USA, 2012: 1097-1105.
[4] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 1026-1034.
[5] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[Z]. arXiv preprint arXiv: 1409.1556, 2014.
[6] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 770-778.
[7] HINTON G, DENG Li, YU Dong, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups[J]. IEEE signal processing magazine, 2012, 29(6): 82-97.
[8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Quebec, Canada, 2014: 3104-3112.
[9] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of machine learning research, 2003, 3: 1137-1155.
[10] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[Z]. arXiv preprint arXiv: 1312.5602, 2013.
[11] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489.
[12] DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA, 2009: 248-255.
[13] SRIVASTAVA N, MANSIMOV E, SALAKHUDINOV R. Unsupervised learning of video representations using LSTMs[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France, 2015: 843-852.
[14] MCCULLOCH W S, PITTS W. A logical calculus of the ideas immanent in nervous activity[J]. The bulletin of mathematical biophysics, 1943, 5(4): 115-133.
[15] HEBB D O. The organization of behavior: A neuropsychological theory[M]. New York: Chapman & Hall, 1949.
[16] MINSKY M L, PAPERT S A. Perceptrons: an introduction to computational geometry[M]. 2nd ed. Cambridge, UK: MIT Press, 1988.
[17] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088): 533-536.
[18] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[19] HINTON G E, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural computation, 2006, 18(7): 1527-1554.
[20] JORDAN M I. Serial order: A parallel distributed processing approach[J]. Advances in psychology, 1997, 121: 471-495.
[21] BENGIO Y. Learning deep architectures for AI[J]. Foundations and trends in machine learning, 2009, 2(1): 1-127.
[22] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Quebec, Canada, 2014: 2672-2680.
[23] BENGIO Y, COURVILLE A, VINCENT P. Representation learning: a review and new perspectives[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1798-1828.
[24] HUBEL D H, WIESEL T N. Receptive fields and functional architecture of monkey striate cortex[J]. The journal of physiology, 1968, 195(1): 215-243.
[25] FUKUSHIMA K, MIYAKE S. Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition[M]//AMARI S I, ARBIB M A. Competition and Cooperation in Neural Nets. Berlin Heidelberg: Springer, 1982: 267-285.
[26] ZEILER M D, KRISHNAN D, TAYLOR G W, et al. Deconvolutional networks[C]//Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA, 2010: 2528-2535.
[27] NOH H, HONG S, HAN B. Learning deconvolution network for semantic segmentation[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 1520-1528.
[28] RADFORD A, METZ L, CHINTALA S. Unsupervised representation learning with deep convolutional generative adversarial networks[Z]. arXiv preprint arXiv: 1511.06434, 2015.
[29] JI Shuiwang, XU Wei, YANG Ming, et al. 3D convolutional neural networks for human action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(1): 221-231.
[30] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[31] GERS F A, SCHMIDHUBER J. Recurrent nets that time and count[C]//Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. Como, Italy, 2000, 3: 189-194.
[32] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[Z]. arXiv preprint arXiv: 1406.1078, 2014.
[33] SHI Xingjian, CHEN Zhourong, WANG Hao, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Quebec, Canada, 2015: 802-810.
[34] VINCENT P, LAROCHELLE H, LAJOIE I, et al. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion[J]. Journal of machine learning research, 2010, 11: 3371-3408.
[35] NG A. Sparse autoencoder[R]. CS294A Lecture Notes, 2011: 72.
[36] KINGMA D P, WELLING M. Auto-encoding variational bayes[Z]. arXiv preprint arXiv: 1312.6114, 2013.
[37] REZENDE D J, MOHAMED S, WIERSTRA D. Stochastic backpropagation and approximate inference in deep generative models[Z]. arXiv preprint arXiv: 1401.4082, 2014.
[38] MIRZA M, OSINDERO S. Conditional generative adversarial nets[Z]. arXiv preprint arXiv: 1411.1784, 2014.
[39] CHEN Xi, DUAN Yan, HOUTHOOFT R, et al. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets[C]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 2172-2180.
[40] LEDIG C, THEIS L, HUSZáR F, et al. Photo-realistic single image super-resolution using a generative adversarial network[Z]. arXiv preprint arXiv: 1609.04802, 2016.
[41] WU Jiajun, ZHANG Chengkai, XUE Tianfan, et al. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling[C]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 82-90.
[42] ISOLA P, ZHU Junyan, ZHOU Tinghui, et al. Image-to-image translation with conditional adversarial networks[Z]. arXiv preprint arXiv: 1611.07004, 2016.
[43] VONDRICK C, PIRSIAVASH H, TORRALBA A. Generating videos with scene dynamics[C]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 613-621.
[44] VONDRICK C, PIRSIAVASH H, TORRALBA A. Anticipating visual representations from unlabeled video[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA, 2016: 98-106.
[45] LAN Tian, CHEN T C, SAVARESE S. A hierarchical representation for future action prediction[C]//Proceedings of the 13th European Conference on Computer Vision. Zürich, Switzerland, 2014: 689-704.
[46] HOAI M, DE LA TORRE F. Max-margin early event detectors[J]. International journal of computer vision, 2014, 107(2): 191-202.
[47] RYOO M S. Human activity prediction: Early recognition of ongoing activities from streaming videos[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision. Barcelona, Spain, 2011: 1036-1043.
[48] VU T H, OLSSON C, LAPTEV I, et al. Predicting actions from static scenes[C]//Proceedings of the 13th European Conference on Computer Vision. Zürich, Switzerland, 2014: 421-436.
[49] PEI Mingtao, JIA Yunde, ZHU Songchun. Parsing video events with goal inference and intent prediction[C]//Proceedings of the 2011 IEEE International Conference on Computer vision. Barcelona, Spain, 2011: 487-494.
[50] FOUHEY D F, ZITNICK C L. Predicting object dynamics in scenes[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, 2014: 2027-2034.
[51] KOPPULA H S, SAXENA A. Anticipating human activities using object affordances for reactive robotic response[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 38(1): 14-29.
[52] HUANG Dean, KITANI K M. Action-reaction: Forecasting the dynamics of human interaction[C]//Proceedings of the 13th European Conference on Computer Vision. Zürich, Switzerland, 2014: 489-504.
[53] PICKUP L C, PAN Zheng, WEI Donglai, et al. Seeing the arrow of time[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, 2014: 2043-2050.
[54] LAMPERT C H. Predicting the future behavior of a time-varying probability distribution[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, 2015: 942-950.
[55] PINTEA S L, VAN GEMERT J C, SMEULDERS A W M. Déja vu: Motion prediction in static images[C]//Proceedings of the 13th European Conference on Computer Vision. Zürich, Switzerland, 2014: 172-187.
[56] KITANI K M, ZIEBART B D, BAGNELL J A, et al. Activity forecasting[C]//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy, 2012: 201-214.
[57] GONG Haifeng, SIM J, LIKHACHEV M, et al. Multi-hypothesis motion planning for visual object tracking[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision. Barcelona, Spain, 2011: 619-626.
[58] KOOIJ J F P, SCHNEIDER N, FLOHR F, et al. Context-based pedestrian path prediction[C]//Proceedings of the 13th European Conference on Computer Vision. Zürich, Switzerland, 2014: 618-633.
[59] WALKER J, DOERSCH C, GUPTA A, et al. An uncertain future: Forecasting from static images using variational autoencoders[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, Netherlands, 2016: 835-851.
[60] WALKER J, GUPTA A, HEBERT M. Dense optical flow prediction from a static image[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 2443-2451.
[61] WALKER J, GUPTA A, HEBERT M. Patch to the future: Unsupervised visual prediction[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, 2014: 3302-3309.
[62] YUEN J, TORRALBA A. A data-driven approach for event prediction[C]//Proceedings of the 11th European Conference on Computer Vision. Heraklion, Crete, Greece, 2010: 707-720.
[63] MOTTAGHI R, RASTEGARI M, GUPTA A, et al. “What happens if...” learning to predict the effect of forces in images[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, Netherlands, 2016: 269-285.
[64] SCHUKDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK, 2004, 3: 32-36.
[65] VUKOTI V, PINTEA S L, RAYMOND C, et al. One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network[C]//Proceedings of the 19th International Conference on Image Analysis and Processing. Catania, Italy, 2017: 140-151.
[66] IONESCU C, PAPAVA D, OLARU V, et al. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE transactions on pattern analysis and machine intelligence, 2014, 36(7): 1325-1339.
[67] YAN Yichao, XU Jingwei, NI Bingbing, et al. Skeleton-aided articulated motion generation[Z]. arXiv preprint arXiv: 1707.01058, 2017.
[68] VILLEGAS R, YANG Jimei, ZOU Yuliang, et al. Learning to generate long-term future via hierarchical prediction[Z]. arXiv preprint arXiv: 1704.05831, 2017.
[69] SOOMRO K, ZAMIR A R, SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[Z]. arXiv preprint axXiv:1202.0402, 2012
[70] MATHIEU M, COUPRIE C, LECUN Y. Deep multi-scale video prediction beyond mean square error[Z]. arXiv preprint arXiv: 1511.05440, 2015.
[71] HINTZ J J. Generative adversarial reservoirs for natural video prediction[D]. Austin, USA: The University of Texas.
[72] VILLEGAS R, YANG Jimei, HONG S, et al. Decomposing motion and content for natural video sequence prediction[C]//Proceedings of the 2017 International Conference on Learning Representations. Toulon, France, 2017.
[73] LIU Ziwei, et al. Video frame synthesis using deep voxel flow[C]//Proceeding of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA, 2017:4463-4471
[74] GORBAN A, IDREES H, JIANG Yugang, et al. THUMOS challenge: Action recognition with a large number of classes[EB/OL]. (2015–05). http://www.thumos.info.
[75] GEIGER A, LENZ P, STILLER C, et al. Vision meets robotics: the KITTI dataset[J]. The international journal of robotics research, 2013, 32(11): 1231-1237.
[76] LOTTER W, KREIMAN G, COX D. Deep predictive coding networks for video prediction and unsupervised learning[Z]. arXiv preprint arXiv: 1605.08104, 2016.
[77] Kuehne H, Jhuang H, Garrote E, et al. HMDB: A large video database for human motion recognition[C]//Proceeding of the 2011 IEEE International Conference on Computer Vision, ICCV. Barcelona, Spain, 2011:2556-2563.
[78] CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 3213-3223.
[79] JIN Xiaojie, LI Xin, XIAO Huaxin, et al. Video scene parsing with predictive feature learning[Z]. arXiv preprint arXiv: 1612.00119, 2016.
[80] LOTTER W, KREIMAN G, COX D. Unsupervised learning of visual structure using predictive generative networks[Z]. arXiv preprint arXiv: 1511.06380, 2015.
[81] YAN Xing, CHANG Hong, SHAN Shiguang, et al. Modeling video dynamics with deep dynencoder[C]//Proceedings of the 13th European Conference on Computer Vision. Zürich, Switzerland, 2014: 215-230.
[82] RANZATO M, SZLAM A, BRUNA J, et al. Video (language) modeling: a baseline for generative models of natural videos[Z]. arXiv preprint arXiv: 1412.6604, 2014.
[83] OH J, GUO Xiaoxiao, LEE H, et al. Action-conditional video prediction using deep networks in atari games[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Quebec, Canada, 2015: 2863-2871.
[84] FINN C, GOODFELLOW I, LEVINE S. Unsupervised learning for physical interaction through video prediction[C]//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 64-72.
[85] LUC P, NEVEROVA N, COUPRIE C, et al. Predicting deeper into the future of semantic segmentation[Z]. arXiv preprint arXiv: 1703.07684, 2017.
[86] CHEN Xiongtao, WANG Wenmin, WANG Jinzhou, et al. Long-term video interpolation with bidirectional predictive network[Z]. arXiv preprint arXiv: 1706.03947, 2017.
[87] XUE Tianfan, WU Jiajun, BOUMAN K, et al. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks[C]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems. Barcelona, Spain, 2016: 91-99.
[88] DENTON E, BIRODKAR V. Unsupervised learning of disentangled representations from video[Z]. arXiv preprint arXiv: 1705.10915, 2017.
相似文献/References:
[1]张媛媛,霍静,杨婉琪,等.深度信念网络的二代身份证异构人脸核实算法[J].智能系统学报,2015,10(2):193.[doi:10.3969/j.issn.1673-4785.201405060]
 ZHANG Yuanyuan,HUO Jing,YANG Wanqi,et al.A deep belief network-based heterogeneous face verification method for the second-generation identity card[J].CAAI Transactions on Intelligent Systems,2015,10():193.[doi:10.3969/j.issn.1673-4785.201405060]
[2]丁科,谭营.GPU通用计算及其在计算智能领域的应用[J].智能系统学报,2015,10(1):1.[doi:10.3969/j.issn.1673-4785.201403072]
 DING Ke,TAN Ying.A review on general purpose computing on GPUs and its applications in computational intelligence[J].CAAI Transactions on Intelligent Systems,2015,10():1.[doi:10.3969/j.issn.1673-4785.201403072]
[3]马晓,张番栋,封举富.基于深度学习特征的稀疏表示的人脸识别方法[J].智能系统学报,2016,11(3):279.[doi:10.11992/tis.201603026]
 MA Xiao,ZHANG Fandong,FENG Jufu.Sparse representation via deep learning features based face recognition method[J].CAAI Transactions on Intelligent Systems,2016,11():279.[doi:10.11992/tis.201603026]
[4]刘帅师,程曦,郭文燕,等.深度学习方法研究新进展[J].智能系统学报,2016,11(5):567.[doi:10.11992/tis.201511028]
 LIU Shuaishi,CHENG Xi,GUO Wenyan,et al.Progress report on new research in deep learning[J].CAAI Transactions on Intelligent Systems,2016,11():567.[doi:10.11992/tis.201511028]
[5]马世龙,乌尼日其其格,李小平.大数据与深度学习综述[J].智能系统学报,2016,11(6):728.[doi:10.11992/tis.201611021]
 MA Shilong,WUNIRI Qiqige,LI Xiaoping.Deep learning with big data: state of the art and development[J].CAAI Transactions on Intelligent Systems,2016,11():728.[doi:10.11992/tis.201611021]
[6]王亚杰,邱虹坤,吴燕燕,等.计算机博弈的研究与发展[J].智能系统学报,2016,11(6):788.[doi:10.11992/tis.201609006]
 WANG Yajie,QIU Hongkun,WU Yanyan,et al.Research and development of computer games[J].CAAI Transactions on Intelligent Systems,2016,11():788.[doi:10.11992/tis.201609006]
[7]黄心汉.A3I:21世纪科技之光[J].智能系统学报,2016,11(6):835.[doi:10.11992/tis.201605022]
 HUANG Xinhan.A3I: the star of science and technology for the 21st century[J].CAAI Transactions on Intelligent Systems,2016,11():835.[doi:10.11992/tis.201605022]
[8]宋婉茹,赵晴晴,陈昌红,等.行人重识别研究综述[J].智能系统学报,2017,12(6):770.[doi:10.11992/tis.201706084]
 SONG Wanru,ZHAO Qingqing,CHEN Changhong,et al.Survey on pedestrian re-identification research[J].CAAI Transactions on Intelligent Systems,2017,12():770.[doi:10.11992/tis.201706084]
[9]杨梦铎,栾咏红,刘文军,等.基于自编码器的特征迁移算法[J].智能系统学报,2017,12(6):894.[doi:10.11992/tis.201706037]
 YANG Mengduo,LUAN Yonghong,LIU Wenjun,et al.Feature transfer algorithm based on an auto-encoder[J].CAAI Transactions on Intelligent Systems,2017,12():894.[doi:10.11992/tis.201706037]
[10]王科俊,赵彦东,邢向磊.深度学习在无人驾驶汽车领域应用的研究进展[J].智能系统学报,2018,13(1):55.[doi:10.11992/tis.201609029]
 WANG Kejun,ZHAO Yandong,XING Xianglei.Deep learning in driverless vehicles[J].CAAI Transactions on Intelligent Systems,2018,13():55.[doi:10.11992/tis.201609029]

备注/Memo

收稿日期:2017-07-19。
基金项目:国家十二五科技支撑计划重点项目(2015BAG09B01).
作者简介:莫凌飞,男,1981年生,副教授,博士,主要研究方向为机器学习与人工智能、物联网与边缘计算、智能机器人。发表学术论文多篇,其中被SCI、EI检索40余篇;蒋红亮,男,1993年生,硕士研究生,主要研究方向为深度无监督学习和计算机视觉;李煊鹏,男,1985年生,讲师,博士,主要研究方向为机器视觉、驾驶辅助系统、环境感知与信息融合。
通讯作者:莫凌飞.E-mail:lfmo@seu.edu.cn.

更新日期/Last Update: 2018-02-01
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com