[1]马志强,李图雅,杨双涛,等.基于深度神经网络的蒙古语声学模型建模研究[J].智能系统学报,2018,13(03):486-492.[doi:10.11992/tis.201710029]
 MA Zhiqiang,LI Tuya,YANG Shuangtao,et al.Mongolian acoustic modeling based on deep neural network[J].CAAI Transactions on Intelligent Systems,2018,13(03):486-492.[doi:10.11992/tis.201710029]
点击复制

基于深度神经网络的蒙古语声学模型建模研究(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第13卷
期数:
2018年03期
页码:
486-492
栏目:
出版日期:
2018-05-05

文章信息/Info

Title:
Mongolian acoustic modeling based on deep neural network
作者:
马志强 李图雅 杨双涛 张力
内蒙古工业大学 数据科学与应用学院, 内蒙古 呼和浩特 010080
Author(s):
MA Zhiqiang LI Tuya YANG Shuangtao ZHANG Li
School of Data Science & Application, Inner Mongolia University of Technology, Hohhot 010080, China
关键词:
语音识别声学模型GMM-HMMDNN-HMM监督学习预训练过拟合dropout
Keywords:
speech recognitionacoustic modelGMM-HMMDNN-HMMsupervised learningpre-trainingover-fittingdropout
分类号:
TP391
DOI:
10.11992/tis.201710029
摘要:
针对高斯混合模型在蒙古语语音识别声学建模中不能充分描述蒙古语声学特征之间相关性和独立性假设的问题,开展了使用深度神经网络模型进行蒙古语声学模型建模的研究。以深度神经网络为基础,将分类与语音特征内在结构的学习紧密结合进行蒙古语声学特征的提取,构建了DNN-HMM蒙古语声学模型,结合无监督预训练与监督训练调优过程设计了训练算法,在DNN-HMM蒙古语声学模型训练中加入dropout技术避免过拟合现象。最后,在小规模语料库和Kaldi实验平台下,对GMM-HMM和DNN-HMM蒙古语声学模型进行了对比实验。实验结果表明,DNN-HMM蒙古语声学模型的词识别错误率降低了7.5%,句识别错误率降低了13.63%;同时,训练时加入dropout技术可以有效避免DNN-HMM蒙古语声学模型的过拟合现象。
Abstract:
Considering the difficulty of using the Gaussian mixture model (GMM) to adequately describe the correlation and independence hypothesis of the Mongolian acoustic features in the acoustic modeling of Mongolian speech recognition, this study investigates an acoustic model based on deep neural network (DNN). Firstly, using DNN, the internal structure of phonetic features were classified and learned to extract the Mongolian acoustic features, and a DNN-HMM Mongolian acoustic model was constructed. Secondly, a training algorithm was designed by combining unsupervised pre-training and supervised training tuning. In addition, dropout technology was added into the DNN-HMM Mongolian acoustic model training to avoid the over-fitting phenomenon. Finally, a comparative experiment was conducted for the GMM-HMM and DNN-HMM Mongolian acoustic models on basis of the small-scale corpus and Kaldi experimental platform. Experimental results show that the word recognition error rate of DNN-HMM Mongolian model was reduced by 7.5% and sentence recognition error rate was reduced by 13.63%. In addition, the over-fitting of DNN-HMM Mongolian acoustic model can be effectively avoided by adopting the dropout technique during training.

参考文献/References:

[1] 马志强, 张泽广, 闫瑞, 等. 基于N-Gram模型的蒙古语文本语种识别算法的研究[J]. 中文信息学报, 2016, 30(1):133-140. MA Zhiqiang, ZHANG Zeguang, YAN Rui, et al. N-Gram based language identification for Mongolian text[J]. Journal of Chinese information processing, 2016, 30(1):133-140.
[2] RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2):257-286.
[3] RABINER L, JUANG B H. Fundamentals of Speech Recognition[M]. Upper Saddle River, USA:Prentice-Hall, 1993.
[4] RENALS S, MORGAN N, BOURLARD H, et al. Connectionist probability estimators in HMM speech recognition[J]. IEEE transactions on speech and audio processing, 1994, 2(1):161-174.
[5] LI Deng, HINTON G, KINGSBURY B. New types of deep neural network learning for speech recognition and related applications:an overview[C]//Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:8599-8603.
[6] HINTON G, DENG Li, YU Dong, et al. Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups[J]. IEEE signal processing magazine, 2012, 29(6):82-97.
[7] YU Dong, DENG Li, DAHL G E. Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition[C]//Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning. 2010.
[8] DAHL G E, YU Dong, DENG Li, et al. Large vocabulary continuous speech recognition with context-dependent DBN-HMMs[C]//Proceedings of 2011 IEEE International Conference on Acoustics, Speech and Signal Processing. Prague, Czech Republic, 2011:4688-4691.
[9] DAHL G E, YU Dong, DENG Li, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE transactions on audio, speech, and language processing, 2012, 20(1):30-42.
[10] HINTON G E. Training products of experts by minimizing contrastive divergence[J]. Neural computation, 2002, 14(8):1771-1800.
[11] HINTON G E, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural computation, 2006, 18(7):1527-1554.
[12] BENGIO Y, LAMBLIN P, POPOVICI D, et al. Greedy layer-wise training of deep networks[M]//SCHÖLKOPF B, PLATT J, HOFFMAN T. Advances in Neural Information Processing Systems. Cambridge:MIT Press, 2007:19-153.
[13] HINTON G E. A practical guide to training restricted Boltzmann machines[R]. Toronto:University of Toronto, 2010:926-927.
[14] KHALTA B O, FUJⅡ A. A lemmatization method for Mongolian and its application to indexing for information retrieval[J]. Information processing & management, 2009, 45(4):438-451.
[15] JAIMAI P, ZUNDUI T, CHAGNAA A, et al. PC-KIMMO-based description of Mongolian morphology[J]. International journal of information processing systems, 2005, 1(1):41-48.
[16] GAO Guanglai, BILIGETU, NABUQING, et al. A Mongolian speech recognition system based on HMM[C]//Proceedings of 2006 International Conference on Intelligent Computing. Kunming, China, 2006:667-676.
[17] 飞龙, 高光来, 闫学亮, 等. 基于分割识别的蒙古语语音关键词检测方法的研究[J]. 计算机科学, 2013, 40(9):208-211. FEI Long, GAO Guanglai, Yan Xueliang, et al. Research on Mongolian spoken term detection method based on segmentation recognition[J]. Computer science, 2013, 40(9):208-211.
[18] HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. arXiv:1207.0580, 2012.
[19] SRIVASTAVA N. Improving neural networks with dropout[D]. Toronto:University of Toronto, 2013.
[20] DENG Li, YU Dong. Deep learning:methods and applications[J]. Foundations and trends in signal processing, 2014, 7(3/4):197-387.

相似文献/References:

[1]张 磊,陈 晶,项学智,等.结合关键词混淆网络的关键词检出系统[J].智能系统学报,2010,5(05):432.[doi:10.3969/j.issn.1673-4785.2010.05.009]
 ZHANG Lei,CHEN Jing,XIANG Xue-zhi,et al.Research of keyword spotting based on a keyword spotting confusion network[J].CAAI Transactions on Intelligent Systems,2010,5(03):432.[doi:10.3969/j.issn.1673-4785.2010.05.009]
[2]张毅,谢延义,罗元,等.一种语音特征提取中Mel倒谱系数的后处理算法[J].智能系统学报,2016,11(2):208.[doi:10.11992/tis.201511008]
 ZHANG Yi,XIE Yanyi,LUO Yuan,et al.Postprocessing method of MFCC in speech feature extraction[J].CAAI Transactions on Intelligent Systems,2016,11(03):208.[doi:10.11992/tis.201511008]

备注/Memo

备注/Memo:
收稿日期:2017-10-31。
基金项目:国家自然科学基金项目(61762070,61650205).
作者简介:马志强,男,1972年生,教授,主要研究方向为机器学习、语音识别、自然语言处理。发表学术论文30余篇,被EI检索10余篇;李图雅,女,1993年生,硕士研究生,主要研究方向为机器学习、语音识别、自然语言处理;杨双涛,男,1990年生,硕士研究生,主要研究方向为机器学习、语音识别、自然语言处理。
通讯作者:李图雅.E-mail:2297854548@qq.com.
更新日期/Last Update: 2018-06-25