[1]方鹏,李贤,汪增福.运用核聚类和偏最小二乘回归的歌唱声音转换[J].智能系统学报编辑部,2016,11(1):55-60.[doi:10.11992/tis.201506022]
 FANG Peng,LI Xian,WANG Zengfu.Conversion of singing voice based on kernel clustering and partial least squares regression[J].CAAI Transactions on Intelligent Systems,2016,11(1):55-60.[doi:10.11992/tis.201506022]
点击复制

运用核聚类和偏最小二乘回归的歌唱声音转换(/HTML)
分享到:

《智能系统学报》编辑部[ISSN:1673-4785/CN:23-1538/TP]

卷:
第11卷
期数:
2016年1期
页码:
55-60
栏目:
出版日期:
2016-02-25

文章信息/Info

Title:
Conversion of singing voice based on kernel clustering and partial least squares regression
作者:
方鹏123 李贤13 汪增福123
1. 中国科学技术大学信息科学技术学院, 安徽合肥 230027;
2. 中国科学院合肥智能机械研究所, 安徽合肥 230031;
3. 语音及语言信息处理国家工程实验室, 安徽合肥 230027
Author(s):
FANG Peng123 LI Xian13 WANG Zengfu123
1. Department of Automation, University of Science and Technology of China, Hefei 230027, China;
2. Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei 230031, China;
3. National Engineering Laboratory of Speech and Language Information Processing, Hefei 230027, China
关键词:
计算机视觉语音转换歌唱声音核聚类偏最小二乘回归高斯混合模型MLSA
Keywords:
computer visionvoice conversionsinging voicekernel clusteringpartial least squares regressionGaussian mixture modelMel log spectrum approximation
分类号:
TN912;TP37
DOI:
10.11992/tis.201506022
摘要:
语音转换是计算机听觉领域的热点问题之一,将歌声运用于语音转换是一种新的研究方向,同时拓宽了语音转换的应用范围。经典的高斯混合模型的方法在少量训练数据时会出现过拟合的现象,而且在转换时并未有效利用音乐信息。为此提出一种歌唱声音转换方法以实现少量训练数据时的音色转换,并且利用歌曲的基频信息提高转换歌声的声音质量。该方法使用核聚类和偏最小二乘回归进行训练得到转换函数,采用梅尔对数频谱近似(MLSA)滤波器对源歌唱声音的波形直接进行滤波来获得转换后的歌唱声音,以此提高转换歌声的声音质量。实验结果表明,在少量训练数据时,该方法在相似度和音质方面都有更好的效果,说明在少量训练数据时该方法优于传统的高斯混合模型的方法。
Abstract:
Voice conversion is a popular topic in the field of computer hearing, and the application of singing voices to voice conversion is a relatively new research direction, which widens the application scope of voice conversion. When a training dataset is small, the conventional Gaussian mixture model (GMM) method may cause overfitting and insufficient utilization of music information. In this study, we propose a method for converting the voice timbre of a source singer into that of a target singer and employ fundamental frequency to improve the converted singing voice quality. We use kernel clustering and partial least squares regression to train the dataset, thereby obtaining the conversion function. To improve the converted singing voice quality, we applied the Mel log spectrum approximation (MLSA) filter, which synthesizes the converted singing voice by filtering the source singing waveform. Based on our experiment results, the proposed method demonstrates better voice similarity and quality, and therefore is a better choice than the GMM-based method when the training dataset is small.

参考文献/References:

[1] VILLAVICENCIO F, BONADA J. Applying voice conversion to concatenative singing-voice synthesis[C]//Proceedings of Interspeech. Chiba, Japan, 2010:2162-2165.
[2] ABE M, NAKAMURA S, SHIKANO K, et al. Voice conversion through vector quantization[J]. Journal of the acoustical society japan (E), 1990, 11(2):71-76.
[3] KAIN A, MACON M W. Spectral voice conversion for text-to-speech synthesis[C]//Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing. Seattle, WA, USA, 1998, 1:285-288.
[4] STYLIANOU Y, CAPPE, O, MOULINES E. Continuous probabilistic transform for voice conversion[J]. IEEE transactions on speech and audio processing, 1998, 6(2):131-142.
[5] TODA T, BLACK A W, TOKUDA K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory[J]. IEEE transactions on audio, speech, and language processing, 2007, 15(8):2222-2235.
[6] HELANDER E, VIRTANEN T, NURMINEN J, et al. Voice conversion using partial least squares regression[J]. IEEE transactions on audio, speech, and language processing, 2010, 18(5):912-921.
[7] LIU Lijuan, CHEN Linghui, LING Zhenhua, et al. Using bidirectional associative memories for joint spectral envelope modeling in voice conversion[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy, 2014:7884-7888.
[8] CHEN Linghui, LING Zhenhua, LIU Lijuan, et al. Voice conversion using deep neural networks with layer-wise generative training[J]. IEEE/ACM Transactions on audio, speech, and language processing, 2014, 22(12):1859-1872.
[9] DESAI S, BLACK A W, YEGNANARAYANA B, et al. Spectral mapping using artificial neural networks for voice conversion[J]. IEEE transactions on audio, speech, and language processing, 2010, 18(5):954-964.
[10] KOBAYASHI K, TODA T, NEUBIG G, et al. Statistical singing voice conversion with direct waveform modification based on the spectrum differential[C]//Proceedings of Interspeech. Singapore, 2014.
[11] KAWAHARA H, MORISE M, TAKAHASHI T, et al. Tandem-STRAIGHT:A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. Las Vegas, NV, USA, 2008:3933-3936.
[12] WU Zhongdong, XIE Weixin, YU Jianping. Fuzzy C-means clustering algorithm based on kernel method[C]//Proceedings of the 5th International Conference on Computational Intelligence and Multimedia Applications. ICCIMA. Xi’an, China, 2003:49-54.
[13] GRAVES D, PEDRYCZ W. Kernel-based fuzzy clustering and fuzzy clustering:a comparative experimental study[J]. Fuzzy Sets Systems, 2010, 161(4):522-543.
[14] DE JONG S. SIMPLS:An alternative approach to partial least squares regression[J]. Chemometrics and intelligent laboratory systems, 1993, 18(3):251-263.
[15] IMAI S, SUMITA K, FURUICHI C. Mel log spectrum approximation (MLSA) filter for speech synthesis[J]. Electronics and communications in Japan (Part I:Communications), 1983, 66(2):10-18.

相似文献/References:

[1]夏 凡,王 宏.基于局部异常行为检测的欺骗识别研究[J].智能系统学报编辑部,2007,2(05):12.
 XIA Fan,WANG Hong.Methodologies for deception detection based on abnormal b ehavior[J].CAAI Transactions on Intelligent Systems,2007,2(1):12.
[2]杨 戈,刘 宏.视觉跟踪算法综述[J].智能系统学报编辑部,2010,5(02):95.
 YANG Ge,LIU Hong.Survey of visual tracking algorithms[J].CAAI Transactions on Intelligent Systems,2010,5(1):95.
[3]刘宏,李哲媛,许超.视错觉现象的分类和研究进展[J].智能系统学报编辑部,2011,6(01):1.
 LIU Hong,LI Zheyuan,XU Chao.The categories and research advances of visual illusions[J].CAAI Transactions on Intelligent Systems,2011,6(1):1.
[4]叶果,程洪,赵洋.电影中吸烟活动识别[J].智能系统学报编辑部,2011,6(05):440.
 YE Guo,CHENG Hong,ZHAO Yang.moking recognition in movies[J].CAAI Transactions on Intelligent Systems,2011,6(1):440.
[5]史晓鹏,何为,韩力群.采用Hough变换的道路边界检测算法[J].智能系统学报编辑部,2012,7(01):81.
 SHI Xiaopeng,HE Wei,HAN Liqun.A road edge detection algorithm based on the Hough transform[J].CAAI Transactions on Intelligent Systems,2012,7(1):81.
[6]顾照鹏,刘宏.单目视觉同步定位与地图创建方法综述[J].智能系统学报编辑部,2015,10(04):499.[doi:10.3969/j.issn.1673-4785.201503003]
 GU Zhaopeng,LIU Hong.A survey of monocular simultaneous localization and mapping[J].CAAI Transactions on Intelligent Systems,2015,10(1):499.[doi:10.3969/j.issn.1673-4785.201503003]
[7]赵军,於俊,汪增福.基于改进逆向运动学的人体运动跟踪[J].智能系统学报编辑部,2015,10(04):548.[doi:10.3969/j.issn.1673-4785.201403032]
 ZHAO Jun,YU Jun,WANG Zengfu.Human motion tracking based on an improved inverse kinematics[J].CAAI Transactions on Intelligent Systems,2015,10(1):548.[doi:10.3969/j.issn.1673-4785.201403032]
[8]姬晓飞,王昌汇,王扬扬.分层结构的双人交互行为识别方法[J].智能系统学报编辑部,2015,10(6):893.[doi:10.11992/tis.201505006]
 JI Xiaofei,WANG Changhui,WANG Yangyang.Human interaction behavior-recognition method based on hierarchical structure[J].CAAI Transactions on Intelligent Systems,2015,10(1):893.[doi:10.11992/tis.201505006]
[9]李雪,蒋树强.智能交互的物体识别增量学习技术综述[J].智能系统学报编辑部,2017,12(02):140.[doi:10.11992/tis.201701006]
 LI Xue,JIANG Shuqiang.Incremental learning and object recognition system based on intelligent HCI: a survey[J].CAAI Transactions on Intelligent Systems,2017,12(1):140.[doi:10.11992/tis.201701006]
[10]王科俊,赵彦东,邢向磊.深度学习在无人驾驶汽车领域应用的研究进展[J].智能系统学报编辑部,2018,13(01):55.[doi:10.11992/tis.201609029]
 WANG Kejun,ZHAO Yandong,XING Xianglei.Deep learning in driverless vehicles[J].CAAI Transactions on Intelligent Systems,2018,13(1):55.[doi:10.11992/tis.201609029]

备注/Memo

备注/Memo:
收稿日期:2015-06-11;改回日期:。
基金项目:国家自然科学基金资助项目(61472393,613031350).
作者简介:方鹏,男,1990年生,硕士研究生,主要研究方向为歌唱声音转换;李贤,男,1988年生,博士研究生,主要研究方向为情感语音、语音转换、歌唱合成等;汪增福,男,1960年生,教授、博士生导师,现任《模式识别与人工智能》编委、InternationalJournalofInformationAcquisition副主编。获ACMMultimedia2009最佳论文奖。主要研究方向为计算机视觉、计算机听觉、人机交互和智能机器人等,发表学术论文180余篇。
通讯作者:汪增福.E-mail:zfwang@ustc.edu.cn.
更新日期/Last Update: 1900-01-01