[1]FANG Peng,LI Xian,WANG Zengfu.Conversion of singing voice based on kernel clustering and partial least squares regression[J].CAAI Transactions on Intelligent Systems,2016,11(1):55-60.[doi:10.11992/tis.201506022]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
11
Number of periods:
2016 1
Page number:
55-60
Column:
学术论文—机器感知与模式识别
Public date:
2016-02-25
- Title:
-
Conversion of singing voice based on kernel clustering and partial least squares regression
- Author(s):
-
FANG Peng1; 2; 3; LI Xian1; 3; WANG Zengfu1; 2; 3
-
1. Department of Automation, University of Science and Technology of China, Hefei 230027, China;
2. Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei 230031, China;
3. National Engineering Laboratory of Speech and Language Information Processing, Hefei 230027, China
-
- Keywords:
-
computer vision; voice conversion; singing voice; kernel clustering; partial least squares regression; Gaussian mixture model; Mel log spectrum approximation
- CLC:
-
TN912;TP37
- DOI:
-
10.11992/tis.201506022
- Abstract:
-
Voice conversion is a popular topic in the field of computer hearing, and the application of singing voices to voice conversion is a relatively new research direction, which widens the application scope of voice conversion. When a training dataset is small, the conventional Gaussian mixture model (GMM) method may cause overfitting and insufficient utilization of music information. In this study, we propose a method for converting the voice timbre of a source singer into that of a target singer and employ fundamental frequency to improve the converted singing voice quality. We use kernel clustering and partial least squares regression to train the dataset, thereby obtaining the conversion function. To improve the converted singing voice quality, we applied the Mel log spectrum approximation (MLSA) filter, which synthesizes the converted singing voice by filtering the source singing waveform. Based on our experiment results, the proposed method demonstrates better voice similarity and quality, and therefore is a better choice than the GMM-based method when the training dataset is small.