[1]王健宗,张旭龙,姜桂林,等.基于分层联邦框架的音频模型生成技术研究[J].智能系统学报,2024,19(5):1331-1339.[doi:10.11992/tis.202306054]
 WANG Jianzong,ZHANG Xulong,JIANG Guilin,et al.Research on audio model generation technology based on a hierarchical federated framework[J].CAAI Transactions on Intelligent Systems,2024,19(5):1331-1339.[doi:10.11992/tis.202306054]
点击复制

基于分层联邦框架的音频模型生成技术研究

参考文献/References:
[1] LIU Pengfei, YUAN Weizhe, FU Jinlan, et al. Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing[J]. ACM computing surveys, 2023, 55(9): 1-35.
[2] TRUMMER I. From BERT to GPT-3 codex[J]. Proceedings of the VLDB endowment, 2022, 15(12): 3770-3773.
[3] GHOSAL D, MAJUMDER N, MEHRISH A, et al. Text-to-audio generation using instruction-tuned LLM and latent diffusion model[EB/OL]. (2023–04–24)[2023–06–30]. http://arxiv.org/abs/2304.13731v2.
[4] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM transactions on audio, speech, and language processing, 2021, 29: 3451-3460.
[5] ZEGHIDOUR N, LUEBS A, OMRAN A, et al. SoundStream: An end-to-end neural audio codec[J]. IEEE/ACM transactions on audio, speech, and language processing, 2021, 30: 495-507.
[6] HAYASHI T, WATANABE S. DiscreTalk: text-to-speech as a machine translation problem[EB/OL]. (2020–05–12)[2023–06–30]. http://arxiv.org/abs/2005.05525v1.
[7] BORSOS Z, MARINIER R, VINCENT D, et al. Audiolm: a language modeling approach to audio generation[EB/OL]. (2022–09–07)[2023–06–30]. https://arxiv.org/abs/2209.03143.
[8] AGOSTINELLI A, DENK T I, BORSOS Z, et al. Musiclm: generating music from text. [EB/OL]. (2023–01–26)[2023–06–30]. https://arxiv.org/abs/2301.11325.
[9] NGUYEN T A, KHARITONOV E, COPET J, et al. Generative spoken dialogue language modeling[J]. Transactions of the association for computational linguistics, 2023, 11: 250-266.
[10] CUI Xiaodong, LU Songtao, KINGSBURY B. Federated acoustic modeling for automatic speech recognition[C]//ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6748–6752.
[11] HONG Zhenhou, WANG Jianzong, QU Xiaoyang, et al. Federated learning with dynamic transformer for text to speech[C]//Interspeech 2021. Brno: ISCA, 2021: 3590–3594.
[12] WU Yusong, CHEN Ke, ZHANG Tianyu, et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation[EB/OL]. (2022–11–12)[2023–06–30]. http://arxiv.org/abs/2211.06687v4.
[13] WU Shangda, YU Dingyao, TAN Xu, et al. CLaMP: contrastive language-music pre-training for cross-modal symbolic music information retrieval[EB/OL]. (2023–04–21)[2023–06–30]. http://arxiv.org/abs/2304.11029v4.
[14] WU Junru, LIANG Yi, HAN Feng, et al. Scaling multimodal pre-training via cross-modality gradient harmonization[J]. Advances in neural information processing systems, 2022, 35: 36161-36173.
[15] WANG Chengyi, CHEN Sanyuan, WU Yu, et al. Neural codec language models are zero-shot text to speech synthesizers[EB/OL]. (2023–01–05)[2023–06–30]. http://arxiv.org/abs/2301.02111v1.
[16] 谢旭康, 陈戈, 孙俊, 等. TCN-Transformer-CTC的端到端语音识别[J]. 计算机应用研究, 2022, 39(3): 699-703.
XIE Xukang, CHEN Ge, SUN Jun, et al. TCN-Transformer-CTC for end-to-end speech recognition[J]. Application research of computers, 2022, 39(3): 699-703.
[17] 解元, 邹涛, 孙为军, 等. 面向高混响环境的欠定卷积盲源分离算法[J]. 通信学报, 2023, 44(2): 82-93.
XIE Yuan, ZOU Tao, SUN Weijun, et al. Algorithm of underdetermined convolutive blind source separation for high reverberation environment[J]. Journal on communications, 2023, 44(2): 82-93.
[18] 方昕, 黄泽鑫, 张聿晗, 等. 基于时域波形的半监督端到端虚假语音检测方法[J]. 计算机应用, 2023, 43(1): 227-231.
FANG Xin, HUANG Zexin, ZHANG Yuhan, et al. Semi-supervised end-to-end fake speech detection method based on time-domain waveforms[J]. Journal of computer applications, 2023, 43(1): 227-231.
[19] 钟佳淋, 吴亚辉, 邓苏, 等. 基于改进NSGA-Ⅲ的多目标联邦学习进化算法[J]. 计算机科学, 2023, 50(4): 333-342.
ZHONG Jialin, WU Yahui, DENG Su, et al. Multi-objective federated learning evolutionary algorithm based on improved NSGA-Ⅲ[J]. Computer science, 2023, 50(4): 333-342.
[20] 陈洋, 廖灿辉, 张锟, 等. 基于自监督对比学习的信号调制识别算法[J]. 系统工程与电子技术, 2023, 45(4): 1200-1206.
CHEN Yang, LIAO Canhui, ZHANG Kun, et al. A signal modulation indentification algorithm based on self-supervised contrast learning[J]. Systems engineering and electronics, 2023, 45(4): 1200-1206.
[21] 罗贤昌, 薛吟兴. 基于BERT的提示学习实现软件需求精确分类[J]. 信息技术与网络安全, 2022, 41(2): 39-45.
LUO Xianchang, XUE Yinxing. Accurately classify software requirements using prompt learning on BERT[J]. Information technology and network security, 2022, 41(2): 39-45.
[22] WANG Chengyi, WU Yu, QIAN Yao, et al. UniSpeech: unified speech representation learning with labeled and unlabeled data[EB/OL]. (2021–01–19)[2023–06–30]. http://arxiv.org/abs/2101.07597v2.
[23] TAN Yue, LONG Guodong, MA Jie, et al. Federated learning from pre-trained models: a contrastive learning approach[EB/OL]. (2022–09–21)[2023–06–30]. http://arxiv.org/abs/2209.10083v1.
[24] KEITH I. The lj speech dataset[EB/OL]. [2023–06–30]. https://keithito.com/LJ- Speech-Dataset/.
[25] QIAN Kaizhi, ZHANG Yang, CHANG Shiyu, et al. Autovc: Zero-shot voice style transfer with only autoencoder loss[C]//36th International Conference on Machine Learning. Long Beach: PMLR, 2019: 5210–5219.
[26] KANEKO T, KAMEOKA H, TANAKA K, et al. CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion[EB/OL]. (2020–10–22)[2023–06–30]. http://arxiv.org/abs/2010.11672v1.
[27] QIAN Kaizhi, ZHANG Yang, CHANG Shiyu, et al. Unsupervised speech decomposition via triple information bottleneck[C]//Proceedings of the 37th International Conference on Machine Learning. Virtual: ACM, 2020: 7836–7846.
[28] SHEN Kai, JU Zeqian, TAN Xu, et al. NaturalSpeech 2: latent diffusion models are natural and zero-shot speech and singing synthesizers[EB/OL]. (2023–04–18)[2023–06–30]. http://arxiv.org/abs/2304.09116v3.

备注/Memo

收稿日期:2023-6-30。
基金项目:广东省重点领域研发计划“新一代人工智能”重大专项(2021B0101400003).
作者简介:王健宗,博士,平安科技(深圳)有限公司副总工程师,资深人工智能总监,联邦学习技术部总经理,智能金融前沿技术研究院院长。美国佛罗里达大学人工智能博士后,美国莱斯大学和华中科技大学联合培养博士,中国计算机学会资深会员,中国计算机学会大数据专家委员会委员,中国自动化学会联邦数据和联邦智能专业委员会副主任。主要研究方向为大模型、联邦学习和深度学习。E-mail:jzwang@188.com;张旭龙,博士,平安科技(深圳)有限公司高级算法研究员,担任清华大学深圳研究院以及中国科学技术大学先进技术研究院校外导师,目前是IEEE、中国自动化学会以及中国计算机学会会员,担任联邦数据与联邦智能专委会委员,主要研究方向为语音合成、语音转换、音频驱动虚拟人生成、音乐信息检索以及机器学习和深度学习方法在人工智能领域应用。2023年入选上海市东方英才计划青年项目。E-mail:zhangxulong@ieee.org;肖京,博士,国家特聘专家,国家新一代普惠金融人工智能开放创新平台技术负责人、深圳市政协委员、深圳市决策咨询委员会委员,兼中国计算机学会深圳分部副主席、广东省人工智能与机器人学会副理事长、深圳市人工智能行业协会会长、深圳市人工智能学会副理事长, 清华大学、上海交通大学、同济大学等客座教授。先后在爱普生美国研究院及美国微软公司担任高级研发管理职务,现任平安集团首席科学家,负责人工智能技术研发及在金融、医疗、智慧城市等领域的应用,带领团队树立了多项传统行业智能化经营的标杆。主要研究方向为人工智能与大数据分析挖掘,参与及承担国家级项目8项,获美国授权专利101项,中国发明专利155项。先后获2018年中国专利奖、2019年吴文俊人工智能杰出贡献奖、2020年吴文俊人工智能科技进步一等奖、2020年上海市科技进步奖一等奖、2020年中国人工智能十大风云人物、2021年深圳市五一劳动奖章、2022年深圳市最美科技工作者等荣誉。发表学术论文249篇。
通讯作者:张旭龙. E-mail:zhangxulong@ieee.org

更新日期/Last Update: 2024-09-05
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com