<-上一篇/Previous Article 下一篇/Next Article->

[1]李荣军,郭秀焱,杨静远.面向鲁棒口语理解的声学组块混淆语言模型微调算法[J].智能系统学报,2023,18(1):131-137.[doi:10.11992/tis.202109024]
　LI Rongjun,GUO Xiuyan,YANG Jingyuan.A fine-tuning algorithm for acoustic text chunk confusion language model orienting to understand robust spoken language[J].CAAI Transactions on Intelligent Systems,2023,18(1):131-137.[doi:10.11992/tis.202109024]

点击复制

面向鲁棒口语理解的声学组块混淆语言模型微调算法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 18 期数: 2023年第1期页码: 131-137 栏目: 学术论文—自然语言处理与理解出版日期: 2023-01-05

Title:: A fine-tuning algorithm for acoustic text chunk confusion language model orienting to understand robust spoken language

作者:: 李荣军, 郭秀焱, 杨静远; 华为技术有限公司 AI应用研究中心，广东深圳 518129

Author(s):: LI Rongjun, GUO Xiuyan, YANG Jingyuan; AI Application Research Center, Huawei Technologies Co., Ltd., Shenzhen 518129, China

关键词:: 自然语言理解; 口语语言理解; 意图识别; 预训练语言模型; 语音识别; 鲁棒性; 语言模型微调; 深度学习

Keywords:: natural language understanding; spoken language understanding; intent recognition; pre-trained language model; speech recognition; robust; fine-tuning of language model; deep learning

分类号:: TP18

DOI:: 10.11992/tis.202109024

摘要:: 利用预训练语言模型（pre-trained language models，PLM）提取句子的特征表示，在处理下游书面文本的自然语言理解的任务中已经取得了显著的效果。但是，当将其应用于口语语言理解（spoken language understanding，SLU）任务时，由于前端语音识别（automatic speech recognition，ASR）的错误，会导致SLU精度的下降。因此，本文研究如何增强PLM提高SLU模型对ASR错误的鲁棒性。具体来讲，通过比较ASR识别结果和人工转录结果之间的差异，识别出连读和删除的文本组块，通过设置新的预训练任务微调PLM，使发音相近的文本组块产生类似的特征嵌入表示，以达到减轻ASR错误对PLM影响的目的。通过在3个基准数据集上的实验表明，所提出的方法相比之前的方法，精度有较大提升，验证方法的有效性。

Abstract:: Employing the pre-trained language model (PLM) to extract the feature representations of sentences has achieved remarkable results in processing downstream natural language understanding tasks based on texts. However, when applying PLM to spoken language understanding (SLU) tasks, it shows degraded performance resulting from erroneous text from front-end automatic speech recognition (ASR). To address this issue, this paper investigates how to enhance a PLM for better SLU robustness against ASR errors. Specifically, by comparing the differences between ASR recognition and manual transcription results, we identify the concatenated and deleted text chunks. Then, we set up a new pre-training task to fine-tune the PLM to make text chunks with similar pronunciation produce similar feature embedding representations to reduce the influence of ASR errors on PLM. Experiments conducted on three SLU benchmark datasets validate the efficiency of our proposal by showing significant accuracy improvements through comparison with prior arts.

参考文献/References:: [1] 程高峰, 颜永红. 多语言语音识别声学模型建模方法最新进展[J]. 计算机科学, 2022, 49(1): 47–52
CHENG Gaofeng, YAN Yonghong. Latest development of multilingual speech recognition acoustic model modeling methods[J]. Computer science, 2022, 49(1): 47–52
[2] 赵宁, 徐俊利, 徐洋航, 等. 客户来电意图识别研究[J]. 中文信息学报, 2021, 35(3): 125–133
ZHAO Ning, XU Junli, XU Yanghang, et al. Intention detection of customer’s call[J]. Journal of Chinese information processing, 2021, 35(3): 125–133
[3] 吕坤儒, 吴春国, 梁艳春, 等. 融合语言模型的端到端中文语音识别算法[J]. 电子学报, 2021, 49(11): 2177–2185
LYU Kunru, WU Chunguo, LIANG Yanchun, et al. An end-to-end Chinese speech recognition algorithm integrating language model[J]. Acta electronica sinica, 2021, 49(11): 2177–2185
[4] 徐扬, 王建成, 刘启元, 等. 基于上下文信息的口语意图检测方法[J]. 计算机科学, 2020, 47(1): 205–211
XU Yang, WANG Jiancheng, LIU Qiyuan, et al. Intention detection in spoken language based on context information[J]. Computer science, 2020, 47(1): 205–211
[5] 李蕾, 周延泉, 钟义信. 基于语用的自然语言处理研究与应用初探[J]. 智能系统学报, 2006, 1(2): 1–6
LI Lei, ZHOU Yanquan, ZHONG Yixin. Pragmatic information based NLP research and application[J]. CAAI transactions on intelligent systems, 2006, 1(2): 1–6
[6] SERDYUK D, WANG Yongqiang, FUEGEN C, et al. Towards end-to-end spoken language understanding[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5754?5758.
[7] HAGHANI P, NARAYANAN A, BACCHIANI M, et al. From audio to semantics: approaches to end-to-end spoken language understanding[C]//2018 IEEE Spoken Language Technology Workshop. Athens: IEEE, 2018: 720?726.
[8] LUGOSCH L, RAVANELLI M, IGNOTO P, et al. Speech model pre-training for end-to-end spoken language understanding[C]//20th Annual Conference of the International Speech Communication Association. Graz: ISCA, 2019: 814?818.
[9] HUANG Yinghui, KUO H K, THOMAS S, et al. Leveraging unpaired text data for training end-to-end speech-to-intent systems[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 7984-7988.
[10] KUO H K J, TüSKE Z, THOMAS S, et al. End-to-end spoken language understanding without full transcripts[C]//21st Annual Conference of the International Speech Communication Association, Shanghai: ISCA, 2020: 906?910.
[11] SUNDARARAMAN M N, KUMAR A, VEPA J. Phoneme-BERT: joint language modelling of phoneme sequence and ASR transcript[EB/OL]. (2021?02?01)[2022?09?12].https://arxiv.org/abs/2102.00804.
[12] ?VEC J, ?MíDL L, IRCING P. Hierarchical discriminative model for spoken language understanding[C]//2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2013: 8322?8326.
[13] ?VEC J, CHYLEK A, ?MíDL L, et al. A study of different weighting schemes for spoken language understanding based on convolutional neural networks[C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 6065?6069.
[14] LADHAK F, GANDHE A, DREYER M, et al. LatticeRnn: recurrent neural networks over lattices[C]//17th Annual Conference of the International Speech Communication Association. San Francisco: ISCA, 2016: 695?699.
[15] HUANG Chaowei, CHEN Yunnung. Learning spoken language representations with neural lattice language modeling[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 3764?3769.
[16] WENG Yue, MIRYALA S S, KHATRI C, et al. Joint contextual modeling for ASR correction and language understanding[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 6349?6353.
[17] MASUMURA R, IJIMA Y, ASAMI T, et al. Neural confnet classification: fully neural network based spoken utterance classification using word confusion networks[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 6039?6043.
[18] HUANG Chaowei, CHEN Yunnung. Learning asr-robust contextualized embeddings for spoken language understanding[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 8009?8013.
[19] NAMAZIFAR M, TUR G, HAKKANI-TüR D. Warped language models for noise robust language understanding[C]//2021 IEEE Spoken Language Technology Workshop. Shenzhen: IEEE, 2021: 981?988.
[20] HOWARD J, RUDER S. Universal language model fine-tuning for text classification[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne: Association for Computational Linguistics, 2018: 328?339.
[21] DEVLIN J, CHANG Mingwei, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: ACL, 2019: 4171-4186.
[22] COUCKE A, SAADE A, BALL A, et al. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces[EB/OL]. (2018?05?25)[2022?09?12].https://arxiv.org/abs/1805.10190.
[23] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011.
[24] HEMPHILL C T, GODFREY J J, DODDINGTON G R, et al. The ATIS spoken language systems pilot corpus[C]//Proceedings of the workshop on Speech and Natural Language-HLT’90. Hidden Valley: Association for Computational Linguistics, 1990: 96-101.
[25] GUPTA S, SHAH R, MOHIT M, et al. Semantic parsing for task oriented dialog using hierarchical representations[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018: 2787?2792.
[26] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]//7th International Conference on Learning Representations. New Orleans LA: ICLR, 2019.
[27] VAN DER MAATEN L, GEOFFREY H. Visualizing data using t-SNE[J]. Journal of machine learning research, 2008, 9: 2579–2605.

相似文献/References:: [1]朱?? 倩,程显毅,韩? 飞.汉语句子语义三维表示模型[J].智能系统学报,2009,4(2):122.
　ZHU Qian,CHENG Xian-yi,HAN Fei.A threedimensional representative model of Chinese sentence semantics[J].CAAI Transactions on Intelligent Systems,2009,4():122.
[2]毛莉娜,李卫华.利用智能引导和KDML增强可拓模型人机建模能力研究[J].智能系统学报,2017,12(3):348.[doi:10.11992/tis.201610017]
　MAO Lina,LI Weihua.Research on enhancing the human-machine modeling ability for an extension model using the intelligent guide and KDML[J].CAAI Transactions on Intelligent Systems,2017,12():348.[doi:10.11992/tis.201610017]

备注/Memo

收稿日期:2021-09-13。
作者简介:李荣军,主任工程师,主要研究方向为人机对话、语音识别;郭秀焱,高级工程师,主要研究方向为知识图谱、人机对话、语音识别;杨静远,高级工程师,主要研究方向为智能问答、任务型对话系统、语音纠错
通讯作者:李荣军.E-mail:lirongjun3@huawei.com

更新日期/Last Update: 1900-01-01

面向鲁棒口语理解的声学组块混淆语言模型微调算法 PDF下载HTML

备注/Memo

面向鲁棒口语理解的声学组块混淆语言模型微调算法

PDF下载 HTML