<-上一篇/Previous Article 下一篇/Next Article->

[1]柳泽明,程子豪,刘晶晶,等.中文多技能对话评估[J].智能系统学报,2025,20(5):1281-1293.[doi:10.11992/tis.202411001]
　LIU Zeming,CHENG Zihao,LIU Jingjing,et al.Evaluation of Chinese multiskill dialogues[J].CAAI Transactions on Intelligent Systems,2025,20(5):1281-1293.[doi:10.11992/tis.202411001]

点击复制

中文多技能对话评估

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 20 期数: 2025年第5期页码: 1281-1293 栏目: 人工智能院长论坛出版日期: 2025-09-05

Title:: Evaluation of Chinese multiskill dialogues

作者:: 柳泽明, 程子豪, 刘晶晶, 杨晓, 郭园方, 王蕴红; 北京航空航天大学计算机学院, 北京 100191

Author(s):: LIU Zeming, CHENG Zihao, LIU Jingjing, YANG Xiao, GUO Yuanfang, WANG Yunhong; School of Computer Science and Engineering, Beihang University, Beijing 100191, China

关键词:: 多技能对话; 对话评估; 闲聊; 开放域对话; 对话推荐; 画像聊天; 知识对话; 大语言模型

Keywords:: multiskill dialogue; dialogue evaluation; chit-chat; open domain dialogue; conversational recommendation; persona-chat; knowledge-grounded dialogue; large language model

分类号:: TP39

DOI:: 10.11992/tis.202411001

摘要:: 准确评估多技能对话系统的能力，对满足用户多样化的需求，例如社交闲聊、深入的知识对话、角色化聊天以及对话推荐至关重要。现有的基准仅针对特定对话技能的评估，无法有效地同时评估多种对话技能。为解决这一问题，本文构建了一个中文多技能评估基准(multi-skill dialogue evaluation benchmark, MSDE)，它包含1 781个对话和21 218条话语，覆盖4类常见的对话任务，即闲聊、知识对话、画像聊天和对话推荐。然后，本文基于MSDE做了大量实验，并分析了自动评估指标和人工评估指标的相关性。实验结果表明：1)在4类对话任务中，闲聊最难评估，知识对话最容易评估。2)不同指标在MSDE上的表现存在明显差异。3)对于人工评估，各指标在不同对话任务上的评估难度不同。部分数据发布在https://github.com/IRIP-LLM/MSDE，全部数据将在整理后发布。

Abstract:: The accurate evaluation of the capabilities of a multiskilled dialogue system is important to satisfy the different demands of users, including social banter, profound knowledge-based discussions, role-playing conversations, and dialogue recommendations. Current benchmarks concentrate on assessing specific dialogue skills and cannot efficiently evaluate multiple dialogue skills concurrently. To facilitate the evaluation of multiskill dialogues, this study establishes a Chinese multiskill evaluation benchmark, which is the Multi-Skill Dialogue Evaluation Benchmark (MSDE). MSDE contains 1,781 dialogues and 21,218 utterances, which cover four common dialogue tasks: chit-chat, knowledge dialog, persona-based dialog, and dialog recommendations. We performed extensive experiments on MSDE and examined the correlation between automatic and human evaluation metrics. Results indicate that (1) among the four dialogue tasks, chit-chat is the most difficult to analyze, while knowledge dialogue is the easiest; (2) significant differences exist in the performance of various metrics on MSDE; (3) for human evaluation, the analysis complexity of each metric differs across varying dialogue tasks. Certain data will be made available on https://github.com/IRIP-LLM/MSDE, and all data will be released after sorting.

参考文献/References:: [1] BAI Jinze, BAI Shuai, CHU Yunfei, et al. Qwen technical report[EB/OL]. (2023-09-28)[2024-11-01]. https://arxiv.org/pdf/2309.16609.
[2] YANG Aiyuan, XIAO Bin, WANG Bingning, et al. Baichuan 2: open large-scale language models[EB/OL]. (2023-09-19)[2024-11-01]. https://arxiv.org/abs/2309.10305.
[3] TOUVRON H, MARTIN L, STONE K, et al. Llama 2: Open foundation and fine-tuned chat models[EB/OL]. (2023-07-18)[2024-11-01]. https://arxiv.org/abs/2307.09288.
[4] ZENG Aohan, XU Bin, WANG Bowen, et al. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. (2024-07-30)[2024-11-01]. https://arxiv.org/abs/2406.12793v2.
[5] ADIWARDANA D, LUONG M T, SO D R, et al. Towards a human-like open-domain chatbot[EB/OL]. (2020-02-27)[2024-11-01]. https://arxiv.org/abs/2001.09977.
[6] ROLLER S, DINAN E, GOYAL N, et al. Recipes for building an open-domain chatbot[EB/OL]. (2020-04-30)[2024-11-01]. https://arxiv.org/abs/2004.13637v2.
[7] SHUSTER K, JU Da, ROLLER S, et al. The dialogue dodecathlon: open-domain knowledge and image grounded conversational agents[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2453–2470.
[8] 马中红, 吴熙倡. 社交聊天机器人的性别偏见: 基于小冰系列的对话测试研究[J]. 国际新闻界, 2024, 46(4): 72-89.
MA Zhonghong, WU Xichang. Gender bias in social chatbots: a conversation test study based on xiaoice series of chatbots[J]. Chinese journal of journalism & communication, 2024, 46(4): 72-89.
[9] 赵妍妍, 陆鑫, 赵伟翔, 等. 情感对话技术综述[J]. 软件学报, 2024, 35(3): 1377-1402.
ZHAO Yanyan, LU Xin, ZHAO Weixiang, et al. Survey on emotional dialogue techniques[J]. Journal of software, 2024, 35(3): 1377-1402.
[10] 房小绵. 基于语音识别的英语智能对话机器人人机交互系统设计[J]. 自动化与仪器仪表, 2023(4): 225-228, 232.
FANG Xiaomian. Design of human-computer interaction system for English intelligent conversation robot based on speech recognition[J]. Automation & instrumentation, 2023(4): 225-228, 232.
[11] 车万翔, 窦志成, 冯岩松, 等. 大模型时代的自然语言处理: 挑战、机遇与发展[J]. 中国科学: 信息科学, 2023, 53(9): 1645-1687.
CHE Wanxiang, DOU Zhicheng, FENG Yansong, et al. Towards a comprehensive understanding of the impact of large language models on natural language processing: challenges, opportunities and future directions[J]. Scientia sinica (informationis), 2023, 53(9): 1645-1687.
[12] 王曦, 曾广平, 乔柱. 面向心理健康的服务机器人设计与实现[J]. 制造业自动化, 2021, 43(6): 137-141.
WANG Xi, ZENG Guangping, QIAO Zhu. Design and implementation of mental health oriented service robot[J]. Manufacturing automation, 2021, 43(6): 137-141.
[13] SMITH E M, WILLIAMSON M, SHUSTER K, et al. Can you put it all together: evaluating conversational agents’ ability to blend skills[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2021-2030.
[14] LIU Zeming, WANG Haifeng, NIU Zhengyu, et al. Towards conversational recommendation over multi-type dialogs[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 1036-1049.
[15] LIU C W, LOWE R, SERBAN I V, et al. How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation[EB/OL]. (2016-03-25)[2024-11-01]. https://arxiv.org/abs/1603.08023.
[16] YEH Y T, ESKENAZI M, MEHRI S. A comprehensive assessment of dialog evaluation metrics[EB/OL]. (2021-07-07)[2024-11-01]. https://arxiv.org/abs/2106.03706v4.
[17] SELLAM T, DAS D, PARIKH A. BLEURT: learning robust metrics for text generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2020: 7881-7892.
[18] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: ACL, 2001: 311-318.
[19] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: Association for Computational Linguistics, 2005: 65-72.
[20] 刘阳阳, 董涛. 基于对话模型的聊天机器人结构研究[J]. 信息技术与信息化, 2023(1): 13-16.
LIU Yangyang, DONG Tao. Research on the structure of chat robot based on dialogue model[J]. Information technology and informatization, 2023(1): 13-16.
[21] LI Yanran, SU Hui, SHEN Xiaoyu, et al. DailyDialog: a manually labelled multi-turn dialogue dataset[EB/OL]. (2017-10-11)[2024-11-01]. https://arxiv.org/abs/1710.03957v1.
[22] GOPALAKRISHNAN K, HEDAYATNIA B, CHEN Qinlang, et al. Topical-chat: towards knowledge-grounded open-domain conversations[C]//Interspeech 2019. Graz: ISCA, 2019: 1891-1895.
[23] ZHANG Saizheng, DINAN E, URBANEK J, et al. Personalizing dialogue agents: I have a dog, do you have pets too?[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne: Association for Computational Linguistics, 2018: 2204-2213.
[24] DINAN E, LOGACHEVA V, MALYKH V, et al. The second conversational intelligence challenge (ConvAI2)[C]//The NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations. Cham: Springer International Publishing, 2020: 187-208.
[25] 魏泽林, 张帅, 王建超. 基于知识图谱问答系统的技术实现[J]. 软件工程, 2021, 24(2): 38-44.
WEI Zelin, ZHANG Shuai, WANG Jianchao. Implementation of question answering based on knowledge graph[J]. Software engineering, 2021, 24(2): 38-44.
[26] 叶健辉, 韩博文, 周帆, 等. 基于自然语言处理的人机对话调控机器人设计[J]. 中国科技信息, 2020(22): 63-65.
YE Jianhui, HAN Bowen, ZHOU Fan, et al. Design of man-machine dialogue control robot based on natural language processing[J]. China science and technology information, 2020(22): 63-65.
[27] 张雨璇, 沙立成, 王海霞, 等. 电网调度智能对话机器人的系统架构和关键技术研究[J]. 电子设计工程, 2022, 30(11): 45-49.
ZHANG Yuxuan, SHA Licheng, WANG Haixia, et al. Research on system architecture and key technologies of intelligent conversation robot for power grid dispatching[J]. Electronic design engineering, 2022, 30(11): 45-49.
[28] LI Jiwei, GALLEY M, BROCKETT C, et al. A diversity-promoting objective function for neural conversation models[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: ACL, 2016: 110-119.
[29] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 4566-4575.
[30] MEHRI S, ESKENAZI M. USR: an unsupervised and reference free evaluation metric for dialog generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2020: 681–707.
[31] HUANG Lishan, YE Zheng, QIN Jinghui, et al. GRADE: automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems[EB/OL]. (2020-10-08)[2024-11-01]. https://arxiv.org/abs/2010.03994v1.
[32] PANG Bo, NIJKAMP E, HAN Wenjuan, et al. Towards holistic and automatic evaluation of open-domain dialogue generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3619–3629.
[33] GHAZARIAN S, WEISCHEDEL R, GALSTYAN A, et al. Predictive engagement: an efficient metric for automatic evaluation of open-domain dialogue systems[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI, 2020: 7789-7796.
[34] HORI C, HORI T. End-to-end conversation modeling track in DSTC6[EB/OL]. (2018-01-30)[2024-11-01]. https://arxiv.org/abs/1706.07440v2.
[35] MEHRI S, ESKENAZI M. USR: an unsupervised and reference free evaluation metric for dialog generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 681–707.
[36] GUNASEKARA C, KIM S, D’HARO L F, et al. Overview of the ninth dialog system technology challenge: DSTC9[J]. IEEE/ACM transactions on audio, speech, and language processing, 2024, 32: 4066-4076.
[37] ZHENG Lianmin, CHIANG W L, SHENG Ying, et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023: 46595-46623.
[38] 中国计算机学会, 中国中文信息学会, 百度. 2021语言与智能技术竞赛: 多技能对话任务[EB/OL]. (2021-05-16)[2024-11-01]. https://aistudio.baidu.com/aistudio/competition/detail/67.
[39] 中国计算机学会. 千言: 多技能对话[EB/OL]. (2021-01-24)[2024-11-01]. https://www.datafountain.cn/competitions/470.
[40] WANG Yida, KE Pei, ZHENG Yinhe, et al. A large-scale Chinese short-text conversation dataset[EB/OL]. (2022-04-26)[2024-11-01]. https://arxiv.org/abs/2008.03946v2.
[41] WU Wenquan, GUO Zhen, ZHOU Xiangyang, et al. Proactive human-machine conversation with explicit conversation goal[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: ACL, 2019: 3794–3804.
[42] XU Xinchao, GOU Zhibin, WU Wenquan, et al. Long time No see! open-domain conversation with long-term persona memory[C]//Findings of the Association for Computational Linguistics: ACL 2022. Dublin: ACL, 2022: 2639–2650.
[43] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Text Summarization Branches Out. Barcelona: Association for Computational Linguistics, 2004: 74–81.
[44] ZHANG Tianyi, KISHORE V, WU F, et al. BERTscore: evaluating text generation with BERT[C]//Proceedings of the International Conference on Learning Representations. New Orleans: OpenReview.net, 2019: 1-43.
[45] KIROS R, ZHU Yukun, SALAKHUTDINOV R, et al. Skip-thought vectors[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal: Curran Associates Inc., 2015: 3294-3302.
[46] FORGUES G, PINEAU J, LARCHEVêQUE J M, et al. Bootstrapping dialog systems with word embeddings[C]//Proceedings of NIPS Modern Machine Learning and Natural Language Processing Workshop. Montreal: Curran Associates Inc., 2014: 1-5.

备注/Memo

收稿日期:2024-11-1。
基金项目:国家重点研发计划项目(2023YFF0725600)；国家自然科学基金项目(62406015).
作者简介:柳泽明，助理教授，博士，中国中文信息学会大模型与生成专业委员会委员，中国中文信息学会具身智能专业委员会(筹)副秘书长和创始委员。主要研究方向为自然语言处理、对话系统、大模型、具身智能。主持国家自然科学基金、国家重点研发计划青年科学家项目任务、CCF-百度松果基金、多个校企科研合作项目等。获北航卓越青年学者、中国国际大学生创新大赛北京赛区“优秀创新创业导师”等。获发明专利授权10项，发表学术论文40余篇，包括第一作者和通信作者论文20余篇。E-mail：zmliu@buaa.edu.cn。;程子豪，主要研究方向为自然语言处理和工具学习。E-mail：zihaocheng@buaa.edu.c。;王蕴红，教授，北京航空航天大学计算机学院院长，中国人工智能学会智能交互专委会主任、中国人工智能学会常务理事、中国图象图形学学会常务理事，国际电气与电子工程师学会会士、国际模式识别协会会士、中国计算机学会会士、中国人工智能学会会士。先后主持国家高技术研究发展计划项目、国家重点基础研究发展计划项目、国家自然科学基金项目等项目。曾获得国家技术发明二等奖、中国青年科技奖、北京市教学成果一等奖，曾被科技部授予 863 计划先进个人，入选教育部新世纪优秀人才计划。获得国际模式识别学会女性科学家Maria Petrou 奖，是该奖设立以来第一位获得此奖项的华人。获发明专利授权 30 余项，发表学术论文 200 余篇。E-mail：yhwang@buaa.edu.cn。
通讯作者:王蕴红. E-mail：yhwang@buaa.edu.cn

更新日期/Last Update: 2025-09-05

中文多技能对话评估 PDF下载HTML

备注/Memo

中文多技能对话评估

PDF下载 HTML