<-Previous Article Next Article->

[1]LIU Zeming,CHENG Zihao,LIU Jingjing,et al.Evaluation of Chinese multiskill dialogues[J].CAAI Transactions on Intelligent Systems,2025,20(5):1281-1293.[doi:10.11992/tis.202411001]

Copy

Evaluation of Chinese multiskill dialogues

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 20 Number of periods: 2025 5 Page number: 1281-1293 Column: 人工智能院长论坛 Public date: 2025-09-05

Title:: Evaluation of Chinese multiskill dialogues

Author(s):: LIU Zeming; CHENG Zihao; LIU Jingjing; YANG Xiao; GUO Yuanfang; WANG Yunhong; School of Computer Science and Engineering, Beihang University, Beijing 100191, China

Keywords:: multiskill dialogue; dialogue evaluation; chit-chat; open domain dialogue; conversational recommendation; persona-chat; knowledge-grounded dialogue; large language model

CLC:: TP39

DOI:: 10.11992/tis.202411001

Abstract:: The accurate evaluation of the capabilities of a multiskilled dialogue system is important to satisfy the different demands of users, including social banter, profound knowledge-based discussions, role-playing conversations, and dialogue recommendations. Current benchmarks concentrate on assessing specific dialogue skills and cannot efficiently evaluate multiple dialogue skills concurrently. To facilitate the evaluation of multiskill dialogues, this study establishes a Chinese multiskill evaluation benchmark, which is the Multi-Skill Dialogue Evaluation Benchmark (MSDE). MSDE contains 1,781 dialogues and 21,218 utterances, which cover four common dialogue tasks: chit-chat, knowledge dialog, persona-based dialog, and dialog recommendations. We performed extensive experiments on MSDE and examined the correlation between automatic and human evaluation metrics. Results indicate that (1) among the four dialogue tasks, chit-chat is the most difficult to analyze, while knowledge dialogue is the easiest; (2) significant differences exist in the performance of various metrics on MSDE; (3) for human evaluation, the analysis complexity of each metric differs across varying dialogue tasks. Certain data will be made available on https://github.com/IRIP-LLM/MSDE, and all data will be released after sorting.

References:: [1] BAI Jinze, BAI Shuai, CHU Yunfei, et al. Qwen technical report[EB/OL]. (2023-09-28)[2024-11-01]. https://arxiv.org/pdf/2309.16609.
[2] YANG Aiyuan, XIAO Bin, WANG Bingning, et al. Baichuan 2: open large-scale language models[EB/OL]. (2023-09-19)[2024-11-01]. https://arxiv.org/abs/2309.10305.
[3] TOUVRON H, MARTIN L, STONE K, et al. Llama 2: Open foundation and fine-tuned chat models[EB/OL]. (2023-07-18)[2024-11-01]. https://arxiv.org/abs/2307.09288.
[4] ZENG Aohan, XU Bin, WANG Bowen, et al. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. (2024-07-30)[2024-11-01]. https://arxiv.org/abs/2406.12793v2.
[5] ADIWARDANA D, LUONG M T, SO D R, et al. Towards a human-like open-domain chatbot[EB/OL]. (2020-02-27)[2024-11-01]. https://arxiv.org/abs/2001.09977.
[6] ROLLER S, DINAN E, GOYAL N, et al. Recipes for building an open-domain chatbot[EB/OL]. (2020-04-30)[2024-11-01]. https://arxiv.org/abs/2004.13637v2.
[7] SHUSTER K, JU Da, ROLLER S, et al. The dialogue dodecathlon: open-domain knowledge and image grounded conversational agents[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2453–2470.
[8] 马中红, 吴熙倡. 社交聊天机器人的性别偏见: 基于小冰系列的对话测试研究[J]. 国际新闻界, 2024, 46(4): 72-89.
MA Zhonghong, WU Xichang. Gender bias in social chatbots: a conversation test study based on xiaoice series of chatbots[J]. Chinese journal of journalism & communication, 2024, 46(4): 72-89.
[9] 赵妍妍, 陆鑫, 赵伟翔, 等. 情感对话技术综述[J]. 软件学报, 2024, 35(3): 1377-1402.
ZHAO Yanyan, LU Xin, ZHAO Weixiang, et al. Survey on emotional dialogue techniques[J]. Journal of software, 2024, 35(3): 1377-1402.
[10] 房小绵. 基于语音识别的英语智能对话机器人人机交互系统设计[J]. 自动化与仪器仪表, 2023(4): 225-228, 232.
FANG Xiaomian. Design of human-computer interaction system for English intelligent conversation robot based on speech recognition[J]. Automation & instrumentation, 2023(4): 225-228, 232.
[11] 车万翔, 窦志成, 冯岩松, 等. 大模型时代的自然语言处理: 挑战、机遇与发展[J]. 中国科学: 信息科学, 2023, 53(9): 1645-1687.
CHE Wanxiang, DOU Zhicheng, FENG Yansong, et al. Towards a comprehensive understanding of the impact of large language models on natural language processing: challenges, opportunities and future directions[J]. Scientia sinica (informationis), 2023, 53(9): 1645-1687.
[12] 王曦, 曾广平, 乔柱. 面向心理健康的服务机器人设计与实现[J]. 制造业自动化, 2021, 43(6): 137-141.
WANG Xi, ZENG Guangping, QIAO Zhu. Design and implementation of mental health oriented service robot[J]. Manufacturing automation, 2021, 43(6): 137-141.
[13] SMITH E M, WILLIAMSON M, SHUSTER K, et al. Can you put it all together: evaluating conversational agents’ ability to blend skills[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2021-2030.
[14] LIU Zeming, WANG Haifeng, NIU Zhengyu, et al. Towards conversational recommendation over multi-type dialogs[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 1036-1049.
[15] LIU C W, LOWE R, SERBAN I V, et al. How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation[EB/OL]. (2016-03-25)[2024-11-01]. https://arxiv.org/abs/1603.08023.
[16] YEH Y T, ESKENAZI M, MEHRI S. A comprehensive assessment of dialog evaluation metrics[EB/OL]. (2021-07-07)[2024-11-01]. https://arxiv.org/abs/2106.03706v4.
[17] SELLAM T, DAS D, PARIKH A. BLEURT: learning robust metrics for text generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2020: 7881-7892.
[18] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: ACL, 2001: 311-318.
[19] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: Association for Computational Linguistics, 2005: 65-72.
[20] 刘阳阳, 董涛. 基于对话模型的聊天机器人结构研究[J]. 信息技术与信息化, 2023(1): 13-16.
LIU Yangyang, DONG Tao. Research on the structure of chat robot based on dialogue model[J]. Information technology and informatization, 2023(1): 13-16.
[21] LI Yanran, SU Hui, SHEN Xiaoyu, et al. DailyDialog: a manually labelled multi-turn dialogue dataset[EB/OL]. (2017-10-11)[2024-11-01]. https://arxiv.org/abs/1710.03957v1.
[22] GOPALAKRISHNAN K, HEDAYATNIA B, CHEN Qinlang, et al. Topical-chat: towards knowledge-grounded open-domain conversations[C]//Interspeech 2019. Graz: ISCA, 2019: 1891-1895.
[23] ZHANG Saizheng, DINAN E, URBANEK J, et al. Personalizing dialogue agents: I have a dog, do you have pets too?[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne: Association for Computational Linguistics, 2018: 2204-2213.
[24] DINAN E, LOGACHEVA V, MALYKH V, et al. The second conversational intelligence challenge (ConvAI2)[C]//The NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations. Cham: Springer International Publishing, 2020: 187-208.
[25] 魏泽林, 张帅, 王建超. 基于知识图谱问答系统的技术实现[J]. 软件工程, 2021, 24(2): 38-44.
WEI Zelin, ZHANG Shuai, WANG Jianchao. Implementation of question answering based on knowledge graph[J]. Software engineering, 2021, 24(2): 38-44.
[26] 叶健辉, 韩博文, 周帆, 等. 基于自然语言处理的人机对话调控机器人设计[J]. 中国科技信息, 2020(22): 63-65.
YE Jianhui, HAN Bowen, ZHOU Fan, et al. Design of man-machine dialogue control robot based on natural language processing[J]. China science and technology information, 2020(22): 63-65.
[27] 张雨璇, 沙立成, 王海霞, 等. 电网调度智能对话机器人的系统架构和关键技术研究[J]. 电子设计工程, 2022, 30(11): 45-49.
ZHANG Yuxuan, SHA Licheng, WANG Haixia, et al. Research on system architecture and key technologies of intelligent conversation robot for power grid dispatching[J]. Electronic design engineering, 2022, 30(11): 45-49.
[28] LI Jiwei, GALLEY M, BROCKETT C, et al. A diversity-promoting objective function for neural conversation models[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: ACL, 2016: 110-119.
[29] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 4566-4575.
[30] MEHRI S, ESKENAZI M. USR: an unsupervised and reference free evaluation metric for dialog generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2020: 681–707.
[31] HUANG Lishan, YE Zheng, QIN Jinghui, et al. GRADE: automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems[EB/OL]. (2020-10-08)[2024-11-01]. https://arxiv.org/abs/2010.03994v1.
[32] PANG Bo, NIJKAMP E, HAN Wenjuan, et al. Towards holistic and automatic evaluation of open-domain dialogue generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3619–3629.
[33] GHAZARIAN S, WEISCHEDEL R, GALSTYAN A, et al. Predictive engagement: an efficient metric for automatic evaluation of open-domain dialogue systems[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI, 2020: 7789-7796.
[34] HORI C, HORI T. End-to-end conversation modeling track in DSTC6[EB/OL]. (2018-01-30)[2024-11-01]. https://arxiv.org/abs/1706.07440v2.
[35] MEHRI S, ESKENAZI M. USR: an unsupervised and reference free evaluation metric for dialog generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 681–707.
[36] GUNASEKARA C, KIM S, D’HARO L F, et al. Overview of the ninth dialog system technology challenge: DSTC9[J]. IEEE/ACM transactions on audio, speech, and language processing, 2024, 32: 4066-4076.
[37] ZHENG Lianmin, CHIANG W L, SHENG Ying, et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023: 46595-46623.
[38] 中国计算机学会, 中国中文信息学会, 百度. 2021语言与智能技术竞赛: 多技能对话任务[EB/OL]. (2021-05-16)[2024-11-01]. https://aistudio.baidu.com/aistudio/competition/detail/67.
[39] 中国计算机学会. 千言: 多技能对话[EB/OL]. (2021-01-24)[2024-11-01]. https://www.datafountain.cn/competitions/470.
[40] WANG Yida, KE Pei, ZHENG Yinhe, et al. A large-scale Chinese short-text conversation dataset[EB/OL]. (2022-04-26)[2024-11-01]. https://arxiv.org/abs/2008.03946v2.
[41] WU Wenquan, GUO Zhen, ZHOU Xiangyang, et al. Proactive human-machine conversation with explicit conversation goal[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: ACL, 2019: 3794–3804.
[42] XU Xinchao, GOU Zhibin, WU Wenquan, et al. Long time No see! open-domain conversation with long-term persona memory[C]//Findings of the Association for Computational Linguistics: ACL 2022. Dublin: ACL, 2022: 2639–2650.
[43] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Text Summarization Branches Out. Barcelona: Association for Computational Linguistics, 2004: 74–81.
[44] ZHANG Tianyi, KISHORE V, WU F, et al. BERTscore: evaluating text generation with BERT[C]//Proceedings of the International Conference on Learning Representations. New Orleans: OpenReview.net, 2019: 1-43.
[45] KIROS R, ZHU Yukun, SALAKHUTDINOV R, et al. Skip-thought vectors[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal: Curran Associates Inc., 2015: 3294-3302.
[46] FORGUES G, PINEAU J, LARCHEVêQUE J M, et al. Bootstrapping dialog systems with word embeddings[C]//Proceedings of NIPS Modern Machine Learning and Natural Language Processing Workshop. Montreal: Curran Associates Inc., 2014: 1-5.

Similar References:

Memo

Last Update: 2025-09-05

Evaluation of Chinese multiskill dialogues PDF DownloadHTML

Memo

Evaluation of Chinese multiskill dialogues

PDF Download HTML