[1]黄河燕,李思霖,兰天伟,等.大语言模型安全性:分类、评估、归因、缓解、展望[J].智能系统学报,2025,20(1):2-32.[doi:10.11992/tis.202401006]
 HUANG Heyan,LI Silin,LAN Tianwei,et al.A survey on the safety of large language model: classification, evaluation, attribution, mitigation and prospect[J].CAAI Transactions on Intelligent Systems,2025,20(1):2-32.[doi:10.11992/tis.202401006]
点击复制

大语言模型安全性:分类、评估、归因、缓解、展望

参考文献/References:
[1] WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models[EB/OL]. (2022-07-15)[2024-01-03]. https://arxiv.org/abs/2206.07682.
[2] LIU Xiao, JI Kaixuan, FU Yicheng, et al. P-Tuning: prompt tuning can be comparable to fine-tuning across scales and tasks[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: Association for Computational Linguistics. 2022: 61–68.
[3] HAO Jianye, YANG Tianpei, TANG Hongyao, et al. Exploration in deep reinforcement learning: from single-agent to multiagent domain[EB/OL]. (2021-09-14)[2024-01-03]. https://arxiv.org/abs/2109.06668v6.
[4] JI Jiamin, QIU Tianyi, CHEN Boyuan, et al. AI alignment: a comprehensive survey[EB/OL]. (2023-10-30)[2024-01-03]. https://arxiv.org/abs/2310.19852.
[5] 鲍小异, 姜晓彤, 王中卿, 等. 基于跨语言图神经网络模型的属性级情感分类[J]. 软件学报, 2023, 34(2): 676-689.
BAO Xiaoyi, JIANG Xiaotong, WANG Zhongqing, et al. Cross-lingual aspect-level sentiment classification with graph neural network[J]. Journal of software, 2023, 34(2): 676-689.
[6] EFRON, B, TIBSHIRANI R, JEROME F. The elements of statistical learning[M]. London: Springer, 2009.
[7] DENG Li, YU Dong. Deep learning: methods and applications[J]. Foundations and trends? in signal processing, 2014, 7(3-4): 197-387.
[8] BUBECK S, CHANDRASEKARAN V, ELDAN R, et al. Sparks of artificial general intelligence: early experiments with GPT-4[EB/OL]. (2023-03-22)[2024-01-03]. https://arxiv.org/abs/2303.12712v5.
[9] OpenAI. GPT-4 technical report[EB/OL]. (2023-08-30)[2024-01-03]. https://api.semanticscholar.org/CorpusID:266362871.
[10] DONG Qingxiu, LI Lei, DAI Damai, et al. A survey on in-context learning[EB/OL]. (2022-12-31)[2024-01-03]. https://arxiv.org/abs/2301.00234.
[11] LIU Pengfei, YUAN Weizhe, FU Jinlan, et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing[J]. ACM computing surveys, 2023, 55(9): 1-35.
[12] WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[EB/OL]. (2022-01-28)[2024-01-03]. https://arxiv.org/abs/2201.11903v6.
[13] SON Guijin, JUNG Hanna, JIN S. Beyond classification: financial reasoning in state-of-the-art language models[EB/OL]. (2023-05-30)[2024-01-03]. https://api.semanticscholar.org/CorpusID:258437058.
[14] BLAIR-STANEK A, HOLZENBERGER N, VAN DURME B. Can GPT-3 perform statutory reasoning?[C]//Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law. Braga Portugal: ACM, 2023: 22-31.
[15] YU Fang, QUARTEY L, SCHILDER F. Legal prompting: teaching a language model to think like a lawyer[EB/OL]. (2022-12-30)[2024-01-03]. https://api.semanticscholar.org/CorpusID:254221002.
[16] TANG Ruixiang, HAN Xiaotian, JIANG Xiaoqian, et al. Does synthetic data generation of LLMs help clinical text mining?[EB/OL]. (2023-03-25)[2024-01-03]. https://api.semanticscholar.org/CorpusID:257405132.
[17] ALEC R, JEFFREY W, REWON C, et al. Language models are unsupervised multitask learners[EB/OL]. [2024-01-03]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[18] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc. , 2020: 1877-1901.
[19] CARLINI N, TRAMER F, WALLACE E, et al. Extracting training data from large language models[C]//30th USENIX Security Symposium (USENIX Security 21). [S. l. ]: [s. n. ], 2021: 2633-2650.
[20] ABID A, FAROOQI M, ZOU J. Persistent anti-muslim bias in large language models[C]//Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. Virtual Event: ACM, 2021: 298-306.
[21] TAYLOR R, KARDAS M, CUCURULL G, et al. Galactica: a large language model for science[EB/OL]. (2022-11-16)[2024-01-03]. https://arxiv.org/abs/2211.09085v1.
[22] EDWARDS B. New meta AI demo writes racist and inaccurate scientific literature, gets pulled[EB/OL]. (2022-11-18)[2023-09-27]. https://arstechnica.com/information-technology/2022/11/after-controversy-meta-pulls-demo-of-ai-model-that-writes-scientific-papers/.
[23] RAE J W, BORGEAUD S, CAI T, et al. Scaling language models: methods, analysis & insights from training gopher[EB/OL]. (2021-12-08)[2024-01-03]. https://arxiv.org/abs/2112.11446v2.
[24] 任奎, 孟泉润, 闫守琨, 等. 人工智能模型数据泄露的攻击与防御研究综述[J]. 网络与信息安全学报, 2021, 7(1): 1-10.
REN Kui, MENG Quanrun, YAN Shoukun, et al. Survey of artificial intelligence data security and privacy protection[J]. Chinese journal of network and information security, 2021, 7(1): 1-10.
[25] GOLDSTEIN J A, SASTRY G, MUSSER M, et al. Generative language models and automated influence operations: emerging threats and potential mitigations[EB/OL]. (2023-01-10)[2024-01-03]. https://arxiv.org/abs/2301.04246v1.
[26] ZHAO W X, ZHOU Kun, LI Junyi, et al. A survey of large language models[EB/OL]. (2023-03-31)[2024-01-03]. https://arxiv.org/abs/2303.18223v15.
[27] WANG Yufei, ZHONG Wanjun, LI Liangyou, et al. Aligning large language models with human: a survey[EB/OL]. (2023-07-28)[2024-01-03]. https://api.semanticscholar.org/CorpusID:260356605.
[28] HUANG Xiaowei, RUAN Wenjie, HUANG Wei, et al. A survey of safety and trustworthiness of large language models through the lens of verification and validation[J]. Artificial intelligence review, 2024, 57(7): 175.
[29] LIU Yang, YAO Yuanshun, TON J F, et al. Trustworthy LLMs: a survey and guideline for evaluating large language models’ alignment[EB/OL]. (2023-08-10)[2024-01-03]. https://arxiv.org/abs/2308.05374.
[30] ZHANG Yue, LI Yafu, CUI Leyang, et al. Siren’s song in the AI ocean: a survey on hallucination in large language models[EB/OL]. (2023-09-03)[2024-01-03]. https://arxiv.org/abs/2309.01219.
[31] RAWTE V, SHETH A, DAS A. A survey of hallucination in large foundation models[EB/OL]. (2023-09-12)[2024-01-03]. https://arxiv.org/abs/2309.05922.
[32] CHAGN Yupeng, WANG Xu, WANG Jindong, et al. A survey on evaluation of large language models[EB/OL]. (2023-07-06)[2024-01-03]. https://arxiv.org/abs/2307.03109?context=cs.AI.
[33] WEIDINGER L, UESATO J, RAUH M, et al. Taxonomy of risks posed by language models[C]//2022 ACM Conference on Fairness, Accountability, and Transparency. Seoul: ACM, 2022: 214-229.
[34] OUYANG Long, WU J, JIANG Xu, et al. Training language models to follow instructions with human feedback[EB/OL]. (2022-03-04)[2024-01-03]. https://arxiv.org/abs/2203.02155?context=cs.CL.
[35] STIENNON N, OUYANG Long, WU J, et al. Learning to summarize from human feedback[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc. , 2020: 3008-3021.
[36] SHUMAILOV I, ZHAO Yiren, BATES D, et al. Sponge examples: energy-latency attacks on neural networks[C]//2021 IEEE European Symposium on Security and Privacy. Vienna: IEEE, 2021: 212-231.
[37] GRESHAKE K, ABDELNABI S, MISHRA S, et al. Not what you’ve signed up for: compromising real-world LLM-integrated applications with indirect prompt injection[C]//Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. Copenhagen: ACM, 2023: 79-90.
[38] PEREZ F, RIBEIRO I. Ignore previous prompt: attack techniques for language models[EB/OL]. (2022-11-18)[2024-01-03]. https://api.semanticscholar.org/CorpusID:253581710.
[39] ZHU Kaijie, WANG Jindong, ZHOU Jiaheng, et al. PromptRobust: towards evaluating the robustness of large language models on adversarial prompts[EB/OL]. (2023-06-07) [2024-01-03]. https://arxiv.org/abs/2306.04528.
[40] 冀甜甜, 方滨兴, 崔翔, 等. 深度学习赋能的恶意代码攻防研究进展[J]. 计算机学报, 2021, 44(4): 669-695.
JI Tiantian, FANG Binxing, CUI Xiang, et al. Research on deep learning-powered malware attack and defense techniques[J]. Chinese journal of computers, 2021, 44(4): 669-695.
[41] LI Linyang, SONG Demin, LI Xiaonan, et al. Backdoor attacks on pre-trained models by layerwise weight poisoning[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic. Stroudsburg: Association for Computational Linguistics, 2021: 3023-3032.
[42] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019: 4171-4186.
[43] YANG Wenkai, LI Lei, ZHANG Zhiyuan, et al. Be careful about poisoned word embeddings: exploring the vulnerability of the embedding layers in NLP models[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2021: 2048-2058.
[44] LI Shaofeng, LIU Hui, DONG Tian, et al. Hidden backdoors in human-centric language models[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. Virtual Event: ACM, 2021: 3123-3140.
[45] CHEN Kangjie, MENG Yuxian, SUN Xiaofei, et al. BadPre: task-agnostic backdoor attacks to pre-trained NLP foundation models[C]//International Conference on Learning Representations. Virtual Event: ICLR, 2021.
[46] CHEN Xiaoyi, SALEM A, CHEN Dingfan, et al. BadNL: backdoor attacks against NLP models with semantic-preserving improvements[EB/OL]. (2020-06-06)[2024-01-03]. https://arxiv.org/abs/2006.01043v2.
[47] ZHANG Zhengyan, XIAO Guangxuan, LI Yongwei, et al. Red alarm for pre-trained models: universal vulnerability to neuron-level backdoor attacks[J]. Machine intelligence research, 2023, 20(2): 180-193.
[48] LIU Yinhan, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. (2019-07-06)[2024-01-03]. https://arxiv.org/abs/1907.11692v1.
[49] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2020-10-22)[2024-01-03]. https://arxiv.org/abs/2010.11929.
[50] XU Jiashu, MA M D, WANG Fei, et al. Instructions as backdoors: backdoor vulnerabilities of instruction tuning for large language models[EB/OL]. (2023-05-24)[2024-01-03]. https://arxiv.org/abs/2305.14710.
[51] LEE H, PHATALE S, MANSOOR H, et al. RLAIF vs. RLHF: scaling reinforcement learning from human feedback with AI feedback[EB/OL]. (2023-09-01)[2024-01-03]. https://arxiv.org/abs/2309.00267.
[52] KIRK H R, VIDGEN B, R?TTGER P, et al. Personalisation within bounds: a risk taxonomy and policy framework for the alignment of large language models with personalised feedback[EB/OL]. (2023-03-10)[2024-01-03]. https://arxiv.org/abs/2303.05453.
[53] PAN A, BHATIA K, STEINHARDT J. The effects of reward misspecification: mapping and mitigating misaligned models[EB/OL]. (2022-01-10)[2024-01-03]. https://arxiv.org/abs/2201.03544.
[54] OREKONDY T, SCHIELE B, FRITZ M. Knockoff nets: stealing functionality of black-box models[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4949-4958.
[55] KRISHNA K, TOMAR G S, PARIKH A P, et al. Thieves on sesame street! model extraction of BERT-based APIs[EB/OL]. (2019-10-06)[2024-01-03]. https://arxiv.org/abs/1910.12366v3.
[56] LIU Yupei, JIA Jinyuan, LIU Hongbin, et al. StolenEncoder: stealing pre-trained encoders in self-supervised learning[C]//Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. Los Angeles: ACM, 2022: 2115-2128.
[57] DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 248-255.
[58] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR, 2021: 8748-8763.
[59] ELMAHDY A, SALEM A. Deconstructing classifiers: towards a data reconstruction attack against text classification models[C]//Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2024: 143–158.
[60] ALI A K, MALIHEH L, ARIE van D. Targeted attack on GPT-Neo for the SATML language model data extraction challenge[EB/OL]. (2023-02-13)[2024-01-03]. https://arxiv.org/abs/2302.07735.
[61] BLACK S, BIDERMAN S, HALLAHAN E, et al. GPT-NeoX-20B: an open-source autoregressive language model[EB/OL]. (2022-06-14) [2024-01-03]. https://arxiv.org/abs/2204.06745v1.
[62] ZANELLA-BéGUELIN S, WUTSCHITZ L, TOPLE S, et al. Analyzing information leakage of updates to natural language models[C]//Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. Virtual Event: ACM, 2020: 363-375.
[63] LI Haoran, GUO Dadi, FAN Wei, et al. Multi-step jailbreaking privacy attacks on ChatGPT[EB/OL]. (2023-04-11)[2024-01-03]. https://arxiv.org/abs/2304.05197.
[64] CHEN Lingjiao, ZAHARIA M, ZOU J. How is ChatGPT’s behavior changing over time?[J/OL]. Harvard data science review, (2024-03-13)[2024-11-15]. https://doi.org/10.1162/99608f92.5317da47.
[65] ZOU A, WANG Zifan, NICHOLAS C, et al. Universal and transferable adversarial attacks on aligned language models[EB/OL]. (2023-07-17)[2024-01-03]. https://arxiv.org/abs/2307.15043.
[66] YUAN Youliang, JIAO Wenxiang, WANG Wenxuan, et al. GPT-4 is too smart to be safe: stealthy Chat with LLMs via cipher[EB/OL]. (2023-08-12)[2024-01-03]. https://arxiv.org/abs/2308.06463?context=cs.
[67] SHEN Xinyue, CHEN Z, BACKES M, et al. In ChatGPT we trust? measuring and characterizing the reliability of ChatGPT[EB/OL]. (2023-04-18)[ 2024-01-03]. https://api.semanticscholar.org/CorpusID:258187122.
[68] WANG Jindong, HU Xixu, HOU Wenxin, et al. On the robustness of ChatGPT: an adversarial and out-of-distribution perspective[C]//ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models. [S. l. ]: ICLR, 2023: 48-62.
[69] DEGN Jiawen, CHENG Jiale, SUN Hao, et al. Towards safer generative language models: a survey on safety risks, evaluations, and improvements [EB/OL]. (2023-11-30)[2024-01-03]. https://arxiv.org/abs/2302.09270v3.
[70] HUGGING Face. 为大语言模型建立红队对抗[EB/OL]. (2023-02-24)[2023-09-27]. https://huggingface.co/blog/zh/red-teaming.
[71] BORKAR J. What can we learn from data leakage and unlearning for law?[EB/OL]. (2023-07-19)[2024-01-03]. https://arxiv.org/abs/2307.10476.
[72] KIM S, YUN S, LEE H, et al. ProPILE: probing privacy leakage in large language models[EB/OL]. (2023-07-04)[2024-01-03]. https://arxiv.org/abs/2307.01881.
[73] BOSTROM N. Information hazards: a typology of potential harms from knowledge[J]. Review of contemporary philosophy, 2011(10): 44-79.
[74] COLIN R, NOAM S, ADAM R, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of machine learning research, 2020, 21(1): 140: 5485-140: 5551.
[75] SCHRAMOWSKI P, TURAN C, ANDERSEN N, et al. Large pre-trained language models contain human-like biases of what is right and wrong to do[J]. Nature machine intelligence, 2022, 4: 258-268.
[76] CALISKAN A, BRYSON J J, NARAYANAN A. Semantics derived automatically from language corpora contain human-like biases[J]. Science, 2017, 356(6334): 183-186.
[77] LUCY L, BAMMAN D. Gender and representation bias in GPT-3 generated stories[C]//Proceedings of the Third Workshop on Narrative Understanding. Stroudsburg: Association for Computational Linguistics, 2021: 48-55.
[78] HARTMANN J, SCHWENZOW J, WITTE M. The political ideology of conversational AI: converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation[J]. SSRN electronic journal, 2023: 4216084.
[79] GEHMAN S, GURURANGAN S, SAP M, et al. RealToxicityPrompts: evaluating neural toxic degeneration in language models[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: Association for Computational Linguistics, 2020: 3356-3369.
[80] 汪楠、成鹰、曹辉, 等. 信息检索技术[M]. 第3版. 北京: 清华大学出版社, 2018.
WANG Nan, CHENG Ying, CAO Hui, et al. Information Retrieval Technology[M]. Third edition. Beijing: Tsinghua University Press, 2018.
[81] FU Xiaorong, ZHANG Bin, et al. Impact of quantity and timeliness of EWOM information on consumer’s online purchase intention under C2C environment[J]. Asian journal of business research, 2011, 1(2): 110010.
[82] EDMONDS C T, EDMONDS J E, VERMEER B Y, et al. Does timeliness of financial information matter in the governmental sector?[J]. Journal of accounting and public policy, 2017, 36(2): 163-176.
[83] OUYANG Long, WU J, XU Jiang, et al. Training language models to follow instructions with human feedback[EB/OL]. (2022-05-04)[2024-01-03]. https://arxiv.org/abs/2203.02155.
[84] ROKEACH M. The nature of human values[M]. New York: The Free Press, 1973.
[85] DIGNUM V. Responsible artificial intelligence: how to develop and use AI in a responsible way[M]. Cham: Springer International Publishing, 2019
[86] H?MMERL K, DEISEROTH B, SCHRAMOWSKI P, et al. Do multilingual language models capture differing moral norms?[EB/OL]. (2022-03-18)[2024-01-03]. https://arxiv.org/abs/2203.09904.
[87] TOUILEB S, ?VRELID L, VELLDAL E. Occupational biases in Norwegian and multilingual language models[C]//Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing. Seattle: Association for Computational Linguistics, 2022: 200-211.
[88] HAEMMERL K, DEISEROTH B, SCHRAMOWSKI P, et al. Speaking multiple languages affects the moral bias of language models[C]//Findings of the Association for Computational Linguistics: ACL 2023. Toronto: Association for Computational Linguistics, 2023: 2137-2156.
[89] WU Baoyuan, CHEN Hongrui, ZHANG Mingda, et al. BackdoorBench: a comprehensive benchmark of backdoor learning[EB/OL]. (2022-06-25)[2024-01-03]. https://arxiv.org/abs/2206.12654.
[90] SHENG Xuan, HAN Zhaoyang, LI Piji, et al. A survey on backdoor attack and defense in natural language processing[C]//2022 IEEE 22nd International Conference on Software Quality, Reliability and Security. Guangzhou: IEEE, 2022: 809-820.
相似文献/References:
[1]吴国栋,秦辉,胡全兴,等.大语言模型及其个性化推荐研究[J].智能系统学报,2024,19(6):1351.[doi:10.11992/tis.202309036]
 WU Guodong,QIN Hui,HU Quanxing,et al.Research on large language models and personalized recommendation[J].CAAI Transactions on Intelligent Systems,2024,19():1351.[doi:10.11992/tis.202309036]

备注/Memo

收稿日期:2024-1-3。
基金项目:国家自然科学基金项目(U21B2009);科技创新2030—“新一代人工智能”重大项目(2020AAA0106601).
作者简介:黄河燕,教授,兼任北京市海量语言信息处理与云计算应用工程技术研究中心主任,主要研究方向为机器翻译和自然语言处理,主持承担了国家重点研发计划项目、国家自科科学基金重点项目、国家高技术研究发展计划课题等20多项国家级科研攻关项目,获得国家科技进步一等奖等10余项国家级和省部级奖励,1997年享受国务院政府特殊津贴,2014年当选“全国优秀科技工作者”。E-mail:hhy63@bit.edu.cn。;李思霖,硕士,主要研究方向为信息抽取与语言模型安全性。E-mail:lisilin87@outlook.com。;郭宇航,讲师,主要研究方向为自然语言处理、信息抽取、机器翻译、机器学习、人工智能。E-mail:guoyuhang@bit.edu.cn。
通讯作者:郭宇航. E-mail:guoyuhang@bit.edu.cn

更新日期/Last Update: 2025-01-05
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com