[1]于润羽,杜军平,薛哲,等.面向科技学术会议的命名实体识别研究[J].智能系统学报,2022,17(1):50-58.[doi:10.11992/tis.202107010]
YU Runyu,DU Junping,XUE Zhe,et al.Research on named entity recognition for scientific and technological conferences[J].CAAI Transactions on Intelligent Systems,2022,17(1):50-58.[doi:10.11992/tis.202107010]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
17
期数:
2022年第1期
页码:
50-58
栏目:
学术论文—机器学习
出版日期:
2022-01-05
- Title:
-
Research on named entity recognition for scientific and technological conferences
- 作者:
-
于润羽1, 杜军平1, 薛哲1, 徐欣1, 奚军庆2
-
1. 北京邮电大学 智能通信软件与多媒体北京市重点实验室, 北京 100876;
2. 司法部信息中心, 北京 100020
- Author(s):
-
YU Runyu1, DU Junping1, XUE Zhe1, XU Xin1, XI Junqing2
-
1. Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China;
2. Judicial Information Centre, Beijing 100020, China
-
- 关键词:
-
命名实体识别; 长短期记忆网络; 注意力机制; 字词融合; 精准画像; 自然语言处理; 信息抽取; 预训练模型
- Keywords:
-
named entity recognition; long-short term memory network; attention mechanism; character-word fusion; accurate portrait; natural language processing; information extraction; pre-trained models
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.202107010
- 摘要:
-
针对通用领域的命名实体识别算法难以充分挖掘到科技学术会议论文数据中语义信息的问题,提出一种结合关键词-字符长短期记忆网络和注意力机制的科技学术会议命名实体识别算法。首先对论文数据集中的关键词特征进行预训练,获得词汇层面的潜在语义信息,将其与字符级别的语义信息融合,解决错误的词汇边界影响识别准确率的问题。然后,将双向长短期记忆网络和注意力机制输出的向量进行融合,同时考虑上下文和全局信息。最后利用条件随机场进行实体的识别。实验表明,所提出的算法在不同数据集上都取得了较好的识别效果,和对比算法相比,准确率、召回率、F1指数均有一定程度的提升。
- Abstract:
-
Aiming at the problem that the named entity recognition algorithm in the general field cannot fully mine the semantic information in the scientific and technological academic conference paper data, a scientific and technological conference named entity recognition algorithm based on the combination of keyword-character long-short term memory (LSTM) and attention mechanism is proposed. First, pretraining of keyword features in the data set is conducted to obtain the latent semantic information at the vocabulary level, and merge it with the semantic information at the character level to solve the problem that the wrong vocabulary boundary affects recognition accuracy. Then, the bi-directional long-short term memory (BiLSTM) and the vector outputs of the attention mechanism are fused, and the contextual and global information is considered. Finally, conditional random field (CRF) is used to identify entities. Experimental results show that the proposed algorithm has achieved better recognition results on different data sets. Compared with the comparison algorithms, the accuracy, recall, and F1 index of the proposed algorithm have been improved.
备注/Memo
收稿日期:2021-07-09。
基金项目:国家重点研发计划项目(2018YFB1402600);国家自然科学基金项目(61772083,61802028);广西科技重大专项(桂科AA18118054).
作者简介:于润羽,硕士研究生,主要研究方向为深度学习、数据挖掘。;杜军平,教授,博士生导师,主要研究方向为人工智能、社交网络分析、数据挖掘、运动图像处理。主持国家重点研发计划项目1项、国家自然科学基金重点项目1项、发表论文400余篇,出版学术专著6部;薛哲,副教授,主要研究方向为机器学习、人工智能、数据挖掘、图像处理。主持国家自然科学基金青年基金项目、参与国家重点研发计划项目1项。发表学术论文30余篇,出版专著1部。
通讯作者:杜军平. E-mail:junpingdu@126.com
更新日期/Last Update:
1900-01-01