[1]朱叶芬,线岩团,余正涛,等.基于局部Transformer的泰语分词和词性标注联合模型[J].智能系统学报,2024,19(2):401-410.[doi:10.11992/tis.202209034]
ZHU Yefen,XIAN Yantuan,YU Zhengtao,et al.Joint model for Thai word segmentation and part-of-speech tagging via a local Transformer[J].CAAI Transactions on Intelligent Systems,2024,19(2):401-410.[doi:10.11992/tis.202209034]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
19
期数:
2024年第2期
页码:
401-410
栏目:
学术论文—自然语言处理与理解
出版日期:
2024-03-05
- Title:
-
Joint model for Thai word segmentation and part-of-speech tagging via a local Transformer
- 作者:
-
朱叶芬1,2, 线岩团1,2, 余正涛1,2, 相艳1,2
-
1. 昆明理工大学 信息工程与自动化学院, 云南 昆明 650500;
2. 昆明理工大学 云南省人工智能重点实验室, 云南 昆明 650500
- Author(s):
-
ZHU Yefen1,2, XIAN Yantuan1,2, YU Zhengtao1,2, XIANG Yan1,2
-
1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;
2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China
-
- 关键词:
-
泰语分词; 词性标注; 联合学习; 局部Transformer; 构词特点; 音节特征; 线性条件随机场; 联合模型
- Keywords:
-
Thai word segmentation; part-of-speech tagging; joint learning; local Transformer; sub-word features; syllable features; linear conditional random field; joint model
- 分类号:
-
TP391
- DOI:
-
10.11992/tis.202209034
- 文献标志码:
-
2023-11-16
- 摘要:
-
泰语分词和词性标注任务二者之间存在高关联性,已有研究表明将分词和词性标注任务进行联合学习可以有效提升模型性能,为此,提出了一种针对泰语拼写和构词特点的分词和词性标注联合模型。针对泰语中字符构成音节,音节组成词语的特点,采用局部Transformer网络从音节序列中学习分词特征;考虑到词根和词缀等音节与词性的关联,将用于分词的音节特征融入词语序列特征,缓解未知词的词性标注特征缺失问题。在此基础上,模型采用线性分类层预测分词标签,采用线性条件随机场建模词性序列的依赖关系。在泰语数据集LST20上的试验结果表明,模型分词F1、词性标注微平均F1和宏平均F1分别达到96.33%、97.06%和85.98%,相较基线模型分别提升了0.33%、0.44%和0.12%。
- Abstract:
-
There is a high correlation between Thai word segmentation (WS) and part-of-speech (POS) tagging tasks, and it has been demonstrated that joint learning of WS and POS tagging tasks can effectively enhance model performance. Herein, we propose a novel joint model for Thai WS and POS, including Thai spelling rules and sub-word features. A local Transformer network is employed to learn WS features from windowed syllable sequences. Considering the relationship between syllables, such as roots, affixes, and POS, the syllable features used for WS are integrated into the characteristics of word sequence to alleviate the lack of POS tagging features for out-of-vocabulary words. Moreover, we utilize a linear classification layer to forecast the label of WS and a linear conditional random field to model the label dependencies of POS sequences. Experimental findings for the Thai LST20 dataset reveal that the proposed method has a WS F1 value, POS tagging microF1 value, and macro F1 value of 96.33%, 97.06%, and 85.98%, respectively, which are enhanced by 0.33%, 0.44%, and 0.12%, with respect to the baselines.
备注/Memo
收稿日期:2022-09-16。
基金项目:国家自然科学基金项目(62266028);云南省重大科技专项计划(202002AD080001)
作者简介:朱叶芬,硕士研究生,主要研究方向为自然语言处理、词法分析。E-mail:846415516@qq.com;线岩团,副教授,主要研究方向为自然语言处理、信息抽取。主持和参与国家自然基金项目和云南省自然科学基金项目及其他纵向课题 10 项,主持横向课题 2 余项,获专利授权和软件著作权 10 余项。发表学术论文 20 余篇。E-mail:xianyt@kust.edu.cn;余正涛,教授,主要研究方向为自然语言处理、信息检索、机器翻译、机器学习。主持和参与国家自然基金项目和云南省自然科学基金项目及其他纵向课题 30 项,主持横向课题 20 余项,获专利授权和软件著作权 50 余项。发表学术论文 80 余篇。E-mail: ztyu@hotmail.com
通讯作者:线岩团. E-mail:xianyt@kust.edu.cn
更新日期/Last Update:
1900-01-01