<-上一篇/Previous Article 下一篇/Next Article->

[1]朱叶芬,线岩团,余正涛,等.基于局部Transformer的泰语分词和词性标注联合模型[J].智能系统学报,2024,19(2):401-410.[doi:10.11992/tis.202209034]
　ZHU Yefen,XIAN Yantuan,YU Zhengtao,et al.Joint model for Thai word segmentation and part-of-speech tagging via a local Transformer[J].CAAI Transactions on Intelligent Systems,2024,19(2):401-410.[doi:10.11992/tis.202209034]

点击复制

基于局部Transformer的泰语分词和词性标注联合模型

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 19 期数: 2024年第2期页码: 401-410 栏目: 学术论文—自然语言处理与理解出版日期: 2024-03-05

Title:: Joint model for Thai word segmentation and part-of-speech tagging via a local Transformer

作者:: 朱叶芬^1,2, 线岩团^1,2, 余正涛^1,2, 相艳^1,2; 1. 昆明理工大学信息工程与自动化学院, 云南昆明 650500;
2. 昆明理工大学云南省人工智能重点实验室, 云南昆明 650500

Author(s):: ZHU Yefen^1,2, XIAN Yantuan^1,2, YU Zhengtao^1,2, XIANG Yan^1,2; 1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;
2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China

关键词:: 泰语分词; 词性标注; 联合学习; 局部Transformer; 构词特点; 音节特征; 线性条件随机场; 联合模型

Keywords:: Thai word segmentation; part-of-speech tagging; joint learning; local Transformer; sub-word features; syllable features; linear conditional random field; joint model

分类号:: TP391

DOI:: 10.11992/tis.202209034

文献标志码:: 2023-11-16

摘要:: 泰语分词和词性标注任务二者之间存在高关联性，已有研究表明将分词和词性标注任务进行联合学习可以有效提升模型性能，为此，提出了一种针对泰语拼写和构词特点的分词和词性标注联合模型。针对泰语中字符构成音节，音节组成词语的特点，采用局部Transformer网络从音节序列中学习分词特征；考虑到词根和词缀等音节与词性的关联，将用于分词的音节特征融入词语序列特征，缓解未知词的词性标注特征缺失问题。在此基础上，模型采用线性分类层预测分词标签，采用线性条件随机场建模词性序列的依赖关系。在泰语数据集LST20上的试验结果表明，模型分词F₁、词性标注微平均F₁和宏平均F₁分别达到96.33%、97.06%和85.98%，相较基线模型分别提升了0.33%、0.44%和0.12%。

Abstract:: There is a high correlation between Thai word segmentation (WS) and part-of-speech (POS) tagging tasks, and it has been demonstrated that joint learning of WS and POS tagging tasks can effectively enhance model performance. Herein, we propose a novel joint model for Thai WS and POS, including Thai spelling rules and sub-word features. A local Transformer network is employed to learn WS features from windowed syllable sequences. Considering the relationship between syllables, such as roots, affixes, and POS, the syllable features used for WS are integrated into the characteristics of word sequence to alleviate the lack of POS tagging features for out-of-vocabulary words. Moreover, we utilize a linear classification layer to forecast the label of WS and a linear conditional random field to model the label dependencies of POS sequences. Experimental findings for the Thai LST20 dataset reveal that the proposed method has a WS F₁ value, POS tagging microF₁ value, and macro F₁ value of 96.33%, 97.06%, and 85.98%, respectively, which are enhanced by 0.33%, 0.44%, and 0.12%, with respect to the baselines.

参考文献/References:: [1] JOUSIMO J, LAOKULRAT N, CARR B, et al. Thai word segmentation with bi-directional RNN [EB/OL]. (2019-10-03)[2023-11-14]. https://github.com/sertiscorp.
[2] KITTINARADORN R, TITIPAT A, CHAOVAVANICH K, et al. DeepCut: A Thai word tok enization library using Deep Neural Network [EB/OL]. (2019-11-11) [2023-11-14]. http://doi.org/10.5281/zenodo.345770, accessed on.
[3] CHORMAI P, PRASERTSOM P, RUTHERFORD A. AttaCut: a fast and accurate neural Thai word segmenter [EB/OL]. (2019-12-16) [2023-11-14]. https://arxiv.org/abs/1911.07056.
[4] SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks[J]. IEEE transactions on signal processing, 1997, 45(11): 2673–2681.
[5] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84–90.
[6] DONG Chuanhai, ZHANG Jiajun, ZONG Chengqing, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[M]//Natural Language Understanding and Intelligent Applications. Cham: Springer International Publishing, 2016: 239-250.
[7] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2018-11-11) [2023-11-14]. https://arxiv.org/abs/1810.04805.pdf.
[8] LAFFERTY J, MCCALLUM A, PEREIRA F. Conditional random fields: probabilistic models for segmenting and labeling sequence data [C]//Proceedings of the 18th Eighteenth International Conference on Machine Learning. Williamstown: ICML, 2001: 282–289.
[9] LIU Yinhan, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach [EB/OL]. (2019-07-26) [2023-11-14]. https://arxiv.org/abs/1907.11692.
[10] HONG T, KIM D, JI M, et al. BROS: a pre-trained language model focusing on text and layout for better key information extraction from documents[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2022: 10767-10775.
[11] ZHANG Taolin, WANG Chengyu, HU Nan, et al. DKPLM: decomposable knowledge-enhanced pre-trained language model for natural language understanding[J]. Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2022: 11703-11711.
[12] LOWPHANSIRIKUL L, POLPANUMAS C, JANTRAKULCHAI N, et al. WangchanBERTa: pretraining transformer-based Thai language models [EB/OL]. (2021-05-20) [2023-11-14]. https://arxiv.org/abs/2101.09635.
[13] NG H, LOW J K. Chinese part-of-speech tagging: one-at-a-time or all-at-once? word-based or character-based? [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Barcelona: EMNLP, 2004: 277-284.
[14] S?GAARD A, GOLDBERG Y. Deep multi-task learning with low level tasks supervised at lower layers[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics . Berlin: Association for Computational Linguistics, 2016: 231-235.
[15] JIANG Wenbin, HUANG Liang, LIU Qun, et al. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging[C]// Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics. Columbus: [s.n.], 2008: 897-904.
[16] SUN Weiwei. A stacked sub-word model for joint Chinese word segmentation and Part-of-Speech tagging[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland: [s.n.], 2011: 1385-1394.
[17] ZENG Xiaodong, WONG D F, CHAO L S, et al. Graph-based semi-supervised model for joint Chinese word segmentation and part-of-speech tagging[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia: [s.n.], 2013: 770-779.
[18] 潘华山, 严馨, 周枫, 等. 基于层叠条件随机场的高棉语分词及词性标注方法[J]. 中文信息学报, 2016, 30(4): 110–116
PAN Huashan, YAN Xin, ZHOU Feng, et al. A Khmer word segmentation and part-of-speech tagging method based on cascaded conditional random fields[J]. Journal of Chinese information processing, 2016, 30(4): 110–116
[19] TIAN Yuanhe, SONG Yan, AO Xiang, et al. Joint Chinese word segmentation and part-of-speech tagging via two-way attentions of auto-analyzed knowledge[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Seattle: [s.n.], 2020: 8286-8296.
[20] BUOY R, TAING N, KOR S. Joint Khmer word segmentation and part-of-speech tagging using deep learning[EB/OL]. (2021-03-31)[2022-01-01]. https://arxiv.org/abs/2103.16801.pdf.
[21] LI Y, LI Xiaomin, WANG Yiru, et al. Character-based joint word segmentation and part-of-speech tagging for Tibetan based on deep learning[J]. Transactions on Asian and low-resource language information processing, 2022: 2375-4699.
[22] YUAN Lichi. A joint method for Chinese word segmentation and part-of-speech labeling based on deep neural network[J]. Soft Computing, 2022, 26(12): 5607–5616.
[23] 林颂凯, 毛存礼, 余正涛, 等. 基于卷积神经网络的缅甸语分词方法[J]. 中文信息学报, 2018, 32(6): 62–70,79
LIN Songkai, MAO Cunli, YU Zhengtao, et al. A method of Myanmar word segmentation based on convolution neural network[J]. Journal of Chinese information processing, 2018, 32(6): 62–70,79
[24] XIANG Yan, XU Ying, YU Zhengtao, et al. CNN-based text multi-classifier using filters initialised by N-gram vector[J]. International journal of information and communication technology, 2019, 15(4): 419.
[25] 郭振, 张玉洁, 苏晨, 等. 基于字符的中文分词、词性标注和依存句法分析联合模型[J]. 中文信息学报, 2014, 28(6): 1–8
GUO Zhen, ZHANG Yujie, SU Chen, et al. Character-level dependency model for joint word segmentation, POS tagging, and dependency parsing in Chinese[J]. Journal of Chinese information processing, 2014, 28(6): 1–8
[26] 刘一佳, 车万翔, 刘挺, 等. 基于序列标注的中文分词、词性标注模型比较分析[J]. 中文信息学报, 2013, 27(4): 30–36
LIU Yijia, CHE Wanxiang, LIU Ting, et al. A comparison study of sequence labeling methods for Chinese word segmentation, POS tagging models[J]. Journal of Chinese information processing, 2013, 27(4): 30–36
[27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need [EB/OL]. (2017-06-12) [2023-01-01]. https://arxiv.org/abs/1706.03762.
[28] PHATTHIYAPHAIBUN W, CHAOVAVANICH K, POLPANUMAS C, et al. Pythainlp: Thai natural language processing in python [EB/OL]. (2022-07-03) [2023-01-01] . https://github.com/PyThaiNLP/pythainlp.
[29] UDOMCHAROENCHAIKIT C, BOONKWAN P, VATEEKUL P. Adversarial evaluation of robust neural sequential tagging methods for Thai language[J]. Transactions on Asian and low-resource language information processing. 2020: 1-25.
[30] KINGMA D P, BA J. Adam: a method for stochastic optimization [EB/OL]. (2014-12-24) [2023-11-14]. https://arxiv.org/abs/1412.6980.

备注/Memo

收稿日期:2022-09-16。
基金项目:国家自然科学基金项目（62266028）；云南省重大科技专项计划（202002AD080001）
作者简介:朱叶芬，硕士研究生，主要研究方向为自然语言处理、词法分析。E-mail：846415516@qq.com;线岩团，副教授，主要研究方向为自然语言处理、信息抽取。主持和参与国家自然基金项目和云南省自然科学基金项目及其他纵向课题 10 项，主持横向课题 2 余项，获专利授权和软件著作权 10 余项。发表学术论文 20 余篇。E-mail：xianyt@kust.edu.cn;余正涛，教授，主要研究方向为自然语言处理、信息检索、机器翻译、机器学习。主持和参与国家自然基金项目和云南省自然科学基金项目及其他纵向课题 30 项，主持横向课题 20 余项，获专利授权和软件著作权 50 余项。发表学术论文 80 余篇。E-mail： ztyu@hotmail.com
通讯作者:线岩团. E-mail：xianyt@kust.edu.cn

更新日期/Last Update: 1900-01-01

基于局部Transformer的泰语分词和词性标注联合模型 PDF下载HTML

备注/Memo

基于局部Transformer的泰语分词和词性标注联合模型

PDF下载 HTML