[1]古丽娜孜·艾力木江,乎西旦·居马洪,孙铁利,等.基于支持向量的最近邻文本分类方法[J].智能系统学报,2018,13(05):799-807.[doi:10.11992/tis.201711007]
 GULNAZ Alimjan,HURXIDA Jumahun,SUN Tieli,et al.The nearest neighbor text classification method based on support vector[J].CAAI Transactions on Intelligent Systems,2018,13(05):799-807.[doi:10.11992/tis.201711007]
点击复制

基于支持向量的最近邻文本分类方法(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第13卷
期数:
2018年05期
页码:
799-807
栏目:
出版日期:
2018-09-05

文章信息/Info

Title:
The nearest neighbor text classification method based on support vector
作者:
古丽娜孜·艾力木江123 乎西旦·居马洪1 孙铁利2 梁义1
1. 伊犁师范学院 电子与信息工程学院, 新疆 伊宁 835000;
2. 东北师范大学 计算机科学与技术学院, 吉林 长春 130117;
3. 东北师范大学 地理科学学院, 吉林 长春 130024
Author(s):
GULNAZ Alimjan123 HURXIDA Jumahun1 SUN Tieli2 LIANG Yi1
1. Department of Electronics and Information Engineering, Yili Normal University, Yining 835000, China;
2. School of Information Science and Technology, Northeast Normal University, Changchun 130117, China;
3. Department of Geographical Science, Nor
关键词:
词干提取预处理支持向量机文本分类分类精度
Keywords:
stemmingpreprocessingsupport vector machinestext categorizationclassification accuracy
分类号:
TP309
DOI:
10.11992/tis.201711007
摘要:
文本分类为一个文档自动分配一组预定义的类别或主题。文本分类中,文档的表示对学习机的学习性能有很大的影响。以实现哈萨克语文本分类为目的,根据哈萨克语语法规则设计实现哈萨克语文本的词干提取,完成哈萨克语文本的预处理。提出基于最近支持向量机的样本距离公式,避免k参数的选定,以SVM与KNN分类算法的特殊组合算法(SV-NN)实现了哈萨克语文本的分类。结合自己构建的哈萨克语文本语料库的语料进行文本分类仿真实验,数值实验展示了提出算法的有效性并证实了理论结果。
Abstract:
Text categorization automatically assigns a set of predefined categories or topics to a document. In text classification, the representation of the document has a great influence on the learning performance of the learning machine. The aim is to achieve Kazakh text classification, according to Kazakh grammar rules, the stemming of Kazakh texts is designed to complete the preprocessing of Kazakh text. A sample distance formula based on the latest support vector machine (SVM) is proposed to avoid the selection of k-parameters. The Kazakh texts are classified by special combination of SVM and KNN classification algorithms (SV-NN). Combining the corpus of Kazakh text corpora constructed by himself, text categorization simulation experiments were conducted. Numerical experiments showed the effectiveness of the proposed algorithm and confirmed the theoretical results.

参考文献/References:

[1] SEBASTIANI F. Machine learning in automated text categorization[J]. ACM computing surveys, 2002, 34(1):1-47.
[2] AHMADI A, FOTOUHI M, KHALEGHI M. Intelligent classification of web pages using contextual and visual features[J]. Applied soft computing, 2011, 11(2):1638-1647.
[3] MARTÍNEZ-CÁMARA E, MARTÍN-VALDIVIA M T, UREÑA-LÓPEZ L A, et al. Polarity classification for Spanish tweets using the COST corpus[J]. Journal of information science, 2015, 41(3):263-272.
[4] PERCANNELLA G, SORRENTINO D, VENTO M. Automatic indexing of news videos through text classification techniques[M]//SINGH S, SINGH M, APTE C, et al. Pattern Recognition and Image Analysis. Berlin:Springer, 2005:512-521.
[5] HU Rong, NAMEE B M, DELANY S J. Active learning for text classification with reusability[J]. Expert systems with applications, 2016, 45:438-449.
[6] SAKURAI S, SUYAMA A. An e-mail analysis method based on text mining techniques[J]. Applied soft computing, 2005, 6(1):62-71.
[7] AL-KABI M, WAHSHEH H, ALSMADI I, et al. Content-based analysis to detect Arabic web spam[J]. Journal of information science, 2012, 38(3):284-296.
[8] ZITAR R A, MOHAMMAD A H. Spam detection using genetic assisted artificial immune system[J]. International journal of pattern recognition and artificial intelligence, 2011, 25(8):1275-1295.
[9] MOHAMMAD A H, ZITAR R A. Application of genetic optimized artificial immune system and neural networks in spam detection[J]. Applied soft computing, 2011, 11(4):3827-3845.
[10] MAO Ming, PENG Yefei, SORING M. Ontology mapping:As a binary classification problem[J]. Concurrency and computation:practice and experience, 2011, 23(9):1010-1025.
[11] YANG Yiming, SLATTERY S, GHANI R. A study of approaches to hypertext categorization[J]. Journal of intelligent information systems, 2002, 18(2/3):219-241.
[12] REN Fuji, LI Chao. Hybrid Chinese text classification approach using general knowledge from Baidu Baike[J]. IEEJ transactions on electrical and electronic engineering, 2016, 11(4):488-498.
[13] DUWAIRI R, EL-ORFALI M. A study of the effects of preprocessing strategies on sentiment analysis for Arabic text[J]. Journal of information science, 2014, 40(4):501-513.
[14] 张冬梅. 文本情感分类及观点摘要关键问题研究[D]. 济南:山东大学, 2012. ZHANG Dongmei. Research on key problems in text sentiment classification and opinion summarization[D]. Ji’nan:Shandong University, 2012.
[15] 杨杰明. 文本分类中文本表示模型和特征选择算法研究[D]. 长春:吉林大学, 2013. YANG Jieming. The research of text representation and feature selection in text categorization[D]. Changchun:Jilin University, 2013.
[16] 张晓娜. CNNIC发布第37次中国互联网络发展状况统计报告[N]. 民主与法制时报, 2016-01-23(001).
[17] SYIAM M M, FAYED Z T, HABIB M B. An intelligent system for Arabic text categorization[J]. International journal of cooperative information systems, 2006, 6(1):1-19.
[18] DUWAIRI R, AL-REFAI M, KHASAWNEH N. Stemming versus light stemming as feature selection techniques for Arabic text categorization[C]//Proceedings of the 4th International Conference on Innovations in Information Technology. Dubai, 2007:446-450.
[19] 贺慧, 王俊义. 主动支持向量机的研究及其在蒙文文本分类中的应用[J]. 内蒙古大学学报:自然科学版, 2006, 37(5):560-563 HE Hui, WANG Junyi. Study of active learning support vector machine and its application on mongolian text classification[J]. Acta scientiarum naturalium universitatis neimongol, 2006, 37(5):560-563
[20] ADELEKE A O, SAMSUDIN N A, MUSTAPHA A, et al. Comparative analysis of text classification algorithms for automated labelling of quranic verses[J]. International journal on advanced science engineering information technology, 2017, 7(4):1419-1427.
[21] MOHAMMAD A H, ALWADA N T, AL-MOMANI O. Arabic text categorization using support vector machine, naïve bayes and neural network[J]. GSTF journal on computing, 2016, 5(1):1-8.
[22] 古丽娜孜·艾力木江, 孙铁利, 伊力亚尔·加尔木哈, 等. 一种基于主动学习支持向量机哈萨克文文本分类方法[J]. 智能系统学报, 2011, 6(3):261-267 GU Linazi Ai Limujiang, SUN Tieli, Yi Liyaer Jia Ermuhamaiti, et al. An approach to the text categorization of the Kazakh language based on an active learning support vector machine[J]. CAAI transactions on intelligent systems, 2011, 6(3):261-267
[23] 古丽娜孜·艾力木江, 孙铁利, 乎西旦·居马洪, 等. 一种基于改进KNN的哈萨克语文本分类[J]. 东北师大学报:自然科学版, 2014, 46(2):63-68 GU Linazi Ai Limujiang, SUNTieli, HU Xidan Ju Mahong, et al. Textcategorization of kazakh text based on improved KNN[J]. Journal of northeast normal university:natural science edition, 2014, 46(2):63-68
[24] 古丽娜孜·艾力木江, 孙铁利, 乎西旦·居马洪, 等. 一种基于SVM-修正KNN算法的哈萨克语文本分类[J]. 西北师范大学学报:自然科学版, 2014, 50(3):48-53 GU Linazi Ai Limujiang, SUN Tieli, HU Xidan Ju Mahong, et al. An approach to the text categorization of the Kazakh language based on SVM-modified KNN algorithm[J]. Journal of northwest normal university:natural science, 2014, 50(3):48-53
[25] 旺建华. 中文文本分类技术研究[D]. 长春:吉林大学, 2007. WANG Jianhua. Research on classification of Chinese documents[D]. Changchun:Jilin University, 2007.
[26] JOACHIMS T. Text categorization with support vector machines:Learning with many relevant features[M]//NÉDELLEC C, ROUVEIROL C. Machine Learning:ECML-98. Berlin:Springer, 1998:137-142.
[27] WANG Ziqiang, SUN Xia, ZHANG Dexian, et al. An optimal SVM-based text classification algorithm[C]//Proceedings of 2006 International Conference on Machine Learning and Cybernetics. Dalian, China, 2006:13-16.
[28] MONTAÑÉS E, FERÁNDEZ J, DÍAZ I, et al. Measures of rule quality for feature selection in text categorization[M]//International Symposium on Advances in Intelligent. Berlin:Springer,, 2003:589-598.
[29] CORTES C, VAPNIK V. Support-vector networks[J]. Machine learning, 1995, 20(3):273-297.
[30] WANG Xuesong, HUANG Fei, CHENG Yuhu. Computational performance optimization of support vector machine based on support vectors[J]. Neurocomputing, 2016, 211:66-71.
[31] COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE transactions on information theory, 1967, 13(1):21-27.
[32] FRANKLIN J. The elements of statistical learning:data mining, inference and prediction[J]. The mathematical intelligencer, 2005, 27(2):83-85.
[33] MENG Qingmin, CIESZEWSKI C J, MADDEN M, et al. K nearest neighbor method for forest inventory using remote sensing data[J]. GIScience & remote sensing, 2007, 44(2):149-165.

相似文献/References:

[1]王斐,张育中,宁廷会,等.脑-机接口研究进展[J].智能系统学报,2011,6(03):189.
 WANG Fei,ZHANG Yuzhong,NING Tinghui,et al.Research progress in a braincomputer interface[J].CAAI Transactions on Intelligent Systems,2011,6(05):189.

备注/Memo

备注/Memo:
收稿日期:2017-11-02。
基金项目:伊犁师范学院一般项目(2016WXYB0004);国家自然科学基金项目(61663045);新疆高校科研计划重点研究项目(XJEDU2014I043);伊犁师范学院重点项目(2016YSZD04).
作者简介:古丽娜孜·艾力木江,女,1972年生,副教授,博士,主要研究方向为机器学习、模式识别、智能信息分类与图像处理。参与国家级、省部级科研项目3项,承担院级重点项目4项。发表学术论文20余篇;乎西旦·居马洪,女,1966年生,教授,主要研究方向为智能信息处理、人脸识别。承担国家级、省部级科研项目4项。发表学术论文20余篇,出版教材1部;孙铁利,男,1956年生,教授,博士生导师,主要研究方向为智能用户接口、智能信息挖掘。承担国家级、省部级科研项目12项。发表学术论文150余篇,出版专著及教材10部。
通讯作者:古丽娜孜·艾力木江.E-mail:alay328@163.com.
更新日期/Last Update: 2018-10-25