[1]沈高峰,谷淑敏.基于遗传算法优化综合启发式的中文网页特征提取[J].智能系统学报,2014,9(4):474-479.[doi:10.3969/j.issn.1673-4785.201305044]
SHEN Gaofeng,GU Shumin.Chinese Web page feature extraction by optimizing comprehensive heuristics based on GA[J].CAAI Transactions on Intelligent Systems,2014,9(4):474-479.[doi:10.3969/j.issn.1673-4785.201305044]
点击复制
《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷:
9
期数:
2014年第4期
页码:
474-479
栏目:
学术论文—智能系统
出版日期:
2014-08-25
- Title:
-
Chinese Web page feature extraction by optimizing comprehensive heuristics based on GA
- 作者:
-
沈高峰1, 谷淑敏2
-
1. 郑州轻工业学院 计算机与通信工程学院, 河南 郑州 450002;
2. 中原工学院信息商务学院 基础学科部, 河南 郑州 450007
- Author(s):
-
SHEN Gaofeng1, GU Shumin2
-
1. School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002, China;
2. Department of Basic Subjects, College Information & Business, Zhongyuan University of Technology, Zhengzhou 450007, China
-
- 关键词:
-
特征提取; 遗传算法; 文本分类; 文本聚类; 词频; 关联度
- Keywords:
-
feature extraction; GA; text classification; text clustering; word frequency; word correlation
- 分类号:
-
TP391.1
- DOI:
-
10.3969/j.issn.1673-4785.201305044
- 摘要:
-
特征提取是信息检索、文本分类、文本聚类以及自动文摘生成等技术的基础。针对传统的特征提取方法不能全面有效地考查待选特征词的缺点, 提出了一种基于遗传算法优化综合启发式的中文网页特征提取方法。该方法通过词频、关联度、词性以及位置等多种启发式来综合考查待选特征, 并利用遗传算法来优化各启发式的权重参数。通过在不同测试集上进行对比, 实验结果表明, 与传统方法相比, 该方法能够有效避免传统特征提取方法产生的偏差, 获得具有代表性的特征集, 从而使得该方法具有一定的实用价值。
- Abstract:
-
Feature extraction is the basis of such technologies as information retrieval, text classification, text clustering and automatic summarization. Aiming at the shortcomings of the traditional feature extraction methods which make it difficult to test feature words comprehensively and effectively, this paper proposes a method for extracting Chinese web page features by optimizing the comprehensive heuristic features based on GA. This proposed method employs comprehensive heuristics of word frequency, word correlation, parts of speech (POS) and position features to comprehensively test selected features and uses GA to optimize the weight of each heuristic parameter. The experimental results of the different test sets show that the proposed method can effectively avoid the derivations of the traditional extraction methods and obtain more representative features, and therefore it has a certain practical value.
备注/Memo
收稿日期:2013-05-10。
基金项目:河南省基础与前沿技术研究计划项目(102300410266); 郑州轻工业学院博士科研基金资助项目
通讯作者:沈高峰,男,1978年生,讲师,主要研究方向为数据库应用、数据挖掘。通过省级成果鉴定8项,先后发表学术论文11篇,参与编写教材4部。E-mail:45125301@qq.com
更新日期/Last Update:
1900-01-01