[1]沈高峰,谷淑敏.基于遗传算法优化综合启发式的中文网页特征提取[J].智能系统学报,2014,9(04):474-479.[doi:10.3969/j.issn.1673-4785.201305044]
 SHEN Gaofeng,GU Shumin.Chinese Web page feature extraction by optimizing comprehensive heuristics based on GA[J].CAAI Transactions on Intelligent Systems,2014,9(04):474-479.[doi:10.3969/j.issn.1673-4785.201305044]
点击复制

基于遗传算法优化综合启发式的中文网页特征提取(/HTML)
分享到:

《智能系统学报》[ISSN:1673-4785/CN:23-1538/TP]

卷:
第9卷
期数:
2014年04期
页码:
474-479
栏目:
出版日期:
2014-08-25

文章信息/Info

Title:
Chinese Web page feature extraction by optimizing comprehensive heuristics based on GA
作者:
沈高峰1 谷淑敏2
1. 郑州轻工业学院 计算机与通信工程学院, 河南 郑州 450002;
2. 中原工学院信息商务学院 基础学科部, 河南 郑州 450007
Author(s):
SHEN Gaofeng1 GU Shumin2
1. School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002, China;
2. Department of Basic Subjects, College Information & Business, Zhongyuan University of Technology, Zhengzhou 450007, China
关键词:
特征提取遗传算法文本分类文本聚类词频关联度
Keywords:
feature extractionGAtext classificationtext clusteringword frequencyword correlation
分类号:
TP391.1
DOI:
10.3969/j.issn.1673-4785.201305044
摘要:
特征提取是信息检索、文本分类、文本聚类以及自动文摘生成等技术的基础。针对传统的特征提取方法不能全面有效地考查待选特征词的缺点, 提出了一种基于遗传算法优化综合启发式的中文网页特征提取方法。该方法通过词频、关联度、词性以及位置等多种启发式来综合考查待选特征, 并利用遗传算法来优化各启发式的权重参数。通过在不同测试集上进行对比, 实验结果表明, 与传统方法相比, 该方法能够有效避免传统特征提取方法产生的偏差, 获得具有代表性的特征集, 从而使得该方法具有一定的实用价值。
Abstract:
Feature extraction is the basis of such technologies as information retrieval, text classification, text clustering and automatic summarization. Aiming at the shortcomings of the traditional feature extraction methods which make it difficult to test feature words comprehensively and effectively, this paper proposes a method for extracting Chinese web page features by optimizing the comprehensive heuristic features based on GA. This proposed method employs comprehensive heuristics of word frequency, word correlation, parts of speech (POS) and position features to comprehensively test selected features and uses GA to optimize the weight of each heuristic parameter. The experimental results of the different test sets show that the proposed method can effectively avoid the derivations of the traditional extraction methods and obtain more representative features, and therefore it has a certain practical value.

参考文献/References:

[1] GHEYAS I A, SMITH L S. Feature subset selection in large dimensionality domains[J]. Pattern Recognition, 2010, 43(1): 5-13.
[2] NGUYEN M H, TORRE F D. Optimal feature selection for support vector machines[J]. Pattern Recognition, 2010, 43(3): 584-591.
[3] ZHAO Zheng, WANG Lei, LIU Huan. On similarity preserving feature selection[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(3): 619-632.
[4] JAVED K, BABRI H A, SAEED M. Feature selection based on class-dependent densities for high-dimensional binary data[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 465-477.
[5] WU Xindong, YU Kui,DING Wei. Online feature selection with streaming features[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(5): 1178-1192.
[6] LEE S, PARK C, KOO J Y. Feature selection in the Laplacian support vector machine[J]. Computational Statistics and Data Analysis, 2011, 55(1): 567-577.
[7] SONG Qinbao, NI Jingjie, WANG Guangtao. A fast clustering-based feature subset selection algorithm for high-dimensional data[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(1): 1-14.
[8] CHUANG L Y, YANG C H, LI J C. Chaotic maps based on binary particle swarm optimization for feature selection[J]. Journal of Applied Soft Computing, 2011, 11(1): 239-248.
[9] 李纲,戴强斌. 基于词汇链的关键词自动标引方法[J]. 图书情报知识, 2011,12(3): 67-71.LI Gang, DAI Qiangbin. Keywords automatic indexing based on lexical chains[J]. Document, Information and Knowledge, 2011, 12(3): 67-71
[10] 朱颢东, 李红婵. 基于互信息和粗糙集理论的特征选择[J].计算机工程, 2011, 37(15): 181-183.ZHU Haodong, LI Hongchan. Feature selection based on mutual information and rough set theory[J]. Computer Engineering, 2011, 37(15): 181-183.
[11] JEONG Y S, KANG I H, JEONG M K. A new feature selection method for one-class classification problems[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6): 1500-1509.
[12] LIU Z, LIU Q. Balanced feature selection method for Internet traffic classification[J]. Networks, 2012, 1(2): 74-83.
[13] MAHROOGHY M,YOUNAN N H, ANANTHARAJ V G. On the use of the genetic algorithm filter-based feature selection technique for satellite precipitation estimation[J]. Geoscience and Remote Sensing Letters, 2012, 9(5): 963-967.

相似文献/References:

[1]周本达,陈明华.随机化均匀设计混合遗传算法求解图的二划分问题[J].智能系统学报,2009,4(01):91.
 ZHOU Ben-da,CHEN Ming-hua.Solving the 2-way graph partitioning problem using a genetic algorithm based on randomized uniform design[J].CAAI Transactions on Intelligent Systems,2009,4(04):91.
[2]黄剑华,唐降龙,刘家锋,等.一种基于Homogeneity的文本检测新方法[J].智能系统学报,2007,2(01):69.
 HUANG Jian-hua,TANG Xiang-long,LIU Jia-feng,et al.A new method for text detection based on Homogeneity[J].CAAI Transactions on Intelligent Systems,2007,2(04):69.
[3]康 琦,汪 镭,刘小莉,等.基于群体智能框架理念的遗传算法总体模式描述[J].智能系统学报,2007,2(05):42.
 KANG Qi,WANG Lei,LIU Xiao-li,et al.General mode description genetic algorithms based on a framework of swarm intelligence[J].CAAI Transactions on Intelligent Systems,2007,2(04):42.
[4]马 炫,张亚龙.基于遗传算法的大规模矩形件优化排样[J].智能系统学报,2007,2(05):48.
 MA Xuan,ZHANG Ya-long.A genetic algorithm for the layout of large scale rectang ular parts[J].CAAI Transactions on Intelligent Systems,2007,2(04):48.
[5]徐 雄.人工情感的进化控制系统实现[J].智能系统学报,2008,3(02):135.
 XU Xiong.Implementation of an evolutionary control system based on artificial emotion[J].CAAI Transactions on Intelligent Systems,2008,3(04):135.
[6]刘 胜,李高云,孙天英.一种基于种群多样度的实数编码并行遗传算法[J].智能系统学报,2008,3(05):423.
 L IU Sheng,L I Gao-yun,SUN Tian-ying.A real coding parallel genetic algorithm based on diversity of population[J].CAAI Transactions on Intelligent Systems,2008,3(04):423.
[7]张 涛,费树岷,李晓东.基于GARBF神经网络及边界不变特征的车辆识别[J].智能系统学报,2009,4(03):278.
 ZHANG Tao,FEI Shu-min,LI Xiao-dong.Vehicle recognition using boundary invariants and a genetic algorithm trained radial basis function neural network[J].CAAI Transactions on Intelligent Systems,2009,4(04):278.
[8]秦世引,高书征.面向救援任务的地面移动机器人路径规划[J].智能系统学报,2009,4(05):414.[doi:10.3969/j.issn.1673-4785.2009.05.005]
 QIN Shi-yin,GAO Shu-zhen.Path planning for mobile rescue robots in disaster areas with complex environments[J].CAAI Transactions on Intelligent Systems,2009,4(04):414.[doi:10.3969/j.issn.1673-4785.2009.05.005]
[9]周树德,孙增圻.遗传算法中的联结关系[J].智能系统学报,2009,4(06):483.[doi:10.3969/j.issn.1673-4785.2009.06.003]
 ZHOU Shu-de,SUN Zeng-qi.Linkage in genetic algorithms[J].CAAI Transactions on Intelligent Systems,2009,4(04):483.[doi:10.3969/j.issn.1673-4785.2009.06.003]
[10]程显毅,巩向普.改进的模糊C-均值算法在医学图像分割中的应用[J].智能系统学报,2010,5(01):80.
 CHENG Xian-yi,GONG Xiang-pu.An improved fuzzy Cmeans algorithm for segmentation of medical images[J].CAAI Transactions on Intelligent Systems,2010,5(04):80.

备注/Memo

备注/Memo:
收稿日期:2013-05-10。
基金项目:河南省基础与前沿技术研究计划项目(102300410266); 郑州轻工业学院博士科研基金资助项目
通讯作者:沈高峰,男,1978年生,讲师,主要研究方向为数据库应用、数据挖掘。通过省级成果鉴定8项,先后发表学术论文11篇,参与编写教材4部。E-mail:45125301@qq.com
更新日期/Last Update: 1900-01-01