[1]谭 营,朱元春.反垃圾电子邮件方法研究进展[J].智能系统学报,2010,5(03):189-201.
 TAN Ying,ZHU Yuan-chun.Advances in antispam techniques[J].CAAI Transactions on Intelligent Systems,2010,5(03):189-201.





Advances in antispam techniques
谭 营12朱元春12
1.北京大学 机器感知与智能教育部重点实验室,北京 100871;
2.北京大学 信息科学技术学院,北京100871
TAN Ying12 ZHU Yuan-chun12
1.Key Laboratory of Machine Perception (MOE), Peking University, Beijing 100871, China;
2.School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
antispam feature extraction intelligent detection technique performance evaluation
As the threat of spam on the Internet grows increasingly severe, antispam techniques have become a hotspot for researchers. The authors reviewed the history, current situation, and latest advances in research on spam control. First, we introduced and analyzed three different types of feature extraction methods for email. These were textbased, imagebased, and behaviorbased approaches. Then, current antispam techniques were described and discussed. These included laws, simple methods, and intelligent approaches. After that, performance evaluation methods and standard data sets were discussed. Finally, we summarized the current research on antispam techniques and pointed out directions for future research, including improvements to email feature extraction techniques, improvements to laws, and new intelligent antispam approaches. 


[1]CRANOR L F, LAMACCHIA B A. Spam![J]. Communications of the ACM, 1998, 41(8): 7483.
[2]GANSTERER W, ILGER M, LECHNER P, et al. Antispam methods—stateoftheart[EB/OL]. [20091105]. http://spam.ani.univie.ac.at/files/FA3840181.pdf.
[3]中国互联网协会反垃圾邮件中心. 2008年第一次中国反垃圾邮件状况调查报告[EB/OL]. [20091105]. http://www.antispam.cn/.
[4]Symantec Inc.. The state of spam, a monthly report—February 2009[EB/OL]. [20091105]. http://eval.symantec.com/mktginfo/enterprise/other_resources/bstate_of_spam_report_022009.enus.pdf.
[5]JENNINGS R. Cost of spam is flattening—our 2009 prediction[EB/OL]. [20091105]. http://www.ferris.com/2009/01/28/costofspamisflatteningour2009predictions/.
[6]Sophos Inc.. Security threat report, July 2009 update: a look at the challenge ahead[EB/OL]. [20091107]. http://www.inuit.se/pub/1214/sophossecuritythreatreportjul2009nawpus.pdf.
[7]中国互联网协会反垃圾邮件中心. 2009年第一季度中国反垃圾邮件状况调查报告[EB/OL]. [20091107]. http://www.antispam.cn/pdf/2009_01_mail_survey.pdf. 
[8]中国互联网协会反垃圾邮件中心. 2008年第四季度中国反垃圾邮件状况调查报告[EB/OL]. [20091107]. http://www.antispam.cn/pdf/2008_4_dc.pdf. 
[9]Wikipedia. KullbackLeibler divergence[EB/OL]. [20091107]. http://en.wikipedia.org/wiki/Information_gain.
[10]KOPRINSKA I, POON J, CLARK J, et al. Learning to classify email[J]. Information Sciences, 2007, 177: 21672187.
[11]YANG Y M, PEDERSEN J O. A comparative study on feature selection in text categorization[C]//Proceedings of International Conference on Machine Learning(ICML’97). San Francisco, USA: Morgan Kaufmann Publishers Inc., 1997: 412420.
[12]GUZELLA T S, CAMINHAS M. A review of machine learning approaches to spam filtering[J]. Expert Systems with Applications, 2009, 36: 1020610222.
[13]BLANZIERI E, BRYL A. A survey of learningbased techniques of email spam filtering[EB/OL]. [20091107]. http://eprints.biblio.unitn.it/archive/00001070/.
[14]ANDROUTSOPOULOS I, PALIOURAS G, MICHELAKIS E. Learning to filter unsolicited commercial email, technique report No. 2004/2[R]. Agia Paraskevi, Greece: NCSR “Demokritos”, 2004.
[15]SCHNEIDER K M. A comparison of event models for naive Bayes antispam email filtering[C]//Proceedings of the 10th Conference of European Chapter of the Association for Computational Linguistics. Morristown, USA: Association for Computational Linguistics, 2003: 307314.
[16]YERAZUNIS W S. Sparse binary polynomial hashing and the CRM114 discriminator[EB/OL]. [20091107]. http://fozzolog.fozzilinymoo.org/images/CRM114_slides.pdf.
[17]SIEFKES C, ASSIS F, CHHABRA S, et al. Combining winnow and orthogonal sparse bigrams for incremental spam filtering[C]//Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases. New York, USA: SpringerVerlag, 2004: 410421.
[18]ODA T, WHITE T. Developing an immunity to spam[J]. Lecture Notes in Computer Science, 2003, 2723: 231242.
[19]RUAN Guangchen, TAN Ying. A threelayer backpropagation neural network for spam detection using artificial immune concentration[J]. Soft Computing, 2010, 14: 139150.
[20]KRASSER S, TANG Y C, GOULD J, et al. Identifying image spam based on header and file properties using C4.5 decision trees and support vector machine learning[C]//Proceedings of IEEE SMC Information Assurance and Security Workshop. New York, USA, 2007: 255261.
[21]NHUNG N P, PHUONG T M. An efficient method for filtering image based spam[J]. Lecture Notes in Computer Science, 2007, 4673: 945953.
[22]YEH C Y, WU C H, DOONG S H. Effective spam classification based on metaheuristics[C]//Proceedings of 2005 IEEE International Conference on Systems, Man, and Cybernetics. Waikoloa, HI, USA, 2005: 38723877.
[23]TASI C H, WU C H. Learning typed behaviors of spam emails using backpropagation neural networks[D]. Kaohsiung, China: ShuTe University, 2004.
[24]WU C H, TSAI C H. A timerobust spam classifier based on backpropagation neural networks and behaviorbased features[C]//Proceedings of the Sixth International Conference on Machine Learning and Cybernetics. Hong Kong, 2007: 1922.
[25]COSTALES B, ALLMAN E. Sendmail[M]. 3rd ed. Sebastopol, USA: O’Reilly & Associates, Inc., 2002.
[26]LIU M, LI Y C, LI W. Spam filtering by stages[C]//Proceedings of 2007 International Conference on Convergence Information Technology. Washington, DC, USA: IEEE Computer Society, 2007: 22092213.
[27]YUE X, ABRAHAM A, CHI Z X, et al. Artificial immune system inspired behaviorbased antispam filter[J]. Soft Computing, 2007, 11: 729740.
[28]GUO Y H, ZHANG Y L, LIU J Y, et al. Research on the comprehensive antispam filter[C]//Proceedings of IEEE International Conference on Industrial Informatics. Singapore, 2006: 10691074.
[29]BHATTACHARYYA M, SCHULTZ M G, ESKIN E, et al. MET: an experimental system for malicious email tracking[C]//Proceedings of the 2002 New Security Paradigms Workshop. Virginia Beach, VA, USA, 2002: 310.
 [30]HERSHKOP S. Behaviorbased email analysis with application to spam detection[D]. New York, USA: Columbia University, 2006.
[31]MARTIN S, SEWANI A, NELSON B, et al. Analyzing behavioral features for email classification[C]//Proceedings of Conference on Email and Anti Spam. Stanford University, USA, 2005.
[32]STOLFO S J, HERSHKOP S, HU C W, et al. Behaviorbased modeling and its application to email analysis[J]. ACM Transactions on Internet Technology, 2006, 6(2): 187221.
[33]BRENDEL R, KRAWCZYK H. Detection methods of dynamic spammers’ behavior[C]//Proceedings of 2nd International Conference on Dependability of Computer Systems. Washington, DC, USA: IEEE Computer Society, 2007: 145152.
[34]RAMACHANDRAN A, FEAMSTER N. Understanding the networklevel behavior of spammers[C]//Proceedings of the 2006 Conference on Applications, Technologies, Architectures,  and Protocols for Computer Communications. New York, USA: ACM, 2006: 291302.
[35]陈建发,吴顺祥. 一种基于用户行为分析的协同反垃圾邮件策略[J]. 电脑知识与技术: 学术交流, 2007(7): 3637.
CHEN Jianfa, WU Shunxiang. An cooperate antispam strategy based on user’s behavioral analysis[J]. Computer Knowledge and Technology: Academic Exchange, 2007(7): 3637.
[36]SPAM LAWS. The CANSPAM Act of 2003 [EB/OL]. [20091107]. http://www.spamlaws.com/federal/index.shtml.
[37]GRIMES G A. Compliance with CANSPAM Act of 2003[J]. Communications of the ACM, 2007, 50: 5562.
[38]Rundfunk and Telekom RegulierungsGmbH. Telekommunikationsgesetz 2003(TKG 2003)[EB/OL]. [20091107]. http://www.rtr.at/de/tk/TKG2003#p107.
[39]HOANCA B. How good are our weapons in the spam wars?[J]. IEEE Technology and Society Magazine, 2006, 25(1): 2230.
[40]HARRIS E. The next step in the spam control war: greylisting[EB/OL]. [20091107]. http://projects.puremagic.com/greylisting/whitepaper.html.
[41]LODER T, ALSTYNE M V, WASH R. An economic answer to unsolicited communication[C]//Proceedings of the 5th ACM Conference on Electronic Commerce. New York, USA: ACM, 2004: 4050.
[42]SAHAMI M, DUMAIS S, HECKERMAN D, et al. A Bayesian approach 〖KG*1/2〗 tofiltering 〖KG*1/2〗 junk〖KG*1/2〗email[C]//Procee dings of the 1998 Workshop on Learning for Text Categorization. Madison, USA, 1998: 5562.
[43]ANDROUTSOPOULOS I, KOUTSIAS J, CHANDRINOS K V, et al. An evaluation of naive Bayesian antispam filtering[C]//Proceedings of the Workshop on Machine Learning in the New Information Age. Barcelona, Spain, 2000: 917.
[44]SHRESTHA R, LIN Y P. Improved Bayesian spam filtering based on coweighted multiarea information[J]. Lecture Notes in Computer Science, 2005, 3518: 650660.
[45]LI Yang, FANG Binxing, GUO Li, et al. Research of a novel antispam technique based on users’ feedback and improved naive Bayesian approach[C]//Proceedings of the International Conference on Networking and Services. Washington, DC, USA: IEEE Computer Society, 2006: 86. 
[46]SAKKIS G, ANDROUTSOPOULOS I, PALIOURAS G, et al. A memorybased approach to antispam filtering for mailing lists[J]. Information Retrieval, 2003, 6(1): 4973.
[47]SCHAPIRE R E, SINGER Y. BoosTexter: a boostingbased system for text categorization[J]. Machine Learning, 2000, 39(2): 135168.
[48]CARRERAS X, MARQUEZ L. Boosting trees for antispam email filtering[C]//Proceedings of 4th International Conference on Recent Advances in Natural Language Processing. Tzigov Chark, Bulgaria, 2001: 5864.
[49]NICHOLAS T. Using AdaBoost and decision stumps to identify spam email[EB/OL]. [20091107]. http://nlp.stanford.edu/courses/cs224n/2003/fp/tyronen/ report.pdf.
[50]VAPNIK V N. Estimation of dependencies based on empirical data[M]. New York: SpringerVerlag, 1982.[51]VAPNIK V N. The nature of statistical learning theory[M]. 2nd ed. New York: SpringerVerlag, 2000.
[52]DRUCKER H, BURGES C J C, KAUFFMAN L, et al. Support vector regression machines[C]//Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 1997: 155161.
[53]DRUCKER H, WU D, VAPNIK V N. Support vector machines for spam categorization[J]. IEEE Transactions on Neural Networks, 1999, 10(5): 10481054.
[54]COHEN W W. Fast effective rule induction[C]//Procee dings of 12th International Conference on Machine Learning. San Mateo, USA: Morgan Kaufmann, 1995: 115123.
[55]SCHAPIRE R E, SINGER Y, SINGHAL A. Boosting and Rocchio applied to text filtering[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 1998: 215223.
[56]JOACHIMS T. A probabilistic〖KG*1/2〗 analysis〖KG*1/2〗 of〖KG*1/2〗 the 〖KG*1/2〗Rocchio 〖KG*1/2〗algorithm 〖KG*1/2〗with TFIDF for text categorization[C]//Procee dings of 14th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufman Publishers Inc., 1997: 143151.
[57]SASAKI M, SHINNOU H. Spam detection using text clustering[C]//Proceedings of International Conference on Cyberworlds. Washington, DC, USA: IEEE Computer Society, 2005: 316319.
[58]DHILLON I S, MODHA D S. Concept decompositions for large sparse text data using clustering[J]. Machine Learning, 2001, 42(1/2): 143175.
[59]CLARK J, KOPRINSKA I, POON J. A neural network based approach to automated email classification[C]//Proceedings of IEEE/WIC International Conference on Web Intelligence. Washington, DC, USA: IEEE Computer Society, 2003: 702.
[60]STUART I, CHA S H, TAPPERT C. A neural network classifier for junk email[J]. Lecture Notes in Computer Science, 2004, 3163: 442450.
[61]SECKER A, FREITAS A A, TIMMIS J. AISEC: an artificial immune system for email 〖KG*1/3〗classification[C]//Procee dings of the Congress on Evolutionary Computation. Canberra, Australia, 2003: 131139.
[62]ODA T, WHITE T. Spam detection using an artificial immune system[EB/OL]. [20091109]. http://terri.zone12.com/doc/academic/crossroads/.
[63]MEDLOCK B. An adaptive, semistructured language model approach to spam filtering on a new corpus[C]//Proceedings of 3rd Conference on Email and Antispam. Mountain View, USA, 2006.
[64]MEDLOCK B. GenSpam [EB/OL]. [20091109]. http://www.benmedlock.co.uk/genspam.html.
[65]ZHANG L, ZHU J, YAO T. An evaluation of statistical spam filtering techniques[J]. ACM Transactions on Asian Language Information Processing, 2004, 3(4): 243269.
[66]ZHANG L, ZHU J, YAO T. Index of /lzhang10/spam[EB/OL]. [20091109]. http://homepages.inf.ed.ac.uk/lzhang10/spam/.


 HUANG Jian-hua,TANG Xiang-long,LIU Jia-feng,et al.A new method for text detection based on Homogeneity[J].CAAI Transactions on Intelligent Systems,2007,2(03):69.
 WANG Fei,ZHANG Yuzhong,NING Tinghui,et al.Research progress in a braincomputer interface[J].CAAI Transactions on Intelligent Systems,2011,6(03):189.
 LIU Ju,SUN Jiande.Independent component analysisbased image/video analysis and applications[J].CAAI Transactions on Intelligent Systems,2011,6(03):495.
 TAN Ying,WANG Jun.Recent advances in finger vein based biometric techniques[J].CAAI Transactions on Intelligent Systems,2011,6(03):471.
 WU Jiawei,YAN Jingqi,FANG Zhihong,et al.Defect detection on a steel slab surface based on the characteristics of an image’s saliency region[J].CAAI Transactions on Intelligent Systems,2012,7(03):75.
 ZHANG Yi,LUO Mingwei,LUO Yuan.EEG feature extraction method based on wavelet transform and sample entropy[J].CAAI Transactions on Intelligent Systems,2012,7(03):339.
 LIU Zhongbao,WANG Shitong.From Parzen window estimation to feature extraction: a new perspective[J].CAAI Transactions on Intelligent Systems,2012,7(03):471.
 SUN Qianru,WANG Wenmin,LIU Hong.Study of human action representation in video sequences[J].CAAI Transactions on Intelligent Systems,2013,8(03):189.
 XU Kele,TANG Tao,JIANG Yongmei.A stable feature point extraction approach for SAR image registration[J].CAAI Transactions on Intelligent Systems,2013,8(03):287.[doi:10.3969/j.issn.1673-4785.201304038]
 CHEN Yang,DONG Xiaoli,LI Weijun,et al.Improvement of an image retrieval algorithm based on biomimetic imaginal thinking[J].CAAI Transactions on Intelligent Systems,2015,10(03):209.[doi:10.3969/j.issn.1673-4785.201411022]


通信作者:谭 营.E-mail: ytan@pku.edu.cn.
谭 营,男,1964年生,教授、博士生导师、博士,IEEE Senior Member. IJSIR副编辑,IES Journal B, Intelligent Devices and Systems副编辑,Journal of Computer Science and Systems Biology副编辑, International Journal of KES编委,Springer和多个重要国际期刊的专刊的编辑,ICSI2010大会主席,ISNN2008程序委员会主席.主要研究方向为计算智能、群体智能、智能信息处理、计算机安全、数据挖掘与模式识别等.负责国家“863”计划、国家自然基金等科研项目30余项.获得2009年度国家自然科学奖二等奖,中科院百人计划入选者.发表学术论文200余篇.
更新日期/Last Update: 2010-07-14