Feature Selection Based on Class Distribution Difference and VPRS for Text Classification
Wu Di①②; Zhang Ya-ping①; Yin Fu-liang①; Li Ming②
①Department of computer science and Engineering, Dalian university of technology, Dalian 116024, China; ②Shenyang Aircraft Design & Research Institute, China Aviation Industry Corporation I, Shenyang 110035, China
Abstract:Weight calculating and feature reduction are key preprocesses in text classification. Firstly, those useless to classify texts are filtered according the category document frequency distribution difference of each feature; and then in order to overcome the limitations of TF-IDF weighting formula a novel weighting formula called TF-CDF is presented . Calculate the weight of each feature according to TF-CDF and build the Vector Space Model (VSM) for the entire corpus. To select significant features, a feature selection approach based on Variable Precision Rough Set (VPRS) is also proposed and implement with some SQL sentences combining the definitions of VPRS with the advantages of SQL sentences. Finally, some experiments based on different weighting formulas and feature selection methods are conducted using libSVM as text classifier. The experimental results show that the novel feature filtering, weighting formula and feature selection method improve the performance of text classification.
吴迪; 张亚平; 殷福亮; 李明. 基于类别分布差异和VPRS特征选择的文本分类方法[J]. 电子与信息学报, 2007, 29(12): 2880-2884 .
Wu Di; Zhang Ya-ping; Yin Fu-liang; Li Ming. Feature Selection Based on Class Distribution Difference and VPRS for Text Classification. , 2007, 29(12): 2880-2884 .