|
|
Feature Selection Based on Class Distribution Difference and VPRS for Text Classification |
Wu Di①②; Zhang Ya-ping①; Yin Fu-liang①; Li Ming② |
①Department of computer science and Engineering, Dalian university of technology, Dalian 116024, China;
②Shenyang Aircraft Design & Research Institute, China Aviation Industry Corporation I, Shenyang 110035, China |
|
|
Abstract Weight calculating and feature reduction are key preprocesses in text classification. Firstly, those useless to classify texts are filtered according the category document frequency distribution difference of each feature; and then in order to overcome the limitations of TF-IDF weighting formula a novel weighting formula called TF-CDF is presented . Calculate the weight of each feature according to TF-CDF and build the Vector Space Model (VSM) for the entire corpus. To select significant features, a feature selection approach based on Variable Precision Rough Set (VPRS) is also proposed and implement with some SQL sentences combining the definitions of VPRS with the advantages of SQL sentences. Finally, some experiments based on different weighting formulas and feature selection methods are conducted using libSVM as text classifier. The experimental results show that the novel feature filtering, weighting formula and feature selection method improve the performance of text classification.
|
Received: 28 December 2006
|
|
|
|
|
|
|
|