Automatic Classification of Chinese Documents Based on Rough Set and Improved Quick-Reduce Algorithm
Sheng Xiao-wei①②;Jiang Ming-hu①②
①Lab of Computational Linguistics, Dept of Chinese Language,Tsinghua University, Beijing 100084, China;②State Key Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, China
Abstract:Much of the previous automatic Text Classification (TC) methods are closely connected with the construction of document vectors. With each term corresponding to a unit in the vector, this method maps the document vectors into a very high dimensional space, possibly of tens of thousands of dimension, which results in a massive amount of calculation. Since the traditional algorithms based on frequency and threshold filtering may often lead to the loss of effective information, this paper presents a new system for TC, which introduces rough set theory that can greatly reduce the document vector dimensions by reduction algorithm. The empirical results prove to be very successful, for it can not only effectively reduce the dimensional space, but also reach higher accuracy while losing less information compared with usual reduction methods.
盛晓炜;江铭虎. 基于Rough集约简算法的中文文本自动分类系统[J]. 电子与信息学报, 2005, 27(7): 1047-1052 .
Sheng Xiao-wei①②;Jiang Ming-hu①②. Automatic Classification of Chinese Documents Based on Rough Set and Improved Quick-Reduce Algorithm. , 2005, 27(7): 1047-1052 .