|
|
Automatic Classification of Chinese Documents Based on Rough Set and Improved Quick-Reduce Algorithm |
Sheng Xiao-wei①②;Jiang Ming-hu①② |
①Lab of Computational Linguistics, Dept of Chinese Language,Tsinghua University, Beijing 100084, China;②State Key Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, China |
|
|
Abstract Much of the previous automatic Text Classification (TC) methods are closely connected with the construction of document vectors. With each term corresponding to a unit in the vector, this method maps the document vectors into a very high dimensional space, possibly of tens of thousands of dimension, which results in a massive amount of calculation. Since the traditional algorithms based on frequency and threshold filtering may often lead to the loss of effective information, this paper presents a new system for TC, which introduces rough set theory that can greatly reduce the document vector dimensions by reduction algorithm. The empirical results prove to be very successful, for it can not only effectively reduce the dimensional space, but also reach higher accuracy while losing less information compared with usual reduction methods.
|
Received: 19 February 2004
|
|
|
|
|
|
|
|