基于一种混合语言模型的自动文本分类技术研究

doi:10.3724/SP.J.1146.2005.01015

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (250 KB)
输出: BibTeX | EndNote (RIS) 背景资料

摘要随着Internet以及Intranet中大量可利用信息的爆炸式增长，文本分类成为处理和组织大量文档数据的关键技术之一。该文提出一种本体论和统计方法相结合的混合语言模型，用以解决自动文本分类问题。首先，通过学习不同类别的训练语料，分别获得各自类别的语言本体知识库，构造成为不同类别的分类器。对于实际文档，将基于不同类别的语言本体知识库分别获得对文档的评价值，并以所获得的最高评价值决定该文档的类别归属。与Bayes，k-nearest neighbor，support vector machine等3种典型的文本分类器进行了比较。实验结果表明，该文方法的分类性能均胜于其上述3种方法。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	郑德权
	李生
	赵铁军
	于浩

关键词 ：文本分类, 本体, 混合语言模型, 上下文, 多元信息

Abstract：With the volume of information available on the Internet and corporate intranets continues to increase, text classification has become one of the key technology in organizing and processing large amount of document data. This paper gives a novel method of Chinese text categorization based on a combination of ontology with statistical method. In this study, first, linguistic ontology knowledge bank will be respectively acquired by learning training corpus for various classes to determine the various categorizations. For a actual document, the evaluation value will respectively be gotten by various linguistic ontology knowledge bank and the categorization will be judged by the highest evaluation value. This method is compared with Bayes, k-nearest neighbor and support vector machine, The primary experimental results show that the method outperforms that previous work.

Key words： Text classification Ontology Hybrid language model Context Multi-grams

收稿日期: 2005-08-17

PACS:

TP391

基金资助:

国家自然科学基金(60302021)和黑龙江省自然科学基金(F2004-04)资助课题

引用本文:

郑德权; 李生; 赵铁军; 于浩. 基于一种混合语言模型的自动文本分类技术研究[J]. 电子与信息学报, 2007, 29(3): 601-605 . Zheng De-quan; Li Sheng; Zhao Tie-jun; Yu Hao. Research on Automatic Text Classification Based on a Hybrid Language Model. , 2007, 29(3): 601-605 .

链接本文:

http://jeit.ie.ac.cn/CN/10.3724/SP.J.1146.2005.01015 或 http://jeit.ie.ac.cn/CN/Y2007/V29/I3/601