|
|
Knowledge Clustering and Statistics Based on MapReduce |
XU Xiaolong LI Yongping |
(College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210003, China) |
|
|
Abstract The large scale and the coarse classification granularity of resources in literature knowledge bases lead to disorientation and overloading when learners retrieve and read papers. This paper proposes a mechanism of knowledge clustering and knowledge statistics based on MapReduce. Firstly, this paper presents a Co-occurrence Matrix building algorithm based on MapReduce (MR-CoMatrix). Secondly, it makes combination of the co-occurrence matrix and similarity coefficient to build the similarity matrix. Thirdly, the similarity matrix is standardized with Z scores. Finally, knowledge clusters are constructed with the Ward,s method. After knowledge clustering, this paper introduces a knowledge Statistics algorithm based on MapReduce (MR-Statistics) to dig the hidden information in each cluster. The experimental results show that the literature knowledge base with MR- CoMatrix and MR-Statistics can realize the accurate and fine clustering, multi-dimension statistics, computational efficiency, and less cost of time.
|
Received: 12 February 2015
Published: 17 November 2015
|
|
Fund: The National Natural Science Foundation of China (61202004, 61472192), The Special Fund for Fast Sharing of Science Paper in Net Era by CSTD (2013116), The Natural Science Fund of Higher Education of Jiangsu Province (14KJB520014) |
Corresponding Authors:
XU Xiaolong
E-mail: xuxl@njupt.edu.cn
|
|
|
|
[1] |
SERET A, VERBRAKEN T, and BAESENS B. A new knowledge-based constrained clustering approach: theory and application in direct marking[J]. Applied Soft Computing, 2014, 24(3): 316-327.
|
[2] |
朱林, 雷景生, 毕忠勤, 等. 一种基于数据流的软子空间聚类算法[J]. 软件学报, 2013, 24(11): 2610-2627.
|
|
ZHU Lin, LEI Jingsheng, BI Zhongqin, et al. Soft subspace clustering algorithm for streaming data[J]. Journal of Software, 2013, 24(11): 2610-2627.
|
[3] |
ZHU Lin, CHUNG Fulai, and WANG Shitong. Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2009, 39(3): 578-591.
|
[4] |
张敏, 于剑. 基于划分的模糊聚类算法[J]. 软件学报, 2004, 15(6): 858-866.
|
|
ZHANG Min and YU Jian. Fuzzy partitional clustering algorithms[J]. Journal of Software, 2004, 15(6): 858-866.
|
[5] |
徐森, 周天, 于化龙, 等. 一种基于矩阵低秩近似的聚类集成算法[J]. 电子学报, 2013, 41(6): 1219-1223.
|
|
XU Sen, ZHOU Tian, YU Hualong, et al. Matrix low rank approximation-based cluster ensemble algorithm[J]. Acta Electronica Sinica, 2013, 41(6): 1219-1223.
|
[6] |
徐森, 卢志茂, 顾国昌. 使用谱聚类算法解决文本聚类集成问题[J]. 通信学报, 2010, 31(6): 58-66.
|
|
XU Sen, LU Zhimao, and GU Guochang. Spectral clustering algorithm for document cluster ensemble problem[J]. Journal on Communications, 2010, 31(6): 58-66.
|
[7] |
ZHU Wenxing, CHEN Jianli, and LI Weiguo. An augmented Lagrangian method for VLSI global placement[J]. The Journal of Supercomputing, 2014, 69(2): 714-738.
|
[8] |
ZHOU F, TORRE F D L, and HODGINS J K. Hierarchical aligned cluster analysis for temporal clustering of human motion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(3): 582-596.
|
[9] |
MASHSHI S, NIU G, MAKOTO Y, et al. Information- maximization clustering based on squared-loss mutual information[J]. Neural Computation, 2014. 26(1): 84-131.
|
[10] |
YU Feili, CAO Liangliang, FERIS R S, et al. Designing Category-level attributes for discriminative visual recognition [C]. Preceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013: 771-776.
|
[11] |
李建元, 周脚根, 关佶红. 谱图聚类算法研究进展[J]. 智能系统学报, 2011, 6(5): 405-414.
|
|
LI Jianyuan, ZHOU Jiaogen, and GUAN Jihong. A survey of clustering algorithms based on spectra of graphs[J]. CAAI Transactions on Intelligent Systems, 2011, 6(5): 405-414.
|
[12] |
LU Zhimao and ZHANG Qi. Clustering by data competition [J]. Science China (Information Sciences), 2013, 56(1): 1-13.
|
[13] |
CHENG Bo, WANG Minhong, A I, et al. Research on e-learning in the workplace 2000-2012: A bibliometric analysis of the literature[J]. Educational Research Review, 2013, 11: 56-72.
|
[14] |
孔万增, 孙志海, 杨灿. 基于基本间隙与正交特征向量的自动谱聚类[J]. 电子学报, 2010, 38(8): 1880-1891.
|
|
KONG Wanzeng, SUN Zhihai, and YANG Can. Automatic spectral clustering based on eigengap and orthogonal eigenvector[J]. Acta Electronica Sinica, 2010, 38(8): 1880-1891.
|
[15] |
CARPENTIER S, SOLE A D, and KAC V G. Rational matrix pseudodifferential operators[J]. Selecta Mathematica, 2014, 20(2): 403-419.
|
[16] |
JUGL E, KUHWALD T, and IVERSEN K. Algorithm for construction of (0,1)-matrix codes[J]. Electronics Letters, 1997, 33(3): 226-229.
|
[17] |
李建江, 崔健, 王聃, 等. MapReduce并行编程模型研究综述[J]. 电子学报, 2011, 39(11): 2635-2642.
|
|
LI Jianjiang, CUI Jian, WANG Dan, et al. Survey of MapReduce parallel programming model [J]. Acta Electronica Sinica, 2011, 39(11): 2635-2642.
|
[18] |
FERRERA P, PRADO I D, PALACIOS E, et al. Tuple MapReduce and pangool: an associated implementation[J]. Knowledge and Information Systems, 2014, 41(2): 531-557.
|
[19] |
陈吉荣, 乐嘉锦. SingleMapReduce:单一输出HDFS文件的MapReduce编程模型[J]. 华南理工大学学报, 2014, 42(5): 135-142.
|
|
CHEN Jirong and LE Jiajin. SingleMapReduce: a MapReduce programming model based on single output file of HDFS[J]. Journal of South China University of Technology, 2014, 42(5): 135-142.
|
[20] |
王肇国, 易涵, 张为华. 基于机器学习特性的数据中心能耗优化算法[J]. 软件学报, 2014, 25(7): 1432-1447.
|
|
WANG Zhaoguo, YI Han, and ZHANG Weihua. Power saving based on characteristics of machine learning in data center[J]. Journal of Software, 2014, 25(7): 1432-1447.
|
[21] |
易小华, 刘杰, 叶丹. 面向MapReduce数据处理流程开发方法[J]. 计算机科学与探索, 2011, 5(2): 161-168.
|
|
YI Xiaohua, LIU Jie, and YE Dan. Development method of MapReduce oriented data flow processing[J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(2): 161-168.
|
[22] |
ROWBERRY J. Z Scores[M]. New York: Springer Science + Business Media, 2013: 3419-3420.
|
[23] |
VARIN T and BUREAU R. Clustering files of chemical structures using the Szekely-Rizzo generalization of Ward’s method[J]. Journal of Molecular Graphics and Modelling, 2009, 28(2): 187-195.
|
[24] |
LEE A. Minkowski generalizations of Ward’s method in hierarchical clustering[J]. Journal of Classification, 2014, 31(2): 194-218.
|
[25] |
MURTAGH F and LEGENDRE P. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?[J]. Journal of Classification, 2014, 31(3): 274-295.
|
|
|
|