安全圈 | 专注于最新网络信息安全讯息新闻

首页

聚類演算法總結

作者 trentadue 时间 2020-02-27
all

最近要在spark上做一個聚類的項目,數據量和類的個數都比較大。KMeans效果尚可,但是有點慢,因而重新看了下常用的算灋。最終選用mini-batch kmeans,使用類似kmeans++的方法來初始化類中心。

kMeans   attention: init centers(randomize vs kMeans++)

mini-batch kMeans   loops: random select samples;find closest for all samples;update centers for each sample

mean shift   init: get centers by bandwidth   loops: find neighbors of centers;update centers;de-duplicate

DBSCAN   init: get densest core samples   loops: get more core samples nearby old samples

Ward hierarchical   init: each sample as center   loops: merge to minimize RMSE within clusters

Spectral clustering   steps: similarity matrix S;S=UV;kmeans of U

AP cluster   init: get S;Rik=0,Aik=0   loops: Rik = Sik - Max_k'!=k(Aik' + Sik');Aik = min(0,Rkk + Sum_i'!=i,k Max(0,Ri'k));Akk = Sum_i'!=k Max(0,Ri’k)   end: for any i,Max_k Rik + Aik as it’s exemplar

Topic Model(LDA)

其實scikit-learn實現了很多算灋,也有現成的數据集可以做做實驗。例如:http://scikit-learn.org/stable/modules/clustering.html 上有一些效果圖,和算灋擴展性的說明。

                       A comparison of the clustering algorithms in scikit-learn

發佈了19篇原創文章·獲贊16·訪問量23萬+