最近要在spark上做一個聚類的項目,數據量和類的個數都比較大。KMeans效果尚可,但是有點慢,因而重新看了下常用的算灋。最終選用mini-batch kmeans,使用類似kmeans++的方法來初始化類中心。
kMeans attention: init centers(randomize vs kMeans++)
mini-batch kMeans loops: random select samples;find closest for all samples;update centers for each sample
mean shift init: get centers by bandwidth loops: find neighbors of centers;update centers;de-duplicate
DBSCAN init: get densest core samples loops: get more core samples nearby old samples
Ward hierarchical init: each sample as center loops: merge to minimize RMSE within clusters
Spectral clustering steps: similarity matrix S;S=UV;kmeans of U
AP cluster init: get S;Rik=0,Aik=0 loops: Rik = Sik - Max_k'!=k(Aik' + Sik');Aik = min(0,Rkk + Sum_i'!=i,k Max(0,Ri'k));Akk = Sum_i'!=k Max(0,Ri’k) end: for any i,Max_k Rik + Aik as it’s exemplar
Topic Model(LDA)
其實scikit-learn實現了很多算灋,也有現成的數据集可以做做實驗。例如:http://scikit-learn.org/stable/modules/clustering.html 上有一些效果圖,和算灋擴展性的說明。
A comparison of the clustering algorithms in scikit-learn
發佈了19篇原創文章·獲贊16·訪問量23萬+