Clustering and K-Means Algorithm
Clustering is a type of unsupervised leraning. Unsupervised learning is contrasted from supervised learning because it uses an unlabeled training set rather than a labeled one.
Clustering is good for market segmentation, social network analysis, organizing computer clusters, astronomical data analysis, etc…
K-Means Algorithm
The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets.
Notations
$c^{(i)}$ = index of cluster to which example $x^{(i)}$ is currently assigned
$\mu_k$ = cluster centroid (k=1,2,…,K)
$\mu_{c^{(i)}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned
Optimization Objective
Our optimization objective is to minimize the following cost function:
\[J(c^{(1)},\dots,c^{(m)},\mu_1,\dots,\mu_K) = \dfrac{1}{m}\sum_{i=1}^m ||x^{(i)} - \mu_{c^{(i)}}||^2\]Algorithm
In order to minimize this cost function, we follow these steps:
-
Random Initialization
From the training examples, randomly pick $K$ points called the cluster centroids($c^{(1)}…c^{(K)}$). Make sure the number of your clusters($K$) is less than the number of your training examples($m$). -
Cluster assignment
Assign all examples into one of $K$ groups based on which cluster centroid the example is closest to.
The goal is to minimize cost $J$ with $c$ while holding $\mu$ fixed. -
Move centroid
Compute the averages for all the points inside each of the $K$ cluster centroid groups, then move the cluster centroid points to those averages.
The goal is to minimize cost $J$ with $\mu$ while holding $c$ fixed. -
Re-run (2) and (3) until we have found our clusters.
-
Try out different random initialization and choose the result that has minimal cost. We do this because depending on the initialization, K-means can get stuck in local optima.
Note that with k-means, it is not possible for the cost function to sometimes increase. It should always descend.
Leave a Comment