K-means clustering
Last updated
Was this helpful?
Last updated
Was this helpful?
Imagine you have a big box of colored marbles, but instead of colors, each marble has different features like size, weight, and pattern, but no names to tell them apart. You want to organize these marbles into groups where marbles in the same group are very similar to each other, but you don't know where to start because they don't come with labels telling you how to group them.
This is like having a bunch of information (data) without knowing what it's for (labels), and you want to learn from it without being told what to look for. It's a bit like trying to sort toys without knowing which ones are supposed to go together, based on how they look or feel.
K-Means is like a game we can play with this box of marbles to sort them into groups.
Pick the Number of Groups (Choose the Number of Clusters, K): Imagine you have a bunch of stickers and you want to sort them into different piles. First, decide how many piles you want. This is like picking "K", the number of clusters.
Start with Guesses (Select Initial Centroids): Now, randomly choose one sticker for each pile to be the first sticker. These are your starting points, or "centroids", where each pile will begin.
Sort Stickers by Closest Starting Point (Assign Points to the Nearest Centroid): Look at each sticker and decide which starting sticker (centroid) it is closest to. Put it in that pile. This step is like grouping all the stickers based on which starting sticker they're nearest to.
Find the Middle of Each Pile (Update Centroids): Once all stickers are in piles, find the middle sticker of each pile. This new middle sticker is the new starting point, or "centroid", for the next round of sorting.
Do It Again (Repeat Assignment and Update Steps): With the new middle stickers, sort all the stickers again into the closest pile. Then, find the new middle stickers again. Keep doing this—sorting and finding new middles—until the piles look just right.
When Piles Stop Changing (Check for Convergence): Keep sorting stickers and finding new middles until the piles don't really change anymore. When the piles stay the same after sorting, that means you're done!
Finished Piles (Result): Now, you have your stickers sorted into piles, with each pile grouped by how similar the stickers are. Each pile is a cluster, and you've found a nice way to organize all your stickers!
K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.
K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters. (Chris Piech, n.d.)
Chris Piech on the from Stanford.