Hult EdTech Club
  • Overview
  • Club Activities
    • L3 Talks
    • EdTech Platform
    • Open Courseware
  • Membership and Roles
  • Frequently Asked Questions
  • Resources
    • Hult Library and Resources
      • Statista
      • Pitchbook
    • Shared Notes
      • Data Visualization
      • Finance
        • Bonds
        • Modern Portfolio
      • Economics
        • Efficiency Wage Theory
        • Consumer Surplus
        • Price Discrimination
        • Capturing Consumer Surplus
        • Reducing Marginal Costs
        • "Free" vs. free
        • Market of Lemons
        • Irrationality
        • Choice Architecture
      • Excel
        • Downloading Excel on your OSX
        • Excel Shortcuts for Mac and Windows
        • Data Analysis Pak on Excel for OSX
        • Office.com - Online Excel Data Analysis Pak
        • Generating Pivot Tables and Pie charts
        • Generating Descriptive Statistics
        • T-test with paired means
      • Languages
        • English
      • Presentation
        • How To Stop Using Filler Words
        • How to Introduce Yourself
        • Elevator Pitch
        • Pyramid Principle
      • R Programming
        • Installing R and R Studio for OSX
          • Common problems in R and R Studio
      • Statistics
        • Regression Analysis
          • Simple Linear Regression
          • Multiple Linear Regression
          • SSR, SSE, and SST
          • Standardized coefficients
        • One-hot encoding versus dummy encoding
        • ANOVA analysis
        • Interpreting box plots
        • Z-distributions, normal distributions
        • Sampling Distributions
        • Degrees of Freedom
        • Null and Alternate Hypotheses
        • Outliers in Boxplots
        • K-means clustering
        • Independent Variables with Correlation
      • Sustainability
      • Global Operations
      • Anki Flashcard Decks
    • Technologies
      • Anki Web App
      • Canvas
        • Canvas+ Extension
        • Canvas as Instructor / Teaching Assistant
      • Customized Problem Sets via ChatGPT
      • Excalidraw
      • Language Reactor
      • Mac OSX Keyboard Shortcuts
      • Scribe
      • Skim PDF (for OSX)
      • Tableau
        • Tableau Relationships vs. Join
      • Tools for Asynchronous Communications
      • Tradle
    • Learning Strategies
      • L3 Guides
        • L3 Feedback Guide
        • L3 Learner Guide
        • L3 Preparation Guide
      • Use a Dataset That Matters To You
      • Leverage AI in the areas you are not studying
      • Recommended Learning Tech Stack
    • Student Life
      • San Francisco
      • Accommodations Research
      • No Tax Filing (Form 8443)
      • Tax Filing (W-4, I-9)
      • California Franchise Tax Board
      • IRS
  • Research
    • Education
      • Presentation and Pitch Decks
      • Rapid Feedback Loop
      • Student Workload
    • Technology
    • AI
    • Entrepreneurship
      • Hult Mentorship Club with Vishal Sachar
    • Careers and Employment
      • 83 (b) Election
      • Cold outreach to hiring managers
      • LinkedIn Search Operators
      • Making good impressions
      • Requesting for Referrals
  • ABOUT
    • Vision
      • Goals for 2023-2027
    • Team
    • Onboarding
    • Incoming Hult Students
    • Contributing Guidelines
    • Partnerships
    • Privacy
Powered by GitBook
On this page
  • Simple explanation
  • Formal definition
  • References

Was this helpful?

  1. Resources
  2. Shared Notes
  3. Statistics

K-means clustering

PreviousOutliers in BoxplotsNextIndependent Variables with Correlation

Last updated 1 year ago

Was this helpful?

The simple explanation below is a simplified analogy provided by ChatGPT.

Simple explanation

Imagine you have a big box of colored marbles, but instead of colors, each marble has different features like size, weight, and pattern, but no names to tell them apart. You want to organize these marbles into groups where marbles in the same group are very similar to each other, but you don't know where to start because they don't come with labels telling you how to group them.

This is like having a bunch of information (data) without knowing what it's for (labels), and you want to learn from it without being told what to look for. It's a bit like trying to sort toys without knowing which ones are supposed to go together, based on how they look or feel.

K-Means is like a game we can play with this box of marbles to sort them into groups.

  1. Pick the Number of Groups (Choose the Number of Clusters, K): Imagine you have a bunch of stickers and you want to sort them into different piles. First, decide how many piles you want. This is like picking "K", the number of clusters.

  2. Start with Guesses (Select Initial Centroids): Now, randomly choose one sticker for each pile to be the first sticker. These are your starting points, or "centroids", where each pile will begin.

  3. Sort Stickers by Closest Starting Point (Assign Points to the Nearest Centroid): Look at each sticker and decide which starting sticker (centroid) it is closest to. Put it in that pile. This step is like grouping all the stickers based on which starting sticker they're nearest to.

  4. Find the Middle of Each Pile (Update Centroids): Once all stickers are in piles, find the middle sticker of each pile. This new middle sticker is the new starting point, or "centroid", for the next round of sorting.

  5. Do It Again (Repeat Assignment and Update Steps): With the new middle stickers, sort all the stickers again into the closest pile. Then, find the new middle stickers again. Keep doing this—sorting and finding new middles—until the piles look just right.

  6. When Piles Stop Changing (Check for Convergence): Keep sorting stickers and finding new middles until the piles don't really change anymore. When the piles stay the same after sorting, that means you're done!

  7. Finished Piles (Result): Now, you have your stickers sorted into piles, with each pile grouped by how similar the stickers are. Each pile is a cluster, and you've found a nice way to organize all your stickers!

Formal definition

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters. (Chris Piech, n.d.)

References

Chris Piech on the from Stanford.

Definition of K-means clustering