K-Means Clustering: A Simple Explanation

K-Means Clustering is a Machine Learning model we use to group similar items, even when the relation is not apparent.

This post will give some examples of when to use K-Means Clustering, so you can evaluate if it may help you. Then it will use an infographic to explain how to use it from a citizen and intuitive point of view. Finally, the post will propose the next steps for you to leverage K-Means Clustering.

Use cases

K-Means Clustering allows us to group items through dimensions (metrics) that are numeric and continuous.

The means to cluster the data may not be evident at first sight. The ultimate goal of the K-Means Clustering is to reveal a grouping pattern that helps us perform different actions. See below several use cases.

Document classification is based on tags, topics or the content of the documents. For example, this project categorises COVID-19 research articles. It helps health professionals find the most relevant content for their interests faster. Thus they can keep up with the new information about the virus.
Delivery store optimisation by finding the most appropriate route and the optimal number of launch locations. This paper shows how to enhance delivery with truck drones through this algorithm.
Customer segmentation. The objective is to group customers based on similar interests to target them with the appropriate marketing campaigns. Here you have an example.
Fraud detection. I.e., we can use historical data to cluster fraudulent patterns. If an insurance company receives a new claim, it can evaluate its proximity to fraudulent practice and investigate it carefully.
Find the dominant colour in an image. You can find out how to do it here.

K-Means Clustering 101

As we already mentioned, the K-Means Clustering is a statistical method. In particular, it is an unsupervised learning algorithm.

The infographic below explains the intuition into how K-Means Clustering works. We use a two-dimensional model as an example. I.e. it has two variables: on the x and y-axis.

K-Means Clustering 101 — *Download the JPG file from here*.

Next Steps

Data

First and foremost, you need enough historical data to develop your K-Means Clustering model, or for that matter, any Machine Learning algorithm. If the data is too old, it won’t match your current situation and interfere with the outcome.

You also need quality data, i.e. it should accurately represent all the spectrum of your items, scenarios or customers. So you must check that there is no bias, missing information, etc.

Development tool

Then, you must use a Jupyter Notebook and code the algorithm in your favourite programming language. You can find many examples in Kaggle to help you.

Depending on your environment, you can use different services based on Jupyter Notebooks, such as Colab, Databricks, Azure Machine Learning, Vertex AI Workbench, etc. These services offer additional features and capabilities that simplify the model’s implementation and integration.

On a separate note, there are already in the market several products that allow you to use Machine Learning models without coding the whole algorithms. I.e., you don’t need a Jupyter notebook (or equivalent service) anymore. For example, BigQuery ML allows the creation of a K-Means clustering model on your BigQuery data set and reading the output with SQL.

Trial and error

Beware that we use two dimensions (variables) to illustrate using the K-Means Clustering model. However, you may need to use more to define the clusters that better adjust to your use case.

You must follow a trial-and-error process to test and refine your model. For example, tags, topics and the variety of content within the documents represent several dimensions in a document classification use case. You will need to identify the dimensions that influence the clustering.

In any case, you should define an acceptable output in your scenario before you start, so you can evaluate the model as you build and tune it.

Tame Complexity

You may need to apply other techniques besides K-Means Clustering to achieve a good result for your use case. For instance, in the paper categorising the COVID research articles, the authors use Natural Language Processing to parse the texts and identify the content and other Machine Learning models.

It is better to start with a simple use case, and, as you master the technique, move into more complex use cases.

The Hidden Technical Debt

Ward Cunningham coined the metaphor “Technical debt” to help reason about the long-term costs we incur when we move quickly in software engineering.

Machine Learning systems are particularly prone to technical debt. They have all of the maintenance problems of traditional code plus the issues derived from data variability. I.e., we use data to develop our Machine Learning models. The data evolves, so the efficiency of our model to discover patterns or forecast predictions on new data decreases.

To address the technical debt in your project, you must implement MLOps. It is a set of practices meant to deploy and maintain Machine Learning models in Production reliably and efficiently.

I hope all the above works as a checklist for you to get ready to implement a K-Means Clustering model. I’d love to hear what your use case is.

Celia Muriel