Introduction

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.let's break down this statement into sub terminologies to understand better what K-mean clustering means.

Swarnim Kashyap
5 min readAug 11, 2021

What is Unsupervised Learning?
Unsupervised Learning is a machine learning technique in which, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.

What is Clustering?

A cluster refers to a collection of data points aggregated together because of certain similarities.

What is K-means?

K-means is an algorithm that identifies k number of centroids and then allocates every data point to the nearest cluster while keeping the centroids as small as possible.

Here "K" refers to the number of centroids we need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

What does "means" refer to in K-means?

The ‘means’ in the K-means refers to averaging of the data, that is finding the centroid.

K-mean Clustering Usecase in Security:

Cyber profiling criminal

Cyber profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. here is an interesting white paper on how to cyber-profile users in an academic environment based on user data preferences.

Through K-mean Clustering algorithms, the data can be grouped by the number of websites visited. This grouping aims to see what the user frequently accesses websites.

The data of internet users access at an institution can be categorized as a large data type so that the analysis can be done with data mining. In this case, the cluster algorithm as one of data mining techniques can be used to find groups (clusters) of a useful object, which the used are depends on the purpose of data analysis.

Clustering analysis is one of the most useful methods for the acquisition of knowledge and is used to find clusters that are a fundamental and important pattern for the distribution of the data itself

Insurance fraud detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.

Another method is to use the Elbow technique to determine the value of K. Once we get the K’s value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.

For the newly formed clusters, it calculates the new centroid position. The position of the centroid moves compared to the randomly allocated one.

Use-Cases in the Security Domain

Perform Clustering

We need to create the clusters, as shown below:

Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2).

The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two points are assigned as centroids. Note that the points can be anywhere, as they are random points. They are called centroids, but initially, they are not the central point of a given data set.

The next step is to determine the distance between each of the randomly assigned centroids’ data points. For every point, the distance is measured from both the centroids, and whichever distance is less, that point is assigned to that centroid. You can see the data points attached to the centroids and represented here in blue and yellow.

The next step is to determine the actual centroid for these two clusters. The original randomly allocated centroid is to be repositioned to the actual centroid of the clusters.

This process of calculating the distance and repositioning the centroid continues until we obtain our final cluster. Then the centroid repositioning stops.

As seen above, the centroid doesn’t need anymore repositioning, and it means the algorithm has converged, and we have the two clusters with a centroid.

Thanks for Reading !! 🙌🏻

Keep Learning !! Keep Sharing !!

--

--

Swarnim Kashyap

RedHat Certified Engineer| DevOps Enthusiast | Big Data Hadoop | RedHat Linux 8| AWS Cloud | GCP | Azure Cloud |GIT & GitHub |Python