What is K-means Clustering and it’s use cases ?
✏️ What is Clustering ?
Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.
✏️ What is K Means Clustering ?
K-means Clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest , serving as a prototype of the cluster.
In simple words k-means clustering tries to group similar kinds of items in form of clusters. From a dataset it finds the similarity between the items and groups them into the clusters.
❄️ The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
✏️ Types of Clustering:
Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.
The various types of clustering are:
- Hierarchical clustering
- Partitioning clustering
Hierarchical clustering is further subdivided into:
- Agglomerative clustering
- Divisive clustering
Partitioning clustering is further subdivided into:
- K-Means clustering
- Fuzzy C-Means clustering
✏️ How exactly Clustering is formed / K Means Clustering Works ?
First we have to choose number of cluster we want that is k. For example we want 2 clusters then value k is 2.
Then we randomly select a centroid for each cluster. Let’s have 2 cluster that means value of k is 2.
In the above image red and green circles represent centroid for their cluster.
Then assign all the points near to centroid as single cluster
We can now clearly see that points close to red colored centroid are under red cluster and same with green colored centroid.
Now let’s make the centroid as normal points.
In above image cross is the centroid of their respective cluster.
Now again run the same process of making the closest point to centroid as a part of cluster. Then we get the below image:
By this process we can assign all points to specific cluster. This is called single iteration.
🤔 But when should we stop the process to get final clusters ?
❄️ Stopping Criteria of K Means Clustering :
There are three stopping criteria that can be adopted to stop the K-means algorithm:
- Centroids of newly formed clusters do not change
- Points remain in the same cluster
- Maximum number of iterations are reached
We can also stop the forming of cluster when the changes made are very small. or when the points are remaining in same cluster after many iterations.
Finally, we can stop the training if the maximum number of iterations is reached.
✏️ Application of K Means Clustering :
K Means Clustering is used in many business related use cases, few are as follows :
- Academic performance
- Search engines
- Diagnostic systems
- Wireless sensor networks
Academic Performance :
Based on the scores, students are categorized into grades like A, B, or C.
Diagnostic systems :
The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.
Search engines :
Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.
Wireless sensor networks :
The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.
✏️ K Means Clustering in Security Domain :
The k-Means clustering algorithm partition a dataset into meaningful patterns. Intrusion Detection System detects malicious attacks which generally include theft information. It can be found from the studies that clustering based intrusion detection methods may be helpful in detecting unknown attack patterns compared to traditional intrusion detection systems. This paper presents modified k-Means by applying preprocessing and normalization steps. As a result the effectiveness is improved and it overcomes the shortcomings of k-Means. This approach is proposed to work on network intrusion data and the algorithm is experimented with KDD99 dataset and found satisfactory results.
Images used in this blogs are created and owned by :- www.analyticsvidhya.com
Thanks for reading.