K-means Clustering and its use-case in the Security Domain
♦What is Unsupervised Learning ?
Unsupervised Learning is a sort of machine learning method that uses input data without tagged replies to make conclusions. The term “training data” refers to a collection of unlabeled data.
♦ What is K-Means Clustering?
The word “k-means” was originally used by james macqueen in 1967 as part of his work on “some techniques for classification and analysis of multivariate observations”. In 1957, the standard algorithm was utilised in Bell Labs as part of a pulse code modulation approach. It was also published by E. W. Forgy in 1965 and is commonly referred to as the Lloyd-Forgy technique.
The assignment of items to homogenous groups (called clusters) while ensuring that objects in different groups are not identical is known as “clustering.” Clustering is an unsupervised job since it tries to explain the items’ hidden structure.
K-means Clustering is a common unsupervised machine learning technique that is both easy and effective.
Clustering is the process of separating a population or set of data points into many groups so that data points in the same group are more similar than data points in other groups. to put it simply.The goal is to sort groups with similar characteristics into clusters. The k-means algorithm’s objective is to discover groupings in data.
♦Types of Clustering:
The various types of clustering are:
- Hierarchical clustering
- Partitioning clustering
Hierarchical clustering is further subdivided into:
- Agglomerative clustering
- Divisive clustering
Partitioning clustering is further subdivided into:
- K-Means clustering
- Fuzzy C-Means clustering
♦ Where is k-means clustering algorithm used?
The k-means clustering technique is utilised in Machine Learning models where we need to conduct unsupervised learning with bad historical data.
It’s an iterative method that separates an unlabeled dataset into k distinct clusters, with each dataset belonging to just one of them.
♦ WORKING OF K-Means Algorithm
The way k-means algorithm works is as follows:
- Specify number of clusters K.
- Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
- Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
- Compute the sum of the squared distance between data points and all centroids.
- Assign each data point to the closest cluster (centroid).
- Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.
The approach k-means follows to solve the problem is called Expectation-Maximization
♦ Applications of K-means Clustering
The k-means method is widely utilised in a wide range of applications, including market segmentation, document clustering, picture segmentation, and compression, among others.It is mostly used in the field of to categorise unlabeled data.
- Customer Profiling
- Market segmentation
- Computer vision
- Geo-statistics
- Astronomy
- Document clustering
- Identifying crime-prone areas
- Cluster analysis
- Feature learning or dictionary learning
- Identifying crime-prone areas
- Insurance fraud detection
- Public transport data analysis
♦ Limitations of K-means Clustering :
Sometimes, it is quite tough to forecast the number of clusters, or the value of k.
- The output is highly influenced by original input, for example, the number of clusters.
- An array of data substantially hits the concluding outcomes.
- In some cases, clusters show complex spatial views, then executing clustering is not a good choice.
- Also, rescaling is sometimes conscious, it can’t be done by normalization or standardization of data points, the output gets changed entirely.
♦ Use-Cases in the Security Domain
We utilise K-means clustering in many areas, and we have numerous use-cases in the security sector. One of the most significant topics where we use k-mean clustering for an optimal method is here.
Using K-Means Clustering Algorithm to Analyze Logs from Proxy Server and Captive Portal:
The amount of data created by users’ different interactions with websites is continuously rising, as is the amount of traffic on the World Wide Web. As a result, online data becomes one of the most important tools for retrieving information and discovering new knowledge.
Web Usage Mining was used to find valuable and fascinating patterns from the web data using logs from the Proxy Server and Captive Portal databases. In addition, the k-means clustering method was utilised to create particular groupings of user access patterns based on the number of user sessions and websites visited by network users. It was discovered as a result of the findings
✔ Cyber-Profiling Criminals:
Cyber-Profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. here is an interesting white paper on how to cyber-profile users in an academic environment based on user data preferences.
✔Call record detail analysis:
A Call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. this information provides greater insights about the customer’s needs when used with customer demographics. in this article , you will understand how you can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. it is used to understand segments of customers with respect to their usage by hours.
✔Anomaly detection:
Anomaly detection refers to methods that provide warnings of unusual behaviors which may compromise the security and performance of communication networks. Anomalous behaviors can be identified by comparing the distance between real data and cluster centroids. Identifying network anomalies is essential for communication networks of enterprises or institutions. The goal is to provide an early warning about an unusual behavior which can affect the security and the performance of a network.
✔Malware Detection:
The process of identifying the existence of malware on a host system or determining whether a certain application is dangerous or benign is known as malware detection. Malware detection techniques are essential for identifying malware attacks that have a significant influence on the cyber world. Unsupervised machine learning may detect malware attacks by detecting the behaviour via clustering.
✔ Insurance fraud detection:
Machine Learning plays an important role in fraud detection and has a wide range of applications in the automotive, healthcare, and insurance industries. It is feasible to identify new claims based on their closeness to clusters that suggest fraudulent tendencies using previous data on fraudulent claims.Because insurance fraud has the potential to cost a firm millions of dollars, the ability to identify fraud is critical.
♦ Conclusion :
K means clustering is one of the most common clustering methods, and it’s generally the first thing practitioners do when they’re working on a clustering problem to get a sense of the dataset’s structure. The purpose of k means is to divide data into separate, non-overlapping groupings. When the clusters have a spherical shape, it performs well, and it’s very beneficial in the security area.
Thank you for taking the time to read this. I’ll be back with a fresh piece shortly.
THANK YOU…… 😊