# “Comparative Analysis of K-Means and Mean Shift Clustering Algorithms”

**Introduction**

Comparative Analysis of K-Means and Mean Shift Clustering Algorithms. Each group, called a cluster, consists of objects that are similar between themselves and dissimilar compared to objects of other groups. This project is intended to study and compare two different data clustering algorithms. In this project, algorithms are compared to the Iris data-set. The data set consists of 150 samples from each of the three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. These two algorithms are compared according to the following factors: distortion, time is taken to form clusters and accuracy of the result.

**Problem Statements**

There are different types of files such as text files, data files, directory files, binary, numeric and graphics, and these different types of files store different types of information. The same algorithm is not fit to cluster all the data types. On different data types, they perform differently. This project tries to evaluate two clustering algorithms and analyze the output. It is difficult to identify which algorithm gives better performance for the categorical data.

**Use Case Diagram**

##### System Architecture

In the first phase, data-set file and a configuration file are selected and uploaded from the user interface. The data set is fed to the K-Mean and Mean-Shift clustering algorithms respectively. If dataset is image data, then preprocessing and feature extraction is done.

The system creates the clusters executing the respective clustering algorithms. The system outputs distortion, time, RI, VOI and accuracy of the respective algorithms applied to the data set. At last, the comparison analysis of the algorithms is performed based on the output and reports generated by the system.

**Interface Design**

We have used Flask as the Web Framework for User Interface Design. Flask is a micro web framework written in Python. It makes the process of designing a web application simpler. Flask offers customization of the look and feel of every component in an application without making significant changes to the application code. It also includes a pluggable look and feels feature, which allows it to emulate the appearance of application components while still having the advantage of platform independence. This particular feature makes writing applications in Flask easy and distinguishes it from another web framework.

##### Flowchart of K-means Algorithm

The flowchart for K-Mean Clustering Algorithm is shown below: In the given flowchart Data Set is passed as an input to the algorithm. The algorithm creates clusters based on the smallest Euclidean distance. After the calculation, the algorithm outputs distortion(the sum of the squared distances between each observation point and its dominating centroid), time and accuracy as the result.

**Flow chart for mean-shift**

**Implementation**

Comparative Analysis of K-Means and Mean Shift Clustering Algorithms, model is shown below. Initially, iris dataset image is selected from the disk. The selected file is then action to clustering. The clustering time is also provided by the application.

For the numerical data set we used iris data set. The use of the iris data set in cluster analysis is not common since the data set only contains two clusters with rather obvious separation. One of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information. The total number of samples taken is 150 samples. So, the dataset contains a set of 150 records under five attributes – petal length, petal width, sepal length, sepal width and species. Sample example of iris data set is shown below:

**Preprocessing**

On preprocessing of image system make a histogram of that image based on its pixel position and RGB value. An image histogram is a graph plotting the frequency of occurrence of different color intensities in the image. Simply put, it shows how many pixels of every 15possible color there is in the image. Every data on the image histogram represents one intensity level. 0 represents black and 255 represents white.

**K-means Algorithm**

Input:

E={e1,e2,…,en} {set of entity to be clustered}

K (number of clusters)

MaxIters (limits of iterations)

Output: C={c1,c2,…,cn} (set of cluster centroids)

L={l(e) | e=1,2,…,n} (set of cluster labels of E)

foreach c¡ ∊ C do

c¡ ← ej ∊ E (e.g. random selection)

end

foreach e¡ ∊ E do

l(e¡) ← euclidean_Distance(e¡,cj)j ∊ {1…k}

end

changed

iter

⃪ false;

⃪ 0;

repeat

foreach c¡ ∊ C do

Update Cluster(c¡);

end

foreach e¡ ∊ E do

minDist

⃪ argminDistance(e¡,cj) j ∊ {1…k}

іf minDist ≠ l(e¡) then

l(e¡)

⃪ minDist;

changed

⃪ true;

16end

end

Iter++;

until changed =true and iter≤ MaxIter;

**Mean shift Algorithms**

Function neighbourhood points(X, x_centroid, distance = 5):

eligible_X = []

for x in X:

distance_between = euclid_distance(x, x_centroid)

# print(‘Evaluating: [%s vs %s] yield dist=%.2f’ % (x, x_centroid, distance_between))

if distance_between <= distance:

eligible_X.append(x)

return eligible_X

function weighting_function(distance, bandwidth):

val = (1/(bandwidth*math.sqrt(2*math.pi))) * np.exp(-0.5*((distance / bandwidth))**2)

return val

X = np.copy(original_X)

past_X = []

n_iterations = 50

for it in range(n_iterations):

for i, x in enumerate(X):

### Step 1. For each datapoint x ∈ X, find the neighbouring points N(x) of x.

neighbours = neighbourhood_points(X, x, look_distance)

### Step 2. For each datapoint x ∈ X, calculate the mean shift m(x).

numerator = 0

17denominator = 0

for neighbour in neighbours:

distance = euclid_distance(neighbour, x)

weight = weighting_function(distance, kernel_bandwidth)

numerator += (weight * neighbour)

denominator += weight

new_x = numerator / denominator

### Step 3. For each datapoint x ∈ X, update x ← m(x).

X[i] = new_x

past_X.append(np.copy(X))

###step 4

until changed =true and iter≤ MaxIter;

**System Testing**

Testing is one of the critical phases to know if there are any errors in an application. System testing is the process of validating and verifying the software program or an application. System testing begins with initial testing by the developer himself. It is to ensure that the system meets the given specifications.

**Iris Data Set (Test Cases)**

K means coding result for iris dataset, This test case is run when (k=3) the number of cluster’s center, then the data points are clustered by the system as shown in table. All the data points (150) are assigned to the cluster by the system. In this test case the algorithm runs up to 10th iteration and the final result is given. At last in one cluster there is 50 number of iris flower but in other, there are 54, 46, which is an error of the system. If outlier is added in data then those are assigned to the cluster.

**K means coding result for iris dataset**

K means coding result for iris dataset This test case is run when (k=3) the number of cluster’s center, then the data points are clustered by the system as shown in table. All the data points (150) are assigned to the cluster by the system. In this test case the algorithm runs up to 10th iteration and the final result is given. At last in one cluster there is 50 number of iris flower but in other, there are 54, 46, which is an error of the system. If outlier is added in data then those are assigned to the cluster.

**Mean shift coding results for iris dataset**

This test case is run without specifying the number of cluster’s center, then the data points are clustered by the system as shown in the table. Not all the data points (146) are assigned to the cluster by the system. In this test case the algorithm runs up to 6th iteration and the final result is given. At last in one cluster there is 50 number of iris flower but in other there are 58, 38, which is an error of the system. If an outlier is added in data then those are not assigned to the cluster those outliers are made another cluster except those iris flower clusters.

**Variation of Information (VOI)**

The Variation of Information (VOI) metric defines the distance between two clusters as the average conditional entropy of one cluster given the other, and thus measures the amount of randomness in one cluster which cannot be explained by the other [12]. Suppose we have two clustering (a division of a set into several subsets) X and Y where X = {X1,X2… Xk}, pi = | Xi | / n, n = Σk | Xi |. Then the variation of information between two clustering is:

VI (X; Y) = H(X) + H(Y) − 2I(X, Y)

Where, H(X) is entropy of X and I(X, Y) is mutual information between X and Y. The mutual information of two cluster is the loss of uncertainty of one clustering if the other is given. Thus, mutual information is positive and bounded by {H(X), H(Y)} _log2(n)

**Performance Evaluation**

After getting result performance evaluation is done. Performance evaluation is done on the basis of various factors such as Distortion, Time analysis, Accuracy for numerical data (Iris dataset). Rand index, Variation of Information, Time analysis for the single image.

**For Iris Data**

**Cluster Size**

Two algorithm makes the respective cluster size of their own which is shown in below figure. K-means algorithm makes better cluster size with a greater number of iterations where its counterpart Mean shift algorithm takes a smaller number of iteration but the cluster size doesn’t converge.

**Comparison of Distortion**

Comparative Analysis of K-Means and Mean Shift Clustering Algorithms, Distortion is defined as the sum of the squared distances between each observation vector and its dominating centroid. Each step of the k-means and means shift algorithm refines the choices of centroids to reduce distortion. The change in distortion is used as a stopping criterion: when the change is lower than a threshold, the algorithm is not making sufficient progress and terminates. One can also define the maximum number of iterations. Here k- mean algorithm gives the better value of Distortion. Lower the value of distortion signifies better results. The following bar graph gives a step-wise overview of Distortion for both algorithms.

**Clustering Time Analysis**

Time take by k-means =11.115787348

Time taken by mean shift=7.644271321000001s

Although mean shift algorithm takes less time but it provides less accurate result than k- means. Mean shift takes less time because it makes cluster based on similarity between the datapoints. K-Means means cluster based on distance so it takes more time than mean shift.

**Conclusion**

Comparative Analysis of K-Means and Mean Shift Clustering Algorithms. The performance of the two clustering algorithms is compared based on the time taken to form the estimated clusters, the accuracy of the result, distortion of the result for numerical data, and RI, VOI for single image. As the number of clusters increases gradually, the time to form the clusters also increases. The mean shift algorithms take few seconds to cluster the iris dataset whereas the k-mean takes the more time to perform clustering on numerical data but result of the clustering is better by k-means algorithm. The k-means algorithms is not sensitive to outliers whereas the mean-shift algorithms is very sensitive in terms of outliers. Both the algorithms have some ambiguity in some(noisy) data when clustered. This noise makes it difficult for the algorithm to cluster an object into its suitable cluster. This will affect the results of the algorithm.

Through, the unsupervised method i.e. cluster-based algorithms were proposed for image clustering based on color and intensity. The clustering techniques such as k-means, mean shift were tested in different images. The performance of proposed algorithms is measured using parameters RI, VOI, and Time. The computational results showed that the K means image Clustering for a single image consumes more time and it also provides poor results. The Mean shift algorithm takes minimum numbers of iterations to compare to k means and also provide good result.

**REFERENCES**

[1] A. J. P. F. M.N. Murty, Data clustering: a review,ACM Comput, surv.31(3), 1999.

[2] R. D. A.K. Jain, Algorithms for Clustering Data, Englewood Cliffs: Prentice Hall, 1988.

Resource Provider:- Bishal Upadhaya Facebook Github

[…] Machine Learning Unsupervised Machine Learning is the process of feed unlabeled data to make model. During the model development algorithm is […]