K-Means Clustering - A Sparse Version of Principal Component Analysis

Introduction

K-Means algorithm is one of the popular clustering analysis techniques, that helps group similar data points into clusters and look for hidden insights. For example, by analyzing sales data, customers can be grouped into segments to help understand their behaviour. In one of the earlier posts, we explored Principal Component Analysis, a dimension reduction tool that transforms a number of (possibly) correlated columns into a (smaller) number of uncorrelated columns called principal components.


K-Means and PCA are thought of as two different procedures having two different purposes, and at first sight they do not seem related. But that isn’t true!


While principal components can help, for example, rank the records of the dataset, k-means just groups the records into, good/bad. Principal components are essentially the continuous solutions to the discrete cluster membership indicators of K -means clustering[1] . In other words, K-Means is a sparse version of PCA. We explore this idea more with the European employment dataset by grouping similar countries. The Principal Components are then combined with the results of K-Means to see the similarities between the two techniques.

Required Packages

The packages listed below are required to replicate this project and generate this HTML document.

  • tidyverse - The gas for the caR
  • cluster - Cluster Analysis
  • factoextra - Visualizing Clusters
  • plotly - Interactive Plots
  • shiny - Center Plotly charts
  • gridExtra - Plot graphs in grids
  • DT - Print HTML Tables
  • knitr - Generating HTML document
  • rmdformats - Document theme

Data

The dataset contains the distribution of employment in different sectors of 26 European countries. There are ten variables, the first being the name of the country, and the remaining nine holding the percentage employment of the countries’ citizens in nine different industries.

The dataset seems to contain USSR and East Germany, so I’m assuming this comes from at least from or before the 80’s. Data is read in and a box plot of distributions in each industries in presented below.

K-Means Basics

K-Means clustering is one of the most commonly used unsupervised machine learning algorithms to group data into a pre-specified K clusters. The data points are assigned to the clusters in such a way that they are as similar to each other as possible. Each cluster is represented by a centroid, which is a vector of mean values of all columns. The within-cluster similarity is measured through the sum of squares of the residuals between the centroid means and all data points assigned to that centroid. For each cluster,

\[withinness(C) = \sum_{\substack{j\in 1:ncol \\ i\in 1:nrow_{cluster}}}{(x_{ij} - \mu_j)^2} \]
  • \(C\) is a vector of length j, which is the number of variables in the dataset
  • \(x_{ij}\) is the element belonging to record i and column j
  • \(\mu_j\) is the average of the column j for all i in that cluster

For reference, this excellent post by Brad Boehmke goes into a lot more detail.

The K-Means algorithm is applied to the dataset with the clusters varying between 2 to 7. The countries are colored based on the cluster they are assigned to and presented below on the chloropleth. The number of clusters can be changed from the dropdown. For two clusters, the algorithm pits Turkey against the rest of the Europe. With seven clusters, we see the countries separated broadly into the central Europe, the Mediterranean, East and Scandinavia. Turkey and Hungary are in their own clusters.

# K-Means requires data to be scaled normally N(0,1).
# This is to make Euclidean distance measurement comparable.
data_scaled <- scale(data) %>% as.data.frame()

# Country code is combined to plot on map
country_code <- read_csv("country_code.csv") %>% 
  rename(country = "COUNTRY", code = "CODE")
data_kmeans <- data_scaled %>% rownames_to_column("country") %>% 
  left_join(country_code, by = "country") 

# Combining the results of K-Means algorithm
data_kmeans$k <- 0
data_kmeans$cluster <- 0
# Into geo_data
geo_data <- NULL

for (i in 2:7) {
  set.seed(1804)
  kmeans_model <- kmeans(data_scaled, i, nstart = 25)
  data_kmeans$cluster <- kmeans_model$cluster
  data_kmeans$k <- i
  geo_data <- geo_data %>% bind_rows(data_kmeans)
}

# Drop unnecessary columns
geo_data <- geo_data %>% select(country, code, k, cluster)

# Change data structure for Plotly
geo_data_played <- geo_data %>% spread(k, cluster, sep = "value")
geo_colnames <- colnames(geo_data_played)

# Plotluy Chloropleth settings
l <- list(color = toRGB("grey"), width = 0.5)
g <- list(showframe = FALSE, 
          # lonaxis = list(range = c(-20, 50)),
          # lataxis = list(range = c(30, 90)),
          lonaxis = list(range = c(-10, 50)),
          lataxis = list(range = c(30, 80)),
          showcoastlines = FALSE,
          projection = list(type = 'Mercator'))

# Build plot with data for two clusters visible
geo_plot <- geo_data_played %>% plot_geo() %>%
  add_trace(z = ~kvalue2, color = ~kvalue2, 
            text = ~country, locations = ~code, 
            marker = list(line = l), showscale = FALSE) %>% 
  colorbar(title = "Scale") %>%
  layout(geo = g,
         title = "Countries grouped based on employment distribution",
         annotations = list(x = 0, y = 1, text = "Change clusters in dropdown",
                            showarrow = FALSE, xref = "paper", yref = "paper"))

# Add plots for clusters 3:7 invisible
for (col in geo_colnames[4:8]) {
  geo_plot <- geo_plot %>% 
    add_trace(z = geo_data_played[[col]], color = geo_data_played[[col]], 
              text = ~country, locations = ~code, 
              marker = list(line = l), showscale = FALSE, visible = FALSE) 
}

# Add dropdown for clusters and set corresponding visibility 
geo_plot <- geo_plot %>%
  layout(
    updatemenus = list(
      list(y = 0.9, x = 0.08,
           buttons = list(
             list(method = "restyle",
                  args = list("visible", list(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)),
                  label = "2"),
             list(method = "restyle",
                  args = list("visible",list(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)),
                  label = "3"),
             list(method = "restyle",
                  args = list("visible", list(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)),
                  label = "4"),
             list(method = "restyle",
                  args = list("visible", list(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)),
                  label = "5"),
             list(method = "restyle",
                  args = list("visible", list(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)),
                  label = "6"),
             list(method = "restyle",
                  args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)),
                  label = "7")
           )))) %>% 
  config(displayModeBar = FALSE) 

div(geo_plot, align = "center")

As we already observed, the dataset contains East and West Germany, Yugoslavia and other countries which do not exist in the maps anymore. If you don’t see 7 colors for 7 clusters, that’s because the missing countries are not being plotted on the map.

We clearly see patterns emerging that we didn’t of before. Countries that are geographically and culturally closer had similar employment patterns. Unsupervised learning thus helped us draw inferences without a labelled response in the input dataset.

Determining the optimal number of clusters

Because the data was on a map, it was easier for us to say that 7 clusters grouped the data better than 2. But in other cases where the output may not be easily interpretable, it becomes essential to be definite about the number of clusters. There are three major methods to determine K.

  • Elbow - The number of clusters is plotted against the within residual sum squares and the cluster where there is a sharp drop is considered optimal.
  • Average Silhouette - The silhouette value is a measure of how similar a data point is to its own cluster compared to other clusters. The optimal number of clusters is the one that maximizes the average silhouette.
  • Gap Statistic - This method computes the within-cluster variations and compares them against the clusters of another randomly generated reference dataset. The K that gives the highest difference is considered optimal.

The result of the Elbow method is ambiguous, while three clusters have the highest average silhouette. The gap statistic method suggests one cluster as the optimum, which is impractical.

Going by the Silhouette method, three clusters broadly separate the data into western and eastern Europe, leaving out Turkey and Yugoslavia into the third cluster. This seems fair, and the rest of the document will work with three clusters.

Merging Results with Principal Components

PCA model is applied to the data and the results are printed below. It’s observed that the first three components explain ~75% of the variance. We will be utilizing only the three components to make plotting graphs possible.

The data points are plotted on a 3-D plane of the first three principal components and the results from clustering are rendered as colors. The output makes it apparent how closely PCA and K-Means are related. While, PCA assigns each country to a point in the 3-dimensional plane, the role of K-Means is only to assign the country to one of the K clusters.

Matrix Algebra and Summary

For K-Means

K-Means algorithm is essentially an optimization problem where the objective function is to come up with centroids and cluster assignments such that, the residual sum squares are minimized. Following what we saw earlier, this can be defined in matrix terms as


\[minimize\ ||CW - X||^2 \]


  • C denotes a matrix of mean values of all variables in different clusters. The dimension is J by K where J is the number of variables in the dataset and K is the number of clusters. Thus each column of matrix C represents a cluster.
  • W is the assignment matrix of dimension K rows and I columns where I is the number of records in the dataset.
  • X is the original data. The double vertical bar in the equation denotes the Euclidean norm, which is root sum squares of the elements of the matrix.

The figure below, from page 293 of the bookMachine Learning Redifined illustrates the concept.

The point of note in the above figure is the matrix W. The constraint for solving the optimization problem of W is that each column should have one 1 and all other elements of the column must be 0. That is, each point should belong to only one cluster.

For PCA

PCA solves the same objective function, with the only difference being that the “categorical” constraint on W is removed. It’s the relaxed version of K-Means problem where

  • C is a matrix of principal components and
  • W is the matrix of eigen vectors, transposed

The optimization solves for C and W to minimize the difference between C*W and X.

For Linear Regression

Come to think of it, the same minimization can be applied to generalized linear regression as well. The matrices in that case would be

  • C - The response values Y
  • W - The coefficients to be determined

Summary

Principal Component Analysis is one of the classic dimension reduction algorithms. While not particularly useful for predictive modeling, the fundamental matrix minimization problem helps to frame our understanding of other popular models such as K-Means and Linear Regression. Just goes to prove the importance of underlying concepts in data science.

Here’s a link to the repository - GitHub

Harish M