Building Index Using Principal Component Analysis

Harish M

PCA?

Let’s suppose one is trying to rank the students of a class based on the scores of several subjects. The obvious way to go about it would be to calculate average scores. If one score is more important than another, the approach would be to go for a weighted average. A third case would be when the scores of one subject, let’s say Math, is widely spread out between 50 to 100%, while all students scored above 90% in Art. In other words, the variance is high. The student with a 100% in Math should be rewarded higher than another with a 100% in Art. This is when Principal Component Analysis can prove useful, by indexing the students based on weight calculated according to the variability in the scores.

math	science	art	lang
60	70	100	100
70	75	98	96
80	80	96	92
90	85	94	88
100	90	92	84

The principal components of a dataset are essentially linear functions of the original variables. A dataset with \(j\) columns will have \(j\) principal components. However, the first few components will usually capture a large percentage of the variance in the dataset. Higher the collinearity between the variables, higher will be the variance captured by the first components. PCA serves best when analyzing datasets with large number of correlated variables. The first principal component will explain most of the variance in the data, while the second component will explain the variations that is not explained by the first component.

Going back to the example of ranking students, let’s suppose the subjects are Math, Science, Art and Language. It would be fair to assume that students who performed well in Math would also score high in Science; and Art, in Language. Variables are multicollinear. Introducing PCA to the data, component 1 will explain the scores of Math and Science, while component 2 will explain Art and Language.

Effectively, the dimension of the dataset is brought down from 4 to 2 and multicollinearity is eliminated.

Eigen Values and Eigen Vectors

As established, the objective of PCA is to capture the variance. This can be achieved by twisting the axes. Let’s look at Galton’s data studying the relationship between a parent’s height and their children. The graph below on the left shows the original data, with the parent’s height on the x axis and the child’s on the y. The maximum variance in the data is observed along the perpendicular red lines on the data. Switching the axes to these red lines, as show on the right, will thus enable studying the variance better.

So which direction should the axes move and how much variance will the new origin explain? Eigen Vectors and Eigen values of the covariance matrix answers the two questions respectively. The eigen vector with the highest eigen value will therefore be the first principal component. In the figure above, the horizontal red line will be the first principal component. In datasets with higher dimensions, some information is lost while choosing the first few principal components. But that’s the price paid for dimensionality reduction.

Data to Index

The dataset printed below contains the running times of 56 countries on 8 events. The first 5 records are printed.

Olympics - Men’s Running Events Times - Sample
m_100	m_200	m_400	m_800	m_1500	m_5000	m_10000	marath	country
10.39	20.81	46.84	1.81	3.70	14.04	29.36	137.72	Argentin
10.31	20.06	44.84	1.74	3.57	13.28	27.66	128.30	Australi
10.44	20.81	46.82	1.79	3.60	13.26	27.72	135.90	Austria
10.34	20.68	45.04	1.73	3.60	13.22	27.45	129.95	Belgium
10.28	20.58	45.91	1.80	3.75	14.68	30.55	146.62	Bermuda

To index these countries, let’s first apply PCA. The Eigen Values/the explained variance by each component is printed below.

pca_model <- princomp(data[, -9], cor = TRUE)
pca_summary <- summary(pca_model, loadings = TRUE, scores = TRUE, cutoff = 0)

eigen_values <- pca_model$sdev^2

pca_model$sdev %>% as.data.frame() %>% rownames_to_column("component") %>% 
  bind_cols(variance = round(sqrt(eigen_values), 3),
            proportion_variance = round(eigen_values/sum(eigen_values), 3),
            cumulative_proportion = round(cumsum(eigen_values)/sum(eigen_values), 3)) %>% 
  dplyr::select(component, variance, proportion_variance, cumulative_proportion) %>% 
  kable(caption = "Eigen Values of Principal Component Analysis")

Eigen Values of Principal Component Analysis
component	variance	proportion_variance	cumulative_proportion
Comp.1	2.573	0.828	0.828
Comp.2	0.937	0.110	0.937
Comp.3	0.399	0.020	0.957
Comp.4	0.352	0.016	0.973
Comp.5	0.283	0.010	0.983
Comp.6	0.261	0.008	0.991
Comp.7	0.215	0.006	0.997
Comp.8	0.150	0.003	1.000

Component 1 explains about ~83% of the variance, while component 1 and 2 explain about ~94%. This is plenty enough. As already explained, each of the components are linear functions of the original variable. Let’s take a look at the coefficients of the components.

pca_summary[["loadings"]][1:8, 1:8] %>% t() %>% round(3) %>% as.data.frame() %>%
  kable(caption = "Coefficients of Principal Components")

Coefficients of Principal Components
	m_100	m_200	m_400	m_800	m_1500	m_5000	m_10000	marath
Comp.1	0.318	0.337	0.356	0.369	0.373	0.364	0.367	0.342
Comp.2	0.567	0.462	0.248	0.012	-0.140	-0.312	-0.307	-0.439
Comp.3	0.332	0.361	-0.560	-0.532	-0.153	0.190	0.182	0.263
Comp.4	0.128	-0.259	0.652	-0.480	-0.405	0.030	0.080	0.300
Comp.5	0.263	-0.154	-0.218	0.540	-0.488	-0.254	-0.133	0.498
Comp.6	0.594	-0.656	-0.157	0.015	0.158	0.141	0.219	-0.315
Comp.7	0.136	-0.113	-0.003	-0.238	0.610	-0.591	-0.177	0.399
Comp.8	0.106	-0.096	0.000	-0.038	0.139	0.547	-0.797	0.158

Component 1 is essentially a weighted average of all 8 races that explains ~83% of the variance in the data. Component 2 however, get’s interesting. Shorter races have a positive coefficient, while longer races are negative. Thus it acts as a representation of countries’ performances in shorter/longer races. Bringing down the number of dimension thus eliminates the correlation between the columns.

To get the equivalent scores, data must be standardized and multipled to the coefficients. The first five records, sorted by the component has been printed.

data %>% dplyr::select("country") %>% 
  bind_cols(pca_summary$scores %>% as.data.frame() %>% round(3)) %>% 
  arrange(Comp.1) %>% 
  head(5) %>% kable(caption = "Principal Component Equivalents")

Principal Component Equivalents
country	Comp.1	Comp.2	Comp.3	Comp.4	Comp.5	Comp.6	Comp.7	Comp.8
Usa	-3.462	-1.120	-0.041	-0.328	0.123	0.378	-0.154	0.005
Gbni	-3.052	-0.281	0.237	0.342	-0.103	0.013	0.041	-0.160
Italy	-2.752	-0.999	-0.496	0.262	-0.102	0.382	0.250	0.127
Ussr	-2.651	-0.764	-0.205	-0.272	0.158	0.277	0.109	0.048
Gdr	-2.614	-0.314	0.087	-0.062	-0.018	-0.038	0.032	0.016

USA has the lowest value for component 1(shorter times are better) and hence ranks first. If one were to use the 8 columns of running times as variables in other processes such as a linear regression, they could instead just opt to use the first few principal component.