Does Regression hold it’s own in the age of Machine Learning?
Harish M

Introduction

Classification is one of the two common supervised machine learning techniques, used to predict the outcome of an observation. It can be a binary “yes” or “no” like in the case of this analysis or a multinomial a/b/c/d/… categorization. Logistic Regression helps us to achieve this, and, without getting technical (which we will in the later sections), the idea is to build a linear equation of predictor variables, that tells us the probability of the outcome of each observation.

In the last decade however, machine learning algorithms have become extremely popular owing to the simplicity of the models and the minimal tweaking required to fit them in a wide array of problems. A few hours is all it takes to fit a random forest model that works pretty well with minimal business know-how. This being the case, it was surprising to learn that regression models are still the most used data science method according to Kaggle survey. 42% of the Kaggle users reported to use GLM at work, compared to 34% who use Random Forests.

Granted, logistic regression is easy to use and has high interpretability, but do they produce comparable results to modern and advanced techniques like bagging and boosting? This analysis sets out to explore that question.


The IBM employee attrition dataset is used in this report to predict whether or not an employee quits. Beginning with a basic model, the dataset is run through decision trees, random forests, boosting machines, and then tuned through a hypergrid. The model performance is evaluated based on confusion matrix output and area under the ROC curve.

A Few Things To Note

The libraries in the code-block below will be required to run the analysis in R. Markdown knitting options are also being set globally.

library(dplyr) #The gasoline to the caR
library(knitr) #Produce this HTML doc
library(rmdformats) #Produce this HTML doc
library(DT) #Print HTML datatables
library(rsample) #Sampling data into train and test
library(readxl) #Data import
library(boot) #Cross validate the model
library(ggplot2) #Visualization
library(plotly) #Interactive Visualization
library(tibble) #Data manipulation
library(gridExtra) #Print ggplots together
library(Hmisc) #Correlation
library(corrplot) #Correlation plot 
library(knitr) #Generate HTML doc
library(caret) #Classification Tools
library(ROCR) #Calculate ROC area
library(pROC) #Plot ROC curve
library(glmnet) #Regularized regressions
library(rpart) #Decision trees
library(rpart.plot) #Plot them
library(gganimate) #Animated ggplots
library(ipred) #Decision tree Bagging
library(ranger) #Random Forest
library(gbm) #Gradient Boosting

# Setting Global Options for the page render
options(max.print = "75")
opts_knit$set(width = "75")

# Globally controlling code blocks
opts_chunk$set(message = FALSE, # prevent messages 
               warning = FALSE, # prevent warnings
               fig.align = "center", # graph position
               fig.height = 4, # graph default height and width
               fig.width = 6)

# DT::datatable parameters
options(DT.options = list(paging = FALSE, # disable pagination
                          scrollY = "200px", # enable vertical scrolling
                          scrollX = TRUE, # enable horizontal scrolling
                          scrollCollapse = TRUE,
                          autoWidth = TRUE, # centering the table output
                          ordering = FALSE, # disable sorting data
                          dom = "t",  # display just the table
                          initComplete = JS("function(settings, json) {" ,
                                            "$(this.api().table().header()).css(
                                            {'color': '#888888'});",
                                            "}")))

The performance of each model is evaluated using the following parameters. A brief explanation is presented.

  • Confusion Matrix - A 2x2 table showing the actual vs predicted results. Here’s an example.


    \(TN = 50\) - True Negative is number of negatives predicted correct. \(FP = 10\) - False Positive is the number of negatives incorrectly tagged as positive.

  • Accuracy - The ratio of observations predicted correct. From the confusion matrix,
    \[accuracy = \frac{TN + TP}{n}\]


  • Sensitivity - Also called True Positive Rate, it measures the proportion of actual positives that are correctly identified.
    \[sensitivity = \frac{TP}{TP + FN}\]


  • Specificity - True Negative Rate
    \[specificity = \frac{TN}{TN + FP}\]


  • Cost - This is a custom parameter, a measure of weighted accuracy with emphasis on False Negatives over False Positives. The percentage of employees who quit is much smaller when compared to those who didn’t. Thus, it’s imperative to give more importance to the true negative rate. At the same time, too many false positives could cost the organization monetary loss. An optimal balance needs to be achieved and this can get subjective. In this analysis the weight on FN is set to 10. Cost must be low, and ideally zero.
    \[cost = \frac{FP + (10 * FN)}{n}\]

The predicted results of most of the models for our dataset return the probabilities of an employee quitting - a number between 0 to 1, with 1 indicating “yes”. A threshold needs to be set on the probability to categorize each row. The threshold generally depends on the class balance and cost function. If the ratio of yes/no is equal, a threshold of 0.5 will work well. But if like in our case, if there is imbalance, with a smaller proportion of “no”, threshold should ideally be much lower.

  • Area under the ROC curve - Now that it’s understood the perceived performance of a model could vary based on the threshold, the Receiver Operating Characteristic curve measures the overall performance across all values of the threshold. The curve is the plot of Sensitivity on the y-axis, and 1 - Specificity on the x-axis for threshold between 0 to 1. \(False\ Positive\ Rate\) is 1 - Specificity. Higher the area under this curve, higher the performance, with area > 0.7 accepted to have adequate discriminant power. A sample plot is presented below.

    It is to be noted that throughout this report, the ROC area of the testing sample will be used to compare models.

The two functions written below will be used by all the models to evaluate these parameters.

The conf_matrix function takes in the predicted probabilities and the threshold, and returns the confusion matrix and the evaluation parameters.

As explained, threshold is not a constant and therefore it could be tuned to improve performance. Moreover, different models could return probability scores that are interpretable differently. The function thres_tune takes in the predicted probabilities and calculates evaluation parameters for threshold values between 0.05 to 0.5, separated by 0.01. conf_matrix is internally called.

Data Prep

The data is imported and categorical fields are typed correctly. Columns with no information are dropped. A sample of the data is presented below. Explanations for the columns of the data is available in sheet 2 of the excel file.

The categorical fields that are unordered are marked as factors for the models to dummy encode.

There are 30 possible predictors for attrition and 1470 observations.

As seen in the table below, the data is imbalanced with only about 16% of the employees quitting. Ideally, the probability of the threshold will lie around this number.

All models are trained on 70% of the data and the remaining are used for testing. Passing attrition column to strata argument ensures that the 0’s and 1’s of the response are equally distributed.

Different models require data to be passed differently, with some requiring predictors passed as a matrix.

Histograms of different columns is presented below.

GLM - The Gold Standard

Logistic regression works very similar to linear regression. All predictors on the right-hand-side of the equation are assigned a coefficient and added up. The coefficients are calculated by minimizing the negative log-likelihood estimation, and the sum of the predictors gives the log-odds or logit equivalent of the event occurring.


\[\operatorname{logit} p = \ln \left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 x \]

The probability is then calculated by exponentiating both sides and p will now be a number between 0 and 1.
\[p = \frac{e^{\beta_0 + \beta_1 x}}{e^{\beta_0 + \beta_1 x}+1} \]


Since exponentials are introduced, changes in variables are not linear. That is, change in probability from 1 to 2 is much higher when compared to that of 10 to 20. To understand this better, let’s look at a dummy example with one coefficient rating and an intercept.
\[y = intercept + coefficient * rating\]

As seen in the table, y increases linearly with rating, whereas the probability doesn’t. Unit change in rating from 1 to 2 increases the probability by about 0.33 whereas unit change from 3 to 4 only causes a 0.016 change.

Along with logit, there are other ways of evaluating probability such as Probit and Complementary log-log links. We will not be covering them in this analysis.

Basic Model

A model with all the variables acting as predictors is built on the training data. The coefficients are plotted below. Hover over the dots to see them.

It appears that 48 variables are being used in the model. There are a number of variables with very low coefficients. But that doesn’t necessarily mean that they are less important. If the corresponding \(X_i\) is large, they could have influence in estimating \(y\). The ROC curve is plotted below.

The area under the curve 0.823 is a pretty good start.

Let’s apply the model on the test data. The threshold is set at 0.5, meaning an employee is tagged will-quit if the model gives a probability above 0.5. The confusion matrix and the evaluation parameters are printed below.

As expected, the accuracy is really good, while the specificity is poor. This can be corrected by setting the right threshold. The thres_tune function is used and the top 10 results sorted by cost are printed below.

The numbers are now more balanced, with a strong decrease in sensitivity. It’s expected, and this is the best balance the model can come up with for the cost we defined earlier. The confusion matrix below for the best cost threshold shows that 176 employees who did not quit were marked as will-quit while 6 who did quit were not identified correctly.

Keeping in mind that this is the most basic model possible, these numbers are great!

Regularization

Regularization in regression can be thought of as penalty for complexity. It’s used when the number of features is high or there exists multicollinearity between the variables. By penalizing the coefficients, overfitting is avoided and model interpretability is improved. Mathematically, this works by adding a penalty term \(P\) to the log-likelihood function of \(\beta\). Ideally, as the penalty increases, the estimated coefficients decrease. The magnitude of the penalty is controlled by \(\lambda\). We look into two types of regularization - ridge and lasso.



To simplify the mathematics, all equations below correspond to the response variable y being continuous.


\[y_{predicted} = \hat{y_i} = \beta_0 + \beta_i * x_i \]

\[minimize \{\sum{(y_i - \hat{y_i})^2} + \lambda P \} \]


where \(\sum{(y_i - \hat{y_i})^2}\) is difference between original and observed y, squared to lose negative differences, and added. This is defined formally as \(SSE\).

Now, the reason this works is because
\(P_{ridge} = \sum|\beta_i^2|\) and \(P_{lasso} = \sum|\beta_i|\)


That is, in case of ridge, owing to the minimization, the addition of \(\beta^2\) forces the magnitude of the coefficient to be as small as possible. In lasso, \(|\beta|\) can force coefficients all the way to zero if they don’t help the model. This property of regularization is used as a variable selection method in modelling.

The effect of regularization on the coefficients is visualized below. Setting \(\lambda = 0\) makes it a general regression while \(\lambda = \infty\) makes all coefficients zero.

\(P_{lasso}\) and \(P_{ridge}\) could be used together in conjunction to develop Elasticnet regression. This is done by introducing a second parameter \(\alpha\), setting which to 0 makes it a ridge regression while 1 makes it a lasso. Any value between 0 and 1 would be elasticnet. The optimal value of \(\lambda\) is calculated by cross-validating on the training data. The full formula is


\[minimize \{SSE + \lambda[P_{ridge} (1-\alpha) + P_{lasso} (\alpha)] \} \]


The equation, when y is binary, is

\[minimize\{(\sum{(-y_i * \beta_ix_i) + log(1+e^{\hat{y_i}})}) + (\lambda[P_{ridge} (1-\alpha) + P_{lasso} (\alpha)])\} \]

A more detailed explanation of Lasso can be found in this research paper on variable selection and corporate bankruptcy forecasts.

Lasso

Howdy partner?

As explained, lasso regularization is analogous to throwing a rope to catch the coefficients and bring them closer, if not all the way, to zero. A basic lasso regression with \(\alpha\) set to 1 is implemented below. Through cross validation on the training dataset, the \(\lambda\) that minimizes the errors is estimated by the function, which is then passed while predicting on the test dataset. lambda.min is the value of lambda with the lowest cross-validated error. nfolds is set at the default value of 10.

The calculated minimum \(\lambda\) is 0.0020386. The area under the curve is 0.831, which is a slight decrease from the GLM model’s. This could be because there was no overfitting in the first place. However, the number of variables in the model has come down to 40, a sharp drop from the 48 of the GLM model. The coefficients and threshold-tuned evaluation parameters are presented below.

Tuning λ

In all of the estimations above, we used \(\lambda_{min}\) to calculate the model coefficients. But since \(\lambda_{min}\) comes from 10-fold cross validation, there could be uncertainty. Choosing a \(\lambda\) that is 1 standard error away from the minimum could still be just as useful on the test data predictions. \(\lambda_{1se}\) was calculated to be 0.0108792.

The area under the curve seems lower than before at 0.826, but the number of variables has sharply reduced again to 28.

Bias-Variance tradeoff

In all three models so far, the number of variables has come down from 48 to 40 and 28, while the area under the curve has decreased from 0.823 to 0.831 and 0.826 - Evidently there is a trend, and this is where the bias-variance tradeoff comes in. Using more complex models will give better results on average (low bias), but variance could be high on newer test data. But reducing model complexity sacrifices accuracy.


Any model should be viewed carefully as a balance of bias and variance.

Tuning α

The final meTric up our regression sleeve is the choice of alpha. An elastic net regularization with \(\alpha\) between 0 to 1 might still improve the area under the curve. In increments of 0.05, models are built for each value of alpha and \(\lambda_{min}\) and the area under the curve is calculated. Looks like the lasso model was already performing the best.

With Bias-Variance tradeoff in mind, it would be ideal to choose the lasso model with \(\lambda_{min}\) to be the best of the three.

Trees and Forests

Any dataset can be broken hierarchically into smaller subgroups. The observations can be grouped conditionally based on their values, and the conditions can then be applied on new data to determine which subgroup the new observations will belong to. Decision tree algorithms work on this idea.

The training data is grown into the largest tree possible to essentially give a perfectly fitted model. The tree is then ‘pruned’ from the bottom until acceptable levels of error is reached. Bias-variance makes an appearance here to determine the number of branches to keep so that the model error and complexity is balanced for the testing data.

The idea of pruning is a little similar to penalty in regularized regression, incorporating a cost-complexity parameter. Cross-validation is performed to get the optimal amount of branches. To understand the idea better, a basic model is built with 10-fold cross validation (default xval = 10) and no pruning (cp = 0).

Oh, this was not expected.

The error must decrease with decreasing CP. The less you prune, lower the error. But since the data is imbalanced, some branches are ending up with all 1’s, leading to inconsistent errors. This can be rectified by incorporating the cost of false negatives while pruning the tree.

That looks more like it! At a cp value of 0.0084, the cross-validated errors are low enough and therefore 32 subgroups or terminal nodes can give a good albeit a little underfit model. The horizontal line here is the 1 standard error from the minimal error of the training data. The first point below the line must be chosen as the optimal terminal nodes. The actual tree for that value of CP is visualized below.

Time to apply the model and evaluate the performance on the test data.

The results don’t look great. The area under the curve is unusable at 0.7. The CP could be tuned to look for better results. In addition to pruning, decision trees can be further tuned by adjusting the minsplit and maxdepth parameters.


  • minsplit refers to the minimum number of observations that must exist in a node before a split is attempted. The smaller the number, higher are the chances of overfitting.

  • maxdepth corresponds to the total number of nodes between each terminal node and root node. Higher the number, larger the tree, meaning it could lead to overfitting.


With this knowledge, we build a grid of models for different values of minsplit, maxdepth and CP. Since CP of 0.0051 did not give great results, it’s reduced to 0.001 and increased in steps of 0.001, up to 0.01. The maxdepth is set to a range between 5 and 15, from the current 9.

The results improve only a little. The area under the curve remains the same for a large range of min.split and max.depth = 5indicating that shorter trees/ less complex models have better prediction accuracty. Here’s a look at the model with the best performance.

A clear demonstration of the drawback of decision trees - The model depends heavily on the training data and adapts very poorly to anyting new. Thus, a more general model (less complex, shorter) is fitting better on the testing data.

To further establish this dependence, let’s build more trees with the only difference between them being the sampled training data. The variable importance of each model is calculated and plotted in the gif below. Variable importance will be covered in detail in the later sections, but the gif illustrates the substantial changes to the importance order of the variables and their AUC for different models that differ only in the training data.

# ```{r eval = FALSE} # Not evaluated. RPubs does not allow hosting gifs.
gif_data <- data.frame()

for (i in 1:25) {
  
  set.seed(NULL)
  train_prop <- runif(1, min = 0.6, max = 0.9)
  gif_data_split <- initial_split(data, prop = train_prop, strata = "attrition")
  gif_data_train <- training(gif_data_split)
  gif_data_test  <- testing(gif_data_split)
  
  gif_rpart_model <- rpart(attrition ~ ., gif_data_train, method = "class",
                           parms = list(split = "information", 
                                        loss = matrix(c(0, 10, 1, 0), nrow = 2)),
                           control = list(xval = 10))
  
  gif_rpart_predict <- predict(gif_rpart_model, gif_data_test, type = "prob")[,2]
  gif_rpart_auc <- round(roc(gif_data_test$attrition, gif_rpart_predict)$auc, 3)
  
  var_imp <- varImp(gif_rpart_model) %>% rownames_to_column("Variable") %>% 
    arrange(desc(Overall)) 
  
  gif_data <- bind_rows(gif_data,
                        data.frame(iteration = rep(i, 10),
                                   roc_area_train = rep(gif_rpart_auc, 10),
                                   variable = var_imp[1],
                                   importance = var_imp[2]))
}

gif_data <- gif_data %>% group_by(iteration) %>%
  mutate(ordering = row_number(Overall) * 1) %>% ungroup()

a <- gif_data %>% 
  ggplot(aes(x = ordering, group = Variable)) +
  geom_tile(aes(y = Overall/2, 
                height = Overall,
                width = 0.8), fill = "gray60") +
  geom_text(aes(y = 0, label = Variable), hjust = 0, size = 3) +
  coord_flip(clip = "off", expand = FALSE) + theme_light() +
  geom_tile(aes(y = round(roc_area_train * 300, 2), 
                height = 2,
                width = 4), fill = "black") +
  geom_text(aes(x = 27, y = round(roc_area_train * 300 + 10), 
                label = round(roc_area_train * 100, 2)),
            size = 3) + 
  theme(axis.ticks.y = element_blank(),
        axis.text.y  = element_blank()) +
  labs(x = " ", y = "Variable Importance") +
  transition_states(iteration, 
                    transition_length = 2, state_length = 1) +
  ease_aes("cubic-in-out") + labs(title = 'Iteration: {closest_state}')

animate(a, fps = 30, nframes = 750)

Rpubs has some trouble hosting gifs. Click to view it on giphy.

via GIPHY

We modelled multiple trees earlier and observed different results. What if the approach is taken a little further, where data is bootstrap sampled into multiple chunks and trees are grown over each chunk and averaged? That’s the idea behind bootstrap aggregating. The concept is simple and it solves decision tree’s drawback of poor adaptability on new data. Below is a basic bagging model. We see an immediate jump in the area under the curve.

Bootstrap brings with it the 632 rule. What this means is that, any bootstrap sample, on average will contain only 63.2% unique records. The remaining one-third of the training data - called the out-of-bag data can be used to estimate prediction error within the model by setting coob = TRUE. But owing to our dataset being imbalanced this provision is not being used.

The trees are grown over bootstrap sampled data chunks. That is, they are sampled randomly, with replacement. By default, 25 trees are grown. This can be adjusted to look for improvement in performance. Evidently there is a good increase, reaching levels similar to that of our good old GLM.

To observe how the ensemble learning works, the AUC is plotted against the number of trees and the gradual increase is clearly observed.

The model with the highest AUC is applied on the test data and the probability cutoff is tuned to evaluate the best cost. We observe the specificity going above 90% for the first time and cost dropping below 0.6. This is the best model so far, even better than GLM’s cost of 0.63.

Bagging is a great idea and it seems to work well too. The performance could be enhanced by introducing another level of randomness. Random forest works very similar to bagging, with an additional mtry parameter that randomly restricts the number of chosen variables that can be used to build each tree. It essentially forces each tree to be different from one another, coercing all variables to be used in building the model. Setting mtry equal to the number of variables is the same as bagging. The rule of thumb in classification problems is to set mtry at the square root of the number of variables. In our case, it’s 5. A simple model is built below using Ranger package. num.trees here refers to the number of bootstrap sampled trees to be built, same as n_bagg, and for now, it’s set to 60 trees.

The result looks pretty good with an AUC of 0.771.

Like every other model so far, it’s tuning time. The num.trees is set to vary between 10 to 100, the same numbers we chose for bagging. mtry ranges between 3 to 7 and min.node.size is set to be between 1 to 5.

The performance gains are significant again, with the AUC crossing the 0.8 mark.

The best model is applied to the test data and cost is evaluated. Looks like the same level of AUC improvement is not being reflected on the cost. It remains in the same level as that of bagging. However, the top 20 results have a cost lower than 0.7, while cost of bagging went up to 0.95. This stability leads us to conclude that random forest is a better model than bagging for our data.

Unlike regression techniques, random forests suffer from the lack of interpretability. The best it can do is tell us the ‘importance’ of each variable. There are a couple of methods of measuring this importance, and both make use of the bootstrap out-of-bag error.

  • Impurity - Calculates the improvement in out-of-bag error caused by each variable across all nodes and trees and averages. Higher the average, higher is the importance.

  • Permutation - Calculates the drop in error if the actual values of a variable passed to the trees during training are altered randomly. Higher the drop, higher the importance.


The importance of different variables, based on impurity and permutation are both visualized for the best model that was evaluated earlier.

Boosting

Gradient Boosting Machines are considered the latest and greatest in the world of machine learning. All the tree-based techniques so far have randomly modelled trees and averaged to reach the optimum. GBM’s however, model trees sequentially, learning from the errors of weaker trees. The term Boosting here refers to iteratively improving the sequence of decision trees. Gradient refers to the negative slope of drop in errors. Tree models are grown sequentially, improving as the error decreases at each step, continuing until the slope flattens out or starts going up. Model is cross-validated at each step to measure the error. The iterative direction of the model growth is based on maximum reduction in error.

The important parameters in GBM modelling are listed below.

  • n.trees is the maximum number of trees the model can grow.

  • shrinkage is the learning rate at each step. Lower the number, slower, but without the risk of missing an optimum point. If it is too low, then the n.trees should be high enough, giving enough chance for the model to learn

  • cv.folds cross validation size

  • bag.fraction is fraction of training data randomly selected to propose the next tree in the expansion.

  • interaction.depth maximum depth of each tree. Can be as low as 1.

  • n.minobsinnode minimum number of observations in the terminal node.


A basic implementation is shown below with maximum number of trees set at 1000, a matching 0.1 shrinkage and max depth at 3. The interaction between the number of trees and cross validation error is observed in the plot below.

It appears that the model reaches optimum performance with just 82 trees. It is to be noted that this tree count applies only to this specific set of hyper parameters. The model is applied on the test data and probability cutoff is tuned. The results look pretty good.

The hyperparameters can be tuned to improve performance. Although we observed peak performance with 82 trees, we continue to work with 1000 trees since lower shrinkage values are being tuned for. The advantage of building a larger tree, owing to the sequentiality is that the tree size could be controlled during prediction, removing the need for a smaller tree model. Shrinkage is adjusted around the initial 0.1 and interaction depth is allowed to reduce to 1. The minimum number of observations in terminal node is allowed to be higher as this limits overfitting. The bag fraction is also varied. The top models sorted by ROC area are printed below.

The 36-model grid took 21 minutes to run and this is one of the huge drawbacks of GBM. They are extremely resource intensive. We see small improvements over the original model. Probability cutoff is tuned for the best model and the results observed are good, but not great.

Hypergrid tuning helped improve the cost, but it still does not outperform GLM or random forest. A second weakness of GBM - Since it is not possible to pass the cost function, it does not perform as well on imbalanced data. We see a large number of false negatives increasing the cost. There is however, a huge leap in performance over a solitary decision tree, and we do observe the gradient descent learning.Random forest performed better though.

The relative influence of each variable is visualized for the best model.

Summary

A table of the numbers from the best models of the last section is presented. It’s emphasized again that these models are chosen purely based on the best cost that we defined and that the number of false negatives could be adjusted by altering the threshold.

GLM did good!

Although it did not have the best cost, it was comparable to that of random forest and GBM. All of that with a single line of code, that took fractions of a second to model. Interpretability is easy as well! For a technique with its root in the 19th century, this is remarkable.

Here comes the “however”: The data we worked on was very clean. There was no processing required, and the variables were well distributed. Had that not been the case, we would have observed larger differences between model performances. And of course, there was random forest that performed better. The decimal differences between the performances could translate to millions in the real world.

So the takeaway from this report is that GLMs are a great place to start modelling with the data. They do a pretty good job, and give an idea of what to expect, while being extremely easy to implement. And for this reason they will continue being popular and used regularly in the field of data science.



Source files can be found here.

LinkedIn