Usually, tree based models are better suited for tabular data than deep learning algorithms. This can be explained through Manifold Hypothesis. The idea is that, if the dimensions of the dataset are just points in an n-dimensional plane, they can be represented in lower dimensions. Think Principal Components. If the data is donut shaped in the 3-D plane, they can be squeezed into 2-D. Taking a look at the figure below, one can be sure that a tree model would be able to classify it very well.
However, if the data point are forming topologies
, like the linked doughnuts in the figure below, they cannot be bent to lower dimensions. This is where wide and/or deep neural networks excel, making them extremely popular in image and text classification.
The easiest way to understand how Artificial Neural Networks function is to understand their original inspiration - the biological neurons. They accept inputs, adjust weights in the hidden layer and estimate the output.
A single neuron is essentially a linear model. A network of neurons functioning together can have significant performance improvement over linear models. By chaining neurons, the model gets the ability to map non-linearities in the dataset.
R has several implementations of neural networks, the most commonly used ones being nnet
and neuralnet
. They both perform similarly, but neuralnet allows multiple hidden layers, while nnet allows regularizing weight. To understand what that means, let’s first understand the working of a neural network through a (2) layer - (h,q) neurons network.
In the figure above, \(x_i\) are the input columns passed to the first hidden layer. They are added a random weight \(w_{ih}\) for each of the \(h\) neurons in the layer. There is also a bias \(b_i\). The outputs \(z_i\) are then passed as input to the second hidden layer with \(q\) sets of weights and biases. This step is called forward pass
and the outputs \(y_i\) are recorded. The recorded outputs are compared against the actual value and the error is calculated. The process repeats again, only this time the weights are adjusted to minimize the error. This is called back propogation
. It repeats until acceptable levels of errors are reached.
Higher number of neurons in a layer help capture the non-linearities in the dataset. The rule of thumb is to set it at 2/3 the size of the input layer, plus the size of the output layer.
Higher number of hidden layers help distinguish increasingly complex features of the input. Like in the case of image recognition where the objective is to detect edges, shapes and also objects. However, the usefullness of multiple hidden layers ends there. For most tabular classification problems, Universal Approximation Theorem states that one hidden layer can approximate any continuous function.
Apart from the number of layers and neurons, there are two other parameters that can be controlled in a neural network.
The first is the learning rate
which controls how fast the model should converge during backpropogation. A slow learning rate means that the model will take long to converge, but it won’t miss a local minima. On the other hand, a fast slow learning rate might not converge at all, or even diverge.
The second is the weight decay
. This is a regularization parameter that’s added to the \(w_i\) estimation that forces weight to be as small as possible. It’s exactly the same as the \(\lambda\) in a lasso regression.
In this post, we aim to predict bankruptcies through ANNs. Although this isn’t their best use, this post will serve as a tutorial to apply the models on tabular data and observe performance changes as the hyperparameters are altered. Thus both the nnet
and neuralnet
implementation are employed.
The following libraries are required to reproduce the output.
library(tidyverse) #The gas to the caR
library(rsample) #Sampling data into train and test
library(caret) #Classification Tools
library(pROC) #Calculate ROC area
library(plotly) #Interactive plots
library(nnet) #Neural Network - allows regularizing
library(neuralnet) #Neural Network - allows multiple hidden layer
library(knitr) #Generate HTML doc
library(prettydoc) #HTML doc theme
The bankruptcy dataset that’s used in this post contains 10 predictor variables that point to whether or not an entity went bankrup. The dlrsn
is the binary response variable and there are 5436 records. The list of variables, along with a description and summary statistics are presented below.
Variable | Description | Min. | Max. | Mean | Median |
---|---|---|---|---|---|
dlrsn | Indicates bankruptcy | 0.00 | 1.00 | NA | NA |
r1 | Working Capital/Total Asset | -4.38 | 2.02 | -0.24 | -0.22 |
r2 | Retained Earning/Total Asset | -2.24 | 1.49 | -0.29 | 0.13 |
r3 | Earning Before Interest & Tax/Total Asset | -2.06 | 2.14 | -0.24 | 0.07 |
r4 | Market Capital / Total Liability | -0.43 | 6.70 | 0.24 | -0.31 |
r5 | SALE/Total Asset | -1.36 | 4.04 | -0.13 | -0.35 |
r6 | Total Liability/Total Asset | -1.51 | 5.11 | 0.20 | 0.00 |
r7 | Current Asset/Current Liability | -1.23 | 2.88 | -0.10 | -0.43 |
r8 | Net Income/Total Asset | -2.21 | 2.00 | -0.23 | 0.21 |
r9 | LOG(Sale) | -2.76 | 2.18 | 0.03 | 0.06 |
r10 | LOG(Market Cap) | -2.21 | 2.48 | 0.18 | 0.12 |
The columns of the data are sclaed to be in a [0 - 1] range before being fed into a neural network model. Data is split into a 75/25 training and testing sets.
The neural net model implemented below has one neuron in the first hidden layer and two in the second layer. This is set through the hidden
parameter. stepmax
is the maximum number of iterations the model is allowed to run. The activation function to be applied is set to logistic
and setting linear.output
to false indicates that we are running a classification. The actual model with weights and biases has been printed below.
neuralnet_model <- neuralnet::neuralnet(dlrsn ~ ., data = data_train, stepmax = 1e6,
linear.output = FALSE, act.fct = "logistic",
hidden = c(1, 2))
plot(neuralnet_model)
The change in model performance for different number of layers and neurons can be observed through the increasing AUC in the chart below.
The first three models have one layer with one to three neurons. The next three models have two layers, with one to two neurons in each layer.
Looking at the results of the testing data, adding neurons seem to help the performance. But after 3 neurons it seems to deteriorate as the model starts overfitting and testing AUC starts going down. With the number of layers, as expected, adding more seems to make little difference.
As already discussed, the advantage of nnet
over neuralnet is that it allows regularizing weight. This helps avoid overfitting to an extent. Unlike cv.glmnet, which cross validates and estimates the best \(/lambda\), nnet expects a decay
value to be passed manually. The initial model below shows the implementation with three neurons.
The decay values are set to be between 0 to ~69, in multiples of 1.2 as shown in the code below.
For each of the 101 decay values, 5 models are built with neurons between 1 to 5. Each of the model is 5-fold cross validated on the whole dataset and the average AUC is calculated.
The reducing performance with increasing regularization is observed in the plot above. Also evident is the fact that performance increases as the number of neurons in the hidden layer increase.