Artificial Neural Networks

Harish M


Why ANNs?

Usually, tree based models are better suited for tabular data than deep learning algorithms. This can be explained through Manifold Hypothesis. The idea is that, if the dimensions of the dataset are just points in an n-dimensional plane, they can be represented in lower dimensions. Think Principal Components. If the data is donut shaped in the 3-D plane, they can be squeezed into 2-D. Taking a look at the figure below, one can be sure that a tree model would be able to classify it very well.

However, if the data point are forming topologies, like the linked doughnuts in the figure below, they cannot be bent to lower dimensions. This is where wide and/or deep neural networks excel, making them extremely popular in image and text classification.

Source: colah.github.io

The Basics

The easiest way to understand how Artificial Neural Networks function is to understand their original inspiration - the biological neurons. They accept inputs, adjust weights in the hidden layer and estimate the output.

A single neuron is essentially a linear model. A network of neurons functioning together can have significant performance improvement over linear models. By chaining neurons, the model gets the ability to map non-linearities in the dataset.

Learning Method and Implementations

R has several implementations of neural networks, the most commonly used ones being nnet and neuralnet. They both perform similarly, but neuralnet allows multiple hidden layers, while nnet allows regularizing weight. To understand what that means, let’s first understand the working of a neural network through a (2) layer - (h,q) neurons network.

Source: Wiki

In the figure above, \(x_i\) are the input columns passed to the first hidden layer. They are added a random weight \(w_{ih}\) for each of the \(h\) neurons in the layer. There is also a bias \(b_i\). The outputs \(z_i\) are then passed as input to the second hidden layer with \(q\) sets of weights and biases. This step is called forward pass and the outputs \(y_i\) are recorded. The recorded outputs are compared against the actual value and the error is calculated. The process repeats again, only this time the weights are adjusted to minimize the error. This is called back propogation. It repeats until acceptable levels of errors are reached.

Choosing Neurons and Layers

Higher number of neurons in a layer help capture the non-linearities in the dataset. The rule of thumb is to set it at 2/3 the size of the input layer, plus the size of the output layer.

Higher number of hidden layers help distinguish increasingly complex features of the input. Like in the case of image recognition where the objective is to detect edges, shapes and also objects. However, the usefullness of multiple hidden layers ends there. For most tabular classification problems, Universal Approximation Theorem states that one hidden layer can approximate any continuous function.

Setting Rate and Decay

Apart from the number of layers and neurons, there are two other parameters that can be controlled in a neural network.

The first is the learning rate which controls how fast the model should converge during backpropogation. A slow learning rate means that the model will take long to converge, but it won’t miss a local minima. On the other hand, a fast slow learning rate might not converge at all, or even diverge.

The second is the weight decay. This is a regularization parameter that’s added to the \(w_i\) estimation that forces weight to be as small as possible. It’s exactly the same as the \(\lambda\) in a lasso regression.


In this post, we aim to predict bankruptcies through ANNs. Although this isn’t their best use, this post will serve as a tutorial to apply the models on tabular data and observe performance changes as the hyperparameters are altered. Thus both the nnet and neuralnet implementation are employed.

Data that we are working with

The bankruptcy dataset that’s used in this post contains 10 predictor variables that point to whether or not an entity went bankrup. The dlrsn is the binary response variable and there are 5436 records. The list of variables, along with a description and summary statistics are presented below.

Variable summary statistics
Variable Description Min. Max. Mean Median
dlrsn Indicates bankruptcy 0.00 1.00 NA NA
r1 Working Capital/Total Asset -4.38 2.02 -0.24 -0.22
r2 Retained Earning/Total Asset -2.24 1.49 -0.29 0.13
r3 Earning Before Interest & Tax/Total Asset -2.06 2.14 -0.24 0.07
r4 Market Capital / Total Liability -0.43 6.70 0.24 -0.31
r5 SALE/Total Asset -1.36 4.04 -0.13 -0.35
r6 Total Liability/Total Asset -1.51 5.11 0.20 0.00
r7 Current Asset/Current Liability -1.23 2.88 -0.10 -0.43
r8 Net Income/Total Asset -2.21 2.00 -0.23 0.21
r9 LOG(Sale) -2.76 2.18 0.03 0.06
r10 LOG(Market Cap) -2.21 2.48 0.18 0.12

The columns of the data are sclaed to be in a [0 - 1] range before being fed into a neural network model. Data is split into a 75/25 training and testing sets.

Modeling neuralnet

The neural net model implemented below has one neuron in the first hidden layer and two in the second layer. This is set through the hidden parameter. stepmax is the maximum number of iterations the model is allowed to run. The activation function to be applied is set to logistic and setting linear.output to false indicates that we are running a classification. The actual model with weights and biases has been printed below.

The change in model performance for different number of layers and neurons can be observed through the increasing AUC in the chart below.

The first three models have one layer with one to three neurons. The next three models have two layers, with one to two neurons in each layer.

Looking at the results of the testing data, adding neurons seem to help the performance. But after 3 neurons it seems to deteriorate as the model starts overfitting and testing AUC starts going down. With the number of layers, as expected, adding more seems to make little difference.

Modeling nnet

As already discussed, the advantage of nnet over neuralnet is that it allows regularizing weight. This helps avoid overfitting to an extent. Unlike cv.glmnet, which cross validates and estimates the best \(/lambda\), nnet expects a decay value to be passed manually. The initial model below shows the implementation with three neurons.

The decay values are set to be between 0 to ~69, in multiples of 1.2 as shown in the code below.

For each of the 101 decay values, 5 models are built with neurons between 1 to 5. Each of the model is 5-fold cross validated on the whole dataset and the average AUC is calculated.

The reducing performance with increasing regularization is observed in the plot above. Also evident is the fact that performance increases as the number of neurons in the hidden layer increase.