# This is PaperWhy. Our sisyphean endeavour not to drown in the immense Machine Learning literature.

With thousands of papers every month, keeping up with and making sense of recent research in machine learning has become almost impossible. By routinely reviewing and reporting papers we help ourselves and hopefully someone else.

# Deep sparse rectifier neural networks

tl;dr: use ReLUs by default. Don’t pretrain if you have lots of labeled training data, but do in unsupervised settings. Use regularisation on weights / activations. $L_1$ might promote sparsity, ReLUs already do and this seems good if the data itself is.

This seminal paper settled the introduction of ReLUs1 into the neural network community (they had already been used in other contexts, e.g. in RBMs.2

rectifying neurons (…) yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data

Back in the late 2000s, layer-wise pretraining was introduced to prime neural networks, achieving significant improvements in performance.3 The authors used the now classical $\max (0,x)$ function together with

an $L_1$ regularizer on the activation values to promote sparsity and prevent potential numerical problems with unbounded activation,

as well as pre-training using denoising autoencoders and found that:

surprisingly, rectifying activation allows deep networks to achieve their best performance without unsupervised pre-training.

It is however noted that unsupervised pre-training can help when much of the training data is unlabeled (in semi-supervised settings)

[Let’s skip the brain stuff, which is always pretty unconvincing anyway…]

A very interesting feature of ReLUs is that they induce sparse representations because they can be completely off. The authors give four reasons why this is good:

• “Information disentangling”: small perturbations of the input are less likely to shift many weights in the network if many of them are zeroes.
• “Efficient variable-size representation”: Varying the number of active neurons allows a model to control the effective dimensionality of the representation for a given input and the required precision.
• “Linear separability”: having many zeroes in a (high-dimensional) representation makes it more easily separable with linear boundaries.
• “Distributed but sparse”: albeit of lower expressivity than dense representations, being still distributed makes them “exponentially better than local ones”.

The ReLU acts like a switch. A unit either works linearly or not at all. This asymmetry might create some issues (see below) but it has the advantage that for any fixed input, the whole network is linear and thus can be seen as an “exponential number of linear models that share parameters”.

This is very reminiscent of the interpretation of Dropout as an extreme form of bagging.4 But perhaps more importantly, gradients flow backwards unhindered in active neurons, thus facilitating learning and alleviating the vanishing gradient problem. This is typically seen as the reason why ReLUs perform so well, brain stuff notwithstanding.

What are some possible disadvantages of ReLUs? Experimental results show that the non-differentiability (“hard saturation”) is not an issue, as long as enough hidden units are non-zero. However, with unbounded activations network weights might grow indefinitely if no regularisation is used and the authors used an $L^1$ penalty term in the cost to promote further sparsity (think of the Lasso and how the $L^1$ norm approximates the “counting zero entries” norm). Finally,

rectifier networks are sub ject to illconditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.

This will be important later in their experiments.

Using ReLUs in autoencoders for unsupervised pre-training presents two main difficulties:

1. The hard thresholding at 0 impedes backpropagation of gradients during reconstruction: Indeed, whenever the network happens to reconstruct a zero in place of a non-zero target, the reconstruction unit can not backpropagate any gradient. In hidden layers this is probably not an issue “because gradients can still flow through the active (non-zero) [gates], possibly helping rather than hurting the assignment of credit”.

2. As already mentioned, regularisation is required to keep weights from growing unchecked.

Four different combinations of modified activations and cost functions are proposed:

 A B C D Activation softplus scaled ReLU + sigmoid reconstr. linear + scaled inputs ReLU + scaled inputs Cost function quadratic cross-entropy quadratic quadratic Regularisation $L\_1$ (act.) $L\_1$ (act.) ? ?

Table 1. Strategies tested. See the paper for definitions. “act.” means “on activations”

The results of the better performing strategies A and B are detailed over four datasets: MNIST, CIFAR10, NISTP and NORB. They used stacked denoising auto-encoders with masking noise as the corruption process and posterior supervised fine-tuning with negative log-likelihood as cost function over a softmaxing of the outputs. Besides the already mentioned $L^1$ penalty on the activations, they used standard minibatch SGD and no other regularisation. An important detail of their setup was:

To take into account the potential problem of rectifier units not being symmetric around 0, we use a variant of the activation function for which half of the units output values are multiplied by -1.

Interestingly, they tried a cost function interpolating between softplus and ReLU and by moving the parameter in its whole range and found that there is no performance gain in using the smoother activation over the ReLU. Therefore, rectifier units are preferable due to their being computationally cheaper and inherently sparse (an average 70~80% of the hidden units inactive in all tests!). Furthermore, pre-training achieved almost no improvement in fully supervised tasks, hinting again at how ReLUs enable more efficient searching of the energy landscape.

However,

In semi-supervised setups (with few labeled data), the pre-training is highly beneficial. But the more the labeled set grows, the closer the models with and without pre-training. Eventually, when all available data is labeled, the two models achieve identical performance.

Finally, tests on textual data for sentiment analysis and review ratings are performed. ReLUs seemed to perform very well and adapt to the inherent sparsity of the data, due to its representation using bag-of-words and binary vectors to encode presence/absence of terms. Pre-training was crucial here and architectures tanh activations were outperformed. The final ReLU networks displayed an average sparsity of around 50%, still much lower that the average of 99.4% zero features in the data, but a significant improvement.

1. The name ReLU was not used in this paper so we are indulging in a bit of an anachronism by using it.
2. Rectified linear units improve restricted boltzmann machines, Nair, V. , Hinton, G. (2010)