# This is PaperWhy. Our sisyphean endeavour not to drown in the immense Machine Learning literature.

With thousands of papers every month, keeping up with and making sense of recent research in machine learning has become almost impossible. By routinely reviewing and reporting papers we help ourselves and hopefully someone else.

# Identity matters in Deep Learning

tl;dr: vanilla residual networks are very good approximators of functions which can be represented as linear perturbations of the identity. In the linear setting, optimization is aided by a benevolent landscape having only minima in certain (interesting) regions. Finally, very simple ResNets can completely learn datasets with $\mathcal{O} (n \log n + \ldots)$ parameters. All this seems to indicate that deep and simple architectures might be enough to achieve great performance.

In general, it is hard for classical deep nets to “preserve features which are good”: initialization with zero mean and small gradients make it hard to learn the identity at any given layer. Even though batchnorm1 seeks to alleviate this issue, it has been residual networks which have most improved upon it.2 In a residual net

(…) each residual layer has the form $x + h (x)$, rather than $h (x)$. This simple reparameterization allows for much deeper architectures largely avoiding the problem of vanishing (or exploding) gradients.

### Identity parametrizations improve optimization

The authors work first in the linear setting, i.e. they consider only networks which are compositions of linear perturbations of the identity:3

$h (x) = (I + A_{l}) \cdots (I + A_{2}) (I + A_{1}) x$

The objective function to mimize is the population risk with quadratic loss:

$f (A) = f (A_{1}, \ldots, A_{l}) := \mathbb{E}_{X, Y} | Y - (I + A_{l}) \cdots (I + A_{1}) X |^2 .$

Labels are assigned with noise, that is $Y = RX + \xi$, with $\xi \sim \mathcal{N} (0, I_{d})$. Note that the problem over the variables $A = (A_{1}, \ldots, A_{l})$ is non-convex.

The first result of the paper states that deep networks with many layers have minima of proportionally low (spectral) norm.4 Because of this, it makes sense to study critical points with small norm, and it turns out that there are only minima. The main result of this section is the following one, where one can think of $\tau$ as being $\mathcal{O} (1 / l)$:

Theorem 1: For every $\tau < 1$, every critical point $A$ of the objective function $f$ with $\underset{1 \leqslant i \leqslant l}{\max} | A_{i} | \leqslant \tau$ is a global minimum.

This is good news since, under the assumption that the model is correct, a “good” (in some sense) optimization algorithm will converge to it.5 The proof is relatively straightforward too: rewrite the risk as the norm of a product by the covariance: $f (A) = | E (A) \Sigma^{1 / 2} |^2 + C$, then, using that $| A_{i} |$ are small, show that if $\nabla f (A) = 0$ this can only be if $E (A) = 0$, where $E$ precisely encodes the condition of being at an optimum: $E (A) = (I + A_{l}) \cdots (I + A_{1}) - R$.

### Identity parametrizations improve representation

The authors consider next non-linear simplified residual networks with each layer of the form

$$\label{eq:residual-unit}\tag{1} h_{j} (x) = x + V_{j} \sigma (U_{j} x + b_{j})$$

where $V_{j}, U_{j} \in \mathbb{R}^{k \times k}$ are weight matrices, $b_{j} \in \mathbb{R}^k$ is a bias vector and $\sigma (z) = \max (0, z)$ is a ReLU activation.6 Note that the layer size is constant. No batchnorm is applied. The problem is $r$-class classification. Assuming (the admittedly natural condition) that all the training data are uniformly separated by a minimal distance $\rho > 0$, they prove that perfectly learning $n$ data points is possible with $n \log n$ parameters:

Theorem 2: There exists a residual network with $\mathcal{O}(n \log n + r^2)$ parameters that perfectly fits the training data.

By choosing the hyperparameter $k \in \mathcal{O} (\log n)$ and $l = \lceil n / k \rceil$ the complexity stated is obtained with a bit of arithmetic. The proof consists then of a somewhat explicit construction of the network. Very roughly, the weight matrices of the hidden layers are chosen as to assign each data point to one of $r$ surrogate label vectors $q_{1}, \ldots, q_{r} \in \mathbb{R}^k$, then the last layer converts these to 1-hot label vectors for output. The main ingredient is therefore the proof that it is possible to map the inputs $x_{i}$ to the surrogate vectors in a way such that the final layer has almost no work left to do.7

This is achieved by showing that:

(…) for an (almost) arbitrary sequence of vectors $x_{1}, \ldots, x_{n}$ there exist [weights $U, V, b$] such that operation (1) transforms $k$ [of them] to an arbitrary set of other $k$ vectors that we can freely choose, and maintains the value of the remaining $n - k$ vectors.

### Experiments

Working on CIFAR10 and CIFAR100, the authors tweaked a standard ResNet architecture in Tensorflow to have constant size $c$ of convolution, no batch norm and smaller weight initialization.8 Several features stand out:

• The last layer is a fixed random projection. Therefore all parameters are in the convolutioms.
• Lack of batchnorm or other regularizers seemed not to lead to serious overfitting, even though the model had $\sim 13.6 \times 10^6$ parameters.
• Both problems where tackled with the same networks and the convolutions where of constant size.

Finally, results on ImageNet were not as bright, though the authors seem confident that this is due to lack of tuning of hyperparameters and learning rate. It would be interesting to find out how true this is and how much harder this tuning becomes by having discarded regularization techniques like dropout or additional data processing.

### Extensions

For further work in the non-linear case (which as of this writing is reported to be work in progress) where the model is no longer to be assumed correct but instead only (mild) regularity assumptions are made, be sure to check out Bartlett’s talk .

1. See .
2. See .
3. For extensions to the non-linear setting see . Note that even though in principle the network can be flattened to only one linear map by taking the product of all $A_{i}$ with no loss in its representational capacity, a great cost in efficiency can be incurred in doing so, see (§G). The dynamics of optimization also are affected by the stacking of purely linear layers, see e.g. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, Saxe, A. , McClelland, J. , Ganguli, S. (2013) .
4. Essentially, the theorem states that there exists a constant $\gamma$ depending on the largest and smallest singular values of $R$ such that there exists a global minimum $A^{\star}$ of the population risk fulfilling $\underset{1 \leqslant i \leqslant l}{\max} | A^{\star}_{i} | \leqslant \frac{1}{l} (4 \pi + 3 \gamma)$ whenever the number of layers $l \geqslant 3 \gamma$.
5. Equation (2.3) of the paper provides a lower bound on the gradient which does guarantee convergence under the assumption that iterates don’t jump out of the domain of interest. See the reference in the paper.
6. This is a deliberately simpler setup than the original ResNets with two ReLU activations and two instances of batch normalization. See Identity Mappings in Deep Residual Networks, He, K. , Zhang, X. , Ren, S. , Sun, J. (2016) .
7. An interesting point is the use of the Johsonn-Lindestrauss lemma to ensure that an initial random projection of input data onto $\mathbb{R}^k$ by the first layer does not violate the condition that it remains separated, with high probability.
8. Gaussian with $\sigma \sim (k^{- 2} c^{- 1})$ instead of $\sigma \sim (k^{- 1} c^{- 1 / 2})$. For more on proper initialisation see e.g. On the importance of initialization and momentum in deep learning, Sutskever, I. , Martens, J. , Dahl, G. , Hinton, G. (2013) .