This is PaperWhy. Our sisyphean endeavour not to drown in the immense Machine Learning literature.

With thousands of papers every month, keeping up with and making sense of recent research in machine learning has become almost impossible. By routinely reviewing and reporting papers we help ourselves and hopefully someone else.

Representational and optimization properties of Deep Residual Networks

Today’s post reviews a recent talk given at the Simons Institute for the Theory of Computing in their current workshop series Computational Challenges in Machine Learning.

tl;dr: Sufficiently regular functions (roughly: having Lipschitz, invertible derivatives) can be represented as compositions of decreasing, small perturbations of the identity. Furthermore, critical points of the quadratic loss for these target functions are proven to be always minima, thus ensuring loss-reducing gradient descent steps. This makes this class of functions “easily” approximable by Deep Residual Networks.

Deep networks are deep compositions of non-linear functions

$h = h_m \circ h_{m - 1} \circ \ldots \circ h_1 .$

Depth provides effective, parsimonious representations of features and nonlinearities provide better rates of approximation (known from representation theory). Even more, shallow representations of some functions are necessarily more complex than deeper ones (no flattening theorems).1 But optimization is hard with many layers: conventional DNNs show increasingly poorer performance at the same task with growing depth, even though they can approximate a strictly larger set of functions.

Deep Residual Networks2 overcome this limitation by introductig skip-paths connecting the input of layer $j$ to the output of layer $j+1$:

$f_{j + 1} (x_j) = w_{j + 1} \sigma (w_j x_j) + x_j,$

where $\sigma$ is typically a ReLU.3

Why? First consider the following fact about linear maps:

Theorem 1:4 Any $A$ with $\det A > 0$ can be written as a product of perturbations of the identity with decreasing norm, i.e.

$A = (I + A_m) \cdots (I + A_1)$

with the spectral norm of each perturbation fulfilling $| A_i | =\mathcal{O} (1 / m)$.

Consider now a linear Gaussian model $y = Ax + \varepsilon$ with $\varepsilon \sim \mathcal{N} (0, \sigma^2 I)$. If one sets to find $A_i$ which minimize the quadratic loss

$$\mathbb{E} | (I + A_m) \cdots (I + A_1) x - y |^2,$$

over all $A_{i}$ such that $I + A_{i}$ is near the identity, i.e. among all $|A_{i} | \ll 1$ in some suitable norm, then it can be shown that every stationary point is a global optimum, that is: if one has $\nabla _{A_1, \ldots, A_m} R (A^{\star}_{1}, \ldots, A_{m}^{\star}) = 0$, then one has found the true $A = (I + A^{\star}_{m}) \cdots (I + A^{\star}_{1})$ (according to the model). Note that this a property of stationary points in this region and does not say that one can attain these points by some particular discretization of gradient descent, i.e. this is neither a generalization bound nor a global result.

The goal of the talk is to show that similar staments hold in the non-linear case as well. The main result is (roughly):

Theorem 2: The computation of a “smooth invertible map” $h$ can be spread throughout a deep network

$h = h_m \circ h_{m - 1} \circ \ldots \circ h_1,$

so that all layers compute near-identity functions:

$| h_i - Id |_L =\mathcal{O} \left( \frac{\log m}{m} \right).$

Where the $| f |_L$ semi-norm is the optimal Lipschitz constant of $f$ and “smoothly invertible” means invertible and differentiable, with Lipschitz derivative and inverse

How does this relate to DRNs? Recall that a standard unit of the kind $A_i \sigma (B_i x)$ is known to be a universal approximator so any of the $h_i$ in the decomposition of Theorem 2 can be approximated by units of DRNs of the form

$\hat{h}_i(x) = x + A_i \sigma (B_i x),$

with the perturbations “becoming flatter and flatter” with $i$. This means that DRNs allow for compositional representation of functions where terms are increasingly “flat” as one goes deeper. Note however that this is only proved for functions which are invertible and differentiable, with Lipshitz derivative and inverse.

The functions $h_i$ of the theorem can be explicitly constructed via adequate scalings $a_1, \dots, a_m \in \mathbb{R}$ such that:

$h_1 (x) = h (a_1 x) / a_1, h_2 (h_1 (x)) = h (a_2 x) / a_2, \ldots, h_m \circ \cdots \circ h_1 (x) = h (a_m x) / a_m,$

and the $a_i$ small enough that $h_i \approx Id$.

Analogously to the linear case, for the class of functions which may be represented as such nested, near-identity compositions of maps, stationary points of the risk function

$Q (h) = \frac{1}{2} \mathbb{E} | h (X) - Y |_2^2$

are global minima.5 Then

Theorem 3: for any function

$h = h_m \circ h_{m - 1} \circ \ldots \circ h_1,$

where $| h_i - Id |_L \leqslant \varepsilon < 1$, it holds that for all $i$:

$| D_{h_i} Q (h) | \geqslant \frac{(1 - \varepsilon)^{m - 1}}{| h - h^{\ast} |} (Q (h) - Q (h^{\ast})) .$

This means that if we start with any $h$ in this class of functions near the identity which is suboptimal (i.e. $Q (h) - Q (h^{\ast}) > 0$), then the (Fréchet) gradient is bounded below and a gradient descent step can be taken to improve the risk.

Note that this is in the whole space of such nested functions, not in the particular parameter space of some instance $\hat{h}$: it can happen that the optimal direction in the whole space is “orthogonal” to the whole subspace allowed by changing weights in the layers of $\hat{h}$. Or it can (and will!) happen that there are local minima among all possible parametrizations of any layer $\hat{h}$. The following statement remains a bit unclear to me:

We should expect suboptimal stationary points in the ReLU or sigmoid parameter space, but these cannot arise because of interactions between parameters in different layers; they arise only within a layer.

Basically: if we are able to optimize in the space of architectures, we should always be able to improve performance (assuming invertible $h$ with Lipschitz derivative and so on).

1. See e.g. .
2. .
3. .
4. .
5. Recall that the minimizer of the risk with quadratic loss is the $L^2$ projection, i.e. the conditional expectation $h^{\ast} (x) = \mathbb{E} [Y|X = x]$.