As we design ever deeper networks it becomes imperative to understandhow adding layers can increase the complexity and expressiveness of thenetwork. Even more important is the ability to design networks whereadding layers makes networks strictly more expressive rather than justdifferent. To make some progress we need a bit of mathematics.

pytorchmxnetjaxtensorflow

import torchfrom torch import nnfrom torch.nn import functional as Ffrom d2l import torch as d2l

from mxnet import init, np, npxfrom mxnet.gluon import nnfrom d2l import mxnet as d2lnpx.set_np()

import jaxfrom flax import linen as nnfrom jax import numpy as jnpfrom d2l import jax as d2l

import tensorflow as tffrom d2l import tensorflow as d2l

## 8.6.1. Function Classes¶

Consider \(\mathcal{F}\), the class of functions that a specificnetwork architecture (together with learning rates and otherhyperparameter settings) can reach. That is, for all\(f \in \mathcal{F}\) there exists some set of parameters (e.g.,weights and biases) that can be obtained through training on a suitabledataset. Let’s assume that \(f^*\) is the “truth” function that wereally would like to find. If it is in \(\mathcal{F}\), we are ingood shape but typically we will not be quite so lucky. Instead, we willtry to find some \(f^*_\mathcal{F}\) which is our best bet within\(\mathcal{F}\). For instance, given a dataset with features\(\mathbf{X}\) and labels \(\mathbf{y}\), we might try findingit by solving the following optimization problem:

(8.6.1)¶\[f^*_\mathcal{F} \stackrel{\textrm{def}}{=} \mathop{\mathrm{argmin}}_f L(\mathbf{X}, \mathbf{y}, f) \textrm{ subject to } f \in \mathcal{F}.\]

We know that regularization(Morozov, 1984, Tikhonov and Arsenin, 1977) may controlcomplexity of \(\mathcal{F}\) and achieve consistency, so a largersize of training data generally leads to better \(f^*_\mathcal{F}\).It is only reasonable to assume that if we design a different and morepowerful architecture \(\mathcal{F}'\) we should arrive at a betteroutcome. In other words, we would expect that \(f^*_{\mathcal{F}'}\)is “better” than \(f^*_{\mathcal{F}}\). However, if\(\mathcal{F} \not\subseteq \mathcal{F}'\) there is no guaranteethat this should even happen. In fact, \(f^*_{\mathcal{F}'}\) mightwell be worse. As illustrated by Fig. 8.6.1, fornon-nested function classes, a larger function class does not alwaysmove closer to the “truth” function \(f^*\). For instance, on theleft of Fig. 8.6.1, though \(\mathcal{F}_3\) iscloser to \(f^*\) than \(\mathcal{F}_1\), \(\mathcal{F}_6\)moves away and there is no guarantee that further increasing thecomplexity can reduce the distance from \(f^*\). With nestedfunction classes where\(\mathcal{F}_1 \subseteq \cdots \subseteq \mathcal{F}_6\) on theright of Fig. 8.6.1, we can avoid theaforementioned issue from the non-nested function classes.

Fig. 8.6.1 For non-nested function classes, a larger (indicated by area)function class does not guarantee we will get closer to the “truth”function (\(\mathit{f}^*\)). This does not happen in nestedfunction classes.¶

Thus, only if larger function classes contain the smaller ones are weguaranteed that increasing them strictly increases the expressive powerof the network. For deep neural networks, if we can train thenewly-added layer into an identity function\(f(\mathbf{x}) = \mathbf{x}\), the new model will be as effectiveas the original model. As the new model may get a better solution to fitthe training dataset, the added layer might make it easier to reducetraining errors.

This is the question that He *et al.* (2016) consideredwhen working on very deep computer vision models. At the heart of theirproposed *residual network* (*ResNet*) is the idea that every additionallayer should more easily contain the identity function as one of itselements. These considerations are rather profound but they led to asurprisingly simple solution, a *residual block*. With it, ResNet wonthe ImageNet Large Scale Visual Recognition Challenge in 2015. Thedesign had a profound influence on how to build deep neural networks.For instance, residual blocks have been added to recurrent networks(Kim et al., 2017, Prakash et al., 2016). Likewise, Transformers(Vaswani et al., 2017) use them to stack many layersof networks efficiently. It is also used in graph neural networks(Kipf and Welling, 2016) and, as a basic concept, it has been usedextensively in computer vision(Redmon and Farhadi, 2018, Ren et al., 2015). Note thatresidual networks are predated by highway networks(Srivastava et al., 2015) that share some of the motivation,albeit without the elegant parametrization around the identity function.

## 8.6.2. Residual Blocks¶

Let’s focus on a local part of a neural network, as depicted inFig. 8.6.2. Denote the input by \(\mathbf{x}\).We assume that \(f(\mathbf{x})\), the desired underlying mapping wewant to obtain by learning, is to be used as input to the activationfunction on the top. On the left, the portion within the dotted-line boxmust directly learn \(f(\mathbf{x})\). On the right, the portionwithin the dotted-line box needs to learn the *residual mapping*\(g(\mathbf{x}) = f(\mathbf{x}) - \mathbf{x}\), which is how theresidual block derives its name. If the identity mapping\(f(\mathbf{x}) = \mathbf{x}\) is the desired underlying mapping,the residual mapping amounts to \(g(\mathbf{x}) = 0\) and it is thuseasier to learn: we only need to push the weights and biases of theupper weight layer (e.g., fully connected layer and convolutional layer)within the dotted-line box to zero. The right figure illustrates the*residual block* of ResNet, where the solid line carrying the layerinput \(\mathbf{x}\) to the addition operator is called a *residualconnection* (or *shortcut connection*). With residual blocks, inputs canforward propagate faster through the residual connections across layers.In fact, the residual block can be thought of as a special case of themulti-branch Inception block: it has two branches one of which is theidentity mapping.

Fig. 8.6.2 In a regular block (left), the portion within the dotted-line boxmust directly learn the mapping \(\mathit{f}(\mathbf{x})\). In aresidual block (right), the portion within the dotted-line box needsto learn the residual mapping\(\mathit{g}(\mathbf{x}) = \mathit{f}(\mathbf{x}) - \mathbf{x}\),making the identity mapping\(\mathit{f}(\mathbf{x}) = \mathbf{x}\) easier to learn.¶

ResNet has VGG’s full \(3\times 3\) convolutional layer design. Theresidual block has two \(3\times 3\) convolutional layers with thesame number of output channels. Each convolutional layer is followed bya batch normalization layer and a ReLU activation function. Then, weskip these two convolution operations and add the input directly beforethe final ReLU activation function. This kind of design requires thatthe output of the two convolutional layers has to be of the same shapeas the input, so that they can be added together. If we want to changethe number of channels, we need to introduce an additional\(1\times 1\) convolutional layer to transform the input into thedesired shape for the addition operation. Let’s have a look at the codebelow.

pytorchmxnetjaxtensorflow

class Residual(nn.Module): #@save """The Residual block of ResNet models.""" def __init__(self, num_channels, use_1x1conv=False, strides=1): super().__init__() self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1, stride=strides) self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1) if use_1x1conv: self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, stride=strides) else: self.conv3 = None self.bn1 = nn.LazyBatchNorm2d() self.bn2 = nn.LazyBatchNorm2d() def forward(self, X): Y = F.relu(self.bn1(self.conv1(X))) Y = self.bn2(self.conv2(Y)) if self.conv3: X = self.conv3(X) Y += X return F.relu(Y)

class Residual(nn.Block): #@save """The Residual block of ResNet models.""" def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs): super().__init__(**kwargs) self.conv1 = nn.Conv2D(num_channels, kernel_size=3, padding=1, strides=strides) self.conv2 = nn.Conv2D(num_channels, kernel_size=3, padding=1) if use_1x1conv: self.conv3 = nn.Conv2D(num_channels, kernel_size=1, strides=strides) else: self.conv3 = None self.bn1 = nn.BatchNorm() self.bn2 = nn.BatchNorm() def forward(self, X): Y = npx.relu(self.bn1(self.conv1(X))) Y = self.bn2(self.conv2(Y)) if self.conv3: X = self.conv3(X) return npx.relu(Y + X)

class Residual(nn.Module): #@save """The Residual block of ResNet models.""" num_channels: int use_1x1conv: bool = False strides: tuple = (1, 1) training: bool = True def setup(self): self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3), padding='same', strides=self.strides) self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3), padding='same') if self.use_1x1conv: self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1), strides=self.strides) else: self.conv3 = None self.bn1 = nn.BatchNorm(not self.training) self.bn2 = nn.BatchNorm(not self.training) def __call__(self, X): Y = nn.relu(self.bn1(self.conv1(X))) Y = self.bn2(self.conv2(Y)) if self.conv3: X = self.conv3(X) Y += X return nn.relu(Y)

class Residual(tf.keras.Model): #@save """The Residual block of ResNet models.""" def __init__(self, num_channels, use_1x1conv=False, strides=1): super().__init__() self.conv1 = tf.keras.layers.Conv2D(num_channels, padding='same', kernel_size=3, strides=strides) self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3, padding='same') self.conv3 = None if use_1x1conv: self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1, strides=strides) self.bn1 = tf.keras.layers.BatchNormalization() self.bn2 = tf.keras.layers.BatchNormalization() def call(self, X): Y = tf.keras.activations.relu(self.bn1(self.conv1(X))) Y = self.bn2(self.conv2(Y)) if self.conv3 is not None: X = self.conv3(X) Y += X return tf.keras.activations.relu(Y)

This code generates two types of networks: one where we add the input tothe output before applying the ReLU nonlinearity whenever`use_1x1conv=False`

; and one where we adjust channels and resolutionby means of a \(1 \times 1\) convolution before adding.Fig. 8.6.3 illustrates this.

Fig. 8.6.3 ResNet block with and without \(1 \times 1\) convolution, whichtransforms the input into the desired shape for the additionoperation.¶

Now let’s look at a situation where the input and output are of the sameshape, where \(1 \times 1\) convolution is not needed.

pytorchmxnetjaxtensorflow

blk = Residual(3)X = torch.randn(4, 3, 6, 6)blk(X).shape

torch.Size([4, 3, 6, 6])

blk = Residual(3)blk.initialize()X = np.random.randn(4, 3, 6, 6)blk(X).shape

[22:49:23] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU

(4, 3, 6, 6)

blk = Residual(3)X = jax.random.normal(d2l.get_key(), (4, 6, 6, 3))blk.init_with_output(d2l.get_key(), X)[0].shape

(4, 6, 6, 3)

blk = Residual(3)X = tf.random.normal((4, 6, 6, 3))Y = blk(X)Y.shape

TensorShape([4, 6, 6, 3])

We also have the option to halve the output height and width whileincreasing the number of output channels. In this case we use\(1 \times 1\) convolutions via `use_1x1conv=True`

. This comes inhandy at the beginning of each ResNet block to reduce the spatialdimensionality via `strides=2`

.

pytorchmxnetjaxtensorflow

blk = Residual(6, use_1x1conv=True, strides=2)blk(X).shape

torch.Size([4, 6, 3, 3])

blk = Residual(6, use_1x1conv=True, strides=2)blk.initialize()blk(X).shape

(4, 6, 3, 3)

blk = Residual(6, use_1x1conv=True, strides=(2, 2))blk.init_with_output(d2l.get_key(), X)[0].shape

(4, 3, 3, 6)

blk = Residual(6, use_1x1conv=True, strides=2)blk(X).shape

TensorShape([4, 3, 3, 6])

## 8.6.3. ResNet Model¶

The first two layers of ResNet are the same as those of the GoogLeNet wedescribed before: the \(7\times 7\) convolutional layer with 64output channels and a stride of 2 is followed by the \(3\times 3\)max-pooling layer with a stride of 2. The difference is the batchnormalization layer added after each convolutional layer in ResNet.

pytorchmxnetjaxtensorflow

class ResNet(d2l.Classifier): def b1(self): return nn.Sequential( nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3), nn.LazyBatchNorm2d(), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

class ResNet(d2l.Classifier): def b1(self): net = nn.Sequential() net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3), nn.BatchNorm(), nn.Activation('relu'), nn.MaxPool2D(pool_size=3, strides=2, padding=1)) return net

class ResNet(d2l.Classifier): arch: tuple lr: float = 0.1 num_classes: int = 10 training: bool = True def setup(self): self.net = self.create_net() def b1(self): return nn.Sequential([ nn.Conv(64, kernel_size=(7, 7), strides=(2, 2), padding='same'), nn.BatchNorm(not self.training), nn.relu, lambda x: nn.max_pool(x, window_shape=(3, 3), strides=(2, 2), padding='same')])

class ResNet(d2l.Classifier): def b1(self): return tf.keras.models.Sequential([ tf.keras.layers.Conv2D(64, kernel_size=7, strides=2, padding='same'), tf.keras.layers.BatchNormalization(), tf.keras.layers.Activation('relu'), tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding='same')])

GoogLeNet uses four modules made up of Inception blocks. However, ResNetuses four modules made up of residual blocks, each of which uses severalresidual blocks with the same number of output channels. The number ofchannels in the first module is the same as the number of inputchannels. Since a max-pooling layer with a stride of 2 has already beenused, it is not necessary to reduce the height and width. In the firstresidual block for each of the subsequent modules, the number ofchannels is doubled compared with that of the previous module, and theheight and width are halved.

pytorchmxnetjaxtensorflow

@d2l.add_to_class(ResNet)def block(self, num_residuals, num_channels, first_block=False): blk = [] for i in range(num_residuals): if i == 0 and not first_block: blk.append(Residual(num_channels, use_1x1conv=True, strides=2)) else: blk.append(Residual(num_channels)) return nn.Sequential(*blk)

@d2l.add_to_class(ResNet)def block(self, num_residuals, num_channels, first_block=False): blk = nn.Sequential() for i in range(num_residuals): if i == 0 and not first_block: blk.add(Residual(num_channels, use_1x1conv=True, strides=2)) else: blk.add(Residual(num_channels)) return blk

@d2l.add_to_class(ResNet)def block(self, num_residuals, num_channels, first_block=False): blk = [] for i in range(num_residuals): if i == 0 and not first_block: blk.append(Residual(num_channels, use_1x1conv=True, strides=(2, 2), training=self.training)) else: blk.append(Residual(num_channels, training=self.training)) return nn.Sequential(blk)

@d2l.add_to_class(ResNet)def block(self, num_residuals, num_channels, first_block=False): blk = tf.keras.models.Sequential() for i in range(num_residuals): if i == 0 and not first_block: blk.add(Residual(num_channels, use_1x1conv=True, strides=2)) else: blk.add(Residual(num_channels)) return blk

Then, we add all the modules to ResNet. Here, two residual blocks areused for each module. Lastly, just like GoogLeNet, we add a globalaverage pooling layer, followed by the fully connected layer output.

pytorchmxnetjaxtensorflow

@d2l.add_to_class(ResNet)def __init__(self, arch, lr=0.1, num_classes=10): super(ResNet, self).__init__() self.save_hyperparameters() self.net = nn.Sequential(self.b1()) for i, b in enumerate(arch): self.net.add_module(f'b{i+2}', self.block(*b, first_block=(i==0))) self.net.add_module('last', nn.Sequential( nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(), nn.LazyLinear(num_classes))) self.net.apply(d2l.init_cnn)

@d2l.add_to_class(ResNet)def __init__(self, arch, lr=0.1, num_classes=10): super(ResNet, self).__init__() self.save_hyperparameters() self.net = nn.Sequential() self.net.add(self.b1()) for i, b in enumerate(arch): self.net.add(self.block(*b, first_block=(i==0))) self.net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes)) self.net.initialize(init.Xavier())

@d2l.add_to_class(ResNet)def create_net(self): net = nn.Sequential([self.b1()]) for i, b in enumerate(self.arch): net.layers.extend([self.block(*b, first_block=(i==0))]) net.layers.extend([nn.Sequential([ # Flax does not provide a GlobalAvg2D layer lambda x: nn.avg_pool(x, window_shape=x.shape[1:3], strides=x.shape[1:3], padding='valid'), lambda x: x.reshape((x.shape[0], -1)), nn.Dense(self.num_classes)])]) return net

@d2l.add_to_class(ResNet)def __init__(self, arch, lr=0.1, num_classes=10): super(ResNet, self).__init__() self.save_hyperparameters() self.net = tf.keras.models.Sequential(self.b1()) for i, b in enumerate(arch): self.net.add(self.block(*b, first_block=(i==0))) self.net.add(tf.keras.models.Sequential([ tf.keras.layers.GlobalAvgPool2D(), tf.keras.layers.Dense(units=num_classes)]))

There are four convolutional layers in each module (excluding the\(1\times 1\) convolutional layer). Together with the first\(7\times 7\) convolutional layer and the final fully connectedlayer, there are 18 layers in total. Therefore, this model is commonlyknown as ResNet-18. By configuring different numbers of channels andresidual blocks in the module, we can create different ResNet models,such as the deeper 152-layer ResNet-152. Although the main architectureof ResNet is similar to that of GoogLeNet, ResNet’s structure is simplerand easier to modify. All these factors have resulted in the rapid andwidespread use of ResNet. Fig. 8.6.4 depicts the fullResNet-18.

Fig. 8.6.4 The ResNet-18 architecture.¶

Before training ResNet, let’s observe how the input shape changes acrossdifferent modules in ResNet. As in all the previous architectures, theresolution decreases while the number of channels increases up until thepoint where a global average pooling layer aggregates all features.

pytorchmxnetjaxtensorflow

class ResNet18(ResNet): def __init__(self, lr=0.1, num_classes=10): super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)), lr, num_classes)ResNet18().layer_summary((1, 1, 96, 96))

Sequential output shape: torch.Size([1, 64, 24, 24])Sequential output shape: torch.Size([1, 64, 24, 24])Sequential output shape: torch.Size([1, 128, 12, 12])Sequential output shape: torch.Size([1, 256, 6, 6])Sequential output shape: torch.Size([1, 512, 3, 3])Sequential output shape: torch.Size([1, 10])

class ResNet18(ResNet): def __init__(self, lr=0.1, num_classes=10): super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)), lr, num_classes)ResNet18().layer_summary((1, 1, 96, 96))

Sequential output shape: (1, 64, 24, 24)Sequential output shape: (1, 64, 24, 24)Sequential output shape: (1, 128, 12, 12)Sequential output shape: (1, 256, 6, 6)Sequential output shape: (1, 512, 3, 3)GlobalAvgPool2D output shape: (1, 512, 1, 1)Dense output shape: (1, 10)

class ResNet18(ResNet): arch: tuple = ((2, 64), (2, 128), (2, 256), (2, 512)) lr: float = 0.1 num_classes: int = 10ResNet18(training=False).layer_summary((1, 96, 96, 1))

Sequential output shape: (1, 24, 24, 64)Sequential output shape: (1, 24, 24, 64)Sequential output shape: (1, 12, 12, 128)Sequential output shape: (1, 6, 6, 256)Sequential output shape: (1, 3, 3, 512)Sequential output shape: (1, 10)

class ResNet18(ResNet): def __init__(self, lr=0.1, num_classes=10): super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)), lr, num_classes)ResNet18().layer_summary((1, 96, 96, 1))

Sequential output shape: (1, 24, 24, 64)Sequential output shape: (1, 24, 24, 64)Sequential output shape: (1, 12, 12, 128)Sequential output shape: (1, 6, 6, 256)Sequential output shape: (1, 3, 3, 512)Sequential output shape: (1, 10)

## 8.6.4. Training¶

We train ResNet on the Fashion-MNIST dataset, just like before. ResNetis quite a powerful and flexible architecture. The plot capturingtraining and validation loss illustrates a significant gap between bothgraphs, with the training loss being considerably lower. For a networkof this flexibility, more training data would offer distinct benefit inclosing the gap and improving accuracy.

pytorchmxnetjaxtensorflow

model = ResNet18(lr=0.01)trainer = d2l.Trainer(max_epochs=10, num_gpus=1)data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)trainer.fit(model, data)

model = ResNet18(lr=0.01)trainer = d2l.Trainer(max_epochs=10, num_gpus=1)data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))trainer.fit(model, data)

model = ResNet18(lr=0.01)trainer = d2l.Trainer(max_epochs=10, num_gpus=1)data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))trainer.fit(model, data)

trainer = d2l.Trainer(max_epochs=10)data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))with d2l.try_gpu(): model = ResNet18(lr=0.01) trainer.fit(model, data)

## 8.6.5. ResNeXt¶

One of the challenges one encounters in the design of ResNet is thetrade-off between nonlinearity and dimensionality within a given block.That is, we could add more nonlinearity by increasing the number oflayers, or by increasing the width of the convolutions. An alternativestrategy is to increase the number of channels that can carryinformation between blocks. Unfortunately, the latter comes with aquadratic penalty since the computational cost of ingesting\(c_\textrm{i}\) channels and emitting \(c_\textrm{o}\) channelsis proportional to \(\mathcal{O}(c_\textrm{i} \cdot c_\textrm{o})\)(see our discussion in Section 7.4).

We can take some inspiration from the Inception block ofFig. 8.4.1 which has information flowing through theblock in separate groups. Applying the idea of multiple independentgroups to the ResNet block of Fig. 8.6.3 led to thedesign of ResNeXt (Xie et al., 2017). Different fromthe smorgasbord of transformations in Inception, ResNeXt adopts the*same* transformation in all branches, thus minimizing the need formanual tuning of each branch.

Fig. 8.6.5 The ResNeXt block. The use of grouped convolution with\(\mathit{g}\) groups is \(\mathit{g}\) times faster than adense convolution. It is a bottleneck residual block when the numberof intermediate channels \(\mathit{b}\) is less than\(\mathit{c}\).¶

Breaking up a convolution from \(c_\textrm{i}\) to\(c_\textrm{o}\) channels into one of \(g\) groups of size\(c_\textrm{i}/g\) generating \(g\) outputs of size\(c_\textrm{o}/g\) is called, quite fittingly, a *groupedconvolution*. The computational cost (proportionally) is reduced from\(\mathcal{O}(c_\textrm{i} \cdot c_\textrm{o})\) to\(\mathcal{O}(g \cdot (c_\textrm{i}/g) \cdot (c_\textrm{o}/g)) = \mathcal{O}(c_\textrm{i} \cdot c_\textrm{o} / g)\),i.e., it is \(g\) times faster. Even better, the number ofparameters needed to generate the output is also reduced from a\(c_\textrm{i} \times c_\textrm{o}\) matrix to \(g\) smallermatrices of size \((c_\textrm{i}/g) \times (c_\textrm{o}/g)\), againa \(g\) times reduction. In what follows we assume that both\(c_\textrm{i}\) and \(c_\textrm{o}\) are divisible by\(g\).

The only challenge in this design is that no information is exchangedbetween the \(g\) groups. The ResNeXt block ofFig. 8.6.5 amends this in two ways: the groupedconvolution with a \(3 \times 3\) kernel is sandwiched in betweentwo \(1 \times 1\) convolutions. The second one serves double dutyin changing the number of channels back. The benefit is that we only paythe \(\mathcal{O}(c \cdot b)\) cost for \(1 \times 1\) kernelsand can make do with an \(\mathcal{O}(b^2 / g)\) cost for\(3 \times 3\) kernels. Similar to the residual block implementationin Section 8.6.2, the residual connection is replaced(thus generalized) by a \(1 \times 1\) convolution.

The right-hand figure in Fig. 8.6.5 provides a muchmore concise summary of the resulting network block. It will also play amajor role in the design of generic modern CNNs inSection 8.8. Note that the idea of grouped convolutionsdates back to the implementation of AlexNet(Krizhevsky et al., 2012). When distributing thenetwork across two GPUs with limited memory, the implementation treatedeach GPU as its own channel with no ill effects.

The following implementation of the `ResNeXtBlock`

class takes asargument `groups`

(\(g\)), with `bot_channels`

(\(b\))intermediate (bottleneck) channels. Lastly, when we need to reduce theheight and width of the representation, we add a stride of \(2\) bysetting `use_1x1conv=True, strides=2`

.

pytorchmxnetjaxtensorflow

class ResNeXtBlock(nn.Module): #@save """The ResNeXt block.""" def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False, strides=1): super().__init__() bot_channels = int(round(num_channels * bot_mul)) self.conv1 = nn.LazyConv2d(bot_channels, kernel_size=1, stride=1) self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3, stride=strides, padding=1, groups=bot_channels//groups) self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, stride=1) self.bn1 = nn.LazyBatchNorm2d() self.bn2 = nn.LazyBatchNorm2d() self.bn3 = nn.LazyBatchNorm2d() if use_1x1conv: self.conv4 = nn.LazyConv2d(num_channels, kernel_size=1, stride=strides) self.bn4 = nn.LazyBatchNorm2d() else: self.conv4 = None def forward(self, X): Y = F.relu(self.bn1(self.conv1(X))) Y = F.relu(self.bn2(self.conv2(Y))) Y = self.bn3(self.conv3(Y)) if self.conv4: X = self.bn4(self.conv4(X)) return F.relu(Y + X)

class ResNeXtBlock(nn.Block): #@save """The ResNeXt block.""" def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False, strides=1, **kwargs): super().__init__(**kwargs) bot_channels = int(round(num_channels * bot_mul)) self.conv1 = nn.Conv2D(bot_channels, kernel_size=1, padding=0, strides=1) self.conv2 = nn.Conv2D(bot_channels, kernel_size=3, padding=1, strides=strides, groups=bot_channels//groups) self.conv3 = nn.Conv2D(num_channels, kernel_size=1, padding=0, strides=1) self.bn1 = nn.BatchNorm() self.bn2 = nn.BatchNorm() self.bn3 = nn.BatchNorm() if use_1x1conv: self.conv4 = nn.Conv2D(num_channels, kernel_size=1, strides=strides) self.bn4 = nn.BatchNorm() else: self.conv4 = None def forward(self, X): Y = npx.relu(self.bn1(self.conv1(X))) Y = npx.relu(self.bn2(self.conv2(Y))) Y = self.bn3(self.conv3(Y)) if self.conv4: X = self.bn4(self.conv4(X)) return npx.relu(Y + X)

class ResNeXtBlock(nn.Module): #@save """The ResNeXt block.""" num_channels: int groups: int bot_mul: int use_1x1conv: bool = False strides: tuple = (1, 1) training: bool = True def setup(self): bot_channels = int(round(self.num_channels * self.bot_mul)) self.conv1 = nn.Conv(bot_channels, kernel_size=(1, 1), strides=(1, 1)) self.conv2 = nn.Conv(bot_channels, kernel_size=(3, 3), strides=self.strides, padding='same', feature_group_count=bot_channels//self.groups) self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1), strides=(1, 1)) self.bn1 = nn.BatchNorm(not self.training) self.bn2 = nn.BatchNorm(not self.training) self.bn3 = nn.BatchNorm(not self.training) if self.use_1x1conv: self.conv4 = nn.Conv(self.num_channels, kernel_size=(1, 1), strides=self.strides) self.bn4 = nn.BatchNorm(not self.training) else: self.conv4 = None def __call__(self, X): Y = nn.relu(self.bn1(self.conv1(X))) Y = nn.relu(self.bn2(self.conv2(Y))) Y = self.bn3(self.conv3(Y)) if self.conv4: X = self.bn4(self.conv4(X)) return nn.relu(Y + X)

class ResNeXtBlock(tf.keras.Model): #@save """The ResNeXt block.""" def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False, strides=1): super().__init__() bot_channels = int(round(num_channels * bot_mul)) self.conv1 = tf.keras.layers.Conv2D(bot_channels, 1, strides=1) self.conv2 = tf.keras.layers.Conv2D(bot_channels, 3, strides=strides, padding="same", groups=bot_channels//groups) self.conv3 = tf.keras.layers.Conv2D(num_channels, 1, strides=1) self.bn1 = tf.keras.layers.BatchNormalization() self.bn2 = tf.keras.layers.BatchNormalization() self.bn3 = tf.keras.layers.BatchNormalization() if use_1x1conv: self.conv4 = tf.keras.layers.Conv2D(num_channels, 1, strides=strides) self.bn4 = tf.keras.layers.BatchNormalization() else: self.conv4 = None def call(self, X): Y = tf.keras.activations.relu(self.bn1(self.conv1(X))) Y = tf.keras.activations.relu(self.bn2(self.conv2(Y))) Y = self.bn3(self.conv3(Y)) if self.conv4: X = self.bn4(self.conv4(X)) return tf.keras.activations.relu(Y + X)

Its use is entirely analogous to that of the `ResNetBlock`

discussedpreviously. For instance, when using (`use_1x1conv=False, strides=1`

),the input and output are of the same shape. Alternatively, setting`use_1x1conv=True, strides=2`

halves the output height and width.

pytorchmxnetjaxtensorflow

blk = ResNeXtBlock(32, 16, 1)X = torch.randn(4, 32, 96, 96)blk(X).shape

torch.Size([4, 32, 96, 96])

blk = ResNeXtBlock(32, 16, 1)blk.initialize()X = np.random.randn(4, 32, 96, 96)blk(X).shape

(4, 32, 96, 96)

blk = ResNeXtBlock(32, 16, 1)X = jnp.zeros((4, 96, 96, 32))blk.init_with_output(d2l.get_key(), X)[0].shape

(4, 96, 96, 32)

blk = ResNeXtBlock(32, 16, 1)X = tf.random.normal((4, 96, 96, 32))Y = blk(X)Y.shape

TensorShape([4, 96, 96, 32])

## 8.6.6. Summary and Discussion¶

Nested function classes are desirable since they allow us to obtainstrictly *more powerful* rather than also subtly *different* functionclasses when adding capacity. One way of accomplishing this is byletting additional layers to simply pass through the input to theoutput. Residual connections allow for this. As a consequence, thischanges the inductive bias from simple functions being of the form\(f(\mathbf{x}) = 0\) to simple functions looking like\(f(\mathbf{x}) = \mathbf{x}\).

The residual mapping can learn the identity function more easily, suchas pushing parameters in the weight layer to zero. We can train aneffective *deep* neural network by having residual blocks. Inputs canforward propagate faster through the residual connections across layers.As a consequence, we can thus train much deeper networks. For instance,the original ResNet paper (He et al., 2016) allowed for upto 152 layers. Another benefit of residual networks is that it allows usto add layers, initialized as the identity function, *during* thetraining process. After all, the default behavior of a layer is to letthe data pass through unchanged. This can accelerate the training ofvery large networks in some cases.

Prior to residual connections, bypassing paths with gating units wereintroduced to effectively train highway networks with over 100 layers(Srivastava et al., 2015). Using identity functions as bypassingpaths, ResNet performed remarkably well on multiple computer visiontasks. Residual connections had a major influence on the design ofsubsequent deep neural networks, of either convolutional or sequentialnature. As we will introduce later, the Transformer architecture(Vaswani et al., 2017) adopts residual connections(together with other design choices) and is pervasive in areas asdiverse as language, vision, speech, and reinforcement learning.

ResNeXt is an example for how the design of convolutional neuralnetworks has evolved over time: by being more frugal with computationand trading it off against the size of the activations (number ofchannels), it allows for faster and more accurate networks at lowercost. An alternative way of viewing grouped convolutions is to think ofa block-diagonal matrix for the convolutional weights. Note that thereare quite a few such “tricks” that lead to more efficient networks. Forinstance, ShiftNet (Wu et al., 2018) mimicks the effects of a\(3 \times 3\) convolution, simply by adding shifted activations tothe channels, offering increased function complexity, this time withoutany computational cost.

A common feature of the designs we have discussed so far is that thenetwork design is fairly manual, primarily relying on the ingenuity ofthe designer to find the “right” network hyperparameters. While clearlyfeasible, it is also very costly in terms of human time and there is noguarantee that the outcome is optimal in any sense. InSection 8.8 we will discuss a number of strategies forobtaining high quality networks in a more automated fashion. Inparticular, we will review the notion of *network design spaces* thatled to the RegNetX/Y models(Radosavovic et al., 2020).

## 8.6.7. Exercises¶

What are the major differences between the Inception block inFig. 8.4.1 and the residual block? How do they comparein terms of computation, accuracy, and the classes of functions theycan describe?

Refer to Table 1 in the ResNet paper (He et al., 2016)to implement different variants of the network.

For deeper networks, ResNet introduces a “bottleneck” architecture toreduce model complexity. Try to implement it.

In subsequent versions of ResNet, the authors changed the“convolution, batch normalization, and activation” structure to the“batch normalization, activation, and convolution” structure. Makethis improvement yourself. See Figure 1 inHe

*et al.*(2016) for details.Why can’t we just increase the complexity of functions without bound,even if the function classes are nested?

pytorchmxnetjaxtensorflow

Discussions

Discussions

Discussions

Discussions