ResNet — Understand and Implement from scratch (2024)

Rohit Modi

Published in

Analytics Vidhya

6 min read

Dec 1, 2021

One must have come across Resnets while working with CNNs, or at least would have heard of it and we do know that ResNets perform really well on most of Computer vision tasks, but why was there even a need for an architecture like this when we already had other good performing architectures, to answer this, let us understand the drawbacks with other deep neural network architectures that were used before ResNets.

We think that the deeper the neural network, the better the performance, but when researchers experimented with deeper neural nets, it was found that adding more layers to a deep network does not always add up to its performance but rather decreases it, which was due to vanishing gradients in very deep neural networks. As a result, it was proposed that adding more layers to a deep neural network should either increase its performance or let it stay the same, but it should never decrease the performance. In order to achieve this, they came up with the concept of Skip connections/Residual connections, by use of which we can avoid loss of information flow. Let us understand what Skip connections are.

A Skip/Residual connection takes the activations from an (n-1)ᵗʰ convolution layer and adds it to the convolution output of (n+1)ᵗʰ layer and then applies ReLU on this sum, thus Skipping the nᵗʰ layer.

The below diagram explains how a skip connection works. (Here I am using f(x) to denote Relu applied on x where x is the output after applying Convolution operation).

ResNet — Understand and Implement from scratch (3)

But how does this even helps, simply put, if the nᵗʰ layer is not learning anything even then we won't lose any information, because at (n+1)ᵗʰ we are using the output of (n-1)ᵗʰ layer as well when we move forward and then applying activation on this sum.

Thus we are enabling the network to skip one ReLU activation in between if it does not provide any useful information or provides no information at all i.e, 0, and the network will be using the previous information, thus maintaining consistent performance. If anyways both the layers are providing significant information, thus having previous information will anyways boost the performance.

For the sake of simplicity, we will be implementing Resent-18 because it has fewer layers, we will implement it in PyTorch and will be using Batchnormalization, Maxpool, and Dropout layers as well.

Below is the Architecture and Layer configuration of Resnet-18 taken from the research paper — Deep Residual Learning for Image Recognition [Link to the paper].

ResNet — Understand and Implement from scratch (4)

Let us pick the Conv3_x block and try to understand what is happening inside it. Let us understand this using Convolution and Identity blocks.

Conv3_x block data flow using Convolution and Identity Blocks

ResNet — Understand and Implement from scratch (5)

The above image tells us the details on how a 56x56 image propagates through the Conv3_x block, Now we will look at how the image transforms at each step within these blocks.

The input to the Conv3_x block is an image of shape (56x56) with 64 channels, The first Convolution layer transforms the image to (28x28) image with 128 channels using a 3x3 kernel and 2x2 stride with 1x1 padding, and applies ReLU on it. The second convolution layer takes this image as input and outputs the image with the same shape (28x28x128). Now in order to apply residual/skip connection, we have to add the output of Conv2_x block which is (56x56x64) with the output of the second convolution which is (28x28x128), in order to do this we need to convert Conv2_x output to (28x28x128), This is done by applying another convolution with 1x1 filter and 2x2 stride, with input channels as 64 and output channels as 128.

Simply put, the Convolution block is responsible for converting the output from one block using convolution operation so that it could be effectively used to add with the output of another convolution block.

After the addition is done, Activation (ReLU) is applied to its output and is sent to the Identity block.

The input and output of an identity block are the same, thus to apply Residual/Skip connection we do not need to apply any transformation to the output of Convolution Block. Therefore to apply Skip/Residual connection all we need to do is to add the output of the Convolution block to the output of the 4th convolution layer in Conv3_x block, and then apply ReLU on it.

We now understand that whenever there is a need to adjust output to make it possible to apply a Residual connection, we would need a Convolution block, and whenever the input and output are the same we need an Identity block.

Let us define a class that implements the ResNet18 model, The model configuration and flow will be defined in the __init__() function and the forward propagation will be defined in the forward() function.

Now let us understand what is happening in #BLOCK3 (Conv3_x) in the above code.

Block 3 takes input from the output of block 2 that is ‘op2’ which will be an image of shape (56x56x64), applies a Convolution operation on it with a kernel size of 3x3, a stride of (2,2), and padding of (1,1) and outputs an image with a shape of (28x28x128), Note that here a Stride of 2 is responsible for reducing the image size, Now Batchnormaliztion and ReLU is applied on this output.
The next layer is again a Convolution Layer, But this time it has the parameters (kernel size (3x3), padding (1,1) and stride(1,1)) which will not reduce the image size and keep it the same, thus outputs a (28x28x128) image and then Applies Batchnorm over it, let's denote this output as ‘x’, Now if you look at Block-3 in the forward() function, you can see that before applying the ReLU operation on ‘x’, We are applying a convolution operation to ‘op2' which is the output if Block-2 (adjusting the size of op2 to match the current output), adding it to ‘x’ and then applying ReLU. At this point, we have successfully implemented the Skip connection using the Convolution block. Let us denote this output as op3_1.
Now that we have used the convolution block, and based on the model architecture and output for Block-3, Image shape and size is not going to change anymore in block-3, thus we will be using Identity block now as the remaining half of Block-3.
We can see in forward function we are applying Conv → Batchnorm → ReLU, and then Conv → Batchnorm, and we get an output x, then we simply add the convolution block’s output ‘op3_1’ to x and then apply ReLU on it. Note that throughout the identity block Image size has stayed the same.
Similarly, we can create all the other blocks as well, the only exception is block2 where the image size is not even reducing in the convolution block. which can be implemented directly, or we can even use a convolution operation with stride (1,1).
For the input layer, we are using Conv → Batchnorm → Maxpool, where convolution kernel size is 7x7 with a stride of 2 and padding of 3, which changes input image size from (224x224x3) to (112x112x64) and then max pool changes its size to (56x56x64).
For the output/Classification layer, we use Fully Connected layers, but before that, we apply an average pooling operation to the output of Block5, which will be in a shape (7x7x512), using a 7x7 filter size and stride of 1, reducing the image size to (1x1x512), then we reshape/flatten this output so that it can be used by Fully Connected layers.

Great, now you know how we can implement ResNets, Note that instead of creating each block separately we can create functions for Convolution and Identity blocks and call them any number of times in order to create even deeper ResNet architectures pretty easily.

After implementing we can directly create an object of this class and pass the number of output classes of our dataset and use it to train our network on any image data.

Go to the following link to check out the complete code to build a ResNet-18 model using the above class and train it using PyTorch on a dataset of Chest X-Ray images to classify if a person has Pneumonia or not.

ResNet — Understand and Implement from scratch (2024)

ResNet18 from scratch - Pytorch

Explore and run machine learning code with Kaggle Notebooks | Using data from Chest X-ray Images