Paper Review

[English] [Paper Review] Deep Residual Learning for Image Recognition (ResNet)

Eugene129 2022. 11. 20. 19:51

This paper has been written by Kaming He in 2015 and citated about “140,000” times.

 

Pasted 7 years but still has powerful impacts.

 

Paper has been cited for about 140,000 times for now.

 

Why this paper has so powerful impact ?

  1. This paper introduced “Skip Connection” for the first time which is used in State-of-the-art Deep Learning Models like YOLO, Transformer, U-Net, etc.
  2. This paper made “Deep” possible in Deep Learning. It has enabled neural networks to stack extremely deep layers in a very economic way.

 

From this, it is possible to estimate how important ResNet is in the field of deep learning.


This review assumes that the reader understands basic FCNN and CNN.

 

If you lack an understanding of FCN, refer to the following video.

https://www.youtube.com/watch?v=aircAruvnKk&ab_channel=3Blue1Brown 

 

If you lack an understanding of CNN, see the following blog.

(Figures are perfect to understand CNN…!)

https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

 

A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way

Artificial Intelligence has been witnessing a monumental growth in bridging the gap between the capabilities of humans and machines…

towardsdatascience.com

 

 

I review every paragraphs in Abstract and Introduction because all paragraphs contain very important content.

 

After introduction, I review Figures and Tables.

 

So, Let’s get started !

 

 


Abstract & The heart of the paper

  • Observed a problem : Degradation Problem.
    • When layers be stacked extremely deep, Accuracy decreased.

  • The solution this paper suggests : ResNet
    • When ResNet stacks extremely deep layers, it learns well.
    • Even extremely deep ResNet are easy to optimize.
  •  

The 1st paragraph in Introduction

  • Introduces existing perspectives in the field of Deep Learning

    • Existing perspectives :
      "Deepening layers is (always) effective"
      • The performance of Neural Networks has improved a lot by deepening layers.
      • As the layer is deepened, the level of extracted features can be further enriched.
      • Recent state-of-the-art models on ImageNet have stacked 16~30 layers.
        (Quite deep, isnt’ it?)

 

 


The 2nd paragraph in Introduction

  • Some people asked a question.
    "Really, the more layers the models stack, the better the models learn?"

    No. It didn't.

  • It's called "Gradient Vanishing problem".
    But this problem has been solved recently.
    • The weight initial value has been initialized as a normal distribution with a mean of 0 and a standard deviation of 1.
    • It was observed that when initialized in this way, the output value of the activation function mainly exists at both ends of 0 and 1.
    • When the output value of the activation function is at the extremes of 0 and 1, the differential value of the activation function approaches 0.
      (Let's look at the differential values of the sigmoid function : output * (1-output) )
    • Then, as the layer is deeper, the back propagation value converges to zero. The gradients vanish.
    • In other words, learning is not achieved at all.

  • Somehow, it is solved.
    • Xavier Glorot and Yoshua Bengio suggested Normalized Initialization to solve this.
      • Convolution layers in Tensorflow API used the uniform initiaization method in this paper as "default" initialization method.

 

However.. We faced another problem.


The 3rd paragraph in Introduction

  • "Really, the more layers the models stack, the better the models perform?"

    • The problem of gradients vanishing  was solved.
      The models with extremely deep layers, started to learn anyway.
    • However, the Degradation problem was observed.

      • it was a phenomenon in which as the depth of the neural network deepened, the accuracy starts to "saturate".

      • Because the "Training error" increased when network be deepened, it was not due to overfitting.

Let's take a look at the left of Figure 1. As the layer be deepened, the training error increased. As shown on the right, the test error also increased.

 

So what on earth is the problem ?


The 4th paragraph in Introduction

  • Explains which experiements they have conducted to explore problems.

 

 

 

 

 

 

 

 

Source : https://developer.nvidia.com/discover/artificial-neural-network

 

Suppose there are two FCNNs that have learned that the figure shown in the above is "Sara".

One of them has learned Sara image with the following architecture. 

 

Draw neural network on https://alexlenail.me/NN-SVG/index.html

 

Then, copy the learned neural network and add some layers to the top (near output layer) .

 

 

Draw neural network on https://alexlenail.me/NN-SVG/index.html

 

Ideally, it would be nice if each added layer works as identity mapping.
(Identity mapping : A layer that should approximate Identity Function.)

 

In other words, it would be very nice if each added layer was optimized as a layer that approximates the identity function.

(Identity function is a function that print out inputs as an output.)

 

If so, this "Deeper" neural network will still be able to correctly infer Sara.

 

If the added layers successfully approximated the identity function,

The "Deeper" neural network should perform better, or at least simillar than the "Shallower" neural network.

 

However, the results showed "Deeper" network performed worse.

 

Let's take a look at the left of Figure 1. As the layer be deepened, the training error increased. As shown on the right, the test error also increased.

 

Therefore, this is a problem with the difficulty of optimization of deep neural networks.

It is a very fundamental problem.
It's not just a matter of overfitting.

 

(In fact, it is not easy for neural network layers to approximate the Identity Function.

Perhaps, the added layers might approximate Zero Function rather than the identity function.

This is because, In the first place, the weights flowing through the deep neural network have the property of approaching "0".)


So, how can we reduce the difficulty of optimization?

 


 

The 5th paragraph in Introduction

  • Suggest ways to reduce the difficulty of optimization
    • Residual Mapping

 

 

 

 

 

 

 

 

 

The 6th paragraph In Introduction

 

  • Explains Shorcut Connections

 

 

 

 

 

 

 

 

Befor we get into the heart of this paper, There is something to point out.

It's the definition of a function.

 

Function is an expression, rule, or law that defines a relationship between one variable and another variable.

For a certain input, function must print output in a consistent way.

 

 

Let's look at the picture above.

There is a function H that print Output H(x) for Input X.

If the output of this function "H(x)" is equal to "x", then we can say the function H is the Identity Function.

 

"Approximating Identity function" is equal to "Approximating the function H that makes H(x) = x."

The problem is that, as mentioned above, it is not easy for neural network layers to approximate the Identity Function.

 

But what if we could learn H(x) - x instead of H?

(Let's H(x) - x = F(x) ).

 

Instead of learning "Ideal" H(x), The models learn "the gap between ideal and reality" H(x)-x, Residual.

Let's change Figure 2 a little by referring to the picture above.

Ccovered the picture in Figure 2 with a blue box. That blue box represents the function H.

 

The function H in that blue box will represent F(x) + x.

Let's remove the blue box.

 

H(x) = 2 weight layers + Shortcut Connections

 

The function F consists of "Convolution operation - ReLu activation function operation - Convolution operation".

The function H consists of "Convolution operation - ReLu activation function operation - Convolution operation - An operation of adding Input x to F(x)".

 

In this case, "an operation of adding Input x to F(x)" is just a "command" in which a parameter to be learned does not exist.


It's just a "command" to add Input x.

 

This can achieve three.

 

1. When updating parameters, we learn function F, not function H.

This is because, at "H = F(x) + x", "+ x" has no parameter to learn.

The function F = H - x.

Without additional parameters to be learned, Residual, the gap between ideal and reality, is learned.

 

2. It becomes easier for function H to approximate the Identity Function.
Since the function H = F(x) + x, if F(x) becomes 0, then H = 0 + x = x and H(x) = x.

That is, if function F approximates Zero Function, function H become Identity Function.

 

It is relatively easy to approximate the Zero Function because all the necessary parameters need to be zero.
(Same whether it's FCNN or CNN. All operations only consists of multiplying and adding.)

 

In addition, Zero Function is easy to approximate because the weights flowing through the deep neural network have the property of approaching zero.

 

3. Because there are no additional parameters to learn,
A fair comparison between
"Plain Networks" without Residual Connections
"Residual Networks" with Residual Connections
can be made.

 

 


The 7th ~ 9th paragraph in Introduction

 

The results of experiements with ResNet are summarized.

  • With ResNet, even when the depth was extremely increased, it was easy to optimize. 
    • Remember? Deeper plain net increased training error unexpectedly.
  • The deeper the depth, The higher the accuracy.
    • ResNet has recorded the best accuracy among the state-of-the-art networks.

  • The results of experiements be consistent whether it's on ImageNet dataset or CIFAR-10 dataset.
    • This result is not only for a certain dataset.

  • With 152 layers of incredible depth, ResNet is the highest ever, but fewer parameters to learn than VGG net.
    • Thans to Shortcut Connections

  • ResNet has won the following competitions.
    • ImageNet detection in ILSVRC 2015
    • ImageNet localization in ILSVRC 2015
    • COCO detection in COCO 2015
    • COCO segmentation in COCO 2015

 

 

Let's look into details in Figures and Tables.


Figure 3

 

Figure 3 visualizes the network architectures used in the experiment at a glance with VGG-19.

  • Plain Network
    • If the feature map size has been halved,
      The number of filters should be doubled to preserve the "time complexity" of each layer. 
      (Inspired by VGG philosophy)

      The time complexity of an algorithm is the number of basic operations, such as multiplications and summations, that the algorithm performs.
 
  • Residual Network = Plain Network + Shortcuts
    • Dimension Staying Shortcuts : Identity Shortcuts
    • Dimension Increasing Shortcuts
      • When convolve with stride of 2, Height and Width decreased,
        Channel numbers increased.


      • For operations of F(x) + x,
        F(x) and x should have same dimensions.

        Adjust the dimensions to be identical.

        • Zero padding shortcuts
        • Projection shortcuts

 

 

 

 

 

 

 

 


Table 1

 

Table 1 summarizes the architectures of ResNet models used in experiments on ImageNet Dataset according to layer depth.

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.

 

 

When entering every convX_x layer, dimension be downsampled by stride 2 convolution.

(Refer to the architecture of the 34-layer residual in Figure 3.)

 

However, if you look closely at Table 1, there is something strange.

Whether conv2_x or conv3_x, output size is maintained in the convolution operation.

Looking at the Pytorch source code, as expected, there was an appropriate number of padding per convolution layer.

looking into the source code of ResNet in Pytorch, padding is performed once in 3x3 convolution within a block.

 

Modified Table 1 with padding times information.

 

Modified Table 1 with padding times information.

 

 


Table 2

 

Table 2 summarizes validation error after learning ImageNet 2012 Classification Dataset.

  • ImageNet 2012 Classification Dataset
    • 1,000 classes
    • 128 millions Training set
    • 5 millions Validation set

 

  • Results
    • the deeper the Plain net, the higher the error
    • the deeper the ResNet, the lower the error

 

Figure 4

 

Figure 4 visualizes error transitional curve on ImageNet 2012 Classification Dataset.

 

Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts.

 

  • The authors suggests that the error increase when deepening the Plain net is not due to gradients vanishing.

    • Already applied methods which have solved gradients vanishing.
    • If it is due to gradients vanishing, learning should not have been achieved.
      • But as you see, Plain-34 has competitive accuracy.
      • It means, learning is being achieved to some degree.

 

 

 

Let's summarize Table 2 and Figure 4.

 

 

 

 

  • Degradation problem can be solved with Residual Learning.
    • When ResNet be deepened, Training error and Validation error decreased.

  • Verified effects of Residual Learning in Deep Neural Networks.
    • ResNet-34 decreased error 3.5% compared to Plain-34

  • ResNet optimize much faster than Plain net.
    • If Neural Network is not very deep, Plain net performs quite good. (27.94 vs 27.88)
    • But you should still use ResNet. It optimizes very fast. 

 


Table 3 ~ 5

 

With dimension increasing shortcut connections,

Used 2 methods to make the dimensions be same.(Refer to Figure 3 explanation)

 

  • Zero padding shortcuts : Add channels which have "0" values.
    • No parameters to be learned.

  • Projection shortcuts : 1x1 convolution for doubling filter numbers 
    • There are parameters to be learned.

 

 

 

 

 

 

 

 

 

 

 

 

Table 3 summarized validation errors with the above two methods in various combinations on Dimension Increasing Shortcuts and Dimension Staying Shortcuts.

 


Let's look into the middle part of Table 3.

Compared Plain net and ResNet-34 with A/B/C options.

 

  • Option A
    • Dimension Increasing Shortcuts : use Zero padding shortcuts
    • Dimension Staying Shortcuts : use Identity shortcuts
    • → All shortcuts have no parameters to be learned.

  • Option B
    • Dimension Increasing Shortcuts : use projection shortcuts
    • Dimension Staying Shortcuts : use Identity shortcuts
    •  Dimension Increasing Shortcuts introduce parameters to be learned.

  • Option C
    • Dimension Increasing Shortcuts : use projection shortcuts
    • Dimension Staying Shortcuts : use projection shortcuts
    • → All Shortcuts introduce parameters to be learned.

 

summarize the middle part of Table 3 :

  • With any options, Better performance than plain-34.
  • With B option is better than with  A.
    Because, with option A, residual learning isn't achieved in zero padded channels.
  • Insignificant differences between A/B/C  Projection shortcuts are not essential to solve degradation problem.
  • Used economic option B in the left experiments.

 


Let's look into the bottom of Table 3.

 

 

ResNets with more layers than 34 were also tested.

However, there were too many parameters to use the building block as it was.

So, ResNet-50, ResNet-101, ResNet-152 use new building blocks.

 

 

Looking at the picture on the right side of Figure 5, "bottleneck" building block is shown.

In the bottomleneck block, the number of channels is reduced by the first 1x1 convolution.

It then goes through 3x3 convolution.

As the 1x1 convolution is performed again, the number of channels is recovered.

 

I compared the number of learning parameters because I was curious if the Bottleneck building block was really efficient.

 

 

Certainly,
the number of parameters of the botleneck building block in the blue box is less than that of the building block in the red box,
so the operation will be efficient.

 


summarizes the bottom of Table 3  + Table 4 + Table 5 :

Table 3, 4, 5

 
  • the bottom of Table 3
    • Even extremely deep ResNet learns well.
    • The deeper the ResNet, The lower the error.

  • Table 4
    • ResNet-34 has beaten all state-of-the-art deep learning models.
    • Single model ResNet-152 has beaten all state-of-the-art deep learning ensemble models.

  • Table 5
    • Made a ResNet Ensemble with 6 different ResNets.
    • Won 2015 ILSVRC with Top-5 error 3.57% (Improved existing error 26%.)

 


Table 6

Experiments were conducted on CIFAR-10 Dataset to show that ResNet is not limited to a certain dataset.

 

The authors conducted experiments with CIFAR-10 Dataset with "simple architectures" on purpose.

It was in order to focus on analyzing movement of extremely deep neural networks rather than produce good results.

 

A figure summarizing the "simple architecture" used for learning CIFAR-10 Dataset is inserted in the paper.

Conducted experiments on CIFAR-10 with simple ResNet like the above.

 

 

With the above picture, couldn't have a good grasp, so I draw following table.

 

Conducted experiments on CIFAR-10 with simple ResNet like the above.

 

The result is following.

 

 

Table 6

  • ResNet has beaten FitNet and Highway, the existing best Classification models
  • ResNet performs better with much fewer parameters.

 

Q. By the way, Table 6 shows data augmentation be conducted.

FitNet and Highway was learned with data augmentation, as well ?

→ Reviewing those models' papers, It was. However, augmentation mothods differ.

 


Figure 6


Let's look into the left of Figure 6.

 

Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers.

 

 

The following conclusions can be drawn through the left picture of Figure 6, the left picture of Figure 4, and the existing paper.

""The optimization difficulity of deep neural networks is a FUNDAMENTAL problem."

(It's not a matter of overfitting.)

  • the left of Figure 6
    • The deep plain-net "suffers" from the increased depth, and the training error increases as the depth increases.

  • the left of Figure 4
    • For ImageNet Dataset,
      it was observed that the training error increases as the depth increases.

  • Similar observations have been reported with MNIST dataset.
    • R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015.
 

 


The following conclusions can be drawn through the picture in the middle of Figure 6 and the picture on the right of Figure 4.

 

 

 

Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers.

 

"ResNet has overcome the optimization difficulity. Accuracy increased when depth increased."

 

  • the middel of Figure 6
    • As depth increased in ResNet, Training error and Testing error decreased.
  • the right of Figure 4
    • Similar observations have been made with ImageNet dataset.

 

 

 

 

 


 

Thereafter, the ResNet model with 1202 layers was evaluated by setting n = 200.

 

 

Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers.

 

The test error of ResNet-1202 was 7.93%, higher than that of ResNet-110.

The authors argue that this may be due to overfitting.

Because, although the test error is high, the training error of ResNet-1202 is similar to that of ResNet-110.

 

The authors speculate that ResNet-1202 would be a "unnecessarily" large model to train small datasets such as CIFAR-10.

 

It is also mentioned that follow-up research will be conducted by applying a Regularization technique such as Dropout.

 


Figure 7

Figure 7. Standard deviations (std) of layer responses on CIFAR10. The responses are the outputs of each 3×3 layer, after BN and before nonlinearity. Top: the layers are shown in their original order. Bottom: the responses are ranked in descending order.

 

Figure 7 is a graph of the standard deviation of outputs for each layer according to the layer index.

 

Since ResNet learns residual, or "the gap between ideal and reality",
the authors say i
t was expected that there would be smaller layer responses than Plain-net.

Figure 7 supports such expectations.

 


 

 

Table 7, 8

 

 

 

Tables 7 and 8 show the results of experiments with ResNet on the Object Detection task, not the Image Classification task.

Lastly, the authors wrap up this paper emphasizing on the excellence of ResNet, revealing that ResNet has won the following competitions.

 

  • ImageNet detection in ILSVRC 2015
  • ImageNet localization in ILSVRC 2015
  • COCO detection in COCO 2015
  • COCO segmentation in COCO 2015