The module basically acts as multiple convolution filters, that are applied to the same input, with some pooling. The results are then concatenated. This allows the model to take advantage of multi-level feature extraction . For instance, it extracts general (5x5) and local (1x1) features at the same time.
Using multiple features from multiple filters improve the performance of the network. Other than that, there is another fact that makes the inception architecture better than others. All the architectures prior to inception, performed convolution on the spatial and channel wise domain together. By performing the 1x1 convolution, the inception block is doing cross-channel correlations, ignoring the spatial dimensions. This is followed by cross-spatial and cross-channel correlations via the 3x3 and 5x5 filters.
The Inception Module is based on a pattern recognition network which mimics the animal visual cortex. After presenting several examples of images, the network gets used to small details, middle sized features or almost whole images if they come up very often. Each layer of the deep network reinforces some features it thinks is there and passes on to the next. If it has been trained to recognize faces, the first layer detects edges, the second overall design, the third eyes, mouth, nose, the fourth the face, the fifth the mood, for instance.
According to the universal approximation theorem, given enough capacity, we know that a feedforward network with a single layer is sufficient to represent any function. However, the layer might be massive and the network is prone to overfitting the data. Therefore, there is a common trend in the research community that our network architecture needs to go deeper.
However, increasing network depth does not work by simply stacking layers together. Deep networks are hard to train because of the notorious vanishing gradient problem — as the gradient is back-propagated to earlier layers, repeated multiplication may make the gradient infinitively small. As a result, as the network goes deeper, its performance gets saturated or even starts degrading rapidly.
Before ResNet, there had been several ways to deal the vanishing gradient issue, e.g. adding an auxiliary loss in a middle layer as extra supervision, but none seemed to really tackle the problem once and for all.
The core idea of ResNet is introducing a so-called “identity shortcut connection” that skips one or more layers,
The authors of Resnet argue that stacking layers shouldn’t degrade the network performance, because we could simply stack identity mappings (layer that doesn’t do anything) upon the current network, and the resulting architecture would perform the same. This indicates that the deeper model should not produce a training error higher than its shallower counterparts. They hypothesize that letting the stacked layers fit a residual mapping is easier than letting them directly fit the desired underlaying mapping. And the residual block explicitly allows it to do precisely that.
DenseNet(Densely Connected Convolutional Networks) is one of the latest neural networks for visual object recognition. It’s quite similar to ResNet but has some fundamental differences.
For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.
ResNet architecture proposed Residual connection, from previous layers to the current one. Roughly saying, input to the present layer was obtained by summation of outputs from previous layers.
So, let’s imagine we have an image with shape(28, 28, 3). First, we spread image to initial 24 channels and receive the image (28, 28, 24). Every next convolution layer will generate k=12 features, and remain width and height the same. The output from Lᵢ layer will be (28, 28, 12). But input to the Lᵢ₊₁ will be (28, 28, 24+12), for Lᵢ₊₂ (28, 28, 24 + 12 + 12) and so on.
Inception have lot of 1*1 convolutions to reduce the dimension of the features where 3*3 and 5*5 convolutions are performed parallel in inception.
ResNet is famous for its short cut connection which is basically the feeding (by summation) of the features from the preceding layers to the next level layers. It empower the features and finally increased ac-curacies are achieved.
DenseNets are somehow similar to ResNets just the difference is the summation is replaced by concatenation which enforce the feature from all preceding layers to all upcoming layer by feed forward direct concatenation connection.