Convolutional neural networks (CNNs) now also start to reach impressive performance on non-classification image processing tasks, such as denoising, demosaicing, super-resolution, and super slow motion. Consequently, CNNs are increasingly deployed on very high-resolution images. However, the resulting high-resolution feature maps pose unseen requirements on the memory system of neural network processing systems, as on-chip memories are too small to store high-resolution feature maps, while off-chip memories are very costly in terms of I/O bandwidth and power. This paper first shows that the classical layer-by-layer inference approaches are bounded in their external I/O bandwidth versus on-chip memory tradeoff space, making it infeasible to scale up to very high resolutions at a reasonable cost. Next, we demonstrate how an alternative depth-first network computation can reduce I/O bandwidth requirements up to >200× for a fixed on-chip memory size or, alternatively, reduce on-chip memory requirements up to >10 000× for a fixed I/O bandwidth limitation. We further introduce an enhanced depth-first method, exploiting both line buffers and tiling, to further improve the external I/O bandwidth versus on-chip memory capacity tradeoff and quantify its improvements beyond the current state of the art.
I don't understand, what is the issue? This is how convolutional layers work, if you put 12 filters of 3x3 pixels in the first layer, each filter will have 3x3x3 wights and when you apply such a filter to the 3 channels of the original image, the 3 channels will be mixed together by the third dimension of the filters (3x3x3) to output a signle channel by filter, otherwise (if the each filter is only 3x3x1) you will end up with a 32x32x36 volume as each filter will be applied to each channel of the input image separatly and produce 3 channels as a result