"The basic conclusion that these results suggest is that when a function can be compactly represented by a deep architecture, it might need a very large architecture to be represented by an insufficiently deep one" (section 2.1 p. 9)
The article goes into the details of this conclusion
Regards
[1]Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1-127.
Deep Neural Networks (DNNs) have emerged as powerful models that outperform Shallow Neural Networks (SNNs) in various domains. One key advantage of DNNs is their ability to learn hierarchical representations of data. Through multiple layers, DNNs progressively extract increasingly abstract features from the input, allowing them to capture complex patterns and relationships. This hierarchical representation learning enables DNNs to better understand the underlying structure of the data and make more accurate predictions.
And also, DNNs possess a larger model capacity compared to SNNs. With a greater number of parameters, DNNs have the ability to capture more intricate variations in the data. This increased capacity allows DNNs to model complex tasks that may be beyond the capabilities of SNNs.
Feature reuse and compositionally are other strengths of DNNs. In deep architectures, features learned in early layers can be reused and combined in subsequent layers, forming more meaningful and sophisticated representations. This feature reuse and compositionally enable DNNs to effectively model and generalize from the data, leading to improved performance.
Efficient gradient propagation is another critical factor contributing to the success of DNNs. DNNs employ back propagation, which allows the gradients to be efficiently computed and propagated through the layers during training. The deep structure of DNNs facilitates better gradient flow, ensuring that the network parameters are effectively updated and optimized.
In summary, DNNs surpass SNNs due to their hierarchical representation learning, larger model capacity, feature reuse and compositionally, efficient gradient propagation, and implicit regularization. These factors collectively contribute to their ability to capture complex patterns, generalize well, and achieve superior performance. Nonetheless, the selection of the appropriate neural network architecture depends on the specific requirements of the task, the nature of the data, and available computational resources.