A short answer to that would be that due to the large complexity of the models (mllions of parameters to train) they are generally capable of representing and learning very complex data patterns. And now we know how to effectively train these models.Another aspect is the hierarchical processing in layers, where we have concise representations of the data, from general ones at the beginning to more specific ones at the end of the net.