It seems possible that if we use dropout followed immediately by batch normalization there might be trouble, and as many authors suggested, it is better if the activation and dropout (when we have to use it) comes after batch normalization. But, I couldn't find any valid suggestion about batch normalization and pooling layers. Which one should come first and why? For me I like to think of batch normalization as being more important for the input of the next layer than only for normalizing the output of the current layer. But, if we normalize before pooling I'm not sure if we will have the same statistics. Any suggestions?

Similar questions and discussions