Hi everyone, I'm attempting to code the Tacotron speech synthesis system from scratch to make sure I understand it. I'm done implementing the first convolutional filterbank layer and have implemented the max pooling layer, but I don't understand why the authors of chose a max-pooling over time with stride 1. They claim it's to keep the temporal resolution, but my problem is that I think using a stride of 1 is equivalent to just doing nothing and keeping the data as is.
As an example, say we have a matrix in which every time step corresponds to one column:
A= [1,2,3,4;
5,6,7,8;
1,2,3,4];
If we max pool over time with stride 2, we'll have:
B = [2,4;
6,8;
2,4]
Max-pooling with stride one will keep the time resolution but also result in B=A (keep every column). So what's the point of even saying that max-pooling was applied?
I hope my question was clear enough, thank you for reading.