13 December 2022 3 9K Report

Hi everyone, I'm attempting to code the Tacotron speech synthesis system from scratch to make sure I understand it. I'm done implementing the first convolutional filterbank layer and have implemented the max pooling layer, but I don't understand why the authors of chose a max-pooling over time with stride 1. They claim it's to keep the temporal resolution, but my problem is that I think using a stride of 1 is equivalent to just doing nothing and keeping the data as is.

As an example, say we have a matrix in which every time step corresponds to one column:

A= [1,2,3,4;

5,6,7,8;

1,2,3,4];

If we max pool over time with stride 2, we'll have:

B = [2,4;

6,8;

2,4]

Max-pooling with stride one will keep the time resolution but also result in B=A (keep every column). So what's the point of even saying that max-pooling was applied?

I hope my question was clear enough, thank you for reading.

More Lyes Demri's questions See All
Similar questions and discussions