Hi,
this paper
https://arxiv.org/pdf/1502.01852.pdf
suggests to initialize weights with ReLU and PReLU activations differently. As far as I understand, I initialize the weights of the first layer with
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
kernel_initializer=RandomNormal(stddev=np.sqrt(1/(img_rows * img_cols))),
input_shape=input_shape))
and beginning with the second layer with:
model.add(Conv2D(64, (3, 3), activation='relu',
kernel_initializer=RandomNormal(stddev=np.sqrt(2/(3*3*32)))))
See equation 10 on page 4.
For the first layer they write:
"For the first layer (l = 1), we should have n1 Var[w1] = 1
because there is no ReLU applied on the input signal. But
the factor 1/2 does not matter if it just exists on one layer.
So we also adopt Eqn.(10) in the first layer for simplicity."
But sadly this performs slightly worse on MNIST and my data set.
On MNIST this special initialization the hit rate begins with 0.9655 and has its peak on 0.9895.
With glorot uniform (Xavier uniform) it begins with 0.9763 and has its peak on 0.9905.
This is reproducible. Did I get the formula wrong? Is everything implemented right?