Proper Weight Initialization for ReLU and PReLU

05 May 2017 4 2K Report

Hi,

this paper

https://arxiv.org/pdf/1502.01852.pdf

suggests to initialize weights with ReLU and PReLU activations differently. As far as I understand, I initialize the weights of the first layer with

model = Sequential()

model.add(Conv2D(32, kernel_size=(3, 3),

activation='relu',

kernel_initializer=RandomNormal(stddev=np.sqrt(1/(img_rows * img_cols))),

input_shape=input_shape))

and beginning with the second layer with:

model.add(Conv2D(64, (3, 3), activation='relu',

kernel_initializer=RandomNormal(stddev=np.sqrt(2/(3*3*32)))))

See equation 10 on page 4.

For the first layer they write:

"For the first layer (l = 1), we should have n1 Var[w1] = 1

because there is no ReLU applied on the input signal. But

the factor 1/2 does not matter if it just exists on one layer.

So we also adopt Eqn.(10) in the first layer for simplicity."

But sadly this performs slightly worse on MNIST and my data set.

On MNIST this special initialization the hit rate begins with 0.9655 and has its peak on 0.9895.

With glorot uniform (Xavier uniform) it begins with 0.9763 and has its peak on 0.9905.

This is reproducible. Did I get the formula wrong? Is everything implemented right?

Reza Fuad Rachmadi

Hi,

I think you should initialize the weights on first layer using std=2/(3*3*1) or std=1/(3*3*1).

In the paper they explain that k= spatial filter size of current layer not previous layer.

You can try to calculate the example they explained in backpropragation case.

I hope its help.

Stefano Di Martino

Hi Reza Fuad Rachamadi,

thank you very much, for your answer! With this change I get indeed slightly better results than with glorot uniform! So, thank you very much for this clarification. I can now use this for my Master Thesis!

For the sake of completion. Do you know, how to initialize a fully connected layer?

This is my network:

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

conv2d_1 (Conv2D) (None, 26, 26, 32) 320

_________________________________________________________________

conv2d_2 (Conv2D) (None, 24, 24, 64) 18496

_________________________________________________________________

max_pooling2d_1 (MaxPooling2 (None, 12, 12, 64) 0

_________________________________________________________________

dropout_1 (Dropout) (None, 12, 12, 64) 0

_________________________________________________________________

flatten_1 (Flatten) (None, 9216) 0

_________________________________________________________________

dense_1 (Dense) (None, 128) 1179776

_________________________________________________________________

dropout_2 (Dropout) (None, 128) 0

_________________________________________________________________

dense_2 (Dense) (None, 10) 1290

Because the output of the flatten layer is 9216, I would say, I initialize the next dense layer (fully connected layer) with stddev=np.sqrt(2/(9216)). Do you think that's right?

So this would be my network:

model = Sequential()

model.add(Conv2D(32, kernel_size=(3, 3),

activation='relu',

kernel_initializer=RandomNormal(stddev=np.sqrt(1/(3*3*1))),

input_shape=input_shape))

model.add(Conv2D(64, (3, 3), activation='relu',

kernel_initializer=RandomNormal(stddev=np.sqrt(2/(3*3*32)))))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(128, activation='relu',

kernel_initializer=RandomNormal(stddev=np.sqrt(2/(9216)))))

model.add(Dropout(0.5))

model.add(Dense(num_classes, activation='softmax'))

Reza Fuad Rachmadi

Hi,

Glad if it's help you.

For fc, If you use the same formula then the std=sqrt(2/128). But, in the paper they recommend that the std must lower than that to address normalization issue, so I think your weights initialization is ok.

You can try the classic std value for fc, such as 0.01 or 0.001.

The text in the paper is following.

It is also worth noticing that the variance of the input signal can be roughly preserved from the first layer to the last. In cases when the input signal is not normalized (e.g., it is in the range of [−128,128]), its magnitude can be so large that the softmax operator will overflow. A solution is to normalize the input signal, but this may impact other hyper-parameters. Another solution is to include a small factor on the weights among all or some layers, e.g., L 1/128 on L layers. In practice, we use a std of 0.01 for the first two fc layers and 0.001 for the last. These numbers are smaller than they should be (e.g., 2/4096) and will address the normalization issue of images whose range is about [−128, 128].

I hope it's help.

Stefano Di Martino

Oh, thx! I should have think this through more thoroughly!

Thank you, very much!

How do I initialize pretrained weights in Tensorflow?

Feedback defines the constitution of an organism?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

The Curse of Evolution and Complexity?

Could dyes amplify the spectrum of light to a specific wavelength?

How to report results of Generalised Linear Mixed Models in a journal article?

Need help with my research project on open source SIEM and machine learning?

Swimming/space travel depends on the proprioceptive muscle spindles?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?