I had quite a few people who were very apprehensive about how slightly different scaling of their weights initialization completely altered the training. In this post, we will take a closer look at the sensitive weights initialization process for any neural net. Here’s the agenda:

- What is weights initialization?
- How we shouldn’t initialize it.
- How we should and why.
- Numpy and Tensorflow options

## What is it?

Image a simple two-layer MLP. Each layer has neurons and each neuron is initialized with a weight. These weights are the values that are adjusted via backpropagation (for more information about gradients check out this **post**). But do we have to do special operations on these weights, like how we may choose to normalize (zero mean, unit variance) our input data? It turns out that it is very helpful if we can properly initialize our weights for training efficiently.

## The Incorrect Way

First, we will see how not to initialize our weights.

One option is to initialize to zeros. This may seem intuitive since our input is normalized we will need positive and negative weights in order to process them and 0 is a nice balance. Unfortunately, with backpropagation, the updates to the weights via the gradients is dependent on the previous weights as well. And since our weights are 0, we will never be able to update these weights and no learning will occur.

We can also choose to initialize with very small random numbers with maybe even some large deviances.Using random small weights can still result in large errors (especially during the initial epochs) so this will result in large gradients which can blow up during backpropagation. So we add some randomness in our weights so they all are able to update and learn but we need some way to initialize the weights so that we can control the variance of their outputs, which lets us control the gradient flow.

## The Proper Way

Now let’s take a look at the proper way to initialize our weights. The objective is to have weights that are able to produce outputs that follow a similar distribution across all neurons. This will greatly help convergence during training and we will be able to train faster and effectively. But how do we calibrate the weights so that we can normalize the variance of our outputs?

We want the output of our weights to have unit variance prior to sending to the activation so we start with this:

So we need to scale our random normal weight initializations by 1/sqrt(n) in order to have unit variance. For ReLU, units this becomes sqrt(2/n) since a ReLU unit is zero for non positive inputs, so we will need to double the scale to have unit variance. Take a lot this **paper** for more of Xavier Glorot initialization and this **paper** for the ReLU initializations.

## Numpy and Tensorflow Implementations

Just a few examples (many more options for initializations):

Numpy:

W1_init = np.random.randn(784, 100).astype(np.float32) * np.sqrt(2.0/(784)) b1_init = np.zeros([100]).astype(np.float32) W2_init = np.random.randn(100, 100).astype(np.float32) * np.sqrt(2.0/(100)) b2_init = np.zeros([100]).astype(np.float32) W3_init = np.random.randn(100, 10).astype(np.float32) * np.sqrt(2.0/(100)) b3_init = np.zeros([10]).astype(np.float32) W_inits = [W1_init, b1_init, W2_init, b2_init, W3_init, b3_init]

Tensorflow:

W = tf.get_variable("W", shape=[784, 100], initializer=tf.contrib.layers.xavier_initializer())

## Looking Ahead

Sure we can initialize our weights at the beginning to control the variance of the outputs but what about when we update our weights and they change. Well, since our weights were properly initialized and our inputs were normalized, the updates themselves are calibrated in a sense and backpropagation will not cause major variations in the initialized weights very quickly. Additionally, we have techniques like batchnorm and layernorm that help will controlling the normalization of the outputs continuously throughout training. You can find information and implementations of those techniques **here**.

BPTT stands for “Backpropagation through time” which is used to explain backprop through the unrolled time-steps in RNNs… why do you keep using this term everywhere where you mean simple backprop.

LikeLiked by 1 person

Ahh! Nice catch, I didn’t even realize I was using ‘BPTT’ here!

LikeLike

Thanks for posting this!

Is it possible the ReLU standard deviation is meant to be sqrt(2/n) rather than 2 / sqrt(n)? That seems to be what’s needed to keep unit variance, and I think that’s what the He et al paper says.

LikeLiked by 1 person

Good catch Raphey! All updated now

LikeLike