Weights Initialization

I had quite a few people who were very apprehensive about how slightly different scaling of their weights initialization completely altered the training. In this post, we will take a closer look at the sensitive weights initialization process for any neural net. Here’s the agenda:

  1. What is weights initialization?
  2. How we shouldn’t initialize it.
  3. How we should and why.
  4. Numpy and Tensorflow options

What is it?

Image a simple two-layer MLP. Each layer has neurons and each neuron is initialized with a weight. These weights are the values that are adjusted via backpropagation (for more information about gradients check out this post). But do we have to do special operations on these weights, like how we may choose to normalize (zero mean, unit variance) our input data? It turns out that it is very helpful if we can properly initialize our weights for training efficiently.

The Incorrect Way

First, we will see how not to initialize our weights.

One option is to initialize to zeros. This may seem intuitive since our input is normalized we will need positive and negative weights in order to process them and 0 is a nice balance. Unfortunately, with backpropagation, the updates to the weights via the gradients is dependent on the previous weights as well. And since our weights are 0, we will never be able to update these weights and no learning will occur.

We can also choose to initialize with very small random numbers with maybe even some large deviances.Using random small weights can still result in large errors (especially during the initial epochs) so this will result in large gradients which can blow up during backpropagation. So we add some randomness in our weights so they all are able to update and learn but we need some way to initialize the weights so that we can control the variance of their outputs, which lets us control the gradient flow.

The Proper Way

Now let’s take a look at the proper way to initialize our weights. The objective is to have weights that are able to produce outputs that follow a similar distribution across all neurons. This will greatly help convergence during training and we will be able to train faster and effectively. But how do we calibrate the weights so that we can normalize the variance of our outputs?

We want the output of our weights to have unit variance prior to sending to the activation so we start with this:

Screen Shot 2016-11-10 at 10.18.57 PM.png

So we need to scale our random normal weight initializations by 1/sqrt(n) in order to have unit variance. For ReLU, units this becomes sqrt(2/n) since a ReLU unit is zero for non positive inputs, so we will need to double the scale to have unit variance. Take a lot this paper for more of Xavier Glorot initialization and this paper for the ReLU initializations.

Numpy and Tensorflow Implementations

Just a few examples (many more options for initializations):


W1_init = np.random.randn(784, 100).astype(np.float32) * np.sqrt(2.0/(784))
b1_init = np.zeros([100]).astype(np.float32)
W2_init = np.random.randn(100, 100).astype(np.float32) * np.sqrt(2.0/(100))
b2_init = np.zeros([100]).astype(np.float32)
W3_init = np.random.randn(100, 10).astype(np.float32) * np.sqrt(2.0/(100))
b3_init = np.zeros([10]).astype(np.float32)
W_inits = [W1_init, b1_init, W2_init, b2_init, W3_init, b3_init]


W = tf.get_variable("W", shape=[784, 100],

Looking Ahead

Sure we can initialize our weights at the beginning to control the variance of the outputs but what about when we update our weights and they change. Well, since our weights were properly initialized and our inputs were normalized, the updates themselves are calibrated in a sense and backpropagation will not cause major variations in the initialized weights very quickly. Additionally, we have techniques like batchnorm and layernorm that help will controlling the normalization of the outputs continuously throughout training. You can find information and implementations of those techniques here.

4 thoughts on “Weights Initialization

  1. None says:

    BPTT stands for “Backpropagation through time” which is used to explain backprop through the unrolled time-steps in RNNs… why do you keep using this term everywhere where you mean simple backprop.

    Liked by 1 person

  2. raphey says:

    Thanks for posting this!
    Is it possible the ReLU standard deviation is meant to be sqrt(2/n) rather than 2 / sqrt(n)? That seems to be what’s needed to keep unit variance, and I think that’s what the He et al paper says.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s