Vanilla Neural Network

I. Objective:

Simple 1-layer multi layer perceptron (MLP). f is the activation function and introduces the non-linearity to our system.

Screen Shot 2016-10-02 at 2.58.46 PM.png

II. Linear Model:

A simple linear model with a softmax layer on top. The main difference here is the lack of a non-linear activation function (ReLU, tanh, etc.). Thanks to Karpathy for the data and code structure, but we will break down the math behind the lines for better understanding. You can check out the code for loading the data on the Github repo but here we will focus on the main model operations.

Screen Shot 2016-10-29 at 10.34.59 AM.png

# Class scores [NXC]
logits =, W)

Screen Shot 2016-10-29 at 10.35.12 AM.png

# Class probabilities
exp_logits = np.exp(logits)
probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)


# Loss
correct_class_logprobs = -np.log(probs[range(len(probs)), y])
loss = np.sum(correct_class_logprobs) / config.DATA_SIZE
loss += 0.5 * config.REG * np.sum(W*W)

Screen Shot 2016-10-29 at 10.38.00 AM.png

# Backpropagation
dscores = probs
dscores[range(len(probs)), y] -= 1
dscores /= config.DATA_SIZE

Screen Shot 2016-10-29 at 10.38.15 AM.png

dW =, dscores)
dW += config.REG*W


W += -config.LEARNING_RATE * dW


We can see that the decision boundary of our classifier is linear and cannot adapt to the non-linear contortions of the data.

Screen Shot 2016-10-02 at 3.09.40 PM.png

III. Neural Network:

Now we introduce a neural net with a softmax on the last layer for class probabilities. We use a ReLU unit to introduce non-linearity. Our network will have two layers, where the shape of the input will be manipulated as follows:

Screen Shot 2016-10-29 at 10.50.03 AM.png

Once again, let’s break down the code.

Screen Shot 2016-10-29 at 10.57.43 AM.png

z_2 =, W_1)

Screen Shot 2016-10-29 at 10.58.09 AM.png

a_2 = np.maximum(0, z_2) # ReLU

Screen Shot 2016-10-29 at 10.58.21 AM.png

logits =, W_2)


# Class probabilities
exp_logits = np.exp(logits)
probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)

Screen Shot 2016-10-29 at 10.59.06 AM.png

# Loss
correct_class_logprobs = -np.log(probs[range(len(probs)), y])
loss = np.sum(correct_class_logprobs) / config.DATA_SIZE
loss += 0.5 * config.REG * np.sum(W_1*W_1)
loss += 0.5 * config.REG * np.sum(W_2*W_2)

Screen Shot 2016-10-29 at 11.02.18 AM.png

# Backpropagation
dscores = probs
dscores[range(len(probs)), y] -= 1
dscores /= config.DATA_SIZE
dW2 =, dscores)

Screen Shot 2016-11-17 at 7.25.55 AM.png

dhidden =, W_2.T)
dhidden[a_2 <= 0] = 0 # ReLu backprop
dW1 =, dhidden)

Screen Shot 2016-10-29 at 11.03.14 AM.png

dW2 += config.REG * W_2
dW1 += config.REG * W_1

Screen Shot 2016-10-29 at 11.03.28 AM.png

W_1 += -config.LEARNING_RATE * dW1
W_2 += -config.LEARNING_RATE * dW2

Our accuracy is very simple and involves doing the forward pass and then comparing the predicted class with the target class.

def accuracy(X, y, W_1, W_2=None):
    logits =, W_1)
    if W_2 is None:
        predicted_class = np.argmax(logits, axis=1)
        print "Accuracy: %.3f" % (np.mean(predicted_class == y))
        z_2 =, W_1)
        a_2 = np.maximum(0, z_2)
        logits =, W_2)
        predicted_class = np.argmax(logits, axis=1)
        print "Accuracy: %.3f" % (np.mean(predicted_class == y))


The resulting decision boundary is able to classify the non-linear data really well.

Screen Shot 2016-10-02 at 3.11.39 PM.png

IV. Tensorflow Implementation:

We will start by setting up our tensorflow model but we will have an extra function called summarize() which will store the progress as we training through the epochs. We will decide which values to store with tf.scalar_summary() so we can see the changes later.

def create_model(sess, FLAGS):

    model = mlp(FLAGS.DIMENSIONS,
    return model

class mlp(object):

    def __init__(self,

        # Placeholders
        self.X = tf.placeholder("float", [None, None])
        self.y = tf.placeholder("float", [None, None])

        # Weights
        W1 = tf.Variable(tf.random_normal(
            [input_dimensions, num_hidden_units], stddev=0.01), "W1")
        W2 = tf.Variable(tf.random_normal(
            [num_hidden_units, num_classes], stddev=0.01), "W2")

        with tf.name_scope('forward_pass') as scope:
            z_2 = tf.matmul(self.X, W1)
            a_2 = tf.nn.relu(z_2)
            self.logits = tf.matmul(a_2, W2)

        # Add summary ops to collect data
        W_1 = tf.histogram_summary("W1", W1)
        W_2 = tf.histogram_summary("W2", W2)

        with tf.name_scope('cost') as scope:
            self.cost = tf.reduce_mean(
                tf.nn.softmax_cross_entropy_with_logits(self.logits, self.y)) \
                    + 0.5 * regularization * tf.reduce_sum(W1*W1) \
                    + 0.5 * regularization * tf.reduce_sum(W2*W2)

            tf.scalar_summary("cost", self.cost)

        with tf.name_scope('train') as scope:
            self.optimizer = tf.train.AdamOptimizer(

    def step(self, sess, batch_X, batch_y):

        input_feed = {self.X: batch_X, self.y: batch_y}
        output_feed = [self.logits, self.cost, self.optimizer]

        outputs =, input_feed)
        return outputs[0], outputs[1], outputs[2]

    def summarize(self, sess, batch_X, batch_y):
        # Merge all summaries into a single operator
        merged_summary_op = tf.merge_all_summaries()
            feed_dict={self.X:batch_X, self.y:batch_y})

Then we will train for several epochs and save the summary each time.

def train(FLAGS):

    # Load the data
    FLAGS, X, y = load_data(FLAGS)

    with tf.Session() as sess:

        model = create_model(sess, FLAGS)
        summary_writer = tf.train.SummaryWriter(
          FLAGS.TENSORBOARD_DIR, graph=sess.graph)

        # y to categorical
        Y = tf.one_hot(y, FLAGS.NUM_CLASSES).eval()

        for epoch_num in range(FLAGS.NUM_EPOCHS):
            logits, training_loss, _ = model.step(sess, X, Y)
            # Display
            if epoch_num%FLAGS.DISPLAY_STEP == 0:
                print "EPOCH %i: \n Training loss: %.3f, Accuracy: %.3f" \
                  % (epoch_num,
                     np.mean(np.argmax(logits, 1) == y))

                # Write logs for each epoch_num
                summary_str = model.summarize(sess, X, Y)
                summary_writer.add_summary(summary_str, epoch_num)

if __name__ == '__main__':
    FLAGS = parameters()

Finally, we can view our training progress using:

$ tensorboard --logdir=logs

and then heading over to http://localhost:6006/ on your browser to view the results. Here are a few:

Screen Shot 2016-10-29 at 12.25.35 PM.pngScreen Shot 2016-10-29 at 12.25.54 PM.png

Extras (DropOut and DropConnect):

There are many add on techniques to this vanilla neural network that works to increase optimization, robustness and overall performance. We will be convering many of them in future posts but I will briefly talk about a very common regularization technique: dropout.

What is it? Dropout is a regularization technique that allows us to nullify the outputs of certain neurons to zero. This will effectively be the same as the neuron not existing in the network. We will do this for p% of the total neurons in each layer and for each batch, a new p% of the neurons in each layer are “dropped”.

Why do we do this? It works out to be a great regularization technique because for each input batch, we are sampling from a different neural net since a whole new set of neurons are dropped. By repeating this, we are preventing the units from co-adapting too much to the data. The original paper describes each iteration as a “thinned” network because p% of the neurons are dropped. Note: Dropout is only for training time. At test time, we will not be dropping any neurons.

Screen Shot 2016-11-17 at 7.09.54 AM.png

In the image above, the layer has p=0.5 which means half of it’s units are dropped. In an other iteration, a different set of 1/2 of the neurons will be dropped. Let’s take a look at masking code to really understand what’s happening.

enter image description here
We use a Bernoulli distribution to generation 0/1 with probability p for 0. We apply this mask to the outputs from our layer. The parts that are multiplied by zero are our “dropped” neurons since they will yield an output of 0 when multiplied by the next set of weights.

Another regularization method, which is an extension of dropout, is dropconnect. It also involves a similar mechanism but is applied to the weights instead.

Screen Shot 2016-11-17 at 7.11.27 AM.pngNotice that here, a set of weights are dropped instead of the neurons.

enter image description here

We apply a similar bernoulli mask to the weights and we use those weights for the layers. Any inputs that are dot producted with the zeroed weights will result in 0. You can see the similarity with dropout and so, empirically, both techniques offer similar results. Dropconnect was proposed because you always have more weights than neurons, so there are more ways to create “thinned” models thus results in more robust training. However, in more papers you will see mostly dropout being utilized and very rarely drop connect since results are similar.

You can read more about dropout here and dropconnect here.

V. Raw Code:

GitHub Repo  (Updating all repos, will be back up soon!)

2 thoughts on “Vanilla Neural Network

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s