I wanted to mainly read Section 3 of this paper for the negative sampling softmax details.
TLDR; When computing the softmax of the output from our decoder, the denominator involves taking the sum of probabilities for all the words. This is not feasible for a large dataset. So we will use negative sampling.
1.) We will only take a small subset V’ from the entire vocab size V. We will also use a predefined proposal distribution Q.
2.) We will define our subset V’ by looking at each training subsequence input and collecting words into V’ until a threshold amount is met. The proposal distribution is different for each training subsequence and if defined by each training subsequence’s subset V’.
UPDATE: Many people asked for a bit more explanation on negative sampling and sampling based approached in general, so here it is.
Let’s say you have a translation task and you have processed the output from the encoder. Now you need to get the logits and apply softmax to determine the appropriate translation. Now imagine your dataset has 1 billion different words (classes) to choose from. When computing the softmax, the denominator (normalization component) is highly computationally expensive to calculate. What all the different sampling based approached do is essentially replace this denominator with an approximation for its true value. Note: this can only be used during training because during inference you actually need to compute the score for all the classes in order to determine the correct result.
We will start by representing our softmax loss as follows:
We can then compute the gradient which will result in below.
We can think of this gradient as two halves. The first term is for the actual target class and is positive. The second term is negative which acts as a negative influence for all the other classes and it is also weighted by P.
The heart of all sampling approaches is to approximate this P which will allow us to skip calculating the sum of probabilities for all the words in set V.
In order to approximate P, we will use a proposal distribution Q which will be basis for our Monte-Carlo sampling. As we can see from the main arXiv paper of this post, our proposal distribution will be something simple based off of the unigram distribution of our training data.
Specifically focusing on negative sampling, the sampled softmax can be represented by:
where k is the number of terms we will to sample with (instead of all the words). The higher k is, the better of an approximation Q will be for P. There are many variations to sampled softmax but these general principles hold.