Saturday, December 3, 2022

[Paper Review - NLP] Negative Sampling (Mikolov et al., 2013)

 Goals

  • improve skip-gram model in terms of speed and accuracy
  • make it possible to make word representation not limited to individual words; rather, it aims to make it possible to represent idiomatic expressions (e.g., Boston Globe)

The original skip-gram model

  • Skip-gram model: aims to predict the words that surround the current words; thus, aims to high log probability
c: the size of the training context (can be a function of the center word w_t) / p(w_t+j|w_t): determined by softmax
v_w: input, v'_w: output vector representations of, w: the number of words in the vocabulary 
  • Hierarchical Softmax(HS): replacing softmax to reduce time cost due to enormous amount of calculations

Methods for improvement

  • Negative samplings (NEG): let's reduce unnecessary calculations by not updating irrelevant words

i) Noise Contrastive Estimation (NCE): a good model should separate data from noise using logistic regression.

ii) a simplified version of NCE, which both requires numerical probabilities of the noise distribution and samples, while NEG requires samples only

iii) NCE aims to improve log probabilities of softmax (accuracy). The skip-gram model aims to improve the quality of vector representation. NEG is also considered in this regard.

if the first term gets bigger, it closes to 0;, the calculation for wrong pairs (-v'_wi^tV_wI) will show the opposite
  • Subsampling of frequent words: frequent words tend not to be informative; thus, to solve this problem, this method is proposed
P(wi): the probability of discarding the word; f(wi): the frequency of the world; t(threshold)

The more frequent a word is, the more likely it is to be discarded => This solves some problems caused by imbalance of frequency; This also increases the training speed and vector accuracy.

  • Experiments (analogical reasoning task for phrases)

i) Task: what is 4th phrase based on the previous three phrases? (e.g., Germany: Berlin: France: ?), including syntactic and semantic analogies

ii) Data: an internal Google dataset with one billion words (training data), a total of 692k after taking out words whose frequency is less than 5

iii) Result: (Accuracy) NEG > HS, NEG > NCE, (Speed) Subsampling > the others / The linear property of the skip-gram model contributes this analogical test, but if a much bigger training dataset is provided, the skip-gram model can be successful in non-linear models

  • Word-based => Phrase-based using a data-driven approach
  • Experiments (Analogical reasoning task for phrases)

i) Task: New analogical task

ii) Results: NEG-15 (k=15) > NEG-5 (k=5), Subsampling models > No subsampling ones, Subsampling HS-Huffman > No subsampling one

iii) With 33 billion words, HS, dim = 1,000, c = entire sentence, 72% accuracy; A combination of HS & Subsampling showed the best performance

Additive compositionally

The skip-gram model can make meaningful words using element-wise addition

No comments:

Post a Comment