Efficient estimation of word representations in vector space (Mikolov et al., 2013)

Introduction

Traditional n-gram model

Principle: prediction of next word based on previous n-1 words; probability of a sentence obtained by multiplying the probability (the n-gram model) of each word
e.g., "This is a puppy"
n = 1 (unigram): [This] [is] [a] [puppy] (this, is, a puppy)
n = 2 (bigram): [This is] [is a] [a sentence] (this is, is a, a sentence)
n = 3 (trigram): [This is a] [is a sentence] (this is a, is a sentence)
Limits:
i. insufficient amount of in-domain data (for ASR)
ii. curse of dimensionality
iii. difficulty of generalization (because the n-gram model has discrete space)
iv. poor performance on word similarity tasks

Distributed representations of words as (continuous) vectors (vs. one-hot encoding)

e.g., queen → | 0.313 | 0.123 | 0.326 | 0.128 | 0.610 | 0.415 | 0.120 |

learning cost(?): the vocabulary size V * the number of parameters m (here 7)

Previous work on representation of words as continuous vectors: Feedforward NNLM
Motivation and goal: necessity of training more complex models on much larger data set with a modest dimensionality of the word vectors between 50-100; building a model that shows better performance on word similarity tasks (multiple degrees of similarity)

Model architectures

Some terms

Computational complexity of a model: the number of parameters needed to fully train the model
Training complexity is proportional to
💡 O = E (number of training epochs; usually 3-50) * T (number of words in the training set; up to 1 billion) * Q (defined depending on model architecture)

Non-linear NN models

Feedforward NNLM
i. components: input, projection, hidden, and output layers
ii. learning joint probability distribution of word sequences with word feature vectors using feedforward NN
iii. input, output, and evaluation
input: a concatenated feature vector of words
output: the estimated probability of a sequence of n words
a softmax function to evaluate the joint probability distribution of words
iv. computational complexity per each training example
💡 Q = N (1-of-V coded input) * D + N * D * H (hidden layer size) + H * V (vocabulary size)

The number of output layers can be reduced to log2|V| (base 2) using hierarchical softmax layers

Recurrent NNLM: RNN-based NNLM
i. components: input (the current word vector & hidden neuron values of the previous word), hidden, and output layers
ii. predicting the current word
iii. computational complexity per each training example
💡 Q = H * H + H * V
The number of output layers can be reduced to log_2|V| using hierarchical softmax layers.
For more on hierarchical softmax:
http://building-babylon.net/2017/08/01/hierarchical-softmax/
New log-linear models
i. The main observation from the previous section was that most of the complexity is caused by the non-linear hidden layer in the model
ii. While this is what makes NNs so attractive, a simpler model that might not be able to represent the data as precisely as NNs but can be trained on much more data efficiently
iii. The two architectures

Continuous Bag-of-words Model
Similar to feedforward NNLM without the hidden layer
Q = N * D + D * log_2(V)
Continuous Skip-gram Model
to predict words adjacent to the current word
Q = C (max distance of the words) * (D + D * log_2(V))
if C = 5, a number R in range < 1;C > is selected randomly, and then R words from past and those from future of the current word as correct labels. So, R * 2 word classifications with the current word as input and each of the R + R words as output.

Results

Task description

i. a comprehensive test set that contains 5 types of semantic questions and 9 types of

syntactic questions (overall, 8869 semantic and 10675 syntactic questions)

ii. evaluation: accuracy for each question type separately
Maximization of accuracy
Data: a Google News corpus for training the word vectors (6B tokens)
i. restricted the vocabulary size to 1 million most frequent words
ii. first evaluated models trained on subsets of the training data (the most frequent 30k words)

iii. 3 training epochs with SGD and backpropagation; learning rate decreases from 0.025
Comparison of model architectures
Data: several LDC corpora (320M words, 82k vocabulary)
Large scale parallel training of model
Several models trained on the Google News 6B data set, with mini-batch asynchronous gradient descent and the adaptive learning rate procedure (Adagrad)

Examples of the learned relationships

Conclusion

It is possible to train high quality word vectors using very simple model architectures

It is possible to compute very accurate high dimensional word vectors from a much larger data set

The word vectors can be successfully applied to automatic extension of facts in Knowledge Bases, and also for verification of correctness of existing facts

Thursday, March 31, 2022

[Paper Review - NLP] Mikolov et al., 2013 (CBOW & Skip Gram)

Efficient estimation of word representations in vector space (Mikolov et al., 2013)

Introduction

Traditional n-gram model

Distributed representations of words as (continuous) vectors (vs. one-hot encoding)

Model architectures

Some terms

Non-linear NN models

Recurrent NNLM: RNN-based NNLM

New log-linear models

Continuous Bag-of-words Model

Continuous Skip-gram Model

Results

Task description

Maximization of accuracy

Data: a Google News corpus for training the word vectors (6B tokens)

Comparison of model architectures

Data: several LDC corpora (320M words, 82k vocabulary)

Large scale parallel training of model

No comments:

Post a Comment

[Book Summary - CtDSI] Cracking the Data Science Interview Ch. 1

Postings