Goals
- improve skip-gram model in terms of speed and accuracy
- make it possible to make word representation not limited to individual words; rather, it aims to make it possible to represent idiomatic expressions (e.g., Boston Globe)
The original skip-gram model
- Skip-gram model: aims to predict the words that surround the current words; thus, aims to high log probability
- Hierarchical Softmax(HS): replacing softmax to reduce time cost due to enormous amount of calculations
Methods for improvement
- Negative samplings (NEG): let's reduce unnecessary calculations by not updating irrelevant words
i) Noise Contrastive Estimation (NCE): a good model should separate data from noise using logistic regression.
ii) a simplified version of NCE, which both requires numerical probabilities of the noise distribution and samples, while NEG requires samples only
iii) NCE aims to improve log probabilities of softmax (accuracy). The skip-gram model aims to improve the quality of vector representation. NEG is also considered in this regard.
- Subsampling of frequent words: frequent words tend not to be informative; thus, to solve this problem, this method is proposed
The more frequent a word is, the more likely it is to be discarded => This solves some problems caused by imbalance of frequency; This also increases the training speed and vector accuracy.
- Experiments (analogical reasoning task for phrases)
i) Task: what is 4th phrase based on the previous three phrases? (e.g., Germany: Berlin: France: ?), including syntactic and semantic analogies
ii) Data: an internal Google dataset with one billion words (training data), a total of 692k after taking out words whose frequency is less than 5
iii) Result: (Accuracy) NEG > HS, NEG > NCE, (Speed) Subsampling > the others / The linear property of the skip-gram model contributes this analogical test, but if a much bigger training dataset is provided, the skip-gram model can be successful in non-linear models
- Word-based => Phrase-based using a data-driven approach
- Experiments (Analogical reasoning task for phrases)
i) Task: New analogical task
ii) Results: NEG-15 (k=15) > NEG-5 (k=5), Subsampling models > No subsampling ones, Subsampling HS-Huffman > No subsampling one
iii) With 33 billion words, HS, dim = 1,000, c = entire sentence, 72% accuracy; A combination of HS & Subsampling showed the best performance
Additive compositionally
The skip-gram model can make meaningful words using element-wise addition
No comments:
Post a Comment