Thursday, March 31, 2022

[Paper Review - NLP] Mikolov et al., 2013 (CBOW & Skip Gram)

 

Efficient estimation of word representations in vector space (Mikolov et al., 2013)

Introduction

Traditional n-gram model

  • Principle: prediction of next word based on previous n-1 words; probability of a sentence obtained by multiplying the probability (the n-gram model) of each word

    e.g., "This is a puppy"

    n = 1 (unigram): [This] [is] [a] [puppy] (this, is, a puppy)

    n = 2 (bigram): [This is] [is a] [a sentence] (this is, is a, a sentence)

    n = 3 (trigram): [This is a] [is a sentence] (this is a, is a sentence)

  • Limits:

    i. insufficient amount of in-domain data (for ASR)

    ii. curse of dimensionality

    iii. difficulty of generalization (because the n-gram model has discrete space)

    iv. poor performance on word similarity tasks

Distributed representations of words as (continuous) vectors (vs. one-hot encoding)

e.g., queen → | 0.313 | 0.123 | 0.326 | 0.128 | 0.610 | 0.415 | 0.120 |

learning cost(?): the vocabulary size V * the number of parameters m (here 7)

  • Previous work on representation of words as continuous vectors: Feedforward NNLM
  • Motivation and goal: necessity of training more complex models on much larger data set with a modest dimensionality of the word vectors between 50-100; building a model that shows better performance on word similarity tasks (multiple degrees of similarity)

Model architectures

Some terms

  • Computational complexity of a model: the number of parameters needed to fully train the model

  • Training complexity is proportional to

    💡 O = E (number of training epochs; usually 3-50) * T (number of words in the training set; up to 1 billion) * Q (defined depending on model architecture)

Non-linear NN models

  • Feedforward NNLM

    i. components: input, projection, hidden, and output layers

    ii. learning joint probability distribution of word sequences with word feature vectors using feedforward NN

    iii. input, output, and evaluation

    input: a concatenated feature vector of words

    output: the estimated probability of a sequence of n words

    a softmax function to evaluate the joint probability distribution of words

    iv. computational complexity per each training example

    💡 Q = N (1-of-V coded input) * D + N * D * H (hidden layer size) + H * V (vocabulary size)

The number of output layers can be reduced to log2|V| (base 2) using hierarchical softmax layers

  • Recurrent NNLM: RNN-based NNLM

    i. components: input (the current word vector & hidden neuron values of the previous word), hidden, and output layers

    ii. predicting the current word

    iii. computational complexity per each training example

     💡 Q = H * H + H * V

    The number of output layers can be reduced to log_2|V| using hierarchical softmax layers.

    For more on hierarchical softmax:

    http://building-babylon.net/2017/08/01/hierarchical-softmax/


  • New log-linear models

    i. The main observation from the previous section was that most of the complexity is caused by the non-linear hidden layer in the model

    ii. While this is what makes NNs so attractive, a simpler model that might not be able to represent the data as precisely as NNs but can be trained on much more data efficiently

    iii. The two architectures













    Continuous Bag-of-words Model

    Similar to feedforward NNLM without the hidden layer

    Q = N * D + D * log_2(V)

    Continuous Skip-gram Model

    to predict words adjacent to the current word

    Q = C (max distance of the words) * (D + D * log_2(V))

    if C = 5, a number R in range < 1;C > is selected randomly, and then R words from past and those from future of the current word as correct labels. So, R * 2 word classifications with the current word as input and each of the R + R words as output.

Results

        Task description

            i. a comprehensive test set that contains 5 types of semantic questions and 9 types of
            syntactic questions (overall, 8869 semantic and 10675 syntactic questions)
  • ii. evaluation: accuracy for each question type separately

    Maximization of accuracy

    Data: a Google News corpus for training the word vectors (6B tokens)

    i. restricted the vocabulary size to 1 million most frequent words
  • ii. first evaluated models trained on subsets of the training data (the most frequent 30k words)







    iii. 3 training epochs with SGD and backpropagation; learning rate decreases from 0.025

    Comparison of model architectures

    Data: several LDC corpora (320M words, 82k vocabulary)



















  • Large scale parallel training of model 

    Several models trained on the Google News 6B data set, with mini-batch asynchronous gradient descent and the adaptive learning rate procedure (Adagrad)

    Examples of the learned relationships
    Conclusion

    It is possible to train high quality word vectors using very simple model architectures

    It is possible to compute very accurate high dimensional word vectors from a much larger data set

    The word vectors can be successfully applied to automatic extension of facts in Knowledge Bases, and also for verification of correctness of existing facts


[Python Code - NLP] Sentimental Analyses of Movie Reviews in Korean Using Keras and PyTorch


In this posting, I am going to compare Keras and PyTorch by giving sentimental analyses of movie reviews in Korean, provided by NAVER. It may be more reasonable to compare Tensorflow and PyTorch because Keras is a high-level deep learning API for Tensorflow while PyTorch is an independent deep learning framework. Nevertheless, Keras is going to be used for the task as it is easier to deal with simple tasks, such as classification, using Keras. I will also explain what Keras and PyTorch are, but as this is a practical tutorial, see the official documents of Keras and PyTorch or materials on them for more detailed information on their architecture.


Keras (with Tensorflow)

Keras is a high-level deep learning API for Tensorflow, which provides high-level features for building deep learning models. This means that Keras itself does not deal with low-level calculations, such as tensor manipulation and derivatives. Rather, it utilizes various backend engines, such as Tensorflow, CNTK, etc., for them. Keras consists of many independent modules. There exist independent modules for neural layer, cost function, optimizer, activation function, and the like, and models can be built with them. This is a brief summary of Keras. For more information on Keras, read Deep Learning with Python by François Chollet

Loading Packages and Data (NAVER movie reviews)

Now, I am going to provide a sentiment analysis of movie reviews in Korean provided by NAVER using Keras. This corpus consists of train and test dataset files. The corpus can be directly loaded from the github source using url-lib, but it was downloaded on a local machine from [here](https://github.com/e9t/nsmc). Both train and test datasets were loaded, and necessary packages were loaded as below. As illustrated, the train dataset consists of document id, document, and label.

The label is binary-encoded: 1 as positive and 0 as negative. After a brief look at the data, the distribution of the data was examined because it is not adequate to do a classification task with imbalanced data. As in cells [9] and [10], positive and negative reviews are well balanced. The null values were removed.


Data Preprocessing with Korean NLP in Python (KoNLPy)

The next step was to preprocess data. Non-Hangeul strings were removed as illustrated in cell [13]. White space data also were changed into empty values and then removed as in cell [14]. All the same processes were also applied to the test dataset as illustrated in cell [15]. Then, [Korean stop word list](https://bab2min.tistory.com) was loaded, with which stopwords were removed. Thereafter, all the reviews were tokenized using Mecab  in KoNLPy. There are other tokenizers (or morphological parsers) available in KoNLPy. Among them, Okt (Twitter) is widely used as it provides stems as an option, but in terms of speed, Mecab is second to none, so it was used (Previously, I tested various morphological parsers and compared which one is good for which task. I will write a posting for this if I have some time).











Integer Encoding and Padding

Thereafter, all the tokens were encoded as integers using the method fit_on_texts of Keras. The Keras tokenizer returned a vocabulary dictionary based on frequency as in cell [21]. The tokens that occurred less than 3 times in this corpus were discarded as in cell [22], after which the integer encoding was applied again. Then, texts_to_sequences was applied to both train and test datasets. Now, each review consists of indices of tokens as in cell [25], and the labels were transformed into np.array as in cell [26]. The very last step of preprocessing was to do sentence padding. The maximum length of the review was set to 35, and then around 94% of the train data were less than it, as in cell [30]. All the reviews were maded to have the same length (maximum length: 35). 














Implementations of Long Short-Term Memory (LSTM) and Reccurent Unit (GRU)

The next step was to train the data.  This time, Long Short-Term Memory (LSTM) and Gated Reccurent Unit (GRU), types of Recurrent Neural Network (RNN)s, were adopted because the movie review data are sequential (It is also possible to do text classification using Convolutional Neural Network(CNN)s, following Yoon Kim, 2014, but if the data size is big enough, it is recommend to adopt a RNN model). Instead of vanilla RNNs, LSTM and GRU were chosen because RNNs suffer from the problem of vanishing gradients as the number of timesteps increases. LSTM was devised to fix this problem by adding cell state and gates (i.e., forget gate, input gate, and output gate); thus, it is capable of learning long-term dependencies (for more information on the architecture of LSTM, read Hochreiter & Schmidhuber, 1997). GRU, which improved upon LSTM, is also capable of learning long-term dependencies. Instead of using the three gates like LSTM, only two gates (i.e., reset gate and update gate) are used, which made it have simpler structure than LSTM (for more information on the architecture of GRU, read Cho et al., 2014).

In building input, hidden, and output layers, Sequential() from Keras was called. Other layers needed to be defined separately,  so the embedding, LSTM, and fully-connected layers were set as in cell [32]. The embedding layer was set with our vocabularly size, and the dimension of the ourput layer was set to 1 as it is a binary classification in Dense(). Thereafter, EarlyStopping and ModelCheckpoint as callback functions were set, as in cell [33]. Then, Adam was adopted as an optimizer, and binary cross entropy was used as loss function with accuracy rate as the metrics. Lastly, 20% of the train data were used as the validation data, the batch size was set to 64, and the epoch setting was 15. As illustrated in cell [34], training stopped after 6 epochs. It took around 6 minutes to train the data, and the accuracy rate was around 84%. A GRU model was also established with the same parameter settings. After 5 epochs, it stopped. It took around 5 minutes to train the data, and the accuracy rate was around 84%. 



PyTorch

PyTorch is a deep learning framework by Facebook for various machine/deep learning tasks. As briefly mentioned above, it may be more reasonable to compare this with Tensorflow in terms of architecture because Keras does not deal with low-level calculations by itself - analogically speaking, PyTorch and Tensorflow are like stick shift while Keras is like automatic drive. Thus, in this section, I briefly explain the major differences of PyTorch and Tensorflow in terms of architecture, and then make a comparison of PyTorch and Keras based more on hands-on experience by performing the same sentiment analysis of the NAVER movie reviews as the above.

PyTorch vs. Tensorflow in terms of architecture (rather than Keras)

PyTorch and Tensorflow are similar in that they both utilize Graphics Processing Unit (GPU) for calculation, operate on tensors, and view models as Directed Acyclic Graphs (DAGs). However, PyTorch and Tensorflow differ on how they can be defined. Specifically, it has been widely recognized that Tensorflow follows define and run idiom while PyTorch does define by run. To be specific, the former implies that graph should be defined prior to a model run (thus static graph) while the latter means that model can be defined and changed simultaneously as graph is defined (thus dynamic graph). Another difference (actually more like an advantage of PyTorch) is related to debugging. It is reportedly said that it is sometimes difficult to figure out where errors came from (e.g., either backend parts or more model-specific ones) when using Tensorflow. On the other hand, debugging is eaiser in PyTorch because it is essentially more pythonic and thus gives easy access to codes. Based on what have been explained so far, it seems that PyTorch is the winner. However, there are some advantages of using Tensorflow over PyTorch. One of them is that Tensorflow has a larger user community than PyTorch, so if you are stuck in something, tremendous Tensorflow users can help you!

PyTorch Installation

As this was the first time for me to use PyTorch, I had to install it. It was easy to install PyTorch. If you go into the official homepage and then click your operating system and the like, it will give you some commands for installation. What you need to is to just type them in your terminal (for Mac user). As the NVIDIA CUDA toolkit does not support Mac OS anymore, CPU should be chosen if your Mac OS is above 10.13 as below. I installed PyTorch using the command below via Terminal. In addition to this, torchtext was installed via Terminal for data handling, and pytorchtools was also installed for early stopping.








Loading Packages and Data (NAVER movie reviews) & Hyperparameter Setting

The same NAVER movie review data were used for PyTorch modeling, and the necessary packages were loaded as below. Unlike Keras, several hyperparameters were needed to be set as in cell [2]. The batch size was set to 64, which is the same as the above, the learning rate was set to 0.001, and the number of epoches was set to 10 (It should have set to 15 as above, but as I failed to apply EarlyStopping to PyTorch model because of some loading error. I changed it to 10 for fear of taking too much time.). In addition, device setting was needed, and as CUDA is not available on my local machine, CPU was used (CUDA can be used with Colab). Data were loaded, and after removing null values, they were saved as csv files as in cells [4] and [6]. This treatment is solely practice-driven!


















Data Preprocessing with Torchtext

The next step was to preprocess data as done above. For this, torchtext was used, by which such necessary preprocessing processes as tokenization, padding, etc. can be done simultaneously because they are provided as parameters, as seen in cell [8]. As with the Keras model, the maximum review length was set to 35, and Mecab was used as a tokenizer. The batch_first was set to True, meaning that mini batch demension should be first, and sequential was set to True for the text while it to False for the label. Then, the format of fields was set as in cell [9]. With this field setting, both train and test data were preprocessed using TabularDataset as in cell [10]. Thereafter, a vocabulary set was built using build_vocab. Only words that occurred at least three times were used for building it, as with the Keras model. Both texts and labels were encoded as integers as in cell [13]. 








 






Implementations of Recurrent Unit (GRU)

Thereafter, 20% of the train data were assigned as the validation data as in cell [17]. The three iterators for the train, validation, and test data were established  for batch learning using BucketIterator as in cell [18]. After checking a few, the iterators were re-loaded as in cell [22]. Now, we need to build a GRU model to train our data. This time, instead of using both LSTM and GRU, only a GRU model was used due to lack of time. The class GRU consits of init, forward, and init_state functions as in cell [23]. Parameters were initiated in init using nn.Module. In the forward function, the first hidden state was set to 0 vector; batch size, sequence length, and hidden state size were returned by GRU (for instance, if the tensor size is [3, 5, 7] and x[:,-1,:] is applied to it, it will return [3, 7]); and only the hidden state at the last time step was considered. Lastly, init\_state function was built for resetting weights. A GRU model was initiated, and Adam was used as an optimizer as in cell [24].

The two functions train and evaluate were defined as in cells [25] and [26]. Cross entropy, to which log function was applied, was used as a loss function, and the accuracy rate was used for the model accessement. The train data were trained as in cell [27]. It took around 22 minutes with 10 epoches, and the test accuracy was around 86%.  





Keras (with Tensorflow) vs. PyTorch based on hands-on experience

The aim of this posting was to compare Keras (with Tensorflow) and PyTorch, for which sentiment analyses of movie reviews in Korean were performed. Based on the time taken to train the data, Keras is the winner (Keras-GRU: around 5 minutes after 5 epochs vs. PyTorch-GRU: around 20 minutes after 10 epochs), but based on the accuracy rate (Keras-GRU: 84% vs. PyTorch-GRU: 86%), PyTorch is the winner. However, it may not be legitimate to judge which is better based on the model accuracy and time taken to train  because they were not exactly on the same page. Specifically, these discrepancies might have been caused for the following reasons: stopwords were not removed in the PyTorch model and EarlyStopping was failed to be adapted to it. So, I provide some personal feedback on each framework. <br>
Personally, it was easy to perform the task with Keras than with PyTorch mainly because I have some experiences with Keras while this was the first time to use PyTorch, and partly because Keras seems to require less fine-tuning. Moreover, whenever I was stuck, it was easy to find solutions on Stack Overflow. Nevertheless, I strongly felt that PyTorch was more pythonic and object-oriented, and thus if a model is well established, it can be used a kind of template. In a nutshell, both have definitely their own advantages over the other, so it is a matter of taste!




References
Keras architecture: Deep Learning with Python by François Chollet
Keras official documentation: <https://keras.io/api/>
PyTorch official documentation: <https://pytorch.org/tutorials/>
LSTM (Hochreiter & Schmidhuber, 1997): <https://www.bioinf.jku.at/publications/older/2604.pdf>
GRU (Cho et al., 2014): <https://arxiv.org/pdf/1406.1078.pdf>
Text classification with CNNs (Yoon Kim, 2014): <https://arxiv.org/pdf/1408.5882.pdf>
A deep learning tutorial book in Korean: Deep learning starting from the bottm (Korean Edition) by Saito Goki and Dog front map
A deep learning tutorial book in Korean: <https://wikidocs.net/book/2788>