Transformers models have become the go-to model for NLP tasks. In this article, we will go through the end-to-end process of training highly robust NLP models using Transformers architecture, following a top-down approach. We always try to use a semi-supervised approach to train an NLP model be it classification or generation.
Majorly, we will be discussing fine-tuning BERT, and Text Classification. Let’s start by getting an overall picture and then get into details.
For data augmentation with Text, read this artice.
Understanding the Semi-Supervised Transformers Architecture
Training a semi-supervised model requires 2 steps:
The below diagram is a GIF of semi-supervised training. First, we fine-tune BERT, then we train a Text Classifier or a Named Entity Recognition model. It gives a visual idea of the linkage between fine-tuning language model and then using the fine-tuned model for classification.
Let’s go over all the components of the above diagram:
The main job of the embedding layer is to convert the textual input data to a model understandable format. It has three main components:
There are a total of 11 BERT Layers in this encode layer.
Each layer has 2 main components:
Not by the definition, but the output of the above 2 layers, embeddings and encodings layers, gives out a matrix with 768 features which is the context of the sentence, therefore we will refer to this as the context vector. This is the most important part of the model, as all the training done in the unsupervised part was to train the weights to get context vectors correctly.
In models such as RNN and LSTM, the input to the model is word embeddings which are unable to capture the context of the sentence. So we train a neural network on top of the word embeddings layer to capture the context of the sentence. Now, this output captures the context with 768 features. This output can be passed as input to classification or generation models making the overall model highly robust and accurate.
This is an unsupervised part of training. Additional convolutional layers are added to the output of the context vectors. The final output of this model is the matrix with vocabulary size i.e. 30,522. This model needs to predict a word from the existing vocabulary. We feed millions of unlabeled sentences and allow the model to adjust the weights to get appropriate context vectors.
There are two ways to train a language model:
Masked Language Model:
We mask 10–30% of the input sentence and ask the model to predict the masked words. As you can see in the below diagram the input sentence is “women floral printed [MASK] top”.
Convolution layers are added to the encoder layer to predict the missing word.
The output of the model is the matrix with vocab size (30,522 in this case) to predict the missing word. Based on the model’s prediction loss is computed and the weights of the models are adjusted.
After training a BERT language model using Masked Language Model technique for the fashion domain with 20 million sentences, you can clearly observe the difference in the below GIF.
Input: women [MASK] top
BERT (Fashion): women floral printed crop top
BERT (Original): women who on were top
Casual Language Models:
With CLM, the model tries to predict the next word from the existing vocabulary and compares it with the true output. Generally, CLM is used for generative models i.e. GPT architectures. We are not covering GPT-based architectures in this article, but you can have a look at the GIF below showing the difference between a fashion-specific GPT2 model v/s the GPT2 model with original weights.
Once BERT/GPT is fine-tuned with the custom dataset, the second part is to train the classification and generation models over it.
Although while training the weights of the entire model gets updated, for our understanding, we can assume that the context vector is now the input to the models.
It is a simple classification problem now. In the below diagram, dense layers have been added to the output of context vectors, and finally reduced the features to the number of classes. The better the context layer output, the more robust and accurate the model will be.
When we are dealing with Transformers, we can also get the word-level confidence scores that contributed to the prediction of the said class.
For the sentence pantaloons green shirt with jacket, wear it with black jeans, we are trying to predict a fashion category. The above sentence has lots of categories in it including shirts, jackets, and jeans. But as you can see in the first example it is correctly predicting shirts, with the word shirt having the highest confidence. In other examples, we have changed the ground truth to different labels and then tried to observe the behavior. With changing labels, the word level confidence of the sentence changes, and it automatically points to the words that are contributing positively to the prediction of the new labels. For e.g. in the second example when we change the True Label to jeans, the confidence of the word jeans becomes positive and other words such as shirt and jacket become negative. This is the beauty of the self-attention models.
Named Entity Recognition
Same as text classification, we assume the context vector as input and write the layers for the NER task. The NER model predicts labels for each input word.
Since the data is fashion specific, the labels are granular. NER also helps is phrase discovery, which in turn leads to trends discovery. There are lots of methods for NER, but the scope of this article is to understand the semi-supervised approach.
Now that we have gone through all the components, I suggest, you look again at the Transformers end-to-end training GIF.
The code for training is available at Hugging face.
In this article, I have tried to give an idea of the semi-supervised training that we can do with transformers to train highly accurate models. Let me know your thoughts on it.