This project uses NLP techniques to classify if tweets are sarcastic or not. BERT is used to train the model and arrive at the predictions.
-
The executable code resides in the file Sentiment_Analysis_with_BERT.ipynb. This code needs to be directly executed from Google Colab. Click on the button below to open the file in Colab.
-
Once in Colab, the code needs to run on a GPU. From Colab, navigate to Edit> Notebook Settings. Select GPU from the Hardware accelerator dropdown
-
The notebook can be executed by executing all the code blocks in order by clicking on the black 'Play' button at the top of each block.
-
In the end, all the predictions are stored in answer.txt in the output folder in the workspace.
A video tutorial is available HERE
This project uses BERT (Bidirectional Encoder Representations from Transformers) which is a state-of-the-art machine learning model used for NLP tasks. BERT is a pre-trained NLP model which can be further trained to solve several text classification problems. BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text.
The HuggingFace Transformers library is used to get the BERT model that works with Tensorflow.
This is how the code works at the high level
- Copy the testing and training data from github to the Colab workspace
- Read the testing and training data from jsonl file and convert them into a csv file
- Clean the input data by removing URL and USER tags from the tweets
- Split the training dataset into training and validation. This will be used to train the model.
- Extract only the required columns for further processing.
- Create the BERT model and tokenizer
- Convert the training and validation data into the BERT format using the helper functions defined above
- Use model.compile to set the optimizer, loss function that BERT will use to train the model
- Call model.fit to actually train the model based on the training and validation data
- Make predictions on the test data based on the trained model.
- Write the resuts to answer.txt in the output folder in the workspace.
- python
- tensorflow
- transformers
- pandas
- sklearn
- os
- urllib
- jsonlines
- csv