We participated in the Text Classification Competition for Sarcasm Detection in Tweets. Our team beat the baseline (0.723) and achieved an F1 score of 0.7542963307013469.
The code can be used for training a preprocessing the given dataset (train.jsonl and test.jsonl) and train a BERT model. The usage of our solution can be found in the "Source Code Walkthrough section".
- Suraj Bisht surajb2@illinois.edu (Team Leader)
- Sithira Serasinghe sithira2@illinois.edu
- Santosh Kore kore3@illinois.edu
- Anaconda 1.9.12
- Python 3.8.3
- PyTorch 1.7.0
- Transformers 3.0.0
Make sure to run this program in an Ananconda environment (i.e. Conda console). This has been tested on *nix and Windows systems.
1. Libs
pip install tweet-preprocessor textblob wordsegment contractions tqdm2. Download TextBlob corpora
python -m textblob.download_corpora3. Install PyTorch & Transformers
conda install pytorch torchvision torchaudio cpuonly -c pytorch transformersIf it complains that the transformers lib's not installed, try this command:
conda install -c conda-forge transformersFirst, cd src and run the following commands,
tl;dr
python clean.py && python train.py && python eval.pyThis will preprocess, train and generate the answer.txt model which can be then submitted to the grader for evaluation.
Description of each step:
-
Clean the dataset
python clean.py -
Train the model
python train.pyOnce the model is trained it will create an
input/model.binfile which saves our model to a binary file. We can later load this file (in the evaluation step) to make predictions. -
Make predictions & create the answer.txt file
python eval.pyThe answer.txt file is created at theoutputfolder.
The following section describes each of these steps in-depth.
We perform data cleaning steps for both train.jsonl and test.jsonl so that they are normalized for training and evaluation purposes. The algorithm for cleaning the data is as follows:
For each tweet:
- Append all
contextto become one sentence and prefix it to theresponse. - Fix the tweet if it has special characters to support better expansion of contractions.
- Remove all digits from the tweets.
- Remove
<URL>and@USERas they do not add any value. - Convert all tweets to lowercase.
- Use NLTK's tweet processor to remove emojis, URLs, smileys, and '@' mentions
- Do hashtag segmentation to expand any hashtags to words.
- Expand contracted words.
- Remove all special symbols.
- Perform lemmatization on the words.
A model can be built and trained with the provided parameters by issuing a python train.py command. The following steps are run in sequence during the model training.
- Read in the train.csv from the prior step.
- Training dataset (5000 records) is split into training and validation as 80:20 ratio.
- Feed in the parameters to the model.
- Perform model training for the given number of epochs.
- Calculate validation accuracy for each run and save the best model as a bin file
The following can be considered as parameters that could be optimized to achieve a better result.
src/config.py
DEVICE = "cpu" # If you have CUDA GPU, change this to 'cuda'
MAX_LEN = 256 # Max length of the tokens in a given document
EPOCHS = 5 # Number of epochs to train the model for
BERT_PATH = "bert-base-uncased" # Our base BERT model. Can plug in different models such as bert-large-uncased
TRAIN_BATCH_SIZE = 8 # Size of the training dataset batch
VALID_BATCH_SIZE = 4 # Size of the validation dataset batchsrc/train.py
L25: test_size=0.15 # Size of the validation dataset
L69: optimizer = AdamW(optimizer_parameters, lr=2e-5) # A different optimizer can be plugging or a learning rate can be defined here
L71: num_warmup_steps=2 # No. of warmup steps that need to run before the actual training stepsrc/model.py
L13: nn.Dropout(0.1) # Configure the dropout valueA high-level view of the sequence of operations run during the evaluation step is as follows.
- Load the test.csv file from the data transformation step.
- Load the best performing model from the training step.
- Perform predictions for each test tweet (1800 total records)
- Generate answer.txt that will be submitted to the grader to the "output" folder.
- Suraj Bisht surajb2@illinois.edu (Team Leader)
- Improve the initial coding workflow (Google Colab, Local setup etc.).
- Investigating Sequential model, Logistic Regression, SVC etc.
- Investigating
bert-base-uncasedmodel. - Investigating data preprocessing options.
- Hyperparameter tuning to improve the current model.
- Sithira Serasinghe sithira2@illinois.edu
- Setting up the initial workflow.
- Investigating LSTM/BiDirectional LSTM, Random Forest etc.
- Investigating various data preprocessing options.
- Investigating
bert-base-uncasedmodel. - Hyperparameter tuning to improve the current model.
- Santosh Kore kore3@illinois.edu
- Improve the initial coding workflow (Google Colab, Local setup etc.).
- Investigating Sequential models, SimpleRNN, CNN etc.
- Investigating
bert-large-uncasedmodel. - Investigating data preprocessing options.
- Hyperparameter tuning to improve the current model.
- Cleaning data further with different methods.
- Optimizing BERT model parameters and trying different BERT model (eg. RoBERTa)
- Re-use some of the tried models and optimizing to beat F1 scores.
- Extract Emoji's to add more meaning to the sentiments of the tweets.
- Data augmentation steps to prevent overfitting.
- Try an ensemble of models (eg. BERT + VLaD etc. )
- Run our model on different test data and compare results against state-of-art.
The usage of BERT model is inspired by https://github.com/abhishekkrthakur/bert-sentiment