Skip to content

cf16-uiuc/CourseProject

 
 

Repository files navigation

Text Classification Competition: Twitter Sarcasm Detection

This project is for CS410 during the fall 2020 semester at UIUC. More details about this competition can be found at: https://github.com/CS410Fall2020/ClassificationCompetition

Implementation Overview

This classifier relies on the Simple Transformer package, based on the Transformer package from HuggingFace. The documentation for Simple Transformer can be found here: https://simpletransformers.ai/docs/classification-models/#classificationmodel. All packages needed can be installed with the package manager pip.

The code is broken into two files - train.py and test.py. In train.py the focus of the file is training the model. Prior to training the model we do a little bit of preprocessing of the data. First, we remove all stop words, then we replace all emoticons with text that may be able to provide information to the model that is trained. For the training we rely on the Simple Transformer package. From that package we use the binary classifier. Their setup allows us to bring in any pretrained model and adjust to our data. We looked at multiple pretrained models from HuggingFace found here: https://huggingface.co/models. In the end we discovered that simply using the BERT model yielded the best results.

To apply labels to a new set of tweets we can run the test.py file. This does a similar process of converting emoticons to text and removing stop words to preprocess the data. We then read in our trained model from train.py and can use the predict() function to predict the new labels.

Running the Code

To run the code you need to ensure that all the required packages are installed. The code can be run from the command line by running either python train.py or python test.py, depending on which file needs to be run. The code can also be run from python IDEs. The different variables, such as file name for training data, number of epochs, or learning rate can be adjusted within the file at the beginning of the file. The end result of running train.py will be created in a folder titled outputs. The test.py file reads from the outputs folder and will output and an answer.txt file.

For a detailed overview of running the code, an instructional demo can be found here: https://youtu.be/hzyMMAHryAE

Functions

The code contains three main functions - convert_emojis (appears in both files), bert_training (train.py), and predict_sarcasm (test.py).

convert_emojis(text): This function takes a single input, text. It cycles through the list of known emojis, looking for them in the text and replacing them to allow for consistency.

bert_training(model_type, model_base, train_data, early_stop, early_stop_delta, overwrite, epoch, batch_size, learning_rate, output): This function is where the model is trained. Several inputs are required, all of which are defined at the beginning of the file. This allows for models to be trained for different model types, varying epochs, and different batch sizes easily. Also allows a user to specify where the output should be written out to, making it easy to save several different models without worrying about overwriting existing models.

predict_sarcasm(data_path, results, model_loc, model): This function takes a model and input text and generated labels for whether or not a tweet is sarcastic or not. The inputs for this function allow the user to specify the location of the data, what the results file should be called, where the model is located, and what type of model it is. This allows the user to easily chagne various parameters to compare different models performance.

Parameter Tuning and Model Exploration

The main parameters that we focused on for this project were learning rate, batch size, number of epochs, and base models. We explored using Roberta, XLNET, and Electra, before finally deciding on BERT. We also looked at using an ensemble method combining results from BERT, Electra, and Roberta. However, those results were below the baseline, so we opted to just use the BERT model.

We also looked at batch size. We found that at batch size of 100 performed well. The Classic Transformer package used 8 as default. This did well, however, not well enough to beat the baseline. When we increased the batch size by much more, we found that the model tended to predict everything to be sarcastic.

Notes

The final trained model was not uploaded to git. Please reach out to me if you would like to see the model or have questions about it. A similar model can be generated by first running train.py before running test.py.

Useful Links and Sources

https://huggingface.co/models https://simpletransformers.ai/docs/binary-classification/ https://towardsdatascience.com/simple-transformers-introducing-the-easiest-bert-roberta-xlnet-and-xlm-library-58bf8c59b2a3 https://github.com/ThilinaRajapakse/simpletransformers

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%