Best Run Selection & Training

Colab Notebook

Final version of our project code is available as Google Colab Notebook which is shared and available at SequenceClassification. Through this code, we trained and fine-tuned the RoBERTa based transformer model for the text classification competition (identifying if the tweet is SARCASM or NOT_SARCASM), and we are able to achive f1 score higher than the base-line.

Below are the several steps detailed to be execute to run the code end to end. Since modelling methodology is Transformer based it would be recommended to use GPU for processing. Google Colab is the preferred environment to run the end to end process and generate the results.

Use of our model

We have created the checkpoint our model, available to download from the below path. For use of our model, user can just just load the checkpoint to the API 'AutoModelForSequenceClassification'. (See https://huggingface.co/transformers/model_doc/auto.html for more details) For the use case of our text competition (and as an example of using our model), please refer to colab notebook RoBERTa_Model_Test fpr using the checkpoint of our model and reproducing answer.txt.

Tutorial: Reproduce_Answers_with_Checkpoint

Checkpoint available to Download from: Checkpoint

See slide 11 for detailed step-by-step guidance Presentation

Presentation and Model Testing

Please see Presentation for the slides we created for the presentation.

Please see video Tutorial: Reproduce_Answers_with_Checkpoint for step by step guidance how to re-run and test our model

Contribution of Team Members

Each of us collaborated very closely in each step of reseach, experiment, and improvise. We have touch point scheduled on a weekly basis, where we shared the learning and resources, discuss our approach, and walk-through our codes. With that, each of us contributed 100% effort in each step of the project process.

Our team members are:

Wenxi Fan (NetID: wenxif2; Email:wenxif2@illinois.edu)

Abhishek Jain (NetID: aj26; Email:aj26@illinois.edu)

Below are the Steps in our Model Training and Output Generation

(as in SequenceClassification.)

Environment Setup

First setup the environment, we will do the following steps here.

Transformers Model Installation
Hyper Parameter Tuning Library Installation
Colab Setup

You will be required to authorize the code using your google account. Copy the authorization code generated and pass it in the notebook in the input box provided when you run mount drive code. Below is the reference:

# Colab setup
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Also please copy the train & test JSONL files provided in your google drive required for training and testing the models further.

Tutorial: Environment Setup

Data Load

Next step is to load the Training & Test Datasets as Pandas dataframe. Please update correct data path where training and test dataset is copied in your google drive.

datapath = r'/content/drive/My Drive/mcsds/cs-410-text-mining/project/ClassificationCompetition/data'
train_pddf = pd.read_json(datapath+'/train.jsonl', lines=True)
test_pddf = pd.read_json(datapath+'/test.jsonl', lines=True)

Above example suggests my train & test jsonl files are copied in my drive at '/mcsds/cs-410-text-mining/project/ClassificationCompetition/data' location.

Further run the data load section. Reference: Data Load & Preprocessing

Data Preprocessing

Next step is to run the data preprocessing steps. Below are the different components of it:

Feature Engineering

Create new features:

Last Response: Extract the last response from the context since the current response was generated on Last this can be separately treated.
Context Reversed: Reverse the context before feeding to transformers so that latest tweets are given more attention and incase if context is too big latest shall be considered.
Combine all into a single
Combine Current & Last Response into Single

Sequence Structuring

Define how do we want to structure the different tweets, basically two approaches are followed:

Combine into single: Last response only, Combine all tweets togeather or current and last.
Two Sentence: (Current, Last Response) or (Current, Context Reversed).

Transform to Datasets

Translate preprocessed dataframes to Transformer Datasets. This step is required to make our dataset translated into Transformer datasets construct. Reference: Data Load & Preprocessing

Model Configurtion

Configure which model strategy to select, train test valid splits, performance metrics, training batch sizes etc. Below are the details:

model_checkpoint: which model to use for text sequence classification. Roberta models are observed to give the maximum performance.
task: specify how to structure the sequences as described in sequence structuring step. We have observed the maximum performance with 'response_context_rev_sep' structure. This format structures input as two sequence <response, context> where response is last tweet to be classified, and context tweets are the previous tweets in an reversed order of occurance.
metric_name: metric to be optimized while training. We have configured it to accuracy.
num_labels: 2, number of classes Sarcasm, Not Sarcasm
batch_size: 16 for roberta, 64 for bert otherwise we face out of memory issues.
train_test_split: to divide training data into train and test datasets.
test_valid_split: to divide test dataset into test and validation set.
epoch: number of epochs to train model on.
weight_decay: determines how much an updating step influences the current value of the weights
learning_rate: weight update rule that causes the weights to exponentially decay to zero

Reference: Model Config

Tokenization

This step translates words to context tokens. Transformers Tokenizer tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

Reference: Tokenization & Single Model Fine Tuning

Single Model Fine Tuning

Download the pretrained model and fine tune the selected model with arguments configured in the previous step.

Reference: Tokenization & Single Model Fine Tuning Reference: Training Results

Test Validation

Validate the results on test data and compute the metrics.

Reference: Validation Results

Hyper Parameter Tuning

Could be only run with HIGH GPU environment Using Transformer Trainer utility which supports hyperparameter search using optuna or Ray Tune libraries which we have installed in our previous step. During hyperparameter tuning step, the Trainer will run several trainings, so it needs to have the model defined via a function (so it can be reinitialized at each new run) instead of just having it passed. The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized and the hyperparameters it used for that run.

Reference: Hyperparameter Tuning

Best Run Selection & Training

Could be only run with HIGH GPU environment To reproduce the best training run from our previous hyper parameter train setp we will set the best hyperparameters TrainingArgument before training the model again. Reference: Hyperparameter Tuning

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Research_Summary		Research_Summary
Test_Source_Data		Test_Source_Data
CS410 Project Proposal (Submission).docx		CS410 Project Proposal (Submission).docx
Progress Report (Submission).pdf		Progress Report (Submission).pdf
RoBERTa_Model_Test.ipynb		RoBERTa_Model_Test.ipynb
SequenceClassifier.ipynb		SequenceClassifier.ipynb
Text Classification Competition – Attention-based Transformers.pptx		Text Classification Competition – Attention-based Transformers.pptx
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Colab Notebook

Use of our model

Presentation and Model Testing

Contribution of Team Members

Below are the Steps in our Model Training and Output Generation

Environment Setup

Data Load

Data Preprocessing

Feature Engineering

Sequence Structuring

Transform to Datasets

Model Configurtion

Tokenization

Single Model Fine Tuning

Test Validation

Hyper Parameter Tuning

Best Run Selection & Training

About

Uh oh!

Releases

Packages

Languages

Wenfan1993/CourseProject

Folders and files

Latest commit

History

Repository files navigation

Colab Notebook

Use of our model

Presentation and Model Testing

Contribution of Team Members

Below are the Steps in our Model Training and Output Generation

Environment Setup

Data Load

Data Preprocessing

Feature Engineering

Sequence Structuring

Transform to Datasets

Model Configurtion

Tokenization

Single Model Fine Tuning

Test Validation

Hyper Parameter Tuning

Best Run Selection & Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages