My Trial to tackle the Kaggle Toxic Comment Classification Competition
I built a model that calculates the probability of a comment belonging to any of the mentioned classes, I used XGBoost after generating feature vectors using GLove and Google news Word2Vec
I got a total AUC of 0.82
Resources needed:
- Download data from kaggle competition page here
- Download GLove Word Vectors here, choose the 300d.480B model
- Download GoogleNews Word Vectors here
- To use the Keras model built in the file
example_to_clarify.py, you need to download the 20 Newsgroup dataset
Note:
final_try.py file is an implementation to XGBoost algorithm on the same data
To Do::
-
You definetly can make much more hyperparameter optimization epecially regarding the LSTM model. for example: You can try playing around with
max_features,max_len,Droupout_rate,sizeof theDenselayer, etc... -
You can try differnt feature engineering and normaization techniques for the text data
-
In general try playing around with parameters like
batch_size,num_epochsandlearning_rate -
Try to use differnt optimization function, maybe
Adagrad,Adadeltaorsgd