In this project, we present a comprehensive study of sentiment analysis on Twitter data, where the task is to predict the smiley to be positive or negative, given the tweet message. With a fully automated framework, we developed and experimented with the most powerful proposed solutions in the related literature, including text preprocessing, text representation, also known as feature extraction, and supervised classification techniques. Different combinations of these algorithms led to a better understanding of each component and exhausting test procedures resulted in a very high classification score on our final results.
See Project Specification at EPFL /epfml/ML_course github page.
In order to run the project you will need the following dependencies installed:
-
Anaconda3 - Download and install Anaconda with python3
-
Scikit-Learn - Download scikit-learn library with conda
$ conda install scikit-learn
-
Gensim - Install Gensim library
$ conda install gensim
-
NLTK - Download all packages of NLTK
$ python $ >>> import nltk $ >>> nltk.download()and then download all packages from the GUI
-
GloVe - Install Glove python implementation
$ pip install glove_python
Sometimes
libgccis also needed to be installed.$ pip install libgcc
-
FastText - Install FastText implementation
$ pip install fasttext
-
Tensorflow - Install tensorflow library
$ pip install tensorflow
(Recommended version tensorflow 0.12.0)
-
Train Data
Download the positive Positive & Negative tweet files in order to train the models and move them in
data/datasets/directory. -
Test Data
Download the Test tweet file in order to test the models in kaggle and move it in
data/datasets/directory. -
Stanford Pretrained Glove Word Embeddings
Download Glove Pretrained Word Embeddings. Then, unzip the downloaded file and move the extracted files in
data/glove/directory. The default Data is the 200d (cp glove.twitter.27B.200d.txt intodata/glove/directory). -
Preprocessed Tweet Files (Optional)
In case you want to avoid preprocessing execution time, you can download the preprocessed train tweets of the full dataset. After downloading the above file, just place it in
data/preprocdirectory. Also, before runing any algorithm, make sure that thepreprocessparameter is enabled and that you have enabled the only default preprocess parameters (In any other case, the test set is going to be processed by a different way which is something unwanted). If the preprocessed file is in the right place, it is going to be loaded. Finally, in case you want to test the algorithm for different datasets, do not forget to remove the preprocessed file (Normally done by enabling theclearandpreprocparameters in the corresponding algorithm). -
Pretrained Word Embeddings (Optional)
In case you want to avoid training from scratch the whole word embeddings matrix, you can download the glove_python, the hybrid and baseline word embeddings (created by our default parameters). After downloading one of the aforementioned files, place it in
data/glovedirectory. When the corresponding method is chosen from theoptions.pyfile, if the required embeddings file exists, it is just loaded and the training phase is skipped. In any other case the files are going to reproduced from scratch (this might take a while).
- A Computer with:
- at least 16 GB of RAM
- a Graphics Card (optional - needed for faster training in CNN solution)
- a Unix-based Operating System (e.g Linux, OSX). Tested & Developed on Ubuntu
See the Public Leaderboard in Kaggle.
Our Team's name is **gLove is in the air...**♩ ♪ ♫ ♬:heart:
Go to src/ directory and set algorithms variable in options.py file.
In case you want to parametrise the model's parameters, just set the corresponding
dictionary in options.py
For more details, check the important parameters in each algorithm in the aforementioned file.
Then just start main.py file
$ cd src/
$ python main.pyWhen the program terminates, you will get all the predictions of the test file
in data/submissions/ directory
By enabling the cv option to true in the options.py file (in the corresponding algorithm) you can get
a good approximation of the kaggle-score directly from Cross Validation (this might take a while for the full datasets)
In data/models/BEST directory, we have stored a checkpoint of our best CNN trained model.
Go to options.py file and:
- Set
algorithmtoCNN - and on
CNNdictionary:- set
traintoFalse - and make sure
checkpoint_diris set toTF_SAVE_PATH + '/BEST/checkpoints'
- set
Finally, just follow the Demo procedure.
- Beril Besbinar
- Dimitrios Sarigiannis
- Panayiotis Smeros
License: MIT