Twitter Sentiment Analysis

In this project, we present a comprehensive study of sentiment analysis on Twitter data, where the task is to predict the smiley to be positive or negative, given the tweet message. With a fully automated framework, we developed and experimented with the most powerful proposed solutions in the related literature, including text preprocessing, text representation, also known as feature extraction, and supervised classification techniques. Different combinations of these algorithms led to a better understanding of each component and exhausting test procedures resulted in a very high classification score on our final results.

Project Specification

See Project Specification at EPFL /epfml/ML_course github page.

Dependencies

In order to run the project you will need the following dependencies installed:

Libraries

Anaconda3 - Download and install Anaconda with python3
Scikit-Learn - Download scikit-learn library with conda
```
$ conda install scikit-learn
```
Gensim - Install Gensim library
```
$ conda install gensim
```
NLTK - Download all packages of NLTK
```
$ python
$ >>> import nltk
$ >>> nltk.download()
```
and then download all packages from the GUI
GloVe - Install Glove python implementation
```
$ pip install glove_python
```
Sometimes libgcc is also needed to be installed.
```
$ pip install libgcc
```
FastText - Install FastText implementation
```
$ pip install fasttext
```
Tensorflow - Install tensorflow library
```
$ pip install tensorflow
```
(Recommended version tensorflow 0.12.0)

Files

Train Data

Download the positive Positive & Negative tweet files in order to train the models and move them in data/datasets/ directory.
Test Data

Download the Test tweet file in order to test the models in kaggle and move it in data/datasets/ directory.
Stanford Pretrained Glove Word Embeddings

Download Glove Pretrained Word Embeddings. Then, unzip the downloaded file and move the extracted files in data/glove/ directory. The default Data is the 200d (cp glove.twitter.27B.200d.txt into data/glove/ directory).
Preprocessed Tweet Files (Optional)

In case you want to avoid preprocessing execution time, you can download the preprocessed train tweets of the full dataset. After downloading the above file, just place it in data/preproc directory. Also, before runing any algorithm, make sure that the preprocess parameter is enabled and that you have enabled the only default preprocess parameters (In any other case, the test set is going to be processed by a different way which is something unwanted). If the preprocessed file is in the right place, it is going to be loaded. Finally, in case you want to test the algorithm for different datasets, do not forget to remove the preprocessed file (Normally done by enabling the clear and preproc parameters in the corresponding algorithm).
Pretrained Word Embeddings (Optional)

In case you want to avoid training from scratch the whole word embeddings matrix, you can download the glove_python, the hybrid and baseline word embeddings (created by our default parameters). After downloading one of the aforementioned files, place it in data/glove directory. When the corresponding method is chosen from the options.py file, if the required embeddings file exists, it is just loaded and the training phase is skipped. In any other case the files are going to reproduced from scratch (this might take a while).

Hardware Requirements

A Computer with:
- at least 16 GB of RAM
- a Graphics Card (optional - needed for faster training in CNN solution)
- a Unix-based Operating System (e.g Linux, OSX). Tested & Developed on Ubuntu

Kaggle Submission

See the Public Leaderboard in Kaggle.

Our Team's name is **gLove is in the air...**♩ ♪ ♫ ♬:heart:

Demo

Go to src/ directory and set algorithms variable in options.py file.

In case you want to parametrise the model's parameters, just set the corresponding dictionary in options.py

For more details, check the important parameters in each algorithm in the aforementioned file.

Then just start main.py file

$ cd src/
$ python main.py

When the program terminates, you will get all the predictions of the test file in data/submissions/ directory

By enabling the cv option to true in the options.py file (in the corresponding algorithm) you can get a good approximation of the kaggle-score directly from Cross Validation (this might take a while for the full datasets)

Reproduce Our Best Kaggle Score

In data/models/BEST directory, we have stored a checkpoint of our best CNN trained model.

Go to options.py file and:

Set algorithm to CNN
and on CNN dictionary:
- set train to False
- and make sure checkpoint_dir is set to TF_SAVE_PATH + '/BEST/checkpoints'

Finally, just follow the Demo procedure.

Contributors

Beril Besbinar
Dimitrios Sarigiannis
Panayiotis Smeros

License: MIT

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
data		data
report		report
src		src
useful_references		useful_references
.gitignore		.gitignore
Poster-Presentation.pdf		Poster-Presentation.pdf
README.md		README.md
project2_description.pdf		project2_description.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Twitter Sentiment Analysis

Project Specification

Dependencies

Libraries

Files

Hardware Requirements

Kaggle Submission

Demo

Reproduce Our Best Kaggle Score

Contributors

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

dsar/Twitter_Sentiment_Analysis

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis

Project Specification

Dependencies

Libraries

Files

Hardware Requirements

Kaggle Submission

Demo

Reproduce Our Best Kaggle Score

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages