Skip to content

dsar/Twitter_Sentiment_Analysis

Repository files navigation

Twitter Sentiment Analysis

In this project, we present a comprehensive study of sentiment analysis on Twitter data, where the task is to predict the smiley to be positive or negative, given the tweet message. With a fully automated framework, we developed and experimented with the most powerful proposed solutions in the related literature, including text preprocessing, text representation, also known as feature extraction, and supervised classification techniques. Different combinations of these algorithms led to a better understanding of each component and exhausting test procedures resulted in a very high classification score on our final results.

Project Specification

See Project Specification at EPFL /epfml/ML_course github page.

Dependencies

In order to run the project you will need the following dependencies installed:

Libraries

  • Anaconda3 - Download and install Anaconda with python3

  • Scikit-Learn - Download scikit-learn library with conda

    $ conda install scikit-learn
  • Gensim - Install Gensim library

    $ conda install gensim
  • NLTK - Download all packages of NLTK

    $ python
    $ >>> import nltk
    $ >>> nltk.download()

    and then download all packages from the GUI

  • GloVe - Install Glove python implementation

    $ pip install glove_python

    Sometimes libgcc is also needed to be installed.

    $ pip install libgcc
  • FastText - Install FastText implementation

    $ pip install fasttext
  • Tensorflow - Install tensorflow library

    $ pip install tensorflow

    (Recommended version tensorflow 0.12.0)

Files

  • Train Data

    Download the positive Positive & Negative tweet files in order to train the models and move them in data/datasets/ directory.

  • Test Data

    Download the Test tweet file in order to test the models in kaggle and move it in data/datasets/ directory.

  • Stanford Pretrained Glove Word Embeddings

    Download Glove Pretrained Word Embeddings. Then, unzip the downloaded file and move the extracted files in data/glove/ directory. The default Data is the 200d (cp glove.twitter.27B.200d.txt into data/glove/ directory).

  • Preprocessed Tweet Files (Optional)

    In case you want to avoid preprocessing execution time, you can download the preprocessed train tweets of the full dataset. After downloading the above file, just place it in data/preproc directory. Also, before runing any algorithm, make sure that the preprocess parameter is enabled and that you have enabled the only default preprocess parameters (In any other case, the test set is going to be processed by a different way which is something unwanted). If the preprocessed file is in the right place, it is going to be loaded. Finally, in case you want to test the algorithm for different datasets, do not forget to remove the preprocessed file (Normally done by enabling the clear and preproc parameters in the corresponding algorithm).

  • Pretrained Word Embeddings (Optional)

    In case you want to avoid training from scratch the whole word embeddings matrix, you can download the glove_python, the hybrid and baseline word embeddings (created by our default parameters). After downloading one of the aforementioned files, place it in data/glove directory. When the corresponding method is chosen from the options.py file, if the required embeddings file exists, it is just loaded and the training phase is skipped. In any other case the files are going to reproduced from scratch (this might take a while).

Hardware Requirements

  • A Computer with:
    • at least 16 GB of RAM
    • a Graphics Card (optional - needed for faster training in CNN solution)
    • a Unix-based Operating System (e.g Linux, OSX). Tested & Developed on Ubuntu

Kaggle Submission

See the Public Leaderboard in Kaggle.

Our Team's name is **gLove is in the air...**♩ ♪ ♫ ♬:heart:

Demo

Go to src/ directory and set algorithms variable in options.py file.

In case you want to parametrise the model's parameters, just set the corresponding dictionary in options.py

For more details, check the important parameters in each algorithm in the aforementioned file.

Then just start main.py file

$ cd src/
$ python main.py

When the program terminates, you will get all the predictions of the test file in data/submissions/ directory

By enabling the cv option to true in the options.py file (in the corresponding algorithm) you can get a good approximation of the kaggle-score directly from Cross Validation (this might take a while for the full datasets)

Reproduce Our Best Kaggle Score

In data/models/BEST directory, we have stored a checkpoint of our best CNN trained model.

Go to options.py file and:

  • Set algorithm to CNN
  • and on CNN dictionary:
    • set train to False
    • and make sure checkpoint_dir is set to TF_SAVE_PATH + '/BEST/checkpoints'

Finally, just follow the Demo procedure.

Contributors

  • Beril Besbinar
  • Dimitrios Sarigiannis
  • Panayiotis Smeros

License: MIT

About

EPFL ML Text Classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages