This is a repository for Machine learning coursework from Cardiff University.
This repository is a part of Machine Learning (ML) coursework from MSc. Data Science and Analytics at Cardiff University. The objetive of this essay is provide a Python Machine Learning model able to do a sentiment analysis in a movie dataset review (IMDb).
The dataset used in this exercise is avaliable in this repository. However, for those who wants to access it, this file is based IMDb Reviews. Go to the site and look for “Large Movie Review Dataset v1.0” and dowloaded it. After it, you can find in your downloaded folder a file called aclImdb_v1.tar.gz. Unpack it and save it on the same folder where you run your python file. After unpacked it, the dataset is provided as text files with positive and negative reviews. On the courserwork dataset file, the core dataset contains 25,000 reviews split into train, development and test sets. The overall distribution of labels is roughly balanced.
There are two files:
- Data set (it has 3 folders - train, develop and test. Each folder contains files with negative and positive reviews one review per line)
- One Python code and;
This exercise was build to run in python using terminal. In order to better execute these instructions, I suggest to download the csv files avaliable here and save it in the same paste where the python file was saved.
Reference paper
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
This code uses these follow packages
import numpy as np
import pandas as pd
import re
import nltk
import sklearn
import nltk
#Data Preprocessing and Feature Engineering
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, ENGLISH_STOP_WORDS
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
from sklearn import svm
from nltk import sent_tokenize, word_tokenize
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
The original datasets are splited in positive and negative text files. For this reason, it is necessary to merge both files contents. I used the follow commands to do that. First, we are going to create the train dataset using pandas.
Load positive and negative .txt files using panda libraries with new line (‘\n’) as criteria of separation. Where ‘imdb_train_pos.txt’ is the name of the positive review .txt file and 'imdb_train_neg.txt' for the negative reviews.
df_pos = pd.read_csv('imdb_train_pos.txt', sep='\n', header = None)
df_neg = pd.read_csv('imdb_train_neg.txt', sep='\n', header = None)
Then, create a second column [1] with reviews labels. For positive reviews put ‘1’ and negative ‘0’.
df_pos[1] = 1
df_neg[1] = 0
#print(df_pos.head())
#print(df_neg.head())
df_train = pd.concat([df_pos,df_neg])
df_train.columns = ['text','label']
df_pos_test = pd.read_csv('imdb_test_pos.txt', sep='\n', header = None)
df_neg_test = pd.read_csv('imdb_test_neg.txt', sep='\n', header = None)
df_pos_test[1] = 1
df_neg_test[1] = 0
df_test = pd.concat([df_pos,df_neg])
df_test.columns = ['text','label']
Feature extraction/engineering are techniques for treating, extracting and reducing features to input in the machine learning models. It can reduces the time of the machine learning process and increases accuracy of the model, due to the reduction of the dimensions without losing important information. For this reason, feature extraction is essential to effetive machine learning model. This exercise applied the follow preprocessing and feature extractions methods:
- Vectorization
- Bag-of-words with 3-gram range
- Stop words
- Lemmatization
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
#lemmatization
class LemmaTokenizer:
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
#stopwords
my_stopwords = ENGLISH_STOP_WORDS.union(['@','<br />'])
#vectorization, bag of words and max. features
vect = CountVectorizer(max_features=1000, ngram_range = (1,3), stop_words = my_stopwords, tokenizer=LemmaTokenizer())
X = vect.fit_transform(df_train.text)
X_test = vect.fit_transform(df_test.text)
#Transform to an array
my_array = X.toarray()
my_array_test = X_test.toarray()
#Transform back to a dataframe, assign column names
X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())
X_df_test = pd.DataFrame(my_array_test, columns=vect.get_feature_names())
After preprocessing and extract features, now we can train our machine learning model on train data set. I am using the SVM from sklearn package here. If it is necessary, we can use the development data set to do new features extraction tests, before put the model to run in test data set.
svm_review = sklearn.svm.SVC(kernel="linear",gamma='auto')
model = svm_review.fit(X_df,df_train.label)
Then, after we trainned our model and we are comfortable with the inital results from trained model, we can put it to run on the test data set and get some predictions with our model.
predictions = model.predict(X_df_test)
print(confusion_matrix(df_test.label,predictions))
print(classification_report(df_test.label,predictions))
print(accuracy_score(df_test.label, predictions))