CourseProject

Software Implementation:

The implementation for this program comes in two parts. We decided to use a Naive Bayes methodology, which starts with a training method and then uses that data to predict classifications onto a testing dataset. We use a bag of words model in which we consider each tweet to be represented as a bag of independent words, meaning that we will ignore the position that the words appear and only deal with the frequency that they appear.

Using a unigram model, the program first calls the training function found in train.py, which goes through the training data given and creates a bag of words model. Because each tweet is pre-labeled as either “SARCASM” or “NOT_SARCASM”, we can treat each tweet as a list of words using the NLTK TweetTokenizer. This method returns a list of sarcastic words and their frequencies, a list of non-sarcastic words and their frequencies, as well as totals for both sarcastic and non sarcastic words.

After that, during the classification phase, the program calculates the probability for each tweet to be “SARCASM” or “NOT_SARCASM” based on the probabilities that were developed from the training set. The program develops a posterior probability for each tweet, and based on which is higher, classifies the tweet as either “SARCASM”or “NOT_SARCASM”. In order to avoid zero probabilities, we have a smoothing parameter in place, which has been optimized to give the best results.

Installation and running:

To install the software, first clone this git repository: https://github.com/shail4221/Classification-Competition.git

Once finished, you can run the program inside by running python classify.py. This will train the data on the train.jsonl file, run the classification on the test.jsonl file, and output the results into answer.txt

The video demo for running the code could be found at: https://drive.google.com/file/d/14abGsfp8Gjn4iRPimjK375e4egvLOwgn/view?usp=sharing

Description of contribution:

Both members worked together mostly equally, with Xuechen focusing primarily on the training side of the program and Shail focusing on the classification side. Both team members worked together to refine and test the program for accuracy.

The training file is train.py, the classification file is classify.py, and the output file is answer.txt. The link to video demo is also in VideoLink.txt.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CourseProject

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
410_Project_Documentation.docx		410_Project_Documentation.docx
Progress_Report.pdf		Progress_Report.pdf
Project Proposal.pdf		Project Proposal.pdf
README.md		README.md
VideoLink.txt		VideoLink.txt
answer.txt		answer.txt
classify.py		classify.py
test.jsonl		test.jsonl
train.jsonl		train.jsonl
train.py		train.py

xw23/CourseProject

Folders and files

Latest commit

History

Repository files navigation

CourseProject

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages