Skip to content

li-951/CourseProject

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CourseProject

Please fork this repository and paste the github link of your fork on Microsoft CMT. Detailed instructions are on Coursera under Week 1: Course Project Overview/Week 9 Activities.

Table of contents

Team info

Team name: kms

Team members: Kim Li (kimli2), Kevin Tzeng (ktzeng2), Shreyas Chandrashekaran (svc3)

Requirements

Installation

From the project directory, run the following command to install requirements:

$ pip install requirements.txt

Overview

The function cleanData() is used to clean the raw e-mail text in project/original_dataset.csv so that the e-mail subject and e-mail body are separated into separate columns. The function returns a pandas dataframe with columns spam, subject, body, attachment, reply, forward.

The function filter(df, args) takes in the cleaned dataframe and user-passed arguments to further filter the e-mail dataset and returns the filtered dataframe.

Software documentation

The following (optional) arguments can be passed in the terminal to filter the dataset:

Arguments Type Description
--csv string csv file to write output to (default is output.csv)
--spam integer choose a value for spam (0 for not spam, 1 for spam)
--attachment integer filter based on whether email contains attachment (0 for no attachments, 1 for attachment included
--reply integer filter based on whether email is a reply (0 for not reply, 1 for reply)
--forward integer filter based on whether email is forwarded (0 for not forwarded, 1 for forwarded)
--subject string filter based on keywords contained in subject
--body string filter based on keywords contained in e-mail body

Software usage documentation

The data can be simply cleaned and returned by running the following:

$ python parser.py

To use the filtering flags, For example, project/sample.csv was created using the following filters:

$ python parser.py --csv 'sample.csv' --attachment 1 --reply 1

and contains a list of all e-mails that include an attachment and are a reply to a previous e-mail.

Contribution

  • Kim: filtering dataset
  • Kevin: filtering dataset
  • Shreyas: cleaning dataset

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 74.1%
  • Python 25.9%