Key Documents

Final Presentation Video (link)
Project Proposal (link)
Mid Term Progress Report (link)

Overview

This Readme file serves as the final documentation of our project. As per the submission guidelines this document covers our original functional goals as stated our proposal, what we were able to accomoplish and implementation details. We also provide details about installation and user guide for our application. We conclude with key challenges we faced during the project and some future work.

Functional Goal(s)

In spite of the advent of many communication tools e.g. social media, slack etc., emails still continues to occupy an important position in one's communiation tool chest. The main goal of this project code names "Maximus", was to perform topic modelling on a user's Inbox (Gmail) and assign topics to each email. We belive this would help users in dealing with a deluge of emails they recieve every day/month by summarizing the main topics dicovered in their Inbox (e.s.p the unread emails) and help them focus on the most important emails that they are most interested.

What did we accomplish?

We were able to build a basic WebApp (dockerized python based Flask application) that downloads emails from a user's Gmail Inbox and discovers the latent topics in the email corpus. We used the LDA implementation from Gensim package for topic modelling. We were also able to discover topic coverage for each email in the Inbox and were able to asociate topic label(s) to each email.

We built dashboard (using ReactJS) to display the topics and number of emails associated with each topic.

Installation and User Guide

There are 2 ways to install and use the application.

Local Install (personal instance)

The application can be installed locally on your laptop/deskop (or a VM in the cloud).
To install the application you will need Docker Desktop installed on your machine. If you don't have it already, it is fairly simple to install, please go here if you work on windows OS or here if you are a MacOS user.
Clone this git repository
cd into source/ folder and run the command docker-compose up. This should bring up the application.
The application can be accessed at url http://localhost:8080

Or, Use our Cloud Instance

If you do not want to run locally or have run into issues, you can choose to access the application hosted at http://138.68.131.219:8080
This instance is already configured with a test gmail account (amythcloud@gmail.com). See details in the below section.

Getting Started with Topic Discovery

Now for the fun part. To discover latent topics in your Gmail Inbox (only Gmail for now) you will have to authorize the app to read emails from your Inbox. Don't panic! we are not going to read all your email :-). Instead, We ask you add a label called cs410 for emails that you are OK with app accessing it e.g. spam mail, promotions etc. We ONLY read emails tagged cs410 from your inbox, if you have not tagged anything we read zero emails.
This is not much of an issue when running the app locally on our computer (data is in our computer, just like your any other desktop email client) as much as it is when using cloud instance.
Note: Since our application has not gone through Google certification, we will need to explicitly add your gmail id in the app tester list. Please send your gmail id to ameetd2@illinois.edu to get your gmail id added as a tester for this app.
If you want to play it absolutely safe (understandably so), you can use the "test" gmail account we have setup (mailID: amythcloud@gmail.com) with some pre-seeded emails. Since this is public repo, please email ameetd2@illinois.edu for the password for this test gmail account.
Hitting the url first time will automatically direct you to Google to get your authorization for accessing your inbox.
Once authorization is successful, you can hit the Sync Emails at the top right corner and hit Sync on the popped dialog and wait for spinner to stop. In case of error(s) a red badge appears on the button, click on the button to open the same dialog again to know the cause. The sync may take time depending on emails and network speed. The blue badge shows total cound of downloaded message.
Once Sync Email button show green tick mark, hit the Analyze button to discover topics. Again wait still the spinner stops and a green tick mark appears on the Analyze Emails button.
Once you see a green tick mark on the Analyze Emails button, REFRESH the dashboard page to see all the topics discovered in our Inbox :-).

Software Design and Implementation Details

The following diagram depicts the key component of the application and the flow of data between these components.

The major components of the application are as follows:

Data PipeLine

This component is responsible for bringing in emails from Gmail service, passing them through a pre-processor (see below) and storing them in MongoDB document collection called maximus.emails. The code for this mode can be found in datapipe.py. This module heavily uses with gmail_client.py to interact with gmail service.

Data Pre-Processor

This is one of key components of the application as the success of topic modeling depends heavily on the quality of data. The module is responsible for extracting text from emails of differen MIME types (using a package called BeautifulSoup), tokenizing, stemming and removing stop words. We also remove all tokens with non-ascii characters, purely numeric tokens etc. Most of the code for this module can be found in preprocessor.py.

Topic Modeling Engine

This module is the heart of the application. It loads pre-processed emails from the MongoDB collection constructs a vocabulary followed by a TF-IDF transformation of each email and then trains a LDA model using Gensim python package. The parameters of the model like #topics, passes and iterations are fixed based on cross validation exercise that was done out-of-band inorder to reduce the topic discovery time. We also use the topic coherence metric as decribed here to select the topics whose constituent words are most coherent. Using the trained LDA model we then discover the topic coverage for each email and tag each email with a set of topic(s) with >=.30 coverage in the email. The code for this module can topic_model.py.

Topic Insights Per Email

The code for this module is in the file "topic_model_per_email.py". In this module we go through all emails in batches of 10 and extract Topic insights for the emails. For each batch of 10 emails, we use NMF to discover 40 topics using unigrams, bigrams and trigrams. The intent is to have very high quality topics. After fitting the data and finding the topics, we pick the top 3 topics with the highest scores for each email. The reason we only do batches of 10, is that if we pick more emails, we would still want to maintain a 1:3 or 1:4 ratio of emails:topics to ensure we extract high quality topics. This would result in a huge sparse matrix and make the fit and transform very slow. We also tried TruncuatedSVD and LDA but found that NMF gave the best results after visual inspection.

Topic Dashboard

This is the front end of the application build using ReactJS. The code is located in the source/webapp folder. The front end shows simple ReactJS cards, one for each topic discovered. Inside each card we should top10 words with highest probability mass. We also list of message that have a significant coverage of this topic >0.30) categorized as read vs. unread.

Building Code

All code is in source directory. To build a fresh docker image of the application, first build the front end $cd webapp/; yarn build. Then run the ./build.sh script in the sourcefolder. After that docker-compose up should automatically run the new image.

Challenges

There are 2 major challenges we ran into during this project.

It took considerable time and effort to clean the email data (mostly due to various MIME type) and get the extracted text to a form that our topic modelling module could consume.
It was difficult to evaluate the quality of topics that our LDA model generated due the unsupervised nature of the algorithm. Also paramters tuned for one Inbox did not generalize well for other Inbox. The idea of topic coherence was useful (but not sufficient) metric to select model parameters atleast for within the scope of a dataset (i.e Inbox).

Future Work

In order to produce meaningful results for each user's Inbox we believe the number of topics (and other model paramters) needs to be determined more dynamically as it can differ for each user's Inbox. This can be quite time consuming especially for large Inbox sizes.
We believe we can use the topics assigned to each email can be useful/effective in clustering similar emails or help user find related emails easily.
Use topic labels as additional input/features for spam vs. ham classification.

Team Members and Contributions

Name	Email Address	Contributions
Ameet Deulgaonkar	ameetd2@illinois.edu	Built the data pipeline to download and pre-process emails. Built the flask backend and ReactJS front dashboard. Implemented the corpus level topic modelling using LDA packaging and installation
Praveen Purohit	purohit4@illinois.edu	Built the per emails topic discovery to summarize email. Experimented with various topic modelling approaches LDA, NMF and PLSA and helped pick the right model for our use case. Testing application

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
source		source
.Rhistory		.Rhistory
CS410 Project Progress Report.pdf		CS410 Project Progress Report.pdf
CS410 Project Proposal.pdf		CS410 Project Proposal.pdf
Final Project Presentation.pdf		Final Project Presentation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Key Documents

Overview

Functional Goal(s)

What did we accomplish?

Installation and User Guide

Local Install (personal instance)

Or, Use our Cloud Instance

Getting Started with Topic Discovery

Software Design and Implementation Details

Data PipeLine

Data Pre-Processor

Topic Modeling Engine

Topic Insights Per Email

Topic Dashboard

Building Code

Challenges

Future Work

Team Members and Contributions

About

Uh oh!

Releases

Packages

Languages

amyth18/Topic-Modelling-in-User-Inbox

Folders and files

Latest commit

History

Repository files navigation

Key Documents

Overview

Functional Goal(s)

What did we accomplish?

Installation and User Guide

Local Install (personal instance)

Or, Use our Cloud Instance

Getting Started with Topic Discovery

Software Design and Implementation Details

Data PipeLine

Data Pre-Processor

Topic Modeling Engine

Topic Insights Per Email

Topic Dashboard

Building Code

Challenges

Future Work

Team Members and Contributions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages