CS-410-CourseProject

Presentation

https://mediaspace.illinois.edu/media/1_8aiq79tk

Objective

Reproducing results from a paper: "Mining Causal Topics in Text Data: Iterative Topic Modeling with Time Series Feedback"

Problems

Data: The first problem encountered was obtaining the data in order used to produce the results. Unfortunately, I missed the piazza post to request access to the NY times dataset. So, I wrote a webscraper (data_scraper.py) to scrape articles from NY Times, which I was able to save in the data folder. I only saved results which contained "bush" or "gore" so it was a manageable size to be uploaded. For the Market data, I manually copied the data into a csv.
Scope: The second problem encountered was the scope of the problem. After reading through the paper and doing a couple of hours of research, I discovered this would be a very complex result to produce, especially working as an individual, and factoring in my own gaps of knowledge involving the algorithm to be implemented. I was unable to completely recreate the algorithm using time stamp series feedback.

Results

I used an LDA model to find the top 3 words from topics discovered from the relevant documents. These are the results.

Top 3 Words in Significant Topics
party nader vote
tax plan social
oil price juniper
street music sunday
company court death
debate right candidate
city game old
clinton cheney know
school test student
clinton lazio mr_clinton

Looking at these words, we can definitely see some topics that were very relevant to the presidential election. Many of these words were also mined from the algorithm used in the paper.

Setup/Run Code

* If desired, set up virtual env. Then, to set up the code: *

# Clone repo, could take a couple seconds-minutes depending on download speed because of data (20-30 MB)
git clone https://github.com/Ludy11xc/CS-410-CourseProject.git

# Navigate into repo
cd CS-410-CourseProject

# Install dependencies.  If EnvironmentError is encountered, rerun with --user
pip install -r requirements.txt

If you would like to run the websraper, you can with

python data_scrapper.py

However, know that it will most likely take multiple hours to complete, and the data has already been scraped and is present in the data folder.

Then to run the code,

# Run this to get results from current model
python lda.py

# OR, run this to train a new model and get new results
python lda.py train

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
model		model
CS 410 Progress Report.pdf		CS 410 Progress Report.pdf
CS 410 Project Proposal.pdf		CS 410 Project Proposal.pdf
README.md		README.md
data_scrapper.py		data_scrapper.py
lda.py		lda.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS-410-CourseProject

Presentation

Objective

Problems

Results

Setup/Run Code

About

Uh oh!

Releases

Packages

Languages

Ludy11xc/CS-410-CourseProject

Folders and files

Latest commit

History

Repository files navigation

CS-410-CourseProject

Presentation

Objective

Problems

Results

Setup/Run Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages