CS 410 Course Project - Michael McClanahan (mjm31)

All project deliverables (source code, documentation, and presentation) were completed by me as I was not the member of a team.

Reproduction of Paper

Mining Causal Topics in Text Data: Iterative Topic Modeling with Time Series Feedback (Kim et. al. 2013)

Link to paper

Overview of Implementation

This project attempts to reproduce the 2020 Presidential Election experiment outlined in the paper above.

All programming was done in Python 3.6 (specifically Anaconda distribution 4.4.0).

A requirements.txt file is provided outlining all non-standard libraries utilized and their associated versions.

pandas==1.1.5
gensim==3.8.0
nltk==3.4.4
scipy==1.5.4
numpy==1.19.4

All source code is contained within ITMTFPresidential.py as a series of functions.

For convenience, the following objects have been serialized to files (specifically .pkl files) for easy re-use:

president_norm_stock_ts : This is the non-text time series containing normalized presidential stock prices (May through October 2020 for the Bush-Gore 2000 presidential race.
gore_bush_nyt_ts : This is the document time series containing NY Times articles from May through October 2020 mentioning either Bush or Gore.
cleaned_doc_list: This the document collection to be analyzed (ie: a list of documents represented each as a list of tokens or words).
gensim_dictionary: This is the gensim dictionary object to be analyzed (built from gore_bush_doc_list).
gensim_corpus: This is the gensim corpus to be analyzed (ie: a list of documents represented each as a list of wordIDs and their counts in the document)
word_impact_dict: This is a dictionary of corpus {wordID:(impact score,p-value)}. It represents the result of section 4.2.2's Word-level Causality analysis.

At runtime, if the script's reload_data variable is set to False, the script will reload president_norm_stock_ts and gore_bush_nyt_ts from disk in O(1) time. If set to True, functions build_datasets() and parse_nyt_corpus_for_gore_bush() will get called and rebuild these datasets from a .csv file and the NYT corpus for XML documents, respectively. Since the NYT dataset was too large, it was not uploaded to this repository. Therefore, setting this variable to True is not recommended.

Additionally, if the script's build_new_corpus variable is set to False, it will reload all of the other remaining objects from disk in O(1) time. If set to True, it will rebuild all of the other objects by rebuilding the collection, the gensim dictionary, and the gensim corpus (which is the object passed to the LdaModel() for its corpus parameter. It will then reperform the Word-level Causality Analysis from 4.2.2, storing the result per gensim dictionary word ID in a dictionary for quick lookup during ITMTF iterations.

The following 4 parameters are then set and ITMTF iterations are started by calling the ITMTF() recursive function.

min_significance_value = 0.8
min_topic_prob = 0.001
iterations = 5
number_of_topics = 10
causal_topics = ITMTF(gore_bush_gensim_corpus,gore_bush_gensim_dictionary,number_of_topics,number_of_topics,word_impact_dict,gore_bush_nyt_ts,president_norm_stock_ts,ts_tsID_map,min_significance_value,min_topic_prob,iterations)

The ITMTF function will call itself for the number of iterations specified, each time building an LdaModel() object, passing in a 2D matrix (num_topics,num_unique_terms) matrix of re calculated prior topic word probability distributions into its eta parameter. To re-calculate the prior, causal topics are first extracted during the topic-level causality analysis (see section 4.2.1 in the paper), which is performed using a Pearson correlation against the reference Topic Stream and the presidential stock price time series. The topic-word probability prior is then calculated for top words in causal topics in the build_prior_matrix() function. This function makes use of the word_impact_dict object above for each causal topic.

With each iteration, .csv files causal_topic_words.csv and itmtf_stats.csv in the working directory are appended with a list of signficant topics and their top 5 words as well as that iteration's average causality confidence and average purity, respectively.

Goal

The project aims to reproduce the paper's results documented in Table 2 and Figure 2(b). You will note that the µ parameter from the paper is not defined prior to calling the ITMTF function above, primarily because it is not an parameter for the LdaModel() class provided by Gensim. It is for this reason that Figure 2(a) from the paper was not reproduced.

How to Use

Install the most recent versions of the above non-standard libraries using pip in a Python3 environment. Ex: pip install pandas
Clone the repository, which contains the working directory and all dependent objects.
Navigate to the working directory and run the script with python ITMTFPresidential.py

Presentation Files

Presentation slides
Voiced Presentation - To listen, open the .pptx document in Powerpoint, then navigate to the Slide Show tab and hit the From Beginning button. The presentation should start from there.
Voiced Presentation Video - If unable to listen directly in Powerpoint, you can view it as a video in the Illinois Media Space.
Video Demo'ing Code

Results

In general this implementation’s results were poorer than what was noted in the paper. Improvements in causality confidence and purity were not observed with more iterations. Top words for causal topics seemed applicable, but not all that compelling or different from one another.

Poorer results are likely due to the following differences with this implementation:

Lack of a background model – Although the paper doesn’t explicitly cite the use of a background model prior, results would imply they had one. My topics seem to have a lot more “background words” (ex: would).
Lack of a background could also be the a main reason for substantially lower purities with each iteration (0-5% vs 40-100%).
Lack of µ parameter – This is likely the reason that neither purity nor causality confidence were improved with each iteration, regardless of the number of topics to begin with. Using this parameter would have ensured that prior words (and topics) appeared with the next iterations results.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
images		images
.gitignore		.gitignore
ITMTFPresidential.py		ITMTFPresidential.py
Michael McClanahan Project Proposal_CS410_mjm31.docx		Michael McClanahan Project Proposal_CS410_mjm31.docx
Progress Report_CS410_mjm31.docx		Progress Report_CS410_mjm31.docx
README.md		README.md
Result Graphs.twb		Result Graphs.twb
causal_topic_words.csv		causal_topic_words.csv
cleaned_doc_list.pkl		cleaned_doc_list.pkl
gensim_corpus.pkl		gensim_corpus.pkl
gensim_dictionary.pkl		gensim_dictionary.pkl
gore_bush_nyt_ts.pkl		gore_bush_nyt_ts.pkl
itmtf_stats.csv		itmtf_stats.csv
mjm31 Project Presentation VOICED.pptx		mjm31 Project Presentation VOICED.pptx
mjm31 Project Presentation.pptx		mjm31 Project Presentation.pptx
president_norm_stock_ts.pkl		president_norm_stock_ts.pkl
president_normalized_prices_may_oct_2000.csv		president_normalized_prices_may_oct_2000.csv
requirements.txt		requirements.txt
word_impact_dict.pkl		word_impact_dict.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS 410 Course Project - Michael McClanahan (mjm31)

Reproduction of Paper

Mining Causal Topics in Text Data: Iterative Topic Modeling with Time Series Feedback (Kim et. al. 2013)

Overview of Implementation

Goal

How to Use

Presentation Files

Results

About

Uh oh!

Releases

Packages

Languages

MM026184/CourseProject

Folders and files

Latest commit

History

Repository files navigation

CS 410 Course Project - Michael McClanahan (mjm31)

Reproduction of Paper

Mining Causal Topics in Text Data: Iterative Topic Modeling with Time Series Feedback (Kim et. al. 2013)

Overview of Implementation

Goal

How to Use

Presentation Files

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages