CourseProject

Final Project for CS410 of UIUC.

Project Proposal

Proposal.pdf

Project Progress Report

Progress_Report.pdf

Project Presentation

Video Presentation

Documentation

1. Overview

This project consists of two major tasks. The first one is to identify emerging topics in Twitter within computer science field, and the second one is to recommend relevant slides of the given topics.

1.1 Identify Emerging Topics

In this task, we first crawled 680k tweets from Twitter with query "computer science", which limited our scale of topics. Then, in order to mine topics from these tweets, we found the optimal number of topics w.r.t. coherence value and trained the LDA topic model with the optimal number of topics. Finally, we visualized topics with word cloud by analyzing hashtags. Besides, we support identifying emerging topics by crawling the latest tweets and predicting their topics with pre-trained LDA model, while these newly crawled data is used to update our LDA model.

All related files are in Identify_Topics/ directory.

Crawling/: Keep crawling tweets and form training data.
data/: Store raw data (sorted_tweets.json), processed data (pre-processed.pkl), stopwords (stopwords.txt), and pictures of topics (topic_desc_fig/).
model: Store pre-trained LDA model files.
topic_discovery.py: Extract topics from crawled tweets, evaluate models with different number of topics to find the optimal, get topics and draw pictures for them with word cloud, and predict emerging topics based on pre-trained model.
topics.json: Word distributions of topics, which will feed into next part.

1.2 Recommend Slides for Topics

In this task, we first crawled 100+ course slides in UIUC. Then, taking the above word distributions of different topics as input, we used BM25 to find relevant slides.

All related files are in Recommend_Slides/ directory.

pdf_download.py: Scrapes slides from four fixed UIUC course websites which are CS225, CS240, CS425 and CS447. It will download all the PDF documents to a local directory "slides".
pdf_miner.py: Read the slides under the "slides" folder and use pdfminer tool to extract text from the slides. Then, write the raw text to a "txt" file under the folder "raw". For example, if it read a PDF file "slides/Lecture1.pdf", there will be a text file "raw/Lecture1.txt" which contains the text data of the original PDF file.
filter_raw.py: Read the raw text files under the "raw" folder and filter these texts so that they can be used in the following ranking algorithm. It removes the stop words, meaningless numbers and some other useless words. Then, it lemmatizes and stems the words so that derivative words can be treated equally. The results are saved under the "corpus" folder. Each file under this folder represents the abstract of a PDF file from "slides" folder. For example, if it read a text file "raw/Lecture1.txt", there will be a filtered text file "corpus/Lecture1.txt" which contains the cleaned text data.
bm25.py: Read the topic file "topics.json" and generate queries with the distributions of keywords in each topic. Each topic generates one query. Then, for each query, run the bm25 ranking algorithm to calculate the scores of this query to each documents in the "corpus" folder. Finally, get the top 10 documents and write the result to the target file "result/bm25.txt".
doc_sim.py: Similar as "bm25.py". The only difference is the ranking algorithm. It calculates the cosine similarity with TF-IDF weights with each pair of query and document. Then, get the top 10 documents and write the result to the target file "result/sim.txt".

2. Implementation

2.1 Identify Emerging Topics

Tweets Crawling

This part serves to generate dataset containing recent tweets from Twitter. Due to the rate limit of Twitter, we can only crawl a small amount of tweets every 15 min. Therefore, in Crawling/ directory, we implemented a crawler which can crawl tweets automatically.

Twitter_crawler.py: Crawl recent tweets that don't overlap with pre-crawled tweets.
utils.py: Sort crawled tweets in terms of create time, which aims to optimize crawling and saving process.
Twitter_crawler.sh: Auto-crawling bash file that runs Twitter_crawler.py and utils.py repeatedly every 15 min.

Topic Mining

This part is to generate topics with crawled tweets. Here we applied LDA algorithm for topic mining. All related files are in TopiccDiscovery/ directory. Implementation of topic_discovery.py is as follows:

Preprocess: For each tweet, we perform 1) lower; 2) remove username, hashtag, url, number, punctuation, special character, and short word; 3) tokenization; 4) remove stopwords; 5) lemmatization; and 6) stemming. Then, we save processed data data/pre-processed.pkl for training.
Find optimal number of topics: We applied LDA model with different numbers of topics from 2 to 14, and found that 10 is the optimal.
Training: We set number of topics as 10, trained an LDA model on 662k processed tweets, and saved model files in model/ directory.
Saving Topics: We loaded pre-trained files, saved word distributions for topics, and drew word cloud figures by analyzing hashtags for all topics.
Predict: With pre-trained model, we can crawl latest tweets about computer science and make predictions to find out emerging topics among all topics. Meanwhile, we use these newly crawled tweets to update the LDA model.