CourseProject

Student: Chirag C. Shetty (cshetty2)

Paper: ChengXiang Zhai, Atulya Velivelli, and Bei Yu. 2004. A cross-collection mixture model for comparative text mining. In Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2004). ACM, New York, NY, USA, 743-748. DOI=10.1145/1014052.1014150 [link]

Introduction

The paper explores a further improvement like PLSA in mining topics. In PLSA, k topics are mined from the entire collection. However, collection may have subset and we may be interested in knowing the topics within a collection while also comparing across different collections. The paper adds one more level of generative variable (lambda_c) and tries to achieve this.

Data

The original paper from 2004 had used a set on news articles and reviews fro epionions.com. The site is no longer active and the dataset wasn't archived anywhere. So I decided to write a scraper, starting with the codes used in the MP's. I chose CNN, which has a search feature on its webpage. So I scrap the webpage resulting from searching a topic of interest and extract the news articles. This mostly involved handcrafting the extraction process.

Procedure for scraping

The main python file is called scrap.py

Edit the 'name' variable to indicate the topic. Files extracted will be stored with this name
no_pages: Number of pages to search. Each page has 10 articles
Run scrap.py (tested for python3.5), by setting dir_url to a topic search page on cnn webpage Example: For example this webpage shows for the search 'election': https://www.cnn.com/search?q=election
run python (3.5 used) scrap.py. The extracted docs will be stored in the folder 'cnn'
You can run it for as many topics as you wish

Baseline model

For baseline, the paper uses the standard PLSA model. Starting with PLSA code from MP3, background model was added. Thus complete PLSA was implemented at plsa_proj.py.

Cross-Collection Mixture Model

The model is implemented at ccmix.py. Following at the EM update equations from the paper

Procedure:

Run scrap.py, by setting dir_url to a topic search page on cnn webpage. Set appropritae variables as described in scrap.py
Set N - number of docs of each kind in the collection
name_set=list of names of each collection eg: ['elon','bezos']
Set number_of_topics
Run the code
The output displays top_n words in each distribution

Important notes

In calculating c(w,d) that count of word w in doc d across all words and docs, smoothing must be applied. No c(w,d) should be exactly 0. Esle it'll cause divison by zero problem. In the code, term_doc_matrix stores c(w,d)
In the EM update steps given in the paper, observe the update for P(w/theta j,i) i.e the collection specific word distributions. Since both numerator and denominator are summed over the entire collection, P(w/theta j,i) will not capture features specific to the sub-collections. They will all behave similarly. Hence in implementation, the summations are only taken over the docs in collection concerned

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cnn		cnn
junk		junk
.DS_Store		.DS_Store
README.md		README.md
Report_Project_Cross-Collection_Mixture_Model.pdf		Report_Project_Cross-Collection_Mixture_Model.pdf
ccmix.py		ccmix.py
chromedriver		chromedriver
cnn_url.txt		cnn_url.txt
plsa_proj.py		plsa_proj.py
scrap.py		scrap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CourseProject

Introduction

Data

Procedure for scraping

Baseline model

Cross-Collection Mixture Model

Important notes

About

Uh oh!

Releases

Packages

Languages

chiragcshetty/CourseProject

Folders and files

Latest commit

History

Repository files navigation

CourseProject

Introduction

Data

Procedure for scraping

Baseline model

Cross-Collection Mixture Model

Important notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages