Topic-Modelling

In this project, we implement two models for topic modeling- Latent Dirichlet Allocation (LDA) and Deep Belief Net (DBN). We use the bag-of-words and term-frequency models to produce latent representation of text data. The three files included are data_preprocessing.py, lda.py and dbn.py.

Requirements

numpy
scikit-learn
tensorflow
nltk
pandas
scipy
python 2 (for lda.py)
python 3 (for dbn.py)

Dataset

The two datasets that are used in our models can be found at 20Newsgroups and BBCSport.

How to run

data_preprocessing.py: loads raw text file of the dataset; performs basic preprocessing steps (tokenization, stemming, lemmatization) and term frequency count; the output is a .mat file with term frequencies and corresponding labels.
lda.py (requires python 2): fits an lda model on the input data from sklearn library; random forest classifier is used on lda output to generate accuracy results and then visualize.
dbn.py (requires python 3): run this file to generate output accuracy using DBN Note: preprocessed data files bbc_train_data.mat and 20NG_train_data.mat are included with this project.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
data_preprocessing.py		data_preprocessing.py
dbn.py		dbn.py
lda.py		lda.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic-Modelling

Requirements

Dataset

How to run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Topic-Modelling

Requirements

Dataset

How to run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages