The goal of this project is to develop a means of easily comparing topic modeling methods, such as LDA and Top2Vec.
This was implemented through a web app hosted here. A demo video is available here.
The web app instance hosted online has limited CPU and RAM resources.
For heavy testing, it is recommended to run this app locally.
- Set up your environment
conda create -n myapp python=3.8andconda activate myapp - Clone this repo and switch to project directory
- Install dependencies
pip install -r requirements.txt - Launch app in browser
streamlit run main.py
The app has two components:
-
A sidebar for user input and control parameters
- choose dataset / web-scraping parameters
- set parameters such as number of topics
- search topic models with a keyword
-
The main pane for displaying results
- each algorithm has a dedicated column, lined up side-by-side for ease of comparison
- topics shown via wordclouds where word size corresponds to term weight
- documents returned from keyword search are displayed in height-adjustable boxes
Currently supported algorithms are LDA and Top2Vec. A simplified overview and comparison of the two is available in this tech review note.
Although many features were planned for this app, a decision was made to make the first version simple, not overly cluttered with dozens of parameters and customization options.
Ideas for future releases:
Data
- expand available datasets for testing
- speed up web scraping through parallelization
- add options for lemmatization and word n-grams in vocabulary
Features
- phrase/multi-term search
- add more algorithms for comparison
- provide users to more parameters for fine-tuning models
Utility
- show and compare time taken to train topic models and perform search
- offer customizable result display:
- number of documents to show
- default height of document display box
- number of wordclouds
- number of words per wordcloud
LDA is implemented via gensim while Top2Vec via top2vec. Both Python packages are available via pip.