Skip to content

wujameszj/CourseProject

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CourseProject

The goal of this project is to develop a means of easily comparing topic modeling methods, such as LDA and Top2Vec.
This was implemented through a web app hosted here. A demo video is available here.

Install

The web app instance hosted online has limited CPU and RAM resources.
For heavy testing, it is recommended to run this app locally.

  1. Set up your environment conda create -n myapp python=3.8 and conda activate myapp
  2. Clone this repo and switch to project directory
  3. Install dependencies pip install -r requirements.txt
  4. Launch app in browser streamlit run main.py

Usage

The app has two components:

  • A sidebar for user input and control parameters

    • choose dataset / web-scraping parameters
    • set parameters such as number of topics
    • search topic models with a keyword
  • The main pane for displaying results

    • each algorithm has a dedicated column, lined up side-by-side for ease of comparison
    • topics shown via wordclouds where word size corresponds to term weight
    • documents returned from keyword search are displayed in height-adjustable boxes

Currently supported algorithms are LDA and Top2Vec. A simplified overview and comparison of the two is available in this tech review note.

Reflection and Future Work

Although many features were planned for this app, a decision was made to make the first version simple, not overly cluttered with dozens of parameters and customization options.

Ideas for future releases:

Data

  • expand available datasets for testing
  • speed up web scraping through parallelization
  • add options for lemmatization and word n-grams in vocabulary

Features

  • phrase/multi-term search
  • add more algorithms for comparison
  • provide users to more parameters for fine-tuning models

Utility

  • show and compare time taken to train topic models and perform search
  • offer customizable result display:
    • number of documents to show
    • default height of document display box
    • number of wordclouds
    • number of words per wordcloud

Reference

LDA is implemented via gensim while Top2Vec via top2vec. Both Python packages are available via pip.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%