This is an implementation of LDA(Latent Dirichlet allocation) topic modeling as one of my project in QUT IFN619. This project is implemented in Python Jupyternotebook.
Dataset used here is 'million ABC news headlines' sources from https://www.kaggle.com/therohk/million-headlines. A million news headlines and its published date are provided in this dataset from 2003 until 2020.
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import gensim
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import gensim.corpora as corpora
import re
import nltk
import string
from gensim.models import CoherenceModel
from IPython.display import Image
from matplotlib import pyplot as plt
from wordcloud import WordCloud
To achieve our goal, firstly, a tf/tfidf model will be built based on tri-gram phrases with python scikit-learn library, then a topic modelling technique will be applied. Topic modelling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection. In this project, the LDA (Latent Dirichlet allocation ) will be chosen as the techniques for our topic. The evaluation and visualization will be also given.
After applying LDA (Latent Dirichlet allocation) algorithm to our dataset, 11 topics were extracted, the top weighed terms are also presented in the following graphs for each topic. We also visualized the distribution of the 11 topics among the decade newstitles in order to obtain a more clear overview regading the top news event of the past 10 years.











