Skip to content

jasonzjc/CourseProject

 
 

Repository files navigation

CourseProject: Intelligent Learning Platform

Overview

The topic of this project is to organize the scattered lectures into a coherent “multimedia textbook” and create an index. A course typically covers many words, but only a few are keywords and relates to the knowledge introduced in the course. The learner may like to quickly find the lecture or the location a specific topic is presented, or a specific keyword is defined and explained. This topic relates to the text retrieval, language model, and topic analysis introduced in the text information system class.

Software Implementation

Dataset Selection

Originally, I plan to use the content of this course as an example, and I did extract the key phrases (i.e. index) of this course. However, I then found no way to embed the Coursera videos into my webpage. I believe this is due to that Coursera does not allow video embedding. Furthermore, I found no explicit way to generate a link to a specific timestamp on Coursera videos. Therefore, I have to turn to an alternative solution.
In this project, I used open course playlists on YouTube as the input data, because:

  1. There are a good bunch of these courses on YouTube, e.g., MIT OCW.
  2. The transcripts are corrected and well-organized
  3. YouTube allows users to embed its videos on web pages.

Software Structure

This software is constructed from three modules:

  • Data collection
  • Key phrases extraction
  • Platform integration They are explained in detail below.

Data Collection

The data collection module collects the essential contents of the user-specified course, including the URL of each video and the transcript of each video. The URL of a YouTube playlist is like this: https://www.youtube.com/playlist?list=PLUl4u3cNGP61iQEFiWLE21EJCxwmWvvek. The string after 'list=' is its list ID. On this page, all the videos are listed. alt text

First of all, the title and URL of each video are scraped using BeautifulSoupt. The URL of a YouTube video is typically like this: https://www.youtube.com/watch?v=YrHlHbtiSM0. The string after 'v=' is its unique video ID. Therefore, we can obtain the video IDs in this playlist. With the video ID, we can crawl its transcript. Here YouTubeTranscriptApi is used to get the transcripts.

With this information, we generated two data:

  • A string concatenating all the texts in the transcripts, to represent the document of this course. This will be fed into the algorithm to extract the key phrases.
  • A dictionary containing the title, URL, content, start and stop time of each sentence. This is used to locate the key phrases and these locations will be used by the platform.

Key Phrases Generation

With the content of a course, now we need to generate the key phrases. It is not necessarily a single word, but could also contain two, three, or many words. Therefore, it appears to be difficult to simply relies on TF-IDF algorithm. Here I used BERT algorithm. BERT stands for Bidirectional Encoder Representations from Transformers. It is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. The document embedding is extracted with BERT first, then word embeddings are extracted for N-gram words/phrases. Lastly, cosine similarity is used to find the words and phrases with the highest similarity to the document. Here, the tool KeyBERT is used to realize BERT algorithm.

It is also found that some phrases BERT found are not quite the keywords of the document. E.g., a word is repeated in an example but it is not a technical word in this course. To conquer this, I tried to combine KERT with another keyword generation method, i.e. Yake!. Yake! is used to execute a first-round keywords extraction. Then this list of keywords is fed into BERT for keyword and phrase extraction. In the limited cases being studied, I found this algorithm works better than using BERT alone.

After the extraction of key phrases, they are located in the videos. In the current implementation, the first location is selected. This is because intuitively, a key phrase is typically explained during or just after the first time it is mentioned in a course.

Platform Integration

The key phrases and their locations are demonstrated on a web page. The webpage is generated with FLASK. The key phrases are listed on the left side of the page, with a link including the corresponding video ID and timestamp. The video block locates on its right. When clicking the link of a key phrase, the corresponding YouTube video will be refreshed on its right. By clicking the play button, the embedded video will automatically start from the timestamp when this keyphrase is introduced. alt text

Further Improvements

If this project is to be continued in the future, I would like to improve the key phrase generation algorithm. E.g., including the context into the algorithm. The title of each video is a good candidate to improve the accuracy in key phrase generation. Also, the location of a key phrase is not necessary the first time it is mentioned. Senmentical analysis can be used to pinpoint the sentence(s) it is explained. Thirdly, the platform can be improved to better integrate the key phrase, the transcript, and the video. The key phrases can be hierarchized to form the structure of a course.

Software Usage

The software is supposed to run on Python 3.6.14 and above.

  1. Clone the repo and ensure that python3 is installed. After cloning, cd into CourseProject.
  2. Run pip3 install -r requirements.txt to ensure you have the required packages to run this project.
  3. Ensure you have a Chrome browser installed. Check its version and download the corresponding chromedriver here. Use it to overwrite the chromedriver file in the repo.
  4. Find a course playlist from YouTube. You can find a lot of playlists here. Be sure you copied the URL of a playlist, in a form like this: https://www.youtube.com/playlist?list=PLUl4u3cNGP63z5HAguqleEbsICfHgDPaG
  5. Open youtube_keywords_generation.py and replace the url value with your URL in the __main__ function (Line 207). Run this file.
  6. Open webpage_flask.py and replace the url value with your URL (Line 7). Run this file.
  7. Open http://127.0.0.1:5000/ in your browser. You should be able to click the key phrases and see the change of videos.

Video Demonstration

YouTube: https://youtu.be/K23OyTei1vk
Illinois.edu: https://mediaspace.illinois.edu/media/t/1_xzz7ab9j

Contribution

Jiecheng Zhao (NetID: jz109) is the only member of this team.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 91.8%
  • Python 6.2%
  • HTML 1.2%
  • CSS 0.8%