Skip to content
/ Versify Public

Generate lyrics in the style of any artist!

Notifications You must be signed in to change notification settings

cho4/Versify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INTRODUCTION

Discovering music by category is now easier than ever through a Google search. However, the industry is made up of endless unique styles that serve as an artist’s signature and help them stand out. There are several reasons why one might want to imitate an artist’s songwriting style, such as finding inspiration, creating parodies or tributes, or expanding one’s own style. In exploring our curiosities towards possible technological advancements for the musical industry, we also recognized that with the explosion of ChatGPT and other generative AI models, the applications of prompt engineering have become more and more prominent. In light of these, our project aims to utilize graphs in conjunction with natural language processing to generate completely new song lyrics in the style of a given musical artist. This will be entirely based on their commonly used vocabulary and semantic patterns which are derived from their existing songs. To compute this, we use graphs to model an artist’s discography, where each vertex represents a song by that artist. In conjunction with the use of an NLP startup, Cohere, we are able to obtain embeddings of the lyrics of these songs, providing us with numerical representations of semantics for each lyric. We then form an edge between two Songs when their embeddings are more than 75% similar. Furthermore, using OpenAI’s GPT-3.5 model, we can take the top five songs with the highest degree in the discography graph to generate lyrics that are most representative of a given artist—applying its powerful prompt-based generation model. This process represents the makeup of our course project, Versify: The Future of Songwriting.

DATASETS

In order to make Versify work, we needed a way to access data about artists, discographies, and lyric data for each song. Initially, we proposed to use musixmatch API to access this information, however through testing it in the early stages of the development of the project, it proved to be both slow and unreliable. We ultimately decided to switch gears and instead use a fixed dataset, enabling us to directly access the data.

In doing so, we turned to Kaggle, a Google owned service which allows users to find and publish datasets for training machine learning models and further applications in data science. We quickly found a suitable dataset called 5 Million Song Lyrics Dataset which was provided in the form of a csv file.

However, we soon determined that we would not be able to work with this data set in its given format, as iterating through such a large file would not be computationally efficient. Thus, after implementing several efficiency related workarounds found in the csv and pandas libraries, we ultimately settled on using the Python sqlite3 library—whose results were researched to be 50 times more efficient than pandas.

For our purposes, we then converted the data in our csv file to a SQL table in a db file, done through a one time use of the pandas.read_csv function in the Python console. This function converts the data in the csv to a pandas.DataFrame object, which is then converted into an SQL table called ‘songs’ by using the pandas.DataFrame.to_sql method, storing the table in lyrics_ds.db. To reduce the size of the already large database file and further improve efficiency, we dropped all the columns from the table that were outside the scope of the project—keeping only ‘title’, ‘artist’, ‘views’ and ‘lyrics’. Through our testing, we found that on average, a SELECT query to ‘songs’ takes 20-30 seconds, which is significantly better than the several minutes it would take to perform the same task using a csv file.

Moreover, we created an additional single-columned table in lyrics_ds.db called ‘artists’, which contains the names of all possible artists in our dataset. As the number of rows of the ‘artists’ table is significantly smaller than that of ‘songs’ (approx. 700k vs 5 mil), this enabled us to more efficiently validate whether an inputted artist exists in our dataset and provide near-instant feedback to the user.

In addition, we generated a separate file called discographies.pkl which contains a dictionary of already instantiated Discography objects for several artists (done using the pickle library). This file updates every time the user enters a new artist, and when an artist name that is already in discographies.pkl is entered, the process of querying lyrics_db.db is skipped, shaving off the need for expensive computations.

THE CULMINATION OF VERSIFY

Following thorough testing and analysis of Versify, we have concluded that the program can accurately capture the style and themes of a given artist to a certain degree. For instance, lyrics generated in the style of a hip-hop artist would typically feature more slang and profanity compared to those of a country artist. The Graph structure and cohere API's embedding feature were conducive to outlining an artist's discography, effectively connecting similar vocabulary and semantic patterns of the artist's lyrics. Moreover, the use of OpenAI's powerful natural language processing AI model, GPT-3.5, allowed our team to develop a high-quality program involving a powerful natural language processing AI. After extensive testing, we are confident in asserting that our project runs stably and satisfactorily. With the use of memoization and a graphical user interface, we extended beyond to prioritize the efficiency and efficacy of our program. Our team has dedicated significant effort to this project, and we hope that it will provide user satisfaction to all users of our program, Versify.

REFERENCES

Altman, Sam. OpenAI API, 2015, https://platform.openai.com/docs/introduction. “API Reference.” API Reference - Pandas 1.5.3 Documentation, 2008, https://pandas.pydata.org/docs/reference/index.html. Digital, Mason. “SQLite VS Pandas: Performance Benchmarks.” The Data Incubator, 5 Dec. 2022, https://www.thedataincubator.com/blog/2022/11/17/sqlite-vs-pandas-performance-benchmarks/#:~:text=sqlite%20or%20memory-sqlite%20is,seconds%20for%2010%20million%20records. Gomez, Aidan. “Add Language AI Capability to Your System.” Cohere AI, 2019, https://docs.cohere.ai/reference/about. Gomez, Aidan. “Embeddings.” Cohere AI, 2019, https://docs.cohere.ai/docs/embeddings. Nayak, Nikhil. “5 Million Song Lyrics Dataset.” Kaggle, 22 Apr. 2022, https://www.kaggle.com/datasets/nikhilnayak123/5-million-song-lyrics-dataset?resource=download. NeuralNine. “Tkinter Beginner Course - Python Gui Development.” YouTube, 29 Sept. 2021, https://youtu.be/ibf5cx221hk. Openai. “Openai/Tiktoken: Tiktoken Is a Fast BPE Tokeniser for Use with Openai's Models.” GitHub, https://github.com/openai/tiktoken. “Pickle - Python Object Serialization.” Python Documentation, https://docs.python.org/3/library/pickle.html. Real Python. “An Intro to Threading in Python.” Real Python, Real Python, 22 May 2022, https://realpython.com/intro-to-python-threading/. Schimansky, Tom. “Tomschimansky/Customtkinter: A Modern and Customizable Python UI-Library Based on Tkinter.” GitHub, https://github.com/TomSchimansky/CustomTkinter. “SQLITE3 - DB-API 2.0 Interface for SQLite Databases.” Python Documentation, https://docs.python.org/3/library/sqlite3.html. “Tkinter - Python Interface to TCL/TK.” Python Documentation, 2001, https://docs.python.org/3/library/tkinter.html.

About

Generate lyrics in the style of any artist!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages