[PDF]

Generating Images from Audios Using AI


Charlie Boxall

08/05/2025

Supervised by Bailin Deng; Moderated by Tingting Li

This project aims to generate a single image for each input acoustic music audio file using deep learning techniques. The student's role will involve: - Creating a dataset of acoustic music audio files to train the AI model on - Experimenting with different audio captioning models to generate text descriptions from the audio - Testing text-to-image generation models to convert the text descriptions into images - Comparing the performance of various model combinations - Constructing an end-to-end pipeline that takes in an acoustic music clip and outputs a corresponding generated image

The focus is specifically on acoustic music rather than speech or other audio types. The goal is to find a suitable combination of AI models that can accurately capture the mood, instruments, genre and other characteristics of a music clip and render them in an automatically generated image. CopyRetry


Initial Plan (02/02/2025) [Zip Archive]

Final Report (08/05/2025) [Zip Archive]

Publication Form