Generating Images from Audios Using AI

Jingtao Yu


Supervised by Bailin Deng; Moderated by Usashi Chatterjee

Given an audio clip, can you draw the scene corresponding to the audio? This project aims to solve this problem using deep learning techniques. We will first use audio captioning models (for example, see https://github.com/audio-captioning/audio-captioning-papers for a list) to generate a free text description for the audio clip. Then an image can be generated from the text description using text-to-image generation models (see https://stability.ai/blog/stable-diffusion-public-release for an example). In this project, you will need to test out different AI models to compare their performance, and to find out suitable models to construct to a audio-to-image generation pipeline.

Initial Plan (06/02/2023) [Zip Archive]

Final Report (17/05/2023) [Zip Archive]

Publication Form