Skip to the content.

Music Enhancement via Image Translation and Vocoding

Nikhil Kandpal, Oriol Nieto, Zeyu Jin

Paper       Code

Mel2Mel+Diffwave Model Overview


Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality music recordings by combining (i) an image-to-image translation model for manipulating audio in its mel-spectrogram representation and (ii) a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms. We find that this approach to music enhancement outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms. Additionally, in evaluating the proposed method with a listening test, we analyze the reliability of common audio enhancement evaluation metrics when used in the music domain.

Real-World Samples

Below, we share a few representative samples generated by our best performing model, Mel2Mel + Diffwave, on real-world samples sourced from Youtube.

Low-Quality Mel2Mel + Diffwave Source
Dance Monkey - Street Musician Karsten Belt (Saxophone)
Top 10 Street Piano Performances
People Singing In Front Of Other People In Public Compilation

Medley-Solos DB Samples

Additionally, we share test samples generated by all of the models we evaluate in the paper: Mel2Mel + Diffwave, Mel2Mel + Griffin-Lim, and Demucs. Each row in the table contains the ground-truth high-quality recording from the Medley-Solos DB test set, the simulated low-quality recording, and the reconstruction generated by each model.

Low-Quality Mel2Mel + Diffwave Mel2Mel + Griffin-Lim Demucs High-Quality