Music Enhancement via Image Translation and Vocoding

Nikhil Kandpal, Oriol Nieto, Zeyu Jin

Paper Code

Mel2Mel+Diffwave Model Overview

Abstract

Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality music recordings by combining (i) an image-to-image translation model for manipulating audio in its mel-spectrogram representation and (ii) a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms. We find that this approach to music enhancement outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms. Additionally, in evaluating the proposed method with a listening test, we analyze the reliability of common audio enhancement evaluation metrics when used in the music domain.

Real-World Samples

Below, we share a few representative samples generated by our best performing model, Mel2Mel + Diffwave, on real-world samples sourced from Youtube.

Medley-Solos DB Samples

Additionally, we share test samples generated by all of the models we evaluate in the paper: Mel2Mel + Diffwave, Mel2Mel + Griffin-Lim, and Demucs. Each row in the table contains the ground-truth high-quality recording from the Medley-Solos DB test set, the simulated low-quality recording, and the reconstruction generated by each model.

Low-Quality	Mel2Mel + Diffwave	Mel2Mel + Griffin-Lim	Demucs	High-Quality

Low-Quality	Mel2Mel + Diffwave	Source
		Dance Monkey - Street Musician Karsten Belt (Saxophone)
		Top 10 Street Piano Performances
		People Singing In Front Of Other People In Public Compilation