Skip to content
/ vocoder Public

Recovers audio of Mel spectrograms using gradient descent

Notifications You must be signed in to change notification settings

pimlu/vocoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gradient descent-based vocoder

Recovers audio of Mel spectrograms using gradient descent

The basic idea of this project is to try to recover audio data from a spectrogram. There already exists the Griffin-Lim algorithm family to do this, and it can even perfectly reconstruct audio from the spectrogram in some cases. However, the challenge in my case is I binned the audio into the Mel scale, compressing the magnitude data by a factor of 2 and making the original unrecoverable. Griffin-Lim can only be used by creating a blurry version of the original spectrogram from the compressed Mel bins.

How can we (slightly) improve on Griffin-Lim in this case? Well, there exist state of the art models, but what I tried was to use gradient descent on the uncompressed data, using the sum of STFT reconstruction error and error with respect to the Mel spectrogram as loss. This allows it to estimate the underlying magnitudes even though they are underspecified by the Mel spectrogram.

How to run

In a tensorflow 2 environment, run python main.py samples/arctic_raw_16k.wav my/out/path.wav, press ctrl-C to stop processing and dump the output. This demo will convert the input file into a compressed Mel spectrogram, then try to recover the wav using gradient descent.

Demo files (use headphones)

The difference isn't drastic, but it is an improvement. It actually likes the output of Griffin-Lim as parameter initialization, so the process is: run Griffin-Lim for a few iterations, then run gradient descent.

File G-L iters G.D. iters notes
arctic_raw_16k.wav -- -- original unprocessed file
griffinlim.wav 5000 -- ~25s griffin lim processing
gl200.wav 200 -- short griffin lim processing
graddescent.wav 200 5000 ~25s grad descent processing

Objectively, the reconstruction loss immediately improves when you feed Griffin-Lim output into gradient descent (even after 5000 G-L iterations). Subjectively, I think the Griffin-Lim output sounds a bit phasier/robotic (you can hear it better when you compare to gl200.wav) than gradient descent, but you might not be able to tell without headphones.

Other files

This repo is a mishmash of old files to be honest, lol. Maybe some of the functions in util.py would be useful to you, e.g. for visualizing STFT data. The story is, I made this back in fall 2018 and then upgraded it in 2021 to Tensorflow 2 (kind of, it still uses the graph API) so I could push it online.

gen_dataset.py is an unrelated file I used to make PNGs out of the STFT of sound files, which may or may not work.

There is some random legacy stuff in the old/ directory.

About

Recovers audio of Mel spectrograms using gradient descent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages