
GANs and Audio Synthesis
Introduction
Ever since the release of OpenAI’s immensely popular ChatGPT software, AI-generated content has been taking popular culture by storm. The public now has easy access to text generation through ChatGPT, and image generation using other recent software. People have taken this newfound ability in all kinds of directions, but the main focus of these popular pieces of software has been either text or image generation. Less explored is the potential application of neural networks to generate sound.
This article explores the potential and mechanics of using AI, specifically generative adversarial networks, to synthesize sound. While the research in this space is still emerging, these methods could apply to fields ranging from sound effects and design to text-to-speech algorithms.
What is a GAN?
A generative adversarial network, or GAN, is a type of machine learning model that is trained on two neural networks: one to generate content and the second to distinguish whether a given piece of content is real or generated. Both networks are trained in tandem against real content. They undergo several rounds of training in which the former network (the generator) generates content that is then fed to the latter (the discriminator). The generator attempts to “trick” the discriminator into classifying generated content as real, and both models refine over time until the generated content is very difficult to distinguish from the input.
In general, GANs are a tried-and-true method of synthesizing content such as images. Photos of human faces that are generated using GANs can often be mistaken for real people. GANs for sound synthesis are significantly less present in the public consciousness, but a few key research milestones have shown significant potential in the last few years.
What companies and researchers are prominent in this space?
A few key projects are very frequently cited in the emerging space of AI sound generation. For one, WaveGAN was created in 2019 to generate one-second-long sound effects and speech snippets. Additionally, TensorFlow’s Magenta project has developed a GANSynth, which speedily generates the sequence at once based on phase changes, rather than trying to generate a waveform iteration by iteration.
How can GANs be used to generate audio?
Many companies and researchers have attempted to use GANs to generate audio, but those mentioned above have found the most success using a variety of methods. As it is made up of waveforms, audio is much more periodically focused than images. The regularity of this data can present both benefits and challenges as compared to image generation. Since any type of neural network, including a GAN, is more inclined to capturing general ideas than precise locations, it doesn’t translate as well to sound waves, which must oscillate at certain exact intervals. GANSynth ran into this issue and switched to using instantaneous frequency data points, which run a calculation on the phase shifts from the waveforms rather than trying to replicate the waveforms directly. This avoids the issue of needing to teach the model at an incredibly high level of precision.
One audio synthesis method, used in a third framework called MelGAN, includes an intermediate step. It inputs text and first transforms it into a type of frequency mapping known as a spectrogram. It then converts the resulting spectrogram to a sound file. The intermediate step provided by a visual diagram could make the process more predictable and allow software to tap into the vast progress that has been made on the visual learning front, as compared to using audio only.
In the WaveGAN paper, it was demonstrated that despite the difficulties of training on waveforms directly, such a network still shows considerable accuracy. The study compared two main approaches: that of WaveGAN, which learned from waveforms directly, and SpecGAN, which used the same spectrogram approach as MelGAN. Both showed promising results, successfully creating short snippets of sound effects and speech. WaveGAN actually appeared to do a little better than SpecGAN, despite the inherent difficulties of replicating such an exact and repetitive representation as a waveform. Although all of these results are promising, they are currently limited in this study to short (1-second) snippets of audio. On the other hand, GANSynth has been shown to generate much longer snippets.
What are the benefits/implications of this?
This developing technology can be used both for creative purposes (e.g., music generation, sound design, and effects) and for more everyday ones, like improving text-to-speech. Software such as MelGAN is particularly suited to this cause, as it can reliably generate a spectrogram and corresponding sound file from a text input alone.
While this progress has many potential applications, it is important, as with any new AI software, to be mindful of its ethical implications. Using AI for sound synthesis could pose
similar questions to the copyright issues that have plagued ChatGPT and AI art. Similarly, AI-fabricated audio files could be used to spread misinformation. At the same time, this technology could also strongly benefit causes like accessibility through applications like text-to-speech, while existing as a tool to bolster creative works.
Conclusion
Generative adversarial networks can be applied to many different purposes, just one of which is sound synthesis. Although many methods are being explored, there are sure signs of progress in the field. The new developments have the potential to change the way we think about audio, whether that is through text-to-speech or sound effects. As AI grows more ubiquitous in this or in any facet of life, it is important both to understand its methodologies and to think about its impacts.