Google: SoundStorm should make audio generation faster and more efficient

With SoundStorm, Google has released an audio AI model that can generate 30 seconds of audio in half a second using an AI processor called a Tensor Processing Unit v4 (TPU v4). SoundStorm takes according to Google semantic tokens as input generated by the “AudioLM” framework. The quality is the same as with AudioLM, but SoundStorm is said to work more coherently and faster because speech processing processes run in parallel. That’s from the paper by the Google research team around Zalán Borsos, which deals with generative audio AI models.

For AudioLM, texts do not have to be transcribed first. Instead, the AI ​​uses existing audio databases – in this case, the Automatic Speech Recognition corpus LibriSpeech, consisting of 1,000 hours of public domain audiobooks. With the use of machine learning, the audio files are tokenized, i.e. divided into sound snippets. This training data is then fed into a machine learning model designed to use natural language processing to learn the sound patterns.

The Bark open source model is also based on a similar approach. In addition to music, speech including the melody, accent and other properties (prosody) can be generated. Speech that sounds more natural than previous models only requires a few seconds of audio input.

When used in conjunction with SPEAR-TTS, a multi-speaker text-to-speech system, SoundStorm can generate natural dialogue. The language is controlled via transcripts, the speaking voices via short voice prompts and the speaker change via instructions in the transcript. To generate 30 seconds of dialogue with multiple speakers, it takes two seconds with TPU-v4.

Ever-improving audio AI models also offer great potential for abuse and thus enable identity theft by tricking voice ID. Many banks in Europe and the USA offer Voice-ID as a login option. Voices readily available on the internet can fall victim to such scams.

AI researchers like those at Google are therefore also working on techniques so that people can distinguish between natural sounds and synthetically generated ones. For example, it is conceivable to watermark AI-generated products to make it easier to distinguish them from real sounds.


(mack)

To home page

source site