Google DeepMind Launches V2A: Next-Gen Audio for Video Creation

Onsa MustafaLast Updated: Jun 24, 2024

Google DeepMind has introduced video-to-audio (V2A) technology, which enables synchronized audiovisual creation. By using a combination of video pixels and natural language text instructions, V2A creates immersive audio for on-screen action. The team also explored autoregressive and diffusion methods to find the best scalable AI architecture and found that the diffusion method produced the most convincing and realistic synchronization of audio and visuals.

The first step in their video-to-audio technology is compressing the input video. The diffusion model repeatedly cleans up the audio from background noise. Visual input and natural language prompts guide this process, generating realistic, synced audio that closely follows the instructions. The final steps involve decoding, waveform generation, and merging the audio and visual data.

Google DeepMind Launches V2A: Next-Gen Audio for Video Creation

Before running the video and audio prompt input through the diffusion model, V2A encodes them. Then, compressed audio is created and decoded into a waveform. The researchers improved the model’s ability to produce high-quality audio by supplementing the training process with additional information, such as transcripts of spoken dialogue and AI-generated annotations with detailed descriptions of sound. This helped train the model to make specific sounds.

The technology learns to respond to information in the transcripts or annotations by associating distinct audio occurrences with different visual scenes. V2A technology can also be paired with video generation models like Veo to create dramatic scores, realistic sound effects, or dialogue that complements the characters and tone of a video.

V2A technology opens up a world of creative possibilities by creating scores for a wide range of classic videos, such as silent films and archival footage. Users can generate as many soundtracks as they desire for any video input. They can also use a “positive prompt” to guide the output towards desired sounds or a “negative prompt” to steer it away from unwanted noises. This flexibility gives users unprecedented control over V2A’s audio output, fostering experimentation and enabling them to find the perfect match for their creative vision.

See Also: Google Lens & YouTube Team Up for Video Search

The team is doing research and development to address various issues. They understand that the quality of the audio output depends on the video input, and distortions or artefacts in the video outside the model’s training distribution can lead to noticeable audio degradation. They are working on improving lip-syncing for videos with voiceovers by analyzing input transcripts to create a perfectly synchronised speech with the characters’ mouth movements. Moreover, they also address issues where the video model doesn’t correspond to the transcript, which can lead to awkward lip-syncing. Their commitment to high standards and continuous improvement is evident.

The team is actively seeking input from prominent creators and filmmakers, recognizing the value of their insights and contributions to the development of V2A technology. This collaborative approach ensures that V2A technology can positively impact the creative community, meeting their needs and enhancing their work. To protect AI-generated content from misuse, they have integrated the SynthID toolbox into the V2A study and watermarked all outputs, demonstrating their commitment to the ethical use of technology.

PTA Taxes Portal

Find PTA Taxes on All Phones on a Single Page using the PhoneWorld PTA Taxes Portal

Explore Now Follow us on Google News!

Onsa MustafaLast Updated: Jun 24, 2024