Google researchers find novel way of turning a single photo of a human into AI-generated video good enough to make you think ‘this might go badly’

Google researchers have found a way to create video versions of humans generated from just a single still image. This enables it to do things like, generate a video of someone speaking from input text, or changing a person’s mouth movements to match an audio track in a different language to the one originally spoken. It also feels like a slippery slope into identity theft and misinformation, but what’s AI if not with a hint of frightening consequences.

The tech itself is rather interesting: it’s called Vlogger by the Google researchers that published the paper. In it the authors (Enric Corona et al) offer up various examples of how the AI takes a single input image of a human—in this case, I believe mostly AI-generated humans—and with an audio file produces both facial and bodily movements for them to match.

That’s just one of a few potential use cases for the tech. Another is editing video, specifically a video subject’s facial expressions. In an example, the researchers show various versions of the same clip: one has a presenter speaking to camera, another with the presenter’s mouth closed in an eerie fashion, another with their eyes closed. My favourite is the video of the presenter with their eyes artificially held open by the AI, unblinking. Huge serial killer vibes. Thanks, AI.

The most useful feature in my opinion is the ability to swap an audio track for a video with a dubbed foreign language version and have the AI lip-sync the person’s facial movements to the audio track.

It works through the use of two stages: “1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion based architecture that augments text-to-image models with both temporal and spatial controls. This approach enables the generation of high quality videos of variable length, that are easily controllable through high-level representations of human faces and bodies,” the GitHub page says.

2. Generation of Moving and Talking People Here’s an example on talking face generation given just a single input image and a driving audio. pic.twitter.com/hd7HKDfYkPMarch 18, 2024

Admittedly the tech isn’t perfect. In the examples given the mouth movements have certain qualities common across AI-generated video content. It’s also pretty creepy at times, as noted by users responding to a thread about the technology by EyeingAI on X. But Vlogger doesn’t need to fool everyone, or even fool anyone at all, to have some use. Similarly, if it were a more perfect technology, it’d be even more worrying to think about how this technology could be used to create deep fakes, spread misinformation, or steal identities. We’ll get there one day, and I for one hope we have some handle on how to deal with this stuff a bit more by then.