Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

ICLR 2024 (Spotlight)

¹Zhejiang University, ²ByteDance, ³HKUST(GZ)

Abstract

One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait.

To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a image-to-plane (I2P) model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter (MA); (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution (HTB-SR) model ; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion (A2M) model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods.

Demo 2: Comparison with VD/AD Baselines

Real3D-Portrait supports both video/audio-driven talking face generation. Hence we compare it against state-of-the-art video-driven (VD) and audio-driven (AD) baselines, respectively.

1. Video-Driven Comparison

We compare with Face-vid2vid (CVPR 2019 Oral), OTAvatar (CVPR 2023), and HiDe-NeRF (CVPR 2023). The comparison shows that our Real3D-Portrait overcomes the challenges faced by these methods and achieves one-shot realistic talking video generation.

2. Audio-Driven Comparison

We compare with two one-shot methods: MakeItTalk (SIGGRAPH Asia 2020) and PC-AVS (CVPR 2023), as well as a identity-overfitted NeRF-based method, RAD-NeRF (arxiv 2023). We find Real3D-Portrait achieves the best lip-sync among all test baselines, and produces the best visual quality among the one-shot methods, even comparable to the identity-overfitted RAD-NeRF.

Demo 3: The features of HTB-SR model

With the design of head-torso-background super-resolution (HTB-SR) model, our proposed Real3D-Portrait could generate realistic and high-fidelity video with natural torso movement and switchable background.

1. Natural Torso Movement

With the warp-based torso branch in HTB-SR model, our proposed Real3D-Portrait could generate natural torso movement at a large scale.

2. Switchable Background

With the individual background branch in HTB-SR model, our proposed Real3D-Portrait supports switchable background, which is useful in video conferencing.

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's talking video without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

BibTeX

@inproceedings{ye2024real3dportrait, author = {Ye, Zhenhui and Zhong, Tianyun and Ren, Yi and Yang, Jiaqi and Li, Weichuang and Huang, Jiangwei and Jiang, Ziyue and He, Jinzheng and Huang, Rongjie and Liu, Jinglin and Zhang, Chen and Yin, Xiang and Ma, Zejun and Zhao, Zhou}, title = {Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis}, journal = {ICLR}, year = {2024}, }