Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

ICLR 2024 (Spotlight)

1Zhejiang University, 2ByteDance, 3HKUST(GZ)


One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait.

To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a image-to-plane (I2P) model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter (MA); (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution (HTB-SR) model ; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion (A2M) model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods.

Overall Pipeline

The inference process of Real3D-Portrait is shown as follows: The inference process of Real3D-Portrait.

Demo 1: One-Shot Realistic 3D Talking Face Generation

To show the overall performance of Real3D-Portrait, in the following video, we provide demos of 10 unseen identities driven by 6 audio clips from various languages.

Demo 2: Comparison with VD/AD Baselines

Real3D-Portrait supports both video/audio-driven talking face generation. Hence we compare it against state-of-the-art video-driven (VD) and audio-driven (AD) baselines, respectively.

1. Video-Driven Comparison

We compare with Face-vid2vid (CVPR 2019 Oral), OTAvatar (CVPR 2023), and HiDe-NeRF (CVPR 2023). The comparison shows that our Real3D-Portrait overcomes the challenges faced by these methods and achieves one-shot realistic talking video generation.

2. Audio-Driven Comparison

We compare with two one-shot methods: MakeItTalk (SIGGRAPH Asia 2020) and PC-AVS (CVPR 2023), as well as a identity-overfitted NeRF-based method, RAD-NeRF (arxiv 2023). We find Real3D-Portrait achieves the best lip-sync among all test baselines, and produces the best visual quality among the one-shot methods, even comparable to the identity-overfitted RAD-NeRF.

Demo 3: The features of HTB-SR model

With the design of head-torso-background super-resolution (HTB-SR) model, our proposed Real3D-Portrait could generate realistic and high-fidelity video with natural torso movement and switchable background.

1. Natural Torso Movement

With the warp-based torso branch in HTB-SR model, our proposed Real3D-Portrait could generate natural torso movement at a large scale.

2. Switchable Background

With the individual background branch in HTB-SR model, our proposed Real3D-Portrait supports switchable background, which is useful in video conferencing.

Demo 4: How Real3D-Portrait generate the final video

In the following video, we show how the final video is generated. (1) The first column is the source image, which is processed by the Image-to-Plane model to reconstruct a 3D avatar (canonical plane). (2) The second column is the motion representation, PNCC, which is predicted by the audio-to-motion model given the input audio signals. (3) Then the PNCC is used by motion adapter to obtain the motion diff-plane. By performing volume rendering to the (canonical plane + motion diff-plane), we obtain the third column, which is the 128x128 resolution head image. (4) We also visualize the depth map of the 3D head. (5) Then we feed the head image and source torso/background image into the head-torso-background super-resolution model to obtain the final 512x512 resolution video (which is the fourth column).


Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's talking video without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.


    author    = {Ye, Zhenhui and Zhong, Tianyun and Ren, Yi and Yang, Jiaqi and Li, Weichuang and Huang, Jiangwei and Jiang, Ziyue and He, Jinzheng and Huang, Rongjie and Liu, Jinglin and Zhang, Chen and Yin, Xiang and Ma, Zejun and Zhao, Zhou},
    title     = {Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis},
    journal   = {ICLR},
    year      = {2024},