Human-VDM: Learning Single-Image 3D Human Gaussian Splatting
from Video Diffusion Models

1Sun Yat-sen University 2Carnegie Mellon University
Project lead

arXiv 2024


Sun Yat-sen University       CMU Robotics Institute
arXiv Code BibTeX
Teaser Image
We propose Human-VDM, a novel 3D Human Generation framework. Given a single RGB human image, Human-VDM aims to generate high-fidelity 3D model. Human-VDM preserves face identity, delivers realistic texture, ensures accurate geometry, and maintains a valid pose of the generated human, surpassing the current state-of-the-art models.

Abstract

Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. The code will be available upon acceptance.

Human-VDM Model Architecture

Human-VDM Model Architecture

An image I is first input to a view-consistent human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and frame interpolation to enhance texture and generate high-quality interpolated frames. Finally, 3D Human Gaussian splatting learns lifelike 3D humans.


Contributions

  • We propose a novel single-view 3D human generation framework that leverages the human video diffusion model to produce view-consistent human frames.

  • We carefully designed a video augmentation model that consists of super-resolution and interpolation to enhance the quality of the generated video.

  • We introduce an effective Gaussian Splatting framework for 3D human reconstruction with offset prediction.

  • Extensive experiments demonstrate that the proposed Human-VDM can generate realistic 3D humans from single-view images, outperforming state-of-the-art methods in both quality and effectiveness.


We conduct complete ablation studies to validate the effectiveness of each component in our model.



Ablation Study


Quantitative Comparison with SOTA models

Quantitative comparison of the proposed Human-VDM with recent State-of-the-art models on THuman 2.0 dataset.




Human-VDM Quantitative Comparison & User Study

Qualitative Results and Comparisons

Qualitative Visual results of Human-VDM and comparisions to State-of-the-art models including in-the-wild testing

Single-view 3D Avatar Results

BibTeX

@misc{liu2024humanvdmlearningsingleimage3d,
      title={Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models}, 
      author={Zhibin Liu and Haoye Dong and Aviral Chharia and Hefeng Wu},
      year={2024},
      eprint={2409.02851},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.02851}, 
}