This paper introduces DanceFusion, a novel framework for reconstructing and generating dance movements synchronized to music, utilizing a Spatio-Temporal Skeleton Diffusion Transformer. The framework adeptly handles incomplete and noisy skeletal data, which is common in short-form dance videos on social media platforms like TikTok. DanceFusion incorporates a hierarchical Transformer-based Variational Autoencoder (VAE) with a diffusion model, significantly enhancing motion realism and accuracy. Our approach introduces sophisticated masking techniques and a unique iterative diffusion process that refines the motion sequences, ensuring high fidelity in both motion generation and synchronization with accompanying audio cues. Comprehensive evaluations demonstrate that DanceFusion surpasses existing methods, providing state-of-the-art performance in generating dynamic, realistic, and stylistically diverse dance motions. Potential applications of this framework extend to content creation, virtual reality, and interactive entertainment, promising substantial advancements in automated dance generation.
Social media platforms have revolutionized how people interact with dance content, with TikTok as a prominent example. However, data collected from such sources often suffer from missing joints, occlusions, and noisy skeleton data, which traditional motion capture systems fail to handle efficiently. DanceFusion addresses these issues by reconstructing missing or noisy motion sequences while maintaining musical synchronization, generating realistic and engaging dance motions that can be applied to content creation, gaming, and virtual avatars.
In this chapter, we present the DanceFusion framework, which integrates a hierarchical Spatio-Temporal Transformer-based Variational Autoencoder (VAE) with a diffusion model to achieve robust motion reconstruction and audio-driven dance motion generation. The input to the model is a sequence of skeleton joints extracted from TikTok dance videos, often containing missing or noisy data. The framework aims to refine these skeleton sequences through a series of denoising steps, ensuring that the generated or reconstructed motion is temporally consistent and aligned with accompanying audio.
The DanceFusion framework introduces a hierarchical Transformer-based Variational Autoencoder (VAE) that integrates spatio-temporal encoding to capture the spatial and temporal information inherent in skeleton sequences. Unlike image-based models like ViViT, which processes grid patches from static images, we treat each skeleton joint as a token, and the sequence of joints over time forms a spatio-temporal grid. Each joint in the skeleton is encoded based on its spatial relationship with other joints, and the temporal sequence of these joint positions is fed into the Transformer for processing.
Diffusion models have gained prominence in the generation of complex data sequences, demonstrating exceptional ability in synthesizing realistic and contextually accurate motion sequences. In DanceFusion, we leverage diffusion models to refine skeleton sequences and synchronize dance movements with audio inputs, thus enhancing both the visual quality and the auditory alignment of generated dance sequences.
One of the major challenges in motion reconstruction is handling incomplete skeleton data, where certain joints may be missing due to occlusions or sensor noise. In the DanceFusion framework, this is addressed through a masking mechanism applied during the encoding phase. Each joint in the input sequence is either present or missing, and this information is captured in a binary mask. The mask prevents the model from considering missing joints during the spatial encoding process, ensuring that only reliable joints contribute to the final representation.
@article{dancefusion2024,
author = {Zhao, Li and Lu, Zhengmin},
title = {DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction},
journal = {arXiv preprint arXiv:2411.04646},
year = {2024},
}