Accelerating Vision Diffusion Transformers with Skip Branches (2024)

Guanjie Chen1111Equal Contribution.Xinyu Zhao2111Equal Contribution.Yucheng Zhou3Tianlong Chen2222Corresponding Authors.Cheng Yu4222Corresponding Authors.

1Shanghai Jiao Tong University 2The University of North Carolina at Chapel Hill
3SKL-IOTSC, CIS, University of Macau 4The Chinese University of Hong Kong
chenguanjie@sjtu.edu.cn, xinyu@cs.unc.edu

Abstract

Diffusion Transformers (DiT), an emerging image and video generation model architecture, has demonstrated great potential because of its high generation quality and scalability properties. Despite the impressive performance, its practical deployment is constrained by computational complexity and redundancy in the sequential denoising process.While feature caching across timesteps has proven effective in accelerating diffusion models, its application to DiT is limited by fundamental architectural differences from U-Net-based approaches.Through empirical analysis of DiT feature dynamics, we identify that significant feature variation between DiT blocks presents a key challenge for feature reusability. To address this, we convert standard DiT into Skip-DiT with skip branches to enhance feature smoothness. Further, we introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time. We validated effectiveness of our proposal on different DiT backbones for video and image generation, showcasing skip branches to help preserve generation quality and achieve higher speedup.Experimental results indicate that Skip-DiTachieves a 1.5×1.5\times1.5 × speedup almost for free and a 2.2×2.2\times2.2 × speedup with only a minor reduction in quantitative metrics. Code is available at https://github.com/OpenSparseLLMs/Skip-DiT.git.

Accelerating Vision Diffusion Transformers with Skip Branches (1)

1 Introduction

Diffusion models[9, 3, 24, 46] have emerged as the de-facto solution for visual generation, owing to their high fidelity outputs and ability to incorporate various conditioning signals, particularly natural language. Classical diffusion models, which adopt U-Net[27] as their denoising backbone, have dominated image and video generation applications. More recently, Diffusion Transformers (DiT)[4, 23] have introduced an alternative architecture that replaces traditional sequential convolutional networks with Vision Transformers, offering enhanced scalability potential. While initially designed for image generation, DiT has demonstrated remarkable effectiveness when extended to video generation tasks[19, 16, 25]. However, despite these advances, significant challenges remain in scaling diffusion models efficiently, particularly for applications involving large numbers of input tokens such as video generation. This scaling challenge is especially pronounced in DiT architectures, where the attention mechanism’s computational complexity grows quadratically with input size. The magnitude of this challenge is illustrated by Sora[20], a state-of-the-art video generation model that remains unavailable to the public.

Numerous approaches have been proposed to enhance the efficiency of diffusion models, including reduced sampling techniques[33], distillation methods[41, 29], and quantization strategies[5]. Caching mechanisms, which reuse noise latents across timesteps, have emerged as a particularly promising direction as they do not require extensive model retraining[18, 15, 38, 47, 6]. However, many existing caching approaches[18, 38] are specifically tailored to U-Net architectures, leveraging their unique structural properties—particularly the skip connections between paired downsampling and upsampling blocks that enable high-level feature reuse while selectively updating low-level features. While some recent studies[47, 6] have attempted to adapt caching mechanisms for DiT acceleration, they have not achieved the same level of efficiency gains and performance preservation as their U-Net counterparts.

Accelerating Vision Diffusion Transformers with Skip Branches (2)
Accelerating Vision Diffusion Transformers with Skip Branches (3)

To understand the key challenges of feature caching in DiT, we analyze the feature dynamics during the denoising process. Drawing inspiration from loss landscape visualization techniques[13, 11], we examine feature changes using the early and late timesteps in denoising as a case study. For effective caching, features should exhibit minimal variation between timesteps, allowing us to reuse features from previous steps and bypass computation in subsequent Transformer blocks. We term this property ”feature smoothness”, which manifests as flatness in the landscape visualization. However, as illustrated in Figure2, we observe that vanilla DiT (w/o skip branches) exhibits high feature variance across timesteps, contrary to the desired characteristics for effective caching. Thus, we ask:

Drawing from the findings in[13], where residual connections are shown capable of mitigating sharp loss landscapes, in this study, ❶ we first conduct preliminary experiments adding skip branches to pre-trained DiT model from shallow to deep blocks, named Skip-DiT. which achieves significantly improved feature smoothness with minimal continuous pre-training, as demonstrated in Figure2 (w/ skip branches). ❷ We then leverage these skip branches during inference to implement an efficient caching mechanism, Skip-Cache, where only the first DiT block’s output needs to be computed for subsequent timesteps while deep block outputs are cached and reused. ❸ To evaluate our proposal, we conduct extensive experiments across multiple DiT backbones, covering image and video generation, class-conditioned and text-conditioned generation. We demonstrate that Skip-DiTconsistently outperforms both dense baselines and existing caching mechanisms in both qualitative and quantitative evaluations.Our contributions are three-fold:

  • We identify feature smoothness as a critical factor limiting the effectiveness of cross-timestep feature caching in DiT, which helps better understand caching efficiency.

  • We build Skip-DiT, a skip-branch augmented DiT architecture that enhances feature smoothness, and Skip-Cachethat efficiently leveraging skip branches for feature caching across timesteps. Skip-Cachefacilitates accelerated inference via caching while maintaining visual generation performance.

  • Extensive empirical evaluations demonstrate that Skip-Cacheachieves substantial acceleration: up to 1.5×1.5\times1.5 × speedup is achieved almost for free and a 2.2×2.2\times2.2 × speedup with only a minor reduction in quantitative metrics.

2 Related Works

Transformer-based Diffusion Models

Diffusion model has become the dominating architecture for image and video generation, whose main idea is iterative generate high-fidelity images or video frames from noise[26].Early diffusion models mainly employ U-Net as their denoising backbone[24, 3]. However, U-Net architectures struggle to model long-range dependencies due to the local nature of convolutions. Researchers proposing diffusion transformer model (DiT) for image generation[4, 2, 23].Recent years have witnessed a significant growth in studies of video DiT. Proprietary DiT such as Sora[20] and Movie-Gen[25] show impressive generation quality, also evidenced by open-sourced implementation[48, 12]. Latte decomposes the spatial and temporal dimensions into four efficient variants for handling video tokens, allowing effective modeling of the substantial number of tokens extracted from videos in the latent space[19]. CogvideoX adds a 3D VAE combined with an expert transformer using adaptive LayerNorm, which enables the generation of longer, high-resolution videos[40]. However, as the number of tokens grows exponentially with video length and spatial resolution, the computational complexity of DiT especially the self-attention mechanism remains a significant bottleneck for video generation.

Diffusion Acceleration with Feature Caching

Since Diffusion model involves iterative denoising, caching features across time-steps, model layers, and modules has been found an effective way to save inference computation costs. For U-Net Diffusion, DeepCache[18] and FRDiff[32] exploit temporal redundancy by reusing features across adjacent denoising steps. While other works take a more structured approach by analyzing and caching specific architectural components–Faster Diffusion[15] specifically targets encoder feature reuse while enabling parallel decoder computation, and Block Caching[38] introduces automated caching schedules for different network blocks based on their temporal stability patterns.Recently, cache-based acceleration has also been applied to DiT. PAB[47] introduces a pyramid broadcasting strategy for attention outputs. ΔΔ\Deltaroman_Δ-DiT[6] proposes adaptive caching of different DiT blocks based on their roles in a generation–rear blocks during early sampling for details and front blocks during later stages for outlines. T-Gate[45] identifies a natural two-stage inference process, enabling the caching and reuse of text-oriented semantic features after the initial semantics-planning stage.While these caching techniques have shown promise, they are primarily limited to inference-time optimization, and there remains significant potential for improving their acceleration factors.

3 Methodology

Accelerating Vision Diffusion Transformers with Skip Branches (4)

3.1 Preliminaries

Diffusion model

The concept of diffusion models mirrors particle dispersion physics, where particles spread out with random motion. It involves forward and backward diffusion. The forward phase adds noise to data across T𝑇Titalic_T timesteps. Starting from data 𝐱0q(𝐱)similar-tosubscript𝐱0𝑞𝐱\mathbf{x}_{0}\sim q(\mathbf{x})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_x ), noise is added to the data at each timestep t{1T}𝑡1𝑇t\in\{1\ldots T\}italic_t ∈ { 1 … italic_T }.

𝐱t=αt𝐱t1+1αtϵt1subscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡subscriptitalic-ϵ𝑡1\mathbf{x}_{t}=\sqrt{\alpha_{t}}\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}\mathbf{%\epsilon}_{t-1}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT(1)

where α𝛼\alphaitalic_α determines noise level while ϵ𝒩(0,𝐈)similar-toitalic-ϵ𝒩0𝐈\mathbf{\epsilon}\sim\mathcal{N}(0,\mathbf{I})italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) represents Gaussian noise. The data 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT becomes increasingly noisy with time, reaching 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I ) at t=T𝑡𝑇t=Titalic_t = italic_T. Reverse diffusion then reconstructs the original data as follows, where μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT refer to the learnable mean and covariance:

pθ(𝐱t1|𝐱t)=𝒩(𝐱t1;μθ(𝐱t,t),Σθ(𝐱t,t)),subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝒩subscript𝐱𝑡1subscript𝜇𝜃subscript𝐱𝑡𝑡subscriptΣ𝜃subscript𝐱𝑡𝑡p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};\mu_{%\theta}(\mathbf{x}_{t},t),\Sigma_{\theta}(\mathbf{x}_{t},t)),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) ,(2)

Diffusion Transformer

In our study, we consider two types of DiT models processing different visual information: image DiT and video DiT, presented in Figure3 (a). ❶ Image DiT model follows Vision Transformer, each block contains self-attention, cross-attention, and feed-forward networks (FFN), where the cross-attention module incorporates text and timestep conditions. ❷ Video DiT adopts dual-subblock architecture: the spatial Transformer subblock processes tokens within the same time frame, while temporal subblock manages cross-frame relationships. These blocks alternate in sequence, with cross-attention integrating conditional inputs. A complete video DiT block pairs one spatial and one temporal component in an interleaved pattern, following[19, 47].

3.2 Visualizing the Feature Smoothness of DiT

Section1 introduces the concept of feature smoothness, and provides an intuitive explanation. In this section, we will detail the feature smoothness visualization and analysis.

To use this approach, we denote the original module parameters of the base model as θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the graph and choose two random direction vectors,δ𝛿\deltaitalic_δ and η𝜂\etaitalic_η, each direction vectors share the same dimension as θ𝜃\thetaitalic_θ. Firstly, these directions are normalized according to the original parameters it correspond to. We take δ𝛿\deltaitalic_δ and η𝜂\etaitalic_η to disturb the model with strength coefficients α𝛼\alphaitalic_α and β𝛽\betaitalic_β. After updating, we get a new model with parameter θsuperscript𝜃\theta^{{}^{\prime}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT:

θ=θ+αδ+βηsuperscript𝜃superscript𝜃𝛼𝛿𝛽𝜂\displaystyle\theta^{{}^{\prime}}=\theta^{*}+\alpha\delta+\beta\etaitalic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_α italic_δ + italic_β italic_η(3)

We denote the predicted noise after k𝑘kitalic_k denoising steps of model before and after adding disturbs as 𝒙θksuperscriptsubscript𝒙superscript𝜃𝑘\bm{x}_{\theta^{*}}^{k}bold_italic_x start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and 𝒙θksuperscriptsubscript𝒙superscript𝜃𝑘\bm{x}_{\theta^{{}^{\prime}}}^{k}bold_italic_x start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. We then define the feature difference with function;

L(θ)=𝒙θk𝒙θk𝒙θk𝒙θk𝐿superscript𝜃superscriptsubscript𝒙superscript𝜃𝑘superscriptsubscript𝒙superscript𝜃𝑘normsuperscriptsubscript𝒙superscript𝜃𝑘normsuperscriptsubscript𝒙superscript𝜃𝑘\displaystyle L(\theta^{*})=\frac{\bm{x}_{\theta^{*}}^{k}\cdot\bm{x}_{\theta^{%{}^{\prime}}}^{k}}{\|\bm{x}_{\theta^{*}}^{k}\|\|\bm{x}_{\theta^{{}^{\prime}}}^%{k}\|}italic_L ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋅ bold_italic_x start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ ∥ bold_italic_x start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ end_ARG(4)

we then plot the feature difference surface feature according to the 2-D function:

f(α,β)=L(θ+αδ+βη)𝑓𝛼𝛽𝐿superscript𝜃𝛼𝛿𝛽𝜂\displaystyle f(\alpha,\beta)=L(\theta^{*}+\alpha\delta+\beta\eta)italic_f ( italic_α , italic_β ) = italic_L ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_α italic_δ + italic_β italic_η )(5)

This approach was employed in [8] and [14], where L(θ)𝐿superscript𝜃L(\theta^{*})italic_L ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) represents the loss of a model with parameters θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, used to analyze trajectories of various minimization methods and model structures. Similarly, [11, 13] utilized this approach to demonstrate that different optimization algorithms converge to distinct local minima in the 2D projected space.

3.3 Skip-DiT: Improving Feature Smoothness

We visualize vanilla DiT feature smoothness in Figure2 (w/o Skip branches) which is trained and generated on Taichidataset. We can observe drastic feature change at the two DDPM steps, indicating they are not ideal states to cache features, thus shortening the space for caching. According to the insights from[13] and Diffusion caching study utilizing U-Net residual connection feature[18], we next investigate whether DiT feature smoothness can be improved by minimal modification to its structure to insert residual property, i.e. skip branches.

A vanilla DiT can be converted into Skip-DiTby connecting shallow blocks to deep blocks with skip branches, as shown in Figure3 (b).Let 𝒙𝒙\bm{x}bold_italic_x denote the input noise embedding, and 𝒙lsubscriptsuperscript𝒙𝑙\bm{x}^{\prime}_{l}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents the output at the l𝑙litalic_l-th layer of Skip-DiT. The architecture consists of L𝐿Litalic_L sequential DiT blocks with skip connections.Each DiT block DiTlsuperscriptsubscriptDiT𝑙\mathcal{F}_{\text{DiT}}^{l}caligraphic_F start_POSTSUBSCRIPT DiT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT at block l𝑙litalic_l processes the features as 𝒙=DiTl(𝒙)superscript𝒙superscriptsubscriptDiT𝑙𝒙\bm{x}^{\prime}=\mathcal{F}_{\text{DiT}}^{l}(\bm{x})bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT DiT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_x ).The i𝑖iitalic_i-th skip branch (i{1L//2}i\in\{1\ldots L//2\}italic_i ∈ { 1 … italic_L / / 2 }) connects i𝑖iitalic_i-th block to (L+1i𝐿1𝑖L+1-iitalic_L + 1 - italic_i)-th block, which can be denoted as Skipi(,)superscriptsubscriptSkip𝑖\mathcal{F}_{\text{Skip}}^{i}(\cdot,\cdot)caligraphic_F start_POSTSUBSCRIPT Skip end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( ⋅ , ⋅ ). Given output 𝒙isubscriptsuperscript𝒙𝑖\bm{x}^{\prime}_{i}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the start of the skip branch and 𝒙lsubscriptsuperscript𝒙𝑙\bm{x}^{\prime}_{l}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from the previous layer, the skip branch aggregates them to the input to l𝑙litalic_l-th block as:

𝒙l=skipi(𝒙i,𝒙l1)=Linear(Norm(𝒙i𝒙l1))subscript𝒙𝑙superscriptsubscriptskip𝑖subscriptsuperscript𝒙𝑖subscriptsuperscript𝒙𝑙1LinearNormdirect-sumsubscriptsuperscript𝒙𝑖subscriptsuperscript𝒙𝑙1\displaystyle\bm{x}_{l}=\mathcal{F}_{\text{skip}}^{i}(\bm{x}^{\prime}_{i},\bm{%x}^{\prime}_{l-1})=\texttt{Linear}(\texttt{Norm}(\bm{x}^{\prime}_{i}\oplus\bm{%x}^{\prime}_{l-1}))bold_italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) = Linear ( Norm ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊕ bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) )(6)

where direct-sum\oplus denotes concatenation, Norm represents the layer normalization, and Linear is a linear fully-connected layer. The final output xlsubscriptsuperscript𝑥𝑙x^{\prime}_{l}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of Skip-DiTrepresents the processed noise output. Each skip branch creates a shortcut path that helps preserve and process information from earlier layers, enabling better gradient flow and feature reuse throughout the network. The combination of DiT blocks and skip branches allows the model to effectively learn the underlying noise distribution while maintaining stable training dynamics.

Similarly, we initialize a class-to-video DiT with skip branches and train it from scratch on Taichidataset, then visualize its feature smoothness as shown in Figure2 (w/ Skip branches). The results show Skip-DiThas a more flat feature-changing landscape at both the beginning and finalizing timesteps. This justifies our initiative to enhance DiT feature smoothness with skip connections.

3.4 Skip-Cache: Caching with Skip Branches

As we have shown better feature smoothness in Skip-DiT, we can use the feature stability and skip branch property of Skip-DiT to implement efficient DiT caching, namely Skip-Cache.The inference process for global timestep t𝑡titalic_t of full inference can be expressed as follows:

𝒙t=DiTL(DiTL1(DiT1(𝒙)))superscriptsuperscript𝒙bold-′𝑡superscriptsubscriptDiT𝐿superscriptsubscriptDiT𝐿1superscriptsubscriptDiT1𝒙\displaystyle\bm{x^{\prime}}^{t}=\mathcal{F}_{\text{DiT}}^{L}(\mathcal{F}_{%\text{DiT}}^{L-1}(...\mathcal{F}_{\text{DiT}}^{1}(\bm{x})))bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT DiT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( caligraphic_F start_POSTSUBSCRIPT DiT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT ( … caligraphic_F start_POSTSUBSCRIPT DiT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_italic_x ) ) )(7)

Consider timesteps t1𝑡1t-1italic_t - 1 and t𝑡titalic_t where the model generates 𝒙t1superscript𝒙𝑡1\bm{x}^{t-1}bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT conditioned on 𝒙tsuperscript𝒙𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.During this step, we cache the intermediate output from the last second layer as

𝒞L1t=xL1tsubscriptsuperscript𝒞𝑡𝐿1superscriptsubscript𝑥𝐿1𝑡\displaystyle\mathcal{C}^{t}_{L-1}=x_{L-1}^{t}caligraphic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT(8)

For local timestep t1𝑡1t-1italic_t - 1, with cached feature 𝒞l1tsubscriptsuperscript𝒞𝑡𝑙1\mathcal{C}^{t}_{l-1}caligraphic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT and first skip branch Skip1(,)superscriptsubscriptSkip1\mathcal{F}_{\text{Skip}}^{1}(\cdot,\cdot)caligraphic_F start_POSTSUBSCRIPT Skip end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( ⋅ , ⋅ ), the inference process can be formulated as:

xt1=DiTL(Skip1(𝒙1t1,𝒞L1t))subscriptsuperscript𝑥𝑡1superscriptsubscriptDiT𝐿superscriptsubscriptSkip1subscriptsuperscript𝒙𝑡11subscriptsuperscript𝒞𝑡𝐿1\displaystyle x^{\prime}_{t-1}=\mathcal{F}_{\text{DiT}}^{L}(\mathcal{F}_{\text%{Skip}}^{1}(\bm{x}^{\prime t-1}_{1},\mathcal{C}^{t}_{L-1}))italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT DiT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( caligraphic_F start_POSTSUBSCRIPT Skip end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ) )(9)

At each local timestep, only 1111-th and L𝐿Litalic_L-th blocks are executed while reusing cached features from the previous global timestep through the skip branch, significantly reducing computational overhead while maintaining generation quality. This can be extended to 1:N𝑁Nitalic_N inference pattern where 𝒞L1tsubscriptsuperscript𝒞𝑡𝐿1\mathcal{C}^{t}_{L-1}caligraphic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT will be reused for the next N𝑁Nitalic_N-1 timesteps.

4 Experiments

4.1 Implementation Details

MethodUCF101FFSSkyTaichiFLOPs (T)Latency (s)Speedup
FVD (\downarrow)FID (\downarrow)FVD (\downarrow)FID (\downarrow)FVD (\downarrow)FID (\downarrow)FVD (\downarrow)FID (\downarrow)
Latte155.2222.9728.885.3649.4611.51166.8411.57278.639.901.00×\times×
ΔΔ\Deltaroman_Δ-DiT161.6225.3325.804.4651.7011.67188.3912.09226.108.091.22×\times×
FORA160.5223.5227.234.6452.9011.96198.5613.68240.269.001.10×\times×
PAB23213.5030.9658.155.9496.9716.38274.9016.05233.877.631.30×\times×
PAB351176.5793.30863.18128.34573.7255.66828.4042.96222.907.141.39×\times×
Skip-Cache
Skip-DiT141.3023.7820.624.3249.2111.92163.0313.55290.0510.021.00×\times×
n=2𝑛2n=2italic_n = 2141.4221.4623.554.4951.1312.66167.5413.89 (0.34\uparrow)180.686.401.56×\times×
n=3𝑛3n=3italic_n = 3137.9819.9326.764.7554.1713.11179.4314.53145.875.241.91×\times×
n=4𝑛4n=4italic_n = 4143.0019.0330.195.1857.3613.77188.4414.38125.994.572.19×\times×
n=5𝑛5n=5italic_n = 5145.3918.7235.525.8662.9214.18209.3815.20121.024.352.30×\times×
n=6𝑛6n=6italic_n = 6151.7718.7842.416.4268.9615.16208.0415.78111.074.122.43×\times×

Models

To demonstrate the remarkable effectiveness of Skip-DiT and Skip-Cache in video generation, we employ the classic and open-source DiT model, Latte [19], as our base model. Latte consists of spatial and temporal transformer blocks, making it suitable for both class-to-video and text-to-video tasks. Hunyuan-DiT [16], the first text-to-image DiT model with skip branches, modifies the model structure by splitting transformer blocks into encoder and decoder blocks, which are connected via long skip connections, similar to UNets. In this work, we leverage Hunyuan-DiT to investigate the effectiveness of skip connections in text-to-image tasks. Additionally, we modify the structure of Latte, following the guidance in Figure3, to evaluate the performance of skip connections in video generation tasks and explore techniques for integrating skip connections into a pre-trained DiT model.

Datasets

In the class-to-video task, we conduct comprehensive experiments on four public datasets: FaceForensics [28], SkyTimelapse [39], UCF101 [34], and Taichi-HD [31]. Following the experimental settings in Latte, we extract 16-frame video clips from these datasets and resize all frames to a resolution of 256×\times×256.

For the text-to-video task, original Latte is trained on Webvid-10M [1] and Vimeo-2M [36], comprising approximately 330k text-video pairs in total. Considering that the resolution of Webvid-10M is lower than 512×\times×512 and is predominantly used in the early stages of pre-training [20], we utilize only Vimeo-2M for training Skip-DiT. To align with Latte, we sample 330k text-video pairs from Vimeo-2M. All training data are resized to a resolution of 512×\times×512, with 16 frames per video and a frame rate of 8.

Training Details

For the training of class-to-video tasks, we train all instances of Skip-DiT from scratch without any initialization. During training, we update all parameters in Skip-DiT, including skip branches. For text-to-video tasks, we propose a two-stage continual training strategy as follows:

  • Skip-branch training: The models are initialized with the weights of the original text-to-video Latte model, while the weight of skip branches are initialized randomly. During this stage, we train only the skip branches until the model can roughly generate items. This stage takes approximately one day.

  • Overall training: After fully training the skip branches, we unfreeze all other parameters and perform overall training. At this stage, Skip-DiTrapidly recovers its generation capability within approximately two days and can generate content comparable to the original Latte with an additional three days of training.

This strategy significantly reduces training costs compared to training from scratch.All our training experiments are conducted on 8 H100 GPUs, employing the video-image joint training strategy proposed in Ma etal. [19]. We find that this approach significantly enhances training stability.

Evaluation Details

Following previous works [47, 18] and Latte, we evaluate text-to-video models using VBench [10], Peak Signal-to-Noise Ratio (PSNR), Learned Perceptual Image Patch Similarity (LPIPS) [43], and Structural Similarity Index Measure (SSIM) [37]. VBench is a comprehensive benchmark suite comprising 16 evaluation dimensions. PSNR is a widely used metric for assessing the quality of image reconstruction, LPIPS measures feature distances extracted from pre-trained networks, and SSIM evaluates structural information differences.All the videos generated for evaluation are sampled with 50 steps DDIM[33], which is the default setting used in Latte.

For class-to-image tasks, we evaluate the similarity between generated and real videos using Fréchet Video Distance (FVD) [35] and Fréchet Inception Distance (FID) [21], following the evaluation guidelines of StyleGAN-V [42]. Latte uses 250-step DDPM [9] as the default solver for class-to-video tasks, which we adopt for all tasks except UCF101. For UCF101, we employ 50-step DDIM [33], as it outperforms 250-step DDPM on both Latte and Skip-DiT. Table2 highlights this phenomenon, showing our methods consistently outperform DDPM-250 under comparable throughput, except for UCF101, where DDIM performs better than 250 steps DDPM.

Implementation Details of Other Caching Methods

We compare with 4 other DiT caching methods in video and image generation: ❶ T-GATE [44] reuses self-attention in the semantics-planning phase and skips cross-attention in the fidelity-improving phase. We follow Zhao etal. [47] to split these two phases. ❷ ΔΔ\Deltaroman_Δ-DiT identifies high similarity in deviations between feature maps and reuses them at the next timestep. While this method is originally designed for images, we extend it to video DiTs by caching only the front blocks, as we observe significant degeneration when caching the back blocks. ❸ FORA [30] reuses attention features across timesteps. ❹ PAB [47] further extends it by broadcasting cross, spatial, and temporal attention features separately. All these caching methods are performed on Latte for equal comparison with Skip-Cache on Skip-DiT.

4.2 Main Results

Class-to-video Generation

We compare the quantitative performance of Latte and Skip-DiTon four class-to-video tasks, as shown in Table 1. Skip-DiTconsistently outperforms Latte in terms of FVD scores across all tasks while achieving comparable performance in FID scores, demonstrating its strong video generation capabilities. Furthermore, we observe that Skip-Cachesignificantly outperforms other caching methods across most metrics, incurring only an average loss of 2.372.372.372.37 in the FVD score and 0.270.270.270.27 in the FID score while achieving a 1.56×\times× speedup. In comparison, only PAB [47] achieves a speedup of more than 1.3×\times×, but at the cost of a substantial average loss of 60.7860.7860.7860.78 in FVD score and 4.484.484.484.48 in FID score. Notably, in the Taichi [31] task, all other caching methods exhibit significant degradation in FVD scores(\geq 21.5), whereas Skip-Cacheexperiences only a slight loss (163.06\rightarrow167.54). To achieve even higher speedup on the other three class-to-video tasks, we accelerated Skip-Cachewith a larger cache timesteps (N=3), resulting in a 2.19×\times× speedup with an average loss of 6.476.476.476.47 in FVD and a 0.680.680.680.68 improvement in FID.

MethodUCF101FFSSkyTaichi
FVD \downarrowFID \downarrowFVD \downarrowFID \downarrowFVD \downarrowFID \downarrowFVD \downarrowFID \downarrow
Latte165.0423.7528.885.3649.4611.51166.8411.57
Skip-DiT173.7022.9520.624.3249.2212.05163.0313.55
Skip-Cachen=2165.6022.7323.554.4951.1312.66167.5413.89
DDIM+Skip-DiT134.2224.6037.286.4886.3913.67343.9721.01
DDIM+Latte146.7823.0639.106.4778.3813.73321.9721.86
Skip-Cachen=3169.3722.4726.764.7554.1713.11179.4314.53
DDIM+Skip-DiT139.5224.7139.206.4990.6213.80328.4721.33
DDIM+LatteLatte+{\texttt{Latte}}+ Latte148.4623.4141.006.5474.3914.20327.2222.96

MethodVBench(%) \uparrowPSNR \uparrowLPIPS \downarrowSSIM \uparrowFLOPs (T)latency (s)speedup
Latte76.141587.2527.111.00×\times×
T-GATE75.68 (\downarrow0.46)22.780.190.781470.7224.151.12×\times×
ΔΔ\Deltaroman_Δ-DiT76.06 (\downarrow0.08)24.010.170.811274.3621.401.27×\times×
FORA76.06 (\downarrow0.08)22.930.140.791341.7224.211.19×\times×
PAB23573.79 (\downarrow2.35)19.180.270.661288.0823.241.24×\times×
PAB34772.08 (\downarrow4.06)18.200.320.631239.3522.231.29×\times×
PAB46971.64 (\downarrow4.50)17.400.350.601210.1121.601.33×\times×
Skip-DiT75.601648.1328.721.00×\times×
Skip-Cache75%
n=2𝑛2n=2italic_n = 275.36(\downarrow0.24)26.020.100.841066.6218.251.57×\times×
n=3𝑛3n=3italic_n = 375.07(\downarrow0.53)22.850.180.76852.3814.881.93×\times×
n=4𝑛4n=4italic_n = 474.43(\downarrow1.17)22.080.220.73760.5613.032.20×\times×
Skip-Cache65%
n=2𝑛2n=2italic_n = 275.51(\downarrow0.09)29.520.060.891127.8319.281.49×\times×
n=3𝑛3n=3italic_n = 375.26(\downarrow0.34)27.460.090.85974.8016.671.72×\times×
n=4𝑛4n=4italic_n = 474.73(\downarrow0.87)25.970.130.81882.9815.121.90×\times×

Text-to-video Generation

MethodFID (\downarrow)CLIP (\uparrow)PSNR (\uparrow)LPIPS (\downarrow)SSIM (\uparrow)FLOPs (T)latency (s)speedup
HunYuan-DiT32.6430.51514.0218.691.00
TGATE32.7130.6416.800.240.61378.9413.211.41
ΔΔ\Deltaroman_Δ-Cache28.3530.3516.560.210.65362.6713.581.38
FORA31.2130.5319.580.140.75330.6813.201.42
Skip-Cache
n=2𝑛2n=2italic_n = 231.3030.5222.090.100.84348.2412.761.46
n=3𝑛3n=3italic_n = 329.5330.5521.250.110.81299.4810.911.71
n=4𝑛4n=4italic_n = 427.4930.5520.550.130.78270.2210.021.87
n=5𝑛5n=5italic_n = 528.3730.5619.940.140.76260.479.511.96
n=6𝑛6n=6italic_n = 627.2130.7119.180.180.70240.968.962.09

In Table 3, we present a quantitative evaluation of all text-to-video models and caching methods. Videos are generated using the prompts from VBench [10], which is considered a more generalized benchmark [47, 10, 48]. Compared with the original Latte, Skip-Cacheachieves a comparable VBench score (75.60 vs. 76.14) with only six days of continual pre-training on 330k training samples.

To demonstrate the superiority of the caching mechanism in Skip-Cache, we evaluate two caching settings: caching at timesteps 700–50 and 800–50 (out of 1000 timesteps in total). In both settings, Skip-Cacheachieves the highest speedup while maintaining superior scores in PSNR, LPIPS, and SSIM, with only a minor loss in VBench score.

In the first setting, our caching mechanism achieves 1.49×\times× and 1.72×\times× speedup with only a 0.12% and 0.22% loss in VBench score, respectively. Among other caching methods, only PAB469 achieves a speedup of more than 1.30×\times×, but at the cost of a 4.50% drop in VBench score. Moreover, our caching method can achieve a 1.90×\times× speedup while still maintaining absolutely better PSNR, LPIPS, and SSIM scores compared to other caching methods.

Furthermore, in the second setting, we achieve a 2.20×\times× speedup with just a 1.17% sacrifice in VBench score, representing the highest speedup among current training-free DiT acceleration works.

4.3 Generalize Skip-Cacheto image Generation

Hunyuan-DiT [16] is a powerful text-to-image DiT model featuring skip branches, whose effectiveness has been demonstrated in [16]. However, its skip branches have not been explored for accelerating image generation. We leverage these skip branches using the same caching mechanism as Skip-Cache and compare our caching strategy with other training-free acceleration methods. Furthermore, we extend our proposed Skip-Cache to class-to-image task in Appendix 6, where Skip-DiT exceeds vanilla DiT model withonly around 38% of its training cost.

Evaluation Details

To evaluate the generalization of the caching mechanism in Skip-Cachefor text-to-image tasks, we use the zero-shot Fréchet Inception Distance (FID) on the MS COCO [17] 256×\times×256 validation dataset by generating 30,000 images based on its prompts, following the evaluation guidelines established by Hunyuan-DiT. Additionally, we employ Peak Signal-to-Noise Ratio (PSNR), Learned Perceptual Image Patch Similarity (LPIPS) [43], and Structural Similarity Index Measure (SSIM) [37] to assess the changes introduced by the caching methods. To ensure a fair comparison, we disable the prompt enhancement feature of Hunyuan-DiT. All videos are generated at a resolution of 1024×\times×1024 and subsequently resized to 256×\times×256 for evaluation.

Evaluation Results

Table 4 provides a comprehensive comparison of Hunyuan-DiT and various caching methods. Notably, our caching mechanism achieves a 2.28×\times× speedup without any degradation in FID or CLIP scores. Furthermore, it outperforms all other caching methods in terms of PSNR, LPIPS, and SSIM scores, consistently maintaining the highest performance even with a 1.93×\times× speedup. These findings underscore the robustness and adaptability of our caching mechanism to image generation tasks.

Caching TimestepVBench (%) \uparrowPSNR \uparrowLPIPS \downarrowSSIM \uparrow
700\rightarrow5075.5129.520.060.89
950\rightarrow30075.4820.580.230.73
800\rightarrow5075.3626.020.100.84
900\rightarrow5075.2422.130.190.76
MethodVBench (%) \uparrowPSNR \uparrowLPIPS \downarrowSSIM \uparrow
Skip-DiT75.60
T-GATE75.1624.090.160.78
ΔΔ\Deltaroman_Δ-DiT75.4822.640.170.79
FORA75.3823.260.160.79
PAB23573.7919.920.290.68
PAB34772.0818.980.340.65
PAB46971.6418.020.370.62

4.4 Ablation studies

Accelerating Vision Diffusion Transformers with Skip Branches (5)

Select the best timesteps to cache

A heat map visualizing feature dynamics across blocks is shown in Figure4. After incorporating skip branches into Latte, we observe that major changes are concentrated in the early timesteps, with feature dynamics becoming considerably smoother in the later timesteps (700\rightarrow50). In contrast, features in Latte exhibit rapid changes across all timesteps. This finding highlights that caching during smoother timesteps leads to significantly better performance, supporting the hypothesis that smooth features enhance caching efficiency in DiT. In Table5, we further validate this observation: under equivalent throughput, caching in the later timesteps (700\rightarrow50), where features are smoother, outperforms caching in the earlier timesteps (950\rightarrow300), achieving superior PSNR, LPIPS, and SSIM scores. Additionally, we segmented the rapidly changing timesteps and experimented with three caching ranges: 900\rightarrow50, 800\rightarrow50, and 700\rightarrow50. The results show that increasing the ratio of smoother regions significantly improves caching performance. These findings underscore the importance of leveraging smoother feature dynamics for optimal caching.

Compatibility of Skip-DiT with other caching methods

As shown in Table6, we extend the existing DiT caching methods to Skip-DiTand observe slight performance improvements. Specially, in ΔΔ\Deltaroman_Δ-DiT, the middle blocks are cached instead of the front blocks, as discussed in Section 4.1, to leverage Skip-DiT’ U-shaped structure. Taking PAB [47] as an example, it loses 1.15% less in VBench score and achieves noticeably better PSNR and SSIM scores on Skip-DiTcompared to Latte, highlighting the potential of Skip-DiTto enhance cache-based methods, and proving the superior caching efficiency of model with better feature smoothness.

5 Conclusion

In this work, we introduce Skip-DiT, a skip-branch-enhanced DiT model designed to produce smoother features and propose Skip-DiTcache to improve caching efficiency in video and image generation tasks. By enhancing feature smoothness across timesteps, Skip-DiTunlocks the potential to cache most blocks while maintaining high-generation quality. Additionally, Skip-Cacheleverages its U-net-style architecture to enable cross-timestep feature caching.Our approach achieves maximum speedup in cache-based visual DiT generation while preserving the highest similarity to original outputs. Furthermore, we analyze feature dynamics before and after incorporating skip branches, demonstrating the effectiveness of caching at timesteps with smoother features. We also show that Skip-DiTis compatible with other caching methods, further extending its applicability.Overall, Skip-DiTcan seamlessly integrate with various DiT backbones, enabling real-time, high-quality video and image generation while consistently outperforming baseline methods. We believe Skip-Cacheoffers a simple yet powerful foundation for advancing future research and practical applications in visual generation.

References

  • Bain etal. [2021]Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman.Frozen in time: A joint video and image encoder for end-to-end retrieval.In Proceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021.
  • Bao etal. [2022]Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu.All are worth words: A vit backbone for diffusion models.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2022.
  • [3]James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jianfeng Wang, Linjie Li, † LongOuyang, † JuntangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, † PrafullaDhariwal, † CaseyChu, † YunxinJiao, and Aditya Ramesh.Improving image generation with better captions.
  • Chen etal. [2023]Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, JamesT. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li.Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.ArXiv, abs/2310.00426, 2023.
  • Chen etal. [2024a]Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu.Q-dit: Accurate post-training quantization for diffusion transformers.arXiv preprint arXiv:2406.17343, 2024a.
  • Chen etal. [2024b]Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen.Delta-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024b.
  • Deng etal. [2009]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.Imagenet: A large-scale hierarchical image database.In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Goodfellow and Vinyals [2014]IanJ. Goodfellow and Oriol Vinyals.Qualitatively characterizing neural network optimization problems.CoRR, abs/1412.6544, 2014.
  • Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Huang etal. [2024]Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu.Vbench: Comprehensive benchmark suite for video generative models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21807–21818, 2024.
  • Im etal. [2016]DanielJiwoong Im, Michael Tao, and Kristin Branson.An empirical analysis of deep network loss surfaces.ArXiv, abs/1612.04010, 2016.
  • Lab and etc. [2024]PKU-Yuan Lab and TuzhanAI etc.Open-sora-plan, 2024.
  • Li etal. [2017]Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein.Visualizing the loss landscape of neural nets.ArXiv, abs/1712.09913, 2017.
  • Li etal. [2018]Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein.Visualizing the loss landscape of neural nets.In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 6391–6401, 2018.
  • Li etal. [2023]Senmao Li, Taihang Hu, FahadShahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, and Jian Yang.Faster diffusion: Rethinking the role of unet encoder in diffusion models.arXiv preprint arXiv:2312.09608, 2023.
  • Li etal. [2024]Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, and Qinglin Lu.Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024.
  • Lin etal. [2014]Tsung-Yi Lin, Michael Maire, SergeJ. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C.Lawrence Zitnick.Microsoft COCO: common objects in context.In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pages 740–755. Springer, 2014.
  • Ma etal. [2023]Xinyin Ma, Gongfan Fang, and Xinchao Wang.Deepcache: Accelerating diffusion models for free.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15762–15772, 2023.
  • Ma etal. [2024]Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao.Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024.
  • OpenAI [2024]OpenAI.Sora: Creating video from text.https://openai.com/sora, 2024.
  • Parmar etal. [2021]Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu.On buggy resizing libraries and surprising subtleties in FID calculation.CoRR, abs/2104.11222, 2021.
  • Peebles and Xie [2023]William Peebles and Saining Xie.Scalable diffusion models with transformers.In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 4172–4182. IEEE, 2023.
  • Peebles and Xie [2022]WilliamS. Peebles and Saining Xie.Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022.
  • Podell etal. [2023]Dustin Podell, Zion English, Kyle Lacey, A. Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, and Robin Rombach.Sdxl: Improving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952, 2023.
  • Polyak etal. [2024]Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, MiteshKumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce Liu, CenPeng, Dimitry Vengertsev, Edgar Schonfeld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopoulos, Rashel Moritz, SaraK. Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du.Movie Gen: A Cast of Media Foundation Models.arXiv e-prints, art. arXiv:2410.13720, 2024.
  • Rombach etal. [2021]Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  • Ronneberger etal. [2015]Olaf Ronneberger, Philipp Fischer, and Thomas Brox.U-net: Convolutional networks for biomedical image segmentation.ArXiv, abs/1505.04597, 2015.
  • Rössler etal. [2018]Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner.Faceforensics: A large-scale video dataset for forgery detection in human faces.CoRR, abs/1803.09179, 2018.
  • Sauer etal. [2023]Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach.Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023.
  • Selvaraju etal. [2024]Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang.Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425, 2024.
  • Siarohin etal. [2019]Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe.First order motion model for image animation.In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7135–7145, 2019.
  • So etal. [2023]Junhyuk So, Jungwon Lee, and Eunhyeok Park.Frdiff: Feature reuse for exquisite zero-shot acceleration of diffusion models.arXiv preprint arXiv:2312.03517, 2023.
  • Song etal. [2021]Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  • Soomro etal. [2012]Khurram Soomro, AmirRoshan Zamir, and Mubarak Shah.UCF101: A dataset of 101 human actions classes from videos in the wild.CoRR, abs/1212.0402, 2012.
  • Unterthiner etal. [2018]Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly.Towards accurate generative models of video: A new metric & challenges.CoRR, abs/1812.01717, 2018.
  • Wang etal. [2023]Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, ChenChange Loy, Bo Dai, Dahua Lin, Yu Qiao, and Ziwei Liu.LAVIE: high-quality video generation with cascaded latent diffusion models.CoRR, abs/2309.15103, 2023.
  • Wang and Bovik [2002]Zhou Wang and AlanC. Bovik.A universal image quality index.IEEE Signal Process. Lett., 9(3):81–84, 2002.
  • Wimbauer etal. [2024]Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, etal.Cache me if you can: Accelerating diffusion models through block caching.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6211–6220, 2024.
  • Xiong etal. [2018]Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo.Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks.In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2364–2373. Computer Vision Foundation / IEEE Computer Society, 2018.
  • Yang etal. [2024]Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, etal.Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024.
  • Yin etal. [2024]Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, WilliamT Freeman, and Taesung Park.One-step diffusion with distribution matching distillation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6613–6623, 2024.
  • Yu etal. [2022]Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin.Generating videos with dynamics-aware implicit generative adversarial networks.In International Conference on Learning Representations, 2022.
  • Zhang etal. [2018]Richard Zhang, Phillip Isola, AlexeiA Efros, Eli Shechtman, and Oliver Wang.The unreasonable effectiveness of deep features as a perceptual metric.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • Zhang etal. [2024a]Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, MikeZheng Shou, and Jürgen Schmidhuber.Cross-attention makes inference cumbersome in text-to-image diffusion models.CoRR, abs/2404.02747, 2024a.
  • Zhang etal. [2024b]Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, MikeZheng Shou, and Jürgen Schmidhuber.Cross-attention makes inference cumbersome in text-to-image diffusion models.arXiv preprint arXiv:2404.02747, 2024b.
  • Zhang etal. [2024c]Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, and Kai Chen.Pia: Your personalized image animator via plug-and-play modules in text-to-image models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7747–7756, 2024c.
  • Zhao etal. [2024]Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You.Real-time video generation with pyramid attention broadcast, 2024.
  • Zheng etal. [2024]Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You.Open-sora: Democratizing efficient video production for all, 2024.

\thetitle

Supplementary Material

6 Class-to-image Generation Experiments

Peebles and Xie [22] proposed the first diffusion model based on the transformer architecture, and it outperforms all prior diffusion models on the class conditional ImageNet[7] 512×\times×512 and 256×\times×256 benchmarks. We add skip branches to its largest model DiT-XL/2 to get Skip-DiT. We train Skip-DiTon class conditional ImageNet with resolution 256×\times×256 from scratch with completely the same expirements setting as DiT-XL/2, and far exceeds DiT-XL/2 with only around 38% of its training cost.

Training of Skip-DiT

We modify the structure of DiT-XL/2 following the methodology outlined in Section 3 and train Skip-DiT for 2,900,000 steps on 8 A100 GPUs, compared to 7,000,000 steps for DiT-XL/2, which also uses 8 A100 GPUs. The datasets and other training settings remain identical to those used for DiT-XL/2, and we utilize the official training code of DiT-XL/2***https://github.com/facebookresearch/DiT. The performance comparison is presented in Table 7, which demonstrates that Skip-DiT significantly outperforms DiT-XL/2 while requiring only 38% of its training steps, highlighting the training efficiency and effectiveness of Skip-DiT.

ModelStepsFID \downarrowsFID \downarrowIS \uparrowPrecision \uparrowRecall \uparrow
cfg=1.0
DiT-XL/27000k9.497.17122.490.670.68
Skip-DiT2900k8.376.50127.630.680.68
cfg=1.5
DiT-XL/27000k2.304.71276.260.830.58
Skip-DiT2900k2.294.58281.810.830.58

Accelerating Evaluation

MethodsFID \downarrowsFID \downarrowIS \uparrowPrecision%Recall%Speedup
cfg=1.5
DiT-XL/22.304.71276.2682.6857.651.00×\times×
FORA2.455.44265.9481.2158.361.57×\times×
Delta-DiT2.475.61265.3381.0558.831.45×\times×
Skip-Cache
Skip-DiT2.294.58281.8182.8857.531.00×\times×
n=2𝑛2n=2italic_n = 22.314.76277.5182.5258.061.46×\times×
n=3𝑛3n=3italic_n = 32.404.98272.0582.1457.861.73×\times×
n=4𝑛4n=4italic_n = 42.545.31267.3481.6058.311.93×\times×
cfg=1.0
DiT-XL/29.497.17122.4966.6667.691.00×\times×
FORA11.729.27113.0164.4667.691.53×\times×
Delta-DiT12.039.68111.8664.5767.531.42×\times×
Skip-Cache
Skip-DiT8.376.50127.6368.0667.891.00×\times×
n=2𝑛2n=2italic_n = 29.257.09123.5767.3267.401.46×\times×
n=3𝑛3n=3italic_n = 310.187.72119.6066.5367.841.71×\times×
n=4𝑛4n=4italic_n = 411.378.49116.0165.7367.321.92×\times×

We evaluate Skip-Cacheon Skip-DiT and compare its performance against two other caching methods: ΔΔ\Deltaroman_Δ-DiT and FORA. As shown in Table 8, Skip-Cacheachieves a 1.46×\times× speedup with only a minimal FID loss of 0.020.020.020.02 when the classifier-free guidance scale is set to 1.5, compared to the 7–8×\times× larger losses observed with ΔΔ\Deltaroman_Δ-DiT and FORA. Moreover, even with a 1.9×\times× acceleration, Skip-DiT performs better than the other caching methods. These findings further confirm the effectiveness of Skip-DiT for class-to-image tasks.

7 Evaluation Details

VBench[10]

is a novel evaluation framework for video generation models. It breaks down video generation assessment to 16 dimensions from video quality and condition consistency: subject consistency, background consistency, temporal flickering, motion smoothness, dynamic degree, aesthetic quality, imaging quality, object class, multiple objects, human action, color, spatial relationship, scene, temporal style, appearance style, overall consistency.

Peak Signal-to-Noise Ratio (PSNR)

measures generated visual content quality by comparing a processed version 𝐯𝐯\mathbf{v}bold_v to the original reference 𝐯rsubscript𝐯𝑟\mathbf{v}_{r}bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT by:

PSNR=10×log10(R2MSE(𝐯,𝐯r))PSNR10subscript10superscript𝑅2MSE𝐯subscript𝐯𝑟\displaystyle\mathrm{PSNR}=10\times\log_{10}(\frac{R^{2}}{\texttt{MSE}(\mathbf%{v},\mathbf{v}_{r})})roman_PSNR = 10 × roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG MSE ( bold_v , bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG )(10)

where R𝑅Ritalic_R is the maximum possible pixel value, and MSE(,)MSE\texttt{MSE}(\cdot,\cdot)MSE ( ⋅ , ⋅ ) calculates the Mean Squared Error between original and processed images or videos. Higher PSNR indicates better reconstruction quality. However, PSNR does not always correlate with human perception and is sensitive to pixel-level changes.

Structural Similarity Index Measure (SSIM)

is a perceptual metric that evaluates image quality by considering luminance, contrast, and structure:

SSIM=[l(𝐯,𝐯r)]α[c(𝐯,𝐯r)]β[s(𝐯,𝐯r𝐯r)]γSSIMsuperscriptdelimited-[]𝑙𝐯subscript𝐯𝑟𝛼superscriptdelimited-[]𝑐𝐯subscript𝐯𝑟𝛽superscriptdelimited-[]𝑠𝐯subscript𝐯𝑟subscript𝐯𝑟𝛾\displaystyle\mathrm{SSIM}=[l(\mathbf{v},\mathbf{v}_{r})]^{\alpha}\cdot[c(%\mathbf{v},\mathbf{v}_{r})]^{\beta}\cdot[s(\mathbf{v},\mathbf{v}_{r}\mathbf{v}%_{r})]^{\gamma}roman_SSIM = [ italic_l ( bold_v , bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ⋅ [ italic_c ( bold_v , bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ⋅ [ italic_s ( bold_v , bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT(11)

where α,β,γ𝛼𝛽𝛾\alpha,\beta,\gammaitalic_α , italic_β , italic_γ are weights for luminance, contrast, and structure quality, where luminance comparison is l(x,y)=2μ𝐯μ𝐯r+C1μ𝐯2+μ𝐯r2+C1𝑙𝑥𝑦2subscript𝜇𝐯subscript𝜇subscript𝐯𝑟subscript𝐶1superscriptsubscript𝜇𝐯2superscriptsubscript𝜇subscript𝐯𝑟2subscript𝐶1l(x,y)=\frac{2\mu_{\mathbf{v}}\mu_{\mathbf{v}_{r}}+C_{1}}{\mu_{\mathbf{v}}^{2}%+\mu_{\mathbf{v}_{r}}^{2}+C_{1}}italic_l ( italic_x , italic_y ) = divide start_ARG 2 italic_μ start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG, contrast comparison isc(x,y)=2σ𝐯σ𝐯r+C2σ𝐯2+σ𝐯r2+C2𝑐𝑥𝑦2subscript𝜎𝐯subscript𝜎subscript𝐯𝑟subscript𝐶2superscriptsubscript𝜎𝐯2superscriptsubscript𝜎subscript𝐯𝑟2subscript𝐶2c(x,y)=\frac{2\sigma_{\mathbf{v}}\sigma_{\mathbf{v}_{r}}+C_{2}}{\sigma_{%\mathbf{v}}^{2}+\sigma_{\mathbf{v}_{r}}^{2}+C_{2}}italic_c ( italic_x , italic_y ) = divide start_ARG 2 italic_σ start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG, and structure comparison is s(x,y)=σxy+C3σ𝐯σ𝐯r+C3𝑠𝑥𝑦subscript𝜎𝑥𝑦subscript𝐶3subscript𝜎𝐯subscript𝜎subscript𝐯𝑟subscript𝐶3s(x,y)=\frac{\sigma_{xy}+C_{3}}{\sigma_{\mathbf{v}}\sigma_{\mathbf{v}_{r}}+C_{%3}}italic_s ( italic_x , italic_y ) = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG, with C𝐶Citalic_C denoting numerical stability coefficients. SSIM scores range from -1 to 1, where 1 means identical visual content.

Learned Perceptual Image Patch Similarity (LPIPS)

is a deep learning-based metric that measures perceptual similarity using L2-Norm of visual features vH×W×C𝑣superscript𝐻𝑊𝐶v\in\mathbb{R}^{H\times W\times C}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT extracted from pretrained CNN ()\mathcal{F}(\cdot)caligraphic_F ( ⋅ ). LPIPS captures semantic similarities and is therefore more robust to small geometric transformations than PSNR and SSIM.

LPIPS=1HWh,w(vr)(v)22LPIPS1𝐻𝑊subscript𝑤superscriptsubscriptnormsubscript𝑣𝑟𝑣22\displaystyle\mathrm{LPIPS}=\frac{1}{HW}\sum_{h,w}||\mathcal{F}(v_{r})-%\mathcal{F}(v)||_{2}^{2}roman_LPIPS = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT | | caligraphic_F ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - caligraphic_F ( italic_v ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(12)

Fréchet Inception Distance (FID) and Fréchet Video Distance (FVD)

FID measures the quality and diversity of generated images by computing distance between feature distributions of reference 𝒩(μr,Σr)𝒩subscript𝜇𝑟subscriptΣ𝑟\mathcal{N}(\mu_{r},\Sigma_{r})caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) and generated images 𝒩(μ,Σ)𝒩𝜇Σ\mathcal{N}(\mu,\Sigma)caligraphic_N ( italic_μ , roman_Σ ) using inception architecture CNNs, where μ,Σ𝜇Σ\mu,\Sigmaitalic_μ , roman_Σ are mean and covariance of features.

FID=μrμ2+Tr(Σr+Σ2(ΣrΣ)1/2)FIDsuperscriptnormsubscript𝜇𝑟𝜇2𝑇𝑟subscriptΣ𝑟Σ2superscriptsubscriptΣ𝑟Σ12\displaystyle\mathrm{FID}=||\mu_{r}-\mu||^{2}+Tr(\Sigma_{r}+\Sigma-2(\Sigma_{r%}\Sigma)^{1/2})roman_FID = | | italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - italic_μ | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_T italic_r ( roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + roman_Σ - 2 ( roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT roman_Σ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT )(13)

FVD is a video extension of FID. Lower FID and FVD indicate higher generation quality.

8 Implementation Details

DeepCache

DeepCache [18] is a training-free caching method designed for U-Net-based diffusion models, leveraging the inherent temporal redundancy in sequential denoising steps. It utilizes the skip connections of the U-Net to reuse high-level features while updating low-level features efficiently. Skip-Cache shares significant similarities with DeepCache but extends the method to DiT models. Specifically, we upgrade traditional DiT models to Skip-DiT and cache them using Skip-Cache. In the work of DeepCache, two key caching decisions are introduced:(1) N: the number of steps for reusing cached high-level features. Cached features are computed once and reused for the next N-1 steps.(2) The layer at which caching is performed. For instance, caching at the first layer ensures that only the first and last layers of the U-Net are recomputed.In Skip-Cache, we adopt these two caching strategies and additionally account for the timesteps to cache, addressing the greater complexity of DiT models compared to U-Net-based diffusion models. For all tasks except the class-to-image task, caching is performed at the first layer, whereas for the class-to-image task, it is applied at the third layer.

ΔΔ\Deltaroman_Δ-DiT

ΔΔ\Deltaroman_Δ-DiT [6] is a training-free caching method designed for image-generating DiT models. Instead of caching the feature maps directly, it uses the offsets of features as cache objects to preserve input information. This approach is based on the observation that the front blocks of DiT are responsible for generating the image outlines, while the rear blocks focus on finer details. A hyperparameter b𝑏bitalic_b is introduced to denote the boundary between the outline and detail generation stages. When tb𝑡𝑏t\leq bitalic_t ≤ italic_b, ΔΔ\Deltaroman_Δ-Cache is applied to the rear blocks; when t>b𝑡𝑏t>bitalic_t > italic_b, it is applied to the front blocks. The number of cached blocks is represented by Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

While this caching method was initially designed for image generation tasks, we extend it to video generation tasks. In video generation, we observe significant degradation in performance when caching the rear blocks, so we restrict caching to the front blocks during the outline generation stage. For Hunyuan-DiT [16], we cache the middle blocks due to the U-shaped transformer architecture. Detailed configurations are provided in Table9.

ΔΔ\Deltaroman_Δ-DiTTaskDiffusion stepsb𝑏bitalic_bAll layersNcsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
Lattet2v50122821
Lattec2v250601410
Hunyuant2i50122818
DiT-XL/2c2i250602821

PAB

PAB (Pyramid Attention Broadcast) [47] is one of the most promising caching methods designed for real-time video generation. The method leverages the observation that attention differences during the diffusion process follow a U-shaped pattern, broadcasting attention outputs to subsequent steps in a pyramid-like manner. Different broadcast ranges are set for three types of attention—spatial, temporal, and cross-attention—based on their respective differences. PABαβγ𝑃𝐴subscript𝐵𝛼𝛽𝛾PAB_{\alpha\beta\gamma}italic_P italic_A italic_B start_POSTSUBSCRIPT italic_α italic_β italic_γ end_POSTSUBSCRIPT denotes the broadcast ranges for spatial (α𝛼\alphaitalic_α), temporal (β𝛽\betaitalic_β), and cross (γ𝛾\gammaitalic_γ) attentions.

In this work, we use the official implementation of PAB for text-to-video tasks on Latte and adapt the caching method to other tasks in-house. For the class-to-video task, where cross-attention is absent, PABαβ𝑃𝐴subscript𝐵𝛼𝛽PAB_{\alpha\beta}italic_P italic_A italic_B start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT refers to the broadcast ranges of spatial (α𝛼\alphaitalic_α) and temporal (β𝛽\betaitalic_β) attentions. In the text-to-image task, which lacks temporal attention, PABαβ𝑃𝐴subscript𝐵𝛼𝛽PAB_{\alpha\beta}italic_P italic_A italic_B start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT instead denotes the broadcast ranges of spatial (α𝛼\alphaitalic_α) and cross (β𝛽\betaitalic_β) attentions. We do not apply PAB to the class-to-image task, as it involves only spatial attention.

T-GATETaskDiffusion stepsmk
Lattet2v50202
Hunyuan-DiTt2i50202
Accelerating Vision Diffusion Transformers with Skip Branches (6)
Accelerating Vision Diffusion Transformers with Skip Branches (7)
Accelerating Vision Diffusion Transformers with Skip Branches (8)
Accelerating Vision Diffusion Transformers with Skip Branches (9)

T-Gates

T-Gates divide the diffusion process into two phases: (1) the Semantics-Planning Phase and (2) the Fidelity-Improving Phase. In the first phase, self-attention is computed and reused every k𝑘kitalic_k steps. In the second phase, cross-attention is cached using a caching mechanism. The hyperparameter m𝑚mitalic_m determines the boundary between these two phases. For our implementation, we use the same hyperparameters as PAB [47]. Detailed configurations are provided in Table 10.

FORA

FORA (Fast-Forward Caching) [30] stores and reuses intermediate outputs from attention and MLP layers across denoising steps. However, in the original FORA paper, features are cached in advance before the diffusion process. We do not adopt this approach, as it is a highly time-consuming process. Instead, in this work, we skip the “Initialization” step in FORA and calculate the features dynamically during the diffusion process.

9 Case Study

Video Generation

In Figure5, we showcase the generated video frames from text prompts with Skip-Cache, PAB, and comparing them to the original model. From generating portraits to scenery, Skip-Cacheconsistently demonstrates better visual fidelity along with faster generation speeds. Figure6 presents class-to-video generation examples with Skip-Cachewith varying caching steps {2,4,6}absent246\in\{2,4,6\}∈ { 2 , 4 , 6 }. By comparing Skip-Cacheto Original model, we see Skip-Cachemaintain good generation quality across different caching steps.

Image Generation

Figure7 compares qualitative results of Skip-Cachecompared to other caching-based acceleration methods (ΔΔ\Deltaroman_Δ-DiT, FORA, T-GATE) on Hunyuan-DiT. In Figure8, Skip-Cacheshow distinct edges in higher speedup and similarity to the original generation, while other baselines exist with different degrees of change in details such as color, texture, and posture. Similarly, we present Skip-Cachewith varying caching steps in Figure8, showing that with more steps cached, it still maintains high fidelity to the original generation.

Accelerating Vision Diffusion Transformers with Skip Branches (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Nathanael Baumbach

Last Updated:

Views: 5879

Rating: 4.4 / 5 (55 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Nathanael Baumbach

Birthday: 1998-12-02

Address: Apt. 829 751 Glover View, West Orlando, IN 22436

Phone: +901025288581

Job: Internal IT Coordinator

Hobby: Gunsmithing, Motor sports, Flying, Skiing, Hooping, Lego building, Ice skating

Introduction: My name is Nathanael Baumbach, I am a fantastic, nice, victorious, brave, healthy, cute, glorious person who loves writing and wants to share my knowledge and understanding with you.