- Published on
StreamDiffusion, a pipeline for real-time interactive generation
- Authors
StreamDiffusion is a new diffusion pipeline for real-time interactive image generation, enabling higher performance for live streaming and other similar scenarios. It replaces traditional sequential denoising with a faster batch process and introduces a parallel I/O queue for smoother operation.
The pipeline also utilizes a new Residual Classifier-Free Guidance (RCFG) method to reduce the number of noise reduction steps and improve overall speed. It also incorporates a Stochastic Similarity Filter to improve energy efficiency. Overall, StreamDiffusion boasts 1.5x faster processing and 2.05x faster speedup when using RCFG, reaching 91.07 frames per second on the RTX4090 GPU. In addition, power consumption is significantly reduced, making it a more efficient solution for real-time image generation.
In our approach, instead of waiting until all noise is completely removed from one image before processing the next input image, we take the next input image after each denoising step. This creates a denoising batch in which the denoising steps for each image are alternated. By combining the denoising steps into a batch, we can process continuous input data more efficiently by using U-Net for batch processing. The input image encoded at time slot t is generated and decoded at time slot t + n, where n is the number of denoising steps.
The orange vectors depict the virtual residual noise that starts from the PF ODE (Probability Flow Ordinary Differential Equations) trajectory and points to the original input latent signal x0
Traditional diffusion models are based on successive noise reduction steps that linearly increase the processing time with each additional step, especially in the U-Net framework. However, for higher fidelity images, more denoising steps are required, resulting in longer latency times.
Stream Batch solves this problem by replacing sequential noise removal with a batch process. Each batch consists of a certain number of denoising steps of consecutive images, allowing each element in the batch to advance one step in the denoising sequence per U-Net pass. This method allows for a more streamlined conversion of input images at a given time interval into the results of their conversion into images at a future time interval.
This approach tangibly reduces the need for multiple U-Net outputs and avoids the linear escalation of processing time as the number of steps increases. The main trade-off shifts from the ratio of processing time to generation quality to the ratio of VRAM capacity to generation quality. With sufficient VRAM memory capacity, high-quality images can be generated in a single U-Net processing cycle, achieving near elimination of the delay problem caused by increasing number of noise reduction steps.
Residual Classifier-Free Guidance Traditional Classifier-Free Guidance improves image generation by correcting for condition effects, but is itself computationally expensive to run through the U-Net model multiple times for each output. Residual Classifier-Free Guidance addresses this problem by introducing the concept of virtual residual noise, which is used to predict the latent representation of the original input image at a given point in the generation process. This method can efficiently generate images that differ from the original image based on the degree of correspondence without the need for additional U-Net computation. This process is referred to as Self-Negative RCFG.
In addition, RCFG can be used to deviate from any negative condition by computing the negative conditional residual noise only once in the first denoising step and using it throughout the process (Onetime-Negative RCFG). This method significantly reduces the computational burden by requiring only n or n+1 U-Net computations for Self-Negative and Onetime-Negative RCFG, respectively, compared to the 2n computations required for conventional CFG. This makes RCFG noticeably more efficient, preserving or improving the quality of the generated images.
The process of converting input images into a pipeline-driven tensor data format and, conversely, converting decoded tensors back into output images requires a considerable amount of additional processing time. To avoid adding this image processing time to the bottleneck itself, the neural network inference process, we split the pre- and post-image processing into separate threads, which allows them to be executed in parallel. Moreover, by utilizing the Input Tensor Queue, we can compensate for time delays of input images due to device faults or communication errors, ensuring smooth streaming.
High-speed image generation systems are optimized by shifting tasks that do not require neural network processing, such as image pre-processing and post-processing, to parallel processing outside the main pipeline. The input images are subjected to operations such as resizing, tensor transformation and normalization. To reconcile the different human processing speeds and model throughput, the authors implemented a system based on I/O queues. These queues process input tensors for diffusion models, which then pass through a variational autoencoder (VAE) for image formation. The output tensors from the variational autoencoder go to the output queue for post-processing and format conversion, after which they are sent to the rendering client. This strategy improves the efficiency of the system and speeds up image generation.
The core of the diffusion pipeline includes VAE and U-Net. By incorporating batch noise reduction and pre-computed cache for embedding textual cues, sampled noise cache and scheduler value cache, the inference pipeline improves speed and enables real-time image generation. The Stochastic Similarity Filter (SSF) is designed to conserve power consumed by the graphics processor. It dynamically adjusts the passing of the diffusion model. This system provides fast and energy-efficient real-time output.
To solve the problem of generating redundant images and wasting GPU resources in scenarios with minimal changes or static environment, a strategy called Stochastic Similarity Filter has been proposed. Stochastic Similarity Filter works by calculating the geometric similarity coefficient (cosine similarity) between the current input image and the previous reference image. Based on this similarity, the probability of skipping VAE and U-Net processes is calculated to reduce unnecessary computations. If the similarity is high, indicating insignificant changes between frames, the pipeline is more likely to skip processing, saving computational resources. This mechanism allows to work efficiently both in dynamic scenes, where processing is in full swing, and in static scenes, where the intensity of processing can be reduced. Unlike the hard thresholding method, which can lead to video clipping, the probabilistic SSF approach provides smoother video generation, adapting to the changing dynamics of the scene without compromising perception.
Precomputation Optimizing the U-Net architecture in image generation systems, especially for interactive or streaming usecases, involves pre-computing and caching the insertion of a textual prompt that remains constant throughout the frames. This cached data is used to compute key-value pairs in U-Net, which are stored and updated only when the input query changes. In addition, the Gaussian noise for each noise reduction step is pre-sampled and cached, which ensures that the noise is constant over all time intervals and improves efficiency in image-to-image transition tasks. The noise power coefficients for each noise reduction step are also pre-computed, which reduces the overhead in high frame rate usecases. For Latent Consistency models, the required functions are precomputed for all noise reduction steps or set to constants, avoiding repeated computations during the inference process.
Comparison of GPU utilization in a static scene. (GPU: RTX3060, number of frames: 20) The blue line represents GPU utilization with SSF, the orange line represents GPU utilization without SSF, and the red line denotes the skip probability calculated based on the similarity coefficient between input frames. In addition, the top part of the graph shows the input images corresponding to the same time segments. In this case, the character in the input images only blinks. The U-Net and VAE engines are built using TensorRT. To further optimize speed, the authors use static packet sizes and fixed input data sizes, which optimizes the computational graph and memory allocation for specific input data sizes, so that processing time is reduced. However, this approach limits flexibility, as processing images with different shapes or using different packet sizes would require the creation of a new engine adapted to these specific sizes.
In addition, the system uses a Tiny AutoEncoder (TAESD), which is an optimized and efficient alternative to the traditional Stable Diffusion autoencoder. TAESD is particularly effective in rapidly converting latents into full-size images and performing decoding processes with significantly lower computational requirements.
Comparison with AutoPipelineForImage2Image: The proposed pipeline shows a significant speedup over Huggingface's AutoPipelineForImage2Image. With TensorRT, StreamDiffusion achieves an acceleration of at least 13x for 10 denoising steps and up to 59.6x for a single denoising step. Without TensorRT, the acceleration is still noticeable, reaching 29.7x for one-step denoising and 8.3x for 10 steps.
StreamDiffusion pipelines with RCFG versus regular CFG: When comparing StreamDiffusion with RCFG to regular CFG, the additional computation for Self-Negative RCFG is minimal, resulting in a negligible change in output time. Onetime-Negative RCFG requires additional U-Net computation only in the first step, which results in similar output time to conventional CFG for one step, but shows a greater advantage when the number of denoising steps is increased. At five noise reduction steps, Self-Negative RCFG shows a speed improvement of 2.05 times and Onetime-Negative RCFG shows a speed improvement of 1.79 times compared to conventional CFG.
Results using no CFG, standard CFG and RCFG with Self-Negative and Onetime-Negative approaches. Compared to the cases with no CFG, the cases with CFG can enhance the influence of textual cues. In the proposed RCFG method, there is a more pronounced influence of the textual cue. Both CFG and RCFG use a cue scale of γ = 1.4. For RCFG, the first two rows use an influence modeling factor δ = 1.0 and the third row uses δ = 0.5. Using RCFG, the pipeline shows improved matching of the generated images with the cue conditions compared to CFG. RCFG improves image modifications, such as color changes or adding elements, by constantly referring to the latent value of the input image and the initially sampled noise. This results in stronger cueing effects and more pronounced changes, although it can increase the contrast of the image. Adjusting the magnitude of the virtual residual noise vector can reduce this effect.