StreamDiffusion, a pipeline for real-time interactive generation

StreamDiffusion is a new diffusion pipeline for real-time interactive image generation, enabling higher performance for live streaming and other similar scenarios. It replaces traditional sequential denoising with a faster batch process and introduces a parallel I/O queue for smoother operation.

The pipeline also utilizes a new Residual Classifier-Free Guidance (RCFG) method to reduce the number of noise reduction steps and improve overall speed. It also incorporates a Stochastic Similarity Filter to improve energy efficiency. Overall, StreamDiffusion boasts 1.5x faster processing and 2.05x faster speedup when using RCFG, reaching 91.07 frames per second on the RTX4090 GPU. In addition, power consumption is significantly reduced, making it a more efficient solution for real-time image generation.

StreamDiffusion pipeline Stream Batch concept

In our approach, instead of waiting until all noise is completely removed from one image before processing the next input image, we take the next input image after each denoising step. This creates a denoising batch in which the denoising steps for each image are alternated. By combining the denoising steps into a batch, we can process continuous input data more efficiently by using U-Net for batch processing. The input image encoded at time slot t is generated and decoded at time slot t + n, where n is the number of denoising steps.

Batching of noise reduction steps Virtual residual noise vectors

The orange vectors depict the virtual residual noise that starts from the PF ODE (Probability Flow Ordinary Differential Equations) trajectory and points to the original input latent signal x0

Traditional diffusion models are based on successive noise reduction steps that linearly increase the processing time with each additional step, especially in the U-Net framework. However, for higher fidelity images, more denoising steps are required, resulting in longer latency times.

Stream Batch solves this problem by replacing sequential noise removal with a batch process. Each batch consists of a certain number of denoising steps of consecutive images, allowing each element in the batch to advance one step in the denoising sequence per U-Net pass. This method allows for a more streamlined conversion of input images at a given time interval into the results of their conversion into images at a future time interval.

This approach tangibly reduces the need for multiple U-Net outputs and avoids the linear escalation of processing time as the number of steps increases. The main trade-off shifts from the ratio of processing time to generation quality to the ratio of VRAM capacity to generation quality. With sufficient VRAM memory capacity, high-quality images can be generated in a single U-Net processing cycle, achieving near elimination of the delay problem caused by increasing number of noise reduction steps.

Residual Classifier-Free Guidance Traditional Classifier-Free Guidance improves image generation by correcting for condition effects, but is itself computationally expensive to run through the U-Net model multiple times for each output. Residual Classifier-Free Guidance addresses this problem by introducing the concept of virtual residual noise, which is used to predict the latent representation of the original input image at a given point in the generation process. This method can efficiently generate images that differ from the original image based on the degree of correspondence without the need for additional U-Net computation. This process is referred to as Self-Negative RCFG.

In addition, RCFG can be used to deviate from any negative condition by computing the negative conditional residual noise only once in the first denoising step and using it throughout the process (Onetime-Negative RCFG). This method significantly reduces the computational burden by requiring only n or n+1 U-Net computations for Self-Negative and Onetime-Negative RCFG, respectively, compared to the 2n computations required for conventional CFG. This makes RCFG noticeably more efficient, preserving or improving the quality of the generated images.

I/O queue Input Tensor Queue

The process of converting input images into a pipeline-driven tensor data format and, conversely, converting decoded tensors back into output images requires a considerable amount of additional processing time. To avoid adding this image processing time to the bottleneck itself, the neural network inference process, we split the pre- and post-image processing into separate threads, which allows them to be executed in parallel. Moreover, by utilizing the Input Tensor Queue, we can compensate for time delays of input images due to device faults or communication errors, ensuring smooth streaming.

High-speed image generation systems are optimized by shifting tasks that do not require neural network processing, such as image pre-processing and post-processing, to parallel processing outside the main pipeline. The input images are subjected to operations such as resizing, tensor transformation and normalization. To reconcile the different human processing speeds and model throughput, the authors implemented a system based on I/O queues. These queues process input tensors for diffusion models, which then pass through a variational autoencoder (VAE) for image formation. The output tensors from the variational autoencoder go to the output queue for post-processing and format conversion, after which they are sent to the rendering client. This strategy improves the efficiency of the system and speeds up image generation.

Stochastic similarity filter Overview of the inference pipeline

The core of the diffusion pipeline includes VAE and U-Net. By incorporating batch noise reduction and pre-computed cache for embedding textual cues, sampled noise cache and scheduler value cache, the inference pipeline improves speed and enables real-time image generation. The Stochastic Similarity Filter (SSF) is designed to conserve power consumed by the graphics processor. It dynamically adjusts the passing of the diffusion model. This system provides fast and energy-efficient real-time output.

To solve the problem of generating redundant images and wasting GPU resources in scenarios with minimal changes or static environment, a strategy called Stochastic Similarity Filter has been proposed. Stochastic Similarity Filter works by calculating the geometric similarity coefficient (cosine similarity) between the current input image and the previous reference image. Based on this similarity, the probability of skipping VAE and U-Net processes is calculated to reduce unnecessary computations. If the similarity is high, indicating insignificant changes between frames, the pipeline is more likely to skip processing, saving computational resources. This mechanism allows to work efficiently both in dynamic scenes, where processing is in full swing, and in static scenes, where the intensity of processing can be reduced. Unlike the hard thresholding method, which can lead to video clipping, the probabilistic SSF approach provides smoother video generation, adapting to the changing dynamics of the scene without compromising perception.

Precomputation Optimizing the U-Net architecture in image generation systems, especially for interactive or streaming usecases, involves pre-computing and caching the insertion of a textual prompt that remains constant throughout the frames. This cached data is used to compute key-value pairs in U-Net, which are stored and updated only when the input query changes. In addition, the Gaussian noise for each noise reduction step is pre-sampled and cached, which ensures that the noise is constant over all time intervals and improves efficiency in image-to-image transition tasks. The noise power coefficients for each noise reduction step are also pre-computed, which reduces the overhead in high frame rate usecases. For Latent Consistency models, the required functions are precomputed for all noise reduction steps or set to constants, avoiding repeated computations during the inference process.

Acceleration model and miniaturized autoencoder Comparison of GPU utilization in a static scene. (GPU: RTX3060, number of frames: 20)

Comparison of GPU utilization in a static scene. (GPU: RTX3060, number of frames: 20) The blue line represents GPU utilization with SSF, the orange line represents GPU utilization without SSF, and the red line denotes the skip probability calculated based on the similarity coefficient between input frames. In addition, the top part of the graph shows the input images corresponding to the same time segments. In this case, the character in the input images only blinks. The U-Net and VAE engines are built using TensorRT. To further optimize speed, the authors use static packet sizes and fixed input data sizes, which optimizes the computational graph and memory allocation for specific input data sizes, so that processing time is reduced. However, this approach limits flexibility, as processing images with different shapes or using different packet sizes would require the creation of a new engine adapted to these specific sizes.

In addition, the system uses a Tiny AutoEncoder (TAESD), which is an optimized and efficient alternative to the traditional Stable Diffusion autoencoder. TAESD is particularly effective in rapidly converting latents into full-size images and performing decoding processes with significantly lower computational requirements.

Experiments Comparison of average output time between Stream Batch and conventional sequential noise reduction without TensorRT.

Stream Batch Noise Reduction vs. U-Net Sequential Loop: Implementing the Stream Batch strategy significantly improves the processing time, achieving a 50% reduction compared to conventional U-Net Sequential Loop. This improvement is observed even when using TensorRT, indicating that the Stream Batch approach improves the efficiency of the original sequential diffusion pipeline at different stages of denoising.

Comparison with AutoPipelineForImage2Image: The proposed pipeline shows a significant speedup over Huggingface's AutoPipelineForImage2Image. With TensorRT, StreamDiffusion achieves an acceleration of at least 13x for 10 denoising steps and up to 59.6x for a single denoising step. Without TensorRT, the acceleration is still noticeable, reaching 29.7x for one-step denoising and 8.3x for 10 steps.

StreamDiffusion pipelines with RCFG versus regular CFG: When comparing StreamDiffusion with RCFG to regular CFG, the additional computation for Self-Negative RCFG is minimal, resulting in a negligible change in output time. Onetime-Negative RCFG requires additional U-Net computation only in the first step, which results in similar output time to conventional CFG for one step, but shows a greater advantage when the number of denoising steps is increased. At five noise reduction steps, Self-Negative RCFG shows a speed improvement of 2.05 times and Onetime-Negative RCFG shows a speed improvement of 1.79 times compared to conventional CFG.

Qualitative results Real-time results from the camera and screen capture images

The first and second columns show real-time AI drawing examples, while the third column shows real-time rendering of 2D illustrations from 3D avatars. The fourth and fifth columns demonstrate the real-time operation of the camera filter. The StreamDiffusion pipeline demonstrates its efficiency in real-time image conversion suitable for various applications such as real-time game graphics, camera filters, face conversion, and AI drawing. It provides low bandwidth with real-time inputs such as cameras or screen shots, while maintaining high quality image generation according to the given textual guidelines.

Results using no CFG, standard CFG and RCFG with Self-Negative and Onetime-Negative approaches. Compared to the cases with no CFG, the cases with CFG can enhance the influence of textual cues. In the proposed RCFG method, there is a more pronounced influence of the textual cue. Both CFG and RCFG use a cue scale of γ = 1.4. For RCFG, the first two rows use an influence modeling factor δ = 1.0 and the third row uses δ = 0.5. Using RCFG, the pipeline shows improved matching of the generated images with the cue conditions compared to CFG. RCFG improves image modifications, such as color changes or adding elements, by constantly referring to the latent value of the input image and the initially sampled noise. This results in stronger cueing effects and more pronounced changes, although it can increase the contrast of the image. Adjusting the magnitude of the virtual residual noise vector can reduce this effect.

Results of text-to-image generation. We use four-stage denoising for LCM-LoRA, and one-stage denoising for sd-turbo. Our StreamDiffusion allows real-time image generation with quality comparable to that obtained with AutoPipeline Text2Image diffusers. The pipeline capability extends beyond standard text-to-image generation to high quality images quickly created with the sd-turbo model. On an RTX 4090 with a Core i9-13900K processor, the pipeline can generate at over 100 frames per second and up to 150 images per second at high packet sizes. While using open models combined with LCM-LoRA to generate more diverse expressions reduces the speed to about 40 frames per second, it still provides flexibility and high quality generation.

Original paper Traduced form here