Bettering Diffusers Bundle for Excessive-High quality Picture Era | by Andrew Zhu | Apr, 2023

Goodbye Babel, generated by Andrew Zhu utilizing Diffusers in pure Python

Stable Diffusion WebUI from AUTOMATIC1111 has confirmed to be a robust instrument for producing high-quality photos utilizing the Diffusion mannequin. Nevertheless, whereas the WebUI is straightforward to make use of, information scientists, machine studying engineers, and researchers usually require extra management over the picture era course of. That is the place the diffusers bundle from huggingface is available in, offering a technique to run the Diffusion mannequin in Python and permitting customers to customise their fashions and prompts to generate photos to their particular wants.

Regardless of its potential, the Diffusers bundle has a number of limitations that stop it from producing photos pretty much as good as these produced by the Secure Diffusion WebUI. Probably the most vital of those limitations embody:

  • The lack to make use of customized fashions within the .safetensor file format;
  • The 77 immediate token limitation;
  • A scarcity of LoRA help;
  • And the absence of picture scale-up performance (often known as HighRes in Secure Diffusion WebUI);
  • Low efficiency and excessive VRAM utilization by default.

This text goals to deal with these limitations and allow the Diffusers bundle to generate high-quality photos similar to these produced by the Secure Diffusion WebUI. With the enhancement options supplied, information scientists, machine studying engineers, and researchers can get pleasure from larger management and adaptability of their picture era processes whereas additionally attaining distinctive outcomes. Within the following sections, we’ll discover the varied methods and methods that can be utilized to beat these limitations and unlock the complete potential of the Diffusers bundle.

Notice that please comply with this hyperlink to put in all required CUDA and Python packages if it’s your first time working Secure Diffusion.

1. Load Up Native Mannequin recordsdata in .safetensor Format

Customers can simply spin up diffusers to generate a picture like this:

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")"cuda")
picture = pipeline("A cute cat taking part in piano").photos[0]"image_of_cat_playing_piano.png")

You could not fulfill with both the output picture or the efficiency. Let’s cope with the issues one after the other. First, let’s load up a customized mannequin in .safetensor format situated wherever in your machine. you can’t simply load the mannequin file like this:

pipeline = DiffusionPipeline.from_pretrained("/mannequin/custom_model.safetensors")

Listed below are the detailed steps to covert .safetensor file to diffusers format:

Step 1. Pull all diffusers code from GitHub

git clone

Step 2. Underneath the scripts folder find the file:

In your terminal, run this command to transform .safetensor file to Diffusers format. Bear in mind to alter the — checkpoint_path worth to signify your case.

python --from_safetensors --checkpoint_path="D:stable-diffusion-webuimodelsStable-diffusiondeliberate_v2.safetensors" --dump_path='D:sd_modelsdeliberate_v2' --device='cuda:0'

Step 3. Now you possibly can load up the pipeline utilizing the newly transformed mannequin file, right here is the entire code:

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
picture = pipeline("A cute cat taking part in piano").photos[0]"image_of_cat_playing_piano.png")

You need to have the ability to convert and use any fashions you obtain from huggingface or

Cat taking part in piano generated by the above code

2. Enhance the Efficiency of Diffusers

Producing high-quality photos is usually a time-consuming course of even for the most recent 3xxx and 4xxx Nvidia RTX GPUs. By default, Diffuers bundle comes with non-optimized settings. Two options may be utilized to drastically increase efficiency.

Right here is the interplay velocity earlier than making use of the next resolution, solely about 2.x iterations per second in RTX 3070 TI 8G RAM to generate a 512×512 picture

  • Use Half Precision Weights

The primary resolution is to make use of half precision weights. Half precision weights use 16-bit floating-point numbers as an alternative of the standard 32-bit numbers. This reduces the reminiscence required for storing weights and quickens computation, which might considerably enhance the efficiency of the Diffusers bundle.

In keeping with this video, decreasing float precision from FP32 to FP16 may also allow the Tensor Cores.

I had one other article to check out how briskly GPU Tensor cores can increase the computation velocity.

Right here is find out how to allow FP16 in diffusers, Simply including two strains of code will increase the efficiency by 500%, with nearly no picture high quality impacts.

from diffusers import DiffusionPipeline
import torch # <----- Line 1 added
pipeline = DiffusionPipeline.from_pretrained(
,torch_dtype = torch.float16 # <----- Line 2 Added
picture = pipeline("A cute cat taking part in piano").photos[0]"image_of_cat_playing_piano.png")

Now the iteration velocity boosts to 10.x iteration per second. A 5x instances quicker.

Xformers is an open-source library that gives a set of high-performance transformers for varied pure language processing (NLP) duties. It’s constructed on high of PyTorch and goals to offer environment friendly and scalable transformer fashions that may be simply built-in into present NLP pipelines. (These days, are there any fashions that don’t use Transformer? :P)

Set up Xformers by pip set up xformers , then we will simply change diffusers to make use of xformers by one line code.

pipeline.enable_xformers_memory_efficient_attention() <--- one line added

This one-line code boosts efficiency by one other 20%.

3. Take away the 77 immediate tokens limitation

Within the present model of Diffusers, there’s a limitation of 77 immediate tokens that can be utilized within the era of photos.

Fortuitously, there’s a resolution to this downside. Through the use of the “lpw_stable_diffusion” pipeline supplied by the neighborhood, you possibly can unlock the 77 immediate token limitation and generate high-quality photos with longer prompts.

To make use of the “lpw_stable_diffusion” pipeline, you should utilize the next code:

pipeline = DiffusionPipeline.from_pretrained(
custom_pipeline="lpw_stable_diffusion", #<--- code added

On this code, we’re initializing a brand new DiffusionPipeline object utilizing the “from_pretrained” technique. We’re specifying the trail to the pre-trained mannequin and setting the “custom_pipeline” argument to “lpw_stable_diffusion”. This tells Diffusers to make use of the “lpw_stable_diffusion” pipeline, which unlocks the 77 immediate token limitation.

Now, let’s use an extended immediate string to try it out. Right here is the entire code:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
,custom_pipeline = "lpw_stable_diffusion" #<--- code added
,torch_dtype = torch.float16
immediate = """
Babel tower falling down, strolling on the starlight, dreamy extremely extensive shot
, atmospheric, hyper real looking, epic composition, cinematic, octane render
, artstation panorama vista images by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta picture by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed publish processing, artstation, rendering by octane, unreal engine
picture = pipeline(immediate).photos[0]"goodbye_babel_tower.png")

And you’ll get a picture like this:

Goodby Babel, generated by Andrew Zhu utilizing diffusers

Should you nonetheless see a warning message like: Token indices sequence size is longer than the desired most sequence size for this mannequin ( *** > 77 ) . Operating this sequence by means of the mannequin will end in indexing errors. It’s regular, you possibly can simply ignore it.

4. Use Customized LoRA with Diffusers

Regardless of the claims of LoRA support in Diffusers, customers nonetheless face limitations in terms of loading native LoRA recordsdata within the .safetensor file format. This is usually a vital impediment for customers to make use of the LoRA from the neighborhood.

To beat this limitation, I’ve created a operate that enables customers to load LoRA recordsdata with weighted numbers in actual time. This operate can be utilized to load LoRA recordsdata and their corresponding weights to a Diffusers mannequin, enabling the era of high-quality photos with LoRA information.

Right here is the operate physique:

from safetensors.torch import load_file
def __load_lora(
state_dict = load_file(lora_path)
LORA_PREFIX_UNET = 'lora_unet'

alpha = lora_weight
visited = []

# immediately replace weight in diffusers mannequin
for key in state_dict:

# as we've got set the alpha beforehand, so simply skip
if '.alpha' in key or key in visited:

if 'textual content' in key:
layer_infos = key.break up('.')[0].break up(LORA_PREFIX_TEXT_ENCODER+'_')[-1].break up('_')
curr_layer = pipeline.text_encoder
layer_infos = key.break up('.')[0].break up(LORA_PREFIX_UNET+'_')[-1].break up('_')
curr_layer = pipeline.unet

# discover the goal layer
temp_name = layer_infos.pop(0)
whereas len(layer_infos) > -1:
curr_layer = curr_layer.__getattr__(temp_name)
if len(layer_infos) > 0:
temp_name = layer_infos.pop(0)
elif len(layer_infos) == 0:
besides Exception:
if len(temp_name) > 0:
temp_name += '_'+layer_infos.pop(0)
temp_name = layer_infos.pop(0)

# org_forward(x) + lora_up(lora_down(x)) * multiplier
pair_keys = []
if 'lora_down' in key:
pair_keys.append(key.change('lora_down', 'lora_up'))
pair_keys.append(key.change('lora_up', 'lora_down'))

# replace weight
if len(state_dict[pair_keys[0]].form) == 4:
weight_up = state_dict[pair_keys[0]].squeeze(3).squeeze(2).to(torch.float32)
weight_down = state_dict[pair_keys[1]].squeeze(3).squeeze(2).to(torch.float32)
curr_layer.weight.information += alpha *, weight_down).unsqueeze(2).unsqueeze(3)
weight_up = state_dict[pair_keys[0]].to(torch.float32)
weight_down = state_dict[pair_keys[1]].to(torch.float32)
curr_layer.weight.information += alpha *, weight_down)

# replace visited checklist
for merchandise in pair_keys:

return pipeline

The logic is extracted from the of the diffusers git repo.

Take one of many well-known LoRA:MoXin for instance. you should utilize the __load_lora operate like this:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
,custom_pipeline = "lpw_stable_diffusion"
,torch_dtype = torch.float16
lora = (r"D:sd_modelsLoraMoxin_10.safetensors",0.8)
pipeline = __load_lora(pipeline=pipeline,lora_path=lora[0],lora_weight=lora[1])"cuda")

immediate = """
shukezouma,detrimental house,shuimobysim
a department of flower, conventional chinese language ink portray
picture = pipeline(immediate).photos[0]"a department of flower.png")

The immediate will generate a picture like this:

a department of flower, generated by Andrew Zhu utilizing diffusers

You possibly can name a number of instances of __load_lora() to load a number of LoRAs for one era.

With this operate, now you can load LoRA recordsdata with weighted numbers in actual time and use them to generate high-quality photos with Diffusers. The LoRA loading is fairly quick, often taking just one–2 seconds, manner higher than changing and utilizing(which can generate one other mannequin file in GB dimension).

5. Use Customized Texture Inversions with Diffusers

Utilizing customized Texture Inversions with Diffusers bundle is usually a highly effective technique to generate high-quality photos. Nevertheless, the official documentation of Diffusers means that customers want to coach their very own Textual Inversions which might take as much as an hour on a V100 GPU. This will not be sensible for a lot of customers who wish to generate photos rapidly.

So I investigated it and located an answer that may allow diffusers to make use of a textual inversion similar to in Secure Diffusion WebUI. Beneath is the operate I created to load a customized Textual Inversion.

def load_textual_inversion(
, text_encoder
, tokenizer
, token = None
, weight = 0.5
Use this operate to load textual inversion mannequin in mannequin initilization stage
or picture era stage.
loaded_learned_embeds = torch.load(learned_embeds_path, map_location="cpu")
string_to_token = loaded_learned_embeds['string_to_token']
string_to_param = loaded_learned_embeds['string_to_param']

# separate token and the embeds
trained_token = checklist(string_to_token.keys())[0]
embeds = string_to_param[trained_token]
embeds = embeds[0] * weight

# forged to dtype of text_encoder
dtype = text_encoder.get_input_embeddings().weight.dtype

# add the token in tokenizer
token = token if token will not be None else trained_token
num_added_tokens = tokenizer.add_tokens(token)
if num_added_tokens == 0:
#print(f"The tokenizer already accommodates the token {token}.The brand new token will change the earlier one")
increase ValueError(f"The tokenizer already accommodates the token {token}. Please cross a special `token` that's not already within the tokenizer.")

# resize the token embeddings

# get the id for the token and assign the embeds
token_id = tokenizer.convert_tokens_to_ids(token)
text_encoder.get_input_embeddings().weight.information[token_id] = embeds
return (tokenizer,text_encoder)

Within the load_textual_inversion() operate, that you must present the next arguments:

  • learned_embeds_path: Path to the pre-trained textual inversion mannequin file in .pt or .bin format.
  • text_encoder: Textual content encoder object obtained from the Diffusion Pipeline.
  • tokenizer: Tokenizer object obtained from the Diffusion Pipeline.
  • token: Elective argument specifying the immediate token. By default, it’s set to None. it’s the key phrase that may set off the textual inversion in your immediate
  • weight: Elective argument specifying the load of the textual inversion. By default, I set it to 0.5. you possibly can change to different worth as wanted.

Now you can use the operate with a diffusers pipeline like this:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
,custom_pipeline = "lpw_stable_diffusion"
,torch_dtype = torch.float16
,safety_checker = None

textual_inversion_path = r""

tokenizer = pipeline.tokenizer
text_encoder = pipeline.text_encoder
learned_embeds_path = textual_inversion_path
, tokenizer = tokenizer
, text_encoder = text_encoder
, token = 'styleempire'

immediate = """
styleempire,award profitable lovely avenue, storm,((darkish storm clouds))
, fluffy clouds within the sky, shaded flat illustration, digital artwork
, trending on artstation, extremely detailed, high-quality element, intricate
, ((lens flare)), (backlighting), (bloom)
neg_prompt = """
cartoon, 3d, ((disfigured)), ((dangerous artwork)), ((deformed)), ((poorly drawn))
, ((additional limbs)), ((shut up)), ((b&w)), bizarre colours, blurry
, hat, cap, glasses, sun shades, lightning, face

generator = torch.Generator("cuda").manual_seed(1)
picture = pipeline(
,negative_prompt =neg_prompt
,generator = generator

Right here is the results of making use of an Empire Style Textual Inversion.

The left’s trendy avenue turns to an previous London type.

6. Upscale Photographs

Diffusers bundle is nice for producing high-quality photos, however picture upscaling will not be its main operate. Nevertheless, the Secure-Diffusion-WebUI affords a characteristic known as HighRes, which permits customers to upscale their generated photos to 2x or 4x. It could be nice if Diffusers customers might get pleasure from the identical characteristic. After some analysis and testing, I discovered that the SwinRI mannequin is a wonderful possibility for picture upscaling, and it will possibly simply upscale photos to 2x or 4x after they’re generated.

To make use of the SwinRI mannequin for picture upscaling, we will use the code from the GitHub repository of JingyunLiang/SwinIR. Should you simply need codes, downloading fashions/, utils/ and is sufficient. Following the readme guideline, you possibly can upscale photos like magic.

Here’s a pattern of how properly SwinRI can scale up a picture.

Left: unique picture, Proper: 4x SwinRI upscaled picture

Many different open-source options can be utilized to enhance picture high quality. Right here checklist three different fashions that I attempted that return fantastic outcomes.

RealSR can scale up a picture 4 instances nearly pretty much as good as SwinRI, and its execution efficiency is the quickest, as an alternative of invoking PyTorch and CUDA. The writer compiles the code and CUDA utilization to binary immediately. My observations reveal that the RealSR can upscale a mage in about simply 2–4 seconds.

CodeFormer is sweet at restoring blurred or damaged faces, it will possibly additionally take away noise and improve background particulars. This resolution and algorithm is broadly utilized in different purposes, together with Secure-Diffusion-WebUI

One other highly effective open-source resolution that archives wonderful outcomes of face restoration, and it’s quick too. GFPGAN can also be built-in into Secure-Diffusion-WebUI.

7. Optimize Diffusers CUDA Reminiscence Utilization

When utilizing Diffusers to generate photos, it’s necessary to think about the CUDA reminiscence utilization, particularly once you wish to load different fashions to additional course of the generated photos. Should you attempt to load one other mannequin like SwinIR to upscale photos, you may encounter a RuntimeError: CUDA out of reminiscence because of the Diffuser mannequin nonetheless occupying the CUDA reminiscence.

To mitigate this concern, there are a number of options to optimize CUDA reminiscence utilization. The next two options I discovered work the perfect:

  • Sliced Consideration for Further Reminiscence Financial savings

Sliced consideration is a way that reduces the reminiscence utilization of self-attention mechanisms in transformers. By partitioning the eye matrix into smaller blocks, the reminiscence necessities are diminished. This method can be utilized with the Diffusers bundle to cut back the reminiscence footprint of the Diffuser mannequin.

To make use of it in Diffusers, merely one line code:


Normally, you gained’t have two fashions working on the identical time, the thought is to dump the mannequin information to the CPU reminiscence quickly and liberate CUA reminiscence house for different fashions, and solely load as much as VRAM once you begin utilizing the mannequin.

To make use of dynamically offload information to CPU reminiscence in Diffusers, use this line code:


After making use of this, every time Diffusers end the picture era process, the mannequin information shall be offloaded to CPU reminiscence routinely till the following time calling.


The article discusses find out how to enhance the efficiency and capabilities of the Diffusers bundle, The article covers a number of options to widespread points confronted by Diffusers customers, together with loading native .safetensor fashions, boosting efficiency, eradicating the 77 immediate tokens limitation, utilizing customized LoRA and Textual Inversion, upscaling photos, and optimizing CUDA reminiscence utilization.

By making use of these options, Diffusers customers can generate high-quality photos with higher efficiency and extra management over the method. The article additionally consists of code snippets and detailed explanations for every resolution.

Should you can efficiently apply these options and code in your case, there might be a further profit, which I profit lots, is that you could be implement your personal options by studying the Diffusers supply code and perceive higher how Secure Diffusion works. To me, studying, discovering, and implementing these options is a enjoyable journey. Hope these options can even assist you to and want you get pleasure from with Secure Diffusion and diffusers bundle.

Right here present the immediate that generates the heading picture:

Babel tower falling down, strolling on the starlight, dreamy extremely extensive shot
, atmospheric, hyper real looking, epic composition, cinematic, octane render
, artstation panorama vista images by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta picture by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed publish processing, artstation, rendering by octane, unreal engine

Measurement: 600 * 800
Seed: 3977059881
Scheduler (or Sampling technique): DPMSolverMultistepScheduler
Sampling steps: 25
CFG Scale (or Steering Scale): 7.5
SwinRI mannequin: 003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth

License and Code Reuse

The options supplied on this article have been achieved by means of intensive supply studying, later night time testing, and logical design. You will need to word that on the time of writing (April 2023), loading LoRA and Textual Inversion options and code included on this article are the one working variations throughout the web.

Should you discover the code introduced on this article helpful and wish to reuse it in your mission, paper, or article, please reference again to this Medium article. The code introduced right here is licensed beneath the MIT license, which lets you use, copy, modify, merge, publish, distribute, sublicense, and/or promote copies of the software program, topic to the situations of the license.

Please word that the options introduced on this article will not be the optimum or best technique to obtain the specified outcomes, and are topic to alter as new developments and enhancements are made. It’s at all times advisable to completely take a look at and validate any code earlier than implementing it in a manufacturing surroundings.


Leave a Reply

Your email address will not be published. Required fields are marked *