v0.9.0: Stable Diffusion 2
🎨 Stable Diffusion 2 is here!
Installation
pip install diffusers[torch]==0.9 transformers
Stable Diffusion 2.0 is available in several flavors:
Stable Diffusion 2.0-V at 768x768
New stable diffusion model (Stable Diffusion 2.0-v) at 768x768 resolution. Same number of parameters in the U-Net as 1.5
, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. SD 2.0-v is a so-called v-prediction model.
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]
image.save("astronaut.png")
Stable Diffusion 2.0-base at 512x512
The above model is finetuned from SD 2.0-base, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png")
Stable Diffusion 2.0 for Inpanting
This model for text-guided inpanting is finetuned from SD 2.0-base. Follows the mask-generation strategy presented in LAMA which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning.
import PIL
import requests
import torch
from io import BytesIO
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
repo_id = "stabilityai/stable-diffusion-2-inpainting"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
image.save("yellow_cat.png")
Stable Diffusion X4 Upscaler
The model was trained on crops of size 512x512 and is a text-guided latent upscaling diffusion model. In addition to the textual input, it receives a noise_level as an input parameter, which can be used to add noise to the low-resolution input according to a predefined diffusion schedule.
import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch
model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))
prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
upscaled_image.save("upsampled_cat.png")
Saving & Loading is fixed for Versatile Diffusion
Previously there was a 🐛 when saving & loading versatile diffusion - this is fixed now so that memory efficient saving & loading works as expected.
- [Versatile Diffusion] Fix remaining tests by @patrickvonplaten in #1418
📝 Changelog
- add v prediction by @patil-suraj in #1386
- Adapt UNet2D for supre-resolution by @patil-suraj in #1385
- Version 0.9.0.dev0 by @anton-l in #1394
- Make height and width optional by @patrickvonplaten in #1401
- [Config] Add optional arguments by @patrickvonplaten in #1395
- Upscaling fixed by @patrickvonplaten in #1402
- Add the new SD2 attention params to the VD text unet by @anton-l in #1400
- Deprecate sample size by @patrickvonplaten in #1406
- Support SD2 attention slicing by @anton-l in #1397
- Add SD2 inpainting integration tests by @anton-l in #1412
- Fix sample size conversion script by @patrickvonplaten in #1408
- fix clip guided by @patrickvonplaten in #1414
- Fix all stable diffusion by @patrickvonplaten in #1415
- [MPS] call contiguous after permute by @kashif in #1411
- Deprecate
predict_epsilon
by @pcuenca in #1393 - Fix ONNX conversion and inference by @anton-l in #1416
- Allow to set config params directly in init by @patrickvonplaten in #1419
- Add tests for Stable Diffusion 2 V-prediction 768x768 by @anton-l in #1420
- StableDiffusionUpscalePipeline by @patil-suraj in #1396
- added initial v-pred support to DPM-solver by @kashif in #1421
- SD2 docs by @patrickvonplaten in #1424