Skip to content

Commit

Permalink
Merge branch 'main' into fix-fast-tests-sayak
Browse files Browse the repository at this point in the history
  • Loading branch information
sayakpaul authored Dec 16, 2024
2 parents 78ec48d + 5a196e3 commit a2211b2
Show file tree
Hide file tree
Showing 41 changed files with 4,225 additions and 92 deletions.
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,8 @@
title: PriorTransformer
- local: api/models/sd3_transformer2d
title: SD3Transformer2DModel
- local: api/models/sana_transformer2d
title: SanaTransformer2DModel
- local: api/models/stable_audio_transformer
title: StableAudioDiTModel
- local: api/models/transformer2d
Expand Down Expand Up @@ -434,6 +436,8 @@
title: PixArt-α
- local: api/pipelines/pixart_sigma
title: PixArt-Σ
- local: api/pipelines/sana
title: Sana
- local: api/pipelines/self_attention_guidance
title: Self-Attention Guidance
- local: api/pipelines/semantic_stable_diffusion
Expand Down
34 changes: 34 additions & 0 deletions docs/source/en/api/models/sana_transformer2d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# SanaTransformer2DModel

A Diffusion Transformer model for 2D data from [SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) was introduced from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.

The abstract from the paper is:

*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*

The model can be loaded with the following code snippet.

```python
from diffusers import SanaTransformer2DModel

transformer = SanaTransformer2DModel.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_diffusers", subfolder="transformer", torch_dtype=torch.float16)
```

## SanaTransformer2DModel

[[autodoc]] SanaTransformer2DModel

## Transformer2DModelOutput

[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
65 changes: 65 additions & 0 deletions docs/source/en/api/pipelines/sana.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->

# SanaPipeline

[SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.

The abstract from the paper is:

*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*

<Tip>

Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.

</Tip>

This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model]https://huggingface.co/Efficient-Large-Model).

Available models:

| Model | Recommended dtype |
|:-----:|:-----------------:|
| [`Efficient-Large-Model/Sana_1600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_diffusers) | `torch.float16` |
| [`Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers) | `torch.float16` |
| [`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers) | `torch.bfloat16` |
| [`Efficient-Large-Model/Sana_1600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_diffusers) | `torch.float16` |
| [`Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers) | `torch.float16` |
| [`Efficient-Large-Model/Sana_600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_diffusers) | `torch.float16` |
| [`Efficient-Large-Model/Sana_600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_512px_diffusers) | `torch.float16` |

Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-673efba2a57ed99843f11f9e) collection for more information.

<Tip>

Make sure to pass the `variant` argument for downloaded checkpoints to use lower disk space. Set it to `"fp16"` for models with recommended dtype as `torch.float16`, and `"bf16"` for models with recommended dtype as `torch.bfloat16`. By default, `torch.float32` weights are downloaded, which use twice the amount of disk storage. Additionally, `torch.float32` weights can be downcasted on-the-fly by specifying the `torch_dtype` argument. Read about it in the [docs](https://huggingface.co/docs/diffusers/v0.31.0/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained).

</Tip>

## SanaPipeline

[[autodoc]] SanaPipeline
- all
- __call__

## SanaPAGPipeline

[[autodoc]] SanaPAGPipeline
- all
- __call__

## SanaPipelineOutput

[[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput
11 changes: 5 additions & 6 deletions examples/community/pipeline_flux_rf_inversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -648,6 +648,8 @@ def __call__(
height: Optional[int] = None,
width: Optional[int] = None,
eta: float = 1.0,
decay_eta: Optional[bool] = False,
eta_decay_power: Optional[float] = 1.0,
strength: float = 1.0,
start_timestep: float = 0,
stop_timestep: float = 0.25,
Expand Down Expand Up @@ -880,12 +882,9 @@ def __call__(
v_t = -noise_pred
v_t_cond = (y_0 - latents) / (1 - t_i)
eta_t = eta if start_timestep <= i < stop_timestep else 0.0
if start_timestep <= i < stop_timestep:
# controlled vector field
v_hat_t = v_t + eta * (v_t_cond - v_t)

else:
v_hat_t = v_t
if decay_eta:
eta_t = eta_t * (1 - i / num_inference_steps) ** eta_decay_power # Decay eta over the loop
v_hat_t = v_t + eta_t * (v_t_cond - v_t)

# SDE Eq: 17 from https://arxiv.org/pdf/2410.10792
latents = latents + v_hat_t * (sigmas[i] - sigmas[i + 1])
Expand Down
2 changes: 2 additions & 0 deletions examples/flux-control/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ accelerate launch train_control_lora_flux.py \
--max_train_steps=5000 \
--validation_image="openpose.png" \
--validation_prompt="A couple, 4k photo, highly detailed" \
--offload \
--seed="0" \
--push_to_hub
```
Expand Down Expand Up @@ -154,6 +155,7 @@ accelerate launch --config_file=accelerate_ds2.yaml train_control_flux.py \
--validation_steps=200 \
--validation_image "2_pose_1024.jpg" "3_pose_1024.jpg" \
--validation_prompt "two friends sitting by each other enjoying a day at the park, full hd, cinematic" "person enjoying a day at the park, full hd, cinematic" \
--offload \
--seed="0" \
--push_to_hub
```
Expand Down
13 changes: 10 additions & 3 deletions examples/flux-control/train_control_flux.py
Original file line number Diff line number Diff line change
Expand Up @@ -541,6 +541,11 @@ def parse_args(input_args=None):
default=1.29,
help="Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`.",
)
parser.add_argument(
"--offload",
action="store_true",
help="Whether to offload the VAE and the text encoders to CPU when they are not used.",
)

if input_args is not None:
args = parser.parse_args(input_args)
Expand Down Expand Up @@ -999,8 +1004,9 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
control_latents = encode_images(
batch["conditioning_pixel_values"], vae.to(accelerator.device), weight_dtype
)
# offload vae to CPU.
vae.cpu()
if args.offload:
# offload vae to CPU.
vae.cpu()

# Sample a random timestep for each image
# for weighting schemes where we sample timesteps non-uniformly
Expand Down Expand Up @@ -1064,7 +1070,8 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
if args.proportion_empty_prompts and random.random() < args.proportion_empty_prompts:
prompt_embeds.zero_()
pooled_prompt_embeds.zero_()
text_encoding_pipeline = text_encoding_pipeline.to("cpu")
if args.offload:
text_encoding_pipeline = text_encoding_pipeline.to("cpu")

# Predict.
model_pred = flux_transformer(
Expand Down
14 changes: 11 additions & 3 deletions examples/flux-control/train_control_lora_flux.py
Original file line number Diff line number Diff line change
Expand Up @@ -573,6 +573,11 @@ def parse_args(input_args=None):
default=1.29,
help="Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`.",
)
parser.add_argument(
"--offload",
action="store_true",
help="Whether to offload the VAE and the text encoders to CPU when they are not used.",
)

if input_args is not None:
args = parser.parse_args(input_args)
Expand Down Expand Up @@ -1140,8 +1145,10 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
control_latents = encode_images(
batch["conditioning_pixel_values"], vae.to(accelerator.device), weight_dtype
)
# offload vae to CPU.
vae.cpu()

if args.offload:
# offload vae to CPU.
vae.cpu()

# Sample a random timestep for each image
# for weighting schemes where we sample timesteps non-uniformly
Expand Down Expand Up @@ -1205,7 +1212,8 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
if args.proportion_empty_prompts and random.random() < args.proportion_empty_prompts:
prompt_embeds.zero_()
pooled_prompt_embeds.zero_()
text_encoding_pipeline = text_encoding_pipeline.to("cpu")
if args.offload:
text_encoding_pipeline = text_encoding_pipeline.to("cpu")

# Predict.
model_pred = flux_transformer(
Expand Down
Loading

0 comments on commit a2211b2

Please sign in to comment.