Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] ConsisID #10140

Open
wants to merge 68 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
0036376
Update __init__.py
SHYuanBest Dec 6, 2024
940ec92
Merge branch 'huggingface:main' into main
SHYuanBest Dec 9, 2024
c78cf01
add consisid
SHYuanBest Dec 10, 2024
61c85f7
update consisid
SHYuanBest Dec 10, 2024
12855b2
update consisid
SHYuanBest Dec 10, 2024
787a69c
make style
SHYuanBest Dec 10, 2024
33d4291
make_style
SHYuanBest Dec 10, 2024
455d68d
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 10, 2024
8f310c5
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 10, 2024
0f447a4
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 10, 2024
d348901
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 10, 2024
a35f92a
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 10, 2024
33f3acb
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 10, 2024
6503a17
add doc
SHYuanBest Dec 10, 2024
a24a4ee
Merge branch 'main' into main
SHYuanBest Dec 10, 2024
19d1fa3
Merge branch 'huggingface:main' into main
SHYuanBest Dec 10, 2024
c13fb17
make style
SHYuanBest Dec 10, 2024
61ad37b
Rename consisid .md to consisid.md
SHYuanBest Dec 10, 2024
3a274ca
Update geodiff_molecule_conformation.ipynb
hlky Dec 11, 2024
02c16ba
Update geodiff_molecule_conformation.ipynb
hlky Dec 11, 2024
e76338e
Update geodiff_molecule_conformation.ipynb
hlky Dec 11, 2024
a597713
Update demo.ipynb
hlky Dec 11, 2024
0a633e4
Merge branch 'main' into main
hlky Dec 11, 2024
51003e8
Update pipeline_consisid.py
hlky Dec 11, 2024
a0e746e
make fix-copies
hlky Dec 11, 2024
14ad9af
Update docs/source/en/using-diffusers/consisid.md
SHYuanBest Dec 12, 2024
e5c84c7
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 12, 2024
0bb54c9
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 12, 2024
c389400
Update docs/source/en/using-diffusers/consisid.md
SHYuanBest Dec 12, 2024
4fb4529
Update docs/source/en/using-diffusers/consisid.md
SHYuanBest Dec 12, 2024
9b2bd31
update doc & pipeline code
SHYuanBest Dec 12, 2024
211331b
fix typo
SHYuanBest Dec 12, 2024
590b1bd
make style
SHYuanBest Dec 12, 2024
8e5b070
update example
SHYuanBest Dec 12, 2024
f234376
Merge branch 'huggingface:main' into main
SHYuanBest Dec 12, 2024
a0acc02
Update docs/source/en/using-diffusers/consisid.md
SHYuanBest Dec 13, 2024
d23d933
Merge branch 'huggingface:main' into main
SHYuanBest Dec 13, 2024
2a722f2
update example
SHYuanBest Dec 17, 2024
1c5a1f2
update example
SHYuanBest Dec 17, 2024
7ceffc9
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 18, 2024
95decbd
Update src/diffusers/pipelines/consisid/pipeline_consisid.py
SHYuanBest Dec 18, 2024
665d1b4
Merge branch 'huggingface:main' into main
SHYuanBest Dec 18, 2024
5139afc
update
SHYuanBest Dec 18, 2024
1e10927
Merge branch 'huggingface:main' into main
SHYuanBest Dec 18, 2024
58f6570
add test and update
SHYuanBest Dec 18, 2024
32649b2
Merge branch 'huggingface:main' into main
SHYuanBest Dec 18, 2024
141038b
remove some changes from docs
a-r-r-o-w Dec 18, 2024
d0fe503
refactor
a-r-r-o-w Dec 18, 2024
60856c7
fix
a-r-r-o-w Dec 18, 2024
313c2e3
undo changes to examples
a-r-r-o-w Dec 18, 2024
935319a
remove save/load and fuse methods
a-r-r-o-w Dec 18, 2024
0f5d677
update
a-r-r-o-w Dec 18, 2024
aa7b0eb
link hf-doc-img & make test extremely small
SHYuanBest Dec 19, 2024
aa98858
update
SHYuanBest Dec 19, 2024
03ebc66
Merge branch 'huggingface:main' into main
SHYuanBest Dec 19, 2024
c8ba3c0
Merge branch 'huggingface:main' into main
SHYuanBest Dec 19, 2024
2e15509
Merge branch 'huggingface:main' into main
SHYuanBest Dec 20, 2024
b174d9f
add lora
SHYuanBest Dec 21, 2024
fbb09aa
fix test
SHYuanBest Dec 22, 2024
3b05257
Merge branch 'huggingface:main' into main
SHYuanBest Dec 22, 2024
5813825
update
SHYuanBest Dec 22, 2024
7734a29
update
SHYuanBest Dec 22, 2024
5fd9a81
change expected_diff_max to 0.4
SHYuanBest Dec 23, 2024
0937753
Merge branch 'huggingface:main' into main
SHYuanBest Dec 23, 2024
cdc04bf
fix typo
SHYuanBest Dec 23, 2024
0af2f83
fix link
SHYuanBest Dec 24, 2024
e17aa82
fix typo
SHYuanBest Dec 24, 2024
3b17e2e
Merge branch 'main' into main
SHYuanBest Dec 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@
- sections:
- local: using-diffusers/cogvideox
title: CogVideoX
- local: using-diffusers/consisid
title: ConsisID
- local: using-diffusers/sdxl
title: Stable Diffusion XL
- local: using-diffusers/sdxl_turbo
Expand Down Expand Up @@ -266,6 +268,8 @@
title: AuraFlowTransformer2DModel
- local: api/models/cogvideox_transformer3d
title: CogVideoXTransformer3DModel
- local: api/models/consisid_transformer3d
title: ConsisIDTransformer3DModel
- local: api/models/cogview3plus_transformer2d
title: CogView3PlusTransformer2DModel
- local: api/models/dit_transformer2d
Expand Down Expand Up @@ -368,6 +372,8 @@
title: CogVideoX
- local: api/pipelines/cogview3
title: CogView3
- local: api/pipelines/consisid
title: ConsisID
- local: api/pipelines/consistency_models
title: Consistency Models
- local: api/pipelines/controlnet
Expand Down
30 changes: 30 additions & 0 deletions docs/source/en/api/models/consisid_transformer3d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# ConsisIDTransformer3DModel

A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/pdf/2411.17440) by Peking University & University of Rochester & etc.

The model can be loaded with the following code snippet.

```python
from diffusers import ConsisIDTransformer3DModel

transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
```

## ConsisIDTransformer3DModel

[[autodoc]] ConsisIDTransformer3DModel

## Transformer2DModelOutput

[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
60 changes: 60 additions & 0 deletions docs/source/en/api/pipelines/consisid.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-->

# ConsisID

[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan.

The abstract from the paper is:

*Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.*

<Tip>

Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.

</Tip>

This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh).

There are two official ConsisID checkpoints for identity-preserving text-to-video.

| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |

### Memory optimization
SHYuanBest marked this conversation as resolved.
Show resolved Hide resolved

ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.

| Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved |
| :----------------------------- | :------------------- | :------------------ |
| - | 37 GB | 44 GB |
| enable_model_cpu_offload | 22 GB | 25 GB |
| enable_sequential_cpu_offload | 16 GB | 22 GB |
| vae.enable_slicing | 16 GB | 22 GB |
| vae.enable_tiling | 5 GB | 7 GB |

## ConsisIDPipeline

[[autodoc]] ConsisIDPipeline

- all
- __call__

## ConsisIDPipelineOutput

[[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput
90 changes: 90 additions & 0 deletions docs/source/en/using-diffusers/consisid.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ConsisID

[ConsisID](https://github.com/PKU-YuanGroup/ConsisID) is an identity-preserving text-to-video generation model that keeps the face consistent in the generated video by frequency decomposition. The main features of ConsisID are:

- Frequency decomposition: The characteristics of the DiT architecture are analyzed from the frequency domain perspective, and based on these characteristics, a reasonable control information injection method is designed.
- Consistency training strategy: A coarse-to-fine training strategy, dynamic masking loss, and dynamic cross-face loss further enhance the model's generalization ability and identity preservation performance.
- Inference without finetuning: Previous methods required case-by-case finetuning of the input ID before inference, leading to significant time and computational costs. In contrast, ConsisID is tuning-free.

This guide will walk you through using ConsisID for use cases.

## Load Model Checkpoints
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.


```python
# !pip install consisid_eva_clip insightface facexlib
import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from huggingface_hub import snapshot_download

# Download ckpts
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")

# Load face helper model to preprocess input face image
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)

# Load consisid base model
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")
```

## Identity-Preserving Text-to-Video
For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results.

```python
from diffusers.utils import export_to_video

prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"

id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)

video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42))
export_to_video(video.frames[0], "output.mp4", fps=8)
```
<table>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any results being demonstrated should be linked from the huggingface documentation-images repository on HF Hub: https://huggingface.co/datasets/huggingface/documentation-images/tree/main/diffusers

If you could open a PR to their, I can merge it and then that could be linked here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@SHYuanBest SHYuanBest Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<tr>
<th style="text-align: center;">Face Image</th>
<th style="text-align: center;">Video</th>
<th style="text-align: center;">Description</th
</tr>
<tr>
<td><img src="https://github.com/user-attachments/assets/be0257b5-9d90-47ba-93f4-5faf78fd1859" style="height: auto; width: 600px;"></td>
<td><img src="https://github.com/user-attachments/assets/f0e2803c-7214-4463-afd8-b28c0cd87c64" style="height: auto; width: 2000px;"></td>
<td>The video features a woman in exquisite hybrid armor adorned with iridescent gemstones, standing amidst gently falling cherry blossoms. Her piercing yet serene gaze hints at quiet determination, as a breeze catches a loose strand of her hair ......</td>
</tr>
<tr>
<td><img src="https://github.com/user-attachments/assets/c1418804-3e5b-4f8b-87f1-25d4ddeee99e" style="height: auto; width: 600px;"></td>
<td><img src="https://github.com/user-attachments/assets/3491e75c-e01a-41d3-ae01-0c2535b7fa81" style="height: auto; width: 2000px;"></td>
<td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge ......</td>
</tr>
<tr>
<td><img src="https://github.com/user-attachments/assets/2c4ea113-47cd-4295-b643-a10e2a566823" style="height: auto; width: 600px;"></td>
<td><img src="https://github.com/user-attachments/assets/2ffb154f-23dc-4314-9976-95c0bd16810b" style="height: auto; width: 2000px;;"></td>
<td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured ......</td>
</tr>
<tr>
<td><img src="https://github.com/user-attachments/assets/d48cb0be-0a64-40fa-8f86-ac406548d592" style="height: auto; width: 600px;"></td>
<td><img src="https://github.com/user-attachments/assets/9eb298a3-4c2a-407e-b73b-32f88895df22" style="height: auto; width: 2000px;;"></td>
<td>The video features a man standing at an easel, focused intently as his brush dances across the canvas. His expression is one of deep concentration, with a hint of satisfaction as each brushstroke adds color and form ......</td>
</tr>
</table>

## Resources

Learn more about ConsisID with the following resources.
- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features.
- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details.
2 changes: 2 additions & 0 deletions docs/source/zh/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
title: 快速入门
- local: stable_diffusion
title: 有效和高效的扩散
- local: consisid
title: 身份保持的文本到视频生成
- local: installation
title: 安装
title: 开始
Loading
Loading