-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redo custom attention processor to support other attention types #6550
base: main
Are you sure you want to change the base?
Redo custom attention processor to support other attention types #6550
Conversation
I haven't looked at the code yet, but do you know if there are still use cases for using attention processors other than Torch 2.0 SDP? Based on the benchmarking that diffusers has done, it seems like the all around best choice. But maybe there are still reasons to use other implementation e.g. very-low-vram system? |
I thought roughly same: |
On CUDA, torch's SDP was faster than xformers for me when I last checked a month or so back. IIRC it was just a couple % faster. |
I thought about this some more, and I'm hesitant to proceed with trying to merge this until we have more clarity around which attention implementations we actually want to support. Right now, we have _adjust_memory_efficient_attention, which tries to configure attention based on the config and the system properties. The logic in this function is outdated, and I think there has been hesitation to change it out of fear of causing a regression on some systems. Let's get to the bottom of this, before deciding how to proceed with this PR. My current guess is that just supporting torch SDP and sliced attention would cover all use cases. But, we need to do some testing to determine if this is accurate. A few data points to consider:
@StAlKeR7779 do you want to look into this? |
@RyanJDick ok, I removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some smoke tests covering various permutations of attention type (normal vs torch-sdp) and slice size configs (none, max, balanced, 1, 3, 6) - all good.
Co-Authored-By: psychedelicious <[email protected]>
pls run |
Co-Authored-By: psychedelicious <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I just did a basic smoke test - looks like others have done more rigorous testing.
A few minor things:
- Delete
_ignore_xformers_triton_message_on_windows
- Delete
logging.getLogger("xformers").addFilter(lambda record: "A matching Triton is not available" not in record.getMessage())
- Remove
xformers
instructions from020_INSTALL_MANUAL.md
invokeai/backend/stable_diffusion/diffusion/custom_attention.py
Outdated
Show resolved
Hide resolved
invokeai/backend/stable_diffusion/diffusion/custom_attention.py
Outdated
Show resolved
Hide resolved
Co-Authored-By: Ryan Dick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only requesting changes here to ensure we don't merge this before testing on a couple of older GPUs. will test it asap
How critical is it to remove
Other than that, |
That seems reasonable, with a configurable override if user wants to force one. Does look like we should add it back. |
The PyTorch blog says Flash Attention is supported from sm80 compute capability onwards: https://pytorch.org/blog/accelerated-pytorch-2/, so perhaps we should default to |
invokeai/backend/stable_diffusion/diffusion/custom_attention.py
Outdated
Show resolved
Hide resolved
invokeai/backend/stable_diffusion/diffusion/custom_attention.py
Outdated
Show resolved
Hide resolved
Not for this PR, but I did some performance testing and we'll probably want to address this at some point: SDXL: >>> Time taken to prepare attention processors: 0.10069823265075684s
>>> Time taken to prepare attention processors: 0.07877492904663086s
>>> Time taken to set attention processors: 0.1278061866760254s
>>> Time taken to reset attention processors: 0.13225793838500977s Code used to measure: def apply_custom_attention(self, unet: UNet2DConditionModel):
"""A context manager that patches `unet` with CustomAttnProcessor2_0 attention layers."""
start = time.time()
attn_procs = self._prepare_attention_processors(unet)
time_1 = time.time()
print(f">>> Time taken to prepare attention processors: {time_1 - start}s")
orig_attn_processors = unet.attn_processors
time_2 = time.time()
print(f">>> Time taken to prepare attention processors: {time_2 - time_1}s")
try:
# Note to future devs: set_attn_processor(...) does something slightly unexpected - it pops elements from
# the passed dict. So, if you wanted to keep the dict for future use, you'd have to make a
# moderately-shallow copy of it. E.g. `attn_procs_copy = {k: v for k, v in attn_procs.items()}`.
unet.set_attn_processor(attn_procs)
time_3 = time.time()
print(f">>> Time taken to set attention processors: {time_3 - time_2}s")
yield None
finally:
time_4 = time.time()
unet.set_attn_processor(orig_attn_processors)
time_5 = time.time()
print(f">>> Time taken to reset attention processors: {time_5 - time_4}s") |
invokeai/backend/stable_diffusion/diffusion/custom_attention.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested after changes, seeing expected performance increases on Ampere and no performance degradation on Pascal. LGTM!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just about good-to-go. There are a couple minor requests for docs, but otherwise the code looks good to me. (I haven't tested all cases myself, but it sounds like others have.)
invokeai/backend/stable_diffusion/diffusion/custom_attention.py
Outdated
Show resolved
Hide resolved
…d cuda, multihead xformers for high heads count)
Co-Authored-By: Ryan Dick <[email protected]>
It looks like there was a significant re-write of the attention logic after the latest round of review and testing on this PR. @StAlKeR7779 can you shed some light on the benefits / motivation for that latest re-write? Given the amount of testing and discussion that went into this branch before the re-write, I'm wondering if we should revert those latest changes, merge this PR, and then open a new PR with the changes. Let me know what you think. |
Summary
Current attention processor implements only
torch-sdp
attention type, so when any ip-adapter or regional prompt used, we override model to runtorch-sdp
attention.New attention processor combines 4 attention processors(
normal
,sliced
,xformers
,torch-sdp
) by moving parts of attention that differs(mask preparation and attention itself), to separate function call, where required implementation executed.Related Issues / Discussions
None
QA Instructions
Change
attention_type
ininvokeai.yaml
and then run generation with ip-adapter or regional prompt.Merge Plan
None?
Checklist
@dunkeroni @RyanJDick