Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incredibly drunk power behavior with a 7900 XTX. NOT the other power limit issue, this one is different. #430

Open
Sorrydough opened this issue Dec 19, 2024 · 5 comments

Comments

@Sorrydough
Copy link

Sorrydough commented Dec 19, 2024

This is really difficult to explain so I made a video showing it. Please watch: https://youtu.be/ceHsFTM35nE

tl;dw: The power consumption appears to be offset 50w below the power limit for no discernible reason. With some fiddling this can be overridden to work as expected, but getting this to actually happen consistently is super buggy and requires me to use two applications simultaneously. You should really watch the video.

I know RDNA3 is infamous for weird power limit behavior on linux, but I showed that there seems to be a way to get it to work. I just don't know what that way is, since frankly it seems to involve a bug.

I didn't show this in the video, but if I apply corectrl first then it does the -50w behavior, and I can't make that stop happening by fiddling with lact. Only fiddling with corectrl is able to fix it for me. But also the behavior is so convoluted generally that I have no idea what's going on because sometimes it works and sometimes it gets stuck at -50w.

- LACT version: whichever one comes from aur with the lact-git package
- GPU model: Sapphire 7900 XTX Nitro
- Kernel version: 6.13.0-rc1-273-tkg-bore
- Distribution: EndeavourOS (Arch-based)
@ilya-zlobintsev
Copy link
Owner

There are two main reasons why I think this could be happening:

  • When you apply the power limit in CoreCtrl, the clockspeeds change (VRAM clock goes down, GPU clock goes up) - so you at least have different OC settings in there. The GPU is also reporting TDC_SOC (electrical current) throttling. Now this could be a false positive, the throttling reporting in the driver isn't 100% accurate, but it could be hitting a current limit before reaching the power limit when using those overclock settings.
  • RDNA3 is quite sensitive to what order the settings are applied in. LACT makes sure everything is correct by applying every setting in a specific order whenever you click apply, regardless of what was changed. I'm not sure if CoreCtrl does the same, as it could be applying only a new power limit and nothing else, potentially affecting other settings. This would explain why you sometimes see this behaviour in CoreCtrl as well, but not always.

What you should check is try to reset all settings (via the dropdown menu), and apply only the power limit and see what the behaviour. Then if everything is okay, add the rest of the overclock settings.

@Sorrydough
Copy link
Author

Sorrydough commented Dec 21, 2024

Ok I tried again, where I used only LACT. I ran into a ton of bugs. You can watch the video to see them (https://youtu.be/qHn2mFK6EEs), here's a summary.

  1. Settings don't apply properly upon boot (or upon opening the application) and I need to move a slider to get the apply button to appear for me to apply them. The power limit to 400 (350 actually?) was applied by default, but everything else appeared to be incorrect.
  2. The vram got stuck in a low power state after reboot until I applied an overclock, although I think this is an overdrive bug and not yours. Maybe this is also where the 350 (400?) limit upon boot came from.
  3. After I applied settings with LACT in the game, the transient power usage and clock speeds went completely insane and stayed that way. At first I thought this was related to the offset power limit bug, but the gpu was pinned at 350w before I clicked apply just out of boot so I don't know what's going on.
  4. I can't change the power profile away from video. I assume this is because I'm using obs.
  5. No matter how much I reset or fiddle or whatever, I can't get the card to use the correct amount of power... EXCEPT for when it was at its default stock power limit of 338w. In that specific case it appeared to obey the power limit properly. After I stopped recording I couldn't fix the transient power draw no matter how I adjusted the power limit, including trying to reset back to default with LACT.
  6. LACT seems to think the gpu is constantly temperature and power throttling, even when it's not. Even when it's idle.

Also just for the record it's tough to not be a two-take jake when the behavior after I click apply is different every time I reboot the computer LMAO.

Also this is tangential, but I have a 650w power supply and sometimes when I set the power limit it gets tripped offline and sometimes it doesn't. I think this is caused by the bugged/inconsistent transient power behavior I observed after altering the power limit. Like it's really bizarre. Really need to do more testing on this though.

@ilya-zlobintsev
Copy link
Owner

The settings do get written automatically at boot - what the LACT UI shows when starting up is what the GPU is reporting as set, not just the LACT configuration. However it seems that due to some factor the GPU reports these settings as applied, even though it doesn't use them in reality (and when you change something, and they get reapplied again, it does set them). The order in which they're applied at boot might be a reason for this.

I'm going to suggest what I've suggested previously - reset all settings (not just the clocks and power limit separately, but everything with the reset configuration button in the dropdown menu), then reboot, then start applying settings one by one. First just change the power limit, see if it works, then the voltage offset etc. This way we can see what setting specifically causes this abnormal behaviour. It will also at the very least get rid of the VIDEO mode (at least when you have the performance level on auto).

The throttling reporting is a bug in the driver as well: https://gitlab.freedesktop.org/drm/amd/-/issues/3251

@Sorrydough
Copy link
Author

Sorrydough commented Dec 22, 2024

Okay I messed around with it some more and I figured out that no matter what I apply in lact, it just totally breaks power management, no matter what I set them to. Here's a video of me applying a change to vram speed and it just throws a fit: https://youtu.be/AffWkqobV84

As far as I can tell, behavior does not go back to normal if I click the "reset all configurations" button in the hamburger menu. It becomes less buggy, but still not working correctly with huge transient power in every direction. I have to reboot to get it fully recovered.

This seems like it might be a driver bug? I'll learn how to change stuff with the cli and see if the behavior is replicated, so we can figure out whether lact is doing it or if overdrive is just broken somehow.

@Sorrydough
Copy link
Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants