Every "Bug" Is Another User's Killer Feature
Vložit
- čas přidán 15. 03. 2024
- Sometimes life imitates art and this recent case with AMD undervolting reminds me quite a bit of XKCD 1172, every time you have a bug there's going to some users who end up relying on it even if it's very clearly a bug.
==========Support The Channel==========
► Patreon: brodierobertson.xyz/patreon
► Paypal: brodierobertson.xyz/paypal
► Liberapay: brodierobertson.xyz/liberapay
► Amazon USA: brodierobertson.xyz/amazonusa
==========Resources==========
XKCD 1172: xkcd.com/1172/
Bug Report: gitlab.freedesktop.org/drm/am...
LACT: github.com/ilya-zlobintsev/LACT
Email Thread: git.kernel.org/pub/scm/linux/...
=========Video Platforms==========
🎥 Odysee: brodierobertson.xyz/odysee
🎥 Podcast: techovertea.xyz/youtube
🎮 Gaming: brodierobertson.xyz/gaming
==========Social Media==========
🎤 Discord: brodierobertson.xyz/discord
🐦 Twitter: brodierobertson.xyz/twitter
🌐 Mastodon: brodierobertson.xyz/mastodon
🖥️ GitHub: brodierobertson.xyz/github
==========Credits==========
🎨 Channel Art:
Profile Picture:
/ supercozman_draws
🎵 Ending music
Track: Debris & Jonth - Game Time [NCS Release]
Music provided by NoCopyrightSounds.
Watch: • Debris & Jonth - Game ...
Free Download / Stream: ncs.io/GameTime
DISCLOSURE: Wherever possible I use referral links, which means if you click one of the links in this video or description and make a purchase I may receive a small commission or other compensation. - Věda a technologie
reading the comments makes me want to see an infinite thread that goes like this:
actually underclocking is not the same as undervolting, undervolting is ...
actually undervolting is not the same as underwatting, underwatting is ...
actually underwatting is not the same as underamping, underamping is ...
actually underamping is not the same as underpowering, underpowering is ...
actually underpowering is not the same as underlowering, underlowering is ...
and so on
underwear
And at the end of all it, there wasn't anybody confused that I was discussing powerlimits
@@BrodieRobertson Power limits in the mailing list were talking about wattage, not voltage or clock speed.
@@speedytruckYeah I thought it was obvious this was not underclocking or undervolting at all.
@@GeorgeN-ATX The reason why there are so many comments (including mine :P) about undervolting is because Brodie used this term and claimed "obviously
this does make [the GPU] run slower" at the very beginning of the video so before he even got to the mailing list part.
I'm gonna set my GPU to -1Watts to generate electricity.
200iq move
Why stop there, set it low enough to complete offset your power usage for your whole house! :D
I hope you're fast enough to generate each frame back to the GPU.
negative infinity. nuclear fusion achievement unlocked!
@@MarkParkTechUm ackchyually circuit breakers are limited to a sustained power of 1.5 kw and a maximum peak power of 1.8 kw, so U can't power more than a tea kettle. Also that would make your'are house really cold because the GPU would suck up all the heat.
Quasi-connectivity in Minecraft Java is a good example of bugs that became features. Origially it was created because Notch copied the code from doors to pistons and a few other components, which makes them redstone-wise think they're two blocks tall. And it's being abused to such great extent by the Redstone community that it's now a feature. And because of the behaviour in Bugrock I think single-tick-ejecting blocks from sticky pistons was originally a bug too.
single tick ejecting blocks was a patch to sand duplication on the java edition to my knowledge.
@@average-neco-arc-enjoyer
At least with end portals you can still duplicate all gravity blocks.
@@Lampe2020 the sand duping method which led to the patch being implemented was a separate duplication glitch though
@@average-neco-arc-enjoyer That's also why pistons were made slower.
Which is silly because Minecraft's changed redstone behaviour before. Most notably when they made the 1-by-1 redstone tile a square instead of a cross, and then later on when they let you toggle between them. Also the very first version of redstone allowed torches to power blocks in all directions. This was changed because it was too hard to stop the signal going to the wrong places.
1:40 undervolting in some cases can result in very little loss in performance while using half the amount of power so I can definitely see why people would want to do it.
1:40 IF December THEN overvolt (Its free cooling bro.)
undervolting can even add performance in some cases if the limit is thermal and not (core) stability
@@vincentschumann937 Yup, what you said is true.
This is important for AMD Ryzen 3rd gen and older.
I haven't tried undervolting AMD cards but I did undervolt some NVIDIA cards that were power limited (3070 and 3090). From memory I got about 5% more performance with about 20% less power usage and no obvious stability problems.
In the case of the 3090 it would have such high power spikes when stock that it would occasionally trip my power supply OCP and reboot the computer. Undervolting it stopped that.
It wasn't a really high amount of undervolting. A little bit can go a long way!
@@espertalhao041 It can also improve boost clocks when the CPU is constrained by max power limits. That's often the case on laptops (if they're not also thermally constrained).
Once I needed a heater for a job but wasn't allowed to have a space heater. So I downloaded a programming language and wrote an infinite loop in it. The laptop kept my poor hands from freezing and warmed up the room a couple of degrees. In short, using a CPU as a heater isn't totally ridiculous.
haha, i use my pc and server for heating about 6 months a year. it works like a charm, i only have to use the radiator every now and then when it gets really cold. of course, summer days are a lot less fun...
Just run 'yes' a whole bunch of times
I remember seeing a video (maybe here on youtube somewhere) of someone using an AMD AM2/AM3 CPU as a tiny grill.
In the winter I used to close my office door when I was a little cold because my work computer would warm it up a little. Now that I have an M1 Mac this trick doesn't work anymore as it never gets hot enough...
I mine Bitcoin and Monero when the temperature drops below 60F. Gotta burn that energy for heat anyway.
In laptops, undervolting often increases performance. Lower voltage means less heat, which means it takes longer to thermal throttle.
as an i9 MacBook user, I felt this 😭😭😭 the i7 is faster 💀
(for music production and the occasional blender at work. I use Linux on my desktop)
"You are buying from companies like ASUS, Gigabye, EVGA"
Or atleast we used to buy from EVGA...
Not AMD cards tho.
All their cards were made by Pegatron and Jetway, so you were essentially buying Asus and Zotac products.
@@erinw6120No, they were EVGA design, just as I can design a PCB and get somone like Pegatron or Jetway to build up the PCB layers, then run it through pick and place machine and a reflow oven then bolt on the fan and such I designed to go on it.
@@EwanMarshallBruh, I know how they were made. Worked there for five years.
didn't matter even then
Just an FYI, undervolting is not the same as underclocking. Every chip has some variance, and the voltage that it has set by default is the one that will let every one run at the specified clock. But since there's usually also some safety margin, you can often reduce the voltage supplied without lowering the clock, in other eords, without loosing performance. It's only if you reduce the power further that the clocks can't be maintained and performance is reduced. Usually when (only) undervolting, we try to maintain the same performance.
These usually go hand in hand, typically you don't touch one slider without touching the others
Depends on the intent of the undervolt. It's not uncommon to see both a power limit and a positive core clock offset. This results in the card running at a lower power, but the same speed/performance. For example, I'm running a 3070 TI right now with a 55% power limit and 150MHz "overclock", resulting in just over stock clock speeds, but at vastly lower power and temperature.
@@BrodieRobertsonTrue, but if you set your undervolt too low the transistors cannot physically conduct, so the GPU cannot work. An underclock/power limit is different, it would still run at low utilisation but crash under higher load due to a lack of power.
@@BrodieRobertsonI undervolt my GPUs without underclocking them. It saves me a dozen or so watts for free.
@@jm56585 I am actually able to get away with undervolting and overclocking. Not by a huge amount, but still running better and more efficiently than stock settings.
Considering silicon lottery and how some companies can be VERY conservative (like setting minimum or no range at all), I think it should be possible to go out of those bounds, but with a warning.
The driver is open source... just remove the patch and recompile.
@@stephanweinberger ...after every change
@@abit_gray well... now you might understand why the developers don't like keep adding options for (at best) extremely rare use cases.
That said: nobody is stopping you from automating your builds.
@@stephanweinberger the option is a "config value" which you check when changing the value. Patch needs to match those lines.
In this case, the option should be there. If they have too many, they should look at other options but maybe, just maybe, there is a good reason to give users options.
@@abit_gray ... and that's exactly how _everyone_ thinks about their favorite "feature". So why should the developers prefer your particular one over others, especially when it wasn't intended to be there in the first place and potentially operates hardware out-of-spec?
Complexity does not grow linearly with new options; as they all potentially interact/interfere with each other (and more often than not even with other subsystems, particularly when they are on the driver level) complexity tends to grow exponentially.
I remember fixing a bug that was accidentally overwriting the intended tokenizer for the AI software with a fallback version. After it was fixed, we got a lot of reports that the coherency was worse. Turns out that the software we built upon had bugs in their default one and the fallback we were accidentally using was better. We didn't keep the bug of course, that would be silly since it was a bug. But we did reverse the order of the tokenizers properly so that the fallback would be the one that people disliked and would only be used after the one they liked failed.
1:28 Actually not necessarily. Since undervolting also decreases heat output, some CPUs can clock higher thanks to the additional thermal headroom.
Kinda counter-intuitive, I know, but CPUs these days kinda try to optimize their own speed when needed. But obviously, thanks to the silicon lottery, the manufacturers can only apply something which works for all sold CPUs, which may not be optimal for your specific one.
The hardware channel JayzTwoCents recently did a bit on this. Normally the too-high values are set by the motherboard manufacturers, even defaulting to voltage values higher than the cpu really wants. This leads to thermal limits on boost behavior after a minute or less of moderate load, but a tiny performance boost for the first few seconds or marginally better stability. Turning down the voltage to the CPU default limit, or possibly a little lower if you get a golden chip, lets your system run at full speed much longer and much quieter.
I use Gentoo BTW.
In Gentoo you can throw patches to folder /etc/portage/patches and by magic you will have your package patched.
For a Gentoo users revert that patch is a very easy task.
Undervolting doesn't mean that GPU runs slower. It just means that you reduced the voltage supplied to the GPU. In most cases people undervolt in order to maintain the stock performance with a lower voltage. Underclocking makes the GPU slower. You can combine underclocking with undervolting if you want to drop your power consumption even futher but you can undervolt without underclocking. Depending on the architecture different things can happen when you undervolt too much. On older GPUs especially you typically get a driver crash however on newer GPUs you might get clock stretching where the driver doesn't crash and the GPU reports the clock speed that you expect but the performance drops. This is why you have to validate the performance with a consistent load such as a benchmark after any changes to the GPU settings.
Edit: in vast majority of cases undervolting has nothing to do with idle power consumption. GPUs already downclock themselves significantly when the PC is idle. In most cases undervolting centres around the voltage used when under load.
I absolutely get why AMD would want to prevent users from setting their power limits too low. If people report issues with undervolted hardware that might be misfiring because the user got unlucky and the in-spec card can't tolerate the low voltages, that would be extremely frustrating and opens up massive potential time sink.
I would 100% prefer a kernel parameter like amdgpu_allow_unsupported_voltages=1 that communicates that the user is not going to get support if they set this parameter, but I see why they would want to make the way forward for undervolting fans be to patch the driver themselves so they can fully wash their hands of that class of issues.
Setting a power limit is not undervolting. That's why limiting it makes no sense
This is why a simple log entry and a bug report that needs to contain the log or current configuration solve the problem. The change does not cause damage, but can cause instability. If you have crashes and you've modified these numbers, revert that flag. Bug reports made while that flag is on are rejected. That simple.
@@knghtbrd Minecraft modding has a similar thing; if you're running a game version with Forge modloader and use the FoamFix mod (which does some jank to fix other jank in Java Minecraft) - the game logs state right at the top "You are using FoamFix - do not report any issues to the Forge Devs unless you can replicate them without FoamFix!"
Suggestion: Adhere to the undervolt limit from the device manufacturer/designer and add to the Linux Kernel documentation how to change the limit when recompiling the kernel. Then the users who want to make the change can bear the administrative burden of their edge case, and the maintainers of the kernel can go back to maintaining the kernel. If you are tech-savvy enough to know about kernel flags and recompiling, you are savvy enough to hack the kernel (or should be).
Not really tho? I can run my own kernel but that's just wasting more time for no good reason.
@@FakeMichau "I can run my own kernel but that's just wasting more time for no good reason." Exactly. At some point, if the issue is not important enough for you to maintain your own esoteric configuration, it's time to accept the decisions of those who are doing the maintenance. The kernel maintainers listened to the bug reporters, considered the issue in good faith, made a reasonable decision, and gave a reasonable explanation for their decision.
The open-source kernel makes it possible for dissenters to make their own changes and recruit others to join them. Those who want to strike out on their own should do so and let the rest of us know how well that worked.
If they're silly enough to enforce a restriction that - let's be honest - has a snowball's chance in hell of damaging hardware... Just require a boot time parameter to enable it. That's all.
I hate developers when they get their anti-user urges. It makes them part of the biggest problem that's plaguing society - "I know better than you".
@@big0bad0brad the point is that _everybody_ wants such parameters for their very special cases, and it's getting harder an harder to maintain them.
Why should the developers be burdened with that extra work plus dealing with the extra "bug" reports for issues that aren't real issues but just people overtaxing their hardware.
It's FOSS ffs - if you really want that change so badly, go implement it or pay someone to do it for you. If you're not willing to do that, the change obviously _isn't_ all that important, is it?
@@big0bad0brad if you know better than the devs why not just make the change in the code and compile ? its open source for a reason. Hell you can even fork it and just change that one aspect and then proceed to merge in from upstream. the issue is ppl who think they are smarter than devs not wanting to do dev.
From the original issue by Federico it sounds like the limit might not be working correctly: he states that setting it to 150W resulted in 300+W of actual consumption. So either the user settings are not respected at all, or the board fw has some unreasonably high values. And this is, in my opinion, the bad side of the decision by AMD driver team: trusting the thirdparty manufacturer to set valid limits (just ask the cooked AMD CPU owners that had voltage auto-tune enabled in the MB).
I will disagree on XKCD: the case in question have actual use achieving actual result. In XKCD 1172, on another hand, user can achieve the same result just by modifying script to check how for long spacebar was pressed.
Also, I'm for "allow to set parameters outside of tested range, but your warranty is void"
To me it seems like part of the problem is that if you set a too low value it seems to be using the default, whereas I’d expect it to use the lower bound.
This just happened to me during the alpha of Plasma 6 - when you activate "show desktop" and then launch a new application from the visible desktop (activating a launcher icon or something else, id open the current wallpaper in Gwenview), then previously all the other windows would immediately return, but in Plasma 6 alpha the other windows would not return and you'd need to activate some other window to get then back. That broke my workflow.
Truest video on CZcams. ;) I've been "longtime user" many times in my 19 years as an open source user (1998-2017), and I've heard about many examples of the need for bug compatibility from long before, during, and after my time. One of the biggest things which attracted me to Linux was its scheduler, which I later learned was considered a simple stupid hack for toy OSs. When cleverer schedulers were implemented for Linux, I got much less comfortable with it because the parts I needed to respond were throttled hard. Evidently, my problem with Windows was that it had a professionally-designed scheduler designed all along.
undervolting is technically distinct from underwatting; underwatting makes the gpu draw fewer amps from the same configured voltage, and the voltage is generally the same across all power brackets
Windows API, ow my, there is a shit ton of bugs, espeically in the older things, and because fixing them may break some application, they are "features", until someone makes an enteirly new library.
I think there should be option to enable it back, it could be named `I_Want_Brick_My_Card_And_I_Will_Not_Ask_For_Help: true` and this setting remove limits but put lot of warnings and in any diagnostic show this info too. If someone try diagnose error on others hardware he can simply ask for this diagnostic info and when he see this warning, then he can simply replay "you get what you ask for".
To quote a comment from @aDifferentJT,
that at the moment I am scrolling was shown to me directly above yours;
"To me it seems like part of the problem is that if you set a too low value it seems to be using the default, whereas l'd expect it to use the lower bound."
I would agree that this should be the expected behavior. (Assuming it's agreed that a reversion is not going to happen.)
Just to be clear [because I am very picky about my semantic word choices, to try and prevent people who don't know English well for misunderstanding me]:
What I mean is that
so long as the applications haven't yet been updated to tell you that you can't set that value as it's below the minimum allowed power-limit, until then,
It should be using the lowest allowed value instead of going back to default high power limit
(while the application you set the power limit in is telling me that yes, it is being limited [when it really isn't]).
@@GeorgeN-ATX "Should be" . It depends:
- Some would say that errors should always fallback to know state. That's a good reason to use the default or maybe to just return an error and do nothing. (I would vote for the last 100%). Setting it to the minimum possible lsets the value to an unknown state (for you).
We don't really know the full thing, we are talking about what a particular app did in that case, maybe the change to the default is performed there, not on the kernel.
It's not that hard to imagine a situation where rail voltage is so low the dynamic power gating mosfets do not fully switch on/off within the gpu, turning the internal power rails into resistive heaters.
Hey! I'm on one of the people who was heavily involved in this issue and I applaud your confidence in spite of not understanding this issue at all.
Your first and biggest mistake was saying that this is at all related to undervolting. As I (Tomasz Pakuła) said in the amdgpu issue, this has nothing to do with neither undervolting (which is actually still allowed despite what you state in this video) nor underclocking.
We were modyfing the power limit without touching clocks and voltage. That simply meant, the GPU throttled earlier. That's completely stamdard behavior. Unfortunately, even Alex fails to understand that simple thing and he is wrong in saying that it would damage GPUs at all. My 6800XT idles at 8W and well, it has been designed to do so. I'm an ex-AMD enginner who worked with ROCm guys and know a bit more about power gating for AI benchmarks
Please, consider revisiting this topic and fully understand it. I'll be happy to provide more information. Again, this is NOT undervolting
For the 0W limit. You could've set it but it was ignored as GPU has a minimum clock limit. It would just always be in that state and pull maybe 50W or there about so there were already safety checks around that.
Sorry for maybe sounding harsh, but my heart sanks every time someone mixes up these things
I remember setting my Intel CPU power limits to 0W, it just locked at 800mhz, no damage done. Why is any of this an issue? Don't the GPU manufacturers just validate how safe it runs at the _lowest clock_? Or am I damaging my M3 by having it mostly in idle when using it?
I think everyone is just drinking bleach with flourite for ice in this situation
Xkcd is just a fortune teller like Nostadramus
I can tell from personal experience that a bug coming a feature due to customer activity is unfortunately a very common thing in corporate software business even as it's something to be avoided at all costs.
as a person without that much technical knowledge about hardware, how the hell could undervolt damage a device? it could turn it off, but physically damage the board??
linux team are worried about bug reports and amd team are worried about PR
I don't that I heard about damage being mentioned
Running very low tension will mess with clocks, and that can damage the gpu
@@user-ro1cc8tz6d yep, both are great reasons, my doubt was about real damage to the board
Seems very unlikely to me. Maybe not impossible if it causes registers to get into invalid states. I have GPUs that have been mildly undervolted for around four years and they are still fine.
As many people have pointed out it was power limiting rather than undervolting this was about. And, as noted in one of the issues, setting the limit too low apparently caused the GPU to behave erratically in a way which at least in theory could maybe in the very worst case have some risk causing damage or at the very least cause other instability issues. It seemed it caused it to somehow result in spikes drawing too much power.
I cant use my laptop anymore
I used the software from 1172 to heat up my lap
birb
Undervolting, when done right, doesn't affect performance. Too low voltage will make the system unstable or the GPU will fail to turn on. Also there's no minimum voltage for every GPU because it depends on the quality of the silicon used so a GPU could undervolt lower than another GPU of the same vendor/model
Instructions unclear, used pacman, now my fan is a heater.
As someone who is essentially the "Support Contact" for other companies that integrate with our platform over API, that XKCD comic absolutely tortured me.
This reminded me a bug in a joycon driver when the joycon started to vibrate it never stopped untill disconnected.
This obviously looks like a feature...
the 'fix' was a good one. If people want to go past those limits, let them patch their kernel. That way they show they know they are in the deep water.
Alex is a big fan of not being bothered by bogus bug reports. I am on his side. You run hardware out of spec, it is your stupid decision.
make being stupid harder weeds out the clueless whiners and let those who actually know what they are doing, still being able to do it.
This reminds me of those wanna-be overclockers that rma'ed a shitton of mainboards until they hit the 'golden one' that gave them 50mhz more - increasing the hardware prices for everybody else - because all those rma'ed boards have to be paid by someone. And guess what. That is everybody else.
I actually have played Scaler. Quite neat and full of interesting ideas. Probably overshadowed by all the other good 3D platformers/action-games in that generation...
1172? what one was that...
OH , THAT ONE? LOL i remember this one but ...
I did not know underclocking was even a thing.
Not only can undervolting reduce the power drawn, but also reduces the heat generated and can therefore make the GPU quieter and maybe even last longer as a result, definitely worth doing.
Within reason, unless you enjoy unstable hardware
V*A=W
undervolting makes your gpu core + hotspot run cooler, while keeping the same power target or by simultaneously increasing it(W) it will leave room for additional core clock speed
Normally this is written as P = IV right?
@@evandrofilipe1526 They were using the units: Watts=(Amps)(Volts)
It's not necessarily that simple. With too low a voltage, not only do you get increased resistance in the transistors that are supposed to be on, you may get the opposing ones to never switch off fully. And that's static power draw that won't go away with clock gating and will be at its worst exactly where the weakest links are. "May damage your hardware" is quite literal.
undervolting is not underclocking or underwatting, sometimes if you win the silicon lottery you can use the extra power saved via undervolting to overclock well past what you could when you run at stock power.
Generally people don't touch one slider without the other, I know they're not the same thing
@@BrodieRobertsonyet, for the whole duration of the video you talk about undervolting which wasn't the point of this amdgpu issue at all.
@@FOREST10PL everybody watching the video knows the video is about power limits
@@BrodieRobertson Yeah, clearly even @FOREST10PL here knows that.
As I implied in my comments on the comment you pinned:
Saying underwatting would get complaints & saying "power limiting" in place of "undervolting" would sound weird [to me] & wouldn't be what people are colloquially expecting to hear.
(P.S. -I'm about to edit my comment you replied with a '?' to;
after thinking about it I've changed my mind slightly.)
@@GeorgeN-ATX I thought it was pretty normal to call all power limiting on a GPU undervolting that's the only term I have ever heard used
So, I undervolted + underclocked my AMD RX 6800 (-250 mV, max. 1900 MHz, resulting in max 120 Watts instead of 210 Watts).
I did notice in CoreCtrl the Power Limit-range now being 208-249 Watts. However it didn't affect me, because I'm already well below that wattage-limit. And doesn't a Power-Limit just lower clocks anyway??
This is why I'm confused about all the fuzz. You can just decrease the clock-freq, it's the same as setting a Power-Limit.
BTW for those who are confused about undervolt && underclock:
Undervolting: Lowering power without loss of performance (because the freq stays same), thus boosting efficiency. How much you can undervolt depends on your luck in the Silicon-lottery.
Underclock: Lowering the max. allowed clock-freq (MHz), thus lowering power (potentially a lot) + lowers performance (non-linear, perf-loss depends on how much you reduce it. )
GPUs have different power states, you need to test the GPU under load to check if that setting is being respected
How undervolting can cause damage I do not understand, less power == less energy == less risk, does the same patch also better query for maximum voltage? Undervolting at worst makes the system non-functional until you increase the voltage to the minimum required again. Current use doesn't increase in response to the lower voltage (otherwise it wouldn't save power).
Undervolting by drastic amounts is unsupported behaviour and it's unclear how certain parts of the board may behave to it
Undervolting causes problems for different reason than overvolting. For example overvolting a traffic light will cause the lights to explode. Undervolting causes them to be hardly visible, that can result a crash of people because they don't see the signals. Undervolting doesn't directly cause the problem, but it can cause a side-effect that can cause the damage. Where as overvolting just makes the component to make a happy pop.
Possibly if adjacent chips have different voltages you could get a lot of current flowing between them.
I think things like the memory bus on the CPU have a buffered voltage just to talk to the memory.
@@BrodieRobertsonUnsupported yes, but we do know how FETs (FinFETs in this case) behave, from what I can tell the worst case scenario is there isn't enough gate voltage to switch the FET, which means it stays at off, think of it like not putting enough force on a pushbutton/keyboard key, it never reaches activation point, of course merely stop undervolting fixes that. Yes it is unsupported behaviour but I can not see how that causes damage and you are not undervolting the whole board but the GPU IC itself. It is those voltage regulator modules you are turning down.
@@ShadowManceri Undervolting can cause crashes and errors, but this is not damage, no permanent problem exists, it just means you have to stop undervolting to fix, unlike overvoltage and overclocking which increase energy draw and therefore heat load on the FETs (overclocking increases current to make the FET gates charge faster).
I need to undervolt my GPU because FFXIV crashes otherwise. I got a Sapphire 7900xtx Nitro+ and when it starts to draw too much power I get a hard crash. I can't remember the log-error, something about "gfx_boundary" or "fence" or straight up "lost device". Only final does this, and the only solution I found so far was to pull down the power limit to ~200W. I haven't tried a different powersupply (but I guess a 1000W beQuiet! Dark Rock Pro should be good enough) or a UPS. So it might be just that when my pc draws too much something becomes unstable. It's only FFXIV tho, and only in very specific happenstances, which is confusing as hell, since with my 6800xt it just ran flawlessly.
Are you doing that in Linux? It sounds like your on windows. The change talked about would still not affect you, as your card manufacturer supports the under volt.
@@bedel23 I have to do that on linux, yes. Haven't tried if the problem is present under windows, since I haven't had that on this pc for quite some time. And I know I am well in bounds about what's considered to be safe.
@@pldcanfly If your card and system is otherwise stock, then you have a legitimate bug and you should mention it on the lkml. My software/computer crashes rather than I want to save a few bucks is the sort of thing that people will actually take and interest in.
The good thing about open source is that you can recompile the kernel with the old setting just for you...then when you report the bug they will notice.
Undervolting is a completely different thing and is usually restricted in your video card's BIOS.
The drivers should says: "these are the out-of-the-box design limit [min value - max value]. We will not check your bugs if any of these values are not within the parameters form the manufacturer. We remember you that you could kill your hardware."
Then the user can set whatever value he/she desires. If it works for them, great! If not, change the values.
Interesting. This reminds of the Esperanto CPU, which is a RiscV chip that has over 1000 cores and is meant to be an AI accelerator. They optimized for power usage rather than speed. There are sharp diminishing returns in terms of speed gains to throwing more power at a chip, so with much less electrical power they can get much of the processing power. Also, they've got over thousand cores on one chip... so a bit slower than the "speed limit" should still be a screamin' demon.
It is very simple, let them set their sane defaults as they like.
But there is a nice video by JayzTwoCents that is titled: "Motherboard Default settings could be COOKING your CPU!"
That video showcases the problems with the "limits" as vendors / add in board partners have the mentality of "Bigger numbers go brrrr and win benchmarks" and sets them a lot higher than needed on both sides.
For CPUs you have to really know what you are doing to get them back to AMD or Intel's specs pending on the BIOS, but it is the same with GPUs, well at least GPUs where they aren't being forcefully controlled like Nvidia.
I like an ecosystem more open like AMD, even to board partners where they can play around with the hardware, and if we are to have this, then we must also have the ability to fix their "brrrr" moments ourselves as users of said hardware.
Either that or full Nvidia, and it is a shame Alex is a bit short sighted here.
But as you said Brodie, if I get an AMD GPU, I will be looking for a kernel with Alex' stuff snipped out of it, he may have a hissyfit knowing that such may end up existing, but that is his problem.
I tried undervolting because I heard it worked well with AMD but didn't feel comfortable with the tools, so I gave it up and put things back to defaults.
You know this makes me respect the strong AMD stance of wanting to remain actually open source in light of the HDMI mess a bit more... On some level if they were willing to do things like move stuff they don't like to firmware it just solves things like this for them. Standing firm on the open source line keeps this sticky for them, but helps prove they aren't just paying lip service.
"It should be illegal to change open-source software in a way that I don't like." Uh... sure thing there, Skippy.
It's interesting that this isn't firmware limited, I believe the upper overclocking limits are limited in firmware and the cards refuses to let you go over the upper bounds (I think it reverts to some fail safe mode?) even if the kernel/driver is modified.
The whole thing is firmware controled and needs the clock to even set the correct base voltage. Would have to be done in an analog way somehow, and I guess it would be expensive.
People already undervolted, the fact that a bug allowed the software to drive the hardware lower than is reasonably safe isn't a good end result. Undervolting isn't being stopped, the bug was just fixed. I can completely understand devs not wanting to waste cycles debugging problems users create that are completely outside of safe specs and then might not either remember or just simply don't report when submitting bugs which leads to relevant information to the bug being missed.
I sympathize with users who did this, but I can't believe that too many people would be affected by going super low voltage. At least It's not like people who lower their car to millimeters off the ground, or jack them up several feet in the air. They aren't just lowered or raised vehicles which is legal within guidelines, they are technically illegal for safety reasons (ground clearance and crash issues, bumper height, center of gravity and rollover, etc) , but people do it anyways.
There are literally dozens of them. Let that community of EXTREME undervolters deal with the issue themselves; like you said, they can revert the patch. Much past that, we're all just wasting time worrying about something that shouldn't be worried about.
This is why Minecraft is a great game -- all the bugs that became the core of the Technical Minecraft community's feature set that allow them to do things never intended, but very useful.
yeah I defenetly reported problems that came from undervolting, hehe. But mostly normal volt atm.
People can still tweak their V/F curve and get a lower power draw that way.
3:48-3:50 R.I.P. EVGA
To be fair undervolting is a feature and changing to a higher power limit if unstable voltage is a bug. I have my CPU, iGPU and GPU undervolted (gpu is also overclocked). Thanks to it i am able to achieve better performance due to having none to very little throttling and lower power use at the same time. My laptop is 10.5 years old now and counting. Apart the TIM an ssd instead of an odd, an ssd install (3 drives in total, one of which is a 5920rpm hdd) and 2133mhz ram instead of the original 1600mhz all of it's parts are original.
Coming in with my outsider perspective as someone who doesn't even know what a GPU does (beyond it processing graphics), I genuinely don't see why they couldn't have included an option for undervolting/-clocking and just added a few dozen warnings to it.
Well if they know their way around proper values they might as well patch it themselves
But the driver shouldn't let your average enthusiast to irrecoverably damage their hardware by default am I wrong?
RIP EVGA GPUs
Stand proud EVGA GPUs owners o7
o7 EVGA GTX 1080 hybrid still going strong.
I love the fact people are playing around and modding gpu's tho it should not over-rule default's regress of new features for the sole fact of stability
This is not about undervolting at all. A card following its power limit does not deviate from the factory-specified voltage-frequency curve. There is no reduction of voltage margin.
Funnily enough the way I configured my desktop is an example of an XKCD1172 like bug reliance.
I have limited my Vega 64 card to it's minimum of 150W, which next to the obvious power and noise benefits, also works around an 8x PCIe 3.0 limit from my 3200G processor.
The way it throttles to achieve this lower power limit nearly completely resolves stuttering caused by that bottleneck.
Which is a weird workaround, but it works, really darn well, leaving hardware en/decoding reliable and making frame-times more consistent.
That said, this level is reproduce-able on Windows and on the lower limit of the card's specification, so the patch will not affect me.
On which, I don't mind the officially supported driver checking normally, as custom drivers that remove the limit again will be packaged for those inclined to limit further.
Which I would do if I got my hands on an RX6900 or similar card, as lowering the wattage limit within reason (>~60%) to my understanding practically 0 risk of damaging the card.
This as it simply makes the card run at a lower power limit it is designed to handle, at worst it might struggle with hardware en/decoding, one of the first troubleshooting steps on that being reverting driver settings.
The underwatting/undervolting mixup is inexcusable though, as this refers to a very different thing.
I'm aware that the comments are filled with explanations on this, but I will add my own.
Underwatting is setting a lower power limit, making the card choose a lower profile from it's set of factory specified options.
Undervolting is changing these profiles to have a lower voltage, thus lowering the power draw while using this profile.
So the former leaves it running within specification (within reason), the latter does not and might cause internal stability or signal integrity issues.
There should have been no further discussion after "add an option and a warning". Sounds like the only reasonable thing to do. Everything else is either condescending or unsafe. *Probably add to that a rule that bug reports are required to include a dump of the module options used.* ... Hope nobody used the original behaviour to make a system work stable with an underdimensioned power supply (in which case, there could be actual disastrous results from that regression). ... Overclocking used to mean exactly doing modifications in proud contempt of manufacturer specifications and marketing, at your own risk. It's been coopted as a marketing device, unfortunately. If an undervolter runs a switching converter into underload conditions with loss of regulation, or causes a latchup in their chip with absurd voltage ratios (I don't know if any of these are possible with these board designs), of course it is on them.
I am on a RX 580 so i do not get power limit controls, i get voltage target and clock speed settings, i just wish i could read a voltage sensor on the card... the only thing i can read is the target voltage in the power play table
I mean if you REALLY want this, there is nothing stopping you from just patching the kernel yourself. Remove the like that requests the lower bound value, set the variable or whatever to zero, recompile, and you're good. It's not a hard thing to do if you are THAT MUCH into undervolting that base limits don't suffice for you.
Not being able to set arbitrary low power levels is one thing, but if the setting hits the lower limit (and the limit is enforced) the card should be set to the lowest it allows, not go to full power.
For me I'm more on the side of not letting user use outside of bounds values especially if it can damage the hardware, as it also means that a virus can just destroy your computer physically not only your data.
However as they proposed I think they should have a global switch which disable every vendor recommended bounds so people can use outside value but if possible this should be only available from a place of trust, like maybe the kernel parameters while booting or even better something which can only be modified in the UEFI preventing virus from destroying your system but enabling users to play outside of the normal ranges.
these devs seems too nice keeping the back and forth dialogue open for so long. How many points/counterpoints can you make?
AMD firmware has a history of interesting behaviour when setting pwr draw related controls to an unrealistically low value. Anyone remember the Ryzen EDC bug? Just like with this amdgpu bug, it basically disabled pwr limit.
Undervolting has been very common on the last generations of both GPU and CPU and is not the same thing like under clocking, you loose very little prestanda, but got lower risk for thermal throttle
It's been a common feature but it being this widespread thing people talk about doing is relatively new
@@BrodieRobertson hey! I'm on one of the people who was heavily involved in this issue and I applaud your confidence in spite of not understanding this issue at all.
Your first and biggest mistake was saying that this is at all related to undervolting. As I (Tomasz Pakuła) said in the amdgpu issue, this has nothing to do with neither undervolting (which is actually still allowed despite what you state in this video) nor underclocking.
We were modyfing the power limit without touching clocks and voltage. That simply meant, the GPU throttled earlier. That's completely stamdard behavior. Unfortunately, even Alex fails to understand that simple thing and he is wrong in saying that it would damage GPUs at all. My 6800XT idles at 8W and well, it has been designed to do so. I'm an ex-AMD enginner who worked with ROCm guys and know a bit more about power gating for AI benchmarks
Please, consider revisiting this topic and fully understand it. I'll be happy to provide more information. Again, this is NOT undervolting
@@BrodieRobertson I would say it's only necessary on the latest gens of GPU, before that the powerdraw wasn't so high. I have only undervolted my 7900xtx in windows. My workstation with Linux has a Nvidia RTX A4000 and there's no need for any optimation
Gentoo approach looking better and better every day.
No need to beg for parameters ever again, or even need them at all.
Would you mind elaborating?, I would appreciate it!
@@GeorgeN-ATX Compiling from source is the default there. Even for the kernel.
@@cgarzs Thanks, I appreciate that, makes sense.
(Weird I didn't get a notification from you replying to me.)
@@GeorgeN-ATX Sounds about right. YouselessTube constantly shadow censor me these days as well. I constantly have to check in a private tab that what I've said is actually visible to anyone else. It's such a pita.
Adding to cgarzs' explanation, I'd also like to mention that Portage makes patching packages dead easy by just dropping your patches on a folder. It really is wonderful.
yeaa... one of the reason I'm staying on 6.6 for the time
Used to limit my GPU to 150W on RX6800 (from 219W) while only losing ~10% performance. Made amdgpu-dkms but it doesn't sound like a great idea to use
AMD engineers knows better I reckon what works for your hardware design for longevity and stability - it is not a hill worth dying on - compile your own driver and kernel if you want it.
As a software engineer, I would be pissed if somebody came to me with complaints and I spent hours chasing it only to discover they changed something that was never intended to be.
I didn't know GPU undervolting was a thing.
Interesting.
"using values lower than the validated range can lead to undefined behavior and could potentially damage your hardware"
dang it, i'm using linux, i'm allowed to rm -rf / if i want, give us FREEDOM!!!!!!!!!!
Should've been a module flag to ignore the lower bounds, imho. To me it just sounds like the "We know better for you so you will be happy" paradigm that people using open source software tend to be wanting to escape from. o.o So yeah, would've wanted a flag.
That said, the guy should really take a hour or two and learn about the GPU modding scene - there is a ton of valuable information to be discovered. ^^
Sounds like a blanket way to disregard features that are now being restricted. Not every niche use case is invalid.
As someone who always put in a cap on the power usage of their systems... this just adds another thing to check for: "what is the limit of how little power I can feed this thing", would be kinda nice if this data would be available somewhere.
(Then again, already are in the laptop-chips only territory by now (for power reasons) - so won't really affect me.)
When I wanted a space heater I used to run Android Studio
I am one of those who underclock my GPU to about 95%.
I say, let people do with their hardware what they want. If its working for them, then I guess its their right to set unconventional settings. And if it breaks, its their fault. So for people who know what their doing, there could be a hidden backdoor with special warnings that the setting could damage hardware. I think that's fair. Especially when it worked like that for years (without problem).
12:30 i cant LOL
There should be a way to turn this off that doesn't require recompiling the kernel. There are a hundred different ways to make your computer explode in Linux, one more shouldn't be that big of a deal.
But as long as there are third party workarounds, I'm fine with this
If they set a lower power draw than the envelope, then set it to the lower bound, not unseting it...
Like... What?
The user should have the right to potentially damage hardware if they so wish, unlike software, hardware is something you own and you should have the right to do what you want with it.
And you do, just not with the driver shipped by AMD
imho , changing clockspeeds to any value within bounds set by the manuf. does not count as overclocking. agreed that anything within those bounds sould not cause harm to the thing being clocked , and anything outside of it should not be covered by warranty. with all the new tech like boosting eg things have seemingly gotten more complicated and they tricked us into thinking that sliding a bar within bounds is overclocking, theyre safe and happy with that. but if i bought a piece of hardware i would like the abilllity to completely ruin it if i want to,... linux used to be the goto os for such things. but i guess we lost that when manuf and industry took some intrest in the platform and had to impose there controll and will on its users
I'm so confused...
I myself use LACT to undervolt my GPU. Because my Stock GPU is LOUD af, I just set an offset of -50mV or -100mV, adjust the Fancurve and lower the clockspeed a tiny bit. My GPU draws less power, is way cooler and quiter.
As much as I love tweaking and testing the limits of cards, I can see their point if users aren't running things in spec but are still asking for support I would get over it pretty quickly. I wish people would understand if you are modifying things you are on your own
That is exactly the bug i had with my 7900XTX (Sapphire Nitro Vapor X) and people told me the GPU is faulty and i should RMA it, but i didn't because it didn't make sense, i had to get rid of Arch and go back to Windows 11 (I was on 10 and i was making the transition to Arch when i got build this new PC with this card) to see of the issue might be OS based (or it's version of drivers, which i've tried every possible that existed at the time) but it happened there too but without fully freezing the entire desktop as it did with Arch which crashed with that ring fx error after you exited a game or a 3D accelerated program/app.
Many weeks later after the new AMD driver and the issue is fixed, it never happened again.
I miss that Arch system :'(
Why don't i go back reinstalling it?
Because i have a 4TB NVME driver which i initially moved all my stuff including games and that's around 3TB of data (from my previous rig) which means that i have to copy them somewhere else again first, then convert this current drive and partitions to ext4 and then put all that data back, it's a useless waste of write cycles.
I mean it's a bit dumb still that setting thing too low turns the limit off entirely, they should at least snap to the actual manufacturer floor.
Are we using power and voltage interchangeably here? Last I knew, voltage _is_ a unit, but whenever there's a value in here, it's in Watts.
Well, P=V^2/R. So, changing the voltage will also change power delivered. Undervolting specifically changes the voltage running through the card, but that also results in less power being delivered.
@@HobbitJack1 Well, it obviously does, but sadly, we're not talking about P here, not √PR.
Heh.. I really see both sides of this. Obviously GPU manufacturers will err on the side of caution, reducing support burden etc... They have to behave as if the average user is a complete moron without actually saying it. And let's face it, there are plenty of morons... I think the solution you alluded to at the end is probably the "Best" one... If you want that feature, hack the driver or wait for someone else to do it.
Average user is quite bad with tech. Now do consider that half of the population is even worse than the average user. Gives little bit perspective how fool proof these things needs to be to even have a chance.
Is that the excuse for the amd driver selecting the worst available pixel format on hdmi? It's hardcoded with a little comment nearby saying it shouldn't be.
power1_cap_min is 0 on my Radeon VII 16GB reference card, so I am just confused now.
I don't really get why would someone buy a (for example) 200W GPU with a safe limit of 100W and then undervolt it to 50W, if they wanted to reach 50W anyway why didn't they just buy a GPU that had that TDP to begin with? At those levels of extreme undervolting they would have to forcefully underclock it as well, since both parameters are directly proportional, it's physically impossible to maintain the exact same performance at a fraction of the power. To my eyes it's literally the same as just buying an older GPU anyway (assuming you're not going far back enough that the drivers themselves are actually deprecated that is).
We have a saying in Brazil called "extracting milk from a stone", and as I see it those people complaining about this bug seem to be addicted to it in a level I can't comprehend them anymore. Kind of like the "just buy a new one" crowd but on the inverse side of the spectrum.
so i am not the only one that can hear in how the fans are spinning how the machine is working, well obviously if you have to deal with stupid users that is never fun. on the other side since it is opensource even if it wasnt and it would affect me i would either modify or re the drivers myself. in the end of the day in that way i know when i changed the driver that i dont have to submit a bugreport since then i broke the code and possible hardware but then i knew it was me. and wouldnt go shaggy on linus or other devs
Ah, another case of "this bug being fixed is a minor inconvenience, it should be illegal to do this!!!"
I used that feature on my 7900xtx, was thinking that a legit thing, before they "broke" it in their "fix". 7900 is very powerhungry, so you could power limit it down to ~45w, so now you can't do that, even in 2d simple games this shit will eat 400W. Nice.
didn't do a thing, if it was not in the scope of the windows drivers I didn't consider it a production work around for hardware but a hack with risks. the idea that we shouldn't trust the bounds provided by hardware creators is a dangerous slope and would prefer not to go that path of zero trust at the hardware level as most people don't have the needed understanding to write their own drivers.