So there's this AMD R9 290 I had lying around from back when two friends and I set up an ethereum mining rig (which is a story for another day). Only one fan, stock cooler.
I was thinking of selling it, but after a couple google searches, the internet told me that this card was in fact better than the Nvidia GTX 960 I was running.
So I pop it in place, reconfigure everything to use amdgpu instead of nvidia, and go for my standard test: Counter Strike: Global Offensive.
Aaaand as soon as the game started, my screen went black, the GPU fan went to max, and I had to force a reboot. Ugly. Looking at the temperatures would tell a simple story, the GPU temperature was reaching 100°C (94 is the maximum allowed), and it was aborting all operations. This command shows the temps:
sudo watch -n 0.5 cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Take it out, watch a video on how to disassemble the thing, remove all the little screws, curse bad screws and bad screwdrivers. Remove all the dry fossilized ancient thermal paste from year 200BC. I've never had isopropyl alcohol at home, ethanol 96° had to do. Apply new thermal paste. Put the thing back together. Try again. Same story.
Maybe I put too little thermal paste! Disassemble it again, clean it well, drop big fucking line of paste, reassemble.
Card back in the PC, this time we are going to be careful. And MEASURE things.
Well turns out, first of all the power draw of this thing is ridiculous, at least on this linux/amdgpu combination. With a single monitor at 60Hz it draws 20W. Which is a lot, but acceptable. Now when you have 2 monitors, or a single one at 144Hz you jump to 65 fucking Watt. IDLE. Only xorg running, "GPU Load" at 0%. 65W.
Now second thing, the fan speed is not ramping up properly at all.
We were going from "pretty quiet" to "fuckfuckfuck max throttle" in one go,
when it's already too late.
Poking around shows that we have the fan speed at
pwm2_max show that the range is 0-255.
By itself it was sitting at around 90, which is fairly quiet
and gets the GPU under 60°C with one monitor at 60Hz.
If I want to keep things under control with 2 monitors though, I have to force the speed to 140.
Then at speed 170 I could open CSGO, although the temperatures were slowly rising. At speed 210 I got it to stabilize at 92°C. FPS unlocked, GPU more or less drawing as much as it could. So speed 210 is safe, it seems. We won't die if we keep it. Thing is, 210 is LOUD. Vacuum cleaner loud, almost. Not acceptable by any means.
Maybe I should just give up on this card.
Although, my brother has one that's the exact same model,
and I think he doesn't have these problems?
radeon though, not
More investigation required. Tomorrow.
From the overclocking section in arch wiki, even though I am not trying to overclock, I got a useful bit of info. I can limit the power draw with:
echo 150000000 > /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap
where 150000000 means 150W.
I think the main problem I have is that the card is not thermally throttling properly. I tested CSGO on windows, and there the fans spun up a bit (but not super high), but most importantly: the card lowered its power to never exceed 94°C. The game was playable, and the noise was bearable.
On linux on the other hand, if I don't limit wattage and don't force the fans up, what happens is that it tries to put the fans on max really late, and then shuts down (due to emergency temp). There's a bunch of people with the same problem at https://bugzilla.kernel.org/show_bug.cgi?id=201539, although they don't seem to be able to manually set the pwm speed (which I can).
Just like them, I get a buggy reading for crit and hyst from
edge: +77.0°C (crit = +104000.0°C, hyst = -273.1°C)
So in short, I have a hardware issue and a software issue:
The cooling on this card is pretty shit.
The amdgpu driver doesn't throttle properly, and its automatic fan control is pretty bad.
- Linux version: 5.6.11
- Mesa: 20.0.6
- Distro: Arch Linux
- xf86-video-amdgpu: 19.1.0
- Kernel parameters: