Anyone have experience with StableDiffusion on Linux with AMD and NVIDIA?

abcdqfr@lemmy.world · edit-2 5 months ago

Anyone have experience with StableDiffusion on Linux with AMD and NVIDIA?

wewbull · 5 months ago

My experience is that AMDs virtual memory system for VRAM is buggy and those bugs cause kernel crashes. A few tips:

If running both cards is overstressing your PSU you might be suffering from voltage drops when your GPU draws maximum power. I was able to run games absolutely fine on my previous PSU, but running diffusion models caused it to collapse. Try just a single card to see if it helps stability.
Make sure your kernel is as recent as possible. There have been a number of fixes in the 6.x series, and I have seen stability go up. Remember: docker images still use your host OS kernel.
If you can, disable the desktop (e.g. systemctl isolate multi-user.target, and run the web gui over the network to another machine. If you’re running ComfyUI, that means adding --listen to the command line options. It’s normally the desktop environment that causes the crashes when it tries to access something in VRAM that has been swapped to normal RAM to make room for your models. Giving the whole GPU to the one task boosts stability massively. It’s not the desktop environment’s fault. The GPU driver should handle the situation.
When you get a crash, often it’s just that the GPU has crashed and not the machine (Won’t be true of a power supply issue). sshing in and shutting down cleanly can save your filesystems the trauma of a hard reboot. If you don’t have another machine, grab a ssh client for your phone like Juice SSH on android. (Not affiliated. It just works for me)
Using rocm-smi to reset the card after a crash might bring things back, but not always. Obviously you have to do this over the network as your display has gone.
Be aware of your VRAM usage (amdgpu_top) and try to avoid overcommitting it. It sucks, but if you can avoid swapping VRAM everything goes better. Low memory modes on the tools can help. ComfyUI has --low-vram for example and it more aggressively removes things from VRAM when it’s finished using them. Slows down generations a bit, but better than crashing.

With this I’ve been running SDXL on a 8GB RX7600 pretty successfully (~1s per iteration). I’ve been thinking about upgrading but I think I’ll wait for the RX8000 series now. It’s possible the underlying problem is something with the GPU hardware as AMD are definitely improving things with software changes, but not solving it once and for all. I’m also hopeful that they will upgrade the VRAM across the range. The 16GB 7600XT says to me that they know <16GB isn’t practical anymore, so the high-end also has to go up, right?