FLUX.1 on a low-end machine

When I tried Stable Diffusion on a low-end PC, I was impressed how easy and cheap it was to get it usable with just a small GPU investment. But after SD 1.5, these models quickly grew insanely and hardware requirements likewise, so I kind of lost interest. It seemed like AI was just a brute-force effort requiring more and more excessive resources.

But at one point, I decided to try out FLUX.1 and was again impressed - not just the highly improved picture quality, but also the fact it was again runnable on a low-end machine. The essence of the story here is quantization - there are versions of models well below 10 GB in size, so they work nicely even with just 16 GB RAM and even less VRAM. They are close to full complete models still.

This time, the only investment I made was M.2 SSD and PCI-E adapter (total cost under 100 e) to bump up the model loading speed from spinning SATA disk's 100 MB/s to 350 MB/s (PCI-E Gen2 x1). Not strictly necessary but was a nicety.

Some examples of quantized model sizes (FLUX.1 Fill Dev):

Model variant	Size	Load time with NVMe SSD/ext4
Full	22.17 GB	68 s
Q8	11.85 GB	36 s
Q5_K_M	7.85 GB	24 s
Q4_K_M	6.47 GB	20 s

Once loaded, Q5_K_M and Q4_K_M can be kept fully in RAM on a 16 GB machine and they also fit into NVIDIA RTX 3060 12 GB VRAM. Given the fast loading and size there's are the most attractive ones to work with. While the bigger models also work with partial loading, things become just too slow. Quality-wise the smaller ones produce pretty good, comparable results. The Q5_K_M was the most optimal for me, and it was pleasing to be able to just delete and forget the others and free up disk space.

In image generation the iteration speed with 1024x1024 resolution is 4.74s/it (roughly 8 times more compared to SD 1.5) when using AMD-FX-6330 with 16 GB RAM and NVIDIA RTX 3060 12 GB VRAM graphics card (in PCI-E Gen2 x16 slot). The bottleneck here is either the GPU or the PCI-E bus.

But this shows that at least with FLUX models quantization works really well, and less can be more.

LoRA Training

Amazingly, I was able to get also FLUX.1 Dev LoRA training working with good results using kohya-ss/sd-scripts, fp8_e4m3fn models and 12 GB VRAM. Some settings and numbers below, mostly for my own reference:

LoRA	rank	blocks_to_swap	dataset resolution	VRAM usage	speed
FLUX.1	4	4	512	11846 MiB	5.0 s/it
FLUX.1	4	12	1024	11358 MiB	14.0 s/it

The speed reading was checked one hour into the training, to offset the initial loading and setup overhead.

Last updated: 2025-06-23 20:39 (EEST)