Third generation AI image models on a low-end PC

Another new year and another set of new generation AI text-to-image models have arrived, so time to tinker. While the early SD1.5 was cool (and still is) for its hackability, FLUX.1 was mostly boring despite giving impressive quality out of the box. But FLUX.2 and Z-Image are again bringing in a new fresh kick, and especially in the text encoder side the development is interesting.

Even though all the AI stuff is full of bloat and brute force, the quantization stuff nowadays helps keeping GPU and VRAM requirements somewhat reasonable. However the combined increased size of denoiser and text encoder models is still getting huge, so now the CPU RAM is now becoming actually a bigger bottleneck on my low-end machine.

To get reasonable throughput, models must stay resident in RAM. With ``small'' 16 GB RAM this is getting tough, so one needs to carefully select the right combinations. While it's still possible to use the bigger models, the model re-loading is so painfully slow it's not really worth it. Especially with distilled models where the inference is fast, constant re-loading would multiply the wall time per image.

With InvokeAI, assuming you have a dedicated headless machine, you can pin the model cache size to 14 GB - based on experiments the remaining 2 GB is sufficient for the InvokeAI itself and underlaying OS/SW stack to still run smoothly. Then you have 14 GB for models, and below table shows what can actually fit there at maximum:

Model	Variant	Size (GB)	Text Encoder	Size	Total	Usage
FLUX.1 Fill Dev	Q4_K_M	6.94	BNB INT8	4.90	11.84	11.89
FLUX.1 Fill Dev	Q5_K_M	8.43	BNB INT8	4.90	13.33	13.27
FLUX.2 Klein 4B	BF16	7.75	Q8	4.28	12.03	12.44
FLUX.2 Klein 4B	Q8	4.30	BF16	8.06	12.36	12.64
FLUX.2 Klein 4B Base	BF16	7.75	Q8	4.28	12.03	12.44
FLUX.2 Klein 9B	Q4_K_M	5.91	Q4_K_M	5.03	10.94	12.87
FLUX.2 Klein 9B	Q4_K_M	5.91	Q5_K_M	5.85	11.76	13.57
FLUX.2 Klein 9B	Q5_K_M	7.02	Q3_K_M	4.37	11.39	13.04
FLUX.2 Klein 9B	Q5_K_M	7.02	Q4_K_M	5.03	12.05	13.81
Z-Image Base	Q8	7.22	Q8	4.28	11.50	13.09
Z-Image Turbo	Q8	6.58	Q8	4.28	10.86	12.64

The ``usage'' referes to InvokeAI's report of ``cache high water mark'' after real repeated usage. Sometimes it's much bigger than the calculated total size, who knows why, but at least VAE etc. needs to also go there.

Inference performance

Just some performance numbers for my personal use and reference.

Inference performance with NVIDIA RTX 3060 12 GB VRAM:

Model	Variant	Inference steps	Total wall time
FLUX.1 Fill Dev	Q4_K_M	30 (Euler)	140 s
FLUX.1 Fill Dev	Q5_K_M	30 (Euler)	150 s
FLUX.2 Klein 4B	BF16	6	15 s
FLUX.2 Klein 4B	Q8	6	15 s
FLUX.2 Klein 4B Base	BF16	30 (Euler)	55 s
FLUX.2 Klein 9B	Q4_K_M	6	35 s
FLUX.2 Klein 9B	Q5_K_M	6	35 s
Z-Image Base	Q8	30 (Euler)	195 s
Z-Image Turbo	Q8	8 (Euler)	30 s

These include the full text encoder pass, i.e. a new prompt always.

Training

With musubi-tuner, both the FLUX.2 Klein 4B and Z-Image LoRA training works with 16 GB RAM and 12 GB VRAM. During training the CPU RAM is not an issue as latents and text encoder output can be cached, and only the denoiser is used during the training.

Model	LoRA rank	memory save	resolution	VRAM usage	speed
FLUX.2 Klein 4B Base	16	(not needed)	512	8456MiB	2.0 s/it
FLUX.2 Klein 4B Base	16	(not needed)	1024	10156MiB	6.4 s/it
Z-Image Base	16	blocks_to_swap=8	512	10826MiB	3.7 s/it
Z-Image Base	16	blocks_to_swap=10	1024	11578MiB	14 s/it

Last updated: 2026-05-30 20:37 (EEST)