Another new year and another set of new generation AI text-to-image models have arrived, so time to tinker. While the early SD1.5 was cool (and still is) for its hackability, FLUX.1 was mostly boring despite giving impressive quality out of the box. But FLUX.2 and Z-Image are again bringing in a new fresh kick, and especially in the text encoder side the development is interesting.
Even though all the AI stuff is full of bloat and brute force, the quantization stuff nowadays helps keeping GPU and VRAM requirements somewhat reasonable. However the combined increased size of denoiser and text encoder models is still getting huge, so now the CPU RAM is now becoming actually a bigger bottleneck on my low-end machine.
To get reasonable throughput, models must stay resident in RAM. With ``small'' 16 GB RAM this is getting tough, so one needs to carefully select the right combinations. While it's still possible to use the bigger models, the model re-loading is so painfully slow it's not really worth it. Especially with distilled models where the inference is fast, constant re-loading would multiply the wall time per image.
With InvokeAI, assuming you have a dedicated headless machine, you can pin the model cache size to 14 GB - based on experiments the remaining 2 GB is sufficient for the InvokeAI itself and underlaying OS/SW stack to still run smoothly. Then you have 14 GB for models, and below table shows what can actually fit there at maximum:
| Model | Variant | Size (GB) | Text Encoder | Size | Total | Usage |
|---|---|---|---|---|---|---|
| FLUX.1 Fill Dev | Q4_K_M | 6.94 | BNB INT8 | 4.90 | 11.84 | 11.89 |
| FLUX.1 Fill Dev | Q5_K_M | 8.43 | BNB INT8 | 4.90 | 13.33 | 13.27 |
| FLUX.2 Klein 4B | BF16 | 7.75 | Q8 | 4.28 | 12.03 | 12.44 |
| FLUX.2 Klein 4B | Q8 | 4.30 | BF16 | 8.06 | 12.36 | 12.64 |
| FLUX.2 Klein 4B Base | BF16 | 7.75 | Q8 | 4.28 | 12.03 | 12.44 |
| FLUX.2 Klein 9B | Q4_K_M | 5.91 | Q5_K_M | 5.85 | 11.76 | 13.57 |
| FLUX.2 Klein 9B | Q5_K_M | 7.02 | Q4_K_M | 5.03 | 12.05 | 13.81 |
| Z-Image Base | Q8 | 7.22 | Q8 | 4.28 | 11.50 | 13.09 |
| Z-Image Turbo | Q9 | 6.58 | Q8 | 4.28 | 10.86 | 12.64 |
The ``usage'' referes to InvokeAI's report of ``cache high water mark'' after real repeated usage. Sometimes it's much bigger than the calculated total size, who knows why, but at least VAE etc. needs to also go there.
Just some performance numbers for my personal use and reference.
Inference performance with NVIDIA RTX 3060 12 GB VRAM:
| Model | Variant | Inference steps | Total wall time |
|---|---|---|---|
| FLUX.1 Fill Dev | Q4_K_M | 30 (Euler) | 140 s |
| FLUX.1 Fill Dev | Q5_K_M | 30 (Euler) | 150 s |
| FLUX.2 Klein 4B | BF16 | 6 | 15 s |
| FLUX.2 Klein 4B | Q8 | 6 | 15 s |
| FLUX.2 Klein 4B Base | BF16 | 30 (Euler) | 55 s |
| FLUX.2 Klein 9B | Q4_K_M | 6 | 35 s |
| FLUX.2 Klein 9B | Q5_K_M | 6 | 35 s |
| Z-Image Base | Q8 | 30 (Euler) | 195 s |
| Z-Image Turbo | Q8 | 8 (Euler) | 30 s |
These include the full text encoder pass, i.e. a new prompt always.
With musubi-tuner, both the FLUX.2 Klein 4B and Z-Image LoRA training works with 16 GB RAM and 12 GB VRAM. During training the CPU RAM is not an issue as latents and text encoder output can be cached, and only the denoiser is used during the training.
| Model | LoRA rank | memory save | resolution | VRAM usage | speed |
|---|---|---|---|---|---|
| FLUX.2 Klein 4B Base | 16 | (not needed) | 512 | 8456MiB | 2.0 s/it |
| FLUX.2 Klein 4B Base | 16 | (not needed) | 1024 | 10156MiB | 6.4 s/it |
| Z-Image Base | 16 | blocks_to_swap=8 | 512 | 10826MiB | 3.7 s/it |
| Z-Image Base | 16 | blocks_to_swap=10 | 1024 | 11578MiB | 14 s/it |
Last updated: 2026-04-07 22:51 (EEST)