Z Image Turbo - Character LoRA Training Guide [Easy How-To Guide, ADHD friendly]

Introduction

Hey folks out there. This is the standalone guide without the LoRA attached. If you just want to start, skip forward to „Dataset“.

I tried to write guide on how I did it. I searched a lot for simple fast forward guides for ZIT Character LoRAs but I couldn’t find a proper one. So I decided to write one on my own. This is my first time writing a guide, and especially the first time writing something like this in English. I also can’t promise you that this guide will workout for you. I am an ADHD patient so I wrote it optimized for my way of thinking and understanding level. I presuppose knowledge of how to use different base AI models like the big LLMs (Gemini, ChatGPT..) and Midjourney or other image generation models/tools. I also assume you got the hardware to do it or know how to rent it. You just need one card with 24GB of VRAM or more. I will try it on a RTX 5060Ti 16GB in the future and share the results as an update. But as far as I know, it should be possible to do the training for Z-Image also on the low tier cards.

Content/Procedure

dataset creation
training

Tools being used

ChatGPT or Gemini (other LLMs possible, but these are the most common ones)
Googles Nano Banana
Ostris AI-Toolkit (also available as template on runpod.io)

Dataset

A dataset is a set of pictures and captions for LoRA training. To generate your dataset multiple steps are required.

Base portrait

Your first step is the base portrait of the character which you can generate where you want. My recommendations are MidJourney or Google’s Nano Banana.

The base portrait should be a frontal 3/4 upper-body portrait with neutral expressions.

Edit to dataset

After you got your base portrait, you are gonna generate the dataset. The base portrait gets edited by an AI-Model into different portraits with the same character.

The best models for this task is Google’s Nano Banana. You can also use Flux.1 Kontext, but Nano Banana is far better based on my experience.

In this guide we are gonna generate a total of 20 images for the dataset. More is possible and could improve results, but 20 is more than fine.

Use your base portrait always as input picture and generate picture after picture. You can use the following 19 neutral prompts. But I would recommend you to use your ChatGPT or Gemini, to generate you optimized prompts, based on the neutral prompts and your base portrait.

Prompts:

Head & Face Focus (9 prompts)

1. Frontal headshot

“Edit the image to show a frontal headshot, head facing directly forward, neutral expression, even soft lighting.”

2. Slight left turn

“Edit the image so the head is turned slightly to the left, close-up portrait, relaxed neutral expression.”

3. Slight right turn

“Edit the image so the head is turned slightly to the right, close-up portrait, neutral expression.”

4. Three-quarter left

“Edit the image to a three-quarter left head angle, face clearly visible, soft natural lighting.”

5. Three-quarter right

“Edit the image to a three-quarter right head angle, calm expression, balanced lighting.”

6. Left profile

“Edit the image to a clean left profile view, head in full side angle, neutral lighting.”

7. Right profile

“Edit the image to a clean right profile view, head in full side angle, soft shadows.”

8. Slight downward gaze

“Edit the image so the head tilts slightly downward, eyes looking down gently, close-up framing.”

9. Slight upward gaze

“Edit the image so the chin is slightly raised, eyes looking forward or slightly upward, soft light.”

Upper Body / Mid-Shot (6 prompts)

10. Upper-body frontal

“Edit the image to an upper-body portrait, frontal angle, relaxed posture, neutral background.”

11. Upper-body three-quarter left

“Edit the image to an upper-body shot at a three-quarter left angle, natural stance.”

12. Upper-body three-quarter right

“Edit the image to an upper-body shot at a three-quarter right angle, calm posture.”

13. Seated upper-body

“Edit the image so the subject is seated, upper-body framing, relaxed shoulders, natural lighting.”

14. Upper-body with slight lean

“Edit the image to show a subtle forward lean, upper-body framing, neutral expression.”

15. Upper-body with head tilt

“Edit the image so the head tilts slightly to one side, upper-body portrait, soft light.”

Full Body / Spatial Context (4 prompts)

16. Full-body frontal

“Edit the image to a full-body view, standing naturally, frontal angle, simple environment.”

17. Full-body three-quarter left

“Edit the image to a full-body view at a three-quarter left angle, relaxed stance.”

18. Full-body three-quarter right

“Edit the image to a full-body view at a three-quarter right angle, neutral posture.”

19. Full-body side view

“Edit the image to a full-body side view, natural standing pose, balanced lighting.”

After you generated every picture, store them into a folder and rename them. It is important to add a counter to the filename.

yourcharacter_0001.png

yourcharacter_0002.png

…

Captioning

A dataset isn’t complete without captions. You have multiple options to gain them. You need one .txt for each picture with the same filename as the picture.

yourcharacter_0001.txt

yourcharacter_0002.txt

…

The first option is to type them by hand. Just one short or two sentences describing the picture. Try to be consistent, use the same sentence-scheme for each picture. There are websites like this one which are handling the filename formatting for you.

The second and more efficient option for captioning is using Gemini or ChatGPT. Both models are capable to type all captions for you, even optimized for Z Image Turbo. Just give them the instruction to caption the following pictures optimized for Z Image Turbo and post the pictures into the chat. You want them to output you a .txt for each picture with correct filenames. Read through all of them to make sure their are correct.

The third option is as good as Gemini or ChatGPT, but runs locally. If Gemini or ChatGPT wouldn’t caption your pictures because they are too NSFW, you could use JoyCaptions. For this process I’ll link to this guide from StartHua. I won’t explain ComfyUI in this guide.

After you got all the captions, put them into the same folder as your pictures. Now your dataset is complete.

Set up the Dataset in AI toolkit

To train the actual LoRA we’ll use Ostris AI-Toolkit. You can run it via runpod.io with Ostris template or run it locally. For this process I’ll link to the guide from Ostris himself. I won’t explain the setup here.

After you started the AI-Toolkit, you head over to „Datasets“. Click on „New Dataset“, name it how you want, it doesn’t matter for the training. Then you click on „Add Images“. Now you upload all your pictures and caption files from the dataset. Once it is uploaded, AI-Toolkit will assign the captions to the right pictures.

the training job

Now head over to „New Job“. Here you’ll setup the parameters for the training, but don’t worry it’s easy to learn just hard to master.

I’ll walk you through every shard that I think is relevant for now from left to right. The good thing is, most of the settings are already pre-applied, just change the settings I’ll tell you and you should be good to go.

You start in the „Job“ shard. Here you just need to give your LoRA a name. The next shard is „Model“. Click on „Model Architecture“ and choose the „Z-Image Turbo (w/Training Adapter)“. In the „Quantization“ shard you could change the „Transformer“ to „NONE“ or leave it to „float8“. It depends on how much VRAM you have. If you have over 24GB of VRAM you should be good to go with „NONE“. „float8“ would be slower.

Now head over to the „Training“ shard. Change the „Timestep Type“ to „Sigmoid“ and activate „Cache Text Embeddings“. Don’t change anything else here, specially not the „Learning Rate“, increasing it could cause it to crash. After you trained your first model you can try to increase the steps, if you think you need more, but 3000 should be enough.

After this, go the „Dataset“ shard. Here you just select your dataset.

The last step are the sample prompts. If you’re using a trigger word, make sure you include it in your sample prompt, otherwise the samples wouldn’t apply the LoRA properly.

Now you can start your training job.

The training should take about 2 hours on a RTX 5090. After the training is done, go through your samples and decide which model is the best. After all my tries, I found the 3000 steps one the most useful.

And that’s it, now you got a Character LoRA for Z-Image Turbo.

A few words in the end

I hope this guide helped you guys to just easily train a Character LoRA for ZIT without your brain melting. I know that I didn’t explain much about what each setting does, but this wasn’t my intent with this guide. It just should walk you easily through the steps which are needed to train a LoRA.

I’m open for hints or improvements of the guide and as I already said last time: I would love to exchange information and research about this topic.