Intro

For kohya_ss LoRA training, there is some advice floating around that is quiet useless. I looked closely at what the weight_decay and scale_weight_norms options really do, and here's what I found out. Note that weight_decay is an option in optimizers and not a kohya_ss/sd-scripts feature, so this applies to other trainers as well. Summary for the impatient:

Setting weight_decay=0.01 in your optimizer isn't affecting your training at all (Adafactor, AdamW, Prodigy)
weight_decay is scaled internally by the learning rate, and such small values are rounded to zero!
Setting scale_weight_norms=1.0 isn't going to do much and might be harmful
There's academic advice to set scale_weight_norms to 3 or 4, but that doesn't apply to LoRA training
scale_weight_norms is probably not useful, but the extra logging helps when playing with weight_decay for example

I will explain how these two features work and how to find effective values that at least do something tangible. But you might be wondering if it's worth bothering with these at all. I always set scale_weight_norms=10.0 in my training now, which gives me additional graphs in tensorboard about the key norms. A norm of 10 will never be reached, so it doesn't do anything to my model and I might learn something from this additional data in the future. Effective values for scale_weight_norms probably don't make better LoRAs. Properly set values for weight_decay certainly do something, but I'm not sure yet if it's useful or practical.

scale_weight_norms

So scale_weight_norms implements max norm regularization. The glaring issue here is that it's applied to the effective LoRA weights only (that's up_weights multiplied with down_weights). These are the full rank LoRA weights that get added to the model weights. So that gets me thinking that some larger values are actually healthy for the LoRA since we might need to cancel some large weight in the checkpoint model (but just a thought).

In any case, if scale_weight_norms had any proven benefit it would be highly dependent on setting it to some very specific value. And that's kind of dubious since it's the same for all trained layers/parameters in the network, which all have slightly different norms. Let me explain how it's implemented so you can do your own experiments if you want though.

When you set scale_weight_norms to a non-zero value, the feature is enabled and your training run produces 3 additional graphs in tensorboard under max_norm: average norm, max norm and number of keys_scaled. Average and key counts are also printed out alongside the progress bar in the console. Each "key" in this case is one layer/module/parameter matrix in the model. SDXL LoRAs target 722 modules for UNet and 264 for the text encoders.

After each training step, if the max norm of the LoRA weights exceeds the set limit by scale_weight_norms, the LoRA weights are scaled back. A scaling factor (smaller than one) is determined that makes the norm match the limit and both the up and down matrix are multiplied by the square root of this factor. The keys_scaled counter is increased by one to indicate that some scaling happened. So the max_key_norm curve will always top out at the limit you set. If a lot of scaling happens, that has an effect on the average norm value of course.

It just so happens that my trains these days usually don't even reach a max norm of 1.0. Using that value is some advice you can find when you search around. And if scaling does happen, it's just a couple of keys and further into the run. Setting much lower values (so aggressive scaling happens, and early) didn't seem to make my models better. But again, maybe I'm missing something and it's always good when more people do experiments.

weight_decay

So in general machine learning, some of the trained weights like to grow much more than others which leads to undesirable results. It would be good if you could tell the solver/optimizer to go looking for weight solutions that are more balanced and spread out. Of course that's something that you can do: Everything that the optimizer does tries to minimize the loss function. So you can guide training by adding terms to that function which get larger when there is something undesirable going on, like large weight norms in this case.

That term is called the L2 penalty and it grows with the square of the weights, so the optimizer is encouraged to go looking for better ways to grow the model. But it already has a tough job and we don't want to add even more complexity to the solution space it needs to explore. It also gets in the way of other clever stuff that some optimizers are doing.

For stochastic gradient descent, the L2 penalty is equivalent to just shrinking the weights a little overall at each step (sort of, see https://arxiv.org/pdf/1711.05101 for example). That shrinkage is called weight_decay, and many optimizers have that feature. It's not true proper L2 regularization, but still good in theory. The feature is really, really simple: At each training step, multiply all of the weights by a factor smaller than one. That's it (a single line of code). In practice they actually add a small negative factor, but it's mathematically the same (W*(1-d)=W+(-d)*W)

When searching around for advice on what actual value to use (for LoRA training), you will commonly find forum posts where someone suggests weight_decay=0.01 (or even 0.001). These people usually sound competent and experienced, so I tried that out. I repeated an identical training run (same seed and all) with and without weight_decay=0.01. Not only were the key norms (logged by scale_weight_norms) and all other graphs identical. The calculated model hashes were identical too. So I trained the exact same LoRA twice, with all final weights matching up exactly, bit for bit.

I guess the intent of using 0.01 is to keep 99% of the values and shrink them by 1% at each step. This sounds like a huge change should be visible over 2000 training steps. What did the "experts" miss? It's that the scaling factor gets multiplied by the current learning rate! That makes sense because otherwise, with a decaying learning rate we would shrink the model down a lot as the LR approaches zero (updates get so small!).

If your LR is at 1e-5, that means we don't lower the weights by 1% but by a tenth of a million instead. Oops. If we really wanted that 1% reduction at LR 1e-5, the proper weight_decay would be 1e-2/1e-5 = 1e3 = 1000. So yeah. That's probably too much. So far I only used this once. With a value of around 20, it certainly had an effect.

I will be playing more with this in the future. Note that scaling back the weights basically makes a weaker model, which lowering the learning rate also does. So the question is always if this can achieve something that lowering the learning rate cannot. Another way to look at this is that you can train with higher LR and the decay would scale things back to usable levels (assuming this is beneficial for model quality).

This is something that I'm already doing unknowingly anyways: My trained LoRAs always come out best looking when genning with a strength below 1.0. Before uploading them on civitai, I rescale them so the usable weight becomes 1.0 again. Nobody reads the recommended strength setting anyways and it's a hassle to copy it to your UI's LoRA preferences. Careful use of weight_decay could apply this rescaling during training already, resulting in a better model in theory since the weights don't overshoot during training.

In case you are wondering, this is certainly an advanced feature for fine tuning your training methods. You would use this to make a good LoRA better. If you are new and struggling to get good results at all, don't turn to weight_decay trying to get something usable. That's just my opinion, but also remember to never blindly trust any "experts".

Update

When crafting values to use for weight_decay, keep in mind that the optimizer handles the up and down weight matrices as separately trained parameters. So the scaling factor to the weights is applied twice, and the actual LoRA weights (up times down) shrink by the square of that factor. If the weights at the optimizer level shrink by 1% for example (factor 0.99), the LoRA shrinks to a strength of 98.01% for example (not 99%).

Just another one of those examples where optimizer settings interact with LoRA training in very specific ways.

Advanced LoRA Training: Weight Scaling Options

Intro

scale_weight_norms

weight_decay

Update