Fine-tuning Hyperparameters

Hyperparameters are what fine-tuning settings are called in machine learning.

To configure fine-tuning hyperparameters, press "Show advanced."

These settings vary depending on what platform and base model you select. Generally speaking, Entry Point sets reasonable defaults for hyperparameters, so modifying them further is optional.

Below is some information on the most common hyperparameters.

Epochs

The number of epochs describes how many times your entire dataset will be used to train the model. Usually, the best value is in the range of 2-5. Using too many epochs introduces the risk of "overfitting" your data, which degrades the model's ability to generalize and perform well on data that was not present at training time.

You can determine if you had too many epochs by comparing the training loss and validation loss after training. If training loss keeps decreasing but validation loss starts to increase from a local minimum on the chart, you are likely overfitting. The model is not trained on validation examples, which makes them useful for testing it against "unseen" data.

As a general rule, you can get better results with fewer epochs and a larger training dataset.

Learning Rate

Learning rate determines much a model's weights should be updated after each batch of examples that it tries to predict. The best learning rate often depends on the model and how it was trained originally.

Common values for learning rate when fine-tuning range from 0.0001 to 0.0002.

Learning rate is often expressed in scientific notation, but Entry Point presents it in decimal form which is more familiar outside of mathematics.

If the learning rate is exposed directly, the presumption is that it will be used to train all of the linear layers of the model (the neural networks that are the "brain" of the model). However, in some cases, it makes sense to train certain layers more than others. In such cases, the learning rate multiplier is an attractive alternative to a fixed learning rate.

Learning rate multiplier

The learning rate multiplier is a hyperparameter that allows you to relatively increase or decrease the learning rate. As implied by the name, it directly multiplies the internal learning rate.

Common values are 0.5 to 2. The default is usually 1 which leaves the learning rate unchanged.

Batch Size

The batch size is how many examples to process at a time. It's used to optimize memory and performance of fine-tuning.

Batch size should vary with the size of the dataset.

After each batch, the model's weights get updated. For example, if you had a dataset of 50 examples and set your batch size to 50, then your model's weights would only get updated once. That's not ideal because we want a model to learn in small increments over time as it processes our data. Conversely, a batch size of 50 might be entirely appropriate for a dataset of 10,000 examples.

For very small datasets, we often set it to 1.

LORA & QLORA Hyperparameters

Rank

Rank determines the size of the decomposed weight matrices (A and B) used in LoRA and QLoRA fine-tuning. A higher Rank value provides more precision when fine-tuning. Common values for Rank range from 8 to 64, but can go much higher. Higher rank uses more memory when fine-tuning.

The highest possible rank is determined by the original size of the linear layers of the model being trained. Also, rank does not need to be a multiple of 2, even though it commonly is.

Scale Factor

Scale Factor multiplies your fine-tuned weight matrices before they are added to the original weights of the model. This increases or decreases their impact on the original weights. You can think of Scale Factor roughly as an alternative to adjusting the learning rate, although it is a much more blunt approach.

The original LoRA paper does not use the term Scale Factor, but it effectively calculates it as Alpha divided by Rank. At Entry Point AI, you can set Alpha or Scale Factor. We believe Scale Factor is a more intuitive hyperparameter than Alpha, even though the end result is the same.

Dropout

Dropout randomly sets some parameters to zero when training. The theory is that by "turning off" some of the parameters during each weight update, it can prevent overfitting and encourage the training of more parameters. Common values are 0, 0.05, or 0.10, which represent percentages (eg 0%, 5%, and 10% respectively).

Last updated