Fine-tuning Small Language Models is a highly effective way to improve their performance, add desired behaviors, or remove undesirable ones. Traditional full-fine-tuning improves model performance on specific tasks but comes with high resource requirements and costs.
This is where Parameter-Efficient Fine-Tuning (PEFT) steps in, offering a smarter, more resource-conscious approach. In this article, we compare some key PEFT techniques—LoRA, QLoRA, DoRA, and QDoRA—highlighting their foundations and trade-offs. This overview will help you choose the best approach for your specific fine-tuning needs.
This post is part of our series on fine-tuning techniques for small language models. If you're looking to understand the fundamentals and set the stage for these advanced techniques, be sure to read our introductory blog: "Fine-Tuning in Small Language Models". For those interested in exploring practical applications and experimental insights, our follow-up blog, "Fine-Tuning Small Language Models: Experimental Insights", dives into real-world results and best practices for implementing these methods.
Full Fine-Tuning vs Parameter Efficient Fine-Tuning (PEFT)
To understand how to fine-tune efficiently, we first need to understand the complexity of Full Fine-Tuning.
Full Fine-Tuning
Full fine-tuning involves updating all parameters of a model, which traduces in updating very large matrices (e.g. if it’s a 10 billion parameter model, you’ll update 10 billion weights). Storing and updating those weights takes lots of memory, offering comprehensive adaptation and potentially high performance, but at the cost of significant computational resources and memory requirements.
Parameter-Efficient Fine-Tuning (PEFT)
PEFT achieves comparable performance to full fine-tuning by updating only a fraction of the parameters, reducing computational costs, lowering memory usage, and enabling faster training.
While PEFT offers remarkable efficiency gains, it may introduce some implementation complexity and performance trade-offs.
LoRA (Low-Rank Adaptation)
The uniqueness of LORA is that instead of updating the weights of a model directly, we leave those unchanged (effectively “freezing” them) and add a few new weights to be trained instead. Fine-tuning these significantly smaller matrices is what brings us the benefits.
The mathematical foundation of LoRA is a concept called Matrix Decomposition, which allows us to train only a subset of the weights, combine the frozen and the trained weights, and treat them as if they were a single matrix. This helps avoid having to fundamentally change how everything else in the Transformer architecture works. The way these matrices are combined during forward-passes is illustrated in the following figure:
LoRA representations
One of the major advantages of LORA is that after fine-tuning for a specific task, we are left with the original model untouched and a much smaller “LoRA adapter”, instead of needing gigabytes of store, the LoRA adapter only requires a few megabytes, often representing just a tiny fraction of the model's original size.
Calculating Parameter Efficiencies
To gain a clearer perspective on the parameter savings achieved by LoRA, let's attempt to estimate the number of parameters involved. We'll use some assumptions to make this estimation more manageable:
- We’ll consider the model as a single neural network, rather than the common layered architecture. This setup is simpler but equivalent to a regular transformer (where you would just repeat it for each layer).
- We'll define a specific model size (total number of parameters) as our starting point and base all subsequent calculations on this figure.
- We’ll assume the dimensions of this matrix (d , k) are equal, resulting in a square matrix.
- When simulating the matrix dimension of our model, if the total number of parameters does not form a perfect square (which is often the case), we'll round up its square root to the nearest value to work with natural numbers only.
Let's walk through a practical example. To keep it straightforward, we'll use a small number of parameters.
Explicit calculation example
Notice how the relationship changes as we redefine the rank, as shown in the image below. This adjustment remains mathematically correct because the multiplication of these matrices still results in a matrix with the dimensions of the original model.
Rank 2 calculations example
If we have different model sizes, using any stablished rank, let’s say rank=1, we can see how the percentage decreases as the parameter increments.
Total Parameters |
Matrix Dimension (D) |
LoRA Parameters |
Percentage |
25 |
5 |
10 |
40% |
100 |
10 |
20 |
20% |
4B |
63246 |
126492 |
0.0032% |
7B |
83667 |
167334 |
0.0024% |
13B |
114018 |
228036 |
0.0018% |
80B |
282843 |
565686 |
0.0007% |
200B |
447214 |
894428 |
0.0004% |
If we keep playing with one model size, we will see that the percentage of parameters used by LoRA increases with the rank value, but it remains a relatively small percentage compared to the total number of parameters in the model.
Rank |
Matrix Dimension (D) |
LoRA Parameters |
Percentage |
4 |
63246 |
505968 |
0.0126% |
8 |
63246 |
1011936 |
0.0253% |
16 |
63246 |
2023872 |
0.0506% |
32 |
63246 |
4047744 |
0.1012% |
64 |
63246 |
8095488 |
0.2024% |
4B model with different ranks
PEFT Hyperparameters Rank
A fair question you may ask, is “what should I set the rank to?”, or “what is its right value?”
The theory is that downstream tasks are intrinsically low-rank, which refers to the fact that we can have a good representation of the model or describe it almost as accurately using fewer dimensions.
In the LoRA paper, the authors provide valuable insights into the impact of rank on model performance and efficiency. They demonstrate that performance often plateaus after a certain rank, suggesting diminishing returns for higher values. Lower ranks (such as 8 or 16) were found to be particularly efficient, offering an optimal balance between performance and computational cost.
Using a higher rank can be particularly beneficial in scenarios where you're teaching the model complex behaviors or when you're introducing behaviors that contradict or extend beyond what the model has learned in its previous training.
The researchers emphasize that the "sweet spot" for rank can vary depending on the specific task, model architecture, and dataset characteristics. "The optimal r [rank] is task-dependent and can be treated as a hyperparameter to tune." This variability highlights the importance of empirical testing and careful consideration of the trade-offs between performance and resource utilization when implementing LoRA for different applications.
During inference, the LoRA adapter works alongside the original mode. The key advantage is that multiple LoRA adapters can reuse the same LLM, significantly reducing overall memory requirements when managing various tasks and use cases.
Alpha
Alpha determines a scaling factor that is applied to your weight's changes, before they get added into the original model weights. The scaling factor for these changes is calculated as Alpha divided by Rank. As example, in the QLoRA paper (which we will cover in the next section), when we set an alpha of 16 and a rank of 64, the scaling factor there will be equal to ¼, which means having an impact of 25%.
A good practice is to set the alpha twice as the rank, meaning having a 2X impact.
Dropout
As in traditional Machine Learning, dropout in LoRA is a regularization technique that prevents overfitting by randomly zeroing out a percentage of parameters during training. The QLoRA paper recommends a 10% dropout rate for models with 7-13 billion parameters, and 5% for larger models (33-65 billion parameters).
DoRA (Weight-Decomposed Low-Rank Adaptation)
While exploring the accuracy differences between Full Fine-tuning (FT) and LoRA, researchers identified differences in how these two methods behave during training and set themselves up to address it. The result of this was DoRA.
To explain DoRA, we can think of fine-tuning a model as the journey of climbing a mountain. In this journey, you want to optimize your route from the base of a mountain (the initial set of model weights) toward the peak (a better set of weights).
At each step you take, there are two critical decisions: the direction in which you walk, and how far you go in that direction. Researchers discovered that the way these decisions, direction and magnitude, are taken in LoRA differs from how they're taken during Full Fine-Tuning and tried to bridge that gap.
To address this, they “decomposed” the original weight matrix in LoRA into two separate matrices: one that captured the direction the optimization takes, and the other capturing the magnitude. By training these matrices independently, DoRA more closely mimics the natural progression of Full Fine-Tuning, allowing the model to "climb the mountain" faster and even reach higher peaks—resulting in better quality results.
DoRA representations
In the paper, we see how DoRA consistently outperforms LoRA across various tasks and model sizes, and it requires minimal additional computational overhead compared to LoRA.
An advantage of DoRA is its straightforward implementation as an extension to existing LoRA setups, requiring minimal modifications to the original LoRA setups.
QLORA (Quantized Low-Rank Adaptation)
QLoRA is a more memory-efficient evolution of LoRA. It enhances LoRA by quantizing the weights of the LoRA adapters (the smaller matrices) to a lower precision.
Precision
To talk about quantization, we need to understand a simple concept: Precision.
Neural networks are composed of floating-point numbers (typically stored in a float32 data type). In computing, floating-point numbers are represented using binary digits (bits) allocated for the sign, exponent, and fraction. You can express the same number with the precision you want.
At times, the number may remain unaffected, but most of the time, there will be some loss when converting to a lower precision format.
Quantization techniques reduce precision by using fewer bits, as in this example, which switches from 32-bit (float32) to 16-bit (float16) representation. According to HuggingFace: The two most common quantization cases are float32 -> float16 and float32 -> int8.
Quantization
The core concept of QLoRA is to apply 4-bit quantization to the pre-trained model weights, which means inputs are processed through a quantized version of the model rather than the original full-precision weights.
During inference or further processing, dequantization occurs, where the quantized weights are converted back to a higher precision format. However, as the original precision is reduced, some information loss may occur, potentially leading to slight differences in the model’s outputs compared to those generated using the original, unquantized weights.
ABSMAX Quantization example
Despite this potential for minor deviations, QLoRA aims to balance significant memory savings and retains a similar level of effectiveness to LoRA. Using half precision (16-bit) often works reasonably well for training neural networks, although performance can vary depending on the dataset.
QDoRA (Quantized Distributed Optimization for Rank Adaptation)
Just as QLoRA introduces quantization to LoRA, QDoRA applies a similar quantized approach to DoRA. Like DoRA, QDoRA focuses on efficient fine-tuning through weight decomposition, but with the added benefit of quantization.
Key Takeaways
In conclusion, each of these fine-tuning optimization techniques—LoRA, QLoRA, DoRA, and QDoRA—brings its own set of strengths to the table. LoRA offers a straightforward and efficient way to fine-tune models with minimal memory overhead, while QLoRA enhances this efficiency through quantization. DoRA provides a more nuanced approach to refining weight adjustments, and QDoRA combines these benefits for even greater performance gains. When choosing between these methods, it's essential to consider your specific needs, whether they be memory efficiency, training speed, or performance optimization.
For those interested in seeing these techniques in action, we conducted an in-depth experiment using the Phi-3-Mini-4k-Instruct model, where we applied LoRA, QLoRA, DoRA, and QDoRA under real-world conditions. To learn more about the results and insights from this experiment, including practical recommendations for implementing these techniques, I encourage you to read our follow-up post: "Fine-Tuning LLMs: Experimental Insights on LoRA, DoRA, and Quantization Techniques."