Large Language Models (LLMs) are useful, but their associated bills can be prohibitive. In this post, we'll explore how Small Language Models (SMLs) can achieve good performance without breaking the bank, even in highly specialized use cases. After all, do we really need a rocket ship to cross the street?
Some of the topics we'll cover include:
- How SLMs and LLMs compare
- Deciding if fine-tuning an SLM is the right decision
- A step-by-step overview of the fine-tuning process
- Fine-tuning challenges and considerations
Understanding SLMs vs LLMs
Regardless of their size, language models can address complex industry-specific tasks, from automating customer service through chatbots to distilling oceans of feedback into clear strategic directives. The fundamental difference is not about capabilities but scale and performance across different tasks and benchmarks.
- Large Language Models (LLMs), trained on vast datasets with hundreds of billions of parameters, excel in handling complex, general-purpose tasks across diverse domains.
- Small Language Models (SLMs), with parameters typically in the millions to billions, offer impressive capabilities. SLMs shine in scenarios requiring faster processing, lower computational resources, and more efficient deployment, making them particularly attractive for specific business applications where speed and efficiency are critical.
Why Fine-Tune SLMs?
Fine-tuning adapts pre-trained models to specific use cases, enhancing their accuracy and relevance for custom requirements. When it comes to SLMs, fine- tuning offers unique advantages that are inherent to their compact size:
- Fine-tuned SLMs can achieve performance comparable or superior to LLMs on specific tasks.
- SLMs require less computational power, making the fine-tuning process faster and less resource-intensive.
- Faster training allows for quicker iterations and development cycles, which translates to a faster time-to-market.
- With more GPU memory available, there's more room for extensive hyperparameter experimentation.
- SLMs offer more efficient inference, reducing operational costs.
- Their size enables their use on smaller devices, opening possibilities for edge computing and mobile applications.
When to Fine-Tune?
Before diving into fine-tuning, take a step back and ask: What's driving this decision? Are you aiming to enhance domain-specific knowledge, improve output quality, or reduce hallucinations?
Once you have that clarified, consider whether fine-tuning is the best approach for these goals. Ask yourself these questions when considering alternatives:
- Prompt Engineering: Could crafting more effective prompts achieve your goals without the complexity of fine-tuning?
- Retrieval-Augmented Generation (RAG): For tasks requiring up-to-date or specialized information, could RAG provide a more flexible solution?
- Cognitive Modeling: Does adding intermediate reasoning steps improve the quality of responses? (e.g. Chain of Thought, Multi-Agent Architecture, Memory-Augmentation)
Consider these alternatives based on your needs. Fine-tuning is not a sledgehammer solution, for some cases a simpler method that saves time and resources would suffice.
The Fine-Tuning Process
It's key to clearly articulate the specific task or problem you aim to address with generative AI. Consider your domain (whether medical or tech), the type of output format (text, code), and any project-specific constraints. With that foundation in place, follow these key steps to ensure an efficient fine-tuning process.
Set your success metrics
Choose metrics that matter to your project's objectives and use case:
- Accuracy: For classification tasks, how often is your model correct?
- Perplexity: How well does your model predict the next token?
- Benchmarks: How does your fine-tuned model perform on your specific use case?
- Coherence: Are there contradictions on your model responses?
- Relevance: Does your model responses are what you would expect?
Some of these metrics, like coherence and relevance, require a subjective evaluation where you can leverage human collaborators or other LLMs. Always remember, the best metric is one that translates directly to value for your end-users.
Set realistic and meaningful targets and make sure to keep an eye on them.
Choose a base model
Choosing the right base model is so important that it deserves its own dedicated process.
- Survey available foundation models that are suitable for your use case, considering both open-source and commercial options. As a rule of thumb, focus on models of no more than 16B parameters.
- Once you identify your potential candidates, run tests with selected models using a representative subset of your data.
- Analyze the results holistically, considering performance on your specific task, implementation and usage costs, deployment in your infrastructure, and alignment with ethical and security requirements.
- Choose the model that offers the optimal balance of these elements.
Model comparison example
Prepare your data
When it comes to data preparation, you have two main routes: create your dataset or use an existing dataset. The main things to keep in mind are data relevance and quality.
The following flowchart will help you determine which route to choose.
Decision flowchart for selecting a dataset
Other data considerations to consider:
- Data preprocessing to improve data quality
- Scrubbing sensitive information (e.g. names, addresses, phones, emails)
- Splitting into training/validation/test sets
- Ensuring compatibility with the model (e.g. context window length)
- Necessary pipelines to make the process automatically reproducible
Set up your workspace and select compute
There are two compute-related aspects that you need to consider simultaneously: platform and GPU.
Selecting a platform depends on your current infrastructure, the features you need, and the skills your current team has, while selecting a GPU depends on the expected computational demands of your fine-tuning efforts. Not all GPUs are available on all platforms, so take that into consideration before committing to one.
Amazon AWS, Microsoft Azure, Google Cloud Platform, and Databricks are a few popular computing platforms.
There is no strict rule when selecting GPU; however, here is a general guideline based on the model size:
Model size guideline
As a strategy, begin with the minimum viable GPU that can handle your model, and then scale up if needed.
Here are some useful tips:
- Check the model's documentation; a lot of them include suggestions for GPU specifications needed for fine-tuning.
Here's an example from the Hugging Face site: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
Here's an example from the model catalog in Azure Machine Learning Studio
2. Look for blog posts or articles from practitioners who have fine-tuned the model you're interested in. The Hugging Face forums, Reddit, and Machine Learning communities are helpful communities. They frequently share insightful information about the hardware they used and any challenges they faced.
Be aware that GPU resources, especially on free or low-cost platforms, are in extremely high demand. For example, while Google Colab offers free GPU access, its availability can be inconsistent, and your workload can be stopped at any time. Be prepared for potential waits or interruptions to your training.
Monitor and document
Make sure to monitor your training progress. Platforms like Weights and Biases offer comprehensive visualization and logging capabilities. Focus on key metrics such as loss curves (training and validation), GPU utilization, and memory consumption.
Example of a system graph in four experiments tracked in Weights and Biases
Set up all the metrics you plan to evaluate from the beginning; this will keep your documentation organized and save time during comparisons. Make sure you properly set your tracking from the start, or you'll be setting yourself up for pain and regret.
Write down your hyperparameters, data preprocessing steps, breakthroughs, challenges, and other notable observations (anything that makes you say "aha!" or "what is this?"). This can come in handy when you're trying to reproduce your results or explain your process to others.
Plan and run your initial tests
As you might have noticed by now, careful planning is key when preparing to fine-tune your model. As Lincoln once noted, "Give me six hours to chop down a tree, and I will spend the first four sharpening the axe." In the same spirit, here are some recommendations to sharpen your fine-tuning axe:
- Start small. Running a test with a small fraction of your dataset (5-10%) can give you quick insights into how your model will train, as well as potential challenges and optimizations. Running a few of those can help you get a feel for the impact of hyperparameters in your specific scenario.
- Flexible planning is key. While it's important to craft an initial training plan, remember that fine-tuning is often an iterative process. Your initial hyperparameters and approach should be seen as a starting point, not a rigid blueprint. Be prepared for:
3. Time Estimation: A hidden benefit of starting with small-scale experiments is the ability to accurately estimate full training time. This is because the relationship between training time and dataset size is pretty much linear in most cases, which means that if you run a training with 10% of your data, you can expect a training with all data to take 10 times longer.
To illustrate this point, let us look at some real-world data from one of our fine-tuning experiments:Relationship between training time and the percentage of data used by
finetuning Phi-3-mini-4k-instruct using Open-Platypus dataset
This insight can be crucial for resource planning and scheduling.
- Compare against the base model. A good starting point is to compare your model's performance to the original (non-fine-tuned) model to quantify improvements. Do not limit yourself to just one metric; consider any domain-specific measures relevant to your task.
- Real-world testing. Ideally, tests should put your model in scenarios that closely mimic your real-world use cases to estimate its practical performance. With those evaluation results in hand, you are ready to refine your approach.
- Targeted troubleshooting. Develop hypotheses about why certain issues are occurring and how they might be addressed. Rather than overhauling everything, focus on making targeted adjustments to address specific shortcomings.
- Keep a detailed log of each iteration. Note the changes made and their impact. As you progress, docs are key for understanding what works and what does not.
- Incremental progress. Do not expect dramatic changes with each iteration. As improvement is often incremental, steady progress is good progress.
- Iterative refinement. While refinement may be the last item on our list, it is not a "final step" as such, but rather part of an ongoing cycle, with each iteration bringing you closer to your ideal model performance.
Challenges Ahead
While the potential of fine-tuned SLMs is immense, there are challenges that you could potentially run into:
- GPU Memory Management:
- Frequent Out of Memory (OOM) errors, especially with limited hardware like V100 GPUs.
- Mixed precision issues, requiring environment restarts.
- Balancing model size and context window length within memory constraints.
- GPU Selection and Access:
- Difficulty obtaining high-performance GPUs (e.g., NVIDIA A100, H100) due to high demand and limited availability.
- Inconsistent access to top-tier GPUs on cloud platforms.
- Hyperparameter Tuning:
- Time-consuming process of manual tuning.
- Significant impact on training duration and model performance.
- Balancing automation tools with manual optimization for best results.
- PEFT Technique Selection:
- Choosing between methods like LoRA, QLoRA, DoRA, and QDORA.
- Weighing trade-offs in performance, speed, and resource utilization.
- Model and Context Window Sizing:
- Deciding on optimal model size for the task at hand.
- Determining appropriate context window length within memory limitations.
- Training and Evaluation Speed:
- Variations in training time based on model size and chosen PEFT technique.
- Potentially resource-intensive and time-consuming evaluation processes.
- Hardware Limitations:
- Dealing with memory leaks and unexpected behavior in quantized versions.
- Managing long sequences and their impact on processing time and memory consumption.
- Experiment Tracking:
- Implementing a robust system to document and reproduce experiments.
- Managing multiple iterations of models, datasets, and hyperparameters.
- Library Compatibility:
- Navigating version incompatibilities, particularly with Hugging Face libraries.
- Performance Optimization:
- Balancing model accuracy with resource constraints.
- Addressing unexpected issues like quantized versions consuming more memory.
Key Takeaways
Fine-tuning Small Language Models (SLMs) is a cost-efficient way to achieve strong performance in specific business tasks, offering a practical alternative to Large Language Models (LLMs).
The decision to fine-tune should be based on clear goals. Before fine-tuning, consider alternatives like prompt engineering or RAG. If fine-tuning is necessary, make sure to plan carefully, and focus on model selection, data preparation, and monitoring.
Despite challenges like GPU limitations, fine-tuning SLMs can lead to faster processing, lower costs, and deployment on smaller devices. By starting small, staying adaptable, and learning from each iteration, you're positioning yourself to achieve the best possible results with your model.