- SFT is the process of updating a pre-trained model's parameters by training it on labeled data for the specific task you want the model to perform. For instance, you can fine-tune a model to be better at summarization by training it with examples of text and their summaries.
- It’s useful for domains where labeled examples exist, such as Q&A datasets or chat logs.
- Learn more: Hugging Face Transformers Fine-Tuning Tutorial
- Reinforcement Learning with Human Feedback (RLHF) adds a feedback loop where human reviewers rate the model’s outputs, which then helps fine-tune the model to generate better responses. This is useful for conversational AI because it improves the quality of the interaction in response to human evaluations.
- Direct Preference Optimization (DPO) is a simpler version of RLHF that avoids explicitly training a reward model, directly optimizing based on preferences. This is aimed at reducing complexity while still leveraging human feedback.
- Learn more about RLHF: OpenAI's Guide to Reinforcement Learning
- For an overview of DPO: Anthropic’s Direct Preference Optimization
- Continual learning allows a model to learn from new data incrementally without forgetting previously learned information. This technique helps in maintaining knowledge while learning from new tasks or data.
- Useful for adapting models over time in production environments without retraining from scratch.
- Learn more: Continual Learning in Neural Networks
- Multi-task learning involves training a model on multiple tasks at the same time, allowing the model to share learned representations across tasks. This improves the generalization of the model and helps it perform better on related tasks.
- Particularly effective in settings where tasks share similar structures or data.
- Learn more: Multi-Task Learning: A Survey
- Meta-learning, or "learning to learn," focuses on training models to quickly adapt to new tasks with few examples. This technique is valuable for scenarios with limited labeled data and enables models to generalize more efficiently across tasks.
- Learn more: Model-Agnostic Meta-Learning
- This method fine-tunes a model by learning task-specific prefixes that influence how the LLM processes input and generates output. It keeps the core model weights frozen, making it more memory efficient. It’s particularly useful for generative tasks like text completion and dialog.
- Learn more: Prefix-Tuning
- Prompt-tuning focuses on learning a set of prompt tokens that are prepended to the input, effectively guiding the model to produce domain-specific responses. This is also efficient in terms of memory and computational resources since only a small number of parameters are updated.
- Learn more: Prompt Tuning for GPT-3
- Adapters are small bottleneck layers inserted into each layer of a transformer. Only these adapter layers are trained during fine-tuning, keeping the majority of the model frozen. This method is resource-efficient and is particularly advantageous when multiple tasks are involved, as different adapters can be used with the same core model.
- Learn more: Adapters in Transformers
- LoRA reduces computational requirements by updating only a smaller number of model weights, which is highly effective for tasks that don't need all of the model's capacity.
- The idea is to decompose the change in weights into smaller rank matrices. The original model weights remain frozen (unchanged), and the adjustments are captured through these additional, lower-dimensional matrices.
- This approach reduces resource requirements, meaning you can fine-tune a large model with less GPU memory, and it's suitable when you need to adapt a model to multiple different tasks while maintaining the same base model.
- Learn more: LoRA: Low-Rank Adaptation of Large Language Models
- This involves a process called Knowledge Distillation, where a smaller "student" model learns to mimic a larger, fine-tuned "teacher" model. It allows for efficiency gains by producing a more compact version of a high-performing model.
- Learn more: DistilBERT: A distilled version of BERT
- Full training means updating all the parameters of the model during the training process.
- In the context of a pre-trained language model, this means the entire model, with millions or even billions of parameters, is adjusted during fine-tuning.
- Full training is resource-intensive—it requires a lot of compute power, memory, and time because you need to backpropagate gradients through all of the model’s layers.
- Typically, full training is done when you train a model from scratch, but it can also be done to adapt a model to a new task or dataset completely. However, this can be impractical for very large models like GPT-3 or LLaMA, as it demands enormous computational resources.
- PEFT methods like LoRA, Prefix-Tuning, Adapters, and Prompt-Tuning allow you to update only a small subset of model parameters, which significantly reduces computational requirements.
- These methods are highly suitable when fine-tuning large models for multiple tasks or when computational resources are limited.
- SFT can be done with both Full Training and PEFT methods:
- With Full Training: You perform SFT by updating all the model parameters. This is very effective but can be expensive, especially for large LLMs.
- With PEFT: You perform SFT by updating only a small number of additional parameters (e.g., LoRA matrices, adapters) while keeping the main model frozen. This is much more efficient and often yields good results without needing the massive compute power of full training.
- Full Training updates all parameters during SFT, which is computationally heavy but can lead to significant adaptation of the model.
- PEFT is a more efficient way to do SFT by only updating a small number of additional parameters, making it easier and cheaper to adapt large models.
- You can use SFT with either method—PEFT or Full Training. It’s just about how many parameters you decide to adjust:
- If you have the resources, full training can be applied.
- If you need efficiency and want to avoid retraining the entire model, PEFT is the better approach.
So, SFT is not inherently tied to PEFT or Full Training; it’s about using a labeled dataset, and how you train the model depends on your resources and goals.