Fine-Tuning

What it is

Fine-tuning is the process of taking a pre-trained language model and training it further with specific data to adapt it to a particular task, domain, or style. Instead of training from scratch (costly and impractical), it leverages the base model's general knowledge and specializes it.

When to use fine-tuning

The decision between fine-tuning, RAG, and prompt engineering depends on the problem:

Criterion	Prompt engineering	RAG	Fine-tuning
Initial cost	Low	Medium	High
Production latency	Low	Medium (retrieval)	Low
Updatable knowledge	No	Yes	No (requires retraining)
Consistent style/format	Limited	Limited	Excellent
Domain terminology	Limited	Good	Excellent
Data needed	0	Documents	100-10,000 examples
Maintenance	Low	Medium (index)	High (retraining)

Practical rule: start with prompt engineering, add RAG if external knowledge is needed, and resort to fine-tuning only when the model can't achieve the desired format, style, or terminology.

Fine-tuning techniques

Full fine-tuning

Updates all model parameters. Produces the best results but requires:

GPUs with lots of memory (A100 80GB or higher)
Large datasets (thousands of examples)
Risk of "catastrophic forgetting" of base knowledge

LoRA (Low-Rank Adaptation)

Freezes the base model and trains small low-rank adaptation matrices. Instead of updating a weight matrix W of dimension d×d, LoRA trains two matrices A (d×r) and B (r×d) where r is much smaller than d (typically 8-64).

QLoRA

LoRA on a 4-bit quantized model. Enables fine-tuning 70B parameter models on a single 24GB GPU.

RLHF (Reinforcement Learning from Human Feedback)

Aligns the model with human preferences using a reward model trained with human comparisons. This is how Claude, GPT-4, and other chat models are trained.

Example: LoRA with PEFT

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
 
model_id = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)  # QLoRA
 
# LoRA configuration
lora_config = LoraConfig(
    r=16,                     # adaptation rank
    lora_alpha=32,            # scaling factor
    target_modules=["q_proj", "v_proj"],  # layers to adapt
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # ~0.1% of total
 
# Dataset in instruction-response format
dataset = load_dataset("json", data_files="training_data.jsonl")
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=TrainingArguments(
        output_dir="./output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
    ),
    peft_config=lora_config,
)
trainer.train()
model.save_pretrained("./lora-adapter")  # Only saves the adapter (~50MB)

The resulting adapter weighs ~50MB instead of the full model's ~16GB, and can be loaded on top of the base model in production.

Data preparation

Dataset quality is the most determining factor:

Format: instruction-response pairs in JSONL, consistent in structure
Quality > quantity: 500 well-curated examples outperform 10,000 noisy ones
Diversity: cover the variety of expected use cases
Synthetic data: using a more capable model to generate training data is a common practice

{"instruction": "Classify the sentiment", "input": "The service was excellent", "output": "positive"}
{"instruction": "Classify the sentiment", "input": "They took 2 hours to serve me", "output": "negative"}

Evaluating the fine-tuned model

Loss going down is not enough — evaluate in the real usage context:

Held-out set: reserve 10-20% of data for evaluation
Task metrics: accuracy, F1, BLEU, ROUGE depending on the case
Human evaluation: compare base model vs. fine-tuned responses on real cases
Regression: verify the model didn't lose general capabilities

Why it matters

Fine-tuning allows adapting a general model to a specific domain with your own data. With LoRA and QLoRA, hardware cost dropped dramatically — fine-tuning Llama 3.1 8B fits on a 24GB GPU. The key decision is not how to fine-tune, but whether you actually need it: prompt engineering and RAG solve most cases without the maintenance cost of a custom model.

References

LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., 2021. Foundational low-rank adaptation method.
QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al., 2023. 4-bit fine-tuning for consumer hardware.
Hugging Face PEFT — Hugging Face, 2024. Library for efficient fine-tuning with LoRA, QLoRA, and other methods.
Hugging Face Training Guide — Hugging Face, 2024. Practical fine-tuning guide with Transformers.
How to Fine-Tune Chat Models — OpenAI, 2024. Fine-tuning guide for chat models with the OpenAI API.

What it is

When to use fine-tuning

The decision between fine-tuning, RAG, and prompt engineering depends on the problem:

Criterion	Prompt engineering	RAG	Fine-tuning
Initial cost	Low	Medium	High
Production latency	Low	Medium (retrieval)	Low
Updatable knowledge	No	Yes	No (requires retraining)
Consistent style/format	Limited	Limited	Excellent
Domain terminology	Limited	Good	Excellent
Data needed	0	Documents	100-10,000 examples
Maintenance	Low	Medium (index)	High (retraining)

Practical rule: start with prompt engineering, add RAG if external knowledge is needed, and resort to fine-tuning only when the model can't achieve the desired format, style, or terminology.

Fine-tuning techniques

Full fine-tuning

Updates all model parameters. Produces the best results but requires:

GPUs with lots of memory (A100 80GB or higher)
Large datasets (thousands of examples)
Risk of "catastrophic forgetting" of base knowledge

LoRA (Low-Rank Adaptation)

QLoRA

LoRA on a 4-bit quantized model. Enables fine-tuning 70B parameter models on a single 24GB GPU.

RLHF (Reinforcement Learning from Human Feedback)

Aligns the model with human preferences using a reward model trained with human comparisons. This is how Claude, GPT-4, and other chat models are trained.

Example: LoRA with PEFT

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
 
model_id = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)  # QLoRA
 
# LoRA configuration
lora_config = LoraConfig(
    r=16,                     # adaptation rank
    lora_alpha=32,            # scaling factor
    target_modules=["q_proj", "v_proj"],  # layers to adapt
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # ~0.1% of total
 
# Dataset in instruction-response format
dataset = load_dataset("json", data_files="training_data.jsonl")
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=TrainingArguments(
        output_dir="./output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
    ),
    peft_config=lora_config,
)
trainer.train()
model.save_pretrained("./lora-adapter")  # Only saves the adapter (~50MB)

The resulting adapter weighs ~50MB instead of the full model's ~16GB, and can be loaded on top of the base model in production.

Data preparation

Dataset quality is the most determining factor:

Format: instruction-response pairs in JSONL, consistent in structure
Quality > quantity: 500 well-curated examples outperform 10,000 noisy ones
Diversity: cover the variety of expected use cases
Synthetic data: using a more capable model to generate training data is a common practice

{"instruction": "Classify the sentiment", "input": "The service was excellent", "output": "positive"}
{"instruction": "Classify the sentiment", "input": "They took 2 hours to serve me", "output": "negative"}

Evaluating the fine-tuned model

Loss going down is not enough — evaluate in the real usage context:

Held-out set: reserve 10-20% of data for evaluation
Task metrics: accuracy, F1, BLEU, ROUGE depending on the case
Human evaluation: compare base model vs. fine-tuned responses on real cases
Regression: verify the model didn't lose general capabilities

Why it matters

References

LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., 2021. Foundational low-rank adaptation method.
QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al., 2023. 4-bit fine-tuning for consumer hardware.
Hugging Face PEFT — Hugging Face, 2024. Library for efficient fine-tuning with LoRA, QLoRA, and other methods.
Hugging Face Training Guide — Hugging Face, 2024. Practical fine-tuning guide with Transformers.
How to Fine-Tune Chat Models — OpenAI, 2024. Fine-tuning guide for chat models with the OpenAI API.

Fine-Tuning

What it is

When to use fine-tuning

Fine-tuning techniques

Full fine-tuning

LoRA (Low-Rank Adaptation)

QLoRA

RLHF (Reinforcement Learning from Human Feedback)

Example: LoRA with PEFT

Data preparation

Evaluating the fine-tuned model

Why it matters

References

Related content

Fine-Tuning

What it is

When to use fine-tuning

Fine-tuning techniques

Full fine-tuning

LoRA (Low-Rank Adaptation)

QLoRA

RLHF (Reinforcement Learning from Human Feedback)

Example: LoRA with PEFT

Data preparation

Evaluating the fine-tuned model

Why it matters

References

Related content