Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Fine-Tuning

Process of specializing a pre-trained model for a specific task or domain through additional training with curated data, adapting its behavior without starting from scratch.

evergreen#fine-tuning#llm#transfer-learning#lora#rlhf#training

What it is

Fine-tuning is the process of taking a pre-trained language model and training it further with specific data to adapt it to a particular task, domain, or style. Instead of training from scratch (costly and impractical), it leverages the base model's general knowledge and specializes it.

When to use fine-tuning

The decision between fine-tuning, RAG, and prompt engineering depends on the problem:

CriterionPrompt engineeringRAGFine-tuning
Initial costLowMediumHigh
Production latencyLowMedium (retrieval)Low
Updatable knowledgeNoYesNo (requires retraining)
Consistent style/formatLimitedLimitedExcellent
Domain terminologyLimitedGoodExcellent
Data needed0Documents100-10,000 examples
MaintenanceLowMedium (index)High (retraining)

Practical rule: start with prompt engineering, add RAG if external knowledge is needed, and resort to fine-tuning only when the model can't achieve the desired format, style, or terminology.

Fine-tuning techniques

Full fine-tuning

Updates all model parameters. Produces the best results but requires:

  • GPUs with lots of memory (A100 80GB or higher)
  • Large datasets (thousands of examples)
  • Risk of "catastrophic forgetting" of base knowledge

LoRA (Low-Rank Adaptation)

Freezes the base model and trains small low-rank adaptation matrices. Instead of updating a weight matrix W of dimension d×d, LoRA trains two matrices A (d×r) and B (r×d) where r is much smaller than d (typically 8-64).

QLoRA

LoRA on a 4-bit quantized model. Enables fine-tuning 70B parameter models on a single 24GB GPU.

RLHF (Reinforcement Learning from Human Feedback)

Aligns the model with human preferences using a reward model trained with human comparisons. This is how Claude, GPT-4, and other chat models are trained.

Example: LoRA with PEFT

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
 
model_id = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)  # QLoRA
 
# LoRA configuration
lora_config = LoraConfig(
    r=16,                     # adaptation rank
    lora_alpha=32,            # scaling factor
    target_modules=["q_proj", "v_proj"],  # layers to adapt
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # ~0.1% of total
 
# Dataset in instruction-response format
dataset = load_dataset("json", data_files="training_data.jsonl")
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=TrainingArguments(
        output_dir="./output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
    ),
    peft_config=lora_config,
)
trainer.train()
model.save_pretrained("./lora-adapter")  # Only saves the adapter (~50MB)

The resulting adapter weighs ~50MB instead of the full model's ~16GB, and can be loaded on top of the base model in production.

Data preparation

Dataset quality is the most determining factor:

  • Format: instruction-response pairs in JSONL, consistent in structure
  • Quality > quantity: 500 well-curated examples outperform 10,000 noisy ones
  • Diversity: cover the variety of expected use cases
  • Synthetic data: using a more capable model to generate training data is a common practice
{"instruction": "Classify the sentiment", "input": "The service was excellent", "output": "positive"}
{"instruction": "Classify the sentiment", "input": "They took 2 hours to serve me", "output": "negative"}

Evaluating the fine-tuned model

Loss going down is not enough — evaluate in the real usage context:

  • Held-out set: reserve 10-20% of data for evaluation
  • Task metrics: accuracy, F1, BLEU, ROUGE depending on the case
  • Human evaluation: compare base model vs. fine-tuned responses on real cases
  • Regression: verify the model didn't lose general capabilities

Why it matters

Fine-tuning allows adapting a general model to a specific domain with your own data. With LoRA and QLoRA, hardware cost dropped dramatically — fine-tuning Llama 3.1 8B fits on a 24GB GPU. The key decision is not how to fine-tune, but whether you actually need it: prompt engineering and RAG solve most cases without the maintenance cost of a custom model.

References

  • LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., 2021. Foundational low-rank adaptation method.
  • QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al., 2023. 4-bit fine-tuning for consumer hardware.
  • Hugging Face PEFT — Hugging Face, 2024. Library for efficient fine-tuning with LoRA, QLoRA, and other methods.
  • Hugging Face Training Guide — Hugging Face, 2024. Practical fine-tuning guide with Transformers.
  • How to Fine-Tune Chat Models — OpenAI, 2024. Fine-tuning guide for chat models with the OpenAI API.

Related content

  • Large Language Models

    Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.

  • Artificial Intelligence

    Field of computer science dedicated to creating systems capable of performing tasks that normally require human intelligence, from reasoning and perception to language generation.

  • Synthetic Data

    Algorithmically generated data that replicates the statistical properties of real data, used to train, evaluate, and test AI systems when real data is scarce, expensive, or sensitive.

Concepts