Fine‑Tuning LLMs with LoRA: A Practical, End‑to‑End Tutorial
Step-by-step guide to fine-tuning LLMs with LoRA/QLoRA using Transformers, PEFT, and TRL—from data prep to deployment.
Image used for representation purposes only.
Overview
Low-Rank Adaptation (LoRA) is the most practical way to fine‑tune large language models (LLMs) without retraining all parameters. By injecting small, trainable matrices into a frozen base model, LoRA slashes memory and compute needs while achieving strong task performance. This tutorial walks you end‑to‑end through data preparation, training with LoRA/QLoRA, evaluation, merging, and deployment using Hugging Face Transformers, PEFT, TRL, and bitsandbytes.
What you’ll build:
- A LoRA‑fine‑tuned instruction model from an open base LLM
- An efficient training pipeline (optionally with 4‑bit quantization, a.k.a. QLoRA)
- Reproducible scripts for evaluation, merging, and inference
Prerequisites and Hardware
- Python 3.10+
- A CUDA‑capable GPU (NVIDIA) with at least 12–24 GB VRAM recommended for 7B–13B models using QLoRA. Smaller VRAM (8–12 GB) can work with careful batch sizes and gradient accumulation.
- CUDA toolkit and compatible PyTorch build
Tip: Start with a 7B model. You’ll iterate faster and still learn the full workflow.
What Are LoRA and QLoRA?
- LoRA: Replace full‑rank weight updates with a pair of low‑rank matrices (rank r ≪ d). During training, only these small adapters are updated; the base model remains frozen.
- QLoRA: Load the base model in 4‑bit quantized weights (NF4) to further reduce memory, while still training LoRA adapters in higher precision (e.g., bfloat16). This is the go‑to approach for consumer GPUs.
Why it works:
- Most of the model’s expressivity is preserved in the frozen base weights.
- Task‑specific adaptation is captured by small, efficient adapters plugged into attention/MLP projections.
Project Setup
Install core libraries:
pip install --upgrade transformers accelerate datasets peft bitsandbytes trl evaluate sentencepiece
Optionally, verify GPU and bitsandbytes:
import torch
print('CUDA:', torch.cuda.is_available())
print('Device count:', torch.cuda.device_count())
Choose a Base Model
Pick a permissively licensed, instruction‑tuned or base LLM that fits your VRAM and use case. Examples: Llama‑like, Mistral‑like, or similar 7B/8B models for general instruction following. Ensure that the tokenizer and model families match.
Guidelines:
- 7B–8B for prototyping and small datasets
- 13B for stronger reasoning if VRAM allows
- Prefer models with robust tokenizers and long context support if needed
Prepare Your Dataset
Your goal is to produce a clean, instruction‑response dataset. Two common formats:
- Chat messages (list of role/content messages)
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain LoRA in simple terms."},
{"role": "assistant", "content": "LoRA adds small adapters to a frozen model..."}
]}
- Instruction format (Alpaca‑style)
{"instruction": "Summarize this note.", "input": "LoRA reduces trainable params...", "output": "LoRA adds small matrices..."}
Best practices:
- Deduplicate, normalize whitespace, and remove harmful or licensed‑incompatible content.
- Keep outputs concise and accurate; the model will imitate what you provide.
- Split into train/eval (e.g., 95/5).
Training Script (QLoRA + LoRA with TRL)
The following script loads a 4‑bit base model, prepares it for k‑bit training, attaches LoRA adapters, and launches SFT (supervised fine‑tuning).
# train_lora.py
import os, math
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from trl import SFTTrainer
from peft import LoraConfig, prepare_model_for_kbit_training
model_name = os.environ.get('BASE_MODEL', 'your-base-llm')
train_path = os.environ.get('TRAIN_JSONL', 'train.jsonl')
eval_path = os.environ.get('EVAL_JSONL', 'eval.jsonl')
output_dir = os.environ.get('OUTPUT_DIR', 'lora-out')
# 4-bit quantization config (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
device_map='auto',
)
# Prepare for k-bit training (enables gradients, gradient checkpointing, etc.)
model = prepare_model_for_kbit_training(model)
# LoRA configuration (r, alpha, dropout). Target modules depend on architecture.
# For Llama/Mistral-like models:
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM',
target_modules=['q_proj','k_proj','v_proj','o_proj','gate_proj','up_proj','down_proj'],
)
# Load dataset
raw = load_dataset('json', data_files={'train': train_path, 'eval': eval_path})
# Format examples into plain text sequences using a simple template
# Prefer tokenizer.apply_chat_template(...) when your data uses chat messages
def format_example(ex):
if 'messages' in ex and isinstance(ex['messages'], list):
# Convert chat messages to a single string
parts = []
for m in ex['messages']:
role = m.get('role','user').capitalize()
content = m.get('content','')
parts.append(f"{role}: {content}")
parts.append('Assistant:') # leave room for the target
return '\n'.join(parts)
else:
instr = ex.get('instruction','')
inp = ex.get('input','')
out = ex.get('output','')
prompt = (
f"### Instruction\n{instr}\n\n" +
(f"### Input\n{inp}\n\n" if inp else '') +
f"### Response\n{out}"
)
return prompt
train_dataset = raw['train'].map(lambda ex: {'text': format_example(ex)})
eval_dataset = raw['eval'].map(lambda ex: {'text': format_example(ex)})
# Training arguments — adjust to your GPU
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=3,
learning_rate=2e-4,
logging_steps=10,
save_steps=500,
eval_steps=500,
evaluation_strategy='steps',
save_total_limit=2,
bf16=True,
lr_scheduler_type='cosine',
warmup_ratio=0.05,
optim='paged_adamw_32bit', # bitsandbytes optimizer for QLoRA
report_to='none',
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
peft_config=lora_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field='text',
packing=True, # better GPU utilization via sequence packing
max_seq_length=2048, # match your context window
args=training_args,
)
trainer.train()
trainer.save_model() # saves LoRA adapter + config to output_dir
Run it:
BASE_MODEL="your-base-llm" \
TRAIN_JSONL=train.jsonl EVAL_JSONL=eval.jsonl \
OUTPUT_DIR=lora-out \
python train_lora.py
Notes:
- Adjust per_device_train_batch_size and gradient_accumulation_steps to fit VRAM.
- If you run out of memory, reduce max_seq_length or enable gradient checkpointing via accelerate config.
Monitoring and Logging
- Watch training loss and eval loss; they should trend downward and stabilize.
- Early plateaus suggest increasing training steps, lowering learning rate, or improving data quality.
- Keep an eye on tokenization warnings; bad tokenization can silently degrade results.
Quick Evaluation (Perplexity + Spot Checks)
Before full benchmarking, sanity‑check on the eval split and with a few prompts.
# eval_snippets.py
import math, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import AutoPeftModelForCausalLM
adapter_dir = 'lora-out'
# Load adapter+base and compute perplexity on a few samples
peft_model = AutoPeftModelForCausalLM.from_pretrained(adapter_dir, torch_dtype=torch.bfloat16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(peft_model.base_model.config._name_or_path, use_fast=True)
texts = [
'User: Explain LoRA succinctly.\nAssistant:',
'### Instruction\nList three benefits of QLoRA.\n\n### Response\n'
]
peft_model.eval()
with torch.no_grad():
for t in texts:
ids = tokenizer(t, return_tensors='pt').to(peft_model.device)
out = peft_model(**ids, labels=ids['input_ids'])
ppl = math.exp(out.loss.item())
print('PPL:', round(ppl, 2))
Also test generation quality with a simple pipeline:
from transformers import pipeline
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_name = 'your-base-llm'
adapter_dir = 'lora-out'
tok = AutoTokenizer.from_pretrained(base_name)
base = AutoModelForCausalLM.from_pretrained(base_name, torch_dtype=torch.bfloat16, device_map='auto')
model = PeftModel.from_pretrained(base, adapter_dir)
pipe = pipeline('text-generation', model=model, tokenizer=tok, torch_dtype=torch.bfloat16, device_map='auto')
print(pipe('User: Give two tips for stable LoRA training.\nAssistant:', max_new_tokens=128, do_sample=True, temperature=0.7)[0]['generated_text'])
Merge and Export (Optional)
Merging bakes the LoRA deltas into the base weights so you can serve a single model without adapters.
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
adapter_dir = 'lora-out'
merged_dir = 'merged-model'
peft_model = AutoPeftModelForCausalLM.from_pretrained(adapter_dir, torch_dtype=torch.bfloat16, device_map='auto')
merged = peft_model.merge_and_unload() # applies the LoRA deltas
# Save as standard HF weights
tokenizer = AutoTokenizer.from_pretrained(peft_model.base_model.config._name_or_path)
merged.save_pretrained(merged_dir, safe_serialization=True)
tokenizer.save_pretrained(merged_dir)
You can now load merged_dir like any regular Transformers model.
Inference Examples
- With adapters (quick swaps across tasks):
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = 'your-base-llm'
adapter = 'lora-out'
tok = AutoTokenizer.from_pretrained(base)
base_model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16, device_map='auto')
model = PeftModel.from_pretrained(base_model, adapter)
pipe = pipeline('text-generation', model=model, tokenizer=tok, torch_dtype=torch.bfloat16, device_map='auto')
print(pipe('Write a concise guide to LoRA evaluation:', max_new_tokens=200)[0]['generated_text'])
- With merged weights (simpler serving):
from transformers import pipeline
pipe = pipeline('text-generation', model='merged-model', torch_dtype='auto', device_map='auto')
print(pipe('Draft a 2-sentence product description for a coffee grinder:', max_new_tokens=100)[0]['generated_text'])
Hyperparameter Cheatsheet
- Rank r: 8–32 commonly. Higher r increases capacity and memory. Start with r=16.
- lora_alpha: 16–64. Often scale with r (e.g., alpha=2×r).
- lora_dropout: 0.05–0.1 for regularization.
- Learning rate: 1e‑4 to 2e‑4 for adapters is typical; reduce if overfitting or unstable.
- Batch size: Maximize effective batch via gradient accumulation.
- Max sequence length: Use realistic context; packing improves throughput but verify quality.
- Optimizer: paged_adamw_32bit for QLoRA; standard AdamW for full‑precision adapter training.
Troubleshooting
- CUDA OOM: Lower batch size, increase gradient_accumulation_steps, shorten max_seq_length, enable gradient checkpointing, or switch to smaller base model.
- Training diverges: Reduce LR, add warmup (5–10%), increase lora_dropout, clean noisy labels.
- Bad generations: Improve data quality, lengthen training, raise r, or fine‑tune on higher‑quality prompts/targets.
- Tokenizer mismatches: Always load tokenizer from the same family as the base model; set pad_token if missing.
- Target modules: For Llama/Mistral‑like models, adapt q/k/v/o and MLP projections; for other architectures, inspect layer names (print(model) and search for projection layers).
- Catastrophic forgetting: Mix a small subset of general instruction data with your domain data or apply regularization.
Safety, Compliance, and Licensing
- Respect base model and dataset licenses; verify commercial terms.
- Filter personally identifiable information and unsafe content in training data.
- Evaluate for prompt injection, jailbreaks, and misinformation on your target domain.
- Provide disclaimers and guardrails in downstream applications.
Scaling Up
- Multi‑adapter workflows: Keep separate LoRA heads for different domains and hot‑swap at inference.
- Long‑context training: Use models with native long context; increase max_seq_length and ensure sufficient VRAM.
- Better data curation: Preference optimization (e.g., DPO) after SFT can significantly improve helpfulness.
- Serving: Run merged models in standard backends or serve base+adapters in a runtime that supports LoRA composition. Engines like vLLM can load LoRA adapters efficiently.
Summary
LoRA and QLoRA make LLM fine‑tuning feasible on modest hardware by training only small adapter layers while keeping the base model frozen (and optionally quantized). With a clean dataset, carefully chosen hyperparameters, and the scripts above, you can ship high‑quality, domain‑specific models quickly—and iterate just as fast.