A Practical Guide to Multi‑Modal RAG: Images Plus Text, End‑to‑End Tutorial

Overview

Multi‑modal Retrieval‑Augmented Generation (RAG) extends classic text‑only RAG by letting a model retrieve and reason over both images and text. In this tutorial you will build an end‑to‑end pipeline that:

Ingests images and documents
Extracts text with OCR and generates captions
Embeds images and text into a shared vector space
Retrieves relevant items for a user query (text or image)
Fuses results from multiple signals (image, caption, OCR)
Generates a grounded answer with a vision‑language model (VLM)

Use cases include searching design catalogs by sketch and description, answering questions about receipts or forms, explaining dashboards or charts, and visual help‑center assistants that understand UI screenshots.

Architecture at a glance

A practical multi‑modal RAG system has 6 layers:

Ingest

Accepted inputs: PNG/JPG images, PDFs, and text files
Store originals in blob storage; track metadata such as source, timestamp, page number, bounding boxes

Preprocess

OCR for text inside images and scanned PDFs (e.g., Tesseract, PaddleOCR)
Image captioning for scene‑level semantics (e.g., BLIP‑style models)
Optional tiling for large images; split PDFs by page

Embed

Text encoder for text chunks and captions
Image encoder for global image vectors (CLIP‑style, OpenCLIP)
Optional region‑level embeddings using detected boxes

Index

Vector database (FAISS locally or Qdrant/Milvus/Weaviate in production)
Separate indices per modality: image, caption, OCR text; keep a relational mapping back to the same asset

Retrieve and fuse

Given a query (text or image), run multiple retrievers
Normalize scores, apply Reciprocal Rank Fusion (RRF) or weighted sum
Optional reranking with a stronger cross‑modal scorer

Generate

Hand the top‑k retrieved context (images + text snippets) to a VLM
Constrain the model to cite evidence and avoid hallucinations

Data preparation

Good retrieval demands thoughtful chunking and metadata.

Images
- Keep a global image vector per asset
- For dense scenes or documents, tile into overlapping crops and index each crop (store crop coordinates)
- Save captions and OCR text alongside the image id
Text
- Chunk long text to 256‑512 tokens with overlap
- Track source, page, section, and confidence scores
Metadata
- width, height, dpi
- OCR language, avg confidence
- Captioning model used and version

Embedding strategies

Pick one or combine several strategies depending on your data and latency budget.

Dual‑encoder image↔text retrieval (CLIP‑style)

Pros: fast ANN search; supports cross‑modal queries out of the box
Cons: struggles with dense text in documents; can miss fine‑grained details

OCR‑first indexing

Run OCR and index the extracted text using a strong text encoder
Pros: precise when the question targets specific words, numbers, or fields
Cons: OCR errors; layout and figures without text are hard

Image captioning + text retrieval

Generate one or more captions per image (short global, detailed long)
Index captions in a text vector store
Pros: bridges visual content into text space; better with semantic or style queries
Cons: captions may omit crucial details

Most robust systems combine all three: CLIP for semantics, OCR for literal strings, captions for context.

Vector store choices

FAISS: fast, simple local prototyping; IVF or HNSW for scale; PQ for memory savings
Qdrant/Milvus/Weaviate: production‑grade, HNSW‑based ANN, filters, payload metadata, horizontal scale
Similarity metric: cosine or inner product; normalize vectors for cosine‑like behavior

Retrieval orchestration and fusion

Given a user text query like: find the invoice with total 129.50 that mentions ACME and a red logo

Run three retrievers:
- Text→OCR store (exact numbers and names)
- Text→caption store (semantic cues like red logo, invoice)
- Text→image store with CLIP (visual style and layout)
Normalize scores to [0,1] (z‑score or min‑max)
Fuse with Reciprocal Rank Fusion (robust) or a learned weighted sum
Optional rerank: compute CLIP similarity at higher resolution for top‑N, or use a cross‑encoder VLM to score text‑image pairs

For image queries (query‑by‑image), embed the query image and search the image index; optionally expand with a caption generated from the query image and search the caption/OCR stores too.

Generation with a VLM

After retrieval, pass to a VLM that can accept both images and text. Construct a prompt that:

Lists the user question
Includes the top‑k evidence items: small thumbnails or URLs for images plus OCR and caption text
Instructs the model to answer strictly from evidence and cite the ids of used items

Prompt sketch:

System: You are a visual analyst. Only answer using the provided evidence. If uncertain, say you do not know.
User: question
Context: k entries with image references, captions, OCR snippets, and metadata

Step‑by‑step implementation

The example below uses widely available open‑source pieces for a local prototype. Adjust models to suit your stack.

Setup

# vision and NLP
pip install pillow sentence-transformers torchvision transformers timm
# indexing and OCR
pip install faiss-cpu pytesseract
# captioning utils (optional)
pip install accelerate

Ingestion and preprocessing

import os, json, uuid, pathlib
from PIL import Image
import pytesseract

DATA_DIR = 'data/assets'  # images and pdf pages already split as images
os.makedirs('artifacts', exist_ok=True)

records = []
for fp in pathlib.Path(DATA_DIR).glob('**/*.*'):
    if fp.suffix.lower() in {'.png', '.jpg', '.jpeg'}:
        img = Image.open(fp).convert('RGB')
        # OCR (simple baseline)
        ocr_text = pytesseract.image_to_string(img)
        rec = {
            'id': str(uuid.uuid4()),
            'path': str(fp),
            'ocr_text': ocr_text.strip(),
            'meta': {
                'source': 'local-demo',
                'filename': fp.name
            }
        }
        records.append(rec)

with open('artifacts/records.jsonl', 'w') as f:
    for r in records:
        f.write(json.dumps(r) + '\n')

Image captioning (optional but powerful)

from transformers import AutoProcessor, BlipForConditionalGeneration
import torch

cap_model_name = 'Salesforce/blip-image-captioning-base'  # pick any BLIP captioner available locally
processor = AutoProcessor.from_pretrained(cap_model_name)
cap_model = BlipForConditionalGeneration.from_pretrained(cap_model_name)
cap_model.eval()

for r in records:
    img = Image.open(r['path']).convert('RGB')
    inputs = processor(images=img, return_tensors='pt')
    with torch.no_grad():
        out = cap_model.generate(**inputs, max_new_tokens=40)
    r['caption'] = processor.tokenizer.decode(out[0], skip_special_tokens=True)

Multi‑modal embeddings with CLIP

Using sentence‑transformers makes CLIP embeddings convenient for both text and images.

import numpy as np
from sentence_transformers import SentenceTransformer
from PIL import Image
import faiss

clip_model = SentenceTransformer('clip-ViT-B-32')

# Build three indices: images, captions, OCR text
img_vecs, cap_vecs, ocr_vecs = [], [], []
img_ids, cap_ids, ocr_ids = [], [], []

for r in records:
    img = Image.open(r['path']).convert('RGB')
    ivec = clip_model.encode([img], batch_size=1, convert_to_numpy=True, normalize_embeddings=True)[0]
    img_vecs.append(ivec); img_ids.append(r['id'])

    if r.get('caption'):
        cvec = clip_model.encode([r['caption']], convert_to_numpy=True, normalize_embeddings=True)[0]
        cap_vecs.append(cvec); cap_ids.append(r['id'])

    if r.get('ocr_text'):
        ovec = clip_model.encode([r['ocr_text']], convert_to_numpy=True, normalize_embeddings=True)[0]
        ocr_vecs.append(ovec); ocr_ids.append(r['id'])

img_mat = np.vstack(img_vecs).astype('float32')
cap_mat = np.vstack(cap_vecs).astype('float32') if cap_vecs else np.zeros((0, img_mat.shape[1]), dtype='float32')
ocr_mat = np.vstack(ocr_vecs).astype('float32') if ocr_vecs else np.zeros((0, img_mat.shape[1]), dtype='float32')

# FAISS inner product indices (embeddings already normalized)
img_index = faiss.IndexFlatIP(img_mat.shape[1]); img_index.add(img_mat)
cap_index = faiss.IndexFlatIP(img_mat.shape[1]);
ocr_index = faiss.IndexFlatIP(img_mat.shape[1]);
if cap_mat.shape[0] > 0: cap_index.add(cap_mat)
if ocr_mat.shape[0] > 0: ocr_index.add(ocr_mat)

Retrieval and fusion

from collections import defaultdict
import numpy as np

def rrf(ranked_ids, k=60):
    # ranked_ids: list of lists of ids in rank order
    scores = defaultdict(float)
    for lst in ranked_ids:
        for rank, _id in enumerate(lst, start=1):
            scores[_id] += 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# text query example
query = 'invoice mentioning ACME with red logo total 129.50'
q_vec = clip_model.encode([query], convert_to_numpy=True, normalize_embeddings=True).astype('float32')

k = 10
D_img, I_img = img_index.search(q_vec, k)
D_cap, I_cap = (np.array([[]]), np.array([[]])) if cap_index.ntotal == 0 else cap_index.search(q_vec, k)
D_ocr, I_ocr = (np.array([[]]), np.array([[]])) if ocr_index.ntotal == 0 else ocr_index.search(q_vec, k)

rank_lists = []
rank_lists.append([img_ids[i] for i in I_img[0] if i != -1])
if cap_index.ntotal: rank_lists.append([cap_ids[i] for i in I_cap[0] if i != -1])
if ocr_index.ntotal: rank_lists.append([ocr_ids[i] for i in I_ocr[0] if i != -1])

fused = rrf(rank_lists)
results = [rid for rid, _ in fused[:k]]

Generation stub with a VLM

Plug in your preferred VLM. The function below shows the interface you will need.

def vlm_answer(question, evidence):
    # evidence: list of dicts with 'path', 'caption', 'ocr_text'
    # Replace with your VLM call; pass images and a structured prompt
    context_lines = []
    for i, e in enumerate(evidence, 1):
        line = f'[item {i}] file={os.path.basename(e["path"])} caption={e.get("caption", "")} ocr={e.get("ocr_text", "")[:200]}'
        context_lines.append(line)
    prompt = (
        'You are a visual analyst. Answer only using the evidence items below. '
        'Cite item numbers used. If unsure, say you do not know.\n\n'
        + '\n'.join(context_lines) + f'\n\nQuestion: {question}\nAnswer:'
    )
    # return call_to_vlm(images=[Image.open(e['path']) for e in evidence], prompt=prompt)
    return '(demo) cite items [1,2]; total appears as 129.50 and vendor ACME.'

evidence = [next(r for r in records if r['id'] == rid) for rid in results[:3]]
print(vlm_answer(query, evidence))

Notes

In production, stream thumbnails or signed URLs instead of raw images
Use a VLM that accepts multiple images and long context; otherwise chunk evidence across multiple rounds

Evaluation

Measure retrieval first, then end‑to‑end generation.

Retrieval metrics
- Recall@k and NDCG@k for text→image, text→OCR, and image→image
- Build a small labeled set: each query lists relevant item ids
- Compare single‑signal vs fused retrieval
Generation metrics
- Human evaluation for factuality and citation correctness
- Task‑specific checks: numeric fields matched, key entities extracted
Ablations to try
- With vs without captions
- OCR confidence thresholds and language packs
- Tiling settings for large documents or dashboards

Latency, cost, and scaling

Precompute and cache all embeddings; persist in your vector DB
Use approximate indices (HNSW/IVF) and float16 or product quantization to reduce memory
Batch encode images and texts on GPU; enable pinned memory and dataloader workers
Limit evidence to the minimal set that answers the question; rerank top‑50 to top‑5
Compress OCR text with sentence selection before handing to the generator

Guardrails and quality

Safety: filter NSFW or sensitive imagery before indexing
Privacy: redact PII in OCR text when required; separate public and private collections
Freshness: stamp each asset with timestamps; add a recency prior in fusion
Observability: log queries, retrieved ids, scores, and selected evidence; capture model output with citations

Troubleshooting checklist

Recall is low on exact fields
- Improve OCR (language packs, denoise, thresholding, rotation)
- Add an exact‑match keyword search over OCR text alongside vector search
Semantics are missed despite good OCR
- Add or improve captions; generate 2‑3 diverse captions per image
- Fine‑tune CLIP on in‑domain pairs if feasible
The generator hallucinates
- Tighten the prompt to require citations
- Post‑verify answers by re‑checking cited evidence items
Latency is too high
- Reduce top‑k per retriever before fusion
- Quantize indices and batch VLM inference

What to build next

Region‑aware retrieval: detect key fields and index region crops with coordinates; show highlighted boxes in answers
Layout‑aware embeddings for documents and forms
Learned fusion: train a light model to predict relevance from features such as scores, caption length, OCR confidence, and recency

Conclusion

Multi‑modal RAG combines the strengths of vision and language systems: images anchor visual evidence, OCR preserves literal strings, and captions provide semantic glue. With a modest stack built from OCR, a captioner, CLIP embeddings, and a vector index, you can answer rich, grounded questions about visual data. Start simple with the baseline above, add fusion and reranking, and evolve toward region‑aware retrieval and robust VLM reasoning as your use cases grow.