GraphRAG Tutorial: From Documents to Knowledge Graph–Powered RAG

Build a practical GraphRAG pipeline: extract a knowledge graph, index nodes and chunks, retrieve local paths and global summaries, and synthesize grounded answers.

ASOasis
10 min read
GraphRAG Tutorial: From Documents to Knowledge Graph–Powered RAG

Image used for representation purposes only.

Overview

Retrieval-augmented generation (RAG) couples a large language model (LLM) with a retriever to ground answers in your data. GraphRAG extends this idea by building and querying a knowledge graph so the LLM can reason over entities, relations, and global structure—not just isolated chunks. The payoff is better multi-hop answers, disambiguation, and explainability via paths and citations.

This tutorial walks you end-to-end: ingesting documents, extracting a graph, indexing both text and graph signals, and implementing a two-level retriever (local and global) that feeds a final answer synthesis prompt.

When to use GraphRAG

  • Your questions require multi-hop reasoning (A → related-to → B → causes → C).
  • You need interpretable answers with entity/edge citations.
  • Your corpus has recurring entities across documents (people, orgs, APIs, components).
  • You want global summaries of regions of the graph (communities, topics) to complement local evidence.

Architecture at a glance

  • Ingestion: parse and chunk documents.
  • Graph extraction: LLM or NLP pipeline yields triples (subject, relation, object) with evidence.
  • Storage: text chunks in a vector store; graph in NetworkX or Neo4j (optional).
  • Indexing: embeddings for chunks and for node/edge text; community detection for global structure.
  • Retrieval:
    • Local: semantic search over chunks + node neighborhoods.
    • Global: community- or subgraph-level summaries.
  • Synthesis: structured prompt that includes local facts, paths, global summaries, and citations.

ASCII sketch:

[Docs] -> [Chunker] -> (1) [Vector Index]
                 \-> (2) [Triple Extractor] -> [Graph DB] -> [Communities + Summaries]

Query -> [Entity Linking + Seed Nodes] -> [Neighborhood Expand] -> [Local + Global Context] -> [LLM Answer]

Prerequisites

  • Python 3.10+
  • Packages: networkx, sentence-transformers, faiss-cpu (or Chroma), scikit-learn, pydantic, python-dotenv, spacy (optional), fastapi (optional for serving)
  • An embedding model (e.g., sentence-transformers) and an LLM provider (any; wrap behind a simple function).

Project setup

Create a minimal environment and install dependencies:

python -m venv .venv && source .venv/bin/activate
pip install networkx sentence-transformers faiss-cpu scikit-learn pydantic python-dotenv spacy
python -m spacy download en_core_web_sm

A simple layout:

project/
  data/                 # raw docs
  build/
    graph.jsonl         # triples cache
    node_summaries.jsonl
  app/
    ingest.py
    extract_graph.py
    index.py
    retriever.py
    answer.py

Step 1 — Ingest and chunk documents

Keep chunks small enough for precise retrieval but large enough for context (~400–800 tokens). Store chunk text and metadata (doc id, page, headings).

# app/ingest.py
from pathlib import Path
import re, json
from typing import List, Dict

def simple_md_split(text: str, max_chars: int = 1800) -> List[str]:
    paras = [p.strip() for p in re.split(r"\n\n+", text) if p.strip()]
    chunks, buf = [], ""
    for p in paras:
        if len(buf) + len(p) + 2 > max_chars:
            if buf: chunks.append(buf); buf = ""
        buf = (buf + "\n\n" + p).strip()
    if buf: chunks.append(buf)
    return chunks

def load_docs(path="data") -> Dict[str, List[str]]:
    docs = {}
    for f in Path(path).glob("**/*.md"):
        text = f.read_text(encoding="utf-8")
        docs[f.stem] = simple_md_split(text)
    return docs

if __name__ == "__main__":
    docs = load_docs()
    Path("build").mkdir(exist_ok=True)
    with open("build/chunks.jsonl", "w", encoding="utf-8") as w:
        for doc_id, chunks in docs.items():
            for i, ch in enumerate(chunks):
                w.write(json.dumps({"doc_id": doc_id, "chunk_id": i, "text": ch})+"\n")

Step 2 — Extract entities and relations (triples)

You can use either:

  • LLM-based extraction with a JSON schema (best quality, higher cost), or
  • Lightweight NLP (spaCy + patterns) as a fallback.

LLM wrapper (provider-agnostic):

# app/llm.py
import os, json
from typing import List

# Implement this to call your LLM provider (OpenAI, Azure, Anthropic, local, etc.)
# It should return a parsed JSON string that fits the schema we request.

def call_llm(system: str, prompt: str) -> str:
    raise NotImplementedError("Plug in your LLM provider here.")

TRIPLE_SCHEMA = {
  "type": "object",
  "properties": {
    "triples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "subject": {"type": "string"},
          "relation": {"type": "string"},
          "object": {"type": "string"},
          "evidence": {"type": "string"},
          "confidence": {"type": "number"}
        },
        "required": ["subject","relation","object","evidence","confidence"]
      }
    }
  },
  "required": ["triples"]
}

EXTRACT_SYSTEM = """
You extract knowledge graph triples from text. Output compact JSON only.
Entities should be canonical (merge aliases). Use relation verbs or nouns.
Include a short evidence quote from the text and a confidence in [0,1].
"""

EXTRACT_PROMPT_TMPL = """
Text:\n"""{text}"""\n
Respond with JSON per schema: {schema}
""".strip()

Extraction driver:

# app/extract_graph.py
import json
from pathlib import Path
from llm import call_llm, EXTRACT_SYSTEM, EXTRACT_PROMPT_TMPL, TRIPLE_SCHEMA

def extract_triples_for_chunks(chunks_path="build/chunks.jsonl", out_path="build/graph.jsonl"):
    with open(chunks_path, "r", encoding="utf-8") as r, open(out_path, "w", encoding="utf-8") as w:
        for line in r:
            rec = json.loads(line)
            prompt = EXTRACT_PROMPT_TMPL.format(text=rec["text"], schema=json.dumps(TRIPLE_SCHEMA))
            try:
                resp = call_llm(EXTRACT_SYSTEM, prompt)
                data = json.loads(resp)
                for t in data.get("triples", []):
                    t.update({"doc_id": rec["doc_id"], "chunk_id": rec["chunk_id"]})
                    w.write(json.dumps(t)+"\n")
            except Exception as e:
                # Optionally log and continue
                pass

if __name__ == "__main__":
    extract_triples_for_chunks()

Tip: post-process to normalize entity names (lowercase, strip punctuation, map aliases like “IBM” ↔ “International Business Machines”).

Step 3 — Build the graph and compute communities

Use NetworkX for a portable graph. Optionally mirror to Neo4j if you need Cypher queries or a production-grade store.

# app/index.py
import json, networkx as nx
from collections import defaultdict

class GraphIndex:
    def __init__(self):
        self.G = nx.MultiDiGraph()
        self.node_text = defaultdict(list)

    def add_triple(self, s,r,o,evidence,meta):
        self.G.add_node(s)
        self.G.add_node(o)
        self.G.add_edge(s,o,relation=r,evidence=evidence,**meta)
        self.node_text[s].append(evidence)
        self.node_text[o].append(evidence)

    @classmethod
    def from_jsonl(cls, path="build/graph.jsonl"):
        gi = cls()
        with open(path,"r",encoding="utf-8") as f:
            for line in f:
                t = json.loads(line)
                gi.add_triple(t["subject"], t["relation"], t["object"], t["evidence"], {"doc_id":t["doc_id"],"chunk_id":t["chunk_id"]})
        return gi

if __name__ == "__main__":
    gi = GraphIndex.from_jsonl()
    print(gi.G.number_of_nodes(), gi.G.number_of_edges())

Community detection (global structure) and node text synthesis:

# app/summarize_graph.py
import json
import networkx as nx
from llm import call_llm
from collections import defaultdict

COMMUNITY_PROMPT = """
Summarize this set of related entities and relations in 5-8 bullet points.
Be factual; cite 2-4 key entity names.
Input triples:\n{triples}
"""

def louvain_communities(G):
    # lightweight fallback using connected components on undirected view
    # replace with a real community algorithm if desired
    return list(nx.connected_components(G.to_undirected()))

def summarize_communities(G, out_path="build/node_summaries.jsonl"):
    comms = louvain_communities(G)
    with open(out_path, "w", encoding="utf-8") as w:
        for i, nodes in enumerate(comms):
            sub = G.subgraph(nodes)
            triples = []
            for u,v,k,d in sub.edges(keys=True, data=True):
                triples.append(f"({u}) -[{d.get('relation','related')}]-> ({v})")
            prompt = COMMUNITY_PROMPT.format(triples="\n".join(triples[:60]))
            try:
                summary = call_llm("You write precise technical summaries.", prompt)
            except Exception:
                summary = "- Related entities: " + ", ".join(list(nodes)[:8])
            w.write(json.dumps({"community_id": i, "nodes": list(nodes), "summary": summary})+"\n")

Step 4 — Dual index: vectors for text and nodes

Embed both chunk texts and node profiles (concatenated evidence snippets). Use the same embedding model so you can rank both together.

# app/embed.py
import json
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss

class DualIndex:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.vec_dim = self.model.get_sentence_embedding_dimension()
        self.faiss_chunks = faiss.IndexFlatIP(self.vec_dim)
        self.faiss_nodes  = faiss.IndexFlatIP(self.vec_dim)
        self.chunk_meta = []
        self.node_meta  = []

    def _emb(self, texts):
        X = self.model.encode(texts, normalize_embeddings=True)
        return np.asarray(X, dtype="float32")

    def add_chunks(self, chunks_jsonl="build/chunks.jsonl"):
        texts, metas = [], []
        with open(chunks_jsonl,"r",encoding="utf-8") as f:
            for line in f:
                rec = json.loads(line)
                texts.append(rec["text"])
                metas.append({k:rec[k] for k in ("doc_id","chunk_id")})
        X = self._emb(texts)
        self.faiss_chunks.add(X)
        self.chunk_meta = metas

    def add_nodes(self, graph_index, min_snips=3):
        texts, metas = [], []
        for node, snips in graph_index.node_text.items():
            if len(snips) < min_snips: continue
            txt = f"Entity: {node}\nEvidence:\n- " + "\n- ".join(snips[:10])
            texts.append(txt)
            metas.append({"entity": node})
        X = self._emb(texts)
        self.faiss_nodes.add(X)
        self.node_meta = metas

Step 5 — Query-time retrieval: local + graph

At query time we:

  1. Detect mentioned entities to seed a neighborhood search.
  2. Run semantic search over text chunks.
  3. Expand the graph k steps from seed nodes to collect high-signal edges and nearby nodes.
  4. Retrieve community summaries for any touched communities.
  5. Assemble a structured context with citations and paths.
# app/retriever.py
import json, re
import networkx as nx
import numpy as np
from typing import List, Dict, Any

MENTION_RX = re.compile(r"[A-Z][A-Za-z0-9_\-]{2,}")

def detect_entities(q: str) -> List[str]:
    # very naive: use spaCy NER in production
    return list(set(MENTION_RX.findall(q)))

def k_hop_neighborhood(G, seeds: List[str], k=2, max_nodes=60):
    visited = set(seeds)
    frontier = set(seeds)
    for _ in range(k):
        nxt = set()
        for u in frontier:
            for _, v in G.out_edges(u): nxt.add(v)
            for v, _ in G.in_edges(u): nxt.add(v)
        frontier = nxt - visited
        visited |= frontier
        if len(visited) > max_nodes: break
    return G.subgraph(visited)

def paths_as_text(SG):
    lines = []
    for u,v,k,d in SG.edges(keys=True, data=True):
        rel = d.get('relation','related')
        ev  = d.get('evidence','')
        lines.append(f"({u}) -[{rel}]-> ({v}); evidence: {ev[:120]}")
    return "\n".join(lines[:80])

def search_faiss(index, query_vec, topk=5):
    D, I = index.search(query_vec, topk)
    return I[0], D[0]

def embed_query(model, q):
    x = model.encode([q], normalize_embeddings=True).astype('float32')
    return x

Answer synthesis with a structured prompt:

# app/answer.py
import json
from llm import call_llm
from embed import DualIndex
from index import GraphIndex
from retriever import detect_entities, k_hop_neighborhood, paths_as_text, embed_query, search_faiss

SYNTH_PROMPT = """
You are a careful assistant. Use only the provided context. Cite entities or doc_ids.
Question: {q}

Local evidence (top chunks):
{local_blocks}

Graph paths (k-hop neighborhood):
{graph_paths}

Global summaries (communities):
{global_summaries}

Instructions:
- First list 3-6 key grounded facts with citations.
- Then produce a concise answer.
- Finally, show 1-3 critical paths as bullet points: (A) -[rel]-> (B) -[rel]-> (C).
"""

def build_and_answer(q:str):
    gi = GraphIndex.from_jsonl()
    di = DualIndex(); di.add_chunks(); di.add_nodes(gi)

    qv = embed_query(di.model, q)
    # Local text search
    I, D = search_faiss(di.faiss_chunks, qv, topk=6)
    local_blocks = []
    for idx in I:
        meta = di.chunk_meta[idx];
        # In production, also keep the text body for each chunk
        local_blocks.append(f"- doc={meta['doc_id']} chunk={meta['chunk_id']}")

    # Graph neighborhood
    seeds = detect_entities(q)
    if not seeds:
        # try to seed from best-matching nodes
        NI, _ = search_faiss(di.faiss_nodes, qv, topk=3)
        seeds = [di.node_meta[i]['entity'] for i in NI]
    SG = k_hop_neighborhood(gi.G, seeds, k=2, max_nodes=80)
    graph_paths = paths_as_text(SG)

    # Global summaries: collect any communities overlapping SG nodes
    # Here we just load precomputed summaries
    comm_summaries = []
    try:
        with open("build/node_summaries.jsonl","r",encoding="utf-8") as f:
            for line in f:
                rec = json.loads(line)
                if any(n in rec["nodes"] for n in SG.nodes):
                    comm_summaries.append(f"- c{rec['community_id']}: {rec['summary']}")
    except FileNotFoundError:
        pass

    prompt = SYNTH_PROMPT.format(q=q,
        local_blocks="\n".join(local_blocks),
        graph_paths=graph_paths,
        global_summaries="\n".join(comm_summaries[:4]))

    ans = call_llm("Grounded answering with rigorous citations.", prompt)
    return ans

Step 6 — Running the pipeline

  • Ingest: python app/ingest.py
  • Extract triples: python app/extract_graph.py
  • Build graph and summaries: python -c “from app.index import GraphIndex; gi=GraphIndex.from_jsonl(); print(’nodes’, gi.G.number_of_nodes())” and python app/summarize_graph.py
  • Answer a question: python -c “from app.answer import build_and_answer; print(build_and_answer(‘How does Component X integrate with Service Y?’))”

Prompting tips that matter

  • Extraction: require JSON with confidence scores. Cap the number of triples per chunk to control cost.
  • Canonicalization: normalize case; map aliases with a small dictionary; merge near-duplicates by Jaccard similarity.
  • Summaries: keep them terse and cache them; refresh when the subgraph changes.
  • Answering: separate local facts, graph paths, and global summaries; ask the LLM to cite entities/doc_ids explicitly.

Evaluation and debugging

  • Faithfulness: check whether cited entities/edges actually exist. Auto-verify by matching answer citations to the graph.
  • Coverage: fraction of gold edges present in retrieved subgraph for a QA set.
  • Latency: break down time spent in embedding search, neighborhood expansion, and LLM calls.
  • Ablations: compare vanilla RAG vs +node paths vs +global summaries.

Debug routines to add:

  • Print top-5 retrieved chunks with scores.
  • Visualize k-hop subgraph with colors per community.
  • Show the three highest betweenness paths connecting seed entities.

Production considerations

  • Storage: Neo4j (with vector indexes) is ideal for large graphs; NetworkX is fine for prototypes.
  • Incremental updates: re-extract triples only for changed chunks; recompute affected subgraphs/communities.
  • Caching: memoize extraction and summaries; use an on-disk key-value store.
  • Guardrails: drop low-confidence triples; require two pieces of evidence for critical edges.
  • Privacy: strip PII during extraction; consider an allowlist of relations.
  • Cost control: batch LLM calls; use smaller models for extraction and bigger ones for final synthesis.

Variations and extensions

  • Path planning: ask the LLM to propose target entity pairs, then compute k-shortest paths in the graph and re-rank by semantic similarity to the question.
  • Temporal GraphRAG: attach timestamps to edges; filter neighborhoods by time range for time-sensitive QA.
  • Heterogeneous graphs: distinct node/edge types (e.g., API, endpoint, product, team) with type-specific prompts.
  • Hybrid scoring: combine vector similarity, PageRank on the subgraph, and relation priors to rank evidence.
  • Structured querying: expose a Cypher tool to the LLM for precise graph lookups when needed.

What you should have now

  • A working pipeline that builds a knowledge graph from your documents.
  • Dual indices over chunks and graph entities.
  • A two-level retriever (local + global) and a synthesis prompt that yields grounded, multi-hop answers with citations and paths.

Next steps

  • Swap in your preferred LLM provider in app/llm.py.
  • Replace the toy community detector with Louvain/Leiden.
  • Add a front end (FastAPI + simple UI) and telemetry to log queries, retrieved edges, and answer citations.

By adding graph structure to RAG, you give the model the scaffolding it needs for reliable multi-hop reasoning—while keeping answers grounded, auditable, and maintainable as your corpus grows.

Related Posts