A Practical Tutorial on Knowledge Graph–Enhanced AI Retrieval (GraphRAG)

Build a production-ready tutorial for knowledge graph–enhanced AI retrieval: schema, ingestion, Cypher, hybrid search, and evaluation.

ASOasis
9 min read
A Practical Tutorial on Knowledge Graph–Enhanced AI Retrieval (GraphRAG)

Image used for representation purposes only.

Overview

Retrieval-augmented generation (RAG) works well for fuzzy, semantic lookups. But when your data has rich structure—people, organizations, documents, events, citations—plain vector search loses context and precision. A knowledge graph (KG) restores that structure and lets you ask topology-aware questions (“Who collaborated with whom?” “What’s the provenance of this claim?”), while still benefiting from embeddings for semantic recall.

This tutorial walks you through building a graph-enhanced retrieval pipeline—often called GraphRAG—from first principles. You’ll design a schema, ingest data, create a vector index, write Cypher queries for graph traversal, and combine the results into a hybrid retriever that feeds your LLM with precise, provenance-rich context.

What you’ll build:

  • A minimal KG with Documents, Chunks, Entities (Person, Organization, Topic), and typed relationships
  • A vector index for semantic recall
  • A hybrid retrieval function that expands from semantic seeds over graph neighborhoods and re-ranks evidence
  • An evaluation harness for faithfulness and coverage

Prerequisites:

  • Python 3.10+
  • A Neo4j database (local or managed)
  • Basic familiarity with embeddings and Cypher

Architecture at a Glance

  • Ingestion: Parse documents → chunk text → extract entities/relations → upsert to Neo4j
  • Indexing: Embed chunks → store vectors in FAISS (or your vector DB of choice) → map chunk_id ↔ vector_id
  • Retrieval: For each question, do semantic recall (vector) → graph expansion (Cypher) → re-rank → assemble citations → generate answer
  • Evaluation: Offline Q/A runs measuring faithfulness, coverage, latency, and context length

Step 1: Model a Minimal Schema

Keep the first iteration small and explicit. You can always refine.

Entities (labels):

  • Person(name, aliases)
  • Organization(name, aliases)
  • Topic(name)
  • Document(id, title, url, published_at)
  • Chunk(id, text, doc_id, order, embedding)

Relationships (types):

  • (Document)-[:CONTAINS]->(Chunk)
  • (Chunk)-[:MENTIONS]->(Person|Organization|Topic)
  • (Person)-[:AFFILIATED_WITH]->(Organization)
  • (Document)-[:CITES]->(Document)
  • (Chunk)-[:ABOUT]->(Topic)

Why this works:

  • Chunk nodes hold the raw text you’ll feed to the LLM.
  • Entity nodes capture canonical references and enable disambiguation.
  • Edges preserve provenance and support topological queries (neighbors, paths, communities).

Step 2: Create Constraints and Indexes (Cypher)

Use constraints to keep your graph clean and fast.

// Neo4j 5+ style
CREATE CONSTRAINT person_id IF NOT EXISTS FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT org_id IF NOT EXISTS FOR (o:Organization) REQUIRE o.name IS UNIQUE;
CREATE CONSTRAINT topic_id IF NOT EXISTS FOR (t:Topic) REQUIRE t.name IS UNIQUE;
CREATE CONSTRAINT doc_id IF NOT EXISTS FOR (d:Document) REQUIRE d.id IS UNIQUE;
CREATE CONSTRAINT chunk_id IF NOT EXISTS FOR (c:Chunk) REQUIRE c.id IS UNIQUE;

// Optional text indexes for name lookups
CREATE FULLTEXT INDEX entity_name IF NOT EXISTS
FOR (n:Person|Organization|Topic) ON EACH [n.name, n.aliases];

Step 3: Ingest Documents and Extract Entities

Start with a small corpus (e.g., internal docs, papers, or blog posts). Chunk by semantic boundaries (e.g., headings or ~250–400 tokens).

Python snippet using spaCy for quick NER and simple relation heuristics:

import os, uuid, json
import spacy
from sentence_transformers import SentenceTransformer
from neo4j import GraphDatabase

# Load NLP + embeddings
nlp = spacy.load("en_core_web_sm")
embedder = SentenceTransformer("all-MiniLM-L6-v2")

driver = GraphDatabase.driver(os.getenv("NEO4J_URI"), auth=(os.getenv("NEO4J_USER"), os.getenv("NEO4J_PASSWORD")))

def chunk_text(text, max_chars=1000):
    parts, buf = [], []
    for sent in text.split('.'):
        if sum(len(s) for s in buf) + len(sent) < max_chars:
            buf.append(sent)
        else:
            parts.append('.'.join(buf).strip()+'.')
            buf = [sent]
    if buf:
        parts.append('.'.join(buf).strip()+'.')
    return [p.strip() for p in parts if p.strip()]

def upsert(tx, query, **params):
    tx.run(query, **params)

def ingest_document(doc):
    # doc = {"id": str, "title": str, "text": str, "url": str|None}
    chunks = chunk_text(doc["text"]) 
    with driver.session() as sess:
        sess.execute_write(upsert, 
            "MERGE (d:Document {id:$id}) SET d.title=$title, d.url=$url",
            id=doc["id"], title=doc["title"], url=doc.get("url"))
        
        for order, text in enumerate(chunks):
            cid = str(uuid.uuid4())
            emb = embedder.encode([text])[0].tolist()
            ents = nlp(text).ents
            sess.execute_write(upsert,
                "MERGE (c:Chunk {id:$id}) SET c.text=$text, c.order=$order, c.doc_id=$doc_id, c.embedding=$emb",
                id=cid, text=text, order=order, doc_id=doc["id"], emb=emb)
            sess.execute_write(upsert,
                "MATCH (d:Document {id:$doc_id}), (c:Chunk {id:$cid}) MERGE (d)-[:CONTAINS]->(c)",
                doc_id=doc["id"], cid=cid)
            
            # Upsert entities + MENTIONS
            for e in ents:
                if e.label_ in ["PERSON", "ORG"]:
                    label = "Person" if e.label_ == "PERSON" else "Organization"
                    sess.execute_write(upsert,
                        f"MERGE (n:{label} {{name:$name}})", name=e.text)
                    sess.execute_write(upsert,
                        "MATCH (c:Chunk {id:$cid}), (n {name:$name}) MERGE (c)-[:MENTIONS]->(n)",
                        cid=cid, name=e.text)

Notes:

  • For production, replace simple heuristics with robust IE (e.g., pattern libraries, domain ontologies, distantly supervised relation extraction, or human-in-the-loop review).
  • Keep raw text only on Chunk nodes. Documents carry metadata and provenance.

Step 4: Build a Vector Index

You can store embeddings in Neo4j or use a dedicated vector database. A simple, fast baseline is FAISS.

import faiss
import numpy as np
from neo4j import GraphDatabase

# Build an index from Chunk embeddings stored in Neo4j
vec_dim = 384  # all-MiniLM-L6-v2
index = faiss.IndexFlatIP(vec_dim)
chunk_ids = []

with driver.session() as sess:
    result = sess.run("MATCH (c:Chunk) RETURN c.id as id, c.embedding as emb")
    vectors = []
    for rec in result:
        chunk_ids.append(rec["id"])
        v = np.array(rec["emb"], dtype=np.float32)
        v /= np.linalg.norm(v) + 1e-12
        vectors.append(v)
    mat = np.vstack(vectors)
    index.add(mat)

# Map from FAISS row -> chunk id
id_map = np.array(chunk_ids)

Step 5: Graph-Aware Retrieval

Combine semantic recall with topology-aware expansion. Here’s a practical pattern:

  1. Semantic seeds: top-k chunks by embedding similarity to the question
  2. Graph expansion: pull neighbors around those seeds—entities, cited documents, related topics, co-mentioned chunks
  3. Candidate assembly: gather chunks reached through the graph that likely carry corroborating evidence
  4. Re-ranking: cross-encoder or LLM scoring to select final N chunks
from sentence_transformers import CrossEncoder

cross = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def semantic_seeds(question, k=12):
    q = embedder.encode([question])[0].astype(np.float32)
    q /= np.linalg.norm(q) + 1e-12
    D, I = index.search(q.reshape(1, -1), k)
    return id_map[I[0]].tolist()

def graph_expand(seed_chunk_ids, max_candidates=200):
    cypher = """
    UNWIND $ids AS cid
    MATCH (c:Chunk {id:cid})-[:MENTIONS|:ABOUT]->(e)
    OPTIONAL MATCH (c)-[:CONTAINS]-(d:Document)
    OPTIONAL MATCH (c)-[:ABOUT]->(t:Topic)
    // one-hop neighborhood to related chunks via shared entity/topic/citations
    OPTIONAL MATCH (c)<-[:CONTAINS]-(d)-[:CITES]->(d2:Document)-[:CONTAINS]->(c2:Chunk)
    OPTIONAL MATCH (c)-[:MENTIONS|:ABOUT]->(e)<-[:MENTIONS|:ABOUT]-(c3:Chunk)
    WITH DISTINCT c, collect(DISTINCT c2) + collect(DISTINCT c3) AS nbrs
    UNWIND nbrs AS cand
    WITH DISTINCT cand LIMIT $limit
    RETURN cand.id AS id, cand.text AS text
    """
    with driver.session() as sess:
        recs = sess.run(cypher, ids=seed_chunk_ids, limit=max_candidates)
        return [{"id": r["id"], "text": r["text"]} for r in recs]

def rerank(question, candidates, top_n=12):
    pairs = [(question, c["text"]) for c in candidates]
    scores = cross.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:top_n]]

def hybrid_retrieve(question, k_seeds=12, max_cands=200, top_n=12):
    seeds = semantic_seeds(question, k=k_seeds)
    cands = graph_expand(seeds, max_candidates=max_cands)
    # include the seeds themselves to avoid missing high-similarity text
    with driver.session() as sess:
        seed_text = sess.run("MATCH (c:Chunk) WHERE c.id IN $ids RETURN c.id as id, c.text as text", ids=seeds)
        for r in seed_text:
            cands.append({"id": r["id"], "text": r["text"]})
    # dedupe by id
    uniq = {c["id"]: c for c in cands}.values()
    return rerank(question, list(uniq), top_n=top_n)

This function returns a compact set of highly relevant chunks tied together by graph structure. These chunks are ready to be fed as grounded context to your LLM.

Step 6: Structured Queries with Cypher Templates

Many enterprise questions are inherently structured. For repeatable intents, define Cypher templates rather than always relying on natural-language-to-Cypher.

Examples:

  • “Which authors collaborated with [person]?”
MATCH (p:Person {name:$name})<-[:MENTIONS]-(c1:Chunk)-[:CONTAINS]-(d:Document)-[:CONTAINS]->(c2:Chunk)-[:MENTIONS]->(co:Person)
WHERE co <> p
RETURN co.name AS collaborator, count(DISTINCT d) AS docs
ORDER BY docs DESC LIMIT 20;
  • “Show documents that cite [document_id] and their topics”
MATCH (d:Document {id:$doc_id})<-[:CITES]-(citing:Document)
OPTIONAL MATCH (citing)-[:CONTAINS]->(:Chunk)-[:ABOUT]->(t:Topic)
RETURN citing.id AS id, citing.title AS title, collect(DISTINCT t.name) AS topics
LIMIT 50;
  • “Top organizations co-mentioned with [topic]”
MATCH (:Topic {name:$topic})<-[:ABOUT]-(c:Chunk)-[:MENTIONS]->(o:Organization)
RETURN o.name AS org, count(*) AS freq
ORDER BY freq DESC LIMIT 20;

You can detect such intents with lightweight patterns or a classifier and route to the right template before falling back to hybrid retrieval.

Step 7: Assemble Grounded Prompts for the LLM

Concise, well-cited context prevents hallucinations. Include chunk IDs and document metadata.

def make_prompt(question, evidences):
    header = "You are a careful analyst. Answer using only the EVIDENCE. Cite chunk ids like [c:ID]. If unsure, say you don't know.\n\n"
    ev_txt = []
    for e in evidences:
        ev_txt.append(f"[c:{e['id']}] " + e['text'].strip())
    context = "\n\n".join(ev_txt)
    return f"{header}QUESTION: {question}\n\nEVIDENCE:\n{context}\n\nANSWER:"

You can use any chat completion API. Always track which chunks are used in the final answer for auditability.

Step 8: Evaluation—Don’t Skip This

Measure the pipeline with a small but representative test set.

Metrics:

  • Faithfulness: percentage of answers fully supported by provided chunks (manual review or automated heuristics)
  • Coverage/Recall: fraction of gold evidence chunks present in the retrieved set
  • Precision: fraction of retrieved chunks that reviewers deem relevant
  • Latency: end-to-end p95
  • Context budget: average tokens used per query

Procedure:

  1. Create ~100–300 question–answer–evidence triples from your domain.
  2. Run baseline vector-only RAG.
  3. Run hybrid graph + vector retrieval.
  4. Compare faithfulness, coverage, and latency; iterate on schema, expansion rules, and re-ranking.

Operational Tips and Pitfalls

  • Canonicalization: Merge duplicate entities. Use lowercase, trimmed names and alias lists. Consider string similarity thresholds (e.g., Jaro–Winkler) under human review.
  • Provenance-first: Every edge should be explainable. Keep source doc IDs and stable references.
  • Depth control: Limit expansion hops to 1–2 to cap latency and noise. Rely on re-ranking rather than deep traversals.
  • Balance recall vs. precision: Start with generous recall (k≈10–20 seeds, 100–300 candidates) then tighten after measuring precision.
  • Incremental updates: Upsert by document ID. Maintain a change log for re-embedding only changed chunks.
  • Safety and privacy: Filter PII; restrict sensitive nodes/edges by role. Apply row-level security where possible.
  • Monitoring: Track query types, cache hit rates, token usage, and any “empty context” failures.

Extending the Tutorial

  • Add richer relations: WORKS_AT, AUTHORED, PUBLISHED_IN, LOCATED_IN, DERIVES_FROM.
  • Use a graph algorithm: Personalized PageRank or community detection to identify influential nodes around a query topic.
  • NL-to-Cypher: Train a few-shot prompt or a constrained decoder that emits only allowed labels and relations from your schema.
  • Multi-index retrieval: Blend BM25 keyword search, vector search, and graph paths; then learn weights via logistic regression on your eval set.
  • Reranking upgrades: Try larger cross-encoders or LLM-based re-rankers in a two-stage cascade.

End-to-End Example

Putting it together for a single question:

question = "Which organizations frequently co-occur with the topic 'Generative AI' and what sources cite them?"
seeds = semantic_seeds(question, k=15)
candidates = graph_expand(seeds, max_candidates=250)
# Add seeds and dedupe
with driver.session() as sess:
    seed_text = sess.run("MATCH (c:Chunk) WHERE c.id IN $ids RETURN c.id as id, c.text as text", ids=seeds)
    for r in seed_text:
        candidates.append({"id": r["id"], "text": r["text"]})
uniq = {c["id"]: c for c in candidates}.values()
final_ctx = rerank(question, list(uniq), top_n=15)
prompt = make_prompt(question, final_ctx)
print(prompt[:1000])  # send to your LLM of choice

Optionally, add a structured pass first:

// Top orgs tied to a topic with supporting docs
MATCH (:Topic {name:$topic})<-[:ABOUT]-(c:Chunk)-[:MENTIONS]->(o:Organization)
WITH o, count(*) AS freq
ORDER BY freq DESC LIMIT 10
MATCH (o)<-[:MENTIONS]-(c2:Chunk)<-[:CONTAINS]-(d:Document)
RETURN o.name AS org, freq, collect(DISTINCT d.id)[..5] AS evidence_docs;

What You’ll Have by the End

  • A working KG-backed retriever that preserves structure and provenance
  • A reproducible ingestion pipeline and vector index
  • Measurable improvements in faithfulness and answerability on structured questions

Next Steps

  • Enrich schema and relations driven by your evaluation findings.
  • Introduce caching for frequent queries (semantic + graph sub-results).
  • Productionize: containerize the retriever, move to managed graph/vector services, add tracing and dashboards.
  • Expand your evaluation set and run regular regressions before each schema or model change.

With these building blocks, you’ve moved from “semantic search plus LLM” to a principled, graph-aware retrieval system that can explain its answers and scale with your domain’s complexity.

Related Posts