LangChain API Tutorial: From Hello World to Production RAG with FastAPI and LangServe

Build a production-ready LangChain API: LCEL chains, LangServe, FastAPI streaming, RAG, structured outputs, testing, and deployment tips.

ASOasis
7 min read
LangChain API Tutorial: From Hello World to Production RAG with FastAPI and LangServe

Image used for representation purposes only.

Overview

LangChain makes it straightforward to turn large language models (LLMs) into real applications: text summarizers, question-answering bots, and full RAG (Retrieval-Augmented Generation) systems. In this tutorial you’ll build a production-ready HTTP API around LangChain using both LangServe (the fastest path) and a hand-rolled FastAPI route (for maximum control). You’ll also add retrieval, structured outputs, streaming, testing, and deployment tips.

By the end you will:

  • Create a minimal LCEL chain (LangChain Expression Language).
  • Expose it over HTTP with LangServe.
  • Build a custom FastAPI endpoint with streaming.
  • Add RAG using a vector store and embeddings.
  • Return structured JSON with Pydantic models.
  • Ship to production with observability and guardrails.

What We’ll Build

  • A /summarize endpoint that condenses long text.
  • A /rag/ask endpoint that answers questions over your documents.
  • Optional: a /structured endpoint that returns strongly-typed JSON.

We’ll use Python, but the concepts carry over to JS/TS as well.

Prerequisites

  • Python 3.10+
  • An LLM provider key (e.g., OpenAI, Anthropic, or Azure OpenAI). We’ll show OpenAI in code; swap for your provider as needed.
  • Basic command line and HTTP familiarity.

Project Setup

Create and activate a virtual environment, then install dependencies.

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -U 'langchain>=0.2' langchain-openai langserve fastapi uvicorn[standard] tiktoken 'faiss-cpu>=1.7.4' langchain-text-splitters langchain-community pydantic python-dotenv

Set your environment variable(s):

export OPENAI_API_KEY='sk-...'
# Optional observability
export LANGCHAIN_TRACING_V2='true'
export LANGCHAIN_API_KEY='ls_...'

Create a basic app structure:

.
├─ app/
│  ├─ __init__.py
│  ├─ chains.py
│  ├─ rag.py
│  └─ main.py
└─ data/
   └─ docs/  # place your PDFs or text files here

Step 1: A Minimal LCEL Chain

LCEL (LangChain Expression Language) is a composable, lazy pipeline syntax. Build a simple summarizer chain.

# app/chains.py
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

# Prompt → LLM → Parser
summary_prompt = ChatPromptTemplate.from_messages([
    ('system', 'You are a concise assistant. Summarize the user\'s text in 3-4 bullet points.'),
    ('user', '{text}')
])

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.2)

summarize_chain = summary_prompt | llm | StrOutputParser()

Usage in Python:

# Quick test (optional)
if __name__ == '__main__':
    result = summarize_chain.invoke({'text': 'LangChain helps developers build LLM apps with chains, tools, and RAG.'})
    print(result)

Step 2: Serve It in 2 Minutes with LangServe

LangServe automatically mounts chains as HTTP routes with input/output schemas and optional streaming.

# app/main.py
from fastapi import FastAPI
from langserve import add_routes
from app.chains import summarize_chain

app = FastAPI(title='LangChain API')

# Mount at /summarize
add_routes(app, summarize_chain, path='/summarize')

@app.get('/health')
def health():
    return {'status': 'ok'}

Run the server:

uvicorn app.main:app --reload --port 8000

Call it with HTTPie (preferred for readability):

http POST :8000/summarize text='LangChain simplifies building LLM-powered apps.'

Or curl (JSON-escaped example):

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -d '{"text":"LangChain simplifies building LLM-powered apps."}' \
  http://localhost:8000/summarize

Tip: LangServe also supports Server-Sent Events (SSE) for streaming. Append ?stream=true when supported by the chain/model.

Step 3: Build a Custom FastAPI Endpoint (Manual Control)

Sometimes you want custom auth, logging, or streaming format. Here’s a hand-rolled route that uses the same chain.

# app/main.py (continued)
from fastapi import Body
from fastapi.responses import StreamingResponse, JSONResponse
from app.chains import summarize_chain

@app.post('/v1/summarize')
async def summarize_v1(payload: dict = Body(...)):
    text = payload.get('text', '')
    if not text:
        return JSONResponse({'error': 'text is required'}, status_code=400)

    # Non-streaming (simple):
    result = await summarize_chain.ainvoke({'text': text})
    return {'summary': result}

@app.post('/v1/summarize/stream')
async def summarize_stream(payload: dict = Body(...)):
    text = payload.get('text', '')
    if not text:
        return JSONResponse({'error': 'text is required'}, status_code=400)

    async def event_gen():
        async for chunk in summarize_chain.astream({'text': text}):
            # Each chunk is a token/string when the last stage is a StrOutputParser
            yield f'data: {chunk}\n\n'
        yield 'event: end\n\n'

    return StreamingResponse(event_gen(), media_type='text/event-stream')

Step 4: Add Retrieval (RAG)

Let’s index your documents and answer questions grounded in that data.

# app/rag.py
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

DATA_DIR = Path('data/docs')

# 1) Load & split your corpus (replace with proper loaders for PDFs/HTML)
def load_corpus():
    texts = []
    for p in DATA_DIR.rglob('*.txt'):
        texts.append(p.read_text(encoding='utf-8'))
    return texts

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)

# 2) Create embeddings and a vector store
embeddings = OpenAIEmbeddings(model='text-embedding-3-large')

def build_vectorstore():
    raw_docs = load_corpus()
    docs = []
    for t in raw_docs:
        docs.extend(splitter.create_documents([t]))
    return FAISS.from_documents(docs, embeddings)

# Initialize at import (small corpora). For large ones, persist & load.
VECTORSTORE = build_vectorstore()
retriever = VECTORSTORE.as_retriever(search_kwargs={'k': 4})

# 3) RAG chain: question + retrieved context → LLM
rag_prompt = ChatPromptTemplate.from_messages([
    ('system',
     'You answer using only the provided context. If the answer is unknown, say you do not know.\n' \
     'Context:\n{context}'),
    ('user', '{question}')
])

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)

def _format_docs(docs):
    return '\n\n'.join(d.page_content for d in docs)

rag_chain = {
    'context': retriever | _format_docs,
    'question': RunnablePassthrough()
} | rag_prompt | llm | StrOutputParser()

Mount the RAG chain via LangServe and add a custom endpoint too:

# app/main.py (continued)
from app.rag import rag_chain

# LangServe route
from langserve import add_routes
add_routes(app, rag_chain, path='/rag/ask')

# Custom JSON contract
@app.post('/v1/rag/ask')
async def rag_ask(payload: dict = Body(...)):
    question = payload.get('question', '')
    if not question:
        return JSONResponse({'error': 'question is required'}, status_code=400)
    answer = await rag_chain.ainvoke(question)
    return {'answer': answer}

Query it:

http POST :8000/v1/rag/ask question='What does our policy say about refunds?'

Step 5: Strongly-Typed JSON with Pydantic

When downstream services consume your API, prefer structured outputs over free-form text.

# app/chains.py (add structured output)
from pydantic import BaseModel, Field
from typing import List
from langchain_openai import ChatOpenAI

class Summary(BaseModel):
    bullets: List[str] = Field(..., description='3-5 concise bullet points')

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
structured_llm = llm.with_structured_output(Summary)

structured_summary_chain = summary_prompt | structured_llm

Expose it:

# app/main.py (continued)
from app.chains import structured_summary_chain
add_routes(app, structured_summary_chain, path='/structured/summarize')

Response shape example:

{
  "bullets": [
    "LangChain composes LLM pipelines via LCEL.",
    "LangServe turns chains into HTTP endpoints.",
    "RAG grounds answers in your data."
  ]
}

Step 6: Testing and Evaluation

  • Unit tests: call your chain’s .invoke with deterministic settings (temperature 0) and compare to regex/patterns.
  • Contract tests: validate schemas for /structured endpoints.
  • RAG evals: create a small set of (question, expected answer, context docs). Score with semantic similarity and exact matches.
  • Load tests: use k6 or Locust to measure latency and throughput under realistic concurrency.

Observability and Tracing

Enable LangSmith (optional) to trace prompts, tokens, and latencies across chains.

  • Set LANGCHAIN_TRACING_V2='true' and LANGCHAIN_API_KEY.
  • Review spans, inputs/outputs, and errors. Use it to diagnose context stuffing, timeouts, or slow retriever calls.

Performance and Cost Tips

  • Use the smallest model that meets quality requirements; reserve larger models for fallback.
  • Stream responses to reduce time-to-first-token and perceived latency.
  • Cache: memoize embeddings for unchanged documents; persist FAISS indexes to disk.
  • Limit retrieved chunks (k) and keep chunk_size reasonable (500–1,000 chars) with some overlap (50–150).
  • Batch embedding jobs and parallelize ingestion.
  • Add timeouts and retries on model calls; circuit-break noisy dependencies.

Security and Reliability

  • Require auth (e.g., an API key header or OAuth2); never expose provider keys to clients.
  • Input validation: enforce max input sizes and supported MIME types.
  • Prompt injection defenses: prepend a strict system message and refuse to execute arbitrary instructions from retrieved text.
  • PII and secrets: scrub logs; consider redaction before indexing.
  • Rate limiting: per-IP and per-token limits; consider a usage quota.
  • Observability: log request IDs, latency, token counts, and outcome (success/error).

Deployment Options

  • Uvicorn/Gunicorn on a VM or container orchestrator (ECS, Kubernetes). Example:
gunicorn 'app.main:app' -k uvicorn.workers.UvicornWorker --workers 4 --bind 0.0.0.0:8000
  • Serverless containers (Cloud Run) or functions (optional with cold start considerations).
  • Edge runtimes for lightweight rerankers and request pre-processing; call your core API for heavy LLM work.

Basic Dockerfile snippet:

FROM python:3.11-slim
WORKDIR /app
COPY . /app
RUN pip install -U pip && \
    pip install -U 'langchain>=0.2' langchain-openai langserve fastapi uvicorn[standard] tiktoken 'faiss-cpu>=1.7.4' langchain-text-splitters langchain-community pydantic python-dotenv
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Troubleshooting

  • Import errors: ensure you installed provider-specific packages (e.g., langchain-openai). Restart your interpreter after installs.
  • Token limit errors: reduce chunk_size or retrieved k; switch to models with larger context windows.
  • Slow queries: profile retriever latency, reduce documents, or enable streaming. Cache embeddings and persist FAISS.
  • Non-determinism: set temperature to 0 for tests; pin model names in production.
  • 429/Rate limit: implement exponential backoff and concurrency caps.

Extending This Tutorial

  • Swap providers (Anthropic, Azure OpenAI, Google, Cohere) by changing the LLM and Embeddings classes.
  • Add tools/agents for multi-step workflows (search, code execution, web retrieval) with strict tool schemas.
  • Add reranking (e.g., embeddings cosine + cross-encoder) to improve RAG answer quality.
  • Add guardrails: JSON schema validation, toxicity or PII filters, or post-generation classifiers.

Summary

You built a LangChain-powered API three ways: a minimal LCEL chain, a LangServe-mounted route, and a custom FastAPI endpoint with streaming. Then you layered in RAG, structured outputs, testing, and production concerns. This architecture scales from a single endpoint to a full platform: keep chains modular, measure everything, and ship behind a well-defined HTTP contract.

Related Posts