LangChain API Tutorial: From Hello World to Production RAG with FastAPI and LangServe

Overview

LangChain makes it straightforward to turn large language models (LLMs) into real applications: text summarizers, question-answering bots, and full RAG (Retrieval-Augmented Generation) systems. In this tutorial you’ll build a production-ready HTTP API around LangChain using both LangServe (the fastest path) and a hand-rolled FastAPI route (for maximum control). You’ll also add retrieval, structured outputs, streaming, testing, and deployment tips.

By the end you will:

Create a minimal LCEL chain (LangChain Expression Language).
Expose it over HTTP with LangServe.
Build a custom FastAPI endpoint with streaming.
Add RAG using a vector store and embeddings.
Return structured JSON with Pydantic models.
Ship to production with observability and guardrails.

What We’ll Build

A /summarize endpoint that condenses long text.
A /rag/ask endpoint that answers questions over your documents.
Optional: a /structured endpoint that returns strongly-typed JSON.

We’ll use Python, but the concepts carry over to JS/TS as well.

Prerequisites

Python 3.10+
An LLM provider key (e.g., OpenAI, Anthropic, or Azure OpenAI). We’ll show OpenAI in code; swap for your provider as needed.
Basic command line and HTTP familiarity.

Project Setup

Create and activate a virtual environment, then install dependencies.

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -U 'langchain>=0.2' langchain-openai langserve fastapi uvicorn[standard] tiktoken 'faiss-cpu>=1.7.4' langchain-text-splitters langchain-community pydantic python-dotenv

Set your environment variable(s):

export OPENAI_API_KEY='sk-...'
# Optional observability
export LANGCHAIN_TRACING_V2='true'
export LANGCHAIN_API_KEY='ls_...'

Create a basic app structure:

.
├─ app/
│  ├─ __init__.py
│  ├─ chains.py
│  ├─ rag.py
│  └─ main.py
└─ data/
   └─ docs/  # place your PDFs or text files here

Step 1: A Minimal LCEL Chain

LCEL (LangChain Expression Language) is a composable, lazy pipeline syntax. Build a simple summarizer chain.

# app/chains.py
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

# Prompt → LLM → Parser
summary_prompt = ChatPromptTemplate.from_messages([
    ('system', 'You are a concise assistant. Summarize the user\'s text in 3-4 bullet points.'),
    ('user', '{text}')
])

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.2)

summarize_chain = summary_prompt | llm | StrOutputParser()

Usage in Python:

# Quick test (optional)
if __name__ == '__main__':
    result = summarize_chain.invoke({'text': 'LangChain helps developers build LLM apps with chains, tools, and RAG.'})
    print(result)

Step 2: Serve It in 2 Minutes with LangServe

LangServe automatically mounts chains as HTTP routes with input/output schemas and optional streaming.

# app/main.py
from fastapi import FastAPI
from langserve import add_routes
from app.chains import summarize_chain

app = FastAPI(title='LangChain API')

# Mount at /summarize
add_routes(app, summarize_chain, path='/summarize')

@app.get('/health')
def health():
    return {'status': 'ok'}

Run the server:

uvicorn app.main:app --reload --port 8000

Call it with HTTPie (preferred for readability):

http POST :8000/summarize text='LangChain simplifies building LLM-powered apps.'

Or curl (JSON-escaped example):

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -d '{"text":"LangChain simplifies building LLM-powered apps."}' \
  http://localhost:8000/summarize

Tip: LangServe also supports Server-Sent Events (SSE) for streaming. Append ?stream=true when supported by the chain/model.

Step 3: Build a Custom FastAPI Endpoint (Manual Control)

Sometimes you want custom auth, logging, or streaming format. Here’s a hand-rolled route that uses the same chain.

# app/main.py (continued)
from fastapi import Body
from fastapi.responses import StreamingResponse, JSONResponse
from app.chains import summarize_chain

@app.post('/v1/summarize')
async def summarize_v1(payload: dict = Body(...)):
    text = payload.get('text', '')
    if not text:
        return JSONResponse({'error': 'text is required'}, status_code=400)

    # Non-streaming (simple):
    result = await summarize_chain.ainvoke({'text': text})
    return {'summary': result}

@app.post('/v1/summarize/stream')
async def summarize_stream(payload: dict = Body(...)):
    text = payload.get('text', '')
    if not text:
        return JSONResponse({'error': 'text is required'}, status_code=400)

    async def event_gen():
        async for chunk in summarize_chain.astream({'text': text}):
            # Each chunk is a token/string when the last stage is a StrOutputParser
            yield f'data: {chunk}\n\n'
        yield 'event: end\n\n'

    return StreamingResponse(event_gen(), media_type='text/event-stream')

Step 4: Add Retrieval (RAG)

Let’s index your documents and answer questions grounded in that data.

# app/rag.py
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

DATA_DIR = Path('data/docs')

# 1) Load & split your corpus (replace with proper loaders for PDFs/HTML)
def load_corpus():
    texts = []
    for p in DATA_DIR.rglob('*.txt'):
        texts.append(p.read_text(encoding='utf-8'))
    return texts

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)

# 2) Create embeddings and a vector store
embeddings = OpenAIEmbeddings(model='text-embedding-3-large')

def build_vectorstore():
    raw_docs = load_corpus()
    docs = []
    for t in raw_docs:
        docs.extend(splitter.create_documents([t]))
    return FAISS.from_documents(docs, embeddings)

# Initialize at import (small corpora). For large ones, persist & load.
VECTORSTORE = build_vectorstore()
retriever = VECTORSTORE.as_retriever(search_kwargs={'k': 4})

# 3) RAG chain: question + retrieved context → LLM
rag_prompt = ChatPromptTemplate.from_messages([
    ('system',
     'You answer using only the provided context. If the answer is unknown, say you do not know.\n' \
     'Context:\n{context}'),
    ('user', '{question}')
])

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)

def _format_docs(docs):
    return '\n\n'.join(d.page_content for d in docs)

rag_chain = {
    'context': retriever | _format_docs,
    'question': RunnablePassthrough()
} | rag_prompt | llm | StrOutputParser()

Mount the RAG chain via LangServe and add a custom endpoint too:

# app/main.py (continued)
from app.rag import rag_chain

# LangServe route
from langserve import add_routes
add_routes(app, rag_chain, path='/rag/ask')

# Custom JSON contract
@app.post('/v1/rag/ask')
async def rag_ask(payload: dict = Body(...)):
    question = payload.get('question', '')
    if not question:
        return JSONResponse({'error': 'question is required'}, status_code=400)
    answer = await rag_chain.ainvoke(question)
    return {'answer': answer}

Query it:

http POST :8000/v1/rag/ask question='What does our policy say about refunds?'

Step 5: Strongly-Typed JSON with Pydantic

When downstream services consume your API, prefer structured outputs over free-form text.

# app/chains.py (add structured output)
from pydantic import BaseModel, Field
from typing import List
from langchain_openai import ChatOpenAI

class Summary(BaseModel):
    bullets: List[str] = Field(..., description='3-5 concise bullet points')

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
structured_llm = llm.with_structured_output(Summary)

structured_summary_chain = summary_prompt | structured_llm

Expose it:

# app/main.py (continued)
from app.chains import structured_summary_chain
add_routes(app, structured_summary_chain, path='/structured/summarize')

Response shape example:

{
  "bullets": [
    "LangChain composes LLM pipelines via LCEL.",
    "LangServe turns chains into HTTP endpoints.",
    "RAG grounds answers in your data."
  ]
}

Step 6: Testing and Evaluation

Unit tests: call your chain’s .invoke with deterministic settings (temperature 0) and compare to regex/patterns.
Contract tests: validate schemas for /structured endpoints.
RAG evals: create a small set of (question, expected answer, context docs). Score with semantic similarity and exact matches.
Load tests: use k6 or Locust to measure latency and throughput under realistic concurrency.

Observability and Tracing

Enable LangSmith (optional) to trace prompts, tokens, and latencies across chains.

Set LANGCHAIN_TRACING_V2='true' and LANGCHAIN_API_KEY.
Review spans, inputs/outputs, and errors. Use it to diagnose context stuffing, timeouts, or slow retriever calls.

Performance and Cost Tips

Use the smallest model that meets quality requirements; reserve larger models for fallback.
Stream responses to reduce time-to-first-token and perceived latency.
Cache: memoize embeddings for unchanged documents; persist FAISS indexes to disk.
Limit retrieved chunks (k) and keep chunk_size reasonable (500–1,000 chars) with some overlap (50–150).
Batch embedding jobs and parallelize ingestion.
Add timeouts and retries on model calls; circuit-break noisy dependencies.

Security and Reliability

Require auth (e.g., an API key header or OAuth2); never expose provider keys to clients.
Input validation: enforce max input sizes and supported MIME types.
Prompt injection defenses: prepend a strict system message and refuse to execute arbitrary instructions from retrieved text.
PII and secrets: scrub logs; consider redaction before indexing.
Rate limiting: per-IP and per-token limits; consider a usage quota.
Observability: log request IDs, latency, token counts, and outcome (success/error).

Deployment Options

Uvicorn/Gunicorn on a VM or container orchestrator (ECS, Kubernetes). Example:

gunicorn 'app.main:app' -k uvicorn.workers.UvicornWorker --workers 4 --bind 0.0.0.0:8000

Serverless containers (Cloud Run) or functions (optional with cold start considerations).
Edge runtimes for lightweight rerankers and request pre-processing; call your core API for heavy LLM work.

Basic Dockerfile snippet:

FROM python:3.11-slim
WORKDIR /app
COPY . /app
RUN pip install -U pip && \
    pip install -U 'langchain>=0.2' langchain-openai langserve fastapi uvicorn[standard] tiktoken 'faiss-cpu>=1.7.4' langchain-text-splitters langchain-community pydantic python-dotenv
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Troubleshooting

Import errors: ensure you installed provider-specific packages (e.g., langchain-openai). Restart your interpreter after installs.
Token limit errors: reduce chunk_size or retrieved k; switch to models with larger context windows.
Slow queries: profile retriever latency, reduce documents, or enable streaming. Cache embeddings and persist FAISS.
Non-determinism: set temperature to 0 for tests; pin model names in production.
429/Rate limit: implement exponential backoff and concurrency caps.

Extending This Tutorial

Swap providers (Anthropic, Azure OpenAI, Google, Cohere) by changing the LLM and Embeddings classes.
Add tools/agents for multi-step workflows (search, code execution, web retrieval) with strict tool schemas.
Add reranking (e.g., embeddings cosine + cross-encoder) to improve RAG answer quality.
Add guardrails: JSON schema validation, toxicity or PII filters, or post-generation classifiers.

Summary

You built a LangChain-powered API three ways: a minimal LCEL chain, a LangServe-mounted route, and a custom FastAPI endpoint with streaming. Then you layered in RAG, structured outputs, testing, and production concerns. This architecture scales from a single endpoint to a full platform: keep chains modular, measure everything, and ship behind a well-defined HTTP contract.

LangChain API Tutorial: From Hello World to Production RAG with FastAPI and LangServe

Overview

What We’ll Build

Prerequisites

Project Setup

Step 1: A Minimal LCEL Chain

Step 2: Serve It in 2 Minutes with LangServe

Step 3: Build a Custom FastAPI Endpoint (Manual Control)

Step 4: Add Retrieval (RAG)

Step 5: Strongly-Typed JSON with Pydantic

Step 6: Testing and Evaluation

Observability and Tracing

Performance and Cost Tips

Security and Reliability

Deployment Options

Troubleshooting

Extending This Tutorial

Summary

Tags

Related Posts

Building and Scaling an AI Image Generator API: Architecture, Costs, and Best Practices

Build an AI App with Flutter: Architecture, Streaming, and Production Best Practices

Alibaba Expands Qwen3 AI Models to Major Developer Platforms

Services

Products

Company

Legal