Build an AI Meeting Transcription API: An End-to-End Tutorial

Build a production-ready AI meeting transcription API with streaming, diarization, summaries, and exports. Code samples and architecture included.

ASOasis
9 min read
Build an AI Meeting Transcription API: An End-to-End Tutorial

Image used for representation purposes only.

Overview

This hands-on tutorial walks you through building a production-ready AI meeting transcription API. You will design clean REST and WebSocket interfaces, wire in a speech-to-text (ASR) engine, add diarization (who spoke when), and generate meeting summaries and action items with an LLM. We will also cover exports (SRT/VTT), quality evaluation, security, and scaling.

What you will build:

  • Batch transcription endpoint for recorded meetings
  • Real-time streaming transcription over WebSocket for live captions
  • Optional diarization and word-level timestamps
  • Post-processing: punctuation, normalization, and redaction
  • Summaries, topics, action items, and highlights
  • Exports to JSON, SRT, and VTT

This tutorial is provider-agnostic. Plug in the ASR and LLM engines you prefer (cloud or on-prem). Code samples use Python (FastAPI) on the server and browser JavaScript on the client.

Architecture at a Glance

The system has four layers:

  1. Ingestion
    • Batch: storage URL (e.g., cloud object store) or direct upload
    • Streaming: WebSocket sends small audio chunks from the browser or softphone
  2. Processing
    • Decoder: normalize audio to 16 kHz, mono PCM
    • ASR: transcribe (streaming or batch)
    • Diarization: detect speakers and assign segments
    • Post-processing: punctuation, capitalization, profanity masking, PII redaction
  3. Intelligence
    • Summarize and extract topics, action items, decisions, and follow-ups
    • Confidence scores, timestamps, and search-ready JSON
  4. Delivery
    • REST to fetch JSON
    • SRT/VTT for captions
    • Webhooks for job completion

Prerequisites

  • Python 3.10+
  • FastAPI and Uvicorn
  • ffmpeg installed on the server
  • An ASR engine (cloud SDK or local library) with streaming and/or batch support
  • An LLM provider (cloud or local) for summaries and action items

Install base dependencies:

pip install fastapi uvicorn websockets pydantic[dotenv] httpx orjson

Optional libraries you may use behind the adapter interfaces:

  • ASR: faster-whisper, Vosk, cloud SDKs
  • Diarization: pyannote.audio (offline), vendor diarization
  • Redaction: custom regex or NER models

API Design

REST (Batch)

  • POST /v1/transcripts: create a transcription job
  • GET /v1/transcripts/{id}: get job status and results
  • GET /v1/transcripts/{id}.srt: export SRT
  • GET /v1/transcripts/{id}.vtt: export VTT
  • Optional: POST /v1/webhooks/test for integration checks

Request example:

{
  "audio_url": "https://storage.example.com/meetings/abc123.m4a",
  "diarize": true,
  "language": "en",
  "redact_pii": true,
  "summarize": true
}

Response (202 Accepted):

{
  "id": "job_e2c1b4",
  "status": "queued"
}

WebSocket (Streaming)

  • wss://api.example.com/v1/stream
  • Client sends small audio chunks (e.g., 20–60 ms) and control messages
  • Server replies with partial and final segments, including timestamps and speaker labels (if available)

Client-to-server JSON control message:

{ "type": "start", "sample_rate": 16000, "language": "en", "diarize": true }

Server-to-client transcript frame (partial):

{
  "type": "partial",
  "seq": 42,
  "start": 12.36,
  "end": 13.10,
  "text": "we should move the deadline",
  "speaker": "S1",
  "confidence": 0.87
}

Final segment uses type: "final". The server sends type: "summary" at the end if summarization is enabled.

Implementing the Server (FastAPI)

Define adapter interfaces so you can swap engines without rewriting your API.

# adapters/asr.py
from typing import Iterable, List, Dict, Any

class ASRStreaming:
    async def start(self, sample_rate: int, language: str | None = None):
        ...
    async def accept(self, pcm16_bytes: bytes) -> List[Dict[str, Any]]:
        """Return zero or more partial segments with timestamps."""
        ...
    async def finalize(self) -> List[Dict[str, Any]]:
        """Return final segments."""
        ...

class ASRBatch:
    async def transcribe(self, wav_path: str, language: str | None = None) -> Dict:
        ...
# adapters/diarization.py
from typing import List, Dict

async def diarize(wav_path: str) -> List[Dict]:
    """Return speaker segments: [{speaker: "S1", start: 0.0, end: 5.2}, ...]"""
    ...
# adapters/llm.py
from typing import Dict, List

SUMMARY_PROMPT = (
    "You are an assistant that writes faithful, concise meeting summaries.\n"
    "Use the transcript to produce: 1) executive summary (5 bullets),\n"
    "2) decisions, 3) action items with owners and dates, 4) risks/blockers.\n"
    "Cite timecodes for each decision/action item when possible.\n"
)

async def summarize(transcript: List[Dict]) -> Dict:
    # Call your preferred LLM here and return a structured dict
    return {"summary": "...", "actions": [], "decisions": []}

Utilities: Audio Normalization

# utils/audio.py
import subprocess, tempfile, os

FFMPEG = "ffmpeg"

def to_wav_pcm16_mono_16k(src_path: str) -> str:
    dst = tempfile.mktemp(suffix=".wav")
    cmd = [
        FFMPEG, "-y", "-i", src_path,
        "-ac", "1", "-ar", "16000", "-f", "wav", "-acodec", "pcm_s16le",
        dst,
    ]
    subprocess.check_call(cmd)
    return dst

FastAPI Endpoints (Batch)

# main.py
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
import orjson, uuid, os
from utils.audio import to_wav_pcm16_mono_16k
from adapters.asr import ASRBatch
from adapters.diarization import diarize
from adapters.llm import summarize
import httpx

app = FastAPI()
JOBS: dict[str, dict] = {}

class CreateJob(BaseModel):
    audio_url: str
    diarize: bool = False
    language: str | None = None
    redact_pii: bool = False
    summarize: bool = False

asr_batch = ASRBatch()  # plug your engine

async def download(url: str, path: str):
    async with httpx.AsyncClient() as client:
        r = await client.get(url, follow_redirects=True)
        r.raise_for_status()
        with open(path, "wb") as f:
            f.write(r.content)

@app.post("/v1/transcripts")
async def create_job(req: CreateJob, bg: BackgroundTasks):
    job_id = f"job_{uuid.uuid4().hex[:6]}"
    JOBS[job_id] = {"status": "queued"}

    async def worker():
        try:
            tmp = f"/tmp/{job_id}"
            src = tmp + os.path.splitext(req.audio_url)[1]
            await download(req.audio_url, src)
            wav = to_wav_pcm16_mono_16k(src)

            result = await asr_batch.transcribe(wav, req.language)
            segments = result.get("segments", [])

            if req.diarize:
                spk = await diarize(wav)
                segments = align_speakers(segments, spk)

            if req.redact_pii:
                segments = redact(segments)

            output = {"id": job_id, "status": "succeeded", "segments": segments}

            if req.summarize:
                output["intelligence"] = await summarize(segments)

            JOBS[job_id] = output
        except Exception as e:
            JOBS[job_id] = {"id": job_id, "status": "failed", "error": str(e)}

    bg.add_task(worker)
    return {"id": job_id, "status": "queued"}

@app.get("/v1/transcripts/{job_id}")
async def get_job(job_id: str):
    job = JOBS.get(job_id)
    if not job:
        raise HTTPException(404, "not found")
    return job

Support functions used above (speaker alignment and redaction) are sketched next.

from typing import List, Dict

def align_speakers(segments: List[Dict], speakers: List[Dict]) -> List[Dict]:
    # Greedy overlap assignment: pick speaker whose [start,end] overlaps segment midpoint
    def mid(seg):
        return (seg["start"] + seg["end"]) / 2
    out = []
    for seg in segments:
        m = mid(seg)
        spk = next((s["speaker"] for s in speakers if s["start"] <= m <= s["end"]), "S?")
        seg = dict(seg)
        seg["speaker"] = spk
        out.append(seg)
    return out

import re
PII = re.compile(r"(\b\d{3}-?\d{2}-?\d{4}\b|\b\d{5}(?:-\d{4})?\b|\b\d{16}\b)")

def redact(segments: List[Dict]) -> List[Dict]:
    for seg in segments:
        seg["text"] = PII.sub("[REDACTED]", seg["text"])
    return segments

WebSocket for Live Transcription

# main.py (continued)
from fastapi import WebSocket, WebSocketDisconnect
from adapters.asr import ASRStreaming

asr_streaming = ASRStreaming()

@app.websocket("/v1/stream")
async def stream(ws: WebSocket):
    await ws.accept()
    session = None
    try:
        while True:
            msg = await ws.receive()
            if "text" in msg:
                data = orjson.loads(msg["text"]) if msg["text"] else {}
                if data.get("type") == "start":
                    session = await asr_streaming.start(data.get("sample_rate", 16000), data.get("language"))
                    await ws.send_text(orjson.dumps({"type": "started"}).decode())
                elif data.get("type") == "end":
                    finals = await asr_streaming.finalize()
                    await ws.send_text(orjson.dumps({"type": "final", "segments": finals}).decode())
                    break
            elif "bytes" in msg:
                parts = await asr_streaming.accept(msg["bytes"])  # pcm16
                if parts:
                    await ws.send_text(orjson.dumps({"type": "partial_batch", "segments": parts}).decode())
    except WebSocketDisconnect:
        pass

Browser Client: Capturing and Streaming Audio

Below is a minimal client that captures mic audio and streams PCM16 frames.

<button id="start">Start</button>
<button id="stop">Stop</button>
<script>
const startBtn = document.getElementById('start');
const stopBtn = document.getElementById('stop');
let ws, audioCtx, processor, source;

async function start() {
  ws = new WebSocket('wss://api.example.com/v1/stream');
  ws.onopen = () => ws.send(JSON.stringify({type:'start', sample_rate:16000, language:'en'}));

  audioCtx = new (window.AudioContext || window.webkitAudioContext)({sampleRate:16000});
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  source = audioCtx.createMediaStreamSource(stream);
  const worklet = await audioCtx.audioWorklet.addModule('pcm-worklet.js');
  processor = new AudioWorkletNode(audioCtx, 'pcm16-writer');
  processor.port.onmessage = (e) => {
    if (ws.readyState === 1) ws.send(e.data); // ArrayBuffer
  };
  source.connect(processor).connect(audioCtx.destination);
}

function stop() {
  if (ws && ws.readyState === 1) ws.send(JSON.stringify({type:'end'}));
  processor && processor.disconnect();
  source && source.disconnect();
  audioCtx && audioCtx.close();
}

startBtn.onclick = start;
stopBtn.onclick = stop;
</script>

AudioWorklet for PCM16 frames:

// pcm-worklet.js
class PCM16Writer extends AudioWorkletProcessor {
  process(inputs) {
    const input = inputs[0][0];
    if (!input) return true;
    const pcm = new Int16Array(input.length);
    for (let i = 0; i < input.length; i++) {
      const s = Math.max(-1, Math.min(1, input[i]));
      pcm[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
    }
    this.port.postMessage(pcm.buffer);
    return true;
  }
}
registerProcessor('pcm16-writer', PCM16Writer);

Post-processing and Exports

Generate SRT from segmented transcript:

def to_srt(segments):
    lines = []
    for i, s in enumerate(segments, 1):
        start = srt_time(s['start'])
        end = srt_time(s['end'])
        text = s['text']
        spk = s.get('speaker', '')
        label = f"{spk}: " if spk else ''
        lines.append(f"{i}\n{start} --> {end}\n{label}{text}\n")
    return "\n".join(lines)

def srt_time(t):
    h = int(t // 3600); m = int((t % 3600)//60); s = t % 60
    return f"{h:02}:{m:02}:{int(s):02},{int((s-int(s))*1000):03}"

Add HTTP routes:

@app.get("/v1/transcripts/{job_id}.srt")
async def export_srt(job_id: str):
    job = JOBS.get(job_id)
    if not job or job.get('status') != 'succeeded':
        raise HTTPException(404)
    return to_srt(job['segments'])

Meeting Intelligence with an LLM

Use prompts that request structured outputs. Guardrails to keep it faithful:

  • Ask the model to quote timecodes alongside each action/decision
  • Provide a strict JSON schema and validate responses
  • Include a short transcript excerpt window for each extraction

Example prompt fragment:

System: You extract structured insights from transcripts without inventing facts.
User: Produce JSON with keys: summary, topics[], decisions[], actions[].
Each decision/action must include an array of supporting timecodes.

Post-processing the model output:

  • Deduplicate identical action items
  • Normalize owners to canonical names if diarization provides speaker maps
  • Attach deep links to the media player using timecodes

Quality and Evaluation

Track these metrics continuously:

  • Word Error Rate (WER) and Character Error Rate (CER)
  • Diarization Error Rate (DER): missed speech, false alarm, confusion
  • Real-time Factor (RTF): processing_time / audio_duration
  • Time to First Token (TTFT) for streaming captions
  • Hallucination rate for summaries (percentage of unsupported claims)

Create a small, representative benchmark from real meetings (with consent). Re-run benchmarks for every engine or prompt change.

Audio Tips for Higher Accuracy

  • Prefer 16 kHz mono PCM; avoid overly compressed sources when possible
  • Capture separate tracks per participant if you can; mixdown with channel tags
  • Use voice activity detection (VAD) to drop long silences and reduce cost
  • Enable acoustic echo cancellation on the sender side to limit cross-talk
  • Normalize loudness (e.g., -23 LUFS) for consistent decoding

Security, Privacy, and Compliance

  • Transport: TLS 1.2+ for all endpoints; use secure WebSocket (wss)
  • Storage: encrypt at rest; restrict buckets with IAM policies
  • Access: OAuth2 or API keys with rotation; least-privilege roles
  • PII: configurable redaction; allow region pinning and data residency
  • Retention: short-lived URLs and automatic deletion schedules
  • Audit: request/response logging with structured metadata (hash, size, duration)

Scaling and Cost Control

  • Use a job queue for batch workloads; autoscale workers on CPU/GPU pools
  • Chunk long audios (e.g., 30–60 s windows) with small overlaps; stitch with timestamps
  • Cache decoded PCM on ephemeral SSD to avoid repeated transcoding
  • Gate expensive LLM calls behind a size threshold or summary-on-demand flag
  • For streaming, cap per-connection bitrate and close idle sockets proactively
  • Pre-warm models to reduce cold starts; pool GPU sessions when possible

Testing and Monitoring

  • Golden tests: fixed audio snippets with pinned expected outputs
  • Fuzz tests: random audio segments to catch decoder edge cases
  • Synthetics: schedule a test call hourly to validate E2E path
  • Observability: emit spans for decode, ASR, diarization, LLM; track errors by cause
  • SLIs: success rate, P95 TTFT, P95 completion time, queue age, average cost/minute

Putting It All Together

You now have a clear blueprint and reference implementation for an AI meeting transcription API. Start with the skeleton above, plug in your preferred ASR and LLM adapters, and iterate on quality with real-world audio. Focus early on audio normalization, diarization alignment, and robust exports; then layer on summaries and action items with strict validation. Finally, invest in observability and cost controls to ship a reliable service that scales with your users.

Next Steps

  • Implement an actual ASR adapter (e.g., local or cloud)
  • Add speaker name mapping UI so users can rename S1, S2 to real names
  • Extend exports to DOCX/HTML and add a shareable meeting page
  • Build webhook retries with exponential backoff
  • Add semantic search over transcripts using embeddings

Related Posts