GraphQL Federation for Microservices: A Practical Architecture Guide

A practical guide to GraphQL federation in microservices—concepts, schema design, gateway patterns, performance, security, and migration strategies.

ASOasis
9 min read
GraphQL Federation for Microservices: A Practical Architecture Guide

Image used for representation purposes only.

Why GraphQL federation exists

Teams adopt microservices to ship features independently, but APIs often lag behind. A single GraphQL server can hide service boundaries, yet it quickly becomes a monolith: one schema, one deployment, one bottleneck. Federation solves this by allowing many teams to contribute parts of a single graph while keeping ownership, deployment, and scaling independent. The client still sees one endpoint; behind the scenes, a router composes and orchestrates multiple subgraphs.

Key outcomes:

  • Preserve domain ownership per team while offering a unified API.
  • Compose a single supergraph schema from separately owned subgraph schemas.
  • Let the router plan requests across services so clients don’t coordinate calls.

Core building blocks

  • Subgraph: A GraphQL service that owns a slice of the schema (types, fields, resolvers). Each subgraph deploys independently.
  • Supergraph (composed schema): The union of all subgraphs plus their relationships. It’s served by a router/gateway.
  • Entities and keys: “Entities” are types referenced across subgraphs, declared with a unique key (e.g., id or upc). A key lets the router reference and join entities across services.
  • Reference resolvers: A subgraph that owns an entity must implement a resolver to load that entity by key when another subgraph references it.
  • Federation directives: Common patterns include:
    • @key(fields: …): Declares the unique fields that identify an entity.
    • @extends and @external: Extend a type in another subgraph; mark external fields used from that type.
    • @requires(fields: …): A field depends on other fields (fetched from the same entity selection) for its computation.
    • @provides(fields: …): A subgraph can supply fields of a related entity it does not own, avoiding extra hops.
  • Composition: A build step validates and merges subgraphs into a single supergraph schema, checking for conflicts, missing keys, and directive semantics.
  • Query planning: The router turns a query into a plan spanning subgraphs (fetch, join, sequence/parallelism), executes the plan, and merges the results.

A minimal federated example (SDL)

Imagine four subgraphs: accounts, products, reviews, and shipping.

Accounts subgraph (owns User):

# accounts.graphql

type User @key(fields: "id") {
  id: ID!
  name: String!
  username: String!
}

Products subgraph (owns Product, links to User):

# products.graphql

type Product @key(fields: "upc") {
  upc: ID!
  name: String!
  price: Int
  weight: Int
  createdBy: User @provides(fields: "username")
}

extend type User @key(fields: "id") {
  id: ID! @external
  username: String! @external
}

Reviews subgraph (owns Review, extends Product):

# reviews.graphql

type Review @key(fields: "id") {
  id: ID!
  body: String!
  author: User @provides(fields: "username")
  product: Product
}

extend type Product @key(fields: "upc") {
  upc: ID!
  reviews: [Review]!
}

Shipping subgraph (computes a field that needs other fields):

# shipping.graphql

extend type Product @key(fields: "upc") {
  upc: ID!
  weight: Int @external
  price: Int @external
  shippingEstimate: Int @requires(fields: "weight price")
}

Sample query against the supergraph:

query ProductDetail($upc: ID!) {
  product(upc: $upc) {
    name
    price
    shippingEstimate
    createdBy { username }
    reviews { body author { username } }
  }
}

At runtime, the router plans the query to fetch Product from products, compute shippingEstimate in shipping (using price and weight), and fetch reviews from reviews. Because createdBy and author both provide username, the plan may avoid extra calls to accounts when only username is selected.

Request lifecycle in a federated graph

  1. Client sends an operation to the router.
  2. Router validates against the composed supergraph.
  3. Query planner produces a federated plan: which subgraphs to hit, in what order, and how to join by keys.
  4. Router executes sub-requests, in parallel where possible, injecting and resolving entity references.
  5. Router merges responses, handles errors/partial data, applies policies (caching, limits), and returns a single JSON payload.

Federation vs. schema stitching

Schema stitching merges schemas but typically relies on manual resolver wiring and lacks standardized entity semantics. Federation focuses on:

  • Declarative ownership and entity identity via @key.
  • Automated composition and query planning.
  • Team autonomy with well-defined extension points (@extends/@external).

Stitching can work for smaller setups; federation scales better with many teams and cross-service references.

Designing a healthy supergraph

  • Stable entity boundaries: Choose keys that are unique, stable, and globally meaningful (e.g., user id, product upc). Avoid composite keys that can change.
  • Ownership clarity: Each entity type should have a single owning subgraph. Other subgraphs may extend it but not redefine ownership.
  • Minimize cross-subgraph hops: Design fields to avoid deep chains of requires/extends. Model frequently co-accessed data together.
  • Nullability discipline: Non-null means you can always deliver. In distributed systems, that is a promise—use it sparingly.
  • Pagination consistency: Prefer cursor-based connections for lists. Keep arguments consistent (first/after, last/before) across subgraphs.
  • Enumerations and scalars: Centralize common scalars (DateTime, Money) and enums to avoid drift. Consider a shared utilities subgraph for platform types.
  • Directives as contracts: Use @provides and @requires where they reduce hops; document the guarantee so teams understand expectations.
  • Deprecation over deletion: Mark fields deprecated, announce timelines, and remove after clients migrate. Federation makes breakages more costly.

Performance patterns

  • Prevent N+1 with batching: Use request-scoped loaders in subgraphs to batch by key (e.g., DataLoader). Federation can easily create N+1 at service boundaries if each reference resolver hits a database individually.
  • Co-locate hot fields: If a field is always fetched with another, consider moving or duplicating it with @provides to avoid extra round-trips.
  • Limit query cost: Apply depth/alias/complexity limits and timeouts at the router. Consider per-client budgets.
  • Cache wisely:
    • Router-level result cache for public, cacheable queries.
    • Subgraph response caching by key (short TTLs) to absorb spikes.
    • Persisted queries and CDN caching when possible.
  • Stream and defer: For large lists or slow, optional fields, use @defer and @stream to improve time-to-first-byte without sacrificing completeness.
  • Plan-aware observability: Track average plan depth, number of subgraph hops, and share of time per subgraph. Optimize the biggest joins first.

Security and authorization

  • Authentication at the edge: Verify a token (e.g., JWT) at the router and propagate claims in request context to subgraphs.
  • Authorization split:
    • Centralized: Gateway enforces coarse rules; subgraphs trust claims for fine-grained checks.
    • Decentralized: Each subgraph performs its own checks, verifying tokens/claims.
    • Hybrid: Gateway enforces global policies (PII access), subgraphs enforce domain rules.
  • Field-level guards: Use schema directives or middleware for rules like “only owner can read email.” Keep rules close to type definitions for clarity.
  • Operation safelists: Register allowed operations and block ad-hoc queries in production.
  • Query hardening: Depth/alias/complexity limits, bounded list sizes, cost-based timeouts, and error redaction.
  • Privacy and caching: Mark sensitive data as non-cacheable; segregate public/private responses. Avoid leaking secrets via error messages.
  • Limit introspection in production (or restrict to authenticated/admin contexts) to reduce surface area.

Reliability and failure modes

  • Timeouts and circuit breakers: Bound latency per subgraph to protect the router. Tripping a breaker should surface partial data with clear errors.
  • Partial data strategy: Prefer nullable fields to allow degraded responses. Define user-facing fallback text for optional fields.
  • Retry budgets: Limited, jittered retries for idempotent subgraph operations. Avoid retry storms.
  • Backpressure and pool sizing: Cap concurrent sub-requests per subgraph to prevent overload.
  • Stale-on-failure: For read-heavy fields, a brief stale cache can hide transient outages.
  • Schema health: Composition should fail fast on breaking changes. Keep contract tests that assert cross-subgraph assumptions.

Tooling and ecosystem (select examples)

  • Routers/gateways: High-performance routers exist for Node.js and Rust; some are open source and others managed. Choose based on throughput, latency, support model, and operational fit.
  • Server frameworks: Many GraphQL servers support federation in Node.js, TypeScript, .NET, Go, and Java ecosystems. Verify support for federation directives and reference resolvers.
  • Schema registry and composition: Use a registry to publish subgraph schemas, validate composition in CI, and manage rollout policies.
  • Operation registry/metrics: Track which clients run which operations; attach SLIs/SLOs per operation and per subgraph.

Local development and testing

  • Compose locally: Keep a script that pulls subgraph SDLs and composes the supergraph; fail the build on composition errors.
  • Contract tests: For every field that uses @provides/@requires, write tests that assert the promise. E.g., if products provides User.username, test that requesting username does not require accounts.
  • Mock subgraphs: Stand up lightweight mocks for dependent subgraphs to speed development.
  • Federated integration tests: Run end-to-end tests against a local router and a subset of real subgraphs. Seed data for predictable keys.
  • Ephemeral environments: Spin up per-PR composed graphs with production-like settings to catch plan regressions early.

Example TypeScript sketch for an entity reference resolver:

// accounts/resolvers.ts
import DataLoader from 'dataloader'

export function createContext(req) {
  return {
    userById: new DataLoader(async (ids) => fetchUsersByIds(ids)),
    auth: req.auth, // claims from the router
  }
}

export const resolvers = {
  User: {
    __resolveReference: (ref, ctx) => ctx.userById.load(ref.id),
  },
}

Deployment patterns

  • Central router: One global router layer in front of all subgraphs. Simplest to operate; add regional replicas for latency.
  • Edge routers: Push the router to the CDN/edge for lower latency and better caching. Ensure secure, low-latency links to subgraphs.
  • Multi-region: Deploy subgraphs regionally with data locality. Use entity keys that are region-agnostic. Consider read replicas for joins.
  • Rollouts: Blue/green or canary deploys per subgraph. Compose the supergraph with the new subgraph schema before shifting traffic.
  • Versioning the supergraph: Treat the composed schema as an artifact. Tag releases and roll back quickly if plan metrics degrade.

Migration playbooks

From REST microservices:

  • Start schema-first: Model core entities and queries that reflect product language, not service boundaries.
  • Wrap reads first: Implement query fields that aggregate multiple REST calls behind a single GraphQL operation.
  • Introduce entities: Add @key and reference resolvers so lists and details can join cleanly across domains.
  • Iterate to mutations: Add write paths after you stabilize read patterns and auth rules.

From a monolithic GraphQL server:

  • Carve by domain: Extract types and resolvers into subgraphs that own entities (e.g., accounts, catalog, orders).
  • Keep the client contract: Preserve the external schema where possible; use @extends to maintain joins.
  • Gradual cutover: Point the router at a mix of extracted subgraphs and a shrinking monolith subgraph until fully decomposed.

Common pitfalls and anti-patterns

  • Everything is an entity: Only mark cross-subgraph types as entities. Value types (Money, Address) rarely need keys.
  • Deep @requires chains: Each require adds a hop. If you need three or more requires for a hot path, reconsider ownership.
  • Chatty resolvers: Avoid per-item cross-service calls inside list resolvers; batch or prefetch by keys.
  • Leaky boundaries: Do not expose internal service IDs that change across systems. Use global, stable IDs in the schema.
  • Per-client fields: Resist adding fields only one client needs; prefer client-side composition or custom operations for niche needs.
  • Silent breaking changes: Removing fields or changing nullability will fail composition or, worse, break clients. Deprecate and communicate instead.

Launch readiness checklist

  • Composition passes in CI for every subgraph change.
  • Operation registry enabled with safelists in production.
  • Query limits configured (depth, aliases, cost, timeouts).
  • AuthN/Z model documented; claims propagated end-to-end.
  • Observability: Traces include subgraph spans; dashboards track plan depth, subgraph latency, error rates.
  • Performance budgets: Max hops per hot operation; N+1 checks with loaders.
  • Failure drills: Timeouts, partial data behavior, and circuit breakers verified.
  • Documentation: Entity ownership map, schema conventions, and deprecation policy published.

Conclusion

GraphQL federation lets you scale your API and your organization at the same time. It preserves the simplicity of a single graph for clients while giving each team independent control over its slice of the schema and runtime. With clear entity boundaries, disciplined schema design, robust security, plan-aware performance work, and strong observability, a federated supergraph becomes a durable platform for product velocity—without re-centralizing your architecture.

Related Posts