Designing Resilient APIs with the Circuit Breaker Pattern

Learn how the API circuit breaker pattern prevents cascading failures, with design choices, observability, and code examples in Java, .NET, Node.js, and Python.

ASOasis
7 min read
Designing Resilient APIs with the Circuit Breaker Pattern

Image used for representation purposes only.

Why APIs Fail Under Load

Distributed systems fail in creative ways: a slow dependency ties up threads, a database throttles, a third‑party API rate‑limits you, or a network partition makes calls hang. If each service waits synchronously and keeps retrying, the load ripples outward, saturating connections and CPUs. This is how benign hiccups become cascading outages.

The circuit breaker resilience pattern prevents those cascades. It turns certain classes of remote errors into fast, predictable failures, allowing the rest of your system to degrade gracefully while dependencies recover.

What a Circuit Breaker Does

A circuit breaker wraps a remote call and tracks recent outcomes. Based on error rates and latencies, it toggles among three states:

  • Closed: Calls flow normally. Failures are counted inside a rolling window.
  • Open: Calls fail fast without hitting the dependency. This protects resources and avoids queue buildup.
  • Half‑open: After a cool‑down, a limited number of test calls probe the dependency. Success closes the breaker; failures reopen it.

Minimal state machine pseudocode:

state = CLOSED
onCall():
  if state == OPEN and now < nextAttempt: fail(FastOpen)
  result = attemptRemote()
  record(result)
  if state == CLOSED and failureRateExceeds(): state = OPEN; nextAttempt = now + openInterval
  elif state == HALF_OPEN:
     if result.success and probeQuotaSatisfied(): state = CLOSED
     elif result.failure: state = OPEN; nextAttempt = now + openInterval
  return result

onTimer():
  if state == OPEN and now >= nextAttempt: state = HALF_OPEN; resetProbeQuota()

When to Use (and When Not To)

Use a circuit breaker when:

  • You call remote services with variable reliability or capacity.
  • Timeouts and retries alone risk saturating threads, connection pools, or rate limits.
  • A fast, explicit failure enables a safe fallback or graceful degradation.

Avoid or reconsider when:

  • The dependency is fully in‑process (no I/O). Prefer local guards.
  • Idempotency or side‑effects make probe retries dangerous.
  • A slow warm‑up period is normal and your client can back off without tripping a breaker (e.g., scheduled batch with passive backpressure).

Design Parameters That Matter

  • Error classification: What counts as a failure? Include timeouts, connection errors, 5xx responses, and often 429/Too Many Requests. Exclude 4xx client errors unless they reflect persistent misconfiguration.
  • Sliding window: Size by count (e.g., last 100 calls) or by time (e.g., last 30 seconds). Time windows adapt better to changing traffic.
  • Thresholds: Typical open conditions are failure rate > 50% with at least N calls observed. Tune N to avoid noisy flips at low volume.
  • Open interval (cool‑down): Start with 30–60 seconds. Too short causes thrashing; too long delays recovery.
  • Half‑open probes: Allow a small concurrent probe quota (1–10). Gate probes to avoid stampedes when dependencies recover.
  • Per‑key granularity: Breakers should be scoped by dependency and often by endpoint/host. Coarse breakers cause unnecessary outages; overly fine breakers are hard to tune.
  • Concurrency isolation: Combine with bulkheads (separate thread pools/connection pools) so a slow service cannot starve others.

Timeouts, Retries, and Breakers: The Right Order

  • Always set an explicit timeout smaller than your end‑to‑end SLO budget. A breaker cannot help if calls never time out.
  • Place retry before the breaker only if you cap attempts and add jittered backoff; each failed attempt contributes to breaker stats.
  • Place the breaker outermost when you must fail fast; place it inner to allow a small number of retries to succeed during transient blips.

A pragmatic stack (outer to inner): CircuitBreaker -> Retry (backoff + jitter, small max) -> Timeout -> HTTP client.

Fallbacks and Graceful Degradation

When open, respond with a controlled alternative:

  • Serve cached or stale‑while‑revalidate data.
  • Return a synthesized minimal response (e.g., hide recommendations, default prices, or disable noncritical features).
  • Enqueue for later processing if strong consistency is not required.

Make fallbacks explicit and observable. Document the user impact so product stakeholders know what degrades under stress.

Observability and Tuning

Expose metrics and events:

  • State transitions: opened, half‑opened, closed.
  • Failure rate and slow‑call rate.
  • Rejected calls due to OPEN state.
  • Latency percentiles (p50/p95/p99) of successful calls.
  • Probe successes/failures.

Dashboards should correlate breaker events with upstream error rates and downstream saturation (CPU, connection pools). Alert on sustained OPEN states and rising fallback rates. Use logs to capture sample payloads and error causes for triage.

Implementation Examples

Java (Resilience4j)

import io.github.resilience4j.circuitbreaker.*;
import io.github.resilience4j.timelimiter.*;
import io.github.resilience4j.retry.*;
import java.time.Duration;

CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(50f)
    .slowCallRateThreshold(50f)
    .slowCallDurationThreshold(Duration.ofMillis(800))
    .minimumNumberOfCalls(50)
    .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.TIME_BASED)
    .slidingWindowSize(30) // seconds
    .waitDurationInOpenState(Duration.ofSeconds(45))
    .permittedNumberOfCallsInHalfOpenState(5)
    .build();

CircuitBreaker breaker = CircuitBreaker.of("catalog-api", cbConfig);
TimeLimiter tl = TimeLimiter.of(Duration.ofMillis(1000));
Retry retry = Retry.ofDefaults("catalog-retry");

Supplier<String> supplier = () -> httpGet("https://svc/catalog/42");
Supplier<String> protectedCall = Decorators.ofSupplier(supplier)
    .withTimeLimiter(tl)
    .withRetry(retry)
    .withCircuitBreaker(breaker)
    .decorate();

try {
  String body = protectedCall.get();
} catch (CallNotPermittedException open) {
  // Fast open: serve fallback
  return cachedCatalog("42");
}

Key points:

  • Use time‑based windows to adapt to variable traffic.
  • Track slow calls, not just outright failures.
  • Treat CallNotPermittedException as a signal to return fallbacks quickly.

.NET (Polly)

var timeout = Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromSeconds(1));
var retry = Policy
  .Handle<HttpRequestException>()
  .OrResult<HttpResponseMessage>(r => (int)r.StatusCode >= 500)
  .WaitAndRetryAsync(2, attempt => TimeSpan.FromMilliseconds(100 * Math.Pow(2, attempt)));
var breaker = Policy
  .Handle<HttpRequestException>()
  .OrResult<HttpResponseMessage>(r => (int)r.StatusCode >= 500 || r.StatusCode == HttpStatusCode.TooManyRequests)
  .CircuitBreakerAsync(handledEventsAllowedBeforeBreaking: 50, durationOfBreak: TimeSpan.FromSeconds(45));

var policyWrap = Policy.WrapAsync(breaker, retry, timeout);

var response = await policyWrap.ExecuteAsync(() => httpClient.GetAsync("/inventory/42"));

Register policies in HttpClientFactory for reuse and to keep breaker state process‑wide per dependency.

Node.js (opossum)

const CircuitBreaker = require('opossum');
const axios = require('axios');

async function getPrice(id) {
  return axios.get(`https://price/api/${id}`, { timeout: 900 });
}

const breaker = new CircuitBreaker(getPrice, {
  errorThresholdPercentage: 50,
  volumeThreshold: 50,
  timeout: 900,
  resetTimeout: 45000,
  rollingCountTimeWindow: 30000
});

breaker.fallback((id) => ({ data: { id, price: cachedPrice(id), stale: true } }));

breaker.on('open', () => logger.warn('price-api breaker OPEN'));
breaker.on('halfOpen', () => logger.info('price-api HALF_OPEN'));
breaker.on('close', () => logger.info('price-api CLOSED'));

const result = await breaker.fire('42');

Python (pybreaker)

import requests
import pybreaker

breaker = pybreaker.CircuitBreaker(
    fail_max=50,  # minimum calls gate handled via your own counter or wrapper
    reset_timeout=45
)

@breaker
def fetch_user(uid: str):
    r = requests.get(f'https://users/api/{uid}', timeout=0.9)
    if r.status_code >= 500 or r.status_code == 429:
        raise pybreaker.CircuitBreakerError('upstream error')
    return r.json()

try:
    user = fetch_user('42')
except pybreaker.CircuitBreakerError:
    user = cached_user('42')

Note: pybreaker’s default counters are count‑based; add your own time‑window logic or wrap with a rolling window if you need time sensitivity.

Testing and Chaos

  • Unit tests: Verify transitions by simulating sequences of successes, timeouts, and 5xx responses.
  • Contract/integration: Point to a stub server that returns controlled delays and statuses to trigger slow‑call thresholds.
  • Load tests: Ramp traffic until the breaker opens; ensure latency for callers stays bounded and fallbacks behave correctly.
  • Chaos experiments: Inject dependency latency and packet loss in staging. Validate that breakers open, retries back off with jitter, and bulkheads prevent thread starvation.

Advanced Tactics

  • Adaptive thresholds: Lower thresholds during incident conditions to fail fast; raise them during high‑availability windows. Feature‑flag the parameters.
  • Per‑tenant breakers: In multi‑tenant systems, isolate noisy neighbors by scoping breakers per tenant or plan tier.
  • Dual thresholds: Combine failure rate and slow‑call rate to catch brownouts that return 200 OK but exceed SLOs.
  • Hedging: For read‑heavy, idempotent operations, send a backup request after the p95 latency. Use sparingly and with rate limits.
  • Coordinated backoff: On 429/503 with Retry‑After, set breaker to OPEN and align reset with the header to avoid thundering herds.

Anti‑Patterns and Gotchas

  • No timeout: A breaker cannot help if calls hang indefinitely. Set strict client timeouts first.
  • Infinite retries: Retries amplify load on a limping dependency. Cap attempts and add jitter.
  • Global breaker: One breaker for “HTTP” will spill failures across unrelated services. Scope by host/endpoint.
  • Hidden fallbacks: Silent degradation confuses operators and product teams. Emit metrics and logs when serving fallbacks.
  • Thrashing half‑open: Allow only a few concurrent probes; otherwise you DDoS a recovering service.
  • Shared thread pool: Without bulkheads, a slow dependency monopolizes threads even when the breaker is OPEN due to queued work.

Rollout Checklist

  1. Inventory external dependencies and set explicit client timeouts per call path.
  2. Define SLOs and what “degraded but acceptable” responses look like.
  3. Choose breaker libraries consistent with your platform.
  4. Start conservative: moderate thresholds, 30–60s open interval, small probe quota.
  5. Add dashboards and alerts for state transitions, rejection counts, and fallback rates.
  6. Run load and chaos tests; tune thresholds and windows from real telemetry.
  7. Gate in production behind a feature flag; roll out per service and per endpoint.
  8. Document behavior so on‑call engineers and product owners understand trade‑offs.

Conclusion

Circuit breakers convert unpredictable remote failures into fast, bounded behavior that protects capacity and user experience. Pair them with explicit timeouts, disciplined retries, and bulkheads. Instrument thoroughly, test under stress, and roll out gradually. Done well, circuit breakers become a quiet guardian—rarely noticed in steady state, invaluable during incidents.

Related Posts