Building Resilient APIs with NestJS

Every API works perfectly on localhost. The interesting question is what happens at 2 a.m. on a high-traffic day when a downstream payment service starts timing out, the database connection pool is exhausted, and a deploy is rolling out at the same time. That’s where resilience lives, and it’s the part of backend work that separates a demo from a system a business can actually depend on.

The key mental shift is to stop treating resilience as a feature you add at the end. It’s a mindset you start with: assume every dependency you don’t own will eventually fail, be slow, or lie to you, and design so that when it does, your service degrades gracefully instead of falling over and taking everything with it.

Let’s walk through the patterns that matter most, with the NestJS code to back them up.

Never let a slow dependency hang your event loop

The fastest way to turn one slow downstream into a full outage is to wait on it forever. A single hung request holds a connection, and under load those pile up until your service stops responding entirely. The fix is a hard timeout on every request, applied globally with an interceptor:

import {
  Injectable, NestInterceptor, ExecutionContext, CallHandler,
  RequestTimeoutException,
} from "@nestjs/common";
import { Observable, throwError, TimeoutError } from "rxjs";
import { catchError, timeout } from "rxjs/operators";
 
@Injectable()
export class TimeoutInterceptor implements NestInterceptor {
  intercept(_: ExecutionContext, next: CallHandler): Observable<any> {
    return next.handle().pipe(
      timeout(5000),
      catchError((err) =>
        err instanceof TimeoutError
          ? throwError(() => new RequestTimeoutException())
          : throwError(() => err),
      ),
    );
  }
}

Five seconds is a starting point, not a law. The point is that a bounded failure is recoverable; an unbounded one isn’t.

Fail fast when a dependency is already down

Retries are tempting, but naive retries make outages worse: you hammer a struggling service with even more traffic until it stays down. The mature pattern is a circuit breaker. After a threshold of failures it “trips,” stops calling the broken dependency for a cooldown period, and serves a fallback instead. A battle-tested choice for this is opossum:

import CircuitBreaker from "opossum";
 
const options = {
  timeout: 3000,                 // give up on a single call after 3s
  errorThresholdPercentage: 50,  // trip the breaker once half the calls fail
  resetTimeout: 10000,           // wait 10s before testing the waters again
};
 
const breaker = new CircuitBreaker(callInventoryService, options);
 
// When inventory is down, return something useful instead of an error wall.
breaker.fallback(() => ({ stock: "unknown", degraded: true }));
 
export const getStock = (sku: string) => breaker.fire(sku);

The mindset shift here is important: a circuit breaker protects both sides. It shields the downstream from a stampede, and it shields your own service from drowning in calls that were never going to succeed. Pair it with exponential backoff and jitter on the retries you do keep, and only retry operations that are safe to repeat.

Reject bad input at the door, and never leak internals

A resilient API is also a strict one. Validate everything at the edge with DTOs and a global pipe, so malformed requests never reach your business logic:

// main.ts
app.useGlobalPipes(
  new ValidationPipe({ whitelist: true, forbidNonWhitelisted: true, transform: true }),
);

Just as important, your errors should have one consistent, safe shape. A leaked stack trace is both a security risk and a debugging nightmare for whoever consumes your API. A catch-all exception filter solves it:

@Catch()
export class AllExceptionsFilter implements ExceptionFilter {
  catch(exception: unknown, host: ArgumentsHost) {
    const res = host.switchToHttp().getResponse<Response>();
    const status =
      exception instanceof HttpException ? exception.getStatus() : 500;
 
    res.status(status).json({
      statusCode: status,
      message: status === 500 ? "Internal server error" : (exception as HttpException).message,
      timestamp: new Date().toISOString(),
    });
  }
}

Clients get a predictable contract, and your 500s never spill the contents of your server into the response body.

Survive your own deploys

This is the pattern that matters most, because it’s where the quiet outages hide: zero-downtime deploys. When your orchestrator sends SIGTERM to roll out a new version, an unprepared NestJS app drops every in-flight request on the floor. Enabling shutdown hooks lets the app drain gracefully:

// main.ts
app.enableShutdownHooks();

Combine that with proper liveness and readiness probes via @nestjs/terminus, so traffic only routes to instances that are actually ready to serve. The orchestrator stops sending requests to a shutting-down pod, the pod finishes what it’s already handling, and users never see a blip. That’s the difference between “we deployed” and “nobody noticed we deployed.”

The habits that hold it all together

A few non-negotiables worth applying on every production service:

Throttle aggressively:

Use @nestjs/throttler to cap requests per client. A traffic spike, accidental or malicious, should slow down, not collapse your service.

Make failure observable:

You cannot fix what you cannot see. Structured logs, a correlation ID on every request, and real metrics streamed to a tool like Datadog turn a 2 a.m. incident from guesswork into a five-minute fix. Resilience without observability is just hope.

Default to graceful degradation:

Decide ahead of time which dependencies are critical and which are optional. A reviews widget being down should never block a checkout.

Wrapping up

Resilient APIs aren’t built by adding a magic library at the end of a sprint. They come from a series of small, deliberate decisions made while you design: bound every wait, fail fast when something’s already broken, validate at the edge, and make sure your own deploys don’t take you down. NestJS gives you clean, framework-native hooks for every one of these, which is exactly why it scales so well from a side project to a system handling serious traffic.

Build for the 2 a.m. failure, not the localhost demo, and your future on-call self will thank you.