Observability with OpenTelemetry: Distributed Tracing Across Next.js, FastAPI, and Django
Instrument three-tier stacks with OpenTelemetry. Propagate trace context across service boundaries and ship spans to your own collector.
Modern stacks are rarely monolithic. If you're running a Next.js frontend talking to a FastAPI service that delegates work to a Django backend, you've got three runtimes, two languages, and a whole lot of surface area for things to go wrong silently. OpenTelemetry gives you the connective tissue to trace a single user request across all of it—without locking yourself into a vendor's proprietary SDK.
This tutorial walks you through instrumenting all three layers with OpenTelemetry, propagating trace context across service boundaries, and shipping spans to a collector you control. It builds on the structured logging guide for this stack—if you've already got correlated logs in place, distributed tracing is the natural next layer.
Why OpenTelemetry and Not Something Proprietary
I've watched teams burn months migrating away from vendor-specific tracing libraries. OpenTelemetry is the CNCF standard—your instrumentation code stays portable whether you're sending data to Jaeger, Tempo, Honeycomb, or Datadog. Pick your backend later; instrument once.
The other reason: OpenTelemetry is now the officially recommended approach in the Next.js documentation, which means framework-level spans (routing, rendering, data fetching) are available out of the box when you wire up the SDK correctly.
The Architecture We're Instrumenting
Each hop is an HTTP call. The goal is one coherent trace spanning all three services, visible in a single waterfall view. Each service creates child spans parented to the incoming request's span, all sharing the same traceId.
Key Concepts Before You Start
| Term | What it means in practice |
|---|---|
| Trace | The full journey of one request across all services |
| Span | A single unit of work within a trace (e.g. one HTTP handler, one DB query) |
| Context propagation | Passing traceparent/tracestate headers between services so spans link up |
| Collector | A vendor-neutral proxy that receives spans and forwards them to your backend |
| Sampler | Decides which traces to keep — critical for controlling volume at scale |
Step 1: Run a Local Collector and Jaeger
Start with a docker-compose.yml:
otel-collector-config.yaml — a production-ready config including batching and 10% tail sampling:
Local vs production: For local development, set
sampling_percentage: 100so you see every trace. Drop it to 10–20% in staging/production to reduce storage costs without losing statistical coverage.
Run docker compose up -d.
Step 2: Instrument Next.js
Next.js has first-class OpenTelemetry support via its instrumentation file convention. There are two approaches: the quick path with @vercel/otel, and a fully manual SDK setup if you need more control.
Option A — Quick setup with @vercel/otel (recommended)
Create instrumentation.ts in your project root (not inside app/ or pages/):
This is the approach documented by the Next.js team and gives you framework spans for routing, rendering, and fetch calls without any additional configuration.
Option B — Manual SDK setup (more control)
Use this when you need custom span processors, custom resource attributes, or to mix in your own instrumentation libraries:
Auto-instrumentation handles fetch calls, so outbound requests to FastAPI carry traceparent headers automatically—no manual header injection needed in Node.js.
Step 3: Instrument FastAPI
FastAPIInstrumentor reads the incoming traceparent header and creates a child span automatically. The Resource with SERVICE_NAME is important—without it, Jaeger labels your service as unknown_service and traces become hard to filter.
Step 4: Instrument Django
Add to manage.py before execute_from_command_line:
You can also drive the service name entirely from the environment without any code change:
Step 5: Inject Context on Outbound HTTP in Python
This is the critical step that trips up more engineers than anything else. When FastAPI calls Django via httpx or requests, you must inject trace context into outbound headers:
Without inject(), the Django span starts a brand-new trace and you lose end-to-end visibility entirely. The traceparent header encodes the current traceId and spanId; Django's instrumentation reads it on arrival and creates a child span automatically.
If you're using httpx throughout your app, you can also install opentelemetry-instrumentation-httpx and call HTTPXClientInstrumentor().instrument() at startup—it patches all httpx clients to inject context automatically, the same way Node.js auto-instrumentation handles fetch.
Step 6: Add Custom Spans Around Business Logic
Auto-instrumentation gives you HTTP-level spans. For production debugging you'll want spans around meaningful units of work—a payment processor, a permission check, a slow database query.
Python (FastAPI or Django):
TypeScript (Next.js):
Span attributes follow OpenTelemetry semantic conventions where possible (http.method, db.statement, user.id for custom ones). This makes it far easier to query spans across services consistently.
Verifying the Trace
Open Jaeger at http://localhost:16686, search for service nextjs-frontend, and find a recent trace. You should see three grouped spans sharing the same traceId:
Debugging a broken trace:
| Symptom | Likely cause |
|---|---|
Django shows a separate traceId | inject() was not called on the outbound Python HTTP client |
| FastAPI spans missing | setup_tracing(app) called after route registration |
Spans appear but service name is unknown_service | Resource not configured with SERVICE_NAME |
| No spans at all from Next.js | instrumentation.ts not in project root, or experimental.instrumentationHook: true missing in older Next.js versions |
Production-Ready Collector Config
For a production deployment, extend the collector config with memory limits and OTLP forwarding to a managed backend (e.g. Grafana Cloud, Honeycomb):
Key points:
memory_limitermust come beforebatchin the processor chain—it protects against spike load causing OOMs.send_batch_size: 512/timeout: 5s: export a batch when it hits 512 spans or 5 seconds have elapsed, whichever comes first. This is the upstream default and a safe starting point.probabilistic_sampleris inotel/opentelemetry-collector-contrib(not the core image), which is why thedocker-compose.ymlusescollector-contrib.- Swap
otlp/honeycombfor any other OTLP-compatible backend—Grafana Tempo, Datadog, New Relic, SigNoz—by changing the exporter block. Your service code never changes.
Environment Variables Cheat Sheet
Keeping config out of code makes deployments cleaner. OpenTelemetry's SDKs respect these natively:
| Variable | Used by | Example value |
|---|---|---|
OTEL_SERVICE_NAME | All SDKs | django-backend |
OTEL_EXPORTER_OTLP_ENDPOINT | All SDKs | http://otel-collector:4318 |
OTEL_TRACES_SAMPLER | All SDKs | parentbased_traceidratio |
OTEL_TRACES_SAMPLER_ARG | All SDKs | 0.1 (10%) |
OTEL_RESOURCE_ATTRIBUTES | All SDKs | deployment.environment=production |
Setting OTEL_TRACES_SAMPLER=parentbased_traceidratio and OTEL_TRACES_SAMPLER_ARG=0.1 applies a 10% head-based sampler at the SDK level—useful when you want to reduce volume before spans even reach the collector.
Production Considerations
- Replace
localhostcollector URLs with your actual endpoint using environment variables per service. Never hardcode hostnames. - Use
BatchSpanProcessor(as shown above) rather thanSimpleSpanProcessor; the latter blocks the request thread and will hurt your p99 latency. - Choose your sampling strategy deliberately. Head-based sampling (SDK-level) is simpler but can drop interesting rare traces. Tail-based sampling (collector-level, via
tailsamplingprocessor) lets you keep all traces with errors regardless of sample rate—worth the added complexity at scale. - Add custom spans around business logic using
tracer.start_as_current_span()in Python ortracer.startActiveSpan()in TypeScript. HTTP spans alone won't tell you which function is slow. - Propagate
traceIdinto your structured logs. If you're already using structured logging across this stack, injecting the currenttraceIdandspanIdinto every log line lets you jump from a trace waterfall directly to the relevant log lines.
Final Thought
End-to-end distributed tracing across a polyglot stack is one of the highest-leverage things you can do for a production system. You stop guessing where latency lives and start having evidence. OpenTelemetry's instrumentation libraries handle the heavy lifting; your job is wiring the collector, configuring sensible sampling, and remembering to call inject() on outbound Python HTTP requests. That last bit trips up more engineers than anything else—and now you know why.
Damian Hodgkiss
Senior Staff Engineer at Sumo Group, leading development of AppSumo marketplace. Technical solopreneur with 25+ years of experience building SaaS products.