Streaming LLM Responses from FastAPI to Next.js with the OpenAI and Anthropic APIs
End-to-end streaming setup for OpenAI and Anthropic APIs across FastAPI and Next.js. Cover backend generators, SDK differences, error handling, and production patterns.
Streaming LLM Responses from FastAPI to Next.js with the OpenAI and Anthropic APIs
If you've built anything with LLMs recently, you know the problem: waiting three to eight seconds for a complete response before rendering anything feels broken. Users assume it's crashed. Streaming fixes that — the first tokens appear almost immediately, and the interface feels alive. Getting it working end-to-end across FastAPI and Next.js is straightforward once you understand the plumbing, but a few spots quietly break. This tutorial covers the full path: backend generator setup, SDK-level differences between OpenAI and Anthropic, Next.js consumption patterns, error handling, and production gotchas.
The Architecture in One Sentence
FastAPI produces a StreamingResponse using an async generator; Next.js consumes it via the fetch API reading from response.body as a ReadableStream. Both OpenAI and Anthropic support streaming natively — the differences are minor and will be called out explicitly.
Prerequisites
- Python 3.9+ with
fastapi,uvicorn,openai, andanthropicinstalled - Node.js 18+ (for native
fetchandReadableStream) - Next.js 13+ (App Router)
OPENAI_API_KEYandANTHROPIC_API_KEYset as environment variables
FastAPI: Setting Up the Streaming Endpoint
Install your dependencies first:
Here's the core pattern. OpenAI first, then the Anthropic variant, then a shared router that exposes both.
OpenAI Streaming
The key detail: yield delta emits raw text chunks. You could format these as Server-Sent Events (text/event-stream) — and should in production (see the SSE section below) — but for a simple chat interface, plain text/plain with chunked transfer encoding is less ceremony to start with.
Why if delta matters. The OpenAI SDK emits a final chunk where chunk.choices[0].delta.content is None (signalling the end of generation). Without the guard, yield None causes FastAPI to serialise "None" as a string into the response body — a subtle, hard-to-spot bug.
Anthropic Streaming
Anthropic's SDK exposes .text_stream on the stream context manager, which iterates directly over decoded text strings — so you never need to inspect event types or unwrap delta objects yourself. This is confirmed in the official Anthropic Python SDK docs: calling for text in stream.text_stream (sync) or async for text in stream.text_stream (async) is the idiomatic path. With OpenAI, you pull out chunk.choices[0].delta.content manually and guard for None; the Anthropic helper removes that ceremony.
max_tokens is required for Anthropic. Unlike OpenAI (where it's optional), the Anthropic API will return a validation error if max_tokens is omitted. Set it to a reasonable upper bound for your use case.
CORS
If your Next.js dev server runs on a different origin, add this before your routes:
In production, replace "http://localhost:3000" with your actual frontend domain. Avoid allow_origins=["*"] unless the endpoint is truly public and unauthenticated.
SSE vs Plain Text {#sse-vs-plain-text}
The examples above use media_type="text/plain". That works for a simple demo, but in production you should use Server-Sent Events (text/event-stream) for two reasons:
- Error signalling mid-stream. With plain text you cannot distinguish an error chunk from a content chunk once the stream has started and HTTP 200 has been sent. SSE lets you send a named
event: errorwith a structured payload. - Reconnection and event IDs. Browsers reconnect SSE automatically on network drop; plain chunked transfer does not.
Here's the SSE-formatted version of the OpenAI generator:
The [DONE] sentinel mirrors OpenAI's own wire format and gives the client an explicit termination signal. Parse it on the frontend with:
Next.js: Consuming the Stream
Pattern 1 — React Server Action (App Router)
Create app/actions/stream.ts:
Then in your component:
The for await...of loop over an async generator is the cleanest pattern here. Each chunk appends to state, React re-renders incrementally, and users see tokens arriving in real time.
Pattern 2 — Next.js Route Handler
If you prefer a Route Handler (closer to an API proxy), create app/api/stream/route.ts:
This pattern is useful when you want the Next.js layer to handle authentication (e.g. check a session cookie) before proxying to FastAPI, without duplicating the streaming logic.
Error Handling Mid-Stream
Once FastAPI sends HTTP 200 and begins streaming, you can no longer change the status code. This means errors that occur mid-generation need to be communicated in-band. Two patterns:
1. Sentinel token in plain text — emit a special string like __ERROR__: rate limit exceeded and detect it on the client. Simple, but fragile if the LLM output could ever contain that string.
2. SSE with a named error event — the cleanest approach (shown above). The client checks event.type before treating data as content:
3. Wrap the generator in try/except on the FastAPI side — always do this regardless of format. An unhandled exception in a generator will silently terminate the stream from the server side; the client sees an abrupt disconnect, not a meaningful error message.
Where Things Actually Break
Buffering by a reverse proxy
Nginx buffers responses by default. Add X-Accel-Buffering: no as a response header in FastAPI, or set proxy_buffering off in your Nginx config block. This silently kills streaming in production whilst everything works locally. The header approach is preferable because it lives in your application code rather than infrastructure config:
For Nginx config, the equivalent directive is:
Vercel's function timeout
Streams deployed to Vercel Serverless Functions hit the default 10-second execution timeout (or 60s on Pro). Either:
- Keep prompts bounded so generation completes in time, or
- Switch to Edge Runtime (
export const runtime = "edge") which has a 25-second (Hobby) or unlimited (Pro) streaming duration
Edge Runtime doesn't support Node.js APIs, so make sure your route handler is pure Web APIs only.
Missing stream=True on the SDK call
Both SDKs default to non-streaming. Forgetting this means you block until the model finishes, then emit the full response at once. FastAPI still returns it as a StreamingResponse, but the latency is identical to non-streaming — the entire response is buffered in the generator before the first yield.
Decoder state across chunks
Always pass { stream: true } to TextDecoder.decode(). Without it, multibyte UTF-8 characters that straddle chunk boundaries produce garbled output — common with CJK characters, emoji, or any non-ASCII content in the response.
Connection drops during long generations
For responses that may take 30+ seconds (e.g. max_tokens=4096), add an AbortController so you can cancel the stream when the user navigates away:
SDK-Level Differences at a Glance
OpenAI (openai ≥ 1.0) | Anthropic (anthropic ≥ 0.25) | |
|---|---|---|
| Async client | AsyncOpenAI() | AsyncAnthropic() |
| Stream call | await client.chat.completions.create(..., stream=True) | async with client.messages.stream(...) as stream: |
| Iterate tokens | async for chunk in stream: → chunk.choices[0].delta.content | async for text in stream.text_stream: |
| Null guard needed | Yes — delta is None on final chunk | No — .text_stream yields strings only |
max_tokens required | No | Yes |
| Stop reason | chunk.choices[0].finish_reason | stream.get_final_message().stop_reason |
| Usage data | chunk.usage (only on final chunk with stream_options) | await stream.get_final_message() → .usage |
To access token usage from OpenAI during streaming, pass stream_options={"include_usage": True} to create() — usage appears on the final chunk.
Choosing Between OpenAI and Anthropic
From an integration standpoint, they're nearly identical. Anthropic's .text_stream is marginally more ergonomic because you skip the manual delta unwrap and null guard. The underlying wire protocol is SSE in both cases; the Python SDKs abstract that away entirely.
If the model choice matters for your product, make the endpoint provider-agnostic (as shown above) and swap at the call site — you're not locked in. A simple provider query param or request body field is all you need.
Full Working Example: main.py
Run with:
Wrapping Up
The full pattern is roughly 50 lines of Python and 40 lines of TypeScript. Once in place, the user experience improvement is significant and the complexity overhead is low. The key is recognising that streaming is fundamentally a generator-to-readable-stream pipeline — everything else is SDK syntax differences and deployment config.
The spots that silently break in production are:
- Nginx buffering — add
X-Accel-Buffering: no - Vercel function timeouts — use Edge Runtime
- Missing
stream=True— blocks the generator entirely - TextDecoder state — always pass
{ stream: true } - Missing error handling in the generator — unhandled exceptions disconnect silently
Get the plumbing right once, and it's reusable across every LLM feature you build.
Damian Hodgkiss
Senior Staff Engineer at Sumo Group, leading development of AppSumo marketplace. Technical solopreneur with 25+ years of experience building SaaS products.