Streaming Response
Problem
A user asks your assistant a question and then waits. And waits. The model is happily generating a 400-word answer, but your code is calling await response.json(), so nothing reaches the screen until the very last token has been produced. Ten seconds of blank space, then a wall of text appears at once.
This is one of the most common mistakes I see in AI interfaces, and it’s avoidable. The model streamed that answer to your server token by token, and the API can stream it to the browser the same way. Buffering the whole thing before rendering throws away the one property that makes a slow response bearable: that you can see it working.
The blank-wait version feels broken even when it’s fast, because the user has no evidence anything is happening. The streamed version feels fast even when it’s slow, because words are appearing the entire time. Same latency, completely different experience.
Solution
Request a streaming response and read the body as it arrives rather than awaiting the full payload. Most model APIs return either a raw text stream or Server-Sent Events, where each chunk carries a small delta of the reply.
The mechanics are the same everywhere: get a reader from response.body, decode each chunk to text, pull the token deltas out of it, and append them to the message you’re rendering. Because you’re mutating a string that the view is bound to, the UI updates on every chunk with no extra machinery.
Two details are easy to get wrong. Decode with a streaming TextDecoder ({ stream: true }) so a character whose bytes land in two different chunks doesn’t come out garbled; a chunk can end in the middle of a character. And thread an AbortController signal through the fetch so the stream can be canceled; a half-finished generation the user has moved on from should stop consuming tokens. Keep the accumulated text in state you already render, and the Thinking Indicator you showed before the first token simply gets replaced by real content.
Example
Here’s streaming from the raw fetch reader up to parsing SSE deltas, plus the two things everyone gets wrong: incremental markdown and scroll anchoring.
Reading the Stream
The core loop reads chunks from the response body and appends decoded text as it arrives.
function useStreamingReply() {
const [text, setText] = useState('');
const [streaming, setStreaming] = useState(false);
const send = async (prompt, signal) => {
setText('');
setStreaming(true);
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ prompt }),
signal,
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
try {
while (true) {
const { value, done } = await reader.read();
if (done) break;
// stream: true keeps multi-byte characters intact across chunks
const chunk = decoder.decode(value, { stream: true });
setText(prev => prev + chunk);
}
} finally {
setStreaming(false);
}
};
return { text, streaming, send };
} <script setup>
import { ref } from 'vue';
const text = ref('');
const streaming = ref(false);
async function send(prompt, signal) {
text.value = '';
streaming.value = true;
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ prompt }),
signal,
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
try {
while (true) {
const { value, done } = await reader.read();
if (done) break;
text.value += decoder.decode(value, { stream: true });
}
} finally {
streaming.value = false;
}
}
defineExpose({ text, streaming, send });
</script>
<template>
<div class="reply">{{ text }}</div>
</template> <script>
let text = '';
let streaming = false;
export async function send(prompt, signal) {
text = '';
streaming = true;
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ prompt }),
signal,
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
try {
while (true) {
const { value, done } = await reader.read();
if (done) break;
text += decoder.decode(value, { stream: true });
}
} finally {
streaming = false;
}
}
</script>
<div class="reply">{text}</div> class StreamingReply extends HTMLElement {
async send(prompt, signal) {
this.textContent = '';
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ prompt }),
signal,
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
this.textContent += decoder.decode(value, { stream: true });
}
}
}
customElements.define('streaming-reply', StreamingReply); Parsing Server-Sent Events
Most model APIs don’t send raw text. They send SSE frames where each data: line holds a JSON delta, terminated by [DONE]. Buffer partial lines, because a frame can be split across chunks.
async function* parseSSE(response, signal) {
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Frames are separated by a blank line; keep the trailing partial in the buffer
const frames = buffer.split('\n\n');
buffer = frames.pop();
for (const frame of frames) {
const line = frame.split('\n').find(l => l.startsWith('data:'));
if (!line) continue;
const data = line.slice(5).trim();
if (data === '[DONE]') return;
const token = JSON.parse(data).choices?.[0]?.delta?.content;
if (token) yield token;
}
}
}
// Usage: for await (const token of parseSSE(response, signal)) append(token);
Rendering Streaming Markdown
Model output is markdown, and re-parsing on every token will flash broken tables and unterminated code fences. Only render the part of the text that has stabilized, and hold back the trailing fragment until it resolves.
function StreamingMarkdown({ text, streaming }) {
const { safe, pending } = useMemo(() => {
if (!streaming) return { safe: text, pending: '' };
// If we're inside an unclosed code fence, render up to it as markdown
// and show the rest as plain text until the closing ``` arrives.
const fenceCount = (text.match(/```/g) || []).length;
if (fenceCount % 2 === 1) {
const lastFence = text.lastIndexOf('```');
return { safe: text.slice(0, lastFence), pending: text.slice(lastFence) };
}
return { safe: text, pending: '' };
}, [text, streaming]);
return (
<div className="markdown">
<Markdown>{safe}</Markdown>
{pending && <pre className="pending">{pending}</pre>}
</div>
);
}
Staying Anchored to the Latest Text
As the reply grows it scrolls past the fold. Keep the newest tokens visible, but only while the user hasn’t scrolled up to read something earlier.
function useStickToBottom(ref, dep) {
const pinned = useRef(true);
useEffect(() => {
const el = ref.current;
const onScroll = () => {
const distance = el.scrollHeight - el.scrollTop - el.clientHeight;
pinned.current = distance < 40; // near the bottom = follow along
};
el.addEventListener('scroll', onScroll);
return () => el.removeEventListener('scroll', onScroll);
}, [ref]);
useEffect(() => {
if (pinned.current) ref.current.scrollTop = ref.current.scrollHeight;
}, [dep, ref]); // dep = the streaming text, so this runs on every chunk
}
Benefits
- The interface feels responsive even on a slow model, because text appears within a second and never stops moving.
- The wait feels much shorter, because reading the answer as it forms fills the time that would otherwise be dead air.
- Users can start acting on the beginning of a long answer before the end has finished generating.
- Abandonment drops, because there’s continuous evidence the app is working rather than a silent gap that reads as “frozen.”
- It pairs naturally with a Stop Generation control, since you already hold the reader and the abort signal.
Tradeoffs
- Chunk boundaries don’t line up with characters, tokens, or markdown syntax, so decoding or rendering the obvious way produces visibly broken output.
- Streaming markdown is genuinely fiddly: half-formed fences, tables, and lists all need special handling to avoid flicker.
- Error handling gets harder: a request can fail after the user has already seen half an answer, so you need to preserve partial text rather than discard it.
- Auto-scrolling fights the user if you don’t detect when they’ve deliberately scrolled up to read.
- You lose the ability to validate or post-process the complete response before showing anything, since you’re committing to tokens as they arrive.
- Some of the infrastructure between your server and the browser (certain proxies and compression layers) holds the chunks until the response closes, which defeats streaming entirely.
Summary
Streaming a response means reading the model’s reply as it’s produced and appending each chunk to the UI, turning a long silent wait into a live, readable answer. Decode chunks with a streaming TextDecoder, parse SSE deltas by buffering partial frames, render only the stabilized prefix of markdown, and anchor to the bottom without hijacking the user’s scroll. The result is an interface that feels fast because the user can watch it think out loud.