Taming AI Chat Streaming: RAF Throttling, Segmented Markdown Caching, and Virtual Scrolling

Background

If you've ever built an AI chat interface, you know the drill: tokens stream in from the LLM, the UI needs to update in real time, and everything feels fine — until it doesn't. A long response with code blocks starts to stutter. A conversation with 50+ messages makes scrolling sluggish. The Markdown parser that seemed lightweight enough suddenly becomes a bottleneck.

I was using a custom lightweight Markdown parser (deliberately avoiding the heavy react-markdown + rehype/remark plugin chain), which was a good starting point. But I identified three specific bottlenecks that were still causing jank:

Every SSE chunk triggered a full React re-render — sometimes dozens per second
The entire Markdown content was re-parsed on every update — even though 90% of it hadn't changed
All messages stayed in the DOM forever — long conversations caused DOM bloat

Let me walk through how I solved each one.

Optimization 1: Stream Chunk Merging with requestAnimationFrame

Before: Every Chunk = Every Render

The typical streaming loop reads chunks from a ReadableStream and updates state on each one:

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value, { stream: true });
  accumulated += chunk;

  // Every single chunk triggers a state update!
  setMessages((prev) => {
    const updated = [...prev];
    const lastMsg = updated[updated.length - 1];
    if (lastMsg && lastMsg.role === 'assistant') {
      updated[updated.length - 1] = { ...lastMsg, content: accumulated };
    }
    return updated;
  });
}

SSE chunks can arrive every 10-50ms. Each setMessages call triggers a React re-render, which means the Markdown parser runs, the DOM updates, and the browser recalculates layout — all potentially dozens of times per second. Most of these renders are wasted because the user can't perceive differences at that speed.

After: RAF-Based Throttling

The browser renders at ~60fps (one frame every ~16ms). There's no point updating the UI faster than that. requestAnimationFrame is the perfect primitive — it coalesces updates to once per frame.

const streamingContentRef = useRef('');
const rafIdRef = useRef<number>(0);

// Before entering the streaming loop:
streamingContentRef.current = '';

const flushStreamUpdate = () => {
  setMessages((prev) => {
    const updated = [...prev];
    const lastMsg = updated[updated.length - 1];
    if (lastMsg && lastMsg.role === 'assistant') {
      updated[updated.length - 1] = {
        ...lastMsg,
        content: streamingContentRef.current,
      };
    }
    return updated;
  });
  rafIdRef.current = 0;
};

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value, { stream: true });
  accumulated += chunk;
  streamingContentRef.current = accumulated;

  // Only schedule a render if one isn't already pending
  if (rafIdRef.current === 0) {
    rafIdRef.current = requestAnimationFrame(flushStreamUpdate);
  }
}

// Flush any remaining content after stream ends
if (rafIdRef.current !== 0) {
  cancelAnimationFrame(rafIdRef.current);
  rafIdRef.current = 0;
  flushStreamUpdate();
}

The key insight: streamingContentRef accumulates the latest content synchronously (so we never lose data), but setMessages is only called once per animation frame. If 5 chunks arrive within one frame, only the last one's content gets rendered — which is exactly what we want.

Don't forget cleanup on unmount:

useEffect(() => {
  return () => {
    if (rafIdRef.current !== 0) {
      cancelAnimationFrame(rafIdRef.current);
      rafIdRef.current = 0;
    }
  };
}, []);

Why Not debounce/throttle?

I considered debounce and throttle first, but requestAnimationFrame is strictly better for this use case:

Approach	Timing	Frame Alignment	Guarantees
`debounce(16ms)`	Fires 16ms after last call	No — may fire mid-frame	Can delay final update
`throttle(16ms)`	Fires every 16ms	No — may fire mid-frame	Wastes renders if no new data
`requestAnimationFrame`	Fires once per frame	Yes — aligned with paint	Always renders latest data

Optimization 2: Segmented Markdown Rendering with Cache

Before: Full Re-Parse on Every Update

During streaming, the Markdown content grows incrementally. But the parser re-parses the entire content from scratch on every update. If the AI has already written 500 words and is adding one more, we're re-parsing all 500 words for no reason — only the last segment is actually changing.

After: Split by Block Boundaries + Cache Completed Segments

The core idea: split the Markdown content into segments at block-level boundaries, cache the parsed React nodes for completed segments, and only re-parse the last (active) segment.

Splitting Content into Segments

function splitContentSegments(content: string): string[] {
  const segments: string[] = [];
  let current = '';
  let inCodeBlock = false;

  for (const line of content.split('\n')) {
    // Respect code block boundaries — never split inside a code block
    if (line.startsWith('```')) {
      inCodeBlock = !inCodeBlock;
      current += (current ? '\n' : '') + line;
      if (!inCodeBlock) {
        segments.push(current);
        current = '';
      }
      continue;
    }

    if (inCodeBlock) {
      current += (current ? '\n' : '') + line;
      continue;
    }

    // Empty line = block boundary
    if (line.trim() === '' && current.trim()) {
      segments.push(current);
      current = '';
      continue;
    }

    current += (current ? '\n' : '') + line;
  }

  if (current.trim()) {
    segments.push(current);
  }

  return segments;
}

The splitting logic respects code block boundaries (``` pairs) so we never break a code block in half. Empty lines serve as natural paragraph separators.

Rendering with Cache

const [segmentCache] = useState<Map<string, React.ReactNode>>(() => new Map());

const renderedSegments = useMemo(() => {
  const segments = splitContentSegments(content);
  const result: React.ReactNode[] = [];

  for (let i = 0; i < segments.length; i++) {
    const seg = segments[i];
    const isLast = i === segments.length - 1;

    // Completed segments: check cache first
    if (!isLast && segmentCache.has(seg)) {
      result.push(segmentCache.get(seg)!);
    } else {
      // Active segment (or cache miss): parse fresh
      const nodes = (
        <Fragment key={`seg-${i}`}>
          {parseMarkdown(seg, translations, i * 10000)}
        </Fragment>
      );
      // Cache completed segments for future reuse
      if (!isLast) {
        segmentCache.set(seg, nodes);
      }
      result.push(nodes);
    }
  }

  // Evict old entries when cache grows too large
  if (segmentCache.size > MAX_CACHE_SIZE) {
    const entries = [...segmentCache.entries()];
    segmentCache.clear();
    entries.slice(-MAX_CACHE_SIZE / 2).forEach(([k, v]) => segmentCache.set(k, v));
  }

  return result;
}, [content, translations, segmentCache]);

The cache key is the raw segment string itself. This works because during streaming, completed segments' text doesn't change — only the last segment grows. When a segment is "completed" (a new segment starts after it), it's cached. On subsequent renders, it's a direct cache hit.

I use useState with a lazy initializer instead of useRef for the cache Map. This avoids a React 19 lint rule (react-hooks/refs) that warns about reading refs during render.

Results

For a 1000-word AI response split into ~15 segments:

Metric	Before	After
Segments parsed per update	15	1 (last segment only)
Cache hit rate (after first render)	N/A	~93%
Parse work per frame	O(total content)	O(last segment only)

Optimization 3: Virtual Scrolling for Message List

Before: All Messages in the DOM

A typical chat message list renders all messages with a simple map:

{messages.map((message, index) => (
  <ChatMessage key={message.id} message={message} showCursor={...} />
))}

In a 50-message conversation, all 50 message components (each potentially containing a full Markdown renderer) stay in the DOM. This causes:

DOM bloat: Hundreds of DOM nodes for off-screen messages
Slow scroll: Browser must calculate layout for all nodes
Wasted memory: React must maintain fiber nodes for invisible components

After: Custom Virtual Scrolling

I built a virtual message list that only renders messages visible in the viewport (plus a small buffer). Here's the architecture:

┌─────────────────────────┐
│   Top Spacer (height)   │  ← Off-screen messages replaced by spacer
├─────────────────────────┤
│   Message #5 (visible)  │  ← Actually rendered
│   Message #6 (visible)  │  ← Actually rendered
│   Message #7 (visible)  │  ← Actually rendered
├─────────────────────────┤
│  Bottom Spacer (height) │  ← Off-screen messages replaced by spacer
└─────────────────────────┘

Height Tracking with ResizeObserver

Chat messages have variable heights (short text vs. long code blocks). I use ResizeObserver to measure actual rendered heights and MutationObserver to watch for new elements being added to the DOM:

useEffect(() => {
  const container = scrollRef.current;
  if (!container) return;

  const resizeObserver = new ResizeObserver((entries) => {
    const updates: Record<string, number> = {};
    for (const entry of entries) {
      const id = (entry.target as HTMLElement).getAttribute('data-msg-id');
      if (id) {
        updates[id] = entry.contentRect.height + MESSAGE_GAP;
      }
    }
    if (Object.keys(updates).length > 0) {
      setHeights((prev) => {
        let changed = false;
        const next = { ...prev };
        for (const [id, height] of Object.entries(updates)) {
          if (prev[id] === undefined || Math.abs(prev[id] - height) > 2) {
            next[id] = height;
            changed = true;
          }
        }
        return changed ? next : prev;
      });
    }
  });

  const observeElements = () => {
    container.querySelectorAll('[data-msg-id]').forEach((el) => {
      resizeObserver.observe(el);
    });
  };

  observeElements();

  const mutationObserver = new MutationObserver(() => {
    observeElements();
  });

  mutationObserver.observe(container, { childList: true, subtree: true });

  return () => {
    resizeObserver.disconnect();
    mutationObserver.disconnect();
  };
}, []);

The 2px tolerance in height comparison prevents infinite loops from sub-pixel rounding differences.

Visible Range Calculation

With height data available, computing which messages are visible is straightforward:

const visibleRange = useMemo(() => {
  if (messages.length === 0) return { start: 0, end: -1 };

  let start = 0;
  let end = messages.length - 1;
  const bufferHeight = OVERSCAN * (ESTIMATED_HEIGHT + MESSAGE_GAP);
  const viewTop = scrollTop - bufferHeight;
  const viewBottom = scrollTop + viewportHeight + bufferHeight;

  // Scan for first visible message
  for (let i = 0; i < messages.length; i++) {
    const h = heights[messages[i].id] ?? ESTIMATED_HEIGHT + MESSAGE_GAP;
    if (offsets[i] + h >= viewTop) {
      start = Math.max(0, i - OVERSCAN);
      break;
    }
  }

  // Scan for last visible message
  for (let i = messages.length - 1; i >= 0; i--) {
    if (offsets[i] <= viewBottom) {
      end = Math.min(messages.length - 1, i + OVERSCAN);
      break;
    }
  }

  // Always include the last message (for streaming visibility)
  end = Math.max(end, messages.length - 1);

  return { start, end };
}, [scrollTop, viewportHeight, offsets, heights, messages]);

The OVERSCAN = 3 buffer renders 3 extra messages above and below the viewport to ensure smooth scrolling without blank flashes.

Smart Auto-Scroll

One of the trickiest parts of a chat virtual list is auto-scrolling. The behavior should be:

User is at the bottom → auto-scroll to follow new content
User has scrolled up → don't force-scroll (they're reading history)
User sends a new message → always scroll to bottom

const isAtBottomRef = useRef(true);

const handleScroll = useCallback(() => {
  const el = scrollRef.current;
  if (!el) return;
  setScrollTop(el.scrollTop);
  // 80px threshold accounts for input area and padding
  isAtBottomRef.current = el.scrollHeight - el.scrollTop - el.clientHeight < 80;
}, []);

// Auto-scroll when at bottom
useEffect(() => {
  if (!scrollRef.current || !isAtBottomRef.current) return;
  requestAnimationFrame(() => {
    if (scrollRef.current && isAtBottomRef.current) {
      scrollRef.current.scrollTop = scrollRef.current.scrollHeight;
    }
  });
}, [messages, isLoading]);

// Force scroll to bottom when user sends a message
useEffect(() => {
  if (messages.length > prevMsgCountRef.current) {
    const lastMsg = messages[messages.length - 1];
    if (lastMsg?.role === 'user') {
      isAtBottomRef.current = true;
    }
  }
  prevMsgCountRef.current = messages.length;
}, [messages]);

Pitfalls and Lessons Learned

1. React 19's `react-hooks/refs` Rule

My first implementation used useRef for the segment cache and height tracking, reading refs during render. React 19's new react-hooks/refs lint rule flags this because reading refs during render can lead to stale UI (refs don't trigger re-renders). I switched to:

Segment cache: useState<Map> with lazy initializer — the Map persists across renders but is "state"
Height tracking: useState<Record<string, number>> updated via ResizeObserver callbacks

2. `setState` Inside `useLayoutEffect` Triggers Lint Error

The react-hooks/set-state-in-effect rule in React 19 flags synchronous setState calls inside effects. My initial approach used useLayoutEffect to measure heights and call setHeights synchronously. The fix was to use ResizeObserver + MutationObserver instead — the setState calls happen in the observer callbacks (which are asynchronous), not in the effect body itself.

3. Virtual Scrolling + Streaming = Always Render Last Message

A subtle but critical detail: during streaming, the last message is constantly growing. If the user is scrolled to the bottom, the virtual list must always include the last message in the visible range, even if the scroll position would normally exclude it. Without this, the streaming text would disappear from the DOM when the message grows tall enough to push it out of the viewport.

// Always include the last message
end = Math.max(end, messages.length - 1);

4. Cache Eviction Strategy

The segment cache uses a simple "clear half when full" strategy. A more sophisticated LRU would be ideal, but for a chat interface where users typically scroll in one direction, this simple approach works well. A MAX_CACHE_SIZE of 200 is generous enough that eviction rarely triggers.

Summary

Optimization	Technique	Key Benefit
Stream throttling	`requestAnimationFrame` batching	Reduces renders from N/chunks to ~60/sec
Segmented rendering	Block-level splitting + segment cache	Avoids re-parsing 90%+ of unchanged content
Virtual scrolling	ResizeObserver + visible range calculation	DOM nodes stay constant regardless of conversation length

These three optimizations work together synergistically: RAF throttling reduces render frequency, segmented caching reduces parse work per render, and virtual scrolling reduces DOM size. The result is a chat interface that stays smooth even with long streaming responses and long conversation histories — all without pulling in heavy dependencies like react-window or react-virtuoso.