构建 resilient 的 AI 聊天流式传输：保留中断内容与指数退避自动重试

引子

在构建 AI 聊天界面时，最令人沮丧的用户体验之一，莫过于看着一段已经生成了一半的回答，因为一次短暂的网络抖动而彻底消失。流中断了，错误提示弹出，用户什么都没有留下——连已经读了一半的内容也没了。

这正是我遇到的问题。我的聊天实现使用标准的 HTTP 流式传输来实时推送 AI 响应。网络稳定时一切完美，但只要网络稍有波动，整个体验就会崩塌。错误处理非常原始：它直接用一条"网络错误"的静态文本，覆盖了所有已经接收到的内容。用户失去了上下文、耐心和信任。

这篇文章记录了我是如何重建错误处理层，使其真正具备 resilient 能力的。解决方案围绕两个核心思想：流中断时保留已接收的部分内容，以及通过指数退避自动重试来从临时故障中恢复。

问题：脆弱的流式传输错误处理

原始的流式传输逻辑很直接。fetch 发起请求，ReadableStream reader 消费数据块，每个数据块被追加到一个累积字符串中，然后更新 UI：

let accumulated = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value, { stream: true });
  accumulated += chunk;

  // 用累积的内容更新 UI
  setMessages(/* ... accumulated ... */);
}

然而，catch 块是整个系统的薄弱环节：

catch (err) {
  if (err instanceof DOMException && err.name === 'AbortError') {
    // 用户主动取消 —— 没问题，直接停止
    setIsLoading(false);
    return;
  }

  // ❌ 问题所在：accumulated 中的内容被丢弃了
  setMessages((prev) => {
    const updated = [...prev];
    const lastMsg = updated[updated.length - 1];
    if (lastMsg && lastMsg.role === 'assistant') {
      updated[updated.length - 1] = {
        ...lastMsg,
        content: '网络连接错误，请检查网络后重试。',
      };
    }
    return updated;
  });
}

注意这个关键缺陷：accumulated 变量保存了错误发生前接收到的所有内容，但 catch 块完全忽略了它。消息被一条静态错误字符串覆盖了。对于一个已经流式传输了 10 秒的响应来说，这意味着 10 秒有价值的内容凭空消失。

而且系统没有任何恢复机制。唯一的选项是一个手动重试按钮，它会重新发送整条用户消息，让 AI 从头生成回答。这既浪费又缓慢。

核心原理一：将内容与错误状态分离

第一个洞察是：内容和错误状态是正交的。一条消息可以同时包含部分内容并且处于错误状态。这两者不应该被混为同一个字符串。

我扩展了消息类型，增加了一个可选的 error 字段：

interface ChatMessage {
  id: string;
  role: 'user' | 'assistant' | 'system';
  content: string;
  timestamp: number;
  mode?: 'resume' | 'general';
  error?: string; // 新增：与内容分离的错误状态
}

这种分离允许 UI 同时渲染部分内容和错误指示器。用户可以看到中断前收到了什么，而不是盯着一条空白的错误消息。

catch 块被重写为保留 accumulated：

catch (err) {
  if (err instanceof DOMException && err.name === 'AbortError') {
    setIsLoading(false);
    return;
  }

  if (accumulated.trim()) {
    // ✅ 保留部分内容，将错误信息单独附加
    setMessages((prev) => {
      const updated = [...prev];
      const lastMsg = updated[updated.length - 1];
      if (lastMsg && lastMsg.role === 'assistant') {
        updated[updated.length - 1] = {
          ...lastMsg,
          content: accumulated,      // 保留已接收的内容
          error: '...',              // 错误信息放在这里
        };
      }
      return updated;
    });
  } else {
    // 什么都没收到 —— 显示通用错误
    setMessages(/* ... 网络错误 ... */);
  }
}

在 UI 层面，消息组件照常渲染 content，并在下方条件式地以独立的视觉块显示 error：

┌─────────────────────────────┐
│  这是在中断前已经收到的       │  ← content（保留）
│  部分回答内容...             │
├─────────────────────────────┤
│  ⚠️  自动重试中... (1/3)     │  ← error（新增）
│  [重试]                      │
└─────────────────────────────┘

这个简单的架构改变极大地改善了用户体验。即使恢复失败，用户也不会丢失他们已经读到一半的响应。

核心原理二：指数退避自动重试

保留内容只是战斗的一半。另一半是在可能的情况下自动从故障中恢复。

我实现了一个带指数退避的自动重试机制。设计目标如下：

自动：对于临时性故障，用户不需要点击任何按钮。
有界：不要无限重试。限制尝试次数和延迟时间。
非侵入式：重试期间不要阻塞用户发送新消息。
可取消：如果用户与聊天界面交互，取消所有待处理的重试。

退避算法

重试延迟遵循带上限的指数退避：

const MAX_AUTO_RETRIES = 3;
const INITIAL_RETRY_DELAY_MS = 1000;
const MAX_RETRY_DELAY_MS = 8000;

function calculateBackoffDelay(attempt: number): number {
  return Math.min(
    INITIAL_RETRY_DELAY_MS * Math.pow(2, attempt),
    MAX_RETRY_DELAY_MS
  );
}

这会产生大约 1秒、2秒、4秒、8秒 的连续延迟。8 秒的上限防止了过度等待。

重试状态机

一个基于 ref 的计数器跨渲染跟踪重试次数：

const autoRetryCountRef = useRef(0);
const autoRetryTimerRef = useRef<ReturnType<typeof setTimeout> | null>(null);

当发生流错误且存在部分内容时，逻辑检查是否还有重试次数：

if (accumulated.trim()) {
  const canAutoRetry = autoRetryCountRef.current < MAX_AUTO_RETRIES;

  if (canAutoRetry) {
    autoRetryCountRef.current++;
    const errorContent = `自动重试中... (${autoRetryCountRef.current}/${MAX_AUTO_RETRIES})`;

    // 在 error 字段中显示重试状态
    setMessages(/* ... content: accumulated, error: errorContent ... */);

    // 调度重试
    const delay = calculateBackoffDelay(autoRetryCountRef.current - 1);
    autoRetryTimerRef.current = setTimeout(() => {
      // 用截断后的上下文重新发送最后一条用户消息
      // ...
    }, delay);
  } else {
    // 所有重试次数已耗尽
    setMessages(/* ... error: "自动重试失败，请手动重试。" ... */);
  }
}

重试动作

当定时器触发时，重试逻辑需要重建对话状态。关键的挑战是避免重复的用户消息。原始的 retry 函数有一个微妙的 bug：它只移除了 assistant 消息，但保留了 user 消息，然后调用 sendMessage 又追加了一条新的 user 消息。AI 会看到同样的用户消息出现两次。

修正后的方法在重新发送前，同时移除失败的 assistant 消息和它前面的 user 消息：

autoRetryTimerRef.current = setTimeout(() => {
  const currentMessages = messagesRef.current;
  const lastUserMessage = [...currentMessages]
    .reverse()
    .find((m) => m.role === 'user');

  if (!lastUserMessage) return;

  // 移除失败的 assistant 消息
  let trimmedMessages = currentMessages.slice(0, -1);

  // 同时移除 user 消息防止重复
  const lastTrimmed = trimmedMessages[trimmedMessages.length - 1];
  if (lastTrimmed?.role === 'user' && lastTrimmed.content === lastUserMessage.content) {
    trimmedMessages = trimmedMessages.slice(0, -1);
  }

  setMessages(trimmedMessages);
  setTimeout(() => {
    sendMessageRef.current(lastUserMessage.content, trimmedMessages);
  }, 0);
}, delay);

注意这里使用了 sendMessageRef —— 一个可变 ref，始终指向最新的 sendMessage 函数。这很关键，因为 setTimeout 回调捕获的是 ref 的值，而不是一个过时的函数实例。

取消安全性

重试不能比它们的有效期活得更久。cancelAutoRetry 函数在每一个会使待处理重试失效的场景下被调用：

用户发送新消息
用户点击手动重试
用户切换聊天模式
用户清空消息
组件卸载

const cancelAutoRetry = useCallback(() => {
  if (autoRetryTimerRef.current) {
    clearTimeout(autoRetryTimerRef.current);
    autoRetryTimerRef.current = null;
  }
  autoRetryCountRef.current = 0;
}, []);

此外，定时器回调在执行前会验证消息是否仍然带有 error 字段。如果用户已经与聊天界面交互（例如发送了新消息），error 字段将不复存在，重试会中止。

踩坑记录与经验教训

坑一：setTimeout 中的过时闭包

我第一次实现自动重试时，直接在 setTimeout 回调中捕获了 sendMessage 函数。由于 sendMessage 是一个带有大量依赖的 useCallback，闭包会在状态变化后引用一个旧版本的函数。重试会使用过时的 messages，产生错误的上下文。

解决方案是 sendMessageRef 模式：

const sendMessageRef = useRef<(content: string, overrideMessages?: ChatMessage[]) => void>(() => {});
// ...
sendMessageRef.current = sendMessage;
// ...
setTimeout(() => {
  sendMessageRef.current(lastUserMessage.content, trimmedMessages);
}, delay);

Ref 是可变的且不会触发重渲染，非常适合从异步上下文中访问回调的"最新"版本。

坑二：重试时重复的用户消息

原始的手动 retry 函数有一个微妙的 bug。它移除了最后一条 assistant 消息，但留下了 user 消息，然后调用 sendMessage 又追加了一条新的 user 消息。AI 会看到同样的用户消息出现两次。

修复方案在重新发送前同时移除 assistant 和 user 消息：

let trimmedMessages = currentMessages.slice(0, -1); // 移除 assistant
const lastTrimmed = trimmedMessages[trimmedMessages.length - 1];
if (lastTrimmed?.role === 'user') {
  trimmedMessages = trimmedMessages.slice(0, -1); // 同时移除 user
}

坑三：孤儿定时器

如果没有适当的清理，自动重试定时器可能在用户已经开始新对话后仍然触发。这会导致令人困惑的行为：一条旧消息突然重新出现。

全面的清理策略包括：

在每一次用户发起的状态变更时调用 cancelAutoRetry()
在定时器回调中执行前检查 lastMsg?.error
在组件卸载 effect 中进行清理

总结

构建 resilient 的流式传输需要超越 happy path 的思考。这次实现的关键要点：

技术	目的
分离的 `error` 字段	保留部分内容的同时指示故障
指数退避	重试临时故障而不压垮服务器
`sendMessageRef` 模式	避免异步回调中的过时闭包
双重消息清理	防止重试时产生重复的用户消息
全面的取消机制	防止孤儿重试导致混乱

结果是一个在网络条件不佳时也能优雅降级的聊天界面。用户看到他们的部分回答被保留，观看自动恢复尝试，并且在所有方法都失败时始终保留手动重试的选项。

对于 AI 应用来说，响应可能需要大量时间生成，保留部分进度不仅仅是一个 nice-to-have —— 它是维持用户信任的关键。

引子

问题：脆弱的流式传输错误处理

原始的流式传输逻辑很直接。fetch 发起请求，ReadableStream reader 消费数据块，每个数据块被追加到一个累积字符串中，然后更新 UI：

let accumulated = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value, { stream: true });
  accumulated += chunk;

  // 用累积的内容更新 UI
  setMessages(/* ... accumulated ... */);
}

然而，catch 块是整个系统的薄弱环节：

catch (err) {
  if (err instanceof DOMException && err.name === 'AbortError') {
    // 用户主动取消 —— 没问题，直接停止
    setIsLoading(false);
    return;
  }

  // ❌ 问题所在：accumulated 中的内容被丢弃了
  setMessages((prev) => {
    const updated = [...prev];
    const lastMsg = updated[updated.length - 1];
    if (lastMsg && lastMsg.role === 'assistant') {
      updated[updated.length - 1] = {
        ...lastMsg,
        content: '网络连接错误，请检查网络后重试。',
      };
    }
    return updated;
  });
}

而且系统没有任何恢复机制。唯一的选项是一个手动重试按钮，它会重新发送整条用户消息，让 AI 从头生成回答。这既浪费又缓慢。

核心原理一：将内容与错误状态分离

第一个洞察是：内容和错误状态是正交的。一条消息可以同时包含部分内容并且处于错误状态。这两者不应该被混为同一个字符串。

我扩展了消息类型，增加了一个可选的 error 字段：

interface ChatMessage {
  id: string;
  role: 'user' | 'assistant' | 'system';
  content: string;
  timestamp: number;
  mode?: 'resume' | 'general';
  error?: string; // 新增：与内容分离的错误状态
}

这种分离允许 UI 同时渲染部分内容和错误指示器。用户可以看到中断前收到了什么，而不是盯着一条空白的错误消息。

catch 块被重写为保留 accumulated：

catch (err) {
  if (err instanceof DOMException && err.name === 'AbortError') {
    setIsLoading(false);
    return;
  }

  if (accumulated.trim()) {
    // ✅ 保留部分内容，将错误信息单独附加
    setMessages((prev) => {
      const updated = [...prev];
      const lastMsg = updated[updated.length - 1];
      if (lastMsg && lastMsg.role === 'assistant') {
        updated[updated.length - 1] = {
          ...lastMsg,
          content: accumulated,      // 保留已接收的内容
          error: '...',              // 错误信息放在这里
        };
      }
      return updated;
    });
  } else {
    // 什么都没收到 —— 显示通用错误
    setMessages(/* ... 网络错误 ... */);
  }
}

在 UI 层面，消息组件照常渲染 content，并在下方条件式地以独立的视觉块显示 error：

┌─────────────────────────────┐
│  这是在中断前已经收到的       │  ← content（保留）
│  部分回答内容...             │
├─────────────────────────────┤
│  ⚠️  自动重试中... (1/3)     │  ← error（新增）
│  [重试]                      │
└─────────────────────────────┘

这个简单的架构改变极大地改善了用户体验。即使恢复失败，用户也不会丢失他们已经读到一半的响应。

核心原理二：指数退避自动重试

保留内容只是战斗的一半。另一半是在可能的情况下自动从故障中恢复。

我实现了一个带指数退避的自动重试机制。设计目标如下：

自动：对于临时性故障，用户不需要点击任何按钮。
有界：不要无限重试。限制尝试次数和延迟时间。
非侵入式：重试期间不要阻塞用户发送新消息。
可取消：如果用户与聊天界面交互，取消所有待处理的重试。

退避算法

重试延迟遵循带上限的指数退避：

const MAX_AUTO_RETRIES = 3;
const INITIAL_RETRY_DELAY_MS = 1000;
const MAX_RETRY_DELAY_MS = 8000;

function calculateBackoffDelay(attempt: number): number {
  return Math.min(
    INITIAL_RETRY_DELAY_MS * Math.pow(2, attempt),
    MAX_RETRY_DELAY_MS
  );
}

这会产生大约 1秒、2秒、4秒、8秒 的连续延迟。8 秒的上限防止了过度等待。

重试状态机

一个基于 ref 的计数器跨渲染跟踪重试次数：

const autoRetryCountRef = useRef(0);
const autoRetryTimerRef = useRef<ReturnType<typeof setTimeout> | null>(null);

当发生流错误且存在部分内容时，逻辑检查是否还有重试次数：

if (accumulated.trim()) {
  const canAutoRetry = autoRetryCountRef.current < MAX_AUTO_RETRIES;

  if (canAutoRetry) {
    autoRetryCountRef.current++;
    const errorContent = `自动重试中... (${autoRetryCountRef.current}/${MAX_AUTO_RETRIES})`;

    // 在 error 字段中显示重试状态
    setMessages(/* ... content: accumulated, error: errorContent ... */);

    // 调度重试
    const delay = calculateBackoffDelay(autoRetryCountRef.current - 1);
    autoRetryTimerRef.current = setTimeout(() => {
      // 用截断后的上下文重新发送最后一条用户消息
      // ...
    }, delay);
  } else {
    // 所有重试次数已耗尽
    setMessages(/* ... error: "自动重试失败，请手动重试。" ... */);
  }
}

重试动作

修正后的方法在重新发送前，同时移除失败的 assistant 消息和它前面的 user 消息：

autoRetryTimerRef.current = setTimeout(() => {
  const currentMessages = messagesRef.current;
  const lastUserMessage = [...currentMessages]
    .reverse()
    .find((m) => m.role === 'user');

  if (!lastUserMessage) return;

  // 移除失败的 assistant 消息
  let trimmedMessages = currentMessages.slice(0, -1);

  // 同时移除 user 消息防止重复
  const lastTrimmed = trimmedMessages[trimmedMessages.length - 1];
  if (lastTrimmed?.role === 'user' && lastTrimmed.content === lastUserMessage.content) {
    trimmedMessages = trimmedMessages.slice(0, -1);
  }

  setMessages(trimmedMessages);
  setTimeout(() => {
    sendMessageRef.current(lastUserMessage.content, trimmedMessages);
  }, 0);
}, delay);

取消安全性

重试不能比它们的有效期活得更久。cancelAutoRetry 函数在每一个会使待处理重试失效的场景下被调用：

用户发送新消息
用户点击手动重试
用户切换聊天模式
用户清空消息
组件卸载

const cancelAutoRetry = useCallback(() => {
  if (autoRetryTimerRef.current) {
    clearTimeout(autoRetryTimerRef.current);
    autoRetryTimerRef.current = null;
  }
  autoRetryCountRef.current = 0;
}, []);

踩坑记录与经验教训

坑一：setTimeout 中的过时闭包

解决方案是 sendMessageRef 模式：

const sendMessageRef = useRef<(content: string, overrideMessages?: ChatMessage[]) => void>(() => {});
// ...
sendMessageRef.current = sendMessage;
// ...
setTimeout(() => {
  sendMessageRef.current(lastUserMessage.content, trimmedMessages);
}, delay);

Ref 是可变的且不会触发重渲染，非常适合从异步上下文中访问回调的"最新"版本。

坑二：重试时重复的用户消息

修复方案在重新发送前同时移除 assistant 和 user 消息：

let trimmedMessages = currentMessages.slice(0, -1); // 移除 assistant
const lastTrimmed = trimmedMessages[trimmedMessages.length - 1];
if (lastTrimmed?.role === 'user') {
  trimmedMessages = trimmedMessages.slice(0, -1); // 同时移除 user
}

坑三：孤儿定时器

如果没有适当的清理，自动重试定时器可能在用户已经开始新对话后仍然触发。这会导致令人困惑的行为：一条旧消息突然重新出现。

全面的清理策略包括：

在每一次用户发起的状态变更时调用 cancelAutoRetry()
在定时器回调中执行前检查 lastMsg?.error
在组件卸载 effect 中进行清理

总结

构建 resilient 的流式传输需要超越 happy path 的思考。这次实现的关键要点：

技术	目的
分离的 `error` 字段	保留部分内容的同时指示故障
指数退避	重试临时故障而不压垮服务器
`sendMessageRef` 模式	避免异步回调中的过时闭包
双重消息清理	防止重试时产生重复的用户消息
全面的取消机制	防止孤儿重试导致混乱

对于 AI 应用来说，响应可能需要大量时间生成，保留部分进度不仅仅是一个 nice-to-have —— 它是维持用户信任的关键。

构建 resilient 的 AI 聊天流式传输：保留中断内容与指数退避自动重试

目录

引子

问题：脆弱的流式传输错误处理

核心原理一：将内容与错误状态分离

核心原理二：指数退避自动重试

退避算法

重试状态机

重试动作

取消安全性

踩坑记录与经验教训

坑一：setTimeout 中的过时闭包

坑二：重试时重复的用户消息

坑三：孤儿定时器

总结

构建 resilient 的 AI 聊天流式传输：保留中断内容与指数退避自动重试

目录

引子

问题：脆弱的流式传输错误处理

核心原理一：将内容与错误状态分离

核心原理二：指数退避自动重试

退避算法

重试状态机

重试动作

取消安全性

踩坑记录与经验教训

坑一：setTimeout 中的过时闭包

坑二：重试时重复的用户消息

坑三：孤儿定时器

总结