Back to Blog
F

Four Layers of Context Management: How I Stopped My AI Assistant from Forgetting

April 24, 2026·12 min readFeatured
AILLMContext-Management

Table of Contents

  • Introduction
  • The Problem with Fixed Windows
  • Layer 1: Token Counting & Dynamic Window
  • Estimating Tokens Without a Tokenizer
  • Model-Aware Context Windows
  • Calculating the Token Budget
  • Selecting Messages by Budget, Not by Count
  • Layer 2: Conversation Summarization
  • When to Trigger
  • Splitting Old from New
  • What the Summary Preserves
  • Layer 3: Key Information Extraction
  • What Gets Extracted
  • When to Extract
  • Incremental Merging
  • Layer 4: System Prompt Compression
  • The Compression Strategy
  • Projects
  • Personal Portfolio
  • Automatic Fallback
  • How the Four Layers Work Together
  • The Client-Server Contract
  • Results
  • Lessons Learned

Table of Contents

  • Introduction
  • The Problem with Fixed Windows
  • Layer 1: Token Counting & Dynamic Window
  • Estimating Tokens Without a Tokenizer
  • Model-Aware Context Windows
  • Calculating the Token Budget
  • Selecting Messages by Budget, Not by Count
  • Layer 2: Conversation Summarization
  • When to Trigger
  • Splitting Old from New
  • What the Summary Preserves
  • Layer 3: Key Information Extraction
  • What Gets Extracted
  • When to Extract
  • Incremental Merging
  • Layer 4: System Prompt Compression
  • The Compression Strategy
  • Projects
  • Personal Portfolio
  • Automatic Fallback
  • How the Four Layers Work Together
  • The Client-Server Contract
  • Results
  • Lessons Learned

Introduction

Every developer who has built an LLM-powered chat app hits the same wall eventually: the model starts "forgetting" things said earlier in the conversation. My portfolio site's AI assistant was no exception. The original implementation used a hard-coded sliding window of 20 messages — anything beyond that was silently discarded. No summary, no compression, no preservation. Just gone.

This works fine for quick Q&A. But when a user has a real conversation — asking follow-up questions, referencing earlier points, building on previous answers — the hard cutoff becomes painfully obvious. The AI contradicts itself, asks for information the user already provided, or loses the thread entirely.

I decided to fix this properly. Not with a band-aid, but with a layered system where each layer solves a different aspect of the same problem. Here's what I built and why.

The Problem with Fixed Windows

The original code was as simple as it gets:

const MAX_CONTEXT_MESSAGES = 20;

const contextMessages = [...currentMessages, userMessage].slice(
  -MAX_CONTEXT_MESSAGES
);

This has several compounding issues:

ProblemWhy It Matters
Hard truncationEverything before the window is gone — no recovery, no summary
No token awareness20 short messages and 20 long messages consume vastly different amounts of context
Bloated system promptIn resume mode, the full portfolio context eats up most of the context window
Model-agnosticSame limit regardless of whether the model has 128K or 200K tokens
Arbitrary total limitA 10,000-character cap was pulled from thin air

The fundamental insight: a message count is a terrible proxy for context consumption. What you actually need to manage is tokens.

Layer 1: Token Counting & Dynamic Window

Estimating Tokens Without a Tokenizer

Running a real tokenizer (like tiktoken) server-side would add a heavy dependency. Instead, I built a heuristic estimator that's accurate enough for budget allocation:

const CJK_REGEX = /[\u4e00-\u9fff\u3400-\u4dbf\u3000-\u303f\uff00-\uffef]/;

export function estimateTokens(text: string): number {
  if (!text) return 0;

  let cjkCount = 0;
  let otherCount = 0;

  for (const char of text) {
    if (CJK_REGEX.test(char)) {
      cjkCount++;
    } else {
      otherCount++;
    }
  }

  // CJK characters ≈ 1.5 tokens, English/other ≈ 0.25 tokens
  return Math.ceil(cjkCount * 1.5 + otherCount * 0.25);
}

This heuristic works because CJK characters typically tokenize to 1-2 tokens each in most modern tokenizers, while English text averages about 4 characters per token (hence 0.25 tokens per character). For a mixed-language chat, this gets you within 10-15% of reality — more than sufficient for deciding how many messages to include.

Model-Aware Context Windows

Different models have wildly different context windows. A lookup table handles this:

const MODEL_CONTEXT_WINDOWS: Record<string, number> = {
  'glm-4-flash': 128000,
  'glm-4-long': 1000000,
  'gpt-4o-mini': 128000,
  'gpt-4o': 128000,
  'claude-3-haiku-20240307': 200000,
  'claude-3-5-sonnet-20241022': 200000,
};

The system auto-detects the active model from environment variables and picks the right window size. No more one-size-fits-all limits.

Calculating the Token Budget

The available budget for conversation messages is what's left after accounting for everything else:

Budget = Context Window - System Prompt - Summary - Extracted Info - Reserved Response

The 4,096 reserved response tokens ensure the model always has room to generate a reply, even if the context is nearly full.

Selecting Messages by Budget, Not by Count

Instead of keeping the last N messages, we now walk backwards from the newest message, accumulating tokens until we hit the budget:

export function selectMessagesByBudget(
  messages: Array<{ role: string; content: string }>,
  budget: number
): Array<{ role: string; content: string }> {
  let usedTokens = 0;
  const selected: Array<{ role: string; content: string }> = [];

  for (let i = messages.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(messages[i].content) + 4; // +4 for role overhead
    if (usedTokens + msgTokens > budget) break;
    usedTokens += msgTokens;
    selected.unshift(messages[i]);
  }

  return selected;
}

This maximizes context utilization regardless of message length. A conversation with short messages naturally includes more of them; a conversation with long messages includes fewer — but always fits within the budget.

Layer 2: Conversation Summarization

Dynamic window sizing prevents overflow, but it doesn't solve the information loss problem. When old messages are excluded from the budget, their content is still gone. Summarization fixes this by compressing old messages into a compact narrative that persists.

When to Trigger

I set the threshold at 40,000 tokens — roughly 30% of a 128K context window. This gives enough headroom for the summary itself plus recent messages plus the system prompt.

export function shouldSummarize(
  messages: Array<{ role: string; content: string }>,
  existingSummary: string | null,
  threshold: number = 40000
): boolean {
  if (existingSummary) return false; // Don't re-summarize if we already have one
  const messagesTokens = estimateMessagesTokens(messages);
  return messagesTokens >= threshold;
}

Key design decision: once a summary exists, we don't re-summarize on every request. The summary is maintained incrementally and stored on the client side.

Splitting Old from New

When summarization triggers, messages are split into two groups:

  • Old messages → sent to the AI for summarization
  • Recent messages → kept as-is (using half the threshold as the recent budget)
export function splitMessagesForSummary(
  messages: Array<{ role: string; content: string }>,
  threshold: number = 40000
) {
  const recentTokenBudget = Math.floor(threshold / 2);
  let recentTokens = 0;
  let splitIndex = messages.length;

  for (let i = messages.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(messages[i].content) + 4;
    if (recentTokens + msgTokens > recentTokenBudget) {
      splitIndex = i + 1;
      break;
    }
    recentTokens += msgTokens;
  }

  // Always keep at least the last 2 messages
  splitIndex = Math.min(splitIndex, Math.max(0, messages.length - 2));

  return [
    messages.slice(0, splitIndex),   // old → summarize
    messages.slice(splitIndex),       // recent → keep
  ];
}

What the Summary Preserves

The summarization prompt instructs the AI to retain four categories of information:

  1. Core questions and needs — What was the user actually asking for?
  2. Topics and conclusions — What was discussed and what was decided?
  3. Preferences and background — What do we know about the user?
  4. Unresolved issues — What's still pending?

The resulting summary is typically 300-500 words — a fraction of the original messages, but containing all the essential context.

Layer 3: Key Information Extraction

Summarization preserves the narrative, but some information deserves to be extracted as structured, persistent context. Think of it as the difference between reading a story and having a cheat sheet.

What Gets Extracted

The extraction prompt produces categorized, tagged output:

[Preference] User prefers TypeScript over JavaScript
[Need] Looking for a senior frontend position
[Background] 5 years of React experience
[Follow-up] Still needs project deployment help

This structured format makes it easy for the AI to reference specific categories. When the user says "like I mentioned before," the AI can check the extracted preferences rather than scanning through the entire conversation.

When to Extract

Extraction is triggered at two points:

  • When summarization occurs — This is a natural checkpoint; we're already processing the conversation
  • When the user has sent ≥ 6 messages and no extraction exists yet — This catches cases where the conversation is substantive but hasn't hit the summarization threshold

Incremental Merging

Like summaries, extracted info supports incremental updates. If extraction already exists, the new conversation is processed together with the existing info to produce an updated version — no information is lost in the merge.

Layer 4: System Prompt Compression

In resume mode, the system prompt includes the full portfolio context: resume content, basic info, skills, projects, and work experience. This can consume thousands of tokens, leaving less room for conversation.

The Compression Strategy

The compressed format transforms verbose sections into compact single-line entries:

Before:

## Projects
### Personal Portfolio
- Description: A modern portfolio website built with Next.js
- Tech Stack: Next.js, TypeScript, Tailwind CSS
- Featured Project
- AI Powered

After:

[Projects] Personal Portfolio[Next.js,TypeScript,Tailwind CSS]*AI

The resume content itself is truncated to the first 500 characters (with a ... indicator). Skills are listed as comma-separated names without proficiency levels. Projects drop descriptions entirely, keeping only the title and tech stack. Work experience compresses to role@company(period).

This typically reduces the portfolio context by about 60% while preserving all the information the AI actually needs to answer questions.

Automatic Fallback

The system doesn't compress by default. It first calculates the token budget with the full system prompt. Only if the budget is insufficient for the conversation messages does it switch to the compressed version:

const messagesTokens = estimateMessagesTokens(activeMessages);
if (messagesTokens > tokenBudget && mode === 'resume') {
  const compressedSystemPrompt = await getResumePrompt(locale, true);
  const compressedBudget = calculateTokenBudget(compressedTokens, ...);

  if (compressedBudget > tokenBudget) {
    useCompressedPrompt = true;
    tokenBudget = compressedBudget;
  }
}

This means short conversations get the full, rich system prompt. Long conversations automatically downgrade to the compressed version to make room for more messages. The user never notices the switch.

How the Four Layers Work Together

Every incoming request flows through all four layers in sequence:

Client sends: all messages + cached summary + cached extracted info
  │
  ├─ 1. Token count ≥ 40K?
  │     YES → Split messages, AI generates summary, keep recent messages only
  │
  ├─ 2. Summary triggered OR ≥ 6 user messages with no extraction?
  │     YES → AI extracts structured preferences/needs/background
  │
  ├─ 3. Calculate token budget
  │     Budget = Window - SystemPrompt - Summary - ExtractedInfo - Reserved
  │
  ├─ 4. Budget too small for all messages?
  │     YES → Switch to compressed system prompt, recalculate budget
  │
  ├─ 5. Select messages by budget (newest-first accumulation)
  │
  ├─ 6. Merge summary + extracted info into system prompt
  │
  ├─ 7. Call AI provider with selected messages + enriched system prompt
  │
  └─ 8. Return response + updated summary/extracted info via headers

The Client-Server Contract

The summary and extracted info are stored on the client and sent with each request. When the server updates them, it returns the new values via response headers:

// Server: encode and send via headers
responseHeaders['X-Conversation-Summary'] = encodeURIComponent(currentSummary);
responseHeaders['X-Extracted-Info'] = encodeURIComponent(currentExtractedInfo);
// Client: read and update local state
const summaryHeader = response.headers.get('X-Conversation-Summary');
if (summaryHeader) {
  setConversationSummary(decodeURIComponent(summaryHeader));
}

This keeps the server stateless — no session storage, no database — while still maintaining context continuity across requests within a conversation.

Results

MetricBeforeAfter
Context selectionFixed 20 messagesDynamic token-based budget
Long conversation supportHard truncation, early context lostSummary preserves key points
User preference retentionNoneStructured extraction persists across requests
System prompt efficiencyFull portfolio always loadedAuto-compresses when budget is tight
Model adaptabilitySame limits for all modelsContext window sized per model
Total length limit10,000 chars (arbitrary)Model context window × 3 chars/token

Lessons Learned

  1. Heuristic token estimation is good enough for budgeting. You don't need exact counts to decide how many messages to include. A CJK-aware heuristic gets you within 10-15% of reality, which is more than sufficient for allocation decisions.

  2. Summarization is expensive but necessary. Each summarization call adds latency and cost. The key is to trigger it only at thresholds, not on every request, and to cache the result on the client side for subsequent requests.

  3. Compression has diminishing returns. Going from verbose to compact saves ~60%, but over-compressing loses nuance. A two-tier approach (full vs. compressed) is the right balance — you get the rich version when you can afford it, and the compact version when you can't.

  4. Client-side state keeps the server simple. Storing summary and extracted info on the client means the server remains stateless. No Redis, no database, no session management. The trade-off is that refreshing the page loses the summary — but that's an acceptable trade-off for a portfolio chat assistant.

  5. The four layers are complementary, not redundant. Token budgeting prevents overflow. Summarization preserves narrative. Extraction preserves facts. Prompt compression creates headroom. Each solves a different facet of the same problem, and they work best together.