Four Layers of Context Management: How I Stopped My AI Assistant from Forgetting

Introduction

Every developer who has built an LLM-powered chat app hits the same wall eventually: the model starts "forgetting" things said earlier in the conversation. My portfolio site's AI assistant was no exception. The original implementation used a hard-coded sliding window of 20 messages — anything beyond that was silently discarded. No summary, no compression, no preservation. Just gone.

This works fine for quick Q&A. But when a user has a real conversation — asking follow-up questions, referencing earlier points, building on previous answers — the hard cutoff becomes painfully obvious. The AI contradicts itself, asks for information the user already provided, or loses the thread entirely.

I decided to fix this properly. Not with a band-aid, but with a layered system where each layer solves a different aspect of the same problem. Here's what I built and why.

The Problem with Fixed Windows

The original code was as simple as it gets:

const MAX_CONTEXT_MESSAGES = 20;

const contextMessages = [...currentMessages, userMessage].slice(
  -MAX_CONTEXT_MESSAGES
);

This has several compounding issues:

Problem	Why It Matters
Hard truncation	Everything before the window is gone — no recovery, no summary
No token awareness	20 short messages and 20 long messages consume vastly different amounts of context
Bloated system prompt	In resume mode, the full portfolio context eats up most of the context window
Model-agnostic	Same limit regardless of whether the model has 128K or 200K tokens
Arbitrary total limit	A 10,000-character cap was pulled from thin air

The fundamental insight: a message count is a terrible proxy for context consumption. What you actually need to manage is tokens.

Layer 1: Token Counting & Dynamic Window

Estimating Tokens Without a Tokenizer

Running a real tokenizer (like tiktoken) server-side would add a heavy dependency. Instead, I built a heuristic estimator that's accurate enough for budget allocation:

const CJK_REGEX = /[\u4e00-\u9fff\u3400-\u4dbf\u3000-\u303f\uff00-\uffef]/;

export function estimateTokens(text: string): number {
  if (!text) return 0;

  let cjkCount = 0;
  let otherCount = 0;

  for (const char of text) {
    if (CJK_REGEX.test(char)) {
      cjkCount++;
    } else {
      otherCount++;
    }
  }

  // CJK characters ≈ 1.5 tokens, English/other ≈ 0.25 tokens
  return Math.ceil(cjkCount * 1.5 + otherCount * 0.25);
}

This heuristic works because CJK characters typically tokenize to 1-2 tokens each in most modern tokenizers, while English text averages about 4 characters per token (hence 0.25 tokens per character). For a mixed-language chat, this gets you within 10-15% of reality — more than sufficient for deciding how many messages to include.

Model-Aware Context Windows

Different models have wildly different context windows. A lookup table handles this:

const MODEL_CONTEXT_WINDOWS: Record<string, number> = {
  'glm-4-flash': 128000,
  'glm-4-long': 1000000,
  'gpt-4o-mini': 128000,
  'gpt-4o': 128000,
  'claude-3-haiku-20240307': 200000,
  'claude-3-5-sonnet-20241022': 200000,
};

The system auto-detects the active model from environment variables and picks the right window size. No more one-size-fits-all limits.

Calculating the Token Budget

The available budget for conversation messages is what's left after accounting for everything else:

Budget = Context Window - System Prompt - Summary - Extracted Info - Reserved Response

The 4,096 reserved response tokens ensure the model always has room to generate a reply, even if the context is nearly full.

Selecting Messages by Budget, Not by Count

Instead of keeping the last N messages, we now walk backwards from the newest message, accumulating tokens until we hit the budget:

export function selectMessagesByBudget(
  messages: Array<{ role: string; content: string }>,
  budget: number
): Array<{ role: string; content: string }> {
  let usedTokens = 0;
  const selected: Array<{ role: string; content: string }> = [];

  for (let i = messages.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(messages[i].content) + 4; // +4 for role overhead
    if (usedTokens + msgTokens > budget) break;
    usedTokens += msgTokens;
    selected.unshift(messages[i]);
  }

  return selected;
}

This maximizes context utilization regardless of message length. A conversation with short messages naturally includes more of them; a conversation with long messages includes fewer — but always fits within the budget.

Layer 2: Conversation Summarization

Dynamic window sizing prevents overflow, but it doesn't solve the information loss problem. When old messages are excluded from the budget, their content is still gone. Summarization fixes this by compressing old messages into a compact narrative that persists.

When to Trigger

I set the threshold at 40,000 tokens — roughly 30% of a 128K context window. This gives enough headroom for the summary itself plus recent messages plus the system prompt.

export function shouldSummarize(
  messages: Array<{ role: string; content: string }>,
  existingSummary: string | null,
  threshold: number = 40000
): boolean {
  if (existingSummary) return false; // Don't re-summarize if we already have one
  const messagesTokens = estimateMessagesTokens(messages);
  return messagesTokens >= threshold;
}

Key design decision: once a summary exists, we don't re-summarize on every request. The summary is maintained incrementally and stored on the client side.

Splitting Old from New

When summarization triggers, messages are split into two groups:

Old messages → sent to the AI for summarization
Recent messages → kept as-is (using half the threshold as the recent budget)

export function splitMessagesForSummary(
  messages: Array<{ role: string; content: string }>,
  threshold: number = 40000
) {
  const recentTokenBudget = Math.floor(threshold / 2);
  let recentTokens = 0;
  let splitIndex = messages.length;

  for (let i = messages.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(messages[i].content) + 4;
    if (recentTokens + msgTokens > recentTokenBudget) {
      splitIndex = i + 1;
      break;
    }
    recentTokens += msgTokens;
  }

  // Always keep at least the last 2 messages
  splitIndex = Math.min(splitIndex, Math.max(0, messages.length - 2));

  return [
    messages.slice(0, splitIndex),   // old → summarize
    messages.slice(splitIndex),       // recent → keep
  ];
}

What the Summary Preserves

The summarization prompt instructs the AI to retain four categories of information:

Core questions and needs — What was the user actually asking for?
Topics and conclusions — What was discussed and what was decided?
Preferences and background — What do we know about the user?
Unresolved issues — What's still pending?

The resulting summary is typically 300-500 words — a fraction of the original messages, but containing all the essential context.

Layer 3: Key Information Extraction

Summarization preserves the narrative, but some information deserves to be extracted as structured, persistent context. Think of it as the difference between reading a story and having a cheat sheet.

What Gets Extracted

The extraction prompt produces categorized, tagged output:

[Preference] User prefers TypeScript over JavaScript
[Need] Looking for a senior frontend position
[Background] 5 years of React experience
[Follow-up] Still needs project deployment help

This structured format makes it easy for the AI to reference specific categories. When the user says "like I mentioned before," the AI can check the extracted preferences rather than scanning through the entire conversation.

When to Extract

Extraction is triggered at two points:

When summarization occurs — This is a natural checkpoint; we're already processing the conversation
When the user has sent ≥ 6 messages and no extraction exists yet — This catches cases where the conversation is substantive but hasn't hit the summarization threshold

Incremental Merging

Like summaries, extracted info supports incremental updates. If extraction already exists, the new conversation is processed together with the existing info to produce an updated version — no information is lost in the merge.

Layer 4: System Prompt Compression

In resume mode, the system prompt includes the full portfolio context: resume content, basic info, skills, projects, and work experience. This can consume thousands of tokens, leaving less room for conversation.

The Compression Strategy

The compressed format transforms verbose sections into compact single-line entries:

Before:

## Projects
### Personal Portfolio
- Description: A modern portfolio website built with Next.js
- Tech Stack: Next.js, TypeScript, Tailwind CSS
- Featured Project
- AI Powered

After:

[Projects] Personal Portfolio[Next.js,TypeScript,Tailwind CSS]*AI

The resume content itself is truncated to the first 500 characters (with a ... indicator). Skills are listed as comma-separated names without proficiency levels. Projects drop descriptions entirely, keeping only the title and tech stack. Work experience compresses to role@company(period).

This typically reduces the portfolio context by about 60% while preserving all the information the AI actually needs to answer questions.

Automatic Fallback

The system doesn't compress by default. It first calculates the token budget with the full system prompt. Only if the budget is insufficient for the conversation messages does it switch to the compressed version:

const messagesTokens = estimateMessagesTokens(activeMessages);
if (messagesTokens > tokenBudget && mode === 'resume') {
  const compressedSystemPrompt = await getResumePrompt(locale, true);
  const compressedBudget = calculateTokenBudget(compressedTokens, ...);

  if (compressedBudget > tokenBudget) {
    useCompressedPrompt = true;
    tokenBudget = compressedBudget;
  }
}

This means short conversations get the full, rich system prompt. Long conversations automatically downgrade to the compressed version to make room for more messages. The user never notices the switch.

How the Four Layers Work Together

Every incoming request flows through all four layers in sequence:

Client sends: all messages + cached summary + cached extracted info
  │
  ├─ 1. Token count ≥ 40K?
  │     YES → Split messages, AI generates summary, keep recent messages only
  │
  ├─ 2. Summary triggered OR ≥ 6 user messages with no extraction?
  │     YES → AI extracts structured preferences/needs/background
  │
  ├─ 3. Calculate token budget
  │     Budget = Window - SystemPrompt - Summary - ExtractedInfo - Reserved
  │
  ├─ 4. Budget too small for all messages?
  │     YES → Switch to compressed system prompt, recalculate budget
  │
  ├─ 5. Select messages by budget (newest-first accumulation)
  │
  ├─ 6. Merge summary + extracted info into system prompt
  │
  ├─ 7. Call AI provider with selected messages + enriched system prompt
  │
  └─ 8. Return response + updated summary/extracted info via headers

The Client-Server Contract

The summary and extracted info are stored on the client and sent with each request. When the server updates them, it returns the new values via response headers:

// Server: encode and send via headers
responseHeaders['X-Conversation-Summary'] = encodeURIComponent(currentSummary);
responseHeaders['X-Extracted-Info'] = encodeURIComponent(currentExtractedInfo);

// Client: read and update local state
const summaryHeader = response.headers.get('X-Conversation-Summary');
if (summaryHeader) {
  setConversationSummary(decodeURIComponent(summaryHeader));
}

This keeps the server stateless — no session storage, no database — while still maintaining context continuity across requests within a conversation.

Results

Metric	Before	After
Context selection	Fixed 20 messages	Dynamic token-based budget
Long conversation support	Hard truncation, early context lost	Summary preserves key points
User preference retention	None	Structured extraction persists across requests
System prompt efficiency	Full portfolio always loaded	Auto-compresses when budget is tight
Model adaptability	Same limits for all models	Context window sized per model
Total length limit	10,000 chars (arbitrary)	Model context window × 3 chars/token

Lessons Learned

Heuristic token estimation is good enough for budgeting. You don't need exact counts to decide how many messages to include. A CJK-aware heuristic gets you within 10-15% of reality, which is more than sufficient for allocation decisions.
Summarization is expensive but necessary. Each summarization call adds latency and cost. The key is to trigger it only at thresholds, not on every request, and to cache the result on the client side for subsequent requests.
Compression has diminishing returns. Going from verbose to compact saves ~60%, but over-compressing loses nuance. A two-tier approach (full vs. compressed) is the right balance — you get the rich version when you can afford it, and the compact version when you can't.
Client-side state keeps the server simple. Storing summary and extracted info on the client means the server remains stateless. No Redis, no database, no session management. The trade-off is that refreshing the page loses the summary — but that's an acceptable trade-off for a portfolio chat assistant.
The four layers are complementary, not redundant. Token budgeting prevents overflow. Summarization preserves narrative. Extraction preserves facts. Prompt compression creates headroom. Each solves a different facet of the same problem, and they work best together.

Introduction

I decided to fix this properly. Not with a band-aid, but with a layered system where each layer solves a different aspect of the same problem. Here's what I built and why.

The Problem with Fixed Windows

The original code was as simple as it gets:

const MAX_CONTEXT_MESSAGES = 20;

const contextMessages = [...currentMessages, userMessage].slice(
  -MAX_CONTEXT_MESSAGES
);

This has several compounding issues:

Problem	Why It Matters
Hard truncation	Everything before the window is gone — no recovery, no summary
No token awareness	20 short messages and 20 long messages consume vastly different amounts of context
Bloated system prompt	In resume mode, the full portfolio context eats up most of the context window
Model-agnostic	Same limit regardless of whether the model has 128K or 200K tokens
Arbitrary total limit	A 10,000-character cap was pulled from thin air

The fundamental insight: a message count is a terrible proxy for context consumption. What you actually need to manage is tokens.

Layer 1: Token Counting & Dynamic Window

Estimating Tokens Without a Tokenizer

Running a real tokenizer (like tiktoken) server-side would add a heavy dependency. Instead, I built a heuristic estimator that's accurate enough for budget allocation:

const CJK_REGEX = /[\u4e00-\u9fff\u3400-\u4dbf\u3000-\u303f\uff00-\uffef]/;

export function estimateTokens(text: string): number {
  if (!text) return 0;

  let cjkCount = 0;
  let otherCount = 0;

  for (const char of text) {
    if (CJK_REGEX.test(char)) {
      cjkCount++;
    } else {
      otherCount++;
    }
  }

  // CJK characters ≈ 1.5 tokens, English/other ≈ 0.25 tokens
  return Math.ceil(cjkCount * 1.5 + otherCount * 0.25);
}

Model-Aware Context Windows

Different models have wildly different context windows. A lookup table handles this:

const MODEL_CONTEXT_WINDOWS: Record<string, number> = {
  'glm-4-flash': 128000,
  'glm-4-long': 1000000,
  'gpt-4o-mini': 128000,
  'gpt-4o': 128000,
  'claude-3-haiku-20240307': 200000,
  'claude-3-5-sonnet-20241022': 200000,
};

The system auto-detects the active model from environment variables and picks the right window size. No more one-size-fits-all limits.

Calculating the Token Budget

The available budget for conversation messages is what's left after accounting for everything else:

Budget = Context Window - System Prompt - Summary - Extracted Info - Reserved Response

The 4,096 reserved response tokens ensure the model always has room to generate a reply, even if the context is nearly full.

Selecting Messages by Budget, Not by Count

Instead of keeping the last N messages, we now walk backwards from the newest message, accumulating tokens until we hit the budget:

export function selectMessagesByBudget(
  messages: Array<{ role: string; content: string }>,
  budget: number
): Array<{ role: string; content: string }> {
  let usedTokens = 0;
  const selected: Array<{ role: string; content: string }> = [];

  for (let i = messages.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(messages[i].content) + 4; // +4 for role overhead
    if (usedTokens + msgTokens > budget) break;
    usedTokens += msgTokens;
    selected.unshift(messages[i]);
  }

  return selected;
}

Layer 2: Conversation Summarization

When to Trigger

I set the threshold at 40,000 tokens — roughly 30% of a 128K context window. This gives enough headroom for the summary itself plus recent messages plus the system prompt.

export function shouldSummarize(
  messages: Array<{ role: string; content: string }>,
  existingSummary: string | null,
  threshold: number = 40000
): boolean {
  if (existingSummary) return false; // Don't re-summarize if we already have one
  const messagesTokens = estimateMessagesTokens(messages);
  return messagesTokens >= threshold;
}

Key design decision: once a summary exists, we don't re-summarize on every request. The summary is maintained incrementally and stored on the client side.

Splitting Old from New

When summarization triggers, messages are split into two groups:

Old messages → sent to the AI for summarization
Recent messages → kept as-is (using half the threshold as the recent budget)

export function splitMessagesForSummary(
  messages: Array<{ role: string; content: string }>,
  threshold: number = 40000
) {
  const recentTokenBudget = Math.floor(threshold / 2);
  let recentTokens = 0;
  let splitIndex = messages.length;

  for (let i = messages.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(messages[i].content) + 4;
    if (recentTokens + msgTokens > recentTokenBudget) {
      splitIndex = i + 1;
      break;
    }
    recentTokens += msgTokens;
  }

  // Always keep at least the last 2 messages
  splitIndex = Math.min(splitIndex, Math.max(0, messages.length - 2));

  return [
    messages.slice(0, splitIndex),   // old → summarize
    messages.slice(splitIndex),       // recent → keep
  ];
}

What the Summary Preserves

The summarization prompt instructs the AI to retain four categories of information:

Core questions and needs — What was the user actually asking for?
Topics and conclusions — What was discussed and what was decided?
Preferences and background — What do we know about the user?
Unresolved issues — What's still pending?

The resulting summary is typically 300-500 words — a fraction of the original messages, but containing all the essential context.

Layer 3: Key Information Extraction

What Gets Extracted

The extraction prompt produces categorized, tagged output:

[Preference] User prefers TypeScript over JavaScript
[Need] Looking for a senior frontend position
[Background] 5 years of React experience
[Follow-up] Still needs project deployment help

When to Extract

Extraction is triggered at two points:

When summarization occurs — This is a natural checkpoint; we're already processing the conversation
When the user has sent ≥ 6 messages and no extraction exists yet — This catches cases where the conversation is substantive but hasn't hit the summarization threshold

Incremental Merging

Layer 4: System Prompt Compression

The Compression Strategy

The compressed format transforms verbose sections into compact single-line entries:

Before:

## Projects
### Personal Portfolio
- Description: A modern portfolio website built with Next.js
- Tech Stack: Next.js, TypeScript, Tailwind CSS
- Featured Project
- AI Powered

After:

[Projects] Personal Portfolio[Next.js,TypeScript,Tailwind CSS]*AI

This typically reduces the portfolio context by about 60% while preserving all the information the AI actually needs to answer questions.

Automatic Fallback

const messagesTokens = estimateMessagesTokens(activeMessages);
if (messagesTokens > tokenBudget && mode === 'resume') {
  const compressedSystemPrompt = await getResumePrompt(locale, true);
  const compressedBudget = calculateTokenBudget(compressedTokens, ...);

  if (compressedBudget > tokenBudget) {
    useCompressedPrompt = true;
    tokenBudget = compressedBudget;
  }
}

How the Four Layers Work Together

Every incoming request flows through all four layers in sequence:

Client sends: all messages + cached summary + cached extracted info
  │
  ├─ 1. Token count ≥ 40K?
  │     YES → Split messages, AI generates summary, keep recent messages only
  │
  ├─ 2. Summary triggered OR ≥ 6 user messages with no extraction?
  │     YES → AI extracts structured preferences/needs/background
  │
  ├─ 3. Calculate token budget
  │     Budget = Window - SystemPrompt - Summary - ExtractedInfo - Reserved
  │
  ├─ 4. Budget too small for all messages?
  │     YES → Switch to compressed system prompt, recalculate budget
  │
  ├─ 5. Select messages by budget (newest-first accumulation)
  │
  ├─ 6. Merge summary + extracted info into system prompt
  │
  ├─ 7. Call AI provider with selected messages + enriched system prompt
  │
  └─ 8. Return response + updated summary/extracted info via headers

The Client-Server Contract

The summary and extracted info are stored on the client and sent with each request. When the server updates them, it returns the new values via response headers:

// Server: encode and send via headers
responseHeaders['X-Conversation-Summary'] = encodeURIComponent(currentSummary);
responseHeaders['X-Extracted-Info'] = encodeURIComponent(currentExtractedInfo);

// Client: read and update local state
const summaryHeader = response.headers.get('X-Conversation-Summary');
if (summaryHeader) {
  setConversationSummary(decodeURIComponent(summaryHeader));
}

This keeps the server stateless — no session storage, no database — while still maintaining context continuity across requests within a conversation.

Results

Metric	Before	After
Context selection	Fixed 20 messages	Dynamic token-based budget
Long conversation support	Hard truncation, early context lost	Summary preserves key points
User preference retention	None	Structured extraction persists across requests
System prompt efficiency	Full portfolio always loaded	Auto-compresses when budget is tight
Model adaptability	Same limits for all models	Context window sized per model
Total length limit	10,000 chars (arbitrary)	Model context window × 3 chars/token

Lessons Learned

Heuristic token estimation is good enough for budgeting. You don't need exact counts to decide how many messages to include. A CJK-aware heuristic gets you within 10-15% of reality, which is more than sufficient for allocation decisions.
Summarization is expensive but necessary. Each summarization call adds latency and cost. The key is to trigger it only at thresholds, not on every request, and to cache the result on the client side for subsequent requests.
Compression has diminishing returns. Going from verbose to compact saves ~60%, but over-compressing loses nuance. A two-tier approach (full vs. compressed) is the right balance — you get the rich version when you can afford it, and the compact version when you can't.
Client-side state keeps the server simple. Storing summary and extracted info on the client means the server remains stateless. No Redis, no database, no session management. The trade-off is that refreshing the page loses the summary — but that's an acceptable trade-off for a portfolio chat assistant.
The four layers are complementary, not redundant. Token budgeting prevents overflow. Summarization preserves narrative. Extraction preserves facts. Prompt compression creates headroom. Each solves a different facet of the same problem, and they work best together.

Four Layers of Context Management: How I Stopped My AI Assistant from Forgetting

Table of Contents

Introduction

The Problem with Fixed Windows

Layer 1: Token Counting & Dynamic Window

Estimating Tokens Without a Tokenizer

Model-Aware Context Windows

Calculating the Token Budget

Selecting Messages by Budget, Not by Count

Layer 2: Conversation Summarization

When to Trigger

Splitting Old from New

What the Summary Preserves

Layer 3: Key Information Extraction

What Gets Extracted

When to Extract

Incremental Merging

Layer 4: System Prompt Compression

The Compression Strategy

Automatic Fallback

How the Four Layers Work Together

The Client-Server Contract

Results

Lessons Learned

Four Layers of Context Management: How I Stopped My AI Assistant from Forgetting

Table of Contents

Introduction

The Problem with Fixed Windows

Layer 1: Token Counting & Dynamic Window

Estimating Tokens Without a Tokenizer

Model-Aware Context Windows

Calculating the Token Budget

Selecting Messages by Budget, Not by Count

Layer 2: Conversation Summarization

When to Trigger

Splitting Old from New

What the Summary Preserves

Layer 3: Key Information Extraction

What Gets Extracted

When to Extract

Incremental Merging

Layer 4: System Prompt Compression

The Compression Strategy

Automatic Fallback

How the Four Layers Work Together

The Client-Server Contract

Results

Lessons Learned