Introduction
Every developer who has built an LLM-powered chat app hits the same wall eventually: the model starts "forgetting" things said earlier in the conversation. My portfolio site's AI assistant was no exception. The original implementation used a hard-coded sliding window of 20 messages — anything beyond that was silently discarded. No summary, no compression, no preservation. Just gone.
This works fine for quick Q&A. But when a user has a real conversation — asking follow-up questions, referencing earlier points, building on previous answers — the hard cutoff becomes painfully obvious. The AI contradicts itself, asks for information the user already provided, or loses the thread entirely.
I decided to fix this properly. Not with a band-aid, but with a layered system where each layer solves a different aspect of the same problem. Here's what I built and why.
The Problem with Fixed Windows
The original code was as simple as it gets:
const MAX_CONTEXT_MESSAGES = 20;
const contextMessages = [...currentMessages, userMessage].slice(
-MAX_CONTEXT_MESSAGES
);
This has several compounding issues:
| Problem | Why It Matters |
|---|---|
| Hard truncation | Everything before the window is gone — no recovery, no summary |
| No token awareness | 20 short messages and 20 long messages consume vastly different amounts of context |
| Bloated system prompt | In resume mode, the full portfolio context eats up most of the context window |
| Model-agnostic | Same limit regardless of whether the model has 128K or 200K tokens |
| Arbitrary total limit | A 10,000-character cap was pulled from thin air |
The fundamental insight: a message count is a terrible proxy for context consumption. What you actually need to manage is tokens.
Layer 1: Token Counting & Dynamic Window
Estimating Tokens Without a Tokenizer
Running a real tokenizer (like tiktoken) server-side would add a heavy dependency. Instead, I built a heuristic estimator that's accurate enough for budget allocation:
const CJK_REGEX = /[\u4e00-\u9fff\u3400-\u4dbf\u3000-\u303f\uff00-\uffef]/;
export function estimateTokens(text: string): number {
if (!text) return 0;
let cjkCount = 0;
let otherCount = 0;
for (const char of text) {
if (CJK_REGEX.test(char)) {
cjkCount++;
} else {
otherCount++;
}
}
// CJK characters ≈ 1.5 tokens, English/other ≈ 0.25 tokens
return Math.ceil(cjkCount * 1.5 + otherCount * 0.25);
}
This heuristic works because CJK characters typically tokenize to 1-2 tokens each in most modern tokenizers, while English text averages about 4 characters per token (hence 0.25 tokens per character). For a mixed-language chat, this gets you within 10-15% of reality — more than sufficient for deciding how many messages to include.
Model-Aware Context Windows
Different models have wildly different context windows. A lookup table handles this:
const MODEL_CONTEXT_WINDOWS: Record<string, number> = {
'glm-4-flash': 128000,
'glm-4-long': 1000000,
'gpt-4o-mini': 128000,
'gpt-4o': 128000,
'claude-3-haiku-20240307': 200000,
'claude-3-5-sonnet-20241022': 200000,
};
The system auto-detects the active model from environment variables and picks the right window size. No more one-size-fits-all limits.
Calculating the Token Budget
The available budget for conversation messages is what's left after accounting for everything else:
Budget = Context Window - System Prompt - Summary - Extracted Info - Reserved Response
The 4,096 reserved response tokens ensure the model always has room to generate a reply, even if the context is nearly full.
Selecting Messages by Budget, Not by Count
Instead of keeping the last N messages, we now walk backwards from the newest message, accumulating tokens until we hit the budget:
export function selectMessagesByBudget(
messages: Array<{ role: string; content: string }>,
budget: number
): Array<{ role: string; content: string }> {
let usedTokens = 0;
const selected: Array<{ role: string; content: string }> = [];
for (let i = messages.length - 1; i >= 0; i--) {
const msgTokens = estimateTokens(messages[i].content) + 4; // +4 for role overhead
if (usedTokens + msgTokens > budget) break;
usedTokens += msgTokens;
selected.unshift(messages[i]);
}
return selected;
}
This maximizes context utilization regardless of message length. A conversation with short messages naturally includes more of them; a conversation with long messages includes fewer — but always fits within the budget.
Layer 2: Conversation Summarization
Dynamic window sizing prevents overflow, but it doesn't solve the information loss problem. When old messages are excluded from the budget, their content is still gone. Summarization fixes this by compressing old messages into a compact narrative that persists.
When to Trigger
I set the threshold at 40,000 tokens — roughly 30% of a 128K context window. This gives enough headroom for the summary itself plus recent messages plus the system prompt.
export function shouldSummarize(
messages: Array<{ role: string; content: string }>,
existingSummary: string | null,
threshold: number = 40000
): boolean {
if (existingSummary) return false; // Don't re-summarize if we already have one
const messagesTokens = estimateMessagesTokens(messages);
return messagesTokens >= threshold;
}
Key design decision: once a summary exists, we don't re-summarize on every request. The summary is maintained incrementally and stored on the client side.
Splitting Old from New
When summarization triggers, messages are split into two groups:
- Old messages → sent to the AI for summarization
- Recent messages → kept as-is (using half the threshold as the recent budget)
export function splitMessagesForSummary(
messages: Array<{ role: string; content: string }>,
threshold: number = 40000
) {
const recentTokenBudget = Math.floor(threshold / 2);
let recentTokens = 0;
let splitIndex = messages.length;
for (let i = messages.length - 1; i >= 0; i--) {
const msgTokens = estimateTokens(messages[i].content) + 4;
if (recentTokens + msgTokens > recentTokenBudget) {
splitIndex = i + 1;
break;
}
recentTokens += msgTokens;
}
// Always keep at least the last 2 messages
splitIndex = Math.min(splitIndex, Math.max(0, messages.length - 2));
return [
messages.slice(0, splitIndex), // old → summarize
messages.slice(splitIndex), // recent → keep
];
}
What the Summary Preserves
The summarization prompt instructs the AI to retain four categories of information:
- Core questions and needs — What was the user actually asking for?
- Topics and conclusions — What was discussed and what was decided?
- Preferences and background — What do we know about the user?
- Unresolved issues — What's still pending?
The resulting summary is typically 300-500 words — a fraction of the original messages, but containing all the essential context.
Layer 3: Key Information Extraction
Summarization preserves the narrative, but some information deserves to be extracted as structured, persistent context. Think of it as the difference between reading a story and having a cheat sheet.
What Gets Extracted
The extraction prompt produces categorized, tagged output:
[Preference] User prefers TypeScript over JavaScript
[Need] Looking for a senior frontend position
[Background] 5 years of React experience
[Follow-up] Still needs project deployment help
This structured format makes it easy for the AI to reference specific categories. When the user says "like I mentioned before," the AI can check the extracted preferences rather than scanning through the entire conversation.
When to Extract
Extraction is triggered at two points:
- When summarization occurs — This is a natural checkpoint; we're already processing the conversation
- When the user has sent ≥ 6 messages and no extraction exists yet — This catches cases where the conversation is substantive but hasn't hit the summarization threshold
Incremental Merging
Like summaries, extracted info supports incremental updates. If extraction already exists, the new conversation is processed together with the existing info to produce an updated version — no information is lost in the merge.
Layer 4: System Prompt Compression
In resume mode, the system prompt includes the full portfolio context: resume content, basic info, skills, projects, and work experience. This can consume thousands of tokens, leaving less room for conversation.
The Compression Strategy
The compressed format transforms verbose sections into compact single-line entries:
Before:
## Projects
### Personal Portfolio
- Description: A modern portfolio website built with Next.js
- Tech Stack: Next.js, TypeScript, Tailwind CSS
- Featured Project
- AI Powered
After:
[Projects] Personal Portfolio[Next.js,TypeScript,Tailwind CSS]*AI
The resume content itself is truncated to the first 500 characters (with a ... indicator). Skills are listed as comma-separated names without proficiency levels. Projects drop descriptions entirely, keeping only the title and tech stack. Work experience compresses to role@company(period).
This typically reduces the portfolio context by about 60% while preserving all the information the AI actually needs to answer questions.
Automatic Fallback
The system doesn't compress by default. It first calculates the token budget with the full system prompt. Only if the budget is insufficient for the conversation messages does it switch to the compressed version:
const messagesTokens = estimateMessagesTokens(activeMessages);
if (messagesTokens > tokenBudget && mode === 'resume') {
const compressedSystemPrompt = await getResumePrompt(locale, true);
const compressedBudget = calculateTokenBudget(compressedTokens, ...);
if (compressedBudget > tokenBudget) {
useCompressedPrompt = true;
tokenBudget = compressedBudget;
}
}
This means short conversations get the full, rich system prompt. Long conversations automatically downgrade to the compressed version to make room for more messages. The user never notices the switch.
How the Four Layers Work Together
Every incoming request flows through all four layers in sequence:
Client sends: all messages + cached summary + cached extracted info
│
├─ 1. Token count ≥ 40K?
│ YES → Split messages, AI generates summary, keep recent messages only
│
├─ 2. Summary triggered OR ≥ 6 user messages with no extraction?
│ YES → AI extracts structured preferences/needs/background
│
├─ 3. Calculate token budget
│ Budget = Window - SystemPrompt - Summary - ExtractedInfo - Reserved
│
├─ 4. Budget too small for all messages?
│ YES → Switch to compressed system prompt, recalculate budget
│
├─ 5. Select messages by budget (newest-first accumulation)
│
├─ 6. Merge summary + extracted info into system prompt
│
├─ 7. Call AI provider with selected messages + enriched system prompt
│
└─ 8. Return response + updated summary/extracted info via headers
The Client-Server Contract
The summary and extracted info are stored on the client and sent with each request. When the server updates them, it returns the new values via response headers:
// Server: encode and send via headers
responseHeaders['X-Conversation-Summary'] = encodeURIComponent(currentSummary);
responseHeaders['X-Extracted-Info'] = encodeURIComponent(currentExtractedInfo);
// Client: read and update local state
const summaryHeader = response.headers.get('X-Conversation-Summary');
if (summaryHeader) {
setConversationSummary(decodeURIComponent(summaryHeader));
}
This keeps the server stateless — no session storage, no database — while still maintaining context continuity across requests within a conversation.
Results
| Metric | Before | After |
|---|---|---|
| Context selection | Fixed 20 messages | Dynamic token-based budget |
| Long conversation support | Hard truncation, early context lost | Summary preserves key points |
| User preference retention | None | Structured extraction persists across requests |
| System prompt efficiency | Full portfolio always loaded | Auto-compresses when budget is tight |
| Model adaptability | Same limits for all models | Context window sized per model |
| Total length limit | 10,000 chars (arbitrary) | Model context window × 3 chars/token |
Lessons Learned
-
Heuristic token estimation is good enough for budgeting. You don't need exact counts to decide how many messages to include. A CJK-aware heuristic gets you within 10-15% of reality, which is more than sufficient for allocation decisions.
-
Summarization is expensive but necessary. Each summarization call adds latency and cost. The key is to trigger it only at thresholds, not on every request, and to cache the result on the client side for subsequent requests.
-
Compression has diminishing returns. Going from verbose to compact saves ~60%, but over-compressing loses nuance. A two-tier approach (full vs. compressed) is the right balance — you get the rich version when you can afford it, and the compact version when you can't.
-
Client-side state keeps the server simple. Storing summary and extracted info on the client means the server remains stateless. No Redis, no database, no session management. The trade-off is that refreshing the page loses the summary — but that's an acceptable trade-off for a portfolio chat assistant.
-
The four layers are complementary, not redundant. Token budgeting prevents overflow. Summarization preserves narrative. Extraction preserves facts. Prompt compression creates headroom. Each solves a different facet of the same problem, and they work best together.