Comprehensive Guide to Cost Reduction and Efficiency in LLM Applications
Research current as of: January 2026
Token optimization has become a critical competency for organizations deploying LLM applications at scale. With AI workloads projected to exceed $840 billion by 2026, and output tokens costing 3-10x more than input tokens, systematic optimization strategies are no longer optional—they're essential for sustainable AI operations.
The fundamental asymmetry in LLM pricing—where output tokens cost 2-10x more than input tokens—creates the primary opportunity for optimization. This premium reflects the computational cost of generation versus processing.
| Provider / Model | Input (per 1M tokens) | Output (per 1M tokens) | Premium Ratio | Context Window |
|---|---|---|---|---|
| Anthropic Claude Opus 4.5 | $5.00 | $25.00 | 5x | 200K tokens |
| Anthropic Claude Sonnet 4.5 | $3.00 | $15.00 | 5x | 200K tokens |
| Anthropic Claude Haiku 4.5 | $1.00 | $5.00 | 5x | 200K tokens |
| OpenAI GPT-4o | $2.50 | $10.00 | 4x | 128K tokens |
| OpenAI o1 (reasoning) | $15.00 | $60.00 | 4x | 128K tokens |
| Provider | Cache Write Multiplier | Cache Read Multiplier | Effective Savings | Cache TTL |
|---|---|---|---|---|
| Anthropic (5-min cache) | 1.25x base input | 0.1x base input | 90% on cache hits | 5 minutes |
| Anthropic (1-hour cache) | 2.0x base input | 0.1x base input | 90% on cache hits | 1 hour |
| OpenAI | 1.0x base input | 0.5x base input | 50% on cache hits | 5-10 minutes |
Example: At $60/M output tokens, a seemingly simple 100-token response that generates 2,000 thinking tokens actually costs $0.12 instead of the expected $0.006—a 20x cost multiplier.
Understanding the financial impact of optimization strategies requires modeling realistic usage patterns:
Scenario: Customer support app
Daily Cost:
$62.50/day = $22,813/year
Optimizations Applied:
Daily Cost:
$8.75/day = $3,194/year
Prompt compression reduces the length and complexity of input prompts while preserving semantic meaning and task performance. Three main categories of techniques have emerged:
These methods remove redundant tokens from natural language prompts while retaining semantic meaning:
These methods encode prompts into continuous trainable embeddings or key-value pairs:
System: You are a helpful customer service agent for Acme Corp.
Our company values excellent customer service. We sell widgets,
gadgets, and various products. Our return policy allows returns
within 30 days. We offer free shipping on orders over $50.
Customer satisfaction is our top priority. We have a dedicated
support team available 24/7 to assist with any questions or
concerns. Our products come with a one-year warranty...
[Additional 600 tokens of context and instructions]
User: How do I return a defective widget?
Token Count: ~800 tokens
Cost: $0.002 (at $2.50/M input tokens)
customer service Acme Corp. return policy 30 days
free shipping $50. support 24/7. warranty one-year
User: return defective widget?
Token Count: ~40 tokens (20x compression)
Cost: $0.0001 (at $2.50/M input tokens)
Savings: 95% cost reduction
Microsoft's open-source solution for prompt compression. Integrated with LlamaIndex.
Extreme compression for specialized use cases using soft prompt methods.
Comprehensive toolkit for prompt compression with multiple algorithms.
Prompt caching allows LLM providers to store and reuse frequently accessed context, offering dramatic cost reductions (50-90%) and latency improvements (up to 85%) for applications with repeated context. This is particularly valuable for:
Claude's prompt caching7IndustryPrompt Caching with ClaudeView Docs processes the prompt in order: tools → system → messages, marking cacheable sections with the cache_control parameter.
| Cache Type | TTL | Write Cost | Read Cost | Use Case |
|---|---|---|---|---|
| 5-minute cache | 5 minutes (refreshed on use) | 1.25x base input | 0.1x base input | High-frequency, short-duration workloads |
| 1-hour cache | 1 hour (refreshed on use) | 2.0x base input | 0.1x base input | Sustained workloads, batch processing |
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an AI assistant for Acme Corp...",
},
{
"type": "text",
"text": "Here is our product catalog: [10,000 tokens of product data]",
"cache_control": {"type": "ephemeral"} # Cache this section
}
],
messages=[
{"role": "user", "content": "What's the price of widget XYZ?"}
]
)
OpenAI's approach8IndustryPrompt Caching in the APIView Docs is simpler—it automatically caches the longest prefix of prompts that have been previously computed. No API changes required.
// Variable content first
User query: "What is product X?"
// Static content last (not cached)
System: Long instructions...
Tools: [tool definitions...]
Knowledge: [product catalog...]
Cache hit rate: ~0%
// Static content first (cached)
Tools: [tool definitions...]
System: Long instructions...
Knowledge: [product catalog...]
// Variable content last
User query: "What is product X?"
Cache hit rate: ~90%+
Process multiple queries against the same context within the cache TTL window to maximize cache utilization:
# Batch-related queries within 5-minute window
questions = [
"What's the price of widget A?",
"Does widget B come with warranty?",
"Compare widgets C and D"
]
# Process rapidly to hit cache
for question in questions:
response = client.messages.create(
model="claude-sonnet-4-5",
system=[...cached_context...],
messages=[{"role": "user", "content": question}]
)
# 90% cost reduction on cached tokens after first call
Track cache performance to optimize your implementation:
| Scenario | Cache Hit Rate | Requests/Hour | Cost Without Cache | Cost With Cache | Savings |
|---|---|---|---|---|---|
| RAG Q&A (10K context) | 80% | 1,000 | $30.00 | $4.50 | 85% |
| Document Analysis | 90% | 500 | $75.00 | $9.75 | 87% |
| Agent with Tools (5K context) | 85% | 2,000 | $30.00 | $5.25 | 82.5% |
| Low-frequency (poor fit) | 20% | 100 | $3.00 | $3.15 | -5% |
For 70-80% of production workloads, mid-tier models perform identically to premium models. Intelligent routing directs simple queries to cost-effective models while reserving expensive frontier models for complex reasoning tasks.
Analyze query complexity and route accordingly:
| Query Complexity | Indicators | Routed Model | Example |
|---|---|---|---|
| Simple | Short, factual, FAQ-like | Haiku ($1/$5) | "What are your hours?" |
| Medium | Multi-step, some reasoning | Sonnet ($3/$15) or GPT-4o ($2.50/$10) | "Compare these two products" |
| Complex | Deep reasoning, analysis, code | Opus ($5/$25) | "Analyze this legal contract" |
| Advanced Reasoning | Multi-step logic, math, planning | o1 ($15/$60) - use sparingly! | "Solve this proof" |
Attempt with smaller model first, escalate only on failure:
async def cascade_route(query, context):
# Try Haiku first (cheap, fast)
haiku_response = await call_haiku(query, context)
# Check confidence/quality
if haiku_response.confidence > 0.85:
return haiku_response # 70% of queries stop here
# Escalate to Sonnet for medium complexity
sonnet_response = await call_sonnet(query, context)
if sonnet_response.confidence > 0.90:
return sonnet_response # 25% need this tier
# Final escalation to Opus for complex tasks
return await call_opus(query, context) # Only 5% reach here
Use lightweight classifier to determine query intent and route accordingly:
from semantic_router import Route, Router
# Define routes
faq_route = Route(
name="faq",
model="claude-haiku-4-5",
utterances=["hours", "location", "contact", "price"]
)
support_route = Route(
name="support",
model="claude-sonnet-4-5",
utterances=["problem", "not working", "error", "help"]
)
analysis_route = Route(
name="analysis",
model="claude-opus-4-5",
utterances=["compare", "analyze", "evaluate", "recommend"]
)
router = Router(routes=[faq_route, support_route, analysis_route])
route = router.classify(user_query) # Fast, cheap classification
response = await call_model(route.model, user_query)
Configuration:
Daily Cost:
$62.50/day
= $22,813/year
Configuration:
Daily Cost:
$27.50/day
= $10,038/year
Savings: 56% ($12,775/year)
Training-based router using RL to optimize cost/quality tradeoffs dynamically.
Fast, lightweight intent classification for routing decisions.
Multi-provider routing with load balancing and fallbacks.
Retrieval-Augmented Generation (RAG) applications face a unique token challenge: retrieved context often constitutes 70-90% of total input tokens. Optimizing RAG pipelines can deliver 50-80% cost reduction while improving accuracy.
Chunking determines how documents are segmented for indexing and retrieval. Poor chunking leads to:
| Strategy | Accuracy | Precision | Recall | Best For |
|---|---|---|---|---|
| Adaptive Chunking | 87% | 7.5 | 89% | Clinical/technical domains |
| Semantic Chunking | 85% | 7.8 | 88% | General purpose, coherent content |
| Recursive Token-Based (R100-0) | 82% | 7.2 | 86% | Balanced performance/efficiency |
| ClusterSemantic (400 tokens) | 80% | 7.0 | 91.3% | High recall requirements |
| ClusterSemantic (200 tokens) | 78% | 8.0 | 85% | High precision requirements |
| Fixed Size (baseline) | 50% | 5.0 | 70% | Quick prototypes only |
from langchain.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings
# Semantic chunking based on embedding similarity
embeddings = OpenAIEmbeddings()
text_splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile", # Use percentile-based threshold
breakpoint_threshold_amount=95, # Split at 95th percentile similarity
number_of_chunks=None # Let it determine optimal chunk count
)
# Process documents
chunks = text_splitter.create_documents([long_document])
# Result: Semantically coherent chunks (avg 200-400 tokens)
# vs fixed-size chunks that often break mid-thought
Combine multiple retrieval methods to maximize relevance while minimizing retrieved tokens:
from langchain.retrievers import EnsembleRetriever
from langchain.vectorstores import FAISS
from langchain.retrievers import BM25Retriever
# Dense retrieval (semantic search)
vectorstore = FAISS.from_documents(chunks, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Sparse retrieval (keyword-based)
sparse_retriever = BM25Retriever.from_documents(chunks)
sparse_retriever.k = 5
# Hybrid ensemble (weighted combination)
ensemble_retriever = EnsembleRetriever(
retrievers=[dense_retriever, sparse_retriever],
weights=[0.6, 0.4] # Favor semantic, but include keyword matches
)
# Retrieve most relevant chunks (better precision = fewer tokens)
relevant_docs = ensemble_retriever.get_relevant_documents(query)
Re-ranking reduces retrieved context by 50-70% while improving relevance:
Process:
Input Tokens: 8,000
Cost: $0.024 (at $3/M)
Relevance: ~40% of context
Process:
Input Tokens: 2,000
Cost: $0.006 (at $3/M)
Relevance: ~85% of context
Savings: 75%
from sentence_transformers import CrossEncoder
# Load cross-encoder re-ranker
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Initial retrieval (cast wide net)
initial_docs = retriever.get_relevant_documents(query, k=20)
# Re-rank based on query-document relevance
pairs = [[query, doc.page_content] for doc in initial_docs]
scores = reranker.predict(pairs)
# Sort by score and keep top-k
ranked_docs = [doc for _, doc in sorted(
zip(scores, initial_docs),
key=lambda x: x[0],
reverse=True
)][:5] # Keep only top 5 most relevant
# Result: 75% fewer tokens, higher relevance
xRAG specializes in compressing retrieved context while preserving answer-critical information:
| Optimization Layer | Technique | Token Reduction | Complexity |
|---|---|---|---|
| 1. Chunking | Semantic chunking (200-400 tokens) | Baseline | Low |
| 2. Retrieval | Hybrid (dense + sparse) | 20-30% | Medium |
| 3. Re-ranking | Cross-encoder filtering | 50-70% | Medium |
| 4. Compression | xRAG or LLMLingua | 60-80% | High |
| 5. Caching | Cache stable knowledge base | 90% (on hits) | Low |
Both OpenAI and Anthropic offer significant discounts for batch processing—50% off standard pricing for workloads that can tolerate 24-hour processing windows.
| Use Case | Batch Fit | Example |
|---|---|---|
| Data extraction | Excellent | Extract structured data from 10,000 documents overnight |
| Content generation | Excellent | Generate product descriptions for entire catalog |
| Analysis & classification | Excellent | Classify customer feedback, sentiment analysis |
| Translation | Good | Translate documentation to multiple languages |
| Customer support | Poor | Real-time chat requires immediate response |
| Code generation (interactive) | Poor | Developer tools need low-latency feedback |
Continuous batching dynamically groups requests for maximum GPU utilization:
BatchPrompt batches multiple data points in each prompt to improve efficiency. Instead of making 100 API calls for 100 items, make 10 calls with 10 items each.
for review in reviews: # 100 reviews
prompt = f"Classify sentiment: {review}"
result = llm.call(prompt)
# Cost: 100 API calls
# Latency: 100 * 2s = 200s
batches = chunk(reviews, 10) # 10 batches of 10
for batch in batches:
prompt = f"Classify sentiment for each:\n"
for i, review in enumerate(batch):
prompt += f"{i+1}. {review}\n"
results = llm.call(prompt)
# Cost: 10 API calls (90% reduction)
# Latency: 10 * 3s = 30s (85% faster)
The BatchPrompt paper introduces two sophisticated techniques:
Runs multiple voting rounds with different data orderings to improve accuracy:
def bpe_classify(batch, num_rounds=3):
votes = defaultdict(list)
for round in range(num_rounds):
# Shuffle batch order each round
shuffled = random.sample(batch, len(batch))
# Get predictions for this ordering
predictions = batch_prompt(shuffled)
# Collect votes
for item_id, prediction in predictions.items():
votes[item_id].append(prediction)
# Majority vote
return {id: Counter(v).most_common(1)[0][0]
for id, v in votes.items()}
Stops voting early for confident predictions:
def seas_classify(batch, confidence_threshold=0.9):
votes = defaultdict(list)
confidences = defaultdict(list)
completed = set()
for round in range(max_rounds):
# Skip completed items
active_batch = [item for item in batch
if item.id not in completed]
if not active_batch:
break
# Get predictions with confidence scores
results = batch_prompt_with_confidence(active_batch)
for item_id, (pred, conf) in results.items():
votes[item_id].append(pred)
confidences[item_id].append(conf)
# Early stop if confident
if conf > confidence_threshold and len(votes[item_id]) >= 2:
if votes[item_id][-1] == votes[item_id][-2]:
completed.add(item_id)
return {id: Counter(v).most_common(1)[0][0]
for id, v in votes.items()}
| Method | API Calls | Accuracy (BoolQ) | Accuracy (RTE) | Token Efficiency |
|---|---|---|---|---|
| Single-data prompting | 100% | 78.5% | 72.3% | Baseline |
| Basic BatchPrompt (batch=32) | 15.7% | 72.1% | 68.9% | 6.4x better |
| BatchPrompt + BPE | 31.4% | 77.8% | 71.5% | 3.2x better |
| BatchPrompt + BPE + SEAS | 22.5% | 79.2% | 73.1% | 4.4x better |
| Approach | Token Cost | Best For | Model Type |
|---|---|---|---|
| Zero-Shot | Minimal | Strong models, simple tasks, format alignment | Opus 4.5, GPT-4o, Sonnet 4.5 |
| Zero-Shot CoT | Low (+15 tokens) | Reasoning tasks with strong models | Opus 4.5, GPT-4o |
| Few-Shot (1-3 examples) | Medium (+200-600 tokens) | Format specification, weaker models, edge cases | Haiku, GPT-3.5, specialized formats |
| Few-Shot CoT (3-5 examples) | High (+1000-2000 tokens) | Weaker models requiring reasoning scaffolding | GPT-3.5, smaller models |
Example 1: [Complex example with reasoning steps]
[200 tokens]
Example 2: [Another complex example]
[200 tokens]
Example 3: [Third example]
[200 tokens]
Now solve this problem:
[User query]
[User query]
Let's think step by step.
Savings: 96.7% token reduction
For GPT-4o, Opus 4.5: Same or better accuracy
Extract key information in JSON format.
Example:
Input: "John Smith, age 35, lives in NYC"
Output: {"name": "John Smith", "age": 35, "city": "NYC"}
Now process: [user input]
~50 tokens vs. 200+ for verbose examples. Use for format specification only.
Automatically generate diverse, simple examples instead of manual curation:
Both OpenAI and Anthropic now offer structured output modes that eliminate the need for verbose format instructions:
Extract information and return in JSON format.
Use the following schema:
{
"name": "string",
"age": "integer",
"email": "string",
"interests": ["string"]
}
Ensure all fields are present. Use null if
unknown. Validate email format. Return only
valid JSON, no explanations.
response = client.messages.create(
model="claude-sonnet-4-5",
response_format={
"type": "json_schema",
"json_schema": schema
},
messages=[...]
)
# Schema enforced by API
# No format instructions needed
Savings: 150 tokens per call
75% of enterprises adopt FinOps automation by 2026, shifting from reactive cost control to autonomous optimization with AI agents managing AI costs. Real-time cost tracking enables:
Open-source LLM observability with automatic token tracking for OpenAI, Anthropic, and other providers.
Simplest solution for LLM cost and token tracking with optimization tools built-in.
End-to-end tracing with OpenAI cost breakdowns from project to individual model token consumption.
Monitors LLM quality, cost, and performance from development through production.
Single pane of glass for all inference traffic with unified observability and cost attribution.
Automatically logs calls to OpenAI, Anthropic, and other LLM libraries with full cost tracking.
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup OpenTelemetry
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
otlp_exporter = OTLPSpanExporter(endpoint="your-collector:4317")
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
# Instrument LLM calls
@tracer.start_as_current_span("llm_call")
def call_llm(prompt, model="claude-sonnet-4-5"):
span = trace.get_current_span()
# Add metadata
span.set_attribute("llm.model", model)
span.set_attribute("llm.prompt_tokens", len(prompt.split()))
response = client.messages.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
# Track usage and cost
span.set_attribute("llm.input_tokens", response.usage.input_tokens)
span.set_attribute("llm.output_tokens", response.usage.output_tokens)
span.set_attribute("llm.cache_read_tokens",
response.usage.cache_read_input_tokens or 0)
# Calculate cost
cost = calculate_cost(response.usage, model)
span.set_attribute("llm.cost", cost)
return response
# Set up cost alerting
class CostGuardrail:
def __init__(self, daily_budget=100.0):
self.daily_budget = daily_budget
self.current_spend = 0.0
self.reset_daily()
def check_budget(self, estimated_cost):
if self.current_spend + estimated_cost > self.daily_budget:
raise BudgetExceededError(
f"Request would exceed daily budget: "
f"${self.current_spend + estimated_cost:.2f} > ${self.daily_budget}"
)
return True
def record_spend(self, actual_cost):
self.current_spend += actual_cost
# Alert at 80% threshold
if self.current_spend > self.daily_budget * 0.8:
send_alert(f"Daily budget 80% consumed: ${self.current_spend:.2f}")
# Use in application
guardrail = CostGuardrail(daily_budget=100.0)
def process_request(user_input):
estimated_cost = estimate_cost(user_input)
guardrail.check_budget(estimated_cost)
response = call_llm(user_input)
guardrail.record_spend(response.cost)
return response
Implement token optimization in stages to manage complexity and measure impact:
| Metric | Baseline | Target (Phase 3) | Measurement Method |
|---|---|---|---|
| Cost per 1M requests | $1,000 | $200-300 | Observability platform |
| Cache hit rate | 0% | >70% | API response metadata |
| Avg tokens per request | 5,000 | <2,000 | Token usage tracking |
| User satisfaction | 85% | >83% (maintain) | User feedback, ratings |
| Response quality | 0.90 | >0.88 (maintain) | Eval suite, human review |
Estimate your potential savings:
| Current Metrics | Your Value | Optimized Value | Savings |
|---|---|---|---|
| Monthly API spend | $10,000 | $2,500 | $7,500/mo |
| Annual savings | — | — | $90,000/year |
| Implementation cost (12 weeks) | — | $30,000 | — |
| ROI (first year) | — | — | 200% |
| Payback period | — | — | 4 months |
| Rank | Technique | Savings | Complexity | When to Use |
|---|---|---|---|---|
| 1 | Prompt Caching | 50-90% | Low | Stable context, high-frequency requests |
| 2 | Model Routing | 50-80% | Medium | >1M tokens/day, varied complexity |
| 3 | Batch Processing | 50% | Low | Async workloads, can tolerate 24h latency |
| 4 | RAG Optimization | 50-80% | Medium-High | RAG applications with chunking/retrieval |
| 5 | Zero-Shot CoT | 30-50% | Low | Strong models, reasoning tasks |
| 6 | Prompt Compression | 70-94% | High | Very long contexts, specialized domains |
| 7 | BatchPrompt (BPE+SEAS) | 70-80% | Medium-High | Classification, batch tasks |
As AI workloads are projected to exceed $840 billion by 2026, token optimization has become a core competency for AI teams. Organizations that master these techniques gain significant competitive advantages through:
Practical Claude Code patterns for token optimization. These examples demonstrate selective context loading, session continuity, and lazy file references based on the context compression research.1AcademicLLMLingua-2: Prompt CompressionView Paper
Only load the files needed for the task, reducing token usage significantly. This implements the hierarchical context pattern from the compression research.2AcademicLongLLMLingua: Long Context CompressionView Paper
from claude_agent_sdk import query, ClaudeAgentOptions
# BAD: Loading entire codebase (expensive)
# async for msg in query("Fix the bug in our codebase", ...)
# GOOD: Targeted context loading
async for message in query(
prompt="""Fix the authentication bug.
Context needed:
- @src/auth/login.py (the buggy file)
- @src/auth/types.py (type definitions)
- @tests/auth/test_login.py (failing tests)""",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Edit"],
permission_mode="acceptEdits"
)
):
pass
# Even better: Use Glob/Grep to find relevant files first
async for message in query(
prompt="""Find and fix the authentication timeout issue.
1. Use Grep to find where timeout is configured
2. Read only the relevant files
3. Fix the issue""",
options=ClaudeAgentOptions(
allowed_tools=["Grep", "Read", "Edit"],
permission_mode="acceptEdits"
)
):
pass
Maintain session state across queries to avoid re-loading context, implementing the KV-cache optimization pattern.5AcademicH2O: Heavy-Hitter OracleView Paper
from claude_agent_sdk import query, ClaudeAgentOptions
# First query: establish context
session_id = None
async for message in query(
prompt="Read and understand the authentication module in src/auth/",
options=ClaudeAgentOptions(allowed_tools=["Read", "Glob"])
):
if hasattr(message, 'subtype') and message.subtype == 'init':
session_id = message.session_id # Capture session ID
# Second query: reuse session (context already loaded)
async for message in query(
prompt="Now find all places that call the login function",
options=ClaudeAgentOptions(
resume=session_id # Resume existing session
)
):
if hasattr(message, "result"):
print(message.result)
# Third query: continue with same context
async for message in query(
prompt="Add rate limiting to each of those call sites",
options=ClaudeAgentOptions(
resume=session_id,
allowed_tools=["Read", "Edit"]
)
):
pass
import { query } from "@anthropic-ai/claude-agent-sdk";
// Capture session ID from first query
let sessionId: string | null = null;
for await (const msg of query({
prompt: "Analyze the database schema in src/models/",
options: { allowedTools: ["Read", "Glob"] }
})) {
if (msg.subtype === "init") sessionId = msg.sessionId;
}
// Subsequent queries reuse the session
for await (const msg of query({
prompt: "Add indexes for the slow queries we discussed",
options: { resume: sessionId, allowedTools: ["Read", "Edit"] }
})) {
if ("result" in msg) console.log(msg.result);
}
Use file references that load on-demand rather than eagerly including all content in the prompt.
# BAD: Eagerly loads entire file into prompt
claude "Here is my code: $(cat src/auth/login.py) - Fix the bug"
# GOOD: Lazy reference - Claude loads only what's needed
claude "Fix the authentication bug in @src/auth/login.py"
# BETTER: Multiple lazy references
claude "Review @src/auth/login.py against @docs/auth-spec.md"
# BEST: Let Claude discover what to read
claude "Find and fix the login timeout bug in the auth module"
GSD workflows implement hierarchical context compression through file-based state management, following the research on efficient context organization.
# GSD uses hierarchical context loading:
# Level 1: PROJECT.md (always loaded, ~500 tokens)
# - Core constraints and decisions
# Level 2: STATE.md (loaded for continuity, ~300 tokens)
# - Current position, accumulated decisions
# Level 3: PLAN.md (loaded for execution, ~500 tokens)
# - Specific tasks and verification criteria
# Level 4: SUMMARY.md files (loaded on-demand)
# - Only loaded when context from completed work is needed
# Result: Fresh execution agent loads ~1300 tokens of context
# vs. loading entire project history (potentially 50k+ tokens)
GSD enforces the 50% context budget rule to address the quality degradation curve identified in research. Each plan targets completion within 50% of available context.
| Context Usage | Quality Level | GSD Strategy |
|---|---|---|
| 0-30% | PEAK - Thorough, comprehensive | Optimal operating range |
| 30-50% | GOOD - Confident, solid work | Target completion zone |
| 50-70% | DEGRADING - Efficiency mode | Split into new plan |
| 70%+ | POOR - Rushed, minimal | Never reach this zone |
Research current as of: January 2026