Document the context window discussion and design decisions: - Industry approaches (MoE, Ensemble, Pipeline, Speculative) - Memory offloading options and trade-offs - Why KV cache can't be shared between workers - Three architectural options for 30K-60K+ context - Current implementation status - Hardware-specific recommendations Provides reference for future enhancements and helps users understand memory constraints in swarm architectures.
5.8 KiB
Context Window Handling in Local Swarm
Overview
This document summarizes how context windows work in swarm architectures and the design decisions made for Local Swarm.
The Core Challenge
When running multiple LLM workers (instances) for consensus voting, each worker needs to process the input. For long contexts (30K-60K+ tokens), this creates memory pressure:
- 7B model at 32K context: ~8GB VRAM per worker
- 7B model at 64K context: ~14GB VRAM per worker
- Input duplication: Each worker processes the full input independently
Industry Approaches
1. Mixture of Experts (MoE)
Used by: GPT-4, Mixtral 8x7B
- Full input goes to all "expert" sub-models
- Router network decides which experts to activate
- Each expert is smaller (e.g., 8x7B vs 1x56B equivalent)
- Trade-off: More parameters total, but only a subset active per token
2. Ensemble Voting (Local Swarm's Approach)
Characteristics:
- Full input to all workers
- Each worker generates independently
- Vote on final outputs
- Pros: True parallel processing, diverse perspectives
- Cons: 100% input duplication, memory intensive
3. Pipeline/Multi-Agent
Used by: LangChain, AutoGPT
- Different workers get different subtasks
- Sequential processing (not parallel)
- Pros: Efficient memory usage, specialization
- Cons: Loses swarm consensus benefit, higher latency
4. Speculative Decoding
Used by: vLLM, Text Generation Inference
- Small "draft" model processes input
- Large model verifies (doesn't reprocess)
- Pros: 2-3x speedup
- Cons: Complex implementation
Memory Offloading
What It Is
Moving part of the model's state from GPU VRAM to system RAM:
- Hot context (active tokens) → GPU VRAM (fast)
- Cold context (earlier tokens) → System RAM (slower)
Performance Impact
| Configuration | Speed | Memory |
|---|---|---|
| 100% GPU | 100% | 20GB VRAM |
| 50% offload | 75% | 10GB VRAM + 10GB RAM |
| 80% offload | 60% | 4GB VRAM + 16GB RAM |
When to Use
- Recommended: When you have plenty of RAM (32GB+) but limited VRAM (8-12GB)
- Trade-off: 25-40% slower, but can run 2-3x more workers
- Implementation: vLLM, DeepSpeed ZeRO-Infinity, llama.cpp
Can Workers Share Context?
The Short Answer
Raw input tokens: Yes (negligible memory) KV Cache (attention states): No (99% of memory, unique per worker)
Why KV Cache Can't Be Shared
The attention mechanism requires unique Key/Value tensors per token position:
Token 1: [K1, V1] ← unique to this position
Token 2: [K2, V2] ← depends on Token 1
...
Token N: [KN, VN] ← depends on all previous
Even with the same input:
- Different random seeds → different attention patterns
- Each worker builds its own understanding
- The "notes and highlights" (KV cache) are unique per worker
Analogy
Five people reading the same book:
- ✅ Can share: The physical book (input tokens)
- ❌ Can't share: Their notes, highlights, thoughts (KV cache)
Options for Long Context (30K-60K+ tokens)
Option 1: Long-Context Models
Models: Phi-3.5 Mini, Llama 3.1/3.2, Qwen 2.5 (128K context)
Pros:
- Simplest architecture
- True parallel swarm voting
- No preprocessing
Cons:
- Requires 8-12GB VRAM per worker at 60K context
- Limited model selection
Best for: Users with high-end GPUs (RTX 4090, 24GB+ VRAM)
Option 2: Context Compression
Architecture: Two-stage processing
Stage 1: Compression swarm (3-5 workers)
- Split 60K into chunks
- Summarize each chunk
- Aggregate to 8K compressed context
Stage 2: Solution swarm (N workers)
- Each worker gets 8K compressed + 2K relevant original
- Generate independently
- Vote on best
Pros:
- Works with standard 8K models
- Maintains swarm architecture
- More workers possible
Cons:
- Potential information loss
- Added latency (~2-3s)
Best for: Users with 8-16GB VRAM who need 30K+ context
Option 3: Hierarchical RAG
Architecture: Three-tier system
Tier 1: Indexing swarm
- Embed context into vector database
- Create searchable knowledge graph
Tier 2: Retrieval + Generation
- Query index for relevant context
- Each worker gets ~6K retrieved + 2K raw
- Generate solutions
Tier 3: Voting swarm
- Rerank and consensus
Pros:
- Scales to 100K+ tokens
- Most robust to information loss
- Specialized workers
Cons:
- Complex implementation
- 3x higher latency
- Requires vector DB
Best for: Maximum accuracy, production deployments
Current Local Swarm Implementation
Local Swarm currently uses Ensemble Voting (Option 1) with standard context windows:
- 2K-8K context (model dependent)
- Each worker loads full model independently
- No context sharing between workers
- No offloading to system RAM (yet)
Recommendations
For 8K-16K Context
Use current implementation with standard models
For 30K+ Context
Choose based on your hardware:
| Setup | Recommended Approach |
|---|---|
| RTX 4090 (24GB) | Option 1: Long-context models |
| RTX 4060 Ti (16GB) | Option 2: Context compression |
| Multiple machines (federated) | Option 2 or 3 |
| CPU-only | Option 2 with aggressive compression |
Memory-Constrained Setups
Enable CPU offloading to run more workers:
# llama.cpp example
./main --cpu-partial 0.8 # Offload 80% to RAM
Future Enhancements
Potential improvements for Local Swarm:
- Context compression layer (Option 2 implementation)
- CPU offloading support for memory-constrained systems
- Hierarchical RAG for enterprise use cases
- Speculative decoding for 2-3x speedup
References
- vLLM PagedAttention: Efficient KV cache management
- DeepSpeed ZeRO-Infinity: Offloading to CPU/NVMe
- Mixtral 8x7B: Mixture of Experts architecture
- Phi-3.5 Technical Report: Long-context small models