Files
local_swarm/docs/CONTEXT.md
T
sleepy cbcba954ae Add CONTEXT.md documentation
Document the context window discussion and design decisions:
- Industry approaches (MoE, Ensemble, Pipeline, Speculative)
- Memory offloading options and trade-offs
- Why KV cache can't be shared between workers
- Three architectural options for 30K-60K+ context
- Current implementation status
- Hardware-specific recommendations

Provides reference for future enhancements and helps users
understand memory constraints in swarm architectures.
2026-02-23 20:19:46 +01:00

211 lines
5.8 KiB
Markdown

# Context Window Handling in Local Swarm
## Overview
This document summarizes how context windows work in swarm architectures and the design decisions made for Local Swarm.
## The Core Challenge
When running multiple LLM workers (instances) for consensus voting, each worker needs to process the input. For long contexts (30K-60K+ tokens), this creates memory pressure:
- **7B model at 32K context:** ~8GB VRAM per worker
- **7B model at 64K context:** ~14GB VRAM per worker
- **Input duplication:** Each worker processes the full input independently
## Industry Approaches
### 1. Mixture of Experts (MoE)
**Used by:** GPT-4, Mixtral 8x7B
- Full input goes to all "expert" sub-models
- Router network decides which experts to activate
- Each expert is smaller (e.g., 8x7B vs 1x56B equivalent)
- **Trade-off:** More parameters total, but only a subset active per token
### 2. Ensemble Voting (Local Swarm's Approach)
**Characteristics:**
- Full input to all workers
- Each worker generates independently
- Vote on final outputs
- **Pros:** True parallel processing, diverse perspectives
- **Cons:** 100% input duplication, memory intensive
### 3. Pipeline/Multi-Agent
**Used by:** LangChain, AutoGPT
- Different workers get different subtasks
- Sequential processing (not parallel)
- **Pros:** Efficient memory usage, specialization
- **Cons:** Loses swarm consensus benefit, higher latency
### 4. Speculative Decoding
**Used by:** vLLM, Text Generation Inference
- Small "draft" model processes input
- Large model verifies (doesn't reprocess)
- **Pros:** 2-3x speedup
- **Cons:** Complex implementation
## Memory Offloading
### What It Is
Moving part of the model's state from GPU VRAM to system RAM:
- **Hot context** (active tokens) → GPU VRAM (fast)
- **Cold context** (earlier tokens) → System RAM (slower)
### Performance Impact
| Configuration | Speed | Memory |
|---------------|-------|--------|
| 100% GPU | 100% | 20GB VRAM |
| 50% offload | 75% | 10GB VRAM + 10GB RAM |
| 80% offload | 60% | 4GB VRAM + 16GB RAM |
### When to Use
- **Recommended:** When you have plenty of RAM (32GB+) but limited VRAM (8-12GB)
- **Trade-off:** 25-40% slower, but can run 2-3x more workers
- **Implementation:** vLLM, DeepSpeed ZeRO-Infinity, llama.cpp
## Can Workers Share Context?
### The Short Answer
**Raw input tokens:** Yes (negligible memory)
**KV Cache (attention states):** No (99% of memory, unique per worker)
### Why KV Cache Can't Be Shared
The attention mechanism requires unique Key/Value tensors per token position:
```
Token 1: [K1, V1] ← unique to this position
Token 2: [K2, V2] ← depends on Token 1
...
Token N: [KN, VN] ← depends on all previous
```
Even with the same input:
- Different random seeds → different attention patterns
- Each worker builds its own understanding
- The "notes and highlights" (KV cache) are unique per worker
### Analogy
Five people reading the same book:
-**Can share:** The physical book (input tokens)
-**Can't share:** Their notes, highlights, thoughts (KV cache)
## Options for Long Context (30K-60K+ tokens)
### Option 1: Long-Context Models
**Models:** Phi-3.5 Mini, Llama 3.1/3.2, Qwen 2.5 (128K context)
**Pros:**
- Simplest architecture
- True parallel swarm voting
- No preprocessing
**Cons:**
- Requires 8-12GB VRAM per worker at 60K context
- Limited model selection
**Best for:** Users with high-end GPUs (RTX 4090, 24GB+ VRAM)
### Option 2: Context Compression
**Architecture:** Two-stage processing
**Stage 1:** Compression swarm (3-5 workers)
- Split 60K into chunks
- Summarize each chunk
- Aggregate to 8K compressed context
**Stage 2:** Solution swarm (N workers)
- Each worker gets 8K compressed + 2K relevant original
- Generate independently
- Vote on best
**Pros:**
- Works with standard 8K models
- Maintains swarm architecture
- More workers possible
**Cons:**
- Potential information loss
- Added latency (~2-3s)
**Best for:** Users with 8-16GB VRAM who need 30K+ context
### Option 3: Hierarchical RAG
**Architecture:** Three-tier system
**Tier 1:** Indexing swarm
- Embed context into vector database
- Create searchable knowledge graph
**Tier 2:** Retrieval + Generation
- Query index for relevant context
- Each worker gets ~6K retrieved + 2K raw
- Generate solutions
**Tier 3:** Voting swarm
- Rerank and consensus
**Pros:**
- Scales to 100K+ tokens
- Most robust to information loss
- Specialized workers
**Cons:**
- Complex implementation
- 3x higher latency
- Requires vector DB
**Best for:** Maximum accuracy, production deployments
## Current Local Swarm Implementation
Local Swarm currently uses **Ensemble Voting (Option 1)** with standard context windows:
- 2K-8K context (model dependent)
- Each worker loads full model independently
- No context sharing between workers
- No offloading to system RAM (yet)
## Recommendations
### For 8K-16K Context
Use current implementation with standard models
### For 30K+ Context
Choose based on your hardware:
| Setup | Recommended Approach |
|-------|---------------------|
| RTX 4090 (24GB) | Option 1: Long-context models |
| RTX 4060 Ti (16GB) | Option 2: Context compression |
| Multiple machines (federated) | Option 2 or 3 |
| CPU-only | Option 2 with aggressive compression |
### Memory-Constrained Setups
Enable CPU offloading to run more workers:
```bash
# llama.cpp example
./main --cpu-partial 0.8 # Offload 80% to RAM
```
## Future Enhancements
Potential improvements for Local Swarm:
1. **Context compression layer** (Option 2 implementation)
2. **CPU offloading support** for memory-constrained systems
3. **Hierarchical RAG** for enterprise use cases
4. **Speculative decoding** for 2-3x speedup
## References
- vLLM PagedAttention: Efficient KV cache management
- DeepSpeed ZeRO-Infinity: Offloading to CPU/NVMe
- Mixtral 8x7B: Mixture of Experts architecture
- Phi-3.5 Technical Report: Long-context small models