cbcba954ae
Document the context window discussion and design decisions: - Industry approaches (MoE, Ensemble, Pipeline, Speculative) - Memory offloading options and trade-offs - Why KV cache can't be shared between workers - Three architectural options for 30K-60K+ context - Current implementation status - Hardware-specific recommendations Provides reference for future enhancements and helps users understand memory constraints in swarm architectures.
211 lines
5.8 KiB
Markdown
211 lines
5.8 KiB
Markdown
# Context Window Handling in Local Swarm
|
|
|
|
## Overview
|
|
|
|
This document summarizes how context windows work in swarm architectures and the design decisions made for Local Swarm.
|
|
|
|
## The Core Challenge
|
|
|
|
When running multiple LLM workers (instances) for consensus voting, each worker needs to process the input. For long contexts (30K-60K+ tokens), this creates memory pressure:
|
|
|
|
- **7B model at 32K context:** ~8GB VRAM per worker
|
|
- **7B model at 64K context:** ~14GB VRAM per worker
|
|
- **Input duplication:** Each worker processes the full input independently
|
|
|
|
## Industry Approaches
|
|
|
|
### 1. Mixture of Experts (MoE)
|
|
**Used by:** GPT-4, Mixtral 8x7B
|
|
|
|
- Full input goes to all "expert" sub-models
|
|
- Router network decides which experts to activate
|
|
- Each expert is smaller (e.g., 8x7B vs 1x56B equivalent)
|
|
- **Trade-off:** More parameters total, but only a subset active per token
|
|
|
|
### 2. Ensemble Voting (Local Swarm's Approach)
|
|
**Characteristics:**
|
|
|
|
- Full input to all workers
|
|
- Each worker generates independently
|
|
- Vote on final outputs
|
|
- **Pros:** True parallel processing, diverse perspectives
|
|
- **Cons:** 100% input duplication, memory intensive
|
|
|
|
### 3. Pipeline/Multi-Agent
|
|
**Used by:** LangChain, AutoGPT
|
|
|
|
- Different workers get different subtasks
|
|
- Sequential processing (not parallel)
|
|
- **Pros:** Efficient memory usage, specialization
|
|
- **Cons:** Loses swarm consensus benefit, higher latency
|
|
|
|
### 4. Speculative Decoding
|
|
**Used by:** vLLM, Text Generation Inference
|
|
|
|
- Small "draft" model processes input
|
|
- Large model verifies (doesn't reprocess)
|
|
- **Pros:** 2-3x speedup
|
|
- **Cons:** Complex implementation
|
|
|
|
## Memory Offloading
|
|
|
|
### What It Is
|
|
Moving part of the model's state from GPU VRAM to system RAM:
|
|
|
|
- **Hot context** (active tokens) → GPU VRAM (fast)
|
|
- **Cold context** (earlier tokens) → System RAM (slower)
|
|
|
|
### Performance Impact
|
|
| Configuration | Speed | Memory |
|
|
|---------------|-------|--------|
|
|
| 100% GPU | 100% | 20GB VRAM |
|
|
| 50% offload | 75% | 10GB VRAM + 10GB RAM |
|
|
| 80% offload | 60% | 4GB VRAM + 16GB RAM |
|
|
|
|
### When to Use
|
|
- **Recommended:** When you have plenty of RAM (32GB+) but limited VRAM (8-12GB)
|
|
- **Trade-off:** 25-40% slower, but can run 2-3x more workers
|
|
- **Implementation:** vLLM, DeepSpeed ZeRO-Infinity, llama.cpp
|
|
|
|
## Can Workers Share Context?
|
|
|
|
### The Short Answer
|
|
**Raw input tokens:** Yes (negligible memory)
|
|
**KV Cache (attention states):** No (99% of memory, unique per worker)
|
|
|
|
### Why KV Cache Can't Be Shared
|
|
|
|
The attention mechanism requires unique Key/Value tensors per token position:
|
|
|
|
```
|
|
Token 1: [K1, V1] ← unique to this position
|
|
Token 2: [K2, V2] ← depends on Token 1
|
|
...
|
|
Token N: [KN, VN] ← depends on all previous
|
|
```
|
|
|
|
Even with the same input:
|
|
- Different random seeds → different attention patterns
|
|
- Each worker builds its own understanding
|
|
- The "notes and highlights" (KV cache) are unique per worker
|
|
|
|
### Analogy
|
|
Five people reading the same book:
|
|
- ✅ **Can share:** The physical book (input tokens)
|
|
- ❌ **Can't share:** Their notes, highlights, thoughts (KV cache)
|
|
|
|
## Options for Long Context (30K-60K+ tokens)
|
|
|
|
### Option 1: Long-Context Models
|
|
**Models:** Phi-3.5 Mini, Llama 3.1/3.2, Qwen 2.5 (128K context)
|
|
|
|
**Pros:**
|
|
- Simplest architecture
|
|
- True parallel swarm voting
|
|
- No preprocessing
|
|
|
|
**Cons:**
|
|
- Requires 8-12GB VRAM per worker at 60K context
|
|
- Limited model selection
|
|
|
|
**Best for:** Users with high-end GPUs (RTX 4090, 24GB+ VRAM)
|
|
|
|
### Option 2: Context Compression
|
|
**Architecture:** Two-stage processing
|
|
|
|
**Stage 1:** Compression swarm (3-5 workers)
|
|
- Split 60K into chunks
|
|
- Summarize each chunk
|
|
- Aggregate to 8K compressed context
|
|
|
|
**Stage 2:** Solution swarm (N workers)
|
|
- Each worker gets 8K compressed + 2K relevant original
|
|
- Generate independently
|
|
- Vote on best
|
|
|
|
**Pros:**
|
|
- Works with standard 8K models
|
|
- Maintains swarm architecture
|
|
- More workers possible
|
|
|
|
**Cons:**
|
|
- Potential information loss
|
|
- Added latency (~2-3s)
|
|
|
|
**Best for:** Users with 8-16GB VRAM who need 30K+ context
|
|
|
|
### Option 3: Hierarchical RAG
|
|
**Architecture:** Three-tier system
|
|
|
|
**Tier 1:** Indexing swarm
|
|
- Embed context into vector database
|
|
- Create searchable knowledge graph
|
|
|
|
**Tier 2:** Retrieval + Generation
|
|
- Query index for relevant context
|
|
- Each worker gets ~6K retrieved + 2K raw
|
|
- Generate solutions
|
|
|
|
**Tier 3:** Voting swarm
|
|
- Rerank and consensus
|
|
|
|
**Pros:**
|
|
- Scales to 100K+ tokens
|
|
- Most robust to information loss
|
|
- Specialized workers
|
|
|
|
**Cons:**
|
|
- Complex implementation
|
|
- 3x higher latency
|
|
- Requires vector DB
|
|
|
|
**Best for:** Maximum accuracy, production deployments
|
|
|
|
## Current Local Swarm Implementation
|
|
|
|
Local Swarm currently uses **Ensemble Voting (Option 1)** with standard context windows:
|
|
|
|
- 2K-8K context (model dependent)
|
|
- Each worker loads full model independently
|
|
- No context sharing between workers
|
|
- No offloading to system RAM (yet)
|
|
|
|
## Recommendations
|
|
|
|
### For 8K-16K Context
|
|
Use current implementation with standard models
|
|
|
|
### For 30K+ Context
|
|
Choose based on your hardware:
|
|
|
|
| Setup | Recommended Approach |
|
|
|-------|---------------------|
|
|
| RTX 4090 (24GB) | Option 1: Long-context models |
|
|
| RTX 4060 Ti (16GB) | Option 2: Context compression |
|
|
| Multiple machines (federated) | Option 2 or 3 |
|
|
| CPU-only | Option 2 with aggressive compression |
|
|
|
|
### Memory-Constrained Setups
|
|
Enable CPU offloading to run more workers:
|
|
|
|
```bash
|
|
# llama.cpp example
|
|
./main --cpu-partial 0.8 # Offload 80% to RAM
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
Potential improvements for Local Swarm:
|
|
|
|
1. **Context compression layer** (Option 2 implementation)
|
|
2. **CPU offloading support** for memory-constrained systems
|
|
3. **Hierarchical RAG** for enterprise use cases
|
|
4. **Speculative decoding** for 2-3x speedup
|
|
|
|
## References
|
|
|
|
- vLLM PagedAttention: Efficient KV cache management
|
|
- DeepSpeed ZeRO-Infinity: Offloading to CPU/NVMe
|
|
- Mixtral 8x7B: Mixture of Experts architecture
|
|
- Phi-3.5 Technical Report: Long-context small models
|