local_swarm/docs/CONTEXT.md

# Context Window Handling in Local Swarm

## Overview

This document summarizes how context windows work in swarm architectures and the design decisions made for Local Swarm.

## The Core Challenge

When running multiple LLM workers (instances) for consensus voting, each worker needs to process the input. For long contexts (30K-60K+ tokens), this creates memory pressure:

- **7B model at 32K context:** ~8GB VRAM per worker
- **7B model at 64K context:** ~14GB VRAM per worker
- **Input duplication:** Each worker processes the full input independently

## Industry Approaches

### 1. Mixture of Experts (MoE)
**Used by:** GPT-4, Mixtral 8x7B

- Full input goes to all "expert" sub-models
- Router network decides which experts to activate
- Each expert is smaller (e.g., 8x7B vs 1x56B equivalent)
- **Trade-off:** More parameters total, but only a subset active per token

### 2. Ensemble Voting (Local Swarm's Approach)
**Characteristics:**

- Full input to all workers
- Each worker generates independently
- Vote on final outputs
- **Pros:** True parallel processing, diverse perspectives
- **Cons:** 100% input duplication, memory intensive

### 3. Pipeline/Multi-Agent
**Used by:** LangChain, AutoGPT

- Different workers get different subtasks
- Sequential processing (not parallel)
- **Pros:** Efficient memory usage, specialization
- **Cons:** Loses swarm consensus benefit, higher latency

### 4. Speculative Decoding
**Used by:** vLLM, Text Generation Inference

- Small "draft" model processes input
- Large model verifies (doesn't reprocess)
- **Pros:** 2-3x speedup
- **Cons:** Complex implementation

## Memory Offloading

### What It Is
Moving part of the model's state from GPU VRAM to system RAM:

- **Hot context** (active tokens) → GPU VRAM (fast)
- **Cold context** (earlier tokens) → System RAM (slower)

### Performance Impact
| Configuration | Speed | Memory |
|---------------|-------|--------|
| 100% GPU | 100% | 20GB VRAM |
| 50% offload | 75% | 10GB VRAM + 10GB RAM |
| 80% offload | 60% | 4GB VRAM + 16GB RAM |

### When to Use
- **Recommended:** When you have plenty of RAM (32GB+) but limited VRAM (8-12GB)
- **Trade-off:** 25-40% slower, but can run 2-3x more workers
- **Implementation:** vLLM, DeepSpeed ZeRO-Infinity, llama.cpp

## Can Workers Share Context?

### The Short Answer
**Raw input tokens:** Yes (negligible memory)
**KV Cache (attention states):** No (99% of memory, unique per worker)

### Why KV Cache Can't Be Shared

The attention mechanism requires unique Key/Value tensors per token position:

```
Token 1: [K1, V1] ← unique to this position
Token 2: [K2, V2] ← depends on Token 1
...
Token N: [KN, VN] ← depends on all previous
```

Even with the same input:
- Different random seeds → different attention patterns
- Each worker builds its own understanding
- The "notes and highlights" (KV cache) are unique per worker

### Analogy
Five people reading the same book:
- ✅ **Can share:** The physical book (input tokens)
- ❌ **Can't share:** Their notes, highlights, thoughts (KV cache)

## Options for Long Context (30K-60K+ tokens)

### Option 1: Long-Context Models
**Models:** Phi-3.5 Mini, Llama 3.1/3.2, Qwen 2.5 (128K context)

**Pros:**
- Simplest architecture
- True parallel swarm voting
- No preprocessing

**Cons:**
- Requires 8-12GB VRAM per worker at 60K context
- Limited model selection

**Best for:** Users with high-end GPUs (RTX 4090, 24GB+ VRAM)

### Option 2: Context Compression
**Architecture:** Two-stage processing

**Stage 1:** Compression swarm (3-5 workers)
- Split 60K into chunks
- Summarize each chunk
- Aggregate to 8K compressed context

**Stage 2:** Solution swarm (N workers)
- Each worker gets 8K compressed + 2K relevant original
- Generate independently
- Vote on best

**Pros:**
- Works with standard 8K models
- Maintains swarm architecture
- More workers possible

**Cons:**
- Potential information loss
- Added latency (~2-3s)

**Best for:** Users with 8-16GB VRAM who need 30K+ context

### Option 3: Hierarchical RAG
**Architecture:** Three-tier system

**Tier 1:** Indexing swarm
- Embed context into vector database
- Create searchable knowledge graph

**Tier 2:** Retrieval + Generation
- Query index for relevant context
- Each worker gets ~6K retrieved + 2K raw
- Generate solutions

**Tier 3:** Voting swarm
- Rerank and consensus

**Pros:**
- Scales to 100K+ tokens
- Most robust to information loss
- Specialized workers

**Cons:**
- Complex implementation
- 3x higher latency
- Requires vector DB

**Best for:** Maximum accuracy, production deployments

## Current Local Swarm Implementation

Local Swarm currently uses **Ensemble Voting (Option 1)** with standard context windows:

- 2K-8K context (model dependent)
- Each worker loads full model independently
- No context sharing between workers
- No offloading to system RAM (yet)

## Recommendations

### For 8K-16K Context
Use current implementation with standard models

### For 30K+ Context
Choose based on your hardware:

| Setup | Recommended Approach |
|-------|---------------------|
| RTX 4090 (24GB) | Option 1: Long-context models |
| RTX 4060 Ti (16GB) | Option 2: Context compression |
| Multiple machines (federated) | Option 2 or 3 |
| CPU-only | Option 2 with aggressive compression |

### Memory-Constrained Setups
Enable CPU offloading to run more workers:

```bash
# llama.cpp example
./main --cpu-partial 0.8  # Offload 80% to RAM
```

## Future Enhancements

Potential improvements for Local Swarm:

1. **Context compression layer** (Option 2 implementation)
2. **CPU offloading support** for memory-constrained systems
3. **Hierarchical RAG** for enterprise use cases
4. **Speculative decoding** for 2-3x speedup

## References

- vLLM PagedAttention: Efficient KV cache management
- DeepSpeed ZeRO-Infinity: Offloading to CPU/NVMe
- Mixtral 8x7B: Mixture of Experts architecture
- Phi-3.5 Technical Report: Long-context small models