Files

T

sleepy cbcba954ae Add CONTEXT.md documentation

Document the context window discussion and design decisions:
- Industry approaches (MoE, Ensemble, Pipeline, Speculative)
- Memory offloading options and trade-offs
- Why KV cache can't be shared between workers
- Three architectural options for 30K-60K+ context
- Current implementation status
- Hardware-specific recommendations

Provides reference for future enhancements and helps users
understand memory constraints in swarm architectures.

2026-02-23 20:19:46 +01:00

5.8 KiB

Raw Blame History

Context Window Handling in Local Swarm

Overview

This document summarizes how context windows work in swarm architectures and the design decisions made for Local Swarm.

The Core Challenge

When running multiple LLM workers (instances) for consensus voting, each worker needs to process the input. For long contexts (30K-60K+ tokens), this creates memory pressure:

7B model at 32K context: ~8GB VRAM per worker
7B model at 64K context: ~14GB VRAM per worker
Input duplication: Each worker processes the full input independently

Industry Approaches

1. Mixture of Experts (MoE)

Used by: GPT-4, Mixtral 8x7B

Full input goes to all "expert" sub-models
Router network decides which experts to activate
Each expert is smaller (e.g., 8x7B vs 1x56B equivalent)
Trade-off: More parameters total, but only a subset active per token

2. Ensemble Voting (Local Swarm's Approach)

Characteristics:

Full input to all workers
Each worker generates independently
Vote on final outputs
Pros: True parallel processing, diverse perspectives
Cons: 100% input duplication, memory intensive

3. Pipeline/Multi-Agent

Used by: LangChain, AutoGPT

Different workers get different subtasks
Sequential processing (not parallel)
Pros: Efficient memory usage, specialization
Cons: Loses swarm consensus benefit, higher latency

4. Speculative Decoding

Used by: vLLM, Text Generation Inference

Small "draft" model processes input
Large model verifies (doesn't reprocess)
Pros: 2-3x speedup
Cons: Complex implementation

Memory Offloading

What It Is

Moving part of the model's state from GPU VRAM to system RAM:

Hot context (active tokens) → GPU VRAM (fast)
Cold context (earlier tokens) → System RAM (slower)

Performance Impact

Configuration	Speed	Memory
100% GPU	100%	20GB VRAM
50% offload	75%	10GB VRAM + 10GB RAM
80% offload	60%	4GB VRAM + 16GB RAM

When to Use

Recommended: When you have plenty of RAM (32GB+) but limited VRAM (8-12GB)
Trade-off: 25-40% slower, but can run 2-3x more workers
Implementation: vLLM, DeepSpeed ZeRO-Infinity, llama.cpp

The Short Answer

Raw input tokens: Yes (negligible memory) KV Cache (attention states): No (99% of memory, unique per worker)

Why KV Cache Can't Be Shared

The attention mechanism requires unique Key/Value tensors per token position:

Token 1: [K1, V1] ← unique to this position
Token 2: [K2, V2] ← depends on Token 1
...
Token N: [KN, VN] ← depends on all previous

Even with the same input:

Different random seeds → different attention patterns
Each worker builds its own understanding
The "notes and highlights" (KV cache) are unique per worker

Analogy

Five people reading the same book:

✅ Can share: The physical book (input tokens)
❌ Can't share: Their notes, highlights, thoughts (KV cache)

Options for Long Context (30K-60K+ tokens)

Option 1: Long-Context Models

Models: Phi-3.5 Mini, Llama 3.1/3.2, Qwen 2.5 (128K context)

Pros:

Simplest architecture
True parallel swarm voting
No preprocessing

Cons:

Requires 8-12GB VRAM per worker at 60K context
Limited model selection

Best for: Users with high-end GPUs (RTX 4090, 24GB+ VRAM)

Option 2: Context Compression

Architecture: Two-stage processing

Stage 1: Compression swarm (3-5 workers)

Split 60K into chunks
Summarize each chunk
Aggregate to 8K compressed context

Stage 2: Solution swarm (N workers)

Each worker gets 8K compressed + 2K relevant original
Generate independently
Vote on best

Pros:

Works with standard 8K models
Maintains swarm architecture
More workers possible

Cons:

Potential information loss
Added latency (~2-3s)

Best for: Users with 8-16GB VRAM who need 30K+ context

Option 3: Hierarchical RAG

Architecture: Three-tier system

Tier 1: Indexing swarm

Embed context into vector database
Create searchable knowledge graph

Tier 2: Retrieval + Generation

Query index for relevant context
Each worker gets ~6K retrieved + 2K raw
Generate solutions

Tier 3: Voting swarm

Rerank and consensus

Pros:

Scales to 100K+ tokens
Most robust to information loss
Specialized workers

Cons:

Complex implementation
3x higher latency
Requires vector DB

Best for: Maximum accuracy, production deployments

Current Local Swarm Implementation

Local Swarm currently uses Ensemble Voting (Option 1) with standard context windows:

2K-8K context (model dependent)
Each worker loads full model independently
No context sharing between workers
No offloading to system RAM (yet)

Recommendations

For 8K-16K Context

Use current implementation with standard models

For 30K+ Context

Choose based on your hardware:

Setup	Recommended Approach
RTX 4090 (24GB)	Option 1: Long-context models
RTX 4060 Ti (16GB)	Option 2: Context compression
Multiple machines (federated)	Option 2 or 3
CPU-only	Option 2 with aggressive compression

Memory-Constrained Setups

Enable CPU offloading to run more workers:

# llama.cpp example
./main --cpu-partial 0.8  # Offload 80% to RAM

Future Enhancements

Potential improvements for Local Swarm:

Context compression layer (Option 2 implementation)
CPU offloading support for memory-constrained systems
Hierarchical RAG for enterprise use cases
Speculative decoding for 2-3x speedup

References

vLLM PagedAttention: Efficient KV cache management
DeepSpeed ZeRO-Infinity: Offloading to CPU/NVMe
Mixtral 8x7B: Mixture of Experts architecture
Phi-3.5 Technical Report: Long-context small models

5.8 KiB Raw Blame History

Context Window Handling in Local Swarm

Overview

The Core Challenge

Industry Approaches

1. Mixture of Experts (MoE)

2. Ensemble Voting (Local Swarm's Approach)

3. Pipeline/Multi-Agent

4. Speculative Decoding

Memory Offloading

What It Is

Performance Impact

When to Use

Can Workers Share Context?

The Short Answer

Why KV Cache Can't Be Shared

Analogy

Options for Long Context (30K-60K+ tokens)

Option 1: Long-Context Models

Option 2: Context Compression

Option 3: Hierarchical RAG

Current Local Swarm Implementation

Recommendations

For 8K-16K Context

For 30K+ Context

Memory-Constrained Setups

Future Enhancements

References

5.8 KiB

Raw Blame History