Files
local_swarm/docs/CONTEXT.md
T
sleepy cbcba954ae Add CONTEXT.md documentation
Document the context window discussion and design decisions:
- Industry approaches (MoE, Ensemble, Pipeline, Speculative)
- Memory offloading options and trade-offs
- Why KV cache can't be shared between workers
- Three architectural options for 30K-60K+ context
- Current implementation status
- Hardware-specific recommendations

Provides reference for future enhancements and helps users
understand memory constraints in swarm architectures.
2026-02-23 20:19:46 +01:00

5.8 KiB

Context Window Handling in Local Swarm

Overview

This document summarizes how context windows work in swarm architectures and the design decisions made for Local Swarm.

The Core Challenge

When running multiple LLM workers (instances) for consensus voting, each worker needs to process the input. For long contexts (30K-60K+ tokens), this creates memory pressure:

  • 7B model at 32K context: ~8GB VRAM per worker
  • 7B model at 64K context: ~14GB VRAM per worker
  • Input duplication: Each worker processes the full input independently

Industry Approaches

1. Mixture of Experts (MoE)

Used by: GPT-4, Mixtral 8x7B

  • Full input goes to all "expert" sub-models
  • Router network decides which experts to activate
  • Each expert is smaller (e.g., 8x7B vs 1x56B equivalent)
  • Trade-off: More parameters total, but only a subset active per token

2. Ensemble Voting (Local Swarm's Approach)

Characteristics:

  • Full input to all workers
  • Each worker generates independently
  • Vote on final outputs
  • Pros: True parallel processing, diverse perspectives
  • Cons: 100% input duplication, memory intensive

3. Pipeline/Multi-Agent

Used by: LangChain, AutoGPT

  • Different workers get different subtasks
  • Sequential processing (not parallel)
  • Pros: Efficient memory usage, specialization
  • Cons: Loses swarm consensus benefit, higher latency

4. Speculative Decoding

Used by: vLLM, Text Generation Inference

  • Small "draft" model processes input
  • Large model verifies (doesn't reprocess)
  • Pros: 2-3x speedup
  • Cons: Complex implementation

Memory Offloading

What It Is

Moving part of the model's state from GPU VRAM to system RAM:

  • Hot context (active tokens) → GPU VRAM (fast)
  • Cold context (earlier tokens) → System RAM (slower)

Performance Impact

Configuration Speed Memory
100% GPU 100% 20GB VRAM
50% offload 75% 10GB VRAM + 10GB RAM
80% offload 60% 4GB VRAM + 16GB RAM

When to Use

  • Recommended: When you have plenty of RAM (32GB+) but limited VRAM (8-12GB)
  • Trade-off: 25-40% slower, but can run 2-3x more workers
  • Implementation: vLLM, DeepSpeed ZeRO-Infinity, llama.cpp

Can Workers Share Context?

The Short Answer

Raw input tokens: Yes (negligible memory) KV Cache (attention states): No (99% of memory, unique per worker)

Why KV Cache Can't Be Shared

The attention mechanism requires unique Key/Value tensors per token position:

Token 1: [K1, V1] ← unique to this position
Token 2: [K2, V2] ← depends on Token 1
...
Token N: [KN, VN] ← depends on all previous

Even with the same input:

  • Different random seeds → different attention patterns
  • Each worker builds its own understanding
  • The "notes and highlights" (KV cache) are unique per worker

Analogy

Five people reading the same book:

  • Can share: The physical book (input tokens)
  • Can't share: Their notes, highlights, thoughts (KV cache)

Options for Long Context (30K-60K+ tokens)

Option 1: Long-Context Models

Models: Phi-3.5 Mini, Llama 3.1/3.2, Qwen 2.5 (128K context)

Pros:

  • Simplest architecture
  • True parallel swarm voting
  • No preprocessing

Cons:

  • Requires 8-12GB VRAM per worker at 60K context
  • Limited model selection

Best for: Users with high-end GPUs (RTX 4090, 24GB+ VRAM)

Option 2: Context Compression

Architecture: Two-stage processing

Stage 1: Compression swarm (3-5 workers)

  • Split 60K into chunks
  • Summarize each chunk
  • Aggregate to 8K compressed context

Stage 2: Solution swarm (N workers)

  • Each worker gets 8K compressed + 2K relevant original
  • Generate independently
  • Vote on best

Pros:

  • Works with standard 8K models
  • Maintains swarm architecture
  • More workers possible

Cons:

  • Potential information loss
  • Added latency (~2-3s)

Best for: Users with 8-16GB VRAM who need 30K+ context

Option 3: Hierarchical RAG

Architecture: Three-tier system

Tier 1: Indexing swarm

  • Embed context into vector database
  • Create searchable knowledge graph

Tier 2: Retrieval + Generation

  • Query index for relevant context
  • Each worker gets ~6K retrieved + 2K raw
  • Generate solutions

Tier 3: Voting swarm

  • Rerank and consensus

Pros:

  • Scales to 100K+ tokens
  • Most robust to information loss
  • Specialized workers

Cons:

  • Complex implementation
  • 3x higher latency
  • Requires vector DB

Best for: Maximum accuracy, production deployments

Current Local Swarm Implementation

Local Swarm currently uses Ensemble Voting (Option 1) with standard context windows:

  • 2K-8K context (model dependent)
  • Each worker loads full model independently
  • No context sharing between workers
  • No offloading to system RAM (yet)

Recommendations

For 8K-16K Context

Use current implementation with standard models

For 30K+ Context

Choose based on your hardware:

Setup Recommended Approach
RTX 4090 (24GB) Option 1: Long-context models
RTX 4060 Ti (16GB) Option 2: Context compression
Multiple machines (federated) Option 2 or 3
CPU-only Option 2 with aggressive compression

Memory-Constrained Setups

Enable CPU offloading to run more workers:

# llama.cpp example
./main --cpu-partial 0.8  # Offload 80% to RAM

Future Enhancements

Potential improvements for Local Swarm:

  1. Context compression layer (Option 2 implementation)
  2. CPU offloading support for memory-constrained systems
  3. Hierarchical RAG for enterprise use cases
  4. Speculative decoding for 2-3x speedup

References

  • vLLM PagedAttention: Efficient KV cache management
  • DeepSpeed ZeRO-Infinity: Offloading to CPU/NVMe
  • Mixtral 8x7B: Mixture of Experts architecture
  • Phi-3.5 Technical Report: Long-context small models