Files
local_swarm/TODO.md
T
sleepy b5bd154ba6 feat: add --tool-port argument for tool server (default: 17616)
- Tool server now runs on port 17616 by default (separate from main API on 17615)
- Add --tool-port argument to customize tool server port
- Update help text to reflect default port 17616
- Prevent port conflicts when running both services on same machine
2026-02-24 14:27:40 +01:00

3.2 KiB

Local Swarm TODO / Future Enhancements

Context Window Optimization (For Long Context 30K+)

Based on docs/CONTEXT.md, implement context compression for memory-constrained setups:

Stage 1: Compression Swarm (3-5 workers)

  • Split 60K input into 6x 10K chunks
  • Each worker summarizes one chunk
  • Aggregate summaries into 8K compressed context
  • Added latency: ~2-3 seconds

Stage 2: Solution Swarm (N workers)

  • Each worker gets 8K compressed + 2K relevant original
  • Generate solutions independently
  • Vote on best response

Benefits:

  • Works with standard 8K models
  • Maintains swarm consensus architecture
  • 2-3x more workers possible

Implementation:

# New: CompressionEngine class
class CompressionEngine:
    def compress(self, text: str, target_tokens: int) -> str:
        # Split into chunks
        # Parallel summarization
        # Aggregate results
        pass

Option 3: Hierarchical RAG (For 100K+ contexts)

Tier 1: Indexing

  • Embed context into vector database
  • Build searchable knowledge graph

Tier 2: Retrieval + Generation

  • Query index for relevant context
  • Each worker gets ~6K retrieved + 2K raw

Tier 3: Voting

  • Rerank and consensus

Use case: Codebase-wide analysis, large document processing


Tool Execution Enhancements

Streaming Tool Results

  • Stream long file reads progressively
  • Show bash command output in real-time
  • Progress indicators for large operations

Tool Permissions

  • Configurable permission levels per tool
  • Approval required for destructive operations (rm, overwrite)
  • Audit log of all tool executions

Tool Result Caching

  • Cache file reads (hash-based)
  • Invalidate on file modification
  • Reduce redundant disk I/O

Federation Improvements

Automatic Peer Discovery

  • Better mDNS reliability
  • Fallback to broadcast/multicast
  • Manual peer list persistence

Load Balancing

  • Distribute requests across peers based on:
    • Current load (active workers)
    • Latency (response time)
    • Capability (model quality)

Fault Tolerance

  • Automatic peer failover
  • Retry with different peers
  • Degraded mode (fewer voters)

UI/UX Enhancements

Web Dashboard

  • Real-time worker status visualization
  • Generation progress bars
  • Tool execution log viewer
  • Configuration management UI

Better Error Messages

  • Clear explanations of OOM errors
  • Suggested configurations based on hardware
  • Model compatibility checker

Performance Optimizations

Speculative Decoding

  • Small draft model generates tokens
  • Large model verifies (2-3x speedup)
  • Requires draft model download

KV Cache Optimization

  • PagedAttention (vLLM-style)
  • Memory-efficient attention states
  • Better long-context performance

Model Quantization

  • Support for GPTQ/AWQ quantization
  • 2-3x smaller models with minimal quality loss
  • Enable larger models on same hardware

Completed ✓

  • Tool execution architecture (local + remote)
  • Simplified tool instructions (300 tokens vs 40k)
  • Federation with peer discovery
  • Hardware auto-detection
  • MLX backend for Apple Silicon
  • Consensus voting strategies
  • Model auto-selection based on VRAM