b5bd154ba6
- Tool server now runs on port 17616 by default (separate from main API on 17615) - Add --tool-port argument to customize tool server port - Update help text to reflect default port 17616 - Prevent port conflicts when running both services on same machine
3.2 KiB
3.2 KiB
Local Swarm TODO / Future Enhancements
Context Window Optimization (For Long Context 30K+)
Based on docs/CONTEXT.md, implement context compression for memory-constrained setups:
Option 2: Context Compression (Recommended for 16GB VRAM)
Stage 1: Compression Swarm (3-5 workers)
- Split 60K input into 6x 10K chunks
- Each worker summarizes one chunk
- Aggregate summaries into 8K compressed context
- Added latency: ~2-3 seconds
Stage 2: Solution Swarm (N workers)
- Each worker gets 8K compressed + 2K relevant original
- Generate solutions independently
- Vote on best response
Benefits:
- Works with standard 8K models
- Maintains swarm consensus architecture
- 2-3x more workers possible
Implementation:
# New: CompressionEngine class
class CompressionEngine:
def compress(self, text: str, target_tokens: int) -> str:
# Split into chunks
# Parallel summarization
# Aggregate results
pass
Option 3: Hierarchical RAG (For 100K+ contexts)
Tier 1: Indexing
- Embed context into vector database
- Build searchable knowledge graph
Tier 2: Retrieval + Generation
- Query index for relevant context
- Each worker gets ~6K retrieved + 2K raw
Tier 3: Voting
- Rerank and consensus
Use case: Codebase-wide analysis, large document processing
Tool Execution Enhancements
Streaming Tool Results
- Stream long file reads progressively
- Show bash command output in real-time
- Progress indicators for large operations
Tool Permissions
- Configurable permission levels per tool
- Approval required for destructive operations (rm, overwrite)
- Audit log of all tool executions
Tool Result Caching
- Cache file reads (hash-based)
- Invalidate on file modification
- Reduce redundant disk I/O
Federation Improvements
Automatic Peer Discovery
- Better mDNS reliability
- Fallback to broadcast/multicast
- Manual peer list persistence
Load Balancing
- Distribute requests across peers based on:
- Current load (active workers)
- Latency (response time)
- Capability (model quality)
Fault Tolerance
- Automatic peer failover
- Retry with different peers
- Degraded mode (fewer voters)
UI/UX Enhancements
Web Dashboard
- Real-time worker status visualization
- Generation progress bars
- Tool execution log viewer
- Configuration management UI
Better Error Messages
- Clear explanations of OOM errors
- Suggested configurations based on hardware
- Model compatibility checker
Performance Optimizations
Speculative Decoding
- Small draft model generates tokens
- Large model verifies (2-3x speedup)
- Requires draft model download
KV Cache Optimization
- PagedAttention (vLLM-style)
- Memory-efficient attention states
- Better long-context performance
Model Quantization
- Support for GPTQ/AWQ quantization
- 2-3x smaller models with minimal quality loss
- Enable larger models on same hardware
Completed ✓
- Tool execution architecture (local + remote)
- Simplified tool instructions (300 tokens vs 40k)
- Federation with peer discovery
- Hardware auto-detection
- MLX backend for Apple Silicon
- Consensus voting strategies
- Model auto-selection based on VRAM