feat: add --tool-port argument for tool server (default: 17616)
- Tool server now runs on port 17616 by default (separate from main API on 17615) - Add --tool-port argument to customize tool server port - Update help text to reflect default port 17616 - Prevent port conflicts when running both services on same machine
This commit is contained in:
@@ -0,0 +1,134 @@
|
||||
# Local Swarm TODO / Future Enhancements
|
||||
|
||||
## Context Window Optimization (For Long Context 30K+)
|
||||
|
||||
Based on docs/CONTEXT.md, implement context compression for memory-constrained setups:
|
||||
|
||||
### Option 2: Context Compression (Recommended for 16GB VRAM)
|
||||
|
||||
**Stage 1: Compression Swarm (3-5 workers)**
|
||||
- Split 60K input into 6x 10K chunks
|
||||
- Each worker summarizes one chunk
|
||||
- Aggregate summaries into 8K compressed context
|
||||
- Added latency: ~2-3 seconds
|
||||
|
||||
**Stage 2: Solution Swarm (N workers)**
|
||||
- Each worker gets 8K compressed + 2K relevant original
|
||||
- Generate solutions independently
|
||||
- Vote on best response
|
||||
|
||||
**Benefits:**
|
||||
- Works with standard 8K models
|
||||
- Maintains swarm consensus architecture
|
||||
- 2-3x more workers possible
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# New: CompressionEngine class
|
||||
class CompressionEngine:
|
||||
def compress(self, text: str, target_tokens: int) -> str:
|
||||
# Split into chunks
|
||||
# Parallel summarization
|
||||
# Aggregate results
|
||||
pass
|
||||
```
|
||||
|
||||
### Option 3: Hierarchical RAG (For 100K+ contexts)
|
||||
|
||||
**Tier 1: Indexing**
|
||||
- Embed context into vector database
|
||||
- Build searchable knowledge graph
|
||||
|
||||
**Tier 2: Retrieval + Generation**
|
||||
- Query index for relevant context
|
||||
- Each worker gets ~6K retrieved + 2K raw
|
||||
|
||||
**Tier 3: Voting**
|
||||
- Rerank and consensus
|
||||
|
||||
**Use case:** Codebase-wide analysis, large document processing
|
||||
|
||||
---
|
||||
|
||||
## Tool Execution Enhancements
|
||||
|
||||
### Streaming Tool Results
|
||||
- Stream long file reads progressively
|
||||
- Show bash command output in real-time
|
||||
- Progress indicators for large operations
|
||||
|
||||
### Tool Permissions
|
||||
- Configurable permission levels per tool
|
||||
- Approval required for destructive operations (rm, overwrite)
|
||||
- Audit log of all tool executions
|
||||
|
||||
### Tool Result Caching
|
||||
- Cache file reads (hash-based)
|
||||
- Invalidate on file modification
|
||||
- Reduce redundant disk I/O
|
||||
|
||||
---
|
||||
|
||||
## Federation Improvements
|
||||
|
||||
### Automatic Peer Discovery
|
||||
- Better mDNS reliability
|
||||
- Fallback to broadcast/multicast
|
||||
- Manual peer list persistence
|
||||
|
||||
### Load Balancing
|
||||
- Distribute requests across peers based on:
|
||||
- Current load (active workers)
|
||||
- Latency (response time)
|
||||
- Capability (model quality)
|
||||
|
||||
### Fault Tolerance
|
||||
- Automatic peer failover
|
||||
- Retry with different peers
|
||||
- Degraded mode (fewer voters)
|
||||
|
||||
---
|
||||
|
||||
## UI/UX Enhancements
|
||||
|
||||
### Web Dashboard
|
||||
- Real-time worker status visualization
|
||||
- Generation progress bars
|
||||
- Tool execution log viewer
|
||||
- Configuration management UI
|
||||
|
||||
### Better Error Messages
|
||||
- Clear explanations of OOM errors
|
||||
- Suggested configurations based on hardware
|
||||
- Model compatibility checker
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimizations
|
||||
|
||||
### Speculative Decoding
|
||||
- Small draft model generates tokens
|
||||
- Large model verifies (2-3x speedup)
|
||||
- Requires draft model download
|
||||
|
||||
### KV Cache Optimization
|
||||
- PagedAttention (vLLM-style)
|
||||
- Memory-efficient attention states
|
||||
- Better long-context performance
|
||||
|
||||
### Model Quantization
|
||||
- Support for GPTQ/AWQ quantization
|
||||
- 2-3x smaller models with minimal quality loss
|
||||
- Enable larger models on same hardware
|
||||
|
||||
---
|
||||
|
||||
## Completed ✓
|
||||
|
||||
- [x] Tool execution architecture (local + remote)
|
||||
- [x] Simplified tool instructions (300 tokens vs 40k)
|
||||
- [x] Federation with peer discovery
|
||||
- [x] Hardware auto-detection
|
||||
- [x] MLX backend for Apple Silicon
|
||||
- [x] Consensus voting strategies
|
||||
- [x] Model auto-selection based on VRAM
|
||||
@@ -197,11 +197,17 @@ Examples:
|
||||
action="store_true",
|
||||
help="Run as dedicated tool execution server (executes read/write/bash tools)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tool-port",
|
||||
type=int,
|
||||
default=17616,
|
||||
help="Port for tool execution server (default: 17616)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tool-host",
|
||||
type=str,
|
||||
default=None,
|
||||
help="URL of tool execution server (e.g., http://192.168.1.10:17616). Tools will be executed remotely."
|
||||
help="URL of tool execution server (default: http://<local-ip>:17616). Tools will be executed remotely."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--version",
|
||||
@@ -250,13 +256,14 @@ Examples:
|
||||
return {"status": "healthy", "mode": "tool-server"}
|
||||
|
||||
host = args.host if args.host else get_local_ip()
|
||||
print(f"🔗 Tool server running at http://{host}:{args.port}")
|
||||
tool_port = args.tool_port
|
||||
print(f"🔗 Tool server running at http://{host}:{tool_port}")
|
||||
print(f" Endpoints:")
|
||||
print(f" - POST /v1/tools/execute")
|
||||
print(f" - GET /health")
|
||||
print(f"\n✅ Tool server ready!")
|
||||
|
||||
uvicorn.run(app, host=host, port=args.port)
|
||||
uvicorn.run(app, host=host, port=tool_port)
|
||||
return
|
||||
|
||||
# Determine model configuration
|
||||
|
||||
Reference in New Issue
Block a user