d33fa406b60fdf0350376dcc5150f0d53fdacc74
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
d33fa406b6 |
feat: CUDA/Android support and federation metrics (#7)
* optimize(federation): run local and peer generation in parallel Previously, the federation waited for local generation to complete before asking peers to generate. This wasted time since peers sat idle while the host generated. Now local swarm and all peers generate simultaneously: - Fire local generation AND peer requests at the same time - Wait for all to complete with asyncio.gather() - Then run global consensus This reduces total generation time from ~2x to ~1x when using federation with multiple nodes. Changes: - Modified generate_with_federation() to run tasks in parallel - Updated logging to reflect parallel execution - Added proper error handling for local generation failures * feat(federation): add federation support to streaming path Previously, federation only worked with non-streaming requests. When opencode used streaming (which it does by default), only the local swarm was queried, ignoring peer nodes. Now when federation is enabled and peers exist: - Start federation generation in background (parallel) - Stream from local swarm immediately - Log federation results when complete This enables federation to work with opencode and other streaming clients while maintaining fast streaming response. Also added webfetch instructions to prevent hallucinating URLs. Changes: - Modified streaming path to detect and use federation - Added asyncio import - Updated tool instructions to prevent URL hallucination * fix(federation): wait for consensus and use federated result in streaming Changed federation in streaming mode to: - Wait for ALL nodes to complete generation - Use the consensus result (not just local) - Stream the federated response to client This ensures voting from all nodes is properly considered. Previous implementation streamed locally while federation ran in background for logging only, which ignored the consensus. * fix(federation): properly stream federated response The federation case was setting the response but not returning a StreamingResponse, so nothing was sent back to the client. Added proper streaming generator for federation results that: - Sends role chunk - Streams content in chunks - Sends final [DONE] chunk This fixes the issue where opencode only saw local node output. * feat(federation): add winner tracking and token usage reporting - Track which node won the consensus voting (local or peer name) - Add winner to FederationResult dataclass - Log winner in server logs - Calculate and report token usage in federation streaming - Fix prompt_tokens calculation in streaming path Now opencode will show: - Context tokens used - Which node won the vote (in logs) * fix(federation): parse tool calls from federated response Federation now properly handles tools: - Removed 'not has_tools' condition so federation works with tools - Added tool call parsing for federated responses - Returns proper tool_calls delta with finish_reason=tool_calls - Falls through to content streaming when no tool calls This fixes opencode issue where federation was skipped when tools were present. * fix(federation): fix token count scope issue in generators The async generators couldn't access the token count variables because they were in the outer function scope. Fixed by: - Calculating token counts inside each generator function - Using separate local variable names to avoid scope issues - Both tool_calls and content streaming now work correctly * config(federation): increase peer timeout from 30s to 60s Federation client timeout determines how long to wait for peer responses before giving up and falling back to local result. Changed from 30s to 60s to give peers more time to respond especially on slower networks or machines. * feat(federation): add CUDA/Android support and peer metrics tracking Changes: - GPU layer auto-configuration based on hardware detection - Offload all layers for Apple Silicon - Configure NVIDIA layers based on GPU count and compute capability - Add GPU device count and compute capability tracking - Android platform detection - Detect Android via environment variables and file paths - Check /proc/sys/kernel/osrelease for kernel version - Normalize Android file paths (~ expansion, /sdcard alternatives) - Android-specific paths in hardware/qualcomm.py - Federation metrics tracking - Add PeerMetrics dataclass with success rate, avg latency, error tracking - Track total requests, successful requests, failed requests - Record last error with timestamp - Add success_rate property (auto-calculated) - Peer-specific timeout configuration - Add timeout_seconds to PeerInfo dataclass - Use peer-specific timeout in FederationClient requests - Use aiohttp.ClientTimeout for proper timeout handling - Track request start time for accurate latency calculation - Comprehensive tests - test_hardware_detector.py: 14 test cases for GPU detection and Android - test_federation_metrics.py: 13 test cases for metrics and timeouts - All 35 tests pass (100% pass rate) - Documentation - Add TODO.md with CUDA/Android implementation status - Document known issues and recommendations - Testing checklist and implementation priorities Token impact: No prompt changes Tests: 35/35 passing Resolves federation timeout and observability issues. |
||
|
|
472961cc23 |
feat: Apple Silicon MLX support, sequential workers, live status display, worker names
Major improvements for macOS/Apple Silicon: - Add spawn-based multiprocessing for Metal GPU compatibility - Implement sequential generation mode for multiple workers - Each worker runs one-at-a-time to avoid GPU conflicts - All workers stay loaded in memory for fast switching User Experience: - 100 unique worker names (Alpha, Raven, Zeus, etc.) - Live terminal status display with progress bars - Show context usage and last output per worker - Display IP addresses for network workers Configuration: - Default port changed to 17615 (from 8000) - Context size options: 16K, 32K (default), 64K, 128K - Offloading options: none, 20%, 50% - Default max_tokens: 1024 MLX Quantization Support: - Support 3bit, 4bit, 5bit, 6bit, 8bit MLX models - Proper memory calculations for each quantization - Sequential mode automatically enabled on Apple Silicon Bug Fixes: - Fix instance calculation (was always returning 1) - Fix quantization bit detection for MLX models - Fix config.json generation in model folders - Preload MiniLM embedding model during init Files Changed: - main.py: Spawn method for macOS, port 17615 - src/backends/mlx.py: MLX generation with stop sequences - src/models/selector.py: Fix instance calculation - src/swarm/manager.py: Sequential generation mode - src/swarm/consensus.py: Preload embedding model - src/swarm/worker.py: Progress tracking per worker - src/swarm/worker_names.py: 100 unique names (NEW) - src/swarm/status_monitor.py: Live display (NEW) - src/interactive.py: Context/offload menus - src/models/registry.py: MLX quantization sizes - src/api/server.py: Port 17615, live status |
||
|
|
e794fe29d4 |
Fix critical bugs, concurrency issues, and code quality across codebase
- Fix asyncio.create_task() crash in zeroconf background thread (discovery.py) - Fix int(bytes) TypeError in peer property decoding (discovery.py) - Fix unreachable Android/Qualcomm GPU detection path (detector.py) - Add nvmlShutdown() to prevent NVML resource leak (detector.py) - Wrap blocking inference in asyncio.to_thread() to unblock event loop (llamacpp.py, mlx.py) - Initialize and use asyncio.Lock for concurrency safety (llamacpp.py) - Fix VRAM regex matching GPU index instead of byte value (amd.py) - Implement best_of_n federation strategy (was dead code) (federation.py) - Lazy-import aiohttp/mcp to avoid hard ImportError (federation.py, mcp_server.py) - Fix response_model conflict with streaming responses (routes.py) - Fix CORS allow_origins=* with allow_credentials=True violation (server.py) - Fix memory calculation using pre-clamped instance count (selector.py) - Fix calculate_max_instances returning 2 when only 0-1 fit (selector.py) - Atomic downloads via .part file to prevent caching partial files (downloader.py) - Replace recursive menu navigation with loop-based approach (interactive.py) - Implement actual majority voting in _majority_vote (consensus.py) - Fix false-positive list detection in quality scoring (consensus.py) - Replace 15+ bare except: with except Exception: across codebase - Fix .json() -> .model_dump_json() for Pydantic v2 (routes.py) - Remove unused MCP imports, add empty prompt validation (mcp_server.py) - Use tokenizer for accurate MLX token counting (mlx.py) - Fix memory estimate from FP32 (*4) to quantized (*0.6) (llamacpp.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
08a5b800d0 |
Phase 7: Add AMD, Intel, and Qualcomm GPU support
Add src/hardware/amd.py: - AMD GPU detection via ROCm (rocm-smi) - Windows AMD detection via PowerShell/WMI - Fallback to PCI detection on Linux - VRAM parsing from ROCm output - Driver version detection - Supports Radeon RX series and other AMD GPUs Add src/hardware/intel.py: - Intel GPU detection via OneAPI (sycl-ls) - OpenCL fallback detection - Windows Intel detection via PowerShell - Arc, Iris Xe, UHD graphics support - VRAM estimation for discrete vs integrated - Driver version detection Add src/hardware/qualcomm.py: - Qualcomm Snapdragon detection for Android/Termux - Multi-method detection (cpuinfo, hardware, getprop) - Termux environment detection - Adreno GPU model extraction - RAM-based VRAM estimation (25% of total) - Setup requirements checking - Device model name retrieval Update src/hardware/detector.py: - Add is_mobile flag to GPUInfo dataclass - Update detect_gpu() to check all GPU vendors - Priority: NVIDIA > AMD > Intel > Qualcomm - Add detect_qualcomm() helper function All detection modules support: - Multiple detection methods with fallbacks - Platform-specific implementations (Linux/Windows/Android) - Graceful handling of missing tools/drivers - Consistent GPUInfo return format Phase 7 complete: Extended GPU support for AMD, Intel, and Qualcomm/Adreno GPUs. |
||
|
|
0e08a2d66a |
Phase 1: Implement hardware detection and model selection
- Add src/hardware/detector.py with cross-platform GPU/CPU/RAM detection - Add src/models/registry.py with model database (Qwen, DeepSeek, CodeLlama) - Add src/models/selector.py with optimal model selection algorithm - Update main.py to use new modules and display results Features: - Detects NVIDIA GPUs on Windows/Linux - Detects Apple Silicon on macOS - Calculates available memory based on platform (100% GPU VRAM, 50% unified RAM) - Selects optimal model, quantization, and instance count - Supports 2-8 instances with quality-based selection |