Files
sleepy d33fa406b6 feat: CUDA/Android support and federation metrics (#7)
* optimize(federation): run local and peer generation in parallel

Previously, the federation waited for local generation to complete
before asking peers to generate. This wasted time since peers sat
idle while the host generated.

Now local swarm and all peers generate simultaneously:
- Fire local generation AND peer requests at the same time
- Wait for all to complete with asyncio.gather()
- Then run global consensus

This reduces total generation time from ~2x to ~1x when using
federation with multiple nodes.

Changes:
- Modified generate_with_federation() to run tasks in parallel
- Updated logging to reflect parallel execution
- Added proper error handling for local generation failures

* feat(federation): add federation support to streaming path

Previously, federation only worked with non-streaming requests.
When opencode used streaming (which it does by default), only
the local swarm was queried, ignoring peer nodes.

Now when federation is enabled and peers exist:
- Start federation generation in background (parallel)
- Stream from local swarm immediately
- Log federation results when complete

This enables federation to work with opencode and other
streaming clients while maintaining fast streaming response.

Also added webfetch instructions to prevent hallucinating URLs.

Changes:
- Modified streaming path to detect and use federation
- Added asyncio import
- Updated tool instructions to prevent URL hallucination

* fix(federation): wait for consensus and use federated result in streaming

Changed federation in streaming mode to:
- Wait for ALL nodes to complete generation
- Use the consensus result (not just local)
- Stream the federated response to client

This ensures voting from all nodes is properly considered.

Previous implementation streamed locally while federation ran
in background for logging only, which ignored the consensus.

* fix(federation): properly stream federated response

The federation case was setting the response but not returning
a StreamingResponse, so nothing was sent back to the client.

Added proper streaming generator for federation results that:
- Sends role chunk
- Streams content in chunks
- Sends final [DONE] chunk

This fixes the issue where opencode only saw local node output.

* feat(federation): add winner tracking and token usage reporting

- Track which node won the consensus voting (local or peer name)
- Add winner to FederationResult dataclass
- Log winner in server logs
- Calculate and report token usage in federation streaming
- Fix prompt_tokens calculation in streaming path

Now opencode will show:
- Context tokens used
- Which node won the vote (in logs)

* fix(federation): parse tool calls from federated response

Federation now properly handles tools:
- Removed 'not has_tools' condition so federation works with tools
- Added tool call parsing for federated responses
- Returns proper tool_calls delta with finish_reason=tool_calls
- Falls through to content streaming when no tool calls

This fixes opencode issue where federation was skipped
when tools were present.

* fix(federation): fix token count scope issue in generators

The async generators couldn't access the token count variables
because they were in the outer function scope. Fixed by:
- Calculating token counts inside each generator function
- Using separate local variable names to avoid scope issues
- Both tool_calls and content streaming now work correctly

* config(federation): increase peer timeout from 30s to 60s

Federation client timeout determines how long to wait for
peer responses before giving up and falling back to local result.

Changed from 30s to 60s to give peers more time to respond
especially on slower networks or machines.

* feat(federation): add CUDA/Android support and peer metrics tracking

Changes:
- GPU layer auto-configuration based on hardware detection
  - Offload all layers for Apple Silicon
  - Configure NVIDIA layers based on GPU count and compute capability
  - Add GPU device count and compute capability tracking

- Android platform detection
  - Detect Android via environment variables and file paths
  - Check /proc/sys/kernel/osrelease for kernel version
  - Normalize Android file paths (~ expansion, /sdcard alternatives)
  - Android-specific paths in hardware/qualcomm.py

- Federation metrics tracking
  - Add PeerMetrics dataclass with success rate, avg latency, error tracking
  - Track total requests, successful requests, failed requests
  - Record last error with timestamp
  - Add success_rate property (auto-calculated)

- Peer-specific timeout configuration
  - Add timeout_seconds to PeerInfo dataclass
  - Use peer-specific timeout in FederationClient requests
  - Use aiohttp.ClientTimeout for proper timeout handling
  - Track request start time for accurate latency calculation

- Comprehensive tests
  - test_hardware_detector.py: 14 test cases for GPU detection and Android
  - test_federation_metrics.py: 13 test cases for metrics and timeouts
  - All 35 tests pass (100% pass rate)

- Documentation
  - Add TODO.md with CUDA/Android implementation status
  - Document known issues and recommendations
  - Testing checklist and implementation priorities

Token impact: No prompt changes
Tests: 35/35 passing

Resolves federation timeout and observability issues.
2026-02-25 00:53:07 +01:00

277 lines
8.0 KiB
Markdown

# TODO: CUDA and Android Support in Federation
## Overview
This document tracks known issues and recommendations for adding CUDA (NVIDIA) and Android nodes to the local_swarm federation system.
## Current Status
-**Apple Silicon (macOS)**: Fully supported with MLX backend
- ⚠️ **CUDA/Android**: Not currently supported, requires implementation work
-**Linux**: Should work with llama.cpp + CUDA
-**Windows**: Should work with llama.cpp + CUDA (not tested)
## Known Issues
### 1. No CUDA Backend for macOS
**Problem:**
- `__init__.py` only chooses MLX or llama.cpp
- No CUDA path for macOS
- Apple Silicon only supports Metal acceleration, not CUDA
**Impact:**
- CUDA/Android nodes on macOS cannot use GPU acceleration
- These nodes will fall back to CPU-only mode
**References:**
- `src/backends/__init__.py` (lines 26-32)
- `src/hardware/detector.py` (Apple Silicon detection)
**Recommendation:**
- Current architecture is correct for macOS - CUDA is not supported on Apple Silicon
- Would need separate CUDA backend implementation (not recommended)
---
### 2. Platform Detection in `hardware/detector.py`
**Current Detection:**
```python
def detect_gpu():
# macOS: Apple Silicon (Metal only, no CUDA)
# Linux/Windows: NVIDIA/AMD/Intel GPU (potential CUDA)
# Android/Termux: CPU-only (no GPU)
```
**Impact:**
- Android/Termux devices detected as Linux
- Will use CPU-only mode (expected)
- No special handling for Android platform
**Potential Issue:**
- Termux on Android reports as "linux"
- May have different requirements (file paths, permissions)
- Need to test if file paths work correctly on Android
**References:**
- `src/hardware/detector.py:170-221` (Android/Termux detection via `is_termux()`)
**Recommendation:**
- Add explicit Android platform detection beyond `is_termux()`
- Test file path handling on Termux
- Consider Android's unique file system limitations
---
### 3. Llama.cpp Backend Configuration
**Current GPU Layer Logic:**
```python
# src/backends/__init__.py (line 35)
if hardware.gpu and not hardware.is_apple_silicon:
n_gpu_layers = -1 # Offload all to GPU (Metal/CUDA)
else:
n_gpu_layers = 0 # CPU-only
```
**For CUDA Support on Linux:**
- Should set `n_gpu_layers` based on actual GPU count
- NVIDIA: Set to GPU count (1-8 for multi-GPU)
- AMD ROCm: Different backend, not tested
**Impact:**
- Currently hardcoded to -1 on Apple Silicon (Metal)
- CUDA nodes on Linux need proper layer configuration
- No validation that requested layers match available GPU
**References:**
- `src/backends/llamacpp.py` (line 16, n_gpu_layers parameter)
- `src/backends/__init__.py` (line 35)
**Recommendation:**
- Make `n_gpu_layers` configurable per backend
- Auto-detect GPU capabilities from `pynvml` or system
- Add GPU layer validation
---
### 4. Seed Variation Mode (Not an Issue, but Important)
**Current Behavior:**
```python
# src/swarm/manager.py (line 76-82)
if use_seed_variation is None and hardware.is_apple_silicon:
self.use_seed_variation = True # Auto-enabled on macOS
```
**How It Works:**
- Runs 1 model instance with different random seeds
- Simulates multiple "workers" for consensus
- Saves memory by not loading multiple models
**Impact on Federation:**
- Your Mac: 1 worker → 2 votes (from 2 seeds)
- Peer Mac: 2 workers → 2 votes (from 2 seeds)
- Total: 4 votes instead of 8 (if using 4 actual instances)
**This is CORRECT behavior** for seed variation mode.
**Recommendation:**
- To get 4 votes per machine (8 total), use `--instances 4` flag
- Seed variation is a design choice, not a bug
---
### 5. Federation Client Timeout
**Status:****FIXED**
**Previous:**
- Default timeout: 30 seconds
- Peers on slow networks or slow machines would timeout
**Current:**
- Default timeout: 60 seconds (increased in `src/network/federation.py:38`)
- Gives peers more time to respond
**References:**
- `src/network/federation.py` (line 38)
**Recommendation:**
- Current 60s is reasonable
- Consider making timeout configurable per peer in discovery
- Add retry logic for failed requests
---
### 6. Network Discovery
**Current Implementation:****PLATFORM AGNOSTIC**
**Uses:**
- mDNS/Bonjour for peer discovery
- Standard network protocols
- No platform-specific blocking
**Status:** Should work on all platforms (macOS, Linux, Windows, Android)
**References:**
- `src/network/discovery.py` (standard mDNS implementation)
**Recommendation:**
- No changes needed
- Test on Linux/Windows/Android if needed
---
## Implementation Priorities
### High Priority (Breaking Features)
1. **CUDA Backend for Linux** (if needed)
- Add CUDA-specific backend or extend llama.cpp
- Auto-detect NVIDIA GPU and configure layers
- Test on actual CUDA hardware
- **Effort:** 3-5 days
2. **Android Platform Detection**
- Add explicit Android detection beyond Termux
- Handle Android's file system and package manager differences
- Test on real Android device
- **Effort:** 2-3 days
### Medium Priority (Improvements)
1. **GPU Layer Auto-Configuration**
- Auto-detect GPU capabilities from system
- Match requested layers to available hardware
- Add validation and helpful error messages
- **Effort:** 1-2 days
2. **Federation Metrics**
- Add per-peer timeout in PeerInfo
- Track latency and success rates
- Better error handling for retry logic
- **Effort:** 1 day
### Low Priority (Nice to Have)
1. **GPU Backend Selection UI**
- Allow users to manually select MLX vs llama.cpp
- Add warning for CUDA backend on macOS (not supported)
- **Effort:** 2 hours
2. **Seed Variation Toggle**
- Add command-line flag to disable seed variation
- Document the trade-offs clearly
- **Effort:** 30 minutes
## Testing Checklist
Before marking any issue as complete, test on:
### macOS (Apple Silicon)
- [ ] Federation with macOS peers (current environment)
- [ ] Seed variation mode works correctly
- [ ] MLX backend loads and generates
- [ ] No crashes with multiple instances
### Linux (NVIDIA GPU)
- [ ] llama.cpp backend loads with CUDA support
- [ ] Federation with Linux peers works
- [ ] GPU layers configured correctly
- [ ] No GPU conflicts
### Windows (NVIDIA GPU)
- [ ] llama.cpp backend loads with CUDA support
- [ ] Federation with Windows peers works
- [ ] No GPU conflicts
### Android (CPU-only)
- [ ] Federation with Android peers works (mDNS should work)
- [ ] CPU-only generation works
- [ ] File paths work on Termux/Android
## Notes
### Architecture Decisions
**Why not per-platform backends:**
- Simplifies codebase (single MLX path, single llama.cpp path)
- Reduces maintenance burden
- Trade-off: Can't optimize for platform-specific GPUs in backends
**Why seed variation on macOS:**
- Apple Silicon has unified memory, not discrete VRAM
- Loading multiple models would consume too much RAM
- Seed variation allows consensus quality with 1 model instance
**CUDA/Android is not a bug:**
- Current system is designed for Apple Silicon + llama.cpp
- Adding CUDA support requires significant architecture work
- Focus on federation quality for current platforms first
## Related Files
- `src/backends/__init__.py` - Backend selection logic
- `src/backends/mlx.py` - Apple Silicon MLX backend
- `src/backends/llamacpp.py` - llama.cpp backend (supports CUDA)
- `src/hardware/detector.py` - Platform and GPU detection
- `src/network/federation.py` - Federation communication
- `src/network/discovery.py` - Peer discovery via mDNS
- `src/swarm/manager.py` - Swarm orchestration
## Conclusion
The current federation implementation is **platform-agnostic** and should work on Linux/Windows with CUDA nodes. The main limitation is that macOS (Apple Silicon) only supports Metal/MLX, not CUDA.
**For immediate use:**
- Use `--instances 4` flag on each machine to get 4 votes per machine
- Test federation between different platforms (macOS + Linux)
- Android/Termux should work as-is (CPU-only mode)
**For future work:**
- Implement high-priority items if CUDA/Android support is needed
- Add GPU layer auto-configuration for better hardware utilization