Add NETWORK.md documenting federation status and TODO
Documents the current state of network federation: - What's working (discovery, federation client, network binding) - What's missing (integration in main.py) - Relevant files and functions - Scope and limitations - Comprehensive TODO list for implementation Federation exists but isn't wired up to the main application flow.
This commit is contained in:
+204
@@ -0,0 +1,204 @@
|
||||
# Network Federation Status
|
||||
|
||||
## Overview
|
||||
Local Swarm has a federation system designed to allow multiple instances to collaborate on the same network, enabling distributed consensus and load balancing across multiple machines.
|
||||
|
||||
## Current Implementation Status
|
||||
|
||||
### ✅ What's Working
|
||||
|
||||
#### 1. Network Discovery (`src/network/discovery.py`)
|
||||
**Purpose**: Automatic discovery of other Local Swarm instances on the local network using mDNS/Bonjour.
|
||||
|
||||
**Key Components**:
|
||||
- `SwarmDiscovery` class - Main discovery service
|
||||
- `PeerInfo` dataclass - Stores information about peer swarms
|
||||
- `start_advertising()` - Announces this swarm to the network
|
||||
- `start_discovery()` - Listens for other swarms on the network
|
||||
- `create_discovery_service()` - Factory function to create discovery instance
|
||||
|
||||
**How It Works**:
|
||||
- Uses mDNS service type: `_local-swarm._tcp.local.`
|
||||
- Advertises on port 63323 (discovery) + API port (17615)
|
||||
- Broadcasts: version, instances, model_id, hardware_summary
|
||||
- Peers timeout after 60 seconds if not seen
|
||||
|
||||
#### 2. Federation Client (`src/network/federation.py`)
|
||||
**Purpose**: Communication protocol between peer swarms.
|
||||
|
||||
**Key Components**:
|
||||
- `FederationClient` class - HTTP client for peer communication
|
||||
- `FederatedSwarm` class - Wraps local swarm with federation logic
|
||||
- `request_vote()` - Gets generation results from peers
|
||||
- `generate_with_federation()` - Coordinates distributed generation
|
||||
- Federation strategies: `best_of_n`, `weighted_vote`, `first_valid`
|
||||
|
||||
**API Endpoints** (not yet exposed):
|
||||
- `POST /v1/federation/vote` - Request generation from peer
|
||||
- `GET /v1/federation/health` - Check peer health
|
||||
|
||||
#### 3. Network Binding (`main.py`)
|
||||
**Purpose**: Secure local network access without internet exposure.
|
||||
|
||||
**Implementation**:
|
||||
- `get_local_ip()` - Detects local network IP (192.x.x.x or 100.x.x.x)
|
||||
- Binds to specific local IP instead of 0.0.0.0
|
||||
- Falls back to localhost if not on private network
|
||||
|
||||
## ❌ What's Missing
|
||||
|
||||
### Critical Gap: No Integration
|
||||
**The federation system exists as standalone modules but is NOT connected to the main application flow.**
|
||||
|
||||
**Specific Issues**:
|
||||
|
||||
1. **No CLI Flag**: No `--federation` or `--enable-federation` argument in `main.py`
|
||||
|
||||
2. **Discovery Never Starts**:
|
||||
- `SwarmDiscovery` class is imported in `network/__init__.py`
|
||||
- But never instantiated or started in `main.py`
|
||||
- `start_advertising()` and `start_discovery()` are never called
|
||||
|
||||
3. **Federation Never Starts**:
|
||||
- `FederatedSwarm` class exists but is never instantiated
|
||||
- `main.py` calls `swarm.generate()` directly
|
||||
- Should call `federated_swarm.generate_with_federation()` when enabled
|
||||
|
||||
4. **API Routes Not Registered**:
|
||||
- Federation endpoints exist in `federation.py` but aren't added to FastAPI router
|
||||
- Routes in `src/api/routes.py` don't include `/v1/federation/*`
|
||||
|
||||
5. **No Peer Management UI**:
|
||||
- No way to see discovered peers
|
||||
- No status dashboard for federation
|
||||
- No manual peer configuration
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
src/network/
|
||||
├── __init__.py # Exports SwarmDiscovery, FederationClient, etc.
|
||||
├── discovery.py # mDNS/Bonjour discovery service
|
||||
│ ├── SwarmDiscovery # Main discovery class
|
||||
│ ├── PeerInfo # Peer information dataclass
|
||||
│ └── create_discovery_service() # Factory function
|
||||
├── federation.py # Inter-swarm communication
|
||||
│ ├── FederationClient # HTTP client for peers
|
||||
│ ├── FederatedSwarm # Wraps swarm with federation
|
||||
│ ├── PeerVote # Vote from peer
|
||||
│ └── FederationResult # Result of federated generation
|
||||
└── (routes missing) # Should add federation routes
|
||||
|
||||
main.py # Should integrate federation here
|
||||
└── Currently: Just runs local swarm
|
||||
└── Should: Optionally run federated swarm with discovery
|
||||
```
|
||||
|
||||
## Scope
|
||||
|
||||
### In Scope
|
||||
- Automatic discovery of peers on same local network
|
||||
- Distributed generation across multiple machines
|
||||
- Consensus voting between local and peer responses
|
||||
- Health checking and peer timeout handling
|
||||
- Secure local network binding (no internet exposure)
|
||||
|
||||
### Out of Scope (Future)
|
||||
- Internet-wide federation (would need authentication/encryption)
|
||||
- Cross-platform federation (Mac ↔ Linux ↔ Windows)
|
||||
- Peer authentication/authorization
|
||||
- Encrypted peer communication
|
||||
- WAN federation through NAT traversal
|
||||
- Peer reputation/scoring system
|
||||
|
||||
## TODO
|
||||
|
||||
### Phase 1: Basic Integration (Minimum Viable)
|
||||
1. **Add `--federation` CLI flag** to `main.py`
|
||||
- Add argument parser entry
|
||||
- Conditionally enable federation
|
||||
|
||||
2. **Integrate discovery in main flow**
|
||||
```python
|
||||
# In main.py after swarm initialization:
|
||||
if args.federation:
|
||||
discovery = await create_discovery_service(args.port)
|
||||
await discovery.start_advertising(swarm_info)
|
||||
await discovery.start_discovery()
|
||||
```
|
||||
|
||||
3. **Add federation API routes** to `src/api/routes.py`
|
||||
- `POST /v1/federation/vote`
|
||||
- `GET /v1/federation/health`
|
||||
- `GET /v1/federation/peers` (list discovered peers)
|
||||
|
||||
4. **Create FederatedSwarm wrapper**
|
||||
```python
|
||||
# Replace: result = await swarm.generate(...)
|
||||
# With:
|
||||
if args.federation:
|
||||
federated = FederatedSwarm(swarm, discovery)
|
||||
result = await federated.generate_with_federation(...)
|
||||
else:
|
||||
result = await swarm.generate(...)
|
||||
```
|
||||
|
||||
### Phase 2: Polish
|
||||
5. **Add peer status display**
|
||||
- Show discovered peers in startup banner
|
||||
- Display peer count in status
|
||||
- Log when peers join/leave
|
||||
|
||||
6. **Handle edge cases**
|
||||
- No peers available (fallback to local only)
|
||||
- All peers timeout (graceful degradation)
|
||||
- Split-brain scenarios
|
||||
|
||||
7. **Configuration**
|
||||
- Config file support for federation settings
|
||||
- Manual peer list (bypass discovery)
|
||||
- Federation strategy selection
|
||||
|
||||
### Phase 3: Testing
|
||||
8. **Integration tests**
|
||||
- Two instances on same machine
|
||||
- Two instances on same network
|
||||
- Peer timeout handling
|
||||
- Consensus validation
|
||||
|
||||
## Usage (When Complete)
|
||||
|
||||
### Start Federated Mode
|
||||
```bash
|
||||
# On Mac 1 (192.168.1.100)
|
||||
python main.py --auto --federation
|
||||
|
||||
# On Mac 2 (192.168.1.101)
|
||||
python main.py --auto --federation
|
||||
|
||||
# Both will:
|
||||
# 1. Start local API on 192.168.x.x:17615
|
||||
# 2. Advertise via mDNS
|
||||
# 3. Discover each other within 5-10 seconds
|
||||
# 4. Distribute generation requests between them
|
||||
```
|
||||
|
||||
### Expected Behavior
|
||||
1. Both Macs advertise themselves via mDNS
|
||||
2. Each discovers the other within 10 seconds
|
||||
3. When a request comes in, both generate responses
|
||||
4. Consensus algorithm picks best response
|
||||
5. Result returned to client
|
||||
|
||||
## Benefits When Complete
|
||||
- **More workers**: Combine instances across machines
|
||||
- **Better consensus**: More responses = better selection
|
||||
- **Load balancing**: Distribute generation across devices
|
||||
- **Redundancy**: If one fails, others continue
|
||||
- **Heterogeneous hardware**: Mix Macs, PCs, servers
|
||||
|
||||
## Current Workaround
|
||||
Until federation is integrated, you can:
|
||||
1. Run instances independently on different machines
|
||||
2. Point clients to specific instances manually
|
||||
3. No automatic peer discovery or coordination
|
||||
Reference in New Issue
Block a user