Files
local_swarm/REVIEW.md
T
sleepy 472961cc23 feat: Apple Silicon MLX support, sequential workers, live status display, worker names
Major improvements for macOS/Apple Silicon:
- Add spawn-based multiprocessing for Metal GPU compatibility
- Implement sequential generation mode for multiple workers
- Each worker runs one-at-a-time to avoid GPU conflicts
- All workers stay loaded in memory for fast switching

User Experience:
- 100 unique worker names (Alpha, Raven, Zeus, etc.)
- Live terminal status display with progress bars
- Show context usage and last output per worker
- Display IP addresses for network workers

Configuration:
- Default port changed to 17615 (from 8000)
- Context size options: 16K, 32K (default), 64K, 128K
- Offloading options: none, 20%, 50%
- Default max_tokens: 1024

MLX Quantization Support:
- Support 3bit, 4bit, 5bit, 6bit, 8bit MLX models
- Proper memory calculations for each quantization
- Sequential mode automatically enabled on Apple Silicon

Bug Fixes:
- Fix instance calculation (was always returning 1)
- Fix quantization bit detection for MLX models
- Fix config.json generation in model folders
- Preload MiniLM embedding model during init

Files Changed:
- main.py: Spawn method for macOS, port 17615
- src/backends/mlx.py: MLX generation with stop sequences
- src/models/selector.py: Fix instance calculation
- src/swarm/manager.py: Sequential generation mode
- src/swarm/consensus.py: Preload embedding model
- src/swarm/worker.py: Progress tracking per worker
- src/swarm/worker_names.py: 100 unique names (NEW)
- src/swarm/status_monitor.py: Live display (NEW)
- src/interactive.py: Context/offload menus
- src/models/registry.py: MLX quantization sizes
- src/api/server.py: Port 17615, live status
2026-02-23 22:57:38 +01:00

107 lines
6.3 KiB
Markdown

Here's a comprehensive review of your project. It's well-architected overall, but there are several issues worth addressing.
---
Critical Bugs
1. src/network/discovery.py:128 — asyncio.create_task() called from a non-asyncio thread
Zeroconf's ServiceBrowser invokes _on_service_state_change from a background thread, but asyncio.create_task() requires the asyncio event loop in
the current thread. This will crash with RuntimeError: no current event loop. Use asyncio.run_coroutine_threadsafe(coro, loop) instead.
2. src/network/discovery.py:161 — int() on bytes raises TypeError
int(properties.get(b"instances", b"0")) — in Python 3, int(b"0") is a TypeError. Need .decode() first.
3. src/hardware/detector.py:149,174 — Android/Qualcomm detection is unreachable
platform.system() returns "Linux" on Android, not "android". So the code enters the Linux branch, tries NVIDIA/AMD/Intel, fails, and returns None —
never reaching Qualcomm detection.
4. src/api/routes.py:77 — response_model breaks streaming
The route declares response_model=ChatCompletionResponse, but when request.stream=True, it returns a StreamingResponse. FastAPI will try to
validate the streaming response against the Pydantic model and fail.
---
High Severity
5. src/backends/llamacpp.py:85-94 and src/backends/mlx.py:88-96 — Blocking calls in async methods
Both backends call synchronous inference (self._llm(...), mlx_generate(...)) directly inside async def methods. This blocks the entire event loop,
freezing the API server during inference. Wrap in await asyncio.to_thread(...).
6. src/backends/llamacpp.py:29 — Lock declared but never initialized
self._lock = None is never replaced with an actual asyncio.Lock(), so there's no concurrency protection when multiple requests hit the same backend
instance.
7. src/swarm/consensus.py:85,89 — Blocking I/O in async context
SentenceTransformer('all-MiniLM-L6-v2') downloads/loads a model synchronously, and .encode() is CPU-bound. Both freeze the event loop.
8. src/hardware/amd.py:80 — VRAM regex matches wrong number
re.search(r'(\d+)', line) on a line like GPU[0] : VRAM Total Memory (B): 17179869184 matches 0 (from GPU[0]), not the VRAM value.
9. src/models/downloader.py:79-88 — Partial downloads cached as valid
If a download is interrupted, the partial file remains. is_model_cached() sees size > 0 and treats it as valid. Should download to a .tmp file and
rename atomically on completion.
10. src/network/federation.py:253-277 — best_of_n strategy is non-functional
The code creates GenerationResponse objects but never uses them, then just returns the local response. This strategy is dead code.
---
Medium Severity
11. src/models/selector.py:182-184 — Memory calculation uses wrong instance count
total_memory_gb = smallest_quant.vram_gb * instances uses the pre-clamped value, but instances gets max(instances, 1) on the next line. Data
inconsistency.
12. src/models/selector.py:65 — calculate_max_instances returns infeasible count
Returns MIN_INSTANCES (2) even when only 0-1 instances fit in memory. _try_smallest_variant calls this without the memory guard that _try_model
has.
13. src/hardware/detector.py:87-88 — NVML resource leak
pynvml.nvmlInit() is called but nvmlShutdown() is never called. Need a try/finally.
14. src/api/server.py:60-66 — Invalid CORS configuration
allow_origins=["*"] with allow_credentials=True violates the CORS spec. Browsers will reject this.
15. src/swarm/consensus.py:186-199 — _majority_vote doesn't do majority voting
It picks the median-length response, not the most common one. Name and docstring are misleading.
16. src/interactive.py:226,368,458 — Recursive menu navigation risks stack overflow
Menu functions call each other recursively. Repeated back-and-forth navigation can blow the stack. Use a loop-based state machine instead.
17. Multiple files — Bare except: clauses
llamacpp.py:157,187, mlx.py:141, detector.py:108,190, amd.py:214, intel.py:220,248, qualcomm.py:185, discovery.py:236, federation.py:116,
updater.py:141,218,231 — all catch SystemExit and KeyboardInterrupt. Use except Exception: instead.
---
Low Severity / Code Quality
18. src/api/routes.py:112,133,147 — .json() deprecated in Pydantic v2. Use .model_dump_json().
19. src/backends/mlx.py:59-63 — GGUF loading via MLX is suspect. Passing the parent directory of a GGUF file to mlx_lm.load() likely won't work.
20. src/swarm/consensus.py:233 — False-positive list detection. Checks for -, *, 1., 2. which match hyphens in code, multiplication operators,
version numbers, etc.
21. src/network/discovery.py:56 — Dict[str, any] should be Dict[str, Any] (capital A).
22. src/mcp_server.py:15-18 — Unused imports (ImageContent, Resource, EmbeddedResource, LoggingLevel).
23. src/models/downloader.py:74,118 — timeout=30 is connect-only, no read timeout. Multi-GB downloads can hang on stalled reads.
24. src/models/downloader.py — No checksum verification after download. Corrupted files are silently cached.
25. Tests directory is empty — tests/__init__.py exists but no actual tests.
---
Suggested Improvements
1. Wrap all blocking inference in asyncio.to_thread() — this is the single most impactful fix. Without it, the API server can only handle one
request at a time.
2. Atomic downloads — download to .part file, rename on success, verify checksum against HuggingFace metadata.
3. Replace recursive menus with a loop-based state machine — e.g. state = "main" in a while True loop with if state == "main": ... branches.
4. Add proper logging — replace all print() calls with logging.getLogger(__name__). The codebase uses print() everywhere, making it hard to control
verbosity.
5. Fix the Android detection path — check is_termux() or /system/build.prop existence early in detect_gpu() before the platform branching.
6. Add integration tests — even simple smoke tests (hardware detection returns valid data, model selection picks something reasonable, API server
starts and responds to /health) would catch regressions.
7. Use aiohttp.ClientSession as async context manager in federation to ensure proper cleanup.
8. Consider separating streaming and non-streaming API routes — this avoids the response_model conflict and makes the code clearer.