472961cc23
Major improvements for macOS/Apple Silicon: - Add spawn-based multiprocessing for Metal GPU compatibility - Implement sequential generation mode for multiple workers - Each worker runs one-at-a-time to avoid GPU conflicts - All workers stay loaded in memory for fast switching User Experience: - 100 unique worker names (Alpha, Raven, Zeus, etc.) - Live terminal status display with progress bars - Show context usage and last output per worker - Display IP addresses for network workers Configuration: - Default port changed to 17615 (from 8000) - Context size options: 16K, 32K (default), 64K, 128K - Offloading options: none, 20%, 50% - Default max_tokens: 1024 MLX Quantization Support: - Support 3bit, 4bit, 5bit, 6bit, 8bit MLX models - Proper memory calculations for each quantization - Sequential mode automatically enabled on Apple Silicon Bug Fixes: - Fix instance calculation (was always returning 1) - Fix quantization bit detection for MLX models - Fix config.json generation in model folders - Preload MiniLM embedding model during init Files Changed: - main.py: Spawn method for macOS, port 17615 - src/backends/mlx.py: MLX generation with stop sequences - src/models/selector.py: Fix instance calculation - src/swarm/manager.py: Sequential generation mode - src/swarm/consensus.py: Preload embedding model - src/swarm/worker.py: Progress tracking per worker - src/swarm/worker_names.py: 100 unique names (NEW) - src/swarm/status_monitor.py: Live display (NEW) - src/interactive.py: Context/offload menus - src/models/registry.py: MLX quantization sizes - src/api/server.py: Port 17615, live status
107 lines
6.3 KiB
Markdown
107 lines
6.3 KiB
Markdown
Here's a comprehensive review of your project. It's well-architected overall, but there are several issues worth addressing.
|
|
|
|
---
|
|
Critical Bugs
|
|
|
|
1. src/network/discovery.py:128 — asyncio.create_task() called from a non-asyncio thread
|
|
Zeroconf's ServiceBrowser invokes _on_service_state_change from a background thread, but asyncio.create_task() requires the asyncio event loop in
|
|
the current thread. This will crash with RuntimeError: no current event loop. Use asyncio.run_coroutine_threadsafe(coro, loop) instead.
|
|
|
|
2. src/network/discovery.py:161 — int() on bytes raises TypeError
|
|
int(properties.get(b"instances", b"0")) — in Python 3, int(b"0") is a TypeError. Need .decode() first.
|
|
|
|
3. src/hardware/detector.py:149,174 — Android/Qualcomm detection is unreachable
|
|
platform.system() returns "Linux" on Android, not "android". So the code enters the Linux branch, tries NVIDIA/AMD/Intel, fails, and returns None —
|
|
never reaching Qualcomm detection.
|
|
|
|
4. src/api/routes.py:77 — response_model breaks streaming
|
|
The route declares response_model=ChatCompletionResponse, but when request.stream=True, it returns a StreamingResponse. FastAPI will try to
|
|
validate the streaming response against the Pydantic model and fail.
|
|
|
|
---
|
|
High Severity
|
|
|
|
5. src/backends/llamacpp.py:85-94 and src/backends/mlx.py:88-96 — Blocking calls in async methods
|
|
Both backends call synchronous inference (self._llm(...), mlx_generate(...)) directly inside async def methods. This blocks the entire event loop,
|
|
freezing the API server during inference. Wrap in await asyncio.to_thread(...).
|
|
|
|
6. src/backends/llamacpp.py:29 — Lock declared but never initialized
|
|
self._lock = None is never replaced with an actual asyncio.Lock(), so there's no concurrency protection when multiple requests hit the same backend
|
|
instance.
|
|
|
|
7. src/swarm/consensus.py:85,89 — Blocking I/O in async context
|
|
SentenceTransformer('all-MiniLM-L6-v2') downloads/loads a model synchronously, and .encode() is CPU-bound. Both freeze the event loop.
|
|
|
|
8. src/hardware/amd.py:80 — VRAM regex matches wrong number
|
|
re.search(r'(\d+)', line) on a line like GPU[0] : VRAM Total Memory (B): 17179869184 matches 0 (from GPU[0]), not the VRAM value.
|
|
|
|
9. src/models/downloader.py:79-88 — Partial downloads cached as valid
|
|
If a download is interrupted, the partial file remains. is_model_cached() sees size > 0 and treats it as valid. Should download to a .tmp file and
|
|
rename atomically on completion.
|
|
|
|
10. src/network/federation.py:253-277 — best_of_n strategy is non-functional
|
|
The code creates GenerationResponse objects but never uses them, then just returns the local response. This strategy is dead code.
|
|
|
|
---
|
|
Medium Severity
|
|
|
|
11. src/models/selector.py:182-184 — Memory calculation uses wrong instance count
|
|
total_memory_gb = smallest_quant.vram_gb * instances uses the pre-clamped value, but instances gets max(instances, 1) on the next line. Data
|
|
inconsistency.
|
|
|
|
12. src/models/selector.py:65 — calculate_max_instances returns infeasible count
|
|
Returns MIN_INSTANCES (2) even when only 0-1 instances fit in memory. _try_smallest_variant calls this without the memory guard that _try_model
|
|
has.
|
|
|
|
13. src/hardware/detector.py:87-88 — NVML resource leak
|
|
pynvml.nvmlInit() is called but nvmlShutdown() is never called. Need a try/finally.
|
|
|
|
14. src/api/server.py:60-66 — Invalid CORS configuration
|
|
allow_origins=["*"] with allow_credentials=True violates the CORS spec. Browsers will reject this.
|
|
|
|
15. src/swarm/consensus.py:186-199 — _majority_vote doesn't do majority voting
|
|
It picks the median-length response, not the most common one. Name and docstring are misleading.
|
|
|
|
16. src/interactive.py:226,368,458 — Recursive menu navigation risks stack overflow
|
|
Menu functions call each other recursively. Repeated back-and-forth navigation can blow the stack. Use a loop-based state machine instead.
|
|
|
|
17. Multiple files — Bare except: clauses
|
|
llamacpp.py:157,187, mlx.py:141, detector.py:108,190, amd.py:214, intel.py:220,248, qualcomm.py:185, discovery.py:236, federation.py:116,
|
|
updater.py:141,218,231 — all catch SystemExit and KeyboardInterrupt. Use except Exception: instead.
|
|
|
|
---
|
|
Low Severity / Code Quality
|
|
|
|
18. src/api/routes.py:112,133,147 — .json() deprecated in Pydantic v2. Use .model_dump_json().
|
|
|
|
19. src/backends/mlx.py:59-63 — GGUF loading via MLX is suspect. Passing the parent directory of a GGUF file to mlx_lm.load() likely won't work.
|
|
|
|
20. src/swarm/consensus.py:233 — False-positive list detection. Checks for -, *, 1., 2. which match hyphens in code, multiplication operators,
|
|
version numbers, etc.
|
|
|
|
21. src/network/discovery.py:56 — Dict[str, any] should be Dict[str, Any] (capital A).
|
|
|
|
22. src/mcp_server.py:15-18 — Unused imports (ImageContent, Resource, EmbeddedResource, LoggingLevel).
|
|
|
|
23. src/models/downloader.py:74,118 — timeout=30 is connect-only, no read timeout. Multi-GB downloads can hang on stalled reads.
|
|
|
|
24. src/models/downloader.py — No checksum verification after download. Corrupted files are silently cached.
|
|
|
|
25. Tests directory is empty — tests/__init__.py exists but no actual tests.
|
|
|
|
---
|
|
Suggested Improvements
|
|
|
|
1. Wrap all blocking inference in asyncio.to_thread() — this is the single most impactful fix. Without it, the API server can only handle one
|
|
request at a time.
|
|
2. Atomic downloads — download to .part file, rename on success, verify checksum against HuggingFace metadata.
|
|
3. Replace recursive menus with a loop-based state machine — e.g. state = "main" in a while True loop with if state == "main": ... branches.
|
|
4. Add proper logging — replace all print() calls with logging.getLogger(__name__). The codebase uses print() everywhere, making it hard to control
|
|
verbosity.
|
|
5. Fix the Android detection path — check is_termux() or /system/build.prop existence early in detect_gpu() before the platform branching.
|
|
6. Add integration tests — even simple smoke tests (hardware detection returns valid data, model selection picks something reasonable, API server
|
|
starts and responds to /health) would catch regressions.
|
|
7. Use aiohttp.ClientSession as async context manager in federation to ensure proper cleanup.
|
|
8. Consider separating streaming and non-streaming API routes — this avoids the response_model conflict and makes the code clearer.
|
|
|