local_swarm/REVIEW.md at 429a3a4e3b88bfcc9db9589edc7097c21ae65b1f

Files

T

sleepy 472961cc23 feat: Apple Silicon MLX support, sequential workers, live status display, worker names

Major improvements for macOS/Apple Silicon:
- Add spawn-based multiprocessing for Metal GPU compatibility
- Implement sequential generation mode for multiple workers
- Each worker runs one-at-a-time to avoid GPU conflicts
- All workers stay loaded in memory for fast switching

User Experience:
- 100 unique worker names (Alpha, Raven, Zeus, etc.)
- Live terminal status display with progress bars
- Show context usage and last output per worker
- Display IP addresses for network workers

Configuration:
- Default port changed to 17615 (from 8000)
- Context size options: 16K, 32K (default), 64K, 128K
- Offloading options: none, 20%, 50%
- Default max_tokens: 1024

MLX Quantization Support:
- Support 3bit, 4bit, 5bit, 6bit, 8bit MLX models
- Proper memory calculations for each quantization
- Sequential mode automatically enabled on Apple Silicon

Bug Fixes:
- Fix instance calculation (was always returning 1)
- Fix quantization bit detection for MLX models
- Fix config.json generation in model folders
- Preload MiniLM embedding model during init

Files Changed:
- main.py: Spawn method for macOS, port 17615
- src/backends/mlx.py: MLX generation with stop sequences
- src/models/selector.py: Fix instance calculation
- src/swarm/manager.py: Sequential generation mode
- src/swarm/consensus.py: Preload embedding model
- src/swarm/worker.py: Progress tracking per worker
- src/swarm/worker_names.py: 100 unique names (NEW)
- src/swarm/status_monitor.py: Live display (NEW)
- src/interactive.py: Context/offload menus
- src/models/registry.py: MLX quantization sizes
- src/api/server.py: Port 17615, live status

2026-02-23 22:57:38 +01:00

6.3 KiB

Raw Blame History

Here's a comprehensive review of your project. It's well-architected overall, but there are several issues worth addressing.

Critical Bugs

src/network/discovery.py:128 — asyncio.create_task() called from a non-asyncio thread Zeroconf's ServiceBrowser invokes _on_service_state_change from a background thread, but asyncio.create_task() requires the asyncio event loop in the current thread. This will crash with RuntimeError: no current event loop. Use asyncio.run_coroutine_threadsafe(coro, loop) instead.
src/network/discovery.py:161 — int() on bytes raises TypeError int(properties.get(b"instances", b"0")) — in Python 3, int(b"0") is a TypeError. Need .decode() first.
src/hardware/detector.py:149,174 — Android/Qualcomm detection is unreachable platform.system() returns "Linux" on Android, not "android". So the code enters the Linux branch, tries NVIDIA/AMD/Intel, fails, and returns None — never reaching Qualcomm detection.
src/api/routes.py:77 — response_model breaks streaming The route declares response_model=ChatCompletionResponse, but when request.stream=True, it returns a StreamingResponse. FastAPI will try to validate the streaming response against the Pydantic model and fail.

High Severity

src/backends/llamacpp.py:85-94 and src/backends/mlx.py:88-96 — Blocking calls in async methods Both backends call synchronous inference (self._llm(...), mlx_generate(...)) directly inside async def methods. This blocks the entire event loop, freezing the API server during inference. Wrap in await asyncio.to_thread(...).
src/backends/llamacpp.py:29 — Lock declared but never initialized self._lock = None is never replaced with an actual asyncio.Lock(), so there's no concurrency protection when multiple requests hit the same backend instance.
src/swarm/consensus.py:85,89 — Blocking I/O in async context SentenceTransformer('all-MiniLM-L6-v2') downloads/loads a model synchronously, and .encode() is CPU-bound. Both freeze the event loop.
src/hardware/amd.py:80 — VRAM regex matches wrong number re.search(r'(\d+)', line) on a line like GPU[0] : VRAM Total Memory (B): 17179869184 matches 0 (from GPU[0]), not the VRAM value.
src/models/downloader.py:79-88 — Partial downloads cached as valid If a download is interrupted, the partial file remains. is_model_cached() sees size > 0 and treats it as valid. Should download to a .tmp file and rename atomically on completion.
src/network/federation.py:253-277 — best_of_n strategy is non-functional The code creates GenerationResponse objects but never uses them, then just returns the local response. This strategy is dead code.

Medium Severity

src/models/selector.py:182-184 — Memory calculation uses wrong instance count total_memory_gb = smallest_quant.vram_gb * instances uses the pre-clamped value, but instances gets max(instances, 1) on the next line. Data inconsistency.
src/models/selector.py:65 — calculate_max_instances returns infeasible count Returns MIN_INSTANCES (2) even when only 0-1 instances fit in memory. _try_smallest_variant calls this without the memory guard that _try_model has.
src/hardware/detector.py:87-88 — NVML resource leak pynvml.nvmlInit() is called but nvmlShutdown() is never called. Need a try/finally.
src/api/server.py:60-66 — Invalid CORS configuration allow_origins=["*"] with allow_credentials=True violates the CORS spec. Browsers will reject this.
src/swarm/consensus.py:186-199 — _majority_vote doesn't do majority voting It picks the median-length response, not the most common one. Name and docstring are misleading.
src/interactive.py:226,368,458 — Recursive menu navigation risks stack overflow Menu functions call each other recursively. Repeated back-and-forth navigation can blow the stack. Use a loop-based state machine instead.
Multiple files — Bare except: clauses llamacpp.py:157,187, mlx.py:141, detector.py:108,190, amd.py:214, intel.py:220,248, qualcomm.py:185, discovery.py:236, federation.py:116, updater.py:141,218,231 — all catch SystemExit and KeyboardInterrupt. Use except Exception: instead.

Low Severity / Code Quality

src/api/routes.py:112,133,147 — .json() deprecated in Pydantic v2. Use .model_dump_json().
src/backends/mlx.py:59-63 — GGUF loading via MLX is suspect. Passing the parent directory of a GGUF file to mlx_lm.load() likely won't work.
src/swarm/consensus.py:233 — False-positive list detection. Checks for -, *, 1., 2. which match hyphens in code, multiplication operators, version numbers, etc.
src/network/discovery.py:56 — Dict[str, any] should be Dict[str, Any] (capital A).
src/mcp_server.py:15-18 — Unused imports (ImageContent, Resource, EmbeddedResource, LoggingLevel).
src/models/downloader.py:74,118 — timeout=30 is connect-only, no read timeout. Multi-GB downloads can hang on stalled reads.
src/models/downloader.py — No checksum verification after download. Corrupted files are silently cached.
Tests directory is empty — tests/__init__.py exists but no actual tests.

Suggested Improvements

Wrap all blocking inference in asyncio.to_thread() — this is the single most impactful fix. Without it, the API server can only handle one request at a time.
Atomic downloads — download to .part file, rename on success, verify checksum against HuggingFace metadata.
Replace recursive menus with a loop-based state machine — e.g. state = "main" in a while True loop with if state == "main": ... branches.
Add proper logging — replace all print() calls with logging.getLogger(name). The codebase uses print() everywhere, making it hard to control verbosity.
Fix the Android detection path — check is_termux() or /system/build.prop existence early in detect_gpu() before the platform branching.
Add integration tests — even simple smoke tests (hardware detection returns valid data, model selection picks something reasonable, API server starts and responds to /health) would catch regressions.
Use aiohttp.ClientSession as async context manager in federation to ensure proper cleanup.
Consider separating streaming and non-streaming API routes — this avoids the response_model conflict and makes the code clearer.

6.3 KiB Raw Blame History

6.3 KiB

Raw Blame History