local_swarm/REVIEW.md

Here's a comprehensive review of your project. It's well-architected overall, but there are several issues worth addressing.

  ---
  Critical Bugs

  1. src/network/discovery.py:128 — asyncio.create_task() called from a non-asyncio thread
  Zeroconf's ServiceBrowser invokes _on_service_state_change from a background thread, but asyncio.create_task() requires the asyncio event loop in
  the current thread. This will crash with RuntimeError: no current event loop. Use asyncio.run_coroutine_threadsafe(coro, loop) instead.

  2. src/network/discovery.py:161 — int() on bytes raises TypeError
  int(properties.get(b"instances", b"0")) — in Python 3, int(b"0") is a TypeError. Need .decode() first.

  3. src/hardware/detector.py:149,174 — Android/Qualcomm detection is unreachable
  platform.system() returns "Linux" on Android, not "android". So the code enters the Linux branch, tries NVIDIA/AMD/Intel, fails, and returns None —
   never reaching Qualcomm detection.

  4. src/api/routes.py:77 — response_model breaks streaming
  The route declares response_model=ChatCompletionResponse, but when request.stream=True, it returns a StreamingResponse. FastAPI will try to
  validate the streaming response against the Pydantic model and fail.

  ---
  High Severity

  5. src/backends/llamacpp.py:85-94 and src/backends/mlx.py:88-96 — Blocking calls in async methods
  Both backends call synchronous inference (self._llm(...), mlx_generate(...)) directly inside async def methods. This blocks the entire event loop,
  freezing the API server during inference. Wrap in await asyncio.to_thread(...).

  6. src/backends/llamacpp.py:29 — Lock declared but never initialized
  self._lock = None is never replaced with an actual asyncio.Lock(), so there's no concurrency protection when multiple requests hit the same backend
   instance.

  7. src/swarm/consensus.py:85,89 — Blocking I/O in async context
  SentenceTransformer('all-MiniLM-L6-v2') downloads/loads a model synchronously, and .encode() is CPU-bound. Both freeze the event loop.

  8. src/hardware/amd.py:80 — VRAM regex matches wrong number
  re.search(r'(\d+)', line) on a line like GPU[0] : VRAM Total Memory (B): 17179869184 matches 0 (from GPU[0]), not the VRAM value.

  9. src/models/downloader.py:79-88 — Partial downloads cached as valid
  If a download is interrupted, the partial file remains. is_model_cached() sees size > 0 and treats it as valid. Should download to a .tmp file and
  rename atomically on completion.

  10. src/network/federation.py:253-277 — best_of_n strategy is non-functional
  The code creates GenerationResponse objects but never uses them, then just returns the local response. This strategy is dead code.

  ---
  Medium Severity

  11. src/models/selector.py:182-184 — Memory calculation uses wrong instance count
  total_memory_gb = smallest_quant.vram_gb * instances uses the pre-clamped value, but instances gets max(instances, 1) on the next line. Data
  inconsistency.

  12. src/models/selector.py:65 — calculate_max_instances returns infeasible count
  Returns MIN_INSTANCES (2) even when only 0-1 instances fit in memory. _try_smallest_variant calls this without the memory guard that _try_model
  has.

  13. src/hardware/detector.py:87-88 — NVML resource leak
  pynvml.nvmlInit() is called but nvmlShutdown() is never called. Need a try/finally.

  14. src/api/server.py:60-66 — Invalid CORS configuration
  allow_origins=["*"] with allow_credentials=True violates the CORS spec. Browsers will reject this.

  15. src/swarm/consensus.py:186-199 — _majority_vote doesn't do majority voting
  It picks the median-length response, not the most common one. Name and docstring are misleading.

  16. src/interactive.py:226,368,458 — Recursive menu navigation risks stack overflow
  Menu functions call each other recursively. Repeated back-and-forth navigation can blow the stack. Use a loop-based state machine instead.

  17. Multiple files — Bare except: clauses
  llamacpp.py:157,187, mlx.py:141, detector.py:108,190, amd.py:214, intel.py:220,248, qualcomm.py:185, discovery.py:236, federation.py:116,
  updater.py:141,218,231 — all catch SystemExit and KeyboardInterrupt. Use except Exception: instead.

  ---
  Low Severity / Code Quality

  18. src/api/routes.py:112,133,147 — .json() deprecated in Pydantic v2. Use .model_dump_json().

  19. src/backends/mlx.py:59-63 — GGUF loading via MLX is suspect. Passing the parent directory of a GGUF file to mlx_lm.load() likely won't work.

  20. src/swarm/consensus.py:233 — False-positive list detection. Checks for -, *, 1., 2. which match hyphens in code, multiplication operators,
  version numbers, etc.

  21. src/network/discovery.py:56 — Dict[str, any] should be Dict[str, Any] (capital A).

  22. src/mcp_server.py:15-18 — Unused imports (ImageContent, Resource, EmbeddedResource, LoggingLevel).

  23. src/models/downloader.py:74,118 — timeout=30 is connect-only, no read timeout. Multi-GB downloads can hang on stalled reads.

  24. src/models/downloader.py — No checksum verification after download. Corrupted files are silently cached.

  25. Tests directory is empty — tests/__init__.py exists but no actual tests.

  ---
  Suggested Improvements

  1. Wrap all blocking inference in asyncio.to_thread() — this is the single most impactful fix. Without it, the API server can only handle one
  request at a time.
  2. Atomic downloads — download to .part file, rename on success, verify checksum against HuggingFace metadata.
  3. Replace recursive menus with a loop-based state machine — e.g. state = "main" in a while True loop with if state == "main": ... branches.
  4. Add proper logging — replace all print() calls with logging.getLogger(__name__). The codebase uses print() everywhere, making it hard to control
   verbosity.
  5. Fix the Android detection path — check is_termux() or /system/build.prop existence early in detect_gpu() before the platform branching.
  6. Add integration tests — even simple smoke tests (hardware detection returns valid data, model selection picks something reasonable, API server
  starts and responds to /health) would catch regressions.
  7. Use aiohttp.ClientSession as async context manager in federation to ensure proper cleanup.
  8. Consider separating streaming and non-streaming API routes — this avoids the response_model conflict and makes the code clearer.