Commit Graph

10 Commits

Author SHA1 Message Date
sleepy dcca89d89a fix: OpenAI API compatibility for hollama and other clients
- Fixed ChatMessage.tool_calls to be Optional with default None (excluded when empty)
- Added logprobs field to ChatCompletionChoice (always included as null)
- Added stats and system_fingerprint to ChatCompletionResponse
- Fixed streaming response to use delta format (not message format)
- Fixed non-streaming response to include logprobs: null
- Updated tool instructions to include 'NO explanations'
- Added pytest-asyncio markers to async tests
- All 41 tests passing

This fixes the 'Cannot read properties of undefined (reading content)' error in hollama and ensures compatibility with OpenAI clients.
2026-02-25 19:39:05 +01:00
sleepy 1acebbc6a2 refactor(models): extract memory calculations and config from selector
Changes:
- selector.py: 486 → 329 lines (-32%)
- Extracted memory calculation functions to memory_calculator.py
- Extracted constants to selector_config.json
- Updated selector.py to load config and import from memory_calculator
- All 35 tests pass
2026-02-25 13:23:47 +01:00
sleepy 0886b9ae73 fix: handle --model with full format (model:size:quant)
- Parse model ID with format like qwen2.5-coder:7b:4bit
- Return specific error if requested config not found or doesn't fit
- Don't fall back to auto-selection when specific config requested
2026-02-24 13:15:59 +01:00
sleepy d30eedaa63 Fix opencode integration: streaming, response format, and tool handling
- Fix streaming to work even when tools are present (was forcing JSON mode)
- Fix response format: use empty list [] instead of null for tool_calls
- Add exclude_none config to ChatMessage model to match OpenAI format
- Remove tool instructions from prompt (were confusing 3B model)
- Fix tool call parsing to handle markdown code blocks properly
- Change default instances from 3 to 1 for faster debugging
- Allow 1 instance minimum in interactive config (was 2 on Mac)
- Add debug logging to track requests and responses

Fixes infinite loop issue where opencode would retry requests repeatedly
2026-02-24 03:44:46 +01:00
sleepy 2461f45ca8 fix: Remove slow HF API check from recommended config selection
- select_optimal_model was checking HF API for available quantizations
- This caused menu to hang/slow down when changing context
- Now only checks availability when browsing or custom config
- Recommended config uses default quantizations (faster)
2026-02-23 23:54:57 +01:00
sleepy f2d0fddfa4 fix: Update selector to check available quantizations on Mac 2026-02-23 23:52:29 +01:00
sleepy 792c40594e fix: Recommended config shows 3 responses on Mac instead of 1
Updated _try_model_with_context and _try_smallest_variant_with_context:
- On Mac (use_mlx=True): Returns 3 responses by default
- On other platforms: Still calculates based on VRAM
- Memory calculation fixed for Mac (doesn't multiply by response count)

Fixes issue where recommended config showed 'Responses: 1' on Mac
2026-02-23 23:46:01 +01:00
sleepy 472961cc23 feat: Apple Silicon MLX support, sequential workers, live status display, worker names
Major improvements for macOS/Apple Silicon:
- Add spawn-based multiprocessing for Metal GPU compatibility
- Implement sequential generation mode for multiple workers
- Each worker runs one-at-a-time to avoid GPU conflicts
- All workers stay loaded in memory for fast switching

User Experience:
- 100 unique worker names (Alpha, Raven, Zeus, etc.)
- Live terminal status display with progress bars
- Show context usage and last output per worker
- Display IP addresses for network workers

Configuration:
- Default port changed to 17615 (from 8000)
- Context size options: 16K, 32K (default), 64K, 128K
- Offloading options: none, 20%, 50%
- Default max_tokens: 1024

MLX Quantization Support:
- Support 3bit, 4bit, 5bit, 6bit, 8bit MLX models
- Proper memory calculations for each quantization
- Sequential mode automatically enabled on Apple Silicon

Bug Fixes:
- Fix instance calculation (was always returning 1)
- Fix quantization bit detection for MLX models
- Fix config.json generation in model folders
- Preload MiniLM embedding model during init

Files Changed:
- main.py: Spawn method for macOS, port 17615
- src/backends/mlx.py: MLX generation with stop sequences
- src/models/selector.py: Fix instance calculation
- src/swarm/manager.py: Sequential generation mode
- src/swarm/consensus.py: Preload embedding model
- src/swarm/worker.py: Progress tracking per worker
- src/swarm/worker_names.py: 100 unique names (NEW)
- src/swarm/status_monitor.py: Live display (NEW)
- src/interactive.py: Context/offload menus
- src/models/registry.py: MLX quantization sizes
- src/api/server.py: Port 17615, live status
2026-02-23 22:57:38 +01:00
sleepy e794fe29d4 Fix critical bugs, concurrency issues, and code quality across codebase
- Fix asyncio.create_task() crash in zeroconf background thread (discovery.py)
- Fix int(bytes) TypeError in peer property decoding (discovery.py)
- Fix unreachable Android/Qualcomm GPU detection path (detector.py)
- Add nvmlShutdown() to prevent NVML resource leak (detector.py)
- Wrap blocking inference in asyncio.to_thread() to unblock event loop (llamacpp.py, mlx.py)
- Initialize and use asyncio.Lock for concurrency safety (llamacpp.py)
- Fix VRAM regex matching GPU index instead of byte value (amd.py)
- Implement best_of_n federation strategy (was dead code) (federation.py)
- Lazy-import aiohttp/mcp to avoid hard ImportError (federation.py, mcp_server.py)
- Fix response_model conflict with streaming responses (routes.py)
- Fix CORS allow_origins=* with allow_credentials=True violation (server.py)
- Fix memory calculation using pre-clamped instance count (selector.py)
- Fix calculate_max_instances returning 2 when only 0-1 fit (selector.py)
- Atomic downloads via .part file to prevent caching partial files (downloader.py)
- Replace recursive menu navigation with loop-based approach (interactive.py)
- Implement actual majority voting in _majority_vote (consensus.py)
- Fix false-positive list detection in quality scoring (consensus.py)
- Replace 15+ bare except: with except Exception: across codebase
- Fix .json() -> .model_dump_json() for Pydantic v2 (routes.py)
- Remove unused MCP imports, add empty prompt validation (mcp_server.py)
- Use tokenizer for accurate MLX token counting (mlx.py)
- Fix memory estimate from FP32 (*4) to quantized (*0.6) (llamacpp.py)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 20:11:58 +01:00
sleepy d68eda45d8 Fix .gitignore to allow src/models/ directory
The .gitignore had 'models/' which excluded both:
- The models/ cache directory at root (intended)
- The src/models/ module directory (NOT intended)

Changed to '/models/' to only exclude root-level models/ directory
while allowing src/models/ to be tracked.

This fixes the 'No module named models' error on fresh clones.
2026-02-23 19:51:40 +01:00