Files

T

sleepy b9ce5db8ef docs: update architecture and README with new modular structure

Updated documentation to reflect the recent refactoring:

README.md:
- Added detailed project structure with line counts
- Added Architecture Principles section
- Added Development section with code quality standards
- Added section about recent refactoring work

ARCHITECTURE.md:
- Added complete project structure tree
- Added Architecture Principles section
- Detailed all modules and their responsibilities
- Added Configuration Files section
- Added Code Quality Standards section

DEVELOPMENT_PATTERNS.md:
- Added Refactoring Success section
- Documented all changes made
- Listed architecture principles established
- Updated success metrics with checkmarks

2026-02-25 13:31:24 +01:00

11 KiB

Raw Blame History

Local Swarm Architecture

Core Concept

Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware.

How It Works

┌─────────────────┐     ┌─────────────────────────────────────┐
│   Your Prompt   │────▶│         Swarm Manager               │
└─────────────────┘     │  ┌─────────┐ ┌─────────┐ ┌─────────┐│
                        │  │Worker 1 │ │Worker 2 │ │Worker 3 ││
                        │  │ (LLM)   │ │ (LLM)   │ │ (LLM)   ││
                        │  └────┬────┘ └────┬────┘ └────┬────┘│
                        │       └───────────┼───────────┘     │
                        │                   ▼                 │
                        │         Consensus Engine            │
                        │         (Picks best answer)         │
                        └───────────────────┬─────────────────┘
                                            ▼
                                    ┌───────────────┐
                                    │ Best Response │
                                    └───────────────┘

Project Structure

local_swarm/
├── main.py                    # Entry point (99 lines)
├── src/
│   ├── api/                   # HTTP API layer
│   │   ├── routes.py          # FastAPI routes (252 lines)
│   │   ├── formatting.py      # Message formatting (265 lines)
│   │   ├── tool_parser.py     # Tool parsing (250 lines)
│   │   ├── chat_handlers.py   # Chat completion logic (287 lines)
│   │   ├── server.py          # Server setup
│   │   └── models.py          # API data models
│   ├── cli/                   # Command-line interface
│   │   ├── parser.py          # CLI argument parsing
│   │   ├── main_runner.py     # Main application logic
│   │   ├── server_runner.py   # Server management
│   │   ├── test_runner.py     # Test mode execution
│   │   └── tool_server.py     # Tool server runner
│   ├── swarm/                 # Swarm orchestration
│   │   ├── manager.py         # Swarm manager
│   │   ├── worker.py          # LLM worker implementation
│   │   ├── consensus.py       # Consensus algorithms
│   │   └── orchestrator.py    # Generation orchestration
│   ├── models/                # Model management
│   │   ├── registry.py        # Model registry (194 lines)
│   │   ├── selector.py        # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculation utilities
│   │   └── downloader.py      # Model downloading
│   ├── backends/              # LLM backends
│   │   ├── llama_cpp.py       # llama.cpp backend
│   │   ├── mlx.py             # Apple Silicon MLX backend
│   │   └── base.py            # Base backend interface
│   ├── hardware/              # Hardware detection
│   │   ├── detector.py        # Hardware detection
│   │   ├── nvidia.py          # NVIDIA GPU detection
│   │   ├── intel.py           # Intel GPU detection
│   │   ├── qualcomm.py        # Qualcomm detection
│   │   └── ...
│   ├── network/               # Network federation
│   │   ├── federation.py      # Cross-swarm consensus
│   │   ├── discovery.py       # Peer discovery (mDNS)
│   │   └── discovery_core.py  # Discovery utilities
│   ├── tools/                 # Tool execution
│   │   └── executor.py        # Tool execution engine
│   ├── interactive/           # Interactive CLI
│   │   ├── ui.py              # UI utilities
│   │   ├── display.py         # Hardware/resource display
│   │   ├── tips.py            # Help content
│   │   └── config_utils.py    # Configuration selection
│   └── utils/                 # Utilities
│       ├── token_counter.py   # Token counting
│       ├── project_discovery.py # Project root discovery
│       ├── network.py         # Network utilities
│       └── logging_config.py  # Logging configuration
├── config/
│   └── models/                # Model configuration files
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── tests/                     # Test suite

Architecture Principles

1. Separation of Concerns

Each module has a single responsibility:

API layer (src/api/) - HTTP routing only
CLI layer (src/cli/) - User interface and orchestration
Swarm layer (src/swarm/) - LLM worker management
Models layer (src/models/) - Model selection and downloading

2. Configuration Over Code

Static data extracted to JSON configs:

Model metadata in config/models/model_metadata.json
Quantization sizes in mlx_quant_sizes.json and gguf_quant_sizes.json
Selection constants in selector_config.json

3. Modular Utilities

Shared functionality in reusable modules:

utils/token_counter.py - Centralized token counting
utils/project_discovery.py - Project root detection
utils/network.py - IP detection and network utilities

Components

1. Hardware Detection (`src/hardware/`)

Detects your GPU and available memory to optimize model selection.

NVIDIA - pynvml
AMD - rocm-smi
Intel - sycl-ls
Apple Silicon - sysctl/unified memory
Qualcomm - Android/Termux detection
CPU - psutil

2. Model Selection (`src/models/`)

Automatically picks the best model based on available memory:

Available Memory → Model Size → Quantization → Instance Count
     24 GB     →   14B      →    Q4_K_M    →   2-3 instances
     16 GB     →    7B      →    Q4_K_M    →   3-4 instances
      8 GB     →    3B      →    Q6_K      →   2-3 instances

Key modules:

registry.py - Loads model data from JSON configs
selector.py - Selects optimal model for hardware
memory_calculator.py - Calculates memory requirements

3. Backends (`src/backends/`)

Run the actual LLM inference:

llama.cpp - CUDA, ROCm, SYCL, CPU (cross-platform)
MLX - Apple Silicon optimized

4. Swarm Management (`src/swarm/`)

Manages multiple LLM workers and consensus voting.

Workers: Each runs an independent LLM instance Consensus: Picks the best response using:

Similarity (semantic grouping)
Quality (code blocks, structure)
Fastest (latency)
Majority (exact match)

Key modules:

manager.py - Swarm lifecycle and coordination
worker.py - Individual worker implementation
consensus.py - Consensus algorithms
orchestrator.py - Generation orchestration

5. Network Federation (`src/network/`)

Connect multiple machines into a distributed swarm:

Machine 1 (4 workers) ──┐
Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer
Machine 3 (3 workers) ──┘

Discovery: mDNS/Bonjour auto-discovery Protocol: HTTP between peers Voting: Two-phase (local consensus → global consensus)

6. API (`src/api/`)

OpenAI-compatible REST API:

POST /v1/chat/completions - Main endpoint
GET /v1/models - List models
GET /health - Health check
POST /v1/tools/execute - Tool execution (when enabled)

Modular design:

routes.py - HTTP routing only (thin controllers)
formatting.py - Message formatting logic
tool_parser.py - Tool call parsing
chat_handlers.py - Chat completion business logic

7. CLI (`src/cli/`)

Command-line interface modules:

parser.py - Argument parsing
main_runner.py - Main application orchestration
server_runner.py - Server lifecycle management
test_runner.py - Test mode execution
tool_server.py - Tool server management

8. Tools (`src/tools/`)

Optional tool execution for enhanced capabilities:

read_file - Read files
write_file - Write files
execute_bash - Run shell commands
webfetch - Fetch web content

9. Interactive Mode (`src/interactive/`)

Interactive CLI components:

ui.py - Menu display and input handling
display.py - Hardware and resource display
tips.py - Educational content and help
config_utils.py - Configuration selection utilities

10. Utilities (`src/utils/`)

Shared utility functions:

token_counter.py - Token counting with tiktoken
project_discovery.py - Project root detection
network.py - Network utilities (IP detection)
logging_config.py - Logging configuration

Data Flow

Request comes in via API
Routes (thin layer) forward to handlers
Chat Handlers process the request
Swarm Manager sends to all workers
Workers generate responses in parallel
Consensus picks the best answer
Response returned to client

Memory Model

External GPU: Use 90% of VRAM
Apple Silicon: Use RAM - 4GB buffer
CPU-only: Use RAM - 4GB buffer

Each worker loads the full model independently (no sharing).

Configuration Files

Static data extracted to JSON for easy maintenance:

config/models/
├── model_metadata.json      # Model names, descriptions, priorities
├── mlx_quant_sizes.json     # MLX quantization VRAM requirements
├── gguf_quant_sizes.json    # GGUF quantization VRAM requirements
└── selector_config.json     # Selection constraints and defaults

Code Quality Standards

No files > 300 lines (with few exceptions)
No functions > 50 lines
No indentation > 3 levels
No duplicate code (>3 lines)
Single responsibility per module
Configuration over code for static data

Testing

tests/
├── test_hardware_detector.py    # Hardware detection tests
├── test_tool_parsing.py         # Tool parsing tests
└── test_federation_metrics.py   # Federation tests

Run tests: python -m pytest tests/ -v

Future Ideas

Context compression for long inputs
CPU offloading for memory-constrained systems
RAG integration for knowledge bases
Speculative decoding for speed
More sophisticated consensus algorithms

11 KiB Raw Blame History