Files
local_swarm/docs/ARCHITECTURE.md
T
sleepy b9ce5db8ef docs: update architecture and README with new modular structure
Updated documentation to reflect the recent refactoring:

README.md:
- Added detailed project structure with line counts
- Added Architecture Principles section
- Added Development section with code quality standards
- Added section about recent refactoring work

ARCHITECTURE.md:
- Added complete project structure tree
- Added Architecture Principles section
- Detailed all modules and their responsibilities
- Added Configuration Files section
- Added Code Quality Standards section

DEVELOPMENT_PATTERNS.md:
- Added Refactoring Success section
- Documented all changes made
- Listed architecture principles established
- Updated success metrics with checkmarks
2026-02-25 13:31:24 +01:00

11 KiB

Local Swarm Architecture

Core Concept

Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware.

How It Works

┌─────────────────┐     ┌─────────────────────────────────────┐
│   Your Prompt   │────▶│         Swarm Manager               │
└─────────────────┘     │  ┌─────────┐ ┌─────────┐ ┌─────────┐│
                        │  │Worker 1 │ │Worker 2 │ │Worker 3 ││
                        │  │ (LLM)   │ │ (LLM)   │ │ (LLM)   ││
                        │  └────┬────┘ └────┬────┘ └────┬────┘│
                        │       └───────────┼───────────┘     │
                        │                   ▼                 │
                        │         Consensus Engine            │
                        │         (Picks best answer)         │
                        └───────────────────┬─────────────────┘
                                            ▼
                                    ┌───────────────┐
                                    │ Best Response │
                                    └───────────────┘

Project Structure

local_swarm/
├── main.py                    # Entry point (99 lines)
├── src/
│   ├── api/                   # HTTP API layer
│   │   ├── routes.py          # FastAPI routes (252 lines)
│   │   ├── formatting.py      # Message formatting (265 lines)
│   │   ├── tool_parser.py     # Tool parsing (250 lines)
│   │   ├── chat_handlers.py   # Chat completion logic (287 lines)
│   │   ├── server.py          # Server setup
│   │   └── models.py          # API data models
│   ├── cli/                   # Command-line interface
│   │   ├── parser.py          # CLI argument parsing
│   │   ├── main_runner.py     # Main application logic
│   │   ├── server_runner.py   # Server management
│   │   ├── test_runner.py     # Test mode execution
│   │   └── tool_server.py     # Tool server runner
│   ├── swarm/                 # Swarm orchestration
│   │   ├── manager.py         # Swarm manager
│   │   ├── worker.py          # LLM worker implementation
│   │   ├── consensus.py       # Consensus algorithms
│   │   └── orchestrator.py    # Generation orchestration
│   ├── models/                # Model management
│   │   ├── registry.py        # Model registry (194 lines)
│   │   ├── selector.py        # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculation utilities
│   │   └── downloader.py      # Model downloading
│   ├── backends/              # LLM backends
│   │   ├── llama_cpp.py       # llama.cpp backend
│   │   ├── mlx.py             # Apple Silicon MLX backend
│   │   └── base.py            # Base backend interface
│   ├── hardware/              # Hardware detection
│   │   ├── detector.py        # Hardware detection
│   │   ├── nvidia.py          # NVIDIA GPU detection
│   │   ├── intel.py           # Intel GPU detection
│   │   ├── qualcomm.py        # Qualcomm detection
│   │   └── ...
│   ├── network/               # Network federation
│   │   ├── federation.py      # Cross-swarm consensus
│   │   ├── discovery.py       # Peer discovery (mDNS)
│   │   └── discovery_core.py  # Discovery utilities
│   ├── tools/                 # Tool execution
│   │   └── executor.py        # Tool execution engine
│   ├── interactive/           # Interactive CLI
│   │   ├── ui.py              # UI utilities
│   │   ├── display.py         # Hardware/resource display
│   │   ├── tips.py            # Help content
│   │   └── config_utils.py    # Configuration selection
│   └── utils/                 # Utilities
│       ├── token_counter.py   # Token counting
│       ├── project_discovery.py # Project root discovery
│       ├── network.py         # Network utilities
│       └── logging_config.py  # Logging configuration
├── config/
│   └── models/                # Model configuration files
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── tests/                     # Test suite

Architecture Principles

1. Separation of Concerns

Each module has a single responsibility:

  • API layer (src/api/) - HTTP routing only
  • CLI layer (src/cli/) - User interface and orchestration
  • Swarm layer (src/swarm/) - LLM worker management
  • Models layer (src/models/) - Model selection and downloading

2. Configuration Over Code

Static data extracted to JSON configs:

  • Model metadata in config/models/model_metadata.json
  • Quantization sizes in mlx_quant_sizes.json and gguf_quant_sizes.json
  • Selection constants in selector_config.json

3. Modular Utilities

Shared functionality in reusable modules:

  • utils/token_counter.py - Centralized token counting
  • utils/project_discovery.py - Project root detection
  • utils/network.py - IP detection and network utilities

Components

1. Hardware Detection (src/hardware/)

Detects your GPU and available memory to optimize model selection.

  • NVIDIA - pynvml
  • AMD - rocm-smi
  • Intel - sycl-ls
  • Apple Silicon - sysctl/unified memory
  • Qualcomm - Android/Termux detection
  • CPU - psutil

2. Model Selection (src/models/)

Automatically picks the best model based on available memory:

Available Memory → Model Size → Quantization → Instance Count
     24 GB     →   14B      →    Q4_K_M    →   2-3 instances
     16 GB     →    7B      →    Q4_K_M    →   3-4 instances
      8 GB     →    3B      →    Q6_K      →   2-3 instances

Key modules:

  • registry.py - Loads model data from JSON configs
  • selector.py - Selects optimal model for hardware
  • memory_calculator.py - Calculates memory requirements

3. Backends (src/backends/)

Run the actual LLM inference:

  • llama.cpp - CUDA, ROCm, SYCL, CPU (cross-platform)
  • MLX - Apple Silicon optimized

4. Swarm Management (src/swarm/)

Manages multiple LLM workers and consensus voting.

Workers: Each runs an independent LLM instance Consensus: Picks the best response using:

  • Similarity (semantic grouping)
  • Quality (code blocks, structure)
  • Fastest (latency)
  • Majority (exact match)

Key modules:

  • manager.py - Swarm lifecycle and coordination
  • worker.py - Individual worker implementation
  • consensus.py - Consensus algorithms
  • orchestrator.py - Generation orchestration

5. Network Federation (src/network/)

Connect multiple machines into a distributed swarm:

Machine 1 (4 workers) ──┐
Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer
Machine 3 (3 workers) ──┘

Discovery: mDNS/Bonjour auto-discovery Protocol: HTTP between peers Voting: Two-phase (local consensus → global consensus)

6. API (src/api/)

OpenAI-compatible REST API:

  • POST /v1/chat/completions - Main endpoint
  • GET /v1/models - List models
  • GET /health - Health check
  • POST /v1/tools/execute - Tool execution (when enabled)

Modular design:

  • routes.py - HTTP routing only (thin controllers)
  • formatting.py - Message formatting logic
  • tool_parser.py - Tool call parsing
  • chat_handlers.py - Chat completion business logic

7. CLI (src/cli/)

Command-line interface modules:

  • parser.py - Argument parsing
  • main_runner.py - Main application orchestration
  • server_runner.py - Server lifecycle management
  • test_runner.py - Test mode execution
  • tool_server.py - Tool server management

8. Tools (src/tools/)

Optional tool execution for enhanced capabilities:

  • read_file - Read files
  • write_file - Write files
  • execute_bash - Run shell commands
  • webfetch - Fetch web content

9. Interactive Mode (src/interactive/)

Interactive CLI components:

  • ui.py - Menu display and input handling
  • display.py - Hardware and resource display
  • tips.py - Educational content and help
  • config_utils.py - Configuration selection utilities

10. Utilities (src/utils/)

Shared utility functions:

  • token_counter.py - Token counting with tiktoken
  • project_discovery.py - Project root detection
  • network.py - Network utilities (IP detection)
  • logging_config.py - Logging configuration

Data Flow

  1. Request comes in via API
  2. Routes (thin layer) forward to handlers
  3. Chat Handlers process the request
  4. Swarm Manager sends to all workers
  5. Workers generate responses in parallel
  6. Consensus picks the best answer
  7. Response returned to client

Memory Model

  • External GPU: Use 90% of VRAM
  • Apple Silicon: Use RAM - 4GB buffer
  • CPU-only: Use RAM - 4GB buffer

Each worker loads the full model independently (no sharing).

Configuration Files

Static data extracted to JSON for easy maintenance:

config/models/
├── model_metadata.json      # Model names, descriptions, priorities
├── mlx_quant_sizes.json     # MLX quantization VRAM requirements
├── gguf_quant_sizes.json    # GGUF quantization VRAM requirements
└── selector_config.json     # Selection constraints and defaults

Code Quality Standards

  • No files > 300 lines (with few exceptions)
  • No functions > 50 lines
  • No indentation > 3 levels
  • No duplicate code (>3 lines)
  • Single responsibility per module
  • Configuration over code for static data

Testing

tests/
├── test_hardware_detector.py    # Hardware detection tests
├── test_tool_parsing.py         # Tool parsing tests
└── test_federation_metrics.py   # Federation tests

Run tests: python -m pytest tests/ -v

Future Ideas

  • Context compression for long inputs
  • CPU offloading for memory-constrained systems
  • RAG integration for knowledge bases
  • Speculative decoding for speed
  • More sophisticated consensus algorithms