[refactor] Pivot from MLX to LiteRT-LM backend (#5) #6

Merged
sleepy merged 1 commit from refactor/5-pivot-litert-lm into main 2026-05-22 14:45:57 +02:00
Owner

Pivot inference backend from MLX Swift to Google LiteRT-LM.

Key improvements

  • E2B: 2.58 GB file (28% smaller), 607 MB RAM (5x less), 56.5 tok/s GPU
  • E4B: 3.65 GB file (27% smaller), 961 MB RAM (5x less)
  • Built-in multimodal vision + audio with on-demand loading
  • Native function calling for web search
  • Official Google Swift API
  • GPU-first with CPU fallback
  • Works on A16+ iPads (not just M-series)

Changes

  • Replace gemma-4-swift-mlx with LiteRT-LM Swift Package
  • Rewrite InferenceService with Engine/Conversation API
  • Rewrite ModelManager for single .litertlm files with real download progress
  • Remove Q4/Q8 selection (LiteRT models pre-quantized)
  • Add extended-virtual-addressing entitlement

Closes #5

Pivot inference backend from MLX Swift to Google LiteRT-LM. ## Key improvements - E2B: 2.58 GB file (28% smaller), 607 MB RAM (5x less), 56.5 tok/s GPU - E4B: 3.65 GB file (27% smaller), 961 MB RAM (5x less) - Built-in multimodal vision + audio with on-demand loading - Native function calling for web search - Official Google Swift API - GPU-first with CPU fallback - Works on A16+ iPads (not just M-series) ## Changes - Replace gemma-4-swift-mlx with LiteRT-LM Swift Package - Rewrite InferenceService with Engine/Conversation API - Rewrite ModelManager for single .litertlm files with real download progress - Remove Q4/Q8 selection (LiteRT models pre-quantized) - Add extended-virtual-addressing entitlement Closes #5
- Replace gemma-4-swift-mlx with Google LiteRT-LM Swift Package
- E2B: 2.58 GB (was 3.6 GB), 607 MB RAM (was 3.2 GB), 56.5 tok/s GPU
- E4B: 3.65 GB (was 5.0 GB), 961 MB RAM (was 5.0 GB)
- Built-in multimodal vision + audio with on-demand encoder loading
- Native function calling with constrained decoding for web search
- Official Swift API with Engine/Conversation/streaming
- GPU-first with CPU fallback
- Real download progress with cancel support
- Single .litertlm model file per variant
- Added extended-virtual-addressing entitlement for large models
- Removed Q4/Q8 selection (LiteRT models pre-quantized)

Closes #5
sleepy merged commit 8b2293e664 into main 2026-05-22 14:45:57 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/Gemma4Pad!6
No description provided.