No description
| Filename | Latest commit message | Latest commit date |
|---|---|---|
| README.md | ||
Kaloyan Nikolov
Software engineer working on LLM inference and training. M.Sc. Computer Science @ RWTH Aachen.
Current focus
- Multi-Token Prediction (MTP) and speculative decoding for local inference
- KV cache quantization and Metal GPU kernel optimization
- Diffusion-based training for hybrid attention+linear architectures research
- Ternary weight quantization research
Active projects
omlx — personal fork with MTP decoding and Q4 KV cache with Hadamard rotation
sleepy-llm — Zig-native LLM inference engine with hand-tuned Metal kernels
sleepy-agent — fully local Android AI assistant, on-device Gemma 4 inference
qwen_orthrus — exploring Orthrus diffusion for ternary-weight and hybrid LLM architectures
Background
- Full port of a C++ speech recognition toolkit to Android using NDK.
- Work on a custom MHA PyTorch module
- Training of Attention and CTC multilingual ASR models with code-switching finetuning.
Stack
Python · Zig · C/C++ · TypeScript · Kotlin