• Joined on 2026-05-08

Kaloyan Nikolov

Software engineer working on LLM inference and training. M.Sc. Computer Science @ RWTH Aachen.


Current focus

  • Multi-Token Prediction (MTP) and speculative decoding for local inference
  • KV cache quantization and Metal GPU kernel optimization
  • Diffusion-based training for hybrid attention+linear architectures research
  • Ternary weight quantization research

Active projects

omlx — personal fork with MTP decoding and Q4 KV cache with Hadamard rotation

sleepy-llm — Zig-native LLM inference engine with hand-tuned Metal kernels

sleepy-agent — fully local Android AI assistant, on-device Gemma 4 inference

qwen_orthrus — exploring Orthrus diffusion for ternary-weight and hybrid LLM architectures


Background

  • Full port of a C++ speech recognition toolkit to Android using NDK.
  • Work on a custom MHA PyTorch module
  • Training of Attention and CTC multilingual ASR models with code-switching finetuning.

Stack

Python · Zig · C/C++ · TypeScript · Kotlin