Implement MXFP4 GGUF converter #37

Open
opened 2026-04-30 18:11:37 +02:00 by sleepy · 0 comments
Owner

Goal

Convert Qwen3.6-27B MXFP4 weights to GGUF format for use with llama.cpp.

MXFP4 format

  • Weights: dtype=U32, shape [out, in/8] -- 8 fp4_e2m1 nibbles per uint32
  • Scales: dtype=U8, shape [out, in/32] -- E4M3 unsigned with bias=7
  • Non-quantized: layernorm, conv1d, dt_bias (BF16)
  • Total model size: roughly 14.9 GB

Conversion challenges

  1. Nibble unpacking: fp4_e2m1 to float32 (MLX has fp4_e2m1 operator)
  2. Scale conversion: E4M3 (MLX) to E8M0 (GGML expected)
  3. Block structure: MLX uses per-32-column scales, GGML uses block_size scales
  4. Tensor name remapping: MLX prefix language_model.model vs BF16 model.language_model

Reference

Detailed analysis in ANALYSIS_QWEN3_5_MXFP4.md
MLX quantized tensor format: ~/.omlx/models/Qwen3.6-27B-mxfp4/

Priority

Low -- blocked until kernel performance issues are resolved.

## Goal Convert Qwen3.6-27B MXFP4 weights to GGUF format for use with llama.cpp. ## MXFP4 format - Weights: dtype=U32, shape [out, in/8] -- 8 fp4_e2m1 nibbles per uint32 - Scales: dtype=U8, shape [out, in/32] -- E4M3 unsigned with bias=7 - Non-quantized: layernorm, conv1d, dt_bias (BF16) - Total model size: roughly 14.9 GB ## Conversion challenges 1. Nibble unpacking: fp4_e2m1 to float32 (MLX has fp4_e2m1 operator) 2. Scale conversion: E4M3 (MLX) to E8M0 (GGML expected) 3. Block structure: MLX uses per-32-column scales, GGML uses block_size scales 4. Tensor name remapping: MLX prefix language_model.model vs BF16 model.language_model ## Reference Detailed analysis in ANALYSIS_QWEN3_5_MXFP4.md MLX quantized tensor format: ~/.omlx/models/Qwen3.6-27B-mxfp4/ ## Priority Low -- blocked until kernel performance issues are resolved.
sleepy added the feature label 2026-04-30 18:11:37 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#37