Implement MXFP4 GGUF converter #37

New Issue

2026-04-30T18:11:37+02:00

sleepy commented

2026-04-30 18:11:37 +02:00

Goal

Convert Qwen3.6-27B MXFP4 weights to GGUF format for use with llama.cpp.

MXFP4 format

Weights: dtype=U32, shape [out, in/8] -- 8 fp4_e2m1 nibbles per uint32
Scales: dtype=U8, shape [out, in/32] -- E4M3 unsigned with bias=7
Non-quantized: layernorm, conv1d, dt_bias (BF16)
Total model size: roughly 14.9 GB

Conversion challenges

Nibble unpacking: fp4_e2m1 to float32 (MLX has fp4_e2m1 operator)
Scale conversion: E4M3 (MLX) to E8M0 (GGML expected)
Block structure: MLX uses per-32-column scales, GGML uses block_size scales
Tensor name remapping: MLX prefix language_model.model vs BF16 model.language_model

Reference

Detailed analysis in ANALYSIS_QWEN3_5_MXFP4.md
MLX quantized tensor format: ~/.omlx/models/Qwen3.6-27B-mxfp4/

Priority

Low -- blocked until kernel performance issues are resolved.

## Goal Convert Qwen3.6-27B MXFP4 weights to GGUF format for use with llama.cpp. ## MXFP4 format - Weights: dtype=U32, shape [out, in/8] -- 8 fp4_e2m1 nibbles per uint32 - Scales: dtype=U8, shape [out, in/32] -- E4M3 unsigned with bias=7 - Non-quantized: layernorm, conv1d, dt_bias (BF16) - Total model size: roughly 14.9 GB ## Conversion challenges 1. Nibble unpacking: fp4_e2m1 to float32 (MLX has fp4_e2m1 operator) 2. Scale conversion: E4M3 (MLX) to E8M0 (GGML expected) 3. Block structure: MLX uses per-32-column scales, GGML uses block_size scales 4. Tensor name remapping: MLX prefix language_model.model vs BF16 model.language_model ## Reference Detailed analysis in ANALYSIS_QWEN3_5_MXFP4.md MLX quantized tensor format: ~/.omlx/models/Qwen3.6-27B-mxfp4/ ## Priority Low -- blocked until kernel performance issues are resolved.

sleepy added the feature label 2026-04-30 18:11:37 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#37