Profile graph fusion effectiveness #34

Open
opened 2026-04-30 18:11:36 +02:00 by sleepy · 0 comments
Owner

Problem

GGML_METAL_FUSION_DISABLE=1 has negligible impact on tok/s. Graph debug shows some ops being fused, but the benefit appears minimal.

Data (9B Q4_0)

Config tg128 (tok/s)
Fusion ON (default) 53.83
Fusion OFF 53.13

Only 0.7 tok/s difference (1.3%).

Questions

  • Which ops are being fused? Use GGML_METAL_GRAPH_DEBUG=1 to check
  • Are fused ops compute-bound or memory-bound? If memory-bound, fusion helps less
  • Can more aggressive fusion patterns be added?
  • MUL_MAT ops (249 per tick) are the heaviest -- can they be fused with subsequent element-wise ops?

Approach

  1. Capture graph with GGML_METAL_GRAPH_DEBUG=1 and grep for fuse markers
  2. Compare fused vs unfused in Xcode GPUtrace
  3. Identify candidate fusion patterns (e.g., MUL_MAT + ADD + UNARY into a single kernel)
## Problem GGML_METAL_FUSION_DISABLE=1 has negligible impact on tok/s. Graph debug shows some ops being fused, but the benefit appears minimal. ## Data (9B Q4_0) | Config | tg128 (tok/s) | |--------|--------------| | Fusion ON (default) | 53.83 | | Fusion OFF | 53.13 | Only 0.7 tok/s difference (1.3%). ## Questions - Which ops are being fused? Use GGML_METAL_GRAPH_DEBUG=1 to check - Are fused ops compute-bound or memory-bound? If memory-bound, fusion helps less - Can more aggressive fusion patterns be added? - MUL_MAT ops (249 per tick) are the heaviest -- can they be fused with subsequent element-wise ops? ## Approach 1. Capture graph with GGML_METAL_GRAPH_DEBUG=1 and grep for fuse markers 2. Compare fused vs unfused in Xcode GPUtrace 3. Identify candidate fusion patterns (e.g., MUL_MAT + ADD + UNARY into a single kernel)
sleepy added the profiling label 2026-04-30 18:11:36 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#34