[cleanup] Gate profiling prints behind debug flag (#44) #54

Merged
sleepy merged 1 commit from refactor/44-remove-profiling-prints into main 2026-05-20 20:10:43 +02:00
Owner

Summary

Gates decode_profile and argmax_compare prints (and their macOS-specific mach_absolute_time / mach_timebase_info calls) behind @import("builtin").mode == .Debug.

  • Preserves the CLI "Decode: X tokens in Ys = Z tok/s" timing in engine.zig
  • Removes profiling noise from release builds

Test Results

  • zig build compiles cleanly

Benchmarks

No performance impact expected (timing calls removed from release builds).

## Summary Gates `decode_profile` and `argmax_compare` prints (and their macOS-specific `mach_absolute_time` / `mach_timebase_info` calls) behind `@import("builtin").mode == .Debug`. - Preserves the CLI "Decode: X tokens in Ys = Z tok/s" timing in `engine.zig` - Removes profiling noise from release builds ## Test Results - `zig build` compiles cleanly ## Benchmarks No performance impact expected (timing calls removed from release builds).
Author
Owner

CHANGES_REQUESTED

  1. Dead GPU work in release builds: dispatch.set_argmax_bf16() is dispatched unconditionally, but the GPU argmax result is never read in non-debug builds. Move the dispatch inside the if (Debug) block to avoid wasting GPU cycles per token in release mode.

  2. Unnecessary code duplication: The copy_bf16_buffer_to_f32 + CPU argmax logic is duplicated in both branches. Restructure so the shared readback and CPU argmax happen once after the conditional debug timing, rather than duplicating ~20 lines of identical logic.

Suggested: gate only the mach_absolute_time calls and std.debug.print lines behind Debug, keep the shared readback/argmax single-path.

**CHANGES_REQUESTED** 1. **Dead GPU work in release builds**: `dispatch.set_argmax_bf16()` is dispatched unconditionally, but the GPU argmax result is never read in non-debug builds. Move the dispatch inside the `if (Debug)` block to avoid wasting GPU cycles per token in release mode. 2. **Unnecessary code duplication**: The `copy_bf16_buffer_to_f32` + CPU argmax logic is duplicated in both branches. Restructure so the shared readback and CPU argmax happen once after the conditional debug timing, rather than duplicating ~20 lines of identical logic. Suggested: gate only the `mach_absolute_time` calls and `std.debug.print` lines behind `Debug`, keep the shared readback/argmax single-path.
sleepy force-pushed refactor/44-remove-profiling-prints from 103f6d5c2a to e11677ecdc 2026-05-20 20:09:54 +02:00 Compare
sleepy merged commit 953928c4d4 into main 2026-05-20 20:10:43 +02:00
sleepy deleted branch refactor/44-remove-profiling-prints 2026-05-20 20:10:43 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm!54
No description provided.