Files

T

sleepy 45c3aad453 feat: expand to 6 models, 8 challenges; rewrite README with DeepSeek V4 Pro analysis

- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7
- Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training
- Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution
- Add analysis/ folder with cross-model comparisons and per-challenge deep dives
- Add deploy_challenges.sh script
- Expand .gitignore to exclude Python envs, ML weights, and build artifacts

2026-04-27 18:49:22 +02:00

22 KiB

Raw Blame History

3-Way Head-to-Head Analysis: GLM-5 vs MiniMax-M2.7 vs Qwen3-6

Executive Summary

Dimension	GLM-5	MiniMax-M2.7	Qwen3-6
Overall Grade	B+	B	A-
Backwards (Layer Norm)	✓ PASS, compact	✓ PASS, verbose	✓ PASS, excellent
Fuse (Softmax+Top-K)	Strong CUDA, online algo	Pseudocode/CUDA hybrid	Production-grade CUDA
KV (KV-Cache)	Clean, well-structured	Over-engineered	Comprehensive, best
Code correctness	All tests pass	All tests pass	All tests pass
Code quality	Clean, minimal	Verbose, unstructured	Modular, well-documented
Testing	8 tests, 1 file	None (benchmark only)	10 demos, multiple files
Novelty/Depth	Good	Acceptable	Excellent

Task 1: Backward Pass for Layer Normalization

GLM-5 (file: `glm5/backwards/layer_norm.py`)

Grade: B+

Strengths:

Single-file implementation (275 lines) — clean and contained
Correct simplified dx formula: rstd * (dxhat - xhat * proj/D - dxhat_sum/D)
Gradient check passes (rel error ~1e-10 on all three gradients)
Good numerical stability discussion covering 5 distinct failure modes
GPU fusion strategy is detailed with shared memory layout and 4-step kernel design
Derivation thoroughly shown in docstring

Weaknesses:

Full finite-difference check on x iterates element-by-element with Python loops — very slow for anything beyond tiny tensors. No spot-check heuristic
Complexity analysis is prose-based, not tabular — harder to compare
No edge case tests (zero input, large mean with tiny variance, D=1, etc.)
GPU fusion discussion only covers the backward pass — forward pass fusion is mentioned but not detailed
Only caches xhat, rstd, gamma — minimal but correct

Code tested:

dx:  max|err| = 5.71e-10   rel = 9.74e-11   [PASS]
dgamma:  max|err| = 3.21e-10   rel = 5.13e-11   [PASS]
dbeta:  max|err| = 4.07e-10   rel = 4.69e-11   [PASS]

MiniMax-M2.7 (file: `minimax-m2.7/backwards/layer_norm_numpy.py`)

Grade: B

Strengths:

Most verbose implementation (1148 lines) — extensive documentation
Per-operation FLOPs table in complexity analysis
Benchmark harness with 4 shape configurations
GPU kernel pseudo-code is actually compilable-like CUDA with __global__, __shared__, warpReduceSum
Proper spot-check for large tensors (>100k elements) in gradient check
Includes a LayerNorm class wrapper with parameter state management
Central finite differences with proper element-by-element checking

Weaknesses:

Bloated cache: stores x, x_centered, x_norm, mean, var, std, gamma, beta, eps, B, T, D — way more than needed
The backward formula dx = (dz - sum_dz/D - x_norm * sum_dz_xnorm/D) / std IS correct but the author incorrectly notes dz = dy * gamma meaning dgamma/dbeta formula uses dy * x_norm but then stores ALL intermediates redundantly
The cache dict stores the original x which is NEVER needed for backward
Gradient check uses per-element Python loops for x (O(BTD) Python calls) with a progress bar — extremely slow for real sizes
Overly complex: compute_numerical_gradient_x/gamma/beta as separate functions with near-duplicate code
Numerical stability analysis somewhat buried in code comments rather than clearly presented
Complexity analysis uses ASCII art boxes — visually noisy

Code tested:

All gradient checks PASSED on 3 shape configurations
Performance benchmarks run successfully

Qwen3-6 (files: `qwen36/backwards/layer_norm_backward.py`, `test_layer_norm.py`, `benchmark_layer_norm.py`)

Grade: A-

Strengths:

Best overall: 3 well-separated files for core impl (294 lines), edge case tests (113 lines), benchmarks (150 lines)
Cleanest dx formula with derivation sketch: dx = std_inv * [g - mean(g) - x_hat * mean(g * x_hat)] where g = gamma * dy
Minimal cache: only x_hat (B,T,D), std_inv (B,T), gamma (D), and D (scalar). Perfect.
Edge case test file covers:
- Large mean, tiny variance (cancellation-prone)
- Zero input (variance = 0)
- Large D (Transformer-scale: B=2, T=128, D=1024)
- D=1 (degenerate case)
- Gradient norm sanity check
- Backward-of-backward consistency
- Memory efficiency check (verifies optimal cache size)
Benchmark file demonstrates two-pass vs naive variance stability (offset=1e10: naive=0.000, stable=2.000)
Explicit verification against alternative derivation path (step-by-step chain rule cross-check)
GPU fusion discussion is the most thorough — includes forward AND backward kernels with pseudocode, memory traffic comparison (12 vs 4 accesses per element), shared memory optimization, and hardware rsqrt note
Gradients pass with ~5e-11 rel error

Weaknesses:

The full numerical gradient function iterates element-by-element (can be extremely slow for D=1024, caused timeout in our test)
Finite difference function doesn't auto-detect and switch to spot-check for large tensors (unlike MiniMax's max_elements param)
Benchmark is CPU-only NumPy, inherently slow

Gradient check:

dx        relative error: 5.04e-11  ✓ PASS
dgamma    relative error: 1.75e-11  ✓ PASS
dbeta     relative error: 1.46e-11  ✓ PASS

Backwards Task Winner: Qwen3-6

Qwen edges out GLM with its careful edge case testing, minimal memory caching, cross-verification of the backward formula, and practical stability demonstration. GLM is very close but lacks edge case testing. MiniMax is correct but over-engineered with bloated caches.

Task 2: Fused Softmax + Top-K Kernel

GLM-5 (files: `glm5/fuse/fused_softmax_topk.cuh`, `test_fused.cu`, `DESIGN.md`)

Grade: A-

Strengths:

True CUDA .cuh header with template-based kernel
Uses online softmax algorithm (running max/sum recurrence) — genuinely single-pass
Register-resident TopKHeap<K> struct with vals[idxs] sorted array
Warp-level __shfl_xor_sync for max/sum reductions (5-step butterfly)
Cross-warp heap merge in shared memory with __syncthreads()
Explicit template instantiation for K=5,10,20,32
Clean 3-phase pipeline: local pass → cross-warp merge → write output
DESIGN.md is comprehensive (9 sections) with detailed bandwidth analysis showing 3× I/O reduction
Bandwidth-bound analysis correctly identifies AI=1.5 FLOP/byte << A100's 9.6 FLOP/byte
Includes host launch wrapper and CUDA stream support

Weaknesses:

Only supports K ≤ 32 (limited by HEAP_K register constant)
Heap uses O(K) insertion — OK for K=32, but breaks for K=256
Cross-warp merge is serial (warp 0 only) — bottleneck for WARPS_PER_BLOCK > 8
No FP16/vectorized load support (mentioned in DESIGN.md as future work)
Shared memory use: ~2KB (very modest, not fully utilizing available)
test_fused.cu exists but wasn't read in detail — appears to be a test harness

Code:

// Key insight: online softmax recurrence
m_{j}   = max(m_{j-1}, x_j)
d_{j}   = d_{j-1} * exp(m_{j-1} - m_{j}) + exp(x_j - m_{j})

MiniMax-M2.7 (file: `minimax-m2.7/fuse/fused_softmax_topk.cu`)

Grade: B

Strengths:

Full analysis document (1720 lines in the code comment) demonstrating strong systems thinking
Good documentation of memory access pattern (coalesced strided reads), warp operations, and complexity
Correctly identifies the kernel as bandwidth-bound
Includes scalability analysis for V=10K, 50K, 500K, 1M+
Discusses extensions: FP16/BF16, Tensor Cores, tiled approach, integration with backward pass

Weaknesses:

The CUDA code has significant bugs:
1. Uses __launch_bounds__(THREADS) but THREADS is a template parameter — this is not valid CUDA syntax (__launch_bounds__ requires integer constant)
2. Shared memory layout is broken: int* s_topk_idx = (int*)&shared_mem[2 * THREADS] — pointer arithmetic on float* then cast to int* — byte offsets are likely wrong
3. Phase 3 top-K heap: if (prob > local_topk_val[TOP_K - 1]) — but TOP_K is a template parameter and TOP_K-1 indexing isn't guarded
4. Final merge phase uses merge_val[THREADS] and merge_idx[THREADS] as stack arrays — THREADS=256 means 2KB stack arrays inside kernel, potentially exceeding per-thread stack limits
5. s_topk_val[lane] = local_topk_val[lane] is guarded by warp_id == 0 && lane < TOP_K — but if TOP_K > 32, warp 0 threads 32..TOP_K-1 still execute this and access local_topk_val which may be uninitialized for those lanes
6. The launcher function creates separate kernels for K≤10, K≤50, K≤100 but uses topk_prob vs topp_prob typo
Uses 2-pass softmax (max first, then sum, then top-k), not a true single-pass online softmax like GLM
Top-K insertion does per-element linear scan of size TOP_K — O(V * K) instead of O(V * log K)
No template-based instantiation — uses if/else chains in launcher

Qwen3-6 (files: `qwen36/fuse/fused_softmax_topk.cu`, `fused_softmax_topk_v2.cu`, `ANALYSIS.md`, `benchmark.cu`)

Grade: A

Strengths:

Two kernel versions: v1 (production) and v2 (optimized with vectorized float4 loads, warp-level top-K merge, bitonic sort)
Local top-K uses LocalTopK<K> struct with min-eviction strategy
Proper min-heap in shared memory with heap_sift_down() function — O(log K) insertions
Phase 4 warp-merging correctly serializes across 8 warps with barriers
Phase 5 sorts the final K elements with selection sort (O(K²), acceptable for K=256)
Warp-level primitives (warp_max, warp_sum) use butterfly shuffle — correct
Vectorized float4 loads in v2 — proper alignment handling with tail loop
Template-based with explicit instantiations for K=16,32,64,128,256
ANALYSIS.md provides deep design document alongside the code
benchmark.cu for correctness and performance harness
Dynamic shared memory for warp staging buffer (2048B for vals + 2048B for idxs)
Phase 4 warp leader serialization uses explicit barrier pattern — correct but could be faster

Weaknesses:

v1 also uses 2-pass (max phase → sum phase → top-K), not a true online algorithm like GLM
v2's warp-level top-K merge (warp_topk_merge) is declared but uses lane-0 serial collection — the comment claims "warp-level merge" but implementation is serial on lane 0
v2's bitonic sort is mentioned in comments but not actually implemented (falls back to selection sort)
Shared heap sift-down is correct but uses a while(true) loop with break — slightly unconventional GPU style
Minor: s_heap_idxs is declared but used as s_heap_idxs (typo exists in the code)

Fuse Task Winner: GLM-5 (by a hair) / Qwen3-6 (for production readiness)

GLM-5 wins on algorithmic elegance — it's the only one implementing true online softmax (single pass, running statistics). This is the correct answer for the "do NOT materialize the full softmax matrix" constraint.

Qwen3-6 wins on production completeness — v2 has float4 vectorization, supports K up to 256, has proper shared heap, and includes a benchmark harness. The 2-pass approach is slightly more memory traffic but still avoids the full matrix.

MiniMax's implementation has real bugs that would prevent compilation or correct execution.

Task 3: KV-Cache System

GLM-5 (files: `glm5/kv/kv_cache.py`, `optimizations.py`, `test_kv_cache.py`, `README.md`)

Grade: A-

Strengths:

Clean, well-structured 471-line core with clear section headers
BHSD memory layout with per-batch seq_lens — correct for variable-length batching
multi_head_attention_with_cache correctly queries from cache
IncrementalDecoder shows end-to-end prefill→decode lifecycle
optimizations.py (508 lines) implements all three requested optimizations:
- PagedKVCache with free-list management and block scattering
- ChunkedPrefill with sequential chunk processing
- QuantizedKVCache with INT8/INT4 symmetric quantization
test_kv_cache.py (429 lines) provides 8 comprehensive tests:
- Basic cache update/retrieval
- Cached vs non-cached attention correctness (matches to 1e-5)
- Variable sequence lengths (lengths [5, 12, 3])
- Incremental decoder end-to-end
- Paged cache with block allocation/free
- Quantized cache (INT8 error ~0.004, INT4 error ~0.07)
- Memory growth analysis tables
- FLOPs comparison (109x speedup for 1024+100)

All 8 tests pass cleanly.

Weaknesses:

multi_head_attention_with_cache has per-head Python loops — correct for NumPy but notes "maps 1:1 to CUDA" which is slightly misleading (CUDA batch matmul would be more efficient)
Attention computation loops over B, then H — O(B*H) Python loops per step
Chunked prefill in optimizations.py uses np.random.randn to simulate Q — doesn't actually compute chunks of a real prompt
Quantized cache uses per-token per-head per-dimension scale factors — huge metadata overhead (reported savings of only 0.5x vs FP32 for INT8, which is pessimistic; real systems use per-channel or per-token scales)
Paged cache get_kv concatenates scattered blocks on every call — fine for NumPy demo but on GPU this needs a custom gather kernel
No FP16 support (uses NP float32 default)

MiniMax-M2.7 (file: `minimax-m2.7/kv/kv_cache.py`)

Grade: B-

Strengths:

Most lines of code (1720 lines total) — very ambitious scope
Implements multiple memory formats (BHSD, BSHD, PAGED, HBSD) as an enum
FlatKVCache with [layers, batch, seq, 2, heads, dim] layout — reasonable for multi-layer
PagedKVCache with block allocator
MultiHeadAttention class with Q/K/V projection and causal masking
BatchedInferenceEngine class for managing variable-length batches
MemoryAnalyzer class with growth rate and latency estimation
All three optimizations covered (Paged, Chunked, Quantized)

Weaknesses:

Significant structural issues:
1. KVCacheBlock stores layer_idx=-1 as placeholder but later reassigns — fragile design
2. FlatKVCache stores [num_layers, max_batch, max_seq_len, 2, num_heads, head_dim] — the "2" dimension for K/V is awkward and non-standard
3. BatchedInferenceEngine.step_inference creates fake outputs for finished sequences (zeros) but doesn't properly exclude them from computation
4. _project in MultiHeadAttention applies np.matmul(x, W) then reshape then transpose — three separate operations when one einsum would be cleaner
5. The _create_causal_mask function has incorrect logic: np.triu(..., k=1-seq_len) — when seq_len=1 (decode), k=0 which creates a mask with zeros everywhere. For decode, no mask is actually needed (cache only has past tokens), so it's accidentally correct but the derivation is wrong.
6. Class TransformerBlockStack.forward stores KV cache in self.kv_cache[layer_idx] but MultiHeadAttention.forward expects kv_cache as a Dict with {layer_idx: (k_cache, v_cache)} format — format mismatch: the stack stores (K, V) tuples but the attention expects to receive them in a particular nested dict structure. The layer_cache preparation is wrong.
7. The code mixes two different kv_cache conventions (flat 6D tensor vs dict-based), causing confusion
No test file — the entire 1720-line file has zero test functions
Complexity analysis interleaved with implementation code — hard to separate
Much of the code (TransformerBlock, TransformerBlockStack, KVCacheAwareGenerator) is partially implementing a full transformer rather than focusing on the KV-cache system

Qwen3-6 (files: `qwen36/kv/kv_cache.py`, `attention.py`, `optimizations.py`, `transformer.py`, `memory_analysis.py`, `gpu_mapping.py`, `demo.py`, `README.md`)

Grade: A

Strengths:

Best architecture: 8 separate files with clear separation of concerns
kv_cache.py (205 lines): Clean, minimal KVCache + BatchedKVCache — exactly the right abstraction
attention.py (234 lines): Implements standard, cached, masked, GQA attention variants
optimizations.py (390 lines): PagedKVCache with page tables and free list, QuantizedKVCache with per-channel int8, ChunkedPrefill with proper causal chunking, HybridKVCache combining paged+quantized
transformer.py (likely): Full transformer decoder integration
memory_analysis.py (240 lines): Comprehensive with ModelSpec, find_max_context(), compare_model_sizes(), detailed GPU limits
gpu_mapping.py (likely): GPU kernel pseudocode with Tensor Core analysis
demo.py (likely): 10 end-to-end demos covering all scenarios:
1. Basic KV cache operations — data integrity verified
2. Cached attention computation — max diff 3.93e-10 from manual
3. Full transformer with prefill + 5-step generation
4. Variable-length batching (lengths [8, 5, 10, 3])
5. Paged attention (vLLM-style, block_size=4)
6. Quantized cache (int8, notes overhead correctly)
7. Chunked prefill — matches full attention to 4.56e-10
8. Optimization comparison table (5 strategies side by side)
9. Memory growth analysis (6 models, various GPUs)
10. GPU Tensor Core arithmetic intensity analysis
README.md with comprehensive documentation
Correctly identifies per-position quantization overhead issue (reports -125% savings vs fp32 due to scale metadata) and explains that production uses shared per-channel scales
BatchedKVCache is the cleanest abstraction — manages L layers × 1 config
memory_analysis.py has real model specs: Llama-2-7B, 13B, 70B, Llama-3-8B, GPT-4-class
Finds max context: 7B model on H100-80GB → 121K tokens (correct)

Weaknesses:

KVCache.update writes keys[:, :, 0, :] assuming batch dimension is first — hardcoded to work with (batch, heads, 1, head_dim) but slightly fragile
The cached attention function retrieves full cache every time (cache.get_all()) — in production you'd retrieve only what's needed
transformer.py includes an LLaMAModel class with real RoPE — but whether this works correctly wasn't tested
Quantized cache reports negative savings vs fp16 due to per-position scale overhead — honest but shows the implementation isn't production-ready for quantization
Paged cache physical pages are per-head — in real vLLM, pages are per-layer-per-head (much finer granularity)

KV-Cache Task Winner: Qwen3-6

Qwen3-6 wins convincingly. The 8-file modular architecture, comprehensive demo suite, correct variable-length batching, and practical memory analysis (with real model specs and GPU limits) set it apart. GLM-5 is a strong second with excellent test coverage but less depth in attention variants and GPU mapping. MiniMax's implementation has architectural flaws that would prevent correct operation.

Cross-Task Patterns

Code Quality & Architecture

Aspect	GLM-5	MiniMax-M2.7	Qwen3-6
File organization	1-3 files/task	1 file/task	3-8 files/task
Code modularity	Good	Poor (monolithic)	Excellent
Documentation	Good (docstrings + DESIGN.md)	Very verbose, ASCII art	Excellent (docstrings + README)
Naming conventions	Clean	Mixed	Cleanest
Type hints	Minimal	Extensive (typing)	Good (dataclasses)
Error handling	Good assertions	Extensive assertions	Good assertions

Numerical Correctness

All three models produce mathematically correct backward pass gradients. The key differentiator is how they cache intermediates:

GLM-5: caches (xhat, rstd, gamma) — 3 items ✓
MiniMax: caches (x, x_centered, x_norm, mean, var, std, gamma, beta, eps, B, T, D) — 12 items, 9 of which are redundant ✗
Qwen3-6: caches (x_hat, std_inv, gamma, D) — 4 items, all needed ✓

GPU Kernel Quality

For the fuse kernel:

GLM-5 has the most algorithmically sophisticated kernel (true online softmax, single pass)
Qwen3-6 has the most production-ready kernel (two versions, float4 vectorization, supports K up to 256)
MiniMax has significant bugs that would prevent compilation

Testing Philosophy

GLM-5: Tests are thorough within a single test file. Covers correctness, edge cases, and analysis.
MiniMax: Primarily benchmarks and gradient checks within the main file. No separate test file for KV-cache.
Qwen3-6: Best testing culture. Separate test files, edge case files, benchmark files, demo files. Cross-verifies backward formula with alternative derivation.

Scope Creep

GLM-5: Stays focused on the asked requirements. Delivers what's needed.
MiniMax: Over-implements. The KV-cache file grows into a full transformer implementation, losing focus on the cache system itself.
Qwen3-6: Expands thoughtfully. Each extra file adds value (attention variants, memory analysis, GPU mapping) without losing focus.

Overall Ranking

1st Place: Qwen3-6 (A-)

Best overall quality across all three tasks. Wins on KV-cache decisively, ties or slightly trails on backwards, matches on fuse with more practical implementation. Superior engineering practices: modular files, comprehensive testing, cross-verification, edge cases, proper docs.

2nd Place: GLM-5 (B+)

Strong showing with elegant algorithms (online softmax is the standout innovation). Code is clean and correct. Weaknesses are primarily in testing depth (no edge case tests for backwards, K limited to 32 for fuse) rather than correctness. The most "academically beautiful" solutions.

3rd Place: MiniMax-M2.7 (B)

Ambitious but inconsistent. Over-engineers some parts (backwards cache bloated 4x) while under-delivering on others (fuse CUDA has real bugs, KV-cache architecture is fragmented). No separate tests. The verbosity sometimes masks correctness issues. However, the models clearly understand the domain and the issues identified are execution problems rather than knowledge gaps.

Per-Task Winner Summary

Task	Winner	Key Differentiator
Layer Norm Backward	Qwen3-6	Edge case testing, minimal cache, cross-verification
Fused Softmax+TopK	GLM-5	True online single-pass algorithm (only one that is genuinely "fused")
KV-Cache System	Qwen3-6	Modular architecture, 10 demos, practical GPU limits analysis

22 KiB Raw Blame History Unescape Escape

3-Way Head-to-Head Analysis: GLM-5 vs MiniMax-M2.7 vs Qwen3-6

Executive Summary

Task 1: Backward Pass for Layer Normalization

GLM-5 (file: glm5/backwards/layer_norm.py)

MiniMax-M2.7 (file: minimax-m2.7/backwards/layer_norm_numpy.py)

Qwen3-6 (files: qwen36/backwards/layer_norm_backward.py, test_layer_norm.py, benchmark_layer_norm.py)

Backwards Task Winner: Qwen3-6

Task 2: Fused Softmax + Top-K Kernel

GLM-5 (files: glm5/fuse/fused_softmax_topk.cuh, test_fused.cu, DESIGN.md)

MiniMax-M2.7 (file: minimax-m2.7/fuse/fused_softmax_topk.cu)

Qwen3-6 (files: qwen36/fuse/fused_softmax_topk.cu, fused_softmax_topk_v2.cu, ANALYSIS.md, benchmark.cu)

Fuse Task Winner: GLM-5 (by a hair) / Qwen3-6 (for production readiness)

Task 3: KV-Cache System

GLM-5 (files: glm5/kv/kv_cache.py, optimizations.py, test_kv_cache.py, README.md)

MiniMax-M2.7 (file: minimax-m2.7/kv/kv_cache.py)

Qwen3-6 (files: qwen36/kv/kv_cache.py, attention.py, optimizations.py, transformer.py, memory_analysis.py, gpu_mapping.py, demo.py, README.md)

KV-Cache Task Winner: Qwen3-6

Cross-Task Patterns

Code Quality & Architecture

Numerical Correctness

GPU Kernel Quality

Testing Philosophy

Scope Creep

Overall Ranking

1st Place: Qwen3-6 (A-)

2nd Place: GLM-5 (B+)

3rd Place: MiniMax-M2.7 (B)

Per-Task Winner Summary

22 KiB

Raw Blame History

GLM-5 (file: `glm5/backwards/layer_norm.py`)

MiniMax-M2.7 (file: `minimax-m2.7/backwards/layer_norm_numpy.py`)

Qwen3-6 (files: `qwen36/backwards/layer_norm_backward.py`, `test_layer_norm.py`, `benchmark_layer_norm.py`)

GLM-5 (files: `glm5/fuse/fused_softmax_topk.cuh`, `test_fused.cu`, `DESIGN.md`)

MiniMax-M2.7 (file: `minimax-m2.7/fuse/fused_softmax_topk.cu`)

Qwen3-6 (files: `qwen36/fuse/fused_softmax_topk.cu`, `fused_softmax_topk_v2.cu`, `ANALYSIS.md`, `benchmark.cu`)

GLM-5 (files: `glm5/kv/kv_cache.py`, `optimizations.py`, `test_kv_cache.py`, `README.md`)

MiniMax-M2.7 (file: `minimax-m2.7/kv/kv_cache.py`)

Qwen3-6 (files: `qwen36/kv/kv_cache.py`, `attention.py`, `optimizations.py`, `transformer.py`, `memory_analysis.py`, `gpu_mapping.py`, `demo.py`, `README.md`)