intel-gpu-llm-diagnosis/research/kernels/kernel_analysis_minimax.md

# Driver/Stack Misalignment Analysis

## Overview

This document catalogs the specific code locations, design decisions, and architectural mismatches that cause Intel Arc GPUs to underperform on LLM inference.

---

## 1. llama.cpp SYCL Backend Misalignments

### 1.1 Kernel Dispatch Logic

**File:** `repos/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp`
**Lines:** ~3258-3660

**Current Dispatch Algorithm:**
```
mul_mat dispatch prefers:
1. MMVQ (reorder path) if src0 type in ggml_sycl_supports_reorder_mmvq()
2. DMMV (reorder path) if src0 type in ggml_sycl_supports_reorder_dmmv()
3. SYCL native matmul as fallback
```

**Support Lists (lines ~3269-3300):**

```cpp
// Supports MMVQ reorder
ggml_sycl_supports_reorder_mmvq(): Q4_0, Q8_0, Q4_K, Q6_K

// Supports DMMV reorder
ggml_sycl_supports_reorder_dmmv(): Q4_0, Q8_0 ONLY

// Supports SYCL matmul reorder
ggml_sycl_supports_reorder_mul_mat_sycl(): Q4_0, Q8_0, Q4_K*, Q6_K*
(* = !g_ggml_sycl_prioritize_dmmv)
```

**Problem:** Q4_K, Q6_K support MMVQ reorder but NOT DMMV reorder. When conditions favor DMMV, these quants fall through to slow generic path.

### 1.2 DMMV Kernel iter_stride Problem

**File:** `repos/llama.cpp/ggml/src/ggml-sycl/dmmv.cpp`
**Lines:** ~975-1100 (dequantize_mul_mat_vec_q8_0_sycl)

**Generic DMMV (used by Q8_0):**
```cpp
iter_stride = 2 * GGML_SYCL_DMMV_X = 64  // processes 2 values per iteration
```

**Reorder DMMV (Q4_0 path):**
```cpp
iter_stride = 8 * 2 * GGML_SYCL_DMMV_X = 512  // processes 16 values per iteration
```

**Root Cause:** Q8_0's 34-byte block structure prevents simple power-of-2 optimization that works for Q4_0's 18-byte blocks.

### 1.3 Missing Q8_0 Reorder Implementation

**File:** `repos/llama.cpp/ggml/src/ggml-sycl/mmvq.cpp`
**Lines:** ~682-730

**Q4_0 Reorder Kernel:**
```cpp
mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q4_0>>
```

**Q8_0 Reorder Kernel:**
```cpp
mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q8_0>>
```

**Note:** PR #21527 adds Q8_0 to reorder framework. Without this fix, Q8_0 defaults to slow DMMV path.

### 1.4 Q4_K DMMV Reorder Gap

**Problem:** Q4_K has reorder structure (`reorder_qw_q4_k()`) but DMMV path doesn't use it.

**Current State:**
- Q4_K MMVQ reorder: ✅ Working
- Q4_K DMMV reorder: ❌ Not implemented

**Impact:** When DMMV is prioritized (GGML_SYCL_PRIORITIZE_DMMV=1), Q4_K gets no optimization.

---

## 2. llama.cpp Vulkan Backend Misalignments

### 2.1 Cooperative Matrix Detection

**File:** `repos/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp`
**Lines:** ~343, ~15972

**Detection Logic:**
```cpp
// Step 1: Architecture classification
if (subgroup_size_control_props.minSubgroupSize == 16) {
    return vk_device_architecture::INTEL_XE2;
}
// Falls through to OTHER for 140T (minSubgroupSize=8)

// Step 2: Coopmat support check
case VK_VENDOR_ID_INTEL:
    return arch == vk_device_architecture::INTEL_XE2;
// Returns false for OTHER
```

**Problem:** Arc 140T (Arrow Lake H) reports minSubgroupSize=8 despite having Xe2 architecture and full coopmat support.

### 2.2 DP4A/DPAS Utilization Gap

**Current State:**
- Vulkan backend has DP4A instruction support
- Matrix multiplication (matmul) path doesn't use DPAS
- Only Flash Attention path partially uses coopmat

**Missing:**
- Q4_K, Q8_0 quantized matmul via DPAS
- Subgroup-level parallelism for token generation

---

## 3. IPEX-LLM vs llama.cpp Gap

### 3.1 Performance Comparison

| Aspect | IPEX-LLM | llama.cpp SYCL | llama.cpp Vulkan |
|--------|----------|----------------|------------------|
| Q4_0 | Fast | Fast | Medium |
| Q4_K | Fast | Medium | Medium |
| Q8_0 | Fast | Was Broken | Was Broken |
| K-quants on Xe2 | Crashes | Works | Works |
| FlashAttention | Full | Partial | Partial |
| vRAM usage | Lower | Higher | Higher |

### 3.2 Source of Optimization Gap

**IPEX-LLM advantages:**
1. Closed-source optimized kernels (not in llama.cpp)
2. oneDNN GEMM integration
3. Lower-level hardware access
4. syclcompat library for platform-specific tuning

**llama.cpp limitations:**
1. Open-source kernels visible to competitors
2. Generic SYCL must work across all Intel GPUs
3. Can't leverage IPEX's proprietary optimizations

---

## 4. Architecture Detection Mismatches

### 4.1 Xe1 vs Xe2 Detection

**Current Detection:** Uses compute capability (device version)

**Problem:**
- Arc A770 reports compute version 1.3 (Xe1)
- Arc B580 reports compute version 1.6 (Xe2)
- BUT: Same driver branch reports different subgroup sizes (8 vs 16)

### 4.2 Missing Architecture-Specific Tuning

**Current kernels:** Single implementation for all Intel GPUs

**Needed:**
| Feature | Xe1 (Alchemist) | Xe2 (Battlemage) |
|---------|-----------------|------------------|
| L2 cache | 16 MB | Larger |
| Optimal block size | 64 | 128 |
| Prefetch depth | 2 | 4 |
| Vector width | 8 | 16 |

---

## 5. Quantization Format Support Matrix

### 5.1 Current Support State

| Format | DMMV Reorder | MMVQ Reorder | SYCL Matmul | Vulkan | Notes |
|--------|--------------|--------------|-------------|--------|-------|
| Q4_0 | ✅ | ✅ | ✅ | ✅ | Fully optimized |
| Q4_1 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
| Q5_0 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
| Q5_1 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
| Q8_0 | ✅* | ✅* | ✅ | ✅ | *Fixed by PR #21527 |
| Q4_K | ❌ | ✅ | ✅* | ✅ | *Prioritize DMMV breaks |
| Q5_K | ❌ | ❌ | ❌ | ❌ | No reorder support |
| Q6_K | ❌ | ✅ | ✅* | ✅ | *Prioritize DMMV breaks |
| IQ4_NL | ❌ | ❌ | ✅ | ❌ | 14% bandwidth, crashes |
| IQ4_XS | ❌ | ❌ | ✅ | ✅ | Not optimized |

### 5.2 Block Size Analysis

| Format | Block Size | Power of 2? | Cache Line Aligned? |
|--------|-----------|-------------|---------------------|
| Q4_0 | 18 bytes | No | Partial |
| Q4_K | 54 bytes | No | No |
| Q5_K | 62 bytes | No | No |
| Q6_K | 66 bytes | No | No |
| Q8_0 | 34 bytes | No | No |
| IQ4_NL | 16 bytes | Yes | Yes |

**Hypothesis:** Power-of-2 block sizes (Q4_0, IQ4_NL) enable efficient memory access patterns. Non-power-of-2 formats suffer.

---

## 6. Key File Locations Summary

### Core Problem Areas:

```
repos/llama.cpp/ggml/src/ggml-sycl/
├── ggml-sycl.cpp
│   ├── Line 219: GGML_SYCL_PRIORITIZE_DMMV env var
│   ├── Line 3258-3260: mul_mat_algo enum (DMMV, MMVQ, SYCL)
│   ├── Line 3269-3292: ggml_sycl_supports_reorder_*() functions
│   ├── Line 3549-3650: dispatch logic with fallback chains
│   └── Problem: Routing logic doesn't handle Q4_K/Q6_K correctly
│
├── dmmv.cpp
│   ├── ~975-1100: dequantize_mul_mat_vec_q8_0_sycl()
│   ├── iter_stride = 64 (generic path)
│   └── Problem: 8x less work than reorder path
│
├── mmvq.cpp
│   ├── ~550-570: Q4_0 reorder kernel
│   ├── ~695-720: Q8_0 reorder kernel (after PR #21527)
│   ├── ~1100-1200: Q4_K kernel (no DMMV support)
│   └── Problem: Missing Q5_K, Q6_K reorder
│
└── vecdotq.hpp
    ├── ~844: vec_dot_q8_0_q8_1 implementation
    └── Problem: Memory coalescing suboptimal for Xe2

repos/llama.cpp/ggml/src/ggml-vulkan/
└── ggml-vulkan.cpp
    ├── ~343: get_device_architecture() classification
    ├── ~15972: coopmat support check
    └── Problem: minSubgroupSize = 8 causes 140T misdetection

repos/ipex-llm/ (archived Jan 2026)
├── Closed-source optimized kernels (not in upstream)
├── syclcompat library
├── oneDNN integration
└── Problem: Archive status, no community maintenance
```

---

## 7. Misalignment Summary Table

| Component | Expected | Actual | Impact |
|-----------|----------|--------|--------|
| Q8_0 DMMV | 64 values/iter | 2 values/iter | 4x slower |
| Q4_K DMMV | Reorder enabled | Not implemented | 40% slower |
| Q5_K MMVQ | Reorder support | Missing | 3x slower |
| Arc 140T detection | INTEL_XE2 | OTHER | Coopmat disabled |
| Q8_0 on B70 | 60% BW | 21% BW | 3x slower |

---

*Last Updated: April 2026*