train_300m.py smoke test has input_ids/labels length mismatch #17

Open
opened 2026-05-09 19:22:08 +02:00 by sleepy · 0 comments
Owner

Problem

The smoke test's TinyDataset (scripts/train_300m.py:107) yields:

yield {
    "input_ids": torch.tensor(seq, dtype=torch.long),       # len=256
    "labels": torch.tensor(seq[1:] + [0], dtype=torch.long)  # len=256 (255+1)
}

This is inconsistent with data.py which uses input_ids == labels (same tensor, same length), then relies on the loss function's logits[:, :-1] vs labels[:, 1:] shift for teacher forcing.

The smoke test's seq[1:] + [0] creates a different alignment — the last position gets label 0 (BOS/pad) instead of being part of the natural sequence shift. This means the loss computes against a different target distribution than the main data pipeline.

Impact

  • Smoke test validates a slightly different data alignment than production code
  • If the smoke test passes but training fails, it's unclear if it's a model bug or a data difference

Action needed

Make the smoke test use the same input_ids == labels pattern as data.py:

yield {
    "input_ids": torch.tensor(seq, dtype=torch.long),
    "labels": torch.tensor(seq, dtype=torch.long),
}

Files

  • scripts/train_300m.py:107
## Problem The smoke test's `TinyDataset` (scripts/train_300m.py:107) yields: ```python yield { "input_ids": torch.tensor(seq, dtype=torch.long), # len=256 "labels": torch.tensor(seq[1:] + [0], dtype=torch.long) # len=256 (255+1) } ``` This is inconsistent with `data.py` which uses `input_ids == labels` (same tensor, same length), then relies on the loss function's `logits[:, :-1]` vs `labels[:, 1:]` shift for teacher forcing. The smoke test's `seq[1:] + [0]` creates a different alignment — the last position gets label `0` (BOS/pad) instead of being part of the natural sequence shift. This means the loss computes against a different target distribution than the main data pipeline. ## Impact - Smoke test validates a slightly different data alignment than production code - If the smoke test passes but training fails, it's unclear if it's a model bug or a data difference ## Action needed Make the smoke test use the same `input_ids == labels` pattern as `data.py`: ```python yield { "input_ids": torch.tensor(seq, dtype=torch.long), "labels": torch.tensor(seq, dtype=torch.long), } ``` ## Files - `scripts/train_300m.py:107`
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ternary#17
No description provided.