PackedDataset.__iter__ silently drops items when buffer < batch_size * 4 #13

Open
opened 2026-05-09 07:30:22 +02:00 by sleepy · 0 comments
Owner

Problem

PackedDataset.__iter__() (data.py:104-112) only yields items when len(buffer) >= batch_size * 4:

while len(buffer) >= batch_size * 4:
    for item in buffer[:batch_size]:
        yield item
    buffer = buffer[batch_size:]

If the final buffer is smaller than batch_size * 4, those items are silently dropped. For streaming datasets that end, the last few sequences could be lost.

Impact

  • Data loss at the end of each dataset iteration
  • For short datasets or small batches, this could mean losing a meaningful fraction of data
  • No warning or logging when data is dropped

Action needed

  • Yield remaining items in the buffer even if below the threshold, OR
  • Add a final flush step after the loop to yield remaining items

Files

  • tergent/data.py:104-112
## Problem `PackedDataset.__iter__()` (data.py:104-112) only yields items when `len(buffer) >= batch_size * 4`: ```python while len(buffer) >= batch_size * 4: for item in buffer[:batch_size]: yield item buffer = buffer[batch_size:] ``` If the final buffer is smaller than `batch_size * 4`, those items are **silently dropped**. For streaming datasets that end, the last few sequences could be lost. ## Impact - Data loss at the end of each dataset iteration - For short datasets or small batches, this could mean losing a meaningful fraction of data - No warning or logging when data is dropped ## Action needed - Yield remaining items in the buffer even if below the threshold, OR - Add a final flush step after the loop to yield remaining items ## Files - `tergent/data.py:104-112`
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ternary#13
No description provided.