Add offline mode for HuggingFace tokenizer and dataset loading #4

Open
opened 2026-05-01 14:01:26 +02:00 by sleepy · 0 comments
sleepy commented 2026-05-01 14:01:26 +02:00 (Migrated from localhost:18431)

Problem

Training fails when no internet available — AutoTokenizer.from_pretrained() and load_dataset() attempt online checks even when models are cached locally.

Fix

  • Set os.environ["HF_HUB_OFFLINE"] = "1" at module level in fixes/data.py and fixes/train_fixed.py
  • Pass local_files_only=True to AutoTokenizer.from_pretrained()
  • Cached resources available on remote: ~/.cache/huggingface/hub/models--hf-internal-testing--llama-tokenizer/ and datasets--HuggingFaceFW--fineweb-edu/

Status

  • fixes/data.py updated
  • fixes/train_fixed.py updated
  • Verified on remote with --smoke flag
## Problem Training fails when no internet available — `AutoTokenizer.from_pretrained()` and `load_dataset()` attempt online checks even when models are cached locally. ## Fix - Set `os.environ["HF_HUB_OFFLINE"] = "1"` at module level in `fixes/data.py` and `fixes/train_fixed.py` - Pass `local_files_only=True` to `AutoTokenizer.from_pretrained()` - Cached resources available on remote: `~/.cache/huggingface/hub/models--hf-internal-testing--llama-tokenizer/` and `datasets--HuggingFaceFW--fineweb-edu/` ## Status - [x] fixes/data.py updated - [x] fixes/train_fixed.py updated - [ ] Verified on remote with --smoke flag
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ternary#4
No description provided.