Data-Poor and Disk-Poor: Training on 1TB Datasets with 3GB RAM
Everyone wants to talk about training LLMs, but nobody wants to talk about the logistics of feeding them. If you’re a hobbyist or a researcher without a petabyte of NVMe storage, you’ve likely hit a wall: How do you train on a 1TB dataset (like codeparrot/github-code) when you only need a 50GB subset and your disk is nearly full?
Most people download the whole terabyte, filter it, and delete the rest. That’s a waste of bandwidth and time. This post is about doing it the smart way using Lance.
1. The Requirement
Before we touch a line of code, we need a strategy. To train a model efficiently, our data must satisfy two conditions:
Serialized Tokens: Text/code must be pre-tokenized and stored in a flat, array-like structure. Loading $k$ tokens for input ($x$) and $k+1$ for targets ($y$) should be a simple slice operation.
Random Access without Memory Bloat: We need to access any chunk of data by index without loading the entire 50GB+ file into RAM.
numpy.memmapis the classic choice here, but it’s brittle and lacks high-level structure.
2. Why Lance?
The killer feature? Zero-copy random access. You can pull specific indices from a massive dataset on disk, and Lance will only load exactly what you asked for. No offset magic required.
3. The Implementation
Step 1: Streaming and Tokenization
We use the HuggingFace datasets library in streaming mode. This is non-negotiable. If you set streaming=False, you start downloading that 1TB file immediately.
import lance
import pyarrow as pa
from tqdm.auto import tqdm
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
# STREAMING is key. Do not download the 1TB.
dataset = load_dataset(
"codeparrot/github-code",
streaming=True,
split="train",
languages=["Python"]
).shuffle(seed=42)
Step 2: Processing as a RecordBatch
We need to turn our stream into a format Lance understands. We’ll use a generator to yield PyArrow RecordBatches. This keeps the memory footprint under 3GB because we only ever hold a small window of samples in RAM.
def process_samples(total_samples=5_000_000):
for i, sample in enumerate(tqdm(dataset, total=total_samples)):
if i >= total_samples:
break
# Tokenize the 'code' field
tokens = tokenizer(sample['code'])['input_ids']
# Yield as a RecordBatch (Schema: "value" -> List of Int64)
yield pa.RecordBatch.from_arrays([pa.array([tokens])], names=["value"])
# Define the Arrow schema
schema = pa.schema([pa.field("value", pa.list_(pa.int64()))])
Step 3: Writing the Lance Dataset
Now we pipe that generator into the Lance writer.
# Convert generator to a Reader
reader = pa.RecordBatchReader.from_batches(schema, process_samples())
# Write to disk
lance.write_dataset(reader, "code_parrot_5M_subset.lance", schema)
4. High-Speed Loading
Once the dataset is written, loading it for your training loop is trivial. You pass a list of indices, and Lance retrieves them via its C++/Rust backend.
dataset = lance.dataset("code_parrot_5M_subset.lance")
def load_data(indices):
"""
Fetches only the specific samples at 'indices' from disk.
"""
data = dataset.take(indices).to_pylist()
return [x['value'] for x in data]
# Example: Get the first 10 samples
batch = load_data([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
The Verdict
We processed, tokenized, and saved a 5-million-sample subset of a massive dataset in under 70 lines of code. More importantly, we did it without buying a new hard drive or blowing up our RAM.
This is the standard you should aim for: Efficient, reproducible, and cheap.
Comments
Post a Comment