Data-Poor and Disk-Poor: Training on 1TB Datasets with 3GB RAM


Everyone wants to talk about training LLMs, but nobody wants to talk about the logistics of feeding them. If you’re a hobbyist or a researcher without a petabyte of NVMe storage, you’ve likely hit a wall: How do you train on a 1TB dataset (like codeparrot/github-code) when you only need a 50GB subset and your disk is nearly full?

Most people download the whole terabyte, filter it, and delete the rest. That’s a waste of bandwidth and time. This post is about doing it the smart way using Lance.

1. The Requirement

Before we touch a line of code, we need a strategy. To train a model efficiently, our data must satisfy two conditions:

  1. Serialized Tokens: Text/code must be pre-tokenized and stored in a flat, array-like structure. Loading $k$ tokens for input ($x$) and $k+1$ for targets ($y$) should be a simple slice operation.

  2. Random Access without Memory Bloat: We need to access any chunk of data by index without loading the entire 50GB+ file into RAM. numpy.memmap is the classic choice here, but it’s brittle and lacks high-level structure.

2. Why Lance?

Lance is a columnar data format written in Rust, optimized for ML. It’s built on Apache Arrow, making it incredibly fast for I/O.

The killer feature? Zero-copy random access. You can pull specific indices from a massive dataset on disk, and Lance will only load exactly what you asked for. No offset magic required.


3. The Implementation

Step 1: Streaming and Tokenization

We use the HuggingFace datasets library in streaming mode. This is non-negotiable. If you set streaming=False, you start downloading that 1TB file immediately.

Python
import lance
import pyarrow as pa
from tqdm.auto import tqdm
from datasets import load_dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

# STREAMING is key. Do not download the 1TB.
dataset = load_dataset(
    "codeparrot/github-code", 
    streaming=True, 
    split="train", 
    languages=["Python"]
).shuffle(seed=42)

Step 2: Processing as a RecordBatch

We need to turn our stream into a format Lance understands. We’ll use a generator to yield PyArrow RecordBatches. This keeps the memory footprint under 3GB because we only ever hold a small window of samples in RAM.

Python
def process_samples(total_samples=5_000_000):
    for i, sample in enumerate(tqdm(dataset, total=total_samples)):
        if i >= total_samples:
            break
        
        # Tokenize the 'code' field
        tokens = tokenizer(sample['code'])['input_ids']
        
        # Yield as a RecordBatch (Schema: "value" -> List of Int64)
        yield pa.RecordBatch.from_arrays([pa.array([tokens])], names=["value"])

# Define the Arrow schema
schema = pa.schema([pa.field("value", pa.list_(pa.int64()))])

Step 3: Writing the Lance Dataset

Now we pipe that generator into the Lance writer.

Python
# Convert generator to a Reader
reader = pa.RecordBatchReader.from_batches(schema, process_samples())

# Write to disk
lance.write_dataset(reader, "code_parrot_5M_subset.lance", schema)

4. High-Speed Loading

Once the dataset is written, loading it for your training loop is trivial. You pass a list of indices, and Lance retrieves them via its C++/Rust backend.

Python
dataset = lance.dataset("code_parrot_5M_subset.lance")

def load_data(indices):
    """
    Fetches only the specific samples at 'indices' from disk.
    """
    data = dataset.take(indices).to_pylist()
    return [x['value'] for x in data]

# Example: Get the first 10 samples
batch = load_data([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The Verdict

We processed, tokenized, and saved a 5-million-sample subset of a massive dataset in under 70 lines of code. More importantly, we did it without buying a new hard drive or blowing up our RAM.

This is the standard you should aim for: Efficient, reproducible, and cheap.

Comments

Popular posts from this blog

Beyond CRUD: Building a Scalable Data Quality Monitoring Engine with React, FastAPI, and Strategy Patterns

Architecting MarketPulse: A Deep Dive into a Enterprise-Grade Financial Sentiment Pipeline

Architecting GitQuery AI: A Deep Dive into Building a Production-Ready RAG System for GitHub Repositories