Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator

thumbnail

Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator

Overview

This tutorial focuses on creating a simple data curation pipeline using NeMo Curator to download, process, and filter the TinyStories dataset, primarily using the validation file with around 22,000 records.

Defining Custom Document Builders

NeMo Curator provides abstract classes for downloading remote data and extracting text records and metadata, enabling customization for different datasets. Implementations are available for datasets like CommonCrawl, Wikipedia, and arXiv.

Iterating and Extracting Text from the Dataset

Records in the downloaded file are separated by a token, and each record spans multiple lines. Implement a class to iterate through the file, yield each record's raw text, and optional metadata for unique identification.

Writing the Dataset to JSONL Format

Utilize NeMo Curator helpers to load datasets from disk in JSONL, Parquet, or Pickle formats. Create a JSON object for each story and write it as a single line in an output file, with story content in the JSON object field.

Modifying Documents

Define how text from each document should be modified using the NeMo Curator interface. Chaining operations with a list of modifications can be applied sequentially to each instance.

Dataset Filtering

Filtering is crucial during dataset curation to remove documents that don't meet specific criteria. Implement filters to discard short documents (less than 80 words) or those not meeting certain requirements.

This structured overview provides insights into creating custom data curation pipelines using NeMo Curator, catering to diverse data processing needs for LLM training.