Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator

Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator
Overview
This tutorial focuses on creating a simple data curation pipeline using NeMo Curator to download, process, and filter the TinyStories dataset, primarily using the validation file with around 22,000 records.
Defining Custom Document Builders
NeMo Curator provides abstract classes for downloading remote data and extracting text records and metadata, enabling customization for different datasets. Implementations are available for datasets like CommonCrawl, Wikipedia, and arXiv.
Iterating and Extracting Text from the Dataset
Records in the downloaded file are separated by a token, and each record spans multiple lines. Implement a class to iterate through the file, yield each record's raw text, and optional metadata for unique identification.
Writing the Dataset to JSONL Format
Utilize NeMo Curator helpers to load datasets from disk in JSONL, Parquet, or Pickle formats. Create a JSON object for each story and write it as a single line in an output file, with story content in the JSON object field.
Modifying Documents
Define how text from each document should be modified using the NeMo Curator interface. Chaining operations with a list of modifications can be applied sequentially to each instance.
Dataset Filtering
Filtering is crucial during dataset curation to remove documents that don't meet specific criteria. Implement filters to discard short documents (less than 80 words) or those not meeting certain requirements.
This structured overview provides insights into creating custom data curation pipelines using NeMo Curator, catering to diverse data processing needs for LLM training.