Curating Non-English Datasets for LLM Training with NVIDIA NeMo Curator

Data Curation Pipeline for Multilingual Dataset
In this section, we will walk through the data curation pipeline for the Thai Wikipedia dataset. The steps involved are:
1. Download Thai Wikipedia Dataset
- Download the Thai Wikipedia dataset from archives and extract it to a JSONL file using NeMo Curator's downloading pipeline.
2. Language Separation
- Perform language separation to retain only the Thai documents in the dataset.
3. Document Modification
- Apply a predefined modifier to the Thai subset from the language separation output.
4. Exact Deduplication with GPU Acceleration
- Utilize GPU-accelerated exact deduplication to remove identical documents efficiently.
Downloading Thai Wikipedia Dataset
Run the provided code to download the Thai Wikipedia dataset using NeMo Curator's downloading pipeline.
Language Separation
Execute the code snippet for language separation to identify and print a document that is identified as English within the dataset.
Document Modification
Apply a predefined modifier to the Thai subset from the language separation output using NeMo Curator's interface.
Exact Deduplication with GPU Acceleration
Implement exact deduplication using GPU acceleration to efficiently remove duplicate documents from the dataset. Use the provided code snippet to identify and view the duplicate documents.
Conclusion
Following this data curation pipeline ensures the preparation of a high-quality Thai Wikipedia dataset for training language models.