Curating Non-English Datasets for LLM Training with NVIDIA NeMo Curator

thumbnail

Data Curation Pipeline for Multilingual Dataset

In this section, we will walk through the data curation pipeline for the Thai Wikipedia dataset. The steps involved are:

1. Download Thai Wikipedia Dataset

  • Download the Thai Wikipedia dataset from archives and extract it to a JSONL file using NeMo Curator's downloading pipeline.

2. Language Separation

  • Perform language separation to retain only the Thai documents in the dataset.

3. Document Modification

  • Apply a predefined modifier to the Thai subset from the language separation output.

4. Exact Deduplication with GPU Acceleration

  • Utilize GPU-accelerated exact deduplication to remove identical documents efficiently.

Downloading Thai Wikipedia Dataset

Run the provided code to download the Thai Wikipedia dataset using NeMo Curator's downloading pipeline.

Language Separation

Execute the code snippet for language separation to identify and print a document that is identified as English within the dataset.

Document Modification

Apply a predefined modifier to the Thai subset from the language separation output using NeMo Curator's interface.

Exact Deduplication with GPU Acceleration

Implement exact deduplication using GPU acceleration to efficiently remove duplicate documents from the dataset. Use the provided code snippet to identify and view the duplicate documents.

Conclusion

Following this data curation pipeline ensures the preparation of a high-quality Thai Wikipedia dataset for training language models.