NVIDIA Technical BlogAugust 10, 2023

Selecting Large Language Model Customization Techniques

Enterprises often require custom models to tailor the language processing capabilities to their specific use cases and domain knowledge. NVIDIA NeMo is an end-to-end framework that allows building, customizing, and deploying generative AI models anywhere. When selecting a customization technique for a large language model (LLM), trade-offs need to be considered between dataset size requirements, training effort, and downstream task accuracy requirements. Here are some techniques:

Parameter-efficient fine-tuning (PEFT): This technique involves introducing a small number of parameters or layers to an existing LLM architecture and training it with use-case-specific data. PEFT provides higher accuracy than prompt engineering and prompt learning but requires more training data and compute resources.
Few-shot prompting: This technique requires adding a few sample prompts and completion pairs to the prompt of the LLM. By doing this, the LLM learns how to generate responses for new, unseen prompts. Few-shot prompting is an efficient way to customize the LLM without extensive training.
Chain-of-thought reasoning: Similar to how humans decompose larger problems into smaller ones, chain-of-thought reasoning is a prompt engineering technique that helps LLMs improve their performance on multi-step tasks. The system prompt serves as input to the LLM, guiding its response generation.
Prompt learning: Prompt learning is a customization method that allows using pretrained LLMs on various downstream tasks without tuning the model's full set of parameters. Instead, virtual prompt embeddings are optimized using gradient descent. This technique is efficient and adaptable.
Prompt tuning: Using pretrained LLMs, prompt tuning initializes soft prompt embeddings as a 2D matrix. The tasks do not share any parameters during training or inference. After prompt tuning is complete, the prompt-tuned virtual tokens are moved to the prompt-tuned and p-tuned soft prompts. This technique reduces the number of trainable parameters and ensures strong performance on downstream tasks.
Adapter Learning: Adapter Learning introduces small feed-forward layers between the layers of the core transformer architecture. This technique allows for easy customization of the LLM without compromising its overall performance.

By considering these customization techniques, enterprises can select the most suitable approach for their specific needs and strike a balance between data requirements, training effort, and downstream task accuracy.