LLM Benchmarking: Fundamental Concepts

LLM Benchmarking: Fundamental Concepts
Introduction
- NVIDIA provides GenAI-Perf, an open-source benchmarking tool for evaluating LLM performance.
- Load testing and performance benchmarking are essential for assessing model deployment.
Stages of LLM Application
- Prompt
- Queuing
- Prefill
- Generation
AI Token and Context Length
- AI token is specific to LLMs and essential for inference performance metrics.
- Context length refers to the number of tokens used in each generation step.
LLM Inference Metrics
- Common metrics include time to first token and intertoken latency.
- TTFT measurement includes queuing time, prefill time, and network latency.
- Metrics can vary based on benchmarking tools and methodologies.
Events and Definitions
- Definitions of key events (e.g., latency of requests, start and end of benchmark).
- TPS calculation methods by GenAI-Perf and LLMPerf.
TPS and Requests per Second
- TPS represents throughput in terms of output tokens per end-to-end latency.
- TPS per user accounts for individual user request throughput.
- RPS signifies the average number of successfully completed requests per second.