LLM Benchmarking: Fundamental Concepts

thumbnail

LLM Benchmarking: Fundamental Concepts

Introduction

  • NVIDIA provides GenAI-Perf, an open-source benchmarking tool for evaluating LLM performance.
  • Load testing and performance benchmarking are essential for assessing model deployment.

Stages of LLM Application

  1. Prompt
  2. Queuing
  3. Prefill
  4. Generation

AI Token and Context Length

  • AI token is specific to LLMs and essential for inference performance metrics.
  • Context length refers to the number of tokens used in each generation step.

LLM Inference Metrics

  • Common metrics include time to first token and intertoken latency.
  • TTFT measurement includes queuing time, prefill time, and network latency.
  • Metrics can vary based on benchmarking tools and methodologies.

Events and Definitions

  • Definitions of key events (e.g., latency of requests, start and end of benchmark).
  • TPS calculation methods by GenAI-Perf and LLMPerf.

TPS and Requests per Second

  • TPS represents throughput in terms of output tokens per end-to-end latency.
  • TPS per user accounts for individual user request throughput.
  • RPS signifies the average number of successfully completed requests per second.