Large Language Models On-Device with MediaPipe and TensorFlow Lite

thumbnail

Table of Contents

  1. Introduction
  2. Models
  3. Performance Metrics
  4. Performance Optimizations
  5. What's Next

Introduction

MediaPipe and TensorFlow Lite have collaborated to enable running large language models (LLMs) on-device, thanks to optimizations in the on-device stack such as new ops, quantization, caching, and weight sharing. The experimental MediaPipe LLM Inference API has been designed to simplify the integration of LLMs for web developers, offering support for Web, Android, and iOS platforms.


Models

The initial release of the MediaPipe LLM Inference API supports four model architectures: Gemma, Phi 2, Falcon, and Stable LM. Developers can choose to use base model weights, community fine-tuned weights, or fine tune weights with their own data. Each model comes with a built-in tokenizer for converting words to tokens.


Performance Metrics

  1. Max Tokens: Maximum total tokens allowed for the LLM prompt and response.
  2. Time to First Token: Duration from calling the LLM Inference API to receiving the first token of the response.
  3. Decode Speed: Speed at which a response is generated by the LLM, influenced by model, hardware, and max tokens settings.

Performance metrics were measured on high-end devices using max tokens of 1280, a 1024-token input prompt, and int8 weight quantization.


Performance Optimizations

Several optimizations were implemented across MediaPipe, TensorFlow Lite, XNNPack, and the GPU-accelerated runtime to achieve the performance benchmarks. Key optimizations include:

  1. Weights Sharing: Allows sharing weights and KV cache across inference contexts during the prefill and decode phases.
  2. Optimized KV Cache Layout: Specialized layout for storing KV cache entries tailored for convolution weights.

What's Next

The experimental release of the MediaPipe LLM Inference API showcases promising optimizations and performance enhancements. Future developments may focus on expanding model support, further optimization, and enhancing the developer experience.