NVIDIA Technical BlogJanuary 16, 2024

Robust Scene Text Detection and Recognition: Inference Optimization

In this blog post, we discuss the optimization of deep learning models for robust scene text detection and recognition (STDR) using the ONNX Runtime and NVIDIA TensorRT. We highlight the benchmarking results of the STDR models, comparing the performance of the two inference optimization tools.

We first discuss the importance of robust scene text detection and recognition in various industries and the challenges involved. We then explain the process of converting the STDR models from ONNX to TensorRT using the NGC container image for TensorRT.

We focus on optimizing each building block of the STDR pipeline, starting with scene text detection. We showcase the speed-up achieved with TensorRT compared to TorchScript for inference, highlighting the deployed CRAFT model as a TensorRT engine with FP32 precision.

Next, we move on to scene text recognition and explain the conversion of the PARseq TorchScript model to ONNX and further to a TensorRT engine. We emphasize the low latency achieved in text recognition, considering that each image may contain multiple text fields.

Finally, we discuss the orchestrator module, which acts as a backend that coordinates between the scene text detection and scene text recognition models. We present the benchmarking results of the STDR pipeline using the NVIDIA Triton Inference Server on an NVIDIA RTX A5000 laptop GPU.

We conclude by emphasizing the importance of optimizing deep learning models for inference and the benefits of using the end-to-end NVIDIA AI Enterprise software solution for building efficient and robust scene-text-OCR systems. With these technologies, developers can create powerful applications for a wide range of industries.