Skip to main content
Article·developer.nvidia.com
deep-learninginferencegpuoptimizationnvidiatensorrtonnxcuda

TensorRT

Optimize and deploy deep learning models with NVIDIA TensorRT for high-performance inference on NVIDIA GPUs, achieving significant speedups and reduced latency.

intermediate2-4 hours4 steps
The play
  1. Install TensorRT
    Download and install TensorRT from the NVIDIA Developer website. Ensure you have a compatible NVIDIA GPU and CUDA toolkit installed. Follow the installation guide specific to your operating system and CUDA version.
  2. Convert a Model to TensorRT
    Use the TensorRT API or command-line tools to convert a trained model (e.g., TensorFlow, PyTorch, ONNX) into a TensorRT engine. This involves parsing the model, optimizing the graph, and generating an execution plan.
  3. Load and Run the TensorRT Engine
    Load the generated TensorRT engine into your application. Allocate input and output buffers on the GPU, copy input data to the input buffer, execute the engine, and retrieve the results from the output buffer.
  4. Optimize for Performance
    Experiment with different TensorRT optimization settings, such as precision (FP16, INT8), dynamic shapes, and layer fusion, to maximize performance for your specific model and hardware. Profile your application to identify bottlenecks and areas for further optimization.
Starter code
Start by installing TensorRT and converting a simple ONNX model using the `trtexec` command-line tool.  Then, load and run the engine in a Python script to verify the setup.
Source
TensorRT — Action Pack