Skip to main content
Article
edge-deploymentonnxquantizationtensorrtmodel-compressionpytorchnvidiainference-optimization

Optimize PyTorch Models for Edge with TensorRT

Use NVIDIA TensorRT to convert a PyTorch model to ONNX, then optimize it with FP16 quantization for high-performance edge inference. This pack provides the code to export your model and run it with the TensorRT Python API.

intermediate30 min4 steps
The play
  1. Install Prerequisites
    Install PyTorch, ONNX, and PyCUDA. You also need to install the NVIDIA TensorRT SDK, which includes the `trtexec` tool. The recommended way to get TensorRT is via the official NVIDIA developer website or by using an NVIDIA NGC container.
  2. Export PyTorch Model to ONNX
    Load a pre-trained model (e.g., ResNet18) and export it to the ONNX (Open Neural Network Exchange) format. This creates a standardized, framework-agnostic version of your model that TensorRT can parse.
  3. Build Optimized Engine with trtexec
    Use the `trtexec` command-line tool from the TensorRT SDK to convert the ONNX file into a highly optimized TensorRT engine. The `--fp16` flag enables FP16 quantization, which significantly speeds up inference with minimal accuracy loss on supported GPUs.
  4. Run Inference with the TensorRT Engine
    Load the optimized `.plan` engine file in Python using the TensorRT API. This script allocates memory on the GPU, copies input data, runs inference, and retrieves the output. This is the final step to using your optimized model in an application.
Starter code
import torch
import torchvision
import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import os

# --- 1. Export PyTorch model to ONNX ---
def export_to_onnx(model_name="resnet18.onnx"):
    print("Exporting PyTorch model to ONNX...")
    dummy_input = torch.randn(1, 3, 224, 224, device='cpu')
    model = torchvision.models.resnet18(pretrained=True).eval()

    torch.onnx.export(model, 
                      dummy_input, 
                      model_name, 
                      verbose=False,
                      input_names=['input'],
                      output_names=['output'],
                      export_params=True
    )
    print(f"Model successfully exported to {model_name}")
    return model_name

# --- 2. Guide user to build TensorRT engine ---
def prompt_to_build_engine(onnx_path="resnet18.onnx", engine_path="resnet18.plan"):
    print("\n--- ACTION REQUIRED ---")
    print("Run the following command in your terminal to build the TensorRT engine:")
    print(f'\n  trtexec --onnx={onnx_path} --saveEngine={engine_path} --fp16\n')
    input("Press Enter after you have successfully run the command and the engine file is created...")
    if not os.path.exists(engine_path):
        print(f"Error: Engine file '{engine_path}' not found. Please build it first.")
        exit()
    return engine_path

# --- 3. Run inference with the TensorRT engine ---
def run_inference(engine_path="resnet18.plan"):
    print("\nRunning inference with the TensorRT engine...")
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

    with open(engine_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())

    context = engine.create_execution_context()
    
    # Prepare input and output buffers
    h_input = np.random.rand(1, 3, 224, 224).astype(np.float32)
    h_output = np.empty(engine.get_binding_shape(1), dtype=np.float32)
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)
    bindings = [int(d_input), int(d_output)]
    
    stream = cuda.Stream()

    # Transfer input data to the GPU.
    cuda.memcpy_htod_async(d_input, h_input, stream)
    # Run inference.
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
    # Synchronize the stream
    stream.synchronize()

    print(f"Inference complete. Output shape: {h_output.shape}")
    # Example: print top 5 class predictions
    top5_indices = np.argsort(h_output[0])[::-1][:5]
    print(f"Top 5 predicted class indices: {top5_indices}")

if __name__ == '__main__':
    # Ensure you have installed torch, torchvision, onnx, pycuda, and the TensorRT SDK
    onnx_file = export_to_onnx()
    engine_file = prompt_to_build_engine(onnx_path=onnx_file)
    run_inference(engine_path=engine_file)
Optimize PyTorch Models for Edge with TensorRT — Action Pack