Paper·arxiv.org
llmmachine-learningdeploymentinfrastructureevaluationnvidia-tensorrt
GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference
Accelerate transformer model inference using NVIDIA TensorRT and mixed-precision techniques. This Action Pack guides you through optimizing models like BERT and GPT-2 for real-time, low-latency applications on GPUs.
intermediate1 hour5 steps
The play
- Set up NVIDIA TensorRT EnvironmentEnsure your system has an NVIDIA GPU, CUDA Toolkit, and cuDNN installed. Install NVIDIA TensorRT following the official documentation, typically via pip for Python or by downloading the tar package and configuring paths.
- Convert Transformer Model to ONNXExport your pre-trained transformer model (e.g., from Hugging Face Transformers) to the ONNX format. This is often an intermediate step for TensorRT conversion. Specify dynamic axes for varying batch sizes and sequence lengths.
- Build TensorRT Engine with Mixed PrecisionLoad the ONNX model into TensorRT. Configure the builder to optimize for your target GPU, enabling mixed-precision (FP16) for performance. Define optimization profiles for different batch sizes and sequence lengths.
- Perform GPU-Accelerated InferenceLoad the optimized TensorRT engine. Prepare your input data (e.g., tokenized text) and transfer it to the GPU. Execute inference using the TensorRT engine, utilizing the defined optimization profiles.
- Evaluate Real-Time PerformanceMeasure inference latency and throughput across various batch sizes and sequence lengths. Compare performance against the original model to quantify the speedup achieved with TensorRT and mixed-precision.
Starter code
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# 1. Load a pre-trained model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, torchscript=True)
# 2. Create dummy input for ONNX export
max_seq_length = 128
inputs = tokenizer("Hello, world!", return_tensors="pt", max_length=max_seq_length, padding="max_length", truncation=True)
# 3. Export to ONNX
# This is a prerequisite for TensorRT conversion. TensorRT can directly consume ONNX.
torch.onnx.export(
model,
(inputs["input_ids"], inputs["attention_mask"]),
"model.onnx",
input_names=["input_ids", "attention_mask"],
output_names=["output"],
dynamic_axes={
"input_ids": {0: "batch_size", 1: "sequence_length"},
"attention_mask": {0: "batch_size", 1: "sequence_length"},
"output": {0: "batch_size"}
},
opset_version=13
)
print("Model exported to model.onnx")
print("Next: Use trtexec or TensorRT Python API to convert model.onnx to a TensorRT engine with FP16.")Source