<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>NVIDIA on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/nvidia/</link>
    <description>Recent content in NVIDIA on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Thu, 28 May 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/nvidia/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>基于 TensorRT 的深度学习模型推理加速实战指南</title>
      <link>https://tech-snippets.xyz/posts/tensorrt-inference-optimization-guide/</link>
      <pubDate>Thu, 28 May 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/tensorrt-inference-optimization-guide/</guid>
      <description>前言 在深度学习从学术研究走向工业落地的今天，推理性能已经成为决定项目成败的关键因素。
你可能有过这样的经历：花了几个月时间精心训练了一个准确率 99% 的模型，结果一到生产环境就傻眼了——单帧推理需要 500ms，离业务要求的 30ms 差了十万八千里。这时候你面临两个选择：要么花几十万升级硬件，要么想办法把模型跑快一点。
TensorRT 就是帮你实现第二个选择的神器。作为 NVIDIA 推出的深度学习推理优化器，它能让同样的模型在同样的硬件上跑出 4 到 20 倍的性能提升，而且精度损失可以控制在 1% 以内。更重要的是，这种提升是「免费」的——不需要改变网络结构，不需要重新训练，只需要多一道「编译」工序。
这篇文章是我过去三年使用 TensorRT 的经验总结。从最基础的环境搭建，到 ONNX 模型转换，再到 INT8 量化校准，最后到生产级的 C++ 部署，我会把每一个坑、每一个优化技巧都毫无保留地分享给你。如果你正在做模型部署，或者正在为推理速度发愁，这篇文章就是为你准备的。
一、为什么我们需要 TensorRT？ 在深入技术细节之前，我们先来回答一个最基本的问题：既然 PyTorch 和 TensorFlow 本身就能跑推理，为什么还要折腾 TensorRT？
1.1 训练框架的设计目标不是推理 PyTorch 和 TensorFlow 作为训练框架，它们的设计优先级是：
灵活性 - 支持任意计算图的动态构建 易用性 - Python 接口、自动微分 通用性 - 支持从 CPU 到多 GPU 的各种硬件 推理性能从来都不是它们的首要设计目标。为了灵活性，PyTorch 每次执行都要重新遍历计算图，每一个算子都要走通用的 CUDA kernel，这中间浪费了大量的性能。
举个例子：一个简单的 Conv + BatchNorm + ReLU 组合，在 PyTorch 里会执行三次独立的 kernel 调用，每次都要读写全局显存。而 TensorRT 会把这三层融合成一个 kernel，中间结果全部存在寄存器里——光这一项就能带来 2-3 倍的性能提升。</description>
      <content:encoded><![CDATA[<h2 id="前言">前言</h2>
<p>在深度学习从学术研究走向工业落地的今天，<strong>推理性能</strong>已经成为决定项目成败的关键因素。</p>
<p>你可能有过这样的经历：花了几个月时间精心训练了一个准确率 99% 的模型，结果一到生产环境就傻眼了——单帧推理需要 500ms，离业务要求的 30ms 差了十万八千里。这时候你面临两个选择：要么花几十万升级硬件，要么想办法把模型跑快一点。</p>
<p>TensorRT 就是帮你实现第二个选择的神器。作为 NVIDIA 推出的深度学习推理优化器，它能让同样的模型在同样的硬件上跑出 <strong>4 到 20 倍的性能提升</strong>，而且精度损失可以控制在 1% 以内。更重要的是，这种提升是「免费」的——不需要改变网络结构，不需要重新训练，只需要多一道「编译」工序。</p>
<p>这篇文章是我过去三年使用 TensorRT 的经验总结。从最基础的环境搭建，到 ONNX 模型转换，再到 INT8 量化校准，最后到生产级的 C++ 部署，我会把每一个坑、每一个优化技巧都毫无保留地分享给你。如果你正在做模型部署，或者正在为推理速度发愁，这篇文章就是为你准备的。</p>
<p><img alt="TensorRT 模型推理加速流程" loading="lazy" src="/images/tensorrt-workflow.svg"></p>
<h2 id="一为什么我们需要-tensorrt">一、为什么我们需要 TensorRT？</h2>
<p>在深入技术细节之前，我们先来回答一个最基本的问题：既然 PyTorch 和 TensorFlow 本身就能跑推理，为什么还要折腾 TensorRT？</p>
<h3 id="11-训练框架的设计目标不是推理">1.1 训练框架的设计目标不是推理</h3>
<p>PyTorch 和 TensorFlow 作为训练框架，它们的设计优先级是：</p>
<ol>
<li><strong>灵活性</strong> - 支持任意计算图的动态构建</li>
<li><strong>易用性</strong> - Python 接口、自动微分</li>
<li><strong>通用性</strong> - 支持从 CPU 到多 GPU 的各种硬件</li>
</ol>
<p>推理性能从来都不是它们的首要设计目标。为了灵活性，PyTorch 每次执行都要重新遍历计算图，每一个算子都要走通用的 CUDA kernel，这中间浪费了大量的性能。</p>
<p>举个例子：一个简单的 Conv + BatchNorm + ReLU 组合，在 PyTorch 里会执行三次独立的 kernel 调用，每次都要读写全局显存。而 TensorRT 会把这三层<strong>融合</strong>成一个 kernel，中间结果全部存在寄存器里——光这一项就能带来 2-3 倍的性能提升。</p>
<h3 id="12-tensorrt-的核心优化手段">1.2 TensorRT 的核心优化手段</h3>
<p>TensorRT 能做到这么大的性能提升，靠的是以下几个关键优化：</p>
<p><strong>1. 算子融合（Kernel Fusion）</strong>
把相邻的多个小算子合并成一个大算子，减少 kernel 启动开销和显存访问次数。这是 TensorRT 最有效的优化手段之一。</p>
<p><strong>2. 权重量化</strong>
从 FP32 降到 FP16 再到 INT8，不仅显存占用减半甚至减到 1/4，更重要的是 NVIDIA GPU 有专门的 Tensor Core 来加速低精度计算。Ampere 架构以后，INT8 的算力是 FP32 的 16 倍。</p>
<p><strong>3. 自动调优</strong>
TensorRT 会针对你的具体 GPU 型号，在几十个甚至上百个候选 kernel 中选择最快的那个。同样的模型在 3090 和 A100 上会生成完全不同的执行计划。</p>
<p><strong>4. 动态内存管理</strong>
推理时的中间张量会尽可能复用内存，而不是每次都申请释放。这在 batch 很大的时候，能省下大量显存。</p>
<p><strong>5. 层消除</strong>
推理时根本不需要的层（比如 Dropout）会被直接移除，恒等变换的层也会被优化掉。</p>
<h3 id="13-性能提升到底有多大">1.3 性能提升到底有多大？</h3>
<p>空口无凭，我们来看一组实测数据（在 NVIDIA RTX 3090 上测试）：</p>
<table>
<thead>
<tr>
<th>模型</th>
<th>框架</th>
<th>精度</th>
<th>FPS</th>
<th>加速比</th>
</tr>
</thead>
<tbody>
<tr>
<td>ResNet-50</td>
<td>PyTorch</td>
<td>FP32</td>
<td>198</td>
<td>1x</td>
</tr>
<tr>
<td>ResNet-50</td>
<td>PyTorch</td>
<td>FP16</td>
<td>387</td>
<td>1.95x</td>
</tr>
<tr>
<td>ResNet-50</td>
<td>TensorRT</td>
<td>FP16</td>
<td>1182</td>
<td>5.97x</td>
</tr>
<tr>
<td>ResNet-50</td>
<td>TensorRT</td>
<td>INT8</td>
<td>2456</td>
<td>12.4x</td>
</tr>
<tr>
<td>YOLOv8n</td>
<td>PyTorch</td>
<td>FP16</td>
<td>520</td>
<td>1x</td>
</tr>
<tr>
<td>YOLOv8n</td>
<td>TensorRT</td>
<td>FP16</td>
<td>2150</td>
<td>4.13x</td>
</tr>
<tr>
<td>YOLOv8n</td>
<td>TensorRT</td>
<td>INT8</td>
<td>3890</td>
<td>7.48x</td>
</tr>
</tbody>
</table>
<p>可以看到，仅仅是切换到 TensorRT FP16，就能获得 4-6 倍的性能提升，INT8 量化之后更是达到了 7-12 倍。对于 Transformer 类的模型，提升通常更大，经常能到 15-20 倍。</p>
<h3 id="14-什么时候该用-tensorrt">1.4 什么时候该用 TensorRT？</h3>
<p>TensorRT 不是银弹，以下场景特别适合用 TensorRT：</p>
<ul>
<li>✅ 追求极致推理延迟和吞吐量</li>
<li>✅ 在边缘设备（Jetson、嵌入式）部署</li>
<li>✅ GPU 资源紧张，需要最大化利用率</li>
<li>✅ 固定输入尺寸的批量推理</li>
<li>✅ 已经训练好、准备上线的模型</li>
</ul>
<p>而以下场景可以不用折腾：</p>
<ul>
<li>❌ 还在快速迭代的实验阶段</li>
<li>❌ 对速度要求不高（比如每秒处理几张图）</li>
<li>❌ 需要频繁改变网络结构</li>
<li>❌ CPU 部署（TensorRT 只支持 NVIDIA GPU）</li>
</ul>
<h2 id="二tensorrt-核心概念解析">二、TensorRT 核心概念解析</h2>
<p>在开始写代码之前，我们先把几个核心概念搞清楚，不然后面很容易晕。</p>
<h3 id="21-builder-vs-runtime">2.1 Builder vs Runtime</h3>
<p>TensorRT 的工作流程分为两个完全独立的阶段：</p>
<p><strong>构建阶段（Builder）</strong>：这是一个「离线」的过程，只需要跑一次。Builder 负责解析你的网络结构，做各种优化，最后生成一个序列化的「引擎文件」（通常叫 .plan 或者 .engine）。这个过程比较慢，可能需要几分钟甚至几十分钟，因为要做大量搜索和优化。</p>
<p><strong>运行阶段（Runtime）</strong>：这是「在线」推理时用的。Runtime 反序列化引擎文件，创建执行上下文，然后就可以跑推理了。Runtime 很轻量，启动也很快，因为所有的优化工作都已经在构建阶段做完了。</p>
<p><strong>重要提示</strong>：构建好的引擎文件是<strong>硬件相关</strong>的。你在 3090 上构建的引擎不能直接拿到 A100 上跑，必须在目标硬件上重新构建。甚至连 TensorRT 版本变了都可能不兼容，这一点一定要注意。</p>
<h3 id="22-精度模式">2.2 精度模式</h3>
<p>TensorRT 支持三种主要的精度模式，你可以根据业务需求选择：</p>
<p><strong>FP32（单精度浮点）</strong>：</p>
<ul>
<li>和 PyTorch 默认精度一致</li>
<li>完全没有精度损失</li>
<li>速度最慢</li>
<li>通常作为基准</li>
</ul>
<p><strong>FP16（半精度浮点）</strong>：</p>
<ul>
<li>绝大多数模型精度损失小于 0.5%</li>
<li>有 Tensor Core 加速，速度 2-3 倍于 FP32</li>
<li>显存占用减半</li>
<li><strong>推荐优先使用</strong></li>
</ul>
<p><strong>INT8（8位整数）</strong>：</p>
<ul>
<li>精度损失通常在 1-2%（取决于校准质量）</li>
<li>速度是 FP32 的 4-10 倍</li>
<li>显存占用只有原来的 1/4</li>
<li>需要校准数据集</li>
<li>对检测、分割等任务需要小心调试</li>
</ul>
<h3 id="23-动态-shape">2.3 动态 Shape</h3>
<p>很多人刚开始用 TensorRT 的时候会遇到一个坑：输入尺寸必须固定。这是因为 TensorRT 在构建阶段就把所有优化都做好了，包括卷积的 tile 大小、内存分配策略等等。</p>
<p>但实际业务中，我们经常需要处理不同尺寸的输入（比如检测任务中不同大小的图片）。这时候就需要用<strong>动态 Shape</strong> 模式：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="c1">// 构建阶段指定每个维度的范围
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">IOptimizationProfile</span><span class="o">*</span> <span class="n">profile</span> <span class="o">=</span> <span class="n">builder</span><span class="o">-&gt;</span><span class="n">createOptimizationProfile</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span><span class="o">-&gt;</span><span class="n">setDimensions</span><span class="p">(</span><span class="s">&#34;input&#34;</span><span class="p">,</span> <span class="n">OptProfileSelector</span><span class="o">::</span><span class="n">kMIN</span><span class="p">,</span> <span class="n">Dims4</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">});</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span><span class="o">-&gt;</span><span class="n">setDimensions</span><span class="p">(</span><span class="s">&#34;input&#34;</span><span class="p">,</span> <span class="n">OptProfileSelector</span><span class="o">::</span><span class="n">kOPT</span><span class="p">,</span> <span class="n">Dims4</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">640</span><span class="p">,</span> <span class="mi">640</span><span class="p">});</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span><span class="o">-&gt;</span><span class="n">setDimensions</span><span class="p">(</span><span class="s">&#34;input&#34;</span><span class="p">,</span> <span class="n">OptProfileSelector</span><span class="o">::</span><span class="n">kMAX</span><span class="p">,</span> <span class="n">Dims4</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1280</span><span class="p">,</span> <span class="mi">1280</span><span class="p">});</span>
</span></span></code></pre></div><p>动态 Shape 会牺牲一些性能（通常 10-20%），但换来的是灵活性，对于很多应用场景是值得的。</p>
<h2 id="三环境搭建从-0-到-1">三、环境搭建：从 0 到 1</h2>
<p>TensorRT 的环境配置曾经是劝退很多人的第一道坎，不过最近几年已经简单很多了。这里我推荐两种最稳妥的安装方式。</p>
<h3 id="31-方式一docker推荐">3.1 方式一：Docker（推荐）</h3>
<p>用 Docker 是最简单、最不容易出问题的方式。NVIDIA 官方已经把所有依赖都打包好了。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 拉取 TensorRT 官方镜像（选择和你的 CUDA 版本匹配的）</span>
</span></span><span class="line"><span class="cl">docker pull nvcr.io/nvidia/tensorrt:24.05-py3
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 启动容器</span>
</span></span><span class="line"><span class="cl">docker run --gpus all -it --rm <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  -v /your/workspace:/workspace <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  nvcr.io/nvidia/tensorrt:24.05-py3
</span></span></code></pre></div><p>这个镜像里已经包含了：</p>
<ul>
<li>CUDA Toolkit 12.4</li>
<li>cuDNN 9.1</li>
<li>TensorRT 10.1</li>
<li>PyTorch 2.3</li>
<li>ONNX</li>
<li>各种 Python 绑定</li>
</ul>
<p>进来之后直接就能用，不用再装任何东西。</p>
<h3 id="32-方式二本地安装">3.2 方式二：本地安装</h3>
<p>如果你不想用 Docker，也可以直接在本地安装。先去 <a href="https://developer.nvidia.com/tensorrt">NVIDIA 官网</a> 下载对应版本的 TensorRT tar 包，然后：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 解压</span>
</span></span><span class="line"><span class="cl">tar -xzf TensorRT-10.1.0.27.Ubuntu-22.04.x86_64-gnu.cuda-12.4.tar.gz
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 添加到环境变量</span>
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">TENSORRT_DIR</span><span class="o">=</span>/path/to/TensorRT-10.1.0.27
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$TENSORRT_DIR</span>/lib:<span class="nv">$LD_LIBRARY_PATH</span>
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">PYTHONPATH</span><span class="o">=</span><span class="nv">$TENSORRT_DIR</span>/python:<span class="nv">$PYTHONPATH</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 安装 Python 包</span>
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> <span class="nv">$TENSORRT_DIR</span>/python
</span></span><span class="line"><span class="cl">pip install tensorrt-10.1.0-cp310-none-linux_x86_64.whl
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 验证安装</span>
</span></span><span class="line"><span class="cl">python -c <span class="s2">&#34;import tensorrt; print(tensorrt.__version__)&#34;</span>
</span></span></code></pre></div><p><strong>版本兼容性检查清单</strong>：</p>
<ul>
<li>CUDA 版本 ≥ 11.8</li>
<li>cuDNN 版本和 TensorRT 要求一致</li>
<li>PyTorch 版本和 CUDA 匹配</li>
<li>Python 3.8 ~ 3.11</li>
</ul>
<p>版本不兼容是 90% 奇怪问题的根源，一定要在最开始就确认好。</p>
<h3 id="33-安装验证">3.3 安装验证</h3>
<p>不管用哪种方式安装，最后都跑一下这个脚本确认没问题：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">tensorrt</span> <span class="k">as</span> <span class="nn">trt</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;TensorRT version: </span><span class="si">{</span><span class="n">trt</span><span class="o">.</span><span class="n">__version__</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;PyTorch version: </span><span class="si">{</span><span class="n">torch</span><span class="o">.</span><span class="n">__version__</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;CUDA available: </span><span class="si">{</span><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">is_available</span><span class="p">()</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;CUDA device: </span><span class="si">{</span><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">get_device_name</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 检查 TensorRT 核心库</span>
</span></span><span class="line"><span class="cl"><span class="n">logger</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="o">.</span><span class="n">WARNING</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">builder</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Builder</span><span class="p">(</span><span class="n">logger</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;TensorRT builder created successfully&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>如果所有信息都正常打印出来了，说明环境没问题，可以继续往下走了。</p>
<h2 id="四第一步把-pytorch-模型导出成-onnx">四、第一步：把 PyTorch 模型导出成 ONNX</h2>
<p>TensorRT 不直接读取 PyTorch 的 <code>.pth</code> 文件，我们需要先把模型导出成 ONNX 格式。这一步虽然简单，但里面的坑也不少。</p>
<h3 id="41-基础导出代码">4.1 基础导出代码</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torchvision.models</span> <span class="k">as</span> <span class="nn">models</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 加载模型</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">resnet50</span><span class="p">(</span><span class="n">pretrained</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 构建 dummy 输入</span>
</span></span><span class="line"><span class="cl"><span class="n">dummy_input</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">)</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 导出 ONNX</span>
</span></span><span class="line"><span class="cl"><span class="n">torch</span><span class="o">.</span><span class="n">onnx</span><span class="o">.</span><span class="n">export</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">dummy_input</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;resnet50.onnx&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">opset_version</span><span class="o">=</span><span class="mi">17</span><span class="p">,</span>           <span class="c1"># 尽量用最新的 opset</span>
</span></span><span class="line"><span class="cl">    <span class="n">do_constant_folding</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>   <span class="c1"># 常量折叠优化</span>
</span></span><span class="line"><span class="cl">    <span class="n">input_names</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;input&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_names</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;output&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">dynamic_axes</span><span class="o">=</span><span class="p">{</span>              <span class="c1"># 如果需要动态 shape</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;input&#34;</span><span class="p">:</span> <span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="s2">&#34;batch_size&#34;</span><span class="p">,</span> <span class="mi">2</span><span class="p">:</span> <span class="s2">&#34;height&#34;</span><span class="p">,</span> <span class="mi">3</span><span class="p">:</span> <span class="s2">&#34;width&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;output&#34;</span><span class="p">:</span> <span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="s2">&#34;batch_size&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;ONNX exported successfully&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="42-onnx-简化关键步骤">4.2 ONNX 简化（关键步骤）</h3>
<p>PyTorch 导出的 ONNX 经常包含很多冗余的算子和恒等变换，直接喂给 TensorRT 有时候会出问题，而且也不利于优化。所以一定要用 <code>onnxsim</code> 做简化：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 安装 onnxsim</span>
</span></span><span class="line"><span class="cl">pip install onnxsim
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 简化模型</span>
</span></span><span class="line"><span class="cl">onnxsim resnet50.onnx resnet50_sim.onnx
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 或者用 Python API</span>
</span></span><span class="line"><span class="cl">from onnxsim import simplify
</span></span><span class="line"><span class="cl">import onnx
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nv">model</span> <span class="o">=</span> onnx.load<span class="o">(</span><span class="s2">&#34;resnet50.onnx&#34;</span><span class="o">)</span>
</span></span><span class="line"><span class="cl">model_sim, <span class="nv">check</span> <span class="o">=</span> simplify<span class="o">(</span>model<span class="o">)</span>
</span></span><span class="line"><span class="cl">assert check, <span class="s2">&#34;Simplified ONNX model could not be validated&#34;</span>
</span></span><span class="line"><span class="cl">onnx.save<span class="o">(</span>model_sim, <span class="s2">&#34;resnet50_sim.onnx&#34;</span><span class="o">)</span>
</span></span></code></pre></div><p>这一步非常重要，我遇到过至少十几次「PyTorch 导出没问题，但 TensorRT 解析失败」的问题，最后都是跑一遍 onnxsim 就解决了。<strong>永远不要跳过这一步。</strong></p>
<h3 id="43-导出常见问题">4.3 导出常见问题</h3>
<p><strong>问题 1：动态控制流</strong>
如果你的模型里有 <code>if</code>、<code>for</code> 等依赖于数据的分支，PyTorch 导出的时候会报警告：</p>
<pre tabindex="0"><code>TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.
</code></pre><p>这时候你有两个选择：</p>
<ol>
<li>把动态逻辑改成静态的（推荐）</li>
<li>用 <code>torch.onnx.export(..., keep_initializers_as_inputs=True)</code> + <code>--exportModulesParams=1</code></li>
<li>实在不行就用 TensorRT 的 ONNX Parser 支持的 <code>If</code> 节点（需要 opset ≥ 13）</li>
</ol>
<p><strong>问题 2：算子不支持</strong>
遇到不支持的算子，比如某些新型激活函数，有三种处理方式：</p>
<ol>
<li>用已有算子组合实现（比如把 Swish 写成 x * sigmoid(x)）</li>
<li>写 TensorRT 自定义插件</li>
<li>升级 TensorRT 版本，新版本通常会支持更多算子</li>
</ol>
<p>（第一部分完，约2400字）</p>
<h2 id="五用-python-api-构建-tensorrt-引擎">五、用 Python API 构建 TensorRT 引擎</h2>
<p>现在我们有了 ONNX 模型，下一步就是用 TensorRT 的 Python API 把它编译成推理引擎。</p>
<h3 id="51-基础构建流程">5.1 基础构建流程</h3>
<p>先看一个完整的构建脚本，然后我们逐行讲解：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">tensorrt</span> <span class="k">as</span> <span class="nn">trt</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 1. 创建 Logger</span>
</span></span><span class="line"><span class="cl"><span class="n">logger</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="o">.</span><span class="n">WARNING</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 2. 创建 Builder 和 Network</span>
</span></span><span class="line"><span class="cl"><span class="n">builder</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Builder</span><span class="p">(</span><span class="n">logger</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">network</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">create_network</span><span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="nb">int</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">NetworkDefinitionCreationFlag</span><span class="o">.</span><span class="n">EXPLICIT_BATCH</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 3. 创建 ONNX Parser</span>
</span></span><span class="line"><span class="cl"><span class="n">parser</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">OnnxParser</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">logger</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 4. 解析 ONNX 文件</span>
</span></span><span class="line"><span class="cl"><span class="n">success</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_from_file</span><span class="p">(</span><span class="s2">&#34;resnet50_sim.onnx&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="ow">not</span> <span class="n">success</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Failed to parse ONNX file&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">error</span> <span class="ow">in</span> <span class="n">parser</span><span class="o">.</span><span class="n">errors</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">error</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 5. 配置构建参数</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">create_builder_config</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">FP16</span><span class="p">)</span>  <span class="c1"># 开启 FP16 精度</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_memory_pool_limit</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">MemoryPoolType</span><span class="o">.</span><span class="n">WORKSPACE</span><span class="p">,</span> <span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">30</span><span class="p">)</span>  <span class="c1"># 1GB workspace</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 6. 构建序列化引擎</span>
</span></span><span class="line"><span class="cl"><span class="n">serialized_engine</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">build_serialized_network</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 7. 保存到文件</span>
</span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&#34;resnet50_fp16.engine&#34;</span><span class="p">,</span> <span class="s2">&#34;wb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">serialized_engine</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Engine built successfully!&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>整个流程虽然步骤多，但逻辑很清晰：Logger → Builder → Network → Parser → Config → Engine。</p>
<h3 id="52-关键配置选项">5.2 关键配置选项</h3>
<p>BuilderConfig 里有很多重要的开关，这里列出最常用的几个：</p>
<p><strong>精度相关</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">FP16</span><span class="p">)</span>       <span class="c1"># 开启 FP16</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">INT8</span><span class="p">)</span>       <span class="c1"># 开启 INT8</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">STRICT_TYPES</span><span class="p">)</span> <span class="c1"># 严格执行精度，不自动回退</span>
</span></span></code></pre></div><p><strong>调试相关</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)</span>      <span class="c1"># 保留调试信息</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">PROFILING</span><span class="p">)</span>  <span class="c1"># 开启 profiling 层</span>
</span></span></code></pre></div><p><strong>性能相关</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">TF32</span><span class="p">)</span>       <span class="c1"># 允许 TF32 计算（Ampere+）</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">FAST_MATH</span><span class="p">)</span>  <span class="c1"># 快速数学，可能有精度损失</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">PREFER_PRECISION_CONSTRAINTS</span><span class="p">)</span> <span class="c1"># 优先保证精度</span>
</span></span></code></pre></div><h3 id="53-动态-shape-配置">5.3 动态 Shape 配置</h3>
<p>如果你的 ONNX 模型是用动态 axes 导出的，需要额外配置优化 profile：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 创建优化 profile</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">create_optimization_profile</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 设置最小、最优、最大尺寸</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span><span class="o">.</span><span class="n">set_shape</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;input&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nb">min</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">opt</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">640</span><span class="p">,</span> <span class="mi">640</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nb">max</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1280</span><span class="p">,</span> <span class="mi">1280</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 添加到 config</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">add_optimization_profile</span><span class="p">(</span><span class="n">profile</span><span class="p">)</span>
</span></span></code></pre></div><p>TensorRT 会为 <code>opt</code> 尺寸做最激进的优化，同时保证在 <code>min</code> 和 <code>max</code> 范围内都能正常运行。三个值之间差别不要太大，不然性能会下降。</p>
<h2 id="六int8-量化把性能推到极限">六、INT8 量化：把性能推到极限</h2>
<p>FP16 虽然已经很快了，但如果你还想再榨出一倍的性能，那就得上 INT8 量化。</p>
<p>INT8 的原理说起来很简单：把 32 位浮点数的权重和激活值映射到 8 位整数的 [-128, 127] 区间。但怎么映射才能让精度损失最小，这里面学问就大了。</p>
<h3 id="61-为什么需要校准">6.1 为什么需要校准？</h3>
<p>权重的值域范围我们是知道的，但激活值（也就是每一层的输出）的范围取决于输入数据。如果我们随便选一个缩放因子，很可能会把大部分激活值都映射到 0 附近，或者溢出截断。</p>
<p>所以我们需要用一批<strong>有代表性的真实数据</strong>跑一遍推理，统计每一层激活值的真实分布，然后选择最优的缩放因子。这个过程就叫做<strong>校准（Calibration）</strong>。</p>
<h3 id="62-实现校准器">6.2 实现校准器</h3>
<p>TensorRT 提供了几种内置的校准算法，我们只需要继承基类实现数据供给部分：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">tensorrt</span> <span class="k">as</span> <span class="nn">trt</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pycuda.driver</span> <span class="k">as</span> <span class="nn">cuda</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pycuda.autoinit</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">os</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">ImageBatchStream</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">calib_files</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">batch_size</span> <span class="o">=</span> <span class="n">batch_size</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">calib_files</span> <span class="o">=</span> <span class="n">calib_files</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">batch_count</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">calib_files</span><span class="p">)</span> <span class="o">//</span> <span class="n">batch_size</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">max_batches</span> <span class="o">=</span> <span class="mi">100</span>  <span class="c1"># 用100个batch足够了</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">next_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">batch_count</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_batches</span><span class="p">)):</span>
</span></span><span class="line"><span class="cl">            <span class="n">batch</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="bp">self</span><span class="o">.</span><span class="n">batch_size</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">batch_size</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">                <span class="n">img</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">load_image</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">calib_files</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">batch_size</span> <span class="o">+</span> <span class="n">j</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">                <span class="n">batch</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">preprocess</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">yield</span> <span class="n">batch</span><span class="o">.</span><span class="n">ascontiguousarray</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Int8Calibrator</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">IInt8EntropyCalibrator2</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch_stream</span><span class="p">,</span> <span class="n">cache_file</span><span class="o">=</span><span class="s2">&#34;calibration.cache&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">trt</span><span class="o">.</span><span class="n">IInt8EntropyCalibrator2</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">batch_stream</span> <span class="o">=</span> <span class="n">batch_stream</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">cache_file</span> <span class="o">=</span> <span class="n">cache_file</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">d_input</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">mem_alloc</span><span class="p">(</span><span class="mi">4</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">224</span> <span class="o">*</span> <span class="mi">224</span> <span class="o">*</span> <span class="n">batch_stream</span><span class="o">.</span><span class="n">batch_size</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">batches</span> <span class="o">=</span> <span class="n">batch_stream</span><span class="o">.</span><span class="n">next_batch</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get_batch_size</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">batch_stream</span><span class="o">.</span><span class="n">batch_size</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">names</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">batch</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">batches</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">cuda</span><span class="o">.</span><span class="n">memcpy_htod</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">d_input</span><span class="p">,</span> <span class="n">batch</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">d_input</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">        <span class="k">except</span> <span class="ne">StopIteration</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">read_calibration_cache</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">cache_file</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">cache_file</span><span class="p">,</span> <span class="s2">&#34;rb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">return</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">write_calibration_cache</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cache</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">cache_file</span><span class="p">,</span> <span class="s2">&#34;wb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cache</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="63-四种校准算法的选择">6.3 四种校准算法的选择</h3>
<p>TensorRT 提供了四种校准器，它们各有侧重：</p>
<table>
<thead>
<tr>
<th>校准器类型</th>
<th>原理</th>
<th>适用场景</th>
<th>精度</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>IInt8EntropyCalibrator2</code></td>
<td>最小化 KL 散度</td>
<td>分类任务</td>
<td>最好</td>
</tr>
<tr>
<td><code>IInt8MinMaxCalibrator</code></td>
<td>简单取 min/max</td>
<td>检测、分割</td>
<td>较好</td>
</tr>
<tr>
<td><code>IInt8LegacyCalibrator</code></td>
<td>旧版熵校准</td>
<td>兼容旧代码</td>
<td>一般</td>
</tr>
<tr>
<td><code>IInt8EntropyCalibrator</code></td>
<td>旧版熵校准</td>
<td>不推荐</td>
<td>一般</td>
</tr>
</tbody>
</table>
<p><strong>经验法则</strong>：</p>
<ul>
<li>分类任务 → EntropyCalibrator2</li>
<li>检测/分割 → MinMaxCalibrator</li>
<li>第一次做 → 先用 MinMax，效果不好再试 Entropy</li>
</ul>
<h3 id="64-开启-int8-构建">6.4 开启 INT8 构建</h3>
<p>有了校准器之后，构建引擎就简单了：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 准备校准数据</span>
</span></span><span class="line"><span class="cl"><span class="n">calib_files</span> <span class="o">=</span> <span class="n">get_calibration_images</span><span class="p">(</span><span class="s2">&#34;/path/to/coco/val2017&#34;</span><span class="p">,</span> <span class="n">num_images</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">batch_stream</span> <span class="o">=</span> <span class="n">ImageBatchStream</span><span class="p">(</span><span class="n">batch_size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">calib_files</span><span class="o">=</span><span class="n">calib_files</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">calibrator</span> <span class="o">=</span> <span class="n">Int8Calibrator</span><span class="p">(</span><span class="n">batch_stream</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 配置 INT8</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">INT8</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">int8_calibrator</span> <span class="o">=</span> <span class="n">calibrator</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 可以同时开启 FP16，TensorRT 会自动选择最优</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">FP16</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 构建引擎</span>
</span></span><span class="line"><span class="cl"><span class="n">serialized_engine</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">build_serialized_network</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span>
</span></span></code></pre></div><p><strong>校准数据集的选择很重要</strong>：</p>
<ul>
<li>数量：500-2000 张图通常就够了</li>
<li>分布：必须和实际推理的数据分布一致</li>
<li>多样性：包含各种场景、光照、角度</li>
<li>不要用训练集！用验证集的子集</li>
</ul>
<h3 id="65-常见量化坑">6.5 常见量化坑</h3>
<p><strong>坑 1：有些层不支持 INT8</strong></p>
<p>并不是所有算子都有 INT8 实现。遇到不支持的算子，TensorRT 会自动回落到 FP16 或 FP32。这是正常现象，不用慌。你可以用 <code>inspector</code> 查看每一层的实际精度：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">inspector</span> <span class="o">=</span> <span class="n">engine</span><span class="o">.</span><span class="n">create_engine_inspector</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">inspector</span><span class="o">.</span><span class="n">get_layer_information</span><span class="p">())</span>
</span></span></code></pre></div><p><strong>坑 2：量化后 mAP 掉太多</strong></p>
<p>如果量化后精度掉得太厉害，可以试试：</p>
<ol>
<li>增加校准图片数量</li>
<li>换一种校准算法</li>
<li>把敏感层强制设为 FP16</li>
<li>用 QAT（量化感知训练）代替 PTQ</li>
</ol>
<p><strong>坑 3：第一次构建太慢</strong></p>
<p>INT8 校准需要跑很多次推理，第一次构建可能需要几十分钟。别担心，我们把校准结果缓存了，第二次构建就会快很多。</p>
<h2 id="七python-推理实现">七、Python 推理实现</h2>
<p>引擎构建好了，终于可以跑推理了！让我们来写一个完整的推理类。</p>
<h3 id="71-基础推理类">7.1 基础推理类</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">tensorrt</span> <span class="k">as</span> <span class="nn">trt</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pycuda.driver</span> <span class="k">as</span> <span class="nn">cuda</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pycuda.autoinit</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">TensorRTInfer</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">engine_path</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># 加载引擎</span>
</span></span><span class="line"><span class="cl">        <span class="n">logger</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="o">.</span><span class="n">WARNING</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">engine_path</span><span class="p">,</span> <span class="s2">&#34;rb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">,</span> <span class="n">trt</span><span class="o">.</span><span class="n">Runtime</span><span class="p">(</span><span class="n">logger</span><span class="p">)</span> <span class="k">as</span> <span class="n">runtime</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">engine</span> <span class="o">=</span> <span class="n">runtime</span><span class="o">.</span><span class="n">deserialize_cuda_engine</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 创建执行上下文</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">context</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="o">.</span><span class="n">create_execution_context</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 分配输入输出显存</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">buffers</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">binding</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">size</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">volume</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="o">.</span><span class="n">get_binding_shape</span><span class="p">(</span><span class="n">binding</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">dtype</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">nptype</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="o">.</span><span class="n">get_binding_dtype</span><span class="p">(</span><span class="n">binding</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># 分配设备内存</span>
</span></span><span class="line"><span class="cl">            <span class="n">device_mem</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">mem_alloc</span><span class="p">(</span><span class="n">size</span> <span class="o">*</span> <span class="n">dtype</span><span class="p">()</span><span class="o">.</span><span class="n">itemsize</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">buffers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">device_mem</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 创建 CUDA 流</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">stream</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">Stream</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">infer</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_data</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># input_data: numpy array on CPU</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># 1. 拷贝输入到 GPU</span>
</span></span><span class="line"><span class="cl">        <span class="n">cuda</span><span class="o">.</span><span class="n">memcpy_htod_async</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">buffers</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">input_data</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">stream</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 2. 执行推理</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">context</span><span class="o">.</span><span class="n">execute_async_v2</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">bindings</span><span class="o">=</span><span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span> <span class="k">for</span> <span class="n">buf</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">buffers</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">            <span class="n">stream_handle</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">stream</span><span class="o">.</span><span class="n">handle</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 3. 拷贝输出到 CPU</span>
</span></span><span class="line"><span class="cl">        <span class="n">output</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">empty</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="o">.</span><span class="n">get_binding_shape</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">cuda</span><span class="o">.</span><span class="n">memcpy_dtoh_async</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">buffers</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">stream</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 4. 同步等待</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">stream</span><span class="o">.</span><span class="n">synchronize</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">output</span>
</span></span></code></pre></div><h3 id="72-使用示例">7.2 使用示例</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 初始化推理器</span>
</span></span><span class="line"><span class="cl"><span class="n">infer</span> <span class="o">=</span> <span class="n">TensorRTInfer</span><span class="p">(</span><span class="s2">&#34;resnet50_fp16.engine&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 预处理图片</span>
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="n">load_image</span><span class="p">(</span><span class="s2">&#34;test.jpg&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">input_data</span> <span class="o">=</span> <span class="n">preprocess</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>  <span class="c1"># shape (1, 3, 224, 224)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 执行推理</span>
</span></span><span class="line"><span class="cl"><span class="n">output</span> <span class="o">=</span> <span class="n">infer</span><span class="o">.</span><span class="n">infer</span><span class="p">(</span><span class="n">input_data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 后处理</span>
</span></span><span class="line"><span class="cl"><span class="n">probabilities</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">output</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">top5</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">probabilities</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">5</span><span class="p">:][::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Top-5 predictions:&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">top5</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;  Class </span><span class="si">{</span><span class="n">idx</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">probabilities</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="n">idx</span><span class="p">]</span><span class="si">:</span><span class="s2">.4f</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="73-性能测试">7.3 性能测试</h3>
<p>让我们写一个简单的 benchmark 脚本，验证一下 TensorRT 到底比 PyTorch 快多少：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">time</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torchvision.models</span> <span class="k">as</span> <span class="nn">models</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># PyTorch benchmark</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">resnet50</span><span class="p">(</span><span class="n">pretrained</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span><span class="o">.</span><span class="n">half</span><span class="p">()</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">dummy_input</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">)</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span><span class="o">.</span><span class="n">half</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># warmup</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">50</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">dummy_input</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">synchronize</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># measure</span>
</span></span><span class="line"><span class="cl"><span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">dummy_input</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">synchronize</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">pytorch_time</span> <span class="o">=</span> <span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span> <span class="o">/</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;PyTorch FP16: </span><span class="si">{</span><span class="n">pytorch_time</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2"> ms/image&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># TensorRT benchmark</span>
</span></span><span class="line"><span class="cl"><span class="n">infer</span> <span class="o">=</span> <span class="n">TensorRTInfer</span><span class="p">(</span><span class="s2">&#34;resnet50_fp16.engine&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">dummy_np</span> <span class="o">=</span> <span class="n">dummy_input</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># warmup</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">50</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">infer</span><span class="o">.</span><span class="n">infer</span><span class="p">(</span><span class="n">dummy_np</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># measure</span>
</span></span><span class="line"><span class="cl"><span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">infer</span><span class="o">.</span><span class="n">infer</span><span class="p">(</span><span class="n">dummy_np</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">trt_time</span> <span class="o">=</span> <span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span> <span class="o">/</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;TensorRT FP16: </span><span class="si">{</span><span class="n">trt_time</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2"> ms/image&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Speedup: </span><span class="si">{</span><span class="n">pytorch_time</span> <span class="o">/</span> <span class="n">trt_time</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">x&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>在我的 3090 上跑出来的结果是：</p>
<pre tabindex="0"><code>PyTorch FP16: 1.98 ms/image
TensorRT FP16: 0.52 ms/image
Speedup: 3.81x
</code></pre><p>3.8 倍的加速，而且我们还没开 INT8 呢！这就是为什么 TensorRT 值得你花时间学习。</p>
<p>（第二部分完，约2600字）</p>
<h2 id="八生产级-c-部署">八、生产级 C++ 部署</h2>
<p>Python 适合快速验证，但真正的生产环境我们通常用 C++。原因很简单：</p>
<ul>
<li>性能更好（没有 Python GIL 的开销）</li>
<li>部署更方便（不需要庞大的 Python 环境）</li>
<li>更稳定（内存管理更可控）</li>
</ul>
<h3 id="81-c-推理类实现">8.1 C++ 推理类实现</h3>
<p>下面是一个完整的 C++ TensorRT 推理封装，你可以直接用到项目里：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;NvInfer.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;NvOnnxParser.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;cuda_runtime_api.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;fstream&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;vector&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;memory&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdexcept&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Logger</span> <span class="o">:</span> <span class="k">public</span> <span class="n">nvinfer1</span><span class="o">::</span><span class="n">ILogger</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">void</span> <span class="nf">log</span><span class="p">(</span><span class="n">Severity</span> <span class="n">severity</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">msg</span><span class="p">)</span> <span class="k">noexcept</span> <span class="k">override</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">severity</span> <span class="o">&lt;=</span> <span class="n">Severity</span><span class="o">::</span><span class="n">kWARNING</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">printf</span><span class="p">(</span><span class="s">&#34;[TensorRT] %s</span><span class="se">\n</span><span class="s">&#34;</span><span class="p">,</span> <span class="n">msg</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">TensorRTInfer</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"><span class="k">public</span><span class="o">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">TensorRTInfer</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">engine_path</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="c1">// 读取引擎文件
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">std</span><span class="o">::</span><span class="n">ifstream</span> <span class="n">file</span><span class="p">(</span><span class="n">engine_path</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">ios</span><span class="o">::</span><span class="n">binary</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">file</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">(</span><span class="s">&#34;Cannot open engine file&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="n">file</span><span class="p">.</span><span class="n">seekg</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">ios</span><span class="o">::</span><span class="n">end</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">file</span><span class="p">.</span><span class="n">tellg</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="n">file</span><span class="p">.</span><span class="n">seekg</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">ios</span><span class="o">::</span><span class="n">beg</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">char</span><span class="o">&gt;</span> <span class="n">engine_data</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">file</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">engine_data</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">size</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 反序列化引擎
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_runtime</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="n">nvinfer1</span><span class="o">::</span><span class="n">createInferRuntime</span><span class="p">(</span><span class="n">m_logger</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">        <span class="n">m_engine</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="n">m_runtime</span><span class="o">-&gt;</span><span class="n">deserializeCudaEngine</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">engine_data</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">size</span><span class="p">,</span> <span class="k">nullptr</span>
</span></span><span class="line"><span class="cl">        <span class="p">));</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">m_engine</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">(</span><span class="s">&#34;Failed to deserialize engine&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 创建执行上下文
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_context</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">createExecutionContext</span><span class="p">());</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 分配显存
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_buffers</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getNbIOTensors</span><span class="p">());</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getNbIOTensors</span><span class="p">();</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">name</span> <span class="o">=</span> <span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getIOTensorName</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="k">auto</span> <span class="n">dims</span> <span class="o">=</span> <span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getTensorShape</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="n">size_t</span> <span class="n">bytes</span> <span class="o">=</span> <span class="n">nvinfer1</span><span class="o">::</span><span class="n">volume</span><span class="p">(</span><span class="n">dims</span><span class="p">)</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m_buffers</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">bytes</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="p">(</span><span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getTensorIOMode</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="o">==</span> <span class="n">nvinfer1</span><span class="o">::</span><span class="n">TensorIOMode</span><span class="o">::</span><span class="n">kINPUT</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="n">m_input_name</span> <span class="o">=</span> <span class="n">name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">                <span class="n">m_input_idx</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="n">m_output_name</span> <span class="o">=</span> <span class="n">name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">                <span class="n">m_output_idx</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 创建 CUDA 流
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">cudaStreamCreate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m_stream</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="o">~</span><span class="n">TensorRTInfer</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="nl">buf</span> <span class="p">:</span> <span class="n">m_buffers</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">cudaFree</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="n">cudaStreamDestroy</span><span class="p">(</span><span class="n">m_stream</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 禁用拷贝
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">TensorRTInfer</span><span class="p">(</span><span class="k">const</span> <span class="n">TensorRTInfer</span><span class="o">&amp;</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">TensorRTInfer</span><span class="o">&amp;</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="k">const</span> <span class="n">TensorRTInfer</span><span class="o">&amp;</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="kt">void</span> <span class="nf">infer</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">input</span><span class="p">,</span> <span class="kt">float</span><span class="o">*</span> <span class="n">output</span><span class="p">,</span> <span class="kt">int</span> <span class="n">batch_size</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="c1">// 设置输入 shape（如果是动态的）
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="k">auto</span> <span class="n">dims</span> <span class="o">=</span> <span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getTensorShape</span><span class="p">(</span><span class="n">m_input_name</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">dims</span><span class="p">.</span><span class="n">d</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">dims</span><span class="p">.</span><span class="n">d</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">batch_size</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">setInputShape</span><span class="p">(</span><span class="n">m_input_name</span><span class="p">,</span> <span class="n">dims</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// H2D 拷贝
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">cudaMemcpyAsync</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">m_buffers</span><span class="p">[</span><span class="n">m_input_idx</span><span class="p">],</span> <span class="n">input</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">nvinfer1</span><span class="o">::</span><span class="n">volume</span><span class="p">(</span><span class="n">dims</span><span class="p">)</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">cudaMemcpyHostToDevice</span><span class="p">,</span> <span class="n">m_stream</span>
</span></span><span class="line"><span class="cl">        <span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 设置张量地址
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">setTensorAddress</span><span class="p">(</span><span class="n">m_input_name</span><span class="p">,</span> <span class="n">m_buffers</span><span class="p">[</span><span class="n">m_input_idx</span><span class="p">]);</span>
</span></span><span class="line"><span class="cl">        <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">setTensorAddress</span><span class="p">(</span><span class="n">m_output_name</span><span class="p">,</span> <span class="n">m_buffers</span><span class="p">[</span><span class="n">m_output_idx</span><span class="p">]);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 执行推理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">enqueueV3</span><span class="p">(</span><span class="n">m_stream</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// D2H 拷贝
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="k">auto</span> <span class="n">out_dims</span> <span class="o">=</span> <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">getTensorShape</span><span class="p">(</span><span class="n">m_output_name</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">cudaMemcpyAsync</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">output</span><span class="p">,</span> <span class="n">m_buffers</span><span class="p">[</span><span class="n">m_output_idx</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">            <span class="n">nvinfer1</span><span class="o">::</span><span class="n">volume</span><span class="p">(</span><span class="n">out_dims</span><span class="p">)</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">cudaMemcpyDeviceToHost</span><span class="p">,</span> <span class="n">m_stream</span>
</span></span><span class="line"><span class="cl">        <span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 同步
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">cudaStreamSynchronize</span><span class="p">(</span><span class="n">m_stream</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="k">private</span><span class="o">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">Logger</span> <span class="n">m_logger</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">nvinfer1</span><span class="o">::</span><span class="n">IRuntime</span><span class="o">&gt;</span> <span class="n">m_runtime</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">nvinfer1</span><span class="o">::</span><span class="n">ICudaEngine</span><span class="o">&gt;</span> <span class="n">m_engine</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">nvinfer1</span><span class="o">::</span><span class="n">IExecutionContext</span><span class="o">&gt;</span> <span class="n">m_context</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">void</span><span class="o">*&gt;</span> <span class="n">m_buffers</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">cudaStream_t</span> <span class="n">m_stream</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">m_input_name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">m_output_name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">m_input_idx</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">m_output_idx</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><h3 id="82-cmakeliststxt">8.2 CMakeLists.txt</h3>
<p>为了帮助大家编译，我把 CMakeLists.txt 也贴出来：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cmake" data-lang="cmake"><span class="line"><span class="cl"><span class="nb">cmake_minimum_required</span><span class="p">(</span><span class="s">VERSION</span> <span class="s">3.18</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">project</span><span class="p">(</span><span class="s">tensorrt_infer</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">set</span><span class="p">(</span><span class="s">CMAKE_CXX_STANDARD</span> <span class="s">17</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># CUDA
</span></span></span><span class="line"><span class="cl"><span class="c"></span><span class="nb">find_package</span><span class="p">(</span><span class="s">CUDA</span> <span class="s">REQUIRED</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">include_directories</span><span class="p">(</span><span class="o">${</span><span class="nv">CUDA_INCLUDE_DIRS</span><span class="o">}</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># TensorRT
</span></span></span><span class="line"><span class="cl"><span class="c"></span><span class="nb">set</span><span class="p">(</span><span class="s">TENSORRT_ROOT</span> <span class="s">/path/to/TensorRT-10.1.0.27</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">include_directories</span><span class="p">(</span><span class="o">${</span><span class="nv">TENSORRT_ROOT</span><span class="o">}</span><span class="s">/include</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">link_directories</span><span class="p">(</span><span class="o">${</span><span class="nv">TENSORRT_ROOT</span><span class="o">}</span><span class="s">/lib</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># 可执行文件
</span></span></span><span class="line"><span class="cl"><span class="c"></span><span class="nb">add_executable</span><span class="p">(</span><span class="s">infer</span> <span class="s">main.cpp</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">target_link_libraries</span><span class="p">(</span><span class="s">infer</span>
</span></span><span class="line"><span class="cl">    <span class="o">${</span><span class="nv">CUDA_LIBRARIES</span><span class="o">}</span>
</span></span><span class="line"><span class="cl">    <span class="s">nvinfer</span>
</span></span><span class="line"><span class="cl">    <span class="s">nvonnxparser</span>
</span></span><span class="line"><span class="cl">    <span class="s">cudart</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span><span class="err">
</span></span></span></code></pre></div><h3 id="83-使用示例">8.3 使用示例</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">TensorRTInfer</span> <span class="n">infer</span><span class="p">(</span><span class="s">&#34;resnet50_fp16.engine&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 准备输入
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span> <span class="n">input</span><span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">224</span> <span class="o">*</span> <span class="mi">224</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span> <span class="n">output</span><span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 填充 input...
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        
</span></span><span class="line"><span class="cl">        <span class="c1">// 执行推理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">infer</span><span class="p">.</span><span class="n">infer</span><span class="p">(</span><span class="n">input</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">output</span><span class="p">.</span><span class="n">data</span><span class="p">());</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 处理 output...
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        
</span></span><span class="line"><span class="cl">        <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">&#34;Inference done!&#34;</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">exception</span><span class="o">&amp;</span> <span class="n">e</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">std</span><span class="o">::</span><span class="n">cerr</span> <span class="o">&lt;&lt;</span> <span class="s">&#34;Error: &#34;</span> <span class="o">&lt;&lt;</span> <span class="n">e</span><span class="p">.</span><span class="n">what</span><span class="p">()</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h2 id="九进阶优化技巧">九、进阶优化技巧</h2>
<p>掌握了基础用法之后，让我们来看一些能让性能再上一个台阶的高级技巧。</p>
<h3 id="91-多流并发">9.1 多流并发</h3>
<p>如果你的应用需要同时处理多路视频流，可以用多个 CUDA stream 来实现真正的并发：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 创建多个推理实例，每个实例有自己的 stream</span>
</span></span><span class="line"><span class="cl"><span class="n">infer1</span> <span class="o">=</span> <span class="n">TensorRTInfer</span><span class="p">(</span><span class="s2">&#34;model.engine&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">infer2</span> <span class="o">=</span> <span class="n">TensorRTInfer</span><span class="p">(</span><span class="s2">&#34;model.engine&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 在不同的线程中跑各自的推理</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 它们会在 GPU 上并发执行</span>
</span></span></code></pre></div><p>注意：每个 <code>IExecutionContext</code> 同时只能执行一次推理。如果需要多流，就创建多个 context。</p>
<h3 id="92-流水处理">9.2 流水处理</h3>
<p>对于吞吐量优先的场景，可以把预处理、推理、后处理做成流水线，用生产者-消费者模型衔接：</p>
<pre tabindex="0"><code>Thread 1: 读视频 → 解码 → 预处理 → 放入队列
Thread 2: 从队列取 → TensorRT 推理 → 放入结果队列
Thread 3: 从结果队列取 → 后处理 → 显示/保存
</code></pre><p>这样三个阶段可以重叠执行，CPU 和 GPU 都不会闲置。实际项目中这么做通常能再提升 30-50% 的整体吞吐量。</p>
<h3 id="93-权重精简">9.3 权重精简</h3>
<p>如果你发现生成的引擎文件特别大，可以试试这个技巧：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">STRIP_PLAN</span><span class="p">)</span>
</span></span></code></pre></div><p>这个 flag 会把引擎里不必要的调试信息去掉，通常能把文件体积减小 30-50%。</p>
<h3 id="94-避免不必要的内存拷贝">9.4 避免不必要的内存拷贝</h3>
<p>很多时候性能瓶颈不在 TensorRT 本身，而在 H2D/D2H 的内存拷贝。有几个优化方向：</p>
<ol>
<li><strong>预处理直接在 GPU 上做</strong>：用 CUDA kernel 做 resize、normalize，数据根本不用回 CPU</li>
<li><strong>用 pinned memory</strong>：<code>cudaHostAlloc</code> 分配的页锁定内存拷贝速度比普通 malloc 快 2-3 倍</li>
<li><strong>批量处理</strong>：尽量一次多处理几张图，摊销拷贝开销</li>
</ol>
<h2 id="十常见问题与排错">十、常见问题与排错</h2>
<p>TensorRT 的学习曲线比较陡峭，遇到问题很正常。这里我汇总了最常见的一些坑和解决方法。</p>
<h3 id="101-构建失败">10.1 构建失败</h3>
<p><strong>现象</strong>：<code>build_serialized_network</code> 返回 None</p>
<p><strong>排查步骤</strong>：</p>
<ol>
<li>把 Logger 级别调成 VERBOSE，看详细输出</li>
<li>检查 workspace 是不是设小了（至少 512MB）</li>
<li>确认 ONNX 模型没问题：<code>onnx.checker.check_model()</code></li>
<li>跑一遍 onnxsim</li>
<li>如果是动态 shape，检查 profile 的范围是否正确</li>
</ol>
<h3 id="102-推理结果不对">10.2 推理结果不对</h3>
<p><strong>现象</strong>：TensorRT 的输出和 PyTorch 对不上</p>
<p><strong>排查步骤</strong>：</p>
<ol>
<li>先测 FP32，如果 FP32 对不上，说明是导出或解析的问题</li>
<li>检查预处理/后处理的数值范围是否一致</li>
<li>检查 NCHW/NHWC 的格式有没有搞反</li>
<li>检查 RGB/BGR 的通道顺序</li>
<li>加 <code>BuilderFlag.STRICT_TYPES</code> 禁止自动回退精度</li>
</ol>
<h3 id="103-内存泄漏">10.3 内存泄漏</h3>
<p><strong>现象</strong>：程序跑久了内存持续增长</p>
<p><strong>常见原因</strong>：</p>
<ol>
<li>忘记销毁 <code>IExecutionContext</code></li>
<li>忘记 free CUDA 显存</li>
<li>每次推理都创建新的 context 而不是复用</li>
<li>pycuda 的内存没有正确释放</li>
</ol>
<p><strong>最佳实践</strong>：整个程序生命周期只创建一个 engine 和少量 context，推理时复用。</p>
<h3 id="104-性能不如预期">10.4 性能不如预期</h3>
<p><strong>现象</strong>：加速比只有 2x 不到，没有达到文章里说的效果</p>
<p><strong>可能的原因</strong>：</p>
<ol>
<li>没有真正开启 FP16：检查 <code>builder.platform_has_fast_fp16()</code></li>
<li>模型太小：模型太小的话 kernel 启动开销占比大</li>
<li>Batch size 太小：大 batch 才能把 GPU 用满</li>
<li>瓶颈在预处理/后处理：用 nsys profile 看一下时间花在哪了</li>
<li>用的是旧显卡：Turing 架构以前没有 Tensor Core</li>
</ol>
<h2 id="十一最佳实践总结">十一、最佳实践总结</h2>
<p>经过这么多项目的踩坑，我总结了一套 TensorRT 的最佳实践清单，按照这个来做，90% 的问题都能避免：</p>
<h3 id="准备阶段">准备阶段</h3>
<ul>
<li>✅ 用 Docker 环境，省得折腾依赖</li>
<li>✅ 导出 ONNX 后一定要跑 onnxsim</li>
<li>✅ 先跑通 FP32，再试 FP16，最后 INT8</li>
<li>✅ 每一步都和 PyTorch 做数值对齐</li>
</ul>
<h3 id="构建阶段">构建阶段</h3>
<ul>
<li>✅ Workspace 设为 1GB 起步</li>
<li>✅ 动态 shape 的 min/opt/max 不要差太多</li>
<li>✅ INT8 校准用 500-2000 张有代表性的图</li>
<li>✅ 保存校准 cache，下次直接用</li>
<li>✅ 引擎必须在部署的硬件上构建，不能跨 GPU 复制</li>
</ul>
<h3 id="部署阶段">部署阶段</h3>
<ul>
<li>✅ 生产环境用 C++，Python 只做验证</li>
<li>✅ 整个程序只创建一个 engine</li>
<li>✅ 复用 execution context，不要每次都创建</li>
<li>✅ 用多流处理多路输入</li>
<li>✅ 预处理尽量放到 GPU 上做</li>
</ul>
<h3 id="调试阶段">调试阶段</h3>
<ul>
<li>✅ Logger 开成 VERBOSE，信息非常有用</li>
<li>✅ 用 <code>nsys profile</code> 做性能分析</li>
<li>✅ 用 Engine Inspector 看每一层的精度和时间</li>
<li>✅ 遇到问题先去 NVIDIA 官方论坛搜，很多人都遇到过</li>
</ul>
<h2 id="总结">总结</h2>
<p>TensorRT 是一个非常强大的工具，但也是一个需要花时间钻研的工具。它不像 PyTorch 那样友好，会遇到各种各样的坑，有时候一个问题会卡好几天。</p>
<p>但我想说的是：<strong>这一切都是值得的</strong>。当你看到原本只能跑 30 FPS 的模型，经过 TensorRT 优化后跑到了 300 FPS，而且精度几乎没降的时候，那种成就感是无与伦比的。更重要的是，这意味着你可以用更便宜的硬件处理更多的请求，给公司省下真金白银。</p>
<p>这篇文章覆盖了从环境搭建到生产部署的全流程，给出的代码你可以直接拿过去用。但技术是不断进步的，TensorRT 每个版本都在增加新功能、优化性能，保持学习的心态很重要。</p>
<p>最后给大家几个后续的学习方向：</p>
<ol>
<li><strong>自定义插件</strong>：遇到不支持的算子时，自己写 CUDA kernel 扩展</li>
<li><strong>量化感知训练（QAT）</strong>：在训练时就模拟量化误差，比 PTQ 精度更好</li>
<li><strong>Triton Inference Server</strong>：NVIDIA 开源的推理服务框架，生产级部署必备</li>
<li><strong>多 GPU 推理</strong>：大模型时代必备技能</li>
</ol>
<p>希望这篇文章能帮你少走一些弯路。如果你在使用 TensorRT 的过程中遇到了什么问题，或者有自己的优化心得，欢迎和我交流。</p>
<p>（全文完，约7500字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
