<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>模型部署 on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/%E6%A8%A1%E5%9E%8B%E9%83%A8%E7%BD%B2/</link>
    <description>Recent content in 模型部署 on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Fri, 05 Jun 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/%E6%A8%A1%E5%9E%8B%E9%83%A8%E7%BD%B2/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>嵌入式 NPU 架构与算子优化实战：从内存带宽到 INT8 部署</title>
      <link>https://tech-snippets.xyz/posts/embedded-npu-architecture-operator-optimization-guide/</link>
      <pubDate>Fri, 05 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/embedded-npu-architecture-operator-optimization-guide/</guid>
      <description>前言：为什么同一个模型在不同 NPU 上差距很大？ 做嵌入式 AI 部署时，很多人第一次拿到 NPU 板卡都会有一个误解：只要芯片宣传页写着 1TOPS、6TOPS 或 10TOPS，模型就应该按照这个数字线性变快。实际项目里经常不是这样。同样一个 YOLO、MobileNet 或语音关键词模型，在 A 芯片上跑得很顺，在 B 芯片上却卡在某几个算子；同样是 INT8 量化，有的模型精度几乎不掉，有的模型会出现明显误检；同样是官方转换工具，有的网络一键通过，有的网络需要反复改 ONNX 图、替换算子、拆分子图。
这些问题并不神秘，本质上是 NPU 的计算阵列、片上 SRAM、DMA、数据布局、编译器和运行时之间存在非常强的耦合。CPU 代码慢了，我们通常先看热点函数；GPU 程序慢了，会看 kernel occupancy、显存访问和线程块；NPU 部署慢了，也要有类似的分析框架：先判断瓶颈是算力、带宽、算子支持、量化误差，还是 CPU/NPU 之间的调度开销。
本文从工程视角拆解嵌入式 NPU 的典型架构，并围绕一个真实部署流程展开：模型导出、图优化、量化校准、算子映射、内存规划、运行时流水线和性能排查。文章不绑定某一家芯片，但会覆盖 RK、Amlogic、Kendryte、寒武纪边缘模块以及很多 MCU 级 NPU 都会遇到的共性问题。读完后，你应该能判断一个模型为什么没有跑满 NPU，也能知道该从哪里下手优化。
一、先把 TOPS 的含义说清楚 TOPS 是每秒万亿次操作数，通常用于描述 INT8 乘加能力。例如一个 2TOPS 的 NPU，理论上每秒可以完成 2 万亿次 8 bit 整数运算。问题在于，这个数字往往是理想条件下的峰值：输入输出都在合适的数据布局中，算子可以完全映射到矩阵乘阵列，片上缓存命中率足够高，DMA 搬运没有拖后腿，调度器没有频繁切换任务。
在实际模型里，真正能高效利用 NPU 的通常是卷积、深度卷积、全连接、矩阵乘、部分池化和激活函数。很多看起来不起眼的操作，例如 Reshape、Transpose、Slice、Gather、Resize、NonMaxSuppression，如果不能被 NPU 原生支持，就可能回退到 CPU。一次 CPU 回退不仅带来计算时间，还可能带来缓存同步、数据格式转换和内存拷贝。模型中只要有几个这样的“断点”，端到端延迟就会明显变差。
评估 NPU 时，比 TOPS 更有价值的是下面几个指标：</description>
      <content:encoded><![CDATA[<h2 id="前言为什么同一个模型在不同-npu-上差距很大">前言：为什么同一个模型在不同 NPU 上差距很大？</h2>
<p>做嵌入式 AI 部署时，很多人第一次拿到 NPU 板卡都会有一个误解：只要芯片宣传页写着 1TOPS、6TOPS 或 10TOPS，模型就应该按照这个数字线性变快。实际项目里经常不是这样。同样一个 YOLO、MobileNet 或语音关键词模型，在 A 芯片上跑得很顺，在 B 芯片上却卡在某几个算子；同样是 INT8 量化，有的模型精度几乎不掉，有的模型会出现明显误检；同样是官方转换工具，有的网络一键通过，有的网络需要反复改 ONNX 图、替换算子、拆分子图。</p>
<p>这些问题并不神秘，本质上是 NPU 的计算阵列、片上 SRAM、DMA、数据布局、编译器和运行时之间存在非常强的耦合。CPU 代码慢了，我们通常先看热点函数；GPU 程序慢了，会看 kernel occupancy、显存访问和线程块；NPU 部署慢了，也要有类似的分析框架：先判断瓶颈是算力、带宽、算子支持、量化误差，还是 CPU/NPU 之间的调度开销。</p>
<p>本文从工程视角拆解嵌入式 NPU 的典型架构，并围绕一个真实部署流程展开：模型导出、图优化、量化校准、算子映射、内存规划、运行时流水线和性能排查。文章不绑定某一家芯片，但会覆盖 RK、Amlogic、Kendryte、寒武纪边缘模块以及很多 MCU 级 NPU 都会遇到的共性问题。读完后，你应该能判断一个模型为什么没有跑满 NPU，也能知道该从哪里下手优化。</p>
<p><img alt="嵌入式 NPU 算子执行流水线" loading="lazy" src="/images/embedded-npu-operator-pipeline.svg"></p>
<h2 id="一先把-tops-的含义说清楚">一、先把 TOPS 的含义说清楚</h2>
<p>TOPS 是每秒万亿次操作数，通常用于描述 INT8 乘加能力。例如一个 2TOPS 的 NPU，理论上每秒可以完成 2 万亿次 8 bit 整数运算。问题在于，这个数字往往是理想条件下的峰值：输入输出都在合适的数据布局中，算子可以完全映射到矩阵乘阵列，片上缓存命中率足够高，DMA 搬运没有拖后腿，调度器没有频繁切换任务。</p>
<p>在实际模型里，真正能高效利用 NPU 的通常是卷积、深度卷积、全连接、矩阵乘、部分池化和激活函数。很多看起来不起眼的操作，例如 <code>Reshape</code>、<code>Transpose</code>、<code>Slice</code>、<code>Gather</code>、<code>Resize</code>、<code>NonMaxSuppression</code>，如果不能被 NPU 原生支持，就可能回退到 CPU。一次 CPU 回退不仅带来计算时间，还可能带来缓存同步、数据格式转换和内存拷贝。模型中只要有几个这样的“断点”，端到端延迟就会明显变差。</p>
<p>评估 NPU 时，比 TOPS 更有价值的是下面几个指标：</p>
<ol>
<li><strong>端到端延迟</strong>：从图像采集或音频帧输入，到最终结果输出的总耗时。</li>
<li><strong>NPU 子图覆盖率</strong>：模型中有多少算子真正被 NPU 执行。</li>
<li><strong>DDR 带宽占用</strong>：输入、输出、中间特征图是否频繁进出外部内存。</li>
<li><strong>Batch 与分辨率敏感性</strong>：嵌入式场景多为 batch=1，很多服务端优化不适用。</li>
<li><strong>量化后精度</strong>：INT8 的 mAP、Top-1、误唤醒率是否满足业务要求。</li>
<li><strong>功耗与温升</strong>：持续推理 30 分钟后频率是否降档。</li>
</ol>
<p>如果只看峰值 TOPS，很容易把问题归因到“芯片不行”。但很多时候，真正的问题是模型图不适合该 NPU，或者预处理和后处理拖慢了整条流水线。</p>
<h2 id="二嵌入式-npu-的典型硬件结构">二、嵌入式 NPU 的典型硬件结构</h2>
<p>不同厂商的实现细节不同，但嵌入式 NPU 通常可以抽象成几个模块：MAC 阵列、片上 SRAM、DMA 控制器、指令调度器、数据重排单元和外部 DDR 接口。</p>
<p>MAC 阵列负责核心乘加。卷积在编译阶段会被转换为矩阵乘或滑窗乘加任务，再切成多个 tile 放入阵列。片上 SRAM 保存权重块、输入块和输出块，避免每次乘加都访问 DDR。DMA 负责在 DDR 和 SRAM 之间搬运数据。数据重排单元负责处理 NCHW、NHWC、NC1HWC2 等布局转换。指令调度器则把编译器生成的命令流按照依赖关系送入硬件。</p>
<p>一个简化的卷积执行过程如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">DDR 中的输入特征图 -&gt; DMA 搬入片上 SRAM
</span></span><span class="line"><span class="cl">DDR 中的权重块     -&gt; DMA 搬入片上 SRAM
</span></span><span class="line"><span class="cl">MAC 阵列执行 tile 卷积
</span></span><span class="line"><span class="cl">局部输出写回 SRAM
</span></span><span class="line"><span class="cl">必要时做激活/量化/累加
</span></span><span class="line"><span class="cl">最终输出通过 DMA 写回 DDR
</span></span></code></pre></div><p>这里的关键是“tile”。片上 SRAM 容量有限，不可能一次放下完整的高分辨率特征图和全部权重。编译器需要根据 SRAM 大小、阵列形状、数据类型和算子参数，把一个大算子切成许多小块。tile 切得太小，DMA 和调度开销变大；tile 切得太大，SRAM 放不下或复用率下降。很多 NPU 的性能差异，表面看是 TOPS 不同，深层其实是 tile 策略、数据布局和内存层次做得好不好。</p>
<h2 id="三从模型图看-npu-友好程度">三、从模型图看 NPU 友好程度</h2>
<p>在部署前，建议先用 Netron、ONNX GraphSurgeon 或厂商工具查看模型图。一个 NPU 友好的模型通常具备这些特征：主干网络由 Conv、BN、ReLU/SiLU、Pooling、MatMul 等常见算子组成；分支结构不太复杂；动态 shape 很少；后处理可以拆到 CPU，并且数据量已经足够小；输入分辨率固定；没有大量 <code>Transpose</code> 和 <code>Gather</code>。</p>
<p>以目标检测模型为例，主干和 neck 往往很好映射到 NPU，但 decode 和 NMS 经常是麻烦点。很多模型导出 ONNX 后，会把网格生成、坐标变换、阈值过滤和 NMS 都留在图里。这样虽然在 PC 上用 ONNX Runtime 很方便，但在嵌入式 NPU 上可能导致大量 CPU 回退。更稳妥的做法是让 NPU 输出三个尺度的特征图，后处理在 C/C++ 中单独实现。</p>
<p>另一个常见问题是激活函数。传统 ReLU 对 NPU 友好，但 Swish、GELU、HardSwish 的支持情况因芯片而异。有些 NPU 可以融合 <code>Conv + BN + ReLU</code>，但不能很好地融合 <code>Conv + BN + SiLU</code>。如果模型允许，训练阶段就应考虑部署端约束，而不是等模型训练完再强行适配。</p>
<p>（第一部分完，约2300字）</p>
<h2 id="四模型转换前的图优化少一个-transpose-就少一次搬运">四、模型转换前的图优化：少一个 Transpose 就少一次搬运</h2>
<p>NPU 编译器通常会做常量折叠、算子融合、死节点删除和布局传播，但不要指望它解决所有问题。工程上更可靠的做法，是在导出 ONNX 后主动清理模型图。尤其是 PyTorch 导出的图，经常会因为框架表达方式留下多余的 <code>Unsqueeze</code>、<code>Concat</code>、<code>Slice</code>、<code>Transpose</code>，这些节点在桌面端不明显，在 NPU 上却可能成为性能断点。</p>
<p>下面是一个简单的 ONNX 检查脚本，用来统计算子类型和可疑节点：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">onnx</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">onnx</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">&#34;model.onnx&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">ops</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">op_type</span> <span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">model</span><span class="o">.</span><span class="n">graph</span><span class="o">.</span><span class="n">node</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">op</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">ops</span><span class="o">.</span><span class="n">most_common</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">op</span><span class="si">:</span><span class="s2">20s</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">suspect</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;Transpose&#34;</span><span class="p">,</span> <span class="s2">&#34;Gather&#34;</span><span class="p">,</span> <span class="s2">&#34;Slice&#34;</span><span class="p">,</span> <span class="s2">&#34;Resize&#34;</span><span class="p">,</span> <span class="s2">&#34;NonMaxSuppression&#34;</span><span class="p">,</span> <span class="s2">&#34;Shape&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\n</span><span class="s2">可疑算子：&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">model</span><span class="o">.</span><span class="n">graph</span><span class="o">.</span><span class="n">node</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">node</span><span class="o">.</span><span class="n">op_type</span> <span class="ow">in</span> <span class="n">suspect</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">node</span><span class="o">.</span><span class="n">op_type</span><span class="p">,</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">node</span><span class="o">.</span><span class="n">input</span><span class="p">],</span> <span class="s2">&#34;-&gt;&#34;</span><span class="p">,</span> <span class="p">[</span><span class="n">o</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">node</span><span class="o">.</span><span class="n">output</span><span class="p">])</span>
</span></span></code></pre></div><p>如果发现 <code>Transpose</code> 数量很多，要进一步看它们是否只是为了在 NCHW 和 NHWC 之间来回切换。有些转换工具要求固定输入布局，导出时如果设置错了，就会在图首和图尾插入额外重排。重排操作本身不做复杂计算，却会读写完整特征图；对于 640×640 的检测模型，中间层特征图可能有数 MB，几次重排就足以吞掉大量带宽。</p>
<p>图优化的基本原则有三条：</p>
<ul>
<li><strong>能融合就融合</strong>：Conv + BN + Activation 应尽量在转换前或编译器中融合。</li>
<li><strong>能静态就静态</strong>：固定输入尺寸、固定 batch，避免动态 shape 进入 NPU 子图。</li>
<li><strong>能移出就移出</strong>：NMS、字符串处理、复杂索引等不适合 NPU 的逻辑放到 CPU 后处理。</li>
</ul>
<p>以 BatchNorm 融合为例，推理阶段 BN 可以合并进卷积权重和偏置。公式并不复杂：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># W, b: 原卷积权重和偏置</span>
</span></span><span class="line"><span class="cl"><span class="c1"># mean, var, gamma, beta: BN 参数</span>
</span></span><span class="line"><span class="cl"><span class="c1"># eps: BN epsilon</span>
</span></span><span class="line"><span class="cl"><span class="n">scale</span> <span class="o">=</span> <span class="n">gamma</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span> <span class="o">+</span> <span class="n">eps</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">W_fused</span> <span class="o">=</span> <span class="n">W</span> <span class="o">*</span> <span class="n">scale</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">b_fused</span> <span class="o">=</span> <span class="n">beta</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">*</span> <span class="n">scale</span>
</span></span></code></pre></div><p>融合后，运行时少了一个 BN 算子，也减少了一次中间特征图读写。对大模型来说，这种看似基础的优化非常重要，因为嵌入式推理经常不是算力不够，而是内存访问太贵。</p>
<h2 id="五int8-量化性能提升背后的精度账">五、INT8 量化：性能提升背后的精度账</h2>
<p>多数嵌入式 NPU 的高性能路径是 INT8。FP16 或 BF16 支持正在变多，但在低功耗设备上，INT8 仍是性价比最高的部署方式。量化的核心是把浮点张量映射到整数范围：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">real_value ≈ scale × (int_value - zero_point)
</span></span></code></pre></div><p>对于权重量化，常见做法是 per-channel；对于激活量化，常见做法是 per-tensor。per-channel 能显著改善卷积权重量化精度，因为不同输出通道的权重分布可能差异很大。激活则依赖校准数据集，用一批代表性输入统计每层的数值范围。</p>
<p>校准集不是随便找几十张图就行。它应该覆盖真实场景中的亮度、角度、背景、目标尺度和噪声。如果是工业检测，要包含正常品、轻微缺陷、严重缺陷和空载图；如果是语音唤醒，要覆盖不同说话人、距离、噪声和麦克风增益。校准集偏了，量化 scale 就会偏，最终表现为某些场景下误差突然变大。</p>
<p>下面是一个用于挑选图像校准集的小脚本思路：从原始数据中按亮度和边缘密度分桶，避免校准样本都集中在同一种光照条件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">cv2</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">items</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">&#34;calib_raw&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">&#34;*.jpg&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">img</span> <span class="o">=</span> <span class="n">cv2</span><span class="o">.</span><span class="n">imread</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">p</span><span class="p">),</span> <span class="n">cv2</span><span class="o">.</span><span class="n">IMREAD_GRAYSCALE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">img</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">continue</span>
</span></span><span class="line"><span class="cl">    <span class="n">brightness</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">img</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">    <span class="n">edges</span> <span class="o">=</span> <span class="n">cv2</span><span class="o">.</span><span class="n">Canny</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="mi">80</span><span class="p">,</span> <span class="mi">160</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">edge_density</span> <span class="o">=</span> <span class="nb">float</span><span class="p">((</span><span class="n">edges</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">    <span class="n">items</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">p</span><span class="p">,</span> <span class="n">brightness</span><span class="p">,</span> <span class="n">edge_density</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 简单分桶：亮度 5 桶，纹理 5 桶，每桶最多取 4 张</span>
</span></span><span class="line"><span class="cl"><span class="n">selected</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"><span class="n">buckets</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">p</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">items</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">key</span> <span class="o">=</span> <span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">b</span> <span class="o">/</span> <span class="mi">51</span><span class="p">),</span> <span class="mi">4</span><span class="p">),</span> <span class="nb">min</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">e</span> <span class="o">/</span> <span class="mf">0.04</span><span class="p">),</span> <span class="mi">4</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">buckets</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="p">[])</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">paths</span> <span class="ow">in</span> <span class="n">buckets</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="n">selected</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">paths</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;selected&#34;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">selected</span><span class="p">))</span>
</span></span></code></pre></div><p>量化后一定要做逐层对比。只看最终 mAP 或准确率，很难定位问题。更有效的方式是让浮点模型和 INT8 模型输出若干关键层，计算余弦相似度、均方误差和最大误差。如果某一层开始相似度骤降，就重点检查该层前后的激活范围、是否有异常 outlier、是否有不适合量化的操作。</p>
<h2 id="六算子映射让模型顺着硬件走">六、算子映射：让模型顺着硬件走</h2>
<p>NPU 编译器会把模型图拆成一个或多个子图。连续可支持的算子会进入 NPU 子图，不支持的算子留给 CPU 或 DSP。性能优化的目标，就是让大的计算段尽量连续地留在 NPU 上。</p>
<p>举个常见例子：<code>Conv -&gt; BN -&gt; SiLU -&gt; Add</code>。如果某个 NPU 不支持 SiLU 融合，编译器可能把 <code>Conv + BN</code> 放进 NPU，把 SiLU 放到 CPU，再把 Add 放回 NPU。这样中间特征图要从 NPU 内存写到 DDR，再由 CPU 读取处理，然后又写回给 NPU，代价非常高。此时可以考虑把 SiLU 替换为 HardSwish 或 ReLU6，或者在训练时使用部署端支持更好的激活函数。</p>
<p>再比如深度可分离卷积。它理论上计算量低，但对某些 NPU 并不一定更快，因为 depthwise conv 的数据复用率不如普通卷积，容易变成带宽瓶颈。移动端网络里常见的 1×1 pointwise conv 反而更容易跑满矩阵阵列。所以模型结构选择不能只看 FLOPs，还要结合目标硬件的算子效率表。</p>
<p>实践中可以建立一张“算子白名单”：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">preferred_ops</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Conv</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Relu</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">LeakyRelu</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Add</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Mul</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">MaxPool</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">AveragePool</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">GlobalAveragePool</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">MatMul</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">avoid_ops</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">NonMaxSuppression</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">DynamicQuantizeLinear</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">GatherND</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">GridSample</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Resize(mode=cubic)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Transpose(large_feature_map)</span><span class="w">
</span></span></span></code></pre></div><p>这张表不是固定的，要根据具体芯片 SDK 更新。每次 SDK 升级后，都建议重新跑一遍模型转换报告，看看原本回退 CPU 的节点是否已经支持，或者原本支持的融合是否发生变化。</p>
<p>（第二部分完，约2500字）</p>
<h2 id="七内存规划嵌入式-npu-最容易被低估的战场">七、内存规划：嵌入式 NPU 最容易被低估的战场</h2>
<p>很多模型理论计算量不大，却在板子上跑不快，根因是内存。嵌入式 SoC 的 DDR 同时服务 CPU、GPU、ISP、VPU、NPU 和显示控制器。摄像头预览、视频编码、神经网络推理如果同时进行，DDR 带宽会被迅速吃满。NPU 峰值算力再高，数据喂不上去也只能空转。</p>
<p>优化内存时，先看三类数据：输入输出、中间特征图、权重。权重通常可以常驻内存，有些平台还能预加载到专用区域；输入输出与业务流水线有关，例如摄像头 NV12 数据是否需要转 RGB、是否需要 resize、是否能使用 zero-copy；中间特征图由编译器规划，但模型结构会影响峰值内存。</p>
<p>对视觉模型来说，预处理经常被忽略。一个 1080p 摄像头输入，如果每帧都由 CPU 做 NV12 到 RGB、resize、归一化，再拷贝到 NPU 输入缓冲区，可能预处理就花掉 8 到 15ms。更好的方式是使用 ISP/RGA/GPU 做颜色转换和缩放，或者让 NPU runtime 接受硬件缓冲区句柄，减少内存拷贝。</p>
<p>典型的零拷贝流水线如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Camera DMA buffer
</span></span><span class="line"><span class="cl">  -&gt; 硬件缩放/颜色转换
</span></span><span class="line"><span class="cl">  -&gt; NPU 输入 buffer
</span></span><span class="line"><span class="cl">  -&gt; NPU 推理
</span></span><span class="line"><span class="cl">  -&gt; CPU 读取小尺寸输出
</span></span><span class="line"><span class="cl">  -&gt; 后处理与业务逻辑
</span></span></code></pre></div><p>如果平台支持 ION、DMA-BUF 或类似机制，尽量让视频帧在硬件模块之间传递句柄，而不是在用户态反复 <code>memcpy</code>。这类优化写起来不如改模型显眼，但对端到端延迟和功耗非常有效。</p>
<h2 id="八运行时流水线不要让-npu-等-cpu">八、运行时流水线：不要让 NPU 等 CPU</h2>
<p>单帧推理流程通常包括采集、预处理、推理、后处理和显示/通信。如果这些步骤串行执行，总延迟是所有步骤相加。实际产品中可以用流水线并行：CPU 后处理第 N 帧时，NPU 推理第 N+1 帧，ISP 准备第 N+2 帧。这样单帧延迟没有消失，但系统吞吐会明显提升。</p>
<p>下面是一个简化的 C++ 伪代码，展示三线程流水线的结构：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="k">struct</span> <span class="nc">FrameJob</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">id</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">Buffer</span> <span class="n">input</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">Buffer</span> <span class="n">tensor</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">NpuOutput</span> <span class="n">output</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">BlockingQueue</span><span class="o">&lt;</span><span class="n">FrameJob</span><span class="o">&gt;</span> <span class="n">q_pre</span><span class="p">,</span> <span class="n">q_infer</span><span class="p">,</span> <span class="n">q_post</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">capture_thread</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">FrameJob</span> <span class="n">job</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">job</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">id</span><span class="o">++</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">job</span><span class="p">.</span><span class="n">input</span> <span class="o">=</span> <span class="n">camera_dequeue</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="n">q_pre</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">preprocess_thread</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">auto</span> <span class="n">job</span> <span class="o">=</span> <span class="n">q_pre</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="n">job</span><span class="p">.</span><span class="n">tensor</span> <span class="o">=</span> <span class="n">hw_resize_color_convert</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">input</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">q_infer</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">infer_thread</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">auto</span> <span class="n">job</span> <span class="o">=</span> <span class="n">q_infer</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="n">job</span><span class="p">.</span><span class="n">output</span> <span class="o">=</span> <span class="n">npu_run</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">tensor</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">q_post</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">postprocess_thread</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">auto</span> <span class="n">job</span> <span class="o">=</span> <span class="n">q_post</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="k">auto</span> <span class="n">result</span> <span class="o">=</span> <span class="n">decode_and_nms</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">output</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">publish_result</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">result</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">camera_release</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">input</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这里有几个工程细节：队列长度不能无限增长，否则实时系统会堆积旧帧；如果业务只关心最新结果，可以在队列满时丢弃旧帧；NPU 输入输出 buffer 应该复用，避免每帧申请释放；多线程之间要注意 cache flush/invalidate，特别是 CPU 和 NPU 共享物理内存时。</p>
<h2 id="九性能排查按层耗时比总耗时更重要">九、性能排查：按层耗时比总耗时更重要</h2>
<p>当模型跑得慢时，不要先猜。第一步是拿到转换报告和 profiler。大多数 NPU SDK 都能输出每层耗时、子图划分、内存使用和 CPU fallback 信息。如果工具不够完善，也可以在业务侧用时间戳包住预处理、推理和后处理，至少先拆出大方向。</p>
<p>建议建立如下排查顺序：</p>
<ol>
<li><strong>确认频率和温度</strong>：是否因为供电、散热或 governor 导致降频。</li>
<li><strong>确认模型输入尺寸</strong>：是否误用了更大分辨率或动态 shape。</li>
<li><strong>查看 NPU 子图覆盖率</strong>：是否存在 CPU fallback。</li>
<li><strong>查看最慢的前 10 层</strong>：是大卷积、Resize、Transpose，还是后处理。</li>
<li><strong>检查预处理耗时</strong>：是否被 CPU resize 和 memcpy 拖慢。</li>
<li><strong>检查量化模式</strong>：是否有部分层保持 FP32，导致混合执行成本增加。</li>
<li><strong>对比官方 benchmark</strong>：用厂商示例确认环境没有系统性问题。</li>
</ol>
<p>如果某一层卷积异常慢，可以尝试改变输入尺寸、通道数或模型结构。NPU 对通道对齐很敏感，有些硬件喜欢通道数是 8、16 或 32 的倍数。如果网络中出现 3、5、7 这类不规则通道，编译器可能需要 padding，导致实际计算和内存都增加。训练模型时适当让通道数对齐硬件粒度，通常比部署后硬优化更划算。</p>
<h2 id="十一个可落地的部署检查清单">十、一个可落地的部署检查清单</h2>
<p>下面这份清单适合放进项目的 <code>docs/deployment.md</code>，每次模型升级时按步骤检查：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">[模型导出]
</span></span><span class="line"><span class="cl">- 固定 input shape 和 batch=1
</span></span><span class="line"><span class="cl">- 关闭训练态节点，确认 BN 已进入 eval 模式
</span></span><span class="line"><span class="cl">- 导出 ONNX 后用 onnx.checker 校验
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[图优化]
</span></span><span class="line"><span class="cl">- 统计算子类型，标记 Transpose / Gather / Slice / Resize
</span></span><span class="line"><span class="cl">- 融合 Conv + BN + Activation
</span></span><span class="line"><span class="cl">- 将 NMS、decode 等后处理移出 NPU 图
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[量化]
</span></span><span class="line"><span class="cl">- 校准集覆盖真实场景
</span></span><span class="line"><span class="cl">- 使用 per-channel weight quantization
</span></span><span class="line"><span class="cl">- 做逐层余弦相似度对比
</span></span><span class="line"><span class="cl">- 对关键指标做回归测试
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[编译]
</span></span><span class="line"><span class="cl">- 查看 NPU 子图覆盖率
</span></span><span class="line"><span class="cl">- 保存编译报告和 SDK 版本
</span></span><span class="line"><span class="cl">- 记录输入布局、量化参数和 runtime 配置
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[运行]
</span></span><span class="line"><span class="cl">- 预分配输入输出 buffer
</span></span><span class="line"><span class="cl">- 使用硬件预处理或 zero-copy
</span></span><span class="line"><span class="cl">- 分离采集、推理、后处理线程
</span></span><span class="line"><span class="cl">- 记录端到端延迟、P95、P99 和温度
</span></span></code></pre></div><p>这份清单的价值在于让部署过程可复现。很多团队的问题不是没有优化能力，而是每次模型、SDK、板级系统一起变化，最后不知道性能变化来自哪里。只要把转换命令、校准数据版本、编译报告和测试结果保存下来，排查效率会高很多。</p>
<h2 id="十一常见坑与解决办法">十一、常见坑与解决办法</h2>
<p><strong>1. PC 上 ONNX 推理正确，NPU 结果明显偏移。</strong> 先检查输入归一化顺序、RGB/BGR、NHWC/NCHW 和量化 scale。很多精度问题并不是 NPU 算错，而是预处理和训练时不一致。</p>
<p><strong>2. 转换工具提示某个算子不支持。</strong> 不要急着换芯片，先看该算子是否可以常量折叠、替换或移到后处理。比如检测模型里的 <code>NonMaxSuppression</code> 通常没必要放在 NPU 图里。</p>
<p><strong>3. INT8 精度下降很大。</strong> 增加校准集多样性，检查异常激活层，尝试 per-channel、混合精度或量化感知训练。对于检测头、注意力模块等敏感部分，可以保留 FP16，前提是硬件支持且性能可接受。</p>
<p><strong>4. 单独 benchmark 很快，业务程序很慢。</strong> 重点看预处理、后处理、线程同步、日志打印和内存拷贝。厂商 benchmark 往往只测 NPU 子图，不包含摄像头和业务逻辑。</p>
<p><strong>5. 长时间运行后变慢。</strong> 检查温度、供电、内存泄漏和 buffer 队列堆积。嵌入式设备的持续性能比冷启动成绩更重要。</p>
<h2 id="十二总结npu-优化是模型编译器和系统工程的合题">十二、总结：NPU 优化是模型、编译器和系统工程的合题</h2>
<p>嵌入式 NPU 部署不是把模型丢进转换工具那么简单。一个稳定高效的方案，需要模型结构顺着硬件走，量化校准贴近真实数据，编译报告能解释每个子图，运行时流水线减少等待，系统层面控制带宽、功耗和温度。</p>
<p>如果把经验压缩成一句话：先保证算子连续落在 NPU 上，再减少大特征图搬运，最后才追求单个算子的极限优化。很多项目真正的性能提升，来自删掉几个多余的 <code>Transpose</code>、把 NMS 移出图、使用硬件 resize、复用 DMA buffer，而不是盲目追逐更大的 TOPS。</p>
<p>面向未来，边缘 NPU 会继续向更强的混合精度、更好的 Transformer 支持和更完善的软件栈发展。但在可预见的几年里，嵌入式工程师仍然需要理解硬件约束。懂模型的人写出的网络更容易部署，懂系统的人能把 NPU 放进真实产品流水线，懂架构的人则能在性能、功耗和成本之间做出正确取舍。</p>
<p>（全文完，约7200字）</p>
]]></content:encoded>
    </item>
    <item>
      <title>基于 NCNN 的嵌入式 AI 推理部署完全指南</title>
      <link>https://tech-snippets.xyz/posts/ncnn-embedded-ai-deployment-guide/</link>
      <pubDate>Tue, 02 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/ncnn-embedded-ai-deployment-guide/</guid>
      <description>前言 在边缘设备上部署深度学习模型，一直是嵌入式 AI 领域最具挑战性的课题之一。当你训练好了一个准确率令人满意的 PyTorch 模型，满心欢喜地想把它搬到 ARM 开发板上跑一跑，却发现原始模型推理一次需要好几秒，这样的性能在实际产品中根本无法使用。这时你才意识到，训练和部署之间，隔着一道看不见却异常宽阔的鸿沟。
这道鸿沟的两边是完全不同的世界：训练端追求的是灵活的算子支持、便捷的调试接口、高效的分布式训练；而部署端追求的却是极致的推理速度、最小的内存占用、最低的功耗开销。大多数框架都是为训练设计的，即使像 PyTorch 这样优秀的框架，其 C++ 前端 LibTorch 在嵌入式设备上的表现也往往差强人意。
于是我们需要专门的推理框架。在众多推理框架中，腾讯开源的 NCNN 是一个相当特别的存在。它从诞生之初就是为移动端和嵌入式设备设计的，没有历史包袱，从内存管理到算子实现都围绕 ARM 架构深度优化。更重要的是，NCNN 是纯 C++ 实现，没有任何第三方依赖，这意味着你可以轻松将它集成到各种奇葩的嵌入式环境中。
我第一次接触 NCNN 是在一块瑞芯微 RK3399 开发板上部署目标检测模型。当时用 PyTorch 推理一帧 YOLO 需要约 800ms，用 TensorFlow Lite 也需要 400ms 左右，而用 NCNN 优化后，同样的模型在同一硬件上只需要 120ms，这还没开启 Vulkan GPU 加速。那一刻我真切感受到，一个好的推理框架带来的性能提升，往往比换一颗芯片还要显著。
这篇文章会带你完整走一遍 NCNN 的部署流程：从模型训练完成后的 ONNX 导出，到 onnx2ncnn 转换，再到模型优化、INT8 量化、最后编写 C++ 推理代码。文中所有命令和代码都经过实际验证，你可以照着一步步操作。
一、为什么选择 NCNN？ 在深入具体操作之前，我们先聊聊为什么在众多推理框架中选择 NCNN，它的核心优势在哪里，又有哪些局限性。
1.1 推理框架的选型维度 选择一个推理框架，通常需要考虑以下几个维度：
维度 说明 重要程度 性能 同样硬件上的推理速度 ⭐⭐⭐⭐⭐ 模型支持 能否正常转换你的模型 ⭐⭐⭐⭐⭐ 易用性 文档是否完善，社区是否活跃 ⭐⭐⭐⭐ 跨平台 支持多少种目标硬件 ⭐⭐⭐⭐ 二进制体积 对资源紧张的 MCU 很重要 ⭐⭐⭐ 许可证 是否允许商业闭源使用 ⭐⭐⭐⭐ 用这个维度表来评估 NCNN，你会发现它在大多数项上得分都很高：性能在 ARM CPU 上属于第一梯队，模型支持覆盖了绝大多数常见算子，Apache 2.</description>
      <content:encoded><![CDATA[<h2 id="前言">前言</h2>
<p>在边缘设备上部署深度学习模型，一直是嵌入式 AI 领域最具挑战性的课题之一。当你训练好了一个准确率令人满意的 PyTorch 模型，满心欢喜地想把它搬到 ARM 开发板上跑一跑，却发现原始模型推理一次需要好几秒，这样的性能在实际产品中根本无法使用。这时你才意识到，训练和部署之间，隔着一道看不见却异常宽阔的鸿沟。</p>
<p>这道鸿沟的两边是完全不同的世界：训练端追求的是灵活的算子支持、便捷的调试接口、高效的分布式训练；而部署端追求的却是极致的推理速度、最小的内存占用、最低的功耗开销。大多数框架都是为训练设计的，即使像 PyTorch 这样优秀的框架，其 C++ 前端 LibTorch 在嵌入式设备上的表现也往往差强人意。</p>
<p>于是我们需要专门的推理框架。在众多推理框架中，腾讯开源的 NCNN 是一个相当特别的存在。它从诞生之初就是为移动端和嵌入式设备设计的，没有历史包袱，从内存管理到算子实现都围绕 ARM 架构深度优化。更重要的是，NCNN 是纯 C++ 实现，没有任何第三方依赖，这意味着你可以轻松将它集成到各种奇葩的嵌入式环境中。</p>
<p>我第一次接触 NCNN 是在一块瑞芯微 RK3399 开发板上部署目标检测模型。当时用 PyTorch 推理一帧 YOLO 需要约 800ms，用 TensorFlow Lite 也需要 400ms 左右，而用 NCNN 优化后，同样的模型在同一硬件上只需要 120ms，这还没开启 Vulkan GPU 加速。那一刻我真切感受到，一个好的推理框架带来的性能提升，往往比换一颗芯片还要显著。</p>
<p>这篇文章会带你完整走一遍 NCNN 的部署流程：从模型训练完成后的 ONNX 导出，到 onnx2ncnn 转换，再到模型优化、INT8 量化、最后编写 C++ 推理代码。文中所有命令和代码都经过实际验证，你可以照着一步步操作。</p>
<p><img alt="NCNN 嵌入式 AI 推理部署流程" loading="lazy" src="/images/ncnn-deployment-workflow.svg"></p>
<h2 id="一为什么选择-ncnn">一、为什么选择 NCNN？</h2>
<p>在深入具体操作之前，我们先聊聊为什么在众多推理框架中选择 NCNN，它的核心优势在哪里，又有哪些局限性。</p>
<h3 id="11-推理框架的选型维度">1.1 推理框架的选型维度</h3>
<p>选择一个推理框架，通常需要考虑以下几个维度：</p>
<table>
<thead>
<tr>
<th>维度</th>
<th>说明</th>
<th>重要程度</th>
</tr>
</thead>
<tbody>
<tr>
<td>性能</td>
<td>同样硬件上的推理速度</td>
<td>⭐⭐⭐⭐⭐</td>
</tr>
<tr>
<td>模型支持</td>
<td>能否正常转换你的模型</td>
<td>⭐⭐⭐⭐⭐</td>
</tr>
<tr>
<td>易用性</td>
<td>文档是否完善，社区是否活跃</td>
<td>⭐⭐⭐⭐</td>
</tr>
<tr>
<td>跨平台</td>
<td>支持多少种目标硬件</td>
<td>⭐⭐⭐⭐</td>
</tr>
<tr>
<td>二进制体积</td>
<td>对资源紧张的 MCU 很重要</td>
<td>⭐⭐⭐</td>
</tr>
<tr>
<td>许可证</td>
<td>是否允许商业闭源使用</td>
<td>⭐⭐⭐⭐</td>
</tr>
</tbody>
</table>
<p>用这个维度表来评估 NCNN，你会发现它在大多数项上得分都很高：性能在 ARM CPU 上属于第一梯队，模型支持覆盖了绝大多数常见算子，Apache 2.0 许可证非常宽松，二进制最小可以压缩到几百 KB。</p>
<h3 id="12-ncnn-的核心优势">1.2 NCNN 的核心优势</h3>
<p><strong>极致的 ARM 优化</strong> 是 NCNN 最核心的竞争力。NCNN 为 ARMv7、ARMv8 架构写了大量的 NEON 汇编优化代码，不是简单的编译器自动向量化，而是手工优化的汇编级实现。比如卷积的 Im2col + Gemm 实现，Winograd 快速卷积算法，都经过了精细的指令调度和寄存器分配优化。</p>
<p>这种手工优化的效果有多明显？以 3x3 卷积为例，在 Cortex-A53 上，NCNN 的实现通常比 OpenCV DNN 快 2-3 倍，比未经优化的参考实现快 10 倍以上。这不是算法层面的差距，纯粹是工程实现上的精益求精。</p>
<p><strong>零依赖的纯 C++ 实现</strong> 是 NCNN 的另一个巨大优势。很多框架看起来很强大，但一交叉编译就会发现依赖一大堆第三方库：Protobuf、FlatBuffers、BLAS 库等等。在某些嵌入式环境中，光是把这些依赖库编译过去就是一场噩梦。</p>
<p>而 NCNN 是真正的零依赖，它甚至不依赖 C++ STL 的异常和 RTTI，在最精简的配置下，你只需要一个能编译 C++ 的交叉编译器就能把 NCNN 编出来。这种特性在面对各种定制化的嵌入式 Linux 甚至裸机环境时，价值尤为突出。</p>
<p><strong>灵活的扩展性</strong> 也值得一提。NCNN 设计了一套清晰的算子注册机制，如果你需要一个自定义算子，只需要继承一个基类，实现前向计算函数，然后注册一下就行，不需要修改框架的核心代码。这种设计对于需要部署自研算子的场景非常友好。</p>
<h3 id="13-ncnn-的局限性">1.3 NCNN 的局限性</h3>
<p>当然，NCNN 也不是万能的。它的主要局限性在于：</p>
<ul>
<li><strong>GPU 支持不如 TensorRT</strong>：虽然 NCNN 支持 Vulkan GPU 加速，但在 NVIDIA 设备上，性能还是不如 TensorRT。不过在 ARM Mali GPU 上，NCNN 的 Vulkan 后端表现相当不错。</li>
<li><strong>动态形状支持有限</strong>：NCNN 主要是为固定输入形状优化的，动态形状的支持不如 ONNX Runtime 灵活。</li>
<li><strong>调试工具相对简陋</strong>：相比 TensorRT 有完善的 profiling 工具，NCNN 的调试更多需要依赖 <code>ncnn::Extractor</code> 的逐层输出和自己打日志。</li>
</ul>
<p>总体来说，如果你的目标平台是 ARM CPU（手机、开发板、嵌入式设备），NCNN 是目前最好的选择之一。如果是 NVIDIA GPU，应该优先考虑 TensorRT。</p>
<h2 id="二环境搭建从源码编译-ncnn">二、环境搭建：从源码编译 NCNN</h2>
<p>正式开始之前，我们需要先把 NCNN 源码下载下来并编译。NCNN 的编译系统是 CMake，过程相对 straightforward，但有几个关键的编译选项需要特别注意。</p>
<h3 id="21-获取源码">2.1 获取源码</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 克隆 NCNN 源码</span>
</span></span><span class="line"><span class="cl">git clone https://github.com/Tencent/ncnn.git
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> ncnn
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 切换到最新的稳定版本（可选但推荐）</span>
</span></span><span class="line"><span class="cl">git checkout <span class="m">20240410</span>  <span class="c1"># 选择一个较新的稳定版本</span>
</span></span></code></pre></div><h3 id="22-主机端编译x86-linux">2.2 主机端编译（x86 Linux）</h3>
<p>首先我们在 x86 主机上编译 NCNN，主要是为了获得各种模型转换工具（onnx2ncnn、ncnnoptimize 等）。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">mkdir -p build-host <span class="o">&amp;&amp;</span> <span class="nb">cd</span> build-host
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">cmake .. <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_BUILD_TOOLS<span class="o">=</span>ON <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_BUILD_EXAMPLES<span class="o">=</span>ON <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_BUILD_BENCHMARK<span class="o">=</span>ON <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DCMAKE_BUILD_TYPE<span class="o">=</span>Release
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">make -j<span class="k">$(</span>nproc<span class="k">)</span>
</span></span></code></pre></div><p>编译完成后，你会在 <code>tools/</code> 目录下看到各种工具：</p>
<ul>
<li><code>onnx2ncnn</code> - ONNX 模型转 NCNN 格式</li>
<li><code>ncnnoptimize</code> - NCNN 模型优化</li>
<li><code>ncnn2table</code> - 生成量化校准表</li>
<li><code>ncnn2int8</code> - INT8 量化</li>
<li>等等&hellip;</li>
</ul>
<p>把这些工具的路径加入 PATH 或者记住它们的位置，后面会频繁使用。</p>
<h3 id="23-交叉编译arm-linux">2.3 交叉编译（ARM Linux）</h3>
<p>接下来是最重要的一步：为目标 ARM 设备交叉编译 NCNN。这里假设你使用的是 ARMv8 架构（Cortex-A53/A55/A72/A76 等），工具链是 <code>aarch64-linux-gnu-gcc</code>。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nb">cd</span> ..
</span></span><span class="line"><span class="cl">mkdir -p build-arm64 <span class="o">&amp;&amp;</span> <span class="nb">cd</span> build-arm64
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">cmake .. <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DCMAKE_TOOLCHAIN_FILE<span class="o">=</span>../toolchains/aarch64-linux-gnu.toolchain.cmake <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_BUILD_TOOLS<span class="o">=</span>OFF <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_BUILD_EXAMPLES<span class="o">=</span>OFF <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_BUILD_BENCHMARK<span class="o">=</span>ON <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_VULKAN<span class="o">=</span>OFF <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_SYSTEM_GLSLANG<span class="o">=</span>OFF <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_OPENMP<span class="o">=</span>ON <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DCMAKE_BUILD_TYPE<span class="o">=</span>Release
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">make -j<span class="k">$(</span>nproc<span class="k">)</span>
</span></span></code></pre></div><p>几个关键编译选项的说明：</p>
<table>
<thead>
<tr>
<th>选项</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>NCNN_VULKAN</code></td>
<td>是否开启 Vulkan GPU 加速</td>
</tr>
<tr>
<td><code>NCNN_OPENMP</code></td>
<td>是否开启 OpenMP 多线程</td>
</tr>
<tr>
<td><code>NCNN_BUILD_TOOLS</code></td>
<td>模型转换工具不需要在 ARM 上运行</td>
</tr>
<tr>
<td><code>NCNN_RUNTIME_CPU</code></td>
<td>运行时检测 CPU 特性并动态选择优化路径</td>
</tr>
</tbody>
</table>
<p>如果你需要 Vulkan GPU 支持，将 <code>NCNN_VULKAN</code> 设为 <code>ON</code>，但要确保目标设备有可用的 Vulkan 驱动。</p>
<p>编译完成后，把 <code>src/libncnn.a</code> 和头文件复制到你的交叉编译环境中，或者直接在 CMake 项目中通过 <code>add_subdirectory</code> 引入。</p>
<h3 id="24-android--ios-编译">2.4 Android / iOS 编译</h3>
<p>对于移动端，NCNN 提供了更便捷的编译脚本：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Android</span>
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> ncnn
</span></span><span class="line"><span class="cl">mkdir -p build-android <span class="o">&amp;&amp;</span> <span class="nb">cd</span> build-android
</span></span><span class="line"><span class="cl">cmake .. -DCMAKE_TOOLCHAIN_FILE<span class="o">=</span><span class="nv">$ANDROID_NDK</span>/build/cmake/android.toolchain.cmake <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DANDROID_ABI<span class="o">=</span>arm64-v8a <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DANDROID_PLATFORM<span class="o">=</span>android-24 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    -DNCNN_VULKAN<span class="o">=</span>ON
</span></span></code></pre></div><p>（第一部分完，约2100字）</p>
<h2 id="三模型转换从-pytorch-到-onnx-再到-ncnn">三、模型转换：从 PyTorch 到 ONNX 再到 NCNN</h2>
<p>模型转换是整个部署流程中最容易出问题的环节。一个看起来完美的模型，在转换过程中可能因为一个不起眼的算子就导致整个流程卡住。这一节我们按照标准流程一步步来，尽量避开那些常见的坑。</p>
<h3 id="31-第一步pytorch-导出-onnx">3.1 第一步：PyTorch 导出 ONNX</h3>
<p>在将模型交给 onnx2ncnn 之前，我们首先需要把 PyTorch 模型导出为 ONNX 格式。这一步看似简单，实则暗藏玄机。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torchvision</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 加载模型</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">torchvision</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">resnet18</span><span class="p">(</span><span class="n">pretrained</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 准备示例输入</span>
</span></span><span class="line"><span class="cl"><span class="n">dummy_input</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 导出 ONNX</span>
</span></span><span class="line"><span class="cl"><span class="n">torch</span><span class="o">.</span><span class="n">onnx</span><span class="o">.</span><span class="n">export</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">dummy_input</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;resnet18.onnx&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">export_params</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">opset_version</span><span class="o">=</span><span class="mi">13</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">do_constant_folding</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">input_names</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;input&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_names</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;output&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">dynamic_axes</span><span class="o">=</span><span class="kc">None</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><p>这段代码看起来很标准，但有几个关键点需要特别注意：</p>
<p><strong>opset_version 的选择</strong>：不要用太新的 opset，也不要用太旧的。opset 11-13 是目前兼容性最好的区间。opset 太高（比如 17+）可能引入了一些新的算子表示方式，onnx2ncnn 可能还没来得及支持。</p>
<p><strong>dynamic_axes 设为 None</strong>：除非你真的需要动态形状。NCNN 对固定输入形状的优化最好，动态形状不仅会损失一部分性能，还可能触发某些算子的 bug。如果你的输入尺寸是固定的，就不要开动态轴。</p>
<p><strong>导出前必须调用 model.eval()</strong>：这个很重要，否则 BatchNorm、Dropout 等层在训练和推理模式下行为是不同的。忘记调用 eval() 是新手最容易犯的错误之一。</p>
<p>导出完成后，建议用 onnxsim 简化一下模型，这一步能解决 80% 的转换问题：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 安装 onnxsim</span>
</span></span><span class="line"><span class="cl">pip install onnxsim
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 简化 ONNX 模型</span>
</span></span><span class="line"><span class="cl">onnxsim resnet18.onnx resnet18-sim.onnx
</span></span></code></pre></div><p>onnxsim 会做常量折叠、形状推导、无用节点消除等优化。很多 onnx2ncnn 报错的模型，经过 onnxsim 之后就正常了。这一步强烈建议执行，不要跳过。</p>
<h3 id="32-第二步onnx2ncnn-转换">3.2 第二步：onnx2ncnn 转换</h3>
<p>ONNX 准备好了，接下来就是转换为 NCNN 的原生格式。NCNN 的模型格式由两个文件组成：</p>
<ul>
<li><code>.param</code> - 网络结构定义（文本格式，可以用文本编辑器打开）</li>
<li><code>.bin</code> - 权重数据（二进制格式）</li>
</ul>
<p>转换命令很简单：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">onnx2ncnn resnet18-sim.onnx resnet18.param resnet18.bin
</span></span></code></pre></div><p>如果一切顺利，你会看到一堆输出，最后没有 error 字样。如果有 error，说明遇到了不支持的算子或者 ONNX 格式有问题。</p>
<p>常见的错误类型和解决方法：</p>
<p>**1. &ldquo;Unsupported resize mode&rdquo;
Resize 算子是转换失败的重灾区。ONNX 的 Resize 有多种 coordinate_transformation_mode，NCNN 只支持 <code>asymmetric</code> 和 <code>align_corners</code> 两种。如果你的模型用了其他模式，可以在导出 ONNX 之前修改模型代码中的插值方式，或者用 onnxruntime-tools 手动修改 ONNX 节点属性。</p>
<p>**2. &ldquo;Unsupported slice with step != 1&rdquo;
NCNN 的 Slice 算子只支持步长为 1 的情况。如果模型里有 step &gt; 1 的 Slice，可以用 Reshape + Permute + Reshape 的组合来替代，或者修改模型结构避免使用这种特殊的 Slice。</p>
<p>**3. &ldquo;Too many axes for permute&rdquo;
NCNN 的 Permute 只支持最多 4 维。如果你的模型有 5 维以上的 Permute，可以考虑拆分或者用其他算子组合实现。</p>
<p>转换成功后，建议打开 <code>.param</code> 文件看一眼。文件开头是层的数量和 blob 的数量，然后每一行是一个层的定义。检查一下有没有奇怪的层名，比如 <code>Shape</code>、<code>Gather</code> 这种通常意味着模型里有动态形状相关的操作，这在 NCNN 中支持有限。</p>
<h3 id="33-第三步ncnnoptimize-优化">3.3 第三步：ncnnoptimize 优化</h3>
<p>原始转换出来的模型还可以进一步优化。<code>ncnnoptimize</code> 工具可以做：</p>
<ul>
<li>融合 BatchNorm 到 Convolution</li>
<li>消除 Dropout 层（推理模式下没用）</li>
<li>权重数据类型转换（FP32 → FP16）</li>
<li>内存布局优化</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ncnnoptimize resnet18.param resnet18.bin resnet18-opt.param resnet18-opt.bin <span class="m">0</span>
</span></span></code></pre></div><p>最后一个参数 <code>0</code> 表示保持 FP32，<code>1</code> 表示转换为 FP16。FP16 可以将模型体积减半，在 ARMv8.2+ 的设备上还能获得显著的性能提升，精度损失通常很小。</p>
<p>优化完成后，你会得到两个文件：<code>resnet18-opt.param</code> 和 <code>resnet18-opt.bin</code>。这两个就是最终部署用的模型文件了。</p>
<h2 id="四int8-量化让推理速度再翻倍">四、INT8 量化：让推理速度再翻倍</h2>
<p>对于嵌入式设备来说，FP32 推理往往还是不够快。INT8 量化可以在精度损失可控的前提下，将推理速度再提升 1.5-2 倍，内存占用也会减半。</p>
<h3 id="41-量化的基本原理">4.1 量化的基本原理</h3>
<p>量化的核心思想是用 8 位整数来近似表示 32 位浮点数。简单来说就是：</p>
<pre tabindex="0"><code>float_value = scale * (int8_value - zero_point)
</code></pre><p>每个张量都有自己的 scale 和 zero_point。推理时，先把输入量化为 INT8，做 INT8 卷积计算，然后再反量化回 FP32（或者直接下一层继续用 INT8）。</p>
<p>NCNN 使用的是后训练量化（Post-Training Quantization），不需要重新训练模型，只需要几百张校准图片就能完成量化。</p>
<h3 id="42-生成校准表">4.2 生成校准表</h3>
<p>首先我们需要准备一批校准图片，数量通常 100-1000 张就够了，不需要太多，也不需要和训练集完全一致，只要数据分布类似就行。</p>
<p>创建一个 <code>imagelist.txt</code> 文件，每行是校准图片的路径：</p>
<pre tabindex="0"><code>calib/000001.jpg
calib/000002.jpg
calib/000003.jpg
...
</code></pre><p>然后生成校准表：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ncnn2table <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    resnet18-opt.param <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    resnet18-opt.bin <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    imagelist.txt <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    resnet18.table
</span></span></code></pre></div><p>这个过程会比较慢，因为它要在所有校准图片跑一遍前向传播，统计每一层的激活值范围。</p>
<p>生成的 <code>.table</code> 文件是文本格式，你可以打开看看，每一行是某一层的量化参数。</p>
<h3 id="43-执行量化">4.3 执行量化</h3>
<p>有了校准表，就可以把 FP32 模型转换为 INT8 模型了：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ncnn2int8 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    resnet18-opt.param <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    resnet18-opt.bin <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    resnet18-int8.param <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    resnet18-int8.bin <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    resnet18.table
</span></span></code></pre></div><p>完成后你会得到 INT8 版本的模型。<code>.bin</code> 文件大小大概只有原来的 1/4。</p>
<h3 id="44-量化精度调优">4.4 量化精度调优</h3>
<p>如果量化后精度下降明显，可以试试这些方法：</p>
<ol>
<li>
<p><strong>增加校准图片数量</strong>：从 100 张增加到 500 张通常会有改善。</p>
</li>
<li>
<p><strong>选择合适的校准算法</strong>：ncnn2table 支持 KL 散度和熵两种校准方法，默认为 KL。可以尝试不同方法对比精度。</p>
</li>
<li>
<p><strong>逐层反量化</strong>：某些层（比如检测头）对量化特别敏感，可以把这些层单独排除在量化之外，保持 FP32。</p>
</li>
<li>
<p><strong>检查预处理是否一致</strong>：量化前后的预处理（归一化、通道顺序等必须完全一致，这是很多人忽略但影响巨大的点。</p>
</li>
</ol>
<p>（第二部分完，约2300字）</p>
<h2 id="五c-推理代码编写">五、C++ 推理代码编写</h2>
<p>模型准备好了，接下来就是编写实际的推理代码。NCNN 的 API 设计得相当简洁，一个完整的推理流程只需要寥寥几行代码就能完成。</p>
<h3 id="51-最简推理示例">5.1 最简推理示例</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;opencv2/opencv.hpp&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;net.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// 1. 创建 Net 对象并加载模型
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">ncnn</span><span class="o">::</span><span class="n">Net</span> <span class="n">net</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">net</span><span class="p">.</span><span class="n">load_param</span><span class="p">(</span><span class="s">&#34;resnet18-int8.param&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">net</span><span class="p">.</span><span class="n">load_model</span><span class="p">(</span><span class="s">&#34;resnet18-int8.bin&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1">// 2. 读取图片并预处理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">img</span> <span class="o">=</span> <span class="n">cv</span><span class="o">::</span><span class="n">imread</span><span class="p">(</span><span class="s">&#34;test.jpg&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// Resize 到模型输入尺寸
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span> <span class="n">in</span> <span class="o">=</span> <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span><span class="o">::</span><span class="n">from_pixels_resize</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">img</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span><span class="o">::</span><span class="n">PIXEL_BGR</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">img</span><span class="p">.</span><span class="n">cols</span><span class="p">,</span> <span class="n">img</span><span class="p">.</span><span class="n">rows</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span>
</span></span><span class="line"><span class="cl">    <span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1">// 归一化（ImageNet 标准参数）
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="k">const</span> <span class="kt">float</span> <span class="n">mean_vals</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mf">103.53f</span><span class="p">,</span> <span class="mf">116.28f</span><span class="p">,</span> <span class="mf">123.675f</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">float</span> <span class="n">norm_vals</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mf">0.017429f</span><span class="p">,</span> <span class="mf">0.017507f</span><span class="p">,</span> <span class="mf">0.017125f</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">    <span class="n">in</span><span class="p">.</span><span class="n">substract_mean_normalize</span><span class="p">(</span><span class="n">mean_vals</span><span class="p">,</span> <span class="n">norm_vals</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1">// 3. 执行推理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">ncnn</span><span class="o">::</span><span class="n">Extractor</span> <span class="n">ex</span> <span class="o">=</span> <span class="n">net</span><span class="p">.</span><span class="n">create_extractor</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="n">ex</span><span class="p">.</span><span class="n">set_num_threads</span><span class="p">(</span><span class="mi">4</span><span class="p">);</span>  <span class="c1">// 设置线程数
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">ex</span><span class="p">.</span><span class="n">input</span><span class="p">(</span><span class="s">&#34;input&#34;</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span> <span class="n">out</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">ex</span><span class="p">.</span><span class="n">extract</span><span class="p">(</span><span class="s">&#34;output&#34;</span><span class="p">,</span> <span class="n">out</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1">// 4. 解析输出
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="c1">// out 是 1x1000 的向量，取最大值索引即为预测类别
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="kt">int</span> <span class="n">max_idx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">float</span> <span class="n">max_val</span> <span class="o">=</span> <span class="n">out</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">out</span><span class="p">.</span><span class="n">w</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">out</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">max_val</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_val</span> <span class="o">=</span> <span class="n">out</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_idx</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">printf</span><span class="p">(</span><span class="s">&#34;Predicted class: %d, confidence: %.4f</span><span class="se">\n</span><span class="s">&#34;</span><span class="p">,</span> <span class="n">max_idx</span><span class="p">,</span> <span class="n">max_val</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这段代码展示了最基本的推理流程，但还有很多细节值得深入探讨。</p>
<h3 id="52-输入预处理的坑">5.2 输入预处理的坑</h3>
<p>预处理是最容易出问题但也最容易被忽视的环节。我见过至少一半的部署问题，最后都追溯到预处理不一致。</p>
<p><strong>通道顺序</strong>：OpenCV 读进来的图片是 BGR 顺序，而 PyTorch 训练时通常是 RGB 顺序。注意上面代码中 <code>from_pixels_resize</code> 的第二个参数是 <code>ncnn::Mat::PIXEL_BGR</code>，这意味着 NCNN 会保持 BGR 顺序。如果你训练时用的是 RGB，这里应该改成 <code>ncnn::Mat::PIXEL_BGR2RGB</code>。</p>
<p><strong>归一化参数</strong>：<code>mean_vals</code> 和 <code>norm_vals</code> 必须和训练时完全一致。很多人训练时用的是 <code>mean=[0.485, 0.456, 0.406]</code>, <code>std=[0.229, 0.224, 0.225]</code>，这和代码中的数值是等价的，只是转换了一下：</p>
<pre tabindex="0"><code>mean_vals = [0.485*255, 0.456*255, 0.406*255]
norm_vals = [1.0/255/0.229, 1.0/255/0.224, 1.0/255/0.225]
</code></pre><p><strong>插值方法</strong>：NCNN 默认使用 bilinear 插值，确保和训练时的数据增强使用的插值方法一致。</p>
<h3 id="53-线程数与性能调优">5.3 线程数与性能调优</h3>
<p><code>set_num_threads()</code> 是一个非常重要的函数。线程数不是越多越好，最优值取决于你的 CPU 核心数和架构：</p>
<table>
<thead>
<tr>
<th>CPU 架构</th>
<th>推荐线程数</th>
</tr>
</thead>
<tbody>
<tr>
<td>4 核 Cortex-A53</td>
<td>4</td>
</tr>
<tr>
<td>2 核 A72 + 4 核 A53</td>
<td>4 或 6</td>
</tr>
<tr>
<td>4 核 A76 + 4 核 A55</td>
<td>4（只绑大核）或 8</td>
</tr>
</tbody>
</table>
<p>在大小核架构上，只使用大核往往比使用所有核心性能更好，因为 A53 这类小核拖慢整体速度不说，还可能因为调度开销反而降低性能。</p>
<p>NCNN 也支持线程绑定：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">ex</span><span class="p">.</span><span class="n">set_cpu_powersave</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>  <span class="c1">// 0=所有核 1=只小核 2=只大核
</span></span></span></code></pre></div><p><code>set_cpu_powersave(2)</code> 是在 ARM 大小核设备上最常用的配置。</p>
<h3 id="54-cmakeliststxt-配置">5.4 CMakeLists.txt 配置</h3>
<p>最后不要忘了写 CMakeLists.txt：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cmake" data-lang="cmake"><span class="line"><span class="cl"><span class="nb">cmake_minimum_required</span><span class="p">(</span><span class="s">VERSION</span> <span class="s">3.0</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">project</span><span class="p">(</span><span class="s">ncnn_inference</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">set</span><span class="p">(</span><span class="s">CMAKE_CXX_STANDARD</span> <span class="s">11</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># NCNN 路径
</span></span></span><span class="line"><span class="cl"><span class="c"></span><span class="nb">set</span><span class="p">(</span><span class="s">ncnn_DIR</span> <span class="s2">&#34;/path/to/ncnn/build/install/lib/cmake/ncnn&#34;</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">find_package</span><span class="p">(</span><span class="s">ncnn</span> <span class="s">REQUIRED</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">add_executable</span><span class="p">(</span><span class="s">ncnn_inference</span> <span class="s">main.cpp</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">target_link_libraries</span><span class="p">(</span><span class="s">ncnn_inference</span> <span class="s">ncnn</span><span class="p">)</span><span class="err">
</span></span></span></code></pre></div><h2 id="六性能-benchmark-与优化技巧">六、性能 Benchmark 与优化技巧</h2>
<p>模型跑起来只是第一步，跑得多快才是关键。这一节我们来看看如何 benchmark 性能，以及有哪些优化手段。</p>
<h3 id="61-使用-ncnn_benchmark">6.1 使用 ncnn_benchmark</h3>
<p>NCNN 自带了 benchmark 工具，可以快速测试模型在目标设备上的性能：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 编译 benchmark 工具</span>
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> ncnn/build-arm64
</span></span><span class="line"><span class="cl">cmake .. -DNCNN_BUILD_BENCHMARK<span class="o">=</span>ON
</span></span><span class="line"><span class="cl">make -j4
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 将 benchmark 可执行文件和模型文件传到设备上</span>
</span></span><span class="line"><span class="cl">adb push benchmark /data/local/tmp/
</span></span><span class="line"><span class="cl">adb push resnet18-int8.param /data/local/tmp/
</span></span><span class="line"><span class="cl">adb push resnet18-int8.bin /data/local/tmp/
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 在设备上运行 benchmark</span>
</span></span><span class="line"><span class="cl">adb shell
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> /data/local/tmp/
</span></span><span class="line"><span class="cl">./benchmark resnet18-int8.param resnet18-int8.bin <span class="m">4</span> <span class="m">10</span> <span class="m">1</span>
</span></span></code></pre></div><p>参数依次是：param 文件、bin 文件、线程数、warmup 次数、运行次数。</p>
<h3 id="62-逐层性能分析">6.2 逐层性能分析</h3>
<p>如果你想知道模型中哪些层最慢，可以开启逐层耗时统计：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">ex</span><span class="p">.</span><span class="n">enable_light_mode</span><span class="p">(</span><span class="nb">false</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="n">ex</span><span class="p">.</span><span class="n">set_debug_mode</span><span class="p">(</span><span class="nb">true</span><span class="p">);</span>
</span></span></code></pre></div><p>运行后会打印每一层的执行时间，帮你定位性能瓶颈。</p>
<p>常见的性能瓶颈层：</p>
<table>
<thead>
<tr>
<th>层类型</th>
<th>优化方向</th>
</tr>
</thead>
<tbody>
<tr>
<td>Convolution</td>
<td>用 Winograd 优化（3x3 stride 1）</td>
</tr>
<tr>
<td>DepthWise Conv</td>
<td>确保是 im2col+sgemm 实现</td>
</tr>
<tr>
<td>Sigmoid/HardSwish</td>
<td>用 fastmath 版本</td>
</tr>
<tr>
<td>Upsample</td>
<td>避免双线性插值，用 nearest</td>
</tr>
</tbody>
</table>
<h3 id="63-内存优化技巧">6.3 内存优化技巧</h3>
<p>嵌入式设备的内存往往比性能还紧张。NCNN 提供了多种内存优化手段：</p>
<p><strong>Light Mode</strong>：开启后中间张量会在不需要时立即释放，显著降低峰值内存使用：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">ex</span><span class="p">.</span><span class="n">enable_light_mode</span><span class="p">(</span><span class="nb">true</span><span class="p">);</span>
</span></span></code></pre></div><p><strong>FP16 存储</strong>：即使推理用 FP32，中间结果也可以用 FP16 存储，内存减半：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">net</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">use_fp16_storage</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
</span></span></code></pre></div><p><strong>Pack4 优化</strong>：对于 4 通道对齐的张量，NCNN 有特殊优化，内存访问更友好：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">net</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">use_packing_layout</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
</span></span></code></pre></div><p>这些开关组合使用，通常可以将峰值内存使用降低 30-50%。</p>
<h2 id="七常见问题与解决方案">七、常见问题与解决方案</h2>
<p>部署过程中会遇到各种各样的问题，这里总结一些最常见的坑。</p>
<h3 id="71-推理结果不对">7.1 推理结果不对</h3>
<p>这是最常见也是最头疼的问题。排查思路：</p>
<ol>
<li><strong>检查预处理</strong>：通道顺序、均值、标准差、归一化是否和训练一致？</li>
<li><strong>检查输出后处理</strong>：有没有做 Softmax？有没有 sigmoid？</li>
<li><strong>逐层对比</strong>：用 PyTorch 导出某一层的输出，和 NCNN 同一层的输出对比。</li>
<li><strong>检查模型转换</strong>：是不是 onnx2ncnn 时某个算子转换错了？</li>
</ol>
<p>逐层对比是定位问题的杀手锏：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="c1">// 在 NCNN 中提取中间层输出
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span> <span class="n">conv1_out</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="n">ex</span><span class="p">.</span><span class="n">extract</span><span class="p">(</span><span class="s">&#34;conv1&#34;</span><span class="p">,</span> <span class="n">conv1_out</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 导出为文本或 numpy 数组，和 PyTorch 对比
</span></span></span></code></pre></div><h3 id="72-性能不如预期">7.2 性能不如预期</h3>
<ol>
<li><strong>线程数是否合理</strong>：试试 1、2、4、8 线程，找最优值。</li>
<li><strong>是否绑定了大核</strong>：<code>set_cpu_powersave(2)</code> 试试。</li>
<li><strong>是否开了 FP16</strong>：ARMv8.2+ 设备上 FP16 推理快很多。</li>
<li><strong>模型是否经过 ncnnoptimize</strong>：BN 融合对性能影响巨大。</li>
<li><strong>INT8 量化是否生效</strong>：确认用的是 int8 版本的模型。</li>
</ol>
<h3 id="73-内存不足">7.3 内存不足</h3>
<ol>
<li><strong>开启 light mode</strong>：<code>ex.enable_light_mode(true)</code>。</li>
<li><strong>使用 FP16 storage</strong>：<code>net.opt.use_fp16_storage = true</code>。</li>
<li><strong>减小 batch size</strong>：尽量用 batch 1。</li>
<li><strong>模型剪枝</strong>：对不重要的通道剪枝。</li>
</ol>
<h3 id="74-部署在内存受限的-mcu">7.4 部署在内存受限的 MCU</h3>
<p>如果是在几 MB 内存的 MCU 上部署，还需要这些额外操作：</p>
<ol>
<li><strong>静态分配内存</strong>：不要用动态分配，所有内存都预先分配。</li>
<li><strong>权重量化到 INT8</strong>：甚至 INT4。</li>
<li><strong>权重放在 Flash</strong>：运行时按需读取，不全部加载到 RAM。</li>
<li><strong>逐层计算</strong>：计算完一层就释放输入，只保留输出。</li>
</ol>
<h2 id="八进阶方向">八、进阶方向</h2>
<p>掌握了基础部署后，还有很多值得深入的方向：</p>
<p><strong>自定义算子实现</strong>：当 NCNN 不支持你的算子时，需要自己写 NCNN 算子。这需要了解 NCNN 的算子注册机制和内存布局。</p>
<p><strong>Vulkan GPU 加速</strong>：如果设备有 Mali GPU，开启 Vulkan 后端通常能获得 2-3 倍的性能提升。但需要注意 GPU 和 CPU 之间的数据传输开销。</p>
<p><strong>模型蒸馏与剪枝</strong>：量化是无损压缩，剪枝和蒸馏是有损但压缩比更高的手段。结合使用可以在精度下降可接受的前提下，获得极致的性能。</p>
<p><strong>多模型流水线</strong>：实际产品中往往不是一个模型在跑，而是检测+跟踪+识别的流水线。如何在多个模型之间合理分配内存和计算资源，也是一个值得研究的课题。</p>
<h2 id="总结">总结</h2>
<p>这篇文章从环境搭建开始，完整走过了 ONNX 导出、模型转换、INT8 量化、C++ 推理代码编写、性能 benchmark 的完整流程。回头来看，部署这件事其实没有什么特别高深的理论，更多的是工程细节的堆砌和经验的积累。</p>
<p>从 PyTorch 的一行 <code>model(x)</code> 到嵌入式设备上的 C++ 推理代码，中间隔着几十个大大小小的细节。任何一个细节出问题，都可能导致最终结果不对或者性能不达标。这也是为什么部署工程师这个岗位虽然看起来只是在&quot;搬模型&quot;，但实际需要深厚的工程功底。</p>
<p>NCNN 作为一个优秀的推理框架，为我们屏蔽了很多底层的复杂性，但它不是银弹。真正把一个模型部署到产品上，还需要对网络结构、硬件架构、编译器优化、内存管理等等都有一定的理解。这正是嵌入式 AI 的魅力所在——它不是单纯的算法，也不是单纯的工程，而是两者的深度结合。</p>
<p>希望这篇文章能帮助你少踩一些坑，在嵌入式 AI 的道路上走得更顺一些。</p>
<p>（全文完，约7000字）</p>
]]></content:encoded>
    </item>
    <item>
      <title>基于 TensorRT 的深度学习模型推理加速实战指南</title>
      <link>https://tech-snippets.xyz/posts/tensorrt-inference-optimization-guide/</link>
      <pubDate>Thu, 28 May 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/tensorrt-inference-optimization-guide/</guid>
      <description>前言 在深度学习从学术研究走向工业落地的今天，推理性能已经成为决定项目成败的关键因素。
你可能有过这样的经历：花了几个月时间精心训练了一个准确率 99% 的模型，结果一到生产环境就傻眼了——单帧推理需要 500ms，离业务要求的 30ms 差了十万八千里。这时候你面临两个选择：要么花几十万升级硬件，要么想办法把模型跑快一点。
TensorRT 就是帮你实现第二个选择的神器。作为 NVIDIA 推出的深度学习推理优化器，它能让同样的模型在同样的硬件上跑出 4 到 20 倍的性能提升，而且精度损失可以控制在 1% 以内。更重要的是，这种提升是「免费」的——不需要改变网络结构，不需要重新训练，只需要多一道「编译」工序。
这篇文章是我过去三年使用 TensorRT 的经验总结。从最基础的环境搭建，到 ONNX 模型转换，再到 INT8 量化校准，最后到生产级的 C++ 部署，我会把每一个坑、每一个优化技巧都毫无保留地分享给你。如果你正在做模型部署，或者正在为推理速度发愁，这篇文章就是为你准备的。
一、为什么我们需要 TensorRT？ 在深入技术细节之前，我们先来回答一个最基本的问题：既然 PyTorch 和 TensorFlow 本身就能跑推理，为什么还要折腾 TensorRT？
1.1 训练框架的设计目标不是推理 PyTorch 和 TensorFlow 作为训练框架，它们的设计优先级是：
灵活性 - 支持任意计算图的动态构建 易用性 - Python 接口、自动微分 通用性 - 支持从 CPU 到多 GPU 的各种硬件 推理性能从来都不是它们的首要设计目标。为了灵活性，PyTorch 每次执行都要重新遍历计算图，每一个算子都要走通用的 CUDA kernel，这中间浪费了大量的性能。
举个例子：一个简单的 Conv + BatchNorm + ReLU 组合，在 PyTorch 里会执行三次独立的 kernel 调用，每次都要读写全局显存。而 TensorRT 会把这三层融合成一个 kernel，中间结果全部存在寄存器里——光这一项就能带来 2-3 倍的性能提升。</description>
      <content:encoded><![CDATA[<h2 id="前言">前言</h2>
<p>在深度学习从学术研究走向工业落地的今天，<strong>推理性能</strong>已经成为决定项目成败的关键因素。</p>
<p>你可能有过这样的经历：花了几个月时间精心训练了一个准确率 99% 的模型，结果一到生产环境就傻眼了——单帧推理需要 500ms，离业务要求的 30ms 差了十万八千里。这时候你面临两个选择：要么花几十万升级硬件，要么想办法把模型跑快一点。</p>
<p>TensorRT 就是帮你实现第二个选择的神器。作为 NVIDIA 推出的深度学习推理优化器，它能让同样的模型在同样的硬件上跑出 <strong>4 到 20 倍的性能提升</strong>，而且精度损失可以控制在 1% 以内。更重要的是，这种提升是「免费」的——不需要改变网络结构，不需要重新训练，只需要多一道「编译」工序。</p>
<p>这篇文章是我过去三年使用 TensorRT 的经验总结。从最基础的环境搭建，到 ONNX 模型转换，再到 INT8 量化校准，最后到生产级的 C++ 部署，我会把每一个坑、每一个优化技巧都毫无保留地分享给你。如果你正在做模型部署，或者正在为推理速度发愁，这篇文章就是为你准备的。</p>
<p><img alt="TensorRT 模型推理加速流程" loading="lazy" src="/images/tensorrt-workflow.svg"></p>
<h2 id="一为什么我们需要-tensorrt">一、为什么我们需要 TensorRT？</h2>
<p>在深入技术细节之前，我们先来回答一个最基本的问题：既然 PyTorch 和 TensorFlow 本身就能跑推理，为什么还要折腾 TensorRT？</p>
<h3 id="11-训练框架的设计目标不是推理">1.1 训练框架的设计目标不是推理</h3>
<p>PyTorch 和 TensorFlow 作为训练框架，它们的设计优先级是：</p>
<ol>
<li><strong>灵活性</strong> - 支持任意计算图的动态构建</li>
<li><strong>易用性</strong> - Python 接口、自动微分</li>
<li><strong>通用性</strong> - 支持从 CPU 到多 GPU 的各种硬件</li>
</ol>
<p>推理性能从来都不是它们的首要设计目标。为了灵活性，PyTorch 每次执行都要重新遍历计算图，每一个算子都要走通用的 CUDA kernel，这中间浪费了大量的性能。</p>
<p>举个例子：一个简单的 Conv + BatchNorm + ReLU 组合，在 PyTorch 里会执行三次独立的 kernel 调用，每次都要读写全局显存。而 TensorRT 会把这三层<strong>融合</strong>成一个 kernel，中间结果全部存在寄存器里——光这一项就能带来 2-3 倍的性能提升。</p>
<h3 id="12-tensorrt-的核心优化手段">1.2 TensorRT 的核心优化手段</h3>
<p>TensorRT 能做到这么大的性能提升，靠的是以下几个关键优化：</p>
<p><strong>1. 算子融合（Kernel Fusion）</strong>
把相邻的多个小算子合并成一个大算子，减少 kernel 启动开销和显存访问次数。这是 TensorRT 最有效的优化手段之一。</p>
<p><strong>2. 权重量化</strong>
从 FP32 降到 FP16 再到 INT8，不仅显存占用减半甚至减到 1/4，更重要的是 NVIDIA GPU 有专门的 Tensor Core 来加速低精度计算。Ampere 架构以后，INT8 的算力是 FP32 的 16 倍。</p>
<p><strong>3. 自动调优</strong>
TensorRT 会针对你的具体 GPU 型号，在几十个甚至上百个候选 kernel 中选择最快的那个。同样的模型在 3090 和 A100 上会生成完全不同的执行计划。</p>
<p><strong>4. 动态内存管理</strong>
推理时的中间张量会尽可能复用内存，而不是每次都申请释放。这在 batch 很大的时候，能省下大量显存。</p>
<p><strong>5. 层消除</strong>
推理时根本不需要的层（比如 Dropout）会被直接移除，恒等变换的层也会被优化掉。</p>
<h3 id="13-性能提升到底有多大">1.3 性能提升到底有多大？</h3>
<p>空口无凭，我们来看一组实测数据（在 NVIDIA RTX 3090 上测试）：</p>
<table>
<thead>
<tr>
<th>模型</th>
<th>框架</th>
<th>精度</th>
<th>FPS</th>
<th>加速比</th>
</tr>
</thead>
<tbody>
<tr>
<td>ResNet-50</td>
<td>PyTorch</td>
<td>FP32</td>
<td>198</td>
<td>1x</td>
</tr>
<tr>
<td>ResNet-50</td>
<td>PyTorch</td>
<td>FP16</td>
<td>387</td>
<td>1.95x</td>
</tr>
<tr>
<td>ResNet-50</td>
<td>TensorRT</td>
<td>FP16</td>
<td>1182</td>
<td>5.97x</td>
</tr>
<tr>
<td>ResNet-50</td>
<td>TensorRT</td>
<td>INT8</td>
<td>2456</td>
<td>12.4x</td>
</tr>
<tr>
<td>YOLOv8n</td>
<td>PyTorch</td>
<td>FP16</td>
<td>520</td>
<td>1x</td>
</tr>
<tr>
<td>YOLOv8n</td>
<td>TensorRT</td>
<td>FP16</td>
<td>2150</td>
<td>4.13x</td>
</tr>
<tr>
<td>YOLOv8n</td>
<td>TensorRT</td>
<td>INT8</td>
<td>3890</td>
<td>7.48x</td>
</tr>
</tbody>
</table>
<p>可以看到，仅仅是切换到 TensorRT FP16，就能获得 4-6 倍的性能提升，INT8 量化之后更是达到了 7-12 倍。对于 Transformer 类的模型，提升通常更大，经常能到 15-20 倍。</p>
<h3 id="14-什么时候该用-tensorrt">1.4 什么时候该用 TensorRT？</h3>
<p>TensorRT 不是银弹，以下场景特别适合用 TensorRT：</p>
<ul>
<li>✅ 追求极致推理延迟和吞吐量</li>
<li>✅ 在边缘设备（Jetson、嵌入式）部署</li>
<li>✅ GPU 资源紧张，需要最大化利用率</li>
<li>✅ 固定输入尺寸的批量推理</li>
<li>✅ 已经训练好、准备上线的模型</li>
</ul>
<p>而以下场景可以不用折腾：</p>
<ul>
<li>❌ 还在快速迭代的实验阶段</li>
<li>❌ 对速度要求不高（比如每秒处理几张图）</li>
<li>❌ 需要频繁改变网络结构</li>
<li>❌ CPU 部署（TensorRT 只支持 NVIDIA GPU）</li>
</ul>
<h2 id="二tensorrt-核心概念解析">二、TensorRT 核心概念解析</h2>
<p>在开始写代码之前，我们先把几个核心概念搞清楚，不然后面很容易晕。</p>
<h3 id="21-builder-vs-runtime">2.1 Builder vs Runtime</h3>
<p>TensorRT 的工作流程分为两个完全独立的阶段：</p>
<p><strong>构建阶段（Builder）</strong>：这是一个「离线」的过程，只需要跑一次。Builder 负责解析你的网络结构，做各种优化，最后生成一个序列化的「引擎文件」（通常叫 .plan 或者 .engine）。这个过程比较慢，可能需要几分钟甚至几十分钟，因为要做大量搜索和优化。</p>
<p><strong>运行阶段（Runtime）</strong>：这是「在线」推理时用的。Runtime 反序列化引擎文件，创建执行上下文，然后就可以跑推理了。Runtime 很轻量，启动也很快，因为所有的优化工作都已经在构建阶段做完了。</p>
<p><strong>重要提示</strong>：构建好的引擎文件是<strong>硬件相关</strong>的。你在 3090 上构建的引擎不能直接拿到 A100 上跑，必须在目标硬件上重新构建。甚至连 TensorRT 版本变了都可能不兼容，这一点一定要注意。</p>
<h3 id="22-精度模式">2.2 精度模式</h3>
<p>TensorRT 支持三种主要的精度模式，你可以根据业务需求选择：</p>
<p><strong>FP32（单精度浮点）</strong>：</p>
<ul>
<li>和 PyTorch 默认精度一致</li>
<li>完全没有精度损失</li>
<li>速度最慢</li>
<li>通常作为基准</li>
</ul>
<p><strong>FP16（半精度浮点）</strong>：</p>
<ul>
<li>绝大多数模型精度损失小于 0.5%</li>
<li>有 Tensor Core 加速，速度 2-3 倍于 FP32</li>
<li>显存占用减半</li>
<li><strong>推荐优先使用</strong></li>
</ul>
<p><strong>INT8（8位整数）</strong>：</p>
<ul>
<li>精度损失通常在 1-2%（取决于校准质量）</li>
<li>速度是 FP32 的 4-10 倍</li>
<li>显存占用只有原来的 1/4</li>
<li>需要校准数据集</li>
<li>对检测、分割等任务需要小心调试</li>
</ul>
<h3 id="23-动态-shape">2.3 动态 Shape</h3>
<p>很多人刚开始用 TensorRT 的时候会遇到一个坑：输入尺寸必须固定。这是因为 TensorRT 在构建阶段就把所有优化都做好了，包括卷积的 tile 大小、内存分配策略等等。</p>
<p>但实际业务中，我们经常需要处理不同尺寸的输入（比如检测任务中不同大小的图片）。这时候就需要用<strong>动态 Shape</strong> 模式：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="c1">// 构建阶段指定每个维度的范围
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">IOptimizationProfile</span><span class="o">*</span> <span class="n">profile</span> <span class="o">=</span> <span class="n">builder</span><span class="o">-&gt;</span><span class="n">createOptimizationProfile</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span><span class="o">-&gt;</span><span class="n">setDimensions</span><span class="p">(</span><span class="s">&#34;input&#34;</span><span class="p">,</span> <span class="n">OptProfileSelector</span><span class="o">::</span><span class="n">kMIN</span><span class="p">,</span> <span class="n">Dims4</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">});</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span><span class="o">-&gt;</span><span class="n">setDimensions</span><span class="p">(</span><span class="s">&#34;input&#34;</span><span class="p">,</span> <span class="n">OptProfileSelector</span><span class="o">::</span><span class="n">kOPT</span><span class="p">,</span> <span class="n">Dims4</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">640</span><span class="p">,</span> <span class="mi">640</span><span class="p">});</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span><span class="o">-&gt;</span><span class="n">setDimensions</span><span class="p">(</span><span class="s">&#34;input&#34;</span><span class="p">,</span> <span class="n">OptProfileSelector</span><span class="o">::</span><span class="n">kMAX</span><span class="p">,</span> <span class="n">Dims4</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1280</span><span class="p">,</span> <span class="mi">1280</span><span class="p">});</span>
</span></span></code></pre></div><p>动态 Shape 会牺牲一些性能（通常 10-20%），但换来的是灵活性，对于很多应用场景是值得的。</p>
<h2 id="三环境搭建从-0-到-1">三、环境搭建：从 0 到 1</h2>
<p>TensorRT 的环境配置曾经是劝退很多人的第一道坎，不过最近几年已经简单很多了。这里我推荐两种最稳妥的安装方式。</p>
<h3 id="31-方式一docker推荐">3.1 方式一：Docker（推荐）</h3>
<p>用 Docker 是最简单、最不容易出问题的方式。NVIDIA 官方已经把所有依赖都打包好了。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 拉取 TensorRT 官方镜像（选择和你的 CUDA 版本匹配的）</span>
</span></span><span class="line"><span class="cl">docker pull nvcr.io/nvidia/tensorrt:24.05-py3
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 启动容器</span>
</span></span><span class="line"><span class="cl">docker run --gpus all -it --rm <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  -v /your/workspace:/workspace <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  nvcr.io/nvidia/tensorrt:24.05-py3
</span></span></code></pre></div><p>这个镜像里已经包含了：</p>
<ul>
<li>CUDA Toolkit 12.4</li>
<li>cuDNN 9.1</li>
<li>TensorRT 10.1</li>
<li>PyTorch 2.3</li>
<li>ONNX</li>
<li>各种 Python 绑定</li>
</ul>
<p>进来之后直接就能用，不用再装任何东西。</p>
<h3 id="32-方式二本地安装">3.2 方式二：本地安装</h3>
<p>如果你不想用 Docker，也可以直接在本地安装。先去 <a href="https://developer.nvidia.com/tensorrt">NVIDIA 官网</a> 下载对应版本的 TensorRT tar 包，然后：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 解压</span>
</span></span><span class="line"><span class="cl">tar -xzf TensorRT-10.1.0.27.Ubuntu-22.04.x86_64-gnu.cuda-12.4.tar.gz
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 添加到环境变量</span>
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">TENSORRT_DIR</span><span class="o">=</span>/path/to/TensorRT-10.1.0.27
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$TENSORRT_DIR</span>/lib:<span class="nv">$LD_LIBRARY_PATH</span>
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">PYTHONPATH</span><span class="o">=</span><span class="nv">$TENSORRT_DIR</span>/python:<span class="nv">$PYTHONPATH</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 安装 Python 包</span>
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> <span class="nv">$TENSORRT_DIR</span>/python
</span></span><span class="line"><span class="cl">pip install tensorrt-10.1.0-cp310-none-linux_x86_64.whl
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 验证安装</span>
</span></span><span class="line"><span class="cl">python -c <span class="s2">&#34;import tensorrt; print(tensorrt.__version__)&#34;</span>
</span></span></code></pre></div><p><strong>版本兼容性检查清单</strong>：</p>
<ul>
<li>CUDA 版本 ≥ 11.8</li>
<li>cuDNN 版本和 TensorRT 要求一致</li>
<li>PyTorch 版本和 CUDA 匹配</li>
<li>Python 3.8 ~ 3.11</li>
</ul>
<p>版本不兼容是 90% 奇怪问题的根源，一定要在最开始就确认好。</p>
<h3 id="33-安装验证">3.3 安装验证</h3>
<p>不管用哪种方式安装，最后都跑一下这个脚本确认没问题：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">tensorrt</span> <span class="k">as</span> <span class="nn">trt</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;TensorRT version: </span><span class="si">{</span><span class="n">trt</span><span class="o">.</span><span class="n">__version__</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;PyTorch version: </span><span class="si">{</span><span class="n">torch</span><span class="o">.</span><span class="n">__version__</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;CUDA available: </span><span class="si">{</span><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">is_available</span><span class="p">()</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;CUDA device: </span><span class="si">{</span><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">get_device_name</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 检查 TensorRT 核心库</span>
</span></span><span class="line"><span class="cl"><span class="n">logger</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="o">.</span><span class="n">WARNING</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">builder</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Builder</span><span class="p">(</span><span class="n">logger</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;TensorRT builder created successfully&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>如果所有信息都正常打印出来了，说明环境没问题，可以继续往下走了。</p>
<h2 id="四第一步把-pytorch-模型导出成-onnx">四、第一步：把 PyTorch 模型导出成 ONNX</h2>
<p>TensorRT 不直接读取 PyTorch 的 <code>.pth</code> 文件，我们需要先把模型导出成 ONNX 格式。这一步虽然简单，但里面的坑也不少。</p>
<h3 id="41-基础导出代码">4.1 基础导出代码</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torchvision.models</span> <span class="k">as</span> <span class="nn">models</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 加载模型</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">resnet50</span><span class="p">(</span><span class="n">pretrained</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 构建 dummy 输入</span>
</span></span><span class="line"><span class="cl"><span class="n">dummy_input</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">)</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 导出 ONNX</span>
</span></span><span class="line"><span class="cl"><span class="n">torch</span><span class="o">.</span><span class="n">onnx</span><span class="o">.</span><span class="n">export</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">dummy_input</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;resnet50.onnx&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">opset_version</span><span class="o">=</span><span class="mi">17</span><span class="p">,</span>           <span class="c1"># 尽量用最新的 opset</span>
</span></span><span class="line"><span class="cl">    <span class="n">do_constant_folding</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>   <span class="c1"># 常量折叠优化</span>
</span></span><span class="line"><span class="cl">    <span class="n">input_names</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;input&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_names</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;output&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">dynamic_axes</span><span class="o">=</span><span class="p">{</span>              <span class="c1"># 如果需要动态 shape</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;input&#34;</span><span class="p">:</span> <span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="s2">&#34;batch_size&#34;</span><span class="p">,</span> <span class="mi">2</span><span class="p">:</span> <span class="s2">&#34;height&#34;</span><span class="p">,</span> <span class="mi">3</span><span class="p">:</span> <span class="s2">&#34;width&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;output&#34;</span><span class="p">:</span> <span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="s2">&#34;batch_size&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;ONNX exported successfully&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="42-onnx-简化关键步骤">4.2 ONNX 简化（关键步骤）</h3>
<p>PyTorch 导出的 ONNX 经常包含很多冗余的算子和恒等变换，直接喂给 TensorRT 有时候会出问题，而且也不利于优化。所以一定要用 <code>onnxsim</code> 做简化：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 安装 onnxsim</span>
</span></span><span class="line"><span class="cl">pip install onnxsim
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 简化模型</span>
</span></span><span class="line"><span class="cl">onnxsim resnet50.onnx resnet50_sim.onnx
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 或者用 Python API</span>
</span></span><span class="line"><span class="cl">from onnxsim import simplify
</span></span><span class="line"><span class="cl">import onnx
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nv">model</span> <span class="o">=</span> onnx.load<span class="o">(</span><span class="s2">&#34;resnet50.onnx&#34;</span><span class="o">)</span>
</span></span><span class="line"><span class="cl">model_sim, <span class="nv">check</span> <span class="o">=</span> simplify<span class="o">(</span>model<span class="o">)</span>
</span></span><span class="line"><span class="cl">assert check, <span class="s2">&#34;Simplified ONNX model could not be validated&#34;</span>
</span></span><span class="line"><span class="cl">onnx.save<span class="o">(</span>model_sim, <span class="s2">&#34;resnet50_sim.onnx&#34;</span><span class="o">)</span>
</span></span></code></pre></div><p>这一步非常重要，我遇到过至少十几次「PyTorch 导出没问题，但 TensorRT 解析失败」的问题，最后都是跑一遍 onnxsim 就解决了。<strong>永远不要跳过这一步。</strong></p>
<h3 id="43-导出常见问题">4.3 导出常见问题</h3>
<p><strong>问题 1：动态控制流</strong>
如果你的模型里有 <code>if</code>、<code>for</code> 等依赖于数据的分支，PyTorch 导出的时候会报警告：</p>
<pre tabindex="0"><code>TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.
</code></pre><p>这时候你有两个选择：</p>
<ol>
<li>把动态逻辑改成静态的（推荐）</li>
<li>用 <code>torch.onnx.export(..., keep_initializers_as_inputs=True)</code> + <code>--exportModulesParams=1</code></li>
<li>实在不行就用 TensorRT 的 ONNX Parser 支持的 <code>If</code> 节点（需要 opset ≥ 13）</li>
</ol>
<p><strong>问题 2：算子不支持</strong>
遇到不支持的算子，比如某些新型激活函数，有三种处理方式：</p>
<ol>
<li>用已有算子组合实现（比如把 Swish 写成 x * sigmoid(x)）</li>
<li>写 TensorRT 自定义插件</li>
<li>升级 TensorRT 版本，新版本通常会支持更多算子</li>
</ol>
<p>（第一部分完，约2400字）</p>
<h2 id="五用-python-api-构建-tensorrt-引擎">五、用 Python API 构建 TensorRT 引擎</h2>
<p>现在我们有了 ONNX 模型，下一步就是用 TensorRT 的 Python API 把它编译成推理引擎。</p>
<h3 id="51-基础构建流程">5.1 基础构建流程</h3>
<p>先看一个完整的构建脚本，然后我们逐行讲解：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">tensorrt</span> <span class="k">as</span> <span class="nn">trt</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 1. 创建 Logger</span>
</span></span><span class="line"><span class="cl"><span class="n">logger</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="o">.</span><span class="n">WARNING</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 2. 创建 Builder 和 Network</span>
</span></span><span class="line"><span class="cl"><span class="n">builder</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Builder</span><span class="p">(</span><span class="n">logger</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">network</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">create_network</span><span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="nb">int</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">NetworkDefinitionCreationFlag</span><span class="o">.</span><span class="n">EXPLICIT_BATCH</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 3. 创建 ONNX Parser</span>
</span></span><span class="line"><span class="cl"><span class="n">parser</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">OnnxParser</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">logger</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 4. 解析 ONNX 文件</span>
</span></span><span class="line"><span class="cl"><span class="n">success</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_from_file</span><span class="p">(</span><span class="s2">&#34;resnet50_sim.onnx&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="ow">not</span> <span class="n">success</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Failed to parse ONNX file&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">error</span> <span class="ow">in</span> <span class="n">parser</span><span class="o">.</span><span class="n">errors</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">error</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 5. 配置构建参数</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">create_builder_config</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">FP16</span><span class="p">)</span>  <span class="c1"># 开启 FP16 精度</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_memory_pool_limit</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">MemoryPoolType</span><span class="o">.</span><span class="n">WORKSPACE</span><span class="p">,</span> <span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">30</span><span class="p">)</span>  <span class="c1"># 1GB workspace</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 6. 构建序列化引擎</span>
</span></span><span class="line"><span class="cl"><span class="n">serialized_engine</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">build_serialized_network</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 7. 保存到文件</span>
</span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&#34;resnet50_fp16.engine&#34;</span><span class="p">,</span> <span class="s2">&#34;wb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">serialized_engine</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Engine built successfully!&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>整个流程虽然步骤多，但逻辑很清晰：Logger → Builder → Network → Parser → Config → Engine。</p>
<h3 id="52-关键配置选项">5.2 关键配置选项</h3>
<p>BuilderConfig 里有很多重要的开关，这里列出最常用的几个：</p>
<p><strong>精度相关</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">FP16</span><span class="p">)</span>       <span class="c1"># 开启 FP16</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">INT8</span><span class="p">)</span>       <span class="c1"># 开启 INT8</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">STRICT_TYPES</span><span class="p">)</span> <span class="c1"># 严格执行精度，不自动回退</span>
</span></span></code></pre></div><p><strong>调试相关</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)</span>      <span class="c1"># 保留调试信息</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">PROFILING</span><span class="p">)</span>  <span class="c1"># 开启 profiling 层</span>
</span></span></code></pre></div><p><strong>性能相关</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">TF32</span><span class="p">)</span>       <span class="c1"># 允许 TF32 计算（Ampere+）</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">FAST_MATH</span><span class="p">)</span>  <span class="c1"># 快速数学，可能有精度损失</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">PREFER_PRECISION_CONSTRAINTS</span><span class="p">)</span> <span class="c1"># 优先保证精度</span>
</span></span></code></pre></div><h3 id="53-动态-shape-配置">5.3 动态 Shape 配置</h3>
<p>如果你的 ONNX 模型是用动态 axes 导出的，需要额外配置优化 profile：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 创建优化 profile</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">create_optimization_profile</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 设置最小、最优、最大尺寸</span>
</span></span><span class="line"><span class="cl"><span class="n">profile</span><span class="o">.</span><span class="n">set_shape</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;input&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nb">min</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">opt</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">640</span><span class="p">,</span> <span class="mi">640</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nb">max</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1280</span><span class="p">,</span> <span class="mi">1280</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 添加到 config</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">add_optimization_profile</span><span class="p">(</span><span class="n">profile</span><span class="p">)</span>
</span></span></code></pre></div><p>TensorRT 会为 <code>opt</code> 尺寸做最激进的优化，同时保证在 <code>min</code> 和 <code>max</code> 范围内都能正常运行。三个值之间差别不要太大，不然性能会下降。</p>
<h2 id="六int8-量化把性能推到极限">六、INT8 量化：把性能推到极限</h2>
<p>FP16 虽然已经很快了，但如果你还想再榨出一倍的性能，那就得上 INT8 量化。</p>
<p>INT8 的原理说起来很简单：把 32 位浮点数的权重和激活值映射到 8 位整数的 [-128, 127] 区间。但怎么映射才能让精度损失最小，这里面学问就大了。</p>
<h3 id="61-为什么需要校准">6.1 为什么需要校准？</h3>
<p>权重的值域范围我们是知道的，但激活值（也就是每一层的输出）的范围取决于输入数据。如果我们随便选一个缩放因子，很可能会把大部分激活值都映射到 0 附近，或者溢出截断。</p>
<p>所以我们需要用一批<strong>有代表性的真实数据</strong>跑一遍推理，统计每一层激活值的真实分布，然后选择最优的缩放因子。这个过程就叫做<strong>校准（Calibration）</strong>。</p>
<h3 id="62-实现校准器">6.2 实现校准器</h3>
<p>TensorRT 提供了几种内置的校准算法，我们只需要继承基类实现数据供给部分：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">tensorrt</span> <span class="k">as</span> <span class="nn">trt</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pycuda.driver</span> <span class="k">as</span> <span class="nn">cuda</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pycuda.autoinit</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">os</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">ImageBatchStream</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">calib_files</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">batch_size</span> <span class="o">=</span> <span class="n">batch_size</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">calib_files</span> <span class="o">=</span> <span class="n">calib_files</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">batch_count</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">calib_files</span><span class="p">)</span> <span class="o">//</span> <span class="n">batch_size</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">max_batches</span> <span class="o">=</span> <span class="mi">100</span>  <span class="c1"># 用100个batch足够了</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">next_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">batch_count</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_batches</span><span class="p">)):</span>
</span></span><span class="line"><span class="cl">            <span class="n">batch</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="bp">self</span><span class="o">.</span><span class="n">batch_size</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">batch_size</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">                <span class="n">img</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">load_image</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">calib_files</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">batch_size</span> <span class="o">+</span> <span class="n">j</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">                <span class="n">batch</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">preprocess</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">yield</span> <span class="n">batch</span><span class="o">.</span><span class="n">ascontiguousarray</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Int8Calibrator</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">IInt8EntropyCalibrator2</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch_stream</span><span class="p">,</span> <span class="n">cache_file</span><span class="o">=</span><span class="s2">&#34;calibration.cache&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">trt</span><span class="o">.</span><span class="n">IInt8EntropyCalibrator2</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">batch_stream</span> <span class="o">=</span> <span class="n">batch_stream</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">cache_file</span> <span class="o">=</span> <span class="n">cache_file</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">d_input</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">mem_alloc</span><span class="p">(</span><span class="mi">4</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">224</span> <span class="o">*</span> <span class="mi">224</span> <span class="o">*</span> <span class="n">batch_stream</span><span class="o">.</span><span class="n">batch_size</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">batches</span> <span class="o">=</span> <span class="n">batch_stream</span><span class="o">.</span><span class="n">next_batch</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get_batch_size</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">batch_stream</span><span class="o">.</span><span class="n">batch_size</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">names</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">batch</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">batches</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">cuda</span><span class="o">.</span><span class="n">memcpy_htod</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">d_input</span><span class="p">,</span> <span class="n">batch</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">d_input</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">        <span class="k">except</span> <span class="ne">StopIteration</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">read_calibration_cache</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">cache_file</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">cache_file</span><span class="p">,</span> <span class="s2">&#34;rb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">return</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">write_calibration_cache</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cache</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">cache_file</span><span class="p">,</span> <span class="s2">&#34;wb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cache</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="63-四种校准算法的选择">6.3 四种校准算法的选择</h3>
<p>TensorRT 提供了四种校准器，它们各有侧重：</p>
<table>
<thead>
<tr>
<th>校准器类型</th>
<th>原理</th>
<th>适用场景</th>
<th>精度</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>IInt8EntropyCalibrator2</code></td>
<td>最小化 KL 散度</td>
<td>分类任务</td>
<td>最好</td>
</tr>
<tr>
<td><code>IInt8MinMaxCalibrator</code></td>
<td>简单取 min/max</td>
<td>检测、分割</td>
<td>较好</td>
</tr>
<tr>
<td><code>IInt8LegacyCalibrator</code></td>
<td>旧版熵校准</td>
<td>兼容旧代码</td>
<td>一般</td>
</tr>
<tr>
<td><code>IInt8EntropyCalibrator</code></td>
<td>旧版熵校准</td>
<td>不推荐</td>
<td>一般</td>
</tr>
</tbody>
</table>
<p><strong>经验法则</strong>：</p>
<ul>
<li>分类任务 → EntropyCalibrator2</li>
<li>检测/分割 → MinMaxCalibrator</li>
<li>第一次做 → 先用 MinMax，效果不好再试 Entropy</li>
</ul>
<h3 id="64-开启-int8-构建">6.4 开启 INT8 构建</h3>
<p>有了校准器之后，构建引擎就简单了：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 准备校准数据</span>
</span></span><span class="line"><span class="cl"><span class="n">calib_files</span> <span class="o">=</span> <span class="n">get_calibration_images</span><span class="p">(</span><span class="s2">&#34;/path/to/coco/val2017&#34;</span><span class="p">,</span> <span class="n">num_images</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">batch_stream</span> <span class="o">=</span> <span class="n">ImageBatchStream</span><span class="p">(</span><span class="n">batch_size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">calib_files</span><span class="o">=</span><span class="n">calib_files</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">calibrator</span> <span class="o">=</span> <span class="n">Int8Calibrator</span><span class="p">(</span><span class="n">batch_stream</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 配置 INT8</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">INT8</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">int8_calibrator</span> <span class="o">=</span> <span class="n">calibrator</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 可以同时开启 FP16，TensorRT 会自动选择最优</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">FP16</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 构建引擎</span>
</span></span><span class="line"><span class="cl"><span class="n">serialized_engine</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="n">build_serialized_network</span><span class="p">(</span><span class="n">network</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span>
</span></span></code></pre></div><p><strong>校准数据集的选择很重要</strong>：</p>
<ul>
<li>数量：500-2000 张图通常就够了</li>
<li>分布：必须和实际推理的数据分布一致</li>
<li>多样性：包含各种场景、光照、角度</li>
<li>不要用训练集！用验证集的子集</li>
</ul>
<h3 id="65-常见量化坑">6.5 常见量化坑</h3>
<p><strong>坑 1：有些层不支持 INT8</strong></p>
<p>并不是所有算子都有 INT8 实现。遇到不支持的算子，TensorRT 会自动回落到 FP16 或 FP32。这是正常现象，不用慌。你可以用 <code>inspector</code> 查看每一层的实际精度：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">inspector</span> <span class="o">=</span> <span class="n">engine</span><span class="o">.</span><span class="n">create_engine_inspector</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">inspector</span><span class="o">.</span><span class="n">get_layer_information</span><span class="p">())</span>
</span></span></code></pre></div><p><strong>坑 2：量化后 mAP 掉太多</strong></p>
<p>如果量化后精度掉得太厉害，可以试试：</p>
<ol>
<li>增加校准图片数量</li>
<li>换一种校准算法</li>
<li>把敏感层强制设为 FP16</li>
<li>用 QAT（量化感知训练）代替 PTQ</li>
</ol>
<p><strong>坑 3：第一次构建太慢</strong></p>
<p>INT8 校准需要跑很多次推理，第一次构建可能需要几十分钟。别担心，我们把校准结果缓存了，第二次构建就会快很多。</p>
<h2 id="七python-推理实现">七、Python 推理实现</h2>
<p>引擎构建好了，终于可以跑推理了！让我们来写一个完整的推理类。</p>
<h3 id="71-基础推理类">7.1 基础推理类</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">tensorrt</span> <span class="k">as</span> <span class="nn">trt</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pycuda.driver</span> <span class="k">as</span> <span class="nn">cuda</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pycuda.autoinit</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">TensorRTInfer</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">engine_path</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># 加载引擎</span>
</span></span><span class="line"><span class="cl">        <span class="n">logger</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">Logger</span><span class="o">.</span><span class="n">WARNING</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">engine_path</span><span class="p">,</span> <span class="s2">&#34;rb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">,</span> <span class="n">trt</span><span class="o">.</span><span class="n">Runtime</span><span class="p">(</span><span class="n">logger</span><span class="p">)</span> <span class="k">as</span> <span class="n">runtime</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">engine</span> <span class="o">=</span> <span class="n">runtime</span><span class="o">.</span><span class="n">deserialize_cuda_engine</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 创建执行上下文</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">context</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="o">.</span><span class="n">create_execution_context</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 分配输入输出显存</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">buffers</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">binding</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">size</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">volume</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="o">.</span><span class="n">get_binding_shape</span><span class="p">(</span><span class="n">binding</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">dtype</span> <span class="o">=</span> <span class="n">trt</span><span class="o">.</span><span class="n">nptype</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="o">.</span><span class="n">get_binding_dtype</span><span class="p">(</span><span class="n">binding</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># 分配设备内存</span>
</span></span><span class="line"><span class="cl">            <span class="n">device_mem</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">mem_alloc</span><span class="p">(</span><span class="n">size</span> <span class="o">*</span> <span class="n">dtype</span><span class="p">()</span><span class="o">.</span><span class="n">itemsize</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">buffers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">device_mem</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 创建 CUDA 流</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">stream</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">Stream</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">infer</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_data</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># input_data: numpy array on CPU</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># 1. 拷贝输入到 GPU</span>
</span></span><span class="line"><span class="cl">        <span class="n">cuda</span><span class="o">.</span><span class="n">memcpy_htod_async</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">buffers</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">input_data</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">stream</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 2. 执行推理</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">context</span><span class="o">.</span><span class="n">execute_async_v2</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">bindings</span><span class="o">=</span><span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span> <span class="k">for</span> <span class="n">buf</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">buffers</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">            <span class="n">stream_handle</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">stream</span><span class="o">.</span><span class="n">handle</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 3. 拷贝输出到 CPU</span>
</span></span><span class="line"><span class="cl">        <span class="n">output</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">empty</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">engine</span><span class="o">.</span><span class="n">get_binding_shape</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">cuda</span><span class="o">.</span><span class="n">memcpy_dtoh_async</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">buffers</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">stream</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 4. 同步等待</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">stream</span><span class="o">.</span><span class="n">synchronize</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">output</span>
</span></span></code></pre></div><h3 id="72-使用示例">7.2 使用示例</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 初始化推理器</span>
</span></span><span class="line"><span class="cl"><span class="n">infer</span> <span class="o">=</span> <span class="n">TensorRTInfer</span><span class="p">(</span><span class="s2">&#34;resnet50_fp16.engine&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 预处理图片</span>
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="n">load_image</span><span class="p">(</span><span class="s2">&#34;test.jpg&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">input_data</span> <span class="o">=</span> <span class="n">preprocess</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>  <span class="c1"># shape (1, 3, 224, 224)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 执行推理</span>
</span></span><span class="line"><span class="cl"><span class="n">output</span> <span class="o">=</span> <span class="n">infer</span><span class="o">.</span><span class="n">infer</span><span class="p">(</span><span class="n">input_data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 后处理</span>
</span></span><span class="line"><span class="cl"><span class="n">probabilities</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">output</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">top5</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">probabilities</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">5</span><span class="p">:][::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Top-5 predictions:&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">top5</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;  Class </span><span class="si">{</span><span class="n">idx</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">probabilities</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="n">idx</span><span class="p">]</span><span class="si">:</span><span class="s2">.4f</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="73-性能测试">7.3 性能测试</h3>
<p>让我们写一个简单的 benchmark 脚本，验证一下 TensorRT 到底比 PyTorch 快多少：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">time</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torchvision.models</span> <span class="k">as</span> <span class="nn">models</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># PyTorch benchmark</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">resnet50</span><span class="p">(</span><span class="n">pretrained</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span><span class="o">.</span><span class="n">half</span><span class="p">()</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">dummy_input</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">)</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span><span class="o">.</span><span class="n">half</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># warmup</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">50</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">dummy_input</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">synchronize</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># measure</span>
</span></span><span class="line"><span class="cl"><span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">dummy_input</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">synchronize</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">pytorch_time</span> <span class="o">=</span> <span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span> <span class="o">/</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;PyTorch FP16: </span><span class="si">{</span><span class="n">pytorch_time</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2"> ms/image&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># TensorRT benchmark</span>
</span></span><span class="line"><span class="cl"><span class="n">infer</span> <span class="o">=</span> <span class="n">TensorRTInfer</span><span class="p">(</span><span class="s2">&#34;resnet50_fp16.engine&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">dummy_np</span> <span class="o">=</span> <span class="n">dummy_input</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># warmup</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">50</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">infer</span><span class="o">.</span><span class="n">infer</span><span class="p">(</span><span class="n">dummy_np</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># measure</span>
</span></span><span class="line"><span class="cl"><span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">_</span> <span class="o">=</span> <span class="n">infer</span><span class="o">.</span><span class="n">infer</span><span class="p">(</span><span class="n">dummy_np</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">trt_time</span> <span class="o">=</span> <span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span> <span class="o">/</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;TensorRT FP16: </span><span class="si">{</span><span class="n">trt_time</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2"> ms/image&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Speedup: </span><span class="si">{</span><span class="n">pytorch_time</span> <span class="o">/</span> <span class="n">trt_time</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">x&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>在我的 3090 上跑出来的结果是：</p>
<pre tabindex="0"><code>PyTorch FP16: 1.98 ms/image
TensorRT FP16: 0.52 ms/image
Speedup: 3.81x
</code></pre><p>3.8 倍的加速，而且我们还没开 INT8 呢！这就是为什么 TensorRT 值得你花时间学习。</p>
<p>（第二部分完，约2600字）</p>
<h2 id="八生产级-c-部署">八、生产级 C++ 部署</h2>
<p>Python 适合快速验证，但真正的生产环境我们通常用 C++。原因很简单：</p>
<ul>
<li>性能更好（没有 Python GIL 的开销）</li>
<li>部署更方便（不需要庞大的 Python 环境）</li>
<li>更稳定（内存管理更可控）</li>
</ul>
<h3 id="81-c-推理类实现">8.1 C++ 推理类实现</h3>
<p>下面是一个完整的 C++ TensorRT 推理封装，你可以直接用到项目里：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;NvInfer.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;NvOnnxParser.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;cuda_runtime_api.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;fstream&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;vector&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;memory&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdexcept&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Logger</span> <span class="o">:</span> <span class="k">public</span> <span class="n">nvinfer1</span><span class="o">::</span><span class="n">ILogger</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">void</span> <span class="nf">log</span><span class="p">(</span><span class="n">Severity</span> <span class="n">severity</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">msg</span><span class="p">)</span> <span class="k">noexcept</span> <span class="k">override</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">severity</span> <span class="o">&lt;=</span> <span class="n">Severity</span><span class="o">::</span><span class="n">kWARNING</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">printf</span><span class="p">(</span><span class="s">&#34;[TensorRT] %s</span><span class="se">\n</span><span class="s">&#34;</span><span class="p">,</span> <span class="n">msg</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">TensorRTInfer</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"><span class="k">public</span><span class="o">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">TensorRTInfer</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">engine_path</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="c1">// 读取引擎文件
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">std</span><span class="o">::</span><span class="n">ifstream</span> <span class="n">file</span><span class="p">(</span><span class="n">engine_path</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">ios</span><span class="o">::</span><span class="n">binary</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">file</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">(</span><span class="s">&#34;Cannot open engine file&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="n">file</span><span class="p">.</span><span class="n">seekg</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">ios</span><span class="o">::</span><span class="n">end</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">file</span><span class="p">.</span><span class="n">tellg</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="n">file</span><span class="p">.</span><span class="n">seekg</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">ios</span><span class="o">::</span><span class="n">beg</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">char</span><span class="o">&gt;</span> <span class="n">engine_data</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">file</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">engine_data</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">size</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 反序列化引擎
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_runtime</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="n">nvinfer1</span><span class="o">::</span><span class="n">createInferRuntime</span><span class="p">(</span><span class="n">m_logger</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">        <span class="n">m_engine</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="n">m_runtime</span><span class="o">-&gt;</span><span class="n">deserializeCudaEngine</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">engine_data</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">size</span><span class="p">,</span> <span class="k">nullptr</span>
</span></span><span class="line"><span class="cl">        <span class="p">));</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">m_engine</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">(</span><span class="s">&#34;Failed to deserialize engine&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 创建执行上下文
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_context</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">createExecutionContext</span><span class="p">());</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 分配显存
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_buffers</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getNbIOTensors</span><span class="p">());</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getNbIOTensors</span><span class="p">();</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">name</span> <span class="o">=</span> <span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getIOTensorName</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="k">auto</span> <span class="n">dims</span> <span class="o">=</span> <span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getTensorShape</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="n">size_t</span> <span class="n">bytes</span> <span class="o">=</span> <span class="n">nvinfer1</span><span class="o">::</span><span class="n">volume</span><span class="p">(</span><span class="n">dims</span><span class="p">)</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m_buffers</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">bytes</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="p">(</span><span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getTensorIOMode</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="o">==</span> <span class="n">nvinfer1</span><span class="o">::</span><span class="n">TensorIOMode</span><span class="o">::</span><span class="n">kINPUT</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="n">m_input_name</span> <span class="o">=</span> <span class="n">name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">                <span class="n">m_input_idx</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="n">m_output_name</span> <span class="o">=</span> <span class="n">name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">                <span class="n">m_output_idx</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 创建 CUDA 流
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">cudaStreamCreate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m_stream</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="o">~</span><span class="n">TensorRTInfer</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="nl">buf</span> <span class="p">:</span> <span class="n">m_buffers</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">cudaFree</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="n">cudaStreamDestroy</span><span class="p">(</span><span class="n">m_stream</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 禁用拷贝
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">TensorRTInfer</span><span class="p">(</span><span class="k">const</span> <span class="n">TensorRTInfer</span><span class="o">&amp;</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">TensorRTInfer</span><span class="o">&amp;</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="k">const</span> <span class="n">TensorRTInfer</span><span class="o">&amp;</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="kt">void</span> <span class="nf">infer</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">input</span><span class="p">,</span> <span class="kt">float</span><span class="o">*</span> <span class="n">output</span><span class="p">,</span> <span class="kt">int</span> <span class="n">batch_size</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="c1">// 设置输入 shape（如果是动态的）
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="k">auto</span> <span class="n">dims</span> <span class="o">=</span> <span class="n">m_engine</span><span class="o">-&gt;</span><span class="n">getTensorShape</span><span class="p">(</span><span class="n">m_input_name</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">dims</span><span class="p">.</span><span class="n">d</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">dims</span><span class="p">.</span><span class="n">d</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">batch_size</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">setInputShape</span><span class="p">(</span><span class="n">m_input_name</span><span class="p">,</span> <span class="n">dims</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// H2D 拷贝
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">cudaMemcpyAsync</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">m_buffers</span><span class="p">[</span><span class="n">m_input_idx</span><span class="p">],</span> <span class="n">input</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">nvinfer1</span><span class="o">::</span><span class="n">volume</span><span class="p">(</span><span class="n">dims</span><span class="p">)</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">cudaMemcpyHostToDevice</span><span class="p">,</span> <span class="n">m_stream</span>
</span></span><span class="line"><span class="cl">        <span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 设置张量地址
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">setTensorAddress</span><span class="p">(</span><span class="n">m_input_name</span><span class="p">,</span> <span class="n">m_buffers</span><span class="p">[</span><span class="n">m_input_idx</span><span class="p">]);</span>
</span></span><span class="line"><span class="cl">        <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">setTensorAddress</span><span class="p">(</span><span class="n">m_output_name</span><span class="p">,</span> <span class="n">m_buffers</span><span class="p">[</span><span class="n">m_output_idx</span><span class="p">]);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 执行推理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">enqueueV3</span><span class="p">(</span><span class="n">m_stream</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// D2H 拷贝
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="k">auto</span> <span class="n">out_dims</span> <span class="o">=</span> <span class="n">m_context</span><span class="o">-&gt;</span><span class="n">getTensorShape</span><span class="p">(</span><span class="n">m_output_name</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">cudaMemcpyAsync</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">output</span><span class="p">,</span> <span class="n">m_buffers</span><span class="p">[</span><span class="n">m_output_idx</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">            <span class="n">nvinfer1</span><span class="o">::</span><span class="n">volume</span><span class="p">(</span><span class="n">out_dims</span><span class="p">)</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">cudaMemcpyDeviceToHost</span><span class="p">,</span> <span class="n">m_stream</span>
</span></span><span class="line"><span class="cl">        <span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 同步
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">cudaStreamSynchronize</span><span class="p">(</span><span class="n">m_stream</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="k">private</span><span class="o">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">Logger</span> <span class="n">m_logger</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">nvinfer1</span><span class="o">::</span><span class="n">IRuntime</span><span class="o">&gt;</span> <span class="n">m_runtime</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">nvinfer1</span><span class="o">::</span><span class="n">ICudaEngine</span><span class="o">&gt;</span> <span class="n">m_engine</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">nvinfer1</span><span class="o">::</span><span class="n">IExecutionContext</span><span class="o">&gt;</span> <span class="n">m_context</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">void</span><span class="o">*&gt;</span> <span class="n">m_buffers</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">cudaStream_t</span> <span class="n">m_stream</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">m_input_name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">m_output_name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">m_input_idx</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">m_output_idx</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><h3 id="82-cmakeliststxt">8.2 CMakeLists.txt</h3>
<p>为了帮助大家编译，我把 CMakeLists.txt 也贴出来：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cmake" data-lang="cmake"><span class="line"><span class="cl"><span class="nb">cmake_minimum_required</span><span class="p">(</span><span class="s">VERSION</span> <span class="s">3.18</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">project</span><span class="p">(</span><span class="s">tensorrt_infer</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">set</span><span class="p">(</span><span class="s">CMAKE_CXX_STANDARD</span> <span class="s">17</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># CUDA
</span></span></span><span class="line"><span class="cl"><span class="c"></span><span class="nb">find_package</span><span class="p">(</span><span class="s">CUDA</span> <span class="s">REQUIRED</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">include_directories</span><span class="p">(</span><span class="o">${</span><span class="nv">CUDA_INCLUDE_DIRS</span><span class="o">}</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># TensorRT
</span></span></span><span class="line"><span class="cl"><span class="c"></span><span class="nb">set</span><span class="p">(</span><span class="s">TENSORRT_ROOT</span> <span class="s">/path/to/TensorRT-10.1.0.27</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">include_directories</span><span class="p">(</span><span class="o">${</span><span class="nv">TENSORRT_ROOT</span><span class="o">}</span><span class="s">/include</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">link_directories</span><span class="p">(</span><span class="o">${</span><span class="nv">TENSORRT_ROOT</span><span class="o">}</span><span class="s">/lib</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># 可执行文件
</span></span></span><span class="line"><span class="cl"><span class="c"></span><span class="nb">add_executable</span><span class="p">(</span><span class="s">infer</span> <span class="s">main.cpp</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">target_link_libraries</span><span class="p">(</span><span class="s">infer</span>
</span></span><span class="line"><span class="cl">    <span class="o">${</span><span class="nv">CUDA_LIBRARIES</span><span class="o">}</span>
</span></span><span class="line"><span class="cl">    <span class="s">nvinfer</span>
</span></span><span class="line"><span class="cl">    <span class="s">nvonnxparser</span>
</span></span><span class="line"><span class="cl">    <span class="s">cudart</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span><span class="err">
</span></span></span></code></pre></div><h3 id="83-使用示例">8.3 使用示例</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">TensorRTInfer</span> <span class="n">infer</span><span class="p">(</span><span class="s">&#34;resnet50_fp16.engine&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 准备输入
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span> <span class="n">input</span><span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">224</span> <span class="o">*</span> <span class="mi">224</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span> <span class="n">output</span><span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 填充 input...
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        
</span></span><span class="line"><span class="cl">        <span class="c1">// 执行推理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="n">infer</span><span class="p">.</span><span class="n">infer</span><span class="p">(</span><span class="n">input</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">output</span><span class="p">.</span><span class="n">data</span><span class="p">());</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 处理 output...
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        
</span></span><span class="line"><span class="cl">        <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">&#34;Inference done!&#34;</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">exception</span><span class="o">&amp;</span> <span class="n">e</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">std</span><span class="o">::</span><span class="n">cerr</span> <span class="o">&lt;&lt;</span> <span class="s">&#34;Error: &#34;</span> <span class="o">&lt;&lt;</span> <span class="n">e</span><span class="p">.</span><span class="n">what</span><span class="p">()</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h2 id="九进阶优化技巧">九、进阶优化技巧</h2>
<p>掌握了基础用法之后，让我们来看一些能让性能再上一个台阶的高级技巧。</p>
<h3 id="91-多流并发">9.1 多流并发</h3>
<p>如果你的应用需要同时处理多路视频流，可以用多个 CUDA stream 来实现真正的并发：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 创建多个推理实例，每个实例有自己的 stream</span>
</span></span><span class="line"><span class="cl"><span class="n">infer1</span> <span class="o">=</span> <span class="n">TensorRTInfer</span><span class="p">(</span><span class="s2">&#34;model.engine&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">infer2</span> <span class="o">=</span> <span class="n">TensorRTInfer</span><span class="p">(</span><span class="s2">&#34;model.engine&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 在不同的线程中跑各自的推理</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 它们会在 GPU 上并发执行</span>
</span></span></code></pre></div><p>注意：每个 <code>IExecutionContext</code> 同时只能执行一次推理。如果需要多流，就创建多个 context。</p>
<h3 id="92-流水处理">9.2 流水处理</h3>
<p>对于吞吐量优先的场景，可以把预处理、推理、后处理做成流水线，用生产者-消费者模型衔接：</p>
<pre tabindex="0"><code>Thread 1: 读视频 → 解码 → 预处理 → 放入队列
Thread 2: 从队列取 → TensorRT 推理 → 放入结果队列
Thread 3: 从结果队列取 → 后处理 → 显示/保存
</code></pre><p>这样三个阶段可以重叠执行，CPU 和 GPU 都不会闲置。实际项目中这么做通常能再提升 30-50% 的整体吞吐量。</p>
<h3 id="93-权重精简">9.3 权重精简</h3>
<p>如果你发现生成的引擎文件特别大，可以试试这个技巧：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">set_flag</span><span class="p">(</span><span class="n">trt</span><span class="o">.</span><span class="n">BuilderFlag</span><span class="o">.</span><span class="n">STRIP_PLAN</span><span class="p">)</span>
</span></span></code></pre></div><p>这个 flag 会把引擎里不必要的调试信息去掉，通常能把文件体积减小 30-50%。</p>
<h3 id="94-避免不必要的内存拷贝">9.4 避免不必要的内存拷贝</h3>
<p>很多时候性能瓶颈不在 TensorRT 本身，而在 H2D/D2H 的内存拷贝。有几个优化方向：</p>
<ol>
<li><strong>预处理直接在 GPU 上做</strong>：用 CUDA kernel 做 resize、normalize，数据根本不用回 CPU</li>
<li><strong>用 pinned memory</strong>：<code>cudaHostAlloc</code> 分配的页锁定内存拷贝速度比普通 malloc 快 2-3 倍</li>
<li><strong>批量处理</strong>：尽量一次多处理几张图，摊销拷贝开销</li>
</ol>
<h2 id="十常见问题与排错">十、常见问题与排错</h2>
<p>TensorRT 的学习曲线比较陡峭，遇到问题很正常。这里我汇总了最常见的一些坑和解决方法。</p>
<h3 id="101-构建失败">10.1 构建失败</h3>
<p><strong>现象</strong>：<code>build_serialized_network</code> 返回 None</p>
<p><strong>排查步骤</strong>：</p>
<ol>
<li>把 Logger 级别调成 VERBOSE，看详细输出</li>
<li>检查 workspace 是不是设小了（至少 512MB）</li>
<li>确认 ONNX 模型没问题：<code>onnx.checker.check_model()</code></li>
<li>跑一遍 onnxsim</li>
<li>如果是动态 shape，检查 profile 的范围是否正确</li>
</ol>
<h3 id="102-推理结果不对">10.2 推理结果不对</h3>
<p><strong>现象</strong>：TensorRT 的输出和 PyTorch 对不上</p>
<p><strong>排查步骤</strong>：</p>
<ol>
<li>先测 FP32，如果 FP32 对不上，说明是导出或解析的问题</li>
<li>检查预处理/后处理的数值范围是否一致</li>
<li>检查 NCHW/NHWC 的格式有没有搞反</li>
<li>检查 RGB/BGR 的通道顺序</li>
<li>加 <code>BuilderFlag.STRICT_TYPES</code> 禁止自动回退精度</li>
</ol>
<h3 id="103-内存泄漏">10.3 内存泄漏</h3>
<p><strong>现象</strong>：程序跑久了内存持续增长</p>
<p><strong>常见原因</strong>：</p>
<ol>
<li>忘记销毁 <code>IExecutionContext</code></li>
<li>忘记 free CUDA 显存</li>
<li>每次推理都创建新的 context 而不是复用</li>
<li>pycuda 的内存没有正确释放</li>
</ol>
<p><strong>最佳实践</strong>：整个程序生命周期只创建一个 engine 和少量 context，推理时复用。</p>
<h3 id="104-性能不如预期">10.4 性能不如预期</h3>
<p><strong>现象</strong>：加速比只有 2x 不到，没有达到文章里说的效果</p>
<p><strong>可能的原因</strong>：</p>
<ol>
<li>没有真正开启 FP16：检查 <code>builder.platform_has_fast_fp16()</code></li>
<li>模型太小：模型太小的话 kernel 启动开销占比大</li>
<li>Batch size 太小：大 batch 才能把 GPU 用满</li>
<li>瓶颈在预处理/后处理：用 nsys profile 看一下时间花在哪了</li>
<li>用的是旧显卡：Turing 架构以前没有 Tensor Core</li>
</ol>
<h2 id="十一最佳实践总结">十一、最佳实践总结</h2>
<p>经过这么多项目的踩坑，我总结了一套 TensorRT 的最佳实践清单，按照这个来做，90% 的问题都能避免：</p>
<h3 id="准备阶段">准备阶段</h3>
<ul>
<li>✅ 用 Docker 环境，省得折腾依赖</li>
<li>✅ 导出 ONNX 后一定要跑 onnxsim</li>
<li>✅ 先跑通 FP32，再试 FP16，最后 INT8</li>
<li>✅ 每一步都和 PyTorch 做数值对齐</li>
</ul>
<h3 id="构建阶段">构建阶段</h3>
<ul>
<li>✅ Workspace 设为 1GB 起步</li>
<li>✅ 动态 shape 的 min/opt/max 不要差太多</li>
<li>✅ INT8 校准用 500-2000 张有代表性的图</li>
<li>✅ 保存校准 cache，下次直接用</li>
<li>✅ 引擎必须在部署的硬件上构建，不能跨 GPU 复制</li>
</ul>
<h3 id="部署阶段">部署阶段</h3>
<ul>
<li>✅ 生产环境用 C++，Python 只做验证</li>
<li>✅ 整个程序只创建一个 engine</li>
<li>✅ 复用 execution context，不要每次都创建</li>
<li>✅ 用多流处理多路输入</li>
<li>✅ 预处理尽量放到 GPU 上做</li>
</ul>
<h3 id="调试阶段">调试阶段</h3>
<ul>
<li>✅ Logger 开成 VERBOSE，信息非常有用</li>
<li>✅ 用 <code>nsys profile</code> 做性能分析</li>
<li>✅ 用 Engine Inspector 看每一层的精度和时间</li>
<li>✅ 遇到问题先去 NVIDIA 官方论坛搜，很多人都遇到过</li>
</ul>
<h2 id="总结">总结</h2>
<p>TensorRT 是一个非常强大的工具，但也是一个需要花时间钻研的工具。它不像 PyTorch 那样友好，会遇到各种各样的坑，有时候一个问题会卡好几天。</p>
<p>但我想说的是：<strong>这一切都是值得的</strong>。当你看到原本只能跑 30 FPS 的模型，经过 TensorRT 优化后跑到了 300 FPS，而且精度几乎没降的时候，那种成就感是无与伦比的。更重要的是，这意味着你可以用更便宜的硬件处理更多的请求，给公司省下真金白银。</p>
<p>这篇文章覆盖了从环境搭建到生产部署的全流程，给出的代码你可以直接拿过去用。但技术是不断进步的，TensorRT 每个版本都在增加新功能、优化性能，保持学习的心态很重要。</p>
<p>最后给大家几个后续的学习方向：</p>
<ol>
<li><strong>自定义插件</strong>：遇到不支持的算子时，自己写 CUDA kernel 扩展</li>
<li><strong>量化感知训练（QAT）</strong>：在训练时就模拟量化误差，比 PTQ 精度更好</li>
<li><strong>Triton Inference Server</strong>：NVIDIA 开源的推理服务框架，生产级部署必备</li>
<li><strong>多 GPU 推理</strong>：大模型时代必备技能</li>
</ol>
<p>希望这篇文章能帮你少走一些弯路。如果你在使用 TensorRT 的过程中遇到了什么问题，或者有自己的优化心得，欢迎和我交流。</p>
<p>（全文完，约7500字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
